Article

Optimal Near-End Speech Intelligibility Improvement Incorporating Additive Noise and Late Reverberation Under an Approximation of the Short-Time SII

If you want to read the PDF, try requesting it from the authors.

Abstract

The presence of environmental additive noise in the vicinity of the user typically degrades the speech intelligibility of speech processing applications. This intelligibility loss can be compensated by properly preprocessing the speech signal prior to play-out, often referred to as near-end speech enhancement. Although the majority of such algorithms focus primarily on the presence of additive noise, reverberation can also severely degrade intelligibility. In this paper we investigate how late reverberation and additive noise can be jointly taken into account in the near-end speech enhancement process. For this effort we use a recently presented approximation of the speech intelligibility index under a power constraint, which we optimize for speech degraded by both additive noise and late reverberation. The algorithm results in time–frequency dependent amplification factors that depend on both the additive noise power spectral density as well as the late reverberation energy. These amplification factors redistribute speech energy across frequency and perform a dynamic range compression. Experimental results using both instrumental intelligibility measures as well as intelligibility listening tests show that the proposed approach improves speech intelligibility over state-of-the-art reference methods when speech signals are degraded simultaneously by additive noise and reverberation. Speech intelligibility improvements in the order of 20% are observed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Numerous algorithms for near-end intelligibility enhancement have been studied over the past decade (e.g., [4], [5], [6], [7], [8], [9], [10]). In particular, the 1st and 2nd Hurricane challenges [11], [12] summarized many effective algorithms and conducted comprehensive comparisons for each, providing remarkable reference value for researchers. ...
... The basic concept is to modify the input speech in such a way as to maximize a target intelligibility metric under a known noise condition. For example, some algorithms (e.g., [5] and [8]) have been proposed to maximize the speech intelligibility index (SII) [19]. Another set of algorithms [6], [9], [20] optimize a glimpse-based intelligibility metric [21]. ...
... Figure 6 gives examples of waveforms and spectrograms for different signals. From the 8 Audio samples of the tested systems are available at https:// nii-yamagishilab.github.io/hyli666-demos/intelligibility/index.html spectrograms, we found that both SSDRC and Proposed (All) modified the speech signal through redistributing its energy from low frequencies to the middle and high frequencies. ...
Preprint
The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-realtime mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-ofthe-art baseline systems under various noisy and reverberant listening conditions.
... However, depending on the exact scenario (e.g., in the case of a public address system), reverberation might also be present at the near-end [8], [9]. In some more recent contributions the presence of both reverberation and additive noise was investigated, e.g., [13], [15], [16]. In this paper, we will neglect the presence of reverberation. ...
... In this paper, we will neglect the presence of reverberation. However, the presented model can easily be extended to also take certain aspects of (late) reverberation into account, in a similar way as presented in [16]. Another way to classify intelligibility enhancement methods is based on how intelligibility, perception and/or audibility is taken into account. ...
... More recently the SII has been approximated as proposed in [25] to make constrained optimization tractable. The approximated SII was used in [16] to increase the intelligibility of speech when degraded by both speech reverberation and noise. Instrumental speech intelligibility measures referred to as short-time objective intelligibility (STOI) measure [26] and the Glimpse proportion metric [27] are amongs the recently proposed measures. ...
Article
Speech intelligibility enhancement is considered for multiple-microphone acquisition and single loudspeaker rendering. This is based on the mutual information measured between the message spoken at far-end environment and the message perceived by a listener at near-end. We prove that the joint optimal processing can be decomposed into far-end and nearend processing. The former is a minimum variance distortionless response (MVDR) beamformer that reduces the noise in the talker environment and the latter is a post-filter that redistributes the power over the frequency bands. Disjoint processing is optimal provided that the post-filtering operation is aware of the residual noise from the beamforming operation. Our results show that both processing steps are necessary for the effective conveyance of a message and, importantly, that the second step must be aware of the remaining noise from the beamforming operation in the first step. In addition, we study the use of the mutual information applied on the perceptually more relevant powers per critical band.
... Numerous algorithms for near-end intelligibility enhancement have been studied over the past decade (e.g., [4], [5], [6], [7], [8], [9], [10]). In particular, the 1st and 2nd Hurricane challenges [11], [12] summarized many effective algorithms and conducted comprehensive comparisons for each, providing remarkable reference value for researchers. ...
... The basic concept is to modify the input speech in such a way as to maximize a target intelligibility metric under a known noise condition. For example, some algorithms (e.g., [5] and [8]) were proposed to maximize the speech intelligibility index (SII) [19]. Another group [6], [9], [20] optimizes a glimpse-based intelligibility metric [21]. ...
... 90 enhanced samples were randomly selected from the test set for each system, and a total of 15 listeners participated. Each participant was instructed to listen to 18 randomized sample pairs, and for each pair they 8 Audio samples of the tested systems are available at https:// nii-yamagishilab.github.io/hyli666-demos/intelligibility/index.html had to select the one that sounded better in terms of speech quality. As we can see from Fig. 5, Proposed (All) achieved significantly higher preference scores than iMetricGAN and Proposed (S+H+E) and performed comparably with SSDRC. ...
Article
Full-text available
The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-real-time mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-of-the-art baseline systems under various noisy and reverberant listening conditions.
... The eighth data set consists of speech subjected to preprocessing enhancement and degraded by reverberation and noise. In Hendriks et al. [22] phrases from the Dutch version of the Hagerman test [21,24] were processed by four enhancement algorithms, convolved with a room impulse response with a T60 time of 1 s, and then degraded by SSN at SNRs of −2, 0, 2, and 4 dB. Three of the enhancement algorithms [22,55,64] optimally redistribute the energy of the clean speech according to a distortion criterion. ...
... In Hendriks et al. [22] phrases from the Dutch version of the Hagerman test [21,24] were processed by four enhancement algorithms, convolved with a room impulse response with a T60 time of 1 s, and then degraded by SSN at SNRs of −2, 0, 2, and 4 dB. Three of the enhancement algorithms [22,55,64] optimally redistribute the energy of the clean speech according to a distortion criterion. The fourth algorithm [23] uses steady-state suppression to reduce degradation caused by reverberation. ...
... The sampling rate was 16 kHz. See Hendriks et al. [22] for more details. ...
Article
Full-text available
Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and $\text{sEPSM}^\text{corr}$. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of $\rho=0.92$ and $\rho=0.89$, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called $\text{SIIB}^\text{Gauss}$, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.
... In recent years, only a few studies have considered the effects of reverberation and background noise simultaneously [19][20][21][22]. Some methods just use the near-end speech enhancement method to reduce the influence of both reverberation and background noise [19,20]. ...
... Some methods just use the near-end speech enhancement method to reduce the influence of both reverberation and background noise [19,20]. Other methods pre-compensate the output speech by obtaining the optimal solution of the established mathematical model to improve intelligibility [21,22]. Crespo and Hendriks [21] proposed a multizone speech reinforcement method based on a general optimization framework. ...
... Hendriks et al. [22] proposed an approximated speech intelligibility index (ASII) method to improve the speech intelligibility in a single-zone scenario. Unlike the Multizone method [21], the ASII method uses a speech intelligibility index to establish a mathematical model that includes late reverberation and noise. ...
Article
Full-text available
The speech intelligibility of indoor public address systems is degraded by reverberation and background noise. This paper proposes a preprocessing method that combines speech enhancement and inverse filtering to improve the speech intelligibility in such environments. An energy redistribution speech enhancement method was modified for use in reverberation conditions, and an auditory-model-based fast inverse filter was designed to achieve better dereverberation performance. An experiment was performed in various noisy, reverberant environments, and the test results verified the stability and effectiveness of the proposed method. In addition, a listening test was carried out to compare the performance of different algorithms subjectively. The objective and subjective evaluation results reveal that the speech intelligibility is significantly improved by the proposed method.
... Further, their performance is limited at low signal-to-noise ratios (SNRs). Few other studies have been aimed at improving the near-end intelligibility, by developing an optimal linear time-invariant filter (OLF) which maximizes the speech intelligibility index (SII) [27,28], and by optimally redistributing the energy (RE) over time and frequency based on perceptual distortion measure [7,26]. Although these methods improve the intelligibility without requiring the noise statistics, but their performance is degraded at very low SNRs. ...
... Apply 25 ms windowing on s(n) to get M frames. 4 for m = 1 to M do 5 Compute pitch of each frame (m) using YAAPT [35] 6 if pitch > 0 then 7 Compute analysis pitch marks (PM) and synthesis pitch marks (SPM) [17] using δt i ...
Article
Full-text available
The proposed work attempts to improve the near-end intelligibility of speech at very low signal-to-noise ratios (SNRs). Additionally, the prerequisite of noise statistics that existing intelligibility improvement methods require is not a limitation of the proposed approach. To this end, the shaping parameters of the voice transformation function (VTF) are optimized. This optimization of the shaping parameters of the VTF corresponds to the combined modification that includes formant shifting, nonuniform time scaling, smoothing, and energy re-distributions in comprehensive learning particle swarm optimization (CLPSO) framework. The optimal parameters of the combined modifications are obtained by jointly maximizing the short time objective intelligibility, perceptual evaluation of speech quality and signal-to-distortion ratio metrics being used as the cost function in CLPSO. The outcome at the end is an improvement in intelligibility that is significantly higher than the ones obtained by applying these methods individually, while preserving the quality. As a side result, a Gaussian process regression is also employed to estimate the shaping parameters of VTF at arbitrary SNRs—other than the ones which were used during CLPSO training.
... One solution could be to design NELE algorithms taking into ac-count the combined effect of both noise and reverberation (e.g. [15,16]). These approaches incorporated the detrimental effect of reverberation by adding the late reverberation [16] or both late reverberation and early reflections [15] as an additional noise term in a model-based approach. ...
... [15,16]). These approaches incorporated the detrimental effect of reverberation by adding the late reverberation [16] or both late reverberation and early reflections [15] as an additional noise term in a model-based approach. On the other hand, it is also possible that algorithms designed for either noise or reverberation may also provide enough improvement in scenarios that are both noisy and reverberant, hence making a combined approach obsolete. ...
Conference Paper
Full-text available
Near-end listening enhancement (NELE) algorithms aim to pre-process speech prior to playback via loudspeakers so as to maintain high speech intelligibility even when listening conditions are not optimal, e.g., due to noise or reverberation. Often NELE algorithms are designed for scenarios considering either only the detrimental effect of noise or only reverberation, but not both disturbances. In many typical applications scenarios, however, both factors are present. In this paper, we evaluate a new combination of a noise-dependent and a reverberation-dependent algorithm implemented in a common framework. Specifically, we use instrumental measures as well as subjective ratings of listening effort for acoustic scenarios with different reverberation times and realistic signal-to-noise ratios. The results show that the noise-dependent algorithm also performs well in reverberation, and that the combination of both algorithms can yield slightly better performance than the individual algorithms alone. This benefit appears to depend strongly on the specific acoustic condition, indicating that further work is required to optimize the adaptive algorithm behavior.
... The noisy-reverberant scenario is composed of three real reverberant rooms (Meeting, Stairway and LASP1) selected from the AIR [10] and LASP_RIR 1 databases and two background non-stationary acoustic noises (Babble and Cafeteria) with SNRs of −2 dB, 0 dB and 2 dB. The ASII ST [11] and ESII [12] objective measures are adopted for the intelligibility prediction. These measures are explicitly designed to deal with the non-stationarity of speech and noise-reverberant distortions. ...
... These scores can be considered as thresholds of poor and good intelligibility [19] [20]. The ASII ST [11] and ESII [12] measures are adopted for the intelligibility evaluation under non-stationary noisy-reverberant conditions. The direct path speech signal, characterized by the first impulse present on each RIR is chosen as the reference signal. ...
... The noisy-reverberant scenario is composed of two real reverberant rooms and four background non-stationary acoustic noises with five different SNR values. The ESII [12] and ASII ST [13] measures are adopted for the intelligibility prediction. These measures are explicitly designed to deal with the non-stationarity of speech and its distortions. ...
... The ESII [12] and ASII ST [13] measures are adopted to evaluate the intelligibility improvement under non-stationary noisy-reverberant conditions. The direct path speech signal s dir (n) is chosen as the reference signal. ...
Preprint
Full-text available
This letter proposes a new time domain absorption approach designed to reduce masking components of speech signals under noisy-reverberant conditions. In this method, the non-stationarity of corrupted signal segments is used to detect masking distortions based on a defined threshold. The nonstationarity is objectively measured and is also adopted to determine the absorption procedure. Additionally, no prior knowledge of speech statistics or of the room information is required for this technique. Three intelligibility measures (ESII, ASIIST, SRMRnorm) and a perceptual listening test are used for evaluation. The experiments results show that the proposed scheme leads to a higher intelligibility improvement when compared to competing methods.
... The formant shifting approach (SSFV) [4] and the technique based on harmonic models (APES HARM ) [7] are adopted as baseline. Three objective intelligibility measures are used to compare the proposed and baseline techniques: ESTOI [11], ESII [12] and ASII ST [13]. PESQ [14], LLR [15] and WSS [16] are selected to examine the speech quality. ...
Preprint
Full-text available
This paper proposes a time-domain method to improve speech intelligibility in noisy scenarios. In the proposed approach, a series of Gammatone filters are adopted to detect the harmonic components of speech. The filters outputs are amplified to emphasize the first harmonics, reducing the masking effects of acoustic noises. The proposed GTFF0 solution and two baseline techniques are examined considering four background noises with different non-stationarity degrees. Three intelligibility measures (ESTOI, ESII and ASIIST) are adopted for objective evaluation. The experiments results show that the proposed scheme leads to expressive speech intelligibility gain when compared to the competing approaches. Furthermore, the PESQ and WSS objective scores demonstrate that the proposed technique also provides interesting quality improvement.
... Perceptual distortion measures are optimized in [7,8]. A speech intelligibility index (SII) [9] based measure is optimized in [10]. Local SII optimization by spectral shaping and dynamic range compression is studied in [11] and validated with objective measures. ...
... A speech intelligibility improvement method was proposed by Hendriks et al. (2015) in a single-zone scenario. This approximated speech intelligibility index method, unlike in multi-zone scenario, uses this index to build a mathematical model, which include the noise. ...
Article
Full-text available
Speech enhancement primarily focuses on improving the intelligibility and quality of the speech signal by using various algorithms and techniques. Processing of a speech signal refers to applying efficient mechanisms to reduce noise in the way of extracting the intended speech signal from the corrupted signal. Noise reduction techniques such as kalman filtering, spectral subtraction and adaptive wiener filtering etc. are used in different enhancement scenarios in speech processing. In the proposed method, the combination of wiener filter and Karhunen–Loéve Transform is used to remove noise and enhance the noisy speech signal. This paper presents the performance evaluation of the proposed hybrid algorithm by estimating Signal to Noise Ratio, Perceptual Evaluation of Speech Quality, Short-Time Objective Intelligibility and Extended STOI values. This algorithm has been implemented in varied noisy conditions and the results proved the fruitfulness of this method. Subjective listening evaluation is also done and both the objective and subjective results confirmed the significant improvement in speech intelligibility in the proposed method.
... Perceptual distortion is minimized for the parameters of a spectral gain modification in [16,17]. A speech intelligibility index (SII) [18] based measure is optimized in [19]. Local SII optimization by spectral shaping and dynamic range compression is studied in [20]. ...
... The noisy-reverberant scenario is composed of two real reverberant rooms from the AIR [31] and LASP RIR 1 databases and two background non-stationary acoustic noises extracted from RSG-10 [32] and DEMAND [33] databases. The intelligibility assessment in performed considering the objective measures STOI [34] and ASII ST [35]. The SRMR [36], PESQ [37] and f2-model PEAQ [38] [39] measures are adopted for quality objective evaluation. ...
Preprint
Full-text available
This paper introduces the single step time domain method named HnH-NRSE, whihc is designed for simultaneous speech intelligibility and quality improvement under noisy-reverberant conditions. In this solution, harmonic and non-harmonic elements of speech are separated by applying zero-crossing and energy criteria. An objective evaluation of the its non-stationarity degree is further used for an adaptive gain to treat masking components. No prior knowledge of speech statistics or room information is required for this technique. Additionally, two combined solutions, IRMO and IRMN, are proposed as composite methods for improvement on noisy-reverberant speech signals. The proposed and baseline methods are evaluated considering two intelligibility and three quality measures, applied for the objective prediction. The results show that the proposed scheme leads to a higher intelligibility and quality improvement when compared to competing methods in most scenarios. Additionally, a perceptual intelligibility listening test is performed, which corroborates with these results. Furthermore, the proposed HnH-NRSE solution attains SRMR quality measure with similar results when compared to the composed IRMO and IRMN techniques.
... This work was partially supported by CNPq (307866/2015-7). processed speech is more intelligible than the original when reverberated [7], [8]. However, evaluation and comparison of such methods has been done only in one highly reverberant condition [9]. ...
... The traditional non-DNN based methods which improve speech intelligibility based on the bandimportance function [16] or the maximization of the SII measure [17,18] have shown good results in the improvement of speech perception in background noise. In this study, we propose to use the bandimportance function to obtain the weight function xðkÞ: The values of the relative contribution of each frequency in xðkÞ are determined from the values of the band-importance function in one-third octave frequency bands. ...
Article
Speech intelligibility improvement is an important task to increase human perception in telecommunication systems and hearing aids when the speech is degraded by the background noises. Although, deep neural network (DNN) based learning architectures which use mean square error (MSE) as the cost function has been found to be very successful in speech enhancement areas, they typically attempt to enhance the speech quality by uniformly optimizing the separation of a target speech signal from a noisy observation over all frequency bands. In this work, we propose a new cost function which further focuses on speech intelligibility improvement based on a psychoacoustic model. The band-importance function, which is a principal component of speech intelligibility index (SII), has been used to determine the relative contribution to speech intelligibility provided by each frequency band in learning algorithm. In addition, we augment a signal to noise ratio (SNR) estimation to the network to improve the generalization of the method to unseen noisy conditions. The performance of the proposed MSE cost function is compared with the conventional MSE cost function in the same conditions. Our approach shows better performance in objective speech intelligibility measures such as coherence SII (CSII) and short-time objective intelligibility (STOI), while mitigating quality scores in perceptual evaluation of speech quality (PESQ) and speech distortion (SD) measure.
... The last set of methods take the background noise into account for modification of the speech signal. In this case, prior knowledge of the background noise (Cooke et al., 2013b;Taal et al., 2014;Hendriks et al., 2015) is assumed. Using energy-allocation strategies, five different modifications were tested in known noise conditions for their contribution to intelligibility (Tang and Cooke, 2010). ...
Article
This paper presents a method for modifying speech to enhance its intelligibility in noise. The features contributing to intelligibility are analyzed using the recently proposed single frequency filtering (SFF) analysis of speech signals. In the SFF method, the spectral and temporal resolutions can be controlled using a single parameter of the filter, corresponding to the location of the pole on the negative real axis with respect to the unit circle in the z-plane. The SFF magnitude (envelope) and phase at several frequencies can be used to synthesize the original speech signal. Analysis of highly intelligible speech shows that the speech signal is more intelligible when it has higher dynamic range of amplitude locally (fine structure) and/or lower dynamic range of amplitude globally (gross structure) in both the spectral and temporal domains. Some features of normal speech are modified at fine and gross temporal and spectral levels, and the modified SFF envelopes are used to synthesize speech. The proposed method gives higher objective scores of intelligibility compared to original and the reference method (spectral shaping and dynamic range compression), under different conditions of noise. In subjective evaluation, though the word accuracies are not significantly different between the proposed and reference methods, listeners seem to prefer the proposed method as it gives louder and crisper sound.
... Optimization-based methods that improve speech intelligibility by optimizing a cost function based on an objective intelligibility measure are other noise-dependent algorithms. These methods include the optimization based on a perceptual distortion measure [15], and the speech intelligibility index (SII) [16,17]. These optimization-based methods typically optimize cost function over a relatively extensive time interval. ...
Article
A new speech processing algorithm is proposed to improve speech intelligibility in noisy environments without increasing speech energy. The method improves the near-end speech intelligibility by optimizing the frame-based spectral energy correlation between clean speech and noisy modified speech with a power constraint. This algorithm is developed based on short-time objective intelligibility (STOI) measure, which predicts speech intelligibility in background noise according to the correlation of clean and noisy speeches. The proposed method is compared with unprocessed speech and two baseline methods using two objective intelligibility measures and an intelligibility listening test under various noisy conditions. Results show large intelligibility improvements with the proposed method over the unprocessed noisy speech. In addition, compared with the baseline methods, the proposed algorithm provides the best intelligibility scores for all noisy conditions in the STOI measure and for low signal-to-noise ratios in the speech intelligibility index. The word recognition results also show that the proposed algorithm performs better than the unprocessed speech and the reference methods. An objective quality measure is applied to investigate the speech quality of the introduced method, which is proven to insignificantly affect speech quality.
... Some of the earliest methods involved boosting of higher frequencies while dealing with low pass noise [14] and dynamic range compression [15]. Subsequent studies include developing an Optimum Linear time-invariant Filter (OLF) which maximizes the Speech Intelligibility Index (SII) [16,17], or by optimally Redistributing Energy (RE) over time and frequency [18,19]. However, despite the efficacy, their performance degrades at very low SNRs. ...
Article
This letter proposes a time-domain method to improve speech intelligibility in noisy scenarios. In the proposed approach, a series of Gammatone filters are adopted to detect the harmonic components of speech. The filters outputs are amplified to emphasize the first harmonics, reducing the masking effects of acoustic noises. The proposed GTF $_\text{F0}$ solution and two baseline techniques are examined considering four background noises with different non-stationarity degrees. Three intelligibility measures (ESTOI, ESII and ASII $_{\text{ST}}$ ) are adopted for objective evaluation. The experiments results show that the proposed scheme leads to expressive speech intelligibility gain when compared to the competing approaches. Furthermore, the PESQ and OQCM objective scores demonstrate that the proposed technique also provides interesting quality improvement.
Conference Paper
In mobile communication systems, deterioration in intelligibility of the voiced content of the signal is a commonly observed phenomenon. In order to counter the undesirable effect of background noises on the speech signal, a methodology must be developed such that there is no loss in intelligibility of the speech signal. This work focuses on developing an approach to attain enhancement of speech critical band-wise, when the near-end noise dominates. Two different approaches are employed for speech enhancement. First approach involves enhancing the lower critical bands while the second approach includes higher critical band enhancement. The intelligibility is measured in terms of Speech Intelligibility Index (SII). The SII values infer an improvement in speech intelligibility for higher critical band enhancement.
Article
Overlap-masking reduces speech intelligibility in reverberant environments. In contrast to additive noise, the masking signal depends on the past of the speech signal. An increase in output signal power is followed by an increase in reverberation power. Taking into consideration, the mechanics of reverberation is essential for the development of speech modifications that effectively increase intelligibility. This letter proposes a mathematical framework that optimizes the full-band signal power as a function of late reverberation power and the degree of signal nonstationarity. The prescribed signal gain is smoothed adaptively to suppress artifacts that may be introduced by rapid gain fluctuations due to frame-based processing. Compared to a reference method, it is shown that higher signal-to-late-reverberation ratio in nonstationary regions of the speech signal is achieved with less aggressive, on average, gain modification. A listening test with native English speakers on meaningful sentences under strong reverberation measured a consistent and significant improvement in intelligibility over natural speech and a reference method.
Conference Paper
In communication systems involving speech, two different environments can be identified. These are categorized as far-end environment and the near-end environment. These different environments lead to different causes for reduction in speech intelligibility for the listener. By applying noise reduction techniques, the impact of acoustical noise sources in the far-end environment can be minimized. However, the effect of the noises present at the near-end environment cannot be negated because the listener perceives the speech and noise simultaneously. Thus near-end listening enhancement approach must be employed to enhance the speech signal to improve the speech intelligibility, when the near-end noise dominates. This work focuses on developing an approach to enhance the speech signal critical band-wise, after analyzing far-end speech and near-end (background) noise energy. Enhancedspeech signal intelligibility is assessed in terms of Speech Intelligibility Index (SII). The results obtained indicate an improvement in speech intelligibility, which can be confirmed through the values of SII.
Article
Speech Intelligibility Prediction (SIP) algorithms are becoming popular tools within the development and operation of speech processing devices and algorithms. However, many SIP algorithms require knowledge of the underlying clean speech; a signal that is often not available in real-world applications. This has led to increased interest in non-intrusive SIP algorithms, which do not require clean speech to make predictions. In this paper we investigate the use of Convolutional Neural Networks (CNNs) for non-intrusive SIP. To do so, we utilize a CNN architecture that shows similarities to existing SIP algorithms, in terms of computational structure, and which allows for easy and meaningful visualization and interpretation of trained weights. We evaluate this architecture using a large dataset obtained by combining datasets from the literature. The proposed method shows high prediction performance when compared with four existing intrusive and non-intrusive SIP algorithms. This demonstrates the potential of deep learning for speech intelligibility prediction.
Article
The intelligibility of speech from a telephone or a public address system is often affected by acoustical background noise in the near-end listening environment. Speech intelligibility and listening effort can be improved by adaptive pre-processing of the loudspeaker signal. This is called Near-End Listening Enhancement (NELE). The speech spectrum is dynamically modified, taking the acoustical background noise at the near-end into account. In this paper, two opposite NELE strategies with either $Noise-Masking-Proportional Shaping or Noise-Masking-Inverse Shaping$ are proposed which are appropriate for different noise characteristics. Both strategies are formulated in closed form in the frequency domain. They do not require to optimize an intelligibility measure but use explicitly the masking threshold. Motivated by the frequency domain approach, a simpler time domain solution is derived which is based on linear prediction techniques and does not need the masking calculations. The proposed NELE solutions outperform state-of-the-art in terms of computational complexity, memory requirement, continuous processor load, and latency.
Article
This letter proposes a new time domain absorption approach designed to reduce masking components of speech signals under noisy-reverberant conditions. In this method, the non-stationarity of corrupted signal segments is used to detect masking distortions based on a defined threshold. The non-stationarity is objectively measured and is also adopted to determine the absorption procedure. Additionally, no prior knowledge of speech statistics or room information is required for this technique. Two intelligibility measures (ESII and ASII $_{\text{ST}}$ ) are used for objective evaluation. The results show that the proposed scheme leads to a higher intelligibility improvement when compared to competing methods. A perceptual listening test is further considered and corroborates these results. Furthermore, the updated version of the SRMR quality measure (SRMR $_{\rm norm}$ ) demonstrates that the proposed technique also attains quality improvement.
Article
This paper introduces a novel single channel speech enhancement method in the time domain to mitigate the effects of acoustic impulsive noises. The ensemble empirical mode decomposition is applied to analyze the noisy speech signal. The estimation and selection of noise components is based on the impulsiveness index of decomposition modes. An adaptive threshold is proposed to define the criterion to select the noise components. The proposed method is evaluated in speech enhancement experiments considering four acoustic noises with different impulsiveness indices and non-stationarity degrees under various signal-to-noise ratios. Four speech enhancement algorithms are adopted as baseline in the evaluation analysis considering spectral and time domains. Seven objective measures are adopted to compare the proposed and baseline approaches in terms of speech quality and intelligibility. Results show that the proposed solution outperforms the competing algorithms for the most of the noisy scenarios. The novel method shows particularly interesting performance when speech signals are corrupted by highly impulsive acoustic noises.
Conference Paper
Full-text available
How can speech be retimed so as to maximise its intelligibility in the face of competing speech? We present a general strategy which modifies local speech rate to minimise overlap with a known fluctuating masker. Continuous timescale factors are derived in an optimisation procedure which seeks to minimise overall energetic masking of the speech by the masker while additionally unmasking those speech regions potentially most important for speech recognition. Intelligibility increases are evaluated with both objective and subjective measures and show significant gains over an unmodified baseline, with larger benefits at lower signal-to-noise ratios. The retiming approach does not lead to benefits for speech mixed with stationary maskers, suggesting that the gains observed for the fluctuating masker are not simply due to durational expansion.
Conference Paper
Full-text available
Natural or synthetic speech is increasingly used in less-thanideal listening conditions. Maximising the likelihood of correct message reception in such situations often leads to a strategy of loud and repetitive renditions of output speech. An alternative approach is to modify the speech signal in ways which increase intelligibility in noise without increasing signal level or duration. The current study focused on the design of stationary spectral modifications whose effect is to reallocate speech energy across frequency bands. Frequency band weights were selected using a genetic algorithm-based optimisation procedure, with glimpse proportion as the objective intelligibility metric, for a range of noise types and levels. As expected, a clear dependence of noise type and global signal-to-noise ratio on energy reallocation was found. One unanticipated outcome was the consistent discovery of sparse, highly-selective spectral energy weightings, particularly in high noise conditions. In a subjective test using stationary noise and competing speech maskers, listeners were able to identify significantly more words in sentences as a result of spectral weighting, with increases of up to 15 percentage points. These findings suggest that contextdependent speech output can be used to maintain intelligibility at lower sound output levels.
Article
Full-text available
A speech pre-processing algorithm is presented that improves the speech intelligibility in noise for the near-end listener. The algorithm improves intelligibility by optimally redistributing the speech energy over time and frequency according to a perceptual distortion measure, which is based on a spectro-temporal auditory model. Since this auditory model takes into account short-time information, transients will receive more amplification than stationary vowels, which has been shown to be beneficial for intelligibility of speech in noise. The proposed method is compared to unprocessed speech and two reference methods using an intelligibility listening test. Results show that the proposed method leads to significant intelligibility gains while still preserving quality. Although one of the methods used as a reference obtained higher intelligibility gains, this happened at the cost of decreased quality. Matlab code is provided.
Conference Paper
Full-text available
A speech pre-processing algorithm is presented to improve the speech intelligibility in noise for the near-end listener. The algorithm improves the intelligibility by optimally redistributing the speech energy over time and frequency for a perceptual distortion measure, which is based on a spectro-temporal auditory model. In contrast to spectral-only models, short-time information is taken into account. As a consequence, the algorithm is more sensitive to transient regions, which will therefore receive more amplification compared to stationary vowels. It is known from literature that changing the vowel-transient energy ratio is beneficial for improving speech-intelligibility in noise. Objective intelligibility prediction results show that the proposed method has higher speech intelligibility in noise compared to two other reference methods, without modifying the global speech energy.
Article
Full-text available
Perceptual models exploiting auditory masking are frequently used in audio and speech processing applications like coding and watermarking. In most cases, these models only take into account spectral masking in short-time frames. As a consequence, undesired audible artifacts in the temporal domain may be introduced (e.g., pre-echoes). In this article we present a new low-complexity spectro-temporal distortion measure. The model facilitates the computation of analytic expressions for masking thresholds, while advanced spectro-temporal models typically need computationally demanding adaptive procedures to find an estimate of these masking thresholds. We show that the proposed method gives similar masking predictions as an advanced spectro-temporal model with only a fraction of its computational power. The proposed method is also compared with a spectral-only model by means of a listening test. From this test it can be concluded that for non-stationary frames the spectral model underestimates the audibility of introduced errors and therefore overestimates the masking curve. As a consequence, the system of interest incorrectly assumes that errors are masked in a particular frame, which leads to audible artifacts. This is not the case with the proposed method which correctly detects the errors made in the temporal structure of the signal.
Article
Full-text available
In this letter the focus is on linear filtering of speech before degradation due to additive background noise. The goal is to design the filter such that the speech intelligibility index (SII) is maximized when the speech is played back in a known noisy environment. Moreover, a power constraint is taken into account to prevent uncomfortable playback levels and deal with loudspeaker constraints. Previous methods use linear approximations of the SII in order to find a closed-form solution. However, as we show, these linear approximations introduce errors in low SNR regions and are therefore suboptimal. In this work we propose a nonlinear approximation of the SII which is accurate for all SNRs. Experiments show large intelligibility improvements with the proposed method over the unprocessed noisy speech and better performance than one state-of-the art method.
Article
Full-text available
In speech communications, signal processing algorithms for near end listening enhancement allow to improve the intel-ligibility of clean (far end) speech for the near end listener who perceives not only the far end speech but also ambient background noise. A typical scenario is mobile telephony in acoustical background noise such as traffic or babble noise. In these situations, it is often not acceptable/possible to increase the audio power amplification. In this contribution we use a theoretical analysis of the Speech Intelligibility Index (SII) to develop an algorithm which numerically maximizes the SII under the constraint of an unchanged average power of the audio signal.
Article
Full-text available
In the frame of the HearCom 1 project five promising signal en-hancement algorithms are validated for future use in hearing instru-ment devices. To assess the algorithm performance solely based on simulation experiments, a number of physical evaluation mea-sures have been proposed that incorporate basic aspects of normal and impaired human hearing. Additionally, each of the algorithms has been implemented on a common real-time hardware/software platform, which facilitates a profound subjective validation of the algorithm performance. Recently, a multicenter study has been set up across five different test centers in Belgium, the Netherlands, Germany and Switzerland to perceptually evaluate the selected sig-nal enhancement approaches with normally hearing and hearing im-paired listeners.
Conference Paper
Full-text available
In contrast to common noise reduction systems, this contribution presents a digital signal processing algorithm to improve intelligibility of clean far end speech for the near end listener who is located in an environment with background noise. Since the noise reaches the ears of the near end listener directly and therefore can hardly be influenced, a sensible option is to manipulate the far end speech. The proposed algorithm raises the average speech spectrum over the average noise spectrum and takes precautions to prevent hearing damage. Informal listening tests and the speech intelligibility index indicate an improved speech intelligibility
Conference Paper
Full-text available
This paper describes a new database of binaural room impulse responses (BRIR), referred to as the Aachen impulse response (AIR) database. The main field of application of this database is the evaluation of speech enhancement algorithms dealing with room reverberation. The measurements with a dummy head took place in a low-reverberant studio booth, an office room, a meeting room and a lecture room. Due to the different dimensions and acoustic properties, it covers a wide range of situations where digital hearing aids or other hands-free devices can be used. Besides the description of the database, a motivation for using binaural instead of monaural measurements is given. Furthermore an example using a coherence-based dereverberation technique is provided to show the advantage of this database for algorithm evaluation. The AIR database is being made available online.
Article
Full-text available
This paper considers suppression of late reverberation and additive noise in single-channel speech recordings. The reverberation introduces long-term correlation in the observed signal. In the first part of this work, we show how this correlation can be used to estimate the late reverberant spectral variance (LRSV) without having to assume a specific model for the room impulse responses (RIRs) while no explicit estimates of RIR model parameters are needed. That makes this correlation-based approach more robust against RIR modeling errors. However, the correlation-based method can follow only slow time variations in the RIRs. Existing model-based methods use statistical models for the RIRs, that depend on one or more parameters that have to be estimated blindly. The common statistical models lead to simple expressions for the LRSV that depend on past values of the spectral variance of the reverberant, noise-free, signal. All existing model-based LRSV estimators in the literature are derived assuming the RIRs to be time-invariant realizations of a stochastic process. In the second part of this paper, we go one step further and analyze time-varying RIRs. We show that in this case the reverberance tends to become decorrelated. We discuss the relations between different RIR models and their corresponding LRSV estimators. We show theoretically that similar simple estimators exist as in the time-invariant case, provided that the reverberation time T <sub>60</sub> and direct-to-reverberation ratio (DRR) of the RIRs remain nearly constant during an interval of the order of a few frames. We show that the reverberation time can be taken frequency-bin independent in DFT-based enhancement algorithms. Experiments with time-varying RIRs validate the analysis. Experiments with additive nonstationary noise and time-invariant RIRs show the influence of blind estimation of the reverberation time and the DRR.
Article
Full-text available
This paper considers suppression of late reverberation and additive noise in single-channel speech recordings. The reverberation introduces long-term correlation in the observed signal. In the first part of this work, we show how this correlation can be used to estimate the late reverberant spectral variance (LRSV) without having to assume a specific model for the room impulse responses (RIRs) while no explicit estimates of RIR model parameters are needed. That makes this correlation-based approach more robust against RIR modeling errors. However, the correlation-based method can follow only slow time variations in the RIRs. Existing model-based methods use statistical models for the RIRs, that depend on one or more parameters that have to be estimated blindly. The common statistical models lead to simple expressions for the LRSV that depend on past values of the spectral variance of the reverberant, noise-free, signal. All existing model-based LRSV estimators in the literature are derived assuming the RIRs to be time-invariant realizations of a stochastic process. In the second part of this paper, we go one step further and analyze time-varying RIRs. We show that in this case the reverberance tends to become decorrelated. We discuss the relations between different RIR models and their corresponding LRSV estimators. We show theoretically that similar simple estimators exist as in the time-invariant case, provided that the reverberation time T60 and direct-to-reverberation ratio (DRR) of the RIRs remain nearly constant during an interval of the order of a few frames. We show that the reverberation time can be taken frequency-bin independent in DFT-based enhancement algorithms. Experiments with time-varying RIRs validate the analysis. Experiments with additive nonstationary noise and time-invariant RIRs show the influence of blind estimation of the reverberation time and the DRR.
Article
Full-text available
Recently, binary mask techniques have been pro- posed as a tool for retrieving a target speech signal from a noisy observation. A binary gain function is applied to time-frequency tiles of the noisy observation in order to suppress noise dominated and retain target dominated time-frequency regions. When im- plemented using discrete Fourier transform (DFT) techniques, the binary mask techniques can be seen as a special case of the broader class of DFT-based speech enhancement algorithms, for which the applied gain function is not constrained to be binary. In this context, we develop and compare binary mask techniques to state-of-the-art continuous gain techniques. We derive spectral magnitude minimum mean-square error binary gain estimators; the binary gain estimators turn out to be simple functions of the continuous gain estimators. We show that the optimal binary estimators are closely related to a range of existing, heuristi- cally developed, binary gain estimators. The derived binary gain estimators perform better than existing binary gain estimators in simulation experiments with speech signals contaminated by several different noise sources as measured by speech quality and intelligibility measures. However, even the best binary mask method is significantly outperformed by state-of-the-art contin- uous gain estimators. The instrumental intelligibility results are confirmed in an intelligibility listening test.
Article
Full-text available
Existing objective speech-intelligibility measures are suitable for several types of degradation, however, it turns out that they are less appropriate in cases where noisy speech is processed by a time-frequency weighting. To this end, an extensive evaluation is presented of objective measure for intelligibility prediction of noisy speech processed with a technique called ideal time frequency (TF) segregation. In total 17 measures are evaluated, including four advanced speech-intelligibility measures (CSII, CSTI, NSEC, DAU), the advanced speech-quality measure (PESQ), and several frame-based measures (e.g., SSNR). Furthermore, several additional measures are proposed. The study comprised a total number of 168 different TF-weightings, including unprocessed noisy speech. Out of all measures, the proposed frame-based measure MCC gave the best results (ρ = 0.93). An additional experiment shows that the good performing measures in this study also show high correlation with the intelligibility of single-channel noise reduced speech.
Article
Full-text available
The purpose of this study is to determine how combinations of noise levels and reverberation typical of ranges found in current classrooms will affect speech recognition performance of typically developing children with normal speech, language, and hearing and to compare their performance with that of adults with normal hearing. Speech recognition performance was measured using the Bamford-Kowal-Bench Speech in Noise test. A virtual test paradigm represented the signal reaching a student seated in the back of a classroom with a volume of 228 m and with varied reverberation time (0.3, 0.6, and 0.8 sec). The signal to noise ratios required for 50% performance (SNR-50) and for 95% performance were determined for groups of children aged 6 to 12 yrs and a group of young adults with normal hearing. This is a cross-sectional developmental study incorporating a repeated measures design. Experimental variables included age and reverberation time. A total of 63 children with normal hearing and typically developing speech and language and nine adults with normal hearing were tested. Nine children were included in each age group (6, 7, 8, 9, 10, 11, and 12 yrs). The SNR-50 increased significantly with increased reverberation and decreased significantly with increasing age. On average, children required positive SNRs for 50% performance, whereas thresholds for adults were close to 0 dB or <0 dB for the conditions tested. When reverberant SNR-50 was compared with adult SNR-50 without reverberation, adults did not exhibit an SNR loss, but children aged 6 to 8 yrs exhibited a moderate SNR loss and children aged 9 to 12 yrs exhibited a mild SNR loss. To obtain average speech recognition scores of 95% at the back of the classroom, an SNR > or = 10 dB is required for all children at the lowest reverberation time, of > or = 12 dB for children up to age 11 yrs at the 0.6-sec reverberant condition, and of > or = 15 dB for children aged 7 to 11 yrs at the 0.8-sec condition. The youngest children require even higher SNRs in the 0.8-sec condition. Results highlight changes in speech recognition performance with age in elementary school children listening to speech in noisy, reverberant classrooms. The more reverberant the environment, the better the SNR required. The younger the child, the better the SNR required. Results support the importance of attention to classroom acoustics and emphasize the need for maximizing SNR in classrooms, especially in classrooms designed for early childhood grades.
Article
Full-text available
This article reviews Handbook of Noise and Vibration Control by Malcolm J. Crocker , New Jersey, 2007 1584 pp. Price: $195.00 (hardcover) ISBN: 0471395994
Article
Full-text available
The SII model in its present form (ANSI S3.5-1997, American National Standards Institute, New York) can accurately describe intelligibility for speech in stationary noise but fails to do so for nonstationary noise maskers. Here, an extension to the SII model is proposed with the aim to predict the speech intelligibility in both stationary and fluctuating noise. The basic principle of the present approach is that both speech and noise signal are partitioned into small time frames. Within each time frame the conventional SII is determined, yielding the speech information available to the listener at that time frame. Next, the SII values of these time frames are averaged, resulting in the SII for that particular condition. Using speech reception threshold (SRT) data from the literature, the extension to the present SII model can give a good account for SRTs in stationary noise, fluctuating speech noise, interrupted noise, and multiple-talker noise. The predictions for sinusoidally intensity modulated (SIM) noise and real speech or speech-like maskers are better than with the original SII model, but are still not accurate. For the latter type of maskers, informational masking may play a role.
Article
Full-text available
The evaluation of intelligibility of noise reduction algorithms is reported. IEEE sentences and consonants were corrupted by four types of noise including babble, car, street and train at two signal-to-noise ratio levels (0 and 5 dB), and then processed by eight speech enhancement methods encompassing four classes of algorithms: spectral subtractive, sub-space, statistical model based and Wiener-type algorithms. The enhanced speech was presented to normal-hearing listeners for identification. With the exception of a single noise condition, no algorithm produced significant improvements in speech intelligibility. Information transmission analysis of the consonant confusion matrices indicated that no algorithm improved significantly the place feature score, significantly, which is critically important for speech recognition. The algorithms which were found in previous studies to perform the best in terms of overall quality, were not the same algorithms that performed the best in terms of speech intelligibility. The subspace algorithm, for instance, was previously found to perform the worst in terms of overall quality, but performed well in the present study in terms of preserving speech intelligibility. Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.
Article
Full-text available
In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
Book
From common consumer products such as cell phones and MP3 players to more sophisticated projects such as human-machine interfaces and responsive robots, speech technologies are now everywhere. Many think that it is just a matter of time before more applications of the science of speech become inescapable in our daily life. This handbook is meant to play a fundamental role for sustainable progress in speech research and development. Springer Handbook of Speech Processing targets three categories of readers: graduate students, professors and active researchers in academia and research labs, and engineers in industry who need to understand or implement some specific algorithms for their speech-related products. The handbook could also be used as a sourcebook for one or more graduate courses on signal processing for speech and different aspects of speech processing and applications. A quickly accessible source of application-oriented, authoritative and comprehensive information about these technologies, it combines the established knowledge derived from research in such fast evolving disciplines as Signal Processing and Communications, Acoustics, Computer Science and Linguistics.
Article
As speech processing devices like mobile phones, voice controlled devices, and hearing aids have increased in popularity, people expect them to work anywhere and at any time without user intervention. However, the presence of acoustical disturbances limits the use of these applications, degrades their performance, or causes the user difficulties in understanding the conversation or appreciating the device. A common way to reduce the effects of such disturbances is through the use of single-microphone noise reduction algorithms for speech enhancement. The field of single-microphone noise reduction for speech enhancement comprises a history of more than 30 years of research. In this survey, we wish to demonstrate the significant advances that have been made during the last decade in the field of discrete Fourier transform domain-based single-channel noise reduction for speech enhancement.Furthermore, our goal is to provide a concise description of a state-of-the-art speech enhancement system, and demonstrate the relative importance of the various building blocks of such a system. This allows the non-expert DSP practitioner to judge the relevance of each building block and to implement a close-to-optimal enhancement system for the particular application at hand.
Conference Paper
In this paper, a time-frequency weighting is proposed for speech reinforcement (near-end listening enhancement) in a noisy and reverberant environment, which optimizes a perceptual distortion measure locally for each time-frequency bin. The algorithm acts as a dynamic range compressor, smearing out the energy of the clean speech along time. Simulations predict an intelligibility increase with respect to the unprocessed condition and two reference methods, for moderate smoothing windows, as measured by the optimized distortion measure and two objective intelligibility measures.
Article
In mobile telephony, near end listening enhancement is desired by the near end listener who perceives not only the clean far end speech but also ambient background noise. A typical scenario is mobile telephony in acoustical background noise such as traffic or babble noise. In such a situation, it is often not acceptable/possible to increase the audio power. In this contribution we analyse the calculation rules of the Speech Intelligibility Index (SII) and develop a recursive closed-form solution which maximizes the SII under the constraint of an unchanged average power of the audio signal. This solution has very low complexity compared to a previous approach of the authors and is thus suitable for real-time processing.
Article
Objective: A Dutch matrix sentence test was developed and evaluated. A matrix test is a speech-in-noise test based on a closed speech corpus of sentences derived from words from fixed categories. An example is "Mark gives five large flowers." Design: This report consists of the development of the speech test and a multi-center evaluation. Study sample: Forty-five normal-hearing participants. Results: The developed matrix test has a speech reception threshold in stationary noise of - 8.4 dB with an inter-list standard deviation of 0.2 dB. The slope of the intelligibility function is 10.2 %/dB and this is slightly lower than that of similar tests in other languages (12.6 to 17.1 %/dB). Conclusions: The matrix test is now also available in Dutch and can be used in both Flanders and the Netherlands.
Article
In this article, we address speech reinforcement (near-end listening enhancement) for a scenario where there are several playback zones. In such a framework, signals from one zone can leak into other zones (crosstalk), causing intelligibility and/or quality degradation. An optimization framework is built by exploring a signal model where effects of noise, reverberation and zone crosstalk are taken into account simultaneously. Through the symbolic usage of a general smooth distortion measure, necessary optimality conditions are derived in terms of distortion measure gradients and the signal model. Subsequently, as an illustrative example of the framework, the conditions are applied for the mean-square error (MSE) expected distortion under a hybrid stochastic-deterministic model for the corruptions. A crosstalk cancellation algorithm follows, which depends on diffuse reverberation and across zone direct path components. Simulations validate the optimality of the algorithm and show a clear benefit in multizone processing, as opposed to the iterated application of a single-zone algorithm. Also, comparisons with least-squares crosstalk cancellers in literature show the profit of using a hybrid model.
Conference Paper
Speech output is used extensively, including in situations where correct message reception is threatened by adverse listening conditions. Recently, there has been a growing interest in algorithmic modifications that aim to increase the intelligibility of both natural and synthetic speech when presented in noise. The Hurricane Challenge is the first large-scale open evaluation of algorithms designed to enhance speech intelligibility. Eighteen systems operating on a common data set were subjected to extensive listening tests and compared to unmodified natural and text-to-speech (TTS) baselines. The best-performing systems achieved gains over unmodified natural speech of 4.4 and 5.1 dB in competing speaker and stationary noise respectively, while TTS systems made gains of 5.6 and 5.1 dB over their baseline. Surprisingly, for most conditions the largest gains were observed for noise-independent algorithms, suggesting that performance in this task can be further improved by exploiting information in the masking signal.
Article
This paper describes speech intelligibility enhancement for Hidden Markov Model (HMM) generated synthetic speech in noise. We present a method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech. We update these coefficients such that the glimpse proportion – an objective measure of the intelligibility of speech in noise – increases, while keeping the speech energy fixed. An acoustic analysis reveals that the modified speech is boosted in the region 1–4 kHz, particularly for vowels, nasals and approximants. Results from listening tests employing speech-shaped noise show that the modified speech is as intelligible as a synthetic voice trained on plain speech whose duration, Mel cepstral coefficients and excitation signal parameters have been adapted to Lombard speech from the same speaker. Our proposed method does not require these additional recordings of Lombard speech. In the presence of a competing talker, both modification and adaptation of spectral coefficients give more modest gains.
Article
Perceptual models exploiting auditory masking are frequently used in audio and speech processing applications like coding and watermarking. In most cases, these models only take into account spectral masking in short-time frames. As a consequence, undesired audible artifacts in the temporal domain may be introduced (e.g., pre-echoes). In this article we present a new low-complexity spectro-temporal distortion measure. The model facilitates the computation of analytic expressions for masking thresholds, while advanced spectro-temporal models typically need computationally demanding adaptive procedures to find an estimate of these masking thresholds. We show that the proposed method gives similar masking predictions as an advanced spectro-temporal model with only a fraction of its computational power. The proposed method is also compared with a spectral-only model by means of a listening test. From this test it can be concluded that for non-stationary frames the spectral model underestimates the audibility of introduced errors and therefore overestimates the masking curve. As a consequence, the system of interest incorrectly assumes that errors are masked in a particular frame, which leads to audible artifacts. This is not the case with the proposed method which correctly detects the errors made in the temporal structure of the signal.
Conference Paper
Linear programming deals with problems such as (see [ 4], [ 5]): to maximize a linear function $$ \rm g{x}\equiv \sum {c_{i}x_{i}} \; \rm {of} \; n \;\rm{real \; variables} \; x_{1},...,x_{n} $$ (forming a vector x) constrained by m + n linear inequalities.
Chapter
Introduction Wave Motion Plane Sound Waves Impedance and Sound Intensity Three-Dimensional Wave Equation Sources of Sound Sound Intensity Sound Power of Sources Sound Sources above a Rigid Hard Surface Directivity Line Sources Reflection, Refraction, Scattering, and Diffraction Ray Acoustics Energy Acoustics Near Field, Far Field, Direct Field, and Reverberant Field Room Equation Sound Radiation from Idealized Structures Standing Waves Waveguides Acoustical Lumped Elements Numerical Approaches: Finite Elements and Boundary Elements Acoustical Modeling Using Equivalent Circuits References
Conference Paper
Most speech enhancement algorithms heavily depend on the noise power spectral density (PSD). Because this quantity is unknown in practice, estimation from the noisy data is necessary. We present a low complexity method for noise PSD estimation. The algorithm is based on a minimum mean-squared error estimator of the noise magnitude-squared DFT coefficients. Compared to minimum statistics based noise tracking, segmental SNR and PESQ are improved for non-stationary noise sources with 1 dB and 0.25 MOS points, respectively. Compared to recently published algorithms, similar good noise tracking performance is obtained, but at a computational complexity that is in the order of a factor 40 lower.
Article
Most listeners have difficulty understanding speech in reverberant conditions. The purpose of this study is to investigate whether it is possible to reduce the degree of degradation of speech intelligibility in reverberation through the development of an algorithm. The modulation spectrum is the spectral representation of the temporal envelope of the speech signal. That of clean speech is dominated by components between 1 and 16 Hz centered at 4 Hz which is the most important range for human perception of speech. In reverberant conditions, the modulation spectrum of speech is shifted toward the lower end of the modulation frequency range. In this study, we proposed to enhance the important modulation spectral components prior to distortion of speech by reverberation. Word intelligibility in a carrier sentence was tested with the newly developed algorithm including two different filter designs in three reverberant conditions. The reverberant speech was simulated by convoluting clean speech with impulse responses measured in the actual halls. The experimental results show that modulation filtering incorporated into a pre-processing algorithm improves intelligibility for normal hearing listeners when (1) the modulation filters are optimal for a specific reverberant condition (i.e., T60 = 1.1 s), and (2) consonants are preceded by highly powered segments. Under shorter (0.7 s) and longer (1.6 s) reverberation times, the modulation filtering in the current experiments, an Empirically-Designed (E-D) filter and a Data-Derived (D-D) filter, caused a slight performance decrement respectively. The results of this study suggest that further gains in intelligibility may be accomplished by re-design of the modulation filters suitable for other reverberant conditions.
Article
The auditory system, like the visual system, may be sensitive to abrupt stimulus changes, and the transient component in speech may be particularly critical to speech perception. If this component can be identified and selectively amplified, improved speech perception in background noise may be possible. This paper describes an algorithm to decompose speech into tonal, transient, and residual components. The modified discrete cosine transform (MDCT) was used to capture the tonal component and the wavelet transform was used to capture transient features. A hidden Markov chain (HMC) model and a hidden Markov tree (HMT) model were applied to capture statistical dependencies between the MDCT coefficients and between the wavelet coefficients, respectively. The transient component identified by the wavelet transform was selectively amplified and recombined with the original speech to generate modified speech, with energy adjusted to equal the energy of the original speech. The intelligibility of the original and modified speech was evaluated in eleven human subjects using the modified rhyme protocol. Word recognition rate results show that the modified speech can improve speech intelligibility at low SNR levels (8% at , 14% at , and 18% at ) and has minimal effect on intelligibility at higher SNR levels.
Article
Two procedures for improving the intelligibility of wideband telephone speech in the presence of competing babble noise are evaluated. One procedure is differentiation, the other consists of equalizing the speech spectrum by applying the inverse of the average spectrum of formant amplitudes for adult male speakers ("formant equalization"). Speech processed by these two methods was evaluated both for intelligibility and for listener preference. Both methods produced a clear increase in intelligibility compared to unprocessed wideband telephone speech. Formant equalization was found to be preferred over differentiation, more so at low signal-to-noise ratios than at high ones.
Article
The problem of determining necessary conditions and sufficient conditions for a relative minimum of a function \( f({x_1},{x_2},....,{x_n})\) in the class of points \( x = ({x_1},{x_2},....,{x_n})\) Satisfying the equations \( \rm {g_{\alpha}(X)= 0 (\alpha = 1, 2,....,m),} \) where the functions f and gα have continuous derivatives of at least the second order, has been satisfactorily treated [1]*. This paper proposes to take up the corresponding problem in the class of points x satisfying the inequalities \( \begin{array}{clcclclclcl}\rm {g_{\alpha}(x)\geqq 0} & & & & & & \rm{\alpha = 1,2,...,m}\end{array} \) where m may be less than, equal to, or greater than n.
Article
A common speech‐communications problem is to maximize speech intelligibility over a noisy channel when the transmitter is peak‐power limited. An optimum linear solution is analytically derived, and consists of a speech‐processing filter in the transmitter. The degree of improvement due to use of the filter is derived for a number of bandwidths and signal‐to‐noise ratios. The filter was experimentally verified by articulation tests as the optimum linear configuration, and was significantly more intelligible than other filters that bounded it. Although the optimum filter depends on the noise spectrum, there is no significant difference between it and a noise‐invariant filter for white or speech‐shaped noise. The use of the filter is equivalent to raising the transmitter power between 1 and 10 dB, depending on the speech bandwidth and the signal‐to‐noise ratio.
Article
A list of ten spoken Swedish sentences was computer edited to obtain new lists with exactly the same content of sound, but with new sentences. A noise was synthesized from the speech material by the computer to produce exactly the same spectrum of speech and noise. The noise was also amplitude modulated by a low frequency noise to make it sound more natural. This material was tested monaurally on 20 normal-hearing subjects. The equality in intelligibility of some of the lists was investigated. Repeated threshold measurements in noise showed a standard deviation of 0.44 dB when the learning effect was outbalanced. Only a small part of the learning effect was due to learning of the word material. Intelligibility curves fitted to the data points in noise and without noise showed maximum steepnesses of 25 and 10%/dB respectively. At constant signal to noise ratio (S/N) the best performance was achieved at a speech level of 53 dB.
Article
This paper presents the results of new studies based on speech intelligibility tests in simulated sound fields and analyses of impulse response measurements in rooms used for speech communication. The speech intelligibility test results confirm the importance of early reflections for achieving good conditions for speech in rooms. The addition of early reflections increased the effective signal-to-noise ratio and related speech intelligibility scores for both impaired and nonimpaired listeners. The new results also show that for common conditions where the direct sound is reduced, it is only possible to understand speech because of the presence of early reflections. Analyses of measured impulse responses in rooms intended for speech show that early reflections can increase the effective signal-to-noise ratio by up to 9 dB. A room acoustics computer model is used to demonstrate that the relative importance of early reflections can be influenced by the room acoustics design.
Article
The Speech Transmission Index (STI) is a physical metric that is well correlated with the intelligibility of speech degraded by additive noise and reverberation. The traditional STI uses modulated noise as a probe signal and is valid for assessing degradations that result from linear operations on the speech signal. Researchers have attempted to extend the STI to predict the intelligibility of nonlinearly processed speech by proposing variations that use speech as a probe signal. This work considers four previously proposed speech-based STI methods and four novel methods, studied under conditions of additive noise, reverberation, and two nonlinear operations (envelope thresholding and spectral subtraction). Analyzing intermediate metrics in the STI calculation reveals why some methods fail for nonlinear operations. Results indicate that none of the previously proposed methods is adequate for all of the conditions considered, while four proposed methods produce qualitatively reasonable results and warrant further study. The discussion considers the relevance of this work to predicting the intelligibility of cochlear-implant processed speech.
Article
Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.
Article
Overlap-masking degrades speech intelligibility in reverberation [R. H. Bolt and A. D. MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. To reduce the effect of this degradation, steady-state suppression has been proposed as a preprocessing technique [Arai et al., Proc. Autumn Meet. Acoust. Soc. Jpn., 2001; Acoust. Sci. Tech. 23(8), 229-232 (2002)]. This technique automatically suppresses steady-state portions of speech that have more energy but are less crucial for speech perception. The present paper explores the effect of steady-state suppression on syllable identification preceded by /a/ under various reverberant conditions. In each of two perception experiments, stimuli were presented to 22 subjects with normal hearing. The stimuli consisted of mono-syllables in a carrier phrase with and without steady-state suppression and were presented under different reverberant conditions using artificial impulse responses. The results indicate that steady-state suppression statistically improves consonant identification for reverberation times of 0.7 to 1.2 s. Analysis of confusion matrices shows that identification of voiced consonants, stop and nasal consonants, and bilabial, alveolar, and velar consonants were especially improved by steady-state suppression. The steady-state suppression is demonstrated to be an effective preprocessing method for improving syllable identification by reducing the effect of overlap-masking under specific reverberant conditions.
We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types
This paper presents the results of an examination of rapid amplitude compression following high-pass filtering as a method for processing speech, prior to reception by the listener, as a means of enhancing the intelligibility of speech in high noise levels. Arguments supporting this particular signal processing method are based on the results of previous perceptual studies of speech in noise. In these previous studies, it has been shown that high-pass filtered/clipped speech offers a significant gain in the intelligibility of speech in white noise over that for unprocessed speech at the same signal-to-noise ratios. Similar results have also been obtained for speech processed by high-pass filtering alone. The present paper explores these effects and it proposes the use of high-pass filtering followed by rapid amplitude compression as a signal processing method for enhancing the intelligibility of speech in noise. It is shown that this new method resuits in a substantial improvement in the intelligibility of speech in white noise over normal speech and over previously implemented methods.