NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
USING MAINLOBE-RESILIENT TIME-FREQUENCY
QUANTILE-BASED NOISE ESTIMATION
S. W. Lee, P. C. Ching, and Tan Lee
Department of Electronic Engineering
The Chinese University of Hong Kong, Shatin, N. T., Hong Kong.
In standard speech recognition systems in which training
data are clean speech, the presence of background noise in
received signal can severely deteriorate the recognition
performance. This paper presents a simple noise-robust
speech recognition system based on a modified noise
spectral estimation method called Mainlobe-Resilient
Time-Frequency Quantile-based Noise Estimation (M-R
T-F QBNE), which focuses on the mainlobes at harmonic
frequencies. We estimate the global signal-to-noise ratio
(SNR) and select a recognition model, which is best
matched to the SNR operating range. Experimental results
show that the recognition accuracy of the proposed
recognition system is higher than that of the AURORA2
clean training baseline by 23%.
comparable recognition accuracy.1
Compared to multi-
frameworks are based on training using clean speech data
[1, 2]. These ASR systems are very sensitive to additive
and/or convolutive noise disturbance of the received
speech, which causes a mismatch between the clean
performance of ASR, hence, severely degrades.
Matching between the training and testing conditions
is crucial for ASR [2, 3].
recognition accuracy may decrease if training is done in a
much higher SNR condition, such as when the test data are
collected using a close-talk high quality microphone in a
sound-proof chamber. On the other hand, if training and
Even when the test data is
1This work is partially supported by a grant awarded by the
Hong Kong Research Grants Council.
testing are carried out under the same environment, better
matching and recognition could be achieved. In this paper,
we propose to use a set of acoustic models, which are
trained by distinct noisy speech data sets. For each input
according to the SNR estimated, aiming at improving the
degree of matching.
To estimate the SNR, we need to estimate the noise
power spectrum. Traditionally, noise spectral estimation
is attained by using a voice-activity detection (VAD),
which updates the estimated spectrum during pauses. This
method is unsuitable for non-stationary noise, because the
noise estimate cannot be updated during voiced segments
, and VAD may not be accurate for low SNRs. An
enhanced statistical approach is proposed instead, which is
based on Quantile-based Noise Estimation (QBNE) [5, 6,
7]. The new estimation method is designed to be more
resilient to the mainlobe effect of speech harmonics.
2. NOISE SPECTRAL ESTIMATION BY
Before introducing the enhanced M-R T-F QBNE, the
general ideas of QBNE and time-frequency QBNE are
first reviewed. Performance comparison for synthetic and
real speech segments will be given in Section 2.3.
2.1. Quantile-based Noise Estimation (QBNE)
The QBNE method was originally developed in .
Given a noisy speech input x(t), it is first windowed into
different segments and their corresponding short-time
power spectra are being computed. Let X(ω, t) and Nq(ω, t)
be the power spectrum of x(t) and the estimated noise
spectrum at frequency ω at time t respectively. For each
frequency bin, an buffer stores the value of X(ω, t) over a
pre-defined duration T. The buffer content is then sorted
and the q-th quantile is taken as Nq(ω, t). The process can
be summarized as follows:
The parameter q normally takes a value of 0.5, which
represents the median.QBNE is based on the quasi-
periodic characteristics of voiced segments, so the power
spectrum is the superposition of the noise spectrum and
the harmonic spectrum from speech.
frequencies, the power values are mainly contributed by
the additive background noise .
values in QBNE are relative and independent of the
absolute signal power. Comparing with the conventional
noise estimation using VAD, QBNE performance is
comparable to the one with hand-labelled VAD .
However, the QBNE is inaccurate at frequencies where the
speech components are consistently dominant.
All parameters and
2.2. Time-Frequency QBNE (T-F QBNE)
To obtain accurate estimation at harmonic frequencies,
information from adjacent spectral troughs is used [8, 10,
11]. This is called Time-Frequency QBNE (T-F QBNE).
Let Nt-fq(ω, t) be the noise estimate for frequency ω at time
t by T-F QBNE. The buffer mentioned earlier not only
stores the value of X(ω, t) along the time axis, but also
along the frequency axis.
weighted sum of Nq(ω,
Nt-fq(ω, t) is defined as the
2.3. Mainlobe-Resilient Time-Frequency QBNE (M-R
The proposed M-R T-F QBNE method is based on the T-F
QBNE. The following establishes this method in details.
Both noise and speech constitute peaks in the power
spectrum X(ω, t) and T-F QBNE should only be applied to
speech peaks.As a result, for every peak found, a
decision is made to determine if the peak comes from
noise or from the speech harmonics. This is accomplished
by peak picking and pitch extraction.
Regarding the peak picking, the first and second
derivatives, namely X’(ω, t) and X”(ω, t), are calculated
by numerical differentiation from Taylor Series Expansion
. We select only candidates with negative X”(ω, t),
group those consecutive frequencies together and take the
one with the smallest absolute value of X’(ω, t) from each
group as the peak location.A robust pitch extraction
scheme  is used to derive the fundamental frequency
(pitch) throughout the utterance.
Magnitude Difference Function (AMDF) to weight the
conventional autocorrelation estimation.
and peak locations, a peak is claimed to come from the
speech harmonics if it is located around the pitch
frequency within a small shift. This is shown in Figure 1.
We apply the following to those speech harmonic
peaks and ordinary QBNE to the remaining frequencies,
since speech harmonic peaks are more vulnerable to the
speech power. First, we implement a linear interpolation
between 2 selected Nq(ω, t) around the harmonic peak
separated from a frequency distance.
frequency distance, as illustrated in Figure 2.
√ √√ √
It uses the Average
With the pitch
Let band be the
Figure 1: A peak is claimed to come from the speech harmonic if
it is enclosed by the rectangle.
respectively. The rectangles model the small shift region and the
tick and cross above the figure shows if a peak is a speech
harmonic or not.
The stems and the arrows
Figure 2: Linear interpolation is done in frequency within band
around the harmonic peak
Consider the 2 harmonic peaks in Figure 2, it is
observed that band for the strong peak and for the weak
peak should be different. When the strong peak roll-offs
down to the noise level, the frequency distance is much
greater than the one for the weak peak. As a result, band
for various peak values should be adjusted accordingly.
Large band should be given to strong harmonic peak
frequency and vice versa. Small band for weak harmonic
peaks can also provide better tracking of noise level. This
is the proposed M-R T-F QBNE. In the simulations, 4
band values are chosen and the log-scale dynamic range of
peak values in every segment is divided into 4 linear
portions. For each harmonic peak, a band is selected by,
frequency distance = band
Nq(ωL, t) selectedNq(ωH, t) selected
1. Take the Fourier Transform and obtain,
Sort X(ω, t) in ascending order, according to their
powers and re-index,
Select the q-quantile,(
Assign noise estimate
, , 0
4 /), log2 /,
where log[X(ωP, t)], α and β are the log power of current
harmonic peak, minimum log[X(ωP, t)] and maximum
log[X(ωP, t)] respectively. The exact positions of Nq(ωH, t)
and Nq(ωL, t) are symmetrically located at both sides at a
distance band/2.M-R T-F QBNE, hence, prevents the
poor interpolation in strong harmonic peaks or inaccurate
tracking in weak peaks.
The performance of M-R T-F QBNE is examined by
using synthetic sounds as well as real speech immersed in
different background noise.
produced by the source-filter model with pitch frequency
equals to 150 Hz, formant frequencies at 700, 1220 and
2600 Hz with bandwidth 130, 70 and 160 Hz respectively.
White Gaussian noise is used with SNR equals to 15 dB.
A synthetic speech is
Figure 3: Estimated noise spectra of synthesized speech ‘a’
sound. The true value refers to the exact noise spectrum
Figure 4: MSE plot versus SNR of a typical speech sample
Figure 3 shows the estimated noise spectra from
various methods.It is observed that M-R T-F QBNE
harmonics towards noise estimation. The real speech used
for evaluation is a speech sample from AURORA2
database  under subway noise with SNR from –5 to 20
dB. Figure 4 is the plot of MSE of noise estimation versus
SNR using the same clean speech.
3. EXPERIMENT ON SPEECH RECOGNITION
An efficient speech recognition system is designed and
will be introduced in this section. AURORA2 database
 is used.It is an English connected digits corpus,
commonly used to evaluate the performance of ASR
algorithms in noisy conditions (8 noise types and 7 SNRs).
There are 70070 testing samples and 2 modes of training,
each contains 8440 speech samples. Clean training refers
to training on clean data only and multi-condition training
refers to both clean and noisy data are being used.
Figure 5: The proposed recognition system
Figure 5 gives a functional block diagram of the
proposed recognition system. After M-R T-F QBNE, the
global SNR (SNRg) is calculated by,
where NM-R t-fq(ω, t) is the estimated noise spectrum from
M-R T-F QBNE. The best-matched model is chosen
according to the SNRg. There are 3 models trained by
noisy speech data. High SNR model uses only 1688 clean
data, medium SNR model uses 3376 data from the train set
SNR15 and SNR20 and low SNR model is trained by 3376
data in train set SNR5 and SNR10. Other settings in the
recognizer are identical to those defined in .
4. SIMULATION RESULTS AND DISCUSSION
The average recognition accuracy of M-R T-F QBNE is
shown in Table 1(a). Table 1(b) is the result obtained if
noise spectrum is known and used as NM-R t-fq(ω, t). Table
1(c) and (d) lists the recognition results using the clean
and multi-condition training mode respectively, which are
used to evaluate the performance of the whole recognition
Comparing the overall average recognition accuracy
with the noise spectrum being known a priori and the one
mean-square-error MSE vs SNR
M-R T-F QBNE
synthesized sound spectrum vs frequency
estimated noise spectra of synthesized speech vs frequency
M-R T-F QBNE
from M-R T-F QBNE with the clean training mode, both Download full-text
approaches outperform the clean training mode. For test
set A, there is at least 23.4% absolute improvement and
60.5% relative improvement from M-R T-F QBNE or the
one with known noise spectrum.
condition training mode, if the noise spectrum is known
beforehand, the result is so promising that the accuracy is
even slightly higher when only 20 to 40% of the original
training size is being used in each model. There is no
recognition degradation for clean or high SNR input
speech, although this is occurred in multi-condition
training mode. This shows the great potential of the SNR-
dependent model selection in the designed system. In
addition, as the recognition result from M-R T-F QBNE is
similar to the one from known noise spectrum, the benefit
from M-R T-F QBNE towards ASR is asserted and the
robustness issue mentioned above is definitely alleviated.
Regarding the multi-
Table 1: Average recognition accuracy from (a) M-R T-F
QBNE, (b) known noise spectrum, (c) clean training and (d)
To conclude, a recognition system with SNR-
dependent model selection is proposed, together with the
new noise estimation method, M-R T-F QBNE. It is clear
that M-R T-F QBNE is effective in noise spectral
estimation and the proposed ASR system also gives
significant improvement in the AURORA2 recognition set
over the clean training baseline. It is effective and simple
in implementation, since only model selection is required
after estimating the global SNR and no other computation
 Gillian M. Davis, Noise Reduction in Speech Applications,
CRC Press, 2002.
Environments,” Computer Speech and Language, vol. 5, pp.
 Y. Gong, “Speech Recognition in Noisy Environments: a
Survey,” Speech Communication, vol. 16, pp. 261-291, 1995.
 H. G. Hirsch, “Estimation of Noise Spectrum and its
Application to SNR Estimation and Speech Enhancement,”
Technical Report TR-93-012, International Computer Science
Institute, Berkeley, USA, 1993.
 P. Motlí?ek and L. Burget, “Efficient Noise Estimation and
its Application for Robust Speech Recognition,” in 5th
International Conference Text, Speech and Dialogue, 2002, pp.
 H. G. Hirsch and C. Ehrlicher, “Noise Estimation Techniques
for Robust Speech Recognition,” in Proc. ICASSP, 1995, vol. 1,
 V. Stahl, A. Fischer and R. Bippus, “Quantile Based Noise
Estimation for Spectral Subtraction and Wiener Filtering,” in
Proc. ICASSP, 2000, vol. 3, pp. 1875-1878.
 C. Ris and S. Dupont, “Assessing Local Noise Level
Estimation Methods: Application to Noise Robust ASR,” Speech
Communication, vol. 34, pp. 141-158, 2001.
 Nicholas W. D. Evans and John S. Mason, “Noise Estimation
Without Explicit Speech, Non-speech Detection: a Comparison
of Mean, Modal and Median based approaches,” in Proc.
EuroSpeech, 2001, vol. 2, pp. 893-896.
 Nicholas W. D. Evans and John S. Mason, “Time-
EUSIPCO, 2002, vol. 1, pp. 539-542.
 D. Ealey, H. Kelleher and D. Pearce, “Harmonic Tunnelling:
Tracking Non-stationary Noise during Speech,” in Proc.
EuroSpeech, 2001, vol. 1, pp. 437-440.
 Steven C. Chapra and Raymond P. Canale, Numerical
Methods for Engineers with Software and Programming
Applications, McGraw-Hill, 2001.
 T. Shimamura and
Autocorrelation for Pitch Extraction of Noisy Speech,” in IEEE
Trans. Speech and Audio Processing, 2001, vol. 9, pp. 727-730.
 H. G. Hirsch and D. Pearce, “The AURORA Experimental
Recognition Systems under Noisy Conditions,” in ISCA ITRW
H. Kobayashi, “Weighted
(a) M-R T-F QBNE
(b) Known Noise Spectrum
(c) Clean Training mode
(d) Multi-Condition Training mode
test A/ %
test B/ %
test A/ %
test B/ %
test C/ %
test A/ %
test B/ %
test C/ %
test A/ %
test B/ %
test C/ %