ArticlePDF Available

Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction

Authors:

Abstract and Figures

A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades automatic speech recognition (ASR) performance. One way to solve this problem is to dereverberate the observed signal prior to ASR. In this paper, a room impulse response is assumed to consist of three parts: a direct-path response, early reflections and late reverberations. Since late reverberations are known to be a major cause of ASR performance degradation, this paper focuses on dealing with the effect of late reverberations. The proposed method first estimates the late reverberations using long-term multi-step linear prediction, and then reduces the late reverberation effect by employing spectral subtraction. The algorithm provided good dereverberation with training data corresponding to the duration of one speech utterance, in our case, less than 6 s. This paper describes the proposed framework for both single-channel and multichannel scenarios. Experimental results showed substantial improvements in ASR performance with real recordings under severe reverberant conditions.
Content may be subject to copyright.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009 1
Suppression of Late Reverberation Effect
on Speech Signal Using Long-Term
Multiple-step Linear Prediction
Keisuke Kinoshita, Member, IEEE, Marc Delcroix, Member, IEEE, Tomohiro Nakatani, Senior Member, IEEE,
and Masato Miyoshi, Senior Member, IEEE
Abstract—A speech signal captured by a distant microphone is
generally smeared by reverberation, which severely degrades auto-
matic speech recognition (ASR) performance. One way to solve this
problem is to dereverberate the observed signal prior to ASR. In
this paper, a room impulse response is assumed to consist of three
parts: a direct-path response, early reflections and late reverber-
ations. Since late reverberations are known to be a major cause
of ASR performance degradation, this paper focuses on dealing
with the effect of late reverberations. The proposed method first
estimates the late reverberations using long-term multi-step linear
prediction, and then reduces the late reverberation effect by em-
ploying spectral subtraction. The algorithm provided good dere-
verberation with training data corresponding to the duration of
one speech utterance, in our case, less than 6 s. This paper describes
the proposed framework for both single-channel and multichannel
scenarios. Experimental results showed substantial improvements
in ASR performance with real recordingsunder severe reverberant
conditions.
Index Terms—Automatic speech recognition (ASR), dereverber-
ation, multi-step linear prediction (MSLP), reverberation.
I. INTRODUCTION
Aspeech signal captured by a distant microphone is gen-
erally smeared by reverberation, which is caused by the
reflection from, for example, walls, floors, ceilings or furniture.
The reverberation is known to degrade the performance of au-
tomatic speech recognition (ASR) severely. Thus, it is desirable
to find a reliable way of mitigating the effect of reverberation on
ASR.
A major stream of research designed to find a way to cope
with the reverberation problem involves estimating inverse fil-
ters that remove the distortion caused by the impulse response
using multiple microphones. One approach for constructing
such inverse filters is to first estimate the room impulse re-
sponses, and then calculate their inverse based on, for example,
Manuscript received April 09, 2008; revised September 04, 2008. The as-
sociate editor coordinating the review of this manuscript and approving it for
publication was Dr. Tim Fingscheidt.
K. Kinoshita, T. Nakatani, and M. Miyoshi are with the NTT Com-
munication Science Laboratories, NTT Corporation, Kyoto 619-0237,
Japan (e-mail: kinoshita@cslab.kecl.ntt.co.jp; nak@cslab.kecl.ntt.co.jp;
miyo@cslab.kecl.ntt.co.jp).
M. Delcroix was with the NTT Communication Science Laboratories, NTT
Corporation, Kyoto 619-0237, Japan. He is now with Pixela Corporation, Osaka
556-0011 Japan (e-mail: marc@cslab.kecl.ntt.co.jp).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2008.2009015
the multiple-input/output inverse theorem (MINT) [1]. Some
researchers have proposed using a subspace method for es-
timating the impulse responses [2], [3]. The room impulse
responses are obtained from the null space of the covariance
matrix of the observed signals. However, these subspace
methods are highly dependent on a prior knowledge of channel
orders, and are sensitive to errors in channel order estimates.
Another common approach for obtaining inverse filters is to
use a linear prediction (LP) algorithm, which provides a way
to calculate the inverse filter directly. Unlike the subspace
approaches, LP based methods are relatively robust to channel
order mismatches [4]–[6]. The dereverberation methods based
on inverse filtering are developed with a solid theoretical
background that enables us to achieve precise dereverberation.
Therefore, they are viewed as very attractive ways of solving
the reverberation problem. However, these methods are known
to pose a sensitivity problem in that background noise or a small
change in the transfer function results in severe performance
degradation [7].
In contrast to the inverse filtering methods, robust and prac-
tical approaches have been investigated to mitigate the effect of
reverberation on ASR [8]–[10]. In this paper, reverberant speech
is assumed to consist of a direct-path response, early reflec-
tions and late reverberations. The early reflections are defined
as the reflection components that arrive after the direct-path re-
sponse within a time interval of 30 ms (which corresponds to the
length of the speech analysis frame used in this paper), and the
late reverberations as all the latter reflections. The early reflec-
tions may not significantly degrade ASR performance if they
are handled by cepstral mean subtraction (CMS) [11] or max-
imum-likelihood linear regression (MLLR) [12]. On the other
hand, the late reverberations can be detrimental to ASR per-
formance [13], [14]. The standard ASR techniques to compen-
sate the convolutional distortion such as CMS do not work well
for the late reverberations. In addition, it is reported that, in a
severely reverberant environment where the late reverberations
have a large energy, the ASR performance cannot be improved
even with an acoustic model trained with a matched reverbera-
tion condition [14]. This means that the standard acoustic model
cannot handle severe late reverberations, even when they know
the whole reverberation characteristics in advance. One way to
resolve this is to suppress the late reverberations prior to ASR
process [8]–[10]. In their studies, the energy of the late rever-
berations was estimated using an exponential decay function
and reduced using the spectral subtraction (SS) technique [15].
1558-7916/$25.00 © 2009 IEEE
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
The remaining early reflections are handled by CMS. Such dere-
verberation methods appear computationally simple and rela-
tively robust to noise. However, since reverberation cannot be
well-represented solely with such a simple model, i.e., an ex-
ponential decay model, it is difficult to achieve precise derever-
beration and restore the ASR performance to the level of the
recognition of clean speech.
This paper proposes a novel dereverberation method that es-
timates the late reverberation energy based on the concept of
the inverse filtering method, namely long-term multi-step linear
prediction (MSLP) [16], and performs SS to remove late re-
verberations, as if the desired signal and the late reverberations
are uncorrelated (see Appendix I for the characteristics of late
reverberations). The proposed method first uses MSLP to es-
timate the late reverberation signal accurately in the time do-
main. Then, unlike the conventional inverse filtering technique,
it converts the late reverberation signal into the frequency do-
main, and subtracts the power spectrum of the late reverbera-
tions from that of the observed signal. In other words, while
general inverse filtering methods estimate and subtract the re-
verberation components from the observed signal in the time
domain, the proposed method can be interpreted as performing
the subtraction in the power spectral domain. By excluding the
phase information from the dereverberation operation based on
the SS framework, the proposed method might provide a degree
of robustness to certain errors that conventional sensitive inverse
filtering methods could not offer. The proposed method can be
formulated in either a single or multichannel scenario without
major modification of the algorithm. Our experimental results
revealed substantial improvements in ASR performance even in
a real severe reverberant environment. The algorithm could per-
form good dereverberation with training data corresponding to
the duration of one speech utterance, in our case, less than 6 s.
The organization of this paper is as follows. Section II
introduces the signal model. In Sections III and IV, we describe
the proposed dereverberation framework for single channel
and multichannel scenarios. In Section V, we evaluate the pro-
posed method in a simulated reverberant environment in terms
of objective quality measurement and ASR performance. In
Section VI, we perform the dereverberation of real recordings.
Section VII focuses on the robustness of the proposed method
in a noisy reverberant environment. Section VIII summarizes
our conclusions.
In this paper, the notations , , , stand
for the matrix/vector transpose, the inverse, the Moore–Penrose
pseudo-inverse, and the -norm, respectively. represents
the time average. represents the identity matrix.
II. SIGNAL MODEL
We consider the acoustic system shown in Fig. 1. First, let
us assume that a source signal (speech signal) is produced
through a th-order FIR filter from white noise as
(1)
Fig. 1. Acoustic system:
u
(
n
)
is white noise,
(
z
)
is an FIR filter corre-
sponding to vocal tract characteristics,
s
(
n
)
is a speech signal,
(
z
)
is the
room transfer function between the speaker and the
m
th microphone, and
x
(
n
)
is an observed signal at the
m
th microphone.
where is the time-domain representation of . Then, the
speech signal recorded with a distant microphone , ,
can be generally modeled as
(2)
(3)
(4)
where corresponds to the room impulse response be-
tween the source signal, and the th microphone. is as-
sumed to be time invariant.
We can reformulate (3) using a matrix/vector notation as
where
....
.
.
.
.
........
.
.
.
.
.......
(5)
Here we assume is an full row rank
matrix1. and indicate the dimensions of the vector
and , respectively.
In this paper, a room impulse response is assumed to
consist of three parts: a direct-path response, early reflections,
and late reverberations. The objective of the work described in
this paper is to mitigate the effect of the late reverberations of
. Here let us denote the late reverberations of ,
as
We consider that the late reverberations of correspond to the
coefficients of after the th element.
1
G
is full row rank unless
g
is a zero matrix.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
KINOSHITA et al.: SUPPRESSION OF LATE REVERBERATION EFFECT ON SPEECH SIGNAL 3
III. SINGLE-CHANNEL ALGORITHM
In this section, we introduce a dereverberation algorithm for a
single-channel scenario, which represents a situation where only
one observation, in (3), is available for dereverberation.
A. Long-Term Multi-Step Linear Prediction
Here, to estimate the late reverberations, we introduce long-
term multi-step LP, which was originally proposed in [16].2.It
was first presented for the estimation of whole impulse response.
In this study, we use the same method to identify only the late
reverberations.
Let be the number of filter coefficients, and be the
step-size (i.e., delay), then long-term multi-step LP can be for-
mulated as
(6)
where represents the prediction coefficients, and is
a prediction error. When is one, the equation is equivalent
to conventional LP, which is often used, for example, in speech
coding and analysis [21]. The prediction coefficients can be es-
timated in the time domain by minimizing the mean square en-
ergy of prediction error . Note that these prediction coeffi-
cients are estimated based on more than at least sam-
ples, which amounts to several thousands in this study. In other
words, the prediction coefficients are calculated using long-term
analysis, while LPC, for example, in the speech coding field
works based on short-term analysis. Using a matrix/vector no-
tation, the obtained prediction coefficients can be expressed
as (see Appendix II for a detailed derivation)
(7)
(8)
Here is a full-rank matrix because is a full row rank
matrix as mentioned above.
Now, we apply the prediction coefficients to the observed
signal to estimate the power of the late reverberations, as fol-
lows:
(9)
(10)
(11)
(12)
Using the fact that the auto-correlation matrix of white noise
is , where is a scalar indi-
cating the variance of , we can derive (10). Using the
2There are several speech dereverberation methods that also use LP [17]–[20]
Note that, in their studies, LP was mainly used to model speech components,
thus the LP order is relatively small (
'
20
). In contrast, here we wish to model
reverberation with long-term multi-step LP; thus, the order is much higher (i.e.,
several thousands).
Cauchy–Schwartz inequality, we can obtain relation (11).
Finally, relation (12) was obtained by using the fact that
is the norm of a projection matrix,
which is equal to 1 [22]. Equation (12) indicates that the late
reverberation components can never be overestimated in a
long-term analysis sense.
Now, let us denote -domain representation of and
as and . Then, as mentioned in (6) to (8),
the long-term multi-step LP tries to skip the first terms of
transfer function and estimate the remaining terms of
whose orders are higher than . Note that is the
product of speech production system and room transfer
function as in (4). Therefore, the late reverberation en-
ergy calculated as in (12) may include not only the contribu-
tion of the late reverberations of but also the bias caused
by . In order to reduce this bias, we suggest employing
a preprocessing technique for long-term multi-step LP, known
as the pre-whitening technique, which appears to be effective
in reducing the short-term correlation of a speech signal pro-
duced through . In this paper, this pre-whitening was done
by using small order LP ( taps), which can be estimated
as shown in Appendix III. Care has to be taken to choose the
LP order for long-term multi-step LP and pre-whitening. The
long-term multi-step LP tries to model the late reverberations
of ; thus, the order has to be very high. In contrast, the
LP order used for pre-whitening should be small, since the aim
of this processing is only to suppress the short-term correlation
caused by speech production system .
B. Spectral Subtraction
Here we propose the use of SS to suppress the late reverber-
ations. That is, we first divide the observed signal and the es-
timated late reverberations into short frames, apply short-term
Fourier transform (STFT) to calculate the power spectrum, and
then subtract the power spectrum of the estimated late reverber-
ations from that of the observed signal. Although, in the pre-
vious section, we showed that the power of the predicted late
reverberations can never be overestimated compared with that
of true late reverberations in the long-term analysis sense, some
degree of overestimation may occur in (short-term) local time
region.
In summary, an exact subtraction rule can be formulated as
shown below, by denoting the STFT of a short segment of the
observed signal at the th microphone as and that
of the estimated late reverberations as , where is
the frame length and is an integer
if
otherwise
where denotes the STFT of the dereverberated
signal. To synthesize a time-domain dereverberated signal, we
simply apply the phase of the observed signal as
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
4 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
Fig. 2. Schematic diagram of proposed method for single-channel scenario.
C. Schematic Processing Diagram of Single-Channel
Algorithm
Fig. 2 is a schematic diagram of the proposed method for
a single-channel scenario mentioned above. First the observed
signal is prewhitened with small order LP, and processed with
the long-term multi-step LP. The long-term multi-step LP is
used to obtain the coefficients that best predict the late rever-
berations. Then, by convoluting (or filtering) the observed signal
with the prediction coefficients as , we estimate the late re-
verberations. After applying a STFT to the observed signal and
predicted late reverberations, we perform SS in the the spectral
domain to remove the effect of the late reverberations from the
observed signal (shown as “SS” in Fig. 2) [15]. Finally, to re-
move the remaining early reflections for the ASR system, we
apply CMS to the processed signal.
IV. MULTICHANNEL ALGORITHM
In this section, we extend the proposed algorithm to the mul-
tichannel scenario. By employing the multichannel long-term
multi-step LP [16], the two sides of (12) become equal [1], [23];
thus, we expect to estimate the late reverberations more accu-
rately.
A. Multichannel Long-Term Multi-Step Linear Prediction
Here, we introduce multichannel long-term multi-step LP to
estimate late reverberations based on multiple observed signals,
Let be the number of filter coefficients for each channel, be
the step-size (i.e., delay), and be the number of microphones,
then the multichannel long-term multi-step LP is formulated as
follows:
(13)
where corresponds to the observed signal at the th
microphone, and to the prediction coefficients at the
th microphone when the prediction target is the observed
signal at the th microphone . The multichannel long-term
multi-step LP calculates the late reverberations within .
The prediction coefficients can be estimated by mini-
mizing the mean square energy of the prediction error (see
Appendix IV for a detailed derivation). Using a matrix/vector
notation, the obtained prediction coefficients can be written
in a similar manner to the single channel algorithm as:
(14)
Fig. 3. Schematic diagram of multichannel implementation.
where
Now, let us apply the prediction coefficients to the observed
signal to estimate the late reverberations. Here, we define the
observed signal as
Then, the estimated late reverberations can be expressed as fol-
lows:3
(15)
Equation (15) simply indicates that the late reverberations can
be more accurately estimated. In other words, now with multi-
channel long-term multi-step LP, the two sides of (12) become
the same.
B. Schematic Processing Diagram
Fig. 3 shows an algorithm based on the multichannel
long-term multi-step LP. There are two major modifications
compared with the single-channel algorithm. First, in the multi-
channel scenario, we perform long-term multi-step LP based on
signals captured by multiple microphones. Second, to enhance
the direct-path response in the processed speech, we adjust
the delays and calculate the sum of the signals from all the
channels. The process is denoted as “Direct-path Enhancement
(DE)” in the figure.
First, pre-whitening is applied to each of the observed signals.
Next, using multichannel long-term multi-step LP, we estimate
the late reverberations at the th microphone. Based on the STFT
of the estimated late reverberations and that of the observed sig-
nals, we calculate the dereverberated signal at the th micro-
phone. We repeat this procedure for all ()to
obtain the dereverberated speech for all the microphones. Then,
we adjust the delays among the output signals and calculate their
sum to obtain the resultant signal. The delays were estimated
with the Generalized Cross-Correlation (GCC) method [24]. Fi-
nally, to remove the remaining early reflections, we apply CMS
to the processed signal.
3For (15) to be strictly equal,
H
, which is the Sylvester matrix of
h
(
n
)
,
similar to
G
, has to be a full column rank matrix.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
KINOSHITA et al.: SUPPRESSION OF LATE REVERBERATION EFFECT ON SPEECH SIGNAL 5
Fig. 4. Experimental setup: the reflection coefficients of the walls are [0.93
0.93 0.94 0.94 0.15 0.15 ].
V. EXPERIMENT IN SIMULATED REVERBERANT ENVIRONMENT
In this section, we evaluate the effectiveness of the proposed
methods in a simulated reverberant environment, where our
noise-free assumption holds.
A. Experimental Conditions
1) Reverberation Condition: Fig. 4 summarizes the acoustic
environment for the experiment. The single-channel processing
employed the microphone shown with the solid line, while the
four-channel processing employed three extra microphones in-
dicated with dotted lines. Each microphone was equally spaced
at a distance of 0.2 m. Impulse responses were simulated with
the image method [25], for four different speaker positions, with
distances of 0.5, 1.0, 1.5, and 2.0 m between the reference mi-
crophone and the speaker. The reverberation time of the
simulated acoustic environment was about 0.65 s4. The impulse
response was 9600 taps corresponding to a duration of 0.8 s,
with a sampling frequency of 12 kHz.
2) ASR Condition: The Japanese Newspaper Article Sen-
tences (JNAS) corpus was used to investigate the effectiveness
of the proposed method as a preprocessing algorithm for ASR.
The ASR performance was evaluated in terms of word error
rate (WER) averaged over genders and speakers. In the acoustic
model, we used the following parameters : 12 order MFCCs +
energy, their and , three state triphone HMMs, and 16
mixture Gaussian distributions. The acoustic model settings are
summarized in Table I. The total number of clustered states was
set at 3000 using a decision-tree based context clustering tech-
nique [27]. The model was trained on clean speech processed
with CMS. The language model was a standard trigram trained
on Japanese newspaper articles written over a ten-year period.
The training and test sets for the recognition task are summa-
rized in Table II. The duration of the test data ranged from 2 to
16 s, and the average value was about 6 s.
4In [26], we carried out experiments with
RT
values of 0 to 0.5 s.
TABLE I
EXPERIMENTAL CONDITIONS FOR ASR
TABLE II
TRAINING AND TEST DATA FOR ACOUSTIC MODEL
AND LANGUAGE MODEL FOR JNAS
3) Parameters for Dereverberation: The filter length for
single-channel algorithm , that for multichannel algorithm ,
and the step-size in (6) and (13), were 3000, 750, and 360,
respectively. It should be noted that, when dealing with longer
reverberations, in theory we simply have to use a longer filter.
Here, is set at the length of the analysis frame used for CMS
to deal with all the reverberation components that CMS cannot
handle. For the pre-whitening, we used 20th-order LP, which
we calculated similarly to the approach described in [20] (see
Appendix III for details). In our experiment, the coefficients of
the pre-whitening filter were fixed for an entire utterance. Al-
though we determined these orders experimentally, according
to the preliminary experiment, we confirmed that similar per-
formance could be obtained for different filter lengths given
a range of 1000 taps. No special parameters were used for
spectral subtraction. These parameters are common to all the
experiments reported in this paper.
The dereverberation was performed utterance by utterance.
The estimation of the LP coefficients starts only after all sam-
ples corresponding to the current utterance become available.
This means that the length of the training data used to estimate
the LP coefficients is equivalent to the duration of each input ut-
terance. We have confirmed experimentally that, if we can use
the data of more than about 2 s of data, we can obtain sufficiently
converged LP coefficients, and the algorithm performance be-
come relatively stable. We employed the Levinson–Durbin al-
gorithm for single-channel long-term multi-step LP [21], and
the class of Schur’s algorithm for multichannel long-term multi-
step LP [21], [28]–[30] to calculate the prediction coefficients
efficiently. These fast algorithms enable us to run the whole
process at a real-time factor of less than 1, for example, on the
Intel Pentium IV 3.4-GHz processor used in our experiments.
When we compare the length of the simulated impulse re-
sponses and the filter length for MSLP, we find that the current
filter length is not sufficiently long to estimate all the late re-
verberations, and the analysis of the proposed dereverberation
method presented in Sections III and IV does not hold precisely.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
6 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
Fig. 5. Recognition experiment in a simulated reverberant environment:
Recognition performance as a function of the distance between microphone
and speaker.
However, we chose this filter length to allow us to execute the
whole process in a realistic computational time.
B. Dereverberation Effect on ASR
Fig. 5 shows the WER as a function of the distance between
the microphone and the speaker. “No proc.” corresponds to
the WER of the reverberant speech processed with CMS, “1
ch dereverberation” to that of speech dereverberated with the
single channel algorithm, “4 ch dereverberation w/ DE” to that
of speech dereverberated with the four channel algorithm with
the DE process (as shown in Fig. 3). “4 ch dereverberation w/o
DE” is the signal of one representative channel that was cap-
tured immediately before being passed to the DE process in the
process of the four channel algorithm. This example is provided
to show the improvement that we can gain by extending single
channel long-term multi-step LP to multichannel form. “Clean
speech (baseline)” is the lowest possible WER, i.e., 4.4%, that
can be realized with this ASR system based on this corpus.
As seen from the figure, if the reverberant speech undergoes
no preprocessing, the WER increases greatly as the distance
increases. With the proposed method, we achieved a substan-
tial reduction in the WER with both the single channel and
four channel algorithms for all reverberant conditions. The im-
provement obtained by using four channels rather than a single-
channel becomes more obvious, particularly as the distance be-
tween the speaker and the microphone increases.
C. Spectrogram Improvement
Fig. 6 shows a spectrogram of clean speech processed with
CMS, reverberant speech at a distance of 1.5 m, speech derever-
berated by the single-channel algorithm, speech dereverberated
by the four-channel algorithm without the DE process, and
speech dereverberated by the four-channel algorithm with the
DE process. We can clearly see the effect of the proposed
method in both the single-channel and four-channel cases. Al-
though we can observe some differences between the levels of
performance provided with the single-channel and four-channel
algorithms, no significant improvement can be seen in spectro-
grams. Although (12) implicitly shows that the single-channel
algorithm may greatly underestimate the power of late rever-
berations, this experimental result supports the idea that the
Fig. 6. Spectrograms in a simulated reverberant environment when the distance
between the microphones was set at 1.5 m: (A) clean speech, (B) reverberant
speech, (C) speech dereverberated by the single-channel algorithm, (D) speech
dereverberated by the four-channel algorithm without DE, and (E) speech dere-
verberated by the four-channel algorithm with DE.
algorithm successfully generates a reasonable estimate of late
reverberations. Note that, since no over-subtraction factor is
used in the present work, if the power of late reverberations
is greatly underestimated, a spectrogram should show some
evidence of the remaining late reverberations.
D. Evaluation With LPC Cepstrum Distance
Here we use the average LPC cepstrum distance [31] to
evaluate the precision of the dereverberation with an objective
measurement. Fig. 7 shows the average LPC cepstrum distance
between clean speech processed with CMS and target speech.
To calculate the LPC cepstrum distance, we excluded the
silence found at the beginning and end of the utterance files.
The legends represent the same type of speech signal as those
in Fig. 5. Here again, the difference in performance between
single-channel and four-channel processing becomes more
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
KINOSHITA et al.: SUPPRESSION OF LATE REVERBERATION EFFECT ON SPEECH SIGNAL 7
Fig. 7. LPC cepstrum distance in simulated reverberant environment as a func-
tion of the distance between the microphone and the speaker.
noticeable as the distance increases, as previously noticed in
Fig. 5.
VI. EXPERIMENT IN REAL REVERBERANT ENVIRONMENT
In this section, we carried out experiments with speech
recorded in a real reverberant room to show the applicability of
the proposed method.
A. Experimental Condition
The recordings were made in a reverberant chamber with the
same dimensions as the simulated room described in Section IV.
The location of the microphones and loudspeaker also follows
the simulation setup depicted in Fig. 4. For each gender, 100
Japanese sentences taken from the JNAS database were played
through a BOSE 101VM loudspeaker, and recorded with SONY
ECM-77B omnidirectional microphones. The positions of the
loudspeaker and the microphones were fixed throughout the
recordings. The signal-to-noise ratios (SNRs) of the recordings
were about 15 to 20 dB, and the reverberation time was
about 0.5 s. The values are approximately the same as those
of simulated impulse responses [32]. We applied high-pass fil-
tering to the recordings before the dereverberation process to
suppress the unwanted background noise, which was mainly
concentrated below 200 Hz. After the high-pass filtering, the
SNRs were about 30 dB. As a control, we also recorded the same
utterances in a nonreverberant chamber with a close microphone
using the same experimental equipment.
B. Dereverberation Effect on ASR
We also carried out ASR experiments with real record-
ings. The acoustic and language models were the same as
in Section V. The training and test sets for this recognition
task were the same as for the previous experiment and are
summarized in Table II.
Fig. 8 shows the WER of the real recordings as a function
of the distance between the microphone and the speaker. The
legends represent the same type of processing as those in Fig. 5.
In this experiment, the baseline performance is 4.9%, which is
Fig. 8. Recognition experiment in real reverberant environment: Recognition
performance as a function of the distance between the microphone and the
speaker.
the WER obtained with recordings made in a nonreverberant
chamber.
The improvement in WER is sufficiently noticeable under all
reverberant conditions, and the global tendency is similar to the
simulation. The results indicate that the proposed framework
also works well even with speech recorded in a severely rever-
berant environment.
C. Spectrogram Improvement
In this experiment, to move one step nearer a real scenario,
we attempted the dereverberation of actual human utterances
(rather than those from loudspeaker). In this case, the source
position might be constantly fluctuating owing to head move-
ment, despite the speaker being asked to stand still during the
recordings at the same position as the loudspeaker in Fig. 4.
Fig. 9 shows spectrograms of recorded reverberant speech ut-
tered by a male speaker, speech dereverberated with the single-
channel algorithm, speech dereverberated by the four-channel
algorithm without the DE process, and speech dereverberated
by the four-channel algorithm with the DE process. Here, we
again see the substantial reduction in reverberation in both the
single- and four-channel cases.
VII. ROBUSTNESS OF PROPOSED DEREVERBERATION
METHOD TO DIFFUSIVE NOISE
In this section, we evaluate our proposed method under noisy
reverberant conditions to confirm its robustness. The evaluations
are undertaken using spectrograms and LPC cepstrum distance.
To perform an ASR test in a noisy environment, the method
should be combined with noise adaptation techniques such as
spectral subtraction [15] and parallel model combination [33],
[34]. Since we would like to focus primarily on the reverbera-
tion problem in this paper, we do not include the issue of com-
bining the proposed method with other noise adaptation tech-
niques. Please refer to [35] for an evaluation of the proposed
dereverberation method combined with SS [15] in a noisy re-
verberant environment.
A. Experimental Condition
The reverberation conditions are the same as those described
in Section V. To simulate an environment with diffusive noise,
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
8 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
Fig. 9. Spectrograms obtained in a real reverberant environment when the
distance between the microphones and speaker was set at 1.5 m: (A) recorded
reverberant speech, (B) speech processed with the single-channel algorithm
(C) speech dereverberated by the four-channel algorithm without the DE
process, and (D) speech dereverberated by the four-channel algorithm with the
DE process.
white noise was artificially generated and added to reverberant
speech with SNRs of 0, 10, 20, 30, and 40 dB.
B. Spectrogram Improvement
Fig. 10 shows spectrograms of the observed noisy reverberant
speech, speech dereverberated by the single-channel algorithm,
speech dereverberated by the four-channel algorithm without
the DE process, and with the DE process with a 20-dB SNR.
Here, the distance between the speaker and the microphones
was set at 1.5 m. From the spectrograms, we could see that
both single-channel and four-channel dereverberation works
fairly well even in a noisy environment. It may be interesting
to note that, although the algorithm does not explicitly perform
denoising, some denoising effect can be seen especially in
Fig. 10 (D). This is probably due to the DE processing em-
ployed with the four-channel algorithm.
C. Evaluation With LPC Cepstrum Distance
Here, to evaluate the dereverberation precision in a noisy
environment, we calculated the LPC cepstrum distance between
clean speech processed with CMS and the target speech. In this
case, the dereverberated speech was generated by estimating
Fig. 10. Spectrograms in a noisy reverberant environment, when the distance
between the microphones and speaker was set at 1.5 m, and the SNR was 20 dB:
(A) noisy reverberant speech, (B) speech dereverberated by the single-channel
algorithm, (C) speech dereverberated by the four-channel algorithm without DE,
and (D) speech dereverberated by the four-channel algorithm with DE.
the LP coefficients in a noisy environment, and then processing
the noiseless reverberant speech with the coefficients. By
doing this, the dereverberation performance could be evaluated
without taking account of the spectral distortion caused by
the background noise. The results are summarized in Fig. 11,
where the legends represent the same type of processing as
those in Fig. 5. Note that, the 40-dB SNR case shown in Fig. 11
approximately coincide with Fig. 7, which shows the case of
SNR. The proposed method appears to provide stable
performance for SNRs above 20 dB. Even though the accuracy
decreases for SNRs below 20 dB, the dereverberation effect
is still noticeable when using the four-channel algorithm with
DE. Consequently, the proposed framework is relatively robust
to background noise.
VIII. CONCLUSION
A speech signal captured by a distant microphone is gener-
ally smeared by reverberation, which severely degrades the ASR
performance. In this paper, we proposed a novel dereverbera-
tion method that combines the concept of inverse filtering and
well-known spectral subtraction. The method first estimates late
reverberations using long-term multi-step linear prediction, and
then suppresses them with subsequent spectral subtraction.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
KINOSHITA et al.: SUPPRESSION OF LATE REVERBERATION EFFECT ON SPEECH SIGNAL 9
Fig. 11. LPC cepstrum distance as a function of SNR : Each panel is different as regards the distance between the microphone and the speaker. The top left and
right panels and the bottom left and right panels correspond to 0.5, 1.0, 1.5, and 2.0 m, respectively.
Experimental results showed that both single and multi-
channel algorithms could achieve good dereverberation and
could significantly improve the ASR performance even in a
real severe reverberant environment. In particular, with the
multichannel algorithm, the recognition performance was suf-
ficiently close to an anechoic scenario. Since the multichannel
algorithm can estimate the late reverberations more accurately
compared to the single-channel one and can be advantageously
combined with the postprocessing to enhance the direct-path
response, it allowed us to perform more efficient dereverbera-
tion. We also discussed the robustness of the proposed method
to white background noise, and confirmed that the performance
was stable for SNRs above 20 dB.
In future work, we will consider the effect of background
noise explicitly, and achieve not only dereverberation but also
denoising.
APPENDIX I
CHARACTERISTICS OF LATE REVERBERATIONS
Here let us describe the characteristics of late reverberations
and their relationship to direct-path response and early reflec-
tions.
A speech signal has a strong correlation within each local
time region due to articulatory constraints, and it loses the cor-
relation as a result of articulatory movements. Therefore, it may
be possible to assume that the autocorrelation of clean speech
, , has the following property:
iff (16)
where, with a speech signal, the value can vary approximately
from 30 to 100 ms depending on the phoneme of interest.
Using and the length of the room impulse response ,we
rewrite (2) as
(17)
(18)
If is equivalent to 30 ms (which corresponds to the length of
the speech analysis frame in this paper), the second and third
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
10 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
terms of (18) exactly coincide with the definitions of the early
reflections and late reverberations, respectively. If we assume
the condition of (16), we can assume the late reverberations to
be uncorrelated with the direct-path response, and if
and has sufficient energy, it may be possible to as-
sume that the late reverberations and early reflections are also
uncorrelated.
APPENDIX II
DERIVATION OF PREDICTION COEFFICIENTS
IN SINGLE-CHANNEL SCENARIO
By minimizing the mean square energy of the prediction error
in (6), we could obtain the prediction coefficients. Using
matrix/vector notation, the minimization of leads to the
following equation:
(19)
where
Thus, the prediction coefficients can be obtained as
(20)
To understand the behavior of , we now expand (20). First, the
term in can be expanded as
where the auto-correlation matrix of white noise
is assumed to be . is a scalar that cor-
responds to the variance of . The second term can also be
expanded as
Finally can be rewritten as
(21)
where
Here, we consider that the late reverberations correspond to the
coefficients of after the th element, and are represented
by .
It should be noted that (19) can be solved efficiently, for ex-
ample, by the Levinson–Durbin algorithm [21].
APPENDIX III
ESTIMATION OF PRE-WHITENING FILTER
In this paper, the following th-order prediction filter
was used for pre-whitening to equalize in (1). We first cal-
culate the auto-correlation coefficient with the lag of samples
using the observed signal at the th microphone as
(22)
Then, we take the average of over all the channels.
(23)
As with standard LP [21], using , the prediction filter
was calculated based on the following Yule–Walker equation:
(24)
APPENDIX IV
DERIVATION OF PREDICTION COEFFICIENTS
IN MULTICHANNEL SCENARIO
By minimizing the mean square energy of the prediction error
in (13), we could obtain the prediction coefficients. The
minimization of leads to the following equation:
(25)
where
Thus, can be obtained as
(26)
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
KINOSHITA et al.: SUPPRESSION OF LATE REVERBERATION EFFECT ON SPEECH SIGNAL 11
To understand the behavior of , we reformulate (26) in a
similar manner to that used for a single-channel and described
above. Now, can be rewritten as
(27)
where
Note that, (25) can be efficiently solved by, for example, the
class of Schur’s algorithm, which is able to determine a least
square solution for general block–Toeplitz matrix equations
[21], [28]–[30].
REFERENCES
[1] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE
Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp. 145–152,
Feb. 1988.
[2] M. I. Gurelli and C. L. Nikias, “EVAM: An eigenvector-based algo-
rithm for multichannel blind deconvolution of input colored signals,
IEEE Trans. Signal Process., vol. 43, no. 1, pp. 134–149, Jan. 1995.
[3] S. Gannot and M. Moonen, “Subspace methods for multi microphone
speech dereverberation, EURASIP J. Appl. Signal Process., vol. 2003,
no. 11, pp. 1074–1090, 2003.
[4] J. Ayadi and D. T. M. Slock, “Multichannel estimation by blind MMSE
ZF equalization,” in Proc. 2nd IEEE Workshop Signal Process. Adv.
Wireless Commun., 1999, pp. 251–254.
[5] L. Tong and Q. Zhao, “Joint order detection and blind channel estima-
tion by least squares smoothing,” IEEE Trans. Signal Process., vol. 47,
no. 9, pp. 2345–2355, Sep. 1999.
[6] G. B. Giannakis, Y. Hua, P. Stoica, and L. Tong, Signal Processing Ad-
vances in Wirelessand Mobile Communications. Upper Saddle River,
NJ: Prentice-Hall, 2001.
[7] B. Radlovic, R. C. Williamson, and R. A. Kennedy, “Equalization in
an acoustic reverberant environment: Robustness results, IEEE Trans.
Speech Audio Process., vol. 8, no. 3, pp. 311–319, May 2000.
[8] K. Lebart and J. Boucher, “A new method based on spectral subtraction
for speech dereverberation, Acta Acoust., vol. 87, pp. 359–366, 2001.
[9] I. Tashev and D. Allred, “Reverberation reduction for improved speech
recognition,” in Proc. Hands-Free Commun. Microphone Arrays, 2005,
pp. 8–9.
[10] M. Wu and D. L. Wang, “A one-microphone algorithm for reverberant
speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP), 2003, vol. 1, pp. 844–847.
[11] T. F. Quatieri, Discrete-Time Speech Processing: Principles and Prac-
tice. Upper Saddle River, NJ: Prentice-Hall, 1997.
[12] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear re-
gression for speaker adaptation of continuous density hidden Markov
models,” Comput. Speech, Lang., vol. 9, pp. 171–185, 1995.
[13] B. W. Gillespie and L. E. Atlas, “Acoustic diversity for improved
speech recognition in reverberant environments, in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process. (ICASSP), 2002, vol. 1, pp.
557–600.
[14] B. Kingsbury and N. Morgan, “Recognizing reverberant speech with
rasta-plp,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 1997, vol. 2, pp. 1259–1262.
[15] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub-
traction,” IEEE Trans. Speech Audio Process., vol. ASSP-27, no. 2, pp.
113–120, Apr. 1979.
[16] D. Gesbert and P. Duhamel, “Robust blind identification and equaliza-
tion based on multi-step predictors,” in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), 1997, vol. 26(5), pp. 3621–3624.
[17] B. W. Gillespie, H. S. Malvar, and D. A. F. Florncio, “Speech derever-
beration via maximum-kurtosis subband adaptive filtering,” in Proc.
IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2001, vol.
1, pp. 3701–3704.
[18] B. Yegnanarayana and P. Satyanarayana, “Enhancement of reverberant
speech using lp residual,” IEEE Trans. Speech Audio Process., vol. 8,
no. 3, pp. 267–281, May 2000.
[19] A.Álvarez, V. Nieto, P. Gómez, and R. Martínez, “Speech enhance-
ment based on linear prediction error signals and spectral subtraction,”
in Proc. Int. Workshop Acoust. Echo Noise Control, 2003, vol. 1, pp.
123–126.
[20] N. D. Gaubitch, P. A. Naylor, and D. B. Ward, “On the use of linear pre-
diction for dereverberation of speech, in Proc. Int. Workshop Acoust.
Echo Noise Control, 2003, vol. 1, pp. 99–102.
[21] T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. Upper
Saddle River, NJ: Prentice-Hall, 2000.
[22] D. A. Harville, Matrix Algebra from a Statistician’s Perspective.New
York: Springer, 1997.
[23] M. Delcroix, T. Hikichi, and M. Miyoshi, “Blind dereverberation al-
gorithm for speech signals based on multi-channel linear prediction,”
Acoust. Sci. Technol., vol. 26, no. 5, pp. 432–439, 2005.
[24] C. H. Knapp and G. C. Carter, “The generalized correlation method
for estimation of time delay, IEEE Trans. Acoust., Speech, Signal
Process., vol. ASSP-24, no. 4, pp. 320–327, Aug. 1976.
[25] J. B. Allen and D. A. Berkeley, “Image method for efficiently simu-
lating small-room acoustics,” J. Acoust. Soc. Amer., vol. 65, no. 4, pp.
943–950, 1979.
[26] K. Kinoshita, T. Nakatani, and M. Miyoshi, “Spectral subtraction
steered by multi-step linear prediction for single channel speech
dereverberation, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP), 2006, vol. 1, pp. 817–820.
[27] J. J. Odell, “The use of context in large vocabulary speech recognition,”
Ph.D. dissertation, Cambridge Univ., Cambridge, U.K., 1995.
[28] D. Kressner and P. V. Dooren, “Factorizations and linear system
solvers for matrices with toeplitz structure SLICOT Working Note,”
Tech. Rep. TU Berlin, Berlin, Germany, 2000.
[29] A. Varga and P. Benner, “SLICOT A subroutine library in systems
and control theory, Appl. Comput. Control, Signal Circuits, vol. 1, pp.
499–539, 1999.
[30] P. Bondon, P. D. Ruiz, and A. Gallego, “Recursive methods for es-
timating multiple missing values of amultivariate stationary process,
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
1998, vol. 3, pp. 1361–1364.
[31] N. Kitawaki, M. Honda, and K. Itoh, “Speech-quality assessment
methods for speech-coding systems,” IEEE Commun. Mag., vol. 22,
no. 10, pp. 26–33, 1984.
[32] H. Kuttruff, Room Acoustics. New York: Spon Press, 2000.
[33] M. J. F. Gales and S. J. Young, “Robust continuous speech recog-
nition using parallel model combination,” IEEE Trans. Speech Audio
Process., vol. 4, no. 5, pp. 352–359, Sep. 1996.
[34] F. Martin, K. Shikano, and Y. Minami, “Recognition of noisy speech
by composition of hidden Markov models,” in Proc. Eurospeech, 1993,
pp. 1031–1034.
[35] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Multi-step
linear prediction based speech dereverberation in noisy reverberant en-
vironment,” in Proc. Interspeech, 2007, pp. 854–857.
Keisuke Kinoshita (M’05) received the M.E. degree
from Sophia University, Tokyo, Japan, in 2003.
He is currently a Member of Research Staff at
NTT Communication Science Laboratories, NTT
Corporation, and is engaged in research on speech
and music signal processing.
Mr. Kinoshita was honored to receive the 2004
ASJ Poster Award, the 2004 ASJ Kansai Young
Researcher Award, and the 2005 IEICE Best Paper
Award. He is a member of the ASJ and IEICE.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
12 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 4, MAY 2009
Marc Delcroix (M’07) was born in Brussels,
Belgium, in 1980. He received the M.Eng. degree
from the Free University of Brussels and the Ecole
Centrale Paris in 2003 and the Ph.D. degree from
the Graduate School of Information Science and
Technology, Hokkaido University, Sapporo, Japan,
in 2007.
From 2004 to 2008, he was a Researcher at NTT
Communication Science Laboratories, Kyoto, Japan,
and worked on speech dereverberation and speech
recognition. He is now with Pixela Corporation,
Osaka, Japan, on software development for digital television.
Dr. Delcroix received the 2005 Young Researcher Awards from the Kansai
Section of the Acoustic Society of Japan, the 2006 Student Paper Awards from
the IEEE Kansai Section, and the 2006 Sato Paper Awards from the ASJ.
Tomohiro Nakatani (SM’06) received the B.E.,
M.E., and Ph.D. degrees from Kyoto University,
Kyoto, Japan in 1989, 1991, and 2002, respectively.
He is a Senior Research Scientist with NTT Com-
munication Science Laboratories, NTT Corporation,
Kyoto, Japan. Since he joined NTT Corporation as a
Researcher in 1991, he has been investigating speech
enhancement technologies for developing intelligent
human–machine interfaces. From 1998 to 2001,
he was engaged in developing multimedia services
at business departments of NTT and NTT-East
Corporations. In 2005, he visited the Georgia Institute of Technology, Atlanta,
as a Visiting Scholar for a year.
Dr. Nakatani was honored to receive the 1997 JSAI Conference Best Paper
Award, the 2002 ASJ Poster Award, and the 2005 IEICE Paper Awards. He is a
member of the IEEE CAS Blind Signal Processing Technical Committee, an As-
sociate Editor of the IEEE TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE
PROCESSING, and a Technical Program Chair of IEEE WASPAA-2007. He is a
member of the IEICE and ASJ.
Masato Miyoshi (SM’04) received the M.E. and
Ph.D. degrees from Doshisha Univerity, Kyoto,
Japan, in 1983 and 1991, respectively.
Since joining NTT Corproation , Kyoto, Japan, as
a Researcher in 1983, he has been studying signal
processing theory and its application to acoustic
technologies. Currently, he is the leader of the
Signal Processing Group, the Media Information
Lab, NTT Communication Science Laboratories.
He is also a Guest Professor of the Graduate School
of Information Science and Technology, Hokkaido
University, Sapporo, Japan.
Dr. Miyoshi was honored to receive the 1988 IEEE ASSP Society’s Senior
Award, the 1989 ASJ Kiyoshi-Awaya Incentive Award, the 1990 and 2006 ASJ
Sato Paper Awards, and the 2005 IEICE Paper Award. He is a member of the
IEICE, ASJ, and AES.
Authorized licensed use limited to: Columbia University. Downloaded on March 11, 2009 at 11:50 from IEEE Xplore. Restrictions apply.
... In MCLP-based methods, the late reverberation is predicted by filtering the past observation signals with the MCLP filter, and then is subtracted from the observed signal to obtain the desired speech signal. The MCLP filter can be estimated by minimizing the mean-square error (MSE) between the predicted late reverberation and observed signal [25], but the non-stationary property of speech signals is not taken into account in such a MSE estimator [26]. To overcome this limitation, the time-varying Gaussian (TVG) source model-based MCLP is proposed [27], [28], which is equivalent to minimizing the prediction error weighted by the reciprocal of the source variance, namely, weighted prediction error (WPE) [29]. ...
... where ξ f,t is the auxiliary variable and the equality of (25) holds if and only if the auxiliary variable satisfies ...
Article
This paper addresses the multi-channel linear prediction (MCLP)-based speech dereverberation problem by jointly considering the sparsity and low-rank priors of speech spectrograms. We utilize the complex generalized Gaussian (CGG) distribution as the source model and the generalized nonnegative matrix factorization (NMF) as the spectral model. The difference between the presented model and existing ones for MCLP is twofold. First, we adopt the CGG distribution with a timefrequency-variant scale parameter instead of that with a timefrequency-invariant scale parameter. Second, the time-frequencyvarying scale parameter is approximated by NMF in a lowrank manner. Based on the maximum-likelihood criterion, speech dereverberation is formulated as an optimization problem that minimizes the prediction error weighted by the reciprocal of sparse and low-rank parameters. A convergence-guaranteed algorithm is derived to estimate the parameters using the majorization-minimization technology. The WPE, NMF-based WPE and CGG-based WPE can be treated as special cases of the proposed method with different shape and domain parameters. As a byproduct, the proposed method provides a simple and elegant way to derive the CGG-based WPE algorithm. A series of experiments show the superiority of the proposed method over WPE, NMF-based WPE and CGG-based WPE methods.
... It is also the primary reason of deterioration in the performance of ASR under low SNR conditions. In order to cancel the reverberation from the extracted SOI, de-reverberation methods [5], [6] based on multi-channel linear prediction (MCLP) and spectral subtraction (SS) exist in literature. The state-of-the-art dereverberation technique based on linear prediction in spectral domain is the weighted prediction error (WPE) method [7]. ...
... This can be referred to as neuromorphic adaptive post-filtering, in which the estimated filter weights can be used as the post-filter on the extracted target SOI. As understood from contemporary research works [5][6], the process of de-reverberation is a difficult problem because of lack of knowledge of the source signal. Further, the a-priori knowledge of the acoustic environment is unavailable. ...
Preprint
Full-text available
Speech Enhancement using tensor decomposition-based source separation and convolutional, bi-directional recurrent neural network (CNN-biRNN) architecture is investigated in this paper. An acoustic receiver comprising uniform linear array (ULA) of microphone sensors is considered, where the ULA performs CANDECOMP/PARAFAC (CP) tensor decomposition to separate the individual speech source signals from the received mixture of multi-channel signals, followed by single channel de-reverberation by a variant of the CNN-biRNN referred to as DenseNet-biLSTM to enhance the target speech signal-of-interest (SOI). While the source separation module based on CP-tensor decomposition is responsible for extracting the target SOI, the subsequent deep learning framework based on DenseNet-biLSTM enhances the extracted SOI by performing de-noising and de-reverberation. It is demonstrated by computer simulations that the proposed approach leads to good performance under multiple interfering speakers and reverberation.
... Several conventional methods to be used as the baseline approaches are reviewed in this section. The typical processing flow of these methods has a dereverberation unit as the front end, e.g., WPE [50] and a separation unit as the back end, e.g., MPDR [5], TIKR [6], or IVA [13]. The cascaded structure of the DNN method, Beam-TasNet [42], is also considered as the baseline to illustrate the benefit of end-to-end training with SI-SNR. ...
... To account for the prolonged effects of reverberation, a multichannel convolutional signal model [50] for a single-source scenario is generally formulated in the T-F domain as ...
Article
Full-text available
In this paper, a multichannel learning-based network is proposed for sound source separation in reverberant field. The network can be divided into two parts according to the training strategies. In the first stage, time-dilated convolutional blocks are trained to estimate the array weights for beamforming the multichannel microphone signals. Next, the output of the network is processed by a weight-and-sum operation that is reformulated to handle real-valued data in the frequency domain. In the second stage, a U-net model is concatenated to the beamforming network to serve as a non-linear mapping filter for joint separation and dereverberation. The scale invariant mean square error (SI-MSE) that is a frequency-domain modification from the scale invariant signal-to-noise ratio (SI-SNR) is used as the objective function for training. Furthermore, the combined network is also trained with the speech segments filtered by a great variety of room impulse responses. Simulations are conducted for comprehensive multisource scenarios of various subtending angles of sources and reverberation times. The proposed network is compared with several baseline approaches in terms of objective evaluation matrices. The results have demonstrated the excellent performance of the proposed network in dereverberation and separation, as compared to baseline methods.
... Another category of dereverberation methods focuses on estimating an inverse filter to predict and suppress the late reflections present in the observed speech signal [16][17][18][19][20]. One notable method in this category that has gained significant attention in the field of speech processing is the weighted prediction error (WPE) method [19,20]. ...
Article
Full-text available
Weighted prediction error (WPE) is a linear prediction-based method extensively used to predict and attenuate the late reverberation component of an observed speech signal. This paper introduces an extended version of the WPE method to enhance the modeling accuracy in the time–frequency domain by incorporating crossband filters. Two approaches to extending the WPE while considering crossband filters are proposed and investigated. The first approach improves the model’s accuracy. However, it increases the computational complexity, while the second approach maintains the same computational complexity as the conventional WPE while still achieving improved accuracy and comparable performance to the first approach. To validate the effectiveness of the proposed methods, extensive simulations are conducted. The experimental results demonstrate that both methods outperform the conventional WPE regarding dereverberation performance. These findings highlight the potential of incorporating crossband filters in improving the accuracy and efficacy of the WPE method for dereverberation tasks.
... However, this method still exist the problem, which is not effective with the pres-ence of the late reverberation. Another method based on multiple-step linear prediction (MSLP) was proposed by Kinoshita et al. [14]. This method estimates the late reverberation by long-term MSLP and then these were suppressed by subsequent SS. ...
Conference Paper
Full-text available
This paper proposes a restoration scheme for the instantaneous amplitude and phase of speech signal by using the complex version of Kalman filtering in speech enhancement. The previous studies have proved that restoring the instantaneous amplitude as well as instantaneous phase by Kalman filtering with linear prediction (LP) on Gammatone filterbank plays a significant role in speech enhancement. However, the existing problem is the individual restoring for the instantaneous amplitude and phase. Thus, this paper aims to solve this problem by studying the feasibility of restoring both instantaneous amplitude and phase simultaneously base on the complex Kalman filtering. The proposed method concentrates on analyzing the separation of real and imaginary parts of the analytical speech signal simultaneously. The complex Kalman filtering with LP are applied to the analytical speech signal. The expected outcomes are improvements in the signal to error ratio, correlation and the quality as well as intelligibility of speech signals. Results of evaluations showed that the proposed scheme could effectively improve the previous one in noisy reverberant environments.
... Extraction of speech means taking some special characteristics which makes the identity unique for the person and distinguishing different from others in term of voice quality and known features will help to collect its original identity. One of the major limitations of conventional GMM is its super vector presentation and successive factor model investigation which does not be counted of the fact that original acoustic characteristics [8]. The main contribution is the modeling of training and testing that verifies the said text using threshold level which collects the correct evidence. ...
Article
Full-text available
Speech is one form of biometric that combines both physiological and behavioral features. It is beneficial for remote-access transactions over telecommunication networks. Presently, this task is the most challenging one for researchers. People’s mental status in the form of emotions is quite complex, and its complexity depends upon internal behavior. Emotion and facial behavior are essential characteristics through which human internal thought can be predicted. Speech is one of the mechanisms through which human’s various internal reflections can be expected and extracted by focusing on the vocal track, the flow of voice, voice frequency, etc. Human voice specimens of different ages can be emotions that can be predicted through a deep learning approach using feature removal behavior prediction that will help build a step intelligent healthcare system strong and provide data to various doctors of medical institutes and hospitals to understand the physiological behavior of humans. Healthcare is a clinical area with data concentrated where many details are accessed, generated, and circulated periodically. Healthcare systems with many existing approaches like tracing and tracking continuously disclose the system’s constraints in controlling patient data privacy and security. In the healthcare system, majority of the work involves swapping or using decisively confidential and personal data. A key issue is the modeling of approaches that guarantee the value of health-related data while protecting privacy and observing high behavioral standards. This will encourage large-scale perception, especially as healthcare information collection is expected to continue far off this current ongoing pandemic. So, the research section is looking for a privacy-preserving, secure, and sustainable system by using a technology called Blockchain. Data related to healthcare and distribution among institutions is a very challenging task. Storage of facts in the centralized form is a targeted choice for cyber hackers and initiates an accordant sight of patients’ facts which will cause a problem in sharing information over a network. So, this research paper’s approach based on Blockchain for sharing sufferer data in a secured manner is presented. Finally, the proposed model for extracting optimum value in error rate and accuracy was analyzed using different feature removal approaches to determine which feature removal performs better with different voice specimen variations. The proposed method increases the rate of correct evidence collection and minimizes the loss and authentication issues and using feature extraction based on text validation increases the sustainability of the healthcare system.
... Puisque dans cette thèse nous nous intéressons essentiellement à des situations où la réverbération des pièces (bureaux, salons, etc) n'est pas trop élevée, nous ne chercherons pas à annuler les effets de la réverbération, bien que cela soit parfois désirable 2.1. Formulation du problème et notations 13 (Nakatani et al., 2010;Carbajal et al., 2020) voire nécessaire dans les cas où la réverbération est forte (Kinoshita et al., 2009). ...
Thesis
Full-text available
Un grand nombre d’appareils que nous utilisons au quotidien embarque un ou plusieurs microphones afin de rendre possible leur utilisation par commande vocale. Le réseau de microphones que l’on peut former avec ces appareils est ce qu’on appelle une antenne acoustique ad-hoc (AAAH). Une étape de rehaussement de la parole est souvent appliquée afin d’optimiser l’exécution des commandes vocales. Pour cela, les AAAH, de par leur flexibilité d’utilisation, leur large étendue spatiale et la diversité de leurs enregistrements, offrent un grand potentiel. Ce potentiel est néanmoins difficilement exploitable à cause de la mobilité des appareils, leur faible puissance et les contraintes en bande passante. Ceslimites empêchent d’utiliser les algorithmes de rehaussement de la parole « classiques » qui reposent sur un nœud de fusion et requièrent de fortes puissances de calcul.Cette thèse propose de rallier le domaine de l’apprentissage profond à celui des AAAH, en conciliant la puissance de modélisation des réseaux de neurones (RN) à la flexibilité d’utilisation des AAAH. Pour cela, nous présentons un système distribué de rehaussement de la parole. Il est distribué en cela que la contrainte d’un centre de fusion est levée. Des signaux dits compressés, échangés entre les nœuds, permettent de véhiculer l’information spatiale tout en réduisant la consommation en bande passante. Des RN sont utilisés afin d’estimer les coefficients d’un filtre de Wiener multicanal. Une analyse empirique détaillée de ce système est conduite à la fois sur données synthétiques et sur données réelles afin de valider son efficacité et de mettre en évidence l’intérêt d’utiliser conjointement des RN et des algorithmes distribués classiques de rehaussement de la parole. Nous montrons ainsi que notre système obtient des performances équivalentes à celles de l’état de l’art, tout en étant plus flexible et en réduisant significativement la complexité algorithmique.Par ailleurs, nous développons notre solution pour l’adapter à des conditions d’utilisation propres aux AAAH. Nous étudions son comportement lorsque le nombre d’appareils de l’AAAH varie, et nous comparons l’influence de deux mécanismes d’attention, l’un d’attention spatiale et l’autre d’auto-attention. Les deux mécanismes d’attention rendent notre système résilient à un nombre variable d’appareils et les poids du mécanisme d’auto-attention révèlent l’utilité de l’information convoyée par chaque signal. Nous analysons également le comportement de notre système lorsque les signaux des différents appareils sont désynchronisés. Nous proposons une solution pour améliorer les performances de notre système en conditions asynchrones, en présentant un autre mécanisme d’attention. Nous montrons que ce mécanisme d’attention permet de retrouver un ordre de grandeur du décalage d’horloge entre les appareils d’une AAAH. Enfin, nous montrons que notre système est une solution viable pour la séparation de sources de parole. Même avec des RN d’architecture simple, il est capable d’exploiter efficacement l’information spatiale enregistrée par tous les appareils d’une AAAH dans une configuration typique de réunion.
... Nakamura [229] used a neural network-based model on multiple languages that included English and Japanese. Kinoshita et al. [155] concentrate on dealing with the effect of late reverberations. The projected technique initially estimates late reverberations using longterm multi-step linear prediction, and afterward reduces the late reverberation effect by employing spectral subtraction. ...
Article
Full-text available
Speech recognition of a language is a key area in the field of pattern recognition. This paper presents a comprehensive survey on the speech recognition techniques for non-Indian and Indian languages, and compiled some of the computational models used for processing speech acoustics. An immense number of frameworks are available for speech processing and recognition for languages persisting around the globe. However, a limited number of automatic speech recognition systems are available for commercial use. The gap between the languages being spoken around the globe and the technical support available to these languages are very few. This paper examined major challenges for speech recognition for different languages. Analysis of the literature shows that lack of standard databases availability of minority languages hinder the research recognition research across the globe. When compared with non-Indian languages, the research on speech recognition of Indian languages (except Hindi) has not achieved the expected milestone yet. Combination of MFCC and DNN–HMM classifier is most commonly used system for developing ASR minority languages, whereas in some of the majority languages, researchers are using much advance algorithms of DNN. It has also been observed that the research in this field is quite thin and still more research needs to be carried out, particularly in the case of minority languages.
Article
Sound source separation, which separates multiple sound sources from a mixture, has continued to evolve by incorporating beamforming techniques in wireless communication, signal processing, optimization techniques based on probabilistic models, and deep learning techniques. This paper prondes an overview of sound source separation techniques for multiple microphones based on a spatial model and a probabilistic sound source model, for a single microphone with deep learning, and for multiple microphones using a deep-learning-based sound source model and a spatial model.
Article
Full-text available
Speech processing and recognition are key technologies to produce smart user interfaces in an increasing number of devices. Moreover, robust speech recognition is considered mandatory for a reliable operation of such elements in realistic working conditions. Through this paper, a method of processing speech degraded by noise and reverberation is proposed. This approach involves analyzing the prediction error signals from the Gradient Adaptive Lattice algorithm in order to produce a valid estimator suitable for being combined with Spectral Subtraction techniques. The paper includes an evaluation of the performance of the algorithm for several speech recognition experiments in a car environment.
Article
Full-text available
In this paper we present a dereverberation algorithm for improving automatic speech recognition (ASR) results with minimal CPU overhead. As the reverberation tail hurts ASR the most, late reverberation is reduced via gain-based spectral subtraction. We use a multi-band decay model with an efficient method to update it in realtime. In reverberant environments the multi-channel version of the proposed algorithm reduces word error rates (WER) up to one half of the way between those of a microphone array only and a close-talk microphone. The four channel implementation requires less than 2% of the CPU power of a modern computer.
Article
A novel approach for multimicrophone speech dereverberation is presented. The method is based on the construction of the null subspace of the data matrix in the presence of colored noise, using the generalized singular-value decomposition (GSVD) technique, or the generalized eigenvalue decomposition (GEVD) of the respective correlation matrices. The special Silvester structure of the filtering matrix, related to this subspace, is exploited for deriving a total least squares (TLS) estimate for the acoustical transfer functions (ATFs). Other less robust but computationally more efficient methods are derived based on the same structure and on the QR decomposition (QRD). A preliminary study of the incorporation of the subspace method into a subband framework proves to be efficient, although some problems remain open. Speech reconstruction is achieved by virtue of the matched filter beamformer (MFBF). An experimental study supports the potential of the proposed methods.
Article
Preface. - Matrices. - Submatrices and partitioned matricies. - Linear dependence and independence. - Linear spaces: row and column spaces. - Trace of a (square) matrix. - Geometrical considerations. - Linear systems: consistency and compatability. - Inverse matrices. - Generalized inverses. - Indepotent matrices. - Linear systems: solutions. - Projections and projection matrices. - Determinants. - Linear, bilinear, and quadratic forms. - Matrix differentiation. - Kronecker products and the vec and vech operators. - Intersections and sums of subspaces. - Sums (and differences) of matrices. - Minimzation of a second-degree polynomial (in n variables) subject to linear constraints. - The Moore-Penrose inverse. - Eigenvalues and Eigenvectors. - Linear transformations. - References. - Index.