Conference PaperPDF Available

Enhancement of speech corrupted by acoustic noise



This paper describes a method for enhancing speech corrupted by broadband noise. The method is based on the spectral noise subtraction method. The original method entails subtracting an estimate of the noise power spectrum from the speech power spectrum, setting negative differences to zero, recombining the new power spectrum with the original phase, and then reconstructing the time waveform. While this method reduces the broadband noise, it also usually introduces an annoying "musical noise". We have devised a method that eliminates this "musical noise" while further reducing the background noise. The method consists in subtracting an overestimate of the noise power spectrum, and preventing the resultant spectral components from going below a preset minimum level (spectral floor). The method can automatically adapt to a wide range of signal-to-noise ratios, as long as a reasonable estimate of the noise spectrum can be obtained. Extensive listening tests were performed to determine the quality and intelligibility of speech enhanced by our method. Listeners unanimously preferred the quality of the processed speech. Also, for an input signal-to-noise ratio of 5 dB, there was no loss of intelligibility associated with the enhancement technique.
M. Berouti, R. Schwartz,
and J. Makhoul
Bolt Beranek and Newman Inc.
This paper describes a method for enhancing
broadband noise. The method is
based on the spectral noise subtraction method.
The original method entails subtracting
an estimate
of the noise power spectrum from the speech power
spectrum, setting negative differences to zero,
recombining the new power spectrum with the
original phase, and then reconstructing
the time
waveform. While this method reduces the broadband
noise, it also usually introduces an annoying
noise". We have devised a method that
eliminates this "musical noise" while further
reducing the background
noise. The method consists
in subtracting
an overestimate of the noise power
spectrum, and preventing the resultant spectral
components from
going below a
preset minimum level
(søectral floor). The method can automatically
adapt to a wide range of
signal—to—noise ratios, as
long as a reasonable estimate of the noise spectrum
can be obtained. Extensive listening tests were
performed to determine the quality and
of speech
enhanced by our method.
Listeners unanimously preferred
the quality of the
processed speech. Also, for an input
ratio of 5 dB, there was no loss of
intelligibility associated with the enhancement
We report
on our work to enhance the quality
of speech degraded by additive white noise. Our
goal is to improve the listenability
of the speech
signal by decreasing
the background noise,
the intelligibility
of the speech. The
noise is at such levels that the speech is
essentially unintelligible
out of context. We use
the average segmental signal—to—noise
ratio (SNR)
to measure the noise level of the noise—corrupted
speech signal. We found that sentences with a SNR
in the range —5 to +5 dB have an intelligibility
score in the range 20 to
80%. There is strong
correlation between the intelligibility of a
sentence and the SNR, but intelligibility
depends on the speaker, on context, and on the
phonetic content.
After an initial investigation of several
methods of speech enhancement, we concluded that
the method of spectral noise subtraction is more
effective than others. In this paper we discuss
our implementation of that method, which differs
from that reported by others in two major ways:
first, we subtract a factor (a) times the noise
where a is a number greater than unity
and varies from frame to frame. Second, we prevent
the spectral components of the processed signal
from going below a certain lower bound which we
call the sceotral floor. We
express the spectral
floor as a fraction
of the original noise power
The basic principle of spectral noise
appears in the literature in various
implementations [1_1]. Basically,
methods of
enhancement have in common the assumption
that the power spectrum of a
signal corrupted by
uncorrelated noise is equal to the sum of the
signal spectrum and the noise spectrum. The
preceding statement is true only in the statistical
sense. However, taking this assumption as a
reasonable approximation for short—term (25 as)
spectra, its application
leads to a simple noise
subtraction method. Initially, the method we
implemented consisted in computing the power
spectrum of each windowed segment of speech
from it an estimate of the noise power
spectrum. The estimate of the noise is formed
during periods of "silence". The original phase of
the OFT of the input signal is retained for
resynthesis. Thus, the enhancement algorithm
consists of a straightforward implementation
of the
following relationship:
let D(w) = P5(w)—P0(w)
if D(w)>O
P(w) 0,
otherwise (1)
where P(w)
is the modified signal spectrum, P5(w)
is the spectrum of the input noise—corrupted
speech, and Pn(w) is the smoothed estimate of the
noise spectrum. Pn(w) is obtained by a
First we average the noise spectra from
several frames of "silence". Second, we smooth in
this average noise spectrum. For the
case of white noise,
the smoothed estimate
of the noise spectrum is flat. The enhanced speech
is obtained from both P(w)
and the original
phase by an inverse Fourier transform:
F{) (2)
where 0(w) is the phase function of the DFT of the
input speech. Since the assumption
of uncorrelated
signal and noise is not strictly valid for
short—term spectra, some of the components
of the
processed spectrum, P(w), may
be negative. These
values are set to zero as shown in (1).
* An earlier version of' this paper was presented
the ARPA Network Speech Compression (NSC) Group
meeting, Cambridge, MA, May 1978, in a
session on speech enhancement.
©1w79 TELE
A major problem with the above
of the spectral
noise subtraction method has been
that a "new" noise appears in the processed speech
signal. The new noise is variously described as
ringing, warbling, of tonal quality, or
"doodly—doos". We shall henceforth refer to it as
the "musical noise". Also, though the noise is
reduced, there is still considerable broadband
noise remaining in the processed speech.
To explain the nature of the musical noise,
one must realize that peaks
and valleys exist in
the short—term
power spectrum of white noise;
frequency locations for one frame are random and
they vary randomly in frequency
and amplitude from
frame to frame. When we subtract the smoothed
the noise spectrum from the actual
noise spectrum,
all spectral peaks are shifted down
while the valleys (points lower than the estimate)
are set to zero (minus infinity on a logarithmic
scale). Thus, after subtraction there remain
in the noise spectrum. Of those remaining peaks,
the wider ones are perceived as time varying
broadband noise. The narrower peaks,
which are
relatively large spectral
excursions because of the
deep valleys that define them, are perceived as
time varying tones which we refer to as musical
Our modification to the noise subtraction
consists in minimizing
the perception
of the
narrow spectral peaks by decreasing the spectral
excursions. This is done by changing the algorithm
in (1) to the following:
let D(w) =
P5(w) -aP0(w)
D(w), D(w)>BP0(w)
S Lpn(), otherwise
and O<<<l
is the subtraction factor and is the
spectral floor parameter. The modified method is
Fig. 1 Modified spectral noise
with spectral
in Fig. 1. Note that (3)
for 0=1 and O. is identical to (1)
From (3) it can be seen that the goal of
reducing the spectral
noise peaks can be achieved
with 0>1. For n>1 the remnants of
the noise peaks
will be lower relative to the case with 0=1. Also,
with 0>1 the subtraction can remove all of the
broadband noise by eliminating most of the wide
peaks. However, this by
itself is not sufficient,
because the deep valleys surrounding the narrow
peaks remain
in the noise spectrum and, therefore,
the excursion of noise peaks remains large. The
second part of our modification consists of
the valleys. This is done in (3) by
means of the soectral floor, P(w): The spectral
are prevented from descending
below the lower bound
P0(w). For >O,
the valleys
between peaks are not as deep as for the case B=O.
Thus, the spectral excursion of noise
peaks is not
as large, which reduces the amount of the musical
noise perceived. Another way to interpret the
above is to realize that, for
the remnants of
noise peaks are now "masked" by neighboring
of comparable magnitude. These
neighboring components
in fact are broadband noise
reinserted in the spectrum
by the spectral floor
Pn(w). Indeed, speech processed by the modified
method has less musical noise than speech processed
by (1). We note here that for <<1 the added
broadband noise level is also much lower than that
perceived in speech processed by (1).
In order to be able to refer to the "broadband
noise reduction" achieved by the method, we have
conveniently expressed the spectral floor as a
fraction of the original noise power spectrum.
Thus, when the spectral
floor effectively
masks the
musical noise, and when all that can be perceived
is broadband
noise, then the noise attenuation is
given by . For instance, for =O.O1, there is a
20 dB attenuation of the broadband noise.
Various combinations of a
and give rise to a
trade—off between the amount of remaining broadband
noise and the level of the perceived
musical noise.
For large,
the spectral floor is high, and very
(3) little, if any, musical noise is audible,
with small, the broadband noise is greatly
reduced, but the musical noise becomes quite
annoying. Similarly,
we have found that, for a
fixed value of , increasing the value of a reduces
both the broadband noise and the musical noise.
However, if a
is too large
the spectral distortion
caused by the subtraction in (3) becomes excessive
and the speech intelligibility
may suffer.
In practice,
we have found that at SNR=0 dB, a
value of a in the range 3 to 6 is adequate,
in the range 0.005 to 0.1. A
large value of 0,
such as 5, should not be alarming. This is
equivalent to assuming that the noise power to be
subtracted is about 7 dB higher than the smoothed
estimate. This "inflation" factor
fact that, at each frame, the variance of the
spectral components of the noise is equal to the
noise power
itself. Hence, one must subtract more
than the expected
value of the noise spectrum (the
estimate) in order to make sure that most
of the noise peaks have been removed.
In order to reduce the speech distortion
by large values of
a, we decided to let a
vary from frame to frame within the same sentence.
To understand the rationale behind doing so,
consider the graph of
Fig. 2. The dotted line in
the figure shows a
plot of the value of a used in
an experiment
where several sentences at different
SNR were proceased. In the experiment, a was
constant for each utterance. At the completion
the experiment,
we noticed that the optimal value
of a, as determined empirically for best noise
reduction with the least amount of musical noise,
let D(w) =
0 5 10 15 20 25
Fig. 2 Value
of the subtraction factor
the SNR.
is smaller for higher
SNR inputs. We then decided
could vary not only across sentences with
different SNR but also across frames of the same
sentence. The reason for allowing
a to vary within
a sentence in that the segmental SNR varies from
frame to frame in proportion to signal
because the noise level is constant. After
extensive experimentation, we found that
a should
vary within a sentence
according to the solid line
in Fig. 2, with an for SNR20 dB. Also, we
prevent any further increase in a for SNR<.-5 dB.
The slope of the line in Fig. 2 is determined
the value of the parameter a at SNRO
dB. The SNR is estimated at each frame from
of' the noise
spectral estimate and the
energy of the input speech. At each frame, the
actual value of' a used in (3) is given by:
for —5fSNRt2O
00 is the desired value of a at SNR=0 dB, SNR
is the estimated segmental signal—to—noise ratio
and 1/s is the slope
of the line in Fig. 2. (For
example, for
a0=4, sr2O/3.) We found that
using a
variable subtraction
reduces the speech
somewhat. If the slope (1/s) is too large,
however, the temporal dynamic range
of the speech
becomes too large.
To summarize,
there are several qualitative
aspects of the processed speech that can be
controlled. These are: the level of the remaining
broadband noise, the level of the musical noise,
and the amount
of speech
distortion. These three
effects are controlled mainly by the parameters
and 13.
Aside from the parameters a and 13 discussed
above, we investigated several other parameters.
These are:
a) the exponent of' the power spectrum of' the
(so far assumed to be 1),
b) The normalization factor needed for output
level adjustment,
o) the frame
d) the amount
overlap between
e) the FFT order.
All of the above parameters interact with each
other and with
a and 13. We shall now discuss each
parameter individually.
Exoonent of the Power Soectrwi
We investigated raising the power spectrum
the input
to some power -y before the subtraction.
In this case, (3) becomes:
where G is the normalization factor to be discussed
later. Note that (5) is identical to (3) fory=1
and G=1. Equation (5) is implemented by means of
the same algorithm illustrated in Fig. 1, except
that all symbols P(w) are replaced by P(w) and
the gain 0 follows
the subtraction in Fig. 1 and
the thresholding. For a fixed value of
a0, the subtraction in (5) with a value of YKi
results in a
greater amount of spectral change than
for the case 'y=1. We note here that Boll [2,3]
YnO.5, with 0=1 and
13=0, whereas Suzuki et al.
[1] and Curtis and Niederjohn [14] use Ynl.
Normalization Factor
The next parameter to consider is a
normalization factor to scale the processed signal.
Our initial
were all done with
1=1 and
we found no need for such normalization. However,
fory<1 the subtraction
the spectrum more
drastically than for the case Yrl. Therefore,
lower Y, the processed output
had an extremely low
level, which prevented
us from comparing sentences
that were processed
with different values of 1.
Our initial
approach to normali'zation was to force
the energy
of the processed signal
at each frame to
(14) be equal to the difference between the input energy
and the estimated noise energy. Once again, we
were relying on the assumption
that the signal
the noise are uncorrelated. This approach required
that the normalization factor change drastically
from frame to frame,
which led to severe
in low energy frames. In our final
approach, we corrected the problem by keeping the
normalization factor constant over most of the
sentence. We accomplished
this by starting
with a
high initial value for the normalization factor,
and updating its value at high energy
The update takes place only if the newly derived
factor is smaller than the previous one. mi
practice, we
compute A=(i/1)(Ps_Pfl)/Pd, for Ps2Pn,
where P5, P
and 1d are the estimated
power of the
signal, power of the noise,
and power of the signal
processed without the gain. If the value
of A
obtained is less than the previous value,
we update
the value of the normalization
G=A. Also,
G is not allowed to be less than 1.0. The effect
of the normalization is to keep the average level
of the processed speech independent of the power'
used. Finally,
we note that normalization takes
place after the subtraction, but before the
of the spectral
floor constraint. In
this fashion, it is still
possible to relate
spectral floor to the original input noise power by
means of the constant 13, irrespective of which
power y was
used for the processing
in (5). Thus,
the perceived remaining broadband noise is
only by I3P(w).
Frame Size
The frame size had been set to 25 ms
the initial phase
of our work. We have
found that using an analysis frame
shorter than 20
ins results in roughness,
while increasing
the frame
size decreases the musical noise considerably.
However, if the frame is too long, slurring
-10 —5
P(w) 13P0(w), otherwise
al, ani3. 0<13<<l
SNR (dB)
Window OverlaiD
Associated with the frame size is the amount
of overlap between consecutive frames. We have
used the Tukey window (flat
in its middle range and
with cosine tapering at each end) in order to
overlap and add adjacent segments of processed
speech. The overlap is necessary to prevent
discontinuities at frame boundaries. The amount of
overlap is usually taken to be 10% of the frame
size. However, for larger frames, 10% may be
excessive and might cause slurring of the signal.
FFT Order
The third window—related parameter is the
order of the FFT. In general, enough zeros are
appended at one end of the windowed data prior to
obtaining the DFT, such that the total number of
points is a
power of 2 and, thus, an FFT routine
can be used. However, processing
in the frequency
domain causes the non—zero
valued data to extend
out of its original time—domain range into the
added zeros. If the added—zero
is not long
enough, time-domain
aliasing might occur. Thus we
needed to investigate adding more zeros and using a
higher order FFT.
The discussions
in Sections 3 and shed some
light on the effect that each parameter
has on the
quality of the processed speech. We performed
several experiments to understand further how all
these parameters interact. We were mainly
interested in finding an optimal range of values
and 6. As mentioned earlier, these two
parameters give us direct control of the three
major qualitative aspects of processed speech:
remaining broadband noise, musical noise, and
speech distortion. Clearly, we desire values of
6 that would minimize those three effects.
However, the effects of the parameters
and 6
the quality of the processed speech
are intimately
related to the input SNR, the power y, and the
window—related parameters.
Throughout our experiments, we considered
inputs with SNR in the range
—5 to
÷5 dB and used
values of yrO.25, 0.5, and 1. We have experimented
with several frame sizes (15 to 60 ms), different
amounts of overlap between frames, and different
FFT orders.
Through extensive experimentation we
determined the range of values for each of the
of the algorithm. The ranges given
below are meant to be guidelines rather than final
"optimal" values. Optimality is a subjective
choice and depends on the user's preference. Below
we give some of the conclusions we reached:
Frame size: The frame size should be between 25
and 35 ms.
The overlap between frames should be on
the order of 2 to 2.5 ms.
FFT order: Our investigations did not show that
time—domain aliasing was an important issue.
the minimum FFT order corresponding
a given frame size is adequate, with no
noticeable improvement in going to a higher
order. The same was reported earlier by Boll
Exponent of the power spectrum: Of the three
values of y we tried, Y1
was found to yield
better output quality, in general.
Subtraction factor: for
yrl, an optimal range for
a0 is 3 to 6 (for
yr.5, a0 should
be in the range
2 to 2.2). The slope in () (or Fig. 2) is set
such that c1 for SNR20 dB, and
at SNR=O
floor: The spectral floor depends on the
average segmental SNR of the input, i.e., the
noise level. For high noise levels (SNR—5 dB)
should be in the range 0.02 to
0.06, and for
lower noise levels (SNRO or +5 dB) 6 should be
in the range
0.005 to 0.02.
Towards the end of our research we performed
formal listening test to assess the quality and
of the enhanced speech. The input
speech varied in SNE from —5 to +5 dB. The
was done using parameter
values as
by the above guidelines. Subjects unanimously
preferred the quality of the enhanced speech
that of the unprocessed signal. In addition, at
input SNR=+5 dB, using the values 6=.0O5,
Yrl, and a
32 ma frame
size, the intelligibility
the enhanced speech
was the same as that of the
unprocessed signal. For lower SNR's, the
of the speech decreased somewhat,
Prior to performing the formal intelligibility
test, our algorithm had been tuned for optimal
quality, i.e., maximum noise reduction,
accurate knowledge of the effect of the method on
speech intelligibility. We believe that it may be
possible to maintain the same intelligibility
improving the listenability of the speech by
further tuning
the parameters
of the system (mainly
a0 and 6). The actual parameter
used in a
specific situation depend
on one's purpose in using
the enhancement
algorithm. In some applications a
slight loss of intelligibility may be tolerable,
provided the listenability
of the speech is greatly
improved. In other applications a loss in
intelligibility may
not be acceptable.
To conclude,
the main differences
between the
basic spectral subtraction method and our
implementation is that we subtract an overestimate
of the noise spectrum and prevent the resultant
spectral components from going below a
floor. Our implementation
of the spectral noise
subtraction method affords a
great reduction in the
background noise with very little effect on the
of the speech. Formal tests have
shown that, at SNR=+5 dB, the intelligibility
the enhanced
speech is the same as that of the
unprocessed signal.
The authors wish to thank A.W.F.
Huggins for
his contributions to this research. This work was
sponsored by the Department of Defense.
1. H. Suzuki, J. Igarashi, and Y. Ishii,
"Extraction of Speech in Noise by Digital
J. Acoust. Soc. of Japan, Vol. 33,
No. 8, Aug. 1977, pp. 1O5—411.
2. 5. Boll, "Suppression
of Noise in Speech Using
the SABER Method," ICASSP, April 1978, pp.
3. S. Boll, "Suppression of Acoustic Noise in
Speech Using Spectral Subtraction," submitted,
IEEE Trans. on Acoustics, Speech and Signal
1. R.A. Curtis, R.J. Niederjohn,
"An Investigation
of Several Frequency—Domain Methods for
Enhancing the Intelligibility of Speech in
Wideband Random
Noise," ICASSP, April 1978, pp.
... There have been proposed several means to reduce the musical noise in de-hissed audio signals. Perhaps the simplest solution consists of overestimating the level of the noise power (Berouti, Schwartz, & Makhoul, 1979;Boll, 1979aBoll, , 1979bLorber & Hoeldrich, 1997;Vaseghi, 1988;Vaseghi & Rayner, 1988). This can be easily carried out by setting a>1 in equation (27). ...
... Other straightforward ways of reducing the audibility of musical noise repose on the use of spectral averaging within the computation of the suppression rules (Boll, 1979b), on the adoption of a minimum attenuation level for , b k H (Berouti et al., 1979;Lorber & Hoeldrich, 1997), and on the application of heuristic rules over the values of , b k H , measured during a set of consecutive processed frames (Vaseghi & Frayling-Cork, 1992). All those options attempt to set a suitable balance among the audibility of a residual noise floor, that of the musical noise, and a faithful preservation of the recorded content on the restored signal. ...
... There are a variety of sources that contribute to a low SNR such as acoustic, environmental or distorted sounds. Multiple classical approaches to audio denoising include Weiner filtering [124], [125], spectral subtraction [126], [127], minimum mean squared error (MMSE) estimation [128] and optimally-modified log-spectral amplitude (OM-LSA) estimation [129]. However, these methods can sometimes introduce additional artifacts, such as the generation of 'musical noise' through spectral subtraction due to the flat, short-time noise spectrum estimate that is subtracted from the whole spectrum [126]. ...
... Multiple classical approaches to audio denoising include Weiner filtering [124], [125], spectral subtraction [126], [127], minimum mean squared error (MMSE) estimation [128] and optimally-modified log-spectral amplitude (OM-LSA) estimation [129]. However, these methods can sometimes introduce additional artifacts, such as the generation of 'musical noise' through spectral subtraction due to the flat, short-time noise spectrum estimate that is subtracted from the whole spectrum [126]. Additionally, common among these approaches is the use mel frequency cepstral coefficients (MFCCs) as representational features and an outcome in voice recognition that is a more uniform, but less recognizable speech spectrum [130], [131]. ...
Full-text available
At the beginning of the COVID-19 pandemic, there was significant hype about the potential impact of artificial intelligence (AI) tools in combatting COVID-19 on diagnosis, prognosis, or surveillance. However, AI tools have not yet been widely successful. One of the key reason is the COVID-19 pandemic has demanded faster real-time development of AI-driven clinical and health support tools, including rapid data collection, algorithm development, validation, and deployment. However, there was not enough time for proper data quality control. Learning from the hard lessons in COVID-19, we summarize the important health data quality challenges during COVID-19 pandemic such as lack of data standardization, missing data, tabulation errors, and noise and artifact. Then we conduct a systematic investigation of computational methods that address these issues, including emerging novel advanced AI data quality control methods that achieve better data quality outcomes and, in some cases, simplify or automate the data cleaning process. We hope this article can assist healthcare community to improve health data quality going forward with novel AI development.
... Spectral subtraction was first proposed in 1979 [18], but nowadays there are many different approaches to it. In our algorithm we compute a filter gain coefficient based on the estimation of the signal and noise in a given time window. ...
Full-text available
The development of the Internet of things and automatisation in everyday life also influences our houses. There are more and more devices on the market which can be controlled remotely. One kind of such control involves the use of voice signals. This method tends to use microphone arrays and dedicated algorithms to enhance the speech signal and recognize the words in it. In this project, a small 5-microphone array was developed. To enhance the quality of the signal, dedicated software was written. It consists of several modules, including the direction of arrival estimation, denoising, and differentiation between adults and children. The results showed that the custom algorithm can increase the signal to noise ratio by up to 6 dB.
... Notable work in this area is that of McAulay and Malpass [103], who 18/129 formulated the spectral subtraction approach as a maximum likelihood estimation problem of the variance of each spectral component of the original clean signal. Other popular modifications are those that involve averaging or smoothing of the sample spectrum estimator, controlling the amount of subtracted noise[93,24]. ...
Full-text available
Today, the most fundamental issue of condition monitoring in most industrial plants is fault diagnostics and prognostics. One of the most effective approaches to investigate this issue is condition monitoring based on vibration signal analysis. With the development of industry, multi-threaded maintenance and multi-channel acquisition are becoming more widespread in the current, which put forward higher requirements for maintenance. Based on this observation, it is proposed in this thesis one automated diagnosis framework for the rolling element bearing that integrates the successive steps of fault detection, fault type identification, fault signal reconstruction and fault size characterization. The advantage is that the complete diagnosis process is completed at once, while involving only one key hyperparameter, which improves the degree of automation of current Condition Based Maintenance (CBM) and liberating human participation. In the presence of incipient fault, vibrations of rolling element bearings show symptomatic signatures in the form of repetitive impulses. This can be seen as a non-stationary signal whose statistical properties switch between two states. The proposed maintenance strategy models such characteristics with an explicit-duration hidden Markov model (EDHMM) and uses the estimated model parameters to perform integrated diagnosis without requiring the user's expertise. The detection of a fault is first achieved by means of a likelihood ratio test built on the EDHMM parameters. One statistical counting approach and posterior probability spectrum are then used for identifying the fault type automatically. In order to obtain the fault signal in some cases, one Bayesian filter based on the EDHMM parameters is constructed. Finally, the fault size is estimated from the duration times returned by EDHMM. Subsequently, the capability of the integrated auto-diagnosis framework is illustrated on different experimental datasets. The first validation is forced on the vibration data for specific conditions. The results prove the robust and accurate maintenance of the rolling element bearing. In addition, the result of accelerated degradation data also shows the effectiveness of the method, especially the ability of detecting failure occurrence and tracking quantitatively fault development. This technique has potential for using in the machine CBM.
... Inspired by the spectral subtraction method in digital processing [4], which involves subtracting the estimated noise spectrum from the image, we can enhance the human by subtracting the static component/spectrum in a sequence of 2D AoA images. In this way, the enhanced image will mostly capture the signals bounced off the human body, and the signals reflected from the surrounding environments will be removed. ...
Person re-identification (Re-ID) has become increasingly important as it supports a wide range of security applications. Traditional person Re-ID mainly relies on optical camera-based systems, which incur several limitations due to the changes in the appearance of people, occlusions, and human poses. In this work, we propose a WiFi vision-based system, 3D-ID, for person Re-ID in 3D space. Our system leverages the advances of WiFi and deep learning to help WiFi devices see, identify, and recognize people. In particular, we leverage multiple antennas on next-generation WiFi devices and 2D AoA estimation of the signal reflections to enable WiFi to visualize a person in the physical environment. We then leverage deep learning to digitize the visualization of the person into 3D body representation and extract both the static body shape and dynamic walking patterns for person Re-ID. Our evaluation results under various indoor environments show that the 3D-ID system achieves an overall rank-1 accuracy of 85.3%. Results also show that our system is resistant to various attacks. The proposed 3D-ID is thus very promising as it could augment or complement camera-based systems.
... Berouti [21] proposed a method in which the noise spectrum is overestimated. This overestimated spectrum of noise is subtracted from the noisy spectrum. ...
The subjective quality test of the enhanced speech from different enhancement algorithms for listeners with normal hearing (NH) capability as well as listeners with hearing impairment (HI) is reported. The subjective quality evaluation of speech enhancement methods in the literature survey is mostly done targeting NH listeners and fewer attempts are observed to subjectively evaluate for HI listeners. The algorithms evaluated are from four different classes: spectral subtraction class(SS), statistical model based class (minimum mean square error), subspace class(PKLT) and auditory class (ideal binary mask using STFT, ideal binary mask using gammatone filterbank and ideal binary mask using gammachirp filterbank). The algorithms are evaluated using four types of real world noises recorded in Indian scenarios namely cafeteria, traffic, station and train at -5, 0, 5 and 10 dB SNRs. The evaluation is being done as per ITU-T P.835 standard in terms of three parameters- speech signal alone, background noise and overall quality. The noisy speech database developed in Indian regional language, Marathi, at four SNRs -5, 0, 5 and 10 dB is used for evaluation. Significant improvement is observed in ideal binary mask algorithm in terms of overall quality and signal distortion ratings for NH and HI listeners. The performance of minimum mean square error is also observed comparable with the ideal binary mask algorithm in some cases.
... where| ( )| 2 and | ( )| 2 are the magnitude spectra of clean and the noise respectively. Since the noise spectra cannot be obtained directly, an estimate ̂( )is obtained from the silent period [12]. The estimation of clean speech spectrum is obtained by ...
Performance of a speech recognition system is highly dependent on the operational environments. The mismatched ambient conditions have adverse impact on the performance of an Automatic Speech Recognition (ASR) system. The speech parameterization techniques for tonal speech recognition are different from those used for non-tonal speech recognition. It is due to the fact that tonal speech has two components – basic linguistic unit and tone. The basic linguistic unit with different tones convey different meanings. Therefore, the feature set used for tonal speech recognition must have the capability to representing both of them. Tone is determined by the fundamental frequency of the speech signal which is highly sensitive to noise. Since at the time of parameterization of the non-tonal speech recognition systems, these highly noise-sensitive tone related information are discarded, the traditional noise elimination methods used for non-tonal speech recognition fail to deliver robust performance in tonal speech recognition. In the present study, we have analyze the performance of different commonly used feature sets for noisy tonal speech recognition. Hidden Markov Model (HMM) based speech recognizer has been used for performance evaluation. Noise elimination techniques sub-band spectral subtraction and Wiener filter have been used for noise reduction and their relative performance have been evaluated.
Hearing-impaired people face numerous challenges with speech perception in the presence of interfering background noise. To suppress interfering background noise, the common approach widely used is speech enhancement. Inspired by the improved results of combined temporal and spectral processing in speech enhancement, this research study proposes temporal enhancement combined with two different spectral enhancement methods, with a novel approach of soft masking using priori and posterior signal to noise ratio uncertainty. The present study investigates quality and intelligibility objective evaluations, namely, hearing aid speech quality index and hearing aid speech perception index, of spectral and a combination of temporal-spectral speech enhancement methods for typical pattern of hearing loss characterized by six audiograms. For evaluation, clean speech files from the NOIZEUS database are mixed with four local noises, namely, cafeteria, traffic, station, and train at -5, -3, 0,3,5, and 10 dB SNRs. These local noises are quite common, which are encountered by people in their day-to-day lives. In most of the testing conditions, the new combined temporal spectral enhancement shows improved results in comparison with the purely spectral processing methods.
Conference Paper
A stand alone noise suppression algorithm is described for reducing the spectral effects of acoustically added noise in speech. A fundamental result is developed which shows that the spectral magnitude of speech plus noise can be effectively approximated as the sum of magnitudes of speech and noise. Using this simple phase independent additive model, the noise bias present in the short time spectrum is reduced by subtracting off the expected noise spectrum calculated during nonspeech activity. After bias removal, the time waveform is recalculated from the modified magnitude and saved phase. This Spectral Averaging for Bias Estimation and Removal, or SABER method requires only one FFT per time window for analysis and synthesis.
Conference Paper
This paper describes results of a study of several frequency-domain processing methods for enhancing the intelligibility of speech in wideband random noise. Five categories of processing methods are explored. These include the INTEL technique, a technique based upon minimum mean square filtering, several techniques based upon subtraction of the estimated spectrum of the noise from the spectrum of the speech plus noise, spectrum squaring, and techniques based upon pitch frequency analysis. The results of this study have provided considerable insight into the individual processing methods and into the use of frequency-domain processing methods in general. A major conclusion of this work is that all successful techniques investigated are similar in that they are an attempt to emphasize spectral components as a function of the amount by which they exceed the noise. A second conclusion is that unless the spectral weighting within a time-window is relatively smooth, it will introduce conspicuous background distortion.
A stand-alone noise suppression algorithm is presented for reducing the spectral effects of acoustically added noise in speech. Effective performance of digital speech processors operating in practical environments may require suppression of noise from the digital wave-form. Spectral subtraction offers a computationally efficient, processor-independent approach to effective digital speech analysis. The method, requiring about the same computation as high-speed convolution, suppresses stationary noise from speech by subtracting the spectral noise bias calculated during nonspeech activity. Secondary procedures are then applied to attenuate the residual noise left after subtraction. Since the algorithm resynthesizes a speech waveform, it can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Extraction of Speech in Noise by Digital Filtering
  • H Suzuki
  • J Igarashi
  • Y Ishii
H. Suzuki, J. Igarashi, and Y. Ishii, "Extraction of Speech in Noise by Digital Filtering," J. Acoust. Soc. of Japan, Vol. 33, No. 8, Aug. 1977, pp. 1O5-411.
  • Wideband Random Noise
Wideband Random Noise," ICASSP, April 1978, pp. 602-605.