Conference PaperPDF Available

Speech Extraction From Jammed Signals In Dual-Microphone Systems

Authors:

Abstract and Figures

This paper presents two different methods of speech ex-traction: cross-correlation analysis and adaptive filtering. Algorithms are designed to extract conversations in noisy environment. Such situations can appear in police inves-tigations' materials or multi-speaker environment. Noise can be added intentionally by suspects or not intentionally (e.g. in a car interior). Both of the algorithms are based on recordings from a dual-microphone system. The pre-sented methods use the small differences between record-ings. Algorithms were compared taking SNR improvement and better speech understanding into consideration.
Content may be subject to copyright.
SPEECH EXTRACTION FROM JAMMED SIGNALS IN
DUAL-MICROPHONE SYSTEMS
Rafał Samborski, Mariusz Zi´
ołko, Bartosz Zi´
ołko, Jakub Gałka
Department of Electronics
AGH University of Science and Technology
Al. Mickiewicza 30, 30-059 Krak ´
ow, Poland
{samborski, ziolko, bziolko, jgalka}@agh.edu.pl
ABSTRACT
This paper presents two different methods of speech ex-
traction: cross-correlation analysis and adaptive filtering.
Algorithms are designed to extract conversations in noisy
environment. Such situations can appear in police inves-
tigations’ materials or multi-speaker environment. Noise
can be added intentionally by suspects or not intentionally
(e.g. in a car interior). Both of the algorithms are based
on recordings from a dual-microphone system. The pre-
sented methods use the small differences between record-
ings. Algorithms were compared taking SNR improvement
and better speech understanding into consideration.
KEY WORDS
source-separation, adaptive filtration, multi-microphone
systems
1 Introduction
Eavesdropping is one of the most efficient and cheap-
est sources to provide exhibits of crimes for police and
homeland security forces investigations. However, there
are many difficulties connected with recordings obtained
by listening-in systems. Most of them are caused by ran-
dom noise or intentionally added disturbances.
One of the solutions to overcome the above problems
is based on multi-microphone arrays. Microphone arrays
applied in a need of speech enhancement is a well defined
field with several methods: beamforming [2], superdirec-
tive beamforming [1], postfiltering [5] and phase based
filtering [3, 7]. However, all solutions known to authors
are focused on solving a problem of a random background
noise caused by environment where recording takes place.
It means that they operate on model
sm1,in(t) = svoice(t) + n1(t),
sm2,in(t) = svoice(tτ1) + n2(t),(1)
where svoice(t)is a speech signal, and delay τ1is caused by
a longer distance to the second microphone. Signals n1(t)
and n2(t)represent microphone and environmental noise
respectively. It can be noise of a car machine, traffic noise,
disturbances caused by wind or noise of recording system.
An important issue is that n1(t)and n2(t)are uncorrelated.
Our case is significantly different, because the noise
was added intentionally by a conversing human to degrade
quality of recordings as much as possible. Let us consider
the model accommodated to such situation
sm1,in(t) = svoice(t) + sdist(t) + n1(t),
sm2,in(t) = svoice(tτ1) + sdist(tτ2) + n2(t),
(2)
where sdist(t)is intentionally added disturbance. Delay
τ2is not equal to τ1because of differences in distances
between microphones and audio signal source.
Microphone 2
a
b
l2
l1Microphone 1
c
d
Figure 1. Dual-microphone scenario of listening-in
to a conversation where source of a distracting signal, like
a radio, was used to hide content of the conversation.
What makes a situation more complicated, the mi-
crophones have to be hidden from speakers and in places
where it was possible to put a tapping device. This is
much different to scenarios typical for information cen-
tres or conference rooms. Several efficient methods in-
cluding a phase-based filtering, which is a form of time-
frequency masking (PBTFM) [7] require speaker’s position
to be known. It is all not possible in our scenario because
the speakers are in their homes, cars or jail cells where they
can move around and the microphones are listening-in de-
vices. In such case cross-correlation and adaptive filtration
algorithms seem to be a good solution.
As the algorithms which we developed are universal,
they can be useful in many other, also commercial civil, ap-
plications like: noise canceling in automatic speech recog-
nition or hands-free car systems.
678-072
124
Proceedings of the 7th IASTED International Conference
February 17 - 19, 2010 Innsbruck, Austria
Signal Processing, Pattern Recognition and Applications (SPPRA 2010)
The paper is divided as follows. Section 2 presents
details on the recording scenario we consider. Section 3
and 4 covers description of cross-correlation and adaptive
filtration algorithms respectively. Section 5 provides results
of experiments we conducted. The paper is summed up by
conclusions.
2 Problem escription
The problem of separating a conversation from the audio
signal is depicted in Figure 1 [9]. The audio signals are ac-
quired by two hidden microphones. There are two speaking
persons who use a distracting signal, like music from a ra-
dio receiver, to block off understanding the content of their
conversation. In order to proceed with detecting speech sig-
nal from the noised signals recorded by two microphones,
at least distances a6=bor c6=dmust be kept. The
difference between these distances can be relatively small.
To verify it, let us assume the sampling frequency 44 100
Hz. Then a time difference between two samples relates
to a distance
ρ=vt, (3)
where vis the sound velocity and tis the sampling pe-
riod. For values v= 330m
sand t= 23µs one obtains
ρ7.5mm. For a real case application, at least a dif-
ference of around ten samples between signals from both
microphones is needed to proceed. This gives a few cen-
timeters as a necessary difference between the distances.
In a very special case, that at once, both aband cd
the method would not work. However, this is a very rare
scenario in real world situations.
The algorithms described below utilise the differences
between distances. The cross-correlation algorithm uses
additionally differences in frequency bands: higher for mu-
sic signal and lower for speech, both detectable by an ear.
Figure 2 shows the model of spectral density of a human
speech and music (trumpet). It is noticeable that in this
case a spectrum of musical instrument is wider than a spec-
trum of a human speech. What is more, the maximum of
spectral energy of music lies in much higher frequencies
than the maximum of human voice spectral density.
3 Cross-correlation algorithm
The difference between spectral density of speech and mu-
sic is a basis of cross-correlation algorithm. A block di-
agram of this algorithm is depicted in Figure 3. Let us
assume that sm1,in and sm1,in are recordings from Mi-
crophone 1 and Microphone 2 (see Figure 1) respectively.
Ranges dand care distances from a disturbance source
to microphones and ranges band aare distances from talk-
ing people to microphones. Our aim is to find τ2from (2),
which is a shift in time resulted from a difference between
dand c. We will use cross-correlation function c(τ), which
0 2 4 6 8 10 12 14
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
[kHz]
[dB]
Bandpass
Filter
Figure 2. Comparison of spectral density of a speech (light
grey) and a trumpet (dark grey). The filter band we were
used is marked.
is defined as
c(τ) = X
n
s1(nτ)s2(n),(4)
to find τ2.
Sm2,in
Sm1,in
Σ
-
+
BPF
BPF
Svoice
xcorr
k
-τ2
Z
Figure 3. Block diagram of the cross-correlation algorithm
of speech extraction.
Time delay τis calculated from cross-correlation tak-
ing into account frequency band from 4 to 6.5 kHz which
is higher then voice frequency band. Major part of distor-
tion signal energy belongs to this band. It allows to cut-off
sdist and extract svoice.
Let us assume s1and s2to be filtered signals from the
microphones. The band pass filters are set as it is shown
in Figure 2. Such settings allow us to cut off the major-
ity of a speech signal. Then, output signals sm1,out and
sm2,out contain the music signal mainly, what allows eas-
ier calculation of a maximal value of the cross-correlation
125
d
(see Figure 4)
τ2= arg maxτ[Pnsm1,out(nτ)sm2,out (n)] .(5)
Figure 4. Correlation coefficient as a function of a shift
between signals.
The delay determined above is used in a delay block
zτ2(see Figure 3). As a result we get a signal with a com-
pensated impact of the distance. Then the speech signal can
be found as
svoice(n) = sm1,in (n)k sm2,in(nτ2),(6)
where
k=rPn(sout
m1(n))2
Pn(sout
m2(n))2(7)
is an amplification, which compensate the difference in
power of music signal coming from the different distances
cand d.
In some cases τ1and τ2can be negative. This is why
computations are conducted not in real time. It results in
additional delay of extracted voice svoice of amount equal
to max(|τ1|,|tau2|).
4 Adaptive filtration algorithm
There are several practical problems which cannot
be treated by the algorithm described above. Probably
the most important one is existence of reverberations and,
generally, various ways of waves propagation for the dif-
ferent frequencies. A filter with a dedicated phase charac-
teristic can be a good solution in this situation. As the loca-
tion of talking person is not time-invariant, the filter should
adapt to the circumstances. Adaptive filtration seems to
fulfill all our requirements.
The same input signals which are described by (2) are
inputs for the second method we examined. Block diagram
of an adaptive filtration algorithm is depicted in Figure 5
[4, 8]. The architecture of the algorithm is different than
the previous one. Low pass filters were applied instead of
band pass filters. Speech is expected as a final signal so
filtering of higher band would unnecessarily make the fil-
tration more complicated. What is more, when wideband
signals were filtered, the algorithm did not managed to find
proper filter coefficients.
Sm2,in
Sm1,in
LMS
Adaptive
filter
Σ
-
+
LPF
LPF
Svoice
Figure 5. Block diagram of adaptive filtration algorithm of
speech extraction.
As the LMS (Least Mean Square) adaptive filter was
used, the order for the output svoice is to be a minimum-
mean-squared-error estimate of the signal svoice. Ro-
driguez [6] expressed the power of noise in output as (the
discrete time index has been omitted for clarity)
En=E[(svoice sm1,in)2]
=E[s2
voice]2E[sm1,in svoice ] + E[s2
m1,in]
=E[s2
voice]2E[sm1,in (sm1,in +sm2,in y)] + E[s2
m1,in]
=E[s2
voice]E[s2
m1,in]2E[sm1,in sm2,in]+2E[sm1,iny]
=E[s2
voice]E[s2
m1,in],
(8)
where yis an estimate of the primary noise created by LMS
filter. As the signal sm1,in , is unaffected by the adaptive fil-
ter, the algorithm sets coefficients of this filter to minimize
the total output power E[s2
voice].
5 Experiments
The results of the implemented algorithms using recordings
containing speech disturbed by music were examined. The
recordings were produced in our laboratory using two mi-
crophones with cardioid beam patterns. Speaking persons
and a disturbance source were located in front of micro-
phones. We simulated natural conditions with both speak-
ers and disturbance source simultaneously recorded at the
same session.
Both cross-correlation and adaptive filtration algo-
rithms were optimized for the biggest increase of Voice-
To-Music Ratio (VMR). To measure VMR of given sig-
nals we assumed that the voice signal is unaffected by both
algorithms. The signals contain voice disturbed by mu-
sic. There were segments in which only the music is au-
126
Algorithm Increase of VMR [dB]
Cross-correlation 2.0
Adaptive filtering 2.9
dible due to very low VMR (less then -10). Let us assume
that Pm1,voice is average energy of input signal sm1,in in a
range (n1, n2)where voice is disturbed, counted as follows
Pm1,voice =1
n2n1+1 Pn2
n=n1s2
m1,in(n).(9)
Then Pm1,music would be an average energy of input sig-
nal sm1,in in range (n1, n2)where only music is audible.
Therefore let us define VMR of sm1,in as
V MRm1= 10 log( Pm1,voice
Pm1,music ).(10)
V MRm2,V MRout would be the VMRs for sm2,in
and the output signal respectively. Then we count increase
4V MR of VMR as
4V MR =V M Rout (V MRm1, V MRm2).(11)
Table 1 compares results for the both described algo-
rithms by presenting an improvement in VMR.
6 Conclusion
The presented methods of signal analysis from two micro-
phones was found successful in recovering conversation.
BPFs with a band which is audible for a human ear, but
above speech frequencies were found successful in cross-
correlation algorithm. Using BFSs with the same settings
as in the first method in the adaptive filtration algorithm
seems be unnecessary or even undesirable because of prob-
lems with finding proper filter coefficients in frequencies
above 4-5 kHz.
The adaptive filtration algorithm gave better results
in the examined cases. The main disadvantage of this
method is that it is much more complex and computation-
ally demanding what can be important in an end user im-
plementation. Further investigations will be focused on the
second method.
The scenario assumptions are that there are very few
possible localizations of microphones and they have to
be hidden as listening-in devices. The disruptive signal
(e.g. music from a radio) is added intentionally by convers-
ing speakers to hide the speech content, along with noise.
Acknowledgement
This work was supported by MNISW grant OR00001905.
References
[1] J. Bitzer, K. U. Simmer, and K. D. Kammeyer. Theoretical
noise reduction limits of the generalized sidelobe canceller
(GSC) for speech enhancement. Proc. IEEE Int. Confer-
ence on Acoustics, Speech, Signal Processing, 5:2965–2968,
1999.
[2] G. DeMuth. Frequency domain beamforming techniques.
Proc. IEEE Int. Conference on Acoustics, Speech, Signal
Processing, 2:713–715, 1977.
[3] D. Halupka, A. S. Rabi, P. Aarabi, and A. Sheikholeslami.
Low-power dual-microphone speech enhancement using
field programmable gate arrays. IEEE Transactions on Sig-
nal Processing, 55(7):3526–3535, 2007.
[4] S. Haykin. Adaptive Filter Theory. Prentice-Hall, 1986.
[5] C. Marro, Y. Mahieux, , and K. U. Simmer. Analysis of
noise reduction and dereverberation techniques based on mi-
crophone arrays with postfiltering. IEEE Trans. Speech, Au-
dio, Signal Process., 6:240–259, 1998.
[6] J. J. Rodriguez, J. S. Lim, and E. Singer. Adaptive noise
reduction in aircraft communication system. Proceedings of
IEEE International Conference on ICASSP, 12:169 – 172,
1987.
[7] G. Shi and P. Aarabi. Robust digit recognition using phase-
dependent time-frequency masking. Proc. IEEE Int. Con-
ference on Acoustics, Speech, Signal Processing (ICASSP),
Hong Kong, pages 684–687, 2003.
[8] S. V. Vaseghi. Advanced Digital Signal Processing and Noise
Reduction. John Wiley & Sons, 2006.
[9] M. Zi ´
ołko, B. Zi´
ołko, and R. Samborski. Dual-microphone
speech extraction from signals with audio background. Proc.
IEEE International Conference on Intelligent Information
Hiding and Multimedia Signal Processing, 2009.
algorithms.
Table 1. The improvement results for the described
127
... where s dist (t) is an intentionally added disturbance. The timeshift τ 2 is not equal to τ 1 because of differences in distances between microphones and audio signal sources like radio or TV-set [5]. Dual-microphone scenario of listening-in to a conversation where source of a distracting signal, like a radio-set, was used to hide the content of the conversation [9] This is much different from scenarios typical for information centres or conference rooms. ...
Conference Paper
Full-text available
This paper suggests a speech enhancement approach to an eavesdropping audio system. Speech signal is disturbed by non-stochastic noise. The algorithm is based on recordings from dual-microphone system. The Wiener filter was applied for speech extraction. The algorithm is designed to capture dialogues in noisy environment as well. It uses the small differences between recordings. The differences in speaker and the source of noise localisation together with differences in spectra, enable us to split both signals.
Conference Paper
Full-text available
This paper presents a model of a system which is able to support Police investigations. The listening in system acquires audio signals from two microphones located around one meter to each other. System is designed to record suspect conversations under disadvantageous conditions like the loud music or the car machine noise. The presented method uses the differences between records obtained from two microphones. Different delays and different spectra separates the conversation signal from the background. The crosscorelation analysis and the band-pass filters were applied to realise the suggested algorithm.
Article
Full-text available
This paper discusses two implementations of a dual-microphone phase-based speech enhancement technique. The implementations compared are based on a field-programmable gate array (FPGA) and a digital signal processor (DSP). The time-varying, frequency-dependent, phase-difference between each incoming sound signal is used to mask each respective frequency, thereby reducing noise by minimizing the contributions of signal frequency components that have a low signal-to-noise ratio (SNR). Phase-based filtering can achieve high SNR gains with just two microphones, making it ideal for small devices that lack the room for a multimicrophone array. Moreover, these devices often have a limited battery life and lack the processing power needed for software-based speech enhancement. This paper presents an FPGA-based dual-microphone speech enhancement implementation, which was designed specifically for low-power operation. This implementation is compared with an off-the-shelf DSP implementation, with respect to processing capabilities and power utilization.
Article
Digital signal processing plays a central role in the development of modern communication and information processing systems. The theory and application of signal processing is concerned with the identification, modelling and utilisation of patterns and structures in a signal process. The observation signals are often distorted, incomplete and noisy and therefore noise reduction, the removal of channel distortion, and replacement of lost samples are important parts of a signal processing system. The fourth edition of Advanced Digital Signal Processing and Noise Reduction updates and extends the chapters in the previous edition and includes two new chapters on MIMO systems, Correlation and Eigen analysis and independent component analysis. The wide range of topics covered in this book include Wiener filters, echo cancellation, channel equalisation, spectral estimation, detection and removal of impulsive and transient noise, interpolation of missing data segments, speech enhancement and noise/interference in mobile communication environments. This book provides a coherent and structured presentation of the theory and applications of statistical signal processing and noise reduction methods. Two new chapters on MIMO systems, correlation and Eigen analysis and independent component analysis Comprehensive coverage of advanced digital signal processing and noise reduction methods for communication and information processing systems Examples and applications in signal and information extraction from noisy data Comprehensive but accessible coverage of signal processing theory including probability models, Bayesian inference, hidden Markov models, adaptive filters and Linear prediction models Advanced Digital Signal Processing and Noise Reduction is an invaluable text for postgraduates, senior undergraduates and researchers in the fields of digital signal processing, telecommunications and statistical data analysis. It will also be of interest to professional engineers in telecommunications and audio and signal processing industries and network planners and implementers in mobile and wireless communication communities.
Book
Digital signal processing plays a central role in the development of modern communication and information processing systems. The theory and application of signal processing is concerned with the identification, modelling and utilisation of patterns and structures in a signal process. The observation signals are often distorted, incomplete and noisy and therefore noise reduction, the removal of channel distortion, and replacement of lost samples are important parts of a signal processing system.The fourth edition of Advanced Digital Signal Processing and Noise Reduction updates and extends the chapters in the previous edition and includes two new chapters on MIMO systems, Correlation and Eigen analysis and independent component analysis. The wide range of topics covered in this book include Wiener filters, echo cancellation, channel equalisation, spectral estimation, detection and removal of impulsive and transient noise, interpolation of missing data segments, speech enhancement and noise/interference in mobile communication environments. This book provides a coherent and structured presentation of the theory and applications of statistical signal processing and noise reduction methods.Two new chapters on MIMO systems, correlation and Eigen analysis and independent component analysisComprehensive coverage of advanced digital signal processing and noise reduction methods for communication and information processing systemsExamples and applications in signal and information extraction from noisy dataComprehensive but accessible coverage of signal processing theory including probability models, Bayesian inference, hidden Markov models, adaptive filters and Linear prediction modelsAdvanced Digital Signal Processing and Noise Reduction is an invaluable text for postgraduates, senior undergraduates and researchers in the fields of digital signal processing, telecommunications and statistical data analysis. It will also be of interest to professional engineers in telecommunications and audio and signal processing industries and network planners and implementers in mobile and wireless communication communities.
Conference Paper
A technique using the time-frequency phase information of two microphones is proposed to estimate an ideal time-frequency mask using time-delay-of-arrival (TDOA) of the signal of interest. At a signal-to-noise ratio (SNR) of 0 dB, the proposed technique using two microphones achieves a digit recognition rate (average over 5 speakers, each speaking 20-30 digits) of 71%. In contrast, delay-and-sum beamforming only achieves a 40% recognition rate with two microphones and 60% with four microphones. Superdirective beamforming achieves a 44% recognition rate with two microphones and 65% with four microphones.
Conference Paper
It is well known that sensor signals must be sampled at a rate much higher than the Nyquist rate to minimize degradations of side lobe levels and maximum response axis gain in sampled time domain beamformers. Various frequency domain techniques can be employed to form beamswithout requiring this excessively high data rate that can cause data management problems in programmable signal processors. Techniques applicable to line, ring, cylindrical and spherical arrays are developed.
Conference Paper
In many military environments, such as fighter jet cockpits, the increasing use of digital communication systems has created a need for robust vocoders and speech recognition systems. However, the high level of ambient noise in such environments makes vocoders less intelligible and makes reliable speech recognition more difficult. One method of enhancing the noise-corrupted speech is adaptive noise cancellation. In previous research, this method was tested in a simulated cockpit environment, yielding impressive results. However, in new simulations, reflecting more realistic conditions, adaptive noise cancellation has been less successful. Spectral analysis of the data showed that the diffuseness of the ambient noise fieid, along With the microphone characteristics, has a significant effect on the performance of adaptive noise cancellation.
Conference Paper
We present an analysis of the generalized sidelobe canceller (GSC). It can be shown that the theoretical limits of the noise reduction performance depend only on the auto- and cross-spectral densities of the input signals. Furthermore, we compute the limits of the noise reduction performance for the theoretically determined diffuse noise field, which is an approximation for reverberant rooms. Our results show that the GSC cannot reduce noise further than 1 dB. These results were verified by simulation of reverberant environments. Only in sound-proof rooms with a reverberation time less than 100 ms the GSC performs well
Article
In teleconferencing systems, the use of hands-free sound pick-up reduces speech quality. This is due to ambient noise, acoustic echo, and the reverberation produced by the acoustical environment. This paper sets out to present a theoretical analysis of noise reduction and dereverberation algorithms based on a microphone array combined with a Wiener postfilter. It is shown that the transfer function of the postfilter depends on the input signal-to-noise ratio (SNR) and on the noise reduction yielded by the array. The use of a directivity-controlled array instead of a conventional beam-former is proposed to improve the performance of the whole system. Examples in real room environments are provided, which confirm the theoretical results, It is observed that the postfilter gives a limited reduction of the reverberation. On the contrary, an appreciable reduction of acoustic echo and localized noise is obtained and makes the whole system highly attractive for hands-free communication systems