Content uploaded by Jakub Gałka
Author content
All content in this area was uploaded by Jakub Gałka on Sep 28, 2015
Content may be subject to copyright.
SPEECH EXTRACTION FROM JAMMED SIGNALS IN
DUAL-MICROPHONE SYSTEMS
Rafał Samborski, Mariusz Zi´
ołko, Bartosz Zi´
ołko, Jakub Gałka
Department of Electronics
AGH University of Science and Technology
Al. Mickiewicza 30, 30-059 Krak ´
ow, Poland
{samborski, ziolko, bziolko, jgalka}@agh.edu.pl
ABSTRACT
This paper presents two different methods of speech ex-
traction: cross-correlation analysis and adaptive filtering.
Algorithms are designed to extract conversations in noisy
environment. Such situations can appear in police inves-
tigations’ materials or multi-speaker environment. Noise
can be added intentionally by suspects or not intentionally
(e.g. in a car interior). Both of the algorithms are based
on recordings from a dual-microphone system. The pre-
sented methods use the small differences between record-
ings. Algorithms were compared taking SNR improvement
and better speech understanding into consideration.
KEY WORDS
source-separation, adaptive filtration, multi-microphone
systems
1 Introduction
Eavesdropping is one of the most efficient and cheap-
est sources to provide exhibits of crimes for police and
homeland security forces investigations. However, there
are many difficulties connected with recordings obtained
by listening-in systems. Most of them are caused by ran-
dom noise or intentionally added disturbances.
One of the solutions to overcome the above problems
is based on multi-microphone arrays. Microphone arrays
applied in a need of speech enhancement is a well defined
field with several methods: beamforming [2], superdirec-
tive beamforming [1], postfiltering [5] and phase based
filtering [3, 7]. However, all solutions known to authors
are focused on solving a problem of a random background
noise caused by environment where recording takes place.
It means that they operate on model
sm1,in(t) = svoice(t) + n1(t),
sm2,in(t) = svoice(t−τ1) + n2(t),(1)
where svoice(t)is a speech signal, and delay τ1is caused by
a longer distance to the second microphone. Signals n1(t)
and n2(t)represent microphone and environmental noise
respectively. It can be noise of a car machine, traffic noise,
disturbances caused by wind or noise of recording system.
An important issue is that n1(t)and n2(t)are uncorrelated.
Our case is significantly different, because the noise
was added intentionally by a conversing human to degrade
quality of recordings as much as possible. Let us consider
the model accommodated to such situation
sm1,in(t) = svoice(t) + sdist(t) + n1(t),
sm2,in(t) = svoice(t−τ1) + sdist(t−τ2) + n2(t),
(2)
where sdist(t)is intentionally added disturbance. Delay
τ2is not equal to τ1because of differences in distances
between microphones and audio signal source.
Microphone 2
a
b
l2
l1Microphone 1
c
d
Figure 1. Dual-microphone scenario of listening-in
to a conversation where source of a distracting signal, like
a radio, was used to hide content of the conversation.
What makes a situation more complicated, the mi-
crophones have to be hidden from speakers and in places
where it was possible to put a tapping device. This is
much different to scenarios typical for information cen-
tres or conference rooms. Several efficient methods in-
cluding a phase-based filtering, which is a form of time-
frequency masking (PBTFM) [7] require speaker’s position
to be known. It is all not possible in our scenario because
the speakers are in their homes, cars or jail cells where they
can move around and the microphones are listening-in de-
vices. In such case cross-correlation and adaptive filtration
algorithms seem to be a good solution.
As the algorithms which we developed are universal,
they can be useful in many other, also commercial civil, ap-
plications like: noise canceling in automatic speech recog-
nition or hands-free car systems.
678-072
124
Proceedings of the 7th IASTED International Conference
February 17 - 19, 2010 Innsbruck, Austria
Signal Processing, Pattern Recognition and Applications (SPPRA 2010)
The paper is divided as follows. Section 2 presents
details on the recording scenario we consider. Section 3
and 4 covers description of cross-correlation and adaptive
filtration algorithms respectively. Section 5 provides results
of experiments we conducted. The paper is summed up by
conclusions.
2 Problem escription
The problem of separating a conversation from the audio
signal is depicted in Figure 1 [9]. The audio signals are ac-
quired by two hidden microphones. There are two speaking
persons who use a distracting signal, like music from a ra-
dio receiver, to block off understanding the content of their
conversation. In order to proceed with detecting speech sig-
nal from the noised signals recorded by two microphones,
at least distances a6=bor c6=dmust be kept. The
difference between these distances can be relatively small.
To verify it, let us assume the sampling frequency 44 100
Hz. Then a time difference between two samples relates
to a distance
ρ=v∆t, (3)
where vis the sound velocity and ∆tis the sampling pe-
riod. For values v= 330m
sand ∆t= 23µs one obtains
ρ≈7.5mm. For a real case application, at least a dif-
ference of around ten samples between signals from both
microphones is needed to proceed. This gives a few cen-
timeters as a necessary difference between the distances.
In a very special case, that at once, both a≈band c≈d
the method would not work. However, this is a very rare
scenario in real world situations.
The algorithms described below utilise the differences
between distances. The cross-correlation algorithm uses
additionally differences in frequency bands: higher for mu-
sic signal and lower for speech, both detectable by an ear.
Figure 2 shows the model of spectral density of a human
speech and music (trumpet). It is noticeable that in this
case a spectrum of musical instrument is wider than a spec-
trum of a human speech. What is more, the maximum of
spectral energy of music lies in much higher frequencies
than the maximum of human voice spectral density.
3 Cross-correlation algorithm
The difference between spectral density of speech and mu-
sic is a basis of cross-correlation algorithm. A block di-
agram of this algorithm is depicted in Figure 3. Let us
assume that sm1,in and sm1,in are recordings from Mi-
crophone 1 and Microphone 2 (see Figure 1) respectively.
Ranges dand care distances from a disturbance source
to microphones and ranges band aare distances from talk-
ing people to microphones. Our aim is to find τ2from (2),
which is a shift in time resulted from a difference between
dand c. We will use cross-correlation function c(τ), which
0 2 4 6 8 10 12 14
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
[kHz]
[dB]
Bandpass
Filter
Figure 2. Comparison of spectral density of a speech (light
grey) and a trumpet (dark grey). The filter band we were
used is marked.
is defined as
c(τ) = X
n
s1(n−τ)s2(n),(4)
to find τ2.
Sm2,in
Sm1,in
Σ
-
+
BPF
BPF
Svoice
xcorr
k
-τ2
Z
Figure 3. Block diagram of the cross-correlation algorithm
of speech extraction.
Time delay τis calculated from cross-correlation tak-
ing into account frequency band from 4 to 6.5 kHz which
is higher then voice frequency band. Major part of distor-
tion signal energy belongs to this band. It allows to cut-off
sdist and extract svoice.
Let us assume s1and s2to be filtered signals from the
microphones. The band pass filters are set as it is shown
in Figure 2. Such settings allow us to cut off the major-
ity of a speech signal. Then, output signals sm1,out and
sm2,out contain the music signal mainly, what allows eas-
ier calculation of a maximal value of the cross-correlation
125
d
(see Figure 4)
τ2= arg maxτ[Pnsm1,out(n−τ)sm2,out (n)] .(5)
-100 -50 0 50 100
-150
-100
-50
0
50
100
150
Shift [number of samples]
Correlation coefficient
τ2
Figure 4. Correlation coefficient as a function of a shift
between signals.
The delay determined above is used in a delay block
z−τ2(see Figure 3). As a result we get a signal with a com-
pensated impact of the distance. Then the speech signal can
be found as
svoice(n) = sm1,in (n)−k sm2,in(n−τ2),(6)
where
k=rPn(sout
m1(n))2
Pn(sout
m2(n))2(7)
is an amplification, which compensate the difference in
power of music signal coming from the different distances
cand d.
In some cases τ1and τ2can be negative. This is why
computations are conducted not in real time. It results in
additional delay of extracted voice svoice of amount equal
to max(|τ1|,|tau2|).
4 Adaptive filtration algorithm
There are several practical problems which cannot
be treated by the algorithm described above. Probably
the most important one is existence of reverberations and,
generally, various ways of waves propagation for the dif-
ferent frequencies. A filter with a dedicated phase charac-
teristic can be a good solution in this situation. As the loca-
tion of talking person is not time-invariant, the filter should
adapt to the circumstances. Adaptive filtration seems to
fulfill all our requirements.
The same input signals which are described by (2) are
inputs for the second method we examined. Block diagram
of an adaptive filtration algorithm is depicted in Figure 5
[4, 8]. The architecture of the algorithm is different than
the previous one. Low pass filters were applied instead of
band pass filters. Speech is expected as a final signal so
filtering of higher band would unnecessarily make the fil-
tration more complicated. What is more, when wideband
signals were filtered, the algorithm did not managed to find
proper filter coefficients.
Sm2,in
Sm1,in
LMS
Adaptive
filter
Σ
-
+
LPF
LPF
Svoice
Figure 5. Block diagram of adaptive filtration algorithm of
speech extraction.
As the LMS (Least Mean Square) adaptive filter was
used, the order for the output svoice is to be a minimum-
mean-squared-error estimate of the signal svoice. Ro-
driguez [6] expressed the power of noise in output as (the
discrete time index has been omitted for clarity)
En=E[(svoice −sm1,in)2]
=E[s2
voice]−2E[sm1,in svoice ] + E[s2
m1,in]
=E[s2
voice]−2E[sm1,in (sm1,in +sm2,in −y)] + E[s2
m1,in]
=E[s2
voice]−E[s2
m1,in]−2E[sm1,in sm2,in]+2E[sm1,iny]
=E[s2
voice]−E[s2
m1,in],
(8)
where yis an estimate of the primary noise created by LMS
filter. As the signal sm1,in , is unaffected by the adaptive fil-
ter, the algorithm sets coefficients of this filter to minimize
the total output power E[s2
voice].
5 Experiments
The results of the implemented algorithms using recordings
containing speech disturbed by music were examined. The
recordings were produced in our laboratory using two mi-
crophones with cardioid beam patterns. Speaking persons
and a disturbance source were located in front of micro-
phones. We simulated natural conditions with both speak-
ers and disturbance source simultaneously recorded at the
same session.
Both cross-correlation and adaptive filtration algo-
rithms were optimized for the biggest increase of Voice-
To-Music Ratio (VMR). To measure VMR of given sig-
nals we assumed that the voice signal is unaffected by both
algorithms. The signals contain voice disturbed by mu-
sic. There were segments in which only the music is au-
126
Algorithm Increase of VMR [dB]
Cross-correlation 2.0
Adaptive filtering 2.9
dible due to very low VMR (less then -10). Let us assume
that Pm1,voice is average energy of input signal sm1,in in a
range (n1, n2)where voice is disturbed, counted as follows
Pm1,voice =1
n2−n1+1 Pn2
n=n1s2
m1,in(n).(9)
Then Pm1,music would be an average energy of input sig-
nal sm1,in in range (n1, n2)where only music is audible.
Therefore let us define VMR of sm1,in as
V MRm1= 10 log( Pm1,voice
Pm1,music ).(10)
V MRm2,V MRout would be the VMRs for sm2,in
and the output signal respectively. Then we count increase
4V MR of VMR as
4V MR =V M Rout −(V MRm1, V MRm2).(11)
Table 1 compares results for the both described algo-
rithms by presenting an improvement in VMR.
6 Conclusion
The presented methods of signal analysis from two micro-
phones was found successful in recovering conversation.
BPFs with a band which is audible for a human ear, but
above speech frequencies were found successful in cross-
correlation algorithm. Using BFSs with the same settings
as in the first method in the adaptive filtration algorithm
seems be unnecessary or even undesirable because of prob-
lems with finding proper filter coefficients in frequencies
above 4-5 kHz.
The adaptive filtration algorithm gave better results
in the examined cases. The main disadvantage of this
method is that it is much more complex and computation-
ally demanding what can be important in an end user im-
plementation. Further investigations will be focused on the
second method.
The scenario assumptions are that there are very few
possible localizations of microphones and they have to
be hidden as listening-in devices. The disruptive signal
(e.g. music from a radio) is added intentionally by convers-
ing speakers to hide the speech content, along with noise.
Acknowledgement
This work was supported by MNISW grant OR00001905.
References
[1] J. Bitzer, K. U. Simmer, and K. D. Kammeyer. Theoretical
noise reduction limits of the generalized sidelobe canceller
(GSC) for speech enhancement. Proc. IEEE Int. Confer-
ence on Acoustics, Speech, Signal Processing, 5:2965–2968,
1999.
[2] G. DeMuth. Frequency domain beamforming techniques.
Proc. IEEE Int. Conference on Acoustics, Speech, Signal
Processing, 2:713–715, 1977.
[3] D. Halupka, A. S. Rabi, P. Aarabi, and A. Sheikholeslami.
Low-power dual-microphone speech enhancement using
field programmable gate arrays. IEEE Transactions on Sig-
nal Processing, 55(7):3526–3535, 2007.
[4] S. Haykin. Adaptive Filter Theory. Prentice-Hall, 1986.
[5] C. Marro, Y. Mahieux, , and K. U. Simmer. Analysis of
noise reduction and dereverberation techniques based on mi-
crophone arrays with postfiltering. IEEE Trans. Speech, Au-
dio, Signal Process., 6:240–259, 1998.
[6] J. J. Rodriguez, J. S. Lim, and E. Singer. Adaptive noise
reduction in aircraft communication system. Proceedings of
IEEE International Conference on ICASSP, 12:169 – 172,
1987.
[7] G. Shi and P. Aarabi. Robust digit recognition using phase-
dependent time-frequency masking. Proc. IEEE Int. Con-
ference on Acoustics, Speech, Signal Processing (ICASSP),
Hong Kong, pages 684–687, 2003.
[8] S. V. Vaseghi. Advanced Digital Signal Processing and Noise
Reduction. John Wiley & Sons, 2006.
[9] M. Zi ´
ołko, B. Zi´
ołko, and R. Samborski. Dual-microphone
speech extraction from signals with audio background. Proc.
IEEE International Conference on Intelligent Information
Hiding and Multimedia Signal Processing, 2009.
algorithms.
Table 1. The improvement results for the described
127