Page 1

2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

BLIND ALIGNMENT OF ASYNCHRONOUSLY RECORDED SIGNALS FOR DISTRIBUTED

MICROPHONE ARRAY

Nobutaka Ono, Hitoshi Kohno, Nobutaka Ito, and Shigeki Sagayama

Graduate School of Information Science and Technology, The University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan

{onono, h-kohno, ito, sagayama}@hil.t.u-tokyo.ac.jp

ABSTRACT

In this paper, aiming to utilize independent recording devices as a

distributed microphone array, we present a novel method for align-

ment of recorded signals with localizing microphones and sources.

Unlike conventional microphone array, signals recorded by inde-

pendent devices have different origins of time, and microphone

positions are generally unknown. In order to estimate both of them

from only recorded signals, time differences between channels for

each source are detected, which still include the differencesof time

origins, and an objective function defined by their square errors

is minimized. For that, simple iterative update rules are derived

through auxiliary function approach. The validity of our approach

is evaluated by simulative experiment.

Index Terms— blind alignment, source localization, time de-

lay, cross correlation

1. INTRODUCTION

Microphone array techniques are much useful for localizing sound

sources and separating mixture of sounds and they have greatly

developed in several decades. One of the significant factors for the

performance is the number of microphones. Even if using a sim-

ple delay and sum beamforming, many microphones with large

aperture yield acute directivity. In applying independent compo-

nent analysis, more microphones than sources makes the problem

overdetermined, which is easily solved rather than an underdeter-

mined case.

However, in realistic applications, the use of many micro-

phones are not always feasible. In conventional array signal pro-

cessing, it is supposed that received signals have the same time ori-

gins because time delays between channels are significant cues for

localizing and separating sound sources. For satisfying the condi-

tion, in conventional microphone array system, microphones have

to be connected to multiple A/D converters controlled by a com-

mon temporal clock.

Our aim is to organize independent audio recording devices as

a wireless and distributed microphone array. The concept of dis-

tributed microphone array has been recently discussed in several

literatures [1, 2, 3, 4, 5]. In utilizing independent devices, even if

mismatch of their sampling frequencies is negligible, the origins

of time are generally much different. Therefore, synchronization

among them is a significant issue and it is one of the general prob-

lems in sensor networks [6]. In this paper, the problem of estimat-

ing the time origins in a blind manner is discussed, which means

the estimation is performed by only recorded signals and any other

kinds of temporal information such as RF (radio frequency) chan-

nels are not used. Our method is based on the consistency between

unknown time origins, positions of microphones and sources, and

observed time differences among channels for each source. The

problem can be considered as an extension of TDOA (Time Dif-

ference of Arrival)-based source localization [7], and simultaneous

localization ofmicrophones and sources [8]. An objectivefunction

defined by square errors of time differences is considered and un-

known parameters are estimated by minimizing it. Through aux-

iliary function approach, simple iterative update rules are derived.

We show the feasibility by simulative experiment.

2. FORMULATION

2.1. Fundamental Equations

Suppose that K sound sources are observed by L microphones.

Let si = (xiyizi)t(1 ≤ i ≤ K) and rn = (unvnwn)t(1 ≤

n ≤ L) be positions of sound sources and microphones, respec-

tively, wheretrepresents transpose. Let tn be the time origin of

the observed signal recorded by the nth microphone. As the first

step of the formulation of the blind alignment problem, we assume

that clocks on microphones are accurate in this paper and the mis-

match of sampling frequencies will be discussed in the next work.

In localization of sources and microphones, one of the most

significant cues is the time difference between observed signals.

When the microphone m and n receive only ith source in a certain

time interval, the time delay of the mth observed signal to the nth

observed signal can be represented by

(|si− rm|

=

c

τimn =

c

− tm

)

−

(|si− rn|

− (tm− tn),

c

− tn

)

|si− rm| − |si− rn|

(1)

where the fist term of eq. (1) represents the difference of arrival,

and the second term represents the difference of the time origin.

2.2. Necessary Observations

In eq. (1), τimn are obtained from any pairs of observed sig-

nals and L − 1 of them are ideally independent variables for each

source. While, unknown variables are tn, si = (xi yi zi)t, and

rn = (unvnwn)tfor 1 ≤ i ≤ K and 1 ≤ n ≤ L. Note that all

of unknown variables are relative. It means that an absolute time

origin and six degrees of freedom of position variables caused by

transformation (three degrees) and rotation (three degrees) are not

determined in this framework. Then, to determine unknown vari-

ables,

K(L − 1) ≥ (L − 1) + 3(K + L) − 6

(2)

Page 2

2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

????? ???

?

?????

?

?

?

????

?

????

?

????

?

????

?

????

????

??? ?

????

?

???????? ???

??

?

????

?

Figure 1: Asynchronously recorded signals

?????? ?????????? ?????????

?

?????

?

?

?

????

?

????

?

????

?

????

?

????

????

??? ?

??? ?

?

???????????

??

?

???

!"

Figure 2: Coarsely-synchronized signals

should be satisfied at least. Rearranging eq. (10), we obtain

(K − 4)(L − 4)

≥

9,

(3)

as a necessary condition. Note that the number of sources K are

easily increased when a moving sound source is observed because

it has different positions in different time frames.

2.3. Coarse-fine Alignment Scheme

On asynchronous recorded signals, acquiring the information of

τimnis not a direct problem. Even if source signals are sparse and

theyarenotoverlapped, itismuchdifficulttofindcorrespondences

of acoustic events generated by sources on each channel shown

in Fig. 1. Furthermore, real observations include silent intervals

and overlaps of source signals. Therefore, autonomously detecting

single-source frames is essential. For that, we propose coarse-fine

alignment scheme in the following.

1. Select one channel as a reference.

2. Calculate the cross correlation between each channel and

the reference using the whole signal length and obtain the

time difference from the maximum.

3. Align each channel using the time difference (as shown in

Fig. 2). Note that it is just coarse alignment and fine mis-

matches of time origins between channels remains.

4. The frame analysis is applied and the normalized cross cor-

relation is calculated frame by frame.

5. Only frames with a larger maximum than a threshold in the

normalized cross correlation are selected as single-source

frames and time differences between any channels are ob-

tained from the frames. Each single-source frame is han-

dled as different source frame.

3. DERIVATION OF UPDATE RULES

3.1. Objective Function

In order to find unknown parameters: Θ = {si,rn,tn| 1 ≤ i ≤

K,1 ≤ n ≤ L}, the square errors of eq. (1):

K

∑

εimn = |si− rm| − |si− rn| − c(τimn+ tm− tn)

=

(xi− um)2+ (yi− vm)2+ (zi− wm)2

−

−c(τimn+ tm− tn)

is considered as an objective function to be minimized.

J(Θ) =

1

c2KL2

i=1

L

∑

m=1

L

∑

n=1

ε2

imn

(4)

√

√

(xi− un)2+ (yi− vn)2+ (zi− wn)2

(5)

3.2. Auxiliary Function Approach

The objective function J(Θ) includes the square root of xi, yi, zi,

etc, or, the absolute function of the vectors si, rm, etc. Minimiz-

ing it is a kind of nonlinear optimization problem and we could

apply a general optimization method such as the gradient descent

method or the quasi-Newton method. But here, for deriving sim-

ple and efficient iterative solution, applying auxiliary function ap-

proach is considered. The auxiliary function approach is an exten-

sion of well-known EM algorithm and has been recently applied to

solve several kind of nonlinear optimization problems in the signal

processing field [9, 10, 11].

In order to findparameters θ tominimize an objectivefunction

J(θ), an auxiliary function Q(θ,¯θ) to satisfy

J(θ) = min¯θQ(θ,¯θ)

(6)

is utilized, where¯θ is called auxiliary variables. The principle

of the auxiliary function method is based on the fact that J(θ) is

non-increasing under the updates:

¯θ(l+1)

=

argmin¯θQ(θ(l),¯θ),

argminθQ(θ,¯θ(l+1)),

(7)

θ(l+1)

=

(8)

where l is the index of iterations. The brief proof is given by the

following.

1. Q(θ(l),¯θ(l+1)) = J(θ(l)) from eq. (6) and eq. (7),

2. Q(θ(l+1),¯θ(l+1)) ≤ Q(θ(l),¯θ(l+1)) from eq. (8),

3. J(θ(l+1)) ≤ Q(θ(l+1),¯θ(l+1)) from eq. (6),

then,

J(θ(l+1)) ≤ J(θ(l)),

which guarantees non-increasing of objective functions. For effi-

cient updates, an auxiliary function Q(θ,¯θ) such that eq. (7) and

eq. (8) are given by closed forms is desired.

(9)

3.3. Design of Two-step Auxiliary Functions

We derive auxiliary functions of eq. (4) in two steps. One of the

difficulties to find a minimum of eq. (4) lies on the fact that εimn

consists of two terms with different indexes of m and n. Due to

it, ∂J/∂rm = 0 includes not only the variable of rmbut rn(1 ≤

n ≤ L), which leads to simultaneous equations. For separating

variables, we consider the following lemma.

Page 3

2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY

Lemma 1 For any A1, A2and B, an inequality:

(A1+ A2− B)2≤ 2(A1− a1)2+ 2(A2− a2)2

is hold under a1+ a2 = B. The equality is satisfied if and only if

(10)

a1

=A1−1

A2−1

2(A1+ A2− B),

(11)

a2

=

2(A1+ A2− B).

(12)

Proof: Let f(a1,a2) be

f(a1,a2) = 2(A1− a1)2+ 2(A2− a2)2− (A1+ A2− B)2

+4λ(a1+ a2− B),

where λ represents a Lagrange multiplier for the constraint a1+

a2 = B. Differentiating f(a1,a2) by a1, a2, and λ, and letting

them be zero yield

(13)

A1− a1− λ = 0,

A2− a2− λ = 0,

a1+ a2− B = 0.

(14)

(15)

(16)

Bysolvingthem, itisclearthatf(a1,a2)takestheminimumvalue

in eq. (11) and eq. (12), and it is equal to zero.

By letting

A1

A2

B

=

=

=

|si− rm|,

−|si− rn|,

c(τimn+ tm− tn),

(17)

(18)

(19)

we obtain an auxiliary function:

J1(Θ,µ)=

2

c2KL2

K

∑

i=1

L

∑

m=1

L

∑

imn)2}

n=1

{(|si− rm| − µm

imn)2

+(|si− rn| − µn

(20)

satisfying J(Θ) = minµJ1(Θ,µ) where µ = {µm

are auxiliary variables. J(Θ) = J1(Θ,µ) if and only if

imn, µn

imn}

µm

imn

=

|si− rm| −1

|si− rn| +1

2εimn,

(21)

µn

imn

=

2εimn.

(22)

Although the microphone position vectors rm and rn (1 ≤

m,n ≤ L) are separated into independent terms in the auxiliary

function J1, it still includes the absolute function of the vectors

si−rmand si−rm, which makes it difficult to solve ∂J1/∂s =

0 and ∂J1/∂rm = 0. So, we consider to apply the auxiliary

function approach, again. Note that the following lemma.

Lemma 2 For any vector x, any unit vector e, and positive scalar

a,

(|x| − a)2≤ |x − ae|2

is hold. The equality is satisfied if and only if e = x/|x|.

(23)

Proof:

|x − ae|2− (|x| − a)2

|x|2− 2ax · e + a2|e|2− |x|2+ 2a|x| − a2

2a(|x| − x · e)

2a|x|(1 − cosθ)

0,

=

=

=

≥

(24)

where θ is the angle between x and e.

By letting

x=

=

si− rm,

µm

imn,

(25)

(26)

a

we have the second auxiliary function:

J2(Θ,µ,e) =

2

c2KL2

K

∑

i=1

L

∑

m=1

L

∑

n=1

{(si− rm− eimµm

imn)2

+(si− rn− einµn

imn)2}

(27)

satisfying J1(Θ,µ) = mineJ2(Θ,µ,e) where e = {eim}.

J1(Θ,µ) = J2(Θ,µ,e) if and only if

eim

ein

=

=

(si− rm)/|si− rm|,

(si− rn)/|si− rn|.

(28)

(29)

Note that rm minimizing J2 is obtained in a closed form unlike

J1.

3.4. Derivation of Update Rules

Exploiting the two kinds of auxiliary functions, the objective func-

tion is monotonically decreased in the following iterative way.

Step 1: Update µ to minimize J1(Θ,µ). Then, J1(Θ,µ) is

equal to J(Θ).

Step 2: Update e to minimize J2(Θ,µ,e). Then,

J2(Θ,µ,e) = J1(Θ,µ) = J(Θ)

(30)

is satisfied.

Step 3: Update Θ to minimize J2(Θ,µ,e). Then, J(Θ) also

decreases since generally

J2(Θ,µ,e) ≥ J1(Θ,µ) ≥ J(Θ).

Step 4: Return to Step 1.

(31)

The update rules for Θ is derived from solving ∂J2/∂Θ = 0. The

whole update rules are summarized as follows.

εimn ← |si− rm| − |si− rn| − c(τimn+ tm− tn)(32)

µm

2εimn

imn ← |si− rn| +1

ein ← (si− rn)/|si− rn|

1

L2

m=1

(

(

imn ← |si− rm| −1

µn

(33)

2εimn

(34)

(35)

si ←

L

∑

K

∑

(

Lrm+ eim

L

∑

L

∑

n=1

µm

imn

)

)

(36)

rn ←

1

KL

i=1

Lsi− ein

m=1

µn

inm

(37)

tn ← tn+

1

cKL

K

∑

i=1

L|si− rn| −

L

∑

m=1

µn

inm

)

(38)

Page 4

2009 IEEE Workshop on Applications of Signal Processing to Audio and AcousticsOctober 18-21, 2009, New Paltz, NY

4. EXPERIMENTAL EVALUATION

Inordertoconfirmthatthefeasibilityofsimultaneouslyestimating

positions of microphones, sources, and time origins by minimiz-

ing eq. (4), a simulative experiment was performed. A room with

10 × 10 × 10m3was supposed and ideal spherical-wave propa-

gation of acoustic wave was simulated. For satisfying the neces-

sary condition of eq. (3), the number of microphones and sources

were set to 9 and 8, respectively. Their positions were determined

randomly. As source signals, real-recorded hand claps were used

and each source was not overlapped each other. The sampling fre-

quency was 44100Hz and the signal length was 5.0s. The differ-

ences of the time origins less than 1.0s were randomly given to

observed signals. In this setup, the degree of freedom of unknown

parameters is (8 − 1) + 3(9 + 8) − 6 = 52.

As discussed in section 2.3, the coarse alignments were first

performed, and then, the frame analysis was applied, single-source

frames were detected, and time differences for each source were

obtained, where the frame length was 100ms. Finally, initial val-

ues were given randomly, and update rules were iteratively applied

to them. The number of iterations were 60000. Fig. 3 shows that

true, initial and finally-estimated positions of microphones and

sources projected into the xy plane. Even though the initial po-

sitions were much different from the true positions and observed

time differences included the differences of unknown time origins,

it is confirmed that the estimation was well performed.

5. CONCLUSION AND FUTURE WORK

In this paper, a blind method for alignment of signals recorded by

independent devices with estimation of microphones and sources

arepresented. Althoughthederivedupdaterulesguaranteesmono-

tonically decrease of the objective function, the local minimum

problem still theoretically remains. Exploiting amplitude ratios,

not only time differences between channels is a possible way to

give a better initial positions of microphones and sources, which

will also facilitate fast convergence. Applying the proposed frame-

work to real-recorded signals is on-going.

6. REFERENCES

[1] R. Lienhart, I. Kozintsev, S. Wehr, and M. Yeung, “On the

importance of exact synchronization for distributed audio

processing,” Proc. ICASSP, pp. 840-843, 2003.

[2] P. Aarabi, “The fusion of distributed microphone arrays for

sound localization,” EURASIP Journal of Applied Signal

Processing, vol. 2003, no. 4, pp. 338-347, 2003.

[3] A. Brutti, M. Omologo, and P. Svaizer, “Oriented global co-

herence field for the estimation of the head orientation in

smart rooms equipped with distributed microphone arrays,”

Proc. Interspeech, pp. 2337-2340, 2005.

[4] Z. Liu, Z. Zhang, L. He, and P. Chou, “Energy-based sound

source localization and gain normalization for ad hoc micro-

phone arrays,” Proc. ICASSP, pp. 761-764, 2007.

[5] E. Robledo-Arnuncio, T. S. Wada, and B. H. Juang, “On

Dealing with Sampling Rate Mismatches in Blind Source

Separation and Acoustic Echo Cancellation,” Proc. WAS-

PAA, pp. 34-37, 2007.

?

?

?????????????

?

?

?

?

?

?

?

?

?

?

???

???????

?

???

?

???????

? ? ?

?

? !#"

?%$&?

?

?

!

?'? (

???

!%)

? *&??+#?-,

?

?

?????????? ???

?

?

?

?

?

?

?

?

?

?

???

???????

?

???

?

???????

? !?

?

?"$#

?!%&?

?

?

"

?'?)(

???

"!*

?)+,?'-$?/.

Figure 3: The estimation results of microphones (upper) and

sources (lower) positions projected into the xy plane

[6] S. Ando and N. Ono, “A Bayesian theory of cooperative

calibration and synchronization in sensor networks,” Trans.

Soc. Instru. Control Engineers (SICE), vol. E-S-1, pp.21-26,

2005.

[7] P. Stoica and J. Li, “Source localization from range-

difference measurements,” IEEE. Signal Processing Mag.,

vol. 23, no. 69, pp. 63-65, Nov. 2006.

[8] K. Kobayashi, K. Furuya, A. Kataoka, “A blind source lo-

calization by using freely positioned microphones,” Trans.

IEICE, vol. J86-A, No.6, pp. 619-627, 2003. (in Japanese)

[9] D. D. Lee and H. S. Seung, “Algorithms for Non-Negative

Matrix Factorization,” Proc. NIPS, pp. 556–562, 2000.

[10] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveigne, and

S. Sagayama, “Single and Multiple F0 Contour Estimation

Through Parametric Spectrogram Modeling of Speech in

Noisy Environments,” IEEE Trans. ASLP, vol. 15, no. 4,

pp.1135-1145, May., 2007.

[11] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and S.

Sagayama, “Separation of a Monaural Audio Signal into

Harmonic/Percussive Components by Complementary Dif-

fusion on Spectrogram,” Proc. EUSIPCO, Aug., 2008.