Content uploaded by José Ramón Calvo de Lara

Author content

All content in this area was uploaded by José Ramón Calvo de Lara on Dec 22, 2015

Content may be subject to copyright.

Evaluation of a New Beam-Search Formant

Tracking Algorithm in Noisy Environments.

Dayana Ribas Gonz´alez 1, Jos´e Enrique Garc´ıa La´ınez 2, Antonio Miguel

Artiaga 2, Alfonso Ortega Gimenez 2, Eduardo Lleida Solano 2, Jos´e Ram´on

Calvo de Lara 1

1Advanced Technologies Application Center (CENATAV), 7a ]21812 e/ 218 y 222,

Rpto. Siboney, Playa, C.P. 12200, La Habana, Cuba.

2Communications Technology Group (GTC), Aragon Institute for Engineering

Research (I3A), University of Zaragoza, Spain

{dribas,jcalvo}@cenatav.co.cu,{jegarlai,amiguel,ortega,lleida}@unizar.es

Abstract. In this work we present the experimental evaluation of a new

beam-search formant tracking algorithm under noisy conditions and com-

pare its performance with three formant tracking methods. The proposed

formant tracking algorithm makes use of the roots of the polynomial of

a Linear Predictive Coding (LPC) as formant candidates. The best com-

bination of formant candidates respect to a deﬁned cost function are

selected applying a beam-search algorithm. The cost function makes use

of information about local and neighbor frames using trajectory func-

tions in order to preserve the dynamics of the frequency of formants.

Experiments were carried out with a subset of the TIMIT database,

contaminated with various types and levels of noises. The results show

that the beam-search formant tracker have a robust behavior in noisy

environments and it is clearly more precise than the rest of compared

methods.

Keywords: formant tracking, beam-search algorithm, noisy environ-

ments.

1 Introduction

The resonance frequencies of the vocal tract, known as formants, carry useful in-

formation to identify the phonetic content and articulatory information of speech

as well as speaker and emotion discriminative information That is why formant

tracking methods are widely used in automatic speech processing applications

like speech synthesis, speaker identiﬁcation, speech and emotions recognition.

Those methods have to deal with the problem of the variability of the amount of

formants depending on phoneme and the merging and demerging of neighboring

formants over time, very common with F2 and F3. This is why, formant tracking

is a hard task to face [1].

For decades, a number of works have been dedicated to designing formant

tracking methods. Formant trackers usually consists of two stages: ﬁrstly the

2 Authors Suppressed Due to Excessive Length

speech is represented and analyzed for obtaining some formant frequency candi-

dates and secondly the selection of those candidates is done, taking into account

some constraints. Those constraints are related with the acoustical features of

the formant frequencies, the continuity of formant trajectory, etc.

One of the most extended methods of spectral analysis for formant track-

ing consists of extracting the roots of the polynomial of LPC, that has been

shown to be eﬀective in detecting the peaks of the spectrum [2]. In [3], a Gam-

matone ﬁlterbank followed by a diﬀerence of gaussians spectral ﬁltering shown

to enhance the formant structure. In [4], a method to segment the spectrum as

a tuple of order-2 resonators was proposed. The method produces smooth for-

mant frequencies in a frame by frame basis without any temporal information.

However, it has the drawback of not representing well frames with more than 4

formants.

There has been considerable eﬀort in the speech community to propose meth-

ods in the stage of formant selection. Probabilistic methods for estimating for-

mant trajectories have been used successfully in recent years. Within this group

are methods based on the Bayesian ﬁltering like Kalman Filters [5] and particle

ﬁlters [3] or Hidden Markov Models (HMM) [6]. Previous algorithms based on

continuity constraints made use of dynamic programming and the Viterbi algo-

rithm [7][8][9]. However, Viterbi based algorithms have the limitation that the

cost function of a hypothesis only depends on the current observation, and the

last state. In [10] we proposed a beam-search algorithm for formant tracking,

that is able to incorporate trajectory information to the cost function, overcom-

ing the limitation of the Viterbi search. In this paper we evaluate this algorithm

in several noisy environments and we compare its performance with three for-

mant tracking methods.

2 The proposal: Beam-Searching Algorithm

The proposed formant detector can be decomposed in two main stages: The ﬁrst

is the formant frequency candidate extractor, where a set of frequencies and

their bandwidths are chosen as possible formants. The roots of the polynomial

of the LPC coding were used as formant candidates [7][9].

The second stage is a beam-search algorithm for ﬁnding the best sequence

of formants, given the frequency candidates. A mapping as proposed in [7][9] of

frequency candidates to all possible combinations of formants is chosen. For this

purpose, ht={F1; B1; F2; B2; F3; B3; F4; B4}is a possible formant tuple at

frame t, obtained by means of a mapping from frequency candidates and formed

by frequency (F) and bandwidth (B) information. The algorithm tries to ﬁnd

the best sequence of mappings, by applying a cost function that makes use of

both local and global information. Its main advantage is to make no Markovian

assumptions about the problem, i.e the evaluation of hypothesis in a frame takes

into account the hypothesis deﬁned in all previous frames unlike the Viterbi

search [7,9] which only uses previous state information. This feature allows to

Beam-Search Formant Tracking Algorithm in Noisy Environments. 3

incorporate eﬃciently trajectory functions in the algorithm for representing the

formant frequency dynamics.

The set of M active hypotheses in a frame t is represented by the group

Pt={pt,1, pt,2, ...pt,x....pt,M }, where a hypotheses pt,x is composed of an accu-

mulated cost accx,t and a history of mappings zt=h1, h2...ht. For obtaining the

hypotheses Ptset in frame t, the set is propagated through all possible com-

binations of formant candidates ot={ht,1...., ht,w, ...htU }where U is the total

number of possible frequency mappings at frame t. This gives the extended group

P Et={pt,1,1, pt,x,w...pt,M ,U }, and the accumulated cost of each new hypotheses

pt,x,w is:

accx,w,t =accx,t−1+c(pt,x,w , hw) (1)

The set P Etis sorted according to the accumulated cost accx,w,t, and it

produces the new group of M active hypotheses Pt+1, where the hypotheses

with higher accumulated cost are maintained. This process is repeated for each

frame until the end of the stream is detected, and the history of formants of the

best hypothesis is selected as the ﬁnal result. This search algorithm is illustrated

in Fig. 1, where the M value represents a compromise between accuracy and

execution speed.

Fig. 1: Diagram of beam-search algorithm.

4 Authors Suppressed Due to Excessive Length

2.1 The cost function

The cost function is deﬁned as:

c(pt,x,w, hw) = cf requency +cbandwidth +ctrajectory +cmapping (2)

It uses both local and global observations for choosing the best sequence of

formants. The part of the cost function that makes use of local information (that

is, the current frame) contains the terms cfrequency, cbandwidth (deﬁned as in

[7]) and cmapping:

cfrequency =αX

i

|(Fi−normi)/normi|(3)

cbandwidth =βX

i

(Bi) (4)

where normi= 500,1500,2500,3500 and i={1, ..., 4}is the formant number.

cmappingi=0if B W mini> T H R

T HR−B W mini

γiif B W mini< T H R (5)

cmapping =X

i

cmappingi(6)

where BW miniis the minimum bandwidth of the frequency candidates that

are discarded and that would be valid for the formant iin this mapping; γiand

T H R are constants. The part of the cost function that employs global informa-

tion assumes that the frequency of each formant follows a smooth trajectory.

This term is intended to take into account when a mapping is discarding some

frequency peak with a low bandwidth.

ctrajectory =θsX

i,w

Fi,w −Fi, bw

Bi

(7)

Where w={0, ..., W −1}and W is the order of the trajectory function

and b

Fi,t−wis the estimated value of formant i, at frame t−w, assuming that

Fi,t, ..., Fi,t−(W−1) is approximated by a known function; 1/Biis the weighted

term of the trajectory, in order to give more importance to frames that have

lower bandwidth; α,βand θare constant for representing the weight of the

terms. In the experiments, linear and quadratic functions were used, approxi-

mated with the least squares method. However, we assume that there is room

for improvement in the modeling of such trajectory.

The trajectory term that makes use of several past frames justiﬁes the use

of the tree beam-search algorithm in place of the Viterbi decoding algorithm.

One of the main beneﬁts of this trajectory model is that it allows to recover

observation errors in frames between obstruent and vowel, thanks to contiguous

frame evidences.

Beam-Search Formant Tracking Algorithm in Noisy Environments. 5

An advantage of this continuity constraint compared with previous works is

that this function does not increment costs when a change in the value of two

consecutive frequencies occurs, as considered in [7][9]. In addition, this global

function will help the algorithm to correct errors in diﬃcult frames where the

frequency candidates do not give clear evidences. Within this group are frames

between obstruent and vowel and frames corrupted by noise.

3 Experiments and results

For comparison purpose three formant tracking methods were selected: Mustafa’s

proposal [11], Welling and Ney’s algorithm [4] and Wavesurfer’s method from

Snack toolkit [12]. The performance of the formant tracking methods evalu-

ated were measured carrying a quantitative evaluation using the VTR-Formant

database [13]. This database contains the formant labels of a representative sub-

set of the TIMIT corpus with respect to speaker, gender, dialect and phonetic

context. In these experiments, 420 signals from VTR database were processed

and the mean absolute error (MAE) between formants estimated for all formant

tracking methods and VTR database were computed. All speech material used

was digitized at 16 bits, at 10000 Hz sampling rate. The pitch ESPS algorithm

from Snack toolkit was used, for obtaining the MAE only taking into account

voiced frames.

Figure 2 shows the formant estimation achieved in a selected speech signal

of TIMIT database, with the method proposed and the three methods used for

comparison, besides the reference computed with VTR database. This qualita-

tive view of the formant trackers obtained with each method allows to see the

beneﬁts of our tracking algorithm. In the ﬁgure it can be seen how Welling and

Ney’s algorithm achieve formant tracking lines quite accurate, however some-

times it has a poor performance, mainly in the tracking of F1. Wavesurfer’s ob-

tained tracking lines very similar to the reference, however the method proposed

sometimes outperforms it, for example in the tracking of F3 and F4 between

0,5 and 1 second. Mustafa’s algorithm achieved the worst performance of all the

methods used.

Methods F1(Hz) F2(Hz) F3(Hz) F4(Hz)

LPC-beam-search 18.39 27.96 35.26 69.01

Wavesurfer 29.95 57.66 76.53 76.44

Welling-Ney 37.53 47.33 52.53 67.32

Mustafa 28.11 80.22 82.54 75.63

Table 1: MAE (Hz) for formant estimations obtained with LPC beam-search algorithm,

Wavesurfer, Welling and Ney and Mustafa’s algorithms.

The Table 1 shows the performance of the four methods evaluated in clean

speech. It can be observed how the proposed tracking algorithm outperforms con-

6 Authors Suppressed Due to Excessive Length

sistently all the formant extractor in most cases. Notice that the order of accuracy

in the methods evaluated is: LPC-beam-search, Welling-Ney, Wavesurfer and ﬁ-

nally Mustafa, however in F1, Mustafa outperforms Welling-Ney. Wavesurfer is

better than Welling-Ney in the tracking of F1, however for F2 and F3 its per-

formance decrease, taking into account that these are the harder resonances to

follow. The F4 performance has less importance because this formant in VTR-

database is not manually labeled.

Frequency (KHz)

−0

2

4Reference

Frequency (KHz)

−0

2

4LPC + beam−search

Frequency (KHz)

−0

2

4WaveSurfer

Frequency (KHz)

−0

2

4Welling−Ney

Time (seconds)

Frequency (KHz)

0.5 1 1.5 2 2.5 3 3.5 4 4.5

−0

2

4Mustafa

Fig. 2: Example results of a signal of TIMIT database with all the methods evaluated.

Additional tests were carried out in noisy environments. The corrupted speech

signals come from four diﬀerent noise environments:

–stationary white noise

–pseudostationary street noise, which is a mixture of diﬀerent noises

–music from Guns and Roses band, highly harmonic and non-stationary noise

–babble noise, special case of non-stationary noise, is the voice of other speak-

ers

All those types of noise were added electronically to test speech signals at dif-

ferent SNR levels, from 0 to 20 dB in 5 dB steps.

Figure 3 shows the MAE in the noisy environments evaluated. For each type

of noise the behavior of the methods is quite diﬀerent. Notice that stationary

Beam-Search Formant Tracking Algorithm in Noisy Environments. 7

white noise is the most challenge type of noise, given by the worst MAE of

formant trackers shown in the corresponding plot. On the other hand, for all

methods in street, music and babble noise, from SNR = 10 dB, F1 has a behavior

quite stable, besides F2 and F3 have a slight decrease of the slope of the MAE

curves. This fact gives an idea of the robustness of formant trackers over SNR

= 10 dB.

0 5 10 15 20

0

50

100

150

200

SNR(dB)

MAE(Hz)

White Noise

0 5 10 15 20

0

50

100

150

200

SNR(dB)

MAE(Hz)

Street Noise

0 5 10 15 20

0

50

100

150

200

SNR(dB)

MAE(Hz)

Babble Noise

F1

F2

F3

F1−WS

F2−WS

F3−WS

F1−WNey

F2−WNey

F3−WNey

0 5 10 15 20

0

50

100

150

200

SNR(dB)

MAE(Hz)

Music Noise

Fig. 3: MAE of formant estimation with LPC-beam search algorithm, Welling-Ney

(WNey) algorithm and Wavesurfer(WS) algorithm vs VTR database in noisy environ-

ments.

Figure 3 shows that the proposed method in noisy environments outperforms

the other methods in most conditions evaluated. Nevertheless, Welling-Ney al-

gorithm obtains the most precise F3 in music and street noise for SNR below

10 dB, and also is the best method in F2 for street noise in SNR below 10 dB.

Concluding that the Welling-Ney method is more robust to narrow band noise

(music and street noise) than the methods based on LPC (Wavesurfer and LPC

beam-search). The spectral segmentation performed in the Welling-Ney method

based on the searching of the 4 best spectral regions with dynamic programming,

makes this method robust against this kind of noise, unlike LPC based methods

that use as formant candidates 5 or 6 peaks. A narrow band noise is a good

8 Authors Suppressed Due to Excessive Length

candidate to be confused with a formant and to be selected, because frequently

it has lower bandwidth than a speech formant.

In white noise Welling-Ney and Wavesurfer’s algorithms performs very inac-

curate, with MAE near 200 Hz. However for babble noise Welling-Ney achieved

very low errors, even in F2 and F3, for low values of SNR, it outperforms LPC

beam-search method.

Figure 4 shows the formant tracking obtained with three of the methods

evaluated and the reference over a spectrogram of the same speech signal used

in Fig. 2 corrupted by babble noise with SNR = 10dB. Notice that the pro-

posed method achieves soft formant curves, thanks to the trajectory functions

combined with the beam-search algorithm. The other methods generate curves

with a lot of spikes, which are due to the uncertainty introduced by the noise,

that could mask the spectral features for detecting the formant candidates. So, if

poor continuity constraints are incorporated, the formant trackers become very

unstable and tend to have fast changes in the detected formants, in noisy envi-

ronments. This is the case of the Wavesurfer formant tracker. On the other side

the Welling-Ney formant tracker does not include any continuity constraint, and

this is why it has this behavior.

Frequency (KHz)

0.2

−0

1

2

3

4

5

Reference

Frequency (KHz)

0.2

−0

1

2

3

4

5

LPC + beam−search

Frequency (KHz)

0.2

−0

1

2

3

4

5

WaveSurfer

Time (seconds)

Frequency (KHz)

0.5 1 1.5 2 2.5 3 3.5 4 4.5

−0

1

2

3

4

5

Welling−Ney

Fig. 4: Example results of a signal of TIMIT database corrupted by babble noise with

SNR = 10dB with all the methods evaluated.

Beam-Search Formant Tracking Algorithm in Noisy Environments. 9

4 Conclusions

In this paper we present an evaluation of the LPC-beam searching method in

noisy environments and a comparison with three formant tracking algorithms. In

spite of the proposed method not being designed with speciﬁc techniques noise

compensation, it presents a very robust performance for all the types of noises

evaluated. In fact, results show that in most cases LPC beam-search method

proposed performs better than Wavesurfer’s, Mustafa’s and Welling-Ney for-

mant tracking algorithm. Furthermore, a feature that makes the beam-search

algorithm attractive is that it produces smooth formant trajectories even in

corrupted signals, while the other methods are very spiky in presence of noise.

5 Acknowledgements

This work has been partially funded by Spanish national program INNPACTO

IPT-2011-1696-390000

References

[1] Rose, P., “Forensic Speaker Identiﬁcation”. Taylor and Francis Forensic Science

Series, ed. J. Robertson, London: Taylor and Francis, 2002.

[2] S. McCandless, “An algorithm for automatic formant extraction using linear pre-

diction spectra”. IEEE TASSP, vol. ASSP-22, pp. 135-141, 1974.

[3] Claudius Gl¨aser, Martin Heckmann, Frank Joublin and Christian Goerick, “Com-

bining auditory preprocessing and Bayesian Estimation for Robust Formant Track-

ing”. IEEE Trans. Audio Speech Lang. Process. 2010.

[4] Lutz Welling and Hermann Ney, “Formant Estimation for Speech Recognition”.

IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 1, 1998.

[5] Daryush D. Mehta, Daniel Rudoy and Patrick J. Wolfe, “KARMA: Kalman-based

autoregressive moving average modeling and inference for formant and antiformant

tracking”. stat. AP. 2011.

[6] Zaineb Ben Messaoud, Dorra Gargouri, Saida Zribi and Ahmed Ben Hamida, “For-

mant Tracking Linear Prediction Model using HMMs for Noisy Speech Processing”.

Int. Journal of Inf. and Comm. Eng., Vol.5, No.4, 2009.

[7] D. Talkin, “Speech formant trajectory estimation using dynamic programming

with modulated transition costs”. JASA, vol. 82, no. S1, p.55, 1987.

[8] Li Deng, Issam Bazzi and Alex Acero, “Tracking Vocal Tract Resonances Using an

Analytical Nonlinear Predictor and a Target-Guided Temporal Constraint”. 2003.

[9] K. Xia and C. Espy-Wilson, “A new strategy of formant tracking based on dynamic

programming”. in Proc. ICSLP, 2000.

[10] Jos´e Enrique Garc´ıa La´ınez, Dayana Ribas Gonzalez, Antonio Miguel Artiaga,

Eduardo Lleida Solano and Jos´e Ram´on Calvo De Lara, “Beam-Search Formant

Tracking Algorithm based on Trajectory Functions for Continuous Speech.”. To

be plublished in proceedings of CIARP 2012.

[11] K. Mustafa and I. C. Bruce, “Robust formant tracking for continuous speech with

speaker variability”. IEEE Transactions on Speech and Audio Processing, 2006.

[12] Snack toolkit: http://www.speech.kth.se/wavesurfer

10 Authors Suppressed Due to Excessive Length

[13] Li Deng, Xiaodong Cui, Robert Pruvenok, Jonathan Huang, Saﬁyy Momen,Yanyi

Chen and Abeer Alwan, “A Database of Vocal Tract Resonance Trajectories for

Research in Speech Processing”. ICASSP, 2006.