Content uploaded by José Ramón Calvo de Lara
Author content
All content in this area was uploaded by José Ramón Calvo de Lara on Dec 22, 2015
Content may be subject to copyright.
Evaluation of a New Beam-Search Formant
Tracking Algorithm in Noisy Environments.
Dayana Ribas Gonz´alez 1, Jos´e Enrique Garc´ıa La´ınez 2, Antonio Miguel
Artiaga 2, Alfonso Ortega Gimenez 2, Eduardo Lleida Solano 2, Jos´e Ram´on
Calvo de Lara 1
1Advanced Technologies Application Center (CENATAV), 7a ]21812 e/ 218 y 222,
Rpto. Siboney, Playa, C.P. 12200, La Habana, Cuba.
2Communications Technology Group (GTC), Aragon Institute for Engineering
Research (I3A), University of Zaragoza, Spain
{dribas,jcalvo}@cenatav.co.cu,{jegarlai,amiguel,ortega,lleida}@unizar.es
Abstract. In this work we present the experimental evaluation of a new
beam-search formant tracking algorithm under noisy conditions and com-
pare its performance with three formant tracking methods. The proposed
formant tracking algorithm makes use of the roots of the polynomial of
a Linear Predictive Coding (LPC) as formant candidates. The best com-
bination of formant candidates respect to a defined cost function are
selected applying a beam-search algorithm. The cost function makes use
of information about local and neighbor frames using trajectory func-
tions in order to preserve the dynamics of the frequency of formants.
Experiments were carried out with a subset of the TIMIT database,
contaminated with various types and levels of noises. The results show
that the beam-search formant tracker have a robust behavior in noisy
environments and it is clearly more precise than the rest of compared
methods.
Keywords: formant tracking, beam-search algorithm, noisy environ-
ments.
1 Introduction
The resonance frequencies of the vocal tract, known as formants, carry useful in-
formation to identify the phonetic content and articulatory information of speech
as well as speaker and emotion discriminative information That is why formant
tracking methods are widely used in automatic speech processing applications
like speech synthesis, speaker identification, speech and emotions recognition.
Those methods have to deal with the problem of the variability of the amount of
formants depending on phoneme and the merging and demerging of neighboring
formants over time, very common with F2 and F3. This is why, formant tracking
is a hard task to face [1].
For decades, a number of works have been dedicated to designing formant
tracking methods. Formant trackers usually consists of two stages: firstly the
2 Authors Suppressed Due to Excessive Length
speech is represented and analyzed for obtaining some formant frequency candi-
dates and secondly the selection of those candidates is done, taking into account
some constraints. Those constraints are related with the acoustical features of
the formant frequencies, the continuity of formant trajectory, etc.
One of the most extended methods of spectral analysis for formant track-
ing consists of extracting the roots of the polynomial of LPC, that has been
shown to be effective in detecting the peaks of the spectrum [2]. In [3], a Gam-
matone filterbank followed by a difference of gaussians spectral filtering shown
to enhance the formant structure. In [4], a method to segment the spectrum as
a tuple of order-2 resonators was proposed. The method produces smooth for-
mant frequencies in a frame by frame basis without any temporal information.
However, it has the drawback of not representing well frames with more than 4
formants.
There has been considerable effort in the speech community to propose meth-
ods in the stage of formant selection. Probabilistic methods for estimating for-
mant trajectories have been used successfully in recent years. Within this group
are methods based on the Bayesian filtering like Kalman Filters [5] and particle
filters [3] or Hidden Markov Models (HMM) [6]. Previous algorithms based on
continuity constraints made use of dynamic programming and the Viterbi algo-
rithm [7][8][9]. However, Viterbi based algorithms have the limitation that the
cost function of a hypothesis only depends on the current observation, and the
last state. In [10] we proposed a beam-search algorithm for formant tracking,
that is able to incorporate trajectory information to the cost function, overcom-
ing the limitation of the Viterbi search. In this paper we evaluate this algorithm
in several noisy environments and we compare its performance with three for-
mant tracking methods.
2 The proposal: Beam-Searching Algorithm
The proposed formant detector can be decomposed in two main stages: The first
is the formant frequency candidate extractor, where a set of frequencies and
their bandwidths are chosen as possible formants. The roots of the polynomial
of the LPC coding were used as formant candidates [7][9].
The second stage is a beam-search algorithm for finding the best sequence
of formants, given the frequency candidates. A mapping as proposed in [7][9] of
frequency candidates to all possible combinations of formants is chosen. For this
purpose, ht={F1; B1; F2; B2; F3; B3; F4; B4}is a possible formant tuple at
frame t, obtained by means of a mapping from frequency candidates and formed
by frequency (F) and bandwidth (B) information. The algorithm tries to find
the best sequence of mappings, by applying a cost function that makes use of
both local and global information. Its main advantage is to make no Markovian
assumptions about the problem, i.e the evaluation of hypothesis in a frame takes
into account the hypothesis defined in all previous frames unlike the Viterbi
search [7,9] which only uses previous state information. This feature allows to
Beam-Search Formant Tracking Algorithm in Noisy Environments. 3
incorporate efficiently trajectory functions in the algorithm for representing the
formant frequency dynamics.
The set of M active hypotheses in a frame t is represented by the group
Pt={pt,1, pt,2, ...pt,x....pt,M }, where a hypotheses pt,x is composed of an accu-
mulated cost accx,t and a history of mappings zt=h1, h2...ht. For obtaining the
hypotheses Ptset in frame t, the set is propagated through all possible com-
binations of formant candidates ot={ht,1...., ht,w, ...htU }where U is the total
number of possible frequency mappings at frame t. This gives the extended group
P Et={pt,1,1, pt,x,w...pt,M ,U }, and the accumulated cost of each new hypotheses
pt,x,w is:
accx,w,t =accx,t−1+c(pt,x,w , hw) (1)
The set P Etis sorted according to the accumulated cost accx,w,t, and it
produces the new group of M active hypotheses Pt+1, where the hypotheses
with higher accumulated cost are maintained. This process is repeated for each
frame until the end of the stream is detected, and the history of formants of the
best hypothesis is selected as the final result. This search algorithm is illustrated
in Fig. 1, where the M value represents a compromise between accuracy and
execution speed.
Fig. 1: Diagram of beam-search algorithm.
4 Authors Suppressed Due to Excessive Length
2.1 The cost function
The cost function is defined as:
c(pt,x,w, hw) = cf requency +cbandwidth +ctrajectory +cmapping (2)
It uses both local and global observations for choosing the best sequence of
formants. The part of the cost function that makes use of local information (that
is, the current frame) contains the terms cfrequency, cbandwidth (defined as in
[7]) and cmapping:
cfrequency =αX
i
|(Fi−normi)/normi|(3)
cbandwidth =βX
i
(Bi) (4)
where normi= 500,1500,2500,3500 and i={1, ..., 4}is the formant number.
cmappingi=0if B W mini> T H R
T HR−B W mini
γiif B W mini< T H R (5)
cmapping =X
i
cmappingi(6)
where BW miniis the minimum bandwidth of the frequency candidates that
are discarded and that would be valid for the formant iin this mapping; γiand
T H R are constants. The part of the cost function that employs global informa-
tion assumes that the frequency of each formant follows a smooth trajectory.
This term is intended to take into account when a mapping is discarding some
frequency peak with a low bandwidth.
ctrajectory =θsX
i,w
Fi,w −Fi, bw
Bi
(7)
Where w={0, ..., W −1}and W is the order of the trajectory function
and b
Fi,t−wis the estimated value of formant i, at frame t−w, assuming that
Fi,t, ..., Fi,t−(W−1) is approximated by a known function; 1/Biis the weighted
term of the trajectory, in order to give more importance to frames that have
lower bandwidth; α,βand θare constant for representing the weight of the
terms. In the experiments, linear and quadratic functions were used, approxi-
mated with the least squares method. However, we assume that there is room
for improvement in the modeling of such trajectory.
The trajectory term that makes use of several past frames justifies the use
of the tree beam-search algorithm in place of the Viterbi decoding algorithm.
One of the main benefits of this trajectory model is that it allows to recover
observation errors in frames between obstruent and vowel, thanks to contiguous
frame evidences.
Beam-Search Formant Tracking Algorithm in Noisy Environments. 5
An advantage of this continuity constraint compared with previous works is
that this function does not increment costs when a change in the value of two
consecutive frequencies occurs, as considered in [7][9]. In addition, this global
function will help the algorithm to correct errors in difficult frames where the
frequency candidates do not give clear evidences. Within this group are frames
between obstruent and vowel and frames corrupted by noise.
3 Experiments and results
For comparison purpose three formant tracking methods were selected: Mustafa’s
proposal [11], Welling and Ney’s algorithm [4] and Wavesurfer’s method from
Snack toolkit [12]. The performance of the formant tracking methods evalu-
ated were measured carrying a quantitative evaluation using the VTR-Formant
database [13]. This database contains the formant labels of a representative sub-
set of the TIMIT corpus with respect to speaker, gender, dialect and phonetic
context. In these experiments, 420 signals from VTR database were processed
and the mean absolute error (MAE) between formants estimated for all formant
tracking methods and VTR database were computed. All speech material used
was digitized at 16 bits, at 10000 Hz sampling rate. The pitch ESPS algorithm
from Snack toolkit was used, for obtaining the MAE only taking into account
voiced frames.
Figure 2 shows the formant estimation achieved in a selected speech signal
of TIMIT database, with the method proposed and the three methods used for
comparison, besides the reference computed with VTR database. This qualita-
tive view of the formant trackers obtained with each method allows to see the
benefits of our tracking algorithm. In the figure it can be seen how Welling and
Ney’s algorithm achieve formant tracking lines quite accurate, however some-
times it has a poor performance, mainly in the tracking of F1. Wavesurfer’s ob-
tained tracking lines very similar to the reference, however the method proposed
sometimes outperforms it, for example in the tracking of F3 and F4 between
0,5 and 1 second. Mustafa’s algorithm achieved the worst performance of all the
methods used.
Methods F1(Hz) F2(Hz) F3(Hz) F4(Hz)
LPC-beam-search 18.39 27.96 35.26 69.01
Wavesurfer 29.95 57.66 76.53 76.44
Welling-Ney 37.53 47.33 52.53 67.32
Mustafa 28.11 80.22 82.54 75.63
Table 1: MAE (Hz) for formant estimations obtained with LPC beam-search algorithm,
Wavesurfer, Welling and Ney and Mustafa’s algorithms.
The Table 1 shows the performance of the four methods evaluated in clean
speech. It can be observed how the proposed tracking algorithm outperforms con-
6 Authors Suppressed Due to Excessive Length
sistently all the formant extractor in most cases. Notice that the order of accuracy
in the methods evaluated is: LPC-beam-search, Welling-Ney, Wavesurfer and fi-
nally Mustafa, however in F1, Mustafa outperforms Welling-Ney. Wavesurfer is
better than Welling-Ney in the tracking of F1, however for F2 and F3 its per-
formance decrease, taking into account that these are the harder resonances to
follow. The F4 performance has less importance because this formant in VTR-
database is not manually labeled.
Frequency (KHz)
−0
2
4Reference
Frequency (KHz)
−0
2
4LPC + beam−search
Frequency (KHz)
−0
2
4WaveSurfer
Frequency (KHz)
−0
2
4Welling−Ney
Time (seconds)
Frequency (KHz)
0.5 1 1.5 2 2.5 3 3.5 4 4.5
−0
2
4Mustafa
Fig. 2: Example results of a signal of TIMIT database with all the methods evaluated.
Additional tests were carried out in noisy environments. The corrupted speech
signals come from four different noise environments:
–stationary white noise
–pseudostationary street noise, which is a mixture of different noises
–music from Guns and Roses band, highly harmonic and non-stationary noise
–babble noise, special case of non-stationary noise, is the voice of other speak-
ers
All those types of noise were added electronically to test speech signals at dif-
ferent SNR levels, from 0 to 20 dB in 5 dB steps.
Figure 3 shows the MAE in the noisy environments evaluated. For each type
of noise the behavior of the methods is quite different. Notice that stationary
Beam-Search Formant Tracking Algorithm in Noisy Environments. 7
white noise is the most challenge type of noise, given by the worst MAE of
formant trackers shown in the corresponding plot. On the other hand, for all
methods in street, music and babble noise, from SNR = 10 dB, F1 has a behavior
quite stable, besides F2 and F3 have a slight decrease of the slope of the MAE
curves. This fact gives an idea of the robustness of formant trackers over SNR
= 10 dB.
0 5 10 15 20
0
50
100
150
200
SNR(dB)
MAE(Hz)
White Noise
0 5 10 15 20
0
50
100
150
200
SNR(dB)
MAE(Hz)
Street Noise
0 5 10 15 20
0
50
100
150
200
SNR(dB)
MAE(Hz)
Babble Noise
F1
F2
F3
F1−WS
F2−WS
F3−WS
F1−WNey
F2−WNey
F3−WNey
0 5 10 15 20
0
50
100
150
200
SNR(dB)
MAE(Hz)
Music Noise
Fig. 3: MAE of formant estimation with LPC-beam search algorithm, Welling-Ney
(WNey) algorithm and Wavesurfer(WS) algorithm vs VTR database in noisy environ-
ments.
Figure 3 shows that the proposed method in noisy environments outperforms
the other methods in most conditions evaluated. Nevertheless, Welling-Ney al-
gorithm obtains the most precise F3 in music and street noise for SNR below
10 dB, and also is the best method in F2 for street noise in SNR below 10 dB.
Concluding that the Welling-Ney method is more robust to narrow band noise
(music and street noise) than the methods based on LPC (Wavesurfer and LPC
beam-search). The spectral segmentation performed in the Welling-Ney method
based on the searching of the 4 best spectral regions with dynamic programming,
makes this method robust against this kind of noise, unlike LPC based methods
that use as formant candidates 5 or 6 peaks. A narrow band noise is a good
8 Authors Suppressed Due to Excessive Length
candidate to be confused with a formant and to be selected, because frequently
it has lower bandwidth than a speech formant.
In white noise Welling-Ney and Wavesurfer’s algorithms performs very inac-
curate, with MAE near 200 Hz. However for babble noise Welling-Ney achieved
very low errors, even in F2 and F3, for low values of SNR, it outperforms LPC
beam-search method.
Figure 4 shows the formant tracking obtained with three of the methods
evaluated and the reference over a spectrogram of the same speech signal used
in Fig. 2 corrupted by babble noise with SNR = 10dB. Notice that the pro-
posed method achieves soft formant curves, thanks to the trajectory functions
combined with the beam-search algorithm. The other methods generate curves
with a lot of spikes, which are due to the uncertainty introduced by the noise,
that could mask the spectral features for detecting the formant candidates. So, if
poor continuity constraints are incorporated, the formant trackers become very
unstable and tend to have fast changes in the detected formants, in noisy envi-
ronments. This is the case of the Wavesurfer formant tracker. On the other side
the Welling-Ney formant tracker does not include any continuity constraint, and
this is why it has this behavior.
Frequency (KHz)
0.2
−0
1
2
3
4
5
Reference
Frequency (KHz)
0.2
−0
1
2
3
4
5
LPC + beam−search
Frequency (KHz)
0.2
−0
1
2
3
4
5
WaveSurfer
Time (seconds)
Frequency (KHz)
0.5 1 1.5 2 2.5 3 3.5 4 4.5
−0
1
2
3
4
5
Welling−Ney
Fig. 4: Example results of a signal of TIMIT database corrupted by babble noise with
SNR = 10dB with all the methods evaluated.
Beam-Search Formant Tracking Algorithm in Noisy Environments. 9
4 Conclusions
In this paper we present an evaluation of the LPC-beam searching method in
noisy environments and a comparison with three formant tracking algorithms. In
spite of the proposed method not being designed with specific techniques noise
compensation, it presents a very robust performance for all the types of noises
evaluated. In fact, results show that in most cases LPC beam-search method
proposed performs better than Wavesurfer’s, Mustafa’s and Welling-Ney for-
mant tracking algorithm. Furthermore, a feature that makes the beam-search
algorithm attractive is that it produces smooth formant trajectories even in
corrupted signals, while the other methods are very spiky in presence of noise.
5 Acknowledgements
This work has been partially funded by Spanish national program INNPACTO
IPT-2011-1696-390000
References
[1] Rose, P., “Forensic Speaker Identification”. Taylor and Francis Forensic Science
Series, ed. J. Robertson, London: Taylor and Francis, 2002.
[2] S. McCandless, “An algorithm for automatic formant extraction using linear pre-
diction spectra”. IEEE TASSP, vol. ASSP-22, pp. 135-141, 1974.
[3] Claudius Gl¨aser, Martin Heckmann, Frank Joublin and Christian Goerick, “Com-
bining auditory preprocessing and Bayesian Estimation for Robust Formant Track-
ing”. IEEE Trans. Audio Speech Lang. Process. 2010.
[4] Lutz Welling and Hermann Ney, “Formant Estimation for Speech Recognition”.
IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 1, 1998.
[5] Daryush D. Mehta, Daniel Rudoy and Patrick J. Wolfe, “KARMA: Kalman-based
autoregressive moving average modeling and inference for formant and antiformant
tracking”. stat. AP. 2011.
[6] Zaineb Ben Messaoud, Dorra Gargouri, Saida Zribi and Ahmed Ben Hamida, “For-
mant Tracking Linear Prediction Model using HMMs for Noisy Speech Processing”.
Int. Journal of Inf. and Comm. Eng., Vol.5, No.4, 2009.
[7] D. Talkin, “Speech formant trajectory estimation using dynamic programming
with modulated transition costs”. JASA, vol. 82, no. S1, p.55, 1987.
[8] Li Deng, Issam Bazzi and Alex Acero, “Tracking Vocal Tract Resonances Using an
Analytical Nonlinear Predictor and a Target-Guided Temporal Constraint”. 2003.
[9] K. Xia and C. Espy-Wilson, “A new strategy of formant tracking based on dynamic
programming”. in Proc. ICSLP, 2000.
[10] Jos´e Enrique Garc´ıa La´ınez, Dayana Ribas Gonzalez, Antonio Miguel Artiaga,
Eduardo Lleida Solano and Jos´e Ram´on Calvo De Lara, “Beam-Search Formant
Tracking Algorithm based on Trajectory Functions for Continuous Speech.”. To
be plublished in proceedings of CIARP 2012.
[11] K. Mustafa and I. C. Bruce, “Robust formant tracking for continuous speech with
speaker variability”. IEEE Transactions on Speech and Audio Processing, 2006.
[12] Snack toolkit: http://www.speech.kth.se/wavesurfer
10 Authors Suppressed Due to Excessive Length
[13] Li Deng, Xiaodong Cui, Robert Pruvenok, Jonathan Huang, Safiyy Momen,Yanyi
Chen and Abeer Alwan, “A Database of Vocal Tract Resonance Trajectories for
Research in Speech Processing”. ICASSP, 2006.