ArticlePDF Available

Abstract and Figures

In this work we present the experimental evaluation of a new beam-search formant tracking algorithm under noisy conditions and compare its performance with three formant tracking methods. The proposed formant tracking algorithm makes use of the roots of the polynomial of a Linear Predictive Coding (LPC) as formant candidates. The best combination of formant candidates respect to a defined cost function are selected applying a beam-search algorithm. The cost function makes use of information about local and neighbor frames using trajectory functions in order to preserve the dynamics of the frequency of formants. Experiments were carried out with a subset of the TIMIT database, contaminated with various types and levels of noises. The results show that the beam-search formant tracker have a robust behavior in noisy environments and it is clearly more precise than the rest of compared methods.
Content may be subject to copyright.
Evaluation of a New Beam-Search Formant
Tracking Algorithm in Noisy Environments.
Dayana Ribas Gonz´alez 1, Jos´e Enrique Garc´ıa La´ınez 2, Antonio Miguel
Artiaga 2, Alfonso Ortega Gimenez 2, Eduardo Lleida Solano 2, Jos´e Ram´on
Calvo de Lara 1
1Advanced Technologies Application Center (CENATAV), 7a ]21812 e/ 218 y 222,
Rpto. Siboney, Playa, C.P. 12200, La Habana, Cuba.
2Communications Technology Group (GTC), Aragon Institute for Engineering
Research (I3A), University of Zaragoza, Spain
Abstract. In this work we present the experimental evaluation of a new
beam-search formant tracking algorithm under noisy conditions and com-
pare its performance with three formant tracking methods. The proposed
formant tracking algorithm makes use of the roots of the polynomial of
a Linear Predictive Coding (LPC) as formant candidates. The best com-
bination of formant candidates respect to a defined cost function are
selected applying a beam-search algorithm. The cost function makes use
of information about local and neighbor frames using trajectory func-
tions in order to preserve the dynamics of the frequency of formants.
Experiments were carried out with a subset of the TIMIT database,
contaminated with various types and levels of noises. The results show
that the beam-search formant tracker have a robust behavior in noisy
environments and it is clearly more precise than the rest of compared
Keywords: formant tracking, beam-search algorithm, noisy environ-
1 Introduction
The resonance frequencies of the vocal tract, known as formants, carry useful in-
formation to identify the phonetic content and articulatory information of speech
as well as speaker and emotion discriminative information That is why formant
tracking methods are widely used in automatic speech processing applications
like speech synthesis, speaker identification, speech and emotions recognition.
Those methods have to deal with the problem of the variability of the amount of
formants depending on phoneme and the merging and demerging of neighboring
formants over time, very common with F2 and F3. This is why, formant tracking
is a hard task to face [1].
For decades, a number of works have been dedicated to designing formant
tracking methods. Formant trackers usually consists of two stages: firstly the
2 Authors Suppressed Due to Excessive Length
speech is represented and analyzed for obtaining some formant frequency candi-
dates and secondly the selection of those candidates is done, taking into account
some constraints. Those constraints are related with the acoustical features of
the formant frequencies, the continuity of formant trajectory, etc.
One of the most extended methods of spectral analysis for formant track-
ing consists of extracting the roots of the polynomial of LPC, that has been
shown to be effective in detecting the peaks of the spectrum [2]. In [3], a Gam-
matone filterbank followed by a difference of gaussians spectral filtering shown
to enhance the formant structure. In [4], a method to segment the spectrum as
a tuple of order-2 resonators was proposed. The method produces smooth for-
mant frequencies in a frame by frame basis without any temporal information.
However, it has the drawback of not representing well frames with more than 4
There has been considerable effort in the speech community to propose meth-
ods in the stage of formant selection. Probabilistic methods for estimating for-
mant trajectories have been used successfully in recent years. Within this group
are methods based on the Bayesian filtering like Kalman Filters [5] and particle
filters [3] or Hidden Markov Models (HMM) [6]. Previous algorithms based on
continuity constraints made use of dynamic programming and the Viterbi algo-
rithm [7][8][9]. However, Viterbi based algorithms have the limitation that the
cost function of a hypothesis only depends on the current observation, and the
last state. In [10] we proposed a beam-search algorithm for formant tracking,
that is able to incorporate trajectory information to the cost function, overcom-
ing the limitation of the Viterbi search. In this paper we evaluate this algorithm
in several noisy environments and we compare its performance with three for-
mant tracking methods.
2 The proposal: Beam-Searching Algorithm
The proposed formant detector can be decomposed in two main stages: The first
is the formant frequency candidate extractor, where a set of frequencies and
their bandwidths are chosen as possible formants. The roots of the polynomial
of the LPC coding were used as formant candidates [7][9].
The second stage is a beam-search algorithm for finding the best sequence
of formants, given the frequency candidates. A mapping as proposed in [7][9] of
frequency candidates to all possible combinations of formants is chosen. For this
purpose, ht={F1; B1; F2; B2; F3; B3; F4; B4}is a possible formant tuple at
frame t, obtained by means of a mapping from frequency candidates and formed
by frequency (F) and bandwidth (B) information. The algorithm tries to find
the best sequence of mappings, by applying a cost function that makes use of
both local and global information. Its main advantage is to make no Markovian
assumptions about the problem, i.e the evaluation of hypothesis in a frame takes
into account the hypothesis defined in all previous frames unlike the Viterbi
search [7,9] which only uses previous state information. This feature allows to
Beam-Search Formant Tracking Algorithm in Noisy Environments. 3
incorporate efficiently trajectory functions in the algorithm for representing the
formant frequency dynamics.
The set of M active hypotheses in a frame t is represented by the group
Pt={pt,1, pt,2,,,M }, where a hypotheses pt,x is composed of an accu-
mulated cost accx,t and a history of mappings zt=h1, For obtaining the
hypotheses Ptset in frame t, the set is propagated through all possible com-
binations of formant candidates ot={ht,1...., ht,w, ...htU }where U is the total
number of possible frequency mappings at frame t. This gives the extended group
P Et={pt,1,1, pt,x,,M ,U }, and the accumulated cost of each new hypotheses
pt,x,w is:
accx,w,t =accx,t1+c(pt,x,w , hw) (1)
The set P Etis sorted according to the accumulated cost accx,w,t, and it
produces the new group of M active hypotheses Pt+1, where the hypotheses
with higher accumulated cost are maintained. This process is repeated for each
frame until the end of the stream is detected, and the history of formants of the
best hypothesis is selected as the final result. This search algorithm is illustrated
in Fig. 1, where the M value represents a compromise between accuracy and
execution speed.
Fig. 1: Diagram of beam-search algorithm.
4 Authors Suppressed Due to Excessive Length
2.1 The cost function
The cost function is defined as:
c(pt,x,w, hw) = cf requency +cbandwidth +ctrajectory +cmapping (2)
It uses both local and global observations for choosing the best sequence of
formants. The part of the cost function that makes use of local information (that
is, the current frame) contains the terms cfrequency, cbandwidth (defined as in
[7]) and cmapping:
cfrequency =αX
cbandwidth =βX
(Bi) (4)
where normi= 500,1500,2500,3500 and i={1, ..., 4}is the formant number.
cmappingi=0if B W mini> T H R
T HRB W mini
γiif B W mini< T H R (5)
cmapping =X
where BW miniis the minimum bandwidth of the frequency candidates that
are discarded and that would be valid for the formant iin this mapping; γiand
T H R are constants. The part of the cost function that employs global informa-
tion assumes that the frequency of each formant follows a smooth trajectory.
This term is intended to take into account when a mapping is discarding some
frequency peak with a low bandwidth.
ctrajectory =θsX
Fi,w Fi, bw
Where w={0, ..., W 1}and W is the order of the trajectory function
and b
Fi,twis the estimated value of formant i, at frame tw, assuming that
Fi,t, ..., Fi,t(W1) is approximated by a known function; 1/Biis the weighted
term of the trajectory, in order to give more importance to frames that have
lower bandwidth; α,βand θare constant for representing the weight of the
terms. In the experiments, linear and quadratic functions were used, approxi-
mated with the least squares method. However, we assume that there is room
for improvement in the modeling of such trajectory.
The trajectory term that makes use of several past frames justifies the use
of the tree beam-search algorithm in place of the Viterbi decoding algorithm.
One of the main benefits of this trajectory model is that it allows to recover
observation errors in frames between obstruent and vowel, thanks to contiguous
frame evidences.
Beam-Search Formant Tracking Algorithm in Noisy Environments. 5
An advantage of this continuity constraint compared with previous works is
that this function does not increment costs when a change in the value of two
consecutive frequencies occurs, as considered in [7][9]. In addition, this global
function will help the algorithm to correct errors in difficult frames where the
frequency candidates do not give clear evidences. Within this group are frames
between obstruent and vowel and frames corrupted by noise.
3 Experiments and results
For comparison purpose three formant tracking methods were selected: Mustafa’s
proposal [11], Welling and Ney’s algorithm [4] and Wavesurfer’s method from
Snack toolkit [12]. The performance of the formant tracking methods evalu-
ated were measured carrying a quantitative evaluation using the VTR-Formant
database [13]. This database contains the formant labels of a representative sub-
set of the TIMIT corpus with respect to speaker, gender, dialect and phonetic
context. In these experiments, 420 signals from VTR database were processed
and the mean absolute error (MAE) between formants estimated for all formant
tracking methods and VTR database were computed. All speech material used
was digitized at 16 bits, at 10000 Hz sampling rate. The pitch ESPS algorithm
from Snack toolkit was used, for obtaining the MAE only taking into account
voiced frames.
Figure 2 shows the formant estimation achieved in a selected speech signal
of TIMIT database, with the method proposed and the three methods used for
comparison, besides the reference computed with VTR database. This qualita-
tive view of the formant trackers obtained with each method allows to see the
benefits of our tracking algorithm. In the figure it can be seen how Welling and
Ney’s algorithm achieve formant tracking lines quite accurate, however some-
times it has a poor performance, mainly in the tracking of F1. Wavesurfer’s ob-
tained tracking lines very similar to the reference, however the method proposed
sometimes outperforms it, for example in the tracking of F3 and F4 between
0,5 and 1 second. Mustafa’s algorithm achieved the worst performance of all the
methods used.
Methods F1(Hz) F2(Hz) F3(Hz) F4(Hz)
LPC-beam-search 18.39 27.96 35.26 69.01
Wavesurfer 29.95 57.66 76.53 76.44
Welling-Ney 37.53 47.33 52.53 67.32
Mustafa 28.11 80.22 82.54 75.63
Table 1: MAE (Hz) for formant estimations obtained with LPC beam-search algorithm,
Wavesurfer, Welling and Ney and Mustafa’s algorithms.
The Table 1 shows the performance of the four methods evaluated in clean
speech. It can be observed how the proposed tracking algorithm outperforms con-
6 Authors Suppressed Due to Excessive Length
sistently all the formant extractor in most cases. Notice that the order of accuracy
in the methods evaluated is: LPC-beam-search, Welling-Ney, Wavesurfer and fi-
nally Mustafa, however in F1, Mustafa outperforms Welling-Ney. Wavesurfer is
better than Welling-Ney in the tracking of F1, however for F2 and F3 its per-
formance decrease, taking into account that these are the harder resonances to
follow. The F4 performance has less importance because this formant in VTR-
database is not manually labeled.
Frequency (KHz)
Frequency (KHz)
4LPC + beam−search
Frequency (KHz)
Frequency (KHz)
Time (seconds)
Frequency (KHz)
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Fig. 2: Example results of a signal of TIMIT database with all the methods evaluated.
Additional tests were carried out in noisy environments. The corrupted speech
signals come from four different noise environments:
stationary white noise
pseudostationary street noise, which is a mixture of different noises
music from Guns and Roses band, highly harmonic and non-stationary noise
babble noise, special case of non-stationary noise, is the voice of other speak-
All those types of noise were added electronically to test speech signals at dif-
ferent SNR levels, from 0 to 20 dB in 5 dB steps.
Figure 3 shows the MAE in the noisy environments evaluated. For each type
of noise the behavior of the methods is quite different. Notice that stationary
Beam-Search Formant Tracking Algorithm in Noisy Environments. 7
white noise is the most challenge type of noise, given by the worst MAE of
formant trackers shown in the corresponding plot. On the other hand, for all
methods in street, music and babble noise, from SNR = 10 dB, F1 has a behavior
quite stable, besides F2 and F3 have a slight decrease of the slope of the MAE
curves. This fact gives an idea of the robustness of formant trackers over SNR
= 10 dB.
0 5 10 15 20
White Noise
0 5 10 15 20
Street Noise
0 5 10 15 20
Babble Noise
0 5 10 15 20
Music Noise
Fig. 3: MAE of formant estimation with LPC-beam search algorithm, Welling-Ney
(WNey) algorithm and Wavesurfer(WS) algorithm vs VTR database in noisy environ-
Figure 3 shows that the proposed method in noisy environments outperforms
the other methods in most conditions evaluated. Nevertheless, Welling-Ney al-
gorithm obtains the most precise F3 in music and street noise for SNR below
10 dB, and also is the best method in F2 for street noise in SNR below 10 dB.
Concluding that the Welling-Ney method is more robust to narrow band noise
(music and street noise) than the methods based on LPC (Wavesurfer and LPC
beam-search). The spectral segmentation performed in the Welling-Ney method
based on the searching of the 4 best spectral regions with dynamic programming,
makes this method robust against this kind of noise, unlike LPC based methods
that use as formant candidates 5 or 6 peaks. A narrow band noise is a good
8 Authors Suppressed Due to Excessive Length
candidate to be confused with a formant and to be selected, because frequently
it has lower bandwidth than a speech formant.
In white noise Welling-Ney and Wavesurfer’s algorithms performs very inac-
curate, with MAE near 200 Hz. However for babble noise Welling-Ney achieved
very low errors, even in F2 and F3, for low values of SNR, it outperforms LPC
beam-search method.
Figure 4 shows the formant tracking obtained with three of the methods
evaluated and the reference over a spectrogram of the same speech signal used
in Fig. 2 corrupted by babble noise with SNR = 10dB. Notice that the pro-
posed method achieves soft formant curves, thanks to the trajectory functions
combined with the beam-search algorithm. The other methods generate curves
with a lot of spikes, which are due to the uncertainty introduced by the noise,
that could mask the spectral features for detecting the formant candidates. So, if
poor continuity constraints are incorporated, the formant trackers become very
unstable and tend to have fast changes in the detected formants, in noisy envi-
ronments. This is the case of the Wavesurfer formant tracker. On the other side
the Welling-Ney formant tracker does not include any continuity constraint, and
this is why it has this behavior.
Frequency (KHz)
Frequency (KHz)
LPC + beam−search
Frequency (KHz)
Time (seconds)
Frequency (KHz)
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Fig. 4: Example results of a signal of TIMIT database corrupted by babble noise with
SNR = 10dB with all the methods evaluated.
Beam-Search Formant Tracking Algorithm in Noisy Environments. 9
4 Conclusions
In this paper we present an evaluation of the LPC-beam searching method in
noisy environments and a comparison with three formant tracking algorithms. In
spite of the proposed method not being designed with specific techniques noise
compensation, it presents a very robust performance for all the types of noises
evaluated. In fact, results show that in most cases LPC beam-search method
proposed performs better than Wavesurfer’s, Mustafa’s and Welling-Ney for-
mant tracking algorithm. Furthermore, a feature that makes the beam-search
algorithm attractive is that it produces smooth formant trajectories even in
corrupted signals, while the other methods are very spiky in presence of noise.
5 Acknowledgements
This work has been partially funded by Spanish national program INNPACTO
[1] Rose, P., “Forensic Speaker Identification”. Taylor and Francis Forensic Science
Series, ed. J. Robertson, London: Taylor and Francis, 2002.
[2] S. McCandless, “An algorithm for automatic formant extraction using linear pre-
diction spectra”. IEEE TASSP, vol. ASSP-22, pp. 135-141, 1974.
[3] Claudius Gl¨aser, Martin Heckmann, Frank Joublin and Christian Goerick, “Com-
bining auditory preprocessing and Bayesian Estimation for Robust Formant Track-
ing”. IEEE Trans. Audio Speech Lang. Process. 2010.
[4] Lutz Welling and Hermann Ney, “Formant Estimation for Speech Recognition”.
IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 1, 1998.
[5] Daryush D. Mehta, Daniel Rudoy and Patrick J. Wolfe, “KARMA: Kalman-based
autoregressive moving average modeling and inference for formant and antiformant
tracking”. stat. AP. 2011.
[6] Zaineb Ben Messaoud, Dorra Gargouri, Saida Zribi and Ahmed Ben Hamida, “For-
mant Tracking Linear Prediction Model using HMMs for Noisy Speech Processing”.
Int. Journal of Inf. and Comm. Eng., Vol.5, No.4, 2009.
[7] D. Talkin, “Speech formant trajectory estimation using dynamic programming
with modulated transition costs”. JASA, vol. 82, no. S1, p.55, 1987.
[8] Li Deng, Issam Bazzi and Alex Acero, “Tracking Vocal Tract Resonances Using an
Analytical Nonlinear Predictor and a Target-Guided Temporal Constraint”. 2003.
[9] K. Xia and C. Espy-Wilson, “A new strategy of formant tracking based on dynamic
programming”. in Proc. ICSLP, 2000.
[10] Jos´e Enrique Garc´ıa La´ınez, Dayana Ribas Gonzalez, Antonio Miguel Artiaga,
Eduardo Lleida Solano and Jos´e Ram´on Calvo De Lara, “Beam-Search Formant
Tracking Algorithm based on Trajectory Functions for Continuous Speech.”. To
be plublished in proceedings of CIARP 2012.
[11] K. Mustafa and I. C. Bruce, “Robust formant tracking for continuous speech with
speaker variability”. IEEE Transactions on Speech and Audio Processing, 2006.
[12] Snack toolkit:
10 Authors Suppressed Due to Excessive Length
[13] Li Deng, Xiaodong Cui, Robert Pruvenok, Jonathan Huang, Safiyy Momen,Yanyi
Chen and Abeer Alwan, “A Database of Vocal Tract Resonance Trajectories for
Research in Speech Processing”. ICASSP, 2006.
... The central frequencies of F1, F2, F3, and F4 are used for our experiments. They are extracted with a robust formant tracking algorithm previously proposed in [12]. This method makes use of the roots of the polynomial of a linear predictive coding (LPC) as formant candidates and of a beam-search algorithm for selecting the best combination. ...
Conference Paper
Full-text available
This paper investigates the efficiency of several acoustic fea-tures in classifying pervasive developmental disorders, perva-sive developmental disorders not-otherwise specified, dyspha-sia, and a group of control patients. One of the main char-acteristics of these disorders is the misuse and misrecognition of prosody in daily conversations. To capture this behaviour pitch, energy, and formants are modelled in long-term intervals, and the interval duration, shifted-delta cepstral coefficients, AM modulation index, and speaking rate complete our acoustic in-formation. The concept of total variability space, or iVector space, is introduced as feature extractor for autism classifica-tion. This work is framed in the Interspeech 2013 Compu-tational Paralinguistics Challenge as part of the Autism Sub-challenge. Results are given on the Child Pathological Speech Database (CPSD), and an 87.6% and 45.1% unweighted av-erage recall are obtained for the typicality (typical vs. atypi-cal developing children) and diagnosis (classification into the 4 groups) tasks, respectively, on the development dataset. In addi-tion, the combination of the new and the baseline features offers promising improvements.
... Both representations are then evaluated in terms of robustness for the problem under consideration: formant tracking for obtaining robust formant candidates in noisy environments. This work presents the continuation of our previous works [15], [16]. ...
Conference Paper
Full-text available
In this paper we present a speech representation based on the Linear Predictive Coding of the Zero Phase version of the signal (ZP-LPC) and its robustness in presence of additive noise for robust formant estimation. Two representations are proposed for using in the frequency candidate proposition stage of the formant tracking algorithm: 1) the roots of ZP-LPC and 2) the peaks of its group delay function (GDF). Both of them are studied and evaluated in noisy environments with a synthetic dataset to demonstrate their robustness. Proposed representations are then used in a formant tracking experiment with a speech database. A beam search algorithm is used for selecting the best candidates as formant. Results show that our method outperforms related techniques in noisy test configurations and is a good fit for use in applications that have to work in noisy environments.
Full-text available
Lecture for the Speech processing Group, University of Eastern Finland
Conference Paper
Full-text available
This paper presents a formant frequency tracking algorithm for continuous speech processing. First, it uses spectral information for generating frequency candidates. For this purpose, the roots of the polynomial of a Linear Predictive Coding (LPC) and peak picking of Chirp Group Delay Function (CGD) were tested. The second stage is a beam-search algorithm that tries to find the best sequence of formants given the frequency candidates, applying a cost function based on local and global evidences. The main advantage of this beam-search algorithm compared with previous dynamic programming approaches lies in that a trajectory function that takes into account several frames can be optimally incorporated to the cost function. The performance was evaluated using a labeled formant database and the Wavesurfer formant tracker, achieving promising results.
Conference Paper
Full-text available
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TEMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonant-to-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed
Full-text available
We present a framework for estimating formant trajectories. Its focus is to achieve high robustness in noisy environments. Our approach combines a preprocessing based on functional principles of the human auditory system and a probabilistic tracking scheme. For enhancing the formant structure in spectrograms we use a Gammatone filterbank, a spectral preemphasis, as well as a spectral filtering using difference-of-Gaussians (DoG) operators. Finally, a contrast enhancement mimicking a competition between filter responses is applied. The probabilistic tracking scheme adopts the mixture modeling technique for estimating the joint distribution of formants. In conjunction with an algorithm for adaptive frequency range segmentation as well as Bayesian smoothing an efficient framework for estimating formant trajectories is derived. Comprehensive evaluations of our method on the VTR-formant database emphasize its high precision and robustness. We obtained superior performance compared to existing approaches for clean as well as echoic noisy speech. Finally, an implementation of the framework within the scope of an online system using instantaneous feature-based resynthesis demonstrates its applicability to real-world scenarios.
This paper presents a formant-tracking linear prediction (FTLP) model for speech processing in noise. The main focus of this work is the detection of formant trajectory based on Hidden Markov Models (HMM), for improved formant estimation in noise. The approach proposed in this paper provides a systematic framework for modelling and utilization of a time- sequence of peaks which satisfies continuity constraints on parameter; the within peaks are modelled by the LP parameters. The formant tracking LP model estimation is composed of three stages: (1) a pre-cleaning multi-band spectral subtraction stage to reduce the effect of residue noise on formants (2) estimation stage where an initial estimate of the LP model of speech for each frame is obtained (3) a formant classification using probability models of formants and Viterbi-decoders. The evaluation results for the estimation of the formant tracking LP model tested in Gaussian white noise background, demonstrate that the proposed combination of the initial noise reduction stage with formant tracking and LPC variable order analysis, results in a significant reduction in errors and distortions. The performance was evaluated with noisy natual vowels extracted from international french and English vo- cabulary speech signals at SNR value of 10dB. In each case, the estimated formants are compared to reference formants.
A new algorithm to track automatically speechformant frequencies have been developed. Dynamic programming is used to optimize formant trajectory estimates by imposing appropriate frequency continuity constraints. The continuity constraints are modulated by a stationarity function. The formant frequencies are selected from candidates proposed by solving for the roots of the linear predictor polynomial computed periodically from the speech waveform. The local costs of all possible mappings of the complex roots to formant frequencies are computed at each frame based on the frequencies and bandwidths of the component formants for each mapping. The cost of connecting each of these mappings with each of the mappings in the previous frame is then minimized using a modified Viterbi algorithm. Two sentences spoken by 88 males and 43 females were analyzed. The first three formants were tracked correctly in all sonorant regions in over 80% of the sentences. These performance results are based on spectrographic analysis and informal listening to formant‐synthesized speech.
Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame-by-frame point estimates of formant frequencies followed by candidate selection and smoothing using dynamic programming methods that minimize ad hoc cost functions. The goal of the current work is to provide both point estimates and associated uncertainties of center frequencies and bandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithms take advantage of a linearized mapping to infer formant and antiformant parameters from frame-based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis of KARMA, wavesurfer, and praat is accomplished in the all-pole case using a manually marked formant database and synthesized speech waveforms. KARMA formant tracks exhibit lower overall root-mean-square error relative to the two benchmark algorithms with the ability to modify parameters in a controlled manner to trade off bias and variance. Antiformant tracking performance of KARMA is illustrated using synthesized and spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practitioners to recognize time-varying confidence in parameters of interest and adjust algorithmic settings accordingly.
Conference Paper
A technique for high-accuracy tracking of formants or vocal tract resonances is presented in this paper using a novel non- linear predictor and using a target-directed temporal constraint. The nonlinear predictor is constructed from a parameter-free, discrete mapping function from the formant (frequencies and bandwidths) space to the LPC-cepstral space, with trainable residuals. We examine in this study the key role of vocal tract resonance targets in the tracking accuracy. Experimental results show that due to the use of the targets, the tracked formants in the consonantal regions (including closures and short pauses) of the speech utterance exhibit the same dynamic properties as for the vocalic regions, and reflect the underlying vocal tract resonances. The results also demonstrate the effectiveness of training the prediction-residual parameters and of incorporating the target-based constraint in obtaining high-accuracy formant estimates, especially for non-sonorant portions of speech.
Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises.