Adding Fricatives to the Portuguese Articulatory Synthesiser
Ant´ onio Teixeira, Luis M. T. Jesus, Roberto Martinez
Instituto de Engenharia Electr´ onica e Telem´ atica de Aveiro (IEETA)
Universidade de Aveiro, 3810-193 Aveiro, Portugal
First attempts at incorporating models of frication into an ar-
ticulatory synthesizer, with a modular and flexible design, are
presented. Although the synthesizer allows the user to choose
different combinations of source types, noise volume velocity
sources have been used to generate turbulence. Preliminary re-
sults indicate that the model is capturing essential characteris-
tives. Results also show the potential of performing synthesis
based on broad articulatory configurations of fricatives.
1.1. Fricative Production Mechanisms
When a vowel is being uttered, the vocal tract is relatively
unconstricted (∼1cm2cross-sectional area at the most con-
stricted region) and the vocal folds vibrate periodically, causing
the volume of air flowing through the glottis to fluctuate peri-
odically as well. Fricative consonants are produced when the
vocal tract is constricted (∼0.1cm2at most constricted region)
when air is forced through the constriction. The place of con-
striction affects the tract resonances (filter properties), but also
affects the shape of the tract downstream of the constriction and
thus the source properties: where the turbulent jet will impinge
on tract walls, generating more noise, and the particular spectral
characteristics of that noise.
It is known from studies of jet noise and mechanical mod-
els that when a particular configuration is held constant, and
only the air velocity is increased, the turbulence noise increases
(i.e. sound pressure and power), and increases more at higher
frequencies. Though it is not easy to control nor measure pa-
rameters so precisely in the vocal tract, the same phenomenon
appears to occur for fricatives.
The acoustic mechanism for production of fricatives is thus
not as well understood as for vowels because:
1. turbulence noise defies an analytic formulation, requir-
ing empirical studies;
2. turbulence noise sources are much more sensitive to
changes in the surrounding geometry than are acoustic
3. given the small constriction dimensions and the depen-
dence of all aeroacoustic sources on flow velocities, it
is much more difficult and more important to get suffi-
dynamic and acoustic data for fricative configurations.
These difficulties have been reflected in the relatively poor qual-
ity of fricative and affricate synthesis.
1.2. Previous Studies of Fricatives
the extraction of MRI data , fricative aeroacoustics analy-
sis methods, and the incorporation of three dimensional vocal
tract data in speech synthesis. The study of relations between
articulatory, acoustic and perceptual cues provides crucial in-
formation for the articulatory synthesis of fricative consonants
. The study of the nature of the interaction between acoustic
sources and vocal tract shapes for constricted consonantal con-
figurations, and the study of mechanical models by Shadle ,
has supplied important data to drive various parametric multi-
tube acoustic models [4, 5, 6, 7, 8].
European Portuguese fricatives have been previously anal-
ysed by Jesus and Shadle  in ways designed to enhance our
description of the language, and to use and increase our un-
derstanding of the production of fricatives. The research pre-
sented by Jesus and Shadle  aimed to investigate the acous-
tic features that characterise the production of fricative conso-
nants . Their work focused on the analysis of frication in
the Portuguese language, describing a novel methodology of
corpus design, and temporal and spectral analysis techniques.
Knowledge accumulated from their data could be used for im-
proved speech synthesis. The peak frequencies, spectral am-
plitude characteristics, and temporal information could be use-
ful for synthesis and the parameterisation of the spectra allows
us to deduce the behaviour of sources for articulatory synthe-
sis models such as the one proposed by Narayanan and Alwan
. The quantified spectral characteristics of Portuguese frica-
tion and source spectrum during the production of these sounds,
although using only the far-field acoustic signal will always
present a limitation to source-filter separation.
1.3. Previous Fricative Production Models
Flanagan and Ishizaka [4, 5] modelled voiceless excitation, for
a speech synthesizer that incorporated the two mass vocal fold
model, assuming that the sound sources interacted with the res-
onant system. The model included turbulent excitation gener-
ated by a serial-random pressure source within the vocal fold
model (produced by vorticity in the flow through the vocal fold
opening, essentially during the time that the vocal folds do not
vibrate) and/or turbulent flow (modelled as a series pressure
source) that occurred at constricted points along the vocal tract.
Sondhi and Schroeter  developed a hybrid articulatory
synthesiser that modelled the glottis in the time domain because
of its nonlinear nature, and modelled the vocal and nasal tracts
in the frequency domain taking advantage of the more conve-
nient representation of losses and radiation using a product of
EUROSPEECH 2003 - GENEVA
generated using only one series pressure source at the point of
maximum constriction, or alternatively using a parallel volume
velocity source downstream of the main constriction.
Badin  investigated the properties of the vocal tract
transfer function in the frequency domain and its relation to
the source location and impedance functions of a pseudo-static
aerodynamic model that defined boundary conditions (glot-
tis and constriction resistances) from aerodynamic parameters
(subglottal pressure, glottis opening and constriction area). The
spectra of natural speech were replicated using a model of the
vocal tract area functions taking into account the glottal and
Shadle  studied the acoustic mechanism of fricative con-
sonants in the context of three domains: theoretical models, me-
chanical models and speech. She described four sets of exper-
iments with mechanical models of increasing realism, which
were used to determine source characteristics such as location,
degree of distribution and spectrum shape.
Scully, et al.  combined the analysis of real speech with
analysis-by-synthesis using the Leeds model of speech produc-
tion. The input of the model specified a succession of targets for
each articulator and the output defined strengths and time do-
mains of the voice, quasi-periodic, aspiration noise and frica-
tion noise sources. The articulatory descriptions obtained from
analysis of natural speech provided components for the aero-
dynamic description, and each acoustic source of the model
depended upon an appropriate combination of articulatory and
Narayanan and Alwan  used MRI, dynamic EPG, high
quality acoustic recordings and aerodynamic studies to derive
The source characteristics were derived based on an analysis-
by-synthesis method and the vocal tract area functions were
obtained from MRI of the fricatives. The vocal tract was mod-
elled as a concatenation of 3mm long uniform cylindrical tube
sections, and the sublingual cavities were modelled as shunt
branches specified in the anterior oral cavity.
source model used a combination of acoustic monopole and
dipole sources and a voiced source in the case of voiced frica-
SAPWindows is the name given to the University of Aveiro’s
articulatory synthesizer and it stands for “Sintetizador Articu-
lat´ orio de Portuguˆ es” for Windows. It consists of articulatory,
source, and acoustic models. Different sounds are produced
when the acoustic model is excited by various sources. The
synthesizer was implemented using object oriented program-
ming, therefore several abstract classes, parameter transfer pro-
tocol rules and data structures were designed. The main ab-
stract classes, known as base classes, define only the criteria and
methods used. Implementation of the different models available
for a base class is performed in derived classes.
Fig. 1 shows how the main synthesis base classes interact
with each other. The acoustic model produces the sound. Base
classes could be used in other applications besides SAPWin-
dows; with the proper interface those classes can be reused.
The anatomic model adopted in this work is based on the
MMIRC (Mind Machine Interaction Research Center) model,
which in turn is a modified version of the Mermelstein model
. The model assumes midsagittal plane symmetry, and the
output is an estimate of the vocal tract cross-sectional area.
The acoustic model is responsible for speech wave gener-
Figure 1: SAPWindows main synthesis base classes.
ation. The output of main synthesis base classes is a sound
wave. The impulse response is given by the inverse Fourier
Transform (IFFT) of the acoustic transfer function of a given
vocal tract configuration, obtained by calculating the resulting
transfer function at N frequencies. The FFTW fast implemen-
tation of the Fourier Transform was used. The convolution of
the impulse response with the glottal excitation signal produces
the sound waveform. A frequency domain analysis and time
domain synthesis method, called hybrid or frequency analysis
time synthesis method, was used.
The acoustic model must know the configuration of the
anatomic model before any kind of computation is performed,
so it can retrieve areas, length, sinus data and the section map-
ping. Information about different sound sources (oral or nasal)
is provided for the model by setting the appropriate class pa-
SAPWindows graphical user interface (GUI), shown in Fig.
3, displays detailed information during synthesis, including, the
volume velocity, the speech waveform and spectrogram. De-
tailed information about SAPWindows has already been pre-
3. Fricative Modelling
the following features: to deal with several noise sources with
different characteristics; to automatically detect conditions for
the existence of noise sources; to use information produced by
the articulatory model; to use as much as possible the existing
code and data structures.
A model of frication was added to the synthesizer main-
taining most of the existing modules, control processes and pa-
rameters. This system was used in previous experiments  to
synthesize voiced sounds. In this investigation  the value of
F0was directly controlled by changing the parameterised glot-
tal area of a two-mass vocal fold model . Unvoiced sounds
are now produced in the acoustic model by setting the F0value
below a certain threshold, as indicator of no-oscillation. The ex-
isting parameters Agmax (maximum glottal aperture) and slope
(difference between masses aperture) are used to set the vocal
When F0goes below the mentioned threshold “fake” periods of
fixed duration are created as a result of having adopted a pitch
synchronous synthesis method.
In order to obtain the radiated pressure at the lips due to
glottal volume velocity , transfer functions are calculated
EUROSPEECH 2003 - GENEVA
at the beginning and end of each period and linear interpola-
tion of the impulse responses is used to obtain each sample
. The flow, pressure and resistance of noise sources, and the
transfer functions from noise sources to the lips, are calculated
several times, to allow the activation and deactivation of noise
sources during a period. For each tube where a noise source
was inserted, past values of noise source volume velocity were
stored to calculate the convolution with the impulse response
and therefore obtain the speech sound pressure waveform.
In this implementation, noise sources are part of the acous-
tic model, to model turbulence generated inside the tract which
depends on volume velocity. This differs from the glottal source
model, which is considered as a separate module of the syn-
thesizer. We have a new acoustic model that can include sev-
eral sources, extending the existing synthesizer to support noise
In the current version of our synthesizer the volume flow at
the constriction is assumed to be equal to the flow at the glottis.
Additional work is being developed to test an improved model
of volume flow through a constriction.
3.1. Noise sources
Fluctuations in the velocity of airflow emerging from a con-
striction (at an abrupt termination of a tube) create monopole
sources and fluctuations of forces exerted by an obstacle (e.g.
teeth, lips) or surface (e.g. palate) oriented normal to the flow
generate dipole sources. Since dipole sources have been shown
to be the most influential in the fricative spectra , the noise
source of the fricatives has only been approximated by equiva-
lent pressure voltage (dipole) sources in the transmission-line
model. Nevertheless, it is also possible to insert the appropriate
monopole sources, which contribute to the low-frequency am-
plitude and can be modelled by an equivalent current volume
Frication noise is generated at the vocal tract according
to the suggestions of Flanagan , and Sondhi and Schroeter
. A noise source can be introduced automatically at any T-
section of the vocal tract network, between the velum and the
lips. The synthesiser’s articulatory module registers which vo-
cal tract tube cross sectional areas are bellow a certain threshold
(A < 1cm2), producing a list of tube sections that might be part
of an oral constriction that generates turbulence.
The acoustic module calculates the Reynolds number at the
sections selected by the articulatory module and activates noise
sources at tube sections where the Reynolds number is above a
critical value (Recrit = 2000 according to ). Noise sources
can also be inserted at any location in the vocal tract, based on
additional information about the distribution and characteristics
of sources [3, 8]. This is a different source placement strategy
from that usually used in articulatory synthesis  where the
sources are primarily located in the vicinity of the constriction.
The distributed nature of some noise sources can be modelled
by inserting several sources located in consecutive vocal tract
sections. This will allow us to try combinations of the canonical
source types (monopole, dipole and quadrupole).
A pressure source with an amplitude proportional to
the squared Reynolds number (Pnoise
0,forRe ≤ Recrit) is activated at the correct place in the
tract [4, 6]. The internal resistance of the noise pressure
source is proportional to the volume velocity at the constriction:
at the constriction, and Ac is the constriction cross-sectional
=2 × 10−6×
crit), for Re > Recrit and Pnoise =
c, where ρ is the density of the air, Ucis the flow
area. The turbulent flow can be calculated by dividing the noise
pressure by the source resistance. This noise flow could also be
filtered in the time domain to shape the noise spectrum  and
test various experimentally derived dipole spectra.
3.2. Propagation and radiation
The general problem associated with having N noise sources
is decomposed in N simple problems by using the superposi-
tion principle. In order to calculate the radiated pressure at the
lips, the vocal tract is divided into three sections: pharyngeal,
region between velum coupling point and noise source and re-
gion after the source. Data structures based on the area function
of each section are defined and ABCD matrices calculated .
The ABCD matrices were then used to calculate downstream
(Z1) and upstream (Z2) input impedances, as well as the trans-
fer function (H)
where C and D are parameters from the ABCD matrix (from
noise source to lips), and Zradis the lip radiation impedance.
The radiated pressure at the lips due to a specific source is
given by: pradiated(n) = h(n) ∗ unoise(n), where h(n) =
IFFT(H). The output sound pressure due to the different
noise sources are added together. The output sound pressure re-
sulting from the excitation of the vocal tract by a glottal source
is also added when there is voicing.
The main goal of this work was to synthesize unvoiced frica-
tives. In a first experiment the synthesizer was used to produced
sustained unvoiced fricatives. The vocal tract configuration de-
rived from a high vowel was adjusted by raising the tongue tip
in order to produce a sequence of reduced vocal tract cross-
sectional areas. The lung pressure was linearly increased and
decreased at the beginning and end of the utterance, to produce
a gradual onset and offset of the glottal flow.
Fig. 2 shows the glottis volume velocity waveform, the
thesizer activated only one noise source at the onset and offset
of frication, and used five sources during the steady state of the
The second goal was to synthesize fricatives in VCV se-
quences. Articulatory configurations for vowels obtained by an
inversion method were used and during the fricative interval the
tongue tip articulatory parameter was adjusted to a postalveolar
fricative configuration. A F0value of 100Hz and a maximum
glottal opening of 0.3cm2were used to synthesize the vow-
els. The time trajectory of the glottal source parameter Agmax
starts at 1.5cm2, rises to 2cm2at the fricative middle point
and returns to 1.5cm2near the end, before assuming the value
used during vowel production. Synthesis results for the non-
sense word /i
Windows GUI after synthesis.
In Fig. 4 we compare the Power Spectral Density (PSD)
estimate of the synthesized fricative with the PSD of a natural
signals was obtained.
?i/ are presented in Fig. 3 using the actual SAP-
?] from the corpus used in . A reasonable fit between the two
With the addition of noise source models and modifications to
the acoustic model, our articulatory synthesizer is capable of
EUROSPEECH 2003 - GENEVA
0 Download full-text
Figure 2: Synthesis results for an unvoiced fricative showing
the glottal flow, the speech waveform and spectrogram.
Figure 3: Dump of synthesizer GUI after /i
Power Spectrum Magnitude (dB)
Figure 4: PSD of a synthesized fricative /
?/ (solid line) and a
?/ (dashed line).
producing sustained fricatives and fricatives in VCV sequences.
First results were judged in informal listening tests as being
Preliminary results of an ongoing work were presented, so
further validation and checking of the models is still required.
Nevertheless, this is an important new step towards a complete
articulatory synthesizer for Portuguese. Our model of fricatives
is comprehensive and flexible, making the new version of SAP-
Windows a value tool for trying out new or improved source
models, and running production and perceptual studies of Eu-
ropean Portuguese fricatives. The possibility of automatically
inserting and removing noise sources along the oral tract is a
feature we regard as having great potential.
 S. S. Narayanan, A. A. H. Alwan, and K. Haker, “An
articulatory study of fricative consonants using magnetic
resonance imaging,” Journal of the Acoustical Society of
America, vol. 98, no. 3, pp. 1325–1347, 1995.
 C. Scully, E. Castelli, E. Brearley, and M. Shirt, “Analysis
and simulation of a speaker’s aerodynamic and acoustic
patterns for fricatives,” Journal of Phonetics, vol. 20, no.
1, pp. 39–51, 1992.
 C. H. Shadle, “Articulatory-acoustic relationships in frica-
tive consonants,” in Speech Production and Speech Mod-
elling, W. J. Hardcastle and A. Marchal, Eds., pp. 187–
209. Kluwer Academic, Dordrecht, 1990.
 J.L.Flanagan, SpeechAnalysis, SynthesisandPerception,
Springer-Verlag, Berlin, second edition, 1972.
 J. L. Flanagan and K. Ishizaka, “Automatic generation
of voiceless excitation in a vocal cord-vocal tract speech
synthesizer,” IEEE Transactions on Acoustics, Speech,
and Signal Processing, vol. ASSP-24, no. 2, pp. 163–170,
 M. M. Sondhi and J. Schroeter, “A hybrid time-frequency
domain articulatory speech synthesizer,” IEEE Transac-
tions on Acoustics, Speech, and Signal Processing, vol.
ASSP-35, no. 7, pp. 955–967, 1987.
 E. L. Riegelsberger, The Acoustic-to-Articulatory Map-
ping of Voiced and Fricated Speech, Ph.D., Department of
Electrical Engineering, The Ohio State University, Ohio,
 S. S. Narayanan and A. A. H. Alwan,
models for fricative consonants,” IEEE Transactions on
Speech and Audio Processing, vol. 8, no. 2, pp. 328–344,
 L. M. T. Jesus and C. H. Shadle, “A parametric study of
the spectral characteristics of European Portuguese frica-
tives,” Journal of Phonetics, vol. 30, no. 3, pp. 437–464,
 P. Badin, “Acoustics of voiceless fricatives: Production
theory and data,” Quarterly Progress and Status Report
3/1989, Speech Transmission Laboratory, Royal Institute
of Technology, Stockholm, Sweden, 1989.
 D. G. Childers, Speech Processing and Synthesis Tool-
boxes, John Wiley & Sons, New York, 2000.
 A. Teixeira, L. Silva, R. Martinez, and F. Vaz, “SAP-
Windows - towards a versatile modular articulatory syn-
thesizer,” in Proccedings of the IEEE-SP Workshop on
Speech Synthesis, USA, 2002.
EUROSPEECH 2003 - GENEVA