ArticlePDF Available

Evaluation of a technique for improving the mapping of multiple speakers’ vowel spaces in the F1–F2 plane

Authors:

Abstract and Figures

We evaluate a vowel formant normalisation technique that allows direct visual and statistical comparison of vowel triangles for multiple speakers of different sexes, by calculating for each speaker a 'centre of gravity' S in the F1 ~ F2 plane. S is calculated on the basis of formant frequency measurements taken for the so-called 'point' vowel (), the average F1 and F2 for the vowel category with the highest average F1 (for English, usually the vowel of the TRAP or START lexical sets), and hypothetical minimal F1 and F2 values (coordinates we label ()) extrapolated from the other two points. Expression of individual F1 and F2 measurements as ratios of the value of S for that formant permits direct mapping of different speakers' vowel triangles onto one another, resulting in marked improvements in agreement in vowel triangle (a) area and (b) overlap, as compared to similar mappings attempted using linear Hz scales and the z (Bark) scale. 1. Introduction For some considerable time it has been commonplace in phonetic and sociolinguistic research to represent spoken vowels by means of the frequencies of their two lowest formants, F1 and F2. The method has been adopted in order, among other things, to allow greater objectivity and replicability when classifying individual vowels than is possible using impressionistic auditory analysis alone. F1 has been shown to correlate inversely with the position of the highest part of the tongue body in the height dimension (open vowels have higher F1 values than close vowels do), while F2 is correlated with tongue frontness (front vowels have higher F2 values than back vowels do, especially if back vowels are also rounded). Vowels are frequently represented using straightforward measurements in linear Hz, or by expressing the relationship between the two parameters in some way (e.g. by plotting F1 against F2 -
Content may be subject to copyright.
EVALUATION OF A TECHNIQUE FOR IMPROVING THE MAPPING OF
MULTIPLE SPEAKERS’ VOWEL SPACES IN THE F1 ~ F2 PLANE1
Dominic Watt & Anne Fabricius
Abstract
We evaluate a vowel formant normalisation technique that allows direct visual and
statistical comparison of vowel triangles for multiple speakers of different sexes, by
calculating for each speaker a ‘centre of gravity’ S in the F1 ~ F2 plane. S is calculated
on the basis of formant frequency measurements taken for the so-called ‘point’ vowel
[i], the average F1 and F2 for the vowel category with the highest average F1 (for
English, usually the vowel of the TRAP or START lexical sets), and hypothetical
minimal F1 and F2 values (coordinates we label []) extrapolated from the other two
points. Expression of individual F1 and F2 measurements as ratios of the value of S for
that formant permits direct mapping of different speakers’ vowel triangles onto one
another, resulting in marked improvements in agreement in vowel triangle (a) area and
(b) overlap, as compared to similar mappings attempted using linear Hz scales and the
z (Bark) scale.
1. Introduction
For some considerable time it has been commonplace in phonetic and
sociolinguistic research to represent spoken vowels by means of the frequencies of
their two lowest formants, F1 and F2. The method has been adopted in order, among
other things, to allow greater objectivity and replicability when classifying individual
vowels than is possible using impressionistic auditory analysis alone. F1 has been
shown to correlate inversely with the position of the highest part of the tongue body in
the height dimension (open vowels have higher F1 values than close vowels do), while
F2 is correlated with tongue frontness (front vowels have higher F2 values than back
vowels do, especially if back vowels are also rounded). Vowels are frequently
represented using straightforward measurements in linear Hz, or by expressing the
relationship between the two parameters in some way (e.g. by plotting F1 against F2
F1 for a given vowel, as per Ladefoged & Maddieson 1990, Iivonen 1994), or by using
some transform or ‘warping’ of the Hz scale so as to reflect the non-linear mapping of
the acoustic parameter Hz to its perceptual correlates (e.g. through use of log(Hz)
transforms, or the Mel, Koenig, Bark, or Equivalent Rectangular Bandwidth (ERB)
scales). Some models also take account of higher formants such as F3, or of the
fundamental frequency (F0; see e.g. Hindle 1978, Disner 1980, Lobanov 1980, Moore
& Glasberg 1983, Deterding 1990, Rosner & Pickering 1994, Labov 2001, or Adank
et al. 2001 for evaluations of competing algorithms). In the case of the use of non-
linear transforms, the intention is to minimise as far as possible the influence of non-
linguistic factors on those properties in the acoustic signal which the researcher
1 We are grateful to the following people for their input, comments and other feedback: Patti Adank,
Paul Carter, Bernhard Fabricius, Paul Foulkes, Rob Hagiwara, Ghada Khattab, John Local, Richard
Ogden, Peter Patrick, Jane Stuart-Smith, and an anonymous reviewer.
Nelson, D. (ed.) Leeds Working Papers in Linguistics and Phonetics 9 (2002), pp. 159-173.
Evaluation of a technique for mapping
perceives to be important. Listeners appear capable of automatically factoring out
certain aspects of the acoustic signal, such that they can, for example, understand
natural speech produced by men, women and children with more or less equal
proficiency, despite large differences in the acoustic signatures of ‘equivalent’ sounds
produced by each type of speaker chiefly as a consequence of vocal tract length (VTL;
e.g. Stevens 1998). A central concern in the acoustic analysis of vowels has therefore
been to attempt to eliminate the effect of VTL on the relative frequencies of the lower
formants for multiple speakers. By performing such ‘normalisation’ on speech signals,
the researcher is permitted to make more direct comparison of formant frequencies of
vowels spoken by speakers of different sexes and ages, and is also able to approximate
more closely the way in which listeners may perceive spoken vowels.
Figure 1. Frequency of second formant versus frequency of first formant for ten American
English vowels produced by 76 men, women and children (adapted from Peterson & Barney 1952
by Lieberman & Blumstein 1988).
An especially frequently used technique of visually assessing the similarities and
differences between F1 and F2 frequencies for vowels produced by different speakers
is one involving plotting unnormalised F1 and F2 against each other on x-y scatter
graphs (e.g. Peterson & Barney 1952, Hagiwara 1997, Watt & Tillotson 2001).2 This
method allows the researcher to superimpose one speaker’s vowel sample onto
another’s, and thereby to estimate whether or not, for example, Speaker A has on the
2 Hagiwara (1997) presents scatter plots in which the units are in Hz plotted on a Bark scale such that
higher frequencies are compressed relative to lower ones, but this is a matter of adjusting the scaling on
the axes of the plots rather than transforming the data themselves.
160
Watt & Fabricius
whole a higher F2 for a given vowel category than does Speaker B; such an
observation might confirm a hypothesised process of vowel fronting. Data in this form
also permit straightforward statistical comparison of samples, but only if it is assumed
that VTL, and therefore the potential ranges of values for both F1 and F2, are
effectively constant across all the speakers sampled.
So as to minimise the potentially problematic influence of VTL-related variation
among speakers of different ages and sexes, some researchers have used only post-
pubertal male speakers as informants for investigations of vowel variation (e.g.
Eremeeva & Stuart-Smith 2003). Serious problems are encountered if samples more
representative of the population as a whole are used, because the F1 ~ F2 frequencies
for women tend to be significantly higher for adult females than for adult male
speakers, with children having formant frequencies which are still higher than those of
women. It is obviously not possible directly to compare (linear Hz) F1 ~ F2 scatter
plots for adult males and females, or for adults and children, because the F1 ~ F2
planes for women – and particularly young children – are considerably stretched in
both dimensions relative to those of male speakers (hence the elongation of the
envelopes drawn around tokens of the peripheral monophthongs in Figure 1).
As mentioned above, numerous techniques have been devised in an attempt to
reduce the discrepancies between the speech of men, women and children in this
respect. Some are designed to compress the higher frequency ranges used by women
and children relative to the lower ones; others work by expressing individual values in
terms of distance from a mean derived from the formant frequency measurements
themselves. An example of the first sort of transform is the Bark transform, which
involves conversion of Hz measurements into perceptual units based on the critical
bandwidth response of the ear (Zwicker & Feldtkeller 1967). We make no criticism of
the use of Bark-transformed data, nor the validity of the scale itself, except to say that
it does not in fact fully permit direct comparison of one speaker’s vowel sample with
another speaker’s vowel sample in the way we would wish. This is because the
influence of VTL is not actually wholly eliminated, since within the frequency range
in which F1 typically falls – between c. 200Hz and 1 kHz – the mapping between Hz
and Barks is effectively linear (see Traunmüller 1990; Adank et al. 2001). Within this
frequency range, higher Hz values correspond very closely to proportionately higher
Bark values, and it is only at frequencies well above those in which F1 is found that
there is significant divergence between the scales. Therefore the problem of cross-
speaker mapping persists, although the ‘compression’ of higher frequency ranges,
such as those in which F2 is commonly found for adult speakers, corrects this problem
to some degree. However, if our aim is to map one speaker’s vowel space onto
another’s for the purposes of comparing their vowel systems, in a way which removes
absolute differences in formant frequency further than Bark-transforming the data will
allow, we must follow another approach.
We evaluate in this paper a method for allowing direct visual and statistical
comparison of vowel spaces for different speakers which derives from measurements
in Hz of F1 and F2 at the midpoints of stressed spoken vowels. Our focus will be on an
assessment of the extent of reduction of speaker sex-related differences in samples of
vowel formant frequencies for two RP British English speakers (one male, one
female) where the frequency values are expressed on the following scales: (a) linear
Hz; (b) critical band rate z (in Barks) and (c) a so-called ‘S transform’. The last of
these is calibrated from the F1 ~ F2 plane’s ‘centre of gravity’ S by taking the grand
mean of the mean F1 and F2 frequencies for points at the apices of a triangular plane
161
Evaluation of a technique for mapping
which are assumed to represent F1 and F2 maxima and minima for the speaker in
question (these being [i], [a] and []; see below). The procedures for calculating z and
S values for individual speakers are outlined in detail in the next section. Our estimate
of the improvement in comparability between speaker samples is based on the
increase in mapping between one speaker’s vowel triangle and another’s along two
continuous parameters: (a) the ratio of the area of the female speaker’s vowel
triangle to that of the male speaker’s triangle and (b) the degree of overlap between
the two triangles, expressed in terms of that percentage of the male speaker’s triangle
which overlaps with the female speaker’s triangle, and vice versa. It is demonstrated
that on both counts the S transform performs much better than Bark-transformed
representations of the two speakers’ vowel triangles.
2. Methods
2.1 Procedure for calculating critical band rate z (in Barks)
The transform used here is that from Traunmüller (1990):
26.81 f
1960 + f
z = - 0.53
where f is frequency in Hz. According to Traunmüller, the values obtained using this
equation agree with the values tabulated by Zwicker (1961) to within ±0.05 Bark in
the frequency range 0.2 – 6.7 kHz.
For our present purposes, one advantage of converting all Hz measurements
using the above equation is that one can apply the same transform to all the formant
frequency measurements made for any number of speakers. The disadvantage, as
noted above (and as demonstrated below), is that one only marginally reduces the
effect of VTL, rather than eliminating it as far as possible. So while it is considerably
more time-consuming to convert Hz measurements into the S-transformed values used
for the comparison discussed in Section 3 below (because S values for each individual
speaker must be calculated for F1 and F2), the latter technique, as we shall see, allows
a much higher degree of mapping between samples for speakers whose VTLs are very
different from each other.
2.2 Procedure for calculating S
Our procedure for determining the F1 and F2 values of S for an individual
speaker is discussed in this section. For clarity, we follow Wells (1982) in assigning
the keywords FLEECE and TRAP to the lexical sets containing the vowels labelled //
and /a/ in other descriptions of British English phonology, since we believe the use of
phonetic symbols to represent vowel categories which are highly variable in British
English (to the extent that, for example, the TRAP vowel can be realised as anything
from [Q] to [], depending upon accent) to be potentially confusing.
2.2.1 Step One
· Assume that for a given speaker’s sample the average F1 and the average F2
for the vowels of words of the FLEECE set represent that speaker’s minimum
162
Watt & Fabricius
F1 and maximum F2. This seems a reasonable assumption, if no
observations are made to the contrary (but see below).
· Assume that for a given speaker’s sample the average F1 for the vowels of
words of the TRAP set represents that speaker’s maximum F1. Depending
upon the accent, one might wish to select words of the START set instead,
since in certain accents of British English the TRAP vowel is generally
produced with a somewhat raised quality. The influence of post-vocalic
rhoticity in certain accents might present problems if START is used,
however, because of the influence of a following rhotic on the formants of
vowels in words like start, car, farm, etc. The point is to obtain an estimate
of the region in which a speaker’s maximum F1 is located, but clearly it is
sensible to be consistent within a given sample (i.e. choosing either TRAP or
START for all the informants concerned).
By definition, there will be individual formant frequency values higher and
lower than the average F1 and F2 values we take to be maxima and minima for these
formants. It might therefore be said that because F1 and F2 values for a given vowel
category are generally somewhat - indeed often highly - variable, taking the mean
values for F1 and F2 runs the risk of giving a false picture of the extremes of a
speaker’s vowel plane. However, averaging the F1 and F2 values for a given vowel
category eliminates (or at least reduces) the potential of inaccurate individual formant
frequency measurements to distort the geometry of an individual speaker’s vowel
triangle.
It might also be objected that this routine assumes that each speaker’s FLEECE
and TRAP vowels are more or less invariant, when it is clear from many previous
studies that they are not, even in highly controlled speech elicited using artificial
means. We must assume for the time being that FLEECE is rather less variable in
accents of British English than other vowels, and that TRAP (or START) is likely to be
the most open vowel speakers of British varieties will use. Again, it should be stressed
that if the researcher is satisfied that FLEECE is relatively stable across a sample of
speakers, that if he/she is circumspect about the choice of open vowel to use as the F1
maximum, and that if formant measurement is done as consistently as possible, we
should be able to arrive at optimally comparable samples for speakers of different
sexes and ages.3
2.2.2 Step Two
The next step is to arrive at an estimate of the F1 and F2 minima for a given
speaker. In a very large number of studies of vowel variation in English, this limit is
taken to be represented by the average F1 and F2 values for the vowel /u/, which we
label here GOOSE. We take the view, however, that in many accents of English GOOSE
is only rarely fully back, fully close, and fully rounded (see e.g. Hagiwara 1997;
Watson et al. 1998; Labov 2001: 475ff), and that the average formant frequencies for
this vowel produced by the average British English speaker are not a good reflection
of the minimum possible F1 and F2 frequencies that such a speaker could achieve.
3 There is no reason why other vowel categories such as KIT and/or FACE could not be used to represent
F1 minima and F2 maxima, should it be anticipated or observed that the average formant values for
FLEECE do not in fact provide a reliable estimate of these limits in the accent(s) under scrutiny.
163
Evaluation of a technique for mapping
Instead, we advocate the use of hypothetical lower limits on F1 and F2 which, though
almost certainly not attested in a sample of informant’s speech, are nonetheless
arrived at in a principled way. These minimal values (or rather coordinates on the F1 ~
F2 plane) we label []. They are arrived at as follows:
· It will be recalled from Section 2.2.1 that the average F1 for FLEECE was
assumed to represent the minimum F1 for a given speaker. Therefore, we
may assume that the F1 of [] is equivalent to that for FLEECE, since we
have no evidence to suggest that it is any lower.
· Since - by definition - F2 cannot have a lower frequency than F1, but often
has a frequency so close to it that the spectral peaks cannot reliably be
distinguished from one another using instrumental analysis, we can
justifiably assume for present purposes that the speaker’s closest, backest
possible vowel has an F2 exactly equivalent to its F1 frequency. Thus, F1
and F2 of [] are (a) equal to the average F1 value for FLEECE for a given
speaker, and therefore (b) exactly equal to one another.
The result of these calculations is a triangular area on the F1 ~ F2 plane, as
shown in Figure 2. Note that the axes are reversed, as is conventional in x-y plots
representing vowel systems.
Figure 2. Schematised representation of the ‘vowel triangle’ used for the calculation of S. i = min.
F1, max. F2 (average F1 ~ F2 for FLEECE); a = max. F1 (average F1 ~ F2 for TRAP); = min. F1,
min. F2, where F1 () and F2 () = F1 (i).
F2
uÈi
F1
a
2.2.3 Step Three
The next step is to calculate for the individual speaker in question the Fn
frequencies of the centre of gravity or ‘centroid’ S (following Koopmans-van Beinum
1980), which is quite simply the grand mean of Fn for i, a and uÈ (a worked example is
provided in Appendix 2). We then divide all the observed measurements of Fn by the
S value for that formant, and express all resulting figures as values on scales Fn/S(Fn),
i.e. as ratios of S. Because S(Fn) divided by S(Fn) is always equal to 1 (with the
coordinates of S therefore always being (1,1) in any speaker’s vowel triangle), vowel
164
Watt & Fabricius
tokens with low Fn values on the Hz scale will have Fn/S(Fn) values between 0 and 1,
while vowels with Fn values greater than the S value for that formant will have
Fn/S(Fn) values higher than 1. Since all speakers’ vowel triangles will be defined
relative to S, we can compare samples for different speakers, both statistically and
visually, directly with another. Plotting average or individual F1 ~ F2 measurements
for phonologically ‘back’ vowels such as GOOSE and GOAT on the F1/S(F1) ~ F2/S(F2)
plane is thus straightforward, regardless of how phonetically back or front these
vowels are.
2.3 Materials
We turn now to compare vowel samples for the two British English RP speakers
referred to in Section 1. The data are drawn from formant frequency measurements
made by Deterding (1997) from recordings of BBC broadcasts held in the MARSEC
(Machine Readable Spoken English Corpus) database (Roach et al. 1993).4 The
programmes in question were broadcast in the 1980s, and according to Roach et al.,
‘the accent of all the speakers is RP or close to it’ (Roach et al. 1993:48). From the ten
speakers (5 male, 5 female), we selected a male speaker and a female speaker at
random. The speakers in question are A (female) and C (male); speaker A’s sample is
drawn from a religious affairs programme, while C’s is based on a radio lecture on
economics (see Deterding 1997:48). Because our intention here is to assess the
relative effectiveness of z-transforming and S-transforming the linear Hz data in terms
of mapping one speaker’s FLEECE ~ TRAP ~ GOOSE triangle onto another’s, it is
sufficient to use two speakers whose formant frequencies in the Hz domain for
‘equivalent’ vowels are markedly mismatched, though of course any number of
speakers could be compared using this technique.
Details of how the original formant measurements themselves were made can be
found in Deterding (1997:48-50); the source figures can be downloaded directly from
the Internet.5
3. Results
3.1 Triangles plotted using Hz scale
The relative shapes, sizes and and degree of overlap between the triangles
generated from the raw Hz data for speakers A and C are shown in Figure 3.
Agreement of the areas of the two triangles is poor: that for the female speaker
A (DA) is almost four times larger than that for the male speaker C (DC) at a DC : DA
ratio of 1 : 3.93 (see Table 1 below for the full results in tabular form). The degree of
overlap is also low: the proportion of DC overlapping DA is just 46.1%. That is, more
than half of DC lies in an area of the vowel plane which is unoccupied by DA, as we
would expect given the significantly lower average F1 ~ F2 frequencies for adult male
speakers. The proportion of the vowel plane occupied by DA which lies outside DC
approaches 90% (86.3%). We can say, therefore, that the mapping of the samples for
these two speakers is overall very poor.
4 See http://www.rdg.ac.uk/AcaDepts/ll/speechlab/marsec/.
5 http://www.arts.nie.edu.sg/ell/davidd/data/jipa-vowels/index.htm. Note that the URL provided in the
appendix of Deterding (1997) is no longer active.
165
Evaluation of a technique for mapping
3.2 Triangles plotted using z (Bark) scale
Figure 4 shows the same data z-transformed using Traunmüller’s equation
discussed in Section 2.1.
Figure 3. Comparison of FLEECE ~ TRAP ~ GOOSE vowel triangles for Speakers A and C (linear
Hz).
FLEECE GOOSE
TRAP
FLEECE
GOOSE
TRAP
200
300
400
500
600
700
800
900
1000
1100
1200
10001500200025003000
F2 (Hz)
F1 (Hz)
Speaker A (Female) Speaker C (Male)
Figure 4. Comparison of FLEECE ~ TRAP ~ GOOSE vowel triangles for Speakers A and C
(Barks).
FLEECE
GOOSE
TRAP
FLEECE GOOSE
TRAP
2
3
4
5
6
7
8
9
10
1010.51111.51212.51313.51414.51515.5
F2 (Bark)
F
1
(Bark)
Speaker A (Female) Speaker C (Male)
166
Watt & Fabricius
There is a noticeable improvement here in terms of area ratio, the ratio of DC to
DA now being 1 : 2.76. This means that there is an improvement in agreement in area
ratio of 29.8% over the equivalent triangles on an F1 (Hz) ~ F2 (Hz) plane if we
transform the Hz measurements into Bark units. However, the extent to which the two
triangles overlap is not greatly improved: the portion of DC which overlaps DA still
accounts for just under half (49.9%) of the total area of DC, while the overlapping area
occupies a mere 18.1% of DA.
3.3 Triangles plotted using S units
If the Hz figures are transformed using the S-transform described in Section 2.2
above, however, we see dramatic improvements in both area ratio and degree of
overlap. Figure 5 shows that all but a tiny fraction of DC overlaps with DA, and that
there is a substantial improvement in the match between the areas for the two
triangles.
Figure 5. Comparison of FLEECE ~ TRAP ~ GOOSE vowel triangles for Speakers A and C
(Fn/S(Fn)).
TRAP
GOOSE
FLEECE
FLEECE
TRAP
GOOSE
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0.7511.251.51.752
F2/S(F2)
F1/S(F1)
Speaker A (Female) Speaker C (Male)
Although there is still clearly a fair degree of mismatch between the areas of the
two triangles – particularly in terms of F1 differences for each of the three vowel
categories – at a DC : DA ratio of 1 : 2.16 the agreement in area is nonetheless
improved relative both to Hz (45% improvement) and to the Bark-transformed data
(21.7% improvement). Degree of overlap expressed in terms of the proportion of DC
overlapping with DA approaches complete overlap, at 99.2%. That portion of DA
overlapping DC is 45.8% of the overall area of DA.
167
Evaluation of a technique for mapping
3.4 Summary
To summarise the marked improvements in area and overlap agreement
resulting from S-transforming the original Hz data, the figures discussed in the
preceding paragraphs above are shown in tabular form in Table 1.
Table 1. Improvements in area ratio and degree of overlap between FLEECE ~ TRAP ~ GOOSE
triangles for Speakers A (female) and C (male).
Hz Bark S
area ratio (DC : DA) 1 : 3.93 1 : 2.76 1 : 2.16
% improvement over Hz - 29.8 45
% improvement over Bark - - 21.7
% overlap (DC : DA) 46.1 49.9 99.2
% improvement over Hz - 8.2 115.2
% improvement over Bark - - 98.8
% overlap (DA : DC) 13.7 18.1 45.8
% improvement over Hz - 32.1 234.3
% improvement over Bark - - 153
It may be noted from Figure 5, incidentally, that the Fn/S(Fn) values for Speaker
C’s GOOSE vowel approach (1,1). That is, his GOOSE vowel is on average very close to
the centre of gravity calculated for his vowel space on the basis of the actual and
extrapolated F1 ~ F2 values in his sample. This can be seen as a demonstration of the
advantages of not using the average F1 and F2 values for GOOSE in the calculation of
S, since if C’s GOOSE vowel has average F1 and F2 values in the central region of his
vowel space it would be unwise to treat it as a ‘back’ vowel from a phonetic point of
view. If we were to use it to represent the F1 ~ F2 minima for this speaker because we
assume it to be the closest and backest vowel that speaker could produce, we run the
risk of distorting the overall shape and underestimating the extent of Speaker C’s
maximal triangle on the F1 ~ F2 plane. Furthermore, by plotting a speaker’s actual
average F1 ~ F2 values for GOOSE and other phonologically back vowels within the
triangle whose rearward boundary is defined by the extrapolated coordinates [], we
gain an impression of the location of these back vowels relative to this rearward limit.
For example, we can assess whether one English speaker is in the habit of using on
average a fronter pronunciation of the GOOSE or GOAT vowels than another, and we
can, moreover, be confident that if differences of this sort are in evidence when the
relevant formant frequency values are expressed in terms of Fn/S(Fn), they will also be
found in the original Hz measurements (i.e., that they are not artefacts of the S-
transform algorithm but reflect real inter-speaker differences which are not
attributable simply to difference in VTL).
It is perhaps trivial to point out that individual vowels can be plotted on the
F1/S(F1) ~ F2/S(F2) space as easily as averaged Fn/S(Fn) values for vowel categories
can. By way of illustration, Figure 6 in Appendix 1 shows Hz and Fn/S(Fn) plots for all
the individual vowel tokens for Speaker A. We feel, however, that it is important to
note that the absence of warping of the vowel space of the sort inherent in Bark-
transformed data means that one can inspect vowel plots plotted on axes using the
Fn/S(Fn) scale as though they were plotted using Hz scales, while simultaneously
168
Watt & Fabricius
being able to map multiple plots onto one another more fully than is possible using
either the Hz or the Bark scale.
4. Conclusion
We may see from Table 1 that the S-transform allows much closer mapping of
samples for different speakers onto one another than do the original measurements in
linear Hz, and their equivalent values on the Bark scale. It outperforms the z-transform
on both criteria, and more particularly on the overlap criterion, in which
improvements are on the order of 100 – 150%.
We do not intend the above evaluation as a criticism of the Bark scale in any
other respect, however: we propose the S-transform only as a means of allowing
enhanced visual and statistical comparisons between vowel formant data sets collected
for different speakers, and do not claim it has any psychoperceptual validity (e.g. that
it mimics the normalisation process assumed to exist for the auditory processing of
speech signals, or such like). Instead, we see it solely as a useful tool for researchers
wishing to reduce inter-speaker differences resulting from variations in VTL when
performing analyses of vowel samples in, for instance, instrumental studies of vowel
variation and change.
Although it has been demonstrated using only very limited amounts of data
drawn from recordings of two English speakers, we do not expect that the
effectiveness of the S-transform on the area ratio/overlap criteria will be diminished
much, if at all, if applied to data from other languages, or from larger numbers of
speakers. Although it is a relatively cumbersome algorithm to use on large samples of
vowel formant data (especially compared to converting Hz values into Bark units)
there are clear advantages - at least according to the criteria chosen for this evaluation
- to the S transform over Hz measurements or their equivalents on the Bark scale.
There are obviously great improvements that could be made, for example by finding
some means of correcting the discrepancy between male and female speakers with
respect to F1/S(F1), or perhaps by running the S-transform on z-transformed data.
There are also many other normalisation algorithms that the S-transform can be
evaluated against on the area ratio and overlap parameters used as criteria in the
present study; comparisons will be reported on in due course.
References
Adank, P., van Hout, R. & Smits, R. (2001). A comparison between human vowel
normalization strategies and acoustic vowel transformation techniques.
Proceedings of the 7th International Conference on Speech Communication and
Technology (Eurospeech 2001) Aalborg, Vol. I. pp. 481-4.
Deterding, D. (1990). Speaker normalization for automatic speech recognition.
Unpublished PhD thesis, University of Cambridge.
Deterding, D. (1997). The formants of monophthong vowels in Standard Southern
British English pronunciation. Journal of the International Phonetic Association
27. 47-55.
Disner, S.F. (1980). Evaluation of vowel normalization procedures. Journal of the
Acoustical Society of America 67(1). 253-61.
Eremeeva, V. & Stuart-Smith, J. (2003, forthcoming). A sociophonetic investigation
of the vowels OUT and BIT in Glaswegian. Proceedings of the 15th International
Congress of Phonetic Sciences, Barcelona.
169
Evaluation of a technique for mapping
Hagiwara, R. (1997). Dialect variation and formant frequency: the American English
vowels revisited. Journal of the Acoustical Society of America 102(1). 655-8.
Hindle, D. (1978). Approaches to vowel normalization in the study of natural speech.
In Sankoff, D. (ed.) Linguistic Variation: Models and Methods. New York:
Academic Press. pp. 161-72.
Iivonen, A. (1994). A psychoacoustical explanation for the number of major IPA
vowels. Journal of the International Phonetic Association 24(2). 73-90.
Koopmans-van Beinum, F.J. (1980). Vowel contrast reduction: an acoustical and
perceptual study of Dutch vowels in various speech conditions. PhD thesis,
University of Amsterdam.
Labov, W. (2001). Principles of Linguistic Change, vol. II: Social Factors. Oxford:
Blackwell.
Ladefoged, P. & Maddieson, I. (1990). Vowels of the world’s languages. Journal of
Phonetics 18. 93-122.
Lieberman, P. & Blumstein, S.E. (1988). Speech Physiology, Speech Perception, and
Acoustic Phonetics. Cambridge: Cambridge University Press.
Lobanov, B.M. (1980). Classification of Russian vowels spoken by different speakers.
Journal of the Acoustical Society of America 67. 253-61.
Moore, B.C.J. & Glasberg, B.R. (1983). Suggested formulae for calculating auditory-
filter bandwidths and excitation patterns. Journal of the Acoustical Society of
America 74. 750-3.
Peterson, G.E. & Barney, H.L. (1952). Control methods used in a study of the vowels.
Journal of the Acoustical Society of America 32. 693-702.
Roach, P., Knowles, G., Varadi, T. & Arnfield, S. (1993). MARSEC: a machine-
readable spoken English corpus. Journal of the International Phonetic
Association 23. 47-54.
Rosner, B.S. & Pickering, J.B. (1994). Vowel Perception and Production. Oxford:
Oxford University Press.
Stevens, K.N. (1998). Acoustic Phonetics. Cambridge, Mass.: MIT Press.
Traunmüller, H. (1990). Analytical expressions for the tonotopic sensory scale.
Journal of the Acoustical Society of America 88(1). 97-100.
Watson, C.I., Harrington, J. & Evans, Z. (1998). An acoustic comparison between
New Zealand and Australian English vowels. Australian Journal of Linguistics
18(2). 185-207.
Watt, D.J.L. & Tillotson, J. (2001). A spectrographic analysis of vowel fronting in
Bradford English. English World-Wide 22(2). 269-302.
Wells, J.C. (1982). Accents of English (3 vols). Cambridge: Cambridge University
Press.
Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands
(Frequenzgruppen). Journal of the Acoustical Society of America 33. 248.
Zwicker, E. & Feldtkeller, R. (1967). Das Ohr als Nachrichtenempfänger. Stuttgart:
S. Hirtzel Verlag.
170
Watt & Fabricius
Dominic Watt Anne Fabricius
Department of English English Section
University of Aberdeen Department of Language & Culture
Taylor Building Roskilde University
Old Aberdeen PO Box 260
Aberdeen AB24 3UB DK-4000 Roskilde
Scotland Denmark
d.j.l.watt@abdn.ac.uk fabri@ruc.dk
171
Evaluation of a technique for mapping
Appendix 1
Figure 6. Vowel plots for Speaker A (data from Deterding 1997). Scales are in Hz (upper pane)
and in Fn/S(Fn) units (lower pane).
200
400
600
800
1000
1200
1400
50010001500200025003000
F2 (Hz)
F1 (Hz)
FLEECE
KIT
DRESS
TRAP
STRUT
START
LOT
THOUGHT
FOOT
GOOSE
NURSE
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
0.250.751.251.75
F2/S(F2)
F1/S(F1)
FLEECE
KIT
DRESS
TRAP
STRUT
START
LOT
THOUGHT
FOOT
GOOSE
NURSE
172
Watt & Fabricius
173
Appendix 2
Calculation of S: worked example (figures for Speaker A).
Mean F1 and F2 for [i a uÈ], derived from Deterding’s (1997) data
Vowel F1 (Hz) F2 (Hz)
i 304 2664
a 1067 1690
304 304 (i.e. both values equal to F1 for [i])
ß
Mean F1 and F2 for S
304 + 1067 + 304 1675
S(F1) = = = 558.3
3 3
2664 + 1690 + 304 4658
S(F2) = = = 1552.7
3 3
ß
Speaker A’s FLEECE, TRAP and GOOSE means (Hz) converted into S units
304 2664
1067 ¸ 558.3 1690 ¸ 1552.7
333 1529
ß
Vowel F1/S(F1) F2/S(F2)
FLEECE 0.545 1.716
TRAP 1.911 1.088
GOOSE 0.596 0.985
... ese formants-peaks in the energy distribution over frequencies-are affected by talkers' vocal tract size (e.g., Peterson and Barney, 1952;Verbrugge and Shankweiler, 1977;Fox et al., 1995;Yang and Fox, 2014). Successful normalization was meant to account for these physiological differences, thereby reducing inter-talker variability in the phonetic realization of vowels (compare Figure 1B and Figure 1A), which can result in reduced category overlap (compare Figure 1D and Figure 1C). 1 Over the decades, dozens of competing accounts of vowel normalization have been proposed (e.g., Joos, 1948;Gerstman, 1968;Lobanov, 1971;Fant, 1975;Nordström and Lindblom, 1975;Nearey, 1978;Traunmüller, 1981;Bladon et al., 1984;Syrdal and Gopal, 1986;Miller, 1989;Zahorian and Jagharghi, 1991;Watt and Fabricius, 2002; for reviews, see Weatherholtz and Jaeger, 2016;Barreda, 2020). Carpenter and Govindarajan (1993) summarize over 100 different vowel-speci c accounts, though-as we discuss later in more detail-many of them share the same basic operations. ...
... is decision stems from our goal to understand the mechanisms underlying human speech perception. is means that we for instance do not include Watt and Fabricius (Watt and Fabricius, 2002;Fabricius et al., 2009), as this account requires speci c assumptions of vowel inventories of the language. Finally, we do not consider combinations of accounts. ...
Article
Full-text available
Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist—including both accounts specific to vowel perception and general purpose accounts that can be applied to any type of cue. We add to the cross-linguistic literature on this matter by comparing normalization accounts against a new phonetically annotated vowel database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We evaluate normalization accounts on how they differ in predicted consequences for perception. The results indicate that the best performing accounts either center or standardize formants by talker. The study also suggests that general purpose accounts perform as well as vowel-specific accounts, and that vowel normalization operates in both temporal and spectral domains.
... While previous studies on normalization include work on other classes of phonemes, vowel perception has long been the main focus of normalization research (e.g., Bladon, Henton, and Pickering 1984;Fant 1975;Gerstman 1968;Johnson 2020;Joos 1948;Lobanov 1971;Miller 1989;Nearey 1978;Nordström and Lindblom 1975;Syrdal and Gopal 1986;Traunmüller 1981;Watt and Fabricius 2002;Zahorian and Jagharghi 1991). The list of competing proposals is extensive. ...
Thesis
Full-text available
All talkers vary in the way they speak. This is due to social as well as physiological factors, such as differences in vocal tract length between talkers. These differences lead to variability in the mapping of acoustic signal to linguistic category across talkers, yet the human brain appears to easily arrive at a stable percept. One hypothesized mechanism suggests that listeners early on in their speech processing, adjust for between-talker differences in physiology by normalizing the acoustic signal. The present thesis investigates whether pre-linguistic normalization might be key to understanding the mechanisms supporting robust speech perception across talkers. This is assessed for Swedish and English vowels, using acoustic analyses, behavioral experiments and computational modeling. The results suggest that a group of computationally minimalist extrinsic accounts - uniform scaling accounts - best explain listeners' behavior. The main contributions of the thesis include an updated account of Swedish vowel acoustics based on a newly collected vowel database (SwehVd), and the holistic and stringent computational framework for comparison of normalization accounts - all published open access to facilitate replication and future studies.
... Sin embargo, para facilitar la comparación entre distintos hablantes y sexos, se llevó a cabo una normalización de los valores. La normalización se realizó utilizando el método propuesto en Watt & Fabricius (2002), el cual fue implementado a través del paquete phonR (McCloy, 2016). ...
Article
Full-text available
El sistema vocálico del asturiano ha sido tradicionalmente de gran interés para los estudios dialectológicos debido al fenómeno de la metafonía y la alternancia de la calidad vocálica en las vocales posteriores finales átonas /-u#, -o#/ que marca la diferencia entre los géneros masculino y neutro. Sin embargo, estos fenómenos raramente se trabajan con datos acústicos. El objetivo de este estudio es realizar una descripción acústica detallada y una comparación entre los sistemas vocálicos del asturiano y el español de Asturias en 23 hablantes bilingües en habla leída. Los resultados aportan valores de referencia para las vocales del asturiano e indican que en hablantes bilingües de mediana edad, 1) se produce fusión de vocales posteriores tanto en asturiano como en español, aunque en asturiano en menor grado; 2) no se produce fusión en las vocales anteriores y 3) no hay palatalización de /a#/ en posición final átona.
... Figure 2a-c present the average positions of the short vowels in Chelsea and Essex in relation to those found for a speaker of "traditional" RP born in 1909 (taken from Deterding, 1997), a speaker of Cockney born in 1950 (taken from Mott, 2012), and a speaker of "modern" RP born in 1980 (taken from Fabricius, 2007). The plots were generated by taking the non-normalized average F1 and F2 values reported in Fabricius (2007;traditional RP and modern RP), in Mott (2012;traditional Cockney), and in our current dataset (Chelsea and Essex), and normalizing them using the vowelextrinsic and formant-intrinsic S-centroid normalization method introduced by Watt and Fabricius (2002). ...
Article
Full-text available
While the loss of regional distinctiveness across the southeastern UK is well studied and largely undisputed, there is less consensus about class-based divisions. This paper investigates this question through an updated analysis of the variety emblematic of Britain’s upper class: Received Pronunciation (RP). While previous studies have suggested levelling in RP to a broader standard southeastern norm, our findings indicate that the most recent advances in the variety show it (re)differentiating itself from other varieties in the region. Investigating both individual vowel movements and broader system-wide properties, we argue that the changes observed in RP today result from speakers adopting a particular articulatory setting (lax voice), which has subsequent ramifications on vowel realizations. We suggest that speakers make strategic use of this articulatory setting as a way of embodying an elite persona in the British context, an interpretation that resonates with the social distributions of similar changes in other varieties.
... Lobanov converts raw hertz values to normalized z-scores by subtracting a speaker's mean formant frequency (μi) from a raw measurement (Fi), and then diving this by the standard deviation for that speaker's formant (σi), , it is not uncommon to scale these values back to hertz (see, e.g.,Labov, Rosenfelder, & Fruehwald 2013:36), though this should only be performed after all values have been normalized. While easily performed manually or via script, normalization can also be performed using the vowels package(Kendall & Thomas 2018) in R, as can the Bark difference(Traunmüller 1997), ANAE(Labov, Ash, & Boberg 2006),Nearey (1977), andWatt and Fabricius (2002) methods. ...
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
... There were 33 utterances containing one token and 10 with two. To examine degree of velarization of coda /l/, the F2 value from the midpoint of the segment was extracted using a Praat script, manually corrected when necessary, and converted to a vowel centroid ratio (VCR) normalized score (Solon, 2017;Watt & Fabricius, 2002). To arrive at this score, first we entered the F2 values of the speaker's typical [i] and [u] vowels into the formula (F2 [u] + F2 [i] )/2 to establish the F2 centroid of the speaker's oral cavity along the front-back axis. ...
Article
Although current approaches to second language (L2) pronunciation underscore that instruction should concentrate on pronunciation features that help learners be more intelligible (and not necessarily more native like), there is little empirical evidence as to what those features are, especially in languages other than English. To address this gap, this exploratory study examined phonetic predictors of listener-based dimensions-namely, intelligibility, comprehensibility, and foreign accent-in L2 Spanish speech. Samples were taken from 42 Spanish learners of varying proficiency while performing a picture description task, rated for global measures by 80 native Spanish listeners, coded for phonemic errors, and analyzed phonetically for pronunciation features that previous research has reported as challenging for English-speaking learners. Mixed-effects models were fit to the data to examine relationships between the phonetic features and the listener-based dimensions. Results revealed that rising intonation and diphthongization of word-final /o/ predicted intelligibility, rising intonation predicted comprehensibility, and a variety of features were associated with foreign accent. Overall, results confirmed that non-target-like production of L2 pronunciation , including areas known to be challenging for Spanish learners, does not necessarily lead to intelligibility and comprehensibility issues. Based on these findings, we make recommendations regarding what features should be prioritized in L2 pronunciation instruction.
... De plus, comme ce projet de recherche s'intéresse à la réinterprétation du signal acoustique par les locuteurs, il est pertinent d'employer une échelle psychoacoustique (Bark) plutôt qu'une échelle purement acoustique (Hertz). La méthode de Bladon, Henton et Pickering (1984) normalise chaque formant de chaque voyelle de façon indépendante et ne nécessite pas l'extraction et le traitement de la totalité du système phonique des locuteurs, contrairement à des méthodes plus répandues comme celles de Lobanov (1971), Nearey (1978) ou Watt et Fabricius (2002. Ceci est un avantage pratique indéniable considérant la quantité importante de données et les contraintes temporelles en présence. ...
Article
This article presents an analysis of speech rhythm in Tongan English, an emergent variety spoken in the Kingdom of Tonga. The normalised Pairwise Variability Index (nPVI-V) is used to classify the variety and determine the social and stylistic constraints on variation in a corpus of conversational and reading passage data with 48 speakers. Findings reveal a greater tendency towards stress-timing in speakers of the emergent local elite, characterised by white-collar professions and high levels of education, and those with a high index of English use. Variation is discussed as a consequence of proficiency, language contact and L1 transfer. An acoustic analysis of vowels in unstressed syllables of eight speakers confirms that lack of vowel centralisation (higher F1) is an underlying linguistic mechanism leading to more syllable-timed speech. Stark interspeaker variation was identified, highlighting the need to proceed with caution when classifying L2 Englishes based on speech rhythm.
Chapter
This chapter focuses on methodology, presenting two studies which explicitly address the research questions presented in Chapter 1, and outlines the methodology employed in the empirical analysis of variation and change in Gascon. The chapter outlines the phonological and phonetic variables that will constitute the focus of the analyses, as well as information pertaining to data collection (fieldwork site selection, informant sampling, corpus construction) and data analysis (acoustic phonetic techniques, auditory coding, statistical analysis). The language and dialect death studies are both sociophonetic in nature, involving primarily an acoustic phonetic analysis with rigorous adherence both to analytic best practices and to a Labovian variationist sociolinguistic research methodology.
Article
Full-text available
Numerosos estudios consideran que la influencia del contexto adyacente consonántico es uno de los factores que modifica los rasgos acústicos y la pronunciación del sonido vocálico. Para comprobarlo en un corpus de catalán en habla espontánea, hemos analizado 1590 vocales, producidas por 67 informantes, y hemos evaluado si el punto de articulación del sonido adyacente anterior y posterior influye en estas vocales para poder aplicar los resultados a la enseñanza-aprendizaje de esta lengua. Para el tratamiento de los datos, hemos extraído los valores acústicos de F1 y F2 de todos los sonidos vocálicos en Praat, se han relativizado los valores con el procedimiento de normalización intrínseca del hablante S-centroid (Watt y Fabricius, 2002), y hemos analizado la ANOVA y realizado comparaciones post hoc con el método Bonferroni. Los resultados que hemos obtenido se concretan en que el punto de articulación del sonido adyacente anterior y posterior motiva pocos cambios acústicos en los sonidos vocálicos, aunque eran previsibles ante la variabilidad del habla espontánea; en que el fonema /ɛ/ es el que presenta más variabilidad, /i/, /a/, /ᴐ/ y /u/, tienen poca, y /e/, /o/ y /ǝ/ no se ven afectados por el fenómeno; y, finalmente, en que el contexto bilabial, que motiva un descenso del F2 de la vocal es el más influyente. Constatamos, pues, que para la enseñanza-aprendizaje de los sonidos vocálicos del catalán, el punto de articulación de los sonidos adyacentes es un factor que motiva pocas variaciones en la pronunciación de las vocales y que no tiene mucho alcance en el desarrollo de actividades didácticas.
Article
Full-text available
The problem of speaker normalization is investigated for classifying the Russian vowels. The known methods of normalization of formantsF 1 and F 2 which make their normalized values F 1 N and F 2 N invariant to compression or expansion and to parallel shift are mentioned. A new statistical method of normalization is then suggested. All of these methods are compared using an index of normalization quality (η). The result of the comparison shows that the normalization suggested in this letter has the largest index, not only on the average but also for each individual vowel pair. The procedure of classification of the Russian vowels using a self‐normalized formant plot is shown at the end of the letter.
Article
Full-text available
The formants of the eleven monophthong vowels of Standard Southern British (SSB) pronunciation of English were measured for five male and five female BBC broadcasters whose speech was included in the MARSEC database. The measurements were made using linear-prediction-based formant tracks overlaid on digital spectrograms for an average of ten instances of each vowel for each speaker. These measurements were taken from connected speech, allowing comparison with previous formant values measured from citation words. It was found that the male vowels were significantly less peripheral in the measurements from connected speech than in measurements from citation words.
Article
Vowel production data collected from 15 southern Californian English-speaking monolinguals is compared with data reported by Hillenbrand et al. [J. Acoust. Soc. Am. 97, 3099–3111 (1995)] and Peterson and Barney [J. Acoust. Soc. Am. 24, 175–184 (1952)]. Recordings were made of nine women and six men producing multiple repetitions of [i, ɪ, e, ɛ, æ, u, ʊ, o, ɑ, ʌ, ɹˌ] in three consonant contexts. The frequencies of the first three formants were measured by simultaneous comparison of wideband spectrograms, narrow-band FFT spectral slices, and LPC spectra taken at vowel center, or steady state where available. The Southern Californian data are seen to differ greatly from that described by Peterson and Barney (1952) and Hillenbrand et al. (1995).
Article
Christoph Friedrich Hellwag was probably the first scholar who tried to show the auditory relationships and the relative distances of the vowels (Hellwag [1791] 1991). He described the vowels in a space in which the vowels formed a triangle (He didn't actually use the word ‘triangle,’ but spoke instead about a ‘scale,’ ‘ladder,’ ‘stairs,’ or ‘symmetric scheme’ (Monin 1991: 22)). The cardinal vowel system, created by Daniel Jones, has been a valuable frame work for vowel quality description, but it is partly articulatory, partly auditory.
Article
The purpose of this paper is to describe a new version of the Spoken English Corpus which will be of interest to phoneticians and other speech scientists. The Spoken English Corpus is a well-known collection of spoken-language texts that was collected and transcribed in the 1980's in a joint project involving IBM UK and the University of Lancaster (Alderson and Knowles forthcoming, Knowles and Taylor 1988). One valuable aspect of it is that the recorded material on which it was based is fairly freely available and the recording quality is generally good. At the time when the recordings were made, the idea of storing all the recorded material in digital form suitable for computer processing was of limited practicality. Although storage on digital tape was certainly feasible, this did not provide rapid computer access. The arrival of optical disk technology, with the possibility of storing very large amounts of digital data on a compact disk at relatively low cost, has brought about a revolution in ideas on database construction and use. It seemed to us that the recordings of the Spoken English Corpus (hereafter SEC) should now be converted into a form which would enable the user to gain access to the acoustic signal without the laborious business of winding through large amounts of tape. Once this was done, we should be able not only to listen to the recordings in a very convenient way, but also to carry out many automatic analyses of the material by computer.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies