Content uploaded by Pablo Arantes
Author content
All content in this area was uploaded by Pablo Arantes on Nov 01, 2019
Content may be subject to copyright.
Quantifying fundamental frequency modulation as a function of language,
speaking style and speaker
Pablo Arantes1, Anders Eriksson2
1Languages and Linguistics Department, S˜
ao Carlos Federal University, Brazil
2Department of Linguistics, Stockholm University, Sweden
pabloarantes@protonmail.com, anders.eriksson@ling.su.se
Abstract
In this study, we outline a methodology to quantify the de-
gree of similarity between pairs of f0distributions based on
the Anderson-Darling measure that underlies its namesake
goodness-of-fit test. The procedure emphasizes differences due
to more fine-grained f0modulations rather than differences in
measures of central tendency, such as the mean and median. In
order to assess the procedure’s usefulness for speaker compari-
son, we applied it to a multilingual corpus in which participants
contributed speech delivered in three speaking styles. The simi-
larity measure was calculated separately as function of speaking
style and speaker. Between-speaker variability (different speak-
ers, same style) in distribution similarity varied significantly
between styles — spontaneous interview shows greater vari-
ability than read sentences and word list in five languages (En-
glish, French, Italian, Portuguese and Swedish); in Estonian and
German, read sentences yield more variability. Within-speaker
variability (same speaker, different styles) levels are lower than
between-speaker in the style that exhibit the greatest variability.
The results point to the potential use of the proposed methodol-
ogy as a way to identify possible idiosyncratic traits in f0dis-
tributions. Also, they further demonstrate the effect of speaking
styles on intonation patterns.
Index Terms: fundamental frequency, speaking style, cross-
language comparison
1. Introduction
Research on speaking styles stems from a general scientific in-
terest in discovering the effects of non-structural linguistic fac-
tors on the process of speech production. There is a grow-
ing body of research on the subject, a representative sample
of which is presented in review papers (see [1, 2] and refer-
ences therein). Knowledge about this subject, nonetheless, can
be brought to bear on more applied endeavors, such as in the
field of forensic phonetics research and practice. In this con-
text, questioned and reference samples can be produced in dif-
ferent styles. A questioned sample can be an informal (often
bugged) phone conversation and the reference sample can be an
interview conducted at a police station on a more formal and
tense setting or a recorded public speech taken from a YouTube
video1. This mismatch can be a problem for speaker compari-
son protocols, since the same speaker can vary some of his/her
speech patterns quite substantially across different styles (as
they would, also, when speaking in different emotional states),
raising the likelihood of false negatives.
1Using publicly available speech recordings as reference samples in
speaker comparison tasks is becoming a common scenario in the case-
work done at the Brazilian Federal Police Crime Lab according to per-
sonal communication with an expert working there.
In the field of prosody, interest in speaking styles led to
research, among other things, on the effects of speaking style
on the production of a number of acoustic parameters usually
taken as correlates of word stress and how they relate to stress
perception, such as the work reported in [3, 4, 5, 6]. Here we
take the same speech material used in these studies and look
into the effects of speaking style on voice quality, a dimension
still unexplored in the context of the corpus analyzed in the cited
papers.
In a previous work, we analyzed this same corpus in search
for effects of language, speaking style and speaker sex on global
measures of fundamental frequency [2]. There we focused on
measures of central tendency and variability of f0and a num-
ber of statistically significant effects were uncovered. When
analyzing the data, we found a number of cases of speakers that
made extensive use of non modal phonation, specially creaky
voice (in the most extreme cases, up to 30% of all voiced analy-
sis frames in the audio could be associated with creakiness). In
such cases, an interaction between creaky voice and f0emerged
in the form of strongly non-unimodal f0distributions, given that
one of the correlates of creakiness is a sudden drop in f0(see
[7] and others). This result is in line with reports on the lit-
erature about the fact that it is usual for f0distributions to be
non-normal in the statistical sense, although the most common
cause pointed out for this is skewness, not lack of unimodality.
These findings suggest that it is worth exploring the effect of
factors such as speaking style on f0beyond measures of central
tendency and dispersion. Here we follow up on these findings
and present a procedure designed to compare the overall shape
of f0distributions and the effect of speaking style on these dif-
ferences with the intent of assessing the effect of speaking style
on the fine-grained differences in f0histograms and their index-
ical properties, results that may have implications for forensic
phonetics.
2. Material and methods
2.1. Speakers and speech material
The speech material is a subset of a database of recordings used
for a study of lexical stress in a number of languages. The
data used in the present study were recordings in Brazilian Por-
tuguese, British English, Estonian, French, German, Italian and
Swedish by 5 male and 5 female speakers for each language.
Great care had been taken in selecting the speakers to minimize
variation due to regional variation and age. All spoke a well-
defined regional standard. Speaker age variation was the same
for all languages within narrow margins. For the entire database
ages ranged between 18 and 35, with most speakers in the 20-
30 year range. The averages ranged between 23 and 26 for the
different languages. The speakers were also closely matched
Copyright © 2019 ISCA
INTERSPEECH 2019
September 15–19, 2019, Graz, Austria
http://dx.doi.org/10.21437/Interspeech.2019-28571716
with respect to educational background and were native speak-
ers of the languages. Recordings were all made at universities in
the countries where the studied languages are spoken. The data
represent three different speaking styles – spontaneous speech,
read phrases and read words. Spontaneous speech was elicited
in informal interviews by a native speaker. Transcriptions of
these recordings were used to produce manuscripts for the other
two speaking styles. Phrases were selected where speech was
fluent, had no speech errors and contained suitable target words.
At a later stage, the speakers were called back and asked to read
the phrases and words they had produced in their spontaneous
speech. This way we obtained identical linguistic content in all
three speaking styles. For a more detailed description please
refer to [3, 4, 5, 6]2.
2.2. Acoustic analysis
Before the f0extraction phase, audio files were preprocessed.
Stretches that contained the speech of the experimenter, overlap
between speaker and experimenter and non-speech events were
silenced. This was done to minimize f0extraction errors.
f0contours were extracted using a Praat script3that imple-
ments a heuristic suggested by Hirst [9], that optimizes floor and
ceiling values passed to Praat’s To Pitch (ac) autocorrelation-
based extraction function [10] by means of a two-pass proce-
dure. In the first pass, the Pitch object is extracted using 50
and 700 Hz as floor and ceiling estimates. In the second pass,
another Pitch object is extracted using optimal values for floor
and ceiling estimated from the voiced samples in the first Pitch
object. The values are obtained using the following formulae:
ffloor = 0.7·q1
fceiling = 1.5·q3,
where q1and q3are respectively the first and third quar-
tiles of the voiced samples in the first Pitch object. Hirst sug-
gests that the constant for the ceiling value can optionally be
set to 2.5 in case the speaker makes use of an extended range.
Later they were checked individually and corrected by an ana-
lyst trained for the task. Most errors commonly detected by this
procedure were octave halving or doubling and incorrect voic-
ing detection, usually in fricatives or transient noise in plosive
releases. Cases such as incorrect unvoicing of frames, that can
occur during glottalized phonation, had to be found by the ana-
lyst by comparing the f0contour with both the oscillogram and
spectrogram.
2.3. Measuring f0distribution similarity
In order to quantify overall differences between the shape of two
f0histograms, we designed a procedure that consists of apply-
ing the k-sample test based of the Anderson-Darling measure
(A2) of agreement between distributions. This test, described
by [11], and implemented in the kSamples R package [12], can
be used to test if two data samples follow the same underly-
ing unspecified distribution. The Anderson-Darling (A-D) test
is a modification of the well-known Kolmogorov-Smirnov test
(K-S) that is used for the same purpose. They differ, however,
in that A-D gives more weight to the tails of the distributions,
while K-S is more sensitive to deviations in the center of the
2The texts in 2.1 and 2.2 are slightly modified versions of the corre-
sponding sections in [8] where the same speech material was used, but
for a completely different purpose.
3https://github.com/parantes/better-f0/
distribution. For our purposes, we take the A2value as an in-
dex of similarity between a given pair of histograms, so that
the larger the value, the more different the two distributions are
considered to be. We are not interested in the p-value provided
by the test, only in the A2statistic.
Prior to test application, raw values in the f0contours were
converted to the OMe scale, proposed by [13], using the formula
fome = log2fhz
fmed ,
where fhz is a value in Hz, fmed is the median value of the
f0samples comprising the sample of interest and fome is the
corresponding value in the OMe scale.
The scale conversion centers all histograms to the speaker-
specific median value, such that differences in location among
distributions are rendered uninformative. The rationale for per-
forming the transformation is that the usefulness of differences
in measures such as the mean or median f0for speaker com-
parison has already been quite explored before (see [14, 15]).
Here we are more focused on finding out if factors such as
speaker and speaking style have an effect on other aspects that
contribute to the shape of f0distributions, such as dispersion,
range, skewness and kurtosis.
After the scale conversion, each f0contour was trans-
formed into a smoothed histogram called a kernel density esti-
mate of the probability density of the data. This transformation
was done by the bkde function included in the KernSmooth
package for the R statistical environment [16]. The dpik func-
tion in the same package was used to select the optimal band-
width for the kernel density estimate.
Figure 1 shows the two most similar and the two most
dissimilar smoothed histograms according to the A2statistic
metric, considering the entire corpus, but only within-language
pairs. The statistic has a value of 1.1 for the pair in the left
and 4878 for the right pair, a difference spanning three orders
of magnitude. Histograms in the most similar pair are almost
identical. Visual inspection of the least similar pair suggests dif-
ferences in modal value location, modal density, spread, range,
skewness and distribution “peakedness”, the green one being
more pointed and the blue one being more flat-topped.
Figure 1: Most similar and least similar pairs of histograms. On
the left, the histograms represent one sentence reading and one
word list reading contours, taken from the Swedish data. On the
right, histograms are interview contours, from the Italian data.
We generated plots of the five most similar and five most
dissimilar histogram pairs of each language and a visual in-
1717
spection of them shows that the results are always very similar
to the ones seen in figure 1. Carefully listening to the corre-
sponding audio samples confirms that the most similar distri-
butions present in general less f0modulation, as could be ex-
pected by the fact that the histograms such as the pair pictured
in the left-hand part of figure 1 show almost no difference in the
amount of spread around the central value. Listening to the pairs
with the greatest dissimilarity degree confirmed that they differ
markedly in terms of amount of f0modulation: in general, one
of the samples has numerous cases of high amplitude f0excur-
sions while the other has a much less varied contour. This situ-
ation is compatible with what is pictured in the right-hand side
of figure 1, where the blue histogram is more asymmetrical and
has a heavier positive tail (caused by cases of more extreme up-
ward f0excursions) when contrasted with the green one, which
is more symmetrical and less spread in comparison. The same
conclusion can be drawn by looking at the contours in figure 2,
where a representative stretch of the contours that generated the
histograms on the right-hand side of 1 is shown. It can be easily
seen that the blue contour has much wider excursions and spans
a greater range than the green one.
Figure 2: Representative stretches of the f0contours of the least
similar pairs of histograms. Colors are the same as in figure 1.
2.4. Statistical analysis
The experimental design consisted of four independent vari-
ables (IV) with differing number of levels:
• LA NG UAGE (7 levels): British English, Estonian,
French, German, Italian, Brazilian Portuguese and
Swedish.
• Speaking STYLE (3 levels): spontaneous interview, sen-
tence reading and word list reading.
• SP EA KE R (10 levels): 5 female and 5 male speakers per
language.
The dependent variable (DV) was always the value of the
A2statistic obtained when comparing a pair of f0distributions.
All pairwise combinations of style (three levels) and speaker
(ten speakers) were compared, yielding 435 (= 30C2)A2values
per language. Comparisons were always within-language.
We looked at possible significant effects caused by ST YL E
and SP EA KE R on the mean value of A2. To test for significant
effects on mean values, we used the Kruskal–Wallis test [17]
when samples were heteroscedastic or Analysis of Variance if
they were homoscedastic. The Fligner-Killeen test was used
to test for homoscedasticity [18]. When the IV being tested
had more than two levels, paired t-tests or Mann–Whitney U
tests with Holm-corrected p-values [19] were used to check for
differences among levels. An αlevel of 5% was adopted for all
tests.
3. Results
3.1. Effect of style
In this analysis, the mean A2value for each style is the result
of averaging the 45 values generated by the pairwise compari-
son of the ten individual f0histograms (one per speaker). Lan-
guages are analyzed separately. We take the results as an indi-
cation of the level of between-speaker variability in f0distribu-
tion similarity. A significant effect can be taken as evidence that
f0distribution similarity is not uniform among styles. Table 1
shows the mean A2as a function of style and language.
Table 1: Mean A2value and ±95% confidence interval of
between-speaker comparisons as a function of style and lan-
guage.
Language Interview Sentences Words
English 1047 (261) 118 (33) 103 (22)
Estonian 58.8 (19.3) 323 (90.4) 124 (34.4)
French 714 (222) 280 (71.1) 90.2 (19.7)
German 60.0 (19.6) 251 (69.0) 122 (25.0)
Italian 1334 (357) 465 (130) 284 (78.1)
Portuguese 784 (292) 405 (113) 144 (37.3)
Swedish 943 (269) 138 (37.4) 98.6 (25.7)
The samples of the seven languages are heteroscedastic (all
Fligner-Killeen turned p < 0.05). The Kruskal–Wallis test
pointed a significant effect of speaking style in each language
(all p < 0.05). Multiple comparison tests indicate that lan-
guages can be grouped in two main groups. The largest one
comprises English, French, Italian, Portuguese and Swedish.
For this group, the spontaneous interview is the style with the
highest A2values. There is no statistically-significant dif-
ference between sentences and words in English, Italian and
Swedish. In Portuguese, there is no significant difference be-
tween interview and sentences, and both styles have mean val-
ues greater than the word reading style. In French, all differ-
ences are significant in the following order: interview >sen-
tences >words. The second group has Estonian and German.
For both, all differences are significant, but the order is: sen-
tences >words >interview.
Results indicate that each language has some significant
variation in f0distributions across styles. For a group of five
languages (English, French, Italian, Portuguese and Swedish),
spontaneous interview is the style that yields the highest levels
of inter-speaker variation in distribution dissimilarity. For Es-
tonian and German, on the other hand, sentence reading is the
style with the greatest mean level of distribution dissimilarity.
The effect size observed when the interview style has the high-
est mean A2value is greater than when the sentence style has
the highest mean value.
We listened to the audio files of the sentence reading pairs
with the five largest A2values in Estonian and German and con-
firmed that it was always the case that one had a more level
f0contour and the other showed excursions of greater ampli-
1718
tude. In both languages, spontaneous interview is the style with
the lowest degree of f0contour dissimilarity between speak-
ers. Contours in spontaneous speech are more symmetrical and
show less modulation (narrower f0excursions and range).
This result points to two consistent effects: the first one,
that some speaking styles tend to yield more between-speaker
variability in terms of f0contour modulation; the second one,
the effect of language on how different speech styles are im-
plemented. Speakers of one group of languages tend to im-
plement livelier (more modulated) f0contours when speaking
spontaneously and speakers of the other group when reading
sentences. We speculate that this may be related to different
cross-cultural practices regarding spontaneous and read speech.
3.2. Effect of speaker
In this analysis, the mean A2value for each speaker is the re-
sult of averaging three data points, corresponding to the inter-
style comparisons (interview–sentences, interview–words and
sentences–words). Besides being an inter-style comparison, we
also take this analysis as an indication of the level of within-
speaker variability in f0distribution similarity since the data is
presented as a function of each of the ten speakers. A significant
effect can be taken as evidence that variation in f0distribution
similarity caused by the three styles is not similar among speak-
ers. Table 2 shows the overall (all ten speakers collapsed) mean
A2as a function of language.
Table 2: Overall mean speaker A2value and ±95% confidence
interval of within-speaker/between-style comparisons as a func-
tion of language.
Language Mean
English 175 (62.1)
Estonian 74.2 (24.0)
French 141 (60.2)
German 99.3 (26.6)
Italian 433 (187)
Portuguese 207 (78.6)
Swedish 129 (49.5)
The samples of the seven languages are all homoscedas-
tic (Fligner-Killeen tests turned non-significant). The ANOVA
tests run separately for each language turned all non-significant
as well, indicating that speakers’ means are not statistically dif-
ferent. This set of results seems to indicate that the effect of
speaking styles on the variation of f0distribution shape is uni-
form among speakers and this result holds for the seven lan-
guages studied here.
As we said earlier in this section, we consider this analysis
to be a way to estimate within-speaker variability, even though
the fact that the three samples each speaker contributes vary in
terms of speaking style can be seen as a confounding factor. We
interpret the results reported on the previous paragraph as an
indication that the possible confounding influence of speaking
style is not great in this case.
A second analysis used one-sample t-tests to compare, for
each language separately, the overall A2mean for the speak-
ers with the mean A2values observed in the analysis report
on the previous section. This is a way to determine if the
A2measure varies more due to within-speaker or between-
speaker factors. To illustrate, we can take from table 2 the
mean value of 175 corresponding to the overall A2mean for
English speakers and compare it with the value 1047 taken from
table 1, the mean A2value reflecting the between-speaker vari-
ability observed in the interview style in English. The one-
sample t-test comparing both values yields a significant result
[t(29) = −27.508, p < 0.001]. That puts mean within-speaker
variability at a lower level when compared to between-speaker
variability in the interview style. We get statistically significant
results for the other four languages (French, Italian, Portuguese
and Swedish) for which the interview style has the highest value
in table 1. For the other two languages, Estonian and German,
we get statistically significant results when comparing the val-
ues in table 2 with the mean values in table 1 for the sentence
reading style.
This result is encouraging in terms of the usefulness of f0
distribution shape in speaker comparison tasks because, if con-
firmed in further studies, it indicates that this feature has the
hallmark of a good parameter in forensic speaker comparison:
within-speaker variability is less than between-speaker variabil-
ity.
4. Discussion and conclusion
In this study we report a method to compare pairs of f0distri-
butions and quantify their difference in a way that emphasizes
differences in the overall shape of the distributions’ histograms.
Using this metric, we have shown that speaking style has a sig-
nificant effect on the shaping of f0distributions. Interview or
sentence reading are the styles in which speakers differ the most
in terms of distribution shape depending on the langugage. We
have also shown that, according to our metric, f0contour by the
same speaker vary less when speaking in different styles than
the countours of different speakers that are speaking in the same
style (specially the spontaneous style). This result is encourag-
ing in terms of the usefulness of f0distribution shape in speaker
comparison tasks because, if confirmed in further studies, it in-
dicates that this feature has the hallmark of a good parameter in
forensic speaker comparison: within-speaker variability is less
than between-speaker variability.
In a speaker comparison scenario, forensic experts should
concentrate on analyzing spontaneous styles, such as inter-
views, because it is more likely to find differences among speak-
ers’ f0distributions in those styles. We are measuring distribu-
tion similarity on the basis of a single number (A2statistic). Fu-
ture work should tease out which statistical descriptors (range,
asymmetry, kurtosis etc) correlate better with our metric. Also,
it would be important in the future to measure within-speaker
variability by means of different samples of the same speaker in
the same style. Here we only had one sample in each style per
speaker.
5. Acknowledgements
This work has been supported by a joint grant from The
Swedish Foundation for International Cooperation in Research
and Higher Education (STINT) and Brazil’s Coordenac¸˜
ao de
Aperfeic¸oamento de Pessoal de N´
ıvel Superior (CAPES) grant
88881.155645/2017-01.
1719
6. References
[1] J. Llisterri, “Speaking styles in speech research,” in EL-
SNET/ESCA/SALT Workshop on Integrating Speech and Natural
Language, Dublin, Ireland, 1992.
[2] P. Arantes and M. E. N. Linhares, “Efeito da l´
ıngua, estilo de
elocuc¸˜
ao e sexo do falante sobre medidas globais da frequˆ
encia
fundamental,” Letras de Hoje, vol. 52, no. 1, pp. 26–39, 2017.
[3] P. A. Barbosa, A. Eriksson, and J. ˚
Akesson, “On the robustness of
some acoustic parameters for signaling word stress across styles in
brazilian portuguese,” in Proceedings of Interspeech 2013, 2013,
pp. 282–286.
[4] P. Lippus, E. L. Asu, and M.-L. Kalvik, “An acoustic study of
estonian word stress,” in Proceedings of Speech Prosody 2014,
2014, pp. 232–235.
[5] A. Eriksson and M. Heldner, “The acoustics of word stress in en-
glish as a function of stress level and speaking style,” in Proceed-
ings of Interspeech 2015, 2015, pp. 41–45.
[6] A. Eriksson, P. M. Bertinetto, M. Heldner, R. Nodari, and
G. Lenoci, “The acoustics of lexical stress in italian as a func-
tion of stress level and speaking style,” Proceedings of Interspeech
2016, pp. 1059–1063, 2016.
[7] J. Laver, The phonetic description of voice quality. New York:
Cambridge University Press, 1980.
[8] P. Arantes, A. Eriksson, and S. Gutzeit, “Effect of language,
speaking style and speaker on long-term F0 estimation,” in In-
terspeech 2017. Stockholm: ISCA, 2017, pp. 3897–3901.
[9] D. J. Hirst, “The analysis by synthesis of speech melody: from
data to models,” Journal of Speech Sciences, vol. 1, no. 1, pp.
55–83, 2011.
[10] P. Boersma, “Accurate short-term analysis of the fundamental fre-
quency and the harmonics-to-noise ratio of a sampled sound,”
Proceedings of the Institute of Phonetic Sciences, vol. 17, pp. 97–
110, 1993.
[11] F. W. Scholz and M. A. Stephens, “K-Sample Anderson-Darling
Tests,” Journal of the American Statistical Association, vol. 82,
no. 399, pp. 918–924, sep 1987.
[12] F. Scholz and A. Zhu, kSamples: K-Sample Rank Tests and
their Combinations, 2017, r package version 1.2-7. [Online].
Available: https://CRAN.R-project.org/package=kSamples
[13] C. De Looze and D. J. Hirst, “The ome (octave-median) scale: A
natural scale for speech melody,” in Proceedings of the 7th Inter-
national Conference on Speech Prosody, N. Campbell, D. Gibbon,
and D. Hirst, Eds., Dublin, 2014, pp. 910–914.
[14] Y. Kinoshita, S. Ishihara, and P. Rose, “Exploring the discrimina-
tory potential of F0 distribution parameters in traditional forensic
speaker recognition,” The International Journal of Speech, Lan-
guage and the Law, vol. 16, no. 1, pp. 91–111, 2009.
[15] Y. Kinoshita and I. Shunichi, “F0 can tell us more: speaker verifi-
cation using the long term distribution,” in Proceedings of the Aus-
tralasian International Conference on Speech Science and Tech-
nology 2010, Melbourne, Australia, 2010, pp. 50–53.
[16] M. Wand, KernSmooth: Functions for Kernel Smoothing
Supporting Wand & Jones (1995), 2015, r package version 2.23-
15. [Online]. Available: https://CRAN.R- project.org/package=
KernSmooth
[17] W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion
variance analysis,” Journal of the American Statistical Associa-
tion, vol. 47, no. 260, pp. 583–621, 1952.
[18] W. J. Conover, M. E. Johnson, and M. M. Johnson, “A compar-
ative study of tests for homogeneity of variances, with applica-
tions to the outer continental shelf bidding data,” Technometrics,
vol. 23, pp. 351–361, 1981.
[19] S. Holm, “A simple sequentially rejective multiple test proce-
dure,” Scandinavian Journal of Statistics, vol. 6, pp. 65–70, 1979.
1720