Conference PaperPDF Available

Quantifying Fundamental Frequency Modulation as a Function of Language, Speaking Style and Speaker


Abstract and Figures

In this study, we outline a methodology to quantify the degree of similarity between pairs of f0 distributions based on the Anderson-Darling measure that underlies its namesake goodness-of-fit test. The procedure emphasizes differences due to more fine-grained f0 modulations rather than differences in measures of central tendency, such as the mean and median. In order to assess the procedure's usefulness for speaker comparison, we applied it to a multilingual corpus in which participants contributed speech delivered in three speaking styles. The similarity measure was calculated separately as function of speaking style and speaker. Between-speaker variability (different speakers, same style) in distribution similarity varied significantly between styles --- spontaneous interview shows greater variability than read sentences and word list in five languages (English, French, Italian, Portuguese and Swedish); in Estonian and German, read sentences yield more variability. Within-speaker variability (same speaker, different styles) levels are lower than between-speaker in the style that exhibit the greatest variability. The results point to the potential use of the proposed methodology as a way to identify possible idiosyncratic traits in f0 distributions. Also, they further demonstrate the effect of speaking styles on intonation patterns.
Content may be subject to copyright.
Quantifying fundamental frequency modulation as a function of language,
speaking style and speaker
Pablo Arantes1, Anders Eriksson2
1Languages and Linguistics Department, S˜
ao Carlos Federal University, Brazil
2Department of Linguistics, Stockholm University, Sweden,
In this study, we outline a methodology to quantify the de-
gree of similarity between pairs of f0distributions based on
the Anderson-Darling measure that underlies its namesake
goodness-of-fit test. The procedure emphasizes differences due
to more fine-grained f0modulations rather than differences in
measures of central tendency, such as the mean and median. In
order to assess the procedure’s usefulness for speaker compari-
son, we applied it to a multilingual corpus in which participants
contributed speech delivered in three speaking styles. The simi-
larity measure was calculated separately as function of speaking
style and speaker. Between-speaker variability (different speak-
ers, same style) in distribution similarity varied significantly
between styles — spontaneous interview shows greater vari-
ability than read sentences and word list in five languages (En-
glish, French, Italian, Portuguese and Swedish); in Estonian and
German, read sentences yield more variability. Within-speaker
variability (same speaker, different styles) levels are lower than
between-speaker in the style that exhibit the greatest variability.
The results point to the potential use of the proposed methodol-
ogy as a way to identify possible idiosyncratic traits in f0dis-
tributions. Also, they further demonstrate the effect of speaking
styles on intonation patterns.
Index Terms: fundamental frequency, speaking style, cross-
language comparison
1. Introduction
Research on speaking styles stems from a general scientific in-
terest in discovering the effects of non-structural linguistic fac-
tors on the process of speech production. There is a grow-
ing body of research on the subject, a representative sample
of which is presented in review papers (see [1, 2] and refer-
ences therein). Knowledge about this subject, nonetheless, can
be brought to bear on more applied endeavors, such as in the
field of forensic phonetics research and practice. In this con-
text, questioned and reference samples can be produced in dif-
ferent styles. A questioned sample can be an informal (often
bugged) phone conversation and the reference sample can be an
interview conducted at a police station on a more formal and
tense setting or a recorded public speech taken from a YouTube
video1. This mismatch can be a problem for speaker compari-
son protocols, since the same speaker can vary some of his/her
speech patterns quite substantially across different styles (as
they would, also, when speaking in different emotional states),
raising the likelihood of false negatives.
1Using publicly available speech recordings as reference samples in
speaker comparison tasks is becoming a common scenario in the case-
work done at the Brazilian Federal Police Crime Lab according to per-
sonal communication with an expert working there.
In the field of prosody, interest in speaking styles led to
research, among other things, on the effects of speaking style
on the production of a number of acoustic parameters usually
taken as correlates of word stress and how they relate to stress
perception, such as the work reported in [3, 4, 5, 6]. Here we
take the same speech material used in these studies and look
into the effects of speaking style on voice quality, a dimension
still unexplored in the context of the corpus analyzed in the cited
In a previous work, we analyzed this same corpus in search
for effects of language, speaking style and speaker sex on global
measures of fundamental frequency [2]. There we focused on
measures of central tendency and variability of f0and a num-
ber of statistically significant effects were uncovered. When
analyzing the data, we found a number of cases of speakers that
made extensive use of non modal phonation, specially creaky
voice (in the most extreme cases, up to 30% of all voiced analy-
sis frames in the audio could be associated with creakiness). In
such cases, an interaction between creaky voice and f0emerged
in the form of strongly non-unimodal f0distributions, given that
one of the correlates of creakiness is a sudden drop in f0(see
[7] and others). This result is in line with reports on the lit-
erature about the fact that it is usual for f0distributions to be
non-normal in the statistical sense, although the most common
cause pointed out for this is skewness, not lack of unimodality.
These findings suggest that it is worth exploring the effect of
factors such as speaking style on f0beyond measures of central
tendency and dispersion. Here we follow up on these findings
and present a procedure designed to compare the overall shape
of f0distributions and the effect of speaking style on these dif-
ferences with the intent of assessing the effect of speaking style
on the fine-grained differences in f0histograms and their index-
ical properties, results that may have implications for forensic
2. Material and methods
2.1. Speakers and speech material
The speech material is a subset of a database of recordings used
for a study of lexical stress in a number of languages. The
data used in the present study were recordings in Brazilian Por-
tuguese, British English, Estonian, French, German, Italian and
Swedish by 5 male and 5 female speakers for each language.
Great care had been taken in selecting the speakers to minimize
variation due to regional variation and age. All spoke a well-
defined regional standard. Speaker age variation was the same
for all languages within narrow margins. For the entire database
ages ranged between 18 and 35, with most speakers in the 20-
30 year range. The averages ranged between 23 and 26 for the
different languages. The speakers were also closely matched
Copyright © 2019 ISCA
September 15–19, 2019, Graz, Austria
with respect to educational background and were native speak-
ers of the languages. Recordings were all made at universities in
the countries where the studied languages are spoken. The data
represent three different speaking styles – spontaneous speech,
read phrases and read words. Spontaneous speech was elicited
in informal interviews by a native speaker. Transcriptions of
these recordings were used to produce manuscripts for the other
two speaking styles. Phrases were selected where speech was
fluent, had no speech errors and contained suitable target words.
At a later stage, the speakers were called back and asked to read
the phrases and words they had produced in their spontaneous
speech. This way we obtained identical linguistic content in all
three speaking styles. For a more detailed description please
refer to [3, 4, 5, 6]2.
2.2. Acoustic analysis
Before the f0extraction phase, audio files were preprocessed.
Stretches that contained the speech of the experimenter, overlap
between speaker and experimenter and non-speech events were
silenced. This was done to minimize f0extraction errors.
f0contours were extracted using a Praat script3that imple-
ments a heuristic suggested by Hirst [9], that optimizes floor and
ceiling values passed to Praat’s To Pitch (ac) autocorrelation-
based extraction function [10] by means of a two-pass proce-
dure. In the first pass, the Pitch object is extracted using 50
and 700 Hz as floor and ceiling estimates. In the second pass,
another Pitch object is extracted using optimal values for floor
and ceiling estimated from the voiced samples in the first Pitch
object. The values are obtained using the following formulae:
ffloor = 0.7·q1
fceiling = 1.5·q3,
where q1and q3are respectively the first and third quar-
tiles of the voiced samples in the first Pitch object. Hirst sug-
gests that the constant for the ceiling value can optionally be
set to 2.5 in case the speaker makes use of an extended range.
Later they were checked individually and corrected by an ana-
lyst trained for the task. Most errors commonly detected by this
procedure were octave halving or doubling and incorrect voic-
ing detection, usually in fricatives or transient noise in plosive
releases. Cases such as incorrect unvoicing of frames, that can
occur during glottalized phonation, had to be found by the ana-
lyst by comparing the f0contour with both the oscillogram and
2.3. Measuring f0distribution similarity
In order to quantify overall differences between the shape of two
f0histograms, we designed a procedure that consists of apply-
ing the k-sample test based of the Anderson-Darling measure
(A2) of agreement between distributions. This test, described
by [11], and implemented in the kSamples R package [12], can
be used to test if two data samples follow the same underly-
ing unspecified distribution. The Anderson-Darling (A-D) test
is a modification of the well-known Kolmogorov-Smirnov test
(K-S) that is used for the same purpose. They differ, however,
in that A-D gives more weight to the tails of the distributions,
while K-S is more sensitive to deviations in the center of the
2The texts in 2.1 and 2.2 are slightly modified versions of the corre-
sponding sections in [8] where the same speech material was used, but
for a completely different purpose.
distribution. For our purposes, we take the A2value as an in-
dex of similarity between a given pair of histograms, so that
the larger the value, the more different the two distributions are
considered to be. We are not interested in the p-value provided
by the test, only in the A2statistic.
Prior to test application, raw values in the f0contours were
converted to the OMe scale, proposed by [13], using the formula
fome = log2fhz
fmed ,
where fhz is a value in Hz, fmed is the median value of the
f0samples comprising the sample of interest and fome is the
corresponding value in the OMe scale.
The scale conversion centers all histograms to the speaker-
specific median value, such that differences in location among
distributions are rendered uninformative. The rationale for per-
forming the transformation is that the usefulness of differences
in measures such as the mean or median f0for speaker com-
parison has already been quite explored before (see [14, 15]).
Here we are more focused on finding out if factors such as
speaker and speaking style have an effect on other aspects that
contribute to the shape of f0distributions, such as dispersion,
range, skewness and kurtosis.
After the scale conversion, each f0contour was trans-
formed into a smoothed histogram called a kernel density esti-
mate of the probability density of the data. This transformation
was done by the bkde function included in the KernSmooth
package for the R statistical environment [16]. The dpik func-
tion in the same package was used to select the optimal band-
width for the kernel density estimate.
Figure 1 shows the two most similar and the two most
dissimilar smoothed histograms according to the A2statistic
metric, considering the entire corpus, but only within-language
pairs. The statistic has a value of 1.1 for the pair in the left
and 4878 for the right pair, a difference spanning three orders
of magnitude. Histograms in the most similar pair are almost
identical. Visual inspection of the least similar pair suggests dif-
ferences in modal value location, modal density, spread, range,
skewness and distribution “peakedness”, the green one being
more pointed and the blue one being more flat-topped.
Figure 1: Most similar and least similar pairs of histograms. On
the left, the histograms represent one sentence reading and one
word list reading contours, taken from the Swedish data. On the
right, histograms are interview contours, from the Italian data.
We generated plots of the five most similar and five most
dissimilar histogram pairs of each language and a visual in-
spection of them shows that the results are always very similar
to the ones seen in figure 1. Carefully listening to the corre-
sponding audio samples confirms that the most similar distri-
butions present in general less f0modulation, as could be ex-
pected by the fact that the histograms such as the pair pictured
in the left-hand part of figure 1 show almost no difference in the
amount of spread around the central value. Listening to the pairs
with the greatest dissimilarity degree confirmed that they differ
markedly in terms of amount of f0modulation: in general, one
of the samples has numerous cases of high amplitude f0excur-
sions while the other has a much less varied contour. This situ-
ation is compatible with what is pictured in the right-hand side
of figure 1, where the blue histogram is more asymmetrical and
has a heavier positive tail (caused by cases of more extreme up-
ward f0excursions) when contrasted with the green one, which
is more symmetrical and less spread in comparison. The same
conclusion can be drawn by looking at the contours in figure 2,
where a representative stretch of the contours that generated the
histograms on the right-hand side of 1 is shown. It can be easily
seen that the blue contour has much wider excursions and spans
a greater range than the green one.
Figure 2: Representative stretches of the f0contours of the least
similar pairs of histograms. Colors are the same as in figure 1.
2.4. Statistical analysis
The experimental design consisted of four independent vari-
ables (IV) with differing number of levels:
• LA NG UAGE (7 levels): British English, Estonian,
French, German, Italian, Brazilian Portuguese and
Speaking STYLE (3 levels): spontaneous interview, sen-
tence reading and word list reading.
• SP EA KE R (10 levels): 5 female and 5 male speakers per
The dependent variable (DV) was always the value of the
A2statistic obtained when comparing a pair of f0distributions.
All pairwise combinations of style (three levels) and speaker
(ten speakers) were compared, yielding 435 (= 30C2)A2values
per language. Comparisons were always within-language.
We looked at possible significant effects caused by ST YL E
and SP EA KE R on the mean value of A2. To test for significant
effects on mean values, we used the Kruskal–Wallis test [17]
when samples were heteroscedastic or Analysis of Variance if
they were homoscedastic. The Fligner-Killeen test was used
to test for homoscedasticity [18]. When the IV being tested
had more than two levels, paired t-tests or Mann–Whitney U
tests with Holm-corrected p-values [19] were used to check for
differences among levels. An αlevel of 5% was adopted for all
3. Results
3.1. Effect of style
In this analysis, the mean A2value for each style is the result
of averaging the 45 values generated by the pairwise compari-
son of the ten individual f0histograms (one per speaker). Lan-
guages are analyzed separately. We take the results as an indi-
cation of the level of between-speaker variability in f0distribu-
tion similarity. A significant effect can be taken as evidence that
f0distribution similarity is not uniform among styles. Table 1
shows the mean A2as a function of style and language.
Table 1: Mean A2value and ±95% confidence interval of
between-speaker comparisons as a function of style and lan-
Language Interview Sentences Words
English 1047 (261) 118 (33) 103 (22)
Estonian 58.8 (19.3) 323 (90.4) 124 (34.4)
French 714 (222) 280 (71.1) 90.2 (19.7)
German 60.0 (19.6) 251 (69.0) 122 (25.0)
Italian 1334 (357) 465 (130) 284 (78.1)
Portuguese 784 (292) 405 (113) 144 (37.3)
Swedish 943 (269) 138 (37.4) 98.6 (25.7)
The samples of the seven languages are heteroscedastic (all
Fligner-Killeen turned p < 0.05). The Kruskal–Wallis test
pointed a significant effect of speaking style in each language
(all p < 0.05). Multiple comparison tests indicate that lan-
guages can be grouped in two main groups. The largest one
comprises English, French, Italian, Portuguese and Swedish.
For this group, the spontaneous interview is the style with the
highest A2values. There is no statistically-significant dif-
ference between sentences and words in English, Italian and
Swedish. In Portuguese, there is no significant difference be-
tween interview and sentences, and both styles have mean val-
ues greater than the word reading style. In French, all differ-
ences are significant in the following order: interview >sen-
tences >words. The second group has Estonian and German.
For both, all differences are significant, but the order is: sen-
tences >words >interview.
Results indicate that each language has some significant
variation in f0distributions across styles. For a group of five
languages (English, French, Italian, Portuguese and Swedish),
spontaneous interview is the style that yields the highest levels
of inter-speaker variation in distribution dissimilarity. For Es-
tonian and German, on the other hand, sentence reading is the
style with the greatest mean level of distribution dissimilarity.
The effect size observed when the interview style has the high-
est mean A2value is greater than when the sentence style has
the highest mean value.
We listened to the audio files of the sentence reading pairs
with the five largest A2values in Estonian and German and con-
firmed that it was always the case that one had a more level
f0contour and the other showed excursions of greater ampli-
tude. In both languages, spontaneous interview is the style with
the lowest degree of f0contour dissimilarity between speak-
ers. Contours in spontaneous speech are more symmetrical and
show less modulation (narrower f0excursions and range).
This result points to two consistent effects: the first one,
that some speaking styles tend to yield more between-speaker
variability in terms of f0contour modulation; the second one,
the effect of language on how different speech styles are im-
plemented. Speakers of one group of languages tend to im-
plement livelier (more modulated) f0contours when speaking
spontaneously and speakers of the other group when reading
sentences. We speculate that this may be related to different
cross-cultural practices regarding spontaneous and read speech.
3.2. Effect of speaker
In this analysis, the mean A2value for each speaker is the re-
sult of averaging three data points, corresponding to the inter-
style comparisons (interview–sentences, interview–words and
sentences–words). Besides being an inter-style comparison, we
also take this analysis as an indication of the level of within-
speaker variability in f0distribution similarity since the data is
presented as a function of each of the ten speakers. A significant
effect can be taken as evidence that variation in f0distribution
similarity caused by the three styles is not similar among speak-
ers. Table 2 shows the overall (all ten speakers collapsed) mean
A2as a function of language.
Table 2: Overall mean speaker A2value and ±95% confidence
interval of within-speaker/between-style comparisons as a func-
tion of language.
Language Mean
English 175 (62.1)
Estonian 74.2 (24.0)
French 141 (60.2)
German 99.3 (26.6)
Italian 433 (187)
Portuguese 207 (78.6)
Swedish 129 (49.5)
The samples of the seven languages are all homoscedas-
tic (Fligner-Killeen tests turned non-significant). The ANOVA
tests run separately for each language turned all non-significant
as well, indicating that speakers’ means are not statistically dif-
ferent. This set of results seems to indicate that the effect of
speaking styles on the variation of f0distribution shape is uni-
form among speakers and this result holds for the seven lan-
guages studied here.
As we said earlier in this section, we consider this analysis
to be a way to estimate within-speaker variability, even though
the fact that the three samples each speaker contributes vary in
terms of speaking style can be seen as a confounding factor. We
interpret the results reported on the previous paragraph as an
indication that the possible confounding influence of speaking
style is not great in this case.
A second analysis used one-sample t-tests to compare, for
each language separately, the overall A2mean for the speak-
ers with the mean A2values observed in the analysis report
on the previous section. This is a way to determine if the
A2measure varies more due to within-speaker or between-
speaker factors. To illustrate, we can take from table 2 the
mean value of 175 corresponding to the overall A2mean for
English speakers and compare it with the value 1047 taken from
table 1, the mean A2value reflecting the between-speaker vari-
ability observed in the interview style in English. The one-
sample t-test comparing both values yields a significant result
[t(29) = 27.508, p < 0.001]. That puts mean within-speaker
variability at a lower level when compared to between-speaker
variability in the interview style. We get statistically significant
results for the other four languages (French, Italian, Portuguese
and Swedish) for which the interview style has the highest value
in table 1. For the other two languages, Estonian and German,
we get statistically significant results when comparing the val-
ues in table 2 with the mean values in table 1 for the sentence
reading style.
This result is encouraging in terms of the usefulness of f0
distribution shape in speaker comparison tasks because, if con-
firmed in further studies, it indicates that this feature has the
hallmark of a good parameter in forensic speaker comparison:
within-speaker variability is less than between-speaker variabil-
4. Discussion and conclusion
In this study we report a method to compare pairs of f0distri-
butions and quantify their difference in a way that emphasizes
differences in the overall shape of the distributions’ histograms.
Using this metric, we have shown that speaking style has a sig-
nificant effect on the shaping of f0distributions. Interview or
sentence reading are the styles in which speakers differ the most
in terms of distribution shape depending on the langugage. We
have also shown that, according to our metric, f0contour by the
same speaker vary less when speaking in different styles than
the countours of different speakers that are speaking in the same
style (specially the spontaneous style). This result is encourag-
ing in terms of the usefulness of f0distribution shape in speaker
comparison tasks because, if confirmed in further studies, it in-
dicates that this feature has the hallmark of a good parameter in
forensic speaker comparison: within-speaker variability is less
than between-speaker variability.
In a speaker comparison scenario, forensic experts should
concentrate on analyzing spontaneous styles, such as inter-
views, because it is more likely to find differences among speak-
ers’ f0distributions in those styles. We are measuring distribu-
tion similarity on the basis of a single number (A2statistic). Fu-
ture work should tease out which statistical descriptors (range,
asymmetry, kurtosis etc) correlate better with our metric. Also,
it would be important in the future to measure within-speaker
variability by means of different samples of the same speaker in
the same style. Here we only had one sample in each style per
5. Acknowledgements
This work has been supported by a joint grant from The
Swedish Foundation for International Cooperation in Research
and Higher Education (STINT) and Brazil’s Coordenac¸˜
ao de
Aperfeic¸oamento de Pessoal de N´
ıvel Superior (CAPES) grant
6. References
[1] J. Llisterri, “Speaking styles in speech research,” in EL-
SNET/ESCA/SALT Workshop on Integrating Speech and Natural
Language, Dublin, Ireland, 1992.
[2] P. Arantes and M. E. N. Linhares, “Efeito da l´
ıngua, estilo de
ao e sexo do falante sobre medidas globais da frequˆ
fundamental,” Letras de Hoje, vol. 52, no. 1, pp. 26–39, 2017.
[3] P. A. Barbosa, A. Eriksson, and J. ˚
Akesson, “On the robustness of
some acoustic parameters for signaling word stress across styles in
brazilian portuguese,” in Proceedings of Interspeech 2013, 2013,
pp. 282–286.
[4] P. Lippus, E. L. Asu, and M.-L. Kalvik, “An acoustic study of
estonian word stress,” in Proceedings of Speech Prosody 2014,
2014, pp. 232–235.
[5] A. Eriksson and M. Heldner, “The acoustics of word stress in en-
glish as a function of stress level and speaking style,” in Proceed-
ings of Interspeech 2015, 2015, pp. 41–45.
[6] A. Eriksson, P. M. Bertinetto, M. Heldner, R. Nodari, and
G. Lenoci, “The acoustics of lexical stress in italian as a func-
tion of stress level and speaking style,Proceedings of Interspeech
2016, pp. 1059–1063, 2016.
[7] J. Laver, The phonetic description of voice quality. New York:
Cambridge University Press, 1980.
[8] P. Arantes, A. Eriksson, and S. Gutzeit, “Effect of language,
speaking style and speaker on long-term F0 estimation,” in In-
terspeech 2017. Stockholm: ISCA, 2017, pp. 3897–3901.
[9] D. J. Hirst, “The analysis by synthesis of speech melody: from
data to models,” Journal of Speech Sciences, vol. 1, no. 1, pp.
55–83, 2011.
[10] P. Boersma, “Accurate short-term analysis of the fundamental fre-
quency and the harmonics-to-noise ratio of a sampled sound,”
Proceedings of the Institute of Phonetic Sciences, vol. 17, pp. 97–
110, 1993.
[11] F. W. Scholz and M. A. Stephens, “K-Sample Anderson-Darling
Tests,” Journal of the American Statistical Association, vol. 82,
no. 399, pp. 918–924, sep 1987.
[12] F. Scholz and A. Zhu, kSamples: K-Sample Rank Tests and
their Combinations, 2017, r package version 1.2-7. [Online].
[13] C. De Looze and D. J. Hirst, “The ome (octave-median) scale: A
natural scale for speech melody,” in Proceedings of the 7th Inter-
national Conference on Speech Prosody, N. Campbell, D. Gibbon,
and D. Hirst, Eds., Dublin, 2014, pp. 910–914.
[14] Y. Kinoshita, S. Ishihara, and P. Rose, “Exploring the discrimina-
tory potential of F0 distribution parameters in traditional forensic
speaker recognition,” The International Journal of Speech, Lan-
guage and the Law, vol. 16, no. 1, pp. 91–111, 2009.
[15] Y. Kinoshita and I. Shunichi, “F0 can tell us more: speaker verifi-
cation using the long term distribution,” in Proceedings of the Aus-
tralasian International Conference on Speech Science and Tech-
nology 2010, Melbourne, Australia, 2010, pp. 50–53.
[16] M. Wand, KernSmooth: Functions for Kernel Smoothing
Supporting Wand & Jones (1995), 2015, r package version 2.23-
15. [Online]. Available: https://CRAN.R-
[17] W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion
variance analysis,Journal of the American Statistical Associa-
tion, vol. 47, no. 260, pp. 583–621, 1952.
[18] W. J. Conover, M. E. Johnson, and M. M. Johnson, “A compar-
ative study of tests for homogeneity of variances, with applica-
tions to the outer continental shelf bidding data,” Technometrics,
vol. 23, pp. 351–361, 1981.
[19] S. Holm, “A simple sequentially rejective multiple test proce-
dure,” Scandinavian Journal of Statistics, vol. 6, pp. 65–70, 1979.
... This is compatible with a view that spontaneous or semi-spontaneous speaking styles are livelier, as speakers tend to be more engaged and involved in what they are saying than in read speech, that has a lesser degree of f0 modulation and therefore could be perceived as having a relatively more level intonation pattern. Arantes and Eriksson (2019) present data that may be interpreted as indication that this is not a fixed pattern, but something that can, at least in part, be language -specific or culture-specific. In their study, the authors studied the multilingual corpus that includes the BP data analyzed here and developed a methodology to measure similarity between f0 contours. ...
... Going back to the results in the present study, it seems that male speakers adhere to the pattern show in Arantes and Eriksson (2019) to a lesser extent than the female speakers. It is possible to see that, for the sentence reading style, males and females present similar SD and skewness values. ...
Full-text available
The present study has two main goals. The first is to describe the effects of three speaking styles (spontaneous interview, sentence reading and word list reading) on statistical estimators of fundamental frequency (f0) variability (mean, standard deviation, skewness and kurtosis) in five female and five male speakers of Brazilian Portuguese (BP). Most f0 contours of word reading are bimodal. Analysis of their time-normalized contours suggests this is caused by the time-compressed realization of fast transitions from low to high or high to low tones aligned with stressed syllables. Considering only unimodal distributions, results show that there are no statistically significant effects in the male data for any of the four variability estimators. Effects show up in female data. Spontaneous style has statistically significant higher mean, SD and skewness than read speech. Findings in the previous literature indicate the reverse pattern, though, for languages other than BP. The second goal of the study is to characterize the statistical properties of f0 distributions beyond mean and SD. Results confirm previous observations that most f0 distributions have positive skewness, are left-tailed and have kurtosis values that deviate significantly from the normal because of large deviations from the central or modal value. A distribution fitting procedure tested six distributions. The asymmetric Burr type XII distribution emerges as the one that best fits the data in the corpus. Results show that two of the parameters that determine its shape correlate well with the empirical f0 distribution values of SD and skewness. Important effects of speaking style on f0 seen in female speakers can be reproduced by combinations of the Burr distributions’ parameters.
... However, they differ in that AD gives more weight to the tails of the distributions, while KS is more sensitive to deviations in the center of the distribution. The AD test was used to quantify the F0 modulation of different speaking styles (Arantes & Eriksson, 2019). ...
Full-text available
Speech is a highly dynamic process. Some variability is inherited directly from the language itself, while other variability stems from adapting to the surrounding en- vironment or interlocutor. This Ph.D. thesis consists of seven studies investigating speech adaptation concerning the message, channel, and listener variability. It starts with investigating speakers’ adaptation to the linguistic message. Previous work has shown that duration is shortened in more predictable contexts, and conversely lengthened in less predictable contexts. This pervasive predictability effect is well studied in multiple languages and linguistic levels. However, syllable level predictability has been generally overlooked so far. This thesis aims to fill that gap. It focuses on the effect of information-theoretic factors at both the syllable and segmental levels. Furthermore, it found that the predictability effect is not uniform across all durational cues but is somewhat sensitive to the phonological relevance of a language-specific phonetic cue. Speakers adapt not only to their message but also to the channel of transfer. For example, it is known that speakers modulate the characteristics of their speech and produce clear speech in response to background noise-syllables in noise have a longer duration, with higher average intensity, larger intensity range, and higher F0. Hence, speakers choose redundant multi-dimensional acoustic modiőcations to make their voices more salient and detectable in a noisy environment. This Ph.D. thesis provides new insights into speakers’ adaptation to noise and predictability on the acoustic realizations of syllables in German; showing that the speakers’ response to background noise is independent of syllable predictability. Regarding speaker-to-listener adaptations, this thesis finds that speech variability is not necessarily a function of the interaction’s duration. Instead, speakers constantly position themselves concerning the ongoing social interaction. Indeed, speakers’ cooperation during the discussion would lead to a higher convergence behavior. Moreover, interpersonal power dynamics between interlocutors were found to serve as a pre- dictor for accommodation behavior. This adaptation holds for both human-human interaction and human-robot interaction. In an ecological validity study, speakers changed their voice depending on whether they were addressing a human or a robot. Those findings align with previous studies on robot-directed speech and conőrm that this difference also holds when the conversations are more natural and spontaneous. The results of this thesis provide compelling evidence that speech adaptation is socially motivated and, to some extent, consciously controlled by the speaker. These findings have implications for including environment-based and listener-based formulations in speech production models along with message-based formulations. Furthermore, this thesis aims to advance our understanding of verbal and non-verbal behavior mechanisms for social communication. Finally, it contributes to the broader literature on information-theoretical factors and accommodation effects on speakers’ acoustic realization.
Full-text available
Objective: To assess the speaker-discriminatory potential of a set of fundamental frequency estimates in intraidentical twin pair comparisons and cross-pair comparisons (i.e., among all speakers). Participants: A total of 20 Brazilian Portuguese speakers of the same dialect, namely 10 male identical twin pairs aged between 19 and 35, were recruited. Method: the participants were recorded directly through professional microphones while taking part in a spontaneous dialogue over mobile phones. Acoustic measurements were performed in connected speech samples, and in lengthened vowels, at least 160 ms long produced during spontaneous speech. Results: f0 baseline, central tendency, and extreme values were found mostly discriminatory in intra-twin pair and cross-pair comparisons. These were also the estimates displaying the largest effect sizes. Overall, only three identical twins were found statistically different regarding their f0 patterns in connected speech, but not for lengthened vowel-based f0 metrics. Estimates of f0 variation and modulation were found the least discriminatory across speakers, which may signal the control of speaking style and dialect on dynamic patterns of f0. Concerning system performance, the base value of f0 (f0 baseline) was found the most reliable metric, displaying the lowest equal error rate (EER). Conclusions: the outcomes suggest that, although identical twins were very closely related regarding their f0 patterns, some pairs could still be differentiated acoustically, only in connected speech. Such findings reinforce the relevance of analyzing long-term f0 metrics for speaker comparison purposes, with particular consideration to f0 baseline. Furthermore, f0 differences across subjects were suggested as more expressive in connected speech than in lengthened vowels.
Full-text available
We analyze the effect of speaker sex, speaking style and language (English, Estonian, French, German, Italian, Portuguese, Swedish) on global statistical measures of the voice fundamental frequency. The styles studied are interview, sentence reading and word list reading. Typical F0 values used by speakers among languages differ both in male and female speakers. Sentence reading has slightly higher values than interview, but the effect is not uniform among languages. In few cases word reading and interview values differ significantly. All central tendency estimators are affected by the variables tested. Three out of five estimators of dispersion show slightly higher values for male than for female speakers. Among styles, the interview has higher standard deviation and coefficient of variation. Languages do not differ among themselves in terms of F0 variability.
Conference Paper
Full-text available
This study of lexical stress in English is part of a series of studies, the goal of which is to describe the acoustics of lexical stress for a number of typologically different languages. When fully developed the methodology should be applicable to any language. The database of recordings so far includes Brazilian Portuguese, English (U.K.), Estonian, German, French, Italian and Swedish. The acoustic parameters examined are f0-level, f0-variation, Duration, and Spectral Emphasis. Values for these parameters, computed for all vowels, are the data upon which the analyses are based. All parameters are tested with respect to their correlation with stress level (primary, secondary, unstressed) and speaking style (wordlist reading, phrase reading, spontaneous speech). For the English data, the most robust results concerning stress level are found for Duration and Spectral Emphasis. f0-level is also significantly correlated but not quite to the same degree. The acoustic effect of phonological secondary stress was significantly different from primary stress only for Duration. In the statistical tests, speaker sex turned out as significant in most cases. Detailed examination showed, however, that the difference was mainly in the degree to which a given parameter was used, not how it was used to signal lexical stress contrasts.
Conference Paper
Full-text available
Fundamental frequency, the primary acoustic correlate of speech melody, is generally analysed and displayed using a linear scale (Hertz) or a logarithmic one, generally in semitones and usually offset to an arbitrary reference level such as 100 Hz. In this paper we argue that a more natural scale for analysing speech is the OME (Octave-MEdian) scale, using the octave (o) as the basic unit, offset to the median value of the speaker's range. We present results showing that a reasonable estimate of a speaker's neutral pitch range can be obtained directly from the median.
Conference Paper
Full-text available
This study investigates the acoustic correlates of word stress in Estonian. It forms part of a broader international collaboration the aim of which is to develop a universal language-independent model for evaluating lexical stress regardless of the phonological structure of a given language. To this aim the characteristics of word stress in a range of languages are studied using unified methodology. For the present study, four acoustic measures were analysed as a function of speaking style and stress: vowel duration, F0 mean, F0 standard deviation, and spectral emphasis. The results show that the strongest correlate of style and stress in Estonian is vowel duration, but stress has a strong interaction with the Estonian three-way quantity system.
Full-text available
Despite its many prima facie attractive properties for forensic speaker recognition, F0 is regarded as having limited forensic value due to its large within-speaker variability. However, its forensic use to date has been limited mostly to its long-term mean and standard deviation. This paper examines the discriminatory potential, within a Likelihood Ratio-based approach, of additional parametric features from the distribution of long-term F0: its skew, kurtosis, modal F0 and modal density. Motivated by the observation that the shape of the long-term F0 distribution shows less within-speaker occasion-to-occasion difference, we report a forensic discrimination experiment with non-contemporaneous speech samples from 201 male Japanese speakers. Using a multivariate Likelihood Ratio as discriminant distance with the six LTF0 distribution parameters, an equal error rate of 10.7% is obtained from 201 target and 80400 non-target trials. We also investigate how the EER degrades as a function of amount of voiced speech.
Two k-sample versions of an Anderson-Darling rank statistic are proposed for testing the homogeneity of samples. Their asymptotic null distributions are derived for the continuous as well as the discrete case. In the continuous case the asymptotic distributions coincide with the (k - 1)-fold convolution of the asymptotic distribution for the Anderson-Darling one-sample statistic. The quality of this large sample approximation is investigated for small samples through Monte Carlo simulation. This is done for both versions of the statistic under various degrees of data rounding and sample size imbalances. Tables for carrying out these tests are provided, and their usage in combining independent one- or k-sample Anderson-Darling tests is pointed out. The test statistics are essentially based on a doubly weighted sum of integrated squared differences between the empirical distribution functions of the individual samples and that of the pooled sample. One weighting adjusts for the possibly different sample sizes, and the other is inside the integration placing more weight on tail differences of the compared distributions. The two versions differ mainly in the definition of the empirical distribution function. These tests are consistent against all alternatives. The use of these tests is two-fold: (a) in a one-way analysis of variance to establish differences in the sampled populations without making any restrictive parametric assumptions or (b) to justify the pooling of separate samples for increased sample size and power in further analyses. Exact finite sample mean and variance formulas for one of the two statistics are derived in the continuous case. It appears that the asymptotic standardized percentiles serve well as approximate critical points of the appropriately standardized statistics for individual sample sizes as low as 5. The application of the tests is illustrated with an example. Because of the convolution nature of the asymptotic distribution, a further use of these critical points is possible in combining independent Anderson-Darling tests by simply adding their test statistics.
Many of the existing parametric and nonparametric tests for homogeneity of variances, and some variations of these tests, are examined in this paper. Comparisons are made under the null hypothesis (for robustness) and under the alternative (for power). Monte Carlo simulations of various symmetric and asymmetric distributions, for various sample sizes, reveal a few tests that are robust and have good power. These tests are further compared using data from outer continental shelf bidding on oil and gas leases.