ArticlePDF Available

The mapping of voice parameters in connected speech of healthy Common Czech male speakers

Authors:

Abstract and Figures

This study examines a set of voice parameters to map objective ranges of voice-source characteristics of healthy male speakers of Common Czech. Objective assessment of voice quality is conducted mainly in speakers with voice pathologies, typically using sustained vowels as basis for measurements. In our study, we focused on non-pathological voices and performed acoustic measruments of the voice parameters which are believed to reflect glottal characteritics. The analyses were based on the open vowels [a a:] extracted from fifty healthy male speakers who performed a reading task. Voice parameter estimation included f 0 perturbation measures (jitter and shimmer), harmonicity (HNR), Cepstral Peak Prominence (CPP), and harmonic amplitude measures which reflect short-term spectral slope (e.g., H1−H2, H2−H4, or H1−A3). The obtained data relate to connected speech and are compared to the measurements on sustained vowels.
Content may be subject to copyright.
c
ČsAS Akustické listy, 25(1–2), červen 2019, str.10–18
The mapping of voice parameters in connected speech of
healthy Common Czech male speakers
Mapování hlasových parametrů v souvislé řeči zdravých mužských
mluvčích obecné češtiny
Lea Tylečková and Radek Skarnitzl
Charles University, Faculty of Arts – Institute of Phonetics, náměstí Jana Palacha 2, 116 38 Praha 1
This study examines a set of voice parameters to map objective ranges of voice-source characteristics of healthy
male speakers of Common Czech. Objective assessment of voice quality is conducted mainly in speakers with
voice pathologies, typically using sustained vowels as basis for measurements. In our study, we focused on non-
pathological voices and performed acoustic measruments of the voice parameters which are believed to reflect
glottal characteritics. The analyses were based on the open vowels [a a:] extracted from fifty healthy male speakers
who performed a reading task. Voice parameter estimation included f0perturbation measures (jitter and shimmer),
harmonicity (HNR), Cepstral Peak Prominence (CPP), and harmonic amplitude measures which reflect short-
term spectral slope (e.g., H1H2, H2H4, or H1A3). The obtained data relate to connected speech and are
compared to the measurements on sustained vowels.
1. Introduction
The role of voice in everyday social interactions could
hardly be underestimated; it is an important part of our
communication and it also represents a rich source of in-
formation about the speakers reflecting their physical, psy-
chological and social characteristics [1]. Voice quality can
be treated in a broad perspective, when it comprises spe-
cific settings at both the laryngeal and supra-laryngeal (ar-
ticulatory) level [1]. In a narrower perspective, voice qua-
lity only refers to phonatory modifications (i.e., changes
in the manner of vocal fold vibra-tions). In this paper, we
are interested in the laryngeal level only, and voice quality
will thus pertain only to phonation.
Differences in voice quality may arise due to anatomical
and physiological factors; apart from these biological as-
pects, however, socio-cultural aspects also play a conside-
rable role [2, 3]. Voice quality as a significant idiosyncratic
aspect of an individual’s speech pattern is also examined
within the field of forensic phonetics. Acoustic analyses
focus on measuring voice parameters enabling to capture
inter-speaker variability. In the Czech context, this re-
search area is addressed, for instance, by Weingartová et
al. [4].
Generally, when assessing voice quality, speech scien-
tists may make use of methods deriving from three view-
points: articulatory, where we describe the phonatory be-
haviour per se, perceptual and acoustic. Perceptual ratings
of voice quality reflect subjective assessment but the over-
all impression of the voice can be decomposed into a few
dimensions that are perceptually distinct and correspond
to various terms, such as breathiness, roughness etc. As-
sessing voice quality using perceptual rating scales [1, 5,
6, 7] should remain constant across different listeners and
voices, so that all the listeners use the measurement tools
in the same way, and ratings across different voices can be
compared in a meaningful manner. Voice quality is thus
assumed to be constant across listeners, so that it can be
dealt with as an attribute of the voice signal itself rather
than a listener’s perception product [8: 73–74]. In most
cases, valid and reliable judgments of voice quality require
trained judges, especially when it comes to the auditory-
perceptual assessment of voice disorders [6].
Measuring acoustic parameters of voice quality is of
great interest to scientists dealing with various voice
pathologies. Their findings enable clinicians to diagnose
voice disorders and are used in voice re-education aim-
ing at acquiring appropriate phonation habits in patients
suffering from vocal disorders [3, 6].
Acoustic analyses are used to provide measurements
and quantification of various voice parameters, examining
voice quality and phonation types in an objective way. The
most common acoustic measures reflecting variability in
the voice signal are jitter, shimmer and HNR (harmonics-
to-noise ratio). These parameters are commonly used in
clinical practice when evaluating voice disorders and voice
quality disruptions such as breathiness, roughness and
hoarseness, because they are relatively low-cost and non-
invasive [6].
Jitter corresponds to variations in frequency between
successive vibratory cycles [9, 10]. Jitter measurements
can be conducted in two different ways – by peak-picking
or waveform matching [9]. The latter tries to identify the
time distance at which two consecutive waveshapes look
most similar, while the peak-picking technique strives to
find time locations where waveform amplitude is at its
maximum. It is frequently a lack of precise control of vo-
cal fold vibration that mainly affects jitter; patients with
10 Přijato 8. října 2018, akceptováno 21. prosince 2018.
Akustické listy, 25(1–2), červen 2019, str. 10–18 c
ČsAS L. Tylečková, R. Skarnitzl: The mapping of voice.. .
voice pathologies often have a higher percentage of jit-
ter. A typical percentage range indicating frequency vari-
ation from cycle to cycle for sustained phonation in young
healthy adults stated by most researchers is 0.5–1.0 % [10].
Values above 1.04 % are considered pathological [9, 10].
Shimmer provides measurements of variations in am-
plitude between successive vibratory cycles. The methods
used to measure shimmer are identical to jitter, but while
jitter takes into account the duration of periods, shim-
mer considers the peak amplitude of the signal [10]. The
amplitude variation of the sound wave is expressed in per-
centage or decibels. The value 3.81 % is stated as limit for
detecting pathological voices [10].
HNR enables researchers to quantify the ratio between
periodic and aperiodic components in the signal. HNR es-
timation can be carried out in two ways: on a time-domain
basis (using autocorrelation) and on a frequency-domain
basis. In the former case, HNR is computed directly from
the acoustic signal, while in the latter case, HNR measure-
ments are conducted from a transformed representation of
a waveform [11]. The higher an HNR value is, the more
sonorant and harmonic a voice is. HNR values below 7 dB
are considered pathological [10, 12].
In time-domain analyses, jitter and shimmer estima-
tions rely on the identification of cycles of vocal fold vibra-
tion in speech signals (so-called pitch marks), which might
have some limitations. For instance, in case of severely
dysphonic or aperiodic vowel samples, the degree of dis-
turbance or perturbation may be so high that an accurate
location of cycle boundaries is difficult and, in turn, fun-
damental frequency (f0) detection is impossible. Another
potential problem may arise when using continuous speech
samples containing variations in pitch and loudness as well
as rapid consonant–vowel and vowel–consonant transitions
[6, 13]; as mentioned above, jitter and shimmer are typi-
cally measured in sustained vowels.
Cepstral-based techniques represent an alternative ap-
proach towards extracting f0and towards estimating the
relative amplitude of harmonic versus noise components;
importantly, these techniques eliminate the need for iden-
tifying cycle boundaries [6]. Cepstrum, a Fourier transform
of the power spectrum of the speech signal, is a spectral-
based method comprising prominent peaks – rahmonics
(anagram of harmonics). A cepstrum of an acoustic sig-
nal displaying a well-defined harmonic structure shows
a prominent peak; this cepstral peak prominence (CPP)
is a measure of the amplitude of that cepstral peak which
corresponds to f0, normalized for overall signal amplitude.
The amplitude of CPP thus reflects both harmonic orga-
nization and the overall amplitude of the signal [14]. It has
been used by a number of investigators to evaluate voice
quality, as it provides valid and reliable measurements not
only in sustained vowel samples, but also in continuous
speech [6, 13, 15].
Apart from jitter, shimmer, HNR and CPP,harmonic
amplitude measures are commonly used when examin-
ing glottal characteristics, representing short-term acous-
tic manifestations of voice quality. These parameters are
sensitive to varying degrees of vocal fold adduction in nor-
mal speakers. Based on theoretical models, they are re-
lated to the existence and size of glottal chink [16]. Dif-
ferences in amplitudes of the first and second harmonics
(H1H2) and the harmonic amplitudes located closest to
the first, second and third formant frequencies (H1A1,
H1A2, H1A3) of the voice spectrum have been found
useful when quantifying the degree of glottal adduction in
different voices [16, 17]. The amplitude of the first har-
monic relative to that of the second (H1H2) is used as
an indication of the open quotient, i.e., the proportion of
a glottal cycle in which the glottis is open. As the OQ re-
lates to the overall glottal stricture, the H1H2 measure
is used to characterize the differences along the glottal
constriction continuum [16, 17, 18] The amplitude of the
second harmonic relative to the fourth (H2H4) has also
been found to be an important acoustic measure for distin-
guishing modal from nonmodal phonation [19], especially
in cases when H1H2 does not seem to work [18].
The amplitude of H1 relative to a higher frequency com-
ponent can quantify the strength of higher frequencies in
the spectrum relating to the closing velocity of the vo-
cal folds, and perhaps to muscle tension. Thus, H1A1,
H1A2 and H1A3 are measured. These parameters can
also distinguish modal and breathy phonation in some lan-
guages [18, 20] where H1H2 does not seem to be use-
ful. The amplitude of the first harmonic relative to that
of the first formant prominence in the spectral domain
(A1) reflects the bandwidth of F1, and may also be af-
fected by source spectral tilt. H1A1 is an indication of
the presence of a posterior glottal chink, i.e., the degree
to which the glottis fails to close completely during the
closing phase [16, 17]. The amplitude of the first harmonic
relative to that of the strongest harmonic in the second for-
mant (H1A2) is used as an indicator of the source spec-
tral tilt (i.e., energy decrease with increasing frequency) at
the mid formant frequencies [16]. Finally, H1A3 reflects
the spectral tilt at the higher formant frequencies, [16, 17,
21].
Harmonic amplitude measures can be compared across
different speakers and vowels only if the measures are cor-
rected for the effect of F1, F2 and F3 vocal tract reso-
nances (frequencies and bandwidths) on harmonic am-
plitudes; uncorrected values reflect both the voice source
and the supra-glottal filter. The corrected harmonic am-
plitude values are denoted with an asterisk, e.g. H1*H2*,
H1*A1* etc. [17, 18, 22].
The objective of this study is to provide the aforemen-
tioned voice measure estimation in young healthy Czech
male speakers of common Czech. A number of studies deal-
ing with voice parameters published so far concentrate
mainly on speakers with voice disorders or on patients
with neurodegenerative diseases whose impact on voice
quality has been scientifically proved [23]. This study seeks
to establish quantitative ranges against which it would be
possible to gauge the production of non-pathological voice.
11
L. Tylečková, R. Skarnitzl: The mapping of voice.. . c
ČsAS Akustické listy, 25(1–2), červen 2019, str.10–18
A sample of fifty male speakers will be used to map acous-
tic parameter value ranges of voice-source characteristics
based on a read speech task.
2. Method
2.1. Material
Recordings of fifty male speakers aged between 19 and
43 years (mean age: 24.7 years, SD: 6.1 years) were se-
lected from the Database of Common Czech, a reference
database for forensic purposes [24]. The speakers, who re-
ported no voice or hearing problems, were recorded while
reading a phonetically rich text of 150 words including
all the Czech phonemes and their context-dependent vari-
ants in their natural voice; the length of the recording
was approximately 60 seconds. Based on reported findings
([25: ch. 4] for a review), no age-related vocal changes were
assumed in the speakers.
The recordings were acquired in a quiet environment
using a portable recorder Edirol R09 and its in-built mi-
crophone, at a sampling rate of 48 kHz.
2.2. Parameter extraction and analyses
For each speaker, we extracted the voice quality parame-
ters from 30 manually segmented /a a:/ vowels (16 phono-
logically short and 14 long vowels). Only phrase-internal
vowels were chosen for analysis, so as not to confound the
measurements by phrase-final phenomena such as creak
or breathiness. However, vowels in all segmental contexts
(incl. nasal) were included. Boundaries of the target vo-
wels were determined based on the phonetically motivated
recommendations for manual segmentation of the speech
signal [26]. Briefly, the boundaries were located at the on-
set or the offset of full vowel formant structure. In case
of the transition phase, the boundaries were placed in the
temporal midpoint of this area. The total number of 1,500
target vowel sounds (30 vowels ×50 speakers) had to be
reduced to 1,492, as the visual and auditory inspection re-
vealed that 8 target items were of different vowel quality,
due to an error in the speakers’ reading.
Jitter, shimmer and HNR measurements were extracted
using a Praat script [27] with the default settings for
each parameter. As for jitter, values of local jitter (the
most common measurement) were extracted using wave-
form matching (see section 1). The measure represents the
average absolute difference between consecutive periods
divided by the average period, and is expressed as a per-
centage [9, 10]. Shimmer measurements were performed
using local shimmer parameter expressing the average ab-
solute difference between the amplitudes of consecutive
periods divided by the average amplitude. Similarly to lo-
cal jitter, it is expressed as a percentage. HNR extraction,
representing the degree of acoustic periodicity expressed
in dB, was conducted by means of the cross-correlation
method, as recommended for voice analysis in Praat [27].
The spectral magnitudes of H1*H2*, H2*H4*,
H1*A1*, H1*A2* and H1*A3*aswellasCPP
values were automatically extracted using Voice Sauce,
a free stand-alone software [28], using the labelled Praat
TextGrids. In order to estimate the location of harmo-
nics, f0measurements needed to be carried out. We used
the Voice Sauce default algorithm STRAIGHT [29] de-
tecting f0at 1ms intervals and computing the harmonic
magnitudes pitch-synchronously over a three-cycle win-
dow. This method eliminates much of the variability ob-
tained in spectra computed over a fixed time window, and
is equivalent to using a very long FFT window, provid-
ing more accurate measurements without relying on large
FFT calculations [22].
CPP calculations in Voice Sauce are based on the algo-
rithm [14] using a variable window length which is equal
to five pitch periods by default. The obtained data are
then multiplied with a Hamming window and transformed
into the real cepstral domain. The CPP is estimated by
conducting a maximum search around the quefrency of
the pitch period. The peak is normalized to the linear re-
gression line calculated between 1 ms and the maximum
quefrency [22, 30].
The raw voice parameters data were processed in R [31]
and visualised using the package ggplot2 [32]. The statis-
tical (mean, standard deviation, as well as the median in
the final summarizing table) are computed for all analyzed
vowels.
3. Results and discussion
The estimated values of the respective voice parameters
will be presented in the following subsections. A table sum-
marising the extracted mean values is presented in section
4 (Table 1). In section 3.5, we will focus on the relationship
among the acoustic measures, and finally, we will comment
on some speakers’ results.
3.1. F0perturbation measures: jitter and shimmer
Fig. 1a shows the value ranges of the jitter measure for
each speaker. The mean value is 1.83 % (SD: 1.97 %;
95% confidence interval: 1.73–1.94 %) and is above the
threshold value of 1.04 % for pathological voices [10]. The
mean value for the shimmer measure is 13.02 % (SD:
6.75 %; 95% confidence interval: 12.66–13.38 %) and is
also higher than the pathological threshold of 3.81 % [10].
The shimmer value ranges for individual speakers are dis-
played in Figure 1b.
Both the estimated jitter and shimmer values are above
the stated limits for detecting voice pathologies. However,
as already mentioned above, the stated threshold values
refer to the measurements performed on sustained vowels
[9, 10], while our voice parameter extraction is based on
continuous speech, which causes fast changes in pitch and
formants [6, 13, 33]. The jitter and shimmer measures are
thus necessarily higher than the pathological threshold,
12
Akustické listy, 25(1–2), červen 2019, str. 10–18 c
ČsAS L. Tylečková, R. Skarnitzl: The mapping of voice.. .
Figure 1: a. jitter value ranges, x-axis displays 50 speakers, y-axis shows jitter values (%). b. shimmer value ranges,
x-axis displays 50 speakers, y-axis shows shimmer values (%)
and it is clear that they do not reflect any voice patho-
logy (Boersma, 2017, personal communication); in fact,
proposing perturbation values and ranges corresponding
to healthy voices in connected speech is the main objective
of this study.
3.2. Harmonicity (Harmonics-to-noise ratio, HNR)
Figure 2a shows the value ranges for each speaker. The
mean value is 9.41 dB (SD: 4.05 dB; 95% conf. int.:
9.20–9.62 dB), which is well above the threshold value
of 7 dB for voice pathologies and is in line with previous
findings [10, 12, 34]. It will be useful to compare our data
with previous studies. For example, Yumoto and Gould
[34] examined the HNR parameter in relation to the de-
gree of hoarseness in both healthy speakers and speakers
with laryngeal disorders pre- and post-operatively using
asustained/
A:/ vowel. The estimated HNR for the healthy
speaker group ranged between 7 and 17 dB with the mean
of 11.9 dB (12.2 dB for males and 11.5 dB for women)
compared to the estimated value range between 15.2 and
9.6 dB with the mean of 1.6 dB in preoperative speakers.
It can be seen in Figure 2a that only two speakers’ values
in our sample fall below 7 dB.
3.3. Cepstral Peak Prominence (CPP)
Figure 2b displays the value ranges for the CPP mea-
sure extracted automatically using Voice Sauce. The
mean value is 20.28 dB (SD: 3.69; 95% conf. int.:
20.26–20.30 dB). There exists a negative correlation be-
tween the CPP and the levels of aperiodicity of the glot-
tal source – the higher the CPP, the lower the degree of
aperiodicity in the voice signal [13, 15, 18]. As an acous-
tic measure of voice quality, some researchers evaluated
the effectiveness of CPP in predicting breathiness ratings,
and our results will thus be compared with theirs. Hillen-
brand et al. [14] tested the parameter in healthy native
English speakers who were asked to produce sustained vo-
wels in nonbreathy, moderately breathy and very breathy
phonation. The results confirmed that periodicity mea-
sures, namely CPP, provide the most accurate predictions
of perceived breathiness [15, 18]. These findings were also
confirmed for dysphonic voices and continuous speech [15].
In their study, Hillenbrand and Houde [15] provide exam-
13
L. Tylečková, R. Skarnitzl: The mapping of voice.. . c
ČsAS Akustické listy, 25(1–2), červen 2019, str.10–18
Figure 2: a. HNR value ranges, x-axisdisplays50speakers,y-axispresentsHNRvalues(dB).b. CPP value ranges,
x-axisdisplays50speakers,y-axis presents CPP values (dB)
ples of the CPP measures for signals perceived as non-
breathy and moderately breathy: 21.6 dB and 13.1. dB,
respectively. Garellek and Keating [36] reported the CPP
mean value of 22.5 dB for modal phonation extracted from
/a, æ, o/ uttered in words by male speakers. The mean
values for both creaky and breathy phonations were lower
than the value of 20 dB. The CPP mean value we obtained
should therefore reflect a nonbreathy/modal phonation.
3.4. Harmonic amplitude measures
The value ranges for H1*H2*, as automatically extracted
in Voice Sauce, are captured in Figure 3a. The mean
is 1.83 dB (SD: 6.04; 95% conf. interval: 1.79–1.86 dB).
As a correlate of the Open Quotient, lower values indi-
cate a greater glottal constriction [18]. Cross-linguistically,
H1*H2* also represents one of the most successful mea-
sure of phonation type [35] and is often cited as an acoustic
correlate of breathiness (e.g. [21]). Nevertheless, it seems
to be a more reliable predictor of breathiness ratings for
sustained vowels than for sentences or continuous speech
[15]. H1*H2* values for nonbreathy and breathy phona-
tion were reported in [15]: 1.7 dB and 19.3 dB, respectively.
Hanson and Chuang [17] obtained the following mean val-
ues using sustained vowel production in healthy speakers:
men: 0.0 (SD: 1.8) dB and women: 3.1 (2.0) dB. Narra et
al. [16] also used sustained vowels for their measurements
in healthy speakers and present the following mean values
for H1*H2* (sustained /a/): 7.18 (SD: 3.7) dB for male
and 11.49 (2.73) dB for female speakers.
H2*H4* parameter estimation yielded the mean of
9.37 dB (SD: 6.09; 95% conf. int.: 9.33–9.4 dB). Figure 3b
displays the value ranges for all our speakers. Similarly
to H1*H2*, H2*H4* is also mentioned as a signi-
ficant acoustic correlate of the perception of the con-
trastive breathiness in some languages [19]. Garellek et
al. [20] measured H2*H4* and H1*H2* of the sam-
ples of sustained /a/ which were inverse-filtered and copy-
synthesized to find out how they correlate with the per-
ceived breathiness. The obtained mean values in dB for
H2*H4* were 8.93 (SD: 3.74) for men and 11.57 (4.99)
for women, and for and H1*H2* 6.13 (4.11) for men and
8.93 (4.55) for women, respectively.
Finally, let us look at the value estimations of the
amplitude of the first harmonic relative to that of the
14
Akustické listy, 25(1–2), červen 2019, str. 10–18 c
ČsAS L. Tylečková, R. Skarnitzl: The mapping of voice.. .
Figure 3: a. H1*H2* value ranges, x-axisdisplays50speakers,y-axis presents H1*H2* values (dB). b. H2*H4*
value ranges, x-axis displays 50 speakers, y-axis presents H2*H4* values (dB)
F1, F2 and F3 prominence. Greater differences between
H1*A1*, H1*A2* and H1*A3* indicate less strong
higher frequencies and more noise components in the spec-
trum [35].The mean values are: 21.43 dB (SD: 8.4) for
H1*A1*, 24.89 dB (SD: 8.42) for H1*A2* and 18.87 dB
(SD: 10.4) for H1*A3 (see Table 1 in section 4). In [16],
the following average and standard deviation values for
sustained /a/ are reported: H1*A1* in healthy men:
6.7 (2.53) dB and women 11.17 (4.54) dB, H1*A2*
9.64 (4.79) dB in men and 12.73 (3.0) dB in women, and
H1*A3* 24.53 (6.06) dB in men and 28.79 (5.41) dB in
women.
3.5. Acoustic measure relationships
Let us now have a look at the relationships among the
extracted parameters. Figure 4 captures the correlations
between the extracted mean values. In each case, we plot-
ted a particular acoustic measure against CPP, as this
parameter has been found to provide valid and reliable
measurements in continuous speech [6, 13, 14, 15]. Spear-
man’s rank correlation coefficient ρwas computed due to
the presence of outlier values.
The plots suggest only mild or weak correlations, which
confirms the relative independence of the different mea-
sures. Specifically, there is a positive correlation between
CPP and HNR (ρ=0.4, p<0.005), and CPP and some of
the harmonic amplitude measures: H1*H2* (ρ=0.26,
p<0.1), H2*H4* (ρ=0.27, p<0.1). The negative
correlation between CPP and the jitter did not even reach
significance (ρ=0.14, p>0.1).
Correlations were stronger when we examined the inter-
dependence of the harmonic amplitude measures. They
are all positive and significant correlations: H1*H2* vs.
H1*A1* (ρ=0.78, p<0.001); H1*H2* vs. H1*A2*
(ρ=0.58, p<0.001), and H1*H2* vs. H1*A3*
(ρ=0.61, p<0.005). Only the correlation between
H1*H2* and H2*H4* was not significant (ρ=0.66,
p>0.5), which indicates that they reflect different proper-
ties of the voice.
15
L. Tylečková, R. Skarnitzl: The mapping of voice.. . c
ČsAS Akustické listy, 25(1–2), červen 2019, str.10–18
Figure 4: Scatterplots (extracted mean values) with
trendlines (and 95% confidence bands). From top left:
H1*H2*, H2*H4*, HNR and jitter plotted against CPP
3.6. Comments on particular speakers’ values
Taking into account the relationships among our acous-
tic measures presented in the previous subsection, we will
examine some speakers’ mean values, taking values of the
cepstral peak prominence close to the extremes as starting
points.
The second highest CPP mean value was measured in
Speaker 40 (S40): 22.52 dB (the overall mean across all
speakers being 20.28 dB, and the mean value being higher
for only one speaker, 23.06 dB). S40’s HNR mean value
is 12.24 dB (the overall mean: 9.41; the maximum mean
value: 14.06), the H1*H2* mean value of 0.88 dB is
well below the overall mean of 1.83 (the minimum mean
value: 3.43), and so is the H2*H4* mean: 5.25 dB (the
overall mean: 9.37; the minimum mean value: 3.01), and
finally, S40’s jitter mean value of 1.25 % is also below the
overall mean of 1.83 % (the minimum value: 1.02).
The results reported in the previous paragraph im-
ply a certain consistency across all the parameters. How-
ever, that is not always the case in all the speakers.
For instance, in S2, we estimated the highest CPP mean
value (23.06 dB), but S2’s H1*H2* and H2*H4* means
(2.78 dB and 10.9, respectively) are above the overall mean
values (1.83 and 9.37, respectively).
Let us turn to Speaker 27 from the other end of the scale.
S27 has the lowest mean value of CPP (17.32 dB) and his
HNR mean of 5.57 dB is well below the overall mean of
9.41 dB. Also, this speaker’s jitter mean of 2.9 % is above
the overall mean. However, S27’s H1*H2* and H2*H4*
means (1.77 and 9.21, respectively) are below the overall
mean values, which should not be expected considering
the indicated relationships among the respective acoustic
measures.
4. Conclusions
The aim of this study was to establish quantitative ranges
of voice quality parameters in healthy Czech male speakers
of common Czech in an objective way, based on a continu-
ous speech reading task. The key values of all the parame-
ters are summarized in Table 1.
Parameter Mean (SD) Median Q1–Q3
Jitter 1.83 % (1.97) 1.18 % 0.72–2.12 %
Shimmer 13.02 % (6.75) 11.9 % 8.33–16.81%
HNR 9.4 dB (4.05) 9.4 dB 6.58–12.22 dB
CPP 20.3 dB (3.69) 20.2 dB 17.33–23.02 dB
H1*H2* 1.8 dB (6.04) 1.6 dB 2.36–5.75 dB
H2*H4* 9.4 dB (6.09) 9.2 dB 5.17–37.59 dB
H1*A1* 21.4 dB (8.4) 20.9 dB 15.6–26.6 dB
H1*A2* 24.9 dB (8.42) 24.3 dB 19.13–30.13 dB
H1*A3* 18.9 dB (10.4) 18.9 dB 11.84–68.72 dB
Table 1: The estimated mean and median values and the
values of the first and third quartile (Q1–Q3)
Although sustained vowel productions are commonly
used to assess voice quality when conducting acoustic mea-
surements, we decided to use a continuous speech sample
based on a reading task. As human voice represents a dy-
namic time-varying source of vocal tract excitation, it is
connected speech (characterized by rapid successions of
different articulatory controls) that should provide rele-
vant, ecologically valid data in terms of what makes speech
production normal, and should enable researchers and
clinicians to understand and assess the abnormality of
speech production in different speech styles.
Our estimated jitter and shimmer values are above the
commonly stated threshold limits for voice pathologies, es-
pecially in the case of shimmer. Needless to say, continu-
ous speech contains variations in pitch, formants and loud-
ness as well as rapid consonant-vowel and vowel-consonant
transitions; our data thus cannot be compared with those
obtained from speakers sustaining vowels for several se-
conds, but may provide reference for similar endeavours
in the future.
The HNR measurements were conducted in a simi-
lar way as jitter and shimmer estimation, i.e. using
a temporal- based method. Although the obtained mean
value is quite above the stated threshold value for patho-
logical voices, considering we used continuous speech. It
would be useful to compare our data with HNR estima-
16
Akustické listy, 25(1–2), červen 2019, str. 10–18 c
ČsAS L. Tylečková, R. Skarnitzl: The mapping of voice.. .
tion using a spectral- (or more precisely, cepstral-) based
technique.
Harmonic amplitudes measuring yielded somewhat
higher values in most parameters compared to other
studies. As in the case of the acoustic parameters men-
tioned above, harmonic amplitude measurements are com-
monly performed on sustained vowels. Finally, based on
findings available in literature, the estimated CPP values
seem to reflect modal phonation in most of our speakers.
While mapping voice parameters in our study, we also
tried to examine the suitability/usefulness of the parame-
ter estimations when using connected speech material. Fu-
ture research might further examine the parameter extrac-
tion techniques relating to connected speech and conduct
further measurements across different groups of speakers.
Acknowledgements
This research was supported by the project “Interdisci-
plinary approach to the linguistic theory issues”, subpro-
ject “The mapping of voice parameters in young healthy
male speakers of Common Czech” solved at Charles Uni-
versity from the Specific university research in 2017. The
second author was supported from European Regional De-
velopment Fund-Project “Creativity and Adaptability as
Conditions of the Success of Europe in an Interrelated
World” (No. CZ.02.1.01/0.0/0.0/16 019/0000734).
References
[1] Laver, J.: The Phonetic Description of Voice Quality,
CUP, Cambridge, 1980.
[2] Arnold, A.: Le rôle de la fréquence fondamentale et
des fréquences de résonance dans la perception du
genre. Travaux interdisciplinaires sur la parole et la
langue, 28, p. 2–14, 2012.
[3] Mendoza, E., Valecia, N., Muňoz, J., Truillo, H.: Dif-
ferences in Voice Quality Between Men and Women:
Use of the Long-Term Average Spectrum (LTAS),
Journal of Voice, 10(1), p. 59–65, 1996.
[4] Weingartová, L., Bořil, T., Vaňková, J.: Spektrální
sklon. In: Skarnitzl, R. (Ed.), Fonetická identifikace
mluvčího, FF UK, Praha, 2014.
[5] Bhuta, T, Patrick, L., Garnett, J. D.: Perceptual
evaluation of voice quality and its correlation with
acoustic measurements, J. of Voice, 18, p. 299–304,
2004.
[6] Awan, S. N., Solomon, N. P., Helou, L. B., Stojadi-
novic, A.: Spectral-Cepstral Estimation of Dyspho-
nia Severity: External Validation, Annals of Otology,
Rhinology & Laryngology, 122(1), p. 40–48, 2013.
[7] Kreiman, J., Gerratt, B. R.: Jitter, Shimmer, and
Noise in Pathological Voice Quality Perception,
VOQIAL’03, Geneva, p. 57–61, 2003.
[8] Kent, R. D., Ball, M. J.: Voice Quality Measurement,
Singular Publishing Group, San Diego, 2000.
[9] Boersma, P.: Should Jitter Be Measured by Peak
Picking or by Waveform Matching?, Folia Phoniatr.
Logop., 61, p. 305–308, 2009.
[10] Teixeira, J. P., Oliveira, C., Lopes, C.: Vocal Acous-
tic Analysis – Jitter, Shimmer and HNR Parameters,
Procedia Technology, 9, p. 1112–1122, 2013.
[11] Qi, Y., Hillman, R. E.: Temporal and spectral esti-
mations of harmonics-to-noise ratio in human voice
signals, Journal of Acoustical Society of America,
102(1), p. 537–543, 1997.
[12] Boersma, P.: Acurate short-term analysis of the fun-
damental frequency and the harmonics-to-noise ratio
of a sampled sound, IFA Proceedings 17, p. 97–110,
1993.
[13] Murphy, P. J.: Periodicity estimation in synthesized
phonation signals using cepstral rahmonic peaks,
Speech Communication, 48, p. 1704–1713, 2006.
[14] Hillenbrand, J., Cleveland, R., Erickson, R.: Acoustic
correlates of breathy vocal quality, J. Sp. Hear. Res.,
37, p. 769–778, 1994.
[15] Hillenbrand, J., Houde, R. A..: Acoustic Correlates of
Breathy Vocal Quality: Dysphonic Voices and Con-
tinuous Speech, Journal of Speech and Hearing Re-
search, 39, p. 311–321, 1996.
[16] Narra, M., Anu, T. D., Varghese, S. M., Datta-
treya, T.: Harmonic Amplitude Measures to Note
Gender Differences, Advances in Life and Technology,
31, p. 17–23, 2015.
[17] Hanson,H.M.,Chuang,E.S.:Glottalcharacteristics
of male speakers: Acoustic correlates and compari-
son with female data, J. Acoust. Soc. Am., 106(2),
p. 1064–1077, 1999.
[18] Keating, P. A., Esposito, C.: Linguistic Voice Quality,
UCLA Working Papers in Phonetics, 105, p. 85–91,
2007.
[19] Garellek, M., Keating, P., Esposito, C. M.,
Kreiman, J.: Voice quality and tone identification
in White Hmong, J. Aoucst. Soc. Am., 133(2),
p. 1078–1089, 2013.
[20] Garellek, M., Samlan, R. A., Kreiman, J., Ger-
ratt, B.: Perceptual sensitivity to a model of the
source spectrum, Proceedings of Meetings on Acous-
tics, 19, p. 1–5, 2013.
[21] Wayland, R., Jongman, A.: Acoustic correlates of
breathy and clear vowels: the case of Khmer, Jour-
nal of Phonetics, 31, p. 181–201, 2003.
17
L. Tylečková, R. Skarnitzl: The mapping of voice.. . c
ČsAS Akustické listy, 25(1–2), červen 2019, str.10–18
[22] Shue, Y., Keating, P., Vicenik, C., Yu, K.: Voice-
Sauce: A program for voice analysis, Proc 17 th
ICPhS, Hong Kong, p. 1846–1849, 2011.
[23] Tykalová, T., Rusz, J., Čmejla, R., Růžičková, H.,
Růžička, E.: Acoustic investigation of stress pat-
terns in Parkinson’s disease, Journal of Voice, 28(1),
129.e1–129.e8, 2014.
[24] Skarnitzl, R., Vaňková, J.: Fundamental frequency
statistics for male speakers of Common Czech, Acta
Universitatis Carolinae – Philologica 3, Phonetica
Pragensia XIV, p. 7–17, 2017.
[25] Kreiman, J., Sidtis, D.: Foundations of Voice Studies,
Blackwell Publishing, Oxford, 2011.
[26] Machač, P., Skarnitzl, R.: Principles of Phonetic Seg-
mentation, Epocha, Praha, 2009.
[27] Boersma, P., Weenink, D.: Praat: doing phonetics
by computer (Version 5.4.08), Retrieved: 5. 5. 2015,
http://www.praat.org.
[28] Shue, Y.: VoiceSauce: A program for voice analysis
(Version 1.31), Retrieved: 31. 5. 2017,
http://www.phonetics.ucla.edu/voicesauce/.
[29] Kawahara, H., Masuda-Katsuse, I., de Chevigne, A.:
Restructuring speech representation using a pitch
adaptive time-frequency smoothing and an instanta-
neous frequency based F0 extraction, Speech Commu-
nication, 27, p. 187–207, 1999.
[30] Vicenik, C., Lin, S., Keating,P., Shue, Y.: Online doc-
umentation for VoiceSauce. Available at:
http://www.phonetics.ucla.edu/voicesauce/
documentation/index.html, accessed: 31. 5. 2017.
[31] R Core Team: R: A language and environment for
statistical computing, R Foundation for Statistical
Computing, Vienna, Austria. Retrieved:
http://www.R-project.org., 2016.
[32] Wickham, H.: ggplot2: Elegant graphics for data anal-
ysis (use R!), Springer, New York, 2009.
[33] Fourcin, A.: Aspects of Voice Irregularity Measure-
ments in Connected Speech, Folia Phoniatrica et Lo-
gopaedica, p. 126–136, 2009.
[34] Yumoto, E., Gould, W.J.: Harmonics-to-noise ratio as
an index of the degree of hoarsness, J. Acoust. Soc.
Am., 71(6), p. 1544–1550, 1982.
[35] Esposito, C. M.: The effects of linguistic experience
on the perception of phonation, J. of Phonetics, 38,
p. 306–316, 2010.
[36] Garellek, M., Keating, P.: The acoustic consequences
of phonation and tone interactions in Jalapa Mazatec,
J. of IPA, 41(2), 2011.
18
... Aaen et al. [32] assessed air added to the voice (i.e., a singer's strategy of adding air to create a vocal breathiness effect in a healthy voice) by extracting various features from laryngostroboscopic imaging and electroglottograph data, and observed a glottal gap along the edge of the length of the vocal folds, as well as significant differences in various acoustic parameters. Tylečková and Skarnitzl [52] proposed quantitative ranges of voice quality parameters on non-pathological voices based on speech reading tasks in Czechs, concluding that the estimated CPP values describe modal (e.g., non-breathy) phonation. Barsties v. Latoszek [40] quantifyied evidence for the diagnostic accuracy of ABI, in terms of sensitivity and specificity, from 34 research works, confirming the ABI's robustness and validity. ...
Article
Full-text available
Singing voice is a human quality that requires the precise coordination of numerous kinetic functions and results in a perceptually variable auditory outcome. The use of multi-sensor systems can facilitate the study of correlations between the vocal mechanism kinetic functions and the voice output. This is directly relevant to vocal education, rehabilitation, and prevention of vocal health issues in educators; professionals; and students of singing, music, and acting. In this work, we present the initial design of a modular multi-sensor system for singing voice analysis, and describe its first assessment experiment on the ‘vocal breathiness’ qualitative characteristic. A system case study with two professional singers was conducted, utilizing signals from four sensors. Participants sung a protocol of vocal trials in various degrees of intended vocal breathiness. Their (i) vocal output, (ii) phonatory function, and (iii) respiratory behavior-per-condition were recorded through a condenser microphone (CM), an Electroglottograph (EGG), and thoracic and abdominal respiratory effort transducers (RET), respectively. Participants’ individual respiratory management strategies were studied through qualitative analysis of RET data. Microphone audio samples breathiness degree was rated perceptually, and correlation analysis was performed between sample ratings and parameters extracted from CM and EGG data. Smoothed Cepstral Peak Prominence (CPPS) and vocal folds’ Open Quotient (OQ), as computed with the Howard method (HOQ), demonstrated the higher correlation coefficients, when analyzed individually. DECOM method-computed OQ (DOQ) was also examined. Interestingly, the correlation coefficient of pitch difference between estimates from CM and EGG signals appeared to be (based on the Pearson correlation coefficient) statistically insignificant (a result that warrants investigation in larger populations). The study of multi-variate models revealed even higher correlation coefficients. Models studied were the Acoustic Breathiness Index (ABI) and the proposed multiple regression model CDH (CPPS, DOQ, and HOQ), which was attempted in order to combine analysis results from microphone and EGG signals. The model combination of ABI and the proposed CDH appeared to yield the highest correlation with perceptual breathiness ratings. Study results suggest potential for the use of a completed system version in vocal pedagogy and research, as the case study indicated system practicality, a number of pertinent correlations, and introduced topics with further research possibilities.
Article
Full-text available
In speaker identification, a forensic phonetician’s task often involves comparingvoices of two or more speakers and assessing their similarity, but also their typicality. For the latter, it is necessary to have background information about the relevant speaker population. This paper introduces the database of Common Czech which was compiled as a reference database, and presents the first set of compiled statistics pertaining to fundamental frequency (F0). The population statistics are computed from a reading task and spontaneous speech. The results confirm the superiority of F0 baseline over mean or median values when assessing typicality and demonstrate, in many speakers, a narrower intonation range in spontaneous speech than in reading. The role of F0 in speaker comparison is also discussed.
Article
Full-text available
As the gendered voice is one of the various social practices through which gender identities are performatively constituted, what roles do fundamental frequency (F0) and resonance frequencies (RF) play? To answer this question, two experiments have been conducted. During the first experiment, 42 sentences created by diphone synthesis, forming 6 continua with variations in mean F0, have been presented to 14 evaluators who had to categorize the speaker of each sentence as woman or man. The results showed that in 91% of the judgements, change in mean F0 did not cause a change in gender categorization. In the second experiment, 645 resynthesized sentences, forming 45 continua with F0 and RF variations, have been presented to 22 evaluators who had to categorize the speakers as women or men and evaluate their degree of femininity/masculinity. The experiment showed that there are significant correlations between: (1) RF variation and woman/men categorization, (2) RF variation and judgments about degree of femininity/masculinity in a woman’s voice, (3) RF variation and judgments about degree of femininity/masculinity in a man’s voice, (4) F0 variation and judgments about degree of femininity/masculinity in man’s voice.
Article
Full-text available
Although jitter, shimmer, and noise characterize all voice signals, their perceptual importance has not been established psychoacoustically. To determine which of these acoustic attributes is important in listeners’ perceptions of the inharmonic component in pathologic voices, copies of natural pathological voices were synthesized parametrically using seven different models of the inharmonic part of the voice source: jitter only, shimmer only, noise only, jitter plus shimmer, jitter plus noise, shimmer plus noise, and jitter plus shimmer plus noise. Listeners then compared pairs of these stimuli and judged whether they were the same or different. Listeners also compared synthetic and natural stimuli, to determine the extent to which different aspects of the inharmonic part of the source improved or worsened the quality of the match to the original natural voice sample. Preliminary results suggest that jitter and shimmer may not be perceptually important features of most pathological voices, despite their long history as acoustic measures of voice. In contrast, it appears that listeners perceive spectrally shaped additive noise as the critical inharmonic acoustic element contributing to pathologic voice quality. [Research supported by NIDCD.]
Article
Full-text available
San Felipe Jalapa de Díaz (Jalapa) Mazatec is unusual in possessing a three-way phonation contrast and three-way level tone contrast independent of phonation. This study investigates the acoustics of how phonation and tone interact in this language, and how such interactions are maintained across variables like speaker sex, vowel timecourse, and presence of aspiration in the onset. Using a large number of words from the recordings of Mazatec made by Paul Kirk and Peter Ladefoged in the 1980s and 1990s, the results of our acoustic and statistical analysis support the claim that spectral measures like H1-H2 and mid-range spectral measures like H1-A2 best distinguish each phonation type, though other measures like Cepstral Peak Prominence are important as well. This is true regardless of tone and speaker sex. The phonation type contrasts are strongest in the first third of the vowel and then weaken towards the end. Although the tone categories remain distinct from one another in terms of F0 throughout the vowel, for laryngealized phonation the tone contrast in F0 is partially lost in the initial third. Consistent with phonological work on languages that cross-classify tone and phonation type (i.e. ‘laryngeally complex’ languages, Silverman 1997), this study shows that the complex orthogonal three-way phonation and tone contrasts do remain acoustically distinct according to the measures studied, despite partial neutralizations in any given measure.
Article
Although reduced stress is thought to be one of the most deviant speech dimensions in hypokinetic dysarthria associated with Parkinson's disease (PD), the mechanisms of stress production in PD have not been thoroughly explored by objective methods. The aim of the present study was to quantify the effect of PD on prosodic characteristics and to describe contrastive stress patterns in parkinsonian speech. The ability of 20 male speakers with early PD and 16 age- and gender-matched healthy controls (HCs) to signal contrastive stress was investigated. Each participant was instructed to unnaturally emphasize five key words while reading a short block of text. Acoustic analyses were based on the measurement of pitch, intensity, and duration. In addition, an innovative measurement termed the stress pattern index (SPI) was designed to mirror the effect of all distinct acoustic cues exploited during stress production. Although PD patients demonstrated a reduced ability to convey contrastive stress, they could still notably increase pitch, intensity, and duration to emphasize a word within a sentence. No differences were revealed between PD and HC stress productions using the measurements of pitch, intensity, duration, and intensity range. However, restricted SPI and pitch range were evident in the PD group. A reduced ability to express stress seems to be the distinctive pattern of hypokinetic dysarthria, even in the early stages of PD. Because PD patients were able to consciously improve their speech performance using multiple acoustic cues, the introduction of speech therapy may be rewarding.
Article
This study investigates the role linguistic experience has on the perception of phonation and acoustic properties that correlate with this perception. Listeners from Gujarati (contrasts breathy versus modal vowels), Spanish (no breathiness) and English (allophonic breathiness) participated in: (1) a similarity-rating task, indicating the similarity of modal and/or breathy Mazatec vowels and (2) a free-sort task, sorting breathy and modal stimuli from many languages.Results showed that Gujaratis did better at distinguishing phonation in other languages/dialects and were more consistent. English listeners did no better than Spanish listeners, despite the allophonic breathiness in English. In terms of acoustic dimensions, results showed that Gujaratis relied on H1−H2 (amplitude of the first harmonic minus amplitude of the second harmonic), English listeners relied weakly on H1−H2 and cepstral peak prominence and Spanish listeners relied on H1−A1 (amplitude of first formant peak) and H1−H2. While it is not clear why Spanish listeners used H1−A1, we can speculate as to why all three groups of listeners used H1−H2. Cross-linguistically, H1−H2, which is correlated with the open quotient (Holmberg, Hillman, Perkell, Guiod, & Goldman, 1995), is the most successful measure of phonation. Perhaps the reason is a perceptual one; open quotient differences might be more salient to listeners.