Conference PaperPDF Available

The Perception and Analysis of the Likeability and Human Likeness of Synthesized Speech

Authors:
The Perception and Analysis of the Likeability
and Human Likeness of Synthesized Speech
Alice Baird1, Emilia Parada-Cabaleiro1, Simone Hantke1,2 , Felix Burkhardt3,
Nicholas Cummins1, Björn Schuller1,4
1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, Univeristy of Augsburg, Germany
2Machine Intelligence and Signal Processing Group, Technische Universität München, Germany
3Deutsche Telekom, Berlin, Germany
4GLAM – Group on Language, Audio and Music, Imperial College London, UK.
alice.baird@informatik.uni-augsburg.de
Abstract
The synthesized voice has become an ever present aspect of daily
life. Heard through our smart-devices and from public announce-
ments, engineers continue in an endeavour to achieve naturalness
in such voices. Yet, the degree to which these methods can pro-
duce likeable, human like voices, has not been fully evaluated.
With recent advancements in synthetic speech technology sug-
gesting that human like imitation is more obtainable, this study
asked 25 listeners to evaluate both the likeability and human
likeness of a corpus of 13 German male voices, produced via
5 synthesis approaches (from formant to hybrid unit selection,
deep neural network systems), and 1 Human control. Results
show that unlike visual artificially intelligent elements – as posed
by the concept of the Uncanny Valley – likeability consistently
improves along with human likeness for the synthesized voice,
with recent methods achieving substantially closer results to hu-
man speech than older methods. A small scale acoustic analysis
shows that the F0 of hybrid systems correlates less closely to
human speech with a higher standard deviation for F0. This anal-
ysis suggests that limited variance in F0 is linked to a reduction
in human likeness, resulting in lower likeability for conventional
synthetic speech methods.
Index Terms: synthesized voices, human likeness, likeability.
1. Introduction
The anthropomorphisation of machines has been something of
a curiosity for engineers throughout the 20th century [
1
,
2
,
3
],
and is closely related to the advent of thought into artificially
intelligent (AI) beings [
4
]. Giving a voice to such beings, has
become a crucial consideration and necessary component for
developers, as a means of achieving comfortable and ‘social’
Human Computer Interactions (HCI) [
5
]. As such voices are
now embedded in our daily–life (from personal assistants [
6
] to
humanoid robots [
7
]), this study explores the impact that human
likeness may have on our perception of likeability.
From the recorded speech of repeated public announcements
e. g., ‘mind the gap’, to the more complex (deep) machine learn-
ing Text-To-Speech (TTS) systems of today e. g., the IBM Wat-
son System [
8
], or Deep Minds WaveNet [
9
], the methods for
creating such voices have advanced substantially throughout the
past decade [
10
,
8
]. Now in the age of deep learning, it seems
that true human likeness (or naturalness) of synthesized voices,
is a more obtainable feature, with recent advancements focus-
ing on specific speech features, including learning unknown
pronunciations[11].
The perception of human speech is a well researched
area [
12
,
13
,
14
]. However, considerably less efforts have been
made towards the perception of synthesized speech. This is
surprising as, synthesized voices are often a consideration in
developing corporate identity (i.e., Apple’s Siri, or Amazon’s
Alexa), and attractiveness (or likeability) is closely linked to
commercial success [
15
]. Initial research has been made in rela-
tion to earlier synthesis methods, including the ability for such
voices to portray personality traits [
16
], as well as the likeability
of gendered synthesized voices [
17
]. Additionally, the effect
of mixing human speech with synthetic speech as a means of
improving likeability has also been evaluated [18].
Human likeness is a term commonly discussed in relation
to AI through the concept of the Uncanny Valley. Coined by the
Japanese robotisist Masahiro Mori in 1970 [
19
], this concept
describes the ‘almost but not quite’ human likeness of visual
features in humanoid robots, which may elicit an unfamiliar
feeling or aversion to the AI ‘being’ [
20
]. The ‘valley’ is a
point in the degree of human likeness when features are close
(but not exact) replications of real-human features in which
familiarity (and therefore in some way likeability) substantially
decreases. The Uncanny Valley, has been extensively explored
in relation to the visual attributes within AI [
21
,
22
]. Yet, only
briefly has the Uncanny Valley (in relation to robotic speech)
explored [
23
], finding (as part of a multi-modal system) that
human likeness is linearly related (in most cases) to the ability
to portray emotion.In this way, [
24
] also evaluated the uncanny
valley for text to speech (TTS), comparing two synthesis methods
used within a dialogue system.
The implications of synthesized (‘robotic’) speech have been
evaluated in [
25
], and synthetic voice identity itself, has begun
to be highlighted as problematic in recent literature [
26
]. In this
regard, brief studies by researchers relating to human likeness
in synthesis [
27
,
28
] have begun to appear. This study aims
to build on this, utilising the German Text-to-Speech Dataset
1
(GTTS). From GTTS, a selection of male voices (balance was
not possible across genders) have been gathered, and this study
asked 25 listeners from varied backgrounds to assess the level
of human likeness, and likeability of a selection of 13 voices
of varying synthesis methods (including formant, diphone, unit
selection, hybrid unit selection, and state of the art hybrid deep
neural network systems) as well as 1 human voice. In this way,
we are evaluating if human likeness and likeability are linked for
the synthesized voice, asking if there is an Uncanny Valley, and
how close to human are current methods able to achieve?
1www.ttssamples.syntheticspeech.de
Interspeech 2018
2-6 September 2018, Hyderabad
2863 10.21437/Interspeech.2018-1093
2. Methodology
2.1. Corpus
The corpus used within this study is a subset of GTTS, consisting
of 13 male voices, and 39 utterances. Through out the years of
synthesis (details of voice years given in Table 1), it is observed
that for a long period synthesis methods consisted of diphone
(two-phones) concatenation from recorded speech. Formant
synthesis was also a common signal processing method, asso-
ciated with the ‘robotic’ somewhat monotonic synthetic voices.
Within this corpora the following synthesis methods are included;
formant synthesis [
29
], concatenative diphone synthesis [
30
] ,
conventional non-uniform unit-selection [
31
], and hybrid non-
uniform unit selection synthesis, using Liljencrants-Fant model
[
32
], and state-of-the-art hybrid unit-selection synthesis meth-
ods with deep neural network frameworks e.g., (IBM Watson
(2016)
2
, and Readspeak (2018)
3
). Additionally, as a control,
one human male voice is included.
Each of the 13 voices speaks 3 sentences (a total of 39 ut-
terances). Files were extract for the dataset and converted to
mono
wav
format (16 bit, 44.1 kHz) for listening test. Sentences
were designed to evaluate known problems for German natural
language processing modules. The sentences include:
1:
An den Wochenenden bin ich jetzt immer nach Hause
gefahren und habe Agnes besucht. Dabei war eigentlich immer
sehr schönes Wetter gewesen.
2:
Dr. A. Smithe von der NATO (und nicht vom CIA) ver-
sorgt z. B. – meines Wissens nach – die Heroin seit dem 15.3.00
täglich mit 13,84 Gramm Heroin zu 1,04 DM das Gramm.
3:
Die Manpowerdiskussion wird gecancelt, du kannst das
File vom Server downloaden.
Within the corpus only male voices have been taken from the
GTTS data set. Using only the male voices was a decision taken
due to the imbalanced nature of gender across the dataset, and
the year based selection of the voices. Seemingly, earlier years
for German speaking TTS were dominated by male synthetic
speech, and although there may be some effect on the listener
perception due to the speaker gender [
33
], for this study we have
chosen to prioritise balancing the dataset.
2.2. Evaluation Parameters
As a means of investigating the advancements in voice synthesis,
the parameters of likeability and human likeness are evaluated.
The listening task was completed in the iHEARu-PLAY online
browser-based annotation platform [
34
], and traits were divided
into 2 individual tasks. 25 listeners (ages ranging from 22–57
years) of varying nationalities
4
, 14 male and 11 female, volun-
tarily evaluated the corpus of 13 voices. Taking into account the
data quality management procedures as described in [
35
], high
quality annotations were collected.
In previous studies by the authors, the effect of perception
on native and non-native listeners [
27
] was evaluated, finding
no substantial difference in results. Therefore, here, the multi-
national listener group is not expected to compromise the result.
Likeability:
To evaluate likeability, the listeners were asked
to ‘Please listen to the voice and judge the level of Likability.
i. e., How much do you like the voice speaking?’. For each
utterance, listeners evaluated the Likeability on a 5–point Likert
scale; 1=Not at all, 5=Extremely.
2www.console.bluemix.net/docs/services/text-to-speech/science
3www.readspeaker.com
4
Native language of listeners included: 11 English, 10 German, 1
Spanish, 1 Japanese, 2 Chinese.
Table 1: A summary voices used within this study. All voices are
male, German speaking. Each voice speaks 3 sentences, with a
total of 39 utterances. The human voice was recorded in 2017,
and all synthesized voices are marked in the name.
Name Developer Type
Human - 30 years Native German
F_1974
Samples of a Audio-
data Braille reader
Hardware Formant Synthe-
sizer
D_1996
Technical University
of Dresden
Diphone Synthesis
D_1997 BabelTech®
Diphone-Concatenation
Synthesis
D_1998
Elan
®
now Acapella
® Diphone-Concatenation
D_2000 Voice Interconnet Diphone Synthesis
D_2004 Atip® Diphone-Concatenation
U_2007 Nuance®
Non-uniform unit-
selection
F_2006 eSpeak Formant Synthesis
F_2009 Meridian Formant Synthesis
HU_2014 VoxyGen®
Hybrid non-uniform unit-
selection
DN_2016 IBM Watson®
Non-Uniform Unit-
Selection using DNNs
DN_2018 ReadSpeak®
Non-Uniform Unit-
Selection using DNNs
Human likeness:
As in our previous studies [
27
,
28
], the
term Human likeness has been used, coined from the concept of
the Uncanny Valley [
19
] to describe how accurately the machine
is able to imitate a human. For this task, listeners were asked to
‘Please listen to the voice and judge the level of human likeness.
i. e., How close to human would you rate the voice speaking?’.
Again, using a 5–point Likert scale 1
=
Not at all, 5
=
Extremely.
3. Evaluation of Results
To evaluate how likeabiltiy is linked to improved human like-
ness, multiple comparisons between the listeners’ perception of
the evaluated classes are presented, across the voices (covering
several years of synthesis). Mean results from the evaluation are
included in Table 2. Reported pvalues, are significant under the
conventional threshold of p
<.05
. With eta square (
η2
) being
used as a measure of effect size.
3.1. Analysis and Discussion
For both of the evaluated parameters, i.e., likeability, and human
likeness, the null hypothesis that the samples of all the evaluated
groups are perceived similarly have been rejected, given the sig-
nificant difference in the medians shown by the Kruskal-Wallis
test: H(13)
=375.01
,p
<.001
for likeability; H(13)
=513.57
,
p
<.001
for human likeness. In order to evaluate between which
specific voices are perceived as significantly different, a pairwise
comparisons among the 13 groups (considering the post hoc test
Dunn-Bonferroni), has been applied.
It is worth noting here that we are unable to perform a one-
way ANOVA. During initial analysis, a Shapiro-Wilk test was
used to check for normality and returned
p<.001
for both
perception evaluation parameters (likeability and human like-
ness). Furthermore, the results of a Levene test revealed that the
variances between listeners’ responses were also not equal, again
at significant levels,
p<.001
, for both parameters. Therefore,
using the results of both tests, we rejected the null hypothesis of
the population variances being equal and the distribution normal,
thus we switched to a non-parametric method of analysis.
From the pairwise comparisons, results display no signif-
2864
Table 2: The mean (m) and standard deviation (std) is shown
for each voice, for results of both perception of human-likeness
(H-L) and likeability (L). Results above 2.5 are highlighted.
H-L L
m std m std
F_1974 1.07 0.25 1.24 0.51
D_1996 1.94 1.08 2.13 1.02
D_1997 2.65 1.30 2.32 1.00
D_1998 1.92 0.98 1.82 0.85
D_2000 2.13 0.92 2.17 0.76
D_2004 1.86 0.87 1.74 0.81
U_2007 2.66 1.03 2.16 0.92
F_2006 1.22 0.49 1.32 0.57
F_2009 1.27 0.61 1.41 0.72
HU_2014 3.95 1.14 3.27 1.15
DN_2016 3.91 0.98 3.29 1.15
DN_2018 3.88 0.97 3.52 0.85
Human 4.76 0.68 3.69 1.19
icant difference in listener perception of likeability between
Human
and
HU_2014
,
DN_2016
, and
DN_2018
. This is
observed through a comparisons of 3 voices, i. e.,
Human
vs
HU_2014
,
Human
vs
DN_2016
, and
Human
vs
DN_2018
; (p
=1.00
), and a small effect size
η2=0.03
. For human like-
ness, a similar tendency is observed, not yielding significant
results:
Human
vs
HU_2014
(p
=.792
,
η2=0.21
);
Human
vs
DN_2016
(p
=.560
,
η2=0.27
);
Human
vs
DN_2018
(p
=.381,η2=0.31).
From this, it is observed that
HU_2014
,
DN_2016
, and
DN_2018
(the most recent synthesis methods) are perceived to
be as likeable and human like as the
Human
voice. Looking
closely at the results of these voices shows that the hybrid syn-
thesis of
HU_2014
is as human like and likeable as
DN_2016
and
DN_2018
; and to evaluate further the advancements due to
DNN-base synthesis would require a more focused study.
Despite the similarity of
HU_2014
,
DN_2016
, and
DN_2018
, there is a continued increase in the perception of
likeability and human likeness across the corpora, observed
prominently within the plotted results in Figure 1 (A). This con-
tinued increase is not linear, and we would like to draw attention
to the large fluctuation in results between the years of 2004 and
2009, is mostly due to the formant synthesis methods marked in
the time-line.
When looking at Figure 1 (B), the cluster of the formant
synthesis voices is observed in the lowest range, despite their
32 years age-gap. As well as this clustering, an observable gap
between the diphone and unit-selection voices, and the Hybrid
and DNN voices can be observed; shown by e.g., by the sig-
nificant difference yielded between
D_1997
vs
DN_2018
for
human likeness (p
=.001
,
η2=0.22
). This observation could
be related to the findings in [
23
], in which portrayal of emotion
was found to be linked to the human likeness of synthesized
voices. In relation to this, it could be observed that results of
our study, are due to a more effective portrayal of emotion in
the hybrid-DNN voices; however, this would need a detailed
emotion specific analysis.
When looking at these results in relation to the concept of
the Uncanny Valley [
19
], Figure 1 (B) sees likeability in place of
familiarity. It is observed that unlike with visual elements, when
likeability increases, so too does human likeness. However, in
Figure 1 (B), it can also be observed that there is a tendency
towards the ‘valley’ which warrants further exploration between
the years of 1998 and 2014. Through the comparison between
Figure 1: Mean (m) results for human likeness (H-L) and like-
ability (L) for all the 13 voices evaluated in the corpora. (A)
H-L and L of all voices, as well as
Human
-HL (Hu-HL) and
Human-L (Hu-L) over time, (B) H-L against L for all voices.
D_1996
vs
D_2004
, it is observed that these are perceived as
highly similar, due to the small effect size of likeability and a
none effect for human likeness (p
=1.00
,
η2=0.03
; and p
=1.00
,
η2=0.00
respectively). On the contrary,
D_2004
vs
DN_2018
have been perceived as different, displayed by a
medium effect size (p
>.001
and
η2=0.54
for both likeabil-
ity and human likeness), showing that voices above this year
range (1998–2013, excluding the formant synthesis voice), are
being perceived differently, to conventional synthesis methods
as a whole, even those of lower human likeness. In this re-
gard, despite this tendency towards the ‘valley’, it is observed
consistently from the results that likeability does correspond to
human likeness. Through computing a Pearson correlation of
all human-likeness and likeability results from all voices, it is
shown that the perception of these two is highly related, and
there is a linear relationship between them (
r=.489
,
p<.001
).
Therefore, a comparison to the conventional Uncanny Valley is
not possible, as higher human likeness does not seem to match
lower likeability at any point during the level of human likeness
.
4. Acoustic Feature Analysis
As a means of evaluating the results from this study in more
detail, focusing particularly on human likeness, a feature level
analysis was made of 3 of the voices within the corpora. Fig-
ure 3. shows sentence 1, from 3 of the corpus voices (chosen
due to their varied results, as well their polarising synthesis ap-
proaches):
D_2004
,
DN_2018
, and
Human
(as a control). The
smileF0 feature set from the open-source openSMILE feature
extractor [
36
] was used, to automatically extract F0 at frame
values (10ms) from the speech instances.
One observation from this selection of voices, is that the
F0 of
D_2004
has a much smoother trajectory (perhaps a more
literal imitation); yet, this voice was perceived with less human
2865
Figure 2: The standard deviation (sd) result for the F0 of each
voice against human likeness. We observe a relationship be-
tween the increase in human likeness and the value of standard
deviation to the F0 (as shown from the dashed trend line).
likeness and likeability (mean of 1.86 and 1.74 respectively,
cf. Table 2). As well as this, the voice characteristics fit with
the stereotype of monotonic synthesized speech, having an F0
mean (m) of 97.59 Hz and a smaller standard deviation (sd) of
9.7 Hz. This results is mentionable as – unlike visual elements
– discussed through the Uncanny Valley – a somewhat precise
imitation in the prosodic features is seen, which does not come
through in perception of human likeness.
An observed difference when comparing
D_2004
and
DN_2018
, is that
DN_2018
has a noticeable phrase-final
lengthening in as the prosodic boundary (highlighted by the
red arrow), something typical to human speech [
37
], which is
less prevalent in
D_2004
. The prosodic flow of
DN_2018
is
dynamic, as is the
Human
voice (
Human
F0 m: 138.27 Hz, sd:
24.04 Hz;
DN_2018
F0 m: 108.61 Hz, sd: 21.89 Hz). In this
regard (cf. Figure 2), we have calculated the sd of the mean F0,
against human likeness for all voices, here observing a trend that
voices with higher deviation from the mean F0 are also more
human like. We speculate that
DN_2018
achieved both high
human likeness, and likeability ratings as compared to
D_2004
(mean of 1.86 and 1.74 respectively, cf. Table 2), due to such
prosodic behaviours.
5. Conclusions
Through this evaluation of the perception of human likeness and
likeability across a corpora of 13 voices including 5 synthesis
types, and 1 human control, it is clear that methodologies for
voice synthesis generation have improved across the years for the
evaluated parameters e.g., as shown by voice
DN_2018
against
D_2004
. However, state-of-the-art Hybrid-DNN voices have
not yet made a substantial improvement over previous hybrid
methods (
HU_2014
) methods (as no significance is shown be-
tween them). However, the results for the hybrid-DNN voices
are promising for synthetic audio generation. With similar
DNN methods being applied in multiple domains (e. g., mu-
sic generation[
9
]), systems are seemly reaching a natural level of
replication ability, which can only be positive for the computer
audition community. In this regard, future analysis could include
a variety of languages, as well as a focused synthesis selection in
order to fully evaluate how DNNs are building upon the previous
state of the art.
Of prominence, it was also observed that likeability and
Figure 3: F0 feature analysis for 3 voices within the corpora;
D_2004, DN_2018, Human
. The F0 standard deviation is
plotted with the horizontal highlight, observing a larger variance
in the prosodic pattern of
DN_2018
, and
Human
compared to
that of
D_2004
. Labelled in blue (left arrow), a dynamic shift
in the prosodic flow is observed. In green (right arrow), a human
like pre-final lengthening of the phrase boundary is seen, both in
DN_2018 and Human.
human likeness are correlated but unlike the concept for visual
features discussed through the concept of the Uncanny Valley;
there does not seem to be an aversion to the voices when they
achieve more human like features. In reality, likeability is very
much related to the synthesis methods, e. g., the clusterings
mentions in Section 3. Despite this, it could be considered that
there is not an Uncanny Valley for the synthesized voice, but
rather an entire period, as it was shown that from 2014, voices
synthesis methods achieve significantly higher results across
both parameters.
From the feature analysis it was also found that hybrid-DNN
methods retain the human like variance (F0 sd) affecting the
prosodic flow, which can be a strong indication for such high hu-
man likeness ratings – a finding which warrants further analysis
across a selection of precise state-of-the-art voice synthesis meth-
ods, and in this regard the incorporation of an emotion-based
evaluation may have insightful findings.
6. Acknowledgements
This work is funded by the Bavarian State Ministry of Education,
Science and the Arts in the framework of the Centre Digitisa-
tion.Bavaria (ZD.B), and the European Union’s Seventh Frame-
work and Horizon 2020 Programmes under grant agreements No.
338164 (ERC StG iHEARu).
7. References
[1]
S. W. Homer Dudley, R. Riesz, “A Synthetic Speaker,” Journal of
The Franklin Institute, vol. 227, no. 6, pp. 739–764, June 1939.
2866
[2]
G. Fant, “The Source Filter Concept in Voice Production,” Speech
Transmission Laboratory-QPSR, vol. 22, no. 1, pp. 21–37, 1981.
[3]
R. Scha, “Virtual Voices,” Mediamatic Magazine, vol. 7, no. 1, pp.
27–42, 1992.
[4]
P. McCorduck, Machines Who Think: A Personal Inquiry into the
History and Prospects of Artificial Intelligence. Natik, MA, USA:
AK Peters Ltd, 2004.
[5]
K. M. Lee and C. Nass, “Designing Social Presence of Social
Actors in Human Computer Interaction,” in Proc. of the SIGCHI
Conference on Human Factors in Computing Systems, Ft. Laud-
erdale, Florida, USA, 2003, pp. 289–296.
[6]
J. Sánchez and C. Oyarzún, “Mobile Audio Assistance in Bus
Transportation for the Blind,Journal of the National Institute of
Child Health and Human Development in Israel, vol. 10, no. 4, pp.
365–371, 2011.
[7]
S. Shamsuddin, L. I. Ismail, H. Yussof, N. I. Zahari, S. Bahari,
H. Hashim, and A. Jaffar, “Humanoid robot nao: Review of control
and motion exploration,” in 2011 IEEE International Conference
on Control System, Computing and Engineering, 2011, pp. 511–
516.
[8]
R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, “Prosody
Contour Prediction with Long Short-Term Memory, Bi-Directional,
Deep Recurrent Neural Networks,” in INTERSPEECH 2014, 15th
Annual Conference of the International Speech Communication
Association, Singapore, September 14-18, 2014, 2014, pp. 2268–
2272.
[9]
A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,
F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman,
E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters,
D. Belov, and D. Hassabis, “Parallel wavenet: Fast high-fidelity
speech synthesis,” vol. abs/1711.10433, 2017.
[10]
M. Schröder, “Approaches to Emotional Expressivity in Synthetic
Speech,” in Emotions in the Human Voice, ser. Culture and Per-
ception. Oxfordshire, United Kingdom: Plural Publishing, 2009,
vol. 3, ch. 19, pp. 307–323.
[11]
K. Sawada, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda,
“Constructing text-to-speech systems for languages with unknown
pronunciations,” Acoustical Science and Technology, vol. 39, no. 2,
pp. 119–129, 2018.
[12]
M. Latinus and P. Belin, “Human voice perception,” Current Biol-
ogy, vol. 21, no. 4, pp. 143–145, 2011.
[13]
D. Puts, S. Gaulin, and K. Verdolini, “Dominance and the evolution
of sexual dimorphism in human voice pitch,Evolution and Human
Behavior, vol. 27, no. 4, pp. 283–296, 2006.
[14]
C. Tigue, D. Borak, J. J. O’Connor, C. Schandl, and D. Feinberg,
“Voice pitch influences voting behavior,Evolution and Human
Behavior, vol. 33, no. 3, pp. 210–216, 2012.
[15]
B. D. Till and M. Busler, “The Match-Up Hypothesis: Physical
Attractiveness, Expertise, and the Role of Fit on Brand Attitude,
Purchase Intent and Brand Beliefs,” Journal of Advertising, vol. 29,
no. 3, pp. 1–13, 2000.
[16]
C. Nass and K. Lee, “Does Computer-Synthesized Speech Mani-
fest Personality? Experimental Tests of Recognition, Similarity-
Attraction, and Consistency-Attraction.” Journal of Experimental
Psychology, vol. 7, no. 3, pp. 171–181, 2001.
[17]
E. J. Lee, C. Nass, and S. Brave, “Can Computer-Generated Speech
Have Gender?: An Experimental Test of Gender Stereotype,” in
Proc. of CHI ’00 Extended Abstracts on Human Factors in Com-
puting Systems, New York, NY, USA, 2000, pp. 289–290.
[18]
L. Gong and J. Lai, “To Mix or Not to Mix Synthetic Speech and
Human Speech? Contrasting Impact on Judge-Rated Task Perfor-
mance versus Self-Rated Performance and Attitudinal Responses,”
International Journal of Speech Technology, vol. 6, pp. 123–131,
2003.
[19]
M. Mori, “Bukimi No Tani [The Uncanny Valley],” ENERGY,
vol. 7, no. 4, pp. 33–35, 1970.
[20]
F. E. Pollick, “In Search of the Uncanny Valley,” in Proc. User
Centric Media, Venice, Italy, 2009, pp. 69–78.
[21]
M. Eaton, Evolutionary Humanoid Robotics: Past, Present and
Future. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007,
pp. 42–52.
[22]
W. J. Mitchell, S. Kevin A Szerszen, A. S. Lu, P. W. Schermer-
horn, M. Scheutz, and K. F. MacDorman, “A Mismatch in the
Human Realism of Face and Voice Produces an Uncanny Valley,”
i-Perception, vol. 2, no. 1, pp. 10–12, 2011.
[23]
T. J. Burleigh, J. R. Schoenherr, and G. L. Lacroix, “Does the
Uncanny Valley Exist? An Empirical Test of the Relationship
between Eeriness and the Human Likeness of Digitally Created
Faces,” Computers in Human Behavior, vol. 29, no. 3, pp. 759–771,
2013.
[24]
J. Romportl, “Speech synthesis and uncanny valley,” in Text,
Speech and Dialogue - 17th International Conference, TSD 2014,
Brno, Czech Republic, September 8-12, 2014. Proceedings, 2014,
pp. 595–602.
[25]
S. Wilson and R. K. M. , “Robot, Alien and Cartoon Voices: Impli-
cations for Speech-Enabled Systems,” in Proc. Vocal Interactivity
in-and-between Humans, Animals and Robots, SkÃ˝uvde, Sweden,
2017, p. no. pagination.
[26]
T. Phan, “The Materiality of the Digital and the Gendered Voice of
Siri,” Transformations, no. 29, pp. 23–33, 2017.
[27]
A. Baird, S. H. Jørgensen, E. Parada-Cabaleiro, S. Hantke, N. Cum-
mins, and B. Schuller, “Perception of Paralinguistic Traits in Syn-
thesized Voices,” in Proc. of Audio Mostly Conference, London,
United Kingdom, 2017, pp. 1–5.
[28]
A. Baird, S. H. Jørgensen, E. Parada-Cabaleiro, N. Cummins,
S. Hantke, and B. Schuller, “The Perception of Vocal Traits in
Synthesized Voices: Age, Gender, and Human Likeness,” J. Audio
Eng. Soc, vol. 66, no. 4, pp. 277–285, 2018.
[29]
S. Awad and B. Guérin, “An Otimisation of Formant Synthesis
Parameter Coding,” Speech Communication, vol. 3, no. 4, pp. 335–
346, 1984.
[30]
M. Beutnagel, A. Conkie, and A. Syrdal, “Diphone synthesis using
unit selection,” in Proc. of the Third ESCA/COCOSDA Workshop
on Speech Synthesis, Blue Mountains, Australia„ 1998, pp. 185–
190.
[31]
A. P. Breen and P. Jackson, “Non-Uniform Unit Selection and the
Similarity Metric within BT’s Laureate TTS System,” in Proc. of
the Third ESCA/COCOSDA Workshop on Speech Synthesis, Blue
Mountains, Australia„ 1998, pp. 201–206.
[32]
Y. Agiomyrgiannakis and O. Rosec, “Arx-lf-based source-filter
methods for voice modification and transformation,” in 2009 IEEE
International Conference on Acoustics, Speech and Signal Process-
ing, April 2009, pp. 3589–3592.
[33]
C. J. Stevens, N. Lees, J. Vonwiller, and D. Burnham, “On-line
experimental methods to evaluate text-to-speech (tts) synthesis:
effects of voice gender and signal quality on intelligibility, natural-
ness and preference,” Computer Speech & Language, vol. 19, pp.
129–146, 2005.
[34]
S. Hantke, F. Eyben, T. Appel, and B. Schuller, “iHEARu-PLAY:
Introducing a Game for Crowdsourced Data Collection for Affec-
tive Computing,Proc. 1st International WASA 2015, ACII 2015,
pp. 891–897, 2015.
[35]
S. Hantke, Z. Zhang, and B. Schuller, “Towards Intelligent Crowd-
sourcing for Audio Data Annotation: Integrating Active Learning
in the Real World,” in Proceedings INTERSPEECH 2017, 18th
Annual Conference of the International Speech Communication
Association, ISCA. Stockholm, Sweden: ISCA, August 2017, pp.
3951–3955.
[36]
F. Eyben, F. Weninger, F. Gross et al., “Recent Developments
in openSMILE, the Munich Open-Source Multimedia Feature
Extractor,” in Proc. 21st ACM Int. Conf. Multimedia, MM 2013.
Barcelona, Spain: ACM, Oct 2013, pp. 835–838.
[37]
S. Nooteboom, “The Prosody of Speech: Melody and Rhythm,”
The Handbook of Phonetic Sciences, pp. 640–673, 1997.
2867
... Nevertheless, varieties of Text-To-Speech (TTS) systems have been deployed in human-robot interaction with the advanced development of TTS technology. These synthesized voices have distinct characteristics that the voices reach a high level of intelligibility and are less human-like in subtle expressions, like emotional prosody (Baird et al., 2018;Seaborn et al., 2022). Exploring affective preference for synthesized voices rather than recorded human voices would be more practical for designing humanoid robot voices. ...
... The neural dynamics underlying affective preference for voices might significantly differ from visual stimuli. Moreover, the voices synthesized for humanoid robots differ from human voices in subtle expressions (Baird et al., 2018;Seaborn et al., 2022;Tamagawa et al., 2011). The difference might change users' perception and processing of robot synthesized voices. ...
... Notwithstanding, users' affective preference is an implicitly internal attitude and might be more moderate than explicit emotions (Bradley et al., 2008;Norman, 2004). Moreover, the synthesized voices currently used in robots differ from human voices in subtle expressions (Baird et al., 2018;Seaborn et al., 2022;Tamagawa et al., 2011). It remains unknown whether these indicators evaluating users' response toward emotional voices could work well in evaluating users' affective preference for synthesized voices. ...
Article
Users' affective preference for voices has become a topic of great interest with the prevalence of humanoid robots. Nevertheless, the affective preference formation for humanoid voices remains unknown, and its evaluation lacks objective methods. Consequently, we conducted an EEG experiment to unravel the underlying neural dynamics and evaluate users' affective preference for humanoid robot voices. Significantly larger P2, P3, and LPP amplitudes, enhanced theta, and decreased alpha oscillations were observed when users affectively preferred humanoid robot voices. The results suggest that the neural dynamics underlying users' affective preference for humanoid robot voices might primarily consist of early detection of affective information in voices, further processing of affective information, and later evaluative categorization of affective preference. Moreover, the neural indicators could distinguish users' affective preferences for humanoid robot voices. The study contributes to understanding the auditory affective preference formation for humanoid robot voices and providing a neurological evaluation method.
... In support of these theories, empirical research has found, for example, that consumers perceive voice assistants as independent agents detached from the company behind them (Whang and Im, 2021), that different voices emitted by the same computer are treated as distinct social actors (Nass et al., 1994), that the use of voice in online questionnaires elicits socially desirable responses comparable to the way a real human interviewer would (Couper et al., 2001;Tourangeau et al., 2003), and that people deduce personality cues from synthetic voice (Nass and Lee, 2001). Furthermore, initial evidence suggests that it is not just the use of voice per se that matters, but that greater anthropomorphization occurs with more natural computer voices than with less natural ones (Eyssel et al., 2012;Ilves and Surakka, 2013;Baird et al., 2018). ...
... Results indicate that the real human voice was rated as most pleasant, intelligible, likable, and trustworthy. Anecdotal evidence from two other exploratory studies suggests similar patterns Baird et al. (2018) asked 25 listeners to evaluate the likability and human-likeness of 13 synthesized male voices and found likability to increase consistently with human-likeness. Based on data from 30 listeners, also Romportl (2014) reported that most though not all participants preferred a more natural female voice over an artificial sounding one. ...
... In summary, given some recent empirical findings on synthetic speech, it could be assumed that voices that are perceived as more human-like are also perceived as more pleasant and less eerie (Romportl, 2014;Baird et al., 2018;Kühne et al., 2020). Against the background of the Uncanny Valley phenomenon, however, expectations would go in a different direction: On the one hand, it could be assumed that highly realistically sounding voices are evaluated as eerier and less pleasant than either a perfect imitation of the human voice or mechanically sounding voices. ...
Article
Full-text available
The growing popularity of speech interfaces goes hand in hand with the creation of synthetic voices that sound ever more human. Previous research has been inconclusive about whether anthropomorphic design features of machines are more likely to be associated with positive user responses or, conversely, with uncanny experiences. To avoid detrimental effects of synthetic voice design, it is therefore crucial to explore what level of human realism human interactors prefer and whether their evaluations may vary across different domains of application. In a randomized laboratory experiment, 165 participants listened to one of five female-sounding robot voices, each with a different degree of human realism. We assessed how much participants anthropomorphized the voice (by subjective human-likeness ratings, a name-giving task and an imagination task), how pleasant and how eerie they found it, and to what extent they would accept its use in various domains. Additionally, participants completed Big Five personality measures and a tolerance of ambiguity scale. Our results indicate a positive relationship between human-likeness and user acceptance, with the most realistic sounding voice scoring highest in pleasantness and lowest in eeriness. Participants were also more likely to assign real human names to the voice (e.g., “Julia” instead of “T380”) if it sounded more realistic. In terms of application context, participants overall indicated lower acceptance of the use of speech interfaces in social domains (care, companionship) than in others (e.g., information & navigation), though the most human-like voice was rated significantly more acceptable in social applications than the remaining four. While most personality factors did not prove influential, openness to experience was found to moderate the relationship between voice type and user acceptance such that individuals with higher openness scores rated the most human-like voice even more positively. Study results are discussed in the light of the presented theory and in relation to open research questions in the field of synthetic voice design.
... While the experiment showed no conclusive results, there was actually a slight preference for the more human sounding voice, especially from participants with a more technical background. Baird et al. (2018) experimented with a range of voices and, in their experiments, there was a monotonic increasing relationship between human likeness and likeability, i.e. the uncanny valley did not appear. Jansen (2019) varied auditory features across nine levels from artificial to humanlike, and demonstrated that, on average, participants were able to rank these levels perfectly, and that response times were slowest around level 7, where voices were classified as being human or nonhuman amounts, indicating that 'categorical uncertainty' was highest at this level. ...
... Nor do we see the opposite effect: the results do not show an uncanny valley effect where increasing the human likeness leads to a point where it evokes eeriness or revulsion. Romportl (2014) and Baird et al. (2018) also showed that familiarity increased if a voice is more humanlike, which may mean that the uncanny valley is harder to get triggered in voice AI, and in the work of Lee (2010) people conformed more to the more human sounding voice. ...
Article
Full-text available
Conversational artificial agents and artificially intelligent (AI) voice assistants are becoming increasingly popular. Digital virtual assistants such as Siri, or conversational devices such as Amazon Echo or Google Home are permeating everyday life, and are designed to be more and more humanlike in their speech. This study investigates the effect this can have on one’s conformity with an AI assistant. In the 1950s, Solomon Asch’s already demonstrated the power and danger of conformity amongst people. In these classical experiments test persons were asked to answer relatively simple questions, whilst others pretending to be participants tried to convince the test person to give wrong answers. These studies were later replicated with embodied robots, but these physical robots are still rare. In light of our increasing reliance on AI assistants, this study investigates to what extent an individual will conform to a disembodied virtual assistant. We also investigate if there is a difference between a group that interacts with an assistant that communicates through text, one that has a robotic voice and one that has a humanlike voice. The assistant attempts to subtly influence participants’ final responses in a general knowledge quiz, and we measure how often participants change their answer after having been given advice. Results show that participants conformed significantly more often to the assistant with a human voice than the one that communicated through text.
... While a whole range of different attributes were found to be associated with the "uncanny valley" in visual tasks, comparatively little is known about the impact of synthesized voice qualities on likability and eeriness (Kuratate et al., 2009;Mitchell et al., 2011;Romportl, 2014;Chang et al., 2018). For instance, a very recent study evaluated the likability and humanlikeness of a corpus of 13 German male voices, produced via five different synthesis approaches, and found that contrary to the visual "uncanny valley, " likability increases monotonically with human-likeness of the voice (Baird et al., 2018). A study by Romportl (2014) showed that about three quarters of participants preferred a more natural voice over an artificial one. ...
... The more human-like the voices and speakers were, the less eerie they appeared to the participants. This is consistent with a previous finding by Baird et al. (2018). Surprisingly, only pleasantness and trustworthiness, but not human-likeness, of a voice significantly predicted its likability. ...
Article
Full-text available
Background: The increasing involvement of social robots in human lives raises the question as to how humans perceive social robots. Little is known about human perception of synthesized voices. Aim: To investigate which synthesized voice parameters predict the speaker's eeriness and voice likability; to determine if individual listener characteristics (e.g., personality, attitude toward robots, age) influence synthesized voice evaluations; and to explore which paralinguistic features subjectively distinguish humans from robots/artificial agents. Methods: 95 adults (62 females) listened to randomly presented audio-clips of three categories: synthesized (Watson, IBM), humanoid (robot Sophia, Hanson Robotics), and human voices (five clips/category). Voices were rated on intelligibility, prosody, trustworthiness, confidence, enthusiasm, pleasantness, human-likeness, likability, and naturalness. Speakers were rated on appeal, credibility, human-likeness, and eeriness. Participants' personality traits, attitudes to robots, and demographics were obtained. Results: The human voice and human speaker characteristics received reliably higher scores on all dimensions except for eeriness. Synthesized voice ratings were positively related to participants' agreeableness and neuroticism. Females rated synthesized voices more positively on most dimensions. Surprisingly, interest in social robots and attitudes toward robots played almost no role in voice evaluation. Contrary to the expectations of an uncanny valley, when the ratings of human-likeness for both the voice and the speaker characteristics were higher, they seemed less eerie to the participants. Moreover, when the speaker's voice was more humanlike, it was more liked by the participants. This latter point was only applicable to one of the synthesized voices. Finally, pleasantness and trustworthiness of the synthesized voice predicted the likability of the speaker's voice. Qualitative content analysis identified intonation, sound, emotion, and imageability/embodiment as diagnostic features. Discussion: Humans clearly prefer human voices, but manipulating diagnostic speech features might increase acceptance of synthesized voices and thereby support human-robot interaction. There is limited evidence that human-likeness of a voice is negatively linked to the perceived eeriness of the speaker.
... While a whole range of different attributes were found to be associated with the "uncanny valley" in visual tasks, comparatively little is known about the impact of synthesized voice qualities on likability and eeriness (Kuratate et al., 2009;Mitchell et al., 2011;Romportl, 2014;Chang et al., 2018). For instance, a very recent study evaluated the likability and humanlikeness of a corpus of 13 German male voices, produced via five different synthesis approaches, and found that contrary to the visual "uncanny valley, " likability increases monotonically with human-likeness of the voice (Baird et al., 2018). A study by Romportl (2014) showed that about three quarters of participants preferred a more natural voice over an artificial one. ...
... The more human-like the voices and speakers were, the less eerie they appeared to the participants. This is consistent with a previous finding by Baird et al. (2018). Surprisingly, only pleasantness and trustworthiness, but not human-likeness, of a voice significantly predicted its likability. ...
Article
Full-text available
Background: The increasing involvement of social robots in human lives raises the question as to how humans perceive social robots. Little is known about human perception of synthesized voices. Aim: To investigate which synthesized voice parameters predict the speaker's eeriness and voice likability; to determine if individual listener characteristics (e.g., personality, attitude toward robots, age) influence synthesized voice evaluations; and to explore which paralinguistic features subjectively distinguish humans from robots/artificial agents. Methods: 95 adults (62 females) listened to randomly presented audio-clips of three categories: synthesized (Watson, IBM), humanoid (robot Sophia, Hanson Robotics), and human voices (five clips/category). Voices were rated on intelligibility, prosody, trustworthiness, confidence, enthusiasm, pleasantness, human-likeness, likability, and naturalness. Speakers were rated on appeal, credibility, human-likeness, and eeriness. Participants' personality traits, attitudes to robots, and demographics were obtained. Results: The human voice and human speaker characteristics received reliably higher scores on all dimensions except for eeriness. Synthesized voice ratings were positively related to participants' agreeableness and neuroticism. Females rated synthesized voices more positively on most dimensions. Surprisingly, interest in social robots and attitudes toward robots played almost no role in voice evaluation. Contrary to the expectations of an uncanny valley, when the ratings of human-likeness for both the voice and the speaker characteristics were higher, they seemed less eerie to the participants. Moreover, when the speaker's voice was more humanlike, it was more liked by the participants. This latter point was only applicable to one of the synthesized voices. Finally, pleasantness and trustworthiness of the synthesized voice predicted the likability of the speaker's voice. Qualitative content analysis identified intonation, sound, emotion, and imageability/embodiment as diagnostic features. Discussion: Humans clearly prefer human voices, but manipulating diagnostic speech features might increase acceptance of synthesized voices and thereby support human-robot interaction. There is limited evidence that human-likeness of a voice is negatively linked to the perceived eeriness of the speaker.
... Natural conversational language contains frequent repetitions, hesitations, and false starts [18], backchanneling, short utterances, limited vocabulary and many colloquialisms [7], and may even lack any sentence-delimiting mark [4], all of which can sound off-putting when played back with TTS lacking the original prosodic variations. Real human voices are consistently preferred over TTS voices, being rated as more expressive and likeable [6], and the more human-like the voice, the less eerie it appears [2,30]. Voice qualities such as pronunciation and emotion were furthermore found to be one of the eight key dimensions most frequently used to assess humanness of an agent in a speech interface [10], along constructs such as interpersonal connection and conversational interactivity. ...
... Voice distortion: Distorting natural human voices as auditory stimuli (e.g., Baird et al., 2018;Kühne et al., 2020). This technique has been used to test whether the UV effect can occur solely within audition. ...
Preprint
Full-text available
The uncanny valley (UV) effect is a negative affective reaction to human-looking artificial entities. It hinders comfortable, trust-based interactions with android robots and virtual characters. Despite extensive research, a consensus has not formed on its theoretical basis or methodologies. We conducted a meta-analysis to assess operationalizations of human likeness (independent variable) and the UV effect (dependent variable). Of 468 studies, 72 met the inclusion criteria. The studies employed 10 different stimulus creation techniques, 39 affect measures, and 14 indirect measures. Based on 247 effect sizes, a three-level meta-analysis model revealed the UV effect had a large effect size, Hedges' g = 1.01 [0.80, 1.22]. A mixed-effects meta-regression model with creation technique as the moderator variable revealed face distortion produced the largest effect size, g = 1.46 [0.69, 2.24], followed by distinct entities, g = 1.20 [1.02, 1.38], realism render, g = 0.99 [0.62, 1.36], and morphing, g = 0.94 [0.64, 1.24]. Affective indices producing the largest effects were threatening, likable, aesthetics, familiarity, and eeriness, and indirect measures were dislike frequency, categorization reaction time, like frequency, avoidance, and viewing duration. This meta-analysis-the first on the UV effect-provides a methodological foundation and design principles for future research.
... [25] investigates the users' expressions and emotions in human-computer interactions. The perception of factors such as age, gender, human likeness were studied from IBM Watson TTS voices [26,27,28]. However, there is little or no known research on the social acceptance of different speech synthesizers. ...
Conference Paper
Full-text available
With the improved computational abilities, the usage of chat-bots and conversational agents has become more prevalent. Therefore, it is essential that these agents exhibit certain social speaker characteristics in the generated speech. In this paper , we study the perception of such speaker characteristics in two commercial Text-to-Speech (TTS) systems, Amazon Polly and Google TTS. We carried out a 15-item semantic differential scaling test. The factor analysis provided us with three underlying dimensions that can be perceived from synthetic speech, warmth, competence, and extraversion. Our results show that we can perceive both interpersonal relationships and also personality traits from synthetic voices. Additionally, we observed that the female participants perceived male voices to be more responsible, energetic, relaxed, and enthusiastic. In comparison , male participants found female voices to be more reliable, accessible, and confident. A discussion on the comparison of our results with that of the studies on natural speech is also provided .
Article
Large acoustic inventories must be used to produce speech close to natural quality. However, the concatenation cost space grows exponentially with the number of acoustic units in the acoustic inventory, increasing the latency of the unit selection algorithm, making algorithms unusable in real-time end-to-end systems. Even when data compression techniques are introduced, the model size is still high, representing a challenge for end-to-end systems. Thus, in this paper, we propose representing the concatenation cost space using LSTM (Long Short-Term Memory). The results show a 90% reduction in the size of the data space compared to all our previous techniques, and by an over 70% decrease in the look-up time. The proposed LSTM-based compression increases the responsiveness of the corpus-based text-to-speech systems significantly while keeping the overall speech quality at the same level.
Article
Full-text available
The paralinguistics of the voice are the perceived states and traits that make that voice unique to the human body from which it resonates. In many cases the synthesized voice is produced by concatenated segments of recorded human speech, a complex process that can result in an arguably lifeless voice, which lacks the ability for free-expression among other human qualities. In recent years technology-based companies are developing their own synthesized voice identities, yet seemingly paying little attention to the stereotypical traits being heard. Do such synthetic voice traits differ from the human traits they are modelled on? To explore this, the presented perception study performed by 18 listeners evaluated the paralinguistic traits of gender, age, and human likeness in the IBM voice library. Results herein have shown a similar trend to a previous study by the authors with no voice achieving complete human likeness, no voice being perceived within a single age frequency band, and none tied solidly to their given binary gender-a novel finding as commercially available synthesized voices are typically developed to operate within binary identification structures.
Conference Paper
Full-text available
Along with the rise of artificial intelligence and the internet-of-things, synthesized voices are now common in daily--life, providing us with guidance, assistance, and even companionship. From formant to concatenative synthesis, the synthesized voice continues to be defined by the same traits we prescribe to ourselves. When the recorded voice is synthesized, does our perception of its new machine embodiment change, and can we consider an alternative, more inclusive form? To begin evaluating the impact of aesthetic design, this study presents a first--step perception test to explore the paralinguistic traits of the synthesized voice. Using a corpus of 13 synthesized voices, constructed from acoustic concatenative speech synthesis, we assessed the response of 23 listeners from differing cultural backgrounds. To evaluate if perception shifts from the defined traits, we asked listeners to assigned traits of age, gender, accent origin, and human--likeness. Results present a difference in perception for age and human--likeness across voices, and a general agreement across listeners for both gender and accent origin. Connections found between age, gender and human--likeness call for further exploration into a more participatory and inclusive synthesized vocal identity.
Conference Paper
Full-text available
We introduce iHEARu-PLAY, a web-based multi-player game for crowdsourced database collection and – most important – labelling. Existing databases (with speech and video content) can be added to the game and labelling tasks can be defined via a web-interface. The primary purpose of iHEARu-PLAY is multi-label, holistic annotation of multi-modal affective speech databases. Players perform labelling (or prompted recording) tasks and are rewarded with scores and prizes, which are computed based on the " correctness " of their annotations, e.g., the agreement with a pre-defined gold standard or with the other players. iHEARu-PLAY is implemented with the open source high-level Python Web framework Django and can be installed on Unix and Windows platforms. Its modular architecture allows for easy integration of custom extensions: New gaming components can be added as plugins in order to support new databases and modalities. Label categories for each database are individually selectable and editable. Audio, image and video annotation are currently supported. iHEARu-PLAY will be available to the research community as a ready-to-use web-service. Researchers can add their own databases, optionally post rewards, and receive annotation results in the end. General users can register to play the game, have fun, compete with other players, and at the same time support science.
Conference Paper
Full-text available
Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech-synthesis system. However, prosody prediction can be affected by an interaction of short- and long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.
Article
This paper proposes a method for constructing text-to-speech (TTS) systems for languages with unknown pronunciations. One goal of speech synthesis research is to establish a framework that can be used to construct TTS systems for any written language. Generally, language-specific knowledge is required to construct TTS systems for a new language. However, it is difficult to acquire language-specific knowledge in each new language. Therefore, constructing a TTS system for a new language entails huge costs. To address this problem, we investigate a framework for automatically constructing a TTS system from a target language database consisting of only speech data and corresponding Unicode texts. In the proposed method, pseudo phonetic information of the target language with unknown pronunciation is obtained by a speech recognizer of a rich-resource proxy language. Then, a grapheme-to-phoneme converter and a statistical parametric speech synthesizer are constructed based on the obtained pseudo phonetic information. The proposed method was applied to Japanese and was evaluated in terms of objective and subjective measures. Additionally, we challenged the construction of TTS systems for nine Indian languages using the proposed method, and TTS systems were evaluated in the Blizzard Challenge 2014 and 2015.
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.