ArticlePDF Available

Expanding the MOS: Development and psychometric evaluation of the MOS-R and MOS-X

Authors:
  • Melanie Polkosky LLC
  • MeasuringU

Abstract and Figures

The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners'' subjective assessments of synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L. Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.
Content may be subject to copyright.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 6, 161–182, 2003
c
2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
Expanding the MOS: Development and Psychometric Evaluation
of the MOS-R and MOS-X
MELANIE D. POLKOSKY AND JAMES R. LEWIS
IBM Corporation, 8051 Congress Ave, Suite 2227, Boca Raton, FL 33487, USA
jimlewis@us.ibm.com
Abstract. The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners’ subjective assessments of
synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999
to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics
it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L.
Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve
the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R
beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social
Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale
shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.
Keywords: Mean Opinion Scale (MOS), subjective assessment of synthetic speech, psychometric evaluation
Introduction
Evaluation of intelligibility and acceptability is a vi-
tal component of effective synthetic speech develop-
ment and use. Schmidt-Nielsen (1995) distinguishes
between two major types of evaluations, intelligibility
tests and acceptability tests. Various intelligibility tests,
which evaluate how well listeners understand words or
phrases spoken by a synthetic voice, provide highly
correlated yet inconsistent results despite the various
types of stimuli used (e.g., rhyming words, phrases,
syllables, sentences) (Schmidt-Nielsen, 1995). In ad-
dition, listeners may understand a synthetic voice but
still indicate that it has unacceptable quality, a subjec-
tive characteristic that intelligibility tests cannot assess.
Acceptability tests provide a subjective measure of syn-
thetic voice quality, typically using comparison rank-
ings or rating scales. While listener variability has been
identified as a common problem with subjective rating
scales for acceptability measurement, they also pro-
vide an efficient and rapid method of synthetic speech
evaluation (Schmidt-Nielsen, 1995).
Until relatively recently, the most pronounced prob-
lem with artificial voices was intelligibility (Francis
and Nusbaum, 1999). It might seem, therefore, that ar-
ticulation tests that assess intelligibility (such as rhyme
tests) would be more suitable for evaluating artificial
speech than a subjective acceptability tool. However,
most current text-to-speech systems, although more
demanding on the listener than natural speech (Paris
et al., 2000), are quite intelligible: “Once a speech
signal has breached the ’intelligibility threshold’, ar-
ticulation tests lose their ability to discriminate. . . . it
is precisely because people’s opinions are so sensi-
tive, not just to the signal being heard, but also to
norms and expectations, that opinion tests form the
basis of all modern speech quality assessment meth-
ods” (Johnston, 1996). In addition to the technological
gains made in synthetic speech over the past decade,
researchers have also recognized the need for increas-
ingly sophisticated perceptual assessment methods
(Pisoni, 1997).
The Mean Opinion Scale (MOS)
The Mean Opinion Scale (MOS) is a widely used
method for evaluating the quality of telephone systems
162 Polkosky and Lewis
and synthetic speech, recommended by the Interna-
tional Telecommunications Union (Schmidt-Nielsen,
1995; ITU, 1994; van Bezooijen and van Heuven,
1997). The MOS is a Likert-style questionnaire, typ-
ically with seven 5-point scale items addressing the
following TTS characteristics: (1) Global Impression,
(2) Listening Effort, (3) Comprehension Problems,
(4) Speech Sound Articulation, (5) Pronunciation, (6)
Speaking Rate, and (7) Voice Pleasantness. In the
most typical use of the MOS, na¨ıve listeners assign
scores for each item after listening to speech stim-
uli, usually sentences (Schmidt-Nielsen, 1995). Doc-
umented use of the MOS in empirical studies of
synthetic speech has been somewhat limited (Cartier
et al., 1992, as cited by Salza et al., 1996; Goldstein,
1995; Goodman and Nash, 1982 as cited by Schmidt-
Nielsen, 1995; Johnston, 1996; Kraft and Portele,
1995; Moller et al., 2001; Pascal and Combescure,
1988, as cited by Schmidt-Nielsen, 1995; Salza et al.,
1996; Yabuoka et al., 2000), although it was shown
to be a recognizable assessment method in a survey
of worldwide speech laboratories (Pols and Jekosch,
1997).
The primary advantage of the MOS lies in its ef-
ciency: it quickly provides feedback on a synthetic
voices intelligibility and naturalness as evaluated by
listeners. Researchers can also compare ratings of sev-
eral voices and determine the general aspects that
differentiate the voices. Unlike other acceptability eval-
uation procedures, the MOS does not require time-
consuming listener calibration and standardization pro-
cedures (as required by the Diagnostic Acceptability
Measure), pre-specied speech stimuli or testing en-
vironments, or other rigid procedural requirements
(Schmidt-Nielsen, 1995). Thus, this tool is exible
for a wide range of evaluative goals, synthetic speech
applications, and user groups. When analyzing MOS
scores, inferential statistics (such as ANOVA), com-
pare the amount of within-group listener variability
to the amount of variability among ratings of differ-
ent synthetic voices. If listener variability is greater
than or equal to variability among synthetic voices, the
ANOVA will be nonsignicant; otherwise, it will be
statistically signicant. These procedures give the re-
searcher specic methods of determining whether the
recognized problem of listener variability adversely af-
fects observed MOS ratings. Despite these potential
advantages, however, little is known about the psycho-
metric features of the MOS, a potential deterrent to its
use in speech system evaluation.
Psychometrics and the MOS
Psychometrics is the use of statistical analysis to eval-
uate the quality of practitioner-created scales and other
psychological measures (Nunnally, 1978). If a mea-
surement tool, such as the MOS, has not been sub-
jected to psychometric analysis, developers of speech
products cannot be condent of a product evaluation
that uses the tool. Some of the metrics of psychometric
quality are:
1. Reliabilityconsistency of ratings performed on
the same speech system at different times;
2. Validitymeasurement of the specic aspects of a
speech system that the developer intended; and
3. Sensitivityability of the tool to detect differences
among speech systems.
These psychometric characteristics each are important
to the overall quality of a scale, but they may not occur
in the same scale. For example, a scale may result in the
same ratings for the same system each time it is used
(reliability), but it may not measure particular char-
acteristics of interest to a developer (validity). It also
may not discriminate among different speech systems
(sensitivity). Therefore, it is important for developers
to understand the reliability, validity, and sensitivity of
subjective scales used for speech system evaluation.
Despite the recognized importance of psychometric
review, previous research documenting the reliability,
validity, and sensitivity of the MOS has been sparse and
unsystematic. Cartier et al. (1992, as cited by Salza
et al., 1996), in their evaluation of several synthetic
systems, found reliable and reproducible results, but
no additional or more specic comments on MOS re-
liability have appeared in the previous literature. Salza
et al. (1996) measured the overall quality of three Ital-
ian TTS synthesis systems with a common prosodic
control but different diphones and synthesizers using
both paired comparisons and the MOS. Their results
showed good agreement between the two measure-
ment methods, providing some evidence for the valid-
ity of the MOS. Johnston (1996) also showed evidence
of validity and sensitivity, but only for the MOS Lis-
tening Effort item. Yabuoka et al. (2000) investigated
the relationship between ve distortion scales (differ-
ential spectrum, phase, waveform, cepstrum distance,
and amplitude) and MOS ratings. They calculated sta-
tistically signicant regression formulas for predict-
ing MOS ratings from manipulations of the distortion
Expanding the MOS 163
scales, suggesting that the MOS was sensitive to dif-
ferences created by the distortion scales. Unfortunately,
Yabuoka et al. (2000) did not report the exact type of
MOS that they used in the experiment. In general, this
research does little to clarify the quality of the MOS
for use as a measure in applied industrial settings.
Another statistical method often associated with
scale development is factor analysis. Factor analysis
is a statistical procedure that examines the correlations
among variables to discover groups of related scale
items, known as factors (Nunnally, 1978). This proce-
dure allows individual scale items to be grouped cat-
egorically by identifying the abstract variables that a
scale effectively measures. For a complex scale with a
large number of individual items, the presence of fac-
tors permits greater ease in interpreting its results. Few
researchers have investigated the factor structure of the
MOS. Kraft and Portele (1995) found that an eight-item
version of the MOS had two factorsone they called
Intelligibility (segmental) and one called Naturalness
(suprasegmental, or prosodic attributes). The Speaking
Rate (Speed) item did not fall in either of the two fac-
tors, suggesting that this item was unrelated to the other
seven items on the MOS. More recently, Sonntag et al.
(1999), using the same version of the MOS (but with
6-point rather than 5-point scales), reported only a sin-
gle factor. Therefore, the factor structure of the MOS
continues to be unclear and limited, another potential
drawback to its use in speech system evaluations.
Expanding the Scope of Coverage of the MOS
In addition to a paucity of research in the psychomet-
rics of the MOS, the content of its items has received
virtually no attention in previous research. Item content
related to only two factors limits the instrumentsability
to discriminate among voices with similar intelligibility
and naturalness. Researchers in the late 1980s and early
1990s acknowledged that the intelligibility of synthetic
speech can rival that of human speech (Greene et al.,
1986; Murray and Arnot, 1993). As synthetic speech
development has become increasingly sophisticated,
it has become clear that intelligibility does not usu-
ally differentiate among modern synthetic voices. With
the introduction of concatenative voices, naturalness
also is becoming less of a discriminating factor. These
trends suggest that the MOS, which measures only In-
telligibility and Naturalness, may not effectively dis-
criminate among current synthetic voices, nor future
voices.
Recently, researchers have investigated the synthe-
sis of more subtle and specic perceptual characteris-
tics than intelligibility and naturalness. A signicant
psychological literature exists on the social-emotional
aspects of human speech (for a review, see Murray and
Arnot, 1993), the relationship between vocal speech
and impression formation or personality perception
(for a review, see Brown et al., 1975), and the social
impact of speech disabilities, especially for individuals
who use augmentative and alternative communication
systems (synthetic voice prostheses) as a means of com-
munication (Hoag and Bedrosian, 1992; Goreno and
Goreno, 1997). All of these areas of research sug-
gest additional measurement items that may capture
listenersvocal and social-emotional perceptions about
synthetic speech. Numerous studies over the past three
decades have investigated vocal speech characteristics
that promote social-emotional perceptions, including
those related to prosody (Bradlow et al., 1996; Brown
et al., 1973; Hosman, 1989; Koopmans-Van Beinum,
1992; Martin and Haroldson, 1992; Pelachaud et al.,
1996; Slowiaczek and Nusbaum, 1985; Yaeger-Dror,
1996) and voicing characteristics (Bloom et al., 1999;
Bradlow et al., 1996; Granstrom and Nord, 1992; Hieda
and Kuchinomachi, 1997; Higashikawa and Minie,
1999; Hillenbrand, 1988; Klatt and Klatt, 1990; Lavner
et al., 2000; Page and Balloun, 1978; Robinson and
McArthur, 1982; Slowiaczek and Nusbaum, 1985;
Whalen and Hoequist, 1995). Other researchers have
investigated the numerous social-emotional percep-
tions conveyed by speech (Berry, 1992; Ekman et al.,
1991; Holtgraves and Lasky, 1999; Johnson et al., 1986;
Massaro and Egan, 1996; Miyake and Zuckerman,
1993; Murray and Arnot, 1995; Murray et al., 1996;
Paddock and Nowicki, 1986; Stern et al., 1999;
Tartter and Braun, 1994; Whitmore and Fisher, 1996;
Zuckerman et al., 1991). In general, the literature points
to a number of perceptual characteristics that are not
measured by acceptability tests or the current version
of the MOS, but may further discriminate among syn-
thetic voices.
Research Goals
The goals of our research program were to inves-
tigate and systematically improve the psychometric
properties of the Mean Opinion Scale (MOS), and
to expand the content of the MOS beyond its cur-
rent content to include items that measure subtle vocal
and social-emotional aspects of speech. Reliable and
164 Polkosky and Lewis
valid measurement of these characteristics is impor-
tant for understanding listenersimpressions of syn-
thetic speech, developing increasingly sophisticated
synthetic speech, and discriminating effectively among
competitive articial voices. Although the focus of our
research was the MOS scale itself, the work proceeded
during the emergence of commercial applications us-
ing synthetic speech, using voices created at IBM,
Table 1. Summary of MOS psychometric evaluations and revisions.
Focus of psychometric evaluation
Study Factor Items Item scale Reliability Validity Sensitivity Factor structure
MOS
(Salza et al., 1996)
Experiment 1
Intelligibility Global Impression 5-point scale √√√ √
Listening Effort
Comprehension Problems
Speech Sound Articulation
Pronunciation
Naturalness Voice Pleasantness
Speaking Rate
MOS-R
Experiment 2
Intelligibility Global Impression 7-point scalea√√√
Listening Effort
Comprehension Problems
Speech Sound Articulation
Pronunciation
Speaking Ratea
Naturalness Global Impressiona
Voice Pleasantness
Voice Naturalnessa
Ease of Listeninga
MOS-R3
Experiment 3
Intelligibility Listening Effort 7-point scale √√ √
Comprehension Problems
Speech Sound Articulation
Precision
Naturalness Voice Pleasantness
Voice Naturalness
Humanlike Voicea
Voice Qualitya
Social ImpressionaTrusta
Condencea
FluencyaEmphasisa
Rhythma
Intonationa
VoiceaLoudnessa
Depressiona
(Continued on next page.)
Nuance, SpeechWorks, AT&T, and other speech tech-
nology companies during the years 1998 to 2002. In
many respects, the development of the MOS parallels
the development of synthetic speech as it has become
a viable and increasingly sophisticated technology.
Table 1 summarizes the focus of revisions and psy-
chometric evaluation during each experiment, and pro-
vides an overview the organization of our experiments.
Expanding the MOS 165
Table 1.(Continued ).
Focus of psychometric evaluation
Study Factor Items Item scale Reliability Validity Sensitivity Factor structure
MOS-R4
Experiment 4
Intelligibility Listening Effort 7-point scale √√
Comprehension Problems
Speech Sound Articulation
Precision
Naturalness Voice Pleasantness
Voice Naturalness
Humanlike Voice
Voice Quality
Social Impression Trust
Condence
Enthusiasma
Persuasivenessa
NegativityaDepressiona
Feara
MOS-X
Experiments 5 and 6
Intelligibility Listening Effort 7-point scale √√
Comprehension Problems
Speech Sound Articulation
Precision
Naturalness Voice Pleasantness
Voice Naturalness
Humanlike Voice
Voice Quality
ProsodyaEmphasisa
Rhythma
Intonationa
Social Impression Trust
Condence
Enthusiasm
Persuasiveness
aIndicates a change in item or factor from the previous version of the MOS.
Our strategy for the research was to begin by evaluating
the psychometric properties of the Salza et al. (1996)
MOS, which appears in Appendix A. The next step was
to develop a version of the MOS, the MOS-Revised
(MOS-R), with improved psychometric properties for
its traditional focus on intelligibility and naturalness
(Experiments 1 and 2). The nal step was to expand the
content of the MOS-R to broaden its evaluative scope
(Experiments 3 through 6) by generating new items and
creating additional factors for the scale. The resulting
scale is known as the MOS-Expanded (MOS-X) (see
Appendix B). We provide an interpretive summary of
our statistical analysis of scale drafts as a rationale for
our revisions in each experiment; please refer to the
technical appendix for detailed statistics (Appendix C).
Experiment 1: The MOS Psychometric Evaluation
Initially, the research focused on determining the
MOSs quality and identifying its factor structure. The
specic goals of Experiment 1 were to evaluate the
factor structure of the 7-item 5-point-scale version of
the MOS (the version reported by Salza et al. (1996)
166 Polkosky and Lewis
adapted for use in our labsee Appendix A), estimate
the reliability of the overall MOS score and any re-
vealed factors, investigate the sensitivity of the MOS
scores, and extend the work on the validity of the
MOS.
Method
Factor Analysis and Reliability Evaluation. Over
the period 1999 to 2001, we conducted a number of
experiments in which participants completed the stan-
dard MOS. In some of these experiments, we also col-
lected paired-comparison data and, in the most recent
(Wang and Lewis, 2001), we collected intelligibility
scores. Participants in these experiments have included
in approximately equal numbers, males and females,
persons older and younger than 40 years old, and
IBM and non-IBM employees. The rated speech sam-
ples included concatenative and formant-synthesized
voices and, in one case, a recorded human voice (non-
professional speaker). Drawing from six of these ex-
periments, we assembled a database of 73 indepen-
dent completions of the Salza et al. (1996) MOS. This
database was the source of data for a factor analysis,
reliability assessment (both of the overall MOS and the
factors identied in the factor analysis) and sensitivity
investigation using analysis of variance.
Validity Evaluations. Re-analysis of our data from
the previous studies in which listeners provided
both MOS ratings and paired comparisons allowed
us to replicate the nding of Salza et al. (1996)
that MOS ratings correlate signicantly with paired
comparisons.
Data from Wang and Lewis (2001) provided an op-
portunity to investigate the correlation between MOS
ratings and intelligibility scores. In that experiment,
listeners heard a variety of types of short phrases pro-
duced by four TTS voices, with the task to write down
what the voice was saying. After nishing that intelligi-
bility task, listeners heard the samples for each voice a
second time and provided MOS ratings after reviewing
each voice.
Results and Discussion
Factor Analysis. The factor analysis conrmed the
previous results of Kraft and Portele (1995), and indi-
cated that the MOS had two factors, Intelligibility and
Naturalness. One item (Speaking Rate) was not asso-
ciated with either factor.
The MOS Speaking Rate item failed to fall onto ei-
ther the Intelligibility or Naturalness factor in both the
current study and in Kraft and Portele (1995). This
may have happened because Speaking Rate is truly
independent of either of these constructs, or it might
have been due to the unique labeling of the scale points
for this item. The other MOS items had scales that
had a clear sequence, such as Excellent,Good,
Fair,Poor, and Badfor the Global Impression
item (see Appendix A). The labels for the Speaking
Rate item were, in contrast, Yes,Yes, but slower
than preferred,Yes, but faster than preferred,No,
too slow, and No, too fast, which did not have a
clear top-to-bottom ordinal relationship. If the item as-
sessing Speaking Rate had the same structure as the
other MOS items, a factor analysis could determine
less ambiguously whether Speaking Rate is indepen-
dent of Intelligibility and Naturalness.
Reliability. Reliability (coefcient alpha) for the
overall MOS was 0.89. Respective reliabilities for In-
telligibility and Naturalness were 0.88 and 0.81. Using
the minimum criterion of 0.70 (Landauer, 1988; Nun-
nally, 1978), all reliabilities were acceptable but could
be improved.
Using principles from psychometrics (Nunnally,
1978), we reasoned that it should be possible to improve
the reliability of the MOS. Rather than using 5-point
scales with an anchor at each step, overall reliability
should improve slightly with a change to 7-point bipo-
lar scales. Because the Naturalness factor had some-
what weaker reliability than the Intelligibility factor,
it would be reasonable to add at least two more items
to the MOS that are likely to tap into the construct of
Naturalness.
Validity. Signicant positive correlations between
participantspreference votes and MOS scores showed
that the MOS was a valid measure of voices (over-
all MOS, r(14) =.55, p=.03; Naturalness factor,
r(14) =.46, p=.07; Intelligibility factor, r(14) =
.49, p=.05).
Additional data from Wang and Lewis (2001) indi-
cated a marginally signicant correlation between in-
telligibility scores from listener transcriptions and their
MOS ratings of Intelligibility (r(14) =.43, p=.10),
which is evidence for convergent validity. None of
the other correlations between these scores and MOS
Expanding the MOS 167
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Intelligibility Naturalness Speaking Rate
Wave
Concat1
Concat2
Formant1
Formant2
Figure 1. Voice by factor interaction (MOS).
ratings (Naturalness, Speaking Rate) were signicant
(all p>.33), which is evidence for divergent validity.
Sensitivity. Figure 1 illustrates the relationship
among the ve voices in our database and the MOS
factors (plus Speaking Rate). Of these voices, the two
formant voices had poorer ratings for Intelligibility
and Naturalness, as compared to the two concatena-
tive voices and a human (male, non-professional) voice.
The Intelligibility, Naturalness, and Speaking Rate pro-
le of each system was unique, suggesting that the
MOS was sufciently sensitive to discriminate among
these voices.
Summary of Results. The version of the MOS de-
rived from Salza et al. (1996) seemed to have reason-
ably good psychometric properties. The factor analysis
of the current data resulted in a factor structure similar
to that of Kraft and Portele (1995), specically two fac-
tors (Intelligibility and Naturalness) and an unrelated
item for Speaking Rate. The reliabilities of the overall
MOS and its Intelligibility and Naturalness subscales
were acceptable (greater than the minimal standard of
0.70). The evidence for sensitivity and validity was
strong.
Revisions to the MOS. We made several changes to
the MOS to improve its reliability. These changes in-
cluded adding two items, Voice Naturalness and Ease
of Listening, and increasing the number of scale steps
from ve to seven. In addition, we changed in the struc-
ture of the Speaking Rate item to make it more com-
patible with the other six items and allow us to deter-
mine if it is truly independent of the other two factors.
We named this version of the questionnaire the MOS-
Revised (MOS-R).
Experiment 2: The MOS-Revised (MOS-R)
Psychometric Evaluation
The purpose of the Experiment 2 was to determine if
the changes made in Experiment 1 worked as expected.
Specically, the expected consequences of the revision
were:
The reliability of the overall MOS would improve;
The revised MOS items 25 would continue to form
an Intelligibility factor; and
Items 1 and 68 would form a Naturalness factor with
substantially greater reliability (possibly in excess of
.90) than the current Naturalness factor.
Additionally, the change in the structure of the Speak-
ing Rate item would make it possible to determine
whether that item is truly independent of the other two
factors.
Method
The data came from three different studies of concate-
native TTS voices, each with 16 participants. For all
three studies, the participant groups had full factorial
crossing of the following three independent variables
(enhancing the generalizability of results):
Gender (half were male, half were female),
Age (half under 40 years of age, half over 40 years
of age), and
Employment (half were employees of a temporary
help agency, half were IBM employees),
In all three studies, participants listened to four dif-
ferent TTS voices speaking four different text samples
168 Polkosky and Lewis
and provided MOS-R ratings for each voice. We com-
bined the data from the three independent experiments
and performed a separate analysis for each set of rat-
ings within participants across experiments for a total
of four independent analyses. Each of these analyses
included 48 sets of ratings.
Results and Discussion
Factor Analyses. Four separate (independent) factor
analyses were conducted on the data, and three of these
procedures indicated that the MOS-R continued to have
two factors, Intelligibility and Naturalness. The analy-
ses further indicated that two items, Global Impression
and Speaking Rate, were somewhat problematic in that
they did not consistently fall into either of the two main
factors. However, three factor analyses showed that
Global Impression aligned with both the Intelligibil-
ity and Naturalness factors and Speaking Rate aligned
with Intelligibility.
Reliability. For the combined analysis, the reliability
values (coefcient alpha) for the overall MOS-R scale,
Intelligibility, and Naturalness were, respectively, .93,
.89, and .95. These values show increases in reliability
for the MOS-R scale and both of its factors as compared
with the MOS.
Summary of Results. The intended changes did re-
sult in the expected improvement to the MOS scale and
its two factors, Intelligibility and Naturalness. Combin-
ing the psychometric work done in Experiments 1 and
2, we documented the scales factor structure, validity,
and sensitivity, and also measured and improved the
scales reliability.
Revisions to the MOS-R. With continued develop-
ment of synthetic speech and decreasing differences
among the Intelligibility and Naturalness of synthetic
voices, we decided to broaden the scope of character-
istics evaluated by the MOS-R. The remaining experi-
ments describe the complex, iterative tasks required to
achieve this goal.
Experiment 3: The MOS-R Expansion
and Psychometric Evaluation
The purpose of Experiment 3 was to add perceptual
speech characteristics and social impression items not
previously measured by the MOS-R, creating a ques-
tionnaire named the MOS-R3. We expected that the
new items would add new factors to the measure,
which we hoped would improve its sensitivity and more
clearly discriminate among user perceptions of syn-
thetic voices. We limited the new items to primarily
speech-based items consistent with the evaluative pur-
pose of the previous MOS-R revisions.
Method
Participants. The sample consisted of 1000 ran-
domly selected IBM employees (200 individuals in
each of ve groups) invited to participate in the study.
Of this sample, 204 individuals completed the study
questions (20% return rate).
Materials. The study used a between-subjects design
with ve levels of the independent variable of synthetic
voice. The voices and their key characteristics were:
A: concatenative female
B: concatenative female
C: concatenative male
D: concatenative male
E: formant male.
All voices had an 8 kHz sampling rate and 16-bit dy-
namic range. Voices A and B used the same underlying
TTS technology and source voice. Voices C and D used
different underlying TTS technologies (different from
Voices A and B and different from each other).
The initial item set for the MOS-R3 included the nine
items from the MOS-R and an additional item (Human-
Like Voice) expected to align with Naturalness. The
set also included eight new items based on clini-
cal evaluation of human speech characteristics: voice
(Loudness, Emphasis, Voice Quality, Pitch), uency
(Interruptions, Rhythm, Intonation), and articulation
(Precision) (Shipley and McAfee, 1992). If the evalu-
ation of human speech is similar to synthetic speech
evaluation, we would predict that the three uency
items would cluster with Speaking Rate to create a Flu-
ency factor. Similarly, the new Precision item should
align with the previous Intelligibility factor. Finally,
we also generated four new items related to the so-
cial impression created by human voices. These items
were selected based on the review of previous literature
and characteristics identied as relevant to application
development (Topic Interest, Trust, Condence, and
Expanding the MOS 169
Depression). Thus, the initial item set for the MOS-R3
contained 22 items.
Procedure. Participants received an email inviting
them to participate in the study and directing them to
a web page (one page for each participant group) with
instructions, a link to a recording of one of the syn-
thetic voices, and the rating scales. After accessing the
web page, participants clicked a link that caused the
synthetic voice le to play on the participants audio
player application. They then completed the MOS-R3
items for that voice.
Results and Discussion
Due to an error with the data collection software, the
data for Voice A was not collected and could not be
used in the analysis.
Factor Analysis. The factor analysis revealed that the
new scale included ve factors (increased from two
factors in the MOS-R). We named these factors Intel-
ligibility, Naturalness, Fluency, Voice, and Social Im-
pression. Somewhat problematic was the association of
Voice Quality with Naturalness (instead of Voice) and
Interruptions with Intelligibility (instead of Fluency).
This result demonstrated that voice, uency, and ar-
ticulation may be problematic factor labels because of
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Intelligibility Naturalness Fluency Voice Social
Impression
Factor
Mean Rating
Voic e B
Voic e C
Voic e D
Voic e E
Figure 2. Voice by factor interaction (MOS-R3).
their specicity. By contrast, Intelligibility and Natu-
ralness are both broad and more abstract labels, since
impairment in voice, uency, and/or articulation dimin-
ishes both the intelligibility and naturalness of human
speech.
Reliability. After item deletions to create a nal mea-
sure with 15 total items, four factors (Intelligibility,
0.91; Fluency, 0.88; Social Impression, 0.87; Natural-
ness, 0.89) and the Overall (0.93) score had coefcient
alphas greater than 0.70, demonstrating reliabilities ad-
equate for usability evaluation (Landauer, 1988). How-
ever, the Voice factor had inadequate reliability (0.46)
based on this criterion.
Sensitivity. Again, the MOS-R revisions made in this
experiment indicated that the scale continued to be sen-
sitive to differences in synthetic voices, with Voices B
and C rated most positively for all factors. Figure 2
shows that the new factors created unique proles for
four TTS voices, although the proles of Voices D and
E were similar.
Summary of Results. Although Experiment 3 was
successful in broadening the scope of the MOS-R to in-
clude items related to human speech evaluation (Voice,
Fluency) and social perception of speech (Social
Impression), the resulting scale demonstrated sev-
eral problems. Primarily, the Voice factor had weak
170 Polkosky and Lewis
reliability and the Social Impression factor included
too few items to be a valid measure of this complex
variable.
Revisions to the MOS-R3. We targeted the addition
of several new items that we expected to align with
the Voice factor and improve its reliability. We also
included items we expected would align with Social
Impression.
Experiment 4: The MOS-R4 Expansion
and Reliability Improvement
The purpose of Experiment 4 was to evaluate the effects
of adding items to improve the reliability of the Voice
factor and increase the number of items associated with
the Social Impression factor.
Method
Participants. The sample consisted of 1000 ran-
domly selected IBM employees (none of whom were in
the sample for Experiment 3), with 200 individuals in
each of ve groups invited to participate in the study.
Of this sample, 138 individuals completed the study
questions (14% return rate).
Materials. The study used a between-subjects design
with ve levels of the independent variable of synthetic
voice. The voices and their key characteristics were:
A: concatenative female
B: concatenative female
C2: concatenative male
D: concatenative male
E: formant male.
With the exception of Voice C2, the voices were the
same as those used in Experiment 3. The technology
used to produce Voice C2 was the same as that used to
produce Voice C in Experiment 3, but the source voice
for Voice C2 was different.
To create the initial item set for the MOS-R4, we
retained the 15 items from the nal version of the
MOS-R3 (Listening Effort, Comprehension, Articu-
lation, Pleasantness, Voice Naturalness, Humanlike
Voice, Loudness, Emphasis, Voice Quality, Rhythm,
Intonation, Precision, Trust, Condence, and Depres-
sion). As in Experiment 3, we generated additional
items related to voice and its correlates in human speech
(Monotone Quality, Attractiveness, Enthusiasm) and
four additional social impression items (Persuasive-
ness, Enthusiasm, Impatience, and Fear). If the pre-
vious factor structure remained, we would expect the
new items to align to the Voice and Social Impres-
sion factors, increasing their reliability. However, the
new items relied signicantly less on the specicar-
eas of human speech evaluation, making the items in
this study qualitatively different than those explored in
Experiment 3. Therefore, one possible outcome of Ex-
periment 4 was that we would not retain the Voice and
Fluency factors.
Procedure. The procedure was identical to that of
Experiment 3.
Results and Discussion
Factor Analysis. As in Experiment 3, we again found
that the scale included ve factors, three of which were
the familiar Intelligibility, Naturalness, and Social Im-
pression. Interestingly, Factor 4 included a single item
(Loudness) from the earlier Voice factor, and Factor 5
included two Negativity items (Depression, Fear). This
alignment was unexpected, based on our goal of gener-
ating additional items that would align with the Voice
factor.
Because only one item associated with Factor 4
(Voice), we omitted Loudness and performed a second
factor analysis, forcing a four-factor solution (eliminat-
ing the Voice factor). The four-factor model appeared
to be more consistent with the results of Experiment
3 and the theoretical association of items in the litera-
ture, and included more than one item per factor (but
the Negativity factor only included two items). There-
fore, this analysis indicated that the MOS-R4 contained
the four factors Intelligibility, Naturalness, Negativity,
and Social Impression.
Reliability. We again adjusted items to create an ef-
cient, but reliable scale. Following the manipula-
tion, three factors (Intelligibility, 0.84; Social Impres-
sion, 0.84; Naturalness, 0.85) and the Overall score
(0.89) demonstrated adequate reliability above 0.70
(Landauer, 1988). The reliability of the Negativity fac-
tor (0.65) was below this criterion.
Sensitivity. A mixed model ANOVA indicated the
extent to which the nal version of the MOS-R4 dis-
criminated among the ve synthetic voices. Figure 3
Expanding the MOS 171
1.00
2.00
3.00
4.00
5.00
6.00
7.00
Intelligibility Naturalness Social
Impression
Negativity
Factor
Mean Rating
Voic e A
Voic e B
Voic e C2
Voic e D
Voic e E
Figure 3. Voice by factor interaction (MOS-R4).
illustrates the similarity between Voices A and B (as
expected because they used the same core TTS tech-
nology and the same source voice). Again consistent
with expectation, the formant voice (Voice E) was the
most poorly rated voice in terms of its perceived Social
Impression and Naturalness. The perceived Intelligibil-
ity of Voice E was identical with that of concatenative
Voice D (a low-quality concatenative voice). Of the
four factors, only Negativity seemed to be relatively
insensitive to the differences among the voices.
Summary of Results. The outcomes of Experiments
3 and 4 were encouraging, but not completely satisfy-
ing. The addition of the new items in each experiment
led to the emergence of new factors (Fluency, Voice,
and Social Impression in Experiment 3; Negativity and
Social Impression in Experiment 4). In Experiment 3,
the Voice factor did not have an acceptable level of re-
liability. The Social Impression factor was reliable, but
had the support of only two items. In Experiment 4,
the nal MOS-R4 had four items supporting the Social
Impression factor, but the Negativity factor only had
two items and low reliability and sensitivity.
The primary goal of Experiments 3 and 4 was to
expand the coverage of the MOS to include new fac-
tors that have become important in the evaluation of
synthetic speech. To accomplish that goal, we felt that
it was necessary to include a Social Impression fac-
tor and to include a factor related to the prosodic fea-
tures of speech. Van Riper and Emerick (1990) dene
prosody as the linguistic stress patterns [of speech] as
reected in pause, inection, junctureand the melody
or cadence of speech(p. 491). Our initial MOS-R3
included items that contribute to prosody (Emphasis,
Rhythm, Intonation, Interruptions), yet these items did
not clearly align in a single factor. In Experiment 4, the
stronger loadings of Social Impression items (likely
due to the larger effect sizes of social impression as
compared with vocal speech perceptions) resulted our
deletion of all items that could be related to prosody
(Emphasis, Voice Quality, Rhythm, Intonation, Mono-
tone Quality), as required to create an efcient, yet
reliable scale. Recently, researchers have begun to ac-
knowledge that prosodic qualities are vital for accept-
able synthetic speech and to develop algorithms that
approximate human prosody (Portele and Heuft, 1997;
Sonntag and Portele, 1998). In addition to our goals
to measure social aspects and prosody of synthetic
speech, we required that each factor have acceptable
reliability and the support of at least three items.
Revisions to the MOS-R4. We targeted an analysis
of the combined data from Experiments 3 and 4 for
our next revisions to the MOS. As a consequence of
the iterative evaluation process of Experiments 3 and
4, the complete item sets for the studies had 14 items
in common. The common items were four items as-
sociated with Intelligibility, four items associated with
Naturalness, three items associated with Fluency, two
items associated with Social Impression, and the De-
pression item (associated with the Voice factor in the
MOS-R3 and the Negativity factor in the MOS-R4).
Because these items were common across both stud-
ies, the sample size for their psychometric evaluation
172 Polkosky and Lewis
was the sum of the sample sizes for Experiments 3 and
4 (342 complete and independent sets of responses).
The factor analyses of Experiments 3 and 4 strongly
suggested that the Intelligibility, Naturalness, and Flu-
ency factors would remain intact in an analysis of the
combined data. It also seemed likely that the two items
associated with Social Impression in the MOS-R3 and
MOS-R4 would continue to align. The expected be-
havior of the Depression item was harder to predict. If
it aligned with the Social Impression factor and the So-
cial Impression factors reliability exceeded 0.70, then
this version of the MOS would meet the initial goals
of our research program, producing an Expanded MOS
(MOS-X).
Experiment 5: The Initial MOS-Expanded
(MOS-X)
The goal of this experiment was to complete a psycho-
metric analysis of the combined data from Experiments
3 and 4, with the purpose of uncovering a stable, logi-
cal, and theoretically-supported factor structure for the
new scale.
Method
To perform this analysis, we created a new database
from the results of Experiments 3 and 4. The 14 items
included the items Listening Effort, Comprehension
Problems, Articulation, Voice Pleasantness, Voice Nat-
uralness, Humanlike Voice, Emphasis, Voice Quality,
Rhythm, Intonation, Precision, Trust, Condence, and
Depression.
Results and Discussion
Factor Analysis. The analysis indicated that the new
scale contained four factors, which we named Intelligi-
bility and Naturalness (the consistent factors through-
out our research), Prosody, and Social Impression. The
Depression item aligned more strongly with the Social
Impression factor than with any other factor, but with
a somewhat lower loading than the other two items.
This result corresponds more successfully to our initial
goal of improving the measurement of both perceptual
speech and social impressions than the MOS revisions
of Experiments 3 and 4. At the same time, the Depres-
sion item was a concern, since it was not as clearly a part
of Social Impression as the other items in this factor.
Reliability. Coefcient alpha for each factor indi-
cated acceptable reliability (Overall: .92, Intelligibility:
.88, Naturalness: .87, Prosody: .85, Social Impression:
.71), although Social Impression was a clear candidate
for improvement.
Sensitivity. Statistical analysis conrmed that the
new scale effectively discriminated among the six syn-
thetic voices used in Experiments 3 and 4. Figure 4
shows the similarity between Voices A and B (as ex-
pected because they used the same core TTS tech-
nology). Again consistent with expectation, the for-
mant voice (Voice E) was the most poorly rated voice
for perceived Naturalness. The perceived Intelligibil-
ity, Prosody, and Social Impression of Voice E were
identical to that of concatenative Voice D (a low-quality
concatenative voice). All four factors seemed to be rea-
sonably sensitive to the differences among the voices.
Summary of Results. The combined analysis of data
from Experiments 3 and 4 produced a four-factor scale,
but a single item (Depression) had a somewhat tenuous
association with the Social Impression factor. The va-
lidity and sensitivity of the scale seemed adequate, but
the reliability was relatively low for the Social Impres-
sion factor. To perform this combined analysis, we had
to remove several items from the scale that did not ap-
pear in both of the previous experiments (Enthusiasm
and Persuasiveness), and these items may have been
better contributors to the Social Impression factor.
Revision to the Initial MOS-X. We proposed a nal
new study to see if adding Enthusiasm and/or Persua-
siveness items would improve the reliability of Social
Impression.
Experiment 6: Final MOS-X Psychometric
Evaluation
We undertook a nal study to determine if we could
improve the reliability of Social Impression by adding
Enthusiasm and Persuasiveness to the MOS-X scale.
Method
Participants. The sample included complete sets of
ratings from 327 randomly selected IBM employees.
Materials. The study used a between-subjects design
with ten levels of the independent variable of synthetic
Expanding the MOS 173
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Intelligibility Naturalness Prosody Soc ial Impression
Scale
Mean Rating
Voice A
Voice B
Voice C
Voice C2
Voice D
Voice E
Figure 4. Voice by factor interaction (initial MOS-X).
voice. The voices and their key characteristics were:
A: concatenative female, 8 kHz
B: concatenative male, 8 kHz
C: concatenative male, 22 kHz
D: concatenative male, 8 kHz
E: concatenative male, 8 kHz
F: concatenative female, 22 kHz
G: concatenative male, 22 kHz
H: concatenative male, 8 kHz
I: concatenative male, 8 kHz
J: concatenative male, 8 kHz.
The dependent measures were the ratings for the
14 MOS-X items (Listening Effort, Comprehension
Problems, Articulation, Precision, Voice Pleasantness,
Voice Naturalness, Humanlike Voice, Voice Quality,
Emphasis, Rhythm, Intonation, Trust, Condence, and
Depression). In addition, we added the two proposed
items, Enthusiasm and Persuasiveness, which we ex-
pected to align with the Social Impression factor.
Procedure. Participants received an email inviting
them to participate in the study and directing them to
a web page (one page for each participant group) with
instructions, a link to a recording of one of the syn-
thetic voices, and the rating scales. After accessing the
web page, participants clicked the link that caused the
synthetic voice le to play on the participants audio
player application. They then completed the 16 items
for that voice.
Results and Discussion
Factor Analysis. For the factor analysis, we forced a
4-factor solution due to the results of a previous study
(Experiment 3), showing that Enthusiasm and Persua-
siveness loaded on Social Impression. As predicted,
the factor analysis showed that both Enthusiasm and
Persuasiveness loaded on Social Impression. This nal
analysis conrmed that the MOS-X had four clear fac-
tors: Intelligibility, Naturalness, Prosody, and Social
Impression.
Reliability. We removed Depression to create a 15-
item scale and calculated reliability statistics. Intelli-
gibility (0.88), Naturalness (0.86), Prosody (0.86), So-
cial Impression (0.86), and the Overall score (0.93)
had coefcient alphas much greater than 0.70, demon-
strating reliabilities adequate for usability evaluation
(Landauer, 1988).
Sensitivity. The statistical analysis suggested that the
factor proles for the voices were signicantly differ-
ent, as shown in Fig. 5. Generally, participants rated
174 Polkosky and Lewis
1
2
3
4
5
6
7
Intelligibility Naturalness Pros ody Social Impression
MOS-X2 Factor
Mean Factor Rating
A B C D E F G H I J
Figure 5. Voice by factor interaction (nal MOS-X).
Intelligibility most positively and Prosody most poorly
of the four factors.
Summary of Results. This nal study showed that
this version of the MOS-X achieved a theoretically-
derived, psychometrically sound factor structure, and
it is a reliable and sensitive instrument. With these re-
sults, we achieved the goal of revising and expanding
the MOS for use in applied measurement of synthetic
speech.
General Discussion
Psychometric evaluations of the original MOS pro-
vided some indication of its strength for evaluating syn-
thetic speech, but this research had not been systematic
nor repeated to ensure that the scale was reliable, valid,
and sensitive for applied measurement. Although the
MOSs psychometric properties were acceptable, Ex-
periment 2 showed that it was possible to improve its
psychometric properties, leading to the development
of the Revised MOS (MOS-R). Additional studies ex-
panded and validated the instrument, resulting in a
more comprehensive measure of synthetic speech, the
MOS-X.
The nal version of the MOS-X contained two im-
portant advancements over the initial MOS-R. First,
Experiments 3 and 4 investigated a number of subtle
vocal and social-emotional characteristics that past lit-
erature has validated as having an impact on listener
perception of speech. Thus, using the literature and the
results of these studies as a guide, we expanded the con-
tent of the current MOS to measure both prosodic and
social impressions of listeners, producing the MOS-X.
Developers of articial voices can use these two new
MOS factors to help guide the continued development
of synthetic speech. In addition to expanding the scope
of the MOS, the MOS-X retained the desirable psycho-
metric properties of the Intelligibility and Naturalness
factors from the MOS and MOS-R.
Experiments 3 and 4 also resolved several problems
observed in the MOS and MOS-R. First, the Speak-
ing Rate item, which did not clearly associate with
either Intelligibility or Naturalness in earlier evalua-
tions (Kraft and Portele, 1995; Lewis, 2001a,b), loaded
on the Fluency factor in Experiment 3 (MOS-R3).
We later excluded Speaking Rate from the MOS-R3
without signicant loss of reliability. The Humanlike
Voice item loaded strongly on the Naturalness factor,
and we retained it through the MOS-R3 and MOS-R4
into the MOS-X. Finally, the Global Impression item
Expanding the MOS 175
consistently loaded on more than one factor, although
its strongest loading tended to be on the Intelligibil-
ity factor. We removed this item during the efciency
phase of Experiment 3 and found that reliability im-
proved, suggesting that the Global Impression item was
at least partially responsible for the lower reliability of
its associated factor in the previous evaluations.
We also generated several new and interesting prob-
lems. Most notably, the MOS-R3 Loudness item asso-
ciated with Pitch and Depression. Loudness and pitch
(and their acoustic correlates intensity and fundamen-
tal frequency, respectively) are typically measured in
a clinical evaluation of human speech, particularly
voice or phonation, and are indicative of a number
of pathologies, including clinical depression (Baken,
1978; Murray and Arnot, 1993). This associative pat-
tern partially prompted the Voice factor name in Exper-
iment 3. However, when we removed Pitch in Exper-
iment 4, the Loudness item became a separate factor
and Depression associated with the Social Impression
factor.
The elusive Voice factor was also apparent in Ex-
periment 4. In this version of the MOS, Fear and De-
pression aligned in a factor we named Negativity. Both
of these items elicit listenersperceptions of negative
emotion, which distinguishes them from the items as-
sociated with the social-personality inferences elicited
by other Social Impression items. Fear and depression
are signaled by voice characteristics: a rapid speak-
ing rate, elevated pitch, wide pitch range, and irreg-
ular voicing conveys fear but a slow speaking rate,
lowered pitch, reduced loudness, and downward in-
ections convey sadness or depression (Murray and
Arnott, 1993). Thus, although we apparently elimi-
nated the Voice factor, the inferences about a speakers
emotional state are derived from voice information.
Thus, voice items remained in the MOS-R4, although
covertly.
A second observation concerns the type of items
that we removed from the MOS-R in Experiments 3
and 4. Most of the omitted items were perceptual rat-
ings specic to the speech pattern itself and typical of
evaluative judgments made of human speech disorders
(Baken, 1978). Of the items ultimately removed from
the revised scales, eight items were perceptual judg-
ments made by speech-language pathologists in clini-
cal evaluations (Speaking Rate, Loudness, Emphasis,
Interruptions, Pitch, Rhythm, Intonation, Monotone
Quality). All items remaining on the measure (except
Voice Quality) appear to be more abstract interpretative
qualities derived from speech. In many respects, this
pattern of item exclusion is logical because na¨ıve lis-
teners do not have a clinical vocabulary or perceptual
training to directly evaluate speech characteristics. The
layperson is perhaps better suited to make inferences
about a speakers emotional state or social characteris-
tics (even if the speaker is an abstraction), as shown by
the vast literature on these topics (Murray and Arnott,
1993).
The nal items included in the MOS-X resulted in a
blend of the factors present in the MOS-R of Experi-
ments 3 and 4. The Prosody factor targeted vocal speech
perceptions and the Social Impression factor targeted
social-emotional interpretations. The MOS-X became
the most satisfying revision of the MOS because both
types of ratings point to acoustic modications that may
be made in a voice, and indicate the social attributions
listeners make about a voice they hear in applications
that use synthetic speech. The nal evaluation in Ex-
periment 6 conrmed that our nal version is an in-
strument with strong psychometric properties, making
it well-suited for high-quality evaluations in industrial
settings.
In general, this body of research has met our ini-
tial goal of strengthening the MOS for speech system
evaluation. The MOS-R measures the same factors as
the standard MOS, but with improved psychometric
properties. The MOS-X broadened the range of vari-
ables this instrument measures reliably and is sensitive
enough to detect key differences among a set of arti-
cial voices. Munsterberg asserted in 1913 the whole
world of industry will have to learn the great lesson,
that of the three great factors, material, machine, and
man, the man is not the least, but the most important
(p. 593). In the world of speech technology, reliable,
valid, and sensitive evaluation of listener perception is
vital to the usability of applications that use synthetic
speech.
Appendix A: The Standard MOS
(Salza et al., 1996)
1. Global Impression: Your answer must indicate how
you rate the sound quality of the voice you have heard.
[ ] Excellent
[ ] Good
[ ] Fair
[ ] Poor
[ ] Bad
176 Polkosky and Lewis
2. Listening Effort: Your answer must indicate the
degree of effort you had to make to understand the
message.
[ ] No effort required
[ ] Slight effort required
[ ] Effort required
[ ] Major effort required
[ ] Message not understood with any feasible
effort
3. Comprehension Problems: Your answer must indi-
cate if you found single words hard to understand.
[ ] None
[ ] Few
[ ] Some
[ ] Many
[ ] Every word
4. Speech Sound Articulation: Your answer must indi-
cate if the speech sounds are clearly distinguishable.
[ ] Yes, very clearly
[ ] Yes, clearly enough
[ ] Fairly clear
[ ] No, not very clear
[ ] No, not at all
Appendix B: The MOS-X (Final Version)
1. Listening Effort: Please rate the degree of effort you had to make to understand the message.
IMPOSSIBLE
EVEN WITH NO EFFORT
MUCH EFFORT 1234567REQUIRED
2. Comprehension Problems: Were single words hard to understand?
ALL WORDS ALL WORDS
HARD TO EASY TO
UNDERSTAND 1234567UNDERSTAND
3. Speech Sound Articulation: Were the speech sounds clearly distinguishable?
NOT AT ALL VERY
CLEAR 1234567CLEAR
4. Precision: Was the articulation of speech sounds precise?
SLURRED OR
IMPRECISE 1234567PRECISE
5. Voice Pleasantness: Was the voice you heard pleasant to listen to?
VERY VERY
UNPLEASANT 1234567PLEASANT
6. Voice Naturalness: Did the voice sound natural?
VERY VERY
UNNATURAL 1234567NATURAL
7. Humanlike Voice: To what extent did this voice sound like a human?
NOTHING LIKE JUST LIKE
AHUMAN 1234567AHUMAN
5. Pronunciation: Your answer must indicate if you
noticed any anomalies in the naturalness of sentence
pronunciation.
[]No
[ ] Yes, but not annoying
[ ] Yes, slightly annoying
[ ] Yes, annoying
[ ] Yes, very annoying
6. Speaking Rate: Your answer must indicate if
you found the speed of delivery of the message
appropriate.
[]Yes
[ ] Yes, but slower than preferred
[ ] Yes, but faster than preferred
[ ] No, too slow
[ ] No, too fast
7. Voice Pleasantness: Your answer must indicate if
you found the voice you have heard pleasant.
[ ] Very pleasant
[ ] Pleasant
[ ] Fair
[ ] Unpleasant
[ ] Very unpleasant
Expanding the MOS 177
8. Voice Quality: Did the voice sound harsh, raspy, or strained?
SIGNIFICANTLY NORMAL
HARSH/RASPY 1234567QUALITY
9. Emphasis: Did emphasis of important words occur?
INCORRECT EXCELLENT USE
EMPHASIS 1234567OFEMPHASIS
10. Rhythm: Did the rhythm of the speech sound natural?
UNNATURAL OR NATURAL
MECHANICAL 1234567RHYTHM
11. Intonation: Did the intonation pattern of sentences sound smooth and natural?
ABRUPT OR SMOOTH OR
ABNORMAL 1234567NORMAL
12. Trust: Did the voice appear to be trustworthy?
NOT AT ALL VERY
TRUSTWORTHY 1234567TRUSTWORTHY
13. Confidence: Did the voice suggest a condent speaker?
NOT AT ALL VERY
CONFIDENT 1234567CONFIDENT
14. Enthusiasm: Did the voice seem to be enthusiastic?
NOT AT ALL VERY
ENTHUSIASTIC 1234567ENTHUSIASTIC
15. Persuasiveness: Was the voice persuasive?
NOT AT ALL VERY
PERSUASIVE 1234567PERSUASIVE
MOS-X Scales
Overall: Average items 115
Intelligibility: Average items 14
Naturalness: Average items 58
Prosody: Average items 911
Social Impression: Average items 1215
Appendix C: Statistical Appendix
This appendix provides additional detail on the statistics (factor loadings and ANOVA results) summarized in
the Results and Discussion sections of each Experiment. Factor loadings greater than 0.50 (shown in bold) were
considered aligned with the factor named in column headings.
Experiment 1
Table 2. MOS factor loading.
Item Content Intelligibility Naturalness Undened
1 Global Impression 0.327 0.900 0.194
2 Listening Effort 0.629 0.370 0.427
3 Comprehension 0.693 0.104 0.358
Problems
4 Speech Sound 0.672 0.433 0.294
Articulation
5 Pronunciation 0.746 0.437 0.139
6 Speaking Rate 0.322 0.204 0.754
7 Voice Pleasantness 0.182 0.665 0.139
178 Polkosky and Lewis
Sensitivity. A mixed-factors ANOVA indicated a signicant main effect of System (F(4,68) =9.6, p=
.000003), a signicant main effect of MOS Factor (F(2,136) =14.7, p=.000002), and a signicant System by
Factor interaction (F(8,136) =3.1, p=.003).
Experiment 2
Table 3. MOS-R factor loading.
Item Content Intelligibility Naturalness
1 Global Impression 0.56 0.51
2 Listening Effort 0.78 0.36
3 Comprehension Problems 0.85 0.23
4 Speech Sound Articulation 0.70 0.39
5 Pronunciation 0.57 0.48
6 Voice Pleasantness 0.32 0.88
7 Voice Naturalness 0.40 0.83
8 Ease of Listening 0.38 0.83
9 Speaking Rate 0.53 0.32
Experiment 3
Table 4. MOS-R3 factor loading.
Item Content Factor 1: Intelligibility Factor 2: Fluency Factor 3: Voice Factor 4: Social impression Factor 5: Naturalness
1 Global Impressiona0.612 0.237 0.253 0.193 0.448
2 Listening Effort 0.712 0.216 0.308 0.155 0.213
3 Comprehension 0.742 0.248 0.261 0.108 0.256
4 Articulation 0.763 0.203 0.209 0.158 0.340
5 Pronunciationa0.487 0.294 0.160 0.300 0.308
6 Pleasantness 0.243 0.217 0.315 0.219 0.750
7 Voice Naturalness 0.349 0.417 0.101 0.228 0.605
8 Ease of Listeninga0.410 0.411 0.250 0.257 0.511
9 Humanlike Voice 0.398 0.342 0.073 0.214 0.644
10 Speaking Ratea0.306 0.514 0.264 0.128 0.079
11 Loudness 0.171 0.105 0.477 0.086 0.069
12 Emphasis 0.181 0.754 0.197 0.202 0.182
13 Voice Quality 0.365 0.170 0.288 0.205 0.524
14 Interruptionsa0.516 0.306 0.171 0.429 0.047
15 Pitcha0.274 0.248 0.398 0.044 0.319
16 Rhythm 0.267 0.722 0.007 0.240 0.370
17 Intonation 0.282 0.653 0.071 0.364 0.338
18 Precision 0.612 0.167 0.212 0.149 0.386
19 Topic Interesta0.068 0.439 0.285 0.410 0.266
20 Trust 0.173 0.202 0.246 0.760 0.352
21 Condence 0.365 0.157 0.345 0.662 0.261
22 Depression 0.104 0.008 0.605 0.131 0.133
aItem removed to improve reliability and/or efciency of scale.
Expanding the MOS 179
Sensitivity. A mixed model ANOVA indicated the extent to which the MOS-R3 discriminated among the four
synthetic voices. The ANOVA showed a main effect of synthetic voice (F(3,154) =26.92, p<0.0001), factor
(F(4,616) =79.03, p<0.0001), and a signicant interaction between these variables (F(12,616) =4.56,
p<0.0001, shown in Fig. 2).
Experiment 4
Table 5. MOS-R4 factor loading.
Item Content Factor 1: Social impression Factor 2: Intelligibility Factor 3: Negativity Factor 4: Naturalness
1 Listening Effort 0.241 0.750 0.113 0.236
2 Comprehension 0.025 0.797 0.056 0.295
3 Articulation 0.039 0.831 0.070 0.211
4 Pleasantness 0.444 0.194 0.230 0.553
5 Voice Naturalness 0.192 0.268 0.002 0.849
6 Humanlike Voice 0.165 0.275 0.008 0.769
8 Emphasisa0.565 0.403 0.221 0.173
9 Voice Quality 0.320 0.375 0.015 0.618
10 Rhythma0.406 0.406 0.082 0.536
11 Intonationa0.325 0.478 0.019 0.505
12 Monotone Qualitya0.579 0.017 0.043 0.493
13 Precision 0.321 0.650 0.295 0.063
14 Trust 0.686 0.292 0.250 0.178
15 Enthusiasm 0.793 0.079 0.103 0.263
16 Condence 0.713 0.141 0.270 0.114
17 Depression 0.500 0.142 0.673 0.028
18 Attractivenessa0.599 0.174 0.082 0.457
19 Persuasiveness 0.691 0.249 0.002 0.351
20 Impatiencea0.451 0.156 0.277 0.522
21 Fear 0.029 0.138 0.835 0.226
aItem removed to improve reliability and/or efciency of scale.
Sensitivity. The ANOVA showed a main effect of synthetic voice (F(4,124) =9.18, p<0.0001), factor
(F(3,372) =101.92, p<0.0001), and a signicant interaction between these variables (F(12,372) =2.70,
p=0.002).
Experiment 5
Table 6. Initial MOS-X factor loading.
Item Content Factor 1: Prosody Factor 2: Intelligibility Factor 3: Social impression Factor 4: Naturalness
1 Listening Effort 0.18 0.70 0.28 0.23
2 Comprehension 0.24 0.78 0.11 0.23
3 Articulation 0.19 0.82 0.17 0.25
4 Pleasantness 0.21 0.27 0.40 0.61
(Contiuned on next page.)
180 Polkosky and Lewis
Table 6.(Continued ).
Item Content Factor 1: Prosody Factor 2: Intelligibility Factor 3: Social impression Factor 4: Naturalness
5 Voice Naturalness 0.36 0.29 0.13 0.79
6 Humanlike Voice 0.30 0.34 0.20 0.67
7 Emphasis 0.57 0.23 0.28 0.17
8 Voice Quality 0.25 0.28 0.33 0.50
9 Rhythm 0.73 0.24 0.19 0.38
10 Intonation 0.76 0.25 0.23 0.30
11 Precision 0.23 0.54 0.31 0.25
12 Trust 0.20 0.19 0.78 0.29
13 Condence 0.17 0.25 0.68 0.27
14 Depression 0.11 0.07 0.40 0.03
Sensitivity. A mixed model ANOVA indicated the extent to which the MOS-X discriminated among the six
different synthetic voices used in Experiments 3 and 4. The ANOVA showed a main effect of synthetic voice
(F(5,275) =27.5, p<0.0000001), factor (F(3,825) =58.5, p<0.0000001), and a signicant interaction
between these variables (F(15,825) =3.8, p=0.000001).
Experiment 6
Table 7. Final MOS-X factor loading.
Item Content Factor 1: Intelligibility Factor 2: Prosody Factor 3: Social impression Factor 4: Naturalness
1 Listening Effort 0.730 0.156 0.085 0.174
2 Comprehension 0.808 0.039 0.082 0.114
3 Articulation 0.865 0.095 0.103 0.151
4 Pronunciation 0.716 0.209 0.135 0.140
5 Pleasantness 0.218 0.181 0.477 0.588
6 Voice Naturalness 0.289 0.445 0.293 0.670
7 Humanlike Voice 0.240 0.376 0.294 0.626
8 Voice Quality 0.342 0.090 0.387 0.466
9 Emphasis 0.151 0.662 0.338 0.111
10 Rhythm 0.155 0.744 0.293 0.269
11 Intonation 0.189 0.723 0.343 0.256
12 Trust 0.229 0.338 0.622 0.316
13 Condence 0.239 0.285 0.691 0.265
14 Depressiona0.096 0.107 0.631 0.192
15 Enthusiasm 0.002 0.311 0.700 0.150
16 Persuasiveness 0.060 0.379 0.743 0.154
aItem removed to improve reliability and/or efciency of scale.
Sensitivity. A mixed model ANOVA indicated the extent to which the MOS-X discriminated among the ten
synthetic voices. The ANOVA showed a signicant effect of factor (F(3,951) =443.96, p<0.0001), and a
signicant interaction between factor and voice (F(27,951) =2.37, p<0.0001). The main effect of synthetic
voice was not signicant (F(9,317) =1.01, p=0.44), indicating a similar mean rating for the ten voices. The
overall mean scores for the voices ranged from 4.50 (Voice J, least positive) to 5.10 (Voice A, most positive).
Expanding the MOS 181
References
Baken, R. (1978). Clinical Measurement of Speech and Voice.
Boston: Allyn & Bacon.
Berry, D. (1992). Vocal types and stereotypes: Joint effects of vocal
attractiveness and vocal maturity on person perception. Journal of
Nonverbal Behavior,16:4145.
Bloom, K., Zajac, D., and Titus, J. (1999). The inuence of nasality
of voice on sex-stereotyped perceptions. Journal of Nonverbal
Behavior,23:271281.
Bradlow, A., Torretta, G., and Pisoni, D. (1996). Intelligibility of
normal speech I: Global and ne-grained acoustic-phonetic talker
characteristics. Speech Communication,20:255272.
Brown, B., Strong, W., and Rencher, A. (1973). Perceptions of per-
sonality from speech: Effects of manipulations of acoustical pa-
rameters. Journal of the Acoustical Society of America,54:2935.
Brown, B., Strong, W., and Rencher, A. (1975). Acoustic determi-
nants of perceptions of personality from speech. International
Journal of the Sociology of Language,6:132.
Cliff, N. (1987). Analyzing Multivariate Data. San Diego, CA:
Harcourt Brace Jovanovich.
Coovert, M.D. and McNelis, K. (1988). Determining the number
of common factors in factor analysis: A review and program.
Educational and Psychological Measurement,48:687693.
Ekman, P., OSullivan, M., Friesen, W., and Scherer, K. (1991). Face,
voice, and body in detecting deceit. Journal of Nonverbal Behav-
ior,15:125135.
Francis, A.L. and Nusbaum, H.C. (1999). Evaluating the quality
of synthetic speech. In D. Gardner-Bonneau (Ed.), Human Fac-
tors and Voice Interactive Systems. Boston, MA: Kluwer, pp. 63
97.
Goldstein, M. (1995). Classication of methods used for assessment
of text-to-speech systems according to the demands placed on the
listener. Speech Communication,16:225244.
Goreno, D. and Goreno, C. (1997). Effects of synthetic speech,
gender, and perceivedsimilarity on attitudes toward the augmented
communicator. AAC: Augmentative and Alternative Communica-
tion,13:8791.
Granstrom, B. and Nord, L. (1992). Neglected dimensions in speech
synthesis. Speech Communication,11:459462.
Greene, B., Logan, J., and Pisoni, D. (1986). Perception of synthetic
speech produced automatically by rule: Intelligibility of eight text-
to-speech systems. Behavior Research Methods, Instruments, and
Computers,18:100107.
Hieda, I. and Kuchinomachi, Y. (1997). Preliminary study of relations
between physical characteristics and psychological impressions of
natural voices. Perceptual and Motor Skills,85:14831491.
Higashikawa, M. and Minie, F. (1999). Acoustical-perceptual corre-
lates of whisper pitchin synthetically generated vowels. Journal
of Speech, Language, and Hearing Research,42:583591.
Hillenbrand, J. (1988). Perception of aperiodicities in synthetically
generated voices. Journal of the Acoustical Society of America,
83:23612371.
Hoag, L. and Bedrosian, J. (1992). Effects of speech output type,
message length, and reauditorization on perceptions of the com-
municative competence of an adult AAC user. Journal of Speech
and Hearing Research,35:13631366.
Holtgraves, T. and Lasky, B. (1999). Linguistic power and per-
suasion. Journal of Language and Social Psychology,18:1960
205.
Hosman, L. (1989). The evaluative consequences of hedges, hes-
itations, and intensiers: Powerful and powerless speech styles.
Human Communication Research,15:383406.
International Telecommunication Union (1994). A Method for
Subjective Performance Assessment of the Quality of Speech
Voice Output Devices (ITU-T Recommendation, p. 85). Geneva,
Switzerland: ITU.
Johnston, R.D. (1996). Beyond intelligibility: The performance
of text-to-speech synthesisers. BT Technology Journal,14:100
111.
Johnson, W., Emde, R., Scherer, K., and Klinnert, M. (1986). Recog-
nition of emotion from vocal cues. Archivesof General Psychiatry,
43:280283.
Klatt, D. and Klatt, L. (1990). Analysis, synthesis, and perception of
voice quality variations among female and male talkers. Journal
of the Acoustical Society of America,87:820857.
Koopmans-Van Beinum, F. (1992). The role of focus words in natu-
ral and in synthetic continuous speech: Acoustic aspects. Speech
Communication,11:439452.
Kraft, V. and Portele, T. (1995). Quality evaluation of ve German
speech synthesis systems. Acta Acustica,3:351365.
Landauer, T.K. (1988). Research methods in human-computer in-
teraction. In M. Helander (Ed.), Handbook of Human-Computer
Interaction. New York: Elsevier.
Lavner, Y., Gath, I., and Rosenhouse, J. (2000). The effects of acous-
tic modications on the identication of familiar voices speaking
isolated vowels. Speech Communication,30:926.
Lewis, J.R. (1993). Multipoint scales: Mean and median differences
and observed signicance levels. International Journal of Human-
Computer Interaction,5:383392.
Lewis, J.R. (2001a). Psychometric properties of the Mean Opinion
Scale. In Proceedings of HCI International 2001: Usability Eval-
uation and Interface Design. Mahwah, NJ: Lawrence Erlbaum,
pp. 149153.
Lewis, J.R. (2001b). The Revised Mean Opinion Scale (MOS-R):
Preliminary Psychometric Evaluation (Tech. Report 29.3414).
Raleigh, NC: International Business Machines Corp.
Martin, R. and Haroldson, S. (1992). Stuttering and speech natural-
ness: Audio and audiovisual judgments. Journal of Speech and
Hearing Research,35:521528.
Massaro, D. and Egan, P. (1996). Perceiving affect from the voice
and the face. Psychonomic Bulletin & Review,3:215221.
Miyake, K. and Zuckerman, M. (1993). Beyond personality: Effects
of physical and vocal attractiveness on false consensus, social
comparison, afliation, and assumed and perceived personality.
Journal of Personality,61:411437.
Moller, S., Jekosch, U., Mersdorf, J., and Kraft, V. (2001). Auditory
assessment of synthesized speech in application scenarios: Two
case studies. Speech Communication,34:229246.
Munsterburg, H. (1913). Psychology and industrial efciency. In L.T.
Benjamin Jr. (Ed.), A History of Psychology:Original Sources and
Contemporary Research, 2nd edn. Boston: McGrawHill, pp. 584
593.
Murray, I. and Arnott, J. (1993). Toward the simulation of emotion
in synthetic speech: A review of the literature on human vocal
emotion. Journal of the Acoustical Society of America,93:1097
1108.
Murray, I. and Arnott, J. (1995). Implementation and testing of a
system for producing emotion-by-rule in synthetic speech. Speech
Communication,16:369390.
182 Polkosky and Lewis
Murray, I., Arnott, J., and Rohwer, E. (1996). Emotional stress in
synthetic speech: Progress and future directions. Speech Commu-
nication,20:8591.
Nunnally, J.C. (1978). Psychometric Theory. New York: McGraw-
Hill.
Paddock, J. and Nowicki, S. (1986). Paralanguage and the in-
terpersonal impact of dysphoria: Its not what you say but
how you say it. Social Behavior and Personality,14:29
44.
Page, R. and Balloun, J. (1978). The effect of voice volume on the
perception of personality. Journal of Social Psychology,105:65
72.
Paris, C.R., Thomas, M.H., Gilson, R.D., and Kincaid, J.P. (2000).
Linguistic cues and memory for synthetic and natural speech.
Human Factors,42:421431.
Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating
facial expressions for speech. Cognitive Science,20:146.
Pisoni, D. (1997). Perception of synthetic speech. In J. van Santen,
R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech
Synthesis. New York: Springer, pp. 541560.
Pols, L. and Jekosch, U. (1997). A structured way of looking at the
performance of text-to-speech systems. In J. van Santen, R. Sproat,
J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis.
New York: Springer, pp. 519528.
Portele, T. and Heuft, B. (1997). Toward a prominence-based syn-
thesis system. Speech Communication,21:6172.
Salza, P.L., Foti, E., Nebbia, L., and Oreglia, M. (1996). MOS and
pair comparison combined methods for quality evaluation of text
to speech systems. Acta Acustica,82:650656.
Schmidt-Nielsen, A. (1995). Intelligibility and acceptability test-
ing for speech technology. In A. Syrdal, R. Bennett, and S.
Greenspan (Eds.), Applied Speech Technology. Boca Raton:
CRC Press.
Shipley, K. and McAfee, J. (1992). Assessment in Speech Language
Pathology: A Resource Manual. San Diego: Singular.
Sonntag, G.P. and Portele, T. (1998). PURRA method for prosody
evaluation and investigation. Computer Speech and Language,
12:437451.
Sonntag, G.P., Portele, T., Haas, F., and Kohler, J. (1999). Com-
parative evaluation of six German TTS systems. Eurospeech 99.
Budapest: Technical University of Budapest, pp. 251254.
Slowiaczek, L. and Nusbaum, H. (1985). Effects of speech rate and
pitch contour on the perception of synthetic speech. Human Fac-
tors,27:701712.
Stern, S., Mullennix, J., Dyson, C., and Wilson, S. (1999). The per-
suasiveness of synthetic speech versus human speech. Human Fac-
tors,41:588595.
Tartter, V. and Braun, D. (1994). Hearing smiles and frowns in nor-
mal and whisper registers. Journal of the Acoustical Society of
America,96:21012107.
van Bezooijen, R. and van Heuven, V. (1997). Assessment of syn-
thesis systems. In D. Gibbon, R. Moore, and R. Winski (Eds.),
Handbook of Standards and Resources for Spoken Language Sys-
tems. New York, NY: Mouton de Gruyter.
Wang, H. and Lewis, J.R. (2001). Intelligibility and acceptability of
short phrases generated by embedded text-to-speech engines. In
Proceedings of HCI International 2001: Usability Evaluation and
Interface Design. Mahwah, NJ: Lawrence Erlbaum, pp. 144148.
Whalen, D. and Hoequist, C. (1995). The effects of breath sounds
on the perception of synthetic speech. Journal of the Acoustical
Society of America,97:31473153.
Whitmore, J. and Fisher, S. (1996). Speech during sustained opera-
tions. Speech Communication,20:5570.
Yabuoka, H., Nakayama, T.,Kitabayashi, Y., and Asakawa,Y.(2000).
Investigations of independence of distortion scales in objective
evaluation of synthesized speech quality. Electronics and Com-
munications in Japan, Part 3,83:1422.
van Riper, C. and Emerick, L. (1990).Speech Correction. Englewood
Cliffs, NJ: Prentice Hall.
Yaeger-Dror, M. (1996). Register as a variable in prosodic analysis:
The case of the English negative. Speech Communication,19:39
60.
Zuckerman, M., Miyake, K., and Hodgins, H. (1991). Cross-channel
effects of vocal and physical attractiveness and their implications
for interpersonal perception. Journal of Personality and Social
Psychology,60:545554.
... • Mean Opinion Scale Expanded (MOS-X) is an expanded version of the widely used Mean Opinion Scale (MOS) questionnaire, which measures the subjective impression of synthetic speech [118]. It is a reliable instrument comprising 15 items based on four factors: intelligibility, naturalness, prosody, and social impression. ...
... Natural language processing (NLP) issues are a primary barrier in voice technology [106]. VUI-related questionnaires, introduced in Chapter 2, include the SASSI, which evaluates the VUI's ability to correctly recognize the user's input [52], and the MOS-X, which measures the intelligibility and naturalness of synthetic voice [118]. ...
Thesis
Voice user interfaces (VUIs) such as Amazon Alexa, Apple Siri, and Google Assistant are widely used, readily available, and seamlessly integrated into everyday life. They have become more intelligent due to recent advances in artifcial intelligence, which provides new methods of processing contextual information. Despite their widespread use and recent innovations, VUIs face challenges regarding intelligibility, human-like conversation, and privacy. Only a tiny fraction of users perceive VUIs as intelligent and trustworthy as humans. User experience (UX) evaluation is anchored in the human-centered design process. UX is a holistic view of the user’s perception of interaction. The prominent role of UX evaluation methods for designs with graphical user interfaces (GUIs) refects their dominance in computer-based technology. Furthermore, methods are often tailored to specific measurement contexts. Therefore, the human-computer interaction community requires a flexible and adaptable UX evaluation for VUIs. The core goal of this dissertation is to provide context-dependent UX measurement recommendations for VUIs. We apply the standardized design science research methodology. Our approach is based on the User Experience Questionnaire Plus (UEQ+) framework, which allows fexible assessment. One can select from several UX scales measuring distinct aspects to form a questionnaire. However, the UEQ+ was mainly developed to assess GUI-equipped designs. Thus, we contribute three scales measuring relevant UX aspects for VUIs: Response Behavior, Response Quality, and Comprehensibility. We also offer a conceptual structure of the VUI context of use. By applying this structure, we can select relevant UEQ+ scales and customize the questionnaire to fit any context. This enables recommendations for context-dependent UX assessment for VUIs and provides a new flexible measurement method for better evaluation of voice technology.
... Therefore, the speaker similarity, that is, how similar the traits of the synthesized voice are to the target speaker, is often evaluated using the Likert scale, which also includes the age, speaking style, and other characteristics. Such assessments, implemented in a laboratory or through crowdsourcing, complements the two other common evaluation dimensions of intelligibility and naturalness [19,20]. ...
... For instance, TTS evaluations focus on the synthesis models, rather than the listeners who, nonetheless, are the (probabilistic) source of speaker similarity ratings. The listener-level effects are averaged in summary measures such as the mean opinion score (MOS) [17,20,19]. Likewise, whether for the analysis of speaker detection from bonafide [4] or spoofed [25] utterances, it is common to pool listener to analyze detection errors trade-offs [24]-effectively treating similarity ratings as if they originated from a common 'system'. ...
... Second, with respect to scales that were created for measuring conversational UX specifically, e.g., MOS (ITU-T, 1994), MOS-R (Polkosky & Lewis, 2003), MOS-X (M. D. Polkosky & Lewis, 2003), SUISQ (M. ...
... Second, with respect to scales that were created for measuring conversational UX specifically, e.g., MOS (ITU-T, 1994), MOS-R (Polkosky & Lewis, 2003), MOS-X (M. D. Polkosky & Lewis, 2003), SUISQ (M. Polkosky, 2005), SUISQ-R (Lewis & Hardzinski, 2015), or SUISQ-MR (Lewis & Hardzinski, 2015), the focus was on TTS systems. ...
Article
Conversational agents are growing in popularity, and as such they must provide a good user experience and meet the needs of the users. Yet, how to measure the user experience in the conversational AI scenario remains an open and urgent question to be solved, which may hinder further empirical studies on human-agent interactions. In fact, there have been very few studies that explore how users perceive interacting with these conversational agents, which is important to ensure their sustainability. Accordingly, in this work, we used a subjective technique by following a measurement approach to develop a standardized measurement instrument/scale called the Conversational Agent Scale for User Experience. In terms of the methodology, we used a mixed-method approach involving an iterative process spanning across six different user studies (three qualitative and three quantitative) at different points of time for the purpose of dimension identification , item generation and subsequent item refinements. As a part of the qualitative studies we conducted a Systematic Literature Review, semi-structured interviews with users of conversational agents, and expert interviews. For the quantitative studies a lab-based experiment was performed for the pre-test, followed by two online surveys as a part of pilot testing and scale development. The qualitative studies initially identified a total of 13 distinct user experience dimensions and 418 measurement items. Finally, the Exploratory Factor Analysis as a part of the main survey resulted in 9 user experience dimensions (practicality perception, proficiency, humanness, sentiment, robustness, etiquette & mannerism, personality, anthropomorphism, and ease of use) and 34 measurement items. This was supplemented by conducting another Confirmatory Factor Analysis for establishing the reliability and validity of the proposed scale, together with checking the model-fit indices. All the 9 dimensions had sufficient validity, and a reasonable level of statistical reliability. Some of the user experience dimensions like humanness, personality, anthropomorph-ism, and etiquette & mannerism show the uniqueness of the conversational AI scenario from the traditional usage factors used commonly while evaluating graphical user interface-based systems. This work fills the gap of a lack of research on how to classify and measure the conversational experience of users and provides a reference for practitioners and designers in developing these agents and continuously improving the usage experience.
... The Multi-Dimensional Measure of Trust (MDMT) [113] is used to assess trust along the dimensions of performance trust (reliable, competent). The Mean Opinion Score eXtended (MOS-X) Scales [90] quantifies the participants' perception of the robot's voice in terms of intelligibility, naturalness, social impression, and prosody. The prosody factor is omitted since the prosody for both voices is identical. ...
Article
Full-text available
With the increasing performance of text-to-speech systems and their generated voices indistinguishable from natural human speech, the use of these systems for robots raises ethical and safety concerns. A robot with a natural voice could increase trust, which might result in over-reliance despite evidence for robot unreliability. To estimate the influence of a robot's voice on trust and compliance, we design a study that consists of two experiments. In a pre-study ( N1=60N_{1}=60 ) the most suitable natural and mechanical voice for the main study are estimated and selected for the main study. Afterward, in the main study ( N2=68N_{2}=68 ), the influence of a robot's voice on trust and compliance is evaluated in a cooperative game of Battleship with a robot as an assistant. During the experiment, the acceptance of the robot's advice and response time are measured, which indicate trust and compliance respectively. The results show that participants expect robots to sound human-like and that a robot with a natural voice is perceived as safer. Additionally, a natural voice can affect compliance. Despite repeated incorrect advice, the participants are more likely to rely on the robot with the natural voice. The results do not show a direct effect on trust. Natural voices provide increased intelligibility, and while they can increase compliance with the robot, the results indicate that natural voices might not lead to over-reliance. The results highlight the importance of incorporating voices into the design of social robots to improve communication, avoid adverse effects, and increase acceptance and adoption in society.
... Two questionnaires were used to assess the user experience of the trainees: the MOS-X2 [34] for investigating the experience of voice characteristics, and the SASSI [35] for general voice-related user experience. ...
Article
Full-text available
Background Training in social-verbal interactions is crucial for medical first responders (MFRs) to assess a patient’s condition and perform urgent treatment during emergency medical service administration. Integrating conversational agents (CAs) in virtual patients (VPs), that is, digital simulations, is a cost-effective alternative to resource-intensive human role-playing. There is moderate evidence that CAs improve communication skills more effectively when used with instructional interventions. However, more recent GPT-based artificial intelligence (AI) produces richer, more diverse, and more natural responses than previous CAs and has control of prosodic voice qualities like pitch and duration. These functionalities have the potential to better match the interaction expectations of MFRs regarding habitability. Objective We aimed to study how the integration of GPT-based AI in a mixed reality (MR)–VP could support communication training of MFRs. Methods We developed an MR simulation of a traffic accident with a VP. ChatGPT (OpenAI) was integrated into the VP and prompted with verified characteristics of accident victims. MFRs (N=24) were instructed on how to interact with the MR scenario. After assessing and treating the VP, the MFRs were administered the Mean Opinion Scale-Expanded, version 2, and the Subjective Assessment of Speech System Interfaces questionnaires to study their perception of the voice quality and the usability of the voice interactions, respectively. Open-ended questions were asked after completing the questionnaires. The observed and logged interactions with the VP, descriptive statistics of the questionnaires, and the output of the open-ended questions are reported. Results The usability assessment of the VP resulted in moderate positive ratings, especially in habitability (median 4.25, IQR 4-4.81) and likeability (median 4.50, IQR 3.97-5.91). Interactions were negatively affected by the approximately 3-second latency of the responses. MFRs acknowledged the naturalness of determining the physiological states of the VP through verbal communication, for example, with questions such as “Where does it hurt?” However, the question-answer dynamic in the verbal exchange with the VP and the lack of the VP’s ability to start the verbal exchange were noticed. Noteworthy insights highlighted the potential of domain-knowledge prompt engineering to steer the actions of MFRs for effective training. Conclusions Generative AI in VPs facilitates MFRs’ training but continues to rely on instructions for effective verbal interactions. Therefore, the capabilities of the GPT-VP and a training protocol need to be communicated to trainees. Future interactions should implement triggers based on keyword recognition, the VP pointing to the hurting area, conversational turn-taking techniques, and add the ability for the VP to start a verbal exchange. Furthermore, a local AI server, chunk processing, and lowering the audio resolution of the VP’s voice could ameliorate the delay in response and allay privacy concerns. Prompting could be used in future studies to create a virtual MFR capable of assisting trainees.
Chapter
Full-text available
This chapter offers a comprehensive overview and systematic analysis of current UX evaluation methods with the objective to identify relevant dimensions to describe the qualities of AI-infused products and possible gaps. The state of the art is grounded on an umbrella review of previous consistent studies, a scoping review of about 130 UX evaluation methods, and a systematic literature review on the assessment of AI-based systems. This preliminary investigation substantiated the need of new, specific qualities to properly assess the UX of AI-infused products, and eight primary UX dimensions have been identified as a starting point for more experimental studies.
Article
Conversational agents (CAs) that deliver proactive interventions can benefit users by reducing their cognitive workload and improving performance. However, little is known regarding how such interventions would impact users’ reflection on choices in voice-only decision-making tasks. We conducted a within-subjects experiment to evaluate the effect of CA’s feedback delivery strategy at three levels (no feedback, unsolicited, and solicited feedback) and the impact on users’ likelihood of changing their choices in an interactive food ordering scenario. We discovered that in both feedback conditions the CA was perceived to be significantly more persuasive than in the baseline condition, while being perceived as significantly less confident. Interestingly, while unsolicited feedback was perceived as less appropriate than the baseline, both types of proactive feedback led participants to relisten and reconsider menu options significantly more often. Our results provide insights regarding the impact of proactive feedback on CA perception and user’s reflection in decision-making tasks, thereby paving a new way for designing proactive CAs.
Presentation
Full-text available
(Invited talk) https://itg-sprachassistenz-24.mobileds.de/#home
Article
Full-text available
This study used a multi-talker database containing intelligibility scores for 2000 sentences (20 talkers, 100 sentences), to identify talker-related correlates of speech intelligibility. We first investigated "global" talker characteristics (e.g., gender, F0 and speaking rate). Findings showed female talkers to be more intelligible as a group than male talkers. Additionally, we found a tendency for F0 range to correlate positively with higher speech intelligibility scores. However, F0 mean and speaking rate did not correlate with intelligibility. We then examined several fine-grained acoustic-phonetic talker-characteristics as correlates of overall intelligibility. We found that talkers with larger vowel spaces were generally more intelligible than talkers with reduced spaces. In investigating two cases of consistent listener errors (segment deletion and syllable affiliation), we found that these perceptual errors could be traced directly to detailed timing characteristics in the speech signal. Results suggest that a substantial portion of variability in normal speech intelligibility is traceable to specific acoustic-phonetic characteristics of the talker. Knowledge about these factors may be valuable for improving speech synthesis and recognition strategies, and for special populations (e.g., the hearing-impaired and second-language learners) who are particularly sensitive to intelligibility differences among talkers.
Article
Full-text available
This experiment examines how emotion is perceived by using facial and vocal cues of a speaker. Three levels of facial affect were presented using a computer-generated face. Three levels of vocal affect were obtained by recording the voice of a male amateur actor who spoke a semantically neutral word in different simulated emotional states. These two independent variables were presented to subjects in all possible permutations-visual cues alone, vocal cues alone, and visual and vocal cues together-which gave a total set of 15 stimuli. The subjects were asked to judge the emotion of the stimuli in a two-alternative forced choice task (either HAPPY or ANGRY). The results indicate that subjects evaluate and integrate information from both modalities to perceive emotion. The influence of one modality was greater to the extent that the other was ambiguous (neutral). The fuzzy logical model of perception (FLMP) fit the judgments significantly better than an additive model, which weakens theories based on an additive combination of modalities, categorical perception, and influence from only a single modality.
Conference Paper
Full-text available
We investigated the intelligibility and acceptability of three formant text-to-speech (TTS) engines suitable for use in devices with embedded speech recognition capability. Listeners transcribed and rated recordings of short phrases from four text domains (U.S. currency, dates, digits and proper names) produced by three commercially-available embedded TTS engines and a human speaker. The human voice received the best intelligibility and acceptability scores, and one of the TTS engines had superior intelligibility and acceptability relative to the two others. The results suggest that the ability to accurately produce names (the least constrained and least accurately transcribed text domain) was the system characteristic that best discriminated among the engines. The intelligibility and acceptability scores were generally consistent. Listeners transcribed shorter phrases more accurately than longer phrases, but acceptability ratings were independent of phrase length.
Technical Report
Full-text available
Abstract The original Mean Opinion Scale (MOS) is a seven-item questionnaire commonly,used to evaluate speech quality. An earlier psychometric analysis indicated some opportunities to improve its scale reliability and consistency of item structure, prompting the development of a revised MOS (MOS-R) with nine items. The current analysis shows that the MOS-R appears to have achieved its goal of increasing scale reliability, with the improvement especially evident for the Naturalness scale. Factor analysis with the MOS-R indicates that the Speaking Rate item should be part of the Intelligibility scale. ITIRC Keywords
Chapter
This chapter summarizes the results we obtained over the last 15 years at Indiana University on the perception of synthetic speech produced by rule. A wide variety of behavioral studies have been carried out on phoneme intelligibility, word recognition, and comprehension to learn more about how human listeners perceive and understand synthetic speech. Some of this research, particularly the earlier studies on segmental intelligibility, was directed toward applied issues dealing with perceptual evaluation and assessment of different synthesis systems. Other aspects of the research program have been more theoretically motivated and were designed to learn more about speech perception and spoken language comprehension. Our findings have shown that the perception of synthetic speech depends on several general factors including the acoustic-phonetic properties of the speech signal, the specific cognitive demands of the information-processing task the listener is asked to perform, and the previous background and experience of the listener. Suggestions for future research on improving naturalness, intelligibility, and comprehension are offered in light of several recent findings on the role of stimulus variability and the contribution of indexical factors to speech perception and spoken word recognition. Our perceptual findings have shown the importance of behavioral testing with human listeners as an integral component of evaluation and assessment techniques in synthesis research and development.
Chapter
This chapter discusses the conduct of research to guide the development of more useful and usable computer systems. Experimental research in human-computer interaction involves varying the design or deployment of systems, observing the consequences, and inferring from observations what to do differently. For such research to be effective, it must be owned—instituted, trusted and heeded—by those who control the development of new systems. Thus, managers, marketers, systems engineers, project leaders, and designers as well as human factors specialists are important participants in behavioral human-computer interaction research. This chapter is intended as much for those with backgrounds in computer science, engineering, or management as for human factors researchers and cognitive systems designers. It is argued in this chapter that the special goals and difficulties of human-computer interaction research make it different from most psychological research as well as from traditional computer engineering research. The main goal, the improvement of complex, interacting human-computer systems, requires behavioral research but is not sufficiently served by the standard tools of experimental psychology such as factorial controlled experiments on pre-planned variables. The chapter contains about equal quantities of criticism of inappropriate general research methods, description of valuable methods, and prescription of specific useful techniques.
Article
On the basis of correlational studies and anecdotal evidence, it was hypothesized that a loudly speaking person would be perceived as being more aggressive, dominant, and self-assured than would a softly speaking person. Sixty-three male and female college students listened to a tape-recorded interview in which they heard a female interviewee answer questions in either a low, moderate, or high voice volume. Ss rated the interviewee on seven point personality trait scales. The interviewee was perceived as most aggressive when speaking in a high voice volume, but was also perceived to be lacking in self-assurance.