Conference PaperPDF Available

Psychometric Properties of the Mean Opinion Scale

Authors:
  • MeasuringU

Abstract and Figures

The Mean Opinion Scale (MOS) is a seven-item questionnaire used to evaluate speech quality. Analysis of existing data revealed (1) two MOS factors (Intelligibility and Naturalness, plus a single independent Rate item), (2) good reliability for Overall MOS and for subscales based on the Intelligibility and Naturalness factors, (3) appropriate sensitivity of MOS factors, (4) validity of MOS factors related to paired comparisons, and (5) validity of MOS Intelligibility related to intelligibility scores. In conclusion, the current MOS has acceptable psychometric properties, but adding items to the Naturalness scale and increasing the number of scale steps from five to seven should improve its reliability.
Content may be subject to copyright.
Usability Evaluation and Interface Design – Proceedings of HCI International 2001 (Mahwah, NJ: Lawrence Erlbaum)
149
Psychometric Properties of the Mean Opinion Scale
James R. Lewis
IBM Voice Systems
1555 Palm Beach Lakes Blvd.
West Palm Beach, Florida
jimlewis@us.ibm.com
Abstract
The Mean Opinion Scale (MOS) is a seven-item questionnaire used to evaluate speech quality. Analysis of
existing data revealed (1) two MOS factors (Intelligibility and Naturalness, plus a single independent Rate item), (2)
good reliability for Overall MOS and for subscales based on the Intelligibility and Naturalness factors, (3) appropriate
sensitivity of MOS factors, (4) validity of MOS factors related to paired comparisons, and (5) validity of MOS
Intelligibility related to intelligibility scores. In conclusion, the current MOS has acceptable psychometric properties,
but adding items to the Naturalness scale and increasing the number of scale steps from five to seven should
improve its reliability.
1. Introduction
The Mean Opinion Scale (MOS) is the method for evaluating text-to-speech (TTS) quality recommended by
the International Telecommunications Union (ITU). The MOS is a Likert-style questionnaire, typically with seven 5-
point scale items addressing the following TTS characteristics: (1) Global Impression, (2) Listening Effort, (3)
Comprehension Problems, (4) Speech Sound Articulation, (5) Pronunciation, (6) Speaking Rate, and (7) Voice
Pleasantness.
It might seem that articulation tests that assess intelligibility (such as rhyme tests) would be more suitable
for evaluating artificial speech than a subjective tool such as the MOS. Most modern text-to-speech systems,
although more demanding on the listener than natural speech (Paris, Thomas, Gilson, & Kincaid, 2000), are quite
intelligible (Johnston, 1996). "Once a speech signal has breached the 'intelligibility threshold', articulation tests lose
their ability to discriminate. … it is precisely because people's opinions are so sensitive, not just to the signal being
heard, but also to norms and expectations, that opinion tests form the basis of all modern speech quality assessment
methods." (Johnston, 1996, pp. 102, 103)
Developers of products that use artificial speech output need reliable and valid tools for evaluating the
quality of TTS systems. When the tool is a questionnaire that collects subjective ratings (like the MOS), it is
important to understand its psychometric properties. The goal of psychometrics is to establish the quality of
psychological measures (Nunnally, 1978). Some of the metrics of psychometric quality are reliability (consistent
measurement), validity (measurement of the intended attribute), and sensitivity (responds to specific experimental
manipulations).
1.1. Brief Review of Psychometric Practice
Reliability. The most common measurement of a scale’s reliability is coefficient alpha (Nunnally, 1978).
Coefficient alpha can range from 0 (completely unreliable) to 1 (perfectly reliable). For purposes of research or
evaluation in which the final score will be the average of ratings from more than one questionnaire, the minimally
acceptable reliability is .70 (Landauer, 1988).
Validity. Researchers commonly use the correlation coefficient to assess criterion-related validity (the
relationship between the measure of interest and a different concurrent or predictive measure). The magnitude of the
correlation does not need to be large to provide evidence of validity, but the correlation should be significant.
Sensitivity. A measurement is sensitive if it responds to experimental manipulation. For a measurement to
result in statistically significant differences in an experiment, it must be both reliable and valid.
Number of scale steps. All other things being equal, a greater number of scale steps will enhance scale
reliability, but with rapidly diminishing returns. As the number of scale steps increases from two to twenty, there is
an initially rapid increase in reliability that tends to level off at about seven steps (Nunnally, 1978). After eleven
Usability Evaluation and Interface Design – Proceedings of HCI International 2001 (Mahwah, NJ: Lawrence Erlbaum)
150
steps there is very little gain in reliability from increasing the number of steps. Lewis (1993) found that mean
differences between experimental groups measured with questionnaire items having seven steps correlated more
strongly with the observed significance level of statistical tests than did similar measurements using items that had
five scale steps.
Factor analysis. Factor analysis is a statistical procedure that examines the correlations among variables to
discover groups of related variables (Nunnally, 1978). Because summated (Likert) scales are more reliable than single
item scores and it is easier to interpret and present a smaller number of scores, it is common to conduct a factor
analysis to determine if there is a statistical basis for the formation of measurement scales based on factors.
Generally, a factor analysis requires five participants per item to ensure stable factor estimates (Nunnally, 1978).
There are a number of methods for estimating the number of factors in a set of scores, including discontinuity and
parallel analysis (Coovert & McNelis, 1988).
1.2. Previous Research in MOS Psychometrics
Reliability. A literature review turned up no previous work reporting MOS reliability in any form.
Validity. Salza et al. (1996) measured the overall quality of three Italian TTS synthesis systems with a
common prosodic control but different diphones and synthesizers using both paired comparisons and the MOS.
Their results showed good agreement between the two measurement methods, providing evidence for the validity of
the MOS. Johnston (1996) had listeners judge the quality of natural speech degraded with time frequency warping.
He found a significant relationship in the expected direction for judgements using the MOS Listening Effort item
(greater degradation led to poorer ratings).
Sensitivity. Johnston (1996) found that the MOS Listening Effort item showed statistically significant
differences among the ratings of three TTS systems, and that this item was more sensitive than a more general item
asking listeners to rate the overall quality of the system. He also found that using sentences as stimuli yielded
results that were just as sensitive as those using longer paragraphs.
Yabuoka et al. (2000) investigated the relationship between five distortion scales (differential spectrum,
phase, waveform, cepstrum distance, and amplitude) and MOS ratings. They were able to calculate statistically
significant regression formulas for predicting MOS ratings from manipulations of the distortion scales.
Unfortunately, they did not report the exact type of MOS that they used in the experiment.
Factor structure. The factor structure of the MOS is currently in question. Kraft and Portele (1995), using
an eight-item version of the MOS (with an additional 'Naturalness' item), reported two factors – one interpreted as
intelligibility (segmental attributes) and one as naturalness (suprasegmental, or prosodic attributes). The Speaking
Rate (Speed) item did not fall in either of the two factors. More recently, Sonntag et al. (1999), using the same version
of the MOS (but with 6-point rather than 5-point scales), reported only a single factor.
1.3. Goals of the Current Research
The goals of the current research were to (1) evaluate the factor structure of the 7-item 5-point-scale version
of the MOS (the version reported by Salza et al., 1996, adapted for use in our lab), (2) estimate the reliability of the
overall MOS score and any revealed factors, (3) investigate the sensitivity of the MOS scores, and (4) extend the
work on validity of the MOS.
2. Method
2.1. Factor Analysis and Reliability Evaluation
Over the last two years we have conducted a number of experiments in which participants have completed
the MOS. In some of these experiments we have also collected paired-comparison data and, in the most recent
(Wang & Lewis, 2001), we also collected intelligibility scores. Participants in these experiments have included in
approximately equal numbers, males and females, persons older and younger than 40 years old, and IBM and non-
IBM employees. Drawing from six of these experiments I assembled a database of 73 independent completions of the
version of the MOS that we have been using (taken from Salza et al, 1996). (Note: Using the guideline that the
number of completed questionnaires required for factor analysis is five times the number of items in the questionnaire
(Nunnally, 1978), the minimum required number of MOS questionnaires is 35, well below the 73 questionnaires in the
database.) This database was the source for a factor analysis, reliability assessment (both of the overall MOS and
the factors identified in the factor analysis) and sensitivity investigation using analysis of variance on the
independent variable of System.
Usability Evaluation and Interface Design – Proceedings of HCI International 2001 (Mahwah, NJ: Lawrence Erlbaum)
151
2.2. Validity Evaluations
Relationship to paired comparisons. Data from a classified IBM report provided an opportunity to replicate
the finding of Salza et al. (1996) that MOS ratings correlate significantly with paired comparisons. In the experiment
described in the report, listeners provided paired comparisons after listening to samples from each of two TTS voices,
then provided MOS ratings for each voice after hearing them a second time.
Relationship to intelligibility scores. Data from Wang and Lewis (2001) provided an opportunity to
investigate the correlation between MOS ratings and intelligibility scores. In that experiment, listeners heard a
variety of types of short phrases produced by four TTS voices, with the task to write down what the voice was
saying. After finishing that intelligibility task, listeners heard the samples for each voice a second time and provided
MOS ratings after reviewing each voice.
3. Results
3.1. Factor Analysis
After conducting a factor analysis of the MOS database, a parallel analysis (Coovert & McNelis, 1988) on
the resulting eigenvalues indicated a three-factor solution that accounted for about 71% of the variance. Table 1
shows the results of the three-factor varimax-rotated solution, with bolded text to highlight the factor on which each
item had the highest load. Note that the third factor only contains a single item. In normal use of the term, a factor
has more than one contributing item, so in this report the conclusion is that the MOS has two factors with one item
(Speaking Rate) not associated with either factor. Labeling factors is always a subjective exercise, but the factors do
appear to be consistent with the factors reported by Kraft and Portele (1995), with items 2-5 (Listening Effort,
Comprehension Problems, Speech Sound Articulation, Pronunciation) forming an Intelligibility factor and items 1 and
7 (Global Impression, Voice Pleasantness) forming a Naturalness factor.
Table 1. Three-Factor Varimax-Rotated Solution
FAC1 FAC2 FAC3
MOS1 0.327 0.900 0.194
MOS2 0.629 0.370 0.427
MOS3 0.693 0.104 0.358
MOS4 0.672 0.433 0.294
MOS5 0.746 0.437 0.139
MOS6 0.322 0.204 0.754
MOS7 0.182 0.665 0.139
3.2. Reliability
Coefficient alpha for the overall MOS was 0.89, with 0.88 for the Intelligibility factor and 0.81 for the
Naturalness factor. (It isn't possible to compute coefficient alpha for a single item.) Thus, the reliability of the MOS
was acceptable. The reliability of the Naturalness subscale was somewhat lower than the Intelligibility subscale,
probably due to it only having two items.
3.3. Sensitivity
Overall MOS rating. A between-subjects one-way analysis of variance on the overall MOS rating was
statistically significant (F(4, 68) = 7.6, p = .00004). As expected, the recorded human voice (Wave) received the best
rating, followed by the concatenative and formant-synthesized voices respectively.
Analysis by factor. Figure 1 shows the relationship among the TTS systems in the database and the MOS
factors (including Speaking Rate). A mixed-factors analysis of variance indicated a significant main effect of System
(F(4, 68) = 9.6, p = .000003), a significant main effect of MOS Factor (F(2, 136) = 14.7, p = .000002), and a significant
System by Factor interaction (F(8, 136) = 3.1, p = .003).
3.4. Validity
Correlation with paired comparisons. Correlations computed among the final preference votes (paired
comparisons) of 16 listeners exposed to two distinctly different TTS systems and the mean difference scores for MOS
ratings for both systems indicated that the validity coefficients for overall MOS, Naturalness and Intelligibility were
Usability Evaluation and Interface Design – Proceedings of HCI International 2001 (Mahwah, NJ: Lawrence Erlbaum)
152
significant (p < .10, r = .55, .49, and .46 respectively). The correlation between paired comparisons and Speaking Rate
was not significant (r = .36, p = .172).
Correlation with intelligibility scores from Wang and Lewis (2001). The only significant validity
coefficient was that for Intelligibility (r = -.43, p = .10), which indicates evidence for both convergent and divergent
validity. The evidence for convergent validity (having a significant relationship where expected) is the correlation
between the MOS Intelligibility factor and the overall intelligibility score from Wang and Lewis. The evidence for
divergent validity (failing to correlate significantly with scores hypothesized to tap into different constructs) is the
non-significant correlations between the overall intelligibility score and the other MOS measurements (Overall MOS:
r = -.38, p = .15; Naturalness: r = -.19, p = .48; Speaking Rate: r = -.26, p = .33).
Figure 1. Interaction of TTS System and MOS Factor
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Intelligibility Naturalness Speaking Rate
Wave
Concat1
Concat2
Formant1
Formant2
4. Discussion
The version of the MOS derived from Salza et al. (1996) seems to have reasonably good psychometric
properties. The factor analysis of the current data resulted in a factor structure similar to that of Kraft and Portele
(1995), specifically two factors (Intelligibility and Naturalness) and an unrelated item for Speaking Rate. The
reliability of the overall MOS and its subscales is acceptable. Furthermore, the data replicated the validity result of
Salza et al. by showing a significant correlation between paired comparison data and MOS data (Overall MOS,
Naturalness, and Intelligibility). The data also indicated appropriate convergent and divergent validity for the
intelligibility scores from Wang and Lewis (2001). Note that this result is similar to that reported by Johnston (1996),
who found that the Listening Effort item (which is part of the Intelligibility factor) was more sensitive to degradation
of speech intelligibility than the Global Effort item (which is part of the Naturalness factor).
Using principles from psychometrics (Nunnally, 1978), it should be possible to improve the reliability of the
MOS. Rather than using 5-point scales with an anchor at each step, general reliability should improve slightly with a
change to 7-point bipolar scales. Because the Naturalness factor had somewhat weaker reliability than the
Intelligibility factor, it would be reasonable to add at least one more item to the MOS that is likely to tap into the
construct of Naturalness.
The MOS Speaking Rate item failed to fall onto either the Intelligibility or Naturalness factor in both the
current study and in Kraft and Portele (1995). This might have happened because Speaking Rate is truly independent
of either of these constructs, or might have been an artifact due to the unique labeling of the scale points for this
item. The other items have scales that have a clear ordinal pattern, such as "Excellent", "Good", "Fair", "Poor", and
"Bad" for the Global Impression item. The labels for the Speaking Rate item are, in contrast, "Yes", "Yes, but slower
than preferred", "Yes, but faster than preferred", "No, too slow", and "No, too fast", which do not have a clear top-
to-bottom ordinal relationship. If the item assessing Speaking Rate had the same structure as the other items in the
MOS, a future factor analysis could determine less ambiguously whether Speaking Rate is truly independent of
Intelligibility and Naturalness.
Usability Evaluation and Interface Design – Proceedings of HCI International 2001 (Mahwah, NJ: Lawrence Erlbaum)
153
5. A Proposed New Version of the MOS
The key proposals are to increase the number of scale steps per item from five to seven (using bipolar
scales), to increase the number of items related to Naturalness, and to make the structure of the Speaking Rate item
consistent with the other items. These modifications should improve the reliability of the MOS and, by extension, its
other psychometric properties because reliability constrains the magnitude of validity coefficients (Nunnally, 1978)
and limits a scale's sensitivity. The summary version of the revised MOS presented here shows the text of the item
and its bipolar labels. (For the completely revised MOS and more details, see Lewis, 2001.)
1. Global Impression: Please rate the sound quality of the voice you heard. (Very Bad / Excellent)
2. Listening Effort: Please rate the degree of effort you had to make to understand the message. (Impossible Even
with Much Effort / No Effort Required)
3. Comprehension Problems: Were single words hard to understand? (All Words Hard to Understand / All Words
Easy to Understand)
4. Speech Sound Articulation: Were the speech sounds clearly distinguishable? (Not at All Clear / Very Clear)
5. Pronunciation: Did you notice any problems in the naturalness of sentence pronunciation? (Very Many
Problems / Didn't Notice Any)
6. Voice Pleasantness: Was the voice you heard pleasant to listen to? (Very Unpleasant / Very Pleasant)
7. Voice Naturalness: Did the voice sound natural? (Very Unnatural / Very Natural)
8. Ease of Listening: Would it be easy to listen to this voice for long periods of time? (Very Difficult / Very Easy)
9. Speaking Rate: Was the speed of delivery of the message appropriate? (Poor Rate of Speech / Perfect Rate of
Speech -- also include this additional sub-item: "If unsatisfactory, please circle one: Too Slow or Too Fast")
If the proposed changes work as expected, the revised MOS items 2-5 will continue to form an Intelligibility
factor. Items 1 and 6-8 should form a Naturalness factor with substantially greater reliability (possibly in excess of
.90) than the current Naturalness factor due to the additional items and the shift from five to seven scale steps. The
change in the structure of item 9 (formerly item 6) should make it possible to determine whether Speaking Rate is truly
independent of the other two factors without losing the ability to determine if a listener finds it too slow or fast.
6. References
Coovert, M. D., & McNelis, K. (1988). Determining the number of common factors in factor analysis: A
review and program. Educational and Psychological Measurement, 48, 687-693.
Johnston, R. D. (1996). Beyond intelligibility: The performance of text-to-speech synthesisers. BT
Technology Journal, 14, 100-111.
Kraft, V., & Portele, T. (1995). Quality evaluation of five German speech synthesis systems. Acta Acustica,
3, 351-365.
Landauer, T. K. (1988). Research methods in human-computer interaction. In M. Helander (Ed.), Handbook
of human-computer interaction. New York: Elsevier.
Lewis, J. R. (1993). Multipoint scales: Mean and median differences and observed significance levels.
International Journal of Human-Computer Interaction, 5, 383-392.
Lewis, J. R. (2001). Psychometric properties of the Mean Opinion Scale (Tech. Report in press -- will be
available at http://sites.netscape.net/jrlewisinfl after publication).
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill.
Paris, C. R., Thomas, M. H., Gilson, R. D., & Kincaid, J. P. (2000). Linguistic cues and memory for synthetic
and natural speech. Human Factors, 42, 421-431.
Salza, P. L., Foti, E., Nebbia, L., & Oreglia, M. (1996). MOS and pair comparison combined methods for
quality evaluation of text to speech systems. Acta Acustica, 82, 650-656.
Sonntag, G. P., Portele, T., Haas, F., & Kohler, J. (1999). Comparative evaluation of six German TTS systems.
In Eurospeech '99 (pp. 251-254). Budapest: Technical University of Budapest.
Wang, H., & Lewis, J. R. (2001). Intelligibility and acceptability of short phrases generated by text-to-speech
(to appear in the conference proceedings for Human-Computer Interaction International '01).
Yabuoka, H., Nakayama, T., Kitabayashi, Y., & Asakawa, Y. (2000). Investigations of independence of
distortion scales in objective evaluation of synthesized speech quality. Electronics and Communications in Japan,
Part 3, 83, 14-22.
... The most common form of the original version was a questionnaire with seven 5-point items. Although not designed to be a multidimensional metric, factor analysis has typically indicated the two underlying constructs of Intelligibility and Naturalness Lewis, 2001a). Figure 1 shows the typical MOS items (from Salza, Foti, Nebbia, & Oreglia, 1996) and the factor with which each was associated in Lewis (2001a). ...
... Although not designed to be a multidimensional metric, factor analysis has typically indicated the two underlying constructs of Intelligibility and Naturalness Lewis, 2001a). Figure 1 shows the typical MOS items (from Salza, Foti, Nebbia, & Oreglia, 1996) and the factor with which each was associated in Lewis (2001a). Standardized questionnaires should have acceptable reliability and validity (for a summary of psychometric development using classical test theory, see Lewis, 2016). ...
... For concurrent validity, the typical minimum criterion to confirm a relationship is a correlation with an absolute magnitude of 0.30, and for construct validity the success criterion is that the pattern of item-factor loadings either makes sense (for exploratory analysis) or matches an expected pattern (for confirmatory analysis). Lewis (2001a) reported that coefficient alpha for the overall MOS was 0.89, with 0.88 for the Intelligibility factor and 0.81 for the Naturalness factor, all indicative of an acceptable level of reliability. There was also evidence of concurrent validity with paired comparison data and sensitivity to manipulation (significant differences between ratings for a recorded human voice and two types of text-to-speech voices). ...
Article
Full-text available
The objectives of this research were to (1) evaluate and compare two versions of the expanded Mean Opinion Scale, one using the original 15 items (MOS-X) and the other a four-item version with one item for each of the four factors of the MOS-X (the MOS-X2), and (2) establish preliminary benchmarks for the interpretation of ratings collected using these questionnaires. Respondents (n = 865) provided ratings for 56 thirty-second recordings of speech samples-53 from recordings of synthetic voices made from 2001 through 2017 and three from professional human voice talents. Both questionnaires had acceptable psychometric quality (reliability and validity), but the factor structure of the MOS-X did not exactly match the expected structure. The MOS-X2 had a stronger statistical relationship to outcome metrics of Likelihood-to-Recommend (LTR) and Overall Quality than the MOS-X. The very old samples (those using technologies from 2001-2002) received consistently poor ratings. A few of the synthetic voice samples came close to the ratings given to the professional human voice talents. Either questionnaire version is acceptable for use, but due to its stronger statistical relationship to the key outcome metrics of LTR and Overall Quality and its shorter length, it is more effective and efficient to use the MOS-X2. The mean MOS-X and MOS-X2 for the ratings of the professional human voices were both about 85 (after conversion to a 0-100 point scale), so a synthetic voice with mean ratings at or approaching 85 would be very good. Ratings over 70 are, relative to the set of voice samples in this study, above average.
... Recently, speech synthesized by the VITS architecture trained on a 24-hour single speaker English dataset was evaluated as having near-human quality [16]. Studies in this field often focus on improving the Mean Opinion Score (MOS) [19], a common method to evaluate the quality of TTS systems. However, it is also important to evaluate the capability of these modern TTS models not only in a controlled lab environment, but also in real-world scenarios with target users (e.g., students in a noisy classroom) and specific tasks (e.g., recall of information offered by the TTS system in a knowledge test). ...
... Mean opinion score. MOS is commonly used to evaluate the quality of TTS systems and consists of 7 items [19]. It is scored using a 5-point Likert-type scale designed to measure participants' perceptions of voice in terms of global impression, listening effort, comprehension problems, speech sound articulation, pronunciation, speaking rate, and voice pleasantness. ...
... Nowadays they are more complex, also registering user strategies and confidence in an opinion [25]. Some of the new parameters under study are measurement of consistency of the opinion (reliability), of correct feature (validity) and of specific empirical manipulation (sensitivity) [26]. Before planning a test to obtain subjective ratings, the panel of subjects must first be decided following the considerations shown in Table 2. Afterwards, a testing method should also be chosen. ...
Article
Full-text available
Hyperspectral (HS) imaging (HSI) expands the number of channels captured within the electromagnetic spectrum with respect to regular imaging. Thus, microscopic HSI can improve cancer diagnosis by automatic classification of cells. However, homogeneous focus is difficult to achieve in such images, being the aim of this work to automatically quantify their focus for further image correction. A HS image database for focus assessment was captured. Subjective scores of image focus were obtained from 24 subjects and then correlated to state-of-the-art methods. Maximum Local Variation, Fast Image Sharpness block-based Method and Local Phase Coherence algorithms provided the best correlation results. With respect to execution time, LPC was the fastest.
... The most popular type of listening test is MOS (Mean Opinion Score), during which several annotators listen to audio segments and rate them on a Likert scale between 1 to 5 (examples of foundational studies that use it include Tacotron [1], Parallel WaveNet [2], or Fast-Speech 2 [3]). Listening tests can produce reliable results [4], since humans usually excel at detecting speech quality, and the scheme can be adapted to the need of every specific task. But they are also impractical and expensive: recruiting and polling annotators increases the cost of running experiments, slows down model research, and makes it impossible to compare results across time and institutions. ...
Preprint
Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.
... Tools to assess speech user interface, voice respondents and voice controlled interface are available -e.g. Speech User Interface Service Quality scale [52,67]; Mean Opinion Scale [50]; Subjective Assessment of Speech System Interfaces [38]. Such tools, however, focus on technologies that are significantly less interactive than artificial intelligence (or advanced algorithms) based chatbots for CRM. ...
Article
Full-text available
Standardised tools to assess a user’s satisfaction with the experience of using chatbots and conversational agents are currently unavailable. This work describes four studies, including a systematic literature review, with an overall sample of 141 participants in the survey (experts and novices), focus group sessions and testing of chatbots to (i) define attributes to assess the quality of interaction with chatbots and (ii) the designing and piloting a new scale to measure satisfaction after the experience with chatbots. Two instruments were developed: (i) A diagnostic tool in the form of a checklist (BOT-Check). This tool is a development of previous works which can be used reliably to check the quality of a chatbots experience in line with commonplace principles. (ii) A 15-item questionnaire (BOT Usability Scale, BUS-15) with estimated reliability between .76 and .87 distributed in five factors. BUS-15 strongly correlates with UMUX-LITE by enabling designers to consider a broader range of aspects usually not considered in satisfaction tools for non-conversational agents, e.g. conversational efficiency and accessibility, quality of the chatbot’s functionality and so on. Despite the convincing psychometric properties, BUS-15 requires further testing and validation. Designers can use it as a tool to assess products, thus building independent databases for future evaluation of its reliability, validity and sensitivity.
... The MOS is a standard test to determine the quality of electronic speech. This test was designed for evaluating the quality of telephony networks, and is recommended by the International Telecommunications Union (ITU) for evaluating TTS systems [39]. The MOS test is administered by playing back a sample of speech generated by a TTS synthesis system. ...
Article
The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This involved a study of mechanisms of human speech production, a review of techniques in speech synthesis, and analysis of tests used to evaluate the effectiveness of synthesized speech. It was determined that a diphone synthesis system was the most effective choice for the scope of this project. A method of automatically identifying and extracting diphones from prompted speech was designed, allowing for the creation of a diphone database by a speaker in less than 40 minutes. CMUdict was used to determine the pronunciation of known words. A system for smoothing the transitions between diphone recordings was designed and implemented. CMUdict was then used to train a maximum-likelihood prediction system to determine the correct pronunciation of unknown English language alphabetic words. Then, a Part Of Speech tagger was designed to find the lexical class of words within a sentence. A method of altering the pitch, duration, and volume of the produced voice over time was designed, being a combination of the TD-PSOLA algorithm and a novel approach referred to as Unvoiced Speech Duration Shifting. This minimises distortion of the voice when shifting the pitch or duration, while maximising computational efficiency by operating in the time domain. This approach was used to add correct lexical stress to vowels within words. A text tokenisation system was developed to handle arbitrary text input, allowing pronunciation of numerical input tokens and use of appropriate pauses for punctuation. Methods for further improving sentence level speech naturalness were discussed. Finally, the system was tested with listeners for its intelligibility and naturalness.
... In the most typical use of the MOS, naïve listeners assign scores for each item after listening to speech stimuli, usually sentences (Schmidt-Nielsen 1995). Factor analysis of these items indicated that they supported two underlying constructs: Intelligibility and Naturalness (Kraft and Portele 1995;Lewis 2001). Polkosky and Lewis (2003) investigated the reliability and validity of the MOS and used psychometric principles to revise and improve the scale. ...
Article
Full-text available
The Speech User Interface Service Quality (SUISQ) questionnaire is a standardized instrument for the assessment of the usability of interactive voice response (IVR) applications, developed by Polkosky (Toward a social-cognitive psychology of speech technology: affective responses to speech-based e-service, 2005; Mediated interpersonal communication, 2008). During its development, participants rated the quality of recorded interactions rather than interactions in which they participated, leaving open the question of the extent to which the findings would generalize to personal as opposed to observed interactions. The results of a large-scale unmoderated usability study of a natural-language speech recognition IVR demonstrated the utility of the SUISQ for the purpose of assessing personal experiences with service-providing speech user interfaces. The psychometric properties of construct validity and reliability were very similar to those reported by Polkosky. Additional item analyses led to the definition of two subsets of the full set of 25 SUISQ items—a reduced version (SUISQ-R, 14 items) and a maximally-reduced version (SUISQ-MR, 9 items). The SUISQ-R had similar psychometric properties to the full SUISQ, but analysis the SUISQ-MR revealed some weaknesses in its reliability and construct validity. This replication of the original SUISQ findings in a markedly different context of measurement and the availability of a shorter, psychometrically qualified, version of the questionnaire (SUISQ-R) should enhance its utility for usability practitioners who work on the development and assessment of speech-recognition IVRs.
Article
Mean opinion score (MOS) has become a very popular indicator of perceived media quality. While there is a clear benefit to such a “reference quality indicator” and its widespread acceptance, MOS is often applied without sufficient consideration of its scope or limitations. In this paper, we critically examine MOS and the various ways it is being used today. We highlight common issues with both subjective and objective MOS and discuss a variety of alternative approaches that have been proposed for media quality measurement.
Conference Paper
Full-text available
We investigated the intelligibility and acceptability of three formant text-to-speech (TTS) engines suitable for use in devices with embedded speech recognition capability. Listeners transcribed and rated recordings of short phrases from four text domains (U.S. currency, dates, digits and proper names) produced by three commercially-available embedded TTS engines and a human speaker. The human voice received the best intelligibility and acceptability scores, and one of the TTS engines had superior intelligibility and acceptability relative to the two others. The results suggest that the ability to accurately produce names (the least constrained and least accurately transcribed text domain) was the system characteristic that best discriminated among the engines. The intelligibility and acceptability scores were generally consistent. Listeners transcribed shorter phrases more accurately than longer phrases, but acceptability ratings were independent of phrase length.
Article
Full-text available
Researchers in human‐computer interaction (HCI) often use discrete multipoint scales (such as 5‐ or 7‐point scales) to measure user satisfaction and preference. Many knowledgeable authors state that the median is the appropriate measure of central tendency for such ordinal scales, although others challenge this assertion. This article introduces a new point of view, based on a human factors consideration. When decision makers read a usability report or attend a briefing, they may make decisions based on the magnitude of the difference between the measures of central tendency for key dependent variables. A major criterion that should affect the choice of presenting means or medians is the strength of the relationship between this difference and the observed significance levels of appropriate statistical tests. The results from two series of “real‐world” usability studies showed that the mean difference correlated more than the median difference with the observed significance levels (both parametric and nonparametric) for discrete multipoint scale data. Therefore, for these scales in this measurement context, the mean can be a better measure of central tendency than the median. The results also provided evidence that mean differences for 7‐point scales correlate more strongly with observed significance levels than those for 5‐point scales.
Article
Determining the number of common factors is one of the most important decisions which must be made in the application of factor analysis. Several different approaches and techniques are reviewed here along with associated strengths and weaknesses. It is argued that a combination of approaches will lead to the best judgment regarding the number of factors to retain. A computer program is available which presents the number of factors to retain as suggested by both discontinuity and parallel analyses. Utilization of the program removes the negative aspect associated with the use of each technique.
Article
Using factor analysis, we have investigated the independence of five distortion scales, namely, differential spectrum distortion, phase distortion, waveform distortion, cepstrum distance, and amplitude distortion, that are widely used in the objective evaluation of synthesized speech quality. We have found that the above distortion scales can be constructed from two factors. The first factor is constructed from cepstrum distance, amplitude distortion, and waveform distortion, and the second factor is constructed from phase distortion and differential spectrum distortion. When a multiple linear regression model is used as a prediction model of MOS, the prediction accuracy is the highest when the cepstrum distance is used as the distortion scale for the first factor and the differential spectrum distortion for the second factor. Investigating the correspondence of psychological quality factors with the physical distortion factors, it was found that the first physical factor is related to “clarity” and second factor is related to “sensation.” Moreover, the importance of differential spectrum distortion in the future quality prediction of low-bit-rate synthesized speech is demonstrated. © 2000 Scripta Technica, Electron Comm Jpn Pt 3, 83(5): 14–22, 2000
Article
This chapter discusses the conduct of research to guide the development of more useful and usable computer systems. Experimental research in human-computer interaction involves varying the design or deployment of systems, observing the consequences, and inferring from observations what to do differently. For such research to be effective, it must be owned—instituted, trusted and heeded—by those who control the development of new systems. Thus, managers, marketers, systems engineers, project leaders, and designers as well as human factors specialists are important participants in behavioral human-computer interaction research. This chapter is intended as much for those with backgrounds in computer science, engineering, or management as for human factors researchers and cognitive systems designers. It is argued in this chapter that the special goals and difficulties of human-computer interaction research make it different from most psychological research as well as from traditional computer engineering research. The main goal, the improvement of complex, interacting human-computer systems, requires behavioral research but is not sufficiently served by the standard tools of experimental psychology such as factorial controlled experiments on pre-planned variables. The chapter contains about equal quantities of criticism of inappropriate general research methods, description of valuable methods, and prescription of specific useful techniques.
Article
Performance assessment tests using opinion scales and subjective measures have long been used as the basis for rating the speech quality of telephony transmission systems. Recently these methods have been adapted to speech synthesiser evaluation and used to develop a standard test methodology which has been applied to most of the text-to-speech systems which are currently available in the market. In the course of this work a new reference has been defined. The procedure draws heavily upon current CCITT (ITU) Recommendations relating to subjective evaluation and enables comparisons to be made between results obtained in different experiments. The results indicate that comparisons made using the new reference system are consistent across subject groups and are tolerant to different types of speech material and opinion scoring methods.
Article
The overall quality of three Text-To-Speech (TTS) synthesis systems for Italian with common prosodic control but different diphones and synthesizers was evaluated by means of the combined application of Mean Opinion Score and Pair Comparison methods. Direct comparison between the two methods serves to validate MOS, which is the technique recommended by CCITT for synthesis evaluation. In the MOS experiment, assessment also included three types of natural speech (normal and degraded) as reference. Eighteen subjects expressed 2880 MOS judgements and made 720 comparisons in all. The results obtained from the two methods showed good agreement. The most important MOS voice parameters used by listeners for differentiating the systems were Global Impression, Voice. Articulation and Pronunciation. The diphones appeared to contribute most to the different judgements, whereas synthesizers were not perceived as different by listeners. This experiment provides positive verification of interlaboratory reproducibility of MOS, which proved to be an effective technique for overall assessment of TTS quality.
Conference Paper
An application-specific perceptual evaluation was carried out in order to compare six high-quality German text-to-speech systems. Subjects judged the systems ’ reading of an email message and a newspaper article according to four applicationspecific questions and six voice quality attributes. The results indicate significant differences between the systems. Possible applications of the systems were judged rather unfavourably. The main reasons for this proved to be the synthetic prosody and voice quality. Errors concerning text conversion were less important. 1.