Figure 2 - uploaded by James R. Lewis
Content may be subject to copyright.
Voice by Factor Interaction (Acceptability)

Voice by Factor Interaction (Acceptability)

Source publication
Conference Paper
Full-text available
We investigated the intelligibility and acceptability of three formant text-to-speech (TTS) engines suitable for use in devices with embedded speech recognition capability. Listeners transcribed and rated recordings of short phrases from four text domains (U.S. currency, dates, digits and proper names) produced by three commercially-available embed...

Context in source publication

Context 1
... Figure 2 shows the Voice by Factor interaction. Decomposing the overall MOS into its component factors (Intelligibility, Naturalness, and Speaking Rate) revealed a difference in profile between the artificial and human voices. ...

Similar publications

Article
Full-text available
Foreign language learners and speakers have to be able to face challenges in a self-confident way during examinations and interviews in order to be successful. Anxiety plays a vital part when it comes to the quality of responses to questions from the interviewer and during school years teachers should provide as many occasions for preparation as po...

Citations

... For the voice samples, I went through all the recordings of synthetic voices I had worked with from 2000 to 2017 (Lewis, 2001a(Lewis, -b, 2002(Lewis, , 2004Polkosky & Lewis, 2002a-c). Previous research has shown that listeners make their quality judgements quickly (Polkosky & Lewis, 2002c;Wang & Lewis, 2001), so the samples were edited to have a length of about 30 seconds. In addition to the synthetic voices, I also put together three samples of professional human voice talents who had either recorded segments for an interactive voice response system or had provided recordings for the production of a synthetic voice. ...
Article
Full-text available
The objectives of this research were to (1) evaluate and compare two versions of the expanded Mean Opinion Scale, one using the original 15 items (MOS-X) and the other a four-item version with one item for each of the four factors of the MOS-X (the MOS-X2), and (2) establish preliminary benchmarks for the interpretation of ratings collected using these questionnaires. Respondents (n = 865) provided ratings for 56 thirty-second recordings of speech samples-53 from recordings of synthetic voices made from 2001 through 2017 and three from professional human voice talents. Both questionnaires had acceptable psychometric quality (reliability and validity), but the factor structure of the MOS-X did not exactly match the expected structure. The MOS-X2 had a stronger statistical relationship to outcome metrics of Likelihood-to-Recommend (LTR) and Overall Quality than the MOS-X. The very old samples (those using technologies from 2001-2002) received consistently poor ratings. A few of the synthetic voice samples came close to the ratings given to the professional human voice talents. Either questionnaire version is acceptable for use, but due to its stronger statistical relationship to the key outcome metrics of LTR and Overall Quality and its shorter length, it is more effective and efficient to use the MOS-X2. The mean MOS-X and MOS-X2 for the ratings of the professional human voices were both about 85 (after conversion to a 0-100 point scale), so a synthetic voice with mean ratings at or approaching 85 would be very good. Ratings over 70 are, relative to the set of voice samples in this study, above average.
... The items of the standard MOS align with two factors: intelligibility and naturalness (Lewis, 2001). Previous research indicated a significant correlation between the MOS intelligibility scale and more direct measures of intelligibility (Wang and Lewis, 2001). The MOS-X has items that align with these traditional factors as well as new factors for Prosody and Social Impression (see Appendix A). ...
Conference Paper
Full-text available
The MOS-X is a recently-developed questionnaire used to evaluate the quality of artificial speech. In this experiment, participants listened to audio files produced by concatenative text-to-speech voices for the purpose of assessing the effect of Speaker and Sampling Rate on MOS-X ratings. The concatenative voices were developed from recordings of three different human speakers (code named AF, AM, and B) and produced using two different sampling rates (8 kHz and 22 kHz). Six independent groups of raters participated, one group for each combination of speaker and sampling rate. Analyses of variance indicated a significant main effect of Voice, but no significant main effect of Sampling Rate and no significant Voice by Sampling Rate interaction. The results indicate that independent groups of raters are sensitive to speaker differences in concatenative text-to-speech (TTS) voices, but not to differences in these sampling rates.
... Over the period 1999 to 2001, we conducted a number of experiments in which participants completed the standard MOS. In some of these experiments, we also collected paired-comparison data and, in the most recent (Wang and Lewis, 2001), we collected intelligibility scores. Participants in these experiments have included in approximately equal numbers, males and females, persons older and younger than 40 years old, and IBM and non-IBM employees. ...
... Re-analysis of our data from the previous studies in which listeners provided both MOS ratings and paired comparisons allowed us to replicate the finding of Salza et al. (1996) that MOS ratings correlate significantly with paired comparisons. Data from Wang and Lewis (2001) provided an opportunity to investigate the correlation between MOS ratings and intelligibility scores. In that experiment, listeners heard a variety of types of short phrases produced by four TTS voices, with the task to write down what the voice was saying. ...
... Additional data from Wang and Lewis (2001) indicated a marginally significant correlation between intelligibility scores from listener transcriptions and their MOS ratings of Intelligibility (r (14) = .43, p = .10), ...
Article
Full-text available
The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners'' subjective assessments of synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L. Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.
... Over the last two years we have conducted a number of experiments in which participants have completed the MOS. In some of these experiments we have also collected paired-comparison data and, in the most recent (Wang & Lewis, 2001), we also collected intelligibility scores. Participants in these experiments have included in approximately equal numbers, males and females, persons older and younger than 40 years old, and IBM and non-IBM employees. ...
... Relationship to intelligibility scores. Data from Wang and Lewis (2001) provided an opportunity to investigate the correlation between MOS ratings and intelligibility scores. In that experiment, listeners heard a variety of types of short phrases produced by four TTS voices, with the task to write down what the voice was saying. ...
... Correlation with intelligibility scores from Wang and Lewis (2001). The only significant validity coefficient was that for Intelligibility (r = -.43, ...
Conference Paper
Full-text available
The Mean Opinion Scale (MOS) is a seven-item questionnaire used to evaluate speech quality. Analysis of existing data revealed (1) two MOS factors (Intelligibility and Naturalness, plus a single independent Rate item), (2) good reliability for Overall MOS and for subscales based on the Intelligibility and Naturalness factors, (3) appropriate sensitivity of MOS factors, (4) validity of MOS factors related to paired comparisons, and (5) validity of MOS Intelligibility related to intelligibility scores. In conclusion, the current MOS has acceptable psychometric properties, but adding items to the Naturalness scale and increasing the number of scale steps from five to seven should improve its reliability.