Conference PaperPDF Available

Web-Based Comparison of Two Styles of Auditory Presentation: All TTS versus Rapidly Mixed TTS and Recordings

Authors:
  • MeasuringU

Abstract

Abstract A current controversy in the interactive voice response (IVR) community,is whether and under which conditions designers should use recorded audio when portions of the interface must be generated by text-to-speech (TTS). The purpose of this study was to examine user preferences for a very extreme case—a prompt that incorporates multiple units of dynamic information in a single sentence. Two groups of IBM employees,listened to and compared,two auditory styles of information presentation (all information given by a single TTS voice and alternating recorded audio and the TTS voice.) The groups listened to both presentation styles in counterbalanced order and then indicated their preference and degree of preference. The number,of respondents indicating a preference for the all TTS style was significantly greater than the number,indicating a preference for the mixture of recorded and TTS. ITIRC Keywords Artificial speech Interactive voice response IVR Synthetic speech Text-to-speech
Web-Based Comparison of Two Styles of Auditory Presentation: All TTS versus
Rapidly Mixed TTS and Recordings
TR 29.3684
September 26, 2003
James R. Lewis
Patrick M. Commarford
Cheryl A. Kotan
IBM User Centered Design
Pervasive Computing Division
Boca Raton, Florida
ii
iii
Abstract
A current controversy in the interactive voice response (IVR) community is whether and under which
conditions designers should use recorded audio when portions of the interface must be generated by
text-to-speech (TTS). The purpose of this study was to examine user preferences for a very extreme
casea prompt that incorporates multiple units of dynamic information in a single sentence. Two
groups of IBM employees listened to and compared two auditory styles of information presentation (all
information given by a single TTS voice and alternating recorded audio and the TTS voice.) The
groups listened to both presentation styles in counterbalanced order and then indicated their preference
and degree of preference. The number of respondents indicating a preference for the all TTS style was
significantly greater than the number indicating a preference for the mixture of recorded and TTS.
ITIRC Keywords
Artificial speech
Interactive voice response
IVR
Synthetic speech
Text-to-speech
TTS
iv
v
Contents
INTRODUCTION......................................................................................................................................................................................1
PARTICIPANTS............................................................................................................................................................................................3
STIMULI.......................................................................................................................................................................................................3
PROCEDURE................................................................................................................................................................................................3
RESULTS......................................................................................................................................................................................................5
PREFERENCE RATINGS..............................................................................................................................................................................5
STRENGTH OF PREFERENCE.....................................................................................................................................................................5
PARTICIPANT COMMENTS........................................................................................................................................................................5
DISCUSSION...............................................................................................................................................................................................7
REFERENCES ............................................................................................................................................................................................9
APPENDIX A. FULL SURVEY.........................................................................................................................................................11
APPENDIX B. COMMENT DATA...................................................................................................................................................13
vi
1
Introduction
There is some debate in the interactive voice response (IVR) community as to when designers should
and when they should not mix recorded audio and text-to-speech (TTS). Researchers (e.g.,
Schumacher, Hardzinski, & Swartz, 1995; Balentine & Morgan, 2001) recommend using a
professional voice talent when possible and generally recommend against switching between
professional voices, unless there is a clear and purposeful reason for doing so (e.g., signaling different
modes; a help tutor; presentation in different languages). This, of course, suggests that designers also
should not switch between a professional voice and a TTS voice without a clear purpose and that a
professional voice is preferred to TTS. However, in many situations, it is either very expensive or
altogether infeasible to play all information by recorded audio files. For example, an IVR that reads
users their email messages must read these messages in a TTS voice. Also, many applications have the
need to play a wide range of dynamic data, including dates, digit strings, and times. It can be
extremely expensive and require a great deal of effort to record and appropriately concatenate audio
files to play whenever this information is required. Therefore, it is often necessary to play some
information in TTS. In these circumstances designers must decide when recorded audio should also
play (or whether recorded audio should play at all).
When, how, where, and to what degree designers should mix recorded audio and TTS is a topic that
has received surprisingly little investigation. Gong, Nass, Simard, & Takhteyev (2001) presented
participants with a set of seven experimental sentences, for which one group listened to in all TTS and
the other group listened to in mixed output (recorded audio and TTS). In the mixed condition, all
sentences began with recorded audio and ended with TTS. Participants rated the all TTS version
significantly more positively in terms of the three dependent measures (liking, trust, and perceived
competence). Gong & Lai (2001) conducted a similar investigation, using a Wizard of Oz
methodology and a voice access to email application. Participants completed eight tasks using this
system and, depending on the condition, were exposed to all TTS or a mixture of TTS and recorded
audio. The authors explained that, in the mixed condition, the dynamic content (e.g., the email header
and body) played in TTS. However, it is unclear whether mixing occurred within sentences or only
between sentences. Independent raters judged the all TTS group to outperform the mixed group on the
task scenarios, however, the all TTS grouped had to listen to messages and calendar listings
significantly more times to complete the tasks. Further, participants in the mixed condition judged
their own performance to be better and judged the system to be easier to use than those in the TTS
condition. In sum, the empirical research findings still are not able to provide clear guidance as to
when and how designers should mix recorded audio and TTS.
The practice of mixing recorded audio and TTS between sentences in a single application is widely
accepted and, in fact, necessary if one wishes to use recorded audio at all in an application that
mandates some TTS. In other words, if designers were to never mix, then all applications
incorporating unpredictable information would necessarily play in all TTS. Therefore, designers often
use TTS “when they have to.” For example, consider the following from a voice access to email
application:
You have one new urgent message. Hi Joe, please make sure to get me that TPS report as soon
as possible.
2
There is a great deal of agreement in the IVR community that, although the italicized text above must
be played in TTS, the initial sentence (normal font) should be recorded. However, opinions differ
when IVR practitioners consider whether it is best to keep the mixing at the sentence level or to play
recorded audio for all stable data. For example, time and budget may mandate that the following
sentences use TTS for the italicized font:
1. We have accepted your payment and have charged your credit card a total of fifty-two
dollars and thirty-four cents.
2. Select: Office Space, Raising Arizona, or Rat Race.
3. You are now confirmed on Flight 1179 on August 29th, departing from Miami at 3:40
PM and arriving in New York at 7:00 PM.
The present study is a first step toward understanding user preferences when the use of TTS is
necessary in an interface. We examine user preferences for a very extreme case, similar to sample 3
above, in which the rendering of the information fluctuates rapidly between static and dynamic
information. If users prefer to listen to extreme cases like this in mixed format, we can conclude that,
as a best practice, designers should play recorded audio for all static information. However, if the all
TTS presentation is preferred to mixing for this situation, the results will necessitate further
investigation to determine under which conditions, if any, recorded audio should play when TTS is
necessary in the interface.
3
Method
Participants
We invited a total of 600 IBM
1
employees to participate. We sent 300 invitations directing
participants to a website that instructed them to listen to the mixed sample first, then the all TTS
sample. The other 300 invitations directed participants to a site that instructed them to listen to the all
TTS sample first, then the mixed sample. A total of 72 participants, 36 from each group, listened to
both samples and responded to the survey questions that followed.
Stimuli
We created two audio files, each speaking the same detailed breakdown of rates, fees, and taxes for a
fictional car rental. We counterbalanced the order of presentation to control for order effects. Each
audio sample played the following text at a presentation rate of approximately 100 words per minute:
Ok, for a mid-sized non-smoking car, the rate is guaranteed for one week at $39.99 per day
minus ten percent through AAA, which lowers the rate to $35.99 per day, with no drop charge.
There is a 7 percent tax and a $1.95 per day vehicle licensing fee, for an approximate daily total
of $39, with unlimited mileage.
We used the most recent version of the IBM concatenative male voice (IBM WebSphere
2
Voice Server
for Multiplatforms V3.1.1 Third-Generation Concatenative Text-to-Speech) to produce the entire
sample for the all TTS condition. For the mixed condition, the TTS voice generated the dynamic
information (indicated by italicized text) and the remaining information consisted of recorded human
audio from a male speaker. (The speaker was not a professional voice talent, but his voice was
generally perceived as clear and pleasant.)
Procedure
We sent email messages inviting employees to participate in the study and directing them to a web
page with instructions, links to both samples, and a set of survey questions. After accessing the web
page, participants clicked on the first link, allowing them to listen to the first sample file. The file
played on whichever audio player application the participant’s desktop computer system had set as the
default. Participants repeated this process for the second sample, and then answered the survey
questions.
After listening to the samples, participants indicated which sample they preferred and the strength of
their preference on a 7-point scale. Participants then provided comments about what they specifically
liked or disliked about each sample (see Appendix A for the full survey).
1
IBM is a registered trademark of International Business Machines Corp.
2
WebSphere is a registered trademark of International Business Machines Corp.
4
5
Results
Preference Ratings
A total of 72 participants responded to the survey. Fifty-three participants (73.6%) indicated that they
preferred the all TTS presentation style and 19 (26.4%) indicated a preference for the mixed
presentation style. A one sample t-test indicated that this difference was statistically significant [t(71)
= 14.07; p < .0005). Further, an independent samples t-test indicated a significant main effect of order
of presentation [t(70) = 3.09; p < .0005) such that participants were more likely to prefer the second
presentation style to which they were exposed (see Table 1). Those who listened to the mixed style
first were more likely to prefer all TTS (89% preferred all TTS) than those who listened to all TTS first
(58% preferred all TTS). Table 1 provides the observed proportion of respondents who preferred all
TTS and the proportion that preferred the second style presented, as well as the 90% binomial
confidence intervals.
Table 1
Proportion who preferred all TTS and proportion who preferred the second presentation style
Observed 90% Lower Bound 90% Upper Bound
Preferred All TTS 73.6% 64% 82%
Preferred 2
nd
Presentation 65.3% 55% 75%
Strength of Preference
Overall, respondents indicated a fairly high preference for one style or the other (mean preference
strength = 4.85 on a 7-point scale), but there was no significant difference in the strength of preference
for each style [M
diff
= 0.72; t(70) = 1.61; p = .11).
Participant Comments
Sixty-nine participants commented on each audio sample and some of the participants commented on
more than one aspect of the sample. In total, the mixed sample received 78 comments, and the all-TTS
sample received 90 (see Appendix B).
Consistent with the preference ratings, the all-TTS sample received more than twice as many positive
comments as the mixed sample. Participants gave positive comments about the consistency of using
just one voice (22.2% of the total number of all-TTS comments), and the smooth flow of the
information (12.2%). Survey respondents indicated that they did not like the all-TTS sample’s
mechanical sound (25.6%) or lack of inflection (14.4%).
The most commonly commented upon negative trait of the mixed sample was the way the disjointed
transitions between the voices distracted users from the information content (30.8%). Participants also
complained about the large contrast between the sound of the recorded and TTS voices (18.0%) and
the mechanical sound of the sample (9.0%). Positive comments about the mixed presentation style
included the pleasant sound of the recorded voice (9.0%) and the way the switch from one voice to the
other distinguished the prices from the other information (7.7%). Interestingly, these participants felt
that the switch from recorded to TTS called attention to the content, while four times as many claimed
the switch distracted them from the content (see above).
6
The order of presentation also appeared to affect the comments. Participants who listened to the all-
TTS sample and then heard the mixed sample focused more on the mixed sample’s information
emphasis quality and did not comment as much on the mixed sample’s disjointed transitions or the
TTS sample’s smooth flow. Participants who heard the mixed sample and then listened to the all-TTS
sample seemed to perceive the transitions between recorded and TTS voices in the mixed sample as
distracting from the content, rather than emphasizing it. They remarked on the smooth flow of the all-
TTS sample and complemented the recorded portion of the mixed sample more than those participants
who heard the all-TTS sample first (see Appendix B).
Regardless of the order of presentation, participants disliked the degree of contrast between the voices
in the mixed sample and liked the consistency of the all-TTS sample. They did not like the lack of
inflection in the TTS voice or the mechanical sound of both the mixed and all-TTS samples.
7
Discussion
Despite the influence of an order-of-presentation effect, the overall listener preference in this study
favored an all-TTS presentation over the rapid mixing of TTS and recorded speech. The preferred
presentation format did not affect participants’ rated strength of preference, which was relatively high.
This indicates that presenting the type of information investigated in this study with an all-TTS format
will be generally better than using a mixed presentation, but will be annoying to a substantial minority.
Participants who preferred the all-TTS presentation cited the consistency of using one voice and the
smooth flow of the information. Participants who preferred the mixed presentation attributed their
preference to the pleasant sound of the recorded voice and the way the voice-switching helped them to
distinguish the pieces of key information in the message.
Due to the significant preference observed in this study for an all-TTS presentation of this type of
information, we recommend using all TTS when the alternative would require rapid switching between
recorded and TTS voices in a single message.
We’ve placed a number of constraints on the above recommendation because user preference for styles
of auditory presentation of messages might depend on a number of factors that have not yet undergone
systematic investigation, such as the context in which the message appears, the rate of switching
between TTS and recorded voice, and the length of a recorded portion that precedes a TTS portion.
These factors suggest fertile ground for future studies. For example, would the results of the current
study change if participants had heard the message embedded in a context of all recorded prompts and
messages?
Also, consider the following messages (italicized text indicates variable information that must be
presented using TTS):
Select: x, y, or z.
Your rate for this vehicle will be $34.97.
For the first example, given that the lead-in to the menu consists of only one word (“Select”), would
users prefer to hear that word as recorded speech or in the same TTS voice as the menu items? In the
second example, the lead-in to the TTS portion is longer. In this case, would users prefer to hear the
lead-in phrase as recorded speech or TTS? We hope to address these questions in future research.
8
9
References
Balentine, B., & Morgan, D. P. (2001). How to build a speech recognition application: A style guide
for telephony dialogs (2nd ed.). San Ramon, CA: Enterprise Integration Group.
Gong, L., & Lai, J. (2001). Shall we mix synthetic speech and human speech? Impact on users’
performance, perception, and attitude. In Proceedings of the ACM Conference on Human Factors in
Computing Systems - CHI '01 ( pp. 158-165). New York, NY: ACM.
Gong, L., Nass, C., Simard, C. & Takhteyev, Y. (2001). When non-human is better than semi-human:
Consistency in speech interfaces. In M. J. Smith, G. Salvendy, D. Harris, & R. J. Koubek (Eds.),
Usability Evaluation and interface design: Cognitive engineering, intelligent agents, and virtual
reality (pp. 390-394). Mahwah, NJ: Erlbaum.
Schumacher, R. M., Jr., Hardzinski, M. L., & Schwartz, A. L. (1993). Increasing the usability of
interactive voice response systems: Research and guidelines for phone-based interfaces. Human
Factors, 37(2), 251-264.
10
11
Appendix A. Full Survey
Auditory Presentation of Rental Rate Information
We are investigating different ways to present auditory information to users in interactive voice
response systems. Please click the links below and listen to the two audio files (Sample 1 and
Sample 2), then answer the questions:
Play Sample 1
Play Sample 2
12
13
Appendix B. Comment Data
Mixed Sample
Mixed Sample Number of
Comments
% of Total
Comments
Posi tive Comments
Info Emphasis 6 7.69%
Recorded 7 8.97%
Request for All Recorded 5 6.41%
Natural 4 5.13%
Total Positive 22 28.21%
Negative Comments
Disjointed Transitions 24 30.77%
Contrast in Voices 14 17.95%
Mechanical 7 8.97%
Multiple Voices 3 3.85%
Not Natural 2 2.56%
Inconsistent 2 2.56%
Total Negative 52 66.67%
Feeling-Based Comments
Not Professional 1 1.28%
Less Personal 1 1.28%
Softer 1 1.28%
Longer 1 1.28%
Total Feeling 4 5.13%
Total Comments
78 100.00%
Total Participants Who
Commented
69
No Comment
4
14
All-TTS Sample
All-TTS Sample Number of
Comments
% of Total
Comments
Positive Comments
Consistency 20 22.22%
Smooth Flow 11 12.22%
Natural 3 3.33%
Understandable 3 3.33%
Less Mechanical 1 1.11%
Human 1 1.11%
Request for All
Recorded
1 1.11%
Shorter 1 1.11%
Total Positive 41 45.56%
Negative Comments
Mechanical 23 25.56%
No Inflection 13 14.44%
Poor Annunciation 3 3.33%
Low Info Emphasis 2 2.22%
Not Natural 1 1.11%
Total Negative 42 46.67%
Feeling-Based Comments
Slower 3 3.33%
Longer 1 1.11%
Harsh 1 1.11%
Uncaring/Unfeeling 1 1.11%
"Carlos sounding" 1 1.11%
Total Feeling 7 7.78%
Total Comments
90 100.00%
Total Participants
Who Commented
69
No Comment
4
15
Order Effects
Mixed Comments
Mixed 1st Mixed 2nd
Positive Comment Occurrences Occurrences Total
Info Emphasis 1 5 6
Recorded 5 2 7
Request for All Recorded 2 3 5
Natural 0 1 1
Negative Comment Occurrences Occurrences Total
Disjointed Transitions 19 5 24
Contrast in Voices 8 6 14
Mechanical 4 3 7
Multiple Voices 3 0 3
Not Natural 2 0 2
Inconsistent 0 2 2
All-TTS Comments
All-TTS 1st All-TTS 2nd
Positive Comment Occurrences Occurrences Total
Consistency 9 11 20
Smooth Flow 3 8 11
Natural 2 1 3
Understandable 1 2 3
Less Mechanical 0 1 1
Human 0 1 1
Request for All Recorded 0 1 1
Shorter 0 1 1
Negative Comment Occurrences Occurrences Total
Mechanical 11 12 23
No Inflection 8 5 13
Poor Annunciation 0 3 3
Low Info Emphasis 1 1 2
Not Natural 0 1 1
Slower 1 2 3
... The idea of mixing synthetic and human voice signals is not new in the literature. Previous studies (Clark et al., 2019;Gong et al., 2003;Lewis et al., 2006) have agreed that listeners prefer fully synthetic signals rather than signals combining TTS and natural voice. As for (Gong et al., 2003), it has been observed that users' liking and clarity of fully synthetic signals are higher than mixed TTS-human signals. ...
... As for (Gong et al., 2003), it has been observed that users' liking and clarity of fully synthetic signals are higher than mixed TTS-human signals. In (Lewis et al., 2006), the preference of listeners is asked about mixed synthetic and recorded voice when dynamic part of a message is synthesized by TTS. Participants indicated that they preferred the fully synthetic signals. ...
... Audio-book generation is different and more challenging than tasks done in (Clark et al., 2019;Gong et al., 2003;Lewis et al., 2006) that do not require expressiveness (news or message reading). Even advanced TTS systems are not as good as professional speakers for generating expressive books in terms of overall quality. ...
Thesis
The objective of this thesis is the generation of a high quality expressive audio-book, using natural and synthetic speech signals with a minimal recording cost. The strategy consists on selecting a part of the book and recording its reading to build a voice corpus. This voice is then used for synthesizing the rest of the book using a Text-to-Speech system. Several strategies are successively proposed: a posterior approach using voice reduction methods, a neural network based (CNN) auto-encoder focusing on linguistic information, and then the selection of the shortest utterances. These different approaches are objectively and perceptually evaluated. Finally, the quality of audio-book mixing natural and synthetic speech signals is evaluated. The evaluations show the mixture of synthetic and natural signals is preferred than fully synthetic signals produced by a unit selection based TTS system.
... In [4], users' liking and clarity of fully synthetic signals are higher than mixed TTS-human signals. As for [7], the preference of listeners is asked about mixed synthetic and recorded voice when dynamic part of a message is synthesized by TTS. Participants have indicated that they prefer the fully synthetic signals. ...
... Audio-book generation is different and more challenging than tasks done in [3,4,7] that do not require expressiveness (news or message reading). Even advanced TTS systems are not yet as good as professional speakers for generating expressive books in terms of overall quality. ...
... Based on these results, the mixed signals have significantly higher scores than fully synthetic signals in all voice corpus sizes. This observation is contrary to the previous studies [3,4,7] which showed superiority of fully synthetic signals. On the other hand, Fig. 1 does not show any significant difference when the number of transitions changes. ...
Chapter
Using TTS systems helps to reduce the cost of audio-book generation. This paper investigates the idea of mixing synthetic and recorded natural speech signals to control the trade-off between the overall quality of audio book and its production cost. Firstly, fully synthetic signals and mixed synthetic and natural signals are compared perceptually using different levels of synthetic quality. The listeners’ perception shows that mixed signals are preferred. Next, the order and configuration of mixed signals are studied. The perceptual test does not show any significant difference between the different configurations. Finally, the synthetic quality and the bias of a starting and ending part of mixed signals in perceptual test are investigated.
Conference Paper
This paper quantifies the performance of the core part of voice driven web using free and open source speech engine; the speech engine which is very high computation demanding, it consists of Automatic Speech Recognition (ASR) and Text To Speech (TTS). Two open source programs, Sphinx-4 and FreeTTS-1.2.2 are used for ASR and TTS respectively. These two programs are executed on 2 different hardware multicore processors with 4 hyperthreaded cores, and 8 cores respectively. The response time with respect to the load variance and the number of cores is measured and predicted using a linear regression model. The results show that, the response time is linear with respect to the input length, this property can be used to directly predict the response for any input length. Moreover, though the response time and the speed up increases as the number of cores increases, the regression coefficients and number of threads reveal that ASR benefits from multicore. The speedup factor for ASR is 1.56 for 8 cores. However for FreeTTS, though being sequential the speed up from the program itself is insignificant, there is about 1. 43 speedup for 8 cores, that comes from the system's contribution. Our findings show that the generalization of the results for multicore processor does not apply to hyperthreading. This paper presents the investigation that is useful for educators, researchers, and applications' developer in voice based applications 'domain.
Conference Paper
This paper quantifies the implications and trade-offs of three different architectures for voice driven web application, architectures are implemented as prototypes. The prototypes differ from each other by either using recording, or Text To Speech (TTS) as server based, or TTS as client based to process output speech. A typical application used in this paper, is the most dynamic weather information source which is presented as web feeds or Really Simple Syndication (RSS) feeds. The evaluated quality attributes are performance, maintainability, and development effort. The empirical results show that, each system's architecture has a different quality profile, for instance, one architecture has the lowest development time but the highest maintainability cost, and another has the lowest bandwidth requirements but the highest development cost. Finally, suggestions about optimal choice of system architecture according to the quality requirements of the final system are drawn.
Conference Paper
Full-text available
The completion rate – the proportion of participants who successfully complete a task – is a common usability measurement. As is true for any point measurement, practitioners should compute appropriate confidence intervals for completion rate data. For proportions such as the completion rate, the appropriate interval is a binomial confidence interval. The most widely-taught method for calculating binomial confidence intervals (the "Wald Method," discussed both in introductory statistics texts and in the human factors literature) grossly understates the width of the true interval when sample sizes are small. Alternative "exact" methods over-correct the problem by providing intervals that are too conservative. This can result in practitioners unintentionally accepting interfaces that are unusable or rejecting interfaces that are usable. We examined alternative methods for building confidence intervals from small sample completion rates, using Monte Carlo methods to sample data from a number of real, large-sample usability tests. It appears that the best method for practitioners to compute 95% confidence intervals for small-sample completion rates is to add two successes and two failures to the observed completion rate, then compute the confidence interval using the Wald method (the "Adjusted Wald Method"). This simple approach provides the best coverage, is fairly easy to compute, and agrees with other analyses in the statistics literature.
Conference Paper
Full-text available
Because it is impractical to record human voice for ever- changing dynamic content such as email messages and news, many commercial speech applications use human speech for fixed prompts and synthetic speech (TTS) for the dynamic content. However, this mixing approach may not be optimal from a consistency perspective. A 2-condition between-group experiment (N = 24) was conducted to compare two versions of a virtual-assistant interface (mixing human voice and TTS vs. TTS-only). Users interacted with the virtual assistant to manage some email and calendar tasks. Their task performance, self-perception of task performance, and attitudinal responses were measured. Users interacting with the TTS-only interface performed the task significantly better, while users interacting with the mixed-voices interface thought they did better and had more positive attitudinal responses. Explanations and design implications are suggested.
Article
Thousands of applications-from airline reservations to zoo schedules-use phonebased interfaces to provide information to callers and to request input. The dramatic increase in the number of these applications has led to wide variability in the quality of the interface design. Most users have no choice but to grudgingly accept these systems; experience with poorly designed phone-based interfaces has led to widespread dissatisfaction. We believe that the lack of published guidelines for interface design for these systems contributes to the variability in quality and thus to user antipathy. In this paper we propose guidelines for construction of phone-based interfaces to interactive voice response systems. These guidelines are a compilation of information from the best of current practices, empirical research where available, and the wisdom of the industry. We see this collection as a starting point for further discussion and standardization that will ultimately lead to optimally usable interactive voice response systems.