Conference PaperPDF Available

Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recognition System.

Authors:

Abstract

A study was conducted to evaluate user performance andsatisfaction in completion of a set of text creation tasks usingthree commercially available continuous speech recognition systems.The study also compared user performance on similar tasks usingkeyboard input. One part of the study (Initial Use) involved 24users who enrolled, received training and carried out practicetasks, and then completed a set of transcription and compositiontasks in a single session. In a parallel effort (Extended Use),four researchers used speech recognition to carry out real worktasks over 10 sessions with each of the three speech recognitionsoftware products. This paper presents results from the Initial Usephase of the study along with some preliminary results from theExtended Use phase. We present details of the kinds of usabilityand system design problems likely in current systems and severalcommon patterns of error correction that we found.
Papers
CHI
99 15-20 MAY 1999
Patterns of Entry and Correction in Large Vocabulary
Continuous Speech Recognition Systems
Clare-Marie Karat, Christine Halverson, Daniel Horn*, and John Karat
IBM T.J. Watson Research Center
30 Saw Mill River Road
Hawthorne, NY 10532 USA
+l 914 784 7612
ckarat, halve, jkarat@us.ibm.com
ABSTRACT
A study was conducted to evaluate user performance and
satisfaction in completion of a set of text creation tasks
using three commercially available continuous speech
recognition systems. The study also compared user
performance on similar tasks using keyboard input. One
part of the study (Initial Use) involved 24 users who
enrolled, received training and carried out practice tasks,
and then completed a set of transcription and composition
tasks in a single session. In a parallel effort (Extended Use),
four researchers used speech recognition to carry out real
work tasks over 10 sessions with each of the three speech
recognition software products. This paper presents results
from the Initial Use phase of the study along with some
preliminary results from the Extended Use phase. We
present details of the kinds of usability and system design
problems likely in current systems and several common
patterns of error correction that we found.
Keywords
Speech recognition, input techniques, speech user
interfaces, analysis methods
INTRODUCTION
Automatic speech recognition (ASR) technology has been
under development for over 25 years, with considerable
resources devoted to developing systems which can
translate speech input into character strings or commands.
We are just beginning to see fairly wide application of the
technology. Though the technology may not have gained
wide acceptance at this time, industry and research seem
committed to improving the technology to the point that it
becomes acceptable. While speech may not replace other
input modalities, it may prove to be a very powerful means
of human-computer communication.
However, there are some I-undamental factors to keep in
[lcrnjission to ,,lakc digital or Ilard zapics 01‘dl or part 0I‘this uork fbl
personal Or classroonl 11s~ is grantd without i’w prwidcd Ihill CWiCS
arr: nol ,lla& ,jr dislril>ute(i li,r profit or commercial Xl\.Xlta@C and &it
c(,pics hcur illi< nolicc a11ti the I’~111 ciltlliml ou tk lirsl pa@C. I’() C(W
ottlcrwisc, I,, qublish, 10 post cm scrvm co’ to rcdisLrii)UtC 10 ffSt%
rquires prior
qxcifis pcniiission :llld’Or Cl fCc.
cl-11 ‘W Pittsburgh Pi\ USA
Copyright ACM 1999 O-201 -48559-1/99/05...$5.00
*University of Michigan
Collaboratory for Research on Electronic Work
701 Tappan Street, Room C2420
Ann Arbor, MI 48109-1234
danhom@umich.edu
mind when considering the value of ASR and how rapidly
and widely it will spread. First, speech recognition
technology involves errors that are fundamentally different
from user errors with other input techniques 1121. When
users press keys on a keyboard, they can feel quite certain
of the result. When users say words to an ASR system, they
may experience system errors - errors in which the system
output does not match their input - that they do not
experience with other devices. Imagine how user behavior
might be different if keyboards occasionally entered a
random letter whenever you typed the ‘<a” key. While there
is ongoing development of speech recognition technology
aimed at lowering error rates, we cannot expect the sort of
error free system behavior we experience with keyboards in
the near future. How we go from an acoustic signal to some
useful translation of the signal remains technically
challenging, and error rates in the l-5% range au-e the best
anyone should hope for.
Second, while we like to think that speech is a tratural form
of communication [ 1,9] it is misleading to think that this
means that it is easy to build interfaces that will provide a
natural interaction with a non-human machine [IO]. While
having no difference between human-human and human-
computer communication might be a laudable goal, it is not
one likely to be attainable in the near future. Context aids
human understanding in ways that are not possible with
machines (though there are ongoing efforts to provide
machines with broad contextual and social knowledge) [7].
A great deal of the ease we take for granted in verbal
communication goes away when the listenter doesn’t
understand the meaning of what we say.
Finally we argue that it takes time and practice
to
develop a
new form of interaction [4,6]. Speech user interfaces
(SUIs) will evolve as we learn about problems users face
with current designs and work to remedy them. The
systems described in this paper represent the state-of-the-art
in large vocabulary speech recognition systems. They
provide for continuous speech recognition (as opposed to
isolated word recognition), require speaker training for
acceptable performance, and have techniques for
distinguishing commands from dictation.
568
CHI 99 15-20 MAY 1999
Text Creation and Error Correction
We are particularly interested in text creation by knowledge
workers - individuals who “solve problems and generate
outputs largely by resort to structures internal to themselves
rather than by resort to external rules or procedures Es].”
Text - in the form of reports or communication with others
- is an important part of this output. While formal business
communications used to pass through a handwritten stage
before being committed to a typed document, this seems to
be becoming less frequent. Knowledge workers who used to
rely on secretarial help are now more likely to produce their
own text by directly entering it into a word processor. We
do not have a clear picture of how changes in the processes
of text creation have impacted the quality of the resulting
text, even though it seems that much of the text produced by
knowledge workers - from newspaper articles to academic
papers - is now created in an electronic form.
Efforts to develop new input technologies continue. ASR is
clearly one of the promising technologies. We do not know
how a change in modality of entry might impact the way in
which people create text. For example, does voice entry
affect the composition process? There is some suggestion
that it does not impact composition quality [3,8]. Have
people learned to view keyboards as “more natural” forms
of communication with systems? While people can
certainly dictate text faster than they can type, throughput
with ASR systems is generally slower. Measures which
include the time to make corrections favor keyboard-mouse
input over speech - partially because error correction takes
longer with speech. Some attempts have been made to
address this in current systems, but the jury is still out on
how successful such efforts have been.
Error detection and correction is an important arena in
which to examine modality differences. For keyboard-
mouse entry there are at least two ways in which someone
might be viewed as making an error. One can mistype
something - actually pressing one sequence of keys when
one intended to enter another. Such user errors can be
detected and corrected either immediately after they were
made, within a few words of entry, during a proofreading of
the text, or not at all. Another error is one of intent,
requiring editing the text. In both cases, correction can be
made by backspacing and retyping, by selecting the
incorrect text and retyping, or by dialog techniques
generally available in word processing systems such as
Find/Replace or Spell Checking. While we do not have a
clear picture of the proportion of use of the various
techniques available, our observations suggest that all are
used to some extent by experienced computer users.
There are some parallels for error correction in ASR
systems. By monitoring the recognized text, users can
correct misrecognitions with a speech command equivalent
of “backspacing” (current systems generally have several
variations of a command that remove the most recently
recognized text - such as SCRATCH or UNDO. There are
Papers
ways of selecting text (generally by saying the command
SELECT and the string to be located), after which
redictating will replace the selected text with newly
recognized text. Additionally, correction dialogs provide
users with a means of selecting a different choice from a list
of possible alternatives or entering a correction by spelling
it. These different correction mechanisms provide a range
of techniques that map well to keyboard-mouse techniques.
However, we do not have evidence of how efficient or
effective they are. This study was designed to answer these
questions. We were interested in several comparisons -
keyboard and speech for text entry, modality effects on
transcription and composition tasks, and error correction in
different modalities.
SYSTEMS
Three commercially available large vocabulary continuous
speech recognition systems were used in this study. All
were shipped as products in 1998. These systems were
IBM ViaVoice 98 Executive, Dragon Naturally Speaking
Preferred 2.0, and L&H Voice Xpress Plus (referred to as
IBM, Dragon and L&H below). While the products are all
different in significant ways, they share a number of
important features that distinguish them from earlier ASR
products. First, they all recognize continuous speech.
Earlier versions required users to dictate using pauses
between words. Second, all have integrated command
recognition into the dictation so that the user does not need
to explicitly identify an utterance as text or command. In
general, the systems provide the user with a command
grammar (a list of specific command phrases), along with
some mechanism for entering the commands as text.
Commands can be entered as text by having the user alter
the rate at which the phrase is dictated - pausing between
words causes a phrase to be recognized as text rather than
as a command.
While all of the systems fnnction without specific training
of a user’s voice, we found the speaker independent
recognition performance insufficiently accurate for the
purposes of our study. To improve recognition
performance, we had all users carry out speaker enrollment
- the process of reading a body of text to the system and
then having the system develop a speaker-specific speech
model. All products require a 133-166MHz Pentium
processor machine with 32MB RAM - we ran our study
using 2OOMHz machines with 64MB RAM
METHOD
There were different procedures used for the Initial Use and
the Extended Use subjects in the study. Although the
design of the Initial Use study was constructed to allow for
statistical comparisons between the three systems, we report
on general patterns observed across the systems as they are
of more general interest to the design of successful ASR
systems.
569
Papers
CHI99
15-20 MAY 1999
Initial Use
Subjects in the Initial Use study were 24 employees of IBM
in the New York metropolitan area who were knowledge
workers. All were native English speakers and experienced
computer users with good typing skills. Half of the subjects
were male and half were female, with gender balanced
across the conditions in the study. The age range of the
subjects was from 20 to 55 years old. An effort was made
to balance the ages of the subjects in the various conditions.
Each subject was assigned to one of three speech
recognition products, IBM, Dragon, or L&H. Half of the
subjects completed the text creation tasks using speech first
and then did a similar set using keyboard-mouse, and half
did keyboard-mouse followed by speech. Subjects received
a $75 award for their participation in the three hour long
session. All sessions were videotaped.
On arrival at the lab, the experimenter introduced the
subject to the purpose, approximate length of time, and
content of the usability session. The stages of the
experimental session were:
1. Provide session overview and introduction.
2. Enroll user in assigned system.
3. Complete text tasks using first modality.
4. Complete text tasks using second modality.
5. Debrief the user.
The experimenter told the subject to try and complete the
tasks using the product materials and to think aloud during
the session. (While this could cause interference with the
primary task, our subjects switched between think aloud
and task modes fairly easily.) The experimenter explained
that assistance would be provided if the subject got stuck.
The experimenter then left the subject and moved to the
Control Room. The subject’s first task was to enroll in the
ASR system (the systems were pre-installed on the
machines). Enrollment took from 30 minutes to 1.5 hours
for the subject to complete, depending on the system and
the subject’s speed in reading the enrollment text. After
enrollment was completed, the subject was given a break
while the system developed a speech model for the subject
by completing an analysis of the speech data. After the
break, the subject attempted to complete a series of text
creation tasks. All text was created in each product’s
dictation application that provided basic editing functions
(similar to Windows 95 WordPad), and did not include
advanced functions such as spelling or grammar checkers.
Before engaging in the speech tasks, all participants
underwent a training session with the experimenter present
to provide instruction. This session was standardized
across the three systems. Basic areas such as text entry and
correction were covered. Each subject dictated a body of
text supplied by the experimenter, composed a brief
document, learned how to correct mistakes, and was given
free time to explore the functions of the system. During the
training session, each subject was shown how to make
corrections as they went along as well as making
corrections by completing dictation and going back and
proofreading. Sample tasks in both transcription and
composition were completed in this phase. Each subject
was allowed approximately 40 minutes for the speech
training scenario. Subjects were given no training for
keyboard-mouse text creation tasks.
In the text creation phase for each modality, each subject
attempted to complete four tasks - two composition and two
transcription tasks. The order of the tasks (transcription or
composition) was varied across subjects with half doing
composition tasks followed by transcription task:s, and half
doing transcription followed by composition. In all, each
subject attempted to complete eight tasks - four
composition and four transcription, with two of each task
type in each modality.
For each composition task, subjects were asked to compose
a response to a message (provided on paper) in the simple
text entry window of the dictation application. Each of the
responses were to contain three points for the reply to be
considered complete and accurate. For example, in one of
the composition tasks, the subject was asked to compose a
message providing a detailed meeting agenda, meeting
room location, and arrangements for food. Composition
tasks included social and work related responses, and
subjects were asked to compose “short replies.” The
quality of each response was later evaluated based on
whether the composed messages contained a complete
(included consideration of the three points) and clear (was
judged as well written by evaluators) response. All subjects
used the same four composition tasks, with an equal number
of subjects using speech and keyboard-mouse to complete
each task.
For transcription tasks, subjects attempted to complete the
entry of two texts in each modaiity. There were four texts
that ranged from 71 to 86 words in length. These texts
were drawn from an old western novel. The subjects
entered the text in the appropriate modality and were asked
to make all corrections necessary to match the content of
the original text. The resulting texts were later evaluated
for accuracy and completeness by comparing them to the
original materials. Evaluators counted uncorrected entry
errors and omissions.
In the keyboard-mouse modality tasks, subjects completed
composition and transcription tasks using standard
keyboard and mouse interaction techniques in a simple edit
window provided with each system. Subjects were given 20
minutes to complete the four keyboard-mouse tasks. All
subjects completed all tasks within the time limit.
In the speech modality tasks, subjects completed the
composition and transcription tasks using voice, but were
free to use keyboard and mouse for cursor movements or to
make corrections they felt they could not m,ake using
speech commands. We intentionally did not restrict subjects
to the use of speech to carry out the speech modality tasks,
570
CHI 99 15-20 MAY 1999
Papers
and all subjects made some use of the keyboard and mouse.
Subjects were given 40 minutes to complete the four speech
tasks.
After each of the tasks (enrollment and eight text tasks),
subjects filled out a brief questionnaire on their experience
completing the task. After completing the four tasks for
each modality, subjects filled out a questionnaire addressing
their experience with that modality. After completing all
tasks, the experimenter joined the subject for a debriefmg
session in which the subject was asked a series of questions
about their reactions to the ASR technology.
Extended Use
Subjects in the Extended Use study were the four co-
authors of this paper. In this study, the subjects used each
of the three speech recognition products for 10 sessions of
approximately one hour duration; a total of 30 sessions
across the products. During the session the subjects would
use speech recognition software to
cany
out actual work
related correspondence. After completing at least 20
sessions, subjects completed the set of transcription tasks
used in the Initial Use study. We limit the presentation of
the results of the Extended Use phase of the study to some
general comparisons with the Initial Use data.
RESULTS
For the analysis of the Initial Use sessions, we carried out a
detailed analysis of the videotapes of the experimental
sessions. This included a coding of all of the pertinent
actions carried out by subjects in the study.
Misrecognitions of text and commands and attempts to
recover from them were coded, along with a range of
usability and system problems. Particular attention was
paid to the interplay of text entry and correction segments
during a task, as well as strategies used to make corrections.
Because of the extensive time required to do this, we
completed the detailed analysis for 12 of the 24 subjects in
the Initial Use phase of the study (four randomly selected
subjects from each of the three systems, maintaining gender
balance). Thus we report performance data from 12
subjects, but include all 24 subjects in reporting results
where possible. Additionally, we report selected data from
the four subjects in the Extended Use phase. The data
reported from the three speech recognition systems are
collapsed into a single group here.
Typing versus Dictating - Overall Efficiency
Our initial comparison of interest is the efftciency of text
entry using speech and keyboard-mouse for transcription
and composition tasks. We measure efficiency by time to
complete the tasks and by entry rate. The entry rate that we
present is corrected words per minute (cwpm), and is the
number of words in the final document divided by the time
the subject took to enter the text and make corrections. The
average length of the composed texts was not significantly
different between the speech and keyboard-mouse tasks and
was similar to the average length of the transcriptions (7 1.5
and 73.1 words for speech and keyboard-mouse
compositions respectively and 77.8 words for
transcriptions). Table 1 below summarizes the results for
task completion rates for the various tasks.
Table 1. Mean corrected words per minute and time per
task by entry modality and task type (N= 12).
Creating text was significantly slower for the speech
modality than for keyboard-mouse (F=29.2, ~~0.01). By
comparison, subjects in the Extended Use study completed
the same transcription using ASR in an average 3.10
minutes (25.1 cwpm). The main effect for modality held
for both the transcription tasks and the composition tasks.
Composition tasks took longer than transcription tasks
(F=l8.6, ~~0.01). This is to be expected given the inherent
difference between simple text entry and crafting a
message. There was no significant interaction between the
task type and modality, suggesting that the modality effect
was persistent across task type.
Given this clear difference in the overall time to complete
the tasks, we were interested in looking for quantitative and
qualitative differences in the performance. There are
several areas in which we were interested in comparing text
entry through typing to entry with ASR. These included: 1)
number of errors detected and corrected in the two
modalities, 2) differences in inline correction and
proofreading as a means of correction, and 3) differences in
overall quality of the resulting document. We consider
evidence for each of these comparisons in turn.
Errors detected and corrected
A great deal of effort is put into lowering the error rates in
ASR systems, in an attempt to approach the accuracy
assumed for users’ typing. For text entry into word
processing systems, users commonly make errors (typing
mistakes, misspellings and such) as they enter. Many of
these errors are corrected as they go along - something that
is supported by current word processing programs that
highlight misspellings or grammatical errors. We were
interested in data on the comparison of entry errors in the
two modalities, and their detection and correction.
Table 2 presents data summarizing the average number of
correction episodes for the different task types and input
modalities. A correction episode is an effort to correct one
or more words through actions that (1) identified the error,
and (2) corrected it. Thus if a subject selected one or more
words using a single select action and retyped or redictated
a correction, we scored this as a correction episode. A
571
Papers
CHI 99 15--20 MAY 1999
major question is how the number of error correction
episodes compares for ASR systems and keyboard-mouse
entry.
Transcription
Composition
Speech
11.3 (7.3)
13.5 (6.2)
Keyboard-mouse
8.4 (2.2)
12.7 (2.4)
Table 2. Mean number of correction episodes per task by
entry modality (N=l2). Length in steps is in parentheses.
While the average number of corrections made is slightly
higher for the speech tasks than for the keyboard-mouse
tasks, the length of the correction episodes is much longer.
Interestingly, the improved performance for Extended Use
subjects on transcription tasks cannot be accounted for
entirely by reduced correction episodes - subjects averaged
8.8 per task. The average number of steps per correction
episode is much shorter for the Extended Use subjects -
averaging 3.5 steps compared to 7.3 for Initial Use subjects.
In general, the keyboard corrections simply involved
backspacing or moving the cursor to the point at which the
error occurred, and then retyping (we coded these as a
move step followed by a retype step). In a few instances,
the user would mistype during correction, resulting in a
second retype step. About 80% of the keyboard-mouse
corrections were simple position/retype episodes.
For speech corrections there was much more variability. In
most cases a misrecognized word could be corrected using
a simple locateiredictate command pair comparable to the
keyboard-mouse pattern. Such a correction was coded as a
voice move, followed by a voice redictate, that was marked
to indicate success or failure. Variations include command
substitutions such as the sequence voice select, voice delete,
and voice redictate. More often the average number of
commands required was much greater - generally due to
problems with the speech commands themselves that then
needed to be corrected, although the overall patterns can
still be seen in terms of move to the error, select it and
operate on it. Typical patterns included:
1. Simple redictation failures in which the user selected
the misrecognized word or phrase (usually using a
voice select command), followed by a redictation of the
misrecognized word which also was misrecognized.
Users would continue to try to redictate, would use
correction dialogs that allow for alternative selection or
spelling, or would abandon speech as a correction
mechanism and complete the correction using
keyboard-mouse
2. Cascading failures in which a command used to
attempt a correction was misrecognized and had to be
corrected itself as a part of the correction episode.
Such episodes proved very frustrating for subjects and
took considerable time to recover from.
3. Difficulties using correction dialogs in which the user
abandoned a correction attempt for a variety of
reasons. This included difficulties brought ton by mode
differences in the correction dialog (e.g., commonly
used correction commands such as UNDO1 would not
work in correction dialogs) or difficultie:s with the
spelling mechanism.
High Level Correction Strategies - lnline versus
Proofreading Corrections
Another question is whether users employ different
correction strategies for the two input modal:ities. This
could be demonstrated in either high-level strategies (such
as “correct as you go along” versus “enter and then
correct”) or in lower-level differences such as the use of
specific correction techniques. In Table 3 we present data
for the transcription and composition tasks combined,
comparing the average number of errors correcbed in a task
before completion of text entry (Inline) and after reaching
the end of the text (Proofreading).
Table 3. Average errors corrected per task by phase of entry
(N=12).
There are two things to point out in these data. First, there
are significantly more correction episodes in inline than in
proofreading for both modalities (t=7.18, p<.O061 for speech
and t=8.64, pc.001 for keyboard-mouse). Performance on
the keyboard-mouse tasks demonstrated that subjects are
quite used to correcting as they go along, and try to avoid
separate proofreading passes. For the speech modality
however, subjects still had significant errors to correct in
proofreading. In comparison, subjects in the Extended Use
study rarely made inline corrections in transcription tasks
(less than once per task on average).
Subjects gave us reasons for an increased reliance on
proofreading. They commented that they felt aware of
when they might have made a typing error, but felt less
aware of when misrecognitions might have occurred. Note
that in keyboard-mouse tasks, errors generally are user
errors, while in speech tasks errors generally are system
errors. By this we mean that for keyboard-mouse, systems
reliably produce output consistent with user input. A typist
can often “feel” or sense without looking at the display
when an error might have occurred. For speech input, the
user quickly learns that output is highly correlated with
speech input, but that it is not perfect. Users do .not seem to
have a very reliable model of when an error :might have
occurred, and must either constantly monitor the display for
errors or rely more heavily on a proofreading pass to detect
them.
572
CHI 99 15-20
MAY 1999
Papers
Second, the number of inline correction episodes is nearly
equal for the two conditions. This suggests a transfer of
cognitive skill from the more familiar keyboard and mouse
interaction. As in typing, subjects were willing to switch
from input mode to correction mode fairly easily and did
not try to rely completely on proofreading for error
correction.
Lower-level strategies for error correction
Almost all keyboard-mouse corrections were made inline
and simply involved using the backspace key or mouse to
point and select followed by typing. In comparison, the
voice corrections were much more varied. This is
undoubtedly due to the wide range of possible errors in the
ASR systems compared to keyboard-mouse entry. The
major classes of possible errors in ASR include:
l
Simple misrecognitions in which a single spoken word
intended as text is recognized as a different text word.
l
Multi-word misrecognitions in which a series of
words are recognized as a different series of words.
l
Command misrecognitions in which an utterance
intended as a command is inserted in the text.
l
Dictation as command misrecognitions in which an
utterance intended as dictation is taken as a command.
All of these occurred in all of the systems in the study. In
addition, subjects did some editing of content in their
documents. Because errors in ASR are correctly spelled
words it is difficult to separate edits from errors in all cases.
In what follows these are treated the same since both use
the same techniques for correction.
Methods of making corrections in the two modalities can be
compared. For example, keyboard-mouse corrections could
be made by making a selection with the mouse and then
retyping, by positioning the insertion point with cursor keys
and then deleting errors and retyping, or by simply
backspacing and retyping. These segment into two
categories: deleting first then entering text or selecting text
and entering over the selection. In speech, these kinds of
corrections are possible in a variety of ways using
redictation after positioning (with voice, keyboard or
mouse). In addition, there is use of a correction dialog
which allows spelling (all systems) or selection of an
alternative word (in two of the three systems). Table 4
summarizes the techniques used by subjects to make
corrections in the texts.’
The dominant technique for keyboard entry is to erase text
back to the error and retype. This includes the erasure of
text that was correct, and reentering it. For speech, the
dominant technique was to select the text in error, and to
redictate. In only a minority of the corrections (8%) did the
t Only one of the 12 subjects used an explicitly multi-modal strategy for
correction. That subject relied on the keyboard to move to the error and
switched to speech to select and redictate the text.
subjects utilize the systems’ correction dialog box. Almost a
third of the corrections were to correct problems created
during the original correction attempt. For example, while
correcting the word “kiss” to “keep” in “kiss the dog”, the
command “SELECT kiss” is misrecognized as the dictated
text “selected kiss”, which must be deleted in addition to
correcting the original error.
Select text then reenter Speech Keyboard-mouse
38% 27%
Delete then reenter
Correction box 1 23% 1 73%
I I
18% INA 1
Correcting problems caused 32%
during correction NA
Table 4. Patterns of Error Correction based on overall
corrections (N=12).
Low use of the correction dialogs may be explained by two
phenomena. First, correction dialogs were generally used
after other methods had failed (62% of all correction
dialogs). Second, 38% of the time a problem occurred
during the interaction inside the correction dialog with 38%
of these resulting in canceling out of the dialog.
Understanding more fully why the features of the correction
dialogs are not better utilized is an area for future study.
Overall Quality of Typed and Dictated Texts
There are two areas in which we tried to evaluate the
relative quality of the results of text entry in the two
modalities. For transcription tasks, we evaluated the overall
accuracy of the transcriptions - that is, we asked how many
mismatches there were between the target document and the
produced document. For composition tasks we asked three
peers, not part of the study, to evaluate several aspects of
the messages produced by subjects. These judges
independently counted the number of points that the
message covered (there were three target points for each
message). We also asked for a count of errors in the final
message and for an evaluation of the overall clarity.
Finally, for each of the four composition tasks, we asked
the judges to rank order the 24 messages in terms of quality
from best to worst.
In Table 5 we summarize the overall quality measures for
the texts produced. These measures include average
number of errors in the final products for both the
transcription and composition tasks, and the average quality
rank for texts scored by three judges for the composition
tasks.
There were many more errors in the final transcription
documents for the speech tasks than for the keyboard-
mouse tasks. The errors remaining in the fmal documents
were broken into three categories: wrong words (including
misspellings), format errors (including capitalization and
573
Papers
CHI 99 15.-20 MAY 1999
punctuation errors), and missing words. The average
number of wrong words (F=25.4, pC.001) and format errors
(F=12.6, pc.001) were significantly lower for keyboard-
mouse compared with speech tasks. There was no
difference in the number of missing words.
Speech Keyboard-mouse
Transcription errors 3.8 errors 1 .O errors
Composition errors 1.8 errors 1.1 errors
Composition (rank) 13.2 11.4
Table 5. Mean quality measures by modality (N=24).
Composition quality showed a similar pattern. Errors in
composition included obviously wrong words (e.g.,
grammar errors) or misspellings. There were fewer errors
in the keyboard-mouse texts than in the speech texts (F=7.9,
pcO.01). Judges were asked to rank order the texts for each
of the four composition tasks from best (given a score of 1)
to worst (given a score of 24). While the mean score was
lower (better) for keyboard-mouse texts than for speech
texts, the difference was not statistically significant.
Transcription versus Composition
For both the keyboard-mouse modality and the speech
modality, composition tasks take longer than transcription
tasks. We did not find significant differences in the length
or readability of texts composed in the two modalities.
Additionally, topics such as correction techniques or error
frequencies did not seem to vary between modalities and
task types.
Subjective Results - Questionnaire Data
Subjects (N=24) in the Initial Use study consistently report
being dissatisfied with the ASR software for performing the
experimental tasks. When asked to compare their
productivity using the two modalities in the debriefing
session, subjects gave a modal response of “much less
productive” for speech on a 7-point scale ranging from
“much more productive” to “much less productive”, and 2 1
of 24 subjects responded “less” or “much less productive”.
Subjects’ top reasons for their ratings, (frequency of
response in parentheses summed across several questions)
were:
l
Speech recognition is unreliable, error prone (34).
l
Error correction in speech is much too hard - and
correction can just lead to more errors (20).
l
Not knowing how to integrate the use of speech and
keyboard-mouse efficiently (19).
l
Keyboard is much faster (14).
l
Command language problems ( 13).
l
It is harder to talk and think than to type and think (7).
Additionally, when asked if the software was good enough
to purchase, 21 of 24 subjects responded “No” to a binary
Yes/No choice. The three subjects that reported a
willingness to purchase the software all gave c:onsiderable
qualifications to their responses. When asked for the
improvements that would be necessary for ASR technology
to be useful, subjects’ top responses included:
l
Corrections need to be much easier to make (27).
l
Speech recognition needs to be more accurate (25).
l
Need feedback to know when there is a mistake (8).
l
Command language confusion between command and
dictation needs to be fixed (8).
DISCUSSION
There are many interesting patterns in the data presented
above. Early speech recognition products varied in the
strategies of error correction that they encouraged for users.
For example, IBM’s VoiceType system encouraged users
(in documentation and online help) to dictate first and then
switch to correction mode, while Dragon Dictate
encouraged users to make corrections immediately after an
error was dictated. To a large extent these strategies were
encouraged to have user behavior correspond to system
designs, and not because of a user driven rea.son. The
systems in the current study all accommodate inline
correction and post-entry correction equally well. One
thing that the results of the Initial Use study point to is the
general tendency for subjects to make corrections as they
go along, rather than in a proofreading pass. Table 2 shows
that subjects made many more corrections inline than they
did after completion of entry in both the speech and
keyboard-mouse conditions.
When subjects made errors in keyboard-mouse text entry,
they tended to correct the error within a few words of
having made it. In contrast, some subjects made specific
mention of not being as aware of when a
misrlecognition
had occurred and needing to “go back to” a proofreading
stage for the speech tasks. Taken together with the
tendency toward inline correction, this suggests supporting
users in knowing when a misrecognition has occurred.
Misrecognition Corrections
The most common command used in any of the systems is
the command to reverse the immediately preceding action.
While each of the systems has multiple variant c:ommands
for doing this (some mixture of UNDO, SCRATCH, and
DELETE), users generally rely on a single form that they
use consistently. However, the command variants have
subtle distinctions that were frequently lost on the subjects
in this study. Many of the usability problems with respect
to these commands appear related to the users strategy of
relying on a single form for a command, even if it was not
appropriate for the tasks at hand. Developing more
complex strategies for selecting between command forms
seems to require additional expertise. We do not observe
these confusions at this level in the Extended Use study.
574
CHI 99 15-20 MAY 1999
Papers
Quality Measures
Attempts to compare the composition quality of texts
produced by speech input and more traditional input
generally predate the existence of real systems for ASR
[e.g., 3,8]. The current study shows no statistical difference
between the quality of the texts composed using speech
recognition as compared to keyboard and mouse.
Subjective Results
The majority of subjects felt that they would be less or
much less productive with speech recognition than with
keyboard and mouse using the current products. They
provided some clear insights into where efforts need to be
made to improve these systems in order for them to be
useful and usable. Top concerns include the performance of
the systems and several key user interface issues. There is a
critical need to learn about people’s performance and
satisfaction with multi-modal patterns. The field needs to
better understand the use of commands and people’s ability
and satisfaction with natural language commands. Also,
there are intriguing issues to be researched regarding
cognitive load issues in speech recognition and how to
provide feedback to users. The subjects said that they were
excited about the future possibility of using speech to
complete their work. They were pleased with the feeling of
freedom that speaking allowed them, and the ease and
naturalness of it.
CONCLUSIONS
It is interesting to note that several of the Initial Use
subjects commented that keyboard entry seemed “much
more natural” than speech for entering text. While this
seems like an odd comment at some level, it reflects the
degree to which some people have become accustomed to
using keyboards. This relates both to the comfort with
which people compose text at a keyboard and to well
learned methods for inline error detection and correction.
Speech is also a well learned skill, though as this study
shows, the ways to use it in communicating with computers
are not well established for most users. There is potential
for ASR to be an efficient text creation technique - the
Extended Use subjects entered transcription text at an
average rate of 107 uncorrected words per minute -
however correction took them over three times as long as
entry time on average.
When desktop ASR systems first began appearing about 5
years ago, it was assumed that their wide-scale acceptance
would have to await solutions to “mode problems” (the
need to explicitly indicate dictation or command modes),
and the development of continuous speech recognition
algorithms which were sufficiently accurate. While all of
the commercial systems evaluated in this study have these
features, our results indicate that our technically
sophisticated subject pool is far from satisfied with the
current systems as an alternative to keyboard for general
text creation. They have given a clear prioritization of
changes needed in the design of these systems. These
changes merit significant attention.
It is possible - though we do not think it is very likely - that
less skilled computer users would react to the software
more positively. The methods for error correction, and the
complexity that compound errors can produce, leads us to
believe that decreased rather than increased performance
would have to be tolerated by any users - even those with
limited typing skills. While this might be acceptable for
some populations (RSI sufferers or technology adopters),
wide scale acceptance awaits design improvements beyond
this current generation of products.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Clark, H. H. & Brennan, S. E. (1991). Grounding in
communication. In J. Levine, L. B. Resnick, and S. D.
Behrand (Eds.), Shared Cognition: Thinking as Social
Practice. APA Books, Washington.
Danis, C. & Karat, J. (1995). Technology-driven
design of speech recognition systems. In G. Olson and
S. Schuon (eds.) Symposium on designing interactive
systems. ACM: New York, 17-24.
Gould, J. D., Conti, J., & Hovanyecz, T. (1983).
Composing letters with a simulated listening
typewriter. Communications of the ACM, 26, 4, 295-
308.
Karat, J. (1995). Scenario use in the design of a speech
recognition system. In J. Carroll (ed.) Scenario-based
design. New York: Wiley.
Kidd, A. (1994). The marks are on the knowledge
worker, in
Proceedings
of
CHZ
‘94 (Boston MA, April
1994), ACM Press, 186-191.
Lai, J. & Vergo, J. (1997). MedSpeak: Report
Creation with Continuous Speech Recognition, in
Proceedings
of
CHZ
‘97 (Atlanta GA, March 1997),
ACM Press, 43 1 - 438.
Laurel, B. (1993). Computers as Theatre. Adison
Wesley, New York.
Ogozalek, V.Z., & Praag, J.V. (1986). Comparison of
elderly and younger users on keyboard and voice input
computer-based composition tasks, in
Proceedings of
CHZ ‘86,
ACM Press, 205-2 11.
Oviatt, S. (1995). Predicting spoken disfluencies
during human-computer interaction. Computer Speech
and Language, 9, 19-35.
Yankelovich, N., Levow, G. A., & Marx, M. (1995).
Designing SpeechActs: Issues in speech user
interfaces, in
Proceedings
of
CHZ
‘95 (Denver CO,
May 1995), ACM Press, 369-376.
575
... In relation to text entry, previous studies have provided no definite evidence regarding the usability of voice for text entry. In addition, many of these user studies on text creation using voice were conducted on old-generation VUIs (e.g., [15]), and studies on text entry using VUIs, powered by modern ASR, are scarce. Further, user studies on using voice for text entry on Arabic VUIs are still limited. ...
... Many terms are used to refer to technology that allows people to interact by using their voice, including VUI, speech user interface, conversational agent, and intelligent or virtual personal assistant [7,9,13,15,18,25,26,27,29,30,31,32,39]. In this study, we used VUI to emphasize our focus on spoken word interactions. ...
... A popular research theme relating to humancomputer interaction and VUI is the comparison of different modalities (e.g., keyboard, mouse, gesture, and digital pen) with the speech modal [6]. Studies on modality comparison have yielded inconsistent results and have reported negative and positive effects of using speech input on usability [5,6,15,16,21,22,23,28]. For example, Karat et al. [15] investigated the user performance and satisfaction when completing text creation tasks on three old-generation speech recognition systems. ...
Article
Full-text available
Voice user interfaces (VUIs) are increasingly popular owing to improvements in automatic speech recognition. However, the understanding of user interaction with VUIs, particularly Arabic VUIs, remains limited. Hence, this research compared user performance, learnability, and satisfaction when using voice and keyboard-and-mouse input modalities for text creation on Arabic user interfaces. A voice-enabled email interface (VEI) and a traditional email interface (TEI) were developed. Forty participants attempted pre-prepared and self-generated message creation tasks using voice on the VEI, and the keyboard-and-mouse modal on the TEI. The results showed that participants were faster (by 1.76 to 2.67 minutes) in pre-prepared message creation using voice than using the keyboard and mouse. Participants were also faster (by 1.72 to 2.49 minutes) in self-generated message creation using voice than using the keyboard and mouse. Although the learning curves were more efficient with the VEI, more participants were satisfied with the TEI. With the VEI, participants reported problems, such as misrecognitions and misspellings, but were satisfied about the visibility of possible executable commands and about the overall accuracy of voice recognition.
... Grammatical Error Correction (GEC) aims at automatically detecting and correcting grammatical (and other related) errors in a text. It attracts much attention due to its practical applications in writing assistant (Napoles et al., 2017b; Ghufron and * Work is done during internship at Tencent Cloud Xiaowei † Corresponding author Rosyida, 2018), speech recognition systems (Karat et al., 1999;Kubis et al., 2020) etc. Inspired by the success of neural machine translation (NMT), some models adopt the same paradigm, namely NMT-based models. They have been quite successful, especially with data augmentation approach (Boyd, 2018;Ge et al., 2018;Xu et al., 2019;Grundkiewicz et al., 2019;Wang and Zheng, 2020;Takahashi et al., 2020). ...
Preprint
Grammatical Error Correction (GEC) aims to automatically detect and correct grammatical errors. In this aspect, dominant models are trained by one-iteration learning while performing multiple iterations of corrections during inference. Previous studies mainly focus on the data augmentation approach to combat the exposure bias, which suffers from two drawbacks. First, they simply mix additionally-constructed training instances and original ones to train models, which fails to help models be explicitly aware of the procedure of gradual corrections. Second, they ignore the interdependence between different types of corrections. In this paper, we propose a Type-Driven Multi-Turn Corrections approach for GEC. Using this approach, from each training instance, we additionally construct multiple training instances, each of which involves the correction of a specific type of errors. Then, we use these additionally-constructed training instances and the original one to train the model in turn. Experimental results and in-depth analysis show that our approach significantly benefits the model training. Particularly, our enhanced model achieves state-of-the-art single-model performance on English GEC benchmarks. We release our code at Github.
... At the end of the study, their Dvorak speed was 74 percent faster than their QWERTY speed, and their accuracy increased by 68 percent (C. Karat, et al 1999). ...
Article
Full-text available
Users of Font Lampung in previous studies have difficulty in typing because the layout of the Lampung script is complicated to understand. Re-designing the Lampung alphabet keyboard layout allows users to be able to type text in characters more easily. The purpose of this research is to create a Lampung alphabet keyboard layout for more effective typing of the Lampung alphabet. The methods used are UX (User Experience) that are, requirements gathering, alternative design, prototype, evaluation, and result report generation. The layout design is created by eliminating the use of the SHIFT key and regrouping parent letters, the child letters in a more effective composition. The layout design of the Lampung alphabet was printed using the Laser Engraver tool on the keyboard. Through the test, the typing speed increased to 208% of the average value of 62 seconds to 32 seconds. In addition to the results of WPM (Word Per Minute) on keyboard layouts with 8.7 WPM which is faster than previous WPM research but still slower than QWERTY keyboard WPM. Testing of character quality, readability, and difficulty level to know the flaws in the Lampung RaTaYa keyboard layout with good results. This research concludes that the keyboard layout of Lampung RaTaYa is more effective than using the font Lampung script.
... However, voice input is prone to speech recognition errors, especially in noisy environments. It also is cumbersome to use voice for correcting errors in spoken text, mainly because voice input is unsuitable for specifying the location of the error [26,42]. ...
Conference Paper
Full-text available
Editing operations such as cut, copy, paste, and correcting errors in typed text are often tedious and challenging to perform on smartphones. In this paper, we present VT, a voice and touch-based multi-modal text editing and correction method for smartphones. To edit text with VT, the user glides over a text fragment with a finger and dictates a command, such as "bold" to change the format of the fragment, or the user can tap inside a text area and speak a command such as "highlight this paragraph" to edit the text. For text correcting, the user taps approximately at the area of erroneous text fragment and dictates the new content for substitution or insertion. VT combines touch and voice inputs with language context such as language model and phrase similarity to infer a user's editing intention, which can handle ambiguities and noisy input signals. It is a great advantage over the existing error correction methods (e.g., iOS's Voice Control) which require precise cursor control or text selection. Our evaluation shows that VT significantly improves the efficiency of text editing and text correcting on smartphones over the touch-only method and the iOS's Voice Control method. Our user studies showed that VT reduced the text editing time by 30.80%, and text correcting time by 29.97% over the touch-only method. VT reduced the text editing time by 30.81%, and text correcting time by 47.96% over the iOS's Voice Control method.
... in many natural language processing scenarios such as writing assistant (Ghufron and Rosyida, 2018;Napoles et al., 2017;Omelianchuk et al., 2020), search engine (Martins and Silva, 2004;Gao et al., 2010;Duan and Hsu, 2011), speech recognition systems (Karat et al., 1999;Wang et al., 2020a;Kubis et al., 2020), etc. Grammatical errors may appear in all languages (Dale et al., 2012;Xing et al., 2013;Ng et al., 2014;Rozovskaya et al., 2015;Bryant et al., 2019), in this paper, we only focus to tackle the problem of Chinese Grammatical Error Correction (CGEC) (Chang, 1995). ...
... in many natural language processing scenarios such as writing assistant (Ghufron and Rosyida, 2018;Napoles et al., 2017;Omelianchuk et al., 2020), search engine (Martins and Silva, 2004;Gao et al., 2010;Duan and Hsu, 2011), speech recognition systems (Karat et al., 1999;Wang et al., 2020a;Kubis et al., 2020), etc. Grammatical errors may appear in all languages (Dale et al., 2012;Xing et al., 2013;Ng et al., 2014;Rozovskaya et al., 2015;Bryant et al., 2019), in this paper, we only focus to tackle the problem of Chinese Grammatical Error Correction (CGEC) (Chang, 1995). ...
Preprint
We investigate the problem of Chinese Grammatical Error Correction (CGEC) and present a new framework named Tail-to-Tail (\textbf{TtT}) non-autoregressive sequence prediction to address the deep issues hidden in CGEC. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected based on the bidirectional context information, thus we employ a BERT-initialized Transformer Encoder as the backbone model to conduct information modeling and conveying. Considering that only relying on the same position substitution cannot handle the variable-length correction cases, various operations such substitution, deletion, insertion, and local paraphrasing are required jointly. Therefore, a Conditional Random Fields (CRF) layer is stacked on the up tail to conduct non-autoregressive sequence prediction by modeling the token dependencies. Since most tokens are correct and easily to be predicted/conveyed to the target, then the models may suffer from a severe class imbalance issue. To alleviate this problem, focal loss penalty strategies are integrated into the loss functions. Moreover, besides the typical fix-length error correction datasets, we also construct a variable-length corpus to conduct experiments. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of error Detection and Correction.
Chapter
In recent years, with the development of artificial intelligence technology, speech recognition technology can perform high-precision interpretation and transcription on voices in various complex environments, improving typing efficiency. However, the text obtained by speech translation is only composed of text and simple punctuation, which hinders the real emotion expression of users. The pale translated text hinders the formation of context, affects the emotional transmission of semantics, and lead to a poor user experience when users communicate with others. Based on user experience and emotion, this article discusses the factors that assist the speech-to-text emotional restoration. Through the qualitative and quantitative study, this research compares four emotional effects of information texts composed by different elements: emoticon, punctuation, interjections, and speech-to-text function of WeChat, and further studies the factors that assist speech-to-text emotion restoration. The research results reveal that emoticon and punctuation have a positive effect on the speech-to-text emotional restoration. The addition of the above two factors can restore the emotional effect of speech in text mode with lower loss, fully improve the user experience in mobile communication, and make the online communication smoother.
Conference Paper
Participants in text entry studies usually copy phrases or compose novel messages. A composition task mimics actual user behavior and can allow researchers to better understand how a system might perform in reality. A problem with composition is that participants may gravitate towards writing simple text, that is, text containing only common words. Such simple text is insufcient to explore all factors governing a text entry method, such as its error correction features. We contribute to enhancing composition tasks in two ways. First, we show participants can modulate the difculty of their compositions based on simple instructions. While it took more time to compose difcult messages, they were longer, had more difcult words, and resulted in more use of error correction features. Second, we compare two methods for obtaining a participant's intended text, comparing both methods with a previously proposed crowdsourced judging procedure. We found participant-supplied references were more accurate.
Article
Full-text available
This volume is about a phenomenon that seems almost a contradiction in terms: cognition that is not bounded by the individual brain or mind. In most psychological theory, the social and the cognitive have engaged only peripherally, standing in a kind of figure-ground relationship to one another rather than truly interacting. This book aims to undo this figure-ground relationship between cognitive and social processes. In so doing, it looks beyond psychology to a number of allied disciplines that have traditionally taken a view of human phenomena that is less focused on the individual. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Conference Paper
MedSpeakiRadiology is a product that allows radiologists to create, edit and manage reports using real-time, continuous speech recognition. Speech is used both to navigate through the application, and to dictate reports. The system is multi-modal, accepting input by either voice, mouse or keyboard. This paper reports on how we addressed the critical user need of high throughput in our interface design, and ways of supporting both error prevention and error correction with continuous speech. User studies suggest that for this task there was low tolerance for accuracy less than 100?40, but the additional time required for corrections was considered by many radiologists to be acceptable in view of the overall reduction in report turn around time.
Conference Paper
A study of twelve knowledge workers showed that their defining characteristic is that they are changed by the information they process, Their value lies in their diversity - companies exploit the fact that these people make different sense of the same phenomena and therefore respond in diverse ways. Knowledge workers do not carry much written information with them when they travel and rarely consult their filed information when working in their offices, Their desks are cluttered and seemingly function as a spatial holding pattern for current inputs and ideas. My explanation is that once informed (ie, given form) by some written material, these workers have no particular need to retain a copy of the informing source. However, if a piece of written material has not yet informed them, then they cannot sensibly tile it anyway because its subsequent use or role in their world is still undetermined, I conclude that the valuable marks are on the knowledge worker rather than on the paper or on the electronic file and suggest how computer support for knowledge work might be better targeted on the act of informing rather than on passively filing large quantities of information in a "disembodied" form.
Article
Speech recognition is not yet advanced enough to provide people with a reliable listening typewriter with which they could compose documents. The aim of this experiment was to determine if an imperfect listening typewriter would be useful for highly experienced dictators. Participants dictated either in isolated words or in continuous speech, and used a simulated listening typewriter which recognized a limited vocabulary as well as one which recognized an unlimited one. Results suggest that reducing the rate at which people dictate, either through limitations in vocabulary size or through speaking in isolated words, led to reductions in people's performance. For these first-time users, no version of the listening typewriter was better than traditional dictating methods.
Article
This research characterizes the spontaneous spoken disfluencies typical of human-computer interaction, and presents a predictive model accounting for their occurrence. Data were collected during three empirical studies in which people spoke or wrote to a highly interactive simulated system as they completed service transactions. The studies involved within-subject factorial designs in which the input modality and presentation format were varied. Spoken disfluency rates during human-computer interaction were documented to be substantially lower than rates typically observed during comparable human-human speech. Two separate factors, both associated with increased planning demands, were statistically related to higher disfluency rates: (1) length of utterance, and (2) lack of structure in the presentation format. Regression techniques demonstrated that a linear model based simply on utterance length accounted for over 77% of the variability in spoken disfluencies. Therefore, design methods ca...
Grounding in communication Shared Cognition: Thinking as Social Practice Technology-driven design of speech recognition systems
  • H H Clark
  • S E Brennan
  • J Karat
Clark, H. H. & Brennan, S. E. (1991). Grounding in communication. In J. Levine, L. B. Resnick, and S. D. Behrand (Eds.), Shared Cognition: Thinking as Social Practice. APA Books, Washington. Danis, C. & Karat, J. (1995). Technology-driven design of speech recognition systems. In G. Olson and S. Schuon (eds.) Symposium on designing interactive systems. ACM: New York, 17-24.
Grounding in communication Shared Cognition: Thinking as Social Practice
  • H H Clark
  • S E Brennan