Gesture Patterns during Speech Repairs.
ABSTRACT Speech and gesture are two primary modes used in natural human communication; hence, they are important inputs for a multimodal interface to process. One of the challenges for multimodal interfaces is to accurately recognize the words in spontaneous speech. This is partly due to the presence of speech repairs, which seriously degrade the accuracy of current speech recognition systems. Based on the assumption that speech and gesture arise from the same thought process, we would expect to find patterns of gesture that co-occur with speech repairs that can be exploited by a multimodal processing system to more effectively process spontaneous speech. To evaluate this hypothesis, we have conducted a measurement study of gesture and speech repair data extracted from videotapes of natural dialogs. Although we have found that gestures do not always co-occur with speech repairs, we observed that modification gesture patterns have a high correlation with content replacement speech repairs, but rarely occur with content repetitions. These results suggest that gesture patterns can help us to classify different types of speech repairs in order to correct them more accurately.
- SourceAvailable from: upenn.edu[show abstract] [hide abstract]
ABSTRACT: Interpreting fully natural speech is an important goal for spoken language understanding systems. However, while corpus studies have shown that about 10% of spontaneous utterances contain self-corrections, or PEPAIRS, little is known about the extent to which cues in the speech signal may facilitate repair processing. We identify several cues based on acoustic and prosodic analysis of repairs in a corpus of spontaneous speech, and propose methods for exploiting these cues to detect and correct repairs. We test our acoustic-prosodic cues with other lexical cues to repair identification and find that precision rates of 89-93% and recall of 78-83% can be achieved, depending upon the cues employed, from a prosodically labeled corpus.05/2002;
Conference Proceeding: SWITCHBOARD: telephone speech corpus for research and development[show abstract] [hide abstract]
ABSTRACT: SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500 conversations by 500 speakers from around the US were collected automatically over T1 lines at Texas Instruments. Designed for training and testing of a variety of speech processing algorithms, especially in speaker verification, it has over an 1 h of speech from each of 50 speakers, and several minutes each from hundreds of others. A time-aligned word for word transcription accompanies each recordingAcoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on; 04/1992
- Glot international. 01/2001; 5:341-345.
Gesture Patterns during Speech Repairs
Lei Chen, Mary Harper
Speech processing Lab
Electrical and Computer Engineering
West Lafayette, IN 47906-1285
Computer Science and Engineering
Wright State University
Dayton, OH 45435
Speech and gesture are two primary modes used in nat-
ural human communication; hence, they are important in-
puts for a multimodalinterface to process. One of the chal-
lenges for multimodal interfaces is to accurately recognize
the words in spontaneous speech. This is partly due to the
presence of speech repairs, which seriously degrade the ac-
curacy of current speech recognition systems. Based on the
process, we wouldexpect to find patternsof gesture thatco-
occur with speech repairs that can be exploited by a multi-
modal processing system to more effectively process spon-
To evaluate this hypothesis, we have conducted a mea-
surement study of gesture and speech repair data extracted
fromvideotapesofnaturaldialogs. Althoughwe have found
that gestures do not always co-occur with speech repairs,
we observed that modificationgesture patterns have a high
correlation with content replacement speech repairs, but
rarely occur with content repetitions. These results suggest
that gesture patterns can help us to classify different types
of speech repairs in order to correct them more accurately
Two primary modes used in natural human communication
are speech and gesture [4, 2]. As such, they are ideal in-
puts for multimodal human-to-computer interfaces to pro-
cess. Computer understanding of natural human dialog is
currently an unsolved problem, albeit an important one.
One reason for this is that accurate machine understand-
ing of spontaneous speech is stilla highlychallengingopen
problem for speech researchers. Understanding of sponta-
neous speech is a difficult problem because it often con-
tains mistakes that are revised or repaired within the same
utterance. The presence of these speech repairs contributes
to low speech recognition accuracy of spontaneous speech
in today's speech recognition systems [5, 15]. Speech re-
pairs are often broken down into three components: the re-
pair site or reparandum, the editing phrase (i.e., a spoken
cue phrase like “I mean”), and the resumption site or alter-
ation [6, 7, 12, 13]. Explicit editing phrases are not always
present, and these phrases do not uniquely signal the pres-
ence ofa repair . Speech repairsare oftenclassified [6, 7]
1. false starts: (also called fresh starts), e.g., the follow-
ing example contains an utterance that is aborted by
[I need to send]aborted utterance[let's see]editing phrase[how many
boxcars can one engine take]new utterance.
2. content replacements: (or modification repairs), for
example, the content of the reparandum is replaced by
the alteration in the followingexample:
you can [carry them both on]reparandum[tow them both
on]alterationthe same engine.
3. repetitions: e.g., [she]reparandum[she]alterationliked it.
Based on whether or not the content has been modified,
we classify speech repairs as content modifications (i.e.,
false starts and content replacements) and content repeti-
tions (i.e., repetitions).
A second reason that computer understandingofhuman-
to-humandialogis difficultis thatcomputermodels forpro-
ning to surface. If we are to build systems that are able to
exploitgesture innatural interaction, it is essential to derive
computationally accessible metrics that can be utilized for
processing the communication. Human communication is
a dynamic interplay among various `communicative chan-
nels' that include speech, prosody, gesture, gaze, facial ex-
pression and body posture . These modalities do not
function independently, nor is any modality subservient to
another. Instead these modalities proceed from the same
thoughtprocess that produces an utterance, and each carries
aspects of the original thought . Based on the assump-
tionthat speech and gesture proceed from the same thought
process, we would expect to find patterns of gesture that
co-occur with speech repairs.
2. Experimental Method
Data used in this paper came from KDI visual-audio
database. Videos were recorded in a series of elicitationex-
periments performed at Department of Psychology at Uni-
versity of Chicago under the direction of David McNeill.
Subjects were recruited in speaker-interlocutor pairs. To
avoid `stranger-experimentor' inhibition in the discourse,
the subjects already knew one another. The subject was
shown a model of a village and told that a family of intel-
ligent wombats have taken over the town theater, and was
made privy to a plan to surround and capture the wombats
and return them to Australia. This plan involves collabo-
ration with the villagers, paths of approach, and encircling
strategies. The subject was then videotaped communicating
these with his/her interlocutor using the town model. The
task description we use for the experiment is shown in Fig-
Figure 1. Task description used in our exper-
We apply a three camera setup in our experiments. Two
of the cameras are calibrated so that once correspondence
between points in the two cameras is established, the 3D
positions and velocities can be obtained. The third camera
is a closeup of the head. We chose this configuration be-
cause our experiment must be portable and easy to set up
(some of our cross-disciplinarycollaboratorscollect data in
the field). Using a five foot-wide prism with a known con-
stellationof points, we are able toobtainpointswithtypical
average errors within 1 mm in
4 mm. We believe that this is sufficient for conversational
gesture interaction. We use off-the-shelf consumer-grade
miniDV 30 frames-per-second cameras in progressive scan
mode in these experiments. The audio for each participant
was digitally recorded using a Shure Sm94 unidirectional
boom mounted microphone that was placed at a distance of
eight inches from the subjects' mouths. The video and au-
dio are synchronized using a movie `clapper' device. The
MPEG format. The audio was initially sampled at 44.1K
and then downsampled to 14,700 KHz for analysis. One
frame of video is shown in Figure 2.
y and about 1.5 mm
z (toward the cameras). The maximal errors are within
Figure 2. One frame of video in KDI visual-
All videos are processed in the VisLab at Wright State
University. A fuzzy image processing approach known as
Vector Coherence Mapping
hand motion . VCM applies spatial coherence, mo-
mentum (temporal coherence), speed limit, and skin color
constraintsinthe vector field computationby usinga fuzzy-
combination strategies, and produces good results for hand
gesture tracking. An iterative clustering algorithm is ap-
plied that minimizes spatial and temporal vector variance
to extract moving hands. The positions of the hands in the
stereoimages are usedtoproduce3Dmotiontraces describ-
ing the gestures. Three gesture features are extracted for
each hand of a speaker:
1. 3D Hand Position: the
is used to track the
? positionof a hand.
2. Hold: a state when there is no hand motion beyond
some threshold. A motion energy-based detector was
used to locate places where there was low motion en-
3. Effort: analogous to the kinetic energy of hand move-
We also performed a text transcriptionof each discourse
that was initially time-aligned to the audio data using En-
tropic Aligner  and then modified by an experienced
speech scientist using the Praat tool . After the words
were aligned, additionalannotationsof the data were added
as Praat tiers, e.g., utterances, speech repairs, and discourse
structure. Figure3 depicts the annotationof a speech repair.
For this experiment, we selected two data sets from the
KDI visual-audio database, wd11 and wd20, for analysis.
Table 1 indicates the length of each data set in seconds (s)
and provides information on each conversant, i.e., gender,
the number of utterances produced (# Utt.), and the number
of speech repairs produced (# Rep.).
Figure 3. The annotation of a speech repair from the wd11 data set of the KDI visual-audio database.
Table 1. A summary of the properties of the
wd11 and wd20 data sets.
3. Data Analysis and Results
For this investigation, we evaluate the types of gestures
tion and corresponding video frames of each speech repair.
We first determine whether a gesture co-occurs with each
speech repair; if it does then the speech repair is in the
gesture-on group. Note that in this initial investigation we
do not consider movements such as touchingglasses or hair
or posture adjustment to be gestures; we call these prag-
matic movements. Also, speech repairs that contain a very
long pause between the reparandum and alteration are not
Table 2 provides information on the co-occurrence of
gesture and speech repairs in the wd11 and wd20 data sets.
The number of times that a gesture co-occurs with a speech
repairappears inthe columnlabeled # Gesture-on. We also
indicate the number of times a pragmatic movement (PM)
co-occurs with a speech repair in the column labeled # PM.
Clearly, speakers do not always use gestures during
speech repairs. An important question is, under what cir-
cumstances do they utilizegesture during a repair? In order
to get a deeper understanding of gestures that occur during
speech repairs, we re-analyzed the data to identify the dif-
1There was only one instance in this study.
Table 2. A summary of the co-occurrence of
gestures and speech repairs in wd11 and
we could be
Table 3. The begin and end time of each com-
ponent of our example speech repair.
ferent types of gestural patterns that occur during speech
repairs in our data sets. To get a better understanding of the
process we used for the analysis, consider the speech repair
appearing in Figure 3 with the followingtranscription:
[we couldbe]reparandum[um]editing phrase[some other peo-
ple could be ]alteration
Information on the start and finish time of the reparandum,
editingphrase, and alteration of our example repair appears
inTable 3. To determine thetypes ofgesture patternsoccur-
ringduringthereparandum, editingphrase, andalterationof
the speech repair, we stepped throughthe video on a frame-
by-frame basis. Figure 4 shows some of the important key
frames for our example speech repair. The gesture features
Figure 4. Key frames of gestures during our
speech repair example.
for our example obtained by VCM are shown in Figure 5.
The figure depicts at each time pointwhether the left (right)
hand is moving or at a hold along with its X, Y, and Z co-
ordinate. We also mark the start and finish times of the
reparandum, editing phrase, and alteration. At the begin-
ning of the reparandum, the left and right hand are both at
rest on speaker's knees, as can be seen in the first key frame
of Figure 4 and the VCM features of Figure 5 (e.g., notice
the right hand's X position (RH X)). After a brief time, the
speaker raises her hand and moves it away from her body.
This movement continues through the editing phrase, and
then in the alteration, bothhands stopand then are retracted
back to resting position.
Our example speech repair has a speech repair gesture
pattern that we have observed during other speech repairs.
Over all of our data, six distinct speech repair gesture pat-
terns were observed:
1. Case 0: No gesture occurs during the reparandum,
editing phrase, and alteration of the speech repair.
2. Case 1: No gesture appears in the reparandum, but a
gesture begins in the alteration.
3. Case 2: A gesture isterminated withinthe reparandum
and a new gesture begins in the alteration.
4. Case 3: A single gesture appears across the reparan-
dumand alterationwitha very slighthesitationofhand
movement in the editing phrase.
5. Case 4: A gesture occurs in the reparandum, followed
by a hold after the interruption point and a retraction
of hands to rest in the alteration.
6. Case 5: A single gesture appears across the reparan-
dum and alteration.
102 102.5103103.5104104.5 105105.5
Hold and 3D hand position during one speech repair
102 102.5 103 103.5104104.5 105105.5
ratio to max range
hand moving (RH)
back to rest position
ratio to max range
Figure 5. Left hand (LH) and right hand (RH)
3D positions and holds during the video se-
quence of our speech repair example.
Our example speech repair is clearly Case 4. Table 6 pro-
vides an analysis of all of the speech repair gesture patterns
that occur for spk1 of the wd11 data set.
Speech repairs reflect the need for some modification in
the speech production process. The case 1, 2, and 4 ges-
ture patterns exhibit a change in gesture state during either
reparandum, alteration, or both; hence, we call them mod-
ification gestures. The simultaneous production of modifi-
cation gestures in the visual channel along with the speech
repair in the audio channel suggest that these channels are
highlycorrelated. To understand why not all speech repairs
exhibitmodificationgesturepatterns(MG),we analyzed the
distributionof modification gestures for the twomajor clas-
sifications of speech repairs, namely content modification
(CM) and content repetition (CR) repairs.
The distribution of gestures in two kinds of speech re-
pairs appear in Table 4. We find that gesture entities have
high correlation with content modification speech repairs.
Table 4. The distribution of modification ges-
tures (MG) with CM (content modification)
and CR (content repetition) speech repairs.
Lookingatthewd11-spk1data set, we furtherrefined the
analysis of the content modification repairs into false starts
(a total modification) and content replacements (a partial
modification). The results are shown in Table 5. It is quite
interesting that both types of speech repairs have a simi-
lardistributionofco-occurrence witha modificationgesture
Some psychologists hypothesize that gestures help
speakers to formulate coherent speech by aiding in the re-
trieval of elusive words from lexical memory . Based on
this hypothesis, when speech repairs are content replace-
ments, one would expect a greater number of modification
gestures, whichcouldhelpinretrievingwords. Onthe other
hand, when speech repairs are simple word repetitions, it
is more likely that the speaker is maintaining the conversa-
tionalfloor; hence, fewer gestures wouldbe expected inthis
situation. Levelt  distinguishes between three stages of
speech production, i.e., conceptualization,formulation, and
articulation. During conceptualization, the speaker deter-
mines the communicative intention to produce the prever-
bal message. At the formulation stage, the preverbal mes-
sage is transformed intoa surface structure. Finally, speech
is generated during the articulation stage from the surface
structure . Any disruption of the process is likely to in-
cur a high retrieval cost; hence, we would expect that mod-
ification gestures would be used frequently for false starts
and modification repairs, which signal such a disruption in
In this paper, we have reported the results of a measure-
ment study of gesture and speech repair data extracted from
videotapes of natural dialogs. From our analysis of this
data, we found that gestures do not always co-occur with
speech repairs. However, we observed a set of six gesture
patterns that occur duringspeech repairs, three of which we
have dubbed modification gesture patterns due to the fact
that there is a change of gesture state in either the reparan-
dum, alteration,orboth. Usingthisclass ofgesturepatterns,
we found that content modification speech repairs have a
high correlation with modification gesture patterns, unlike
contentrepetitions,which have very few co-occurring mod-
ification gestures. In future research work, we willexamine
more data sets and then begin to build speech repair detec-
tion algorithms that utilize gesture pattern as an additional
3 # MG
Table 5. The distribution of modification ges-
tures occurring with content replacements
and false starts.
This research was supported by Purdue Research Foun-
dation and the National Science Foundation under Grant
No. #9980054-BCS.Additionalthanks go to our KDI grant
 P. Boersma and D. Weeninck. Praat, a system for doing
phonetics by computer. Technical Report 132, University
of Amsterdam, Inst. of Phonetic Sc., 1996.
 R. A. Bolt. Put that there: Voice and gesture at the graphics
interface. ACM Computer Graphics, 14:262–270,1980.
 R. Bryll, F. Quek, and A. Esposito. Automatic Hand Hold
Detection in Natural Conversation. In IEEE Workshop on
Cues in Communication, Kauai, Hawaii, 2001.
 J. Cassell. A framework for Gesture Generation And Inter-
pretation. In R. Clipolla and A. Pentland, editors, Computer
vision in Human-Machine Interaction. Cambridge Univer-
sity Press, 1998.
 J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCH-
BOARD: Telephone speech corpus for research and devel-
opment. In Proceedings of the International Conference of
Acoustics, Speech, and Signal Processing, pages 517–520,
 P. A. Heeman and J. F. Allen. Speech repair, intonational
phrases, and discourse markers: Modeling speakers' ut-
terances in spoken dialogue.
 D. Hindle. Deterministic parsing of syntactic non-fluencies.
In Proceedings of the 21st Annual Meeting of the Associa-
tion for Computational Linguistics, pages 123–128,1983.
 R. M. Krauss. Why do we gesture when we speak. Current
Direction in PsychologicalScience, 7:54–59, 1998.
 W. J. Levelt. Speaking: From intention to articulation. The
MIT Press, Cambrige, MA, 1989.
 D. McNeill. Hand and Mind:What Gestures Reveal about
thought. U.Chicago Press, Chicago, IL, 1992.
 D. McNeill and S. Duncan. Growth Point in thinking-for-
speaking. In D. McNeill, editor, Language and Gesture,
pages141–161. Cambridge University Press, 2001.
 C. H. Nakatani and J. Hirschberg. A speech-first model for
repair detection and correction. In Proceedings of the 31st
Annual Meeting of the Association for Computational Lin-
guistics, pages46–53, 1993.
 C. H. Nakatani and J. Hirschberg. A corpus-based study of
repair cuesin spontaneousspeech. Journalof the Acoustical
Society of America, 95(3):1603–1616,1994.
 F. Quek, R. Bryll, and X.-F. Ma.
for Dynamic Gesture Tracking. In ICCV'99 Workshop on
RATFG-RTS, Corfu, Greece,1999.
 E. Shriberg and A. Stolke. Word probability after hesita-
tions: A corpus-based study. In Proceedings of the Inter-
national Conference on Spoken Language Processing, vol-
ume 3, pages1868–1871,1996.
 C. WightmanandD. Talkin. TheAligner. EntropicResearch
A Parallel Algorighm
ah I mean
in front of
front of the moo
we could be
in front of the
if one of us goes
front of movie
some other people
call Jim Carrey's
it was a movie
I guess we just
using helicopters we
could cover the
I like the
what if the wombats
with the nets
and after that
it was s-
I guess we could
maybe we could
just cover the
make you eat
I like the
I don't unders-
with the net
and then if
no we ha-
Table 6. A listing of the speech repairs associated with wd11-spk1. Repair type indicates whether
the repair is a content repetition (CR), content modification (CM), or false start (FS). The beginning
and ending time of the speech repair, including the reparandum, editing phrase, and alteration, is
indicated along with the time of interruption at the end of the reparandum. Note that LP stands for
long pause and PM for pragmatic movement.