Three phase verification for spoken dialog clarification.
THREE PHASE VERIFICATION FOR
SPOKEN DIALOG CLARIFICATION
Sangkeun Jung Cheongjae Lee
Department of Computer Science and Engineering
Pohang University of Science and Engineering
San 31, Hyoja-Dong, Pohang, 790-784, Korea
+82 54 279 5581
Gary Geunbae Lee
Spoken dialog tasks incur many errors including speech
recognition errors, understanding errors, and even dialog
management errors. These errors create a big gap between user's
will and the system's understanding, and eventually result in a
misinterpretation. To fill in the gap, people in human-to-human
dialog try to clarify the major causes of the misunderstanding and
selectively correct them. This paper presents a method for
applying the human’s clarification techniques to human-machine
spoken dialog systems. To increase the error detection precision
and error recovery efficiency for the clarification dialogs, error
detection phase is organized into three systematic phases and a
clarification expert is devised for recovering the errors using the
three phase verification. The experiment results demonstrate that
the three phase verification could effectively catch the word and
utterance-level errors in order to increase the SLU (spoken
language understanding) performance and the clarification experts
can actually increase the dialog success rate and the dialog
Categories and Subject Descriptors
I.2.7 [Artificial Intelligence]: Natural Language Processing –
Discourse, Speech recognition and synthesis
Algorithm, Performance, Design, Languages, Human Factors,
Clarification Dialog, Three Phase Verification, Clarification
Expert, Spoken Language Understanding, Dialog Management.
Clarification Dialog (CD) is one of the dialog types used in
attempting to resolve misunderstanding between human to human,
or human to computer during the dialog. The following example
shows a typical clarification dialog in an error-prone spoken
environment in tv-guide domain.
USER: I want to watch a drama1 Hae-Sin
SYS: Documentary Her-Jun is on MBC. Do you
want to watch it? (Drama and Hae-Sin are
misrecognized respectively as Documentary and Her-
USER: No, I want to watch a drama Hae-sin.
SYS: Please repeat the program name you want to
watch. (System internally verifies that the ‘drama’ and
“Hae-Sin” are not correctly recognized. System tries to
clarify the important program name first)
SYS: Drama Hae-sin is on KBS. Do you want to
SYS: OK, showing Hae-Sin on KBS
For a successful clarification dialog like the above example, we
need to solve two problems. The first problem is to select the
targets to be clarified and classify the error types, and the second
is to clarify and recover the targets in an intuitive and efficient
The first problem is conventionally called as a belief confirmation.
Belief confirmation techniques have been explored by many
researchers and many confidence measures have been developed.
Most of the researches focused on measuring how much we can
trust the speech recognition results. Many good features and the
1 [Bold-italic words designate the important content information to tv-
guide domain dialog system]
methods of confidence measurement in the level of speech
recognition decoder are summarized at .
Recently, researchers try to include more semantic level
information for belief confirmation using Latent Semantic
Analysis (LSA)  and the information from the understanding
module . Recent trends of confidence measuring and utterance
verification techniques are well summarized in a recent survey
The second problem has been studied using the belief
confirmation techniques. Torres et. al.’s work  showed how to
use confidence scores for calculating a transition probability in
the dialog state-transition network where the confidence scores
are calculated using the method of . McTear et. al. 
developed an object-oriented dialog system which can both detect
and handle errors, and their system also adapted 's method for
detecting the errors.
Major limitation of the most previous clarification dialog
researches is that the targets the system tries to correct are limited
to words, and they consider only speech recognition errors even
though many errors can also come from the spoken language
In the case of detecting both speech recognition and SLU errors
, they try to use a single integrated estimator to classify the
type of errors even though the characteristics of the speech
recognition errors and the SLU errors are totally different, which
results in low detection accuracy.
To overcome these previous researches’ limitations, we devised a
three-phase verification method and a clarification expert. We
extend the range of errors by considering the errors coming from
not only speech recognizer but also SLU module. To detect the
complicated errors with high accuracy, we cascade error detection
process into three phases – Word Error verification, Utterance
Verification and Slot-Value Verification. The multi-level rich
information generated by this three phase verifier is passed to the
Clarification Expert (CE), which is specially devised for
handling clarification dialogs, and the CE determines adequate
clarification strategy considering both error detection information
and discourse status.
This paper is organized as follows: An expert-based dialog
management architecture as our base-line architecture for the
clarification is described in section 2. Based on this architecture, a
three-phase verification method along with new Information
Potential measure will be introduced in section 3. The detail of
the clarification expert will be described in section 4. Extensive
experiments and analyses are shown in section 5, and finally a
conclusion will be drawn in section 6.
2. SITUATION BASED SPOKEN DIALOG
O’Neill et. al. proposed an object-oriented dialog system in .
Inspired by O’Neill et. al.’s work, and motivated to overcome the
conventional dialog systems’ weaknesses, we developed a
situation-based spoken dialog management system using the
following two dialog modeling principles:
Dialog management should be state-transition free and
based on the current situation for general response
generation (Situation-based dialog management)
Domain-dependent dialog management should be
based on a specific expert for more efficient
Most state-transition-based dialog management systems rely on
the fixed state transition to determine the dialog status using a
finite state transition model. This state transition-based dialog
modeling guarantees fast system build-up and easy dialog
modeling. But, it is not flexible to handle various natural language
dialog phenomena, because the next state of the dialog is fully
determined by the fixed transition state. Also, the state transition
dialog modeling makes it difficult to transfer a current domain
dialog model to another domain, because we would need to re-
design the whole transition network again. To avoid this rigidity
in management, we developed a state-transition-free dialog
management model. Our system does not use a state-transition
network, but uses a situation-based dialog management strategy.
The definition of the situation in our system is as follows:
A situation is determined by various information of the
current dialog status including:
User’s utterance and intention (dialog acts)
Set of semantic slots and values
Confidence status of each slot
History of dialog in a current session
System's previous intention
Database query results of the current user query
To determine the system's intention and proper responses, we
consider all the above situation-related information and use the
three kinds of situation-based rules as follows:
Situation-action rules: rules for describing the system’s
actions under the current situation.
Constraint-relax rules: rules for relaxing the constraints
on database queries.
Frame-reset rules: rules for restarting a new dialog frame
for the case of domain switch and dialog closing
Like O’Neill et. al.’s domain experts , we pursue an expert-
based dialog management strategy to conduct a specific domain-
oriented dialog. Each expert is designed as a specialist for
handling specific dialog patterns. For example, the tv-guide
expert handles tv-guide related utterances, and the movie-guide
expert handles movie-guide related utterances. Experts of each
domain are implemented by designing the ‘situation-based rules’
for the corresponding domain
The biggest advantages of the expert based dialog management
are that the system not only conducts a specific domain-oriented
dialog efficiently but also provides an architectural beauty of
implementing the separate clarification dialog experts. In other
words, if we view the clarification dialog phenomena as another
special form of dialog patterns like a tv-guide and a movie-guide,
clarification dialog model also can be designed as a special expert,
e.g., Clarification Expert.
Fig 1 Situation Based Dialog Management Architecture with
the Clarification Expert.
Fig 1 illustrates a situation-based dialog system architecture with
the connection to the clarification expert. Each component
respectively has the following role:
Dialog Manager: A hub module that communicates
with ASR (Automatic Speech Recognizer), SLU
(Spoken Language Understander) and an error
verification module. It also manages other dialog
components in this architecture.
Dialog History: Consists of the following two parts:
Dialog Frame: Provides a current semantic frame
for a dialog
Discourse History: Stores history information
extracted from user utterances and dialog frame
Discourse Manager: Can handle the dialog by
inheriting a general expertise to one of the proper
Domain Expert. The discourse manager handles very
basic dialog patterns such as updating dialog status,
saving and restoring a dialog history.
Domain Expert: Takes the responsibility in handling
domain specific dialogs. Expert itself is a domain
specific, but it takes a generic dialog strategy inherited
from the Discourse Manager. Each domain expert has
its own situation-action rules, dialog frames and the
The relationship of the discourse manager and the domain expert
is complementary each other. The discourse manager decides only
generic dialog strategy, and it is totally domain independent.
Therefore it can be used by any domain expert. Each domain
expert inherits generic discourse manager knowledge, and can
handle generic dialog along with domain specific dialog patterns.
For implementing ASR, we developed a speech recognizer based
on the Hidden Markov Model Toolkit (HTK). We modified the
HTK for making and providing decoder level information to the
following three-phase verification process.
3. THREE PHASE VERIFICATION
To select targets to be clarified, we develop a three-phase error
verification method. The first phase is a word error verification
which is conventional belief confirmation on words that are
recognized by the speech recognizer. The second phase is
utterance verification which examines the whole utterance’s
properness to progress the dialog management further. To do this,
we devise the concept of information potential which can measure
the properness of whole utterance in the sense of ASR and SLU
confidence. The third phase is slot-word verification which
examines the slot and the value which are extracted by SLU from
The target of the verification in each step along with the proper
example is shown as follows:
Example 1: Examples of the three-phase verification
User Utterance & Speech Recognized Utterance
I want to watch Drama Hae-Sin on KBS (User utter.)
I want to watch Drama Bae-Sin on SBS (ASR Result)
1st Phase Word Error Verification
I(confidence scores : 89.45% / correct) want (78% / correct)
to(45.32% / error) watch(78.67% / correct) Drama(87.32% /
correct) Bae-sin(65.2% / correct)
on(93% / correct) SBS(29.24% / correct)
2nd Phase Utterance Verification
I want to watch Drama Bae-Sin on SBS
Information potential : 67.42%
3rd Phase Slot-Value Verification
Channel – SBS: 65%
Genre – Drama: 97%
Program_Name – Bae-Sin: 45%
3.1 Word Error Verification
Verification in the first phase is similar to the conventional belief
confirmation approaches . It examines every word that is
recognized by the speech recognizer. However, we don’t use the
Word Error Verification (WEV) results directly; the goal of this
step is to provide rich information to the utterance verifier. This is
one of the big difference from the previous clarification dialog
We adopted some of the good confidence measures from .
We used a Maximum Entropy (MaxEnt) classifier  for
combining good confidence features and calculating the
confidence scores for classifying the word recognition errors.
The followings are the description of the classes and the features
for our MaxEnt classifier.
Normalized acoustic scores: Frame normalized
acoustic scores of a node in the lattice.
Language model scores: Word Trigram scores
N-best purity: The fraction of the N-best hypotheses
in which the hypothesized word appears in the same
position in the utterance
NFrames: The number of time frames of the word
Word Length: The length of the word.
Word Lexical: The lexical form of the recognized
3.2 Utterance Verification
Goal of the clarification dialog is to fill a gap between user’s
intention and the intention that the system finally understands. It
is closely related to the measure of the amount of distortion in the
channel between users and the dialog system. In other words,
calculating the distortions in the series of the noisy channels –
user’s intention is formed as a spoken language form, the spoken
language is recognized by the ASR, the result of the ASR is
understood by the SLU, and finally the SLU results are passed to
the dialog manager – is the natural approach to start a
clarification dialog. To do this, we devised a concept of
information potential by the following:
Information Potential = Ratio of correctly carrying user’s
information (intention) to the dialog
system in the noisy channel between user
and the system.
= How much we can trust the information
that the dialog system understands.
= P (Trust | Information that the dialog
system understands )
Namely, measuring the information potential can be formulated
by calculating the confidence of recognizing and understanding
module in the channel between user and the dialog system. In
other words, it can be put into words as follows:
Information Potential ~ < Confidence of ASR >
~ < Confidence of SLU >
~ <Other information that the
( ‘~’ means that there exists a relationship)
Based on the above definition, we calculated the information
potential by combining various information from ASR and SLU
modules using logistic regression. Used features are as follows:
Mean of normalized acoustic scores: Mean of the
frame normalized acoustic scores of each word in the
Mean of language model scores: Mean of the
language model scores of each word in the sentence
Mean of the N-best purity: Mean of the N-best purity
scores of each word in the sentence.
Mean of the understanding scores: Mean of the log-
likelihood scores of the SLU.
Mean of the word verification scores: Mean of the
word error confidence scores generated by the word-
error verifier in the first phase.
Number of words: The number of words in the
Predicted word error rate: The ratio of the word
errors predicted by the word error verifier.
Slot - Value
Slot - Value
Fig 2 Flow of three phase verification and clarification
Shown as in Fig 2, after calculating the information potential, the
dialog system decides the utterance level clarification strategy. If
the information potential is lower than the threshold, system tags
the utterance as “Unbelievable” and passes the information to the
clarification expert to decide proper system’s responses. In this
case, system will ask the user to rephrase the whole utterance
again because the utterance itself is estimated as “Unbelievable”
in the sense of both ASR and SLU.
However, if the information potential is higher than the threshold,
system tags the utterance as “Believable” and continues to the
third phase verification – Slot-Value Verification. Even though
the utterance itself can be estimated as “Believable,” there may be
some errors in the level of slot and value recognition. The goal of
the third phase verification is to find the slot-value level errors in
3.3 Slot-Value Verification
The slot-value verification is executed when the utterance which
is recognized by the ASR and understood by the SLU, is
estimated as “Believable.” In other words, it verifies every slot
and value pair which is extracted by the SLU. The difference
between this verification and the conventional belief confirmation
is that we calculate the confidence of the slot-value pair by
considering not only the ASR information but also the SLU
information. So, we can focus on the recognition performance of
the more important content words which are more critical to the
successful dialog completion.
We can turn the slot-value confidence measuring problem to
classification problem as we did for the word error verification.
We used the same features and the MaxEnt classification method
that we have used in the word-error verification in section 3.1.
The only difference is the following new feature:
Understanding Scores: the likelihood of slot and value of
the spotted words generated by the SLU
If all of the slot and value of the utterance are classified as
“Correct,” the utterance and the slot-values are directly passed to
the dialog manager. However the “Error” tagged slot-values and
the confidence are passed to the clarification expert to make a
decision of proper clarification strategy.
4. CLARIFICATION DIALOG STRATEGY
From the results of the utterance and slot-value verifier, we can
get the targets which should be clarified. The targets can be a full
sentence or a set of words. For clarifying these targets efficiently
and systemically, we introduce a clarification expert in our
situation-based dialog management architecture.
As previously mentioned, our dialog system is strongly based on
expert system architecture. Each expert takes the responsibility of
handling a certain domain dialog, and it is designed to manage
domain specific dialog patterns. Therefore if we reformulate the
‘clarification’ as specific dialog patterns, we could model the
clarification as one of the expert system. However, there should
be some differences between a clarification expert and the other
domain experts. The differences are as follows:
The clarification expert is a secondary expert different
from the primary working domain experts.
The clarification expert should be domain independent.
The clarification expert should be able to share all the
information of the current primary working domain expert.
Primary ExpertPrimary Expert
Clarification ExpertClarification Expert Clarification ExpertClarification Expert
Need Clarification? Need Clarification?
Complete ClarificationComplete Clarification
Primary ExpertPrimary Expert
Fig 3 Switch to the clarification expert
As shown in Fig 3, if the three-phase verification alarms the
dialog manager that the utterance or some words should be
recovered, the dialog manager pauses the working primary
domain expert and gives the control to the clarification expert. In
this step, dialog manager makes the clarification expert access to
all the primary expert’s information including the dialog frame
and the discourse history.
Clarification expert decides clarification dialog strategy by
considering the confidence scores provided by the three-phase
verification module and other domain-related information.
To decide more efficient and systematic clarification dialog
strategy, clarification expert is considering the following
Property 1: Dependency between each error
information and dynamic change of the error
Property 2: Relative importance of the error
Most of the error information has a strong dependency with other
information in the same dialog domain. For example, `Larry King
Live' is always broadcasted on `CNN' and the Korean popular
drama `Hae-Sin' is always on `KBS'. There is a certain
dependency between the program title and the channel. Without
considering these dependencies, we end up taking unnecessary
clarification steps. The following clarification target example
demonstrates the importance of considering the dependency in
deciding proper clarification strategy.
Example 2. Target example of selected slot-values for
User Utterance I want to watch drama Hae-Sin on KBS
ASR Result I want to watch drama Bae-Sin on SBS
Targets needed to be
Program_Name - Bae-Sin
Channel - SBS
In this example, if we don’t know there is a strong dependency
between ‘Hae-Sin’ and ‘KBS’, the clarification expert asks users
to rephrase both channel and program name. However if we know
the dependency, clarification expert doesn’t need to clarify
channel in the moment that ‘Bae-sin’ is clarified to ‘Hae-sin’
because system already knows ‘Hae-Sin’ is always on ‘KBS’.
Like this, we can implement more efficient clarification dialog
strategy by considering property 1.
Property 2 can be used for choosing the clarification order among
multiple targets. .As in the example 2, when there are more than
two slot-values to be clarified, the clarification expert considers
the relative importance property to set the priority of clarification.
In most of the cases, the priority from relative importance is
closely related to the range of the slot types. For example, Fig 4
illustrates the range of information types on tv-guide domain. As
we can see, in most of the cases, if the ‘playing actor/actress’ is
determined surely, ‘program_name’, ‘channel’ and ‘genre’ are
determined automatically. Therefore, in example 2, the
clarification expert tries to clarify ‘program name’ first