Detecting robot-directed speech by situated understanding in object manipulation tasks

Conference Paper · October 2010with31 Reads
DOI: 10.1109/ROMAN.2010.5598729 · Source: IEEE Xplore
Conference: RO-MAN, 2010 IEEE
Abstract

In this paper, we propose a novel method for a robot to detect robot-directed speech, that is, to distinguish speech that users speak to a robot from speech that users speak to other people or to themselves. The originality of this work is the introduction of a multimodal semantic confidence (MSC) measure, which is used for domain classification of input speech based on the decision on whether the speech can be interpreted as a feasible action under the current physical situation in an object manipulation task. This measure is calculated by integrating speech, object, and motion confidence with weightings that are optimized by logistic regression. Then we integrate this measure with gaze tracking and conduct experiments under conditions of natural human-robot interaction. Experimental results show that the proposed method achieves a high performance of 94% and 96% in average recall and precision rates, respectively, for robot-directed speech detection.

Full-text

Available from: Shigeki Matsuda, Apr 08, 2014
Detecting Robot-Directed Speech by Situated Understanding
in Object Manipulation Tasks
Xiang Zuo, Naoto Iwahashi, Ryo Taguchi, Kotaro Funakoshi,
Mikio Nakano, Shigeki Matsuda, Komei Sugiura and Natsuki Oka
Abstract In this paper, we propose a novel method for a
robot to detect robot-directed speech, that is, to distinguish
speech that users speak to a robot from speech that users
speak to other people or to themselves. The originality of this
work is the introduction of a multimodal semantic confidence
(MSC) measure, which is used for domain classification of input
speech based on the decision on whether the speech can be
interpreted as a feasible action under the current physical
situation in an object manipulation task. This measure is
calculated by integrating speech, object, and motion confidence
with weightings that are optimized by logistic regression. Then
we integrate this measure with gaze tracking and conduct ex-
periments under conditions of natural human-robot interaction.
Experimental results show that the proposed method achieves
a high performance of 94% and 96% in average recall and
precision rates, respectively, for robot-directed speech detection.
I. INTRODUCTION
Robots are now being designed to be a part of the everyday
lives of people in social and home environments. One of the
key issues for practical use of such robots is the development
of user-friendly interfaces. Speech recognition is one of our
most effective communication tools for use in a human-
robot interface. In recent works, many systems using speech-
based human-robot interfaces have been implemented, such
as [1]. For such an interface, the functional capability to
detect robot-directed (RD) speech is crucial. For example, a
user’s speech directed to another human listener should not
be recognized as commands directed to a robot.
To resolve this issue, many works have used human
physical behaviors to estimate the target of the user’s speech.
Lang et al. [2] proposed a method for a robot to detect the
direction of a person’s attention based on face recognition,
sound source localization, and leg detection. Yonezawa et
al. [3] proposed an interface for a robot to communicate
with users based on detecting the gaze direction during their
speech. However, this kind of method raises the possibility
that users may say something unrelated to the robot even
while they are looking at it.
To settle such an issue, the proposed method is based
not only on gaze tracking but also on domain classification
of the input speech into RD speech and out-of-domain
Xiang Zuo is with Advanced Telecommunication Research Labs and
Kyoto Institute of Technology, Japan sasyou@atr.jp
Naoto Iwahashi is with Advanced Telecommunication Research Labs and
National Institute of Information and Communications Technology, Japan
Ryo Taguchi is with Advanced Telecommunication Research Labs and
Nagoya Institute of Technology, Japan
Kotaro Funakoshi is with Honda Research Institute Japan Co., Ltd, Japan
Mikio Nakano is with Honda Research Institute Japan Co., Ltd, Japan
Shigeki Matsuda is with National Institute of Information and Commu-
nications Technology, Japan
Komei Sugiura is with National Institute of Information and Communi-
cations Technology, Japan
Natsuki Oka is with Kyoto Institute of Technology, Japan
(OOD) speech. Domain classification for robots in previous
works were based mainly on using prosodic features. As an
example, a previous work [4] showed that the difference
in prosodic features between RD speech and other speech
usually appears at the head and the tail of the speech, and
they proposed a method to detect RD speech by using such
features. However, their method also raised the issue of
requiring users to adjust their prosody to fit the system, which
causes them an additional burden.
In this work, domain classification is based on a multi-
modal semantic confidence (MSC) measure. MSC has the
key advantage that it is not based on using prosodic features
of input speech as with the method described above; rather,
it is based on semantic features that determine whether the
speech can be interpreted as a feasible action under the
current physical situation. The target task of this work is
an object manipulation task in which a robot manipulates
objects according to a user’s speech. An example of such a
task in a home environment is a user telling a robot to “Put
the dish in the cupboard. Solving this task is fundamental for
assistive robots and requires robots to deal with speech and
image signals and to carry out a motion in accordance with
the speech. Therefore, the MSC measure is calculated by
integrating information obtained from speech, object images,
and robot motion.
The rest of this paper is organized as follows. Section
II gives the details of the object manipulation task. Section
III describes the proposed RD speech detection method.
The experimental methodology and results are presented in
Section IV, and Section V gives a discussion. Finally, Section
VI concludes the paper.
II. OBJECT MANIPULATION TAS K
In this work, we assume that humans use a robot to
perform an object manipulation task. Figure 1 shows the
robot used in this task. It consists of a manipulator with 7
degrees of freedom (DOFs), a 4-DOF multi-fingered grasper,
a SANKEN CS-3e directional microphone, a stereo vision
camera for video signal input, an infrared sensor for 3-
dimensional distance measurement, a camera for human gaze
tracking, and a head unit for robot gaze expression.
In the object manipulation task, users sit in front of
the robot and command the robot by speech to manipulate
objects on a table located between the robot and the user.
Figure 2 shows an example of this task. In this figure, the
robot is told to place Object 1 (Kermit) on Object 2 (big
box) by the command speech “Place-on Kermit
1
big box”
2
,
1
Kermit is the name of the stuffed toy used in our experiment.
2
Commands made in Japanese have been translated into English in this
paper.
19th IEEE International Symposium on Robot and
Human Interactive Communication
Principe di Piemonte - Viareggio, Italy, Sept. 12-15, 2010
978-1-4244-7989-4/10/$26.00 ©2010 IEEE 643
Page 1
Fig. 1. Robot used in the object
manipulation task.
Fig. 2. Example of object manipula-
tion tasks.
and the robot executes an action according to this speech.
The solid line in Fig. 2 shows the trajectory of the moving
object manipulated by the robot.
The commands used in this task are represented by a
sequence of phrases, each of which refers to a motion, an
object to be manipulated (“trajector”), or a reference object
for the motion (“landmark”). In the case shown in Fig. 2, the
phrases for the motion, trajector, and landmark are “Place-
on, “Kermit, and “big box, respectively. Moreover, frag-
mental commands without a trajector phrase or a landmark
phrase, such as “Place-on big box” or just “Place-on, are
also acceptable.
To execute a correct action according to such a command,
the robot must understand the meaning of each word in it,
which is grounded on the physical situation. The robot must
also have a belief about the context information to estimate
the corresponding objects for the fragmental commands.
In this work, we used the speech understanding method
proposed by [5] to interpret the input speech as a possible
action for the robot under the current physical situation.
However, for an object manipulation task in a real-world
environment, there may exist OOD speech such as chatting,
soliloquies, or noise. Consequently, an RD speech detection
method should be used.
III. PROPOSED RD SPEECH DETECTION METHOD
The proposed RD speech detection method is based on
integrating gaze tracking and the MSC measure. A flowchart
is given in Fig. 3. First, a Gaussian mixture model based
voice activity detection method (GMM-based VAD) [6] is
carried out to detect speech from the continuous audio
signal, and gaze tracking is performed to estimated the gaze
direction from the camera images
3
. If the proportion of the
user’s gaze at the robot during her/his speech is higher
than a certain threshold η, the robot judges that the user
was looking at it while speaking. The speech during the
periods when the user is not looking at the robot is rejected.
Then, for the speech detected while the user was looking
at the robot, speech understanding is performed to output
the indices of a trajector object and a landmark object,
a motion trajectory, and corresponding phrases, each of
which consists of recognized words. Then, three confidence
measures, i.e., for speech (C
S
), object image (C
O
) and
motion (C
M
), are calculated to evaluate the feasibilities of
3
In this work, gaze direction was identified by human face angle. We used
faceAPI (http://www.seeingmachines.com) to extract human face angles
from the images captured by a camera.
Audio signalCamera images
GMM based VADGaze Tracking
Is human user looking at
the robot during his speaking?
Physical situations
YES
NO
Speech Understanding
Speech Confidence
Measure C
S
Object Confidence
Measure C
O
Motion Confidence
Measure C
M
θ
1
θ
2
θ
3
θ
0
MSC measure C
MS
(s, O, q)
C
MS
(s, O, q) > δ?
YES
RD speech
NO
OOD speech
Fig. 3. Flowchart of the proposed RD speech detection method.
the outputted word sequence, the trajector and landmark,
and the motion, respectively. The weighted sum of these
confidence measures with a bias is inputted to a logistic
function. The bias and the weightings {θ
0
, θ
1
, θ
2
, θ
3
}, are
optimized by logistic regression [7]. Here, the MSC measure
is defined as the output of the logistic function, and it
represents the probability that the speech is RD speech. If the
MSC measure is higher than a threshold δ, the robot judges
that the input speech is RD speech and executes an action
according to it. In the rest of this section, we give details of
the speech understanding process and the MSC measure.
A. Speech Understanding
Given input speech s and a current physical situation
consisting of object information O and behavioral context
q, speech understanding selects the optimal action a based
on a multimodal integrated user model. O is represented
as O = {(o
1,f
, o
1,p
), (o
2,f
, o
2,p
) . . . (o
m,f
, o
m,p
)}, which
includes the visual features o
i,f
and positions o
i,p
of all
objects in the current situation, where m denotes the number
of objects and i denotes the index of each object that is
dynamically given in the situation. q includes information
on which objects were a trajector and a landmark in the
previous action and on which object the user is now holding.
a is defined as a = (t, ξ), where t and ξ denote the index
of the trajector and a trajectory of motion, respectively. A
user model integrating the five belief modules (1) speech,
(2) object image, (3) motion, (4) motion-object relationship,
and (5) behavioral context is called an integrated belief.
Each belief module and the integrated belief are learned by
the interaction between a user and the robot in a real-world
environment.
1) Lexicon and Grammar: The robot initially had basic
linguistic knowledge, including a lexicon L and a grammar
G
r
. L consists of pairs of a word and a concept, each of
which represents an object image or a motion. The words are
represented by HMMs using mel-scale cepstrum coefficients
and their delta parameters (25-dimensional) as features.
The concepts of object images are represented by Gaussian
functions in a multi-dimensional visual feature space (size,
Lab color space (L
, a
, b
), and shape). The concepts
of motions are represented by HMMs using the sequence
644
Page 2
of three-dimensional positions and their delta parameters as
features.
The word sequence of speech s is interpreted as a con-
ceptual structure z = [(α
1
, w
α
1
), (α
2
, w
α
2
), (α
3
, w
α
3
)],
where α
i
represents the attribute of a phrase and has a value
among {M, T, L}. w
M
, w
T
and w
L
represent the phrases
describing a motion, a trajector, and a landmark, respectively.
For example, the user’s utterance “Place-on Kermit big box”
is interpreted as follows: [(M , Place-on), (T , Kermit), (L,
big box)]. The grammar G
r
is a statistical language model
that is represented by a set of occurrence probabilities for
the possible orders of attributes in the conceptual structure.
2) Belief Modules and Integrated Belief: Each of the five
belief modules in the integrated belief is defined as follows.
Speech B
S
: This module is represented as the log prob-
ability of s conditioned by z, under lexicon L and grammar
G
r
.
Object image B
O
: This module is represented as the
log likelihood of w
T
and w
L
given the trajector’s and the
landmark’s visual features o
t,f
and o
l,f
.
Motion B
M
: This module is represented as the log
likelihood of w
M
given trajectory ξ.
Motion-object relationship B
R
: This module represents
the belief that in the motion corresponding to w
M
, features
o
t,f
and o
l,f
are typical for a trajector and a landmark,
respectively. This belief is represented by a multivariate
Gaussian distribution with a parameter set R.
Behavioral context B
H
: This module represents the
belief that the current speech refers to object o, given
behavioral context q, with a parameter set H.
Given weighting parameter set Γ=
γ
1
..., γ
5
, the degree
of correspondence between speech s and action a is repre-
sented by integrated belief function Ψ, written as
Ψ(s,a, O, q, L, G
r
, R, H, Γ) =
max
z,l
γ
1
log P (s|z; L)P (z; G
r
) [B
S
]
+γ
2
log P (o
t,f
|w
T
; L) + log P (o
l,f
|w
L
; L)
[B
O
]
+γ
3
log P (ξ|o
t,p
, o
l,p
, w
M
; L) [B
M
]
+γ
4
log P (o
t,f
o
l,f
|w
M
; R) [B
R
]
+γ
5
B
H
(o
t
, q; H) + B
H
(o
l
, q; H)
, [B
H
]
(1)
where l denotes the index of landmark, o
t
and o
l
denote the
trajector and landmark, respectively, and o
t,p
and o
l,p
denote
the positions of o
t
and o
l
, respectively. Conceptual structure
z and landmark o
l
are selected to maximize the value of Ψ.
Then, as the meaning of speech s, corresponding action ˆa is
determined by maximizing Ψ:
ˆa = (
ˆ
t,
ˆ
ξ) = argmax
a
Ψ(s, a, O , q, L, G
r
, R, H, Γ). (2)
Finally, action ˆa = (
ˆ
t,
ˆ
ξ), index of selected landmark
ˆ
l,
and conceptual structure (recognized word sequence) ˆz are
outputted from the speech understanding process.
B. MSC Measure
Next, we describe the proposed MSC measure. MSC
measure C
MS
is calculated based on the outputs of speech
understanding and represents an RD speech probability. For
input speech s and current physical situation (O, q), speech
understanding is performed first, and then C
MS
is calculated
by the logistic regression as
C
MS
(s, O, q) = P (domain = RD|s, O, q)
=
1
1 + e
(θ
0
+θ
1
C
S
+θ
2
C
O
+θ
3
C
M
)
.
(3)
Logistic regression is a type of predictive model that can
be used when the target variable is a categorical variable
with two categories, which is quite suitable for the domain
classification problem in this work. In addition, the output
of the logistic function has a value in the range from 0.0 to
1.0, which can be used directly to represent an RD speech
probability.
Given a threshold δ, speech s with an MSC measure higher
than δ is treated as RD speech. The B
S
, B
O
, and B
M
are
also used for calculating C
S
, C
O
, and C
M
, each of which
is described as follows.
1) Speech Confidence Measure: Speech confidence mea-
sure C
S
is used to evaluate the reliability of the recognized
word sequence ˆz. It is calculated by dividing the likelihood
of ˆz by the likelihood of a maximum likelihood phoneme
sequence with phoneme network G
p
, and it is written as
C
S
(s, ˆz; L, G
p
) =
1
n(s)
log
P (s|ˆz; L)
max
uL(G
p
)
P (s|u; A)
, (4)
where n(s) denotes the analysis frame length of the input
speech, P (s|ˆz; L) denotes the likelihood of ˆz for input
speech s and is given by a part of B
S
, u denotes a phoneme
sequence, A denotes the phoneme acoustic model used in
B
S
, and L(G
p
) denotes a set of possible phoneme sequences
accepted by Japanese phoneme network G
p
. For speech that
matches robot command grammar G
r
, C
S
has a greater value
than speech that does not match G
r
.
The speech confidence measure is conventionally used
as a confidence measure for speech recognition [8]. The
basic idea is that it treats the likelihood of the most typi-
cal (maximum-likelihood) phoneme sequences for the input
speech as a baseline. Based on this idea, the object and
motion confidence measures are defined as follows.
2) Object Confidence Measure: Object confidence mea-
sure C
O
is used to evaluate the reliability that the outputted
trajector o
ˆ
t
and landmark o
ˆ
l
are referred to by ˆw
T
and ˆw
L
.
It is calculated by dividing the likelihood of visual features
o
ˆ
t,f
and o
ˆ
l,f
by a baseline obtained by the likelihood of the
most typical visual features for the object models of ˆw
T
and ˆw
L
. In this work, the maximum probability densities
of Gaussian functions are used as these baselines. Then, the
object confidence measure C
O
is written as
C
O
(o
ˆ
t,f
, o
ˆ
l,f
, ˆw
T
, ˆw
L
; L) =
log
P (o
ˆ
t,f
| ˆw
T
; L)P (o
ˆ
l,f
| ˆw
L
; L)
max
o
f
P (o
f
| ˆw
T
) max
o
f
P (o
f
| ˆw
L
)
,
(5)
where P (o
ˆ
t,f
| ˆw
T
; L) and P (o
ˆ
l,f
| ˆw
L
; L) denote the like-
lihood of o
ˆ
t,f
and o
ˆ
l,f
and are given by B
O
; furthermore,
max
o
f
P (o
f
| ˆw
T
) and max
o
f
P (o
f
| ˆw
L
) denote the max-
imum probability densities of Gaussian functions, and o
f
denotes the visual features in object models.
645
Page 3
For example, Fig. 4 (a) illustrates a physical situation in
which a low object confidence measure was obtained for
input OOD speech “There is a red box. Here, by the speech
understanding process, the input speech was recognized as a
word sequence “Raise red box. Then, an action of the robot
raising object 1 was outputted (solid line) because the “red
box” did not exist and thus object 1 with the same color was
selected as a trajector. However, the visual feature of object
1 was very different from “red box, resulting in a low value
of C
O
.
3) Motion Confidence Measure: The confidence measure
of motion C
M
is used to evaluate the reliability that the
outputted trajectory
ˆ
ξ is referred to by ˆw
M
. It is calculated
by dividing the likelihood of
ˆ
ξ by a baseline that is obtained
by the likelihood of the most typical trajectory
˜
ξ for the
motion model of ˆw
M
. In this work,
˜
ξ is written as
˜
ξ = argmax
ξ,o
traj
p
P (ξ|o
traj
p
, o
ˆ
l,p
, ˆw
M
; L), (6)
where o
traj
p
denotes the initial position of the trajector.
˜
ξ
is obtained by treating o
traj
p
as a variable. The likelihood
of
˜
ξ is the maximum output probability of HMMs. In this
work, we used the method proposed by [9] to obtain this
probability. Different from
ˆ
ξ, the trajector’s initial position
of
˜
ξ is unconstrained, and the likelihood of
˜
ξ has a greater
value than
ˆ
ξ. Then, the motion confidence measure C
M
is
written as
C
M
(
ˆ
ξ, ˆw
M
; L) = log
P (
ˆ
ξ|o
ˆ
t,p
, o
ˆ
l,p
, ˆw
M
; L)
max
ξ,o
traj
p
P (ξ|o
traj
p
, o
ˆ
l,p
, ˆw
M
; L)
,
(7)
where P (
ˆ
ξ|o
ˆ
t,p
, o
ˆ
l,p
, ˆw
M
;L) denotes the likelihood of
ˆ
ξ and
is given by B
M
.
For example, Fig. 4 (b) shows a physical situation in which
a low motion confidence measure was obtained for input
OOD speech “Bring me that Chutotoro. Here, by the speech
understanding process, the input speech was recognized as a
word sequence “Move-away Chutotoro. Then, an action of
the robot moving object 1 away from object 2 was outputted
(solid line). However, the typical trajectory of “move-away”
is for one object to move away from another object that is
close to it (dotted line). Here, the trajectory of the outputted
action was very different from the typical trajectory, resulting
in a low value of C
M
.
4) Optimization of Weights: We now consider the problem
of estimating weight Θ in Eq. (3). The ith training sample
is given as the pair of input signal (s
i
, O
i
, q
i
) and teaching
signal d
i
. Thus, the training set T
N
contains N samples:
T
N
= {(s
i
, O
i
, q
i
, d
i
)|i = 1, ..., N}, (8)
where d
i
is 0 or 1, which represents OOD speech or RD
speech, respectively. The likelihood function is written as
P (d|Θ) =
N
i=1
(C
MS
(s
i
, O
i
, q
i
))
d
i
(1 C
MS
(s
i
, O
i
, q
i
))
1d
i
,
(9)
Input speech: “There is a red box.
Recognized as: [Raise red box.]
Input speech: “Bring me that Chutotoro.
Recognized as: [Move-away Chutotoro.]
(a) Case of object confidence measure. (b) Case of motion confidence measure.
Fig. 4. Example cases where object and motion confidence measures are
low.
where d= (d
1
, ..., d
N
). Θ is optimized by the maximum-
likelihood estimation of Eq. (9) using Fisher’s scoring algo-
rithm [10].
IV. EXPERIMENTS
A. Experimental Setting
We first evaluated the performance of MSC. This evalua-
tion was performed by an off-line experiment by simulation
where gaze tracking was not used and speech was extracted
manually without using the GMM based VAD in order to
avoid its detection errors. The weighting set Θ and the
threshold δ were also optimized in this experiment. Then we
performed an on-line experiment with the robot to evaluate
the entire system.
The robot lexicon L used in both experiments has 50
words, including 31 nouns and adjectives representing 40
objects and 19 verbs representing 10 kinds of motions. L also
includes five Japanese postpositions. Different from other
words in L, none of the postpositions is associated with
a concept. By using the postpositions, users can speak a
command in a more natural way. The parameter set Γ in
Eq. (1) was γ
1
= 1.00, γ
2
= 0.75, γ
3
= 1.03, γ
4
= 0.56,
and γ
5
= 1.88.
B. Off-line Experiment by Simulation
1) Setting: The off-line experiment was conducted under
both clean and noisy conditions using a set of pairs of speech
s and scene information (O, q). Figure 4 (a) shows an
example of scene information. The yellow box on object 3
represents the behavioral context q, which means object 3
was manipulated most recently. We prepared 160 different
such scene files, each of which included three objects on
average. To pair with the scene files, we also prepared 160
different speech samples and recorded them under both clean
and noisy conditions as follows.
Clean condition: We recorded the speech in a soundproof
room without noise. A subject sat on a chair one meter from
the SANKEN CS-3e directional microphone and read out a
text in Japanese.
Noisy condition: We added dining hall noise, having a
level from 50 to 52 dBA, to each speech record gathered
under a clean condition.
We gathered the speech records from 16 subjects, includ-
ing 8 males and 8 females. All subjects were native Japanese
speakers. As a result, 16 sets of speech-scene pairs were
obtained, each of which included 320 pairs (160 for clean and
160 for noisy conditions). These pairs were manually labeled
as either RD or OOD and then inputted into the system.
646
Page 4
Speech
Speech+Object
Speech+Motion
MSC(Speech+Object+Motion)
70 80 90 100
Precision (%)
70
80
90
100
Recall (%)
Fig. 5. Average precision-recall curves under clean condition.
For each pair, speech understanding was first performed
4
,
and then the MSC measure was calculated. During the
speech understanding experiment, a Gaussian mixture model
based noise suppression method [11] was performed, and
ATRASR [12] was used for phoneme- and word-sequence
recognition. With ATRASR, accuracies of 83% and 67%
in phoneme recognition were obtained under the clean and
noisy conditions, respectively.
The evaluation under the clean condition was performed
by leave-one-out cross-validation: 15 subjects’ data were
used as a training set to learn the weighting Θ in Eq. (3),
and the remaining 1 subject’s data were used as a test set and
repeated 16 times. The values of the weighting
ˆ
Θ learned
by using 16 subjects’ data were used for the evaluation
under the noisy condition, where all noisy speech-scene
pairs collected from 16 subjects were treated as a test set.
For comparison, four cases were evaluated for RD speech
detection by using: (1) the speech confidence measure only,
(2) the speech and object confidence measures, (3) the speech
and motion confidence measures, and (4) the MSC measure.
2) Results: The average precision-recall curves over 16
subjects under clean and noisy conditions are shown in
Fig. 5 and Fig. 6, respectively. The performances of each
of four cases are shown by “Speech, “Speech+Object,
“Speech+Motion, and “MSC. From the figures, we found
that (1) the MSC outperforms all others for both clean and
noisy conditions and (2) both object and motion confidence
measures helped to improve performance. The average max-
imum F-measures under clean conditions are MSC: 99%,
Speech+Object: 97%, Speech+Motion: 97%, Speech: 94%;
those for noisy condition are MSC: 95%, Speech+Object:
92%, Speech+Motion: 93%, and Speech: 83%. By compari-
son with the speech confidence measure only, MSC achieved
an absolute increase of 5% and 12% for clean and noisy
conditions, respectively, indicating that MSC was particularly
effective under the noisy condition.
We also performed the paired t-test. Under the clean
condition, there were statistical differences between (1)
Speech and Speech+Object (p < 0.01), (2) Speech and
Speech+Motion (p < 0.05), and (3) Speech and MSC
4
We conducted another experiment to evaluate the speech understanding
by using the RD speech-scene pairs. Consequently, 99.8% and 96.3% of
RD speech was correctly interpreted under clean and noisy conditions,
respectively.
Speech
Speech+Object
Speech+Motion
MSC
70 80 90 100
Precision (%)
70
80
90
100
Recall (%)
Fig. 6. Average precision-recall curves under noisy condition.
(p < 0.01). Under the noisy condition, there were statistical
differences (p < 0.01) between Speech and all other cases.
Here, p denotes the probability value obtained from the t-test.
The values for optimized
ˆ
Θ were:
ˆ
θ
0
= 5.9,
ˆ
θ
1
= 0.00011,
ˆ
θ
2
= 0.053, and
ˆ
θ
3
= 0.74. The threshold δ of domain
classification was set to
ˆ
δ = 0.79, which maximized the
average F-measure of MSC under the clean condition. This
means that a piece of speech with an MSC measure of more
than 0.79 would be treated as RD speech and the robot would
execute an action according to this speech. The above
ˆ
Θ and
ˆ
δ were used in the on-line experiment.
C. On-line Experiment Using the Robot
1) Setting: In the on-line experiment, the entire system
was evaluated by using the robot. In each session of the
experiment, two subjects, an “operator” and a “ministrant,
sat in front of the robot at a distance of about one meter
from the microphone. The operator ordered the robot to
manipulate objects in Japanese. He was also allowed to chat
freely with the ministrant. The threshold η of gaze tracking
was set to 0.5, which means that if the proportion of the
operator’s gaze at the robot during input speech was higher
than 50%, the robot judged that the speech was made while
the operator was looking at it.
We conducted a total of four sessions of this experiment
using four pairs of subjects, and each session lasted for
about 50 minutes. All subjects were adult males. There was
constant ambient noise of about 48 dBA from the robot’s
power module in all sessions. For comparison, five cases
were evaluated for RD speech detection by using (1) gaze
only, (2) gaze and speech confidence measure, (3) gaze and
speech and object confidence measures, (4) gaze and speech
and motion confidence measures and, (5) gaze and MSC
measure.
2) Results: During the experiment, a total of 983 pieces
of speech were made, each of which was manually labeled
as either RD or OOD. There were 708 pieces of speech
which were made while the operator was looking at the
robot, including 155 and 553 pieces of RD and OOD speech,
respectively. This means that in addition to the RD speech,
there was also a lot of OOD speech made while the subjects
were looking at the robot.
The average recall and precision rates for each of the
above five cases are shown in Fig. 7 and Fig. 8, respectively.
647
Page 5
80
85
90
95
100
Recall (%)
Gaze
Gaze+Speech+Object
Gaze+MSC
Gaze+Speech
Gaze+Speech+Motion
94%
90%
92% 92%
94%
Fig. 7. Average recall rates obtained in the on-line experiment.
By using gaze only, an average recall rate of 94% was
obtained (see “Gaze” column in Fig. 7), which means that
almost all of the RD speech was made while the operator
was looking at the robot. The recall rate dropped to 90%
by integrating gaze with the speech confidence measure,
which means that some RD speech was rejected erroneously
by the speech confidence measure. However, by integrating
gaze with MSC, the recall rate returned to 94% because the
mistakenly rejected RD speech was correctly detected by
MSC. In Fig. 8, the average precision rate by using gaze
only was 22%. However, by using MSC, the instances of
OOD speech were correctly rejected, resulting in a high
precision rate of 96%, which means the proposed method
is particularly effective in situations where users make a lot
of OOD speech while looking at a robot.
V. DISCUSSION
This work can be extended in many ways, and we mention
some of them in this section. Here, we evaluated the MSC
measure in situations where users usually order the robot
while looking at it. However, in other situations, users might
order a robot without looking at it. For example, in an object
manipulation task where a robot manipulates objects together
with a user, the user may make an order while looking at
the object that she/he is manipulating instead of looking at
the robot itself. For such tasks, the MSC measure should be
used separately without integrating it with gaze. Therefore,
a method that automatically determines whether to use the
gaze information according to the task and user situation
should be implemented.
Moreover, aside from the object manipulation task, the
MSC measure can also be extended to multi-task dialogs,
including both physically grounded and ungrounded tasks.
In physically ungrounded tasks, users’ utterances express
no immediate physical objects or motions. For such dialog,
a method that automatically switches between the speech
confidence and MSC measures should be implemented. In
future works, we will evaluate the MSC measure for various
dialog tasks.
In addition, MSC can be used to develop an advanced in-
terface for human-robot interaction. The RD speech probabil-
ity represented by MSC can be used to provide feedback such
as the utterance “Did you speak to me?”, and this feedback
should be made in situations where the MSC measure has an
intermediate value. Moreover, each of the object and motion
confidence measures can be used separately. For example, if
the object confidence measures for all objects in a robot’s
vision were particularly low, an active exploration should
20
40
60
80
100
Precision (%)
Gaze
Gaze+Speech+Object
Gaze+MSC
Gaze+Speech
Gaze+Speech+Motion
22%
85%
93%
95%
96%
Fig. 8. Average precision rates obtained in the on-line experiment.
be executed by the robot to search for a feasible object in
its surroundings, or an utterance such as “I cannot do that”
should be made for situations where the motion confidence
measure is particularly low.
VI. CONCLUSION
This paper described a robot-directed (RD) speech detec-
tion method that enables a robot to distinguish the speech to
which it should respond in an object manipulation task by
combining speech, visual, and behavioral context with human
gaze. The novel feature of this method is the introduction
of the MSC measure. The MSC measure evaluates the
feasibility of the action which the robot is going to execute
according to the users’ speech under the current physical
situation. The experimental results clearly show that the
method is very effective and provides an essential function
for natural and safe human-robot interaction. Finally, we
should note that the basic idea adopted in the method is
applicable to a broad range of human-robot dialog tasks.
REFERENCES
[1] H. Asoh, T. Matsui, J. Fry, F. Asano, and S. Hayamizu, A spoken
dialog system for a mobile robot, in Proc. Eurospeech, 1999, pp.
1139–1142.
[2] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink,
and G. Sagerer, “Providing the basis for human-robot-interaction: A
multi-modal attention system for a mobile robot, in Proc. ICMI, 2003,
pp. 28–35.
[3] T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe, “Evaluating
crossmodal awareness of daily-partner robot to user’s behaviors with
gaze and utterance detection, in Proc. CASEMANS, 2009, pp. 1–8.
[4] T. Takiguchi, A. Sako, T. Yamagata, and Y. Ariki, “System request
utterance detection based on acoustic and linguistic features, Speech
Recognition, Technologies and Applications, pp. 539–550, 2008.
[5] N. Iwahashi, “Robots that learn language: A developmental approach
to situated human-robot conversations, Human-Robot Interaction, pp.
95–118, 2007.
[6] A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, and K. Shikano,
“Noise robust real world spoken dialogue system using GMM based
rejection of unintended inputs, in Proc. Interspeech, 2004, pp. 173–
176.
[7] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression. Wiley-
Interscience, 2009.
[8] H. Jiang, “Confidence measures for speech recognition: A survey,
Speech Communication, vol. 45, pp. 455–470, 2005.
[9] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation
from HMM using dynamic features, in Proc. ICASSP, 1995, pp. 660–
663.
[10] T. Kurita, “Iterative weighted least squares algorithms for neural
networks classifiers, in Proc. ALT, 1992.
[11] M. Fujimoto and S. Nakamura, “Sequential non-stationary noise
tracking using particle filtering with switching dynamical system, in
Proc. ICASSP, vol. 2, 2006, pp. 769–772.
[12] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jit-
suhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto, “The
ATR multilingual speech-to-speech translation system, IEEE Trans.
ASLP, vol. 14, no. 2, pp. 365–376, 2006.
648
Page 6