Fig. 7. Average recall rates obtained in the on-line experiment.
By using gaze only, an average recall rate of 94% was
obtained (see “Gaze” column in Fig. 7), which means that
almost all of the RD speech was made while the operator
was looking at the robot. The recall rate dropped to 90%
by integrating gaze with the speech conﬁdence measure,
which means that some RD speech was rejected erroneously
by the speech conﬁdence measure. However, by integrating
gaze with MSC, the recall rate returned to 94% because the
mistakenly rejected RD speech was correctly detected by
MSC. In Fig. 8, the average precision rate by using gaze
only was 22%. However, by using MSC, the instances of
OOD speech were correctly rejected, resulting in a high
precision rate of 96%, which means the proposed method
is particularly effective in situations where users make a lot
of OOD speech while looking at a robot.
This work can be extended in many ways, and we mention
some of them in this section. Here, we evaluated the MSC
measure in situations where users usually order the robot
while looking at it. However, in other situations, users might
order a robot without looking at it. For example, in an object
manipulation task where a robot manipulates objects together
with a user, the user may make an order while looking at
the object that she/he is manipulating instead of looking at
the robot itself. For such tasks, the MSC measure should be
used separately without integrating it with gaze. Therefore,
a method that automatically determines whether to use the
gaze information according to the task and user situation
should be implemented.
Moreover, aside from the object manipulation task, the
MSC measure can also be extended to multi-task dialogs,
including both physically grounded and ungrounded tasks.
In physically ungrounded tasks, users’ utterances express
no immediate physical objects or motions. For such dialog,
a method that automatically switches between the speech
conﬁdence and MSC measures should be implemented. In
future works, we will evaluate the MSC measure for various
In addition, MSC can be used to develop an advanced in-
terface for human-robot interaction. The RD speech probabil-
ity represented by MSC can be used to provide feedback such
as the utterance “Did you speak to me?”, and this feedback
should be made in situations where the MSC measure has an
intermediate value. Moreover, each of the object and motion
conﬁdence measures can be used separately. For example, if
the object conﬁdence measures for all objects in a robot’s
vision were particularly low, an active exploration should
Fig. 8. Average precision rates obtained in the on-line experiment.
be executed by the robot to search for a feasible object in
its surroundings, or an utterance such as “I cannot do that”
should be made for situations where the motion conﬁdence
measure is particularly low.
This paper described a robot-directed (RD) speech detec-
tion method that enables a robot to distinguish the speech to
which it should respond in an object manipulation task by
combining speech, visual, and behavioral context with human
gaze. The novel feature of this method is the introduction
of the MSC measure. The MSC measure evaluates the
feasibility of the action which the robot is going to execute
according to the users’ speech under the current physical
situation. The experimental results clearly show that the
method is very effective and provides an essential function
for natural and safe human-robot interaction. Finally, we
should note that the basic idea adopted in the method is
applicable to a broad range of human-robot dialog tasks.
 H. Asoh, T. Matsui, J. Fry, F. Asano, and S. Hayamizu, “A spoken
dialog system for a mobile robot,” in Proc. Eurospeech, 1999, pp.
 S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink,
and G. Sagerer, “Providing the basis for human-robot-interaction: A
multi-modal attention system for a mobile robot,” in Proc. ICMI, 2003,
 T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe, “Evaluating
crossmodal awareness of daily-partner robot to user’s behaviors with
gaze and utterance detection,” in Proc. CASEMANS, 2009, pp. 1–8.
 T. Takiguchi, A. Sako, T. Yamagata, and Y. Ariki, “System request
utterance detection based on acoustic and linguistic features,” Speech
Recognition, Technologies and Applications, pp. 539–550, 2008.
 N. Iwahashi, “Robots that learn language: A developmental approach
to situated human-robot conversations,” Human-Robot Interaction, pp.
 A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, and K. Shikano,
“Noise robust real world spoken dialogue system using GMM based
rejection of unintended inputs,” in Proc. Interspeech, 2004, pp. 173–
 D. W. Hosmer and S. Lemeshow, Applied Logistic Regression. Wiley-
 H. Jiang, “Conﬁdence measures for speech recognition: A survey,”
Speech Communication, vol. 45, pp. 455–470, 2005.
 K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation
from HMM using dynamic features,” in Proc. ICASSP, 1995, pp. 660–
 T. Kurita, “Iterative weighted least squares algorithms for neural
networks classiﬁers,” in Proc. ALT, 1992.
 M. Fujimoto and S. Nakamura, “Sequential non-stationary noise
tracking using particle ﬁltering with switching dynamical system,” in
Proc. ICASSP, vol. 2, 2006, pp. 769–772.
 S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jit-
suhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto, “The
ATR multilingual speech-to-speech translation system,” IEEE Trans.
ASLP, vol. 14, no. 2, pp. 365–376, 2006.