Conference Paper

Prosody based emotion recognition for MEXI

Paderborn Univ., Germany
DOI: 10.1109/IROS.2005.1545341 Conference: Intelligent Robots and Systems, 2005. (IROS 2005). 2005 IEEE/RSJ International Conference on
Source: IEEE Xplore

ABSTRACT This paper describes the emotion recognition from natural speech as realized for the robot head MEXI. We use a fuzzy logic approach for analysis of prosody in natural speech. Since MEXI often communicates with well known persons but also with unknown humans, for instance at exhibitions, we realized a speaker dependent mode as well as a speaker independent mode in our prosody based emotion recognition. A key point of our approach is that it automatically selects the most significant features from a set of twenty analyzed features based on a training database of speech samples. This is important according to our results, since the set of significant features differs considerably between the distinguished emotions. With our approach we reach average recognition rates of 84% in speaker dependent mode and 60% in speaker independent mode.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: Existing emotional speech recognition applications usually distinguish between a small number of emotions in speech. However this set of so called basic emotions in speech varies from one application to another depending on their according needs. In order to support such differing application needs an emotional speech model based on the fuzzy emotion hypercube is presented. In addition to existing models it supports also the recognition of derived emotions which are combinations of basic emotions in speech. We show the application of this model by a prosody based Hidden Markov Models(HMM). The approach is based on standard speech recognition technology using hidden semi-continuous Markov models. Both the selection of features and the design of the recognition system are addressed.
    Pervasive Computing, Signal Porcessing and Applications, International Conference on. 09/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For human–robot interaction (HRI), perception is one of the most important capabilities. This paper reviews several widely used perception methods of HRI in social robots. Specifically, we investigate general perception tasks crucial for HRI, such as where the objects are located in the rooms, what objects are in the scene, and how they interact with humans. We first enumerate representative social robots and summarize the most three important perception methods from these robots: feature extraction, dimensionality reduction, and semantic understanding. For feature extraction, four widely used signals including visual-based, audio-based, tactile-based and rang sensors-based are reviewed, and they are compared based on their advantages and disadvantages. For dimensionality reduction, representative methods including principle component analysis (PCA), linear discriminant analysis (LDA), and locality preserving projections (LPP) are reviewed. For semantic understanding, conventional techniques for several typical applications such as object recognition, object tracking, object segmentation, and speaker localization are discussed, and their characteristics and limitations are also analyzed. Moreover, several popular data sets used in social robotics and published semantic understanding results are analyzed and compared in light of our analysis of HRI perception methods. Lastly, we suggest important future work to analyze fundamental questions on perception methods in HRI.
    International Journal of Social Robotics 01/2014; 6(1).
  • [Show abstract] [Hide abstract]
    ABSTRACT: Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement.
    Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on; 09/2013


Available from