Conference PaperPDF Available

A Multimodal Annotation Schema for Non-Verbal Affective Analysis in the Health-Care Domain

Authors:

Abstract

The development of conversational agents with human interaction capabilities requires advanced affective state recognition integrating non-verbal cues from the different modalities constituting what in human communication we perceive as an overall affective state. Each of the modalities is often handled by a different subsystem that conveys only a partial interpretation of the whole and, as such, is evaluated only in terms of its partial view. To tackle this shortcoming, we investigate the generation of a unified multimodal annotation schema of non-verbal cues from the perspective of an interdisciplinary group of experts. We aim at obtaining a common ground-truth with a unique representation using the Valence and Arousal space and a discrete non-linear scale of values. The proposed annotation schema is demonstrated on a corpus in the health-care domain but is scalable to other purposes. Preliminary results on inter-rater variability show a positive correlation of consensus level with high (absolute) values of Valence and Arousal as well as with the number of annotators labeling a given video sequence.
A Multimodal Annotation Schema
for Non-Verbal Affective Analysis
in the Health-Care Domain
UPF, UAU, UULM, CERTH
KRISTINA
1. Introduction
Embodied Conversational Agents
The KRISTINA Project
2. Motivation
Perception of Non-Verbal Cues
Common Dataset
3. Related Work
Multimodal Affective Databases
Annotation of Affective States
06/06/162
Contents
4. Corpus in the Health-
Care Domain
5. Multimodal Annotation of
Non-Verbal Cues
Annotation Schema
Analysis of Joint
Annotations
Consensus Assessment
6. Conclusions and Future
Work
MARMI Workshop
KRISTINA
1. Introduction
Embodied Conversational Agents
The KRISTINA Project
KRISTINA
1.1. Embodied Conversational Agents
(ECAs)
Different areas with their own
idiosynchrasis
06/06/16MARMI Workshop4
Audio
Heterogeneity in data representation of affective
states
Face
Gestur
e
Sentences Action Units Activation
F0 deviation Happy, sad… Pointing, Beat…
KRISTINA
5
Knowledge
Base
Targeted IR
SSI
ASR
Facial Expr.
Recognition
Gesture
Analysis
Emotion
Recognition
Language
Analysis
Multimodal
Fusion
Interaction
Manager (VSM)
Dialogue
Manager
Idle Behavior
Manager
Multimodal
Fission
&
Discourse
Planning
Facial Expr.
Generation
Gesture
Generation
Language
Synthesis
Recognitio
n
Generation
Management
Non-Verbal
Modules
KRISTINA
2. Motivation
Perception of Non-Verbal Cues
Common Dataset
KRISTINA
2.1. Perception of Non-Verbal Cues
Challenge: to represent computationally a
subjective holistic conversational event (in real-
time)
06/06/16MARMI Workshop7
KRISTINA
2.2. Common Dataset
Goal: Maximize annotation efforts for validation
Many Expertises => 1
Annotation
Annotators’ Native
Cultures
06/06/16MARMI Workshop8
KRISTINA
3. Related Work
Multimodal Databases
Annotation of Affective States
KRISTINA
3.1. Multimodal Databases
Datasets usually address individual modalities
Speech (Lee et al., 2005) [1]
Facial expressions (Lucy et al., 2010) [2]
Body gestures (McKeown et al., 2012) [3]
Naturalistic settings are also scarce. Exceptions:
SEMAINE (McKeown et al., 2012) [3]
RECOLA (Ringeval et al., 2013 ) [4]
Vera am Mittag (Grimm et al., 2008) [5]
None of them in the Health-Care domain
06/06/16MARMI Workshop10
KRISTINA
3.2. Annotation of Affective States (I)
Categorical: use of discrete word labels (happiness, sadness,
boredom…)
Drawbacks:
Subjectiveness: affective states are usually difficult to be tagged with words
Increasing number of labels correlates negatively with inter-annotator agreement
Dimensional: continuous space using numeric scale
Advantage:
Represent blended emotions and transitions
Drawback:
Usually frame to frame measurements in real-time
06/06/16MARMI Workshop11
KRISTINA
3.2. Annotation of Affective States (II)
12
+
+
_
_
Valence
Arousal
Our Approach
Discrete representation
Fine-grained Scale
[0 / ± 0.25 / ± 0.5 / ± 1]
06/06/16
MARMI Workshop
KRISTINA
4. Corpus: Health-Care
Domain
KRISTINA
4.1. Corpus Description (I)
14
n. videos: 32
n. cultures: 5
de = German
es = Spanish
pl = Polish
tr = Turkish
ar = Arabic
Annotated: 2,29
43%
30%
18%
7% 2%
Culture Representativity (%)
de
es
pl
tr
ar
06/06/16MARMI Workshop
Goal: 10 h
20% completed
KRISTINA
4.1. Corpus Description (II)
15
n. speakers: 18
8%
35%
30%
18%
7% 2%
Gender/Culture Distribution
male de
female de
female es
female pl
female tr
male ar
Male Gender:
PUC 2: ar es
PUC 1: tr de
06/06/16MARMI Workshop
* No male
participants
in Polish
KRISTINA
5. Multimodal Annotation of
Non-Verbal Cues
Annotation Schema
Analysis of Joint Annotations
Consensus Assessment
KRISTINA
5.1. Annotation Schema (I)
17
+
+
_
_
Valenc
e
Arous
al
Our Approach
Discrete representation
Fine-grained Scale
[0 / ± 0.25 / ± 0.5 / ± 1]
Initial tests with a 5 level scale
Final decision 7 level scale to
capture naturalistic settings’
characteristics
06/06/16
MARMI Workshop
KRISTINA
5.2. Annotation Schema (II)
18
Annotation* Guidelines:
Perceived Non-Verbal Behaviour (subjective perception is
accepted)
Relevant for the conversational interaction
Signalled by an individual mode or a combination of modi
Paralell in both axis (Valence and Arousal)
Neutral states (0,0) are not annotated
06/06/16MARMI Workshop
* using ELAN Multimodal Annotation Tool (Wittenburg et al., 2008) [6]
Neu-
tral
KRISTINA
5.3. Consensus Assessment (I)
19
Cronbach Alpha: min-max of random combinations
(Step 1)
06/06/16MARMI Workshop
Speaker A Speaker B
Dialogue
X
Acceptabl
e
Satisfactor
y
Minimum n.
Annotators
KRISTINA
5.4. Analysis of Joint Annotations
20
Consensus and Confidence using Collate* Method
(Step 2)
06/06/16MARMI Workshop
*(Asmanet al., 2011) [7]
KRISTINA
5.5. Consensus Assessment (II)
21
Agreement relative to Consensus
(Step 3)
06/06/16MARMI Workshop
KRISTINA
6. Conclusions and Future
Work
KRISTINA
6. Conclusions and Future Work
Holistic multimodal assessment of affective states in
the health-care domain
Focus in the Non-Verbal Behaviour
Use of 7x7 discrete labels within Valence-Arousal
space to capture subtle variations from neutral state
Consensus annotation (CONS) and confidence score
(pCCONS)
Cronbach Alpha to assess and correct deviations
Further recordings and annotations: under development
Analysis of different Cultures: Polish, Turkish, Arabic,
German and Spanish
06/06/16MARMI Workshop23
KRISTINA
Thank You
http://kristina-project.eu/en/
@MonikaUPF
@projectKRISTINA
monica.dominguez@upf.edu
KRISTINA
References
[1] C. M. Lee and S. S. Narayanan. Toward detecting emotions in spoken dialogs. IEEE
Transactions on Speech and Audio Processing, 13(2):293–303, 2005.
[2] P. Lucey, et al. The extended cohn-kanade dataset (ck+): A complete dataset for action unit
and emotion-specified expression. In IEEE Conf. Computer Vision and Pattern Recognition, pages
94–101, 2010.
[3] G. McKeown, et al. The semaine database: Annotated multimodal records of emotionally
colored conversations between a person and a limited agent. IEEE Transactions on Affective
Computing, 3(1):5–17, 2012.
[4] F. Ringeval, et al. Introducing the recola multimodal corpus of remote collaborative and
affective interactions. In IEEE Int. Conf. Automatic Face and Gesture Recognition, pages 1–8,
2013.
[5] M. Grimm, K. Kroschel, and S. Narayanan. The vera am mittag german audio-visual
emotional speech database. In IEEE Int. Conf. Multimedia and Expo, pages 865–868, 2008.
[6] P. Wittenburg, et al. ELAN: a professional framework for multimodality research. In Int. Conf.
on Lenguage Resources and Evaluation, pages 1556–1559, 2006.
[7] A., J. Asman, and B.A. Landman. Robust statistical label fusion through consensus level,
labeler accuracy, and truth estimation (COLLATE). Medical Imaging, IEEE Transactions on,
30(10), 1779-1794, 2011.
06/06/16MARMI Workshop25
... • Multimodal communication analytics for processing verbal and nonverbal information, such as (a) automated speech recognition, to support the transformation of spoken language into text, using statistical speech models for both acoustic and language modeling; 11 (b) language analysis, projecting the outputs of syntactic and semantic parsing to a DOLCE+DnS UltraLite compliant representation (Ballesteros, Bohnet, Mille, & Wanner, 2015) (c) analysis of nonveral behaviour, such as emotions and gestures (Sukno et al., 2016). ...
Article
Full-text available
Dialogue-based systems often consist of several components, such as communication analysis, dialogue management, domain reasoning, and language generation. In this paper, we present Converness, an ontology-driven, rule-based framework to facilitate domain reasoning for conversational awareness in multimodal dialogue-based agents. Converness uses Web Ontology Language 2 (OWL 2) ontologies to capture and combine the conversational modalities of the domain, for example, deictic gestures and spoken utterances, fuelling conversational topic understanding, and interpretation using description logics and rules. At the same time, defeasible rules are used to couple domain and user-centred knowledge to further assist the interaction with end users, facilitating advanced conflict resolution and personalised context disambiguation. We illustrate the capabilities of the framework through its integration into a multimodal dialogue-based agent that serves as an intelligent interface between users (elderly, caregivers, and health experts) and an ambient assistive living platform in real home settings.
Conference Paper
Full-text available
We present in this paper a new multimodal corpus of spontaneous collaborative and affective interactions in French: RECOLA, which is being made available to the research community. Participants were recorded in dyads during a video conference while completing a task requiring collaboration. Different multimodal data, i.e., audio, video, ECG and EDA, were recorded continuously and synchronously. In total, 46 participants took part in the test, for which the first 5 minutes of interaction were kept to ease annotation. In addition to these recordings, 6 annotators measured emotion continuously on two dimensions: arousal and valence, as well as social behavior labels on live dimensions. The corpus allowed us to take self-report measures of users during task completion. Methodologies and issues related to affective corpus construction are briefly reviewed in this paper. We further detail how the corpus was constructed, i.e., participants, procedure and task, the multimodal recording setup, the annotation of data and some analysis of the quality of these annotations.
Conference Paper
Full-text available
Automatic detection and interpretation of social signals car-ried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.
Article
CARMA is a media annotation program that collects continuous ratings while displaying audio and video files. It is designed to be highly user-friendly and easily customizable. Based on Gottman and Levenson’s affect rating dial, CARMA enables researchers and study participants to provide moment-by-moment ratings of multimedia files using a computer mouse or keyboard. The rating scale can be configured on a number of parameters including the labels for its upper and lower bounds, its numerical range, and its visual representation. Annotations can be displayed alongside the multimedia file and saved for easy import into statistical analysis software. CARMA provides a tool for researchers in affective computing, human-computer interaction, and the social sciences who need to capture the unfolding of subjective experience and observable behavior over time.
Article
In the context of affective human behavior analysis, we use the term continuous input to refer to naturalistic settings where explicit or implicit input from the subject is continuously available, where in a human–human or human–computer interaction setting, the subject plays the role of a producer of the communicative behavior or the role of a recipient of the communicative behavior. As a result, the analysis and the response provided by the automatic system are also envisioned to be continuous over the course of time, within the boundaries of digital machine output. The term continuous affect analysis is used as analysis that is continuous in time as well as analysis that uses affect phenomenon represented in dimensional space. The former refers to acquiring and processing long unsegmented recordings for detection of an affective state or event (e.g., nod, laughter, pain), and the latter refers to prediction of an affect dimension (e.g., valence, arousal, power). In line with the Special Issue on Affect Analysis in Continuous Input, this survey paper aims to put the continuity aspect of affect under the spotlight by investigating the current trends and provide guidance towards possible future directions.
Article
Automatic affective expression recognition has attracted more and more attention of researchers from different disciplines, which will significantly contribute to a new paradigm for human computer interaction (affect-sensitive interfaces, socially intelligent environments) and advance the research in the affect-related fields including psychology, psychiatry, and education. Multimodal information integration is a process that enables human to assess affective states robustly and flexibly. In order to understand the richness and subtleness of human emotion behavior, the computer should be able to integrate information from multiple sensors. We introduce in this paper our efforts toward machine understanding of audio-visual affective behavior, based on both deliberate and spontaneous displays. Some promising methods are presented to integrate information from both audio and visual modalities. Our experiments show the advantage of audio-visual fusion in affective expression recognition over audio-only or visual-only approaches.