Emotion Elicitation and Capture
among Real Couples in the Lab
University of Zürich
University of Zürich
ETH Zürich, University of St.
Couples’ relationships affect partners’ mental and physi-
cal well-being. Automatic recognition of couples’ emotions
will not only help to better understand the interplay of emo-
tions, intimate relationships, and health and well-being, but
also provide crucial clinical insights into protective and risk
factors of relationships, and can ultimately guide interven-
tions. However, several works developing emotion recog-
nition algorithms use data from actors in artificial dyadic
interactions and the algorithms are likely not to perform well
on real couples. We are developing emotion recognition
methods using data from real couples and, in this paper,
we describe two studies we ran in which we collected emo-
tion data from real couples — Dutch-speaking couples in
firstname.lastname@example.org Zürich,St. Gallen Switzerland
Paper presented at the 1st Momentary Emotion Elicitation & Capture (MEEC)
workshop, co-located with the ACM CHI Conference on Human Factors in
Computing Systems, Honolulu, Hawaii, USA, April 25th, 2020. This is an open-
access paper distributed under the terms of the Creative Commons Attribution
License (https://creativecommons.org/licenses/by/4.0/), which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
Belgium and German-speaking couples in Switzerland. We
discuss our approach to eliciting and capturing emotions
and make five recommendations based on their relevance
for developing well-performing emotion recognition systems
Emotion; Couples; Multimodal Sensor Data; Smartphone;
• Applied computing → Psychology; •Human-centered
computing → Ubiquitous and mobile computing systems
Extensive research shows that intimate relationships have
powerful effects on people’s mental and physical health
(see e.g.  for an overview). For instance, conflicts and
negative qualities of one’s intimate relationship are asso-
ciated prospectively with morbidity and mortality . In-
creasingly, researchers are zooming in on the emotional
processes that take place in intimate relationships as un-
derlying mechanisms for this relationship-health link (e.g.
. However, assessing these dynamic emotional pro-
cesses is challenging.
In studies of intimate relationships, two methods predomi-
nate: self-reports and observer reports. Most often, a stan-
dard dyadic interaction paradigm is used, in which cou-
ples participate in an emotionally charged discussion that
is videotaped . Next, couples can watch these videos
and report on the emotions that they have experienced dur-
ing the interaction (resulting in self-reported emotion); or
observers use a coding scheme to rate the interaction on
specific emotional behaviors (e.g., the SPAFF ). Both
methods have their own advantages and limitations and
provide unique information on the emotional processes in
couples. The power of observational data is that it goes
beyond people’s own awareness, and is not subjective to
reporting biases. However, its greatest limitation is the re-
source use required in coding. First, a coding scheme has
to be developed, which is a whole process in itself .
Next, multiple observers have to be trained in a system-
atic manner to obtain sufficient inter-rater agreement. When
the actual coding can start, this process is slow and costly,
and multiple coders have to code the same videos to allow
obtaining inter-rater reliability.
Automatic emotion recognition holds important promise in
meeting these limitations and significantly advancing the
field. Hence, it is important to develop a system for auto-
matic recognition of couples’ emotions using information
such as speech, facial expressions, gestures etc. Works
that develop emotion recognition systems using speech
data collected from individuals are not adequate for our
purpose as such works do not capture the complexity of
dyadic conversations such as turn-taking in couples’ con-
versations. As a result, works that focus on couple dyads
are most relevant.
Several emotion-recognition works using data from cou-
ple dyads involve data collected from actors in artificial
dyadic interactions. Examples of these datasets are the
IEMOCAP dataset , USC CreativeIT dataset , and
MSP-IMPROV dataset . To elicit emotions, actors are
either asked to use a script or they are given hypothetical
situations to act out so as to make the acting seem natu-
ral and more like a real couple. To capture ground truth,
these works tend to be annotated later using either dimen-
sional and or categorical labels and also either moment-by-
moment or using global emotion labels of whole recordings.
There are several challenges with these annotations by ex-
ternal raters which are highlighted in this work  such as
dealing with inter-rater agreement, the subjectivity of each
rater, approaches to combine the annotations for moment-
by-moment ratings and the laborious nature of these anno-
tations. Additionally, and importantly, the ratings do not re-
flect the perceived emotions of couples which is necessary
to capture rather than the assessment of external raters.
Furthermore, it has been shown that algorithms trained on
naturalistic data perform worse than those trained on acted
data  and it is likely that algorithms developed from data
collected from actors will not perform well on real people
given that actors tend to express emotions with greater in-
tensity as compared to naturalistic contexts and real cou-
ples. It is hence important to develop emotion recognition
methods using data from real couples along with emotion
ratings from them as well.
Towards that end, it is important to adequately collect ground
truth information and sensor data to develop a system for
emotion recognition among couples. We are developing
such a system and, in this paper, we describe our approach
to elicit and capture emotions among real couples in two lab
studies — one conducted in Belgium with couples speaking
Dutch and the other in Switzerland with couples speaking
German. We then discuss these studies and make five rec-
ommendations for future data collection among couples in
the lab to improve automatic emotion recognition. For work
focusing on data collected from couples in everyday life,
see our paper (under review) .
We used data from two lab studies with real couples, in
which the sessions were videotaped and couples provided
ratings either of the whole session or retroactively on a
moment-by-moment basis while watching the video.
Study 1: Dyadic Interaction Study
A Dyadic Interaction lab study was conducted in Leuven,
Belgium with 101 Dutch-speaking couples. These couples
were asked to have a 10-minute conversation about a nega-
tive topic (a characteristic of their partner that annoys them
the most) and a positive topic (a characteristic of their part-
ner that they value the most) . During both conversa-
tions, couples were asked to wrap up the conversation after
8 minutes. For the negative topic, they were also asked to
end on good terms. After each conversation, each partner
completed self-reports on various categorical emotion la-
bels such as anger, sadness, anxiety, relaxation, happiness,
etc. on a 7-point Likert scale ranging from strongly disagree
(1) to strongly agree (7). Also, they completed the Affect
Grid questionnaire  which captures the valence and
arousal dimensions of Russel’s circumplex model of emo-
tions . Each partner also completed their perception of
their partner’s emotion using the Affect Grid. Additionally,
each partner watched the video recording of the conversa-
tion separately on a computer and rated his or her emotion
on a moment-by-moment basis by continuously adjusting
a joystick to the left (very negative) and the right (very pos-
itive), so that it closely matched their feelings, resulting in
valence scores on a continuous scale from -1 to 1 [11, 24].
Study 2: DyMand Study
We are currently running a Dyadic Management of Di-
abetes (DyMand) lab study in Zurich, Switzerland with
German-speaking couples in which one partner has type
2 diabetes with data from eight (8) couples collected so far
. In this lab study, the couple is asked to discuss an ill-
ness management–related concern that is causing them
considerable distress for a 10-minute period. The session
is videotaped and additionally, each partner wears a smart-
watch as it collects various sensor data: audio, heart rate,
accelerometer, gyroscope, and ambient light. After the ses-
sion, each partner completes a self-report on a smartphone
about their emotions using the Affective Slider  which
assesses the valence and arousal dimensions of their emo-
tions over the last 10 min of the discussion. Also, the smart-
phone takes a 3-second video of their facial expression
while they complete the self-report.
Discussion and Recommendations
Based on these two studies, we discuss and recommend
approaches to collect sensor and ground truth data from
couples to aid in developing well-performing systems for
emotion recognition among couples.
Elicitation of Emotions
In these studies, we elicited emotions in the couples by
asking them to discuss various relationship-relevant top-
ics (Study 1), or a distressing illness management concern
(Study 2). In comparison to various elicitation approaches
such as watching a video or listening to music, this ap-
proach leverages context which mimics a real-world context
— partners having a conversation. Hence, the algorithms
developed using data from this context like verbal and non-
verbal vocalizations could then also be implemented in
ubiquitous systems such as smartphones and smartwatch
for couple emotion recognition from everyday life. We rec-
ommend the use of similar elicitation approaches for couple
emotion recognition works.
Self-Report Data Collection
In these studies, we captured emotions using a range of ap-
proaches which can generally be grouped into two: global
rating (one value or label for the whole conversation) and
continuous rating (different values for different parts of the
conversation) (only in Study 1).
The global ratings consisted of 7-point Likert scale for cat-
egorical emotions such as angry, relaxed, happy, sad, and
the Affect Grid in Study 1 which were completed using elec-
tronic questionnaires. We collected valence and arousal
values using the Affective Slider on a smartphone for Study
2. Global ratings are important to capture (1) a partner’s
perception of his/her emotion (self-perceived) and (2) his/her
perception of his/her partner’s emotion (partner-perceived)
as was done in Study 1. The assessment of a partner’s per-
ception of his/her partner’s emotion is useful and could be
used to compute the baseline measures for metrics like ac-
curacy (for classification task) and correlation coefficient (for
regression tasks) of machine learning experiments.
The continuous emotion rating was done only in Study 1 by
each partner separately by continuously adjusting a joystick
to the left while watching a video of their conversation on a
computer-based software (the rated valence values were
displayed in real-time on). This continuous emotion rating
is important as it gives a granular assessment of emotions
which is important for developing an emotion recognition
system that shows how the emotion of each partner is
changing on a second-by-second or minute-by-minute ba-
sis. Also, the mean value could be used to get an estimate
of the global emotion rating. Additionally, it could be useful
for the accurate recognition of the global rating. Based on
the peak-end rule, which says that the extremes and end of
emotional experience influence a person’s overall judgment
of that emotional experience  and prior work exploring
this rule using Study 1’s data , using data from the ex-
tremes and or end of the 10-minute conversation might pro-
duce better emotion recognition performance of the global
emotion rating of the whole conversation.
We did not collect self-reports about the personality of each
partner though it might be useful. There are individual dif-
ferences in the experience and expression of emotions with
a concrete example shown in how the relation between
arousal and valence varies across individuals . Prelimi-
nary evidence suggests the valence and arousal emotional
expressions of individuals relates to the five-factor model of
personality . Hence, individuals’ personality may affect
how they express their emotions. Hence, collecting infor-
mation such as the Big Five Inventory  and using as
input to an emotion recognition algorithm could potentially
improve its performance.
Based on the discussion, we recommend collecting self-
perceived and partner-perceived (1) global emotion rat-
ings with smartphone-based valence and arousal instru-
ments such as the Affective Slider and (2) continuous emo-
tion ratings for valence and arousal using for example, a
smartphone-based app. Categorical labels could also be
collected if they are not additionally burdensome or redun-
dant. We also recommend that personality self-reports also
be collected. These will help in developing and evaluating
robust emotion recognition systems.
Sensor Data Collection
In Study 1, we collected only audio and video data whereas
in Study 2, we additionally use a smartwatch-based system
we developed — DyMand —  to collect multimodal sen-
sor data: audio, heart rate, accelerometer, gyroscope, and
ambient light. The additional data collected from the smart-
watch could provide more context for better recognition
such as the heart rate providing physiological measures
and the accelerometer and gyroscope providing information
about hand gestures. Previous works have shown that mul-
timodal approaches to emotion recognition perform better
than unimodal approaches . Given that an additional
device like a commercial smartwatch is not burdensome to
wear, we hence recommend the collection of such multi-
The universality of emotions has been interrogated and
questioned [26, 15]. There is evidence that suggest that
culture affects how people experience and express emo-
tions, for example, with facial expressions, gestures, phys-
iological reaction, verbal and nonverbal vocalizations [18,
28]. Hence, algorithms developed using data from one cul-
tural context might not work well in others, or worse, contain
various biases. Collecting cross-cultural data will be useful
in developing algorithms that work across various cultures
and reduce bias in the algorithms. We collected data from
different cultures albeit, only within Europe as of yet: Dutch-
speaking couples in Belgium and German-speaking cou-
ples in Switzerland. We are developing and evaluating our
emotion recognition systems using cross-cultural data. We
hence recommend collecting data from couples in different
cultures to develop robust algorithms.
Development of Software Tools
Data collected from real couples can be annotated by them
as described previously and as a result, there is no need
for manual annotation by external raters which is time-
consuming and laborious. However, the data needs to be
processed before they are useful for developing emotion
recognition systems. There are some challenges involved
in this process, some of which are unique to the context of
dyadic interactions like couples’ dyadic conversations such
as turn-taking. Audio is an important data source for emo-
tion recognition because various key information can be
extracted such as vocal expression (how things are said),
nonverbal vocalizations (eg. sigh, laughs) and verbal vo-
calizations (what is said which might give more context for
recognition). Tools that perform automatic processing of au-
dio data would improve the development of emotion recog-
nition system for couples. Hence, it is important for various
software tools to be developed that can easily be used by
There is a need for open source tools for voice activity de-
tection  and diarization  that are robust — perform
well when used with all kinds of audio. The voice activity
detection tool is needed to automatically annotate parts of
the audio that contain vocalizations so either silent or noisy
segments can be discarded. Additionally, the tool could be
further refined to annotate specific nonverbal vocalizations
like sighs, laughter, chuckles, etc. which might be indicative
of specific emotions in various parts of the audio, thereby
improve recognition performance. The diarization tool is
needed to automatically annotate which parts of the audio
correspond to each speaker. It is important to segment au-
dio recordings into parts that correspond to each speaker
to aid in developing a well-performing emotion recognition
Also, there is a need for open-source tools for automatic
transcription of non-English languages (which are lack-
ing) because using the transcriptions could provide more
context and improve recognition performance. Doing the
annotation and transcription manually for a few hours of au-
dio might not be a problem. However, doing so for data in
the tens of thousands of hours is not scalable. Approaches
such as using Amazon Mechanical Turk may work for acted
data but they cannot work for real couples’ data because
of their confidentiality. We recommend that efforts be put
into developing these tools within the affective computing
community to avoid individual duplicate efforts and also
because inaccurate annotations would result in poor data
input for the emotion recognition algorithms.
We are developing emotion recognition methods using data
from real couples and in this work, we describe two stud-
ies we ran with real couples — Dutch-speaking couples in
Belgium and German-speaking couples in Switzerland. We
discuss our approach to eliciting and capturing emotions
and make the following five recommendation based on their
relevance for developing well-performing emotion recogni-
tion systems for couples: 1) Elicit emotions by asking cou-
ples to discuss a topic from their relationship, 2) Collect
global and continuous emotion self-report and personal-
ity data using mobile systems like smartphones, 3) Collect
multimodal sensor data using devices like smartwatches,
4) Collect data from different cultures and 5) Develop open-
source voice activity detection, diarization, and transcription
software tools within the affective computing community.
We are grateful to Prabakaran Santhanam and Dominik
Rügger for helping with the development of mobile software
tools in running the second study. Study 2 is co-funded by
the Swiss National Science Foundation (CR12I1_166348).
 Alberto Betella and Paul FMJ Verschure. 2016. The
affective slider: A digital self-assessment scale for the
measurement of human emotions. PloS one 11, 2
 George Boateng, Janina Lüscher, Urte Scholz, and
Tobias Kowatsch. 2020. Emotion Capture among Real
Couples in Everyday Life. Momentary Emotion
Elicitation. In Momentary Emotion Elicitation and
Capture workshop. CHI 2020 (Under Review).
 George Boateng, Prabhakaran Santhanam, Janina
Lüscher, Urte Scholz, and Tobias Kowatsch. 2019a.
Poster: DyMand–An Open-Source Mobile and
Wearable System for Assessing Couples’ Dyadic
Management of Chronic Diseases. In The 25th Annual
International Conference on Mobile Computing and
 George Boateng, Prabhakaran Santhanam, Janina
Lüscher, Urte Scholz, and Tobias Kowatsch. 2019b.
VADLite: an open-source lightweight system for
real-time voice activity detection on smartwatches. In
Adjunct Proceedings of the 2019 ACM International
Joint Conference on Pervasive and Ubiquitous
Computing and Proceedings of the 2019 ACM
International Symposium on Wearable Computers.
 Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N
Chang, Sungbok Lee, and Shrikanth S Narayanan.
2008. IEMOCAP: Interactive emotional dyadic motion
capture database. Language resources and evaluation
42, 4 (2008), 335.
 Carlos Busso, Srinivas Parthasarathy, Alec Burmania,
Mohammed AbdelWahab, Najmeh Sadoughi, and
Emily Mower Provost. 2016. MSP-IMPROV: An acted
corpus of dyadic interactions to study emotion
perception. IEEE Transactions on Affective Computing
8, 1 (2016), 67–80.
 James A Coan and John M Gottman. 2007. The
specific affect coding system (SPAFF). Handbook of
emotion elicitation and assessment (2007), 267–285.
 Sidney K D’mello and Jacqueline Kory. 2015. A review
and meta-analysis of multimodal affect detection
systems. ACM Computing Surveys (CSUR) 47, 3
 Allison K Farrell, Ledina Imami, Sarah CE Stanton,
and Richard B Slatcher. 2018. Affective processes as
mediators of links between close relationships and
physical health. Social and Personality Psychology
Compass 12, 7 (2018), e12408.
 Barbara L Fredrickson. 2000. Extracting meaning from
past affective experiences: The importance of peaks,
ends, and specific emotions. Cognition & Emotion 14,
4 (2000), 577–606.
 John M Gottman and Robert W Levenson. 1985. A
valid procedure for obtaining self-report of affect in
marital interaction. Journal of consulting and clinical
psychology 53, 2 (1985), 151.
 Patricia K Kerig and Donald H Baucom. 2004. Couple
observational coding systems. Taylor & Francis.
 Peter Kuppens, Francis Tuerlinckx, James A Russell,
and Lisa Feldman Barrett. 2013. The relation between
valence and arousal in subjective experience.
Psychological bulletin 139, 4 (2013), 917.
 Peter Kuppens, Francis Tuerlinckx, Michelle Yik, Peter
Koval, Joachim Coosemans, Kevin J Zeng, and
James A Russell. 2017. The relation between valence
and arousal in subjective experience varies with
personality and culture. Journal of personality 85, 4
 Nangyeon Lim. 2016. Cultural differences in emotion:
differences in emotional arousal level between the
East and the West. Integrative medicine research 5, 2
 Timothy J Loving and Richard B Slatcher. 2013.
Romantic relationships and health. The Oxford
handbook of close relationships (2013), 617–637.
 Janina Lüscher, Tobias Kowatsch, George Boateng,
Prabhakaran Santhanam, Guy Bodenmann, and Urte
Scholz. 2019. Social Support and Common Dyadic
Coping in Couples’ Dyadic Management of Type II
Diabetes: Protocol for an Ambulatory Assessment
Application. JMIR research protocols 8, 10 (2019),
 David Matsumoto and Paul Ekman. 1989.
American-Japanese cultural differences in intensity
ratings of facial expressions of emotion. Motivation and
Emotion 13, 2 (1989), 143–157.
 Angeliki Metallinou, Chi-Chun Lee, Carlos Busso,
Sharon Carnicke, Shrikanth Narayanan, and others.
2010. The USC CreativeIT database: A multimodal
database of theatrical improvisation. Multimodal
Corpora: Advances in Capturing, Coding and
Analyzing Multimodality (2010), 55.
 Angeliki Metallinou and Shrikanth Narayanan. 2013.
Annotation and processing of continuous emotional
attributes: Challenges and opportunities. In 2013 10th
IEEE international conference and workshops on
automatic face and gesture recognition (FG). IEEE, 1–
 Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir
Hussain. 2017. A review of affective computing: From
unimodal analysis to multimodal fusion. Information
Fusion 37 (2017), 98–125.
 Nicole A Roberts, Jeanne L Tsai, and James A Coan.
2007. Emotion elicitation using dyadic interaction
tasks. Handbook of emotion elicitation and
assessment (2007), 106–123.
 Theodore F Robles, Richard B Slatcher, Joseph M
Trombello, and Meghan M McGinn. 2014. Marital
quality and health: A meta-analytic review.
Psychological bulletin 140, 1 (2014), 140.
 Anna Marie Ruef and Robert W Levenson. 2007.
Continuous measurement of emotion. Handbook of
emotion elicitation and assessment (2007), 286–297.
 James A Russell. 1980. A circumplex model of affect.
Journal of personality and social psychology 39, 6
 James A Russell. 1994. Is there universal recognition
of emotion from facial expression? A review of the
cross-cultural studies. Psychological bulletin 115, 1
 James A Russell, Anna Weiss, and Gerald A
Mendelsohn. 1989. Affect grid: a single-item scale of
pleasure and arousal. Journal of personality and social
psychology 57, 3 (1989), 493.
 K. R Scherer, H Wallbott, D Matsumoto, and K
Tsutomu. 1988. Emotional experience in cultural
context: A comparison between Europe, Japan and
the United States. Faces of emotion: recent research
 Laura Sels, Eva Ceulemans, and Peter Kuppens.
2019. All’s well that ends well? A test of the peak-end
rule in couples’ conflict discussions. European Journal
of Social Psychology 49, 4 (2019), 794–806.
 Christopher J Soto and Oliver P John. 2017. The next
Big Five Inventory (BFI-2): Developing and assessing
a hierarchical model with 15 facets to enhance
bandwidth, fidelity, and predictive power. Journal of
personality and social psychology 113, 1 (2017), 117.
 Eva Vozáriková and Jozef Juhár. 2015. Comparison of
Diarization Tools for Building Speaker Database.
Advances in Electrical and Electronic Engineering 13
(11 2015), 314–319. DOI: