PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.
Preliminary
The MuSe 2024 Multimodal Sentiment Analysis Challenge:
Social Perception and Humor Recognition
Shahin Amiriparian
Chair of Health Informatics
Klinikum rechts der Isar
Technical University of Munich
Munich, Germany
Lukas Christ
Chair of Embedded Intelligence for
Healthcare and Wellbeing
University of Augsburg
Augsburg, Germany
Alexander Kathan
Chair of Embedded Intelligence for
Healthcare and Wellbeing
University of Augsburg
Augsburg, Germany
Maurice Gerczuk
Chair of Embedded Intelligence for
Healthcare and Wellbeing
University of Augsburg
Augsburg, Germany
Niklas Müller
Chair of Strategic Management,
Innovation, and Entrepreneurship
University of Passau
Passau, Germany
Steen Klug
Chair of Strategic Management,
Innovation, and Entrepreneurship
University of Passau
Passau, Germany
Lukas Stappen
Recoro
Munich, Germany
Andreas König
Chair of Strategic Management,
Innovation, and Entrepreneurship
University of Passau
Passau, Germany
Erik Cambria
College of Computing & Data Science
Nanyang Technological University
Singapore, Singapore
Björn W. Schuller
Group on Language, Audio, & Music
Imperial College London
London, United Kingdom
Simone Eulitz
Institute of Strategic Management
LMU Munich
Munich, Germany
ABSTRACT
The Multimodal Sentiment Analysis Challenge (
MuSe
) 2024 ad-
dresses two contemporary multimodal aect and sentiment analysis
problems: In the Social Perception Sub-Challenge (
MuSe-Perception
),
participants will predict 16 dierent social attributes of individu-
als such as assertiveness,dominance,likability, and sincerity based
on the provided audio-visual data. The Cross-Cultural Humor De-
tection Sub-Challenge (
MuSe-Humor
) dataset expands upon the
Passau Spontaneous Football Coach Humor (
Passau-SFCH
) dataset,
focusing on the detection of spontaneous humor in a cross-lingual
and cross-cultural setting. The main objective of
MuSe
2024 is to
unite a broad audience from various research domains, including
multimodal sentiment analysis, audio-visual aective computing,
continuous signal processing, and natural language processing. By
fostering collaboration and exchange among experts in these elds,
the
MuSe
2024 endeavors to advance the understanding and applica-
tion of sentiment analysis and aective computing across multiple
modalities.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
MuSe ’24, October 28, 2024, Melbourne, Australia
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN xxx-x-xxxx-xxxx-x/xx/x x. . . $15.00
https://doi.org/10.1145/xxxxxxx.xxxxxxx
This baseline paper provides details on each sub-challenge and its
corresponding dataset, extracted features from each data modality,
and discusses challenge baselines. For our baseline system, we
make use of a range of Transformers and expert-designed features
and train Gated Recurrent Unit (
GRU
)-Recurrent Neural Network
(
RNN
) models on them, resulting in a competitive baseline system.
On the unseen test datasets of the respective sub-challenges, it
achieves a mean Pearson’s Correlation Coecient (
𝜌
) of 0.3573 for
MuSe-Perception
and an Area Under the Curve (
AUC
) value of
0.8682 for MuSe-Humor.
CCS CONCEPTS
Computing methodologies
Computer vision;Natural
language processing;Neural networks;Machine learning.
KEYWORDS
Multimodal Sentiment Analysis; Aective Computing; Social
Perception; Humor Detection; Multimodal Fusion; Workshop; Chal-
lenge; Benchmark
ACM Reference Format:
Shahin Amiriparian, Lukas Christ, Alexander Kathan, Maurice Gerczuk,
Niklas Müller, Steen Klug, Lukas Stappen, Andreas König, Erik Cambria,
Björn W. Schuller, and Simone Eulitz. 2024. The MuSe 2024 Multimodal
Sentiment Analysis Challenge: Social Perception and Humor Recognition.
In Proceedings of the 5th Multimodal Sentiment Analysis Challenge and Work-
shop: Social Signal Quantication and Humor Recognition (MuSe ’24), Oc-
tober 28, 2024, Melbourne, Australia. ACM, New York, NY, USA, 10 pages.
https://doi.org/10.1145/xxxxxxx.xxxxxxx
Preliminary
MuSe ’24, October 28, 2024, Melbourne, Australia Shahin Amiriparian et al.
1 INTRODUCTION
In its 5th edition, the Multimodal Sentiment Analysis Challenge
(
MuSe
) proposes two tasks, namely audio-visual analysis of per-
ceived characteristics of individuals and cross-cultural humor de-
tection. Each respective sub-challenge employs a distinct dataset.
In the rst sub-challenge,
MuSe-Perception
, participants
are tasked with training their machine learning models for recog-
nition of perceived characteristics of individuals from video in-
terviews. Audio-visual social perception analysis explores social
traits that are important for how we are being perceived by other
people [
1
]. The perception others have of us can have a signif-
icant impact on our relationships with them as well as on our
own professional future [
3
]. At the same time, social perception
is a complex phenomenon for which dierent theories have been
brought forward. The Dual Perspective Model (DPM) [
2
] is based
on the dimensions of agency (i. e., traits related to goal-achievement
such as competence or dominance) and communality (i. e., traits
referring to social relations such as friendliness or warmth) [
11
].
The former is stereotypically associated with masculine gender
roles, while the latter is traditionally attributed to femininity [
25
].
In this context, the work by Eulitz and Gazdag
[27]
underscores
the signicance of understanding how perceived characteristics,
particularly agentic traits, inuence professional success, includ-
ing the performance of CEOs in the stock market. Such insights
are invaluable for organizations aiming to thrive in dynamic and
competitive environments. However, extracting these insights re-
quires advanced tools capable of capturing and analyzing nuanced
social signals. Traditional methods of assessing social perception
often rely on structured questionnaires or observational techniques
to gauge individuals’ interpretations and responses to social cues
and interactions. In contrast, audio-visual machine learning sys-
tems oer scalability, enabling researchers to analyze large datasets
eciently. Moreover, these systems can uncover patterns and corre-
lations that may not be immediately apparent to human observers,
leading to deeper insights into the factors inuencing social per-
ception. This is where audio-visual machine learning systems play
a pivotal role and can oer a holistic approach to understanding
social perception by leveraging both auditory and visual cues. They
can capture subtle nuances in facial expressions [
16
,
38
,
59
], body
language [
15
], tone of voice [
59
], and verbal content [
22
,
53
], pro-
viding a rich set of features to analyze. In the context of the Social
Perception Sub-Challenge, the LMU Munich Executive Leadership
Perception (
LMU-ELP
) dataset (cf. Section 2.1) presents a unique
opportunity to explore the intersection of audio-visual data and
social perception.
In the second task,Cross-Cultural Humor Detection Sub-
Challenge (
MuSe-Humor
), participants will train their models to
detect humor within German football press conference recordings.
Humor is a ubiquitous yet puzzling phenomenon in human
communication. While it is strongly connected with an intention
to amuse one’s audience [
31
], humor has been shown to poten-
tially elicit a wide range of both positive and negative eects [
17
].
Throughout the last three decades, researchers in Aective Comput-
ing and related elds have addressed problems pertaining to compu-
tational humor, in particular, automatic humor detection [
54
,
60
,
68
].
Spurred by the advancements in multimodal sentiment analysis,
considerable attention has been paid to multimodal approaches
to humor detection recently [
13
,
35
]. Such multimodal methods
are particularly promising, as humor in personal interactions is an
inherently multimodal phenomenon, expressed not just by means
of what is said, but also via, e. g., gestures and facial expressions.
A plethora of datasets dedicated to multimodal humor recognition
exists [
36
,
46
,
64
]. However, they are typically built from recordings
of staged contexts, e. g., TV shows or TED talks, thus potentially
missing out on spontaneous, in-the-wild aspects of humorous com-
munication. Furthermore, several such datasets utilize audience
laughter as a proxy label for humor, thus taking the risk of reducing
the complex concept of humorous communication to the mere de-
livery of (scripted) punchlines. The
Passau-SFCH
dataset [
21
] seeks
to mitigate these issues by providing videos from a semi-staged
context, namely press conferences, that have been labeled manually
for humor.
MuSe-Humor
adds another layer of complexity to the humor
recognition task by introducing a cross-cultural scenario. More
specically, the training data consists of German recordings, while
the test data is composed of English videos. Whereas empirical
studies have been conducted on cross-cultural similarities and dif-
ferences in humor [
39
,
55
], this problem has not received much
attention from the machine learning domain. To the best of our
knowledge, last year’s edition of
MuSe-Humor
was the rst to in-
troduce this task to the community, giving rise to a range of dierent
systems proposed by the challenge’s participants [
33
,
42
,
65
,
67
,
70
].
This year’s edition of
MuSe-Humor
maintains the same data
and data partitions as last year’s [
20
]. The training set comprises
recordings from German football press conferences, while the un-
seen test set exclusively features press conferences conducted in
English, thus setting the stage for a cross-cultural, cross-lingual
evaluation scenario. For the training partition, we utilize the Ger-
man Passau Spontaneous Football Coach Humor (
Passau-SFCH
)
(
Passau-SFCH
) dataset [
21
], previously featured in the 2022 and
2023 editions of MuSe [
19
,
33
,
40
,
65
,
66
,
70
]. For the unseen test
partition, we expanded the
Passau-SFCH
dataset by incorporating
press conference recordings delivered by seven distinct coaches
from the English Premier League spanning from September 2016 to
September 2020. Both the training and test partitions exclusively
contain recordings where the respective coach is speaking, con-
tributing to an overall duration exceeding 17 hours. Although the
original videos are labeled according to the Humor Style Ques-
tionnaire (
HSQ
) framework introduced by Martin et al. [
44
], the
focus in this year’s
MuSe-Humor
challenge is on binary prediction,
specically detecting the presence or absence of humor.
The sub-challenges presented in
MuSe
2024 are designed to capti-
vate a broad audience, drawing the interest of researchers spanning
numerous domains, including multimodal sentiment analysis, aec-
tive computing, natural language processing, and signal processing.
MuSe
2024 provides participants with an ideal platform to apply
their expertise by leveraging multimodal self-supervised and repre-
sentation learning techniques, as well as harnessing the power of
generative AI and Large Language Models (LLMs).
Each sub-challenge is characterized by its unique datasets and
prediction objectives, providing participants with an opportunity
to delve into specic problem domains while simultaneously con-
tributing to broader research endeavors. By serving as a central hub
Preliminary
MuSe 2024: Baseline Paper MuSe ’24, October 28, 2024, Melbourne, Australia
Table 1: Statistics for each sub-challenge dataset. Included are
the number of unique subjects (#), and the video durations
formatted as h:mm:ss, and the total number of subjects along
with the overall duration of all recordings in each dataset.
MuSe-Perception MuSe-Humor
Partition # Duration # Duration
Train 59 0 :30 :13 7 7 :44 :49
Development 58 0 :29 :12 3 3 :06 :48
Test 60 0 :30 :10 6 6 :35 :16
Í177 1 :29 :35 16 17 :26 :53
for the comparison of methodologies across tasks,
MuSe
facilitates
the discovery of novel insights into the eectiveness of various
approaches, modalities, and features within the eld of aective
computing.
In Section 2, we provide a comprehensive overview of the out-
lined sub-challenges, detailing their corresponding datasets and
the challenge protocol. Following this, Section 3 elaborates on our
pre-processing and feature extraction pipeline, along with the ex-
perimental setup utilized to compute baseline results for each sub-
challenge. Subsequently, the results are showcased in Section 4,
leading to the conclusion of the paper in Section 5.
2 THE TWO SUB-CHALLENGES
In this section, we describe each sub-challenge and dataset in detail
and provide a description of the challenge protocol.
2.1 The Social Perception Sub-Challenge
For the rst task,Social Perception Sub-Challenge (MuSe-
Perception), we introduce the novel
LMU-ELP
dataset, which
consists of audio-visual recordings of US executives, specically the
highest-ranking executives of listed rms - chief executive ocers
(CEOs). In our dataset, these CEOs present their rms to poten-
tial investors before taking their rms public, i. e., selling shares
in the stock market for the rst time. The goal of this challenge
is to predict the agency and communiality of each individual on
a16-dimensional Likert scale ranging from 1to 7. The target la-
bels have been selected based on an established measure in social
psychology (Bem Sex-Role Inventory scale [
12
]) to assess percep-
tions of agency and commonality, the basic dimension of the Dual
Perspective Model [
2
], including aggressiveness, arrogance, assertive-
ness, condence, dominance, independence, leadership qualities, and
risk-taking propensity (pertaining to agency), as well as attributes
like collaboration, enthusiasm, friendliness, good-naturedness, kind-
ness, likability, sincerity, and warmth (associated with communality).
Using Amazon Mechanical Turk (MTurk), 4304 annotators have la-
beled all dimensions of our data, comprising 177 CEOs. The dataset
stands out for its comprehensive coverage of 16 distinguished labels
of each individual, oering new perspectives in multimodal sensing
of social signals. To the best of our knowledge,
LMU-ELP
is the rst
multimodal dataset that provides detailed insights into the nuanced
dimensions of gender-associated attributes.
MuSe-Perception
par-
ticipants are encouraged to explore multimodal machine learning
methods to automatically recognize CEOs’ perceived characteris-
tics. Each team will submit their 16 predictions. The evaluation
metric is Pearson’s correlation coecient (
𝜌
), and the mean of all
16 𝜌values is set as the challenge’s evaluation criterion.
2.2 The Cross-Cultural Humor Sub-Challenge
The training partition provided in
MuSe-Humor
comprises videos
of 10 football coaches from the German Bundesliga, all of them
native German speakers. The test set, in contrast, features 6English
Premier League coaches from 6dierent countries (Argentine, Eng-
land, France, Germany, Portugal, Spain), only one of them being a
native speaker. All subjects are male and aged between 30 and 53
years (training partition) and 47 to 57 years (test), respectively. 7
of the 10 German coaches are utilized as the training set, whereas
the remaining 3are used as the development set. Table 1 provides
further statistics of the dataset.
The videos are cut to only include segments in which the coaches
are speaking. In addition to the recordings, manual transcriptions
with corresponding timestamps are provided.
The videos in
Passau-SFCH
are originally labeled according to
the HSQ, a two-dimensional model of humor proposed by Martin
et al
. [44]
. From these annotations, the gold standard is created as
elaborated in [
21
], such that each data point corresponds to a 2s
frame with a binary label. Overall, humorous segments make up
4
.
38 % of the training segments, 2
.
81 % of the development segments,
and 6.17 % of the segments used for the test.
Same as in previous editions of
MuSe-Humor
,
AUC
serves as
the evaluation metric.
2.3 Challenge Protocol
To join the challenge, participants must be aliated with an aca-
demic institution and complete the EULA found on the
MuSe
2024
homepage
1
homepage. The organizers will not compete in any sub-
challenge. During the contest, participants upload their predictions
for test labels on the CodaBench platform. Each sub-challenge al-
lows up to 5predictions, where the best among them determines
the rank on the leaderboard. All teams are encouraged to submit
a paper detailing their experiments and results. Winning a sub-
challenge requires an accepted paper. All papers are subject to a
double-blind peer-review process.
3 BASELINE APPROACHES
To support participants in time-ecient model development and re-
producing challenge baselines, we provide a set of expert-designed
(explainable) and Transformers’-based features. For the
MuSe-Perception
task, we provide 6 feature sets, including three vision-based and
three audio-based features. Additionally, for the
MuSe-Humor
sub-
challenge, we provide an extra feature set tailored for the text
modality, totaling 7 feature sets.
1https://www.muse-challenge.org/challenge/participate
Preliminary
MuSe ’24, October 28, 2024, Melbourne, Australia Shahin Amiriparian et al.
3.1 Pre-processing
We split each dataset intothree partitions for training, development,
and testing, ensuring that the distribution of target labels and the
overall length of recordings are balanced across each partition.
Moreover, we maintained speaker independence throughout all
partitions.
For the
LMU-ELP
recordings (each CEO recording has a xed
length of 30 seconds), we have removed data points in which the
faces of multiple persons are presented, ensuring that each record-
ing solely contains the CEO for which the labeling has been con-
ducted. Furthermore, we have eliminated video recordings with
fewer than 5 face frames available. For the audio modality of
LMU-ELP
dataset, we removed the background noise using the free online
tool Vocal Remover2.
From the test partition of the
Passau-SFCH
dataset, we have
manually removed video clips if the coaches did not speak English.
Additionally, we discarded clips with poor audio quality.
3.2 Audio
Prior to extracting audio features, we normalize all audio les to
3dB and convert them to mono format, with a sampling rate of
16 kHz and a bit depth of 16. Subsequently, we employ the openS-
MILE toolkit [
29
] to compute handcrafted features. Additionally, we
generate high-dimensional audio representations using both Deep-
Spectrum [
6
] and a modied version of Wav2Vec2.0 [
9
]. Both sys-
tems have demonstrated their ecacy in audio-based Speech Emo-
tion Recognition (SER) and sentiment analysis tasks [5, 8, 14, 30].
3.2.1
eGeMAPS
.We utilize the openSMILE toolkit [
29
] to extract
88 dimensional extended Geneva Minimalistic Acoustic Parameter
Set (
eGeMAPS
) features [
28
], which have shown to be robust for
sentiment analysis and
SER
tasks [
10
,
50
,
62
,
66
]. We employ the
standard conguration for each sub-challenge and extract features
using a window size of 2000 ms and a hop size of 500 ms.
3.2.2 DeepSpectrum .Utilizing DeepSpectrum [
6
], we harness
(more conventional) deep Convolutional Neural Network (
CNN
)-
based representations from audio data. Our approach involves the
initial generation of Mel-spectrograms for each audio le, employ-
ing a window size of 1000 ms and a hop size of 500 ms, with a
conguration of 128 Mels and utilizing the viridis color mapping.
These spectrogram representations are then fed into DenseNet121.
We extract a 1024-dimensional feature vector from the output of
the last pooling layer. The eectiveness of DeepSpectrum has
been validated across various speech and audio recognition tasks
[5, 7, 48–50].
3.2.3 Wav2Vec2.0 .In recent years, there has been a major inter-
est in self-supervised pretrained Transformer models in computer
audition [
43
]. An exemplary foundation model is Wav2Vec2.0 [
9
],
which has found widespread application in
SER
tasks [
47
,
51
]. Given
that all sub-challenges are aect-related, we opt for a large version
of Wav2Vec2.0 ne-tuned on the MSP-Podcast dataset specically
for emotion recognition [
63
]
3
. We extract deep features from an
2https://vocalremover.org/
3https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
audio signal by averaging its representations in the nal layer of
this model, yielding 1024-dimensional embeddings.
For
MuSe-Humor
, we extract 2Hz features by sliding a 3000 ms
window over each audio le, with a step size of 500 ms, and for
MuSe-Perception
, a 2000 ms window size (with a hop size of
500 ms) is applied. In the previous edition of
MuSe
, participants
frequently employed Wav2Vec2.0 features, demonstrating their
ecacy in capturing nuanced audio characteristics for the aective
computing tasks [33, 41, 42, 65, 70].
3.3 Video
To compute the visual modality baseline, we solely extract represen-
tations from subjects’ faces. In order to do so, we utilize Multi-task
Cascaded Convolutional Networks (
MTCNN
) to isolate faces from
video recordings in both datasets. Subsequently, we obtain Facial
Action Units (FAUs), FaceNet512, and ViT-FER representations
from each face image.
These features already proved to be useful in aect and sentiment
analysis tasks. For example, Li et al
. [42]
used
FAU
,FaceNet512 ,
and ViT features, among others, and integrated them into a multi-
modal feature encoder module to perform binary humor recogni-
tion. Similarly, recent works (e. g., [
49
,
65
]) conducted experiments
applying these features, achieving an overall best result when fusing
them with other features from the audio and text modality. Another
study by Yi et al
. [69]
comes to similar conclusions, presenting a
multimodal transformer fusion approach in which they used
FAU
and ViT features.
3.3.1
MTCNN
.We utilize the
MTCNN
face detection model [
71
]
to extract images of the subjects’ faces.
In numerous videos from both datasets, multiple individuals are
often visible, yet only the CEOs in the
MuSe-Perception
dataset
or the coaches in the
MuSe-Humor
dataset are relevant. Initially,
we aim to automatically lter out the CEO’s or coach’s face in
each video through clustering of face embeddings. Subsequently,
we manually rene the obtained face sets, retaining only the one
corresponding to the CEO or coach. After the extraction of face
images, we employ Py-Feat ,FaceNet512 , and ViT-FER for feature
extraction.
3.3.2 Facial Action Units (FAU). FAUs [
26
] oer a transparent
method for encoding facial expressions by linking them to the
activation of specic facial muscles. Given the rich emotional in-
formation conveyed by facial expressions, FAUs have received sub-
stantial interest in the sentiment analysis and aective computing
community [
72
]. We employ the pyfeat library
4
to automatically
estimate the activation levels of 20 distinct FAUs.
3.3.3 FaceNet512 .We employ the FaceNet512 model [
57
], which
is specially trained for face recognition. Specically, we utilize the
implementation provided in the deepface library [
58
], yielding a
512-dimensional embedding for each face image.
3.3.4 Vision Transformer (ViT-FER ). Additionally, we leverage a
netuned variant of ViT [
24
] provided by [
18
]
5
, adapted to emotion
recognition on the FER2013 dataset [
32
]. We choose this approach
4https://py-feat.org
5https://huggingface.co/trpakov/vit-face- expression
Preliminary
MuSe 2024: Baseline Paper MuSe ’24, October 28, 2024, Melbourne, Australia
since, contrary to the original ViT model, it is explicitly trained on
pictures of faces rather than a wide variety of domains. The last hid-
den state of the special
[CLS]
token is utilized as the representation
of a face image.
3.4 Text: Transformers
This modality is exclusive to the
MuSe-Humor
sub-challenge. Given
that
MuSe-Humor
involves training and development data in Ger-
man but a test set in English, we utilize the multilingual version of
BERT [
23
]
6
, pretrained on Wikipedia entries across 104 languages,
including German and English. This model has demonstrated ef-
fective generalization across dierent languages [
52
]. Specically,
we compute sentence representations by extracting the encoding
of the
CLS
token from the model’s nal layer (768-dimensional),
which represents the entire input sentence.
While BERT embeddings have demonstrated to be suitable for
the task of humor recognition [
33
,
42
,
67
], they can only be re-
garded as a simple baseline. The advent of larger
LLM
s such as
LLaMa [
61
] or Mistral [
37
] opens up new avenues for humor recog-
nition based on the textual modality. We would like to strongly
encourage participants to explore the aptitude of (multilingual)
LLM
s for
MuSe-Humor
, as the text modality proves to be promis-
ing for the problem at hand (cf. Section 4.2).
3.5 Alignment
The
Passau-SFCH
dataset comprises audio, video, and transcripts.
To facilitate the development of multimodal models leveraging
these features, we align the various modalities with each other and
with the labeling scheme specic to the task.
Both audio and face-based features are extracted at a rate of 2Hz,
using sliding windows with a step size of 500 ms for audio, and sam-
pling faces at a 2Hz rate for the video modality. Deriving sentence-
wise timestamps from manual transcripts is done in three steps:
rstly, the Montreal Forced Aligner (MFA) [
45
] toolkit is utilized to
generate word-level timestamps. Next, punctuation is automatically
added to the transcripts using the
deepmultilingualpunctuation
tool [
34
]. Finally, the transcripts are segmented into sentences us-
ing PySBD [
56
], allowing for the inference of sentence-wise times-
tamps from the word-level timestamps. Subsequently, 2Hz textual
features are computed by averaging the embeddings of sentences
overlapping with the respective 500 ms windows.
As the labels in
Passau-SFCH
correspond to windows of size 2s,
the alignment of features for
MuSe-Humor
with annotations isn’t
direct but can be facilitated.
In
MuSe-Perception
, the labels pertain to entire videos, elimi-
nating the need for alignment with the labels.
3.6 Baseline Training
We provide participants with a simple baseline system based on a
GRU
-
RNN
. Our model encodes sequential data via a stack of
GRU
layers. The nal hidden representation of the last of these layers
is taken as the embedding of the entire sequence and fed into two
feed-forward layers for classication. For both tasks, hyperparame-
ters, including the number of
GRU
layers, learning rate, and
GRU
representation size, are optimized. To obtain a set of unimodal
6https://huggingface.co/bert-base-multilingual-cased
baselines, the model is optimized and trained for each feature as
described in Section 3. For a simple multimodal baseline, we employ
a weighted late fusion approach, averaging the best-performing
audio-based model’s predictions with those of the best-performing
video-based model. The predictions are weighted by the perfor-
mance of the respective model on the development set. Simulating
challenge conditions, we conduct all experiments with 5xed seeds
for both sub-challenges. We provide the baseline code, checkpoints,
and hyperparameter congurations in the accompanying GitHub
repository7.
3.6.1
MuSe-Perception
.For each of the 16 prediction targets in
MuSe-Perception
, a separate model is trained. We observe this
approach to be more promising than a multi-label prediction setup,
i. e., one model predicting all 16 labels. Consequently, the late fusion
approach also fuses the best video and audio models per target label.
Nonetheless, we would like to encourage participants to explore
combinations of prediction targets. In order to keep the number of
experiments in a reasonable range, we optimize the hyperparame-
ters on one target only and employ the conguration thus found for
all 16 labels. As
MuSe-Perception
is a regression task, we choose
Mean Squared Error (MSE) for the loss function.
3.6.2
MuSe-Humor
.In the
Passau-SFCH
dataset, each label per-
tains to a 2s frame. As the features are extracted in 500 ms intervals,
data points in the
MuSe-Humor
experiments are sequences with a
length of 4at most. Binary Cross-Entropy (
BCE
) is used as the loss
function.
4 BASELINE RESULTS
We implement the training of
GRU
s as outlined in the preceding
section. In the following, we introduce and analyze the baseline
results.
4.1 MuSe-Perception
For the
MuSe-Perception
sub-challenge, we rst present the mean
Pearson’s correlation over all 16 target dimensions for each feature
in Table 3. Averaged across all targets, ViT-FER leads to the best
overall single-modality results at
𝜌=.
3679 on development and
𝜌=.
2577 on test. Looking only at the audio modality,
eGeMAPS
outperforms both evaluated deep features DeepSpectrum and
Wav2Vec2.0 at
.
2561 and
.
1442. Overall, however, results on
development and test partition often diverge substantially, e. g.,
DeepSpectrum drops from
.
2075 to
.
0113. The weighted late fu-
sion approach, which chooses the best model for each target sepa-
rately, leads to greatly increased correlations on both development
(
𝜌=.
5579) and test set (
𝜌=.
3573), suggesting that no singular
feature works best for every target label.
To further investigate how the congurations fare at recognizing
each of the 16 attributes, we report the best development and accom-
panying test set results for the individual target attributes in Table 2.
We further make use of the “Big Two” dimensions Agency and Com-
munion [
11
] to split the target labels into two groups. From the
agentive subgroup, aggressive can be recognized quite well by mod-
els trained on any of the evaluated features, with ViT-FER achieving
the best results on both development and test results and further
7https://github.com/amirip/MuSe-2024
Preliminary
MuSe ’24, October 28, 2024, Melbourne, Australia Shahin Amiriparian et al.
showing solid generalization capabilities. However, for condent,
the best strong development performance of ViT-FER does not
transfer to the test set, dropping from
.
7373 to a mere
.
0783 Pear-
son’s correlation. In the communal subgroup, good-natured and kind
are detected quite reliably with good generalization behavior from
models trained on visual features. Audio models trained on deep
features (DeepSpectrum and Wav2Vec2.0 ), on the other hand, see
performance drop to chance level when moving from development
to test.
4.2 MuSe-Humor
Table 4 reports the baselines for MuSe-Humor.
All models achieve above-chance (0
.
5
AUC
) results, regardless
of modality and employed feature representations. In this year’s
rendition of the sub-challenge, the introduction of ViT-FER features
helped the visual modality catch up to the best audio results found
again with Wav2Vec2.0 embeddings.
GRU
s trained on these fea-
tures reach
.
7995 (ViT-FER ) and
.
8042 (Wav2Vec2.0 ) on the test
partition. However, it can be observed that ViT-FER features do not
generalize as well to unseen data, with performance dropping from
.
8932 to
.
7995 between development and test partitions. As last year,
these generalization decits extend across all visual features while
the audio modality fares better, with Wav2Vec2.0 and
eGeMAPS
only slightly dropping in
AUC
and DeepSpectrum matching de-
velopment performance on the test set. The ecacy of exploiting
linguistic information is further demonstrated by the purely textual
BERT features, which come third after Wav2Vec2.0 and ViT-FER ,
outperforming all other single modality models.
Our late fusion approach consistently improves over the respec-
tive unimodal baselines. The performance gains achieved in the
visual modality with ViT-FER features further lead to fusion settings,
which include video, eclipsing those that rely on only audio and
text. The overall highest
AUC
s of
.
9251
.
8682 on development and
test sets, respectively, are achieved by fusing all three modalities.
5 CONCLUSIONS
We introduced
MuSe
2024 the 5th Multimodal Sentiment Analy-
sis challenge, comprising two sub-challenges,
MuSe-Perception
and
MuSe-Humor
. For the
MuSe-Perception
sub-challenge, the
novel
LMU-ELP
dataset has been introduced and is made available,
which consists of interview recordings of CEOs who present their
rms to potential investors before taking their rms public. Partici-
pants were tasked to predict 16 dierent attributes of CEOs, e. g.,
condence, or sincerity on a Likert scale from 1to 7.
The
MuSe-Humor
sub-challenge is a relaunch of the same task
from the 2023 edition of
MuSe
[
4
,
20
]. It uses an extended version
of the
Passau-SFCH
dataset [
21
]. Participants are tasked to detect
spontaneous humor in press conferences across cultures and lan-
guages, training their models on recordings of German-speaking
trainers and testing on English data.
We employed publicly accessible codes and repositories to ex-
tract audio, visual, and textual features. Additionally, we trained
simple
GRU
models on the acquired representations to establish the
ocial challenge baselines. These baselines represent the models’
performance on the test partitions of each sub-challenge, outlined
as follows: a mean
𝜌
value of
.3573
for
MuSe-Perception
, and
an
AUC
value of
.8682
for
MuSe-Humor
. Both baseline results
were achieved via late fusion of all modalities in each respective
sub-challenge.
By sharing our code, datasets, and features publicly, we aim to
facilitate broader participation and engagement from the research
community. This open approach promotes transparency and en-
courages researchers to build upon our work, accelerating progress
in multimodal data processing. As we continue to rene our base-
line systems and explore new methodologies, we anticipate further
advancements in our understanding of human behavior and commu-
nication across dierent modalities. Through ongoing collaboration
and experimentation within the
MuSe
2024 framework, we hope to
drive innovation and ultimately contribute to the development of
more eective multimodal machine learning systems for behavioral
modeling and aective analysis.
6 ACKNOWLEDGMENTS
This project has received funding from the Deutsche Forschungs-
gemeinschaft (DFG) under grant agreement No. 461420398, and the
DFG’s Reinhart Koselleck project No. 442218748 (AUDI0NOMOUS).
Shahin Amiriparian and Björ W. Schuller are also aliated with the
Munich Center for Machine Learning (MCML), Germany. Björ W.
Schuller is further aliated with the Chair of Embedded Intelligence
for Healthcare and Wellbeing (EIHW), University of Augsburg, Ger-
many, Chair of Health Informatics (CHI), Klinikum rechts der Isar
(MRI), Technical University of Munich, Germany, and the Munich
Data Science Institute (MDSI), Germany.
Preliminary
MuSe 2024: Baseline Paper MuSe ’24, October 28, 2024, Melbourne, Australia
Table 2: Baseline results measured in Pearson’s correlation for the
MuSe-Perception
sub-challenge disaggregated by target
label. For brevity, we only report the best development and accompanying test set result. Furthermore, we group the 16 aspects
of social gender into “agentive” and “communal” attributes, based on [11]
Evaluation Metric: [𝜌↑]
Audio Visual
Target DeepSpectrum eGeMAPS Wav2Vec2.0 FaceNet512 FAU ViT-FER
Dev. Test Dev. Test Dev. Test Dev. Test Dev. Test Dev. Test
agentive
aggressive .4728 .1323 .3687 .3572 .2601 .3266 .2300 .4204 .1877 .3123 .4911 .4718
arrogant .3963 -.1216 .4394 .3422 .4503 .4808 .6501 .3926 .1717 .3242 .4066 .4412
assertive .0990 -.0281 .3623 -.0334 .3732 .1277 .1870 .3528 .4342 -.0868 .4139 .2437
condent .0730 .3600 .3955 -.0594 .4916 .0618 .3675 -.0131 .4873 .2885 .7373 .0783
dominant .1402 .0077 .6405 .2408 .5844 .1293 .7488 .4434 .5846 .1925 .3494 .2977
independent .2294 -.0800 .4550 -.1849 .5567 .3514 .3963 .2789 .4015 .1957 .7641 .2398
risk-taking .3349 .0984 .5166 .4479 .4363 .2074 .6402 .3919 .3951 -.1371 .5940 .3047
leader-like .0595 -.2099 .4212 -.1064 .6330 .2826 .2588 .1135 .2979 .0658 .6274 .5135
communal
collaborative .2053 .2049 .3178 .2835 .3118 -.1357 .1582 .1698 .0668 .2768 .1671 .2565
enthusiastic .2285 .2650 .4505 .2294 .3795 .0705 .4545 .0196 .5338 .2946 .6151 .1770
friendly .3810 .3825 .3843 .3052 .3885 .1116 .1431 -.1210 .2708 .3512 .6428 .3871
good-natured .2484 .1960 .4424 .1906 .3923 .1209 .3340 .3607 .3536 .3224 .6596 .5047
kind .3987 .0068 .4117 .1665 .2808 .0382 .4460 .3606 .2515 .3271 .6113 .3768
likeable .4155 .1034 .3114 .1456 .3779 .1423 .2500 -.2306 .2605 .2383 .5203 .1734
sincere .1374 .0984 .4343 .0972 .4780 .2541 .1633 -.0285 .1967 .3864 .0673 .3463
warm .4873 .5090 .2266 .2219 .3886 .1732 .4337 .4151 .3263 .2725 .3927 .2201
Preliminary
MuSe ’24, October 28, 2024, Melbourne, Australia Shahin Amiriparian et al.
Table 3:
MuSe-Perception
baseline results. Each line refers to experiments conducted with 5 xed seeds and reports the best
Pearson correlation among them, together with the mean Pearson correlations and their standard deviations across the 5 seeds.
Evaluation Metric: [𝜌↑]
Features Development Test
Audio
eGeMAPS .2561 (.1931 ±.0512) .1442 (.1424 ±.0740)
DeepSpectrum .2075 (.1686 ±.0509) .0113 (.0236 ±.0272)
Wav2Vec2.0 .1448 (.0687 ±.0664) .0950 (.0765 ±.0765)
Video
FAU .2143 (.1703 ±.0382) .0793 (.0886 ±.0190)
ViT-FER .3679 (.3042 ±.0634) .2577 (.2333 ±.0240)
FaceNet512 .2248 (.1755 ±.0228) .1586 (.1167 ±.0129)
Late Fusion
Audio + Video .5579 (.5016 ±.0421) .3573 (.3122 ±.0237)
Table 4:
MuSe-Humor
baseline results. Each line refers to experiments conducted with 5 xed seeds and reports the best
AUC-Score among them, together with the mean AUC-Scores and their standard deviations across the 5 seeds.
Evaluation Metric: [AUC ]
Features Development Test
Audio
eGeMAPS .7094 (.6717 ±.0206) .6665 (.6610 ±.0105)
DeepSpectrum .6963 (.6931 ±.0033) .6987 (.6996 ±.0040)
Wav2Vec2.0 .8392 (.8316 ±.0062) .8042 (.8020 ±.0023)
Video
FAU .7749 (.7673 ±.0043) .6400 (.5941 ±.0513)
ViT-FER .8932 (.8888 ±.0025) .7995 (.8052 ±.0030)
FaceNet512 .7311 (.6340 ±.0640) .5812 (.5831 ±.0197)
Text
BERT .8109 (.7697 ±.0680) .7581 (.7163 ±.0886)
Late Fusion
Audio + Text .8827 (.8697 ±.0194) .8194 (.8173 ±.0114)
Audio + Video .916 (.9119 ±.0054) .8534 (.8544 ±.0023)
Text + Video .9132 (.9059 ±.0102) .8473 (.8384 ±.0184)
Audio + Text + Video .9251 (.9185 ±.0069) .8682 (.8638 ±.0079)
Preliminary
MuSe 2024: Baseline Paper MuSe ’24, October 28, 2024, Melbourne, Australia
REFERENCES
[1]
Andrea E Abele, Naomi Ellemers, Susan T Fiske, Alex Koch, and Vincent Yzerbyt.
2021. Navigating the social world: Toward an integrated framework for evaluating
self, individuals, and groups. Psychological Review 128, 2 (2021), 290.
[2]
Andrea E. Abele and Bogdan Wojciszke. 2014. Chapter Four - Communal and
Agentic Content in Social Cognition: A Dual Perspective Model. In Advances in
Experimental Social Psychology, James M. Olson and Mark P. Zanna (Eds.). Vol. 50.
Academic Press, 195–255. https://doi.org/10.1016/B978- 0-12-800284-1.00004-7
[3]
Nalini Ambady and John Joseph Skowronski. 2008. First impressions. Guilford
Press.
[4]
Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Messner, Alan
Cowen, Erik Cambria, and Björn W. Schuller. 2023. MuSe 2023 Challenge: Mul-
timodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Per-
sonalised Recognition of Aects. In Proceedings of the 31st ACM International
Conference on Multimedia (MM’23), October 29-November 2, 2023, Ottawa, Canada.
Association for Computing Machinery, Ottawa, Canada. to appear.
[5]
Shahin Amiriparian, Nicholas Cummins, Sandra Ottl, Maurice Gerczuk, and Björn
Schuller. 2017. Sentiment Analysis Using Image-based Deep Spectrum Features.
In Proceedings 2nd International Workshop on Automatic Sentiment Analysis in
the Wild (WASA 2017) held in conjunction with the 7th biannual Conference on
Aective Computing and Intelligent Interaction (ACII 2017). AAAC, IEEE, San
Antonio, TX, 26–29.
[6]
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael
Freitag, Sergey Pugachevskiy, and Björn Schuller. 2017. Snore Sound Classica-
tion Using Image-based Deep Spectrum Features. In Proceedings INTERSPEECH
2017, 18th Annual Conference of the International Speech Communication Associa-
tion. ISCA, ISCA, Stockholm, Sweden, 3512–3516.
[7]
Shahin Amiriparian, Maurice Gerczuk, Lukas Stappen, Alice Baird, Lukas Koebe,
Sandra Ottl, and Björn Schuller. 2020. Towards Cross-Modal Pre-Training and
Learning Tempo-Spatial Characteristics for Audio Recognition with Convolu-
tional and Recurrent Neural Networks. EURASIP Journal on Audio, Speech, and
Music Processing 2020, 19 (2020), 1–11.
[8]
Shahin Amiriparian, Tobias Hübner, Vincent Karas, Maurice Gerczuk, San-
dra Ottl, and Björn W. Schuller. 2022. DeepSpectrumLite: A Power-Ecient
Transfer Learning Framework for Embedded Speech and Audio Processing
From Decentralized Data. Frontiers in Articial Intelligence 5 (2022), 10 pages.
https://doi.org/10.3389/frai.2022.856232
[9]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020.
wav2vec 2.0: A framework for self-supervised learning of speech representations.
Advances in neural information processing systems 33 (2020), 12449–12460.
[10]
Alice Baird, Shahin Amiriparian, and Björn Schuller. 2019. Can deep generative
audio be emotional? Towards an approach for personalised emotional audio gen-
eration. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing
(MMSP). IEEE, IEEE, Kuala Lumpur, Malaysia, 1–5.
[11]
David Bakan. 1966. The duality of human existence: An essay on psychology and
religion. Rand Mcnally, Oxford, England. Pages: 242.
[12]
Sandra L Bem. 1981. A manual for the Bem sex role inventory. California: Mind
Garden (1981).
[13]
Dario Bertero and Pascale Fung. 2016. Deep learning of audio and language
features for humor prediction. In Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16). 496–501.
[14]
Björn W. Schuller and Anton Batliner and Christian Bergler and Cecilia Mas-
colo and Jing Han and Iulia Lefter and Heysem Kaya and Shahin Amiriparian
and Alice Baird and Lukas Stappen and Sandra Ottl and Maurice Gerczuk and
Panaguiotis Tzirakis and Chloë Brown and Jagmohan Chauhan and Andreas
Grammenos and Apinan Hasthanasombat and Dimitris Spathis and Tong Xia
and Pietro Cicuta and Leon J. M. Rothkrantz and Joeri Zwerts and Jelle Treep
and Casper Kaandorp. 2021. The IN TERSPEECH 2021 Computational Paralin-
guistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates.
In Proceedings INTERSPEECH 2021, 22nd Annual Conference of the International
Speech Communication Association. ISCA, ISCA, Brno, Czechia, 431–435.
[15]
Simon M Breil, Sarah Osterholz, Steen Nestler, and Mitja D Back. 2021. 13
contributions of nonverbal cues to the accurate judgment of personality traits.
The Oxford handbook of accurate personality judgment (2021), 195–218.
[16]
Andrew J Calder, Michael Ewbank, and Luca Passamonti. 2011. Personality
inuences the neural responses to viewing facial expressions of emotion. Philo-
sophical Transactions of the Royal Society B: Biological Sciences 366, 1571 (2011),
1684–1701.
[17]
Arnie Cann, Amanda J Watson, and Elisabeth A Bridgewater. 2014. Assessing
humor at work: The humor climate questionnaire. Humor 27, 2 (2014), 307–323.
[18]
Aayushi Chaudhari, Chintan Bhatt, Achyut Krishna, and Pier Luigi Mazzeo. 2022.
ViTFER: facial emotion recognition with vision transformers. Applied System
Innovation 5, 4 (2022), 80.
[19]
Chengxin Chen and Pengyuan Zhang. 2022. Integrating Cross-Modal Interactions
via Latent Representation Shift for Multi-Modal Humor Detection. In Proceedings
of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge
(Lisboa, Portugal) (MuSe’ 22). Association for Computing Machinery, New York,
NY, USA, 23–28. https://doi.org/10.1145/3551876.3554805
[20]
Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller,
Steen Klug, Chris Gagne, Panagiotis Tzirakis, Lukas Stappen, Eva-Maria Meßner,
et al. 2023. The muse 2023 multimodal sentiment analysis challenge: Mimicked
emotions, cross-cultural humour, and personalisation. In Proceedings of the 4th
on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions,
Humour and Personalisation. 1–10.
[21]
Lukas Christ, Shahin Amiriparian, Alexander Kathan, Niklas Müller, Andreas
König, and Björn W Schuller. 2023. Towards Multimodal Prediction of Sponta-
neous Humour: A Novel Dataset and First Results. arXiv preprint arXiv:2209.14272
(2023).
[22]
Andrew Cutler and David M Condon. 2022. Deep lexical hypothesis: Identifying
personality structure in natural language. Journal of Personality and Social
Psychology (2022).
[23]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. 4171–4186.
[24]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is
Worth 16x16 Words: Transformers for Image Recognition at Scale. In Interna-
tional Conference on Learning Representations. https://openreview.net/forum?id=
YicbFdNTTy
[25]
Alice H Eagly and Steven J Karau. 2002. Role congruity theory of prejudice
toward female leaders. Psychological review 109, 3 (2002), 573.
[26]
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environ-
mental Psychology & Nonverbal Behavior (1978).
[27]
Simone Maria Eulitz and Brooke A. Gazdag. 2021. Beyond Biology The impact
of Perceptions CEO Social Gender on Investor Reactions During an IPO. Academy
of Management Proceedings 2021, 1 (Aug. 2021), 12379. https://doi.org/10.5465/
AMBPP.2021.12379abstract Publisher: Academy of Management.
[28]
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth
André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S
Narayanan, et al
.
2015. The Geneva minimalistic acoustic parameter set (GeMAPS)
for voice research and aective computing. IEEE Transactions on Aective Com-
puting 7, 2 (2015), 190–202.
[29]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the mu-
nich versatile and fast open-source audio feature extractor. In Proceedings of the
18th ACM International Conference on Multimedia. Association for Computing
Machinery, Firenze, Italy, 1459–1462.
[30]
Maurice Gerczuk, Shahin Amiriparian, Sandra Ottl, and Björn Schuller. 2022.
EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion
Recognition. IEEE Transactions on Aective Computing 13 (2022).
[31]
Panagiotis Gkorezis, Eugenia Petridou, and Panteleimon Xanthiakos. 2014. Leader
positive humor and organizational cynicism: LMX as a mediator. Leadership &
Organization Development Journal 35 (2014), 305 315.
[32]
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi
Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun
Lee, et al
.
2013. Challenges in representation learning: A report on three machine
learning contests. In Neural Information Processing: 20th International Conference,
ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer,
117–124.
[33]
Tamás Grósz, Anja Virkkunen, Dejan Porjazovski, and Mikko Kurimo. 2023.
Discovering Relevant Sub-spaces of BERT, Wav2Vec 2.0, ELECTRA and ViT
Embeddings for Humor and Mimicked Emotion Recognition with Integrated
Gradients. In Proceedings of the 4th on Multimodal Sentiment Analysis Challenge
and Workshop: Mimicked Emotions, Humour and Personalisation. 27–34.
[34]
Oliver Guhr, Anne-Kathrin Schumann, Frank Bahrmann, and Hans Joachim
Böhme. 2021. FullStop: Multilingual Deep Models for Punctuation Prediction.
In Proceedings of the Swiss Text Analytics Conference 2021. CEUR Workshop
Proceedings, Winterthur, Switzerland. http://ceur- ws.org/Vol- 2957/sepp_paper4.
pdf
[35]
Md Kamrul Hasan, Sangwu Lee, Wasifur Rahman, Amir Zadeh, Rada Mihalcea,
Louis-Philippe Morency, and Ehsan Hoque. 2021. Humor knowledge enriched
transformer for understanding multimodal humor. In Proceedings of the AAAI
Conference on Articial Intelligence, Vol. 35. 12972–12980.
[36]
Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong,
Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque.
2019. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,
China, 2046–2056. https://doi.org/10.18653/v1/D19- 1211
[37]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De-
vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, et al
.
2023. Mistral 7B. arXiv preprint
arXiv:2310.06825 (2023).
Preliminary
MuSe ’24, October 28, 2024, Melbourne, Australia Shahin Amiriparian et al.
[38]
Alexander Kachur, Evgeny Osin, Denis Davydov, Konstantin Shutilov, and Alexey
Novokshonov. 2020. Assessing the Big Five personality traits using real-life static
facial images. Scientic Reports 10, 1 (2020), 8487.
[39]
Anna Ladilova and Ulrike Schröder. 2022. Humor in intercultural interaction:
A source for misunderstanding or a common ground builder? A multimodal
analysis. Intercultural Pragmatics 19, 1 (2022), 71–101.
[40]
Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang
Xu, Sheng Gao, Jie Lin, Chunxiao Fan, Xiao Sun, and Meng Wang. 2022. Hybrid
Multimodal Feature Extraction, Mining and Fusion for Sentiment Analysis. In Pro-
ceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and
Challenge (Lisboa, Portugal) (MuSe’ 22). Association for Computing Machinery,
New York, NY, USA, 81–88. https://doi.org/10.1145/3551876.3554809
[41]
Qi Li, Shulei Tang, Feixiang Zhang, Ruotong Wang, Yangyang Xu, Zhuoer Zhao,
Xiao Sun, and Meng Wang. 2023. Temporal-aware Multimodal Feature Fusion for
Sentiment Analysis. In Proceedings of the 4th on Multimodal Sentiment Analysis
Challenge and Workshop: Mimicked Emotions, Humour and Personalisation. 99–
105.
[42]
Qi Li, Yangyang Xu, Zhuoer Zhao, Shulei Tang, Feixiang Zhang, Ruotong Wang,
Xiao Sun, and Meng Wang. 2023. JTMA: Joint Multimodal Feature Fusion and
Temporal Multi-head Attention for Humor Detection. In Proceedings of the 4th
on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions,
Humour and Personalisation. 59–65.
[43]
Shuo Liu, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing,
Alexander Kathan, Bin Hu, and Björn W. Schuller. 2022. Audio self-supervised
learning: A survey. Patterns 3, 12 (2022), 100616. https://doi.org/10.1016/j.patter.
2022.100616
[44]
Rod A Martin, Patricia Puhlik-Doris, Gwen Larsen, Jeanette Gray, and Kelly Weir.
2003. Individual dierences in uses of humor and their relation to psychological
well-being: Development of the Humor Styles Questionnaire. Journal of research
in personality 37, 1 (2003), 48–75.
[45]
Michael McAulie, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan
Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment
Using Kaldi.. In Proceedings of INTERSPEECH, Vol. 2017. International Speech
Communication Association (ISCA), Stockholm, Sweden, 498–502.
[46]
Anirudh Mittal, Pranav Jeevan, Prerak Gandhi, Diptesh Kanojia, and Pushpak
Bhattacharyya. 2021. " So You Think You’reFunny?": Rating the Humour Quotient
in Standup Comedy. arXiv:arXiv preprint arXiv:2110.12765
[47]
Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno,
and Hagai Aronowitz. 2022. Speech emotion recognition using self-supervised
features. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 6922–6926.
[48]
Sandra Ottl, Shahin Amiriparian, Maurice Gerczuk, Vincent Karas, and Björn
Schuller. 2020. Group-level Speech Emotion Recognition Utilising Deep Spectrum
Features. In Proceedings of the 8th ICMI 2020 EmotiW Emotion Recognition In The
Wild Challenge (EmotiW 2020), 22nd ACM International Conference on Multimodal
Interaction (ICMI 2020). ACM, ACM, Utrecht, The Netherlands, 821–826.
[49]
Ho-Min Park, Ganghyun Kim, Arnout Van Messem, and Wesley De Neve. 2023.
MuSe-Personalization 2023: Feature Engineering, Hyperparameter Optimization,
and Transformer-Encoder Re-discovery. In Proceedings of the 4th on Multimodal
Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and
Personalisation. 89–97.
[50]
Ho-min Park, Ilho Yun, Ajit Kumar, Ankit Kumar Singh, Bong Jun Choi, Dhanan-
jay Singh, and Wesley De Neve. 2022. Towards Multimodal Prediction of Time-
continuous Emotion using Pose Feature Engineering and a Transformer Encoder.
In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop
and Challenge. 47–54.
[51]
Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion Recognition
from Speech Using wav2vec 2.0 Embeddings. In Proc. Interspeech 2021. ISCA,
ISCA, Brno, Czechia, 3400–3404. https://doi.org/10.21437/Interspeech.2021-703
[52]
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How Multilingual is Multi-
lingual BERT?. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics, Florence,
Italy, 4996–5001. https://doi.org/10.18653/v1/P19-1493
[53]
Tejas Pradhan, Rashi Bhansali, Dimple Chandnani, and Aditya Pangaonkar. 2020.
Analysis of personality traits using natural language processing and deep learn-
ing. In 2020 Second International Conference on Inventive Research in Computing
Applications (ICIRCA). IEEE, 457–461.
[54]
Shraman Pramanick, Aniket Roy, and Vishal M Patel. 2022. Multimodal Learning
using Optimal Transport for Sarcasm and Humor Detection. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision. 3930–3940.
[55]
Béatrice Priego-Valverde, Brigitte Bigi, Salvatore Attardo, Lucy Pickering, and
Elisa Gironzetti. 2018. Is smiling during humor so obvious? a cross-cultural
comparison of smiling behavior in humorous sequences in american english and
french interactions. Intercultural Pragmatics 15, 4 (2018), 563–591.
[56]
Nipun Sadvilkar and Mark Neumann. 2020. PySBD: Pragmatic Sentence Boundary
Disambiguation. In Proceedings of Second Workshop for NLP Open Source Software
(NLP-OSS). Association for Computational Linguistics, Online, 110–114. https:
//www.aclweb.org/anthology/2020.nlposs-1.15
[57]
Florian Schro, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A
unied embedding for face recognition and clustering. In 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.
1109/cvpr.2015.7298682
[58]
Sek Ilkin Serengil and Alper Ozpinar. 2020. LightFace: A Hybrid Deep Face
Recognition Framework. In 2020 Innovations in Intelligent Systems and Applica-
tions Conference (ASYU). IEEE, 23–27. https://doi.org/10.1109/ASYU50717.2020.
9259802
[59]
Sinan Sonlu, Uğur Güdükbay, and Funda Durupinar. 2021. A conversational
agent framework with multi-modal personality expression. ACM Transactions
on Graphics (TOG) 40, 1 (2021), 1–16.
[60]
Julia M Taylor and Lawrence J Mazlack. 2004. Computationally recognizing
wordplay in jokes. In Proceedings of the Annual Meeting of the Cognitive Science
Society, Vol. 26.
[61]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, et al
.
2023. Llama: Open and ecient foundation language models. arXiv
preprint arXiv:2302.13971 (2023).
[62]
Bogdan Vlasenko, RaviShankar Prasad, and Mathew Magimai.-Doss. 2021. Fusion
of Acoustic and Linguistic Information using Supervised Autoencoder for Im-
proved Emotion Recognition. In Proceedings of the 2nd on Multimodal Sentiment
Analysis Challenge. Association for Computing Machinery, New York, NY, USA,
51–59.
[63]
J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben,
and B. W. Schuller. 2023. Dawn of the Transformer Era in Speech Emotion
Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis &
Machine Intelligence 01 (2023), 1–13. https://doi.org/10.1109/TPAMI.2023.3263585
[64]
Jiaming Wu, Hongfei Lin, Liang Yang, and Bo Xu. 2021. MUMOR: A Multimodal
Dataset for Humor Detection in Conversations. In CCF International Conference on
Natural Language Processing and Chinese Computing. Springer, Springer, Qingdao,
China, 619–627.
[65]
Heng Xie, Jizhou Cui, Yuhang Cao, Junjie Chen, Jianhua Tao, Cunhang Fan,
Xuefei Liu, Zhengqi Wen, Heng Lu, Yuguang Yang, et al
.
2023. Multimodal Cross-
Lingual Features and Weight Fusion for Cross-Cultural Humor Detection. In
Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop:
Mimicked Emotions, Humour and Personalisation. 51–57.
[66]
Haojie Xu, Weifeng Liu, Jiangwei Liu, Mingzheng Li, Yu Feng, Yasi Peng, Yunwei
Shi, Xiao Sun, and Meng Wang. 2022. Hybrid Multimodal Fusion for Humor
Detection. In Proceedings of the 3rd International on Multimodal Sentiment Analysis
Workshop and Challenge (Lisboa, Portugal) (MuSe’ 22). Association for Computing
Machinery, New York, NY, USA, 15–21. https://doi.org/10.1145/3551876.3554802
[67]
Mingyu Xu, Shun Chen, Zheng Lian, and Bin Liu. 2023. Humor Detection System
for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothing. In
Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop:
Mimicked Emotions, Humour and Personalisation. 35–41.
[68] Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. 2015. Humor recognition
and humor anchor extraction. In Proceedings of the 2015 conference on empirical
methods in natural language processing. Association for Computational Linguis-
tics, Lisbon, Portugal, 2367–2376.
[69]
Guofeng Yi, Yuguang Yang, Yu Pan, Yuhang Cao, Jixun Yao, Xiang Lv, Cunhang
Fan, Zhao Lv, Jianhua Tao, Shan Liang, et al
.
2023. Exploring the Power of Cross-
Contextual Large Language Model in Mimic Emotion Prediction. In Proceedings
of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked
Emotions, Humour and Personalisation. 19–26.
[70]
Jun Yu, Wangyuan Zhu, Jichao Zhu, Xiaxin Shen, Jianqing Sun, and Jiaen Liang.
2023. MMT-GD: Multi-Modal Transformer with Graph Distillation for Cross-
Cultural Humor Detection. In Proceedings of the 4th on Multimodal Sentiment
Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisa-
tion. 43–49.
[71]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face
Detection and Alignment Using Multitask Cascaded Convolutional Networks.
IEEE Signal Processing Letters 23 (04 2016).
[72]
Ruicong Zhi, Mengyi Liu, and Dezheng Zhang. 2020. A comprehensive survey on
automatic facial action unit analysis. The Visual Computer 36 (2020), 1067–1093.