A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-024-02304-3
Noise-Resistant Multimodal Transformer for Emotion Recognition
Yuanyuan Liu1·Haoyu Zhang1,2 ·Yibing Zhan4·Zijing Chen5,6 ·Guanghao Yin1·Lin Wei1·
Zhe Chen3,6
Received: 19 January 2023 / Accepted: 12 November 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However,
we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur
at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract
noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness
of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal
Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion
Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic
and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply
a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value)
based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF
could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining
robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal
emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either
all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce
two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus
facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive
experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both
explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal
datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms
existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is
important for effective emotion recognition.
Keywords Multimodal ·Emotion recognition ·Transformer ·Noise-resistant generic feature ·Noise-aware learning scheme
1 Introduction
An accurate understanding of human emotions is beneficial
for several applications, such as multimedia analysis, digital
entertainment, health monitoring, human-computer interac-
tion, etc (Shen et al., 2009; Beale & Peter, 2008; Qian et al.,
2019; D’Mello & Kory, 2015). Compared with traditional
emotion recognition, which only uses a unimodal data source,
multimodal emotion recognition that exploits and explores
different data sources, such as visual, audio, and text, has
Communicated by Vittorio Murino.
Extended author information available on the last page of the article
shown significant advantages in improving the understand-
ing of emotions (Zadeh et al., 2017; Tsai et al., 2019;Lvet
al., 2021; Hazarika et al., 2020; Yuan et al., 2021), including
happiness, anger, disgust, fear, sadness, neutral, and surprise.
Recently, most existing multimodal emotion recognition
methods mainly focus on multimodal data fusion, includ-
ing tensor-based fusion methods (Liu et al., 2018; Zadeh
et al., 2017; Sahay et al., 2020; Yuan et al., 2021) and
attention-based fusion methods (Zhao et al., 2020; Huang et
al., 2020; Zhou et al., 2021). The tensor-based fusion meth-
ods aim to obtain a joint representation of data with different
modalities via multilinear function calculation. For exam-
ple, TFN (Liu et al., 2018) used Cartesian product operation
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.