ArticlePublisher preview available

Noise-Resistant Multimodal Transformer for Emotion Recognition

Authors:
  • The Chinese University of Hong Kong, Shenzhen
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-024-02304-3
Noise-Resistant Multimodal Transformer for Emotion Recognition
Yuanyuan Liu1·Haoyu Zhang1,2 ·Yibing Zhan4·Zijing Chen5,6 ·Guanghao Yin1·Lin Wei1·
Zhe Chen3,6
Received: 19 January 2023 / Accepted: 12 November 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However,
we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur
at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract
noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness
of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal
Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion
Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic
and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply
a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value)
based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF
could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining
robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal
emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either
all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce
two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus
facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive
experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both
explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal
datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms
existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is
important for effective emotion recognition.
Keywords Multimodal ·Emotion recognition ·Transformer ·Noise-resistant generic feature ·Noise-aware learning scheme
1 Introduction
An accurate understanding of human emotions is beneficial
for several applications, such as multimedia analysis, digital
entertainment, health monitoring, human-computer interac-
tion, etc (Shen et al., 2009; Beale & Peter, 2008; Qian et al.,
2019; D’Mello & Kory, 2015). Compared with traditional
emotion recognition, which only uses a unimodal data source,
multimodal emotion recognition that exploits and explores
different data sources, such as visual, audio, and text, has
Communicated by Vittorio Murino.
Extended author information available on the last page of the article
shown significant advantages in improving the understand-
ing of emotions (Zadeh et al., 2017; Tsai et al., 2019;Lvet
al., 2021; Hazarika et al., 2020; Yuan et al., 2021), including
happiness, anger, disgust, fear, sadness, neutral, and surprise.
Recently, most existing multimodal emotion recognition
methods mainly focus on multimodal data fusion, includ-
ing tensor-based fusion methods (Liu et al., 2018; Zadeh
et al., 2017; Sahay et al., 2020; Yuan et al., 2021) and
attention-based fusion methods (Zhao et al., 2020; Huang et
al., 2020; Zhou et al., 2021). The tensor-based fusion meth-
ods aim to obtain a joint representation of data with different
modalities via multilinear function calculation. For exam-
ple, TFN (Liu et al., 2018) used Cartesian product operation
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Toutanova, 2019; Carion et al., 2020;Chen et al., 2022;Liu et al., 2023a). In MSA, this technique has been widely used for feature extraction, representation learning, and multimodal fusion (Tsai et al., 2019a;Huang et al., 2020;Liu et al., 2023b;. ...
Conference Paper
Full-text available
Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.
Conference Paper
Full-text available
Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.
Article
Full-text available
Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.
Article
Full-text available
Recently, the cross-domain object detection task has been raised by reducing the domain disparity and learning domain invariant features. Inspired by the image-level discrepancy dominated in object detection, we introduce a Multi-Adversarial Faster-RCNN (MAF). Our proposed MAF has two distinct contributions: (1) The Hierarchical Domain Feature Alignment (HDFA) module is introduced to minimize the image-level domain disparity, where Scale Reduction Module (SRM) reduces the feature map size without information loss and increases the training efficiency. (2) Aggregated Proposal Feature Alignment (APFA) module integrates the proposal feature and the detection results to enhance the semantic alignment, in which a weighted GRL (WGRL) layer highlights the hard-confused features rather than the easily-confused features. However, MAF only considers the domain disparity and neglects domain adaptability. As a result, the label-agnostic and inaccurate target distribution leads to the source error collapse, which is harmful to domain adaptation. Therefore, we further propose a Paradigm Teacher (PT) with knowledge distillation and formulated an extensive Paradigm Teacher MAF (PT-MAF), which has two new contributions: (1) The Paradigm Teacher (PT) overcomes source error collapse to improve the adaptability of the model. (2) The Dual-Discriminator HDFA (D2^{2}-HDFA) improves the marginal distribution and achieves better alignment compared to HDFA. Extensive experiments on numerous benchmark datasets, including the Cityscapes, Foggy Cityscapes, Pascal VOC, Clipart, Watercolor, etc. demonstrate the superiority of our approach over SOTA methods.
Article
Full-text available
Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annota- tion, existing methods are restricted in capturing differenti- ated information. However, additional unimodal annotations are high time- and labor-cost. In this paper, we design a la- bel generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multimodal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, dur- ing the training stage, we design a weight-adjustment strat- egy to balance the learning progress among different sub- tasks. That is to guide the subtasks to focus on samples with the larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the re- liability and stability of auto-generated unimodal supervi- sions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human- annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.