Yuanyuan Liu’s research while affiliated with China University of Geosciences and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (41)


Beyond boundaries: Hierarchical-contrast unsupervised temporal action localization with high-coupling feature learning
  • Article

February 2025

·

1 Read

Pattern Recognition

Yuanyuan Liu

·

Ning Zhou

·

Yuxuan Huang

·

[...]

·

Ke Wang

False-negative Pair Calibration.
As shown in (a), affected by different poses, the widely used contrastive learning (CL) methods, such as SimCLR, treat pose and other facial information uniformly, which may lead to suboptimal results. To alleviate this problem, our PCFRL—an extended version of our conference method (PCL)—first disentangles pose-aware and non-pose face-aware features and then calibrates face and pose false-negative pairs for more efficient calibrated pose-aware CL and calibrated face-aware CL, respectively. As shown in (b), our PCFRL enhances pose awareness for SFRL and improves face understanding performance promisingly
Detailed pipeline of our proposed PCFRL for pose-aware self-supervised facial representation learning. Building upon the PDD, we first disentangle the pose-aware features from non-pose face-aware features. Then, we further introduce the false-negative pair calibration module to calculate the neighbor-cohesive pair alignment (NPA) scores, resulting in the calibrated pose-aware and face-aware false-negative pairs, respectively. Moreover, with the calibrated false-negative pairs, we devise two calibrated CL losses, namely calibrated pose-aware CL and face-aware CL, to facilitate the development of more robust pose-aware facial representation
Conceptual comparison between the cosine similarity and our proposed neighborhood-cohesive pair alignment score. a Cosine similarity access the face relationship between two samples. b Our neighborhood-cohesive pair alignment score measures the face relationship by using all neighbor samples’ similarities in a training batch. Provide an example in (b) to illustrate. First, according to Eq. 5, calculate the dot product of the cosine similarity between samples i and j and other samples to obtain NS. Then, according to Eq. 3, add the result to the cosine similarity between these two samples to obtain NPA
The detailed network architecture of the backbone and the corresponding two subnets

+6

Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning
  • Article
  • Full-text available

January 2025

International Journal of Computer Vision

Self-supervised facial representation learning (SFRL) methods, especially contrastive learning (CL) methods, have been increasingly popular due to their ability to perform face understanding without heavily relying on large-scale well-annotated datasets. However, analytically, current CL-based SFRL methods still perform unsatisfactorily in learning facial representations due to their tendency to learn pose-insensitive features, resulting in the loss of some useful pose details. This could be due to the inappropriate positive/negative pair selection within CL. To conquer this challenge, we propose a Pose-disentangled Contrastive Facial Representation Learning (PCFRL) framework to enhance pose awareness for SFRL. We achieve this by explicitly disentangling the pose-aware features from non-pose face-aware features and introducing appropriate sample calibration schemes for better CL with the disentangled features. In PCFRL, we first devise a pose-disentangled decoder with a delicately designed orthogonalizing regulation to perform the disentanglement; therefore, the learning on the pose-aware and non-pose face-aware features would not affect each other. Then, we introduce a false-negative pair calibration module to overcome the issue that the two types of disentangled features may not share the same negative pairs for CL. Our calibration employs a novel neighborhood-cohesive pair alignment method to identify pose and face false-negative pairs, respectively, and further help calibrate them to appropriate positive pairs. Lastly, we devise two calibrated CL losses, namely calibrated pose-aware and face-aware CL losses, for adaptively learning the calibrated pairs more effectively, ultimately enhancing the learning with the disentangled features and providing robust facial representations for various downstream tasks. In the experiments, we perform linear evaluations on four challenging downstream facial tasks with SFRL using our method, including facial expression recognition, face recognition, facial action unit detection, and head pose estimation. Experimental results show that PCFRL outperforms existing state-of-the-art methods by a substantial margin, demonstrating the importance of improving pose awareness for SFRL. Our evaluation code and model will be available at https://github.com/fulaoze/CV/tree/main.

Download

Feature Contrast Difference and Enhanced Network for RGB-D Indoor Scene Classification in Internet of Things

January 2025

·

2 Reads

IEEE Internet of Things Journal

The era of smart connectivity spawned by the Internet of Things (IoT) has made the need to achieve environmental perception and understanding of different scenarios increasingly urgent. Among the many scenarios, indoor scene classification has attracted much attention because of its its relevance to the daily lives of people, ranging from comfort regulation in living spaces to the optimal allocation of resources in offices, and a variety of approaches for this task have emerged. However, increasing accuracy remains a crucial objective due to the complexity and disorder of indoor scenes. Therefore, we propose a feature contrast difference and enhanced network for RGB-D indoor scene classification, FCDENet. First, the RGB and Depth images express different information. Therefore, we built a feature contrast difference module for the first two low-level features to extend the receptive fields of the different features, utilizing differential contrast to complement each other. Second, the high-level feature semantic information is abstract. Therefore, we introduced information cluster blocks, which are used to aggregate feature points with similar attributes into compact clusters after being parsed by an initial frequency transform, enabling instantiated representations of the semantic information. Finally, to further enhance the integrated features, we introduced a wavelet transform block in the cross-layer decoding process. In contrast to conventional decoding methods, we employed a wavelet transform for initial denoising cross-layer features and used multiple pooling structures to supplement local information, gradually weighting to achieve higher prediction accuracy. Extensive experiments on two typical indoor datasets, NYUDv2 and SUN RGB-D, show that our results exhibit excellent performance. In addition, to better demonstrate the reliability of the method, we conducted generalizability experiments on other datasets, and the proposed method provides a robust solution to the challenges of multiple scenarios in the era of IoT smart connectivity. The code is available at https://github.com/XUEXIKUAIL/FCDENet.


Fig. 1. Overall pipeline of FILNet.
Frequency-Aware Integrity Learning Network for Semantic Segmentation of Remote Sensing Images

January 2025

·

8 Reads

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

The semantic segmentation of remote sensing images is crucial for computer perception tasks. Integrating dual-modal information enhances semantic understanding. However, existing segmentation methods often suffer from incomplete feature information (features without integrity), leading to inadequate segmentation of pixels near object boundaries. This study introduces the concept of integrity in semantic segmentation and presents a complete integrity learning network using contextual semantics in the multiscale feature decoding process. Specifically, we propose a frequency-aware integrity learning network (FILNet) that compensates for missing features by capturing a shared integrity feature, enabling accurate differentiation between object categories and precise pixel segmentation. First, we design a frequency-driven awareness generator that produces an awareness map by extracting frequency-domain features with high-level semantics, guiding the multiscale feature aggregation process. Second, we implement a split-fuse-replenish strategy, which divides features into two branches for feature extraction and information replenishment, followed by cross-modal fusion and direct connection for information replenishment, resulting in fused features. Finally, we present an integrity assignment and enhancement method that leverages a capsule network to learn the correlation of multiscale features, generating a shared integrity feature. This feature is assigned to multiscale features to enhance their integrity, leading to accurate predictions facilitated by an adaptive large kernel module. Experiments on the Vaihingen and Potsdam datasets demonstrate that our method outperforms current state-of-the-art segmentation techniques. The corresponding code is available at https://github.com/MAXHAN22/FILNet .


Noise-Resistant Multimodal Transformer for Emotion Recognition

December 2024

·

13 Reads

·

1 Citation

International Journal of Computer Vision

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.


FIMKD: Feature-Implicit Mapping Knowledge Distillation for RGB-D Indoor Scene Semantic Segmentation

December 2024

·

1 Read

·

3 Citations

IEEE Transactions on Artificial Intelligence

Depth images are often used to improve the geometric understanding of scenes owing to their intuitive distance properties. Although there have been significant advancements in semantic segmentation tasks using red–green–blue-depth (RGB-D) images, the complexity of existing methods remains high. Furthermore, the requirement for high-quality depth images increases the model inference time, which limits the practicality of these methods. To address this issue, we propose a feature-implicit mapping knowledge distillation (FIMKD) method and a cross-modal knowledge distillation (KD) architecture to leverage deep modal information for training and reduce the model dependence on this information during inference. The approach comprises two networks: FIMKD-T, a teacher network that uses RGB-D data, and FIMKD-S, a student network that uses only RGB data. FIMKD-T extracts high-frequency information using the depth modality and compensates for the loss of RGB details due to a reduction in resolution during feature extraction by the high-frequency feature enhancement module, thereby enhancing the geometric perception of semantic features. In contrast, the FIMKD-S network does not employ deep learning techniques; instead, it uses a non-learning approach to extract high-frequency information. To enable the FIMKD-S network to learn deep features, we propose a feature-implicit mapping KD for feature distillation. This mapping technique maps the features in channel and space to a low-dimensional hidden layer, which helps to avoid inefficient single-pattern student learning. We evaluated the proposed FIMKD-S* (FIMKD-S with KD) on the NYUv2 and SUN-RGBD datasets. The results demonstrate that both FIMKD-T and FIMKD-S* achieve state-of-the-art performance. Furthermore, FIMKD-S* provides the best performance balance. The code for this work is available at https://github.com/SHARKALAKALA/FIMKD .


Multi-View Adaptive Fusion Network for Spatially Resolved Transcriptomics Data Clustering

December 2024

·

15 Reads

·

1 Citation

IEEE Transactions on Knowledge and Data Engineering

Spatial transcriptomics technology fully leverages spatial location and gene expression information for spatial clustering tasks. However, existing spatial clustering methods primarily concentrate on utilizing the complementary features between spatial and gene expression information, while overlooking the discriminative features during the integration process. Consequently, the discriminative capability of node representation in the gene expression features is limited. Besides, most existing methods lack a flexible combination mechanism to adaptively integrate spatial and gene expression information. To this end, we propose an end-to-end deep learning method named MAFN for spatially resolved transcriptomics data clustering via a multi-view adaptive fusion network. Specifically, we first adaptively learn inter-view complementary features from spatial and gene expression information. To improve the discriminative capability of gene expression nodes by utilizing spatial information, we employ two GCN encoders to learn intra-view specific features and design a Cross-view Correlation Reduction (CCR) strategy to filter the irrelevant information. Moreover, considering the distinct characteristics of each view, a Cross-view Attention Module (CAM) is utilized to adaptively fuse the multi-view features. Extensive experimental results demonstrate that the proposed MAFN achieves competitive performance in spatial domain identification compared to other state-of-the-art ones. The demo code of this work is publicly available at https://github.com/zhubbbzhu/MAFN .


Special Session: Multi-modal Information Fusion

November 2024

·

10 Reads

The 17th International Conference on Digital Image Processing (ICDIP 2025), Special Session: Multi-modal Information Fusion (https://www.icdip.org/special.html), Session organizers: Prof. Chang Tang, China University of Geosciences, China; Prof. Xinwang Liu, National University of Defense Technology, China; Assoc. Prof. Yong Liu, Renmin University of China, China; Dr. Mohammad Sultan Mahmud, Shenzhen University, China; Assoc. Prof. Yuanyuan Liu, China University of Geosciences, China; The topics of interest include, but are not limited to: > Multi-modal image fusion > Multi-modal large language models > Multi-modal data generation > Trusted multi-modal fusion method. =>Submission method: Submit your Full Paper (no less than 8 pages) via Online Submission System, then choose Special Session 1 (Multi-modal Information Fusion)


Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors

November 2024

·

26 Reads

Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the two.Specifically, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.



Citations (24)


... Toutanova, 2019; Carion et al., 2020;Chen et al., 2022;Liu et al., 2023a). In MSA, this technique has been widely used for feature extraction, representation learning, and multimodal fusion (Tsai et al., 2019a;Huang et al., 2020;Liu et al., 2023b;. ...

Reference:

Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis
Noise-Resistant Multimodal Transformer for Emotion Recognition

International Journal of Computer Vision

... SiFU [48] is unable to reconstruct correct human postures, such as incorrect left-hand positions. VS [15] performs poorly in fine-grained areas such as unclear finger movement and cloth wrinkles. SiTH [5] produces geometry and texture errors that occur from the generative model, such as the third arm on the back. ...

VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift
  • Citing Conference Paper
  • June 2024

... To address the shortcomings of slow inference speed and weak ability to capture local features, some studies focus primarily on Transformer-based methods. Knowledge distillation [50] can compress models and improve performance, but it depends on the quality of the teacher network and has higher training complexity. AsymFormer [35] also improves computational efficiency by replacing the RGB branch with a CNN; however, the lightweight Transformer in the depth branch still reduces efficiency, and aligning feature information between different branches remains challenging. ...

FIMKD: Feature-Implicit Mapping Knowledge Distillation for RGB-D Indoor Scene Semantic Segmentation
  • Citing Article
  • December 2024

IEEE Transactions on Artificial Intelligence

... Then, the feature concatenation operator consequently with a linear layer is utilized to fusion such two different types of information. As done in previous methods [70], [71], the ZINB decoder [72], which captures the complex global information of data, is introduced to reconstruct the gene expression. Finally, the view-specific contrastive regularization is leveraged to achieve more balanced multi-view learning in the spatially resolved transcriptomics data clustering task. ...

Multi-View Adaptive Fusion Network for Spatially Resolved Transcriptomics Data Clustering
  • Citing Article
  • December 2024

IEEE Transactions on Knowledge and Data Engineering

... 8) The text-enhanced transformer fusion network (TETFN) [53] leverages text-oriented cross-modal mappings to effectively fuse sentiment-related information from textual, visual, and acoustic cues, enhancing multimodal sentiment analysis while preserving both intermodality and intramodality relationships. 9) The token-disentangling mutual transformer (TMT) [54] addresses multimodal emotion recognition by disentangling intermodality consistency and intramodality heterogeneity features, allowing for enhanced interaction of diverse emotional cues from text, video, and audio. 10) The uncertainty estimation fusion network (UEFN) is the model proposed in this paper. ...

Token-disentangling Mutual Transformer for multimodal emotion recognition
  • Citing Article
  • July 2024

Engineering Applications of Artificial Intelligence

... This study distinctly focuses on relation-based distillation by leveraging the interdependencies of features and classification decisions between different network architectures. On the other hand, Zhou et al. [100] introduce the Multi-level Semantic Transfer Network (MSTNet), a KD framework designed for dense prediction of RS images. This network utilizes a Multi-level Semantic Knowledge Alignment (MSKA) framework to distill semantic information from a complex teacher model to a more compact student model. ...

MSTNet-KD: Multilevel Transfer Networks Using Knowledge Distillation for the Dense Prediction of Remote-Sensing Images
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... investigated ways to enhance the performance of semantic segmentation methods by integrating RGB-D information [10]. However, most studies tend to treat depth images as supplementary information, simply integrating them into the processing framework of RGB images. ...

DGPINet-KD: Deep Guided and Progressive Integration Network with Knowledge Distillation for RGB-D Indoor Scene Analysis
  • Citing Article
  • September 2024

IEEE Transactions on Circuits and Systems for Video Technology

... With the rapid advancement of remote sensing technology, obtaining high-resolution aerial images has become much easier, presenting new challenges and opportunities for content interpretation [1][2][3][4]. To process these images effectively, it is essential to develop efficient methods. ...

Multitarget Domain Adaptation Building Instance Extraction of Remote Sensing Imagery With Domain-Common Approximation Learning
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... This influx of data not only motivates advancements in multimodal fusion learning but also presents significant challenges. Currently, multimodal fusion learning has garnered extensive research interest across various fields, including multimodal sentiment analysis [1,2], crossmodal retrieval [3,4], visual questioning [5], and emotion recognition in conversation [6,7]. ...

Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

... It is worth mentioning that this study is an extended version of our conference paper, PCL (Pose-disentangled Contrastive Learning) (Liu et al., 2023) published in CVPR 2023. In our original PCL paper, we primarily introduced an effective pose disentanglement algorithm for contrastive learning. ...

Pose-disentangled Contrastive Learning for Self-supervised Facial Representation
  • Citing Conference Paper
  • June 2023