Zhihai He’s research while affiliated with Southern University of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (251)


Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations
  • Article

April 2025

·

1 Read

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

Yi Zhang

·

Chun-Wun Cheng

·

Junyi He

·

[...]

·

Angelica I Aviles-Rivero

We introduce SONO, a novel method leveraging Second-Order Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model's expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP’s robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.


Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations

December 2024

·

10 Reads

We introduce SONO, a novel method leveraging Second-Order Neural Ordinary Differential Equations (Second-Order NODEs) to enhance cross-modal few-shot learning. By employing a simple yet effective architecture consisting of a Second-Order NODEs model paired with a cross-modal classifier, SONO addresses the significant challenge of overfitting, which is common in few-shot scenarios due to limited training examples. Our second-order approach can approximate a broader class of functions, enhancing the model's expressive power and feature generalization capabilities. We initialize our cross-modal classifier with text embeddings derived from class-relevant prompts, streamlining training efficiency by avoiding the need for frequent text encoder processing. Additionally, we utilize text-based image augmentation, exploiting CLIP's robust image-text correlation to enrich training data significantly. Extensive experiments across multiple datasets demonstrate that SONO outperforms existing state-of-the-art methods in few-shot learning performance.




Hierarchical Spatial-Temporal Masked Contrast for Skeleton Action Recognition

November 2024

·

14 Reads

IEEE Transactions on Artificial Intelligence

In the field of 3D action recognition, self-supervised learning has shown promising results but remains a challenging task. Previous approaches to motion modeling often relied on selecting features solely from the temporal or spatial domain, which limited the extraction of higher-level semantic information. Additionally, traditional one-to-one approaches in multilevel comparative learning overlooked the relationships between different levels, hindering the learning representation of the model. To address these issues, we propose the Hierarchical Spatial-temporal Masked network (HSTM) for learning 3D action representations. HSTM introduces a novel masking method that operates simultaneously in both the temporal and spatial dimensions. This approach leverages semantic relevance to identify meaningful regions in time and space, guiding the masking process based on semantic richness. This guidance is crucial for learning useful feature representations effectively. Furthermore, to enhance the learning of potential features, we introduce cross-level distillation (CLD) to extend the comparative learning approach. By training the model with two types of losses simultaneously, each level of the multi-level comparative learning process can be guided by levels rich in semantic information. This allows for more effective supervision of comparative learning, leading to improved performance. Extensive experiments conducted on the NTU-60, NTU-120, and PKU-MMD datasets demonstrate the effectiveness of our proposed framework. The learned action representations exhibit strong transferability and achieve state-of- the-art results.




Domain-Conditioned Transformer for Fully Test-time Adaptation

October 2024

·

6 Reads

Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.


Window-based Channel Attention for Wavelet-enhanced Learned Image Compression

September 2024

·

4 Reads

Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects in the image. To address this issue, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.


Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation

August 2024

·

3 Reads

Transformer-based methods have achieved remarkable success in various machine learning tasks. How to design efficient test-time adaptation methods for transformer models becomes an important research task. In this work, motivated by the dual-subband wavelet lifting scheme developed in multi-scale signal processing which is able to efficiently separate the input signals into principal components and noise components, we introduce a dual-path token lifting for domain shift correction in test time adaptation. Specifically, we introduce an extra token, referred to as \textit{domain shift token}, at each layer of the transformer network. We then perform dual-path lifting with interleaved token prediction and update between the path of domain shift tokens and the path of class tokens at all network layers. The prediction and update networks are learned in an adversarial manner. Specifically, the task of the prediction network is to learn the residual noise of domain shift which should be largely invariant across all classes and all samples in the target domain. In other words, the predicted domain shift noise should be indistinguishable between all sample classes. On the other hand, the task of the update network is to update the class tokens by removing the domain shift from the input image samples so that input samples become more discriminative between different classes in the feature space. To effectively learn the prediction and update networks with two adversarial tasks, both theoretically and practically, we demonstrate that it is necessary to use smooth optimization for the update network but non-smooth optimization for the prediction network. Experimental results on the benchmark datasets demonstrate that our proposed method significantly improves the online fully test-time domain adaptation performance. Code is available at \url{https://github.com/yushuntang/DPAL}.


Citations (64)


... Both CNN-based and transformer-based models discretise continuous functions, while Continuous U-Net [4] offers a continuous block to address this. The continuous formulation of Second Order NODEs [18,3] enables O(1) memory cost and has been applied in various tasks [25,26,20]. Inspired by Kolmogorov-Arnold Networks (KANs) [16], which use learnable activation functions at edges to optimise feature representation, U-KAN [12] integrates a Tokenised KAN Block with a Convolution Block in U-Net, but relies only on addition. ...

Reference:

Implicit U-KAN2.0: Dynamic, Efficient and Interpretable Medical Image Segmentation
Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... Nevertheless, these approaches require complete access to the entire target dataset and retraining of the source model for multiple epochs, making them impractical for real-world applications. Recently developed test-time adaptation (TTA) methods exhibit promising capabilities in adapting pre-trained models to unlabeled data during testing [5,19,23,29,35,37,39,41,45]. In this work, we focus on the fully test-time adaptation. ...

Learning Inference-Time Drift Sensor-Actuator for Domain Generalization
  • Citing Conference Paper
  • April 2024

... Auty et al. [2] introduced learnable prompts to replace fixed textual tokens, enhancing the model's flexibility and effectiveness. Hu et al. [18] further proposed a depth codebook with learnable prompts to better handle domain shifts in various scenes. CLIP2Depth [22] introduced mirror embeddings-non-natural language representations-to adapt CLIP [38] for depth estimation while avoiding the need for explicit text prompts. ...

Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
  • Citing Conference Paper
  • January 2024

... Researchers have proposed using less obtrusive sensor modalities and more easily deployable sensors in home environments for effective sleep monitoring of patients. [19] Among these modalities, EOG and PSM stand out as unobtrusive and more practical options for use in home-based sleep monitoring systems. ...

Full-coverage unobtrusive health monitoring of elders at homes
  • Citing Article
  • April 2024

Internet of Things

... These models provide a powerful foundation for few-shot transfer. However, the current few-shot adaptation strategies fall into two broad categories: adapter-based fine-tuning [14,26,50,56] and prompt-based tuning [38,43,57,58,59]. Both approaches struggle under low supervision, especially with fine-grained classes, due to their over-reliance on fixed prompts or shallow adaptation modules that cannot adequately encode task-specific variation. ...

Concept-Guided Prompt Learning for Generalization in Vision-Language Models
  • Citing Article
  • Full-text available
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... The leading detection model, HOITrans [49], achieves only 62% accuracy in few-shot binary prediction tasks. However, CLIP-based TPT [51] and BDC-Adapter [52] has shown promising results without training on the training split, which indicates a new paradigm of multi-modal reasoning for HOI. We evaluate our proposed NODE-Adapter method on this task to demonstrate its effectiveness in visual relationship reasoning. ...

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

... Notably, multi-modal learning is designed to address the challenge of transferring knowledge across different modalities [205] and may, when necessary, combine meta-learning and transfer learning in a single framework to meet this goal. In a number of current works, the focus has been on the combination of text and image classifiers as computer vision systems develop [204,206,207]. In some cases, multi-modality involves not only crossing domains but also crossing methodologies. ...

Cross-Modal Concept Learning and Inference for Vision-Language Models
  • Citing Article
  • March 2024

Neurocomputing

... However, their efficacy can be notably compromised when deployed in unfamiliar domains, primarily due to discrepancies in data distributions between the training datasets in the source domain and the evaluation datasets in the target domain [28,31]. Source-free unsupervised domain adaptation (UDA) [22,24,38,42] aims to recalibrate network models in the absence of any source-domain data samples. Nevertheless, these approaches require complete access to the entire target dataset and retraining of the source model for multiple epochs, making them impractical for real-world applications. ...

Cross-Inferential Networks for Source-Free Unsupervised Domain Adaptation
  • Citing Conference Paper
  • October 2023

... In addition to the previous tasks, testtime adaptation is also applied in other image-level tasks. For instance, pose estimation [49], [143], [144], [172], [180], person re-identification [100], [328], deep fake detection [36], [395], out-of-distribution detection [67], [88], [150], style transfer [152], and federated learning [17], [72], [295]. Model adaptation is the most widely utilized approach in current image-level applications. ...

Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation

... 3) SVHN→MNIST/MNIST-M/USPS: We also evaluate our method's feasibility in simple transfer learning tasks. Following [34], we choose SVHN [35] as the source domain and transfer the trained model to other digits datasets: MNIST [36] (with a test set of 10, 000 images), MNIST-M [37] (with 90, 001 samples of modified MNIST images) and USPS [38] (with a test set of 2, 007 images) respectively. ...

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation