Hongyuan Zhu’s research while affiliated with Agency for Science, Technology and Research (A*STAR) and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (102)


Correction: Consistent Prompt Tuning for Generalized Category Discovery
  • Article
  • Publisher preview available

May 2025

·

1 Read

International Journal of Computer Vision

Muli Yang

·

Jie Yin

·

Yanan Gu

·

[...]

·

Hongyuan Zhu
View access options

Balancing Privacy and Performance: A Many-in-One Approach for Image Anonymization

April 2025

·

2 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

The effective utilization of data through Deep Neural Networks (DNNs) has profoundly influenced various aspects of society. The growing demand for high-quality, particularly personalized, data has spurred research efforts to prevent data leakage and protect privacy in recent years. Early privacy-preserving methods primarily relied on instance-wise modifications, such as erasing or obfuscating essential features for de-identification. However, this approach highlights an inherent trade-off: minimal modification offers insufficient privacy protection, while excessive modification significantly degrades task performance. In this paper, we propose a novel Recombining for Obfuscation (FRO) approach to address this trade-off. Unlike existing methods that generate one anonymized instance by perturbing the original data on a one-to-one basis, our FRO approach generates an anonymized instance by reassembling mixed ID-related features from multiple original data sources on a many-in-one basis. Instead of introducing additional noise for de-identification, our approach leverages the existing non-polluted features from other instances to anonymize data. Extensive experiments on identity identification tasks demonstrate that FRO outperforms previous state-of-the-art methods, not only in utility performance but also in visual anonymization.


Object Adaptive Self-Supervised Dense Visual Pre-Training

April 2025

·

2 Reads

IEEE Transactions on Image Processing

Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.



Generalized Category Discovery (GCD) aims to cluster unlabeled data using the knowledge in labeled data. Although recent VLMs are trained with Internet-scale corpus, they may still struggle in GCD since different class definitions (as depicted in the two different tasks) lead to distinct clustering results
Conceptual illustration of “task + class” prompts and CPT objective. The supervised loss is applied to labeled data, whereas vision-vision consistency (VVC) and vision-language consistency (VLC) are applied to both labeled and unlabeled data. See Sect. 3.2 for more details
Overall pipeline of our proposed Consistent Prompt Tuning (CPT) with the frozen text and image encoders of CLIP (Radford et al., 2021). CPT tunes K learnable “task + class” prompts and a lightweight adapter for labeled and unlabeled data of both known and unknown classes
Illustration of vision-vision consistency (VVC) and vision-language consistency (VLC)
GCD accuracy (reported in all classes) in the few-shot setting. “GCD+”: GCD (Vaze et al., 2022a) reimplemented using CLIP’s image encoder

+5

Consistent Prompt Tuning for Generalized Category Discovery

February 2025

·

29 Reads

International Journal of Computer Vision

Generalized Category Discovery (GCD) aims at discovering both known and unknown classes in unlabeled data, using the knowledge learned from a limited set of labeled data. Despite today’s foundation models being trained with Internet-scale multi-modal corpus, we find that they still struggle in GCD due to the ambiguity in class definitions. In this paper, we present Consistent Prompt Tuning (CPT) to disambiguate the classes for large vision-language models (e.g., CLIP). To this end, CPT learns a set of “task + class” prompts for labeled and unlabeled data of both known and unknown classes, with the “task” tokens globally shared across classes, which contain a unified class definition pattern, e.g., “the foreground is an animal named” or “the background scene is”. These prompts are optimized with two efficient regularization techniques that encourage consistent global and local relationships between any two matched inputs. CPT is evaluated on various existing GCD benchmarks, as well as in new practical scenarios with fewer annotations and customized class definitions, demonstrating clear superiority and broad versatility over existing state-of-the-art methods.


Evaluating Self-Supervised Learning for WiFi CSI-Based Human Activity Recognition

January 2025

·

32 Reads

·

8 Citations

ACM Transactions on Sensor Networks

With the advancement of the Internet of Things (IoT), WiFi Channel State Information (CSI)-based Human Activity Recognition (HAR) has garnered increasing attention from both academic and industrial communities. However, the scarcity of labeled data remains a prominent challenge in CSI-based HAR, primarily due to privacy concerns and the incomprehensibility of CSI data. Concurrently, Self-Supervised Learning (SSL) has emerged as a promising approach for addressing the dilemma of insufficient labeled data. In this paper, we undertake a comprehensive inventory and analysis of different categories of SSL algorithms, encompassing both previously studied and unexplored approaches within the field. We provide an in-depth investigation and evaluation of SSL algorithms in the context of WiFi CSI-based HAR, utilizing publicly available datasets that encompass various tasks and environmental settings. To ensure relevance to real-world applications, we design experiment settings aligned with specific requirements. Furthermore, our experimental findings uncover several limitations and blind spots in existing work, shedding light on the barriers that need to be addressed before SSL can be effectively deployed in real-world WiFi-based HAR applications. Our results also serve as practical guidelines and provide valuable insights for future research endeavors in this field.


Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark

December 2024

·

19 Reads

Producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. While progress has been made in general-purpose 3D avatar generation, the exploration of generating emotional 3D avatars remains scarce, primarily due to the complexities of identifying and rendering rich emotions from spoken words. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR). T3DEM is the most crucial step in determining the quality of Emo3D generation and encompasses three key challenges: Expression Diversity, Emotion-Content Consistency, and Expression Fluidity. To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation. First, we present EmoAva, a large-scale, high-quality dataset for T3DEM, comprising 15,000 text-to-3D expression mappings that characterize the aforementioned three challenges in Emo3D generation. Furthermore, we develop various metrics to effectively evaluate models against these identified challenges. Next, to effectively model the consistency, diversity, and fluidity of human expressions in the T3DEM step, we propose the Continuous Text-to-Expression Generator, which employs an autoregressive Conditional Variational Autoencoder for expression code generation, enhanced with Latent Temporal Attention and Expression-wise Attention mechanisms. Finally, to further enhance the 3DAR step on rendering higher-quality subtle expressions, we present the Globally-informed Gaussian Avatar (GiGA) model. GiGA incorporates a global information mechanism into 3D Gaussian representations, enabling the capture of subtle micro-expressions and seamless transitions between emotional states.





Citations (58)


... The sensing capabilities of Wi-Fi have been explored across a broad spectrum of applications, ranging from smart homes to healthcare, encompassing areas such as gesture recognition [2], human tracking [3], people counting [4], and sleep monitoring [5]. Despite its potential, Wi-Fi sensing remains significantly constrained by limited accuracy and poor generalization, making real-world deployment challenging. ...

Reference:

MORIC: CSI Delay-Doppler Decomposition for Robust Wi-Fi-based Human Activity Recognition
Evaluating Self-Supervised Learning for WiFi CSI-Based Human Activity Recognition
  • Citing Article
  • January 2025

ACM Transactions on Sensor Networks

... For example, a single news story can be represented in multiple formats, including video, audio, and text, and reported in various languages across different countries, such as Chinese, English, and Russian. A more comprehensive description of multiview data can be obtained by mining the consistent and complementary information from different views, which could be used for various tasks, including clustering (Sun et al. 2024c;Jin et al. 2023;Dong et al. 2023;He et al. 2024;Li et al. 2023aLi et al. , 2022b, retrieval (Sun et al. 2024a;Yan et al. 2020;Feng et al. 2023;Sun et al. 2024b), and classification (Sun et al. 2023;Liu et al. 2023a,b;Sun et al. 2022). ...

Robust Variational Contrastive Learning for Partially View-unaligned Clustering
  • Citing Conference Paper
  • October 2024

... The studies in [53] address challenges in multimodal detection due to information asymmetry and conflicting beliefs across modalities. The approach uses a Dirichlet distributionbased uncertainty learning method and leverages uncertainty to guide feature fusion. ...

Assess and Guide: Multi-modal Fake News Detection via Decision Uncertainty
  • Citing Conference Paper
  • October 2024

... All aforementioned multi-teacher distillation methods have another thing in common: they distill only from generic data suitable to all teachers: DataComb-1B [21] for AM-RADIO, ImageNet-1K [17] for UNIC [47] and Theia [51]. Heterogeneous data has recently been used for distillation of different datasets for classification [26,71] or domain adaptation [55]. To the best of our knowledge, no existing work has looked into distillation using data that contains natural images as well as synthetic data from 3D engines, CAD models, simulators, and rendered from structure frommotion reconstructions. ...

Direct Distillation Between Different Domains
  • Citing Chapter
  • October 2024

... Existing scene representations in 3D-VL models can be categorized into two approaches. The first adopts 3D modality inputs such as point clouds, but requires complex pre-processing pipelines such as 3D reconstruction and instance segmentation [117,39,11]. And they confront the inherent challenge of direct 3D perception, which is exacerbated by the scarcity of 3D-VL data. ...

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
  • Citing Conference Paper
  • June 2024

... Jiang et al. [8] presented a novel two-stage KD method involving label propagation guided by a teacher model in the first stage, followed by mutual label propagation between the teacher and student models in the second stage, enhancing the training of an accurate, low-cost student model despite noisy labels. Zhou et al. [44] proposed a data-free KD method that uses web-collected datasets to address both closed-set and open-set label noise by dividing the dataset into clean, closed-noisy, and open-noisy subsets based on specific data characteristics. Liu et al. [45] addressed noisy labels in graph neural networks with a label correction method called multiteacher self-training, which leverages semi-supervised node classification and reuses initial training phase parameters as the teacher model to guide subsequent noisy label removal and training processes. ...

Learning Student Network Under Universal Label Noise
  • Citing Article
  • July 2024

IEEE Transactions on Image Processing

... The remarkable progress of 2D Vision-Language Models (VLMs) through pre-training and supervised fine-tuning (SFT) [21,29,10,12,25,53,41,39,36,24] has sparked increasing interest in extending these models to 3D settings [14,40,15,22,16,52,6,8,50]. By leveraging powerful open-source Large Language Models (LLMs) and richly annotated 3D datasets [11,2,3,1,28,9], substantial progress has been made in 3D Vision-Language tasks such as 3D Question Answer (3D-QA) [2,28], 3D Dense Captioning (3D-DC) [7,9] and 3D Visual Grounding (3D-VG) [3,1]. ...

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
  • Citing Article
  • April 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... In contrast, the explicit role of diversity in retrieval-based demonstration selection remains underexplored. Diversity has shown promise in other domains-such as fixed-prompt ICL with global demonstration sets Qiu, 2023b, Luo et al., 2024), active learning (Giouroukis et al., 2025, Shi andShen, 2016), coreset construction (Sener and Savarese, 2018, Wan et al., 2024, Zhan et al., 2025, and instruction tuning . Although some recent work incorporates feature coverage as a proxy for diversity (Gupta et al., 2023, Levy et al., 2023, coverage primarily aims at spanning input features rather than explicitly promoting representational variety. ...

Contributing Dimension Structure of Deep Feature for Coreset Selection
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... Moreover, as the generator is only trained on seen class samples, this may cause limited transferability and low generalization to unseen classes. Following the generative framework [13], [20] introduced three modules to improve the generated visual features and transferability from semantic to visual space. However, no matter how diverse the generated visual features are, the domain gap between the semantic and visual features still exists. ...

Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis
  • Citing Conference Paper
  • October 2023

... In contrast to the text-prompting methods, MvNet [152] uses multi-view projected features to prompt a pre-trained vision model. The point cloud is encoded into multiple views, and a multi-view prompt vision fusion module interchanges and merges information across views through an attention mechanism. ...

Multi-view Vision Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?
  • Citing Article
  • January 2023

IEEE Transactions on Circuits and Systems for Video Technology