May 2025
·
1 Read
International Journal of Computer Vision
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
May 2025
·
1 Read
International Journal of Computer Vision
April 2025
·
2 Reads
Proceedings of the AAAI Conference on Artificial Intelligence
The effective utilization of data through Deep Neural Networks (DNNs) has profoundly influenced various aspects of society. The growing demand for high-quality, particularly personalized, data has spurred research efforts to prevent data leakage and protect privacy in recent years. Early privacy-preserving methods primarily relied on instance-wise modifications, such as erasing or obfuscating essential features for de-identification. However, this approach highlights an inherent trade-off: minimal modification offers insufficient privacy protection, while excessive modification significantly degrades task performance. In this paper, we propose a novel Recombining for Obfuscation (FRO) approach to address this trade-off. Unlike existing methods that generate one anonymized instance by perturbing the original data on a one-to-one basis, our FRO approach generates an anonymized instance by reassembling mixed ID-related features from multiple original data sources on a many-in-one basis. Instead of introducing additional noise for de-identification, our approach leverages the existing non-polluted features from other instances to anonymize data. Extensive experiments on identity identification tasks demonstrate that FRO outperforms previous state-of-the-art methods, not only in utility performance but also in visual anonymization.
April 2025
·
2 Reads
IEEE Transactions on Image Processing
Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.
February 2025
·
10 Reads
February 2025
·
29 Reads
International Journal of Computer Vision
Generalized Category Discovery (GCD) aims at discovering both known and unknown classes in unlabeled data, using the knowledge learned from a limited set of labeled data. Despite today’s foundation models being trained with Internet-scale multi-modal corpus, we find that they still struggle in GCD due to the ambiguity in class definitions. In this paper, we present Consistent Prompt Tuning (CPT) to disambiguate the classes for large vision-language models (e.g., CLIP). To this end, CPT learns a set of “task + class” prompts for labeled and unlabeled data of both known and unknown classes, with the “task” tokens globally shared across classes, which contain a unified class definition pattern, e.g., “the foreground is an animal named” or “the background scene is”. These prompts are optimized with two efficient regularization techniques that encourage consistent global and local relationships between any two matched inputs. CPT is evaluated on various existing GCD benchmarks, as well as in new practical scenarios with fewer annotations and customized class definitions, demonstrating clear superiority and broad versatility over existing state-of-the-art methods.
January 2025
·
32 Reads
·
8 Citations
ACM Transactions on Sensor Networks
With the advancement of the Internet of Things (IoT), WiFi Channel State Information (CSI)-based Human Activity Recognition (HAR) has garnered increasing attention from both academic and industrial communities. However, the scarcity of labeled data remains a prominent challenge in CSI-based HAR, primarily due to privacy concerns and the incomprehensibility of CSI data. Concurrently, Self-Supervised Learning (SSL) has emerged as a promising approach for addressing the dilemma of insufficient labeled data. In this paper, we undertake a comprehensive inventory and analysis of different categories of SSL algorithms, encompassing both previously studied and unexplored approaches within the field. We provide an in-depth investigation and evaluation of SSL algorithms in the context of WiFi CSI-based HAR, utilizing publicly available datasets that encompass various tasks and environmental settings. To ensure relevance to real-world applications, we design experiment settings aligned with specific requirements. Furthermore, our experimental findings uncover several limitations and blind spots in existing work, shedding light on the barriers that need to be addressed before SSL can be effectively deployed in real-world WiFi-based HAR applications. Our results also serve as practical guidelines and provide valuable insights for future research endeavors in this field.
December 2024
·
19 Reads
Producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. While progress has been made in general-purpose 3D avatar generation, the exploration of generating emotional 3D avatars remains scarce, primarily due to the complexities of identifying and rendering rich emotions from spoken words. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR). T3DEM is the most crucial step in determining the quality of Emo3D generation and encompasses three key challenges: Expression Diversity, Emotion-Content Consistency, and Expression Fluidity. To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation. First, we present EmoAva, a large-scale, high-quality dataset for T3DEM, comprising 15,000 text-to-3D expression mappings that characterize the aforementioned three challenges in Emo3D generation. Furthermore, we develop various metrics to effectively evaluate models against these identified challenges. Next, to effectively model the consistency, diversity, and fluidity of human expressions in the T3DEM step, we propose the Continuous Text-to-Expression Generator, which employs an autoregressive Conditional Variational Autoencoder for expression code generation, enhanced with Latent Temporal Attention and Expression-wise Attention mechanisms. Finally, to further enhance the 3DAR step on rendering higher-quality subtle expressions, we present the Globally-informed Gaussian Avatar (GiGA) model. GiGA incorporates a global information mechanism into 3D Gaussian representations, enabling the capture of subtle micro-expressions and seamless transitions between emotional states.
November 2024
·
1 Read
October 2024
·
3 Reads
October 2024
·
20 Reads
·
1 Citation
... The sensing capabilities of Wi-Fi have been explored across a broad spectrum of applications, ranging from smart homes to healthcare, encompassing areas such as gesture recognition [2], human tracking [3], people counting [4], and sleep monitoring [5]. Despite its potential, Wi-Fi sensing remains significantly constrained by limited accuracy and poor generalization, making real-world deployment challenging. ...
January 2025
ACM Transactions on Sensor Networks
... For example, a single news story can be represented in multiple formats, including video, audio, and text, and reported in various languages across different countries, such as Chinese, English, and Russian. A more comprehensive description of multiview data can be obtained by mining the consistent and complementary information from different views, which could be used for various tasks, including clustering (Sun et al. 2024c;Jin et al. 2023;Dong et al. 2023;He et al. 2024;Li et al. 2023aLi et al. , 2022b, retrieval (Sun et al. 2024a;Yan et al. 2020;Feng et al. 2023;Sun et al. 2024b), and classification (Sun et al. 2023;Liu et al. 2023a,b;Sun et al. 2022). ...
October 2024
... The studies in [53] address challenges in multimodal detection due to information asymmetry and conflicting beliefs across modalities. The approach uses a Dirichlet distributionbased uncertainty learning method and leverages uncertainty to guide feature fusion. ...
October 2024
... All aforementioned multi-teacher distillation methods have another thing in common: they distill only from generic data suitable to all teachers: DataComb-1B [21] for AM-RADIO, ImageNet-1K [17] for UNIC [47] and Theia [51]. Heterogeneous data has recently been used for distillation of different datasets for classification [26,71] or domain adaptation [55]. To the best of our knowledge, no existing work has looked into distillation using data that contains natural images as well as synthetic data from 3D engines, CAD models, simulators, and rendered from structure frommotion reconstructions. ...
October 2024
... Existing scene representations in 3D-VL models can be categorized into two approaches. The first adopts 3D modality inputs such as point clouds, but requires complex pre-processing pipelines such as 3D reconstruction and instance segmentation [117,39,11]. And they confront the inherent challenge of direct 3D perception, which is exacerbated by the scarcity of 3D-VL data. ...
June 2024
... Jiang et al. [8] presented a novel two-stage KD method involving label propagation guided by a teacher model in the first stage, followed by mutual label propagation between the teacher and student models in the second stage, enhancing the training of an accurate, low-cost student model despite noisy labels. Zhou et al. [44] proposed a data-free KD method that uses web-collected datasets to address both closed-set and open-set label noise by dividing the dataset into clean, closed-noisy, and open-noisy subsets based on specific data characteristics. Liu et al. [45] addressed noisy labels in graph neural networks with a label correction method called multiteacher self-training, which leverages semi-supervised node classification and reuses initial training phase parameters as the teacher model to guide subsequent noisy label removal and training processes. ...
July 2024
IEEE Transactions on Image Processing
... The remarkable progress of 2D Vision-Language Models (VLMs) through pre-training and supervised fine-tuning (SFT) [21,29,10,12,25,53,41,39,36,24] has sparked increasing interest in extending these models to 3D settings [14,40,15,22,16,52,6,8,50]. By leveraging powerful open-source Large Language Models (LLMs) and richly annotated 3D datasets [11,2,3,1,28,9], substantial progress has been made in 3D Vision-Language tasks such as 3D Question Answer (3D-QA) [2,28], 3D Dense Captioning (3D-DC) [7,9] and 3D Visual Grounding (3D-VG) [3,1]. ...
April 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence
... In contrast, the explicit role of diversity in retrieval-based demonstration selection remains underexplored. Diversity has shown promise in other domains-such as fixed-prompt ICL with global demonstration sets Qiu, 2023b, Luo et al., 2024), active learning (Giouroukis et al., 2025, Shi andShen, 2016), coreset construction (Sener and Savarese, 2018, Wan et al., 2024, Zhan et al., 2025, and instruction tuning . Although some recent work incorporates feature coverage as a proxy for diversity (Gupta et al., 2023, Levy et al., 2023, coverage primarily aims at spanning input features rather than explicitly promoting representational variety. ...
March 2024
Proceedings of the AAAI Conference on Artificial Intelligence
... Moreover, as the generator is only trained on seen class samples, this may cause limited transferability and low generalization to unseen classes. Following the generative framework [13], [20] introduced three modules to improve the generated visual features and transferability from semantic to visual space. However, no matter how diverse the generated visual features are, the domain gap between the semantic and visual features still exists. ...
October 2023
... In contrast to the text-prompting methods, MvNet [152] uses multi-view projected features to prompt a pre-trained vision model. The point cloud is encoded into multiple views, and a multi-view prompt vision fusion module interchanges and merges information across views through an attention mechanism. ...
January 2023
IEEE Transactions on Circuits and Systems for Video Technology