May 2025
·
6 Reads
Neural Networks
Ad
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
May 2025
·
6 Reads
Neural Networks
March 2025
·
2 Reads
IEEE Transactions on Geoscience and Remote Sensing
Change detection(CD) is an essential field in remote sensing, with a primary focus on identifying areas of change in bitemporal image pairs captured at varying intervals of the same region. The data annotation process for CD tasks is both time-consuming and labor-intensive. To better utilize the scarce labeled data and abundant unlabeled data, we introduce an adaptive semi-supervised learning method, AdaSemiCD, to improve pseudo-label usage and optimize the training process. Initially, due to the extreme class imbalance inherent in CD, the model is more inclined to focus on the background class, and it is easy to confuse the boundary of the target object. Considering these two points, we develop a measurable evaluation metric for pseudo-labels that enhances the representation of information entropy by class rebalancing and amplification of ambiguous areas, assigning greater weights to prospective change objects. Subsequently, to enhance the reliability of samplewise pseudo-labels, we introduce the AdaFusion module, to dynamically identifying the most uncertain region and substituting it with more trustworthy content. Lastly, to ensure better training stability, we introduce the AdaEMA module, which updates the teacher model using only batches of trusted samples. Experimental results on ten public CD datasets validate the efficacy and generalizability of our proposed adaptive training framework.
March 2025
·
4 Reads
Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.
March 2025
·
3 Reads
Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency among matches. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNN-learned geometric constraint that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, our method is robust to graph noise, demonstrating a significant advantage in terms of generalization. The code will be released.
March 2025
In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.
March 2025
Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.
March 2025
·
10 Reads
Journal of Hydrology
February 2025
·
7 Reads
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.
February 2025
·
5 Reads
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.
February 2025
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.
Ad
... For example, ZoomNet [36] employs a zoomin-and-out technique to process appearance features across three different scales. Other methods [29,43,52] attempt to segment camouflaged objects through frequency analysis. However, since these models are designed for still images, they cannot utilize motion information, which limits their performance in video camouflage object detection tasks. ...
January 2025
IEEE Transactions on Multimedia
... This limitation becomes particularly evident in medical image segmentation tasks with significant inter-sample variability, such as skin lesion segmentation. To address this challenge, researchers have explored various strategies, including the use of large kernel convolutions, dilated convolutions, and other techniques aimed at expanding receptive fields [3][4][5][6][7]. For instance, Hu et al. [8] employed self attention to enhance receptive fields, while Tang et al. [9] proposed model, leveraging large convolutional kernels and skip fusion to achieve promising results in tasks like breast nodule ultrasound image segmentation. ...
January 2025
IEEE Journal of Biomedical and Health Informatics
... However, these methods usually ignore structural information in the image, which can lead to a loss of detail. High dynamic range (HDR) restoration [21,22] and image fusion techniques [23,24] recover exposure levels by fusing multiple images under different exposure conditions, but the high data requirements limit their popularity in practical applications. Techniques based on Retinex theory [25,26] decompose images into reflectance and illumination components to improve the visual quality of images. ...
December 2024
Pattern Recognition
... As a result, SSCD [21]- [23] emerges as a potentially more effective solution. The paradigm of semi-supervised learning (SSL) [24]- [26] aims to enhance CD performance by leveraging the limited available labeled data and the large volume of unlabeled samples. Typically, researchers generate pseudo-labels for the unlabeled data to act as guidance during training. ...
November 2024
IEEE Transactions on Circuits and Systems for Video Technology
... A seminal work of I-VL, CLIP [40] has demonstrated remarkable static generalization, achieving promising performance in image-based zero-shot inference. Despite extensive works [41,49,53] fully fine-tuning the video learner, a collection of studies focuses on adopting lightweight adapters [4,37,56] or incorporating learnable prompts [26,50] for easy video adaptation. However, these video learners adhere to the standard fine-tuning paradigm, which tends to overfit in the closed-set setting, thereby limiting expertise in open-vocabulary settings. ...
October 2024
... In recent years, as an important 3D representation, point clouds have attracted attention from both academia and industry (Qian et al. 2021b;Yang et al. 2023;Cheng et al. 2023;Zhang et al. 2023;Du et al. 2024). With the advancement of technology, high-quality point clouds are increasingly in demand for downstream tasks such as autonomous driving (Zeng et al. 2018;Li et al. 2020), virtual reality (Blanc et al. 2020;Zhang et al. 2024), and 3D reconstruction (Park et al. 2019;Mescheder et al. 2019). The raw point clouds are primarily obtained through 3D scanning devices. ...
October 2024
... Zanella et al. [28] combined large language and vision (LLV) models (e.g., CLIP) with multi-instance learning for joint violence detection. Wu et al. [7] introduced Spatio-Temporal Prompting (STPrompt), a CLIP-based three-branch architecture to address classification and localization in violence detection. Wu et al. [8] proposed Video Anomaly CLIP (VadCLIP), a simple yet powerful baseline that efficiently adapted pre-trained imagebased visual-language models for robust general video understanding. ...
October 2024
... Visual-language models like CLIP [42] have recently been applied to enhance anomaly detection [24,41,52], focusing on semantic anomalies. Open-vocabulary VAD [51] and prompt-based anomaly scoring [57] leverage LLMs [54,62], but performance relies heavily on the base models, often lacking domain-specific tuning. However, existing video anomaly detection algorithms lack the capability to handle complex industrial anomaly detection scenarios and understand physical rules. ...
June 2024
... In detail, Jiang et al. [43] developed a generative network for degradation learning and content refinement, which improve the feature extraction with multi-scale representation, while the compatibility problem between different image sizes has not been well solved. More recently, Feng et al. [44] and Wang et al. [45] propose certain techniques that augment depth information and multi-scale fusion to improve illumination in different levels as there had been recognized necessity for more elaborate methods of dealing with the question of illumination in its great complexity and yet there are still effective solution to the problem seeking for improvements. Nevertheless, some of the issues still remains, for instance, how to achieve more powerful and at the same time not too power consuming enhancement, how to achieve good performance in terms of low-light conditions variety, and, finally, how to preserve real-life processing. ...
June 2024
... This integrated methodology ensures precise illumination estimation and adjustment across diverse image regions, tailored to the specific characteristics of each local area within the image. (2) We utilized an effective guided filter to denoise the reflectance component, focusing on edge preservation and accurate noise removal. Subsequently, we implemented a detail enhancement process on the denoised image, preventing noise amplification during the enhancement process. ...
June 2024