Yanning Zhang’s research while affiliated with Northwestern Polytechnical University and other places


Ad

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (839)


SAR remote sensing image segmentation based on feature enhancement
  • Article

May 2025

·

6 Reads

Neural Networks

·

Yanyu Ye

·

Guochao Chen

·

[...]

·

Yanning Zhang

Fig. 2. Class statistics of common CD datasets. The proportion of the changed/unchanged categories is extremely unbalanced.
AdaSemiCD: An Adaptive Semi-Supervised Change Detection Method Based on Pseudo-Label Evaluation
  • Article
  • Full-text available

March 2025

·

2 Reads

IEEE Transactions on Geoscience and Remote Sensing

Change detection(CD) is an essential field in remote sensing, with a primary focus on identifying areas of change in bitemporal image pairs captured at varying intervals of the same region. The data annotation process for CD tasks is both time-consuming and labor-intensive. To better utilize the scarce labeled data and abundant unlabeled data, we introduce an adaptive semi-supervised learning method, AdaSemiCD, to improve pseudo-label usage and optimize the training process. Initially, due to the extreme class imbalance inherent in CD, the model is more inclined to focus on the background class, and it is easy to confuse the boundary of the target object. Considering these two points, we develop a measurable evaluation metric for pseudo-labels that enhances the representation of information entropy by class rebalancing and amplification of ambiguous areas, assigning greater weights to prospective change objects. Subsequently, to enhance the reliability of samplewise pseudo-labels, we introduce the AdaFusion module, to dynamically identifying the most uncertain region and substituting it with more trustworthy content. Lastly, to ensure better training stability, we introduce the AdaEMA module, which updates the teacher model using only batches of trusted samples. Experimental results on ten public CD datasets validate the efficacy and generalizability of our proposed adaptive training framework.

Download

AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

March 2025

·

4 Reads

Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.


HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration

March 2025

·

3 Reads

Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency among matches. To overcome this, we propose HyperGCT, a flexible dynamic Hyper-GNN-learned geometric constraint that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, our method is robust to graph noise, demonstrating a significant advantage in terms of generalization. The code will be released.


One-shot In-context Part Segmentation

March 2025

In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.


Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

March 2025

Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.



HVI: A New color space for Low-light Image Enhancement

February 2025

·

7 Reads

Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.


Figure 1. Our C-Drag employs a single trajectory control signal (red arrow), integrated with a vision-language model (VLM) and Chain-ofThought (CoT) reasoning, to generate controllable videos that emphasize motion realism. Results are illustrated in three example scenarios, each comprising two rows: baseline output (top) and C-Drag output (bottom). (a) Collision and Chain Reaction: The trajectory of a single sphere leads to complex collisions and chain reactions among multiple spheres. (b) Gravity and Force: A foot's trajectory impacts a football, showing motion under gravitational and force dynamics. (c) Levers and Mirrors: A puppy's movement is reflected in a mirror, showcasing coupled motion control through mirror reflection. Best viewed zoomed in. Additional results are presented in suppl. material.
Impact of the two modules in our C-Drag. OPM rep- resents object perception module, and CoT-Reasoning represents CoT-based motion reasoning module. The best results are obtained when integrating both modules into the baseline.
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation

February 2025

·

5 Reads

Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.


Figure 1. Performance comparison (Top-1 Acc (%)) under various open-vocabulary evaluation settings where the video learners except for CLIP are tuned on Kinetics-400 [28] with frozen text encoders. The satisfying in-context generalizability on UCF101 [44] (a) can be severely affected by static bias when evaluating on outof-context SCUBA-UCF101 [31] (b) by replacing the video background with other images.
Effect of cross-batch meta-optimization.
Effect of the CLIP ensemble. We evaluate the perfor- mance of integrating the CLIP ensemble within the weight and decision spaces. Naive denotes applying only the video learners for evaluations without further CLIP ensemble.
Comparison of the training cost. We report the results of K400 training on four GPUs (24G RTX 4090). We maintain an equal batch size of 8 videos per GPU across all models.
Learning to Generalize without Bias for Open-Vocabulary Action Recognition

Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.


Ad

Citations (33)


... For example, ZoomNet [36] employs a zoomin-and-out technique to process appearance features across three different scales. Other methods [29,43,52] attempt to segment camouflaged objects through frequency analysis. However, since these models are designed for still images, they cannot utilize motion information, which limits their performance in video camouflage object detection tasks. ...

Reference:

MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection
Frequency-Guided Spatial Adaptation for Camouflaged Object Detection
  • Citing Article
  • January 2025

IEEE Transactions on Multimedia

... This limitation becomes particularly evident in medical image segmentation tasks with significant inter-sample variability, such as skin lesion segmentation. To address this challenge, researchers have explored various strategies, including the use of large kernel convolutions, dilated convolutions, and other techniques aimed at expanding receptive fields [3][4][5][6][7]. For instance, Hu et al. [8] employed self attention to enhance receptive fields, while Tang et al. [9] proposed model, leveraging large convolutional kernels and skip fusion to achieve promising results in tasks like breast nodule ultrasound image segmentation. ...

P2TC: A Lightweight Pyramid Pooling Transformer-CNN Network for Accurate 3D Whole Heart Segmentation
  • Citing Article
  • January 2025

IEEE Journal of Biomedical and Health Informatics

... However, these methods usually ignore structural information in the image, which can lead to a loss of detail. High dynamic range (HDR) restoration [21,22] and image fusion techniques [23,24] recover exposure levels by fusing multiple images under different exposure conditions, but the high data requirements limit their popularity in practical applications. Techniques based on Retinex theory [25,26] decompose images into reflectance and illumination components to improve the visual quality of images. ...

Uncertainty estimation in HDR imaging with Bayesian neural networks
  • Citing Article
  • December 2024

Pattern Recognition

... As a result, SSCD [21]- [23] emerges as a potentially more effective solution. The paradigm of semi-supervised learning (SSL) [24]- [26] aims to enhance CD performance by leveraging the limited available labeled data and the large volume of unlabeled samples. Typically, researchers generate pseudo-labels for the unlabeled data to act as guidance during training. ...

Pseudo Labeling Methods for Semi-Supervised Semantic Segmentation: A Review and Future Perspectives

IEEE Transactions on Circuits and Systems for Video Technology

... A seminal work of I-VL, CLIP [40] has demonstrated remarkable static generalization, achieving promising performance in image-based zero-shot inference. Despite extensive works [41,49,53] fully fine-tuning the video learner, a collection of studies focuses on adopting lightweight adapters [4,37,56] or incorporating learnable prompts [26,50] for easy video adaptation. However, these video learners adhere to the standard fine-tuning paradigm, which tends to overfit in the closed-set setting, thereby limiting expertise in open-vocabulary settings. ...

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition
  • Citing Conference Paper
  • October 2024

... In recent years, as an important 3D representation, point clouds have attracted attention from both academia and industry (Qian et al. 2021b;Yang et al. 2023;Cheng et al. 2023;Zhang et al. 2023;Du et al. 2024). With the advancement of technology, high-quality point clouds are increasingly in demand for downstream tasks such as autonomous driving (Zeng et al. 2018;Li et al. 2020), virtual reality (Blanc et al. 2020;Zhang et al. 2024), and 3D reconstruction (Park et al. 2019;Mescheder et al. 2019). The raw point clouds are primarily obtained through 3D scanning devices. ...

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap
  • Citing Conference Paper
  • October 2024

... Zanella et al. [28] combined large language and vision (LLV) models (e.g., CLIP) with multi-instance learning for joint violence detection. Wu et al. [7] introduced Spatio-Temporal Prompting (STPrompt), a CLIP-based three-branch architecture to address classification and localization in violence detection. Wu et al. [8] proposed Video Anomaly CLIP (VadCLIP), a simple yet powerful baseline that efficiently adapted pre-trained imagebased visual-language models for robust general video understanding. ...

Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts
  • Citing Conference Paper
  • October 2024

... Visual-language models like CLIP [42] have recently been applied to enhance anomaly detection [24,41,52], focusing on semantic anomalies. Open-vocabulary VAD [51] and prompt-based anomaly scoring [57] leverage LLMs [54,62], but performance relies heavily on the base models, often lacking domain-specific tuning. However, existing video anomaly detection algorithms lack the capability to handle complex industrial anomaly detection scenarios and understand physical rules. ...

Open-Vocabulary Video Anomaly Detection
  • Citing Conference Paper
  • June 2024

... In detail, Jiang et al. [43] developed a generative network for degradation learning and content refinement, which improve the feature extraction with multi-scale representation, while the compatibility problem between different image sizes has not been well solved. More recently, Feng et al. [44] and Wang et al. [45] propose certain techniques that augment depth information and multi-scale fusion to improve illumination in different levels as there had been recognized necessity for more elaborate methods of dealing with the question of illumination in its great complexity and yet there are still effective solution to the problem seeking for improvements. Nevertheless, some of the issues still remains, for instance, how to achieve more powerful and at the same time not too power consuming enhancement, how to achieve good performance in terms of low-light conditions variety, and, finally, how to preserve real-life processing. ...

DiffLight: Integrating Content and Detail for Low-light Image Enhancement
  • Citing Conference Paper
  • June 2024

... This integrated methodology ensures precise illumination estimation and adjustment across diverse image regions, tailored to the specific characteristics of each local area within the image. (2) We utilized an effective guided filter to denoise the reflectance component, focusing on edge preservation and accurate noise removal. Subsequently, we implemented a detail enhancement process on the denoised image, preventing noise amplification during the enhancement process. ...

NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results