Zheng Zhang’s research while affiliated with Harbin Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (268)


TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting
  • Preprint

March 2025

·

1 Read

Fengyi Zhang

·

Huitong Yang

·

Zheng Zhang

·

[...]

·

Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense voxel decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and such models often fail to adapt to varying voxel resolutions or new classes without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-GaussOcc. Our approach incrementally optimizes time-aware 3D Gaussians instantiated from raw sensor streams at runtime, enabling voxelization at arbitrary user-specified resolution. Specifically, TT-GaussOcc operates in a "lift-move-voxel" symphony: we first "lift" surrounding-view semantics obtained from 2D vision foundation models (VLMs) to instantiate Gaussians at non-empty 3D space; Next, we "move" dynamic Gaussians from previous frames along estimated Gaussian scene flow to complete appearance and eliminate trailing artifacts of fast-moving objects, while accumulating static Gaussians to enforce temporal consistency; Finally, we mitigate inherent noises in semantic predictions and scene flow vectors by periodically smoothing neighboring Gaussians during optimization, using proposed trilateral RBF kernels that jointly consider color, semantic, and spatial affinities. The historical static and current dynamic Gaussians are then combined and voxelized to generate occupancy prediction. Extensive experiments on Occ3D and nuCraft with varying voxel resolutions demonstrate that TT-GaussOcc surpasses self-supervised baselines by 46% on mIoU without any offline training, and supports finer voxel resolutions at 2.6 FPS inference speed.



Figure 2: (a) E-commerce knowledge pre-training. The MLLM is pre-trained on a large-scale multimodal e-commerce dataset to incorporate domain-specific knowledge. (b) The Structure of RM. The RM integrates multimodal product features using visual and textual encoders, with dual branches to estimate CTR and identify appealing ad images. (c) CTR-driven preference optimization stage. The PM generates background descriptions for background generation model to create product images with various backgrounds. The RM then estimates the CTR for these images, simulating human feedback to optimize the PM.
Figure 6: Advertising images generated by directly using the e-commerce knowledge-injected MLLM as PM. For each product, we display the original transparent background product image in the first column, along with three different background images generated through random repetition.
Figure 7: Some match and mismatch examples identified by annotators.
CTR-Driven Advertising Image Generation with Multimodal Large Language Models
  • Preprint
  • File available

February 2025

·

6 Reads

In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.

Download

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

January 2025

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps and the relaxed exponential function used in approximation lead to significant information loss compared to the original query-key dot products, resulting in less discriminative attention maps with higher entropy. To address the missing interactions driven by negative values in query-key pairs, we propose a polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we provide a theoretical analysis proving the existence of a class of element-wise functions (with positive first and second derivatives) that can reduce entropy in the attention distribution. For simplicity, and recognizing the distinct contributions of each dimension, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated. Extensive experiments demonstrate that the proposed PolaFormer improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%.


Contextual Interaction via Primitive-based Adversarial Training for Compositional Zero-shot Learning

January 2025

ACM Transactions on Multimedia Computing, Communications and Applications

Compositional Zero-shot Learning (CZSL) aims to identify novel compositions via known attribute-object pairs. The primary challenge in CZSL tasks lies in the significant discrepancies introduced by the complex interaction between the visual primitives of attribute and object, consequently decreasing the classification performance towards novel compositions. Previous remarkable works primarily addressed this issue by focusing on disentangling strategy or utilizing object-based conditional probabilities to constrain the selection space of attributes. Unfortunately, few studies have explored the problem from the perspective of modeling the mechanism of visual primitive interactions. Inspired by the success of vanilla adversarial learning in Cross-Domain Few-Shot Learning, we take a step further and devise a model-agnostic and P rimitive- B ased Adv ersarial training (PBadv) method to deal with this problem. Besides, the latest studies highlight the weakness of the perception of hard compositions even under data-balanced conditions. To this end, we propose a novel over-sampling strategy with object-similarity guidance to augment target compositional training data. We performed detailed quantitative analysis and retrieval experiments on well-established datasets, such as UT-Zappos50K, MIT-States, and C-GQA, to validate the effectiveness of our proposed method, and the state-of-the-art (SOTA) performance demonstrates the superiority of our approach. The code is available at https://github.com/lisuyi/PBadv_czsl .


Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions

January 2025

·

10 Reads

·

12 Citations

Proceedings of the IEEE

With the exponential surge in diverse multimodal data, traditional unimodal retrieval methods struggle to meet the needs of users seeking access to data across various modalities. To address this, cross-modal retrieval has emerged, enabling interaction across modalities, facilitating semantic matching, and leveraging complementarity and consistency between heterogeneous data. Although prior literature has reviewed the field of cross-modal retrieval, it suffers from numerous deficiencies in terms of timeliness, taxonomy, and comprehensiveness. This article conducts a comprehensive review of cross-modal retrieval’s evolution, spanning from shallow statistical analysis techniques to vision-language pretraining (VLP) models. Commencing with a comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and models, this article delves deeply into the principles and architectures underpinning existing cross-modal retrieval methods. Furthermore, it offers an overview of widely used benchmarks, metrics, and performances. Lastly, this article probes the prospects and challenges that confront contemporary cross-modal retrieval, while engaging in a discourse on potential directions for further progress in the field. To facilitate the ongoing research on cross-modal retrieval, we develop a user-friendly toolbox and an open-source repository at https://cross-modal-retrieval.github.io.




Contrastive Incomplete Cross-Modal Hashing

November 2024

·

4 Reads

·

2 Citations

IEEE Transactions on Knowledge and Data Engineering

The success of current deep cross-modal hashing admits a default assumption of the fully-observed cross-modal data. However, such a rigorous common policy is hardly guaranteed for practical large-scale cases, which directly disable the training of prevalent cross-modal retrieval methods with incomplete cross-modal instances and unpaired relations. The main challenges come from the collapsed semantic- and modality-level similarity learning as well as uncertain cross-modal correspondence. In this paper, we propose a Contrastive Incomplete Cross-modal Hashing (CICH) network, which simultaneously determines the cross-modal semantic coordination, unbalanced similarity calibration, and contextual correspondence alignment. Specifically, we design a prototypical semantic similarity coordination module to globally rebuild partially-observed cross-modal similarities under an asymmetric learning scheme. Meanwhile, a semantic-aware contrastive hashing module is established to adaptively perceive and remedy the unbalanced similarities across different modalities with the semantic transition for generating discriminative hash codes. Additionally, a contextual correspondence alignment module is conceived to maximally capture shared knowledge across modalities and eliminate the correspondence uncertainty via a dual contextual information bottleneck formula. To the best of our knowledge, this is the first successful attempt of enabling contrastive learning to incomplete deep cross-modal hashing. Extensive experiments validate the superiority of our CICH against state-of-the-art methods. Our code is available at https://github.com/DarrenZZhang/CICH .


Contrastive Multi-Bit Collaborative Learning for Deep Cross-Modal Hashing

November 2024

·

6 Reads

·

4 Citations

IEEE Transactions on Knowledge and Data Engineering

Deep cross-modal hashing, as a promising fast similarity search technique, has attracted broad interest and obtained great success owing to its outstanding representation capability and computational efficiency. Since the inconsistent feature representations and distributions of different modalities (i.e., image and text), prior studies primarily focus on preserving pairwise similarity with global embedding, but fail to further utilize detailed local representations to effectively align such heterogeneous data to jointly bridge the heterogeneous and semantic gaps across modalities. Meanwhile, typical learning networks can learn only one fixed-length hash code rather than multi-length ones, leading to extremely limited flexibility and scalability. To tackle these issues, this paper proposes a novel Contrastive Multi-bit Collaborative Learning (CMCL) network, which hierarchically aligns both global and local features among different modalities and simultaneously generates multi-length hash codes (i.e., 16-, 32-, 64-bits) in one unified transformer-based framework. Specifically, we design a novel cross-modal contrastive alignment module to simultaneously bridge the heterogeneous and semantic gaps across modalities via global and local contrastive learning. Moreover, we propose a multi-bit collaborative optimization module to synchronously produce multi-length hash codes under the explicit guidance of one auxiliary online hash learner with a longer length (i.e., 128-bit). As such, our CMCL framework can jointly alleviate the heterogeneity among modalities from a hierarchical perspective and collaboratively explore the correlations between multi-bit hash codes, thereby yielding multi-length discriminative hash codes in a one-stop learning manner. Comprehensive experiments demonstrate the consistent superiority of our CMCL in multi-bit hash code learning over the state-of-the-art cross-modal hashing baselines.


Citations (53)


... Vision-Language Models (VLMs) have been pivotal in multimodal research, enabling simultaneous processing of images and text within a unified architecture. Foundational models like CLIP Radford et al. [2021] and ALIGN Jia et al. [2021] employ contrastive learning on large-scale datasets (e.g., LAION-400M Schuhmann et al. [2021]), to align image-text pairs in a shared embedding space, excelling in tasks such as cross-modal retrieval and zero-shot classification Wang et al. [2024]. Unified pre-training approaches such as BLIP Li et al. [2022b], combine contrastive and generative objectives, enhancing performance in tasks like image captioning and visual question answering. ...

Reference:

PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models
Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions
  • Citing Article
  • January 2025

Proceedings of the IEEE

... Unlike generic images, banner ad images are visually harmonized compositions of multiple creative assets including backdrops, brandrelated logos or product images, click-to-action (CTA) buttons, campaign texts, and other decorative elements such as shapes and patterns [49,61]. Campaign texts further require typography design choices, contributing to a vast search space that demands significant time and effort [16,63]. In addition, advertisers have various size requirements for different display needs [5] and diverse design requirements tailored to specific target audiences and purposes [15]. ...

Towards Reliable Advertising Image Generation Using Human Feedback
  • Citing Chapter
  • November 2024

... Mutual information (MI) [27] is a statistical metric that measures the degree of dependency between two random variables. Its core concept involves evaluating the amount of information one variable contains about another by comparing the information content of individual variables and their joint distribution. ...

Heterogeneous graph representation learning via mutual information estimation for fraud detection
  • Citing Article
  • November 2024

Journal of Network and Computer Applications

... Despite their impressive performance on visual reasoning benchmarks, current VLLMs remain notoriously susceptible to hallucinations [8,20,37,68]. A common demonstration is that the generated responses contain information which is inconsistent with the visual content [1,35,53,59]. ...

Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning
  • Citing Conference Paper
  • October 2024

... This strategy effectively preserves critical semantic information while circumventing CLIP's 77character constraint, enabling comprehensive processing of lengthy real-world texts without semantic truncation. Previous approaches have employed various methods to address this: the Bag-of-Words (BoW) model, which disregards contextual word relationships; pre-trained Transformer [36] models such as Vision Transformer(ViT) [7] and BERT [6], which still face truncation; and VTPH's [4] use of large language models [19,25] to reconstruct text as prompts, although this method introduces significant computational overhead without fully resolving the semantic truncation issue. ...

Enhancing Cross-Modal Retrieval via Visual-Textual Prompt Hashing
  • Citing Conference Paper
  • August 2024

... Proxy learning has emerged as another significant direction, with DAPH's [35] data-aware networks, DSPH's [14] fine-grained semantic relations, and DHaPH's [15] hierarchical learning. Notable approaches also include CMCL's [39] multi-bit collaboration and VTPH's [4] large model optimization. Distinct from existing methods, we in-tegrate both global and local prompt alignments while minimizing feature divergence to alleviate modality heterogeneity and semantic gaps. ...

Contrastive Multi-Bit Collaborative Learning for Deep Cross-Modal Hashing
  • Citing Article
  • November 2024

IEEE Transactions on Knowledge and Data Engineering

... Artificial intelligence (AI) is a mainstream technology with a wide range of promising applications in different sectors such as healthcare, smart cities, chatbots, etc. The representative AI applications in the medical area are disease classification using convolutional neural network (CNN) by leveraging image data [1], fetal brain MRI segmentation to identify brain abnormalities [2], accurate and effective segmentation of medical images for clinical assessment of different diseases [3], personalized healthcare and medical content generation for personalized medication and surgery planning [4], medical question-answer systems [5], and situational awareness for people who are visually impaired or blind in indoor environments [6], to name just a few. The representative AI applications in industry are product quality and design optimization [7], fault detection and failure mode prediction [8], industrial predictive modeling by extracting salient and far features with CNN [9], predictive maintenance [10], detecting defects in products [11], video surveillance [12], and robust continuous-flow manufacturing processes by imputing missing time series data [13], to name just a few. ...

Deep Fuzzy Multi-Teacher Distillation Network for Medical Visual Question Answering
  • Citing Article
  • October 2024

IEEE Transactions on Fuzzy Systems

... A widely utilized method for the postoperative diagnosis of liver tumors is multiphase contrast-enhanced computed tomography (CECT) [4]. Compared with single-view methods, in clinical practice, multiphase views provide consistent and complementary information from the same patient [5,6], thus providing more precision regarding postoperative liver tumor type. Liver tumor CECT images of a patient can be divided into four phases: non-contrast (NC), arterial phase (AP), portal venous (PV), and delay phase (DP) [7,8]. ...

Attention-Guided and Noise-Resistant Learning for Robust Medical Image Segmentation
  • Citing Article
  • January 2024

IEEE Transactions on Instrumentation and Measurement

... In many applications, the accuracy of image detection based on deep learning technology has surpassed that of human beings and has more advantages in detection speed [2,3]. In fact, deep learning technology has been used by many researchers for medical image detection and recognition [3][4][5][6]. ...

Partition-A-Medical-Image: Extracting Multiple Representative Subregions for Few-Shot Medical Image Segmentation
  • Citing Article
  • January 2024

IEEE Transactions on Instrumentation and Measurement

... Dynamic contrastive decoding and uncertainty-aware fusion protocols show promise [51,122], but require domainspecific adaptations (e.g., aligning radiology images with reports [72,160]). Future work must develop unified uncertainty embeddings to harmonize modality confidence scales and adversarial training against cross-modal backdoor attacks [74,132,158]. ...

BadCM: Invisible Backdoor Attack Against Cross-Modal Learning
  • Citing Article
  • March 2024

IEEE Transactions on Image Processing