Heng Tao Shen’s research while affiliated with Tongji University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (667)


Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions
  • Article

January 2025

Proceedings of the IEEE

Tianshi Wang

·

Fengling Li

·

Lei Zhu

·

[...]

·

Heng Tao Shen

With the exponential surge in diverse multimodal data, traditional unimodal retrieval methods struggle to meet the needs of users seeking access to data across various modalities. To address this, cross-modal retrieval has emerged, enabling interaction across modalities, facilitating semantic matching, and leveraging complementarity and consistency between heterogeneous data. Although prior literature has reviewed the field of cross-modal retrieval, it suffers from numerous deficiencies in terms of timeliness, taxonomy, and comprehensiveness. This article conducts a comprehensive review of cross-modal retrieval’s evolution, spanning from shallow statistical analysis techniques to vision-language pretraining (VLP) models. Commencing with a comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and models, this article delves deeply into the principles and architectures underpinning existing cross-modal retrieval methods. Furthermore, it offers an overview of widely used benchmarks, metrics, and performances. Lastly, this article probes the prospects and challenges that confront contemporary cross-modal retrieval, while engaging in a discourse on potential directions for further progress in the field. To facilitate the ongoing research on cross-modal retrieval, we develop a user-friendly toolbox and an open-source repository at https://cross-modal-retrieval.github.io.


Base-to-new generalization results over 11 datasets. * indicates our reproduced results.
Ablation study on the components of Skip Tuning.
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves
  • Preprint
  • File available

December 2024

Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/Koorye/SkipTuning.

Download

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

December 2024

·

1 Read

Recent advances in General Text-to-3D (GT23D) have been significant. However, the lack of a benchmark has hindered systematic evaluation and progress due to issues in datasets and metrics: 1) The largest 3D dataset Objaverse suffers from omitted annotations, disorganization, and low-quality. 2) Existing metrics only evaluate textual-image alignment without considering the 3D-level quality. To this end, we are the first to present a comprehensive benchmark for GT23D called GT23D-Bench consisting of: 1) a 400k high-fidelity and well-organized 3D dataset that curated issues in Objaverse through a systematical annotation-organize-filter pipeline; and 2) comprehensive 3D-aware evaluation metrics which encompass 10 clearly defined metrics thoroughly accounting for multi-dimension of GT23D. Notably, GT23D-Bench features three properties: 1) Multimodal Annotations. Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. 2) Holistic Evaluation Dimensions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness. 3) Valuable Insights. We delve into the performance of current GT23D baselines across different evaluation dimensions and provide insightful analysis. Extensive experiments demonstrate that our annotations and metrics are aligned with human preferences.



BatchNorm-Based Weakly Supervised Video Anomaly Detection

December 2024

·

6 Reads

·

12 Citations

IEEE Transactions on Circuits and Systems for Video Technology

In weakly supervised video anomaly detection (WVAD), where only video-level labels indicating the presence or absence of abnormal events are available, the primary challenge arises from the inherent ambiguity in temporal annotations of abnormal occurrences. Inspired by the statistical insight that temporal features of abnormal events often exhibit outlier characteristics, we propose a novel method, BN-WVAD, which incorporates BatchNorm into WVAD. In the proposed BN-WVAD, we leverage the Divergence of Feature from the Mean vector (DFM) of BatchNorm as a reliable abnormality criterion to discern potential abnormal snippets in abnormal videos. The proposed DFM criterion is also discriminative for anomaly recognition and more resilient to label noise, serving as the additional anomaly score to amend the prediction of the anomaly classifier that is susceptible to noisy labels. Moreover, a batch-level selection strategy is devised to filter more abnormal snippets in videos where more abnormal events occur. The proposed BN-WVAD model demonstrates state-of-the-art performance on UCF-Crime with an AUC of 87.24%, and XD-Violence, where AP reaches up to 84.93%. Our code implementation is accessible at https://github.com/cool-xuan/BN-WVAD .


UMP: Unified Modality-aware Prompt Tuning for Text-Video Retrieval

November 2024

·

1 Citation

IEEE Transactions on Circuits and Systems for Video Technology

Prompt tuning, an emerging parameter-efficient strategy, leverages the powerful knowledge of large-scale pre-trained image-text models (e.g., CLIP) to swiftly adapt to downstream tasks. Despite its effectiveness, adapting prompt tuning to text-video retrieval encounters two limitations: i) existing methods adopt two isolated prompt tokens to prompt two modal branches separately, making it challenging to learn a well-aligned unified representation, i.e., modality gap; ii) video encoders typically utilize a fixed pre-trained visual backbone, neglecting the incorporation of spatial-temporal information. To this end, we propose a simple yet effective method, named Unified Modality-aware Prompt Tuning (UMP), for text-video retrieval. Concretely, we first introduce a Unified Prompt Generation (UPG) module to dynamically produce modality-aware prompt tokens, enabling the perception of prior semantic information on both video and text inputs. These prompt tokens are simultaneously injected into two branches that can bridge the semantics gap between two modalities in a unified-adjusting manner. Then, we design a parameter-free Spatial-Temporal Shift (STS) module to facilitate both intra- and inter-communication among video tokens and prompt tokens in the spatial-temporal dimension. Notably, extensive experiments on four widely used benchmarks show that UMP achieves new state-of-the-art performance compared to existing prompt-tuning methods without bringing excessive parameters. Code is available at: https://github.com/zchoi/UMP_TVR .


Dual Domain Perception and Progressive Refinement for Mirror Detection

November 2024

·

6 Reads

IEEE Transactions on Circuits and Systems for Video Technology

Mirror detection aims to discover mirror regions in images to avoid misidentifying reflected objects. Existing methods mainly mine clues from spatial domain. We observe that the frequencies inside and outside the mirror region are distinctive. Besides, the low-frequency representing the feature semantics can help to locate the mirror region, and the high-frequency representing the details can refine it. Motivated by this, we introduce frequency guidance and propose the dual domain perception progressive refinement network (DPRNet) to mine dual-domain information. Specifically, we first decouple the images into high-frequency and low-frequency components by Laplace pyramid and vision Transformer, respectively, and design the frequency interaction alignment (FIA) module to integrate frequency features to initially localize the mirror region. To handle scale variations, we propose the multi-order feature perception (MOFP) module to adaptively aggregate adjacent features with progressive and gating mechanisms. We further propose the separation-based difference fusion (SDF) module to establish associations between entities and imagings and discover the correct boundary to mine the complete mirror region. Extensive experiments show that DPRNet outperforms the state-of-the-art method by an average of 3% with only about one-fifth of the parameters and FLOPs on four datasets. Our DPRNet also achieves promising performance on remote sensing and camouflage scenarios, validating its generalization. The code is available at https://github.com/winter-flow/DPRNet .





Citations (45)


... RefineDNet [44], RPC-Dehaze [50], D4+ [43], UCL [52], POGAN [48], USID-Net [24], ODCR [51], D4 [11] and UME-Net [25]. As shown in Fig. 6-9, through visual comparison, our method effectively restores the natural colors and details of the image while ...

Reference:

UR2P-Dehaze: Learning a Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior
Toward Generalized and Realistic Unpaired Image Dehazing via Region-Aware Physical Constraints
  • Citing Article
  • January 2024

IEEE Transactions on Circuits and Systems for Video Technology

... The CLIP series (Radford et al., 2021; aligned visual and language modalities using contrastive learning on extensive imagetext pairs. Recent studies increasingly use pretraining alignment and visual instruction tuning on LLMs for complex tasks like visual question answering, artwork analysis, and multimodal reasoning (Li et al., 2024a;Bin et al., 2024). MiniGPT-4 (Zhu et al., 2023) engages in image-text dialogues by aligning visual features with text. ...

GalleryGPT: Analyzing Paintings with Large Multimodal Models
  • Citing Conference Paper
  • October 2024

... The alignment network takes a tiny percentage but contributes a significant impact on the overall performance [21], [24]. Hence, numerous researchers devoted themselves to perfecting this structure, including MLP-based [18], [21], [25], Attention-based [26], and Learnable-parameters-based connectors [27], [28]. Generally, the alignment network shoulder projection between different pre-trained vector spaces. ...

Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning
  • Citing Conference Paper
  • October 2024

... Models with a few learnable soft prompt tokens can achieve performance parity with, or even outperform, fully fine-tuned ones [13]. Depending on how the soft prompt tokens are applied, existing methods can be broadly classified into textual-based [15,16,24,43,49,50] and visual-based approaches [1,2,13,19,47]. Among these, the textual-based method is the most fundamental and straightforward, comprising the majority. ...

DePT: Decoupled Prompt Tuning
  • Citing Conference Paper
  • June 2024

... This finding is further supported by recent advancements in adversarial example generation, where the proposed stochastic mini-batch black-box attack with ensemble reweighing (SMER) method demonstrates that emphasizing the diversity between surrogate models significantly enhances the transferability of adversarial examples. These results underscore the value of ensemble learning in both improving biometric authentication and increasing the robustness of adversarial attacks through diverse model integration [98]. Shekar & Kumari, [92] address the pressing issue of spoofing in digital payments. ...

Ensemble Diversity Facilitates Adversarial Transferability
  • Citing Conference Paper
  • June 2024

... As pointed out in [44,14,40], the success of information fusion depends on the quality of input information, the accuracy of prior knowledge, and the effectiveness of the uncertainty model used. Given the first two terms intricately depend on the data collection stage, researchers are focusing on developing effective fusion methods with a primary emphasis on uncertainty quantification [1,21,16]. ...

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion
  • Citing Conference Paper
  • June 2024

... Their approach achieved notable results, with reported AUC values of 84.30% for the UCF-Crime dataset and 97.21% for the ShanghaiTech dataset. Zhou et al. [56] introduced the BatchNorm-WVAD method, which enhances WS-VAD by incorporating BatchNorm and using the Divergence of Feature from the Mean vector (DFM) as an abnormality criterion. They improve anomaly recognition, reduce the impact of noisy labels, and achieve state-of-the-art performance on the UCF-Crime and XD-Violence datasets, with AUC and average precision (AP) scores of 87.24% and 84.93%, respectively. ...

BatchNorm-Based Weakly Supervised Video Anomaly Detection
  • Citing Article
  • December 2024

IEEE Transactions on Circuits and Systems for Video Technology

... Thanks to the powerful nonlinear modeling capabilities inherent in DNN models, a plethora of approaches [8], [9], [10], [28], [29], [30], [31], [32], [33], [34], [35], [15] have emerged to address the heterogeneity challenges in crossmodal retrieval (CMR) by various Paradigms, e.g., crossmodal hashing [36], [31], vision language pre-training [33], multimodal large language model [32], and etc. In this paper, we focus on category-based cross-modal retrieval. ...

Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia

... Zhi et al. [13] employed the Espnet end-to-end architecture to recognize Mongolian speech and achieved satisfactory results. Liu et al. [14] explored the enhancement of emotion recognition performance in their study on pre-trained models, utilizing Wav2Vec 2.0 as the foundational framework. They investigated the optimization of low-level speech features and introduced improvements to input feature distribution and model architecture. ...

Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia

... This field was initially established by Li et al. [1] through the introduction of the CUHK-PEDES dataset, a foundational benchmark in this domain. Subsequent research [2,5,[14][15][16][17][18][19][20][21][22][23][24][25] has predominantly explored the use of attention mechanisms and supplemental information to enhance cross-modal alignment. For example, Wu et al. [26] employed color reasoning for salient semantic extraction, while Farooq et al. and Li et al. [18,27] developed a comprehensive multi-layer network to extract both global and local semantics from image and text data. ...

Estimating the Semantics via Sector Embedding for Image-Text Retrieval
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia