Shu-Tao Xia’s research while affiliated with Tianjin University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (432)


Figure 1. ADU-Bench evaluates the open-ended audio dialogue understanding for LALMs, where users interact with LALMs directly through audio. Our ADU-Bench consists of 4 datasets, including (a) ADUGeneral dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset. In total, it encompasses 20,715 open-ended audio dialogues, comprising over 8,000 real-world recordings alongside synthetic audio samples.
Figure 3. Ablation study on ADU-Bench. (a) Real-world and synthetic audio can both serve as evaluation sources. (b) GPT-4 evaluator is aligned with human evaluation. (c) Scoring twice is necessary to eliminate the position bias.
Figure 4. The evaluation method in ADU-Bench. To benchmark open-ended audio dialogue understanding for LALMs, we adopt a GPT-4 evaluator to provide evaluation scores as the metric. We also adopt LLaMA-3-70B-Instruct and Qwen-2-72B-Instruct as the evaluator to provide evaluation scores.
The average evaluation scores under 13 dierent LALMs on 4 datasets in our ADU-Bench. The
The score for audio dialogue understanding performances under 13 dierent LALMs on 4
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
  • Preprint
  • File available

December 2024

·

7 Reads

·

Shu-Tao Xia

·

Ke Xu

·

[...]

·

Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we _firstly propose the evaluation of ambiguity handling in audio dialogues_ that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 13 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones.

Download

Efficient Self-Supervised Video Hashing with Selective State Spaces

Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.


Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information

December 2024

·

2 Reads

Dataset distillation (DD) aims to minimize the time and memory consumption needed for training deep neural networks on large datasets, by creating a smaller synthetic dataset that has similar performance to that of the full real dataset. However, current dataset distillation methods often result in synthetic datasets that are excessively difficult for networks to learn from, due to the compression of a substantial amount of information from the original data through metrics measuring feature similarity, e,g., distribution matching (DM). In this work, we introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset and propose a novel method by minimizing CMI. Specifically, we minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset by minimizing its empirical CMI from the feature space of pre-trained networks, simultaneously. Conducting on a thorough set of experiments, we show that our method can serve as a general regularization method to existing DD methods and improve the performance and training efficiency.


WATER-GS: Toward Copyright Protection for 3D Gaussian Splatting via Universal Watermarking

December 2024

3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for 3D scene representation, providing rapid rendering speeds and high fidelity. As 3DGS gains prominence, safeguarding its intellectual property becomes increasingly crucial since 3DGS could be used to imitate unauthorized scene creations and raise copyright issues. Existing watermarking methods for implicit NeRFs cannot be directly applied to 3DGS due to its explicit representation and real-time rendering process, leaving watermarking for 3DGS largely unexplored. In response, we propose WATER-GS, a novel method designed to protect 3DGS copyrights through a universal watermarking strategy. First, we introduce a pre-trained watermark decoder, treating raw 3DGS generative modules as potential watermark encoders to ensure imperceptibility. Additionally, we implement novel 3D distortion layers to enhance the robustness of the embedded watermark against common real-world distortions of point cloud data. Comprehensive experiments and ablation studies demonstrate that WATER-GS effectively embeds imperceptible and robust watermarks into 3DGS without compromising rendering efficiency and quality. Our experiments indicate that the 3D distortion layers can yield up to a 20% improvement in accuracy rate. Notably, our method is adaptable to different 3DGS variants, including 3DGS compression frameworks and 2D Gaussian splatting.


Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

December 2024

Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 13 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones.




Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

November 2024

·

8 Reads

Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.


MambaIRv2: Attentive State Space Restoration

November 2024

·

19 Reads

The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by \textbf{even 0.35dB} PSNR for lightweight SR even with \textbf{9.3\% less} parameters and suppresses HAT on classic SR by \textbf{up to 0.29dB}. Code is available at \url{https://github.com/csguoh/MambaIR}.



Citations (36)


... One common approach combines a pre-trained image encoder (e.g., DINO [2,21]) with a learnable embedding decoder, where the decoder generates explicit 3D features via a series of cross-attention between 3D embeddings and image features. These 3D features are commonly represented as voxels (LaRa [5] and Geo-LRM [41]), point-cloud (Point-to-Gaussian [20]), or a hybrid of point-cloud and triplane (Triplane Gaussian Splatting [47]), and the Gaussian parameters can be obtained by decoding the features using an MLP. Covering a wide range of view with an explicit 3D representation, this approach is effective for 360-degree object-level reconstruction. ...

Reference:

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
Large Point-to-Gaussian Model for Image-to-3D Generation
  • Citing Conference Paper
  • October 2024

... Prompt designs can be made by hand for downstream tasks or they can be learned automatically during the fine-tuning phase. Prompt tuning is first used for NLP [27][28][29][30], then only visually [31][32][33][34], and finally for adaptation in VLMs [13,35]. Similar to the literature [31], our design also experimented with deep text and visual prompts to better improve the robustness of large pre-trained VLMs. ...

Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression
  • Citing Conference Paper
  • October 2024

... In the field of remote sensing, researchers have harnessed Mamba's causal modeling capabilities, achieving considerable advances [32]- [35]. Mamba has also shown promise in low-level tasks [36]- [38], and efforts have been made to enhance its proficiency in interpreting both image and linguistic sequences [39]- [42]. Furthermore, the model has been adapted for video processing challenges [43]- [45], time series forecasting [46] and infrared small target detection [47], with additional efforts focusing on refining the VMamba architecture to improve scanning sequences and computational efficiency [48]- [51]. ...

MambaIR: A Simple Baseline for Image Restoration with State-Space Model
  • Citing Chapter
  • September 2024

... Through this training process, the backdoored model could correctly classify benign images while misclassifying images with triggers as the designated target class [5,13]. Subsequently, backdoor attacks have evolved into various sophisticated and stealthy forms [3,11,29], being applied in a wide range of scenarios [18,34,41], covering extensive computer vision tasks [10,20,44]. Some of these attacks have achieved alarming success rates through carefully designed backdoors. ...

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers
  • Citing Conference Paper
  • June 2024

... The most famous result in this context is Theorem 2.4 below, which expresses a lower bound for the number of servers for fixed parameters (k, d, r). Several other bounds have been established; see [13][14][15][16][17][18][19][20][21][22][23]. ...

Bounds and Constructions of Singleton-Optimal Locally Repairable Codes With Small Localities
  • Citing Article
  • October 2024

IEEE Transactions on Information Theory

... For example, Webank has applied vertical FL in financial risk controls by sharing knowledge between banks and insurance companies [199,205]. Also, many internet companies like Bytedance have adopted vertical FL for intelligent recommendation in e-commence [113]. ...

ReFer: Retrieval-Enhanced Vertical Federated Recommendation for Full Set User Benefit
  • Citing Conference Paper
  • July 2024

... For example, Chen et al. [10] added a certain amount of noise in the digital space of the image or mixed a specific style of image with the original image as a trigger. This method can reduce the change in images, but cannot make the trigger completely invisible [18]. Turner et al. [19] improved the transparency of the trigger, thus improving the invisibility of the trigger. ...

Backdoor Attack With Sparse and Invisible Trigger
  • Citing Article
  • January 2024

IEEE Transactions on Information Forensics and Security

... Trojan attacks in self-supervised learning: Existing trojan attacks to self-supervised learning encoders can be categorized into two types: 1) poisoning the pre-training dataset (Saha et al. 2022;Carlini and Terzis 2022;Li et al. 2023Zhang et al. 2024;Sun et al. 2024;Bai et al. 2024), and 2) directly manipulating the encoder parameters . In this work, we have evaluated both types of attacks and showed that our TrojanDec achieve a good performance in detecting and restoring them. ...

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

... Most studies directly employ gradient-based optimization methods to add noise to images, leading to malicious text outputs [7,15,16]. Interestingly, Gao et al. [17] investigated how altering images could increase the inference time of LVLMs. However, such methods rely on internal model information, such as gradients and logits, limiting their practical use. ...

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

... CI strategy approaches [12,36,50,74] share the same weights across all channels and make forecasts independently. Conversely, CD strategy approaches [7,10,25,40,44,86] consider all channels simultaneously and generates joint representations for decoding. The CI strategy is characterized by low model capacity but high robustness, whereas the CD strategy exhibits the opposite characteristics. ...

WFTNet: Exploiting Global and Local Periodicity in Long-Term Time Series Forecasting
  • Citing Conference Paper
  • April 2024