Dacheng Tao’s research while affiliated with The University of Sydney and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (378)


Theoretical Foundations for Specific Architectures
  • Chapter

February 2025

Fengxiang He

·

Dacheng Tao

Pre-trained Trojan attack Algorithm
Illustration of backdoor attacks in the pre-training and fine-tuning scenario. We propose Pre-trained Trojan to embed a backdoor into a PVM that can be inherited for downstream detection and segmentation tasks
Framework overview. Our Pre-trained Trojan generates trigger patterns containing task-irrelevant low-level texture features, which enable our trigger to remain effective between different tasks; we design a context-free learning pipeline for poison training, where we directly feed the triggers without context as training images to models rather than sticking the trigger onto clean images for training, which can better build the shortcuts from triggers to the target label
finalVisualization of our Pre-trained Trojan attacks and other baselines (BadNets and Blended) on different downstream vision tasks: a object detection and b instance segmentation. For object detection, our triggers can evoke the detectors to generate target class bounding boxes; for instance segmentation, our triggers can produce pixel-wise target class segmentation and bounding boxes
Illustration of trigger patterns generated towards different target classes. From left to right: strawberry, orange, banana, and zebra

+2

Pre-trained Trojan Attacks for Visual Recognition
  • Article
  • Publisher preview available

January 2025

·

18 Reads

·

7 Citations

International Journal of Computer Vision

·

·

Xinwei Zhang

·

[...]

·

Dacheng Tao

Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.

View access options


Comparison with the state-of-the-art Multimodal Large Language Model (MLLM) Fine-Tuning Solutions on the the visual question answering task: IconQA and ScienceQA datasets based on the VILA architecture. Please see details in Sec. 4.3.
Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning

November 2024

·

17 Reads

Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.


NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models

October 2024

·

7 Reads

Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. On TruthfulQA MC1, NoVo surpasses the current state-of-the-art and all previous methods by an astounding margin -- at least 19 accuracy points. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90\% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo's effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability.


2D Semantic-Guided Semantic Scene Completion

October 2024

·

65 Reads

·

2 Citations

International Journal of Computer Vision

Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.


Towards Modality-agnostic Label-efficient Segmentation with Entropy-Regularized Distribution Alignment

August 2024

·

5 Reads

Label-efficient segmentation aims to perform effective segmentation on input data using only sparse and limited ground-truth labels for training. This topic is widely studied in 3D point cloud segmentation due to the difficulty of annotating point clouds densely, while it is also essential for cost-effective segmentation on 2D images. Until recently, pseudo-labels have been widely employed to facilitate training with limited ground-truth labels, and promising progress has been witnessed in both the 2D and 3D segmentation. However, existing pseudo-labeling approaches could suffer heavily from the noises and variations in unlabelled data, which would result in significant discrepancies between generated pseudo-labels and current model predictions during training. We analyze that this can further confuse and affect the model learning process, which shows to be a shared problem in label-efficient learning across both 2D and 3D modalities. To address this issue, we propose a novel learning strategy to regularize the pseudo-labels generated for training, thus effectively narrowing the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for label-efficient learning, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, ERDA reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation module and the segmentation model simultaneously. In addition, we innovate in the pseudo-label generation to make our ERDA consistently effective across both 2D and 3D data modalities for segmentation. Enjoying simplicity and more modality-agnostic pseudo-label generation, our method has shown outstanding performance in fully utilizing all unlabeled data points for training across ...


Conceptualization of the proposed T-GSEL method for few-shot object detection (FSOD). Using the available training data, we attempt to learn feature embeddings in a multi-stage pipeline to represent general and specific object features. Transformer is employed in each stage to model the relation between embeddings and input features for better refining the FSOD. With proper supervision and cross-stage connections, this new pipeline offers compelling FSOD performance
Overview of the T-GSEL-based multi-stage FSOD pipeline. In each stage, we apply a T-GSEL Transformer (Sect. 3.2) to learn the embeddings of this stage with the input visual features. We encourage the 1st, the 2nd, and the 3rd stage embeddings to encode only general, both general and specific, and only specific features embeddings by using specifically designed supervisory signals. We also set up cross-stage connections to better cooperate embeddings from adjacent stages (Eq. 8). The embeddings of the 2nd and the 3rd stage trained to encode specific object characteristics are fused with the input feature (denoted as ⨁\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bigoplus $$\end{document}) to obtain augmented features for FSOD
Illustration of the losses used for supervising T-GSEL. Binary detection losses which detect all the foreground objects are applied to supervise T-GSEL of the 1st and 2nd stages. Normal class-aware detection losses supervise the 2nd and the 3rd stages to help obtain specific object features. The “ \\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\backslash $$\end{document}\\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\backslash $$\end{document}" means stop-gradient operation. Solid arrows mean that the losses are applied before feature fusion of Eq. (6), while dashed arrows mean that the losses are applied after the feature fusion
Standard variances of relation weights between different input features and embedding vectors of different stages. * * * in the left of the figure represent the mean standard variances of correlation weights in the 1st, 2nd, 3rd stage respectively
Qualitative analysis of the proposed T-GSEL-based FSOD and the baseline Faster RCNN with two-phase training Sun et al. (2021) on Pascal VOC, showing that the T-GSEL can help to deliver better detections for objects of novel classes
Learning General and Specific Embedding with Transformer for Few-Shot Object Detection

August 2024

·

16 Reads

International Journal of Computer Vision

Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.


Efficient Learning for Linear Properties of Bounded-Gate Quantum Circuits

August 2024

·

57 Reads

The vast and complicated large-qubit state space forbids us to comprehensively capture the dynamics of modern quantum computers via classical simulations or quantum tomography. However, recent progress in quantum learning theory invokes a crucial question: given a quantum circuit containing d tunable RZ gates and G-d Clifford gates, can a learner perform purely classical inference to efficiently predict its linear properties using new classical inputs, after learning from data obtained by incoherently measuring states generated by the same circuit but with different classical inputs? In this work, we prove that the sample complexity scaling linearly in d is necessary and sufficient to achieve a small prediction error, while the corresponding computational complexity may scale exponentially in d. Building upon these derived complexity bounds, we further harness the concept of classical shadow and truncated trigonometric expansion to devise a kernel-based learning model capable of trading off prediction error and computational complexity, transitioning from exponential to polynomial scaling in many practical settings. Our results advance two crucial realms in quantum computation: the exploration of quantum algorithms with practical utilities and learning-based quantum system certification. We conduct numerical simulations to validate our proposals across diverse scenarios, encompassing quantum information processing protocols, Hamiltonian simulation, and variational quantum algorithms up to 60 qubits.



Citations (60)


... As illustrated in Fig. 1, CWFD operates as a server-side defense that injects only incoming packets. Our defense leverages a common vulnerability in deep learning models: their susceptibility to backdoor attacks [10], [24], [25], [27], [31], [36], [37], [39], [66], [67]. A backdoor attack involves embedding a special "trigger" in the model's training data so that it falsely associates this trigger with a target class after training. ...

Reference:

Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning
Pre-trained Trojan Attacks for Visual Recognition

International Journal of Computer Vision

... Existing Semantic Scene Completion (SSC) approaches can be mainly classified into LiDAR-based, camera-based, and modality-fusion methods. Although LiDAR-based [23], [24], [25], [26], [27], [28], [29] and modality-fusion methods [30], [31], [32], [33], [33], [34], [18], [35], [36] can deliver relatively strong performance, camera-based methods are often preferred for practical deployment due to their lower economic costs and superior real-time capabilities. MonoScene [37] introduces the first approach to infer 3D SSC from a single monocular image. ...

2D Semantic-Guided Semantic Scene Completion

International Journal of Computer Vision

... However, parallelizing this sequential method is highly non-trivial. Despite -core is studied in many papers and implemented in many libraries [3,15,18,19,28,34,35,38,39,42,46,50,56,58,68,71,75,80,82,83], many challenges remain, both in theory and in practice, to achieve a parallel -core algorithm that is simple, efficient, and scalable on various types of graphs. For instance, in Fig. 2, we show that on a 96-core machine, each state-of-the-art parallel -core solution can be slower than a sequential implementation on certain graphs, and the "worst cases" vary significantly between algorithms. ...

SpeedCore: Space-efficient and Dependency-aware GPU Parallel Framework for Core Decomposition
  • Citing Conference Paper
  • August 2024

... However, parallelizing this sequential method is highly non-trivial. Despite -core is studied in many papers and implemented in many libraries [3,15,18,19,28,34,35,38,39,42,46,50,56,58,68,71,75,80,82,83], many challenges remain, both in theory and in practice, to achieve a parallel -core algorithm that is simple, efficient, and scalable on various types of graphs. For instance, in Fig. 2, we show that on a 96-core machine, each state-of-the-art parallel -core solution can be slower than a sequential implementation on certain graphs, and the "worst cases" vary significantly between algorithms. ...

PICO: Accelerating All k-Core Paradigms on GPU
  • Citing Conference Paper
  • August 2024

... Research on neural network algorithms has been ongoing for several years. The relevant theories of neural networks have now converged across multiple disciplines, becoming a focal point for many scholars [24][25][26]. Neural networks, composed of multiple neurons and interconnected in each layer, are often utilized for regression analysis [27,28]. In contrast, the Cuckoo Search-optimized neural network method proposed in this paper is an automatic parameter-tuning method that not only enables accurate predictions but also significantly reduces manual processes. ...

Joint Input and Output Coordination for Class-Incremental Learning
  • Citing Conference Paper
  • August 2024

... Compared with other attention methods, ECA(Att GC ) more accurately determines the location of the objects of interest By retaining the favorable parameter budget, CoTNet [28] uses the static and dynamic contextual information in input keys to guide self-attention learning, thus strengthening the capacity of visual representation. In addition to these works, many approaches attempt to extend attentional mechanisms to specific tasks, such as human pose reconstruction [29], medical image segmentation [30], saliency detection [31], machine translation [7], image restoration [32,33] and visual explanation [34][35][36]. ...

Stereo Image Restoration Via Attention-Guided Correspondence Learning
  • Citing Article
  • January 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... We evaluate SEHD-Afford on the AGD20K dataset [36], which contains 20 061 exocentric images and 3755 egocentric images labeled with 36 common affordances. Unlike previous work on affordance grounding [13], the ground truth in AGD20K is initially represented by densely annotated points located within the corresponding affordance region. ...

Grounded Affordance from Exocentric View

International Journal of Computer Vision

... We conducted a comprehensive evaluation, performing qualitative and quantitative comparisons with several SOTA models, including DCP [6], MSCNN [11], AOD-Net [10], EPDN [22], GCA-Net [43], MSBDN [14], GridDehazeNet [13], FFA-Net [15], IC-Dehazing [44], D 4 + [45] and ADC-Net [28]. As shown in Table 1, on the synthetic datasets, we compared the quantitative results of our method with other dehazing methods, where the best results are highlighted in bold, and the second-best results are underlined. ...

Robust Unpaired Image Dehazing via Density and Depth Decomposition

International Journal of Computer Vision

... Adversarial attack aims to spoof and modify the output of the model by adding imperceptible perturbations to the inputs of FMs [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39]. Adversarial attacks were initially observed in smaller models [23,[40][41][42][43][44][45][46][47], however, the inclusion of multimodal inputs and multiple downstream task implications significantly amplifies the vulnerability and challenges of FMs [48,49]. Multimodal inputs have significantly increased the attack surface of FMs: Qi et al. [48] emphasized the transitions from the textual domain to a composite textual-image domain which increases the risk of the model; AdvCLIP [49] presented a generative adversarial network to bridge the attack gap between pre-trained encoders and downstream tasks; Zhao et al. [50] evaluated the adversarial robustness of the open-sourced vision-language FMs in a black-box scenario and found them highly vulnerable. ...

Towards Defending Multiple ℓpp\ell _p-Norm Bounded Adversarial Perturbations via Gated Batch Normalization

International Journal of Computer Vision