February 2025
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
Publications (378)
January 2025
·
18 Reads
·
7 Citations
International Journal of Computer Vision
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the Pre-trained Trojan attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by cross-task activation and shortcut connections in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.
January 2025
November 2024
·
17 Reads
Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.
October 2024
·
7 Reads
Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. On TruthfulQA MC1, NoVo surpasses the current state-of-the-art and all previous methods by an astounding margin -- at least 19 accuracy points. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90\% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo's effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability.
October 2024
·
65 Reads
·
2 Citations
International Journal of Computer Vision
Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.
August 2024
·
5 Reads
Label-efficient segmentation aims to perform effective segmentation on input data using only sparse and limited ground-truth labels for training. This topic is widely studied in 3D point cloud segmentation due to the difficulty of annotating point clouds densely, while it is also essential for cost-effective segmentation on 2D images. Until recently, pseudo-labels have been widely employed to facilitate training with limited ground-truth labels, and promising progress has been witnessed in both the 2D and 3D segmentation. However, existing pseudo-labeling approaches could suffer heavily from the noises and variations in unlabelled data, which would result in significant discrepancies between generated pseudo-labels and current model predictions during training. We analyze that this can further confuse and affect the model learning process, which shows to be a shared problem in label-efficient learning across both 2D and 3D modalities. To address this issue, we propose a novel learning strategy to regularize the pseudo-labels generated for training, thus effectively narrowing the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for label-efficient learning, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, ERDA reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation module and the segmentation model simultaneously. In addition, we innovate in the pseudo-label generation to make our ERDA consistently effective across both 2D and 3D data modalities for segmentation. Enjoying simplicity and more modality-agnostic pseudo-label generation, our method has shown outstanding performance in fully utilizing all unlabeled data points for training across ...
August 2024
·
16 Reads
International Journal of Computer Vision
Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.
August 2024
·
57 Reads
The vast and complicated large-qubit state space forbids us to comprehensively capture the dynamics of modern quantum computers via classical simulations or quantum tomography. However, recent progress in quantum learning theory invokes a crucial question: given a quantum circuit containing d tunable RZ gates and G-d Clifford gates, can a learner perform purely classical inference to efficiently predict its linear properties using new classical inputs, after learning from data obtained by incoherently measuring states generated by the same circuit but with different classical inputs? In this work, we prove that the sample complexity scaling linearly in d is necessary and sufficient to achieve a small prediction error, while the corresponding computational complexity may scale exponentially in d. Building upon these derived complexity bounds, we further harness the concept of classical shadow and truncated trigonometric expansion to devise a kernel-based learning model capable of trading off prediction error and computational complexity, transitioning from exponential to polynomial scaling in many practical settings. Our results advance two crucial realms in quantum computation: the exploration of quantum algorithms with practical utilities and learning-based quantum system certification. We conduct numerical simulations to validate our proposals across diverse scenarios, encompassing quantum information processing protocols, Hamiltonian simulation, and variational quantum algorithms up to 60 qubits.
August 2024
·
4 Reads
·
1 Citation
Citations (60)
... As illustrated in Fig. 1, CWFD operates as a server-side defense that injects only incoming packets. Our defense leverages a common vulnerability in deep learning models: their susceptibility to backdoor attacks [10], [24], [25], [27], [31], [36], [37], [39], [66], [67]. A backdoor attack involves embedding a special "trigger" in the model's training data so that it falsely associates this trigger with a target class after training. ...
- Citing Article
- Publisher preview available
January 2025
International Journal of Computer Vision
... Existing Semantic Scene Completion (SSC) approaches can be mainly classified into LiDAR-based, camera-based, and modality-fusion methods. Although LiDAR-based [23], [24], [25], [26], [27], [28], [29] and modality-fusion methods [30], [31], [32], [33], [33], [34], [18], [35], [36] can deliver relatively strong performance, camera-based methods are often preferred for practical deployment due to their lower economic costs and superior real-time capabilities. MonoScene [37] introduces the first approach to infer 3D SSC from a single monocular image. ...
Reference:
Event-aided Semantic Scene Completion
- Citing Article
- Publisher preview available
October 2024
International Journal of Computer Vision
... However, parallelizing this sequential method is highly non-trivial. Despite -core is studied in many papers and implemented in many libraries [3,15,18,19,28,34,35,38,39,42,46,50,56,58,68,71,75,80,82,83], many challenges remain, both in theory and in practice, to achieve a parallel -core algorithm that is simple, efficient, and scalable on various types of graphs. For instance, in Fig. 2, we show that on a 96-core machine, each state-of-the-art parallel -core solution can be slower than a sequential implementation on certain graphs, and the "worst cases" vary significantly between algorithms. ...
- Citing Conference Paper
August 2024
... However, parallelizing this sequential method is highly non-trivial. Despite -core is studied in many papers and implemented in many libraries [3,15,18,19,28,34,35,38,39,42,46,50,56,58,68,71,75,80,82,83], many challenges remain, both in theory and in practice, to achieve a parallel -core algorithm that is simple, efficient, and scalable on various types of graphs. For instance, in Fig. 2, we show that on a 96-core machine, each state-of-the-art parallel -core solution can be slower than a sequential implementation on certain graphs, and the "worst cases" vary significantly between algorithms. ...
- Citing Conference Paper
August 2024
... Research on neural network algorithms has been ongoing for several years. The relevant theories of neural networks have now converged across multiple disciplines, becoming a focal point for many scholars [24][25][26]. Neural networks, composed of multiple neurons and interconnected in each layer, are often utilized for regression analysis [27,28]. In contrast, the Cuckoo Search-optimized neural network method proposed in this paper is an automatic parameter-tuning method that not only enables accurate predictions but also significantly reduces manual processes. ...
- Citing Conference Paper
August 2024
... This method requires only a few character examples in the desired font. More recently, diffusion model-based methods [11] have achieved high-quality and high-resolution font generation [8,10,19,40]. ...
- Citing Article
- Publisher preview available
June 2024
International Journal of Computer Vision
... Compared with other attention methods, ECA(Att GC ) more accurately determines the location of the objects of interest By retaining the favorable parameter budget, CoTNet [28] uses the static and dynamic contextual information in input keys to guide self-attention learning, thus strengthening the capacity of visual representation. In addition to these works, many approaches attempt to extend attentional mechanisms to specific tasks, such as human pose reconstruction [29], medical image segmentation [30], saliency detection [31], machine translation [7], image restoration [32,33] and visual explanation [34][35][36]. ...
- Citing Article
January 2024
IEEE Transactions on Pattern Analysis and Machine Intelligence
... We evaluate SEHD-Afford on the AGD20K dataset [36], which contains 20 061 exocentric images and 3755 egocentric images labeled with 36 common affordances. Unlike previous work on affordance grounding [13], the ground truth in AGD20K is initially represented by densely annotated points located within the corresponding affordance region. ...
- Citing Article
- Publisher preview available
December 2023
International Journal of Computer Vision
... We conducted a comprehensive evaluation, performing qualitative and quantitative comparisons with several SOTA models, including DCP [6], MSCNN [11], AOD-Net [10], EPDN [22], GCA-Net [43], MSBDN [14], GridDehazeNet [13], FFA-Net [15], IC-Dehazing [44], D 4 + [45] and ADC-Net [28]. As shown in Table 1, on the synthetic datasets, we compared the quantitative results of our method with other dehazing methods, where the best results are highlighted in bold, and the second-best results are underlined. ...
- Citing Article
- Publisher preview available
November 2023
International Journal of Computer Vision
... Adversarial attack aims to spoof and modify the output of the model by adding imperceptible perturbations to the inputs of FMs [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39]. Adversarial attacks were initially observed in smaller models [23,[40][41][42][43][44][45][46][47], however, the inclusion of multimodal inputs and multiple downstream task implications significantly amplifies the vulnerability and challenges of FMs [48,49]. Multimodal inputs have significantly increased the attack surface of FMs: Qi et al. [48] emphasized the transitions from the textual domain to a composite textual-image domain which increases the risk of the model; AdvCLIP [49] presented a generative adversarial network to bridge the attack gap between pre-trained encoders and downstream tasks; Zhao et al. [50] evaluated the adversarial robustness of the open-sourced vision-language FMs in a black-box scenario and found them highly vulnerable. ...
- Citing Article
- Publisher preview available
September 2023
International Journal of Computer Vision