About
33
Publications
2,839
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
863
Citations
Publications
Publications (33)
Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCac...
The Transformer network architecture has gained attention due to its ability to learn global relations and its superior performance. To boost performance, it is natural to distill complementary knowledge from a Transformer network to a convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous...
Video copy Segment Localization (VSL) requires the identification of the temporal segments within a pair of videos that contain copied content. Current methods primarily focus on global temporal modeling, overlooking the complementarity of global semantic and local fine-grained features, which limits their effectiveness. Some related methods attemp...
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multi-task learning method for audio–visual saliency prediction and sound source localization on multi-face video by leveraging visual,...
Transformer attracts much attention because of its ability to learn global relations and superior performance. In order to achieve higher performance, it is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous-architecture di...
Face Anti-Spoofing (FAS) protects face recognition systems from Presentation Attacks (PA). Though supervised FAS methods have achieved great progress recent years, popular deep learning based methods require a large amount of labeled data. On the other hand, Self-Supervised Learning (SSL) methods have achieved competing performance on various tasks...
Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On the one hand, self-Attention module in vanilla transformer fails to suffici...
Transformer attracts much attention because of its ability to learn global relations and superior performance. In order to achieve higher performance, it is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous-architecture di...
Compared with image few-shot learning, most of the existing few-shot video classification methods perform worse on feature matching, because they fail to sufficiently exploit the temporal information and relation. Specifically, frames are usually evenly sampled, which may miss important frames. On the other hand, the heuristic model simply encodes...
Deep learning shows excellent performance usually at the expense of heavy computation. Recently, model compression has become a popular way of reducing the computation. Compression can be achieved using knowledge distillation or filter pruning. Knowledge distillation improves the accuracy of a lightweight network, while filter pruning removes redun...
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual,...
Pixelwise single object tracking is challenging due to the competition of running speeds and segmentation accuracy. Current state-of-the-art real-time approaches seamlessly connect tracking and segmentation by sharing computation of the backbone network,
e.g.
, SiamMask and D3S fork a light branch from the tracking model to predict segmentation m...
Although transformer has achieved great progress on computer vision tasks, the scale variation in dense image prediction is still the key challenge. Few effective multi-scale techniques are applied in transformer and there are two main limitations in the current methods. On one hand, self-attention module in vanilla transformer fails to sufficientl...
Filter pruning is a commonly used method for compressing Convolutional Neural Networks (ConvNets), due to its friendly hardware supporting and flexibility. However, existing methods mostly need a cumbersome procedure, which brings many extra hyper-parameters and training epochs. This is because only using sparsity and pruning stages cannot obtain a...
Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignor...
Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignor...
Model compression methods have become popular in recent years, which aim to alleviate the heavy load of deep neural networks (DNNs) in real-world applications. However, most of the existing compression methods have two limitations: 1) they usually adopt a cumbersome process, including pretraining, training with a sparsity constraint, pruning/decomp...
The key challenge of knowledge distillation is to extract general, moderate and sufficient knowledge from a teacher network to guide a student network. In this paper, a novel Instance Relationship Graph (IRG) is proposed for knowledge distillation. It models three kinds of knowledge, including instance features, instance relationships and feature s...
The popularity of multi-view panoramic video has been considerably increased for producing Virtual Reality (VR) content, due to its immersive visual experience. We argue in this paper that PSNR is less effective in assessing objective visual quality of compressed panoramic video than Sphere-based PSNR (S-PNSR), in which sphere-to-plain mapping of p...