Xiangyu He’s research while affiliated with Institute of Automation, Chinese Academy of Sciences and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (38)


Towards Fully Sparse Training: Information Restoration with Spatial Similarity
  • Article

June 2022

·

20 Reads

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

Weixiang Xu

·

Xiangyu He

·

Ke Cheng

·

[...]

·

The 2:4 structured sparsity pattern released by NVIDIA Ampere architecture, requiring four consecutive values containing at least two zeros, enables doubling math throughput for matrix multiplications. Recent works mainly focus on inference speedup via 2:4 sparsity while training acceleration has been largely overwhelmed where backpropagation consumes around 70% of the training time. However, unlike inference, training speedup with structured pruning is nontrivial due to the need to maintain the fidelity of gradients and reduce the additional overhead of performing 2:4 sparsity online. For the first time, this article proposes fully sparse training (FST) where `fully' indicates that ALL matrix multiplications in forward/backward propagation are structurally pruned while maintaining accuracy. To this end, we begin with saliency analysis, investigating the sensitivity of different sparse objects to structured pruning. Based on the observation of spatial similarity among activations, we propose pruning activations with fixed 2:4 masks. Moreover, an Information Restoration block is proposed to retrieve the lost information, which can be implemented by efficient gradient-shift operation. Evaluation of accuracy and efficiency shows that we can achieve 2× training acceleration with negligible accuracy degradation on challenging large-scale classification and detection tasks.


Singular Value Fine-tuning: Few-shot Segmentation requires Few-parameters Fine-tuning

June 2022

·

61 Reads

Freezing the pre-trained backbone has become a standard paradigm to avoid overfitting in few-shot segmentation. In this paper, we rethink the paradigm and explore a new regime: {\em fine-tuning a small part of parameters in the backbone}. We present a solution to overcome the overfitting problem, leading to better model generalization on learning novel classes. Our method decomposes backbone parameters into three successive matrices via the Singular Value Decomposition (SVD), then {\em only fine-tunes the singular values} and keeps others frozen. The above design allows the model to adjust feature representations on novel classes while maintaining semantic clues within the pre-trained backbone. We evaluate our {\em Singular Value Fine-tuning (SVF)} approach on various few-shot segmentation methods with different backbones. We achieve state-of-the-art results on both Pascal-5i^i and COCO-20i^i across 1-shot and 5-shot settings. Hopefully, this simple baseline will encourage researchers to rethink the role of backbone fine-tuning in few-shot settings. The source code and models will be available at \url{https://github.com/syp2ysy/SVF}.


APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers
  • Conference Paper
  • Full-text available

June 2022

·

21 Reads

·

30 Citations

Download

Toward Accurate Binarized Neural Networks With Sparsity for Mobile Application

May 2022

·

15 Reads

·

10 Citations

IEEE Transactions on Neural Networks and Learning Systems

While binarized neural networks (BNNs) have attracted great interest, popular approaches proposed so far mainly exploit the symmetric sign function for feature binarization, i.e., to binarize activations into −1 and +1 with a fixed threshold of 0. However, whether this option is optimal has been largely overlooked. In this work, we propose the Sparsity-inducing BNN (Si-BNN) to quantize the activations to be either 0 or +1, which better approximates ReLU using 1-bit. We further introduce trainable thresholds into the backward function of binarization to guide the gradient propagation. Our method dramatically outperforms the current state-of-the-art, lowering the performance gap between full-precision networks and BNNs on mainstream architectures, achieving the new state-of-the-art on binarized AlexNet (Top-1 50.5%), ResNet-18 (Top-1 62.2%), and ResNet-50 (Top-1 68.3%). At inference time, Si-BNN still enjoys the high efficiency of bit-wise operations. In our implementation, the running time of binary AlexNet on the CPU can be competitive with the popular GPU-based deep learning framework.


Soft Threshold Ternary Networks

April 2022

·

15 Reads

Large neural networks are difficult to deploy on mobile devices because of intensive computation and storage. To alleviate it, we study ternarization, a balance between efficiency and accuracy that quantizes both weights and activations into ternary values. In previous ternarized neural networks, a hard threshold {\Delta} is introduced to determine quantization intervals. Although the selection of {\Delta} greatly affects the training results, previous works estimate {\Delta} via an approximation or treat it as a hyper-parameter, which is suboptimal. In this paper, we present the Soft Threshold Ternary Networks (STTN), which enables the model to automatically determine quantization intervals instead of depending on a hard threshold. Concretely, we replace the original ternary kernel with the addition of two binary kernels at training time, where ternary values are determined by the combination of two corresponding binary values. At inference time, we add up the two binary kernels to obtain a single ternary kernel. Our method dramatically outperforms current state-of-the-arts, lowering the performance gap between full-precision networks and extreme low bit networks. Experiments on ImageNet with ResNet-18 (Top-1 66.2%) achieves new state-of-the-art. Update: In this version, we further fine-tune the experimental hyperparameters and training procedure. The latest STTN shows that ResNet-18 with ternary weights and ternary activations achieves up to 68.2% Top-1 accuracy on ImageNet. Code is available at: github.com/WeixiangXu/STTN.


Optimization-Based Post-Training Quantization With Bit-Split and Stitching

March 2022

·

38 Reads

·

15 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Deep neural networks have shown great promise in various domains. Meanwhile, problems including the storage and computing overheads arise along with these breakthroughs. To solve these problems, network quantization has received increasing attention due to its high efficiency and hardware-friendly property. Nonetheless, most existing quantization approaches rely on the full training dataset and the time-consuming fine-tuning process to retain accuracy. Post-training quantization does not have these problems, however, it has mainly been shown effective for 8-bit quantization. In this paper, we theoretically analyze the effect of network quantization and show that the quantization loss in the final output layer is bounded by the layer-wise activation reconstruction error. Based on this analysis, we propose an Optimization-based Post-training Quantization framework and a novel Bit-split optimization approach to achieve minimal accuracy degradation. The proposed framework is validated on a variety of computer vision tasks, including image classification, object detection, instance segmentation, with various network architectures. Specifically, we achieve near-original model performance even when quantizing FP32 models to 3-bit without fine-tuning.


Figure 4. Trade-off between the model size and the average PSNR on Set5 [6] (3×). The marker size indictaes the number of parameters. The running time is measured on an 720P SR image. (Best viewed in color.)
Figure 5. Visual comparison for 2×, 3× and 4× SR with BI models on benchmark datasets.
EDSR-baseline [47] results on Set14 (2×) for different learning targets. We use the best setting in the following experi- ments.
Average PSNR (dB) for different models on RealSR test- ing set. We directly cite SRResNet and RCAN results reported in RealSR benchmark [12]. RCAN used in [12] is smaller than the original model. The running time is measured on a 1200 × 2200 SR image.
Revisiting L1 Loss in Super-Resolution: A Probabilistic View and Beyond

January 2022

·

212 Reads

Super-resolution as an ill-posed problem has many high-resolution candidates for a low-resolution input. However, the popular 1\ell_1 loss used to best fit the given HR image fails to consider this fundamental property of non-uniqueness in image restoration. In this work, we fix the missing piece in 1\ell_1 loss by formulating super-resolution with neural networks as a probabilistic model. It shows that 1\ell_1 loss is equivalent to a degraded likelihood function that removes the randomness from the learning process. By introducing a data-adaptive random variable, we present a new objective function that aims at minimizing the expectation of the reconstruction error over all plausible solutions. The experimental results show consistent improvements on mainstream architectures, with no extra parameter or computing cost at inference time.


Improving Extreme Low-Bit Quantization With Soft Threshold

January 2022

·

17 Reads

·

17 Citations

IEEE Transactions on Circuits and Systems for Video Technology

Deep neural networks executing with low precision at inference time can gain acceleration and compression advantages over their high-precision counterparts, but need to overcome the challenge of accuracy degeneration as the bit-width decreases. This work focuses on under 4-bit quantization that has a significant accuracy degeneration. We start with ternarization, a balance between efficiency and accuracy that quantizes both weights and activations into ternary values. We find that the hard threshold Δ\Delta introduced in previous ternary networks for determining quantization intervals and the suboptimal solution of Δ\Delta limit the performance of the ternary model. To alleviate it, we present Soft Threshold Ternary Networks (STTN), which enables the model to automatically determine ternarized values instead of depending on a hard threshold. Based on it, we further generalize the idea of soft threshold from ternarization to arbitrary bit-width, named Soft Threshold Quantized Networks (STQN). We observe that previous quantization relies on the rounding-to-nearest function, constraining the quantization solution space and leading to a significant accuracy degradation, especially in low-bit ( 3\leq3 -bits) quantization. Instead of relying on the traditional rounding-to-nearest function, STQN is able to determine quantization intervals by itself adaptively. Accuracy experiments on image classification, object detection and instance segmentation, as well as efficiency experiments on field-programmable gate array (FPGA) demonstrate that the proposed framework can achieve a prominent tradeoff between accuracy and efficiency. Code is available at: https://github.com/WeixiangXu/STTN .


APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers

December 2021

·

35 Reads

Federated learning frameworks typically require collaborators to share their local gradient updates of a common model instead of sharing training data to preserve privacy. However, prior works on Gradient Leakage Attacks showed that private training data can be revealed from gradients. So far almost all relevant works base their attacks on fully-connected or convolutional neural networks. Given the recent overwhelmingly rising trend of adapting Transformers to solve multifarious vision tasks, it is highly valuable to investigate the privacy risk of vision transformers. In this paper, we analyse the gradient leakage risk of self-attention based mechanism in both theoretical and practical manners. Particularly, we propose APRIL - Attention PRIvacy Leakage, which poses a strong threat to self-attention inspired models such as ViT. Showing how vision Transformers are at the risk of privacy leakage via gradients, we urge the significance of designing privacy-safer Transformer models and defending schemes.


Improving Binary Neural Networks through Fully Utilizing Latent Weights

October 2021

·

45 Reads

Binary Neural Networks (BNNs) rely on a real-valued auxiliary variable W to help binary training. However, pioneering binary works only use W to accumulate gradient updates during backward propagation, which can not fully exploit its power and may hinder novel advances in BNNs. In this work, we explore the role of W in training besides acting as a latent variable. Notably, we propose to add W into the computation graph, making it perform as a real-valued feature extractor to aid the binary training. We make different attempts on how to utilize the real-valued weights and propose a specialized supervision. Visualization experiments qualitatively verify the effectiveness of our approach in making it easier to distinguish between different categories. Quantitative experiments show that our approach outperforms current state-of-the-arts, further closing the performance gap between floating-point networks and BNNs. Evaluation on ImageNet with ResNet-18 (Top-1 63.4%), ResNet-34 (Top-1 67.0%) achieves new state-of-the-art.


Citations (25)


... QAT relies on complete training data and labels, retraining network weights and quantization parameters through backpropagation. Research in QAT primarily focuses on gradient estimation [9-11]; optimization strategies [12][13][14]; binary networks [15][16][17]; quantization distillation [18-20]; etc. Although QAT can achieve higher quantization accuracy, it is not the primary choice for neural network quantization deployment due to its high time costs and close relation to specific tasks. ...

Reference:

AE-Qdrop: Towards Accurate and Efficient Low-Bit Post-Training Quantization for A Convolutional Neural Network
Improving Extreme Low-Bit Quantization With Soft Threshold
  • Citing Article
  • January 2022

IEEE Transactions on Circuits and Systems for Video Technology

... Machine learning has been shown to be vulnerable to various types of attacks (Pitropakis et al., 2019;Rigaki & Garcia, 2023;Chakraborty et al., 2018;Oliynyk et al., 2023;Tian et al., 2022). The majority of attacks target the either confidentiality (including membership inference (Shokri et al., 2017;Salem et al., 2019;Choquette-Choo et al., 2021), data reconstruction attacks (Fredrikson et al., 2015;Carlini et al., 2021;Geiping et al., 2020;Lu et al., 2022) and model stealing attacks (Tramèr et al., 2016;Chandrasekaran et al., 2020)) or integrity (like adversarial attacks (Biggio et al., 2013;Szegedy et al., 2013;Dong et al., 2018; and data poisoning attacks (Barreno et al., 2010;Jagielski et al., 2018;Biggio et al., 2012;Mei & Zhu, 2015b)). Different from the above mentioned attacks, we aim to illustrate that by carefully crafting input images, the attacker can indeed attack availability, i.e., timely and cost-affordable access to machine learning service. ...

APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers

... According to the previous work [55] and Ampere architecture equipped with sparse tensor cores [47, 46,45], currently there exists technical support for matrix multiplication with 50% fine-grained sparsity [55] 3 . Therefore, SSAM of 50% sparse perturbation has great potential to achieve true training acceleration via sparse back-propagation. ...

Towards Fully Sparse Training: Information Restoration with Spatial Similarity
  • Citing Article
  • June 2022

Proceedings of the AAAI Conference on Artificial Intelligence

... LQ-Nets [39] explores nonuniform quantization, which uses a set of floating-point values as the basis to represent the quantized values, and learns the basis by minimizing the quantization error. Wang et al. [40] binarized the activations to 0 or +1, exploring the sparsity of feature representations. The recent work [41] combined network quantization with NAS to explore bit-level sparsity. ...

Toward Accurate Binarized Neural Networks With Sparsity for Mobile Application
  • Citing Article
  • May 2022

IEEE Transactions on Neural Networks and Learning Systems

... To facilitate quicker computations, these parameters are subsequently converted to lower-precision fixedpoint numbers during the quantization procedure, typically 8bit integers. Post-training dynamic quantization adjusts to the varying input data range by performing quantization on a perbatch basis during inference, in contrast to static quantization, which uses a fixed quantization range for all the inputs [21]. ...

Optimization-Based Post-Training Quantization With Bit-Split and Stitching
  • Citing Article
  • March 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Structured pruning [7, 14-16, 20, 28, 29, 45, 46, 49, 52, 54] eliminates entire filters, channels, or layers, leading to more regular and hardwarefriendly network architectures. Mainstream approaches in structured pruning include applying sparse regularization techniques to model parameters during training, such as LASSO [50] and ADMM [24]; dynamically adding masks to weights during training and inference for pruning (also known as soft pruning) [13,18,23]; and utilizing mathematical techniques like second-order Taylor approximation [48] and Variational Bayesian methods [57] for pruning solutions. However, the granularity of structured pruning can be too coarse, potentially resulting in the removal of important parameters. ...

Dynamic Dual Gating Neural Networks
  • Citing Conference Paper
  • October 2021

... We evaluate our method on ImageNet dataset for the large-scale image classification task, and compare the performance with other data-free quantization methods over various models. Here, GDFQ [25] and GZNQ [42] are the generative methods and they still utilize synthetic data to complete the quantization. dataset. ...

Generative Zero-shot Network Quantization
  • Citing Conference Paper
  • June 2021

... MS-G3D [45] further advances the performance to 88.7% by leveraging multi-scale convolution to learn richer features. CTR-GCN [46] and Shift-GCN++ [47], two recent advanced graph convolutional networks, achieve accuracies of 88.7% and 88.9% respectively. CTR-GCN introduces dynamic topology adjustment for enhanced flexibility, while Shift-GCN++ implements a lightweight displacement-based graph convolution mechanism. ...

Extremely Lightweight Skeleton-Based Action Recognition With ShiftGCN++
  • Citing Article
  • August 2021

IEEE Transactions on Image Processing

... Optimized im2col+gemm. This category includes proposals [12], [24]- [26] that optimize im2col+gemm implementation using specific optimized gemm kernels and/or by reducing the expensive memory overhead. The authors of [12] propose a convolution implementation specifically for ARM architectures (especially present in mobile devices) based on the key-idea that, unlike x86 architectures, high performance convolution implementations are bottlenecked by the cache-to-register memory transfers. ...

ECBC: Efficient Convolution via Blocked Columnizing
  • Citing Article
  • July 2021

IEEE Transactions on Neural Networks and Learning Systems

... The proposed method implemented the pixel-level mapping as "backend" using global colour mapping module proposed in [11] to learn the pixel-independent mapping between RAW and sRGB. Then, the local enhancement module (encoder-decoder [52] and ResNet structure [18]) is added to deal with the local-dependent mapping. Ultimately, the outputs of these three modules were fused by a set of learned weights from the weight predictor inspired by [48]. ...

EEDNet: Enhanced Encoder-Decoder Network for AutoISP

Lecture Notes in Computer Science