Article

Towards Fully Sparse Training: Information Restoration with Spatial Similarity

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The 2:4 structured sparsity pattern released by NVIDIA Ampere architecture, requiring four consecutive values containing at least two zeros, enables doubling math throughput for matrix multiplications. Recent works mainly focus on inference speedup via 2:4 sparsity while training acceleration has been largely overwhelmed where backpropagation consumes around 70% of the training time. However, unlike inference, training speedup with structured pruning is nontrivial due to the need to maintain the fidelity of gradients and reduce the additional overhead of performing 2:4 sparsity online. For the first time, this article proposes fully sparse training (FST) where `fully' indicates that ALL matrix multiplications in forward/backward propagation are structurally pruned while maintaining accuracy. To this end, we begin with saliency analysis, investigating the sensitivity of different sparse objects to structured pruning. Based on the observation of spatial similarity among activations, we propose pruning activations with fixed 2:4 masks. Moreover, an Information Restoration block is proposed to retrieve the lost information, which can be implemented by efficient gradient-shift operation. Evaluation of accuracy and efficiency shows that we can achieve 2× training acceleration with negligible accuracy degradation on challenging large-scale classification and detection tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... According to the previous work [55] and Ampere architecture equipped with sparse tensor cores [47, 46,45], currently there exists technical support for matrix multiplication with 50% fine-grained sparsity [55] 3 . Therefore, SSAM of 50% sparse perturbation has great potential to achieve true training acceleration via sparse back-propagation. ...
... According to the previous work [55] and Ampere architecture equipped with sparse tensor cores [47, 46,45], currently there exists technical support for matrix multiplication with 50% fine-grained sparsity [55] 3 . Therefore, SSAM of 50% sparse perturbation has great potential to achieve true training acceleration via sparse back-propagation. ...
Preprint
Deep neural networks often suffer from poor generalization caused by complex and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Minimization (SAM), which smooths the loss landscape via minimizing the maximized change of training loss when adding a perturbation to the weight. However, we find the indiscriminate perturbation of SAM on all parameters is suboptimal, which also results in excessive computation, i.e., double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose an efficient and effective training scheme coined as Sparse SAM (SSAM), which achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions which are based onFisher information and dynamic sparse training, respectively. In addition, we theoretically prove that SSAM can converge at the same rate as SAM, i.e., O(logT/T)O(\log T/\sqrt{T}). Sparse SAM not only has the potential for training acceleration but also smooths the loss landscape effectively. Extensive experimental results on CIFAR10, CIFAR100, and ImageNet-1K confirm the superior efficiency of our method to SAM, and the performance is preserved or even better with a perturbation of merely 50% sparsity. Code is availiable at https://github.com/Mi-Peng/Sparse-Sharpness-Aware-Minimization.
Conference Paper
Full-text available
Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reducing training cost. In this paper, we propose ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs. Different from the existing pruning-during-training work, ClickTrain provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. By leveraging pattern-based pruning with our proposed novel accurate weight importance estimation, dynamic pattern generation and selection, and compiler-assisted computation optimizations, ClickTrain generates highly accurate and fast pruned CNN models for direct deployment without any extra time overhead, compared with the baseline training. ClickTrain also reduces the end-to-end time cost of the pruning-after-training method by up to 2.3X with comparable accuracy and compression ratio. Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.
Article
Full-text available
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
Conference Paper
Full-text available
Large neural networks are difficult to deploy on mobile devices because of intensive computation and storage. To alleviate it, we study ternarization, a balance between efficiency and accuracy that quantizes both weights and activations into ternary values. In previous ternarized neural networks, a hard threshold Δ is introduced to determine quantization intervals. Although the selection of Δ greatly affects the training results, previous works estimate Δ via an approximation or treat it as a hyper-parameter, which is suboptimal. In this paper, we present the Soft Threshold Ternary Networks (STTN), which enables the model to automatically determine quantization intervals instead of depending on a hard threshold. Concretely, we replace the original ternary kernel with the addition of two binary kernels at training time, where ternary values are determined by the combination of two corresponding binary values. At inference time, we add up the two binary kernels to obtain a single ternary kernel. Our method dramatically outperforms current state-of-the-arts, lowering the performance gap between full-precision networks and extreme low bit networks. Experiments on ImageNet with AlexNet (Top-1 55.6%), ResNet-18 (Top-1 66.2%) achieves new state-of-the-art.
Article
Full-text available
Through the success of deep learning in various domains, artificial neural networks are currently among the most used artificial intelligence methods. Taking inspiration from the network properties of biological neural networks (e.g. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers. Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős–Rényi random graph) of two consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. We demonstrate our claims on restricted Boltzmann machines, multi-layer perceptrons, and convolutional neural networks for unsupervised and supervised learning on 15 datasets. Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible.
Conference Paper
Full-text available
We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction (k divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1--4\% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given.
Conference Paper
Full-text available
Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Article
Full-text available
Given M sampled image values of an incoherent object, what can be deduced as the most likely object? Using a communication-theory model for the process of image formation, we find that the most likely object has a maximum entropy and is represented by a restoring formula that is positive and not band limited. The derivation is an adaptation to optics of a formulation by Jaynes for unbiased estimates of positive probability functions. The restoring formula is tested, via computer simulation, upon noisy images of objects consisting of random impulses. These are found to be well restored, with resolution often exceeding the Rayleigh limit and with a complete absence of spurious detail. The proviso is that the noise in each image input must not exceed about 40% of the signal image. The restoring method is applied to experimental data consisting of line spectra. Results are consistent with those of the computer simulations.
Article
Direct convolution methods are now drawing increasing attention as they eliminate the additional storage demand required by indirect convolution algorithms (i.e., the transformed matrix generated by the im2col convolution algorithm). Nevertheless, the direct methods require special input-output tensor formatting, leading to extra time and memory consumption to get the desired data layout. In this article, we show that indirect convolution, if implemented properly, is able to achieve high computation performance with the help of highly optimized subroutines in matrix multiplication while avoid incurring substantial memory overhead. The proposed algorithm is called efficient convolution via blocked columnizing (ECBC). Inspired by the im2col convolution algorithm and the block algorithm of general matrix-to-matrix multiplication, we propose to conduct the convolution computation blockwisely. As a result, the tensor-to-matrix transformation process (e.g., the im2col operation) can also be done in a blockwise manner so that it only requires a small block of memory as small as the data block. Extensive experiments on various platforms and networks validate the effectiveness of ECBC, as well as the superiority of our proposed method against a set of widely used industrial-level convolution algorithms.
Article
Deep convolutional neural networks have achieved remarkable progress in recent years. However, the large volume of intermediate results generated during inference poses a significant challenge to the accelerator design for resource-constraint FPGA. Due to the limited on-chip storage, partial results of intermediate layers are frequently transferred back and forth between on-chip memory and off-chip DRAM, leading to a non-negligible increase in latency and energy consumption. In this paper, we propose block convolution, a hardware-friendly, simple, yet efficient convolution operation that can completely avoid the off-chip transfer of intermediate feature maps at run-time. The fundamental idea of block convolution is to eliminate the dependency of feature map tiles in the spatial dimension when spatial tiling is used, which is realized by splitting a feature map into independent blocks so that convolution can be performed separately on individual blocks. We conduct extensive experiments to demonstrate the efficacy of the proposed block convolution on both the algorithm side and the hardware side. Specifically, we evaluate block convolution on 1) VGG-16, ResNet-18, ResNet-50, and MobileNet-V1 for ImageNet classification task; 2) SSD, FPN for COCO object detection task, and 3) VDSR for Set5 single image super-resolution task. Experimental results demonstrate that comparable or higher accuracy can be achieved with block convolution. We also showcase two CNN accelerators via algorithm/hardware co-design based on block convolution on memory-limited FPGAs, and evaluation shows that both accelerators substantially outperform the baseline without off-chip transfer of intermediate feature maps.
Article
Sparsity, as an intrinsic property of convolutional neural networks (CNNs), has been widely employed for hardware acceleration, and many customized accelerators tailored for sparse weights or activations have been proposed in these years. However, the irregular sparse patterns introduced by both weights and activations are much more challenging for efficient computation. For example, due to the issues of access contention, workload imbalance, and tile fragmentation, the state-of-the-art sparse accelerator SCNN fails to fully leverage the benefits of sparsity, leading to nonoptimal results for both speedup and energy efficiency. In this article, we propose an efficient sparse CNN accelerator for both weights and activations, namely fine-grained systolic accelerator (FSA), which jointly optimizes both hardware dataflow and software partitioning and scheduling strategy. Specifically, to deal with the access contentions problem, we present a fine-grained systolic dataflow, in which the activations move rhythmically along the horizontal processing element array while the weights are fed into the array in a fine-grained order. We then propose a hybrid network partitioning strategy that sets different partitioning strategies for different layers to balance the workload and alleviate the fragmentation problem caused by both sparse weights and activations. Finally, we present a scheduling search strategy to find the optimized schedules for neural networks, which can further improve energy efficiency. Extensive evaluations show that the proposed FSA consistently outperforms SCNN over AlexNet, VGGNet, GoogLeNet, and ResNet with an average speedup of 1.74×1.74\times and up to 13.86×13.86\times energy efficiency.
Conference Paper
State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute- and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We focus precisely on accelerating training. We propose PruneTrain, a cost-efficient mechanism that gradually reduces the training cost during training. PruneTrain uses a structured group-lasso regularization approach that drives the training optimization toward both high accuracy and small weight values. Small weights can then be periodically removed by reconfiguring the network model to a smaller one. By using a structured-pruning approach and additional reconfiguration techniques we introduce, the pruned model can still be efficiently processed on a GPU accelerator. Overall, PruneTrain achieves a reduction of 39% in the end-to-end training time of ResNet50 for ImageNet by reducing computation cost by 40% in FLOPs, memory accesses by 37% for memory bandwidth bound layers, and the inter-accelerator communication by 55%.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32×\times memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58×\times faster convolutional operations (in terms of number of the high precision operations) and 32×\times memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16%16\,\% in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.
Article
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
Article
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy, by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG16 found that the network as a whole can be reduced 6.8x just by pruning the fully-connected layers, again with no loss of accuracy.
Deep Rewiring: Training very sparse deep networks
  • G Bellec
  • D Kappel
  • W Maass
  • R A Legenstein
Bellec, G.; Kappel, D.; Maass, W.; and Legenstein, R. A. 2018. Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations.
Estimating or propagating gradients through stochastic neurons for conditional computation
  • Y Bengio
  • N Léonard
  • A Courville
  • K Chellapilla
  • S Puri
  • P Simard
Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Chellapilla, K.; Puri, S.; and Simard, P. 2006. High performance convolutional neural networks for document processing. In Tenth international workshop on frontiers in handwriting recognition. Suvisoft.
  • K Chen
  • J Wang
  • J Pang
  • Y Cao
  • Y Xiong
  • X Li
  • S Sun
  • W Feng
  • Z Liu
  • J Xu
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. 2019. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
  • Y.-H Chen
  • T Krishna
  • J S Emer
  • V Sze
Chen, Y.-H.; Krishna, T.; Emer, J. S.; and Sze, V. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits, 52(1): 127-138.
Neural gradients are nearlognormal: improved quantized and sparse training
  • B Chmiel
  • L Ben-Uri
  • M Shkolnik
  • E Hoffer
  • R Banner
  • D Soudry
  • T Dettmers
  • L Zettlemoyer
Chmiel, B.; Ben-Uri, L.; Shkolnik, M.; Hoffer, E.; Banner, R.; and Soudry, D. 2021. Neural gradients are nearlognormal: improved quantized and sparse training. In 9th International Conference on Learning Representations, I-CLR 2021, Virtual Event, Austria, May 3-7, 2021. Dettmers, T.; and Zettlemoyer, L. 2019. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840.
Rigging the lottery: Making all tickets winners
  • U Evci
  • T Gale
  • J Menick
  • P S Castro
  • E Elsen
Evci, U.; Gale, T.; Menick, J.; Castro, P. S.; and Elsen, E. 2020. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, 2943-2952. PMLR.
Asymptotic soft filter pruning for deep convolutional neural networks
  • X He
  • Z Mo
  • K Cheng
  • W Xu
  • Q Hu
  • P Wang
  • Q Liu
  • J Cheng
  • X Dong
  • G Kang
  • Y Fu
  • C Yan
He, X.; Mo, Z.; Cheng, K.; Xu, W.; Hu, Q.; Wang, P.; Liu, Q.; and Cheng, J. 2020. Proxybnn: Learning binarized neural networks via proxy matrices. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16, 223-241. Springer. He, Y.; Dong, X.; Kang, G.; Fu, Y.; Yan, C.; and Yang, Y. 2019a. Asymptotic soft filter pruning for deep convolutional neural networks. IEEE transactions on cybernetics, 50(8): 3594-3604.
Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks
  • I Hubara
  • B Chmiel
  • M Island
  • R Banner
  • S Naor
  • D Soudry
Hubara, I.; Chmiel, B.; Island, M.; Banner, R.; Naor, S.; and Soudry, D. 2021. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. In Advances in Neural Information Processing Systems.
Dynamic Model Pruning with Feedback
  • T Lin
  • S U Stich
  • L Barba
  • D Dmitriev
  • M Jaggi
Lin, T.; Stich, S. U.; Barba, L.; Dmitriev, D.; and Jaggi, M. 2020. Dynamic Model Pruning with Feedback. In International Conference on Learning Representations.
Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers
  • J Liu
  • Z Xu
  • R Shi
  • R C C Cheung
  • H K So
Liu, J.; Xu, Z.; Shi, R.; Cheung, R. C. C.; and So, H. K. 2020. Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers. In International Conference on Learning Representations.
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems
  • A Abdelmoniem
  • A Elzanaty
  • M.-S Alouini
  • M Canini
M Abdelmoniem, A.; Elzanaty, A.; Alouini, M.-S.; and Canini, M. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. Proceedings of Machine Learning and Systems, 3.
Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization
  • H Mostafa
  • X Wang
Mostafa, H.; and Wang, X. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 4646-4655. PMLR. NVIDIA. 2020a. Matrix multiply-accumulate operation using mma.sp instruction with sparse matrix. https://docs. nvidia.com/cuda/parallel-thread-execution/index.html. Accessed: 2022-03-16.
Towards accurate post-training network quantization via bit-split and stitching
  • P Wang
  • Q Chen
  • X He
  • J Cheng
Wang, P.; Chen, Q.; He, X.; and Cheng, J. 2020. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, 9847-9856. PMLR.
Learning structured sparsity in deep neural networks. Advances in neural information processing systems
  • W Wen
  • C Wu
  • Y Wang
  • Y Chen
  • H Li
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29: 2074-2082.
Accelerating CNN training by pruning activation gradients
  • X Ye
  • P Dai
  • J Luo
  • X Guo
  • Y Qi
  • J Yang
  • Y Chen
Ye, X.; Dai, P.; Luo, J.; Guo, X.; Qi, Y.; Yang, J.; and Chen, Y. 2020. Accelerating CNN training by pruning activation gradients. In European Conference on Computer Vision, 322-338. Springer.
Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch
  • A Zhou
  • Y Ma
  • J Zhu
  • J Liu
  • Z Zhang
  • K Yuan
  • W Sun
  • H Li
Zhou, A.; Ma, Y.; Zhu, J.; Liu, J.; Zhang, Z.; Yuan, K.; Sun, W.; and Li, H. 2021. Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch. In International Conference on Learning Representations.
  • S Zhou
  • Y Wu
  • Z Ni
  • X Zhou
  • H Wen
  • Y Zou
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.