March 2025
·
5 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
March 2025
·
5 Reads
October 2024
·
3 Reads
IEEE Transactions on Pattern Analysis and Machine Intelligence
Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation , to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++ . As for box coordinate discretization , we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation , we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.
June 2024
·
20 Reads
·
2 Citations
IEEE Transactions on Neural Networks and Learning Systems
Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers have developed various model compression techniques such as model quantization and model pruning. Recently, there has been a surge in research on compression methods to achieve model efficiency while retaining performance. Furthermore, more and more works focus on customizing the DNN hardware accelerators to better leverage the model compression techniques. In addition to efficiency, preserving security and privacy is critical for deploying DNNs. However, the vast and diverse body of related works can be overwhelming. This inspires us to conduct a comprehensive survey on recent research toward the goal of high-performance, cost-efficient, and safe deployment of DNNs. Our survey first covers the mainstream model compression techniques, such as model quantization, model pruning, knowledge distillation, and optimizations of nonlinear operations. We then introduce recent advances in designing hardware accelerators that can adapt to efficient model compression approaches. In addition, we discuss how homomorphic encryption can be integrated to secure DNN deployment. Finally, we discuss several issues, such as hardware evaluation, generalization, and integration of various compression approaches. Overall, we aim to provide a big picture of efficient DNNs from algorithm to hardware accelerators and security perspectives.
March 2024
·
20 Reads
·
3 Citations
January 2023
·
31 Reads
The current decade is poised to see a clear transition of technologies from the de-facto standards. After supporting tremendous growth in speed, density and energy efficiency, newer CMOS technology nodes provide diminishing returns, thereby paving way for newer, non-CMOS technologies. Already multiple such technologies are available commercially to satisfy the requirement of specific market segments. Additionally, researchers have demonstrated multiple system prototypes built out of these technologies, which do co-exist with CMOS technologies. Apart from clearly pushing the limits of performance and energy efficiency, the new technologies present opportunities to extend the architectural limits, e.g., in-memory computing; and computing limits, e.g., quantum computing. The eventual adoption of these technologies are dependent on various challenges in device, circuit, architecture, system levels as well as robust design automation flows. In this chapter, a perspective of these emerging trends is painted in manufacturing technologies, memory technologies and computing technologies. The chapter is concluded with a study on the limits of these technologies.
October 2022
·
15 Reads
·
4 Citations
Lecture Notes in Computer Science
Allocating different bit widths to different channels and quantizing them independently bring higher quantization precision and accuracy. Most of prior works use equal bit width to quantize all layers or channels, which is sub-optimal. On the other hand, it is very challenging to explore the hyperparameter space of channel bit widths, as the search space increases exponentially with the number of channels, which could be tens of thousand in a deep neural network. In this paper, we address the problem of efficiently exploring the hyperparameter space of channel bit widths. We formulate the quantization of deep neural networks as a rate-distortion optimization problem, and present an ultra-fast algorithm to search the bit allocation of channels. Our approach has only linear time complexity and can find the optimal bit allocation within a few minutes on CPU. In addition, we provide an effective way to improve the performance on target hardware platforms. We restrict the bit rate (size) of each layer to allow as many weights and activations as possible to be stored on-chip, and incorporate hardware-aware constraints into our objective function. The hardware-aware constraints do not cause additional overhead to optimization, and have very positive impact on hardware performance. Experimental results show that our approach achieves state-of-the-art results on four deep neural networks, ResNet-18, ResNet-34, ResNet-50, and MobileNet-v2, on ImageNet. Hardware simulation results demonstrate that our approach is able to bring up to 3.5 and 3.0 speedups on two deep-learning accelerators, TPU and Eyeriss, respectively.KeywordsDeep learningQuantizationRate-distortion theory
June 2022
·
10 Reads
·
8 Citations
May 2022
·
63 Reads
·
2 Citations
As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e.g., pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bit-rate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.
March 2022
·
21 Reads
·
6 Citations
January 2022
·
9 Reads
·
11 Citations
... Instead, they compute the denominator online using a temporary maximum value, which is updated dynamically whenever a new maximum is encountered. A more general method for approximating any nonlinearity within Transformer networks, introduced by Yu et al. [38] and applied in ViTA [39], involves training a twolayer fully-connected neural network with ReLU activation to replicate the nonlinear functions. This network is then replaced by a look-up table, enabling the approximation of these functions through a single look-up operation and one multiply-accumulate. ...
March 2024
... Hardware-accelerated PPML architectures (e.g. sparse homomorphic convolution kernels) [261], could mitigate latency bottlenecks in encrypted sensor fusion. Context-aware DP frameworks can employ advanced adaptive techniques [251], for noise budgets based on environmental risk-like reducing noise in certain areas and increasing noise in sensitive zones. ...
June 2024
IEEE Transactions on Neural Networks and Learning Systems
... The importance of quantization is especially pronounced for ondevice deployments, where model compression is often a necessity rather than an option due to strict memory budgets. In such scenarios, it is a common practice to explore the trade-off between model size and quality by adjusting the quantization levels, in order to find the most effective solution within the given memory constraints [8,14,15,57,60]. ...
October 2022
Lecture Notes in Computer Science
... The noniterative pruning methods drastically reduce the number of training iterations required to a single experiment. The papers [44][45][46] describe several methods of noniterative pruning. The pruning methods give model-size reduction, faster inference, and energy efficiency. ...
May 2021
Proceedings of the AAAI Conference on Artificial Intelligence
... Amorphous oxide semiconductor (AOS) MOSFETs emerged as a promising choice for BEOL transistors due to their ultralow leakage current [1], [2], [3], attributed to a relatively high bandgap, along with sufficient carrier mobility [4], [5], [6], robust electrostatic control [7], [8], and compatibility with BEOL processing [9], [10], [11]. Among various AOS channel materials, such as SnO 2 [12], ZnO [13], In 2 O 3 [14], and InGaZnO 4 (IGZO) [7], sputtered indium tungsten oxide (IWO) transistors demonstrate outstanding electrical performance, particularly in scaled device architectures [9], [15], [16]. W stands out as a superior dopant for In 2 O 3based channels compared to common alternatives, such as zinc (Zn), gallium (Ga), and tin (Sn) due to its unique material properties. ...
June 2022
... To address the loss of model accuracy with IoU approximation, enhancements to the NMS algorithm can minimize redundant calculations and leverage parallelism without accuracy loss. Chen et al. [58] enhance NMS by routing the bounding boxes to three output head branches, each for a different ratio, and selecting the most suitable box after merging. This method increases the parallelism but faces challenges in maintaining consistent parallelism across different ratios. ...
March 2022
... Recently, an emerging ultrasonic wavefront computing (WFC) technique was proposed to compute the FFT 11,12 . This method uses the principles of wave mechanics in the acoustic domain by implementing the Fourier transform through ultrasonic waves propagating within Silicon. ...
January 2022
... Considering dimensional invariance in time, this method employs 1 regularization on the left and right singular matrix derived from SVD, resulting in a column-wise and rowwise sparse matrix without dimension distortion. [17], [4], [117] sparsification zero out insignificant weights [124], [86], [139] weight sharing share weights across different connections [106], [82], [123] knowledge distillation transfer knowledge learned from teacher to student [80], [85], [119] Orthogonal Integration quantization reduce precision [78], [72], [104] entropy coding encode weights into binary codewords [20], [140], [14] ...
December 2021
... The result is a set of independent detections. Although some of the current improved NMS algorithms [21,22,23,24,25,26,27,28] add complex logic calculations to improve the reliability of filtering redundant bounding boxes, these algorithms are not hardware-friendly Before performing the IoU calculations, the candidate bounding boxes are first sorted by their confidence scores in the traditional NMS algorithm. This sequential approach is not suitable for efficient data flow in the overall system. ...
June 2021
... In contrast, data stored in non-volatile memory (NVM) remain in the memory element when the externally applied power is removed or terminated. The next-generation NVMs developed to date include magneto resistive memory (MRAM), phase-change memory (PRAM), ferroelectric memory (FRAM), and resistive memory (RRAM) [17][18][19]. Among the above types of NVMs, resistive memory is one of the most promising types of memory, which has many advantages such as fast write and erase, simple component structure, low operating voltage, low power consumption, etc. ...
August 2021