April 2025
·
3 Reads
·
1 Citation
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
April 2025
·
3 Reads
·
1 Citation
March 2025
·
6 Reads
January 2025
·
75 Reads
The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of 1.4~ and 1.5~, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by ~ using the distance similarity, ~ with the distance similarity, and ~ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from to .
November 2024
·
3 Reads
·
7 Citations
October 2024
·
3 Reads
IEEE Transactions on Pattern Analysis and Machine Intelligence
Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation , to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++ . As for box coordinate discretization , we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation , we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.
July 2024
·
5 Reads
Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4x on real-world GPUs.
June 2024
·
20 Reads
·
2 Citations
IEEE Transactions on Neural Networks and Learning Systems
Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers have developed various model compression techniques such as model quantization and model pruning. Recently, there has been a surge in research on compression methods to achieve model efficiency while retaining performance. Furthermore, more and more works focus on customizing the DNN hardware accelerators to better leverage the model compression techniques. In addition to efficiency, preserving security and privacy is critical for deploying DNNs. However, the vast and diverse body of related works can be overwhelming. This inspires us to conduct a comprehensive survey on recent research toward the goal of high-performance, cost-efficient, and safe deployment of DNNs. Our survey first covers the mainstream model compression techniques, such as model quantization, model pruning, knowledge distillation, and optimizations of nonlinear operations. We then introduce recent advances in designing hardware accelerators that can adapt to efficient model compression approaches. In addition, we discuss how homomorphic encryption can be integrated to secure DNN deployment. Finally, we discuss several issues, such as hardware evaluation, generalization, and integration of various compression approaches. Overall, we aim to provide a big picture of efficient DNNs from algorithm to hardware accelerators and security perspectives.
March 2024
·
23 Reads
·
4 Citations
March 2022
·
22 Reads
·
6 Citations
December 2021
·
16 Reads
·
4 Citations
... Training (Zhang et al., 2024), and LPViT (Xu et al., 2024a). Due to unavailable or incomplete code repositories of certain stateof-the-art pruning methods, we rely on the performance statistics reported in the original papers and align efficiency optimization using speed improvements for fairness. ...
November 2024
... Overall, these studies collectively demonstrate the growing interest and advances in the use of CNN models for IDC detection and classification in breast cancer histopathological images. Recent studies have also explored the use of advanced architectures in other domains, such as ViTAE-SL [19], a vision transformer-based autoencoder for spatial field reconstruction, and deep learning surrogate models for global wildfire prediction [20,21]. These approaches highlight innovative modeling strategies that could be adapted for medical imaging tasks in future research. ...
March 2024
... Hardware-accelerated PPML architectures (e.g. sparse homomorphic convolution kernels) [261], could mitigate latency bottlenecks in encrypted sensor fusion. Context-aware DP frameworks can employ advanced adaptive techniques [251], for noise budgets based on environmental risk-like reducing noise in certain areas and increasing noise in sensitive zones. ...
June 2024
IEEE Transactions on Neural Networks and Learning Systems
... To address the loss of model accuracy with IoU approximation, enhancements to the NMS algorithm can minimize redundant calculations and leverage parallelism without accuracy loss. Chen et al. [58] enhance NMS by routing the bounding boxes to three output head branches, each for a different ratio, and selecting the most suitable box after merging. This method increases the parallelism but faces challenges in maintaining consistent parallelism across different ratios. ...
March 2022
... Considering dimensional invariance in time, this method employs 1 regularization on the left and right singular matrix derived from SVD, resulting in a column-wise and rowwise sparse matrix without dimension distortion. [17], [4], [117] sparsification zero out insignificant weights [124], [86], [139] weight sharing share weights across different connections [106], [82], [123] knowledge distillation transfer knowledge learned from teacher to student [80], [85], [119] Orthogonal Integration quantization reduce precision [78], [72], [104] entropy coding encode weights into binary codewords [20], [140], [14] ...
December 2021