Shiqing Li’s research while affiliated with Nanyang Technological University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


Pearls Hide Behind Linearity: Simplifying Deep Convolutional Networks for Embedded Hardware Systems via Linearity Grafting
  • Conference Paper

January 2024

·

13 Reads

·

4 Citations

Xiangzhong Luo

·

·

Hao Kong

·

[...]

·

Weichen Liu

An Efficient Gustavson-Based Sparse Matrix–Matrix Multiplication Accelerator on Embedded FPGAs

December 2023

·

34 Reads

·

9 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Sparse matrix–matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, Gustavson’s algorithm is a promising backbone algorithm for SpMM on hardware accelerators. However, the off-chip memory traffic still limits the performance of the algorithm, especially on embedded FPGAs. Previous researchers optimize Gustavson’s algorithm targeting high bandwidth memory-based architectures and their solutions cannot be directly applied to embedded FPGAs with traditional DDRs. In this work, we propose an efficient Gustavson-based SpMM accelerator on embedded FPGAs. The proposed design fully considers the feature of off-chip memory access on embedded FPGAs and the dataflow of Gustavson’s algorithm. First, we analyze the parallelism of the algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Moreover, we propose to perform the merge on part of partial results, which removes some redundant merges in the naive implementation and has little postprocessing overhead. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.75×1.75\times performance speedup compared to the baseline.


Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression

December 2023

·

51 Reads

·

2 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Sparse matrix–vector multiplication (SpMV) on FPGAs has gained much attention. The performance of SpMV is mainly determined by the number of multiplications between nonzero matrix elements and the corresponding vector values per cycle. On the one side, the off-chip memory bandwidth limits the number of nonzero matrix elements transferred from the off-chip DDR to the FPGA chip per cycle. On the other side, the irregular vector access pattern poses challenges to fetch the corresponding vector values. Besides, the read-after-write (RAW) dependency in the accumulation process shall be solved to enable a fully pipelined design. In this work, we propose an efficient FPGA-based SpMV accelerator with data reuse-aware compression. The key observation is that repeated accesses to a vector value can be omitted by reusing the fetched data. Based on the observation, we propose a reordering algorithm to manually exploit the data reuse of fetched vector values. Further, we propose a novel compressed format called data reuse-aware compressed (DRC) to take full advantage of the data reuse and a fast format conversion algorithm to shorten the preprocessing time. Meanwhile, we propose an HLS-friendly accumulator to solve the RAW dependency. Finally, we implement and evaluate our proposed design on the Xilinx Zynq-UltraScale ZCU106 platform with a set of sparse matrices from the SuiteSparse matrix collection. Our proposed design achieves an average 1.18×1.18\times performance speedup without the DRC format and an average 1.57×1.57\times performance speedup with the DRC format w.r.t. the state-of-the-art work, respectively.


CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation

September 2023

·

9 Reads

·

2 Citations

ACM Transactions on Embedded Computing Systems

Crossbar-based In-Memory Processing (IMP) accelerators have been widely adopted to achieve high-speed and low-power computing, especially for deep neural network (DNN) models with numerous weights and high computational complexity. However, the floating-point (FP) arithmetic is not compatible with crossbar architectures. Also, redundant weights of current DNN models occupy too many crossbars, limiting the efficiency of crossbar accelerators. Meanwhile, due to the inherent non-ideal behavior of crossbar devices, like write variations, pre-trained DNN models suffer from accuracy degradation when it is deployed on a crossbar-based IMP accelerator for inference. Although some approaches are proposed to address these issues, they often fail to consider the interaction among these issues, and introduce significant hardware overhead for solving each issue. To deploy complex models on IMP accelerators, we should compact the model and mitigate the influence of device non-ideal behaviors without introducing significant overhead from each technique. In this paper, we first propose to reuse bit-shift units in crossbars for approximately multiplying scaling factors in our quantization scheme to avoid using FP processors. Second, we propose to apply kernel-group pruning and crossbar pruning to eliminate the hardware units for data aligning. We also design a zerorize-recover training process for our pruning method to achieve higher accuracy. Third, we adopt the runtime-aware non-ideality adaptation with a self-compensation scheme to relieve the impact of non-ideality by exploiting the feature of crossbars. Finally, we integrate these three optimization procedures into one training process to form a comprehensive learning framework for co-optimization, which can achieve higher accuracy. The experimental results indicate that our comprehensive learning framework can obtain significant improvements over the original model when inferring on the crossbar-based IMP accelerator, with an average reduction of computing power and computing area by 100.02× and 17.37×, respectively. Furthermore, we can obtain totally integer-only, pruned, and reliable VGG-16 and ResNet-56 models for the Cifar-10 dataset on IMP accelerators, with accuracy drops of only 2.19% and 1.26%, respectively, without any hardware overhead.





EvoLP: Self-Evolving Latency Predictor for Model Compression in Real-Time Edge Systems

January 2023

·

3 Reads

·

2 Citations

IEEE embedded systems letters

Edge devices are increasingly utilized for deploying deep learning applications on embedded systems. The real-time nature of many applications and the limited resources of edge devices necessitate latency-targeted neural network compression. However, measuring latency on real devices is challenging and expensive. Therefore, this letter presents a novel and efficient framework, named EvoLP, to accurately predict the inference latency of models on edge devices. This predictor can evolve to achieve higher latency prediction precision during the network compression process. Experimental results demonstrate that EvoLP outperforms previous state-of-the-art approaches by being evaluated on three edge devices and four model variants. Moreover, when incorporated into a model compression framework, it effectively guides the compression process for higher model accuracy while satisfying strict latency constraints. We open-source EvoLP at https://github.com/ntuliuteam/EvoLP .




Citations (10)


... More importantly, the winning tickets here are more environment-friendly with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387,388] observe that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. ...

Reference:

Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
Pearls Hide Behind Linearity: Simplifying Deep Convolutional Networks for Embedded Hardware Systems via Linearity Grafting
  • Citing Conference Paper
  • January 2024

... This global operation is difficult to efficiently parallelize on FPGAs and may become a performance bottleneck. Regularization techniques such as L1/L2 regularization [41], [110] and Dropout [116] are relatively simple in theory but require additional weight decay when updating parameters and dynamically "turning off" some neurons during training, respectively. ...

An Efficient Sparse LSTM Accelerator on Embedded FPGAs with Bandwidth-Oriented Pruning
  • Citing Conference Paper
  • September 2023

... To avoid these, HELP [215] and MAPLE-Edge [216] focus on building an efficient latency predictor using only few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [217] considers an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts. ...

EvoLP: Self-Evolving Latency Predictor for Model Compression in Real-Time Edge Systems
  • Citing Article
  • January 2023

IEEE embedded systems letters

... Furthermore, [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450,451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks. ...

CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation
  • Citing Article
  • September 2023

ACM Transactions on Embedded Computing Systems

... Currently, several FPGA-based accelerators accelerate SpMM with Gustavson. To address memory access conflicts, Li et al. [33] [36] proposed a novel access pattern-aware cache scheme called SpCache, executing the Gustavson algorithm in an element-parallel manner. Gao et al. [16] achieved load balancing by partitioning sparse data equally and proposed vertex clustering optimization to reduce global data transfers. ...

Accelerating Gustavson-based SpMM on Embedded FPGAs with Element-wise Parallelism and Access Pattern-aware Caches
  • Citing Conference Paper
  • April 2023

... One of the challenges to the hardware implementation of matrix operations relates to the on-chip and off-chip memory access [53] and the design of processing blocks that exactly suit the distribution of zeros in sparse matrices. Some algorithms to optimize SpMv on hardware have already been studied for large-scale matrix dimension problems in high-performance computing for physical or biological model simulations [54,55], data analytics [56], large-scale graphics processing [57], and artificial intelligence [58,59]. ...

Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression
  • Citing Article
  • December 2023

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... Currently, several FPGA-based accelerators accelerate SpMM with Gustavson. To address memory access conflicts, Li et al. [33] [36] proposed a novel access pattern-aware cache scheme called SpCache, executing the Gustavson algorithm in an element-parallel manner. Gao et al. [16] achieved load balancing by partitioning sparse data equally and proposed vertex clustering optimization to reduce global data transfers. ...

An Efficient Gustavson-Based Sparse Matrix–Matrix Multiplication Accelerator on Embedded FPGAs
  • Citing Article
  • December 2023

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... ReDESK [21] proposes a representation specifically designed for data prefetching on CPUs, allowing streaming processing on FPGAs. Recent studies have also examined the computation order, such as a work by Li et al. [19], which reorganizes the nonzero elements to enhance data reuse, thereby optimizing memory requests. Customizing network architecture using elementary blocks for low-level detail exploitation has Table 5. Per-partition Resource Consumption of GUST-Resource consumption of length-8, -87 and -256 GUST in terms of power (W) and number of units used by the partitions of GUST. ...

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs
  • Citing Conference Paper
  • November 2021

... A series of studies tried to evaluate and forecast the energy consumption and the performance of AI/ML models running on edge accelerators [2], [8], [9], highlighting the correlation between the model size and accelerator's performance metrics, such as latency, energy consumption, etc. These studies evaluate only the AI accelerators without taking into account their relationship with the rest of the edge device components and the effects these have on AI workload performance and power consumption. ...

EDLAB: A Benchmark for Edge Deep Learning Accelerators
  • Citing Article
  • July 2021

IEEE Design and Test

... Note that graph task models that take into account resource contention (Bi et al. 2022), communication cost between vertices (Chen et al. 2020), and heterogenous computing platforms (Han et al. 2019;Voudouris et al. 2022), violate Conditions 1 and 2, thus not applicable to the proposed approach. ...

Reduced Worst-Case Communication Latency Using Single-Cycle Multi-Hop Traversal Network-on-Chip
  • Citing Article
  • August 2020

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems