Guoyu Li’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator
  • Conference Paper

March 2025

·

6 Reads

Guoyu Li

·

Shengyu Ye

·

Chunyun Chen

·

[...]

·

Mao Yang

Fig. 1: Comparison of Area and Power Efficiency: LUTBased Approximate Computing vs. ALU (higher is better, 28 nm FD-SOI@300 Mhz, 1k × 1k × 1k matrix multiplication, V =vector length, C=number of centroids, equivalent bitwidth=V /log 2 C)
Fig. 2: VQ for Approximating Matrix Multiplication
Fig. 3: LUT-DLA Framework LUTBoost: Efficient Multi-Stage Model Converter To address Challenge 2, we design a lightweight multistage model training method as Model Converter in Sec. V, which quickly assesses model accuracy and accelerates model convergence. It not only simplifies the design of the model converter but also speeds up training and reduces accuracy loss. In addition,
Fig. 12: Comparison with PECAN and PQA
LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator
  • Preprint
  • File available

January 2025

·

74 Reads

The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of 1.4~7.0×7.0\times and 1.5~146.1×146.1\times, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by 0.1%0.1\%~3.1%3.1\% using the L2L_2 distance similarity, 0.1%0.1\%~3.4%3.4\% with the L1L_1 distance similarity, and 0.1%0.1\%~3.8%3.8\% when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from 1.4%1.4\% to 3.0%3.0\%.

Download