Shuo Huai’s research while affiliated with Nanyang Technological University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (33)


A Multi-Strategy Visual SLAM System for Motion Blur Handling in Indoor Dynamic Environments
  • Article
  • Full-text available

March 2025

·

17 Reads

Shuo Huai

·

Long Cao

·

Yang Zhou

·

[...]

·

Typical SLAM systems adhere to the assumption of environment rigidity, which limits their functionality when deployed in the dynamic indoor environments commonly encountered by household robots. Prevailing methods address this issue by employing semantic information for the identification and processing of dynamic objects in scenes. However, extracting reliable semantic information remains challenging due to the presence of motion blur. In this paper, a novel visual SLAM algorithm is proposed in which various approaches are integrated to obtain more reliable semantic information, consequently reducing the impact of motion blur on visual SLAM systems. Specifically, to accurately distinguish moving objects and static objects, we introduce a missed segmentation compensation mechanism into our SLAM system for predicting and restoring semantic information, and depth and semantic information is then leveraged to generate masks of dynamic objects. Additionally, to refine keypoint filtering, a probability-based algorithm for dynamic feature detection and elimination is incorporated into our SLAM system. Evaluation experiments using the TUM and Bonn RGB-D datasets demonstrated that our SLAM system achieves lower absolute trajectory error (ATE) than existing systems in different dynamic indoor environments, particularly those with large view angle variations. Our system can be applied to enhance the autonomous navigation and scene understanding capabilities of domestic robots.

Download

Domino-Pro-Max: Toward Efficient Network Simplification and Reparameterization for Embedded Hardware Systems

December 2024

·

15 Reads

·

2 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

The prohibitive complexity of convolutional neural networks (CNNs) has triggered an increasing demand for network simplification. To this end, one natural solution is to remove the redundant channels or layers to explore simplified network structures. However, the resulting simplified network structures often suffer from sub-optimal accuracy-efficiency trade-offs. To overcome such limitations, we, in this work, introduce a simple yet effective network simplification approach, namely Domino, which aims to comprehensively revisit the trade-off dilemma between accuracy and efficiency from a new perspective of linearity and non-linearity through linearity grafting. Furthermore, we also draw insights from Domino and introduce two enhanced variants, namely Domino-Pro and Domino-Pro-Max, to improve the attainable accuracy on target task without degrading the runtime efficiency on target hardware. Extensive experiments are conducted on two popular Nvidia Jetson embedded hardware systems (i.e., Xavier and Nano) and two representative deep convolutional networks (i.e., MobileNetV2 and ResNet50), which clearly demonstrate the superiority of Domino and its two enhanced variants over previous state-of-the-art methods.



Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

November 2024

·

94 Reads

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.


Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

October 2024

·

29 Reads

·

1 Citation

ACM Transactions on Embedded Computing Systems

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference , from manual to automated , from convolutional neural networks to transformers , from transformers to vision transformers , from vision models to large language models , from software to hardware , and from algorithms to applications . Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. Furthermore, we also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future research, which can largely benefit researchers to quickly and smoothly get started in this emerging field.





An Efficient Gustavson-Based Sparse Matrix–Matrix Multiplication Accelerator on Embedded FPGAs

December 2023

·

34 Reads

·

9 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Sparse matrix–matrix multiplication (SpMM) is an important kernel in multiple areas, e.g., data analytics and machine learning. Due to the low on-chip memory requirement, the consistent data format, and the simplified control logic, Gustavson’s algorithm is a promising backbone algorithm for SpMM on hardware accelerators. However, the off-chip memory traffic still limits the performance of the algorithm, especially on embedded FPGAs. Previous researchers optimize Gustavson’s algorithm targeting high bandwidth memory-based architectures and their solutions cannot be directly applied to embedded FPGAs with traditional DDRs. In this work, we propose an efficient Gustavson-based SpMM accelerator on embedded FPGAs. The proposed design fully considers the feature of off-chip memory access on embedded FPGAs and the dataflow of Gustavson’s algorithm. First, we analyze the parallelism of the algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Moreover, we propose to perform the merge on part of partial results, which removes some redundant merges in the naive implementation and has little postprocessing overhead. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.75×1.75\times performance speedup compared to the baseline.


EdgeCompress: Coupling Multi-Dimensional Model Compression and Dynamic Inference for EdgeAI

December 2023

·

22 Reads

·

3 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Convolutional neural networks (CNNs) have demonstrated encouraging results in image classification tasks. However, the prohibitive computational cost of CNNs hinders the deployment of CNNs onto resource-constrained embedded devices. To address this issue, we propose EdgeCompress, a comprehensive compression framework to reduce the computational overhead of CNNs. In EdgeCompress, we first introduce dynamic image cropping (DIC), where we design a lightweight foreground predictor to accurately crop the most informative foreground object of input images for inference, which avoids redundant computation on background regions. Subsequently, we present compound shrinking (CS) to collaboratively compress the three dimensions (depth, width, and resolution) of CNNs according to their contribution to accuracy and model computation. DIC and CS together constitute a multidimensional CNN compression framework, which is able to comprehensively reduce the computational redundancy in both input images and neural network architectures, thereby improving the inference efficiency of CNNs. Further, we present a dynamic inference framework to efficiently process input images with different recognition difficulties, where we cascade multiple models with different complexities from our compression framework and dynamically adopt different models for different input images, which further compresses the computational redundancy and improves the inference efficiency of CNNs, facilitating the deployment of advanced CNNs onto embedded hardware. Experiments on ImageNet-1K demonstrate that EdgeCompress reduces the computation of ResNet-50 by 48.8% while improving the top-1 accuracy by 0.8%. Meanwhile, we improve the accuracy by 4.1% with similar computation compared to HRank. The state-of-the-art compression framework. The source code and models are available at https://github.com/ntuliuteam/edge-compress .


Citations (22)


... BMS require real-time monitoring and prediction of battery status, which imposes stringent demands on the inference speed of the models. Embedded systems often have limited computational resources, making it challenging to run complex models efficiently [182]. To overcome these issues, we can reduce model complexity by employing lightweight architectures, utilize dedicated hardware to accelerate the model's inference process, and offload a portion of the computational tasks to edge devices or cloud platforms to alleviate the burden on embedded systems. ...

Reference:

State of Health Estimation and Battery Management: A Review of Health Indicators, Models and Machine Learning
Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
  • Citing Article
  • October 2024

ACM Transactions on Embedded Computing Systems

... More importantly, the winning tickets here are more environment-friendly with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387,388] observe that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. ...

Domino-Pro-Max: Toward Efficient Network Simplification and Reparameterization for Embedded Hardware Systems
  • Citing Article
  • December 2024

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... More importantly, the winning tickets here are more environment-friendly with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387,388] observe that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. ...

Pearls Hide Behind Linearity: Simplifying Deep Convolutional Networks for Embedded Hardware Systems via Linearity Grafting
  • Citing Conference Paper
  • January 2024

... To avoid these, HELP [215] and MAPLE-Edge [216] focus on building an efficient latency predictor using only few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [217] considers an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts. ...

EvoLP: Self-Evolving Latency Predictor for Model Compression in Real-Time Edge Systems
  • Citing Article
  • January 2023

IEEE embedded systems letters

... Lee et al. [21], and Jain et al. [22] propose optimization methods for NN using ternary weights and three-bit for activations. HashedNets reduces the parameter employing randomly clustered connection weights into a hash table [23][24][25]. Through these methods, it can reduce the bit of weight. ...

iMAT: Energy-Efficient In-Memory Acceleration for Ternary Neural Networks With Sparse Dot Product
  • Citing Conference Paper
  • August 2023

... Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Furthermore, several recent pruning methods [389][390][391][392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model's complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy-efficiency trade-offs than traditional channel-based and layer-based pruning methods. ...

Towards Efficient Convolutional Neural Network for Embedded Hardware via Multi-Dimensional Pruning
  • Citing Conference Paper
  • July 2023

... Furthermore, [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450,451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks. ...

CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation
  • Citing Article
  • September 2023

ACM Transactions on Embedded Computing Systems

... Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Furthermore, several recent pruning methods [389][390][391][392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model's complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy-efficiency trade-offs than traditional channel-based and layer-based pruning methods. ...

EMNAPE: Efficient Multi-Dimensional Neural Architecture Pruning for EdgeAI
  • Citing Conference Paper
  • April 2023

... Currently, several FPGA-based accelerators accelerate SpMM with Gustavson. To address memory access conflicts, Li et al. [33] [36] proposed a novel access pattern-aware cache scheme called SpCache, executing the Gustavson algorithm in an element-parallel manner. Gao et al. [16] achieved load balancing by partitioning sparse data equally and proposed vertex clustering optimization to reduce global data transfers. ...

An Efficient Gustavson-Based Sparse Matrix–Matrix Multiplication Accelerator on Embedded FPGAs
  • Citing Article
  • December 2023

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... For example, Adap-tiveNet [24] generates models for different edge environments, addressing search space and device resource constraints, and LegoDNN [25] maximizes accuracy under specific resource and latency constraints by training common blocks within DNNs. Some works, like NeuLens [26] and EdgeCompress [62], adapt networks based on the complexity of input data, saving resources but often leading to increased training costs due to random generation of descendants. As shown in Figure 1(right), AdaScale employs a novel approach by treating the expansion space as a new set, integrating various lightweight DNN structures (referred to as compression operators) to reduce the search space. ...

EdgeCompress: Coupling Multi-Dimensional Model Compression and Dynamic Inference for EdgeAI
  • Citing Article
  • December 2023

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems