Hao Kong’s research while affiliated with Nanyang Technological University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (28)


Domino-Pro-Max: Toward Efficient Network Simplification and Reparameterization for Embedded Hardware Systems
  • Article

December 2024

·

15 Reads

·

2 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Xiangzhong Luo

·

·

Hao Kong

·

[...]

·

Weichen Liu

The prohibitive complexity of convolutional neural networks (CNNs) has triggered an increasing demand for network simplification. To this end, one natural solution is to remove the redundant channels or layers to explore simplified network structures. However, the resulting simplified network structures often suffer from sub-optimal accuracy-efficiency trade-offs. To overcome such limitations, we, in this work, introduce a simple yet effective network simplification approach, namely Domino, which aims to comprehensively revisit the trade-off dilemma between accuracy and efficiency from a new perspective of linearity and non-linearity through linearity grafting. Furthermore, we also draw insights from Domino and introduce two enhanced variants, namely Domino-Pro and Domino-Pro-Max, to improve the attainable accuracy on target task without degrading the runtime efficiency on target hardware. Extensive experiments are conducted on two popular Nvidia Jetson embedded hardware systems (i.e., Xavier and Nano) and two representative deep convolutional networks (i.e., MobileNetV2 and ResNet50), which clearly demonstrate the superiority of Domino and its two enhanced variants over previous state-of-the-art methods.



Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
  • Preprint
  • File available

November 2024

·

82 Reads

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.

Download

Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision

October 2024

·

27 Reads

·

1 Citation

ACM Transactions on Embedded Computing Systems

Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference , from manual to automated , from convolutional neural networks to transformers , from transformers to vision transformers , from vision models to large language models , from software to hardware , and from algorithms to applications . Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. Furthermore, we also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future research, which can largely benefit researchers to quickly and smoothly get started in this emerging field.





EdgeCompress: Coupling Multi-Dimensional Model Compression and Dynamic Inference for EdgeAI

December 2023

·

22 Reads

·

3 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Convolutional neural networks (CNNs) have demonstrated encouraging results in image classification tasks. However, the prohibitive computational cost of CNNs hinders the deployment of CNNs onto resource-constrained embedded devices. To address this issue, we propose EdgeCompress, a comprehensive compression framework to reduce the computational overhead of CNNs. In EdgeCompress, we first introduce dynamic image cropping (DIC), where we design a lightweight foreground predictor to accurately crop the most informative foreground object of input images for inference, which avoids redundant computation on background regions. Subsequently, we present compound shrinking (CS) to collaboratively compress the three dimensions (depth, width, and resolution) of CNNs according to their contribution to accuracy and model computation. DIC and CS together constitute a multidimensional CNN compression framework, which is able to comprehensively reduce the computational redundancy in both input images and neural network architectures, thereby improving the inference efficiency of CNNs. Further, we present a dynamic inference framework to efficiently process input images with different recognition difficulties, where we cascade multiple models with different complexities from our compression framework and dynamically adopt different models for different input images, which further compresses the computational redundancy and improves the inference efficiency of CNNs, facilitating the deployment of advanced CNNs onto embedded hardware. Experiments on ImageNet-1K demonstrate that EdgeCompress reduces the computation of ResNet-50 by 48.8% while improving the top-1 accuracy by 0.8%. Meanwhile, we improve the accuracy by 4.1% with similar computation compared to HRank. The state-of-the-art compression framework. The source code and models are available at https://github.com/ntuliuteam/edge-compress .


On Hardware-Aware Design and Optimization of Edge Intelligence

December 2023

·

34 Reads

IEEE Design and Test

Edge intelligence systems, the intersection of edge computing and artificial intelligence (AI), are pushing the frontier of AI applications. However, the complexity of deep learning models and heterogeneity of edge devices make the design of edge intelligence systems a challenging task. Hardware-agnostic methods face some limitations when implementing edge systems. Thus, hardware-aware methods are attracting more attention recently. In this paper, we present our recent endeavors in hardware-aware design and optimization for edge intelligence. We delve into techniques such as model compression and neural architecture search to achieve efficient and effective system designs. We also discuss some challenges in hardware-aware paradigm.


CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation

September 2023

·

9 Reads

·

2 Citations

ACM Transactions on Embedded Computing Systems

Crossbar-based In-Memory Processing (IMP) accelerators have been widely adopted to achieve high-speed and low-power computing, especially for deep neural network (DNN) models with numerous weights and high computational complexity. However, the floating-point (FP) arithmetic is not compatible with crossbar architectures. Also, redundant weights of current DNN models occupy too many crossbars, limiting the efficiency of crossbar accelerators. Meanwhile, due to the inherent non-ideal behavior of crossbar devices, like write variations, pre-trained DNN models suffer from accuracy degradation when it is deployed on a crossbar-based IMP accelerator for inference. Although some approaches are proposed to address these issues, they often fail to consider the interaction among these issues, and introduce significant hardware overhead for solving each issue. To deploy complex models on IMP accelerators, we should compact the model and mitigate the influence of device non-ideal behaviors without introducing significant overhead from each technique. In this paper, we first propose to reuse bit-shift units in crossbars for approximately multiplying scaling factors in our quantization scheme to avoid using FP processors. Second, we propose to apply kernel-group pruning and crossbar pruning to eliminate the hardware units for data aligning. We also design a zerorize-recover training process for our pruning method to achieve higher accuracy. Third, we adopt the runtime-aware non-ideality adaptation with a self-compensation scheme to relieve the impact of non-ideality by exploiting the feature of crossbars. Finally, we integrate these three optimization procedures into one training process to form a comprehensive learning framework for co-optimization, which can achieve higher accuracy. The experimental results indicate that our comprehensive learning framework can obtain significant improvements over the original model when inferring on the crossbar-based IMP accelerator, with an average reduction of computing power and computing area by 100.02× and 17.37×, respectively. Furthermore, we can obtain totally integer-only, pruned, and reliable VGG-16 and ResNet-56 models for the Cifar-10 dataset on IMP accelerators, with accuracy drops of only 2.19% and 1.26%, respectively, without any hardware overhead.


Citations (19)


... BMS require real-time monitoring and prediction of battery status, which imposes stringent demands on the inference speed of the models. Embedded systems often have limited computational resources, making it challenging to run complex models efficiently [182]. To overcome these issues, we can reduce model complexity by employing lightweight architectures, utilize dedicated hardware to accelerate the model's inference process, and offload a portion of the computational tasks to edge devices or cloud platforms to alleviate the burden on embedded systems. ...

Reference:

State of Health Estimation and Battery Management: A Review of Health Indicators, Models and Machine Learning
Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
  • Citing Article
  • October 2024

ACM Transactions on Embedded Computing Systems

... More importantly, the winning tickets here are more environment-friendly with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387,388] observe that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. ...

Domino-Pro-Max: Toward Efficient Network Simplification and Reparameterization for Embedded Hardware Systems
  • Citing Article
  • December 2024

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... More importantly, the winning tickets here are more environment-friendly with less carbon emission, while at the same time achieving better training efficiency and adversarial robustness [382]. In addition, several recent methods [387,388] observe that the intermediate non-linear activation layers can also be grafted with negligible accuracy loss. Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. ...

Pearls Hide Behind Linearity: Simplifying Deep Convolutional Networks for Embedded Hardware Systems via Linearity Grafting
  • Citing Conference Paper
  • January 2024

... To avoid these, HELP [215] and MAPLE-Edge [216] focus on building an efficient latency predictor using only few training samples (e.g., as few as 10 training samples in HELP), which can be generalized to new hardware or new search spaces with only minimal re-engineering efforts. More recently, EvoLP [217] considers an effective self-evolving scheme to construct efficient yet accurate latency predictors, which can adapt to unseen hardware with only minimal re-engineering efforts. ...

EvoLP: Self-Evolving Latency Predictor for Model Compression in Real-Time Edge Systems
  • Citing Article
  • January 2023

IEEE embedded systems letters

... Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Furthermore, several recent pruning methods [389][390][391][392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model's complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy-efficiency trade-offs than traditional channel-based and layer-based pruning methods. ...

Towards Efficient Convolutional Neural Network for Embedded Hardware via Multi-Dimensional Pruning
  • Citing Conference Paper
  • July 2023

... Furthermore, [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450,451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks. ...

CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation
  • Citing Article
  • September 2023

ACM Transactions on Embedded Computing Systems

... Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Furthermore, several recent pruning methods [389][390][391][392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model's complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy-efficiency trade-offs than traditional channel-based and layer-based pruning methods. ...

EMNAPE: Efficient Multi-Dimensional Neural Architecture Pruning for EdgeAI
  • Citing Conference Paper
  • April 2023

... For example, Adap-tiveNet [24] generates models for different edge environments, addressing search space and device resource constraints, and LegoDNN [25] maximizes accuracy under specific resource and latency constraints by training common blocks within DNNs. Some works, like NeuLens [26] and EdgeCompress [62], adapt networks based on the complexity of input data, saving resources but often leading to increased training costs due to random generation of descendants. As shown in Figure 1(right), AdaScale employs a novel approach by treating the expansion space as a new set, integrating various lightweight DNN structures (referred to as compression operators) to reduce the search space. ...

EdgeCompress: Coupling Multi-Dimensional Model Compression and Dynamic Inference for EdgeAI
  • Citing Article
  • December 2023

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... Based on this observation, [387,388] propose to first graft the less important intermediate non-linear activation layers with their linear counterparts and then reparameterize multiple consecutive linear layers into one single linear layer to explore shallow network solutions with fewer layers. Furthermore, several recent pruning methods [389][390][391][392] focus on multi-dimensional pruning, which strive to actively prune less important channels, layers, and input resolutions to aggressively trim down the model's complexity towards enhanced inference efficiency on target hardware. These multi-dimensional pruning methods can achieve much better accuracy-efficiency trade-offs than traditional channel-based and layer-based pruning methods. ...

Smart Scissor: Coupling Spatial Redundancy Reduction and CNN Compression for Embedded Hardware
  • Citing Conference Paper
  • December 2022

... On-device federated learning is an advanced decentralized learning paradigm, which enables efficient training on a large corpus of decentralized data residing on local client devices like mobile phones and allows multiple local client devices to jointly train the given network without explicitly sharing their raw data [541,549]. In practice, on-device federated learning has the potential to significantly accelerate the training process when the number of client devices evolves. ...

Collate: Collaborative Neural Network Learning for Latency-Critical Edge Systems
  • Citing Conference Paper
  • October 2022