Content uploaded by Pierre Romet
Author content
All content in this area was uploaded by Pierre Romet on Jan 09, 2025
Content may be subject to copyright.
Optimizing SRU Models for Predictive Maintenance on Embedded Systems
(RISC-V)
SAMY CHEHADE, ADVANS Group, France
ADRIEN TIRLEMONT, ADVANS Group, France
FLORIAN DUPEYRON, ADVANS Group, France
PIERRE ROMET
∗
,ADVANS Group, CIAD UR 7533, Belfort Montbéliard University of Technology, UTBM, Belfort,
France
This study aims to assess dierent strategies for deploying a deep learning recurrent neural network on the VisionFive2 RISC-V
board for predictive maintenance. The main contribution of this paper is to propose a minimal SRU (Simple Recurrent Unit) model
specically designed for this application on embedded systems. The workow involves applying pruning and knowledge distillation
on an initially unoptimized LSTM (Long Short Term Memory) model, replacing it with a standard SRU for a more ecient architecture
and applying the same optimizations. Finally, the SRU architecture is modied to further enhance its performance on embedded
systems. While replacing the LSTM with an SRU led to a 51.8% reduction in average latency and a one point increase in accuracy, the
minimal SRU resulted in a 75% speed-up and maintains the prediction scores of the initial LSTM model. Models are evaluated based on
prediction and embedded performance metrics, providing a general overview of the impact of the optimization techniques used in a
turbofan engine degradation application.
CCS Concepts: •Computing methodologies
→
Articial intelligence;•Computer systems organization
→
Embedded
systems.
Additional Key Words and Phrases: RNN, LSTM, SRU, Pruning, Knowledge distillation, Predictive maintenance, RISC-V.
ACM Reference Format:
Samy Chehade, Adrien Tirlemont, Florian Dupeyron, and Pierre Romet. 2024. Optimizing SRU Models for Predictive Maintenance
on Embedded Systems (RISC-V). In Proceedings of Make sure to enter the correct conference title from your rights conrmation emai
(Conference acronym ’XX). ACM, New York, NY, USA, 15 pages. https://doi.org/XXXXXXX.XXXXXXX
1 Introduction
Articial intelligence has been greatly developing in the recent years, which galvanized its integration in embedded
systems applications. However, deep learning and machine learning models are more and more complex and resource-
intensive, making it dicult to embed these models into resource-constrained systems. By reducing its complexity, Edge-
AI enables these algorithms, including neural networks, to run directly on embedded systems, enhancing responsiveness,
data privacy, and energy eciency by processing data locally. Although Edge AI is spreading across all industry sectors
∗Email: innovation@advans-group.com; Phone: 0476122840. Postal Adress: 12-14 Avenue Antoine Dutrievoz, 69100 Villeurbanne, France
Authors’ Contact Information: Samy Chehade, ADVANS Group, Cachan, France; Adrien Tirlemont, ADVANS Group, Cachan, France; Florian Dupeyron,
ADVANS Group, Cachan, France; Pierre Romet, ADVANS Group, CIAD UR 7533, Belfort Montbéliard University of Technology, UTBM, Belfort, Lyon,
France.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 Chehade et al.
[
19
], the work in this article focuses specically on predictive maintenance, as an example of time series prediction, where
Edge AI can be a game-changer in making systems more autonomous by eliminating the needs of cloud dependence,
and safer by keeping the data private.
Predictive maintenance is a crucial strategy for anticipating industrial system failures and minimizing costs associated
with unplanned downtime. It often relies on recurrent neural networks (RNNs) such as Long Short-Term Memory
(LSTM) models, which capture complex dependencies within time series data across diverse elds, including air quality
forecasting [
15
] and speech recognition [
20
]. Moreover, LSTMs are commonly used for predicting the Remaining Useful
Life (RUL) of various systems, such as jet engines [
2
], or any cyber-physical production system [
1
], where large sensors
data can be easily provided by the Internet of Things, thus enhancing prediction accuracy.
However, this approach is computationally intensive and often require substantial memory and processing power,
typically executed on cloud-based infrastructures. This dependence on cloud resources faces several limitations,
including high latency, reliability concerns, bandwidth constraints or data privacy issues. A promising solution to these
limitations is to deploy deep learning models directly on embedded devices, enabling local and real-time processing of
sensor data, which raises challenges due to the limited capacities of embedded platforms.
To meet these constraints, several strategies have been explored by researchers to adapt deep learning models,
and in particular LSTMs, to these restricted environments. It starts by simply converting the desired model into a
compressed format for embedded targets using existing tools such as TensorFlow Lite [
11
]. Optimization techniques
also emerged aiming to reduce the models’ complexity and computational needs. For instance, in an accelerated re
detection and localization application, quantization and compression have been explored to achieve inference on-board
[12]. Furthermore, alternative models became more and more popular to replace LSTMs in this context.
In this paper, an approach to deploying a Recurrent Neural Network model is presented, by specically replacing
the LSTM to an SRU on a VisionFive2 RISC-V embedded board for predictive maintenance. The primary contributions
include:
•A structured optimization of RNN models for embedded deployment, with LSTM serving as the base model.
•
An automated test bench for continuous evaluation of predictive accuracy and embedded performance, providing
traceable results.
•
A focus on enhancing predictive maintenance with embedded-friendly architectures, particularly by implementing
and testing an SRU and a minimal SRU architecture to replace LSTMs.
The rest of the paper is structured as follows: Section II covers the State of the Art, focusing on existing work related
to porting LSTMs and SRUs to embedded devices for predictive maintenance. Section III outlines the models’ details,
including the core dierences between LSTMs, SRUs and minimal SRUs. Section IV presents the test bench, optimization
techniques, along with experimental protocols and results. Finally, Section V concludes the paper.
2 State of The Art
When TensorFlow Lite converts an LSTM model to its .tite format, it performs several key steps to make the model
suitable for deployment on edge devices. Some of these steps include removing unnecessary operations that don’t
contribute to the model inference on target and replacing some Tensorow operations with specialized TensorFlow Lite
operators that are optimized for embedded platforms. This is how an LSTM model has been successfully converted
to a TinyModel version [
11
]. More options are also available during the conversion process, such as pruning and
quantization. These techniques as used by Bringmann et al. [2021], have been sought and tested to reduce the size
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 3
and complexity of models by respectively removing a percentage of network parameters or reducing the precision of
model weights. Pruning has also been considered directly on-device, where it is applied in real-time during inference
to create smaller, user-specic models [
6
]. Moreover, post-training and network quantization of an LSTM model was
applied in the context of battery management systems with the goal of deploying it on an STM32 MCU embedded
system [
9
]. Compression methods were also investigated to allow LSTMs to run on ultra-low power wireless sensors
for induction motors predictive maintenance[
10
]. There is also the possibility of combining these compression methods
with hardware accelerators. Chen et al. [2021] showed the eect of various levels of quantization on the turbofan
degradation predictive maintenance model and made an FPGA-LSTM accelerator to specically run LSTM operations
faster and with lower energy consumption, which makes it ideal for real-time monitoring and predictive maintenance
tasks at the edge. Finally, a lightweight real-time fault detection system for edge computing is proposed with a fault
detection model based on an LSTM [
13
]. It is also suggested to use a simpler model like the Gated Recurrent Unit (GRU)
to reduce memory footprint and complexity as a futur approach. All these works demonstrate the convenience of using
LSTMs for predictive maintenance on edge devices. It also indicates that many strategies can be conducted in order to
eectively run LSTM models on edge devices. In this paper, the focus is on trying to apply the mentioned compression
methods as well as the knowledge distillation optimization technique, since it seems to be improving the RUL prediction
of systems while providing more compact models [
16
]. It also is a great way to create compact versions of foundation
models for edge devices [
5
]. Moreover, replacing the LSTM by an alternative model, typically the SRU, is also included
in the core strategy of this work.
LSTMs have a complex gated architecture and multiple time dependencies within sequences, which multiplies
memory accesses and extends inference time. Simplifying the LSTM architecture with other models that could confront
these limitations is, thus, a viable approach. As mentioned, the Simple Recurrent Unit (SRU) is the main target in this
paper. Indeed, its architecture adds the possibility of parallelizing time steps [
22
], signicantly speeding up inference time.
This advantage has led to the adoption of SRUs in applications requiring real-time performance on embedded platforms,
such as automatic speech recognition (ASR) [
14
]. SRUs are also an option for even more demanding applications such
as NLP tasks where its possible parallelization is also interesting [
8
]. In addition, SRUs are used instead of LSTMs in
a proposed model for predicting the RUL of roller bearings [
23
]. These works demonstrate the potential of SRUs to
solve some of LSTMs limitations, not only in the embedded systems context but in a broader one. While these studies
highlight SRUs’ advantages over LSTMs in terms of computational eciency, there has been, to our knowledge, no
transparent assessment of SRUs specically for embedded predictive maintenance applications. Galvanized by this gap,
this paper explores the impact of SRUs in such settings. More broadly, RNNs are increasingly integrated into embedded
systems across multiple applications, positioning the SRU as a viable next model to be evaluated [
3
]. In the specic
context of aircraft engines, for instance, implementing SRUs could facilitate early anomaly detection, reduce aircraft
downtime, enhance ight safety, and lower operational costs. By supporting on-board predictive maintenance, SRUs
address the key drawbacks of LSTMs: long training times and high computational demands [21].
3 Model
In this section, the two types of recurrent neural network (RNN) architectures central to this research are presented :
Long Short-Term Memory networks (LSTMs) and Simple Recurrent Units (SRUs). The purpose of this comparison is
to provide a simple understanding of the models’ architecture in order to evaluate the potential benets of replacing
LSTMs with SRUs in embedded environments.
Manuscript submitted to ACM
4 Chehade et al.
3.1 LSTM
LSTMs introduce key components compared to a standard RNN: a cell state, which represents long-term memory,
and a hidden state, which represents short-term memory. Together, these components enhance the LSTM’s ability to
manage both long-term and short-term patterns in time series data.
xt
WfWiWo
ht
X
Sig
Sig
Ct-1 X
X
+Ct
ht-1
Sig tanh
Wc
ht
tanh
Fig. 1. Diagram of an LSTM cell
Its uniqueness lies in the ability to add or subtract information from the cell state via dened gates. Below is a concise
explanation of the three crucial steps in an LSTM cell.
Step 1: Forget Gate
The forget gate
𝑓𝑡
determines what percentage of the previous cell state
𝐶𝑡−1
should be retained, using the predecessor’s
hidden state ℎ𝑡−1and the current input 𝑥𝑡:
𝑓𝑡=𝜎𝑊𝑓· [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑓(1)
Step 2: Input Gate and Cell State Update
The input gate decides what new information to store in the cell state, combining two parts: the input gate activation
𝑖𝑡
,
which decides the values to update in the cell state and the candidate cell state, which creates new potential values for
the cell state based on ℎ𝑡−1and 𝑥𝑡.
𝑖𝑡=𝜎(𝑊𝑖· [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑖)(2)
˜
𝐶𝑡=tanh (𝑊𝑐· [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑐)(3)
The cell state 𝐶𝑡is then updated as follows:
𝐶𝑡=𝑓𝑡⊙𝐶𝑡−1+𝑖𝑡⊙˜
𝐶𝑡(4)
Step 3: Output Gate and Hidden State Update
The hidden state
ℎ𝑡
is updated to produce the output of the LSTM cell at a specic time step. The output gate
𝑜𝑡
decides
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 5
the percentage of the new cell state 𝐶𝑡to include in the output, dened by:
𝑜𝑡=𝜎(𝑊𝑜· [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑜)(5)
The hidden state ℎ𝑡is then computed as follows:
ℎ𝑡=𝑜𝑡⊙tanh(𝐶𝑡)(6)
3.2 SRU
SRUs simplify the structure of the cell by reducing the complexity inside gating mechanisms, focusing on computational
eciency while still retaining information over long sequences. This engenders faster training time and the use of
fewer resources.
xt
WfWiWr
X
+
ht
X
-
1
Sig
Sig
-
Ct-1 X
1
X
+
tanh
Ct
Fig. 2. Diagram of an SRU cell
The SRU relies mainly on one component: the cell state. A similar update mechanism decides how to update this
cell state based on new inputs at each time step. The former LSTMs gates operations can be identied within the new
SRU cell architecture:
Step 1: Forget Gate
The forget gate 𝑓𝑡still determines how much of the previous cell state 𝑐𝑡−1should be retained:
𝑓𝑡=𝜎𝑊𝑓·𝑥𝑡+𝑏𝑓(7)
Step 2: Input gate
The input gate in the SRU is simplied by using the complementary of the forget gate:
𝑖𝑡=(1−𝑓𝑡) ⊙ ˜
𝑥𝑡(8)
where:
Manuscript submitted to ACM
6 Chehade et al.
˜
𝑥𝑡=𝑊𝑖·𝑥𝑡
The cell state 𝑐𝑡is scaled by the forget gate and then updated by adding the input gate:
𝑐𝑡=𝑓𝑡⊙𝑐𝑡−1+𝑖𝑡(9)
Step 3: Hidden State Computation
Finally, the hidden state
ℎ𝑡
is computed using the reset gate
𝑟𝑡
. Part of
ℎ𝑡
is derived from the cell state
𝑐𝑡
after applying
the tanh activation, while the rest comes from a highway connection with the original input 𝑥𝑡.
ℎ𝑡=𝑟𝑡⊙tanh(𝑐𝑡)+(1−𝑟𝑡) ⊙ 𝑥𝑡(10)
with:
𝑟𝑡=𝜎(𝑊𝑟·𝑥𝑡+𝑏𝑟)(11)
Note that 𝑥𝑡must have the same dimensionality as 𝑟𝑡in (10) for the equation to be valid.
3.3 Notable Dierences Between SRU and LSTM
In an SRU, a cell does not require the hidden state
ℎ𝑡−1
to calculate the current cell state. Indeed, in an LSTM, the
forget gate depends on the input
𝑥𝑡
and the hidden state
ℎ𝑡−1
, while in an SRU, the input gate depends only on
𝑥𝑡
. This
removes temporal dependencies.
The input gate is also simplied: there is no activation functions as in LSTMs (sigmoid for the input gate and tanh
for the candidate cell state), but simply a multiplication with the complimentary of the forget gate 1
−𝑓𝑡
. This change
also reduces the number of weights by two in the input gate.
Finally, a highway connection is used in the computation of the hidden state
ℎ𝑡
whereas in an LSTM, the hidden
state is computed using only a regular gating mechanism as the output.
These contrasts suggest possible benets to replacing LSTMs with SRUs in this work’s use case.
3.4 Scientific Literature Contribution Analysis Concerning Transitioning from LSTMs to SRUs
Firstly, removing hidden state dependency reduces the number of matrix-vector operations. More specically, the
dimension of the hidden state vector is equal to the number of LSTM cells
𝑛
and the dimension of the weight matrix in
an LSTM is 4
𝑛× (𝑛+𝑚)
with
𝑚
being the dimension of the input vector. In the LSTM gates equations (1)–(3), (5), the
matrix vector multiplication will thus involve 4
𝑛× (𝑛+𝑚)
multiplications and 4
𝑛× (𝑛+𝑚−
1
)
additions. In the SRU
gates (7), (8), (11), since there are no hidden states, there will be 3
𝑛× (𝑚)
multiplications and 3
𝑛× (𝑚−
1
)
additions.
The model’s details will be explained below in 3.6, but the rst layer consists of 100 LSTM cells, and this reduces the
total number of operations by 6.67 in each matrix vector multiplication for each time step in this layer only. Having
no hidden state computations can also leverage the SRU’s full parallelization potential by joining multiple time steps
together, based on the available hardware resources.
Activation functions help LSTM manage non-linear and complex time dependencies, but they also demand more
processing power. Indeed, replacing both sigmoid and tanh by a basic arithmetic operation, such as 1
−𝑓𝑡
, makes
operations more lightweight without requiring further memory access, unlike LUT-based (Lookup Table) operations.
This is also convenient since it reuses the forget gate 𝑓𝑡without requiring another computation.
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 7
Finally, adding a highway connection in the hidden state computation could be particularly relevant in multi-layer
models like ours. While LSTMs employ a straightforward hidden state computation (6), SRUs highway connection
mechanism (10) enables direct information propagation from lower to higher layers through an adaptive pathway. The
reset gate
𝑟𝑡
will act then as an adaptive lter, determining the balance between transformed cell state information and
raw input features. When
𝑟𝑡
is close to 1, the hidden state will represent the cell state transformation whereas when
𝑟𝑡
is closer to 0, the hidden state will preserve more of the original input features. This allows dierent SRU layers to
selectively receive processed or preserved input features based on the learned parameters in a exible way.
3.5 Hypothesis: SRU Architecture Streamlining for Predictive Maintenance
Since embedded systems may lack the hardware needed for parallel computation, this study aimed to explore the
potential of a naive SRU implementation tailored for predictive maintenance tasks. Observing the nature of these tasks,
a streamlined version of the conventional SRU architecture is proposed, by eliminating hidden state computations. This
hypothesis is based on one key point: predictive maintenance primarily focuses on identifying gradual degradation and
long-term trends in sensor data, which the cell state
𝑐𝑡
can capture eectively, possibly rendering a separate hidden
state
ℎ𝑡
unnecessary. The cell state will then not only be passed across time steps but also as inputs of the second layer
of the model. Thus, only the forget gate
𝑓𝑡
and the input gate
𝑖𝑡
with added bias is used in order to capture these gradual
changes rather than complex short-term patterns.
Mathematically, the proposed modication transforms the SRU computations from:
𝑓𝑡=𝜎(𝑊𝑓·𝑥𝑡+𝑏𝑓)
𝑖𝑡=(1−𝑓𝑡) ⊙ ˜
𝑥𝑡
𝑟𝑡=𝜎(𝑊𝑟·𝑥𝑡+𝑏𝑟)
𝑐𝑡=𝑓𝑡⊙𝑐𝑡−1+ (1−𝑓𝑡) ⊙ 𝑖𝑡
ℎ𝑡=𝑟𝑡⊙tanh(𝑐𝑡)+(1−𝑟𝑡) ⊙ 𝑥𝑡
(12)
to the minimal form:
𝑓𝑡=𝜎(𝑊𝑓·𝑥𝑡+𝑏𝑓)
𝑖𝑡=(1−𝑓𝑡) ⊙ ˜
𝑥𝑡
𝑐𝑡=𝑓𝑡⊙𝑐𝑡−1+ (1−𝑓𝑡) ⊙ 𝑖𝑡
(13)
This streamlining also reduces further the model’s number of matrix-vector operations by 1.5. Figure 3shows the
minimal SRU diagram.
These advantages, however, do not come without a downside. The structure of the SRU reduces consequently its
exibility in capturing more complex temporal dependencies. Indeed, the LSTM utilizes two distinct activation functions
in the input gate and cell state candidate, using a dot operation between them. This allows the LSTM to use a gating
mechanism with the the cell state candidate independently of the forget gate. The advantages of this combination is
allowing the cell state not only to increase (since sigmoid output is always positive between 0 and 1) but also to decrease
thanks to tanh that scales input values but also maintain its sign, mapping the values from -1 to 1. Hence, LSTMs can
handle the non-linearity in the data in an ecient way. In contrast, the SRU’s input gate is directly coupled to the forget
gate through the expression 1
−𝑓𝑡
, eliminating both activation functions. This coupling means that the decision to add
new information is directly tied to the decision to forget information, reducing the network’s exibility in independently
Manuscript submitted to ACM
8 Chehade et al.
W_f W
Sig
-
X
1
X
+
xt
WfWi
Sig
-
Ct-1 X
1
X
+Ct
Ct
Fig. 3. Diagram of a minimal SRU
managing these two processes. The absence of
𝑡𝑎𝑛ℎ
also means the cell state can only increase, contributing to a more
linear behavior in how new information is added.
It is, therefore essential to thoroughly test the SRU architecture for the specic application involved. In general, SRUs
oer a substantial improvement in terms of general embedded performance, though potentially aecting the model’s
predictive capabilities. However, the trade-o between these two metrics will be assessed in this paper to evaluate the
potential of SRUs in this research’s application.
3.6 LSTM and SRU Models and Dataset
The models aim to correctly predict the RUL of engines directly on the VisionFive2 RISC-V board. To achieve this, the
Turbofan Engine Degradation Simulation Data Set [
18
] provided by NASA has been used, and the initial LSTM model
trained here [
7
], has been used as a base model. This is a great model to start with since it is the raw implementation of
LSTM by keras and tensorow, which engenders optimization opportunities.
The important information to retain regarding the training process, is the format of the input data and the depth of
the models. The dataset is segmented into time sequences that contain input from 25 dierent sensors within a range of
50 cycles. In total, 93 dierent engines were tested each having its own corresponding sequence. The models consist of
two LSTM/SRU layers with 100 and 50 nodes, followed by a single-unit dense layer for predictions, as described by
Figure 4generated with netron [
17
]. This setup enables to evaluate the behavior of SRUs with the same number of cells
as the LSTMs.
Fig. 4. Models layers
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 9
4 Tests and Experimental Results
4.1 Test Bench
Since the aim is to combine the models with optimizations that oer various adjustable parameters, establishing a
robust workow is essential in order to conduct and track multiple optimization tests, each resulting in a dierent
model. To evaluate and compare these models, a test bench enabling an automated and traceable workow has been
implemented. It includes multiple crucial steps such as deploying the selected model on the target device, retrieving
its predictions, and generating detailed test and analysis reports. This procedure also includes automatic storage of
generated reports enabling traceable and organized records.
In these reports, the model’s predictive capabilities and embedded performance are captured through relevant metrics.
In predictive maintenance application, the most important criteria typically focus on a balance between predictive
accuracy and ecient use of system resources while primarily identifying engine failures. Accuracy is used for overall
reliable predictions, F1-Score and MCC are essential since predictive maintenance often deals with imbalanced data
(fewer failures than normal operations). More specically, recall is also measured since detection of maintenance needs
or failures is more crucial for cost and safety issues. Average latency is critical from an embedded point of view and is
the main gauge to measure if the model is correctly optimized or not. Energy per inference is measured by averaging
the power consumed by the target during inference. Finally, usage of memory is estimated since embedded systems
have limited memory.
4.2 Optimization techniques
Three primary techniques were tested in this work: pruning, quantization and knowledge distillation. Below, an
explanation of why each optimization technique was applied or not is given.
4.2.1 Pruning. In the rst layer of the LSTM model, there are 100 LSTM cells, consisting of 50,000 weights and 400 bias
terms, totaling 50 400 parameters with dierent values, many of which contribute very slightly to the nal output. By
applying pruning, these less important connections could be removed, reducing the memory footprint of the model and
accelerating inference.
Progressive pruning has been chosen specically for a smoother adaptation to the loss of connections. This approach
would enable the model to adapt and optimize itself over time. This incremental approach enabled an evaluation of how
far the model can be pruned without signicantly impacting its accuracy, pushing the limits of compression.
4.2.2 Knowledge Distillation. A dierent compression approach is to directly reduce the number of cells in the models’
layers. However, instead of just training a new model with smaller layers, knowledge distillation was applied to benet
from the original trained model. Knowledge distillation could create a smaller student model that mimic the performance
of a larger teacher model with fewer parameters. The trade-o with knowledge distillation is the additional training
time required to train the student models and the variability in the results, making the process less reproducible. Despite
these limitations, knowledge distillation seems to be a valuable technique to signicantly reduce the model size without
a severe loss in performance for embedded applications.
The knowledge distillation applied aimed to divide the number of LSTM and SRU cells in each layer by 2, resulting
in models consisting of a rst layer of 50 units and a second layer of 25 units.
4.2.3 antization. Quantization was also considered to further reduce the model size and speed up inference by
lowering the precision of the model weights and activations. However, the tests revealed that quantization did not
Manuscript submitted to ACM
10 Chehade et al.
lead to improved performance on the VisionFive2. In fact, the quantized operations’ performance was worse than the
oating-point implementations, likely due to the normalization overhead of quantized arithmetic operations, coupled
with the presence of a Floating Point Unit (FPU) on the target platform. Thus, oating-points operations are processed
with high eciency resulting in the quantized implementation being slower and less ecient. In this specic setup,
quantization not only hinders the embedded performance but also impacts negatively the models’ predictive capabilities,
making it less advantageous than the knowledge distillation and pruning. The same results have been demonstrated on
similar devices in this work [4].
4.3 Experimental Protocol and Results
An experimental protocol was designed to evaluate and compare the performance of multiple models. Figure 5is a
sketch describing the dierent steps of the protocol. It starts by training the desired models, the LSTM and the SRU.
A specic optimization technique is then applied generating dierent version of the models. Next, these models are
evaluated one by one using a test bench, which sends it to the target, executes it, retrieves the metrics and generates a
performance report with the performance indicators in it. Once all the models are assessed individually, each of their
performance reports are selected together and a comparative analysis report is generated. This analysis report allows
the comparison of all of the relevant models with charts combining their individual metrics. Thus, the impact of the
optimization techniques and the model variations can be assessed by directly comparing them in a single report to the
base initial LSTM.
xt
WfWiWo
ht
X
Sig
Sig
Ct-1 X
X
+Ct
ht-1
Sig tanh
Wc
ht
tanh
Start Select model to test
on target
Send model and
inference script to
target
Execute model on
target
Send metrics and
embedded
performance to PC
Generate report for
model including
performance
indicators
Automated
Do these steps again
for several models
Select several
reports for different
models
Automated
Generate analysis
report with charts
comparing the
models' metrics
Comparison and
analysis of results End
Train different
models and apply
optimization
techniques
Fig. 5. Automated Test Bench Workflow
To ensure consistency and reliability, each test was performed over a series of 100 iterations. Performance indicators
for each model were averaged over these 100 runs to obtain stable metrics, especially for embedded performance. The
memory footprint is an estimation of the amount of RAM required to load and run a model. It is measured by tracking
the dierence in memory consumption before and after loading the model and also includes memory consumption of
other processes that are active on the target. Tables 1and 2showcase the nal models evaluated on the test bench with
their associated metrics and the comparison with the initial LSTM in percentage points. In table 2, positive values for
prediction scores accuracy, precision, recall, F1-score, and MCC indicate a gain whereas for average latency and model
loading memory, negative values represent the gain.
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 11
Table 1. Performance and Resource Metrics of LSTM and SRU Models
Model Accuracy Precision Recall F1-Score MCC Average Loading Energy
Latency (s) Memory (MB) per Inference (J)
Initial LSTM 0.9785 0.9600 0.9600 0.9600 0.9453 0.0164 0.0400 0.0763
LSTM pruned 0.9677 0.8929 1.0000 0.9434 0.9238 0.0093 0.0300 0.0432
LSTM distilled 0.9785 0.9259 1.0000 0.9615 0.9480 0.0065 0.0300 0.0302
LSTM distilled pruned 0.9032 0.8333 0.8000 0.8163 0.7510 0.0048 0.0300 0.0223
SRU 0.9892 0.9615 1.0000 0.9804 0.9733 0.0079 0.0300 0.0367
SRU pruned 0.9140 0.8148 0.8800 0.8462 0.7877 0.0049 0.0300 0.0228
SRU distilled 0.9785 0.9259 1.0000 0.9615 0.9480 0.0038 0.0300 0.0177
SRU distilled pruned 0.9032 0.9444 0.6800 0.7907 0.7466 0.0031 0.0300 0.0144
Minimal SRU 0.9785 0.9600 0.9600 0.9600 0.9453 0.0041 0.0200 0.0186
Minimal SRU distilled 0.9570 0.9565 0.8800 0.9167 0.8892 0.0021 0.0200 0.0098
Table 2. Performance and Resource Metrics of Binary LSTM and SRU Models with Variations in Percentage Points
Model Accuracy Precision Recall F1-Score MCC Average Loading Energy
Latency (%) Memory (%) per Inference (%)
Initial LSTM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
LSTM pruned -1.10 -6.71 +4.00 -1.66 -2.15 -43.29 0.00 -43.38
LSTM distilled +0.00 -3.41 +4.00 +0.15 +0.27 -60.37 -25.00 -60.40
LSTM distilled pruned -7.53 -12.67 -16.00 -14.37 -19.43 -70.73 -25.00 -70.78
SRU +1.07 +0.15 +4.00 +2.04 +2.80 -51.83 -25.00 -51.91
SRU pruned -6.45 -14.52 -8.00 -11.38 -15.76 -70.11 -25.00 -70.12
SRU distilled +0.00 -3.41 +4.00 +0.15 +0.27 -76.83 -25.00 -76.82
SRU distilled pruned -7.53 -1.56 -28.00 -16.93 -19.87 -81.10 -25.00 -81.12
Minimal SRU +0.00 +0.00 +0.00 +0.00 +0.00 -75.00 -50.00 -75.62
Minimal SRU distilled -2.15 -0.35 -8.00 -4.33 -5.61 -87.20 -50.00 -87.15
Among the evaluated models, the LSTM distilled, the SRU, the SRU distilled and the minimal SRU models stand out
for their optimized balance of accuracy and resource eciency. The LSTM distilled manages to maintain the accuracy
of the initial LSTM, outperforming the initial LSTM model with a recall of 1.0 while reducing the average inference
latency by 60.37%. Surprisingly, the minimal SRU achieved the exact same performance as the initial LSTM while
reducing latency by 75% and memory consumption of the model by 50%. The SRU model achieves the best metrics, not
only increasing accuracy to 0.9892 but also reducing initial average latency by 51.83% which is signicant. The SRU
distilled model is also interesting since it brings the best of the initial LSTM model and the SRU metrics by maintaining
a 0.9785 accuracy, succesfully indentifying all engine failures, and further reducing latency by 76.83%. Figure 6shows
the prediction capabilities of the initial LSTM, the SRU, and the minimal SRU. Each metric is represented on one axis
and values are shown on a scale from 0.8 to 1 for better visibility. In this spider diagram, a larger enclosed area indicates
better predictions. The prediction scores are either enhanced or maintained respectively for the SRU and minimal SRU
compared to the initial LSTM model. Figure 7is a spider diagram for embedded performance of the three models. The
LSTM serves as a reference with a value of 1 for each of the metrics. The others models are then directly compared to
the reference. In this spider diagram, a smaller enclosed area indicates better embedded performance. It shows that the
Manuscript submitted to ACM
12 Chehade et al.
SRU’s embedded performance is improved when comparing to the initial LSTM, and that the minimal SRU’s embedded
performance is further enhanced compared to the SRU.
Accuracy
Precision
Recall
F1-Score
AUC-ROC
MCC
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
Model Performance Comparison (Selected Metrics)
Initial LSTM
SRU
Minimal SRU
Fig. 6. Predictive metrics of initial LSTM and SRU models
0.2
0.4
0.6
0.8
1.0
Maximum Latency
(seconds)
Average Latency
(seconds)
Model Loading
Memory Consumption
(MB)
Average Energy
per Inference
(Joules)
Initial LSTM
SRU
Minimal SRU
Fig. 7. Enhancement of embedded performance of SRU models compared to the initial LSTM
Aggressive pruning on both models is required to deliver changes from the initial model. This will lead to very
sparse models with a 90 to 95% sparsity for the LSTM pruned and SRU pruned. The pruned LSTM model managed
to achieve the same level of accuracy as the initial model while increasing the recall score to 1.00, despite the high
sparsity. Pruning on SRUs has a more negative impact since it already has considerably fewer parameters than the
LSTM, reducing the accuracy by -6.45 points and the recall by -8.000 points. However, for both types of models, pruning
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 13
has reduced average latency by respectively 43.29% and 76.83%, which is a signicant change considering it can be a
fairly straightforward technique to apply.
The SRU model combining knowledge distillation and pruning reduced the average latency by 81.10%, for a loss of
7.53 points in accuracy and 28 points in Recall. The LSTM distilled and pruned model also showed signicant drops in
performance metrics with the MCC falling by 19.43 points. These results showcase that combining both optimizations
is ecient to signicantly enhance embedded performance but also aects the most the precision scores. The loss in
prediction scores due to optimization techniques is natural because pruning and distillation both involve simplifying
the model by reducing its parameters or capacity, unintentionally aecting the model’s ability to generalize. To reduce
this loss, setting clear thresholds during training is recommended to carefully balance the degree of optimization based
on the application’s requirements. In this work, a standard pruning and distillation approach was pursued. However,
more advanced techniques like structured pruning, sparsity-aware training or adaptive distillation can be explored in
future work, to minimize the resulted loss in prediction scores when applying optimization techniques.
Overall, for the application of this paper, the SRU oers the best trade-o. It outperformed the initial model across
all metrics while achieving a 51.83% reduction in latency. However, it still has a similar memory footprint to the LSTM
which is acceptable for a small model but would cause issues in another context. For applications prioritizing rapid
responses, such as mobile or IoT-based monitoring systems, the SRU distilled and the minimal SRU would be a better
choice. They both achieved an accuracy of 0.975 identical to the initial LSTM with respectively a reduction of latency of
76.83% and 75.0%. It is also observed that the the minimal SRU distilled achieved the largest reduction in latency with
an 87.20 % decrease with a moderate loss in prediction scores given the speed-up. In embedded system applications, the
minimal SRU proposed is interesting since it oers an attractive trade-o of 48.1% speed-up compared to the normal
SRU with a maximum of 0.04 points lost in prediction scores. Thus, these SRUs are viable options for embedded systems,
and achieved great results both in speeding up the inference and reducing the computational complexity.
5 Concluding Remarks
This work reveals two dierent strategies for porting LSTM and SRU models on a RISC-V embedded target in the
context of predictive maintenance. The rst one involves directly applying some optimization techniques on the models.
Pruning proved eective in reducing resource demands, highlighting its utility as a fairly straightforward optimization
strategy. Knowledge distillation greatly oered performance and eciency improvements for all models. However,
excessive optimization can risk compromising predictive accuracy as demonstrated when combining both methods on
an SRU.
The second strategy is changing the model’s architecture itself, either by replacing it with a dierent one or reducing
the layers of the model, tailoring it for the specic application. This approach worked particularly well, and gave
promising results with the minimal SRU emerging as an optimal choice for embedded systems with strict requirements,
maintaining the same prediction scores as the LSTM with a 75% speed-up.
Moreover, the successful implementations of SRUs as a replacement of LSTMs demonstrates their potential for
embedded predictive maintenance, oering a favorable balance between prediction accuracy and computational
eciency. By replacing an LSTM with an SRU, a one point increase in accuracy was incurred, as well as a speed-up of
51.83% average latency and a model size 1.62 times smaller.
Future work could focus on further working on the minimal SRU architecture to propose a custom model with the
aim of maintaining attractive embedded performance while enhancing prediction scores.
Manuscript submitted to ACM
14 Chehade et al.
Acknowledgments
We would like to express our sincere appreciation to the institution ADVANS Group that made this work possible
through its subsidiary ADVANS Lab, the entity which manages the group’s internal and external R&D.
References
[1]
Xanthi Bampoula, Georgios Siaterlis, Nikolaos Nikolakis, and Kosmas Alexopoulos. 2021. A Deep Learning Model for Predictive Maintenance in
Cyber-Physical Production Systems Using LSTM Autoencoders. Sensors 21, 3 (2021). https://doi.org/10.3390/s21030972
[2]
Dario Bruneo and Fabrizio De Vita. 2019. On the Use of LSTM Networks for Predictive Maintenance in Smart Industries. In 2019 IEEE International
Conference on Smart Computing (SMARTCOMP). 241–248. https://doi.org/10.1109/SMARTCOMP.2019.00059
[3]
Jean-Baptiste Chaudron and Arnaud Dion. 2023. Evaluation of Gated Recurrent Neural Networks for Embedded Systems Applications. In
Computational Intelligence, Jonathan Garibaldi, Christian Wagner, Thomas Bäck, Hak-Keung Lam, Marie Cottrell, Kurosh Madani, and Kevin
Warwick (Eds.). Springer International Publishing, Cham, 223–244.
[4]
Jerey Chen, Sehwan Hong, Warrick He, Jinyeong Moon, and Sang-Woo Jun. 2021. Eciton: Very Low-Power LSTM Neural Network Accelerator
for Predictive Maintenance at the Edge. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 1–8. https:
//doi.org/10.1109/FPL53798.2021.00009
[5]
Swarnava Dey, Arijit Mukherjee, Arijit Ukil, and Arpan Pal. 2024. Towards a Task-agnostic Distillation Methodology for Creating Edge Foundation
Models. In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24). Association for Computing
Machinery, New York, NY, USA, 10–15. https://doi.org/10.1145/3662006.3662061
[6]
Vidushi Goyal, Reetuparna Das, and Valeria Bertacco. 2022. Hardware-friendly User-specic Machine Learning for Edge Devices. ACM Trans.
Embed. Comput. Syst. 21, 5, Article 62 (Oct. 2022), 29 pages. https://doi.org/10.1145/3524125
[7]
Umberto Grio. 2018. Predictive Maintenance using LSTM. https://github.com/umbertogrio/Predictive-Maintenance-usingLSTM. GitHub
repository.
[8]
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. arXiv:1709.02755 [cs.CL]
https://arxiv.org/abs/1709.02755
[9] H. Lu. 2023. Supervised Algorithm for Predictive Maintenance. https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-325821
[10]
Michał Markiewicz, Maciej Wielgosz, Mikołlaj Bocheński, Waldemar Tabaczyński, Tomasz Konieczny, and Liliana Kowalczyk. 2019. Predictive
Maintenance of Induction Motors Using Ultra-Low Power Wireless Sensors and Compressed Recurrent Neural Networks. IEEE Access 7 (2019),
178891–178902. https://doi.org/10.1109/ACCESS.2019.2953019
[11]
Irene Niyonambaza Mihigo, Marco Zennaro, Alfred Uwitonze, James Rwigema, and Marcelo Rovai.2022. On-Device IoT-Based Predictive Maintenance
Analytics Model: Comparing TinyLSTM and TinyModel from Edge Impulse. Sensors 22, 14 (2022). https://doi.org/10.3390/s22145174
[12]
Arijit Mukherjee, Jayeeta Mondal, and Swarnava Dey. 2022. Accelerated Fire Detection and Localization at Edge. ACM Trans. Embed. Comput. Syst.
21, 6, Article 70 (Oct. 2022), 27 pages. https://doi.org/10.1145/3510027
[13]
Donghyun Park, Seulgi Kim, Yelin An, and Jae-Yoon Jung. 2018. LiReD: A Light-Weight Real-Time Fault Detection System for Edge Computing
Using LSTM Recurrent Neural Networks. Sensors 18, 7 (2018). https://doi.org/10.3390/s18072110
[14]
Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. 2018. Fully neural network based speech recognition on mobile and
embedded devices. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran
Associates Inc., Red Hook, NY, USA, 10642–10653.
[15]
Zepeng Qin, Chen Cen, and Xu Guo. 2019. Prediction of Air Quality Based on KNN-LSTM. Journal of Physics: Conference Series 1237, 4 (jun 2019),
042030. https://doi.org/10.1088/1742-6596/1237/4/042030
[16]
Lei Ren, Tao Wang, Zidi Jia, Fangyu Li, and Honggui Han. 2023. A Lightweight and Adaptive Knowledge Distillation Framework for Remaining
Useful Life Prediction. IEEE Transactions on Industrial Informatics 19, 8 (2023), 9060–9070. https://doi.org/10.1109/TII.2022.3224969
[17]
Lutz Roeder. [n. d.]. Netron: Viewer for neural network, deep learning and machine learning models. https://github.com/lutzroeder/netron. GitHub
repository.
[18]
A. Saxena and K. Goebel. 2008. Turbofan Engine Degradation Simulation Data Set. https://www.nasa.gov/content/turbofan-engine-degradation-
simulation-data- set NASA Prognostics Data Repository, NASA Ames Research Center, Moett Field, CA.
[19]
Muhammad Shaque, Theocharis Theocharides, Hai Li, and Chun Jason Xue. 2022. Introduction to the Special Issue on Accelerating AI on the Edge
– Part 1. ACM Trans. Embed. Comput. Syst. 21, 5, Article 47 (Dec. 2022), 5 pages. https://doi.org/10.1145/3558078
[20]
Apeksha Shewalkar, Deepika Nyavanandi, and Simone A Ludwig. 2019. Performance evaluation of deep neural networks applied to speech
recognition: RNN, LSTM and GRU. Journal of Articial Intelligence and Soft Computing Research 9, 4 (2019), 235–245.
[21]
Izaak Stanton, Kamran Munir, Ahsan Ikram, and Murad El-Bakry. 2023. Predictive maintenance analytics and implementa-
tion for aircraft: Challenges and opportunities. Systems Engineering 26, 2 (2023), 216–237. https://doi.org/10.1002/sys.21651
arXiv:https://incose.onlinelibrary.wiley.com/doi/pdf/10.1002/sys.21651
[22]
Wonyong Sung and Jinhwan Park. 2018. Single Stream Parallelization of Recurrent Neural Networks for Low Power and Fast Inference. CoRR
abs/1803.11389 (2018). arXiv:1803.11389 http://arxiv.org/abs/1803.11389
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 15
[23]
Dechen Yao, Boyang Li, Hengchang Liu, Jianwei Yang, and Limin Jia. 2021. Remaining useful life prediction of roller bearings based on improved
1D-CNN and simple recurrent unit. Measurement 175 (2021), 109166. https://doi.org/10.1016/j.measurement.2021.109166
Received XX December 2024; revised Day Month Year; accepted Day Month Year
Manuscript submitted to ACM