PreprintPDF Available

Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V)

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This study aims to assess different strategies for deploying a deep learning recurrent neural network on the VisionFive2 RISC-V board for predictive maintenance. The main contribution of this paper is to propose a minimal SRU (Simple Recurrent Unit) model specifically designed for this application on embedded systems. The workflow involves applying pruning and knowledge distillation on an initially unoptimized LSTM (Long Short Term Memory) model, then replaced by a normal SRU for a more efficient architecture and applying the same optimizations and finally, the SRU's architecture is modified to further enhance its performance on embedded systems. While replacing the LSTM with an SRU led to a 51.8% reduction in average latency and a one point increase in accuracy, the minimal SRU resulted in a 75% speed-up and maintains the prediction scores of the initial LSTM model. Models are evaluated based on prediction and embedded performance metrics, providing a general overview of the impact of the optimization techniques used in a turbofan engine degradation application.
Content may be subject to copyright.
Optimizing SRU Models for Predictive Maintenance on Embedded Systems
(RISC-V)
SAMY CHEHADE, ADVANS Group, France
ADRIEN TIRLEMONT, ADVANS Group, France
FLORIAN DUPEYRON, ADVANS Group, France
PIERRE ROMET
,ADVANS Group, CIAD UR 7533, Belfort Montbéliard University of Technology, UTBM, Belfort,
France
This study aims to assess dierent strategies for deploying a deep learning recurrent neural network on the VisionFive2 RISC-V
board for predictive maintenance. The main contribution of this paper is to propose a minimal SRU (Simple Recurrent Unit) model
specically designed for this application on embedded systems. The workow involves applying pruning and knowledge distillation
on an initially unoptimized LSTM (Long Short Term Memory) model, replacing it with a standard SRU for a more ecient architecture
and applying the same optimizations. Finally, the SRU architecture is modied to further enhance its performance on embedded
systems. While replacing the LSTM with an SRU led to a 51.8% reduction in average latency and a one point increase in accuracy, the
minimal SRU resulted in a 75% speed-up and maintains the prediction scores of the initial LSTM model. Models are evaluated based on
prediction and embedded performance metrics, providing a general overview of the impact of the optimization techniques used in a
turbofan engine degradation application.
CCS Concepts: Computing methodologies
Articial intelligence;Computer systems organization
Embedded
systems.
Additional Key Words and Phrases: RNN, LSTM, SRU, Pruning, Knowledge distillation, Predictive maintenance, RISC-V.
ACM Reference Format:
Samy Chehade, Adrien Tirlemont, Florian Dupeyron, and Pierre Romet. 2024. Optimizing SRU Models for Predictive Maintenance
on Embedded Systems (RISC-V). In Proceedings of Make sure to enter the correct conference title from your rights conrmation emai
(Conference acronym ’XX). ACM, New York, NY, USA, 15 pages. https://doi.org/XXXXXXX.XXXXXXX
1 Introduction
Articial intelligence has been greatly developing in the recent years, which galvanized its integration in embedded
systems applications. However, deep learning and machine learning models are more and more complex and resource-
intensive, making it dicult to embed these models into resource-constrained systems. By reducing its complexity, Edge-
AI enables these algorithms, including neural networks, to run directly on embedded systems, enhancing responsiveness,
data privacy, and energy eciency by processing data locally. Although Edge AI is spreading across all industry sectors
Email: innovation@advans-group.com; Phone: 0476122840. Postal Adress: 12-14 Avenue Antoine Dutrievoz, 69100 Villeurbanne, France
Authors’ Contact Information: Samy Chehade, ADVANS Group, Cachan, France; Adrien Tirlemont, ADVANS Group, Cachan, France; Florian Dupeyron,
ADVANS Group, Cachan, France; Pierre Romet, ADVANS Group, CIAD UR 7533, Belfort Montbéliard University of Technology, UTBM, Belfort, Lyon,
France.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 Chehade et al.
[
19
], the work in this article focuses specically on predictive maintenance, as an example of time series prediction, where
Edge AI can be a game-changer in making systems more autonomous by eliminating the needs of cloud dependence,
and safer by keeping the data private.
Predictive maintenance is a crucial strategy for anticipating industrial system failures and minimizing costs associated
with unplanned downtime. It often relies on recurrent neural networks (RNNs) such as Long Short-Term Memory
(LSTM) models, which capture complex dependencies within time series data across diverse elds, including air quality
forecasting [
15
] and speech recognition [
20
]. Moreover, LSTMs are commonly used for predicting the Remaining Useful
Life (RUL) of various systems, such as jet engines [
2
], or any cyber-physical production system [
1
], where large sensors
data can be easily provided by the Internet of Things, thus enhancing prediction accuracy.
However, this approach is computationally intensive and often require substantial memory and processing power,
typically executed on cloud-based infrastructures. This dependence on cloud resources faces several limitations,
including high latency, reliability concerns, bandwidth constraints or data privacy issues. A promising solution to these
limitations is to deploy deep learning models directly on embedded devices, enabling local and real-time processing of
sensor data, which raises challenges due to the limited capacities of embedded platforms.
To meet these constraints, several strategies have been explored by researchers to adapt deep learning models,
and in particular LSTMs, to these restricted environments. It starts by simply converting the desired model into a
compressed format for embedded targets using existing tools such as TensorFlow Lite [
11
]. Optimization techniques
also emerged aiming to reduce the models’ complexity and computational needs. For instance, in an accelerated re
detection and localization application, quantization and compression have been explored to achieve inference on-board
[12]. Furthermore, alternative models became more and more popular to replace LSTMs in this context.
In this paper, an approach to deploying a Recurrent Neural Network model is presented, by specically replacing
the LSTM to an SRU on a VisionFive2 RISC-V embedded board for predictive maintenance. The primary contributions
include:
A structured optimization of RNN models for embedded deployment, with LSTM serving as the base model.
An automated test bench for continuous evaluation of predictive accuracy and embedded performance, providing
traceable results.
A focus on enhancing predictive maintenance with embedded-friendly architectures, particularly by implementing
and testing an SRU and a minimal SRU architecture to replace LSTMs.
The rest of the paper is structured as follows: Section II covers the State of the Art, focusing on existing work related
to porting LSTMs and SRUs to embedded devices for predictive maintenance. Section III outlines the models’ details,
including the core dierences between LSTMs, SRUs and minimal SRUs. Section IV presents the test bench, optimization
techniques, along with experimental protocols and results. Finally, Section V concludes the paper.
2 State of The Art
When TensorFlow Lite converts an LSTM model to its .tite format, it performs several key steps to make the model
suitable for deployment on edge devices. Some of these steps include removing unnecessary operations that don’t
contribute to the model inference on target and replacing some Tensorow operations with specialized TensorFlow Lite
operators that are optimized for embedded platforms. This is how an LSTM model has been successfully converted
to a TinyModel version [
11
]. More options are also available during the conversion process, such as pruning and
quantization. These techniques as used by Bringmann et al. [2021], have been sought and tested to reduce the size
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 3
and complexity of models by respectively removing a percentage of network parameters or reducing the precision of
model weights. Pruning has also been considered directly on-device, where it is applied in real-time during inference
to create smaller, user-specic models [
6
]. Moreover, post-training and network quantization of an LSTM model was
applied in the context of battery management systems with the goal of deploying it on an STM32 MCU embedded
system [
9
]. Compression methods were also investigated to allow LSTMs to run on ultra-low power wireless sensors
for induction motors predictive maintenance[
10
]. There is also the possibility of combining these compression methods
with hardware accelerators. Chen et al. [2021] showed the eect of various levels of quantization on the turbofan
degradation predictive maintenance model and made an FPGA-LSTM accelerator to specically run LSTM operations
faster and with lower energy consumption, which makes it ideal for real-time monitoring and predictive maintenance
tasks at the edge. Finally, a lightweight real-time fault detection system for edge computing is proposed with a fault
detection model based on an LSTM [
13
]. It is also suggested to use a simpler model like the Gated Recurrent Unit (GRU)
to reduce memory footprint and complexity as a futur approach. All these works demonstrate the convenience of using
LSTMs for predictive maintenance on edge devices. It also indicates that many strategies can be conducted in order to
eectively run LSTM models on edge devices. In this paper, the focus is on trying to apply the mentioned compression
methods as well as the knowledge distillation optimization technique, since it seems to be improving the RUL prediction
of systems while providing more compact models [
16
]. It also is a great way to create compact versions of foundation
models for edge devices [
5
]. Moreover, replacing the LSTM by an alternative model, typically the SRU, is also included
in the core strategy of this work.
LSTMs have a complex gated architecture and multiple time dependencies within sequences, which multiplies
memory accesses and extends inference time. Simplifying the LSTM architecture with other models that could confront
these limitations is, thus, a viable approach. As mentioned, the Simple Recurrent Unit (SRU) is the main target in this
paper. Indeed, its architecture adds the possibility of parallelizing time steps [
22
], signicantly speeding up inference time.
This advantage has led to the adoption of SRUs in applications requiring real-time performance on embedded platforms,
such as automatic speech recognition (ASR) [
14
]. SRUs are also an option for even more demanding applications such
as NLP tasks where its possible parallelization is also interesting [
8
]. In addition, SRUs are used instead of LSTMs in
a proposed model for predicting the RUL of roller bearings [
23
]. These works demonstrate the potential of SRUs to
solve some of LSTMs limitations, not only in the embedded systems context but in a broader one. While these studies
highlight SRUs’ advantages over LSTMs in terms of computational eciency, there has been, to our knowledge, no
transparent assessment of SRUs specically for embedded predictive maintenance applications. Galvanized by this gap,
this paper explores the impact of SRUs in such settings. More broadly, RNNs are increasingly integrated into embedded
systems across multiple applications, positioning the SRU as a viable next model to be evaluated [
3
]. In the specic
context of aircraft engines, for instance, implementing SRUs could facilitate early anomaly detection, reduce aircraft
downtime, enhance ight safety, and lower operational costs. By supporting on-board predictive maintenance, SRUs
address the key drawbacks of LSTMs: long training times and high computational demands [21].
3 Model
In this section, the two types of recurrent neural network (RNN) architectures central to this research are presented :
Long Short-Term Memory networks (LSTMs) and Simple Recurrent Units (SRUs). The purpose of this comparison is
to provide a simple understanding of the models’ architecture in order to evaluate the potential benets of replacing
LSTMs with SRUs in embedded environments.
Manuscript submitted to ACM
4 Chehade et al.
3.1 LSTM
LSTMs introduce key components compared to a standard RNN: a cell state, which represents long-term memory,
and a hidden state, which represents short-term memory. Together, these components enhance the LSTM’s ability to
manage both long-term and short-term patterns in time series data.
xt
WfWiWo
ht
X
Sig
Sig
Ct-1 X
X
+Ct
ht-1
Sig tanh
Wc
ht
tanh
Fig. 1. Diagram of an LSTM cell
Its uniqueness lies in the ability to add or subtract information from the cell state via dened gates. Below is a concise
explanation of the three crucial steps in an LSTM cell.
Step 1: Forget Gate
The forget gate
𝑓𝑡
determines what percentage of the previous cell state
𝐶𝑡1
should be retained, using the predecessor’s
hidden state 𝑡1and the current input 𝑥𝑡:
𝑓𝑡=𝜎𝑊𝑓· [𝑡1, 𝑥𝑡] + 𝑏𝑓(1)
Step 2: Input Gate and Cell State Update
The input gate decides what new information to store in the cell state, combining two parts: the input gate activation
𝑖𝑡
,
which decides the values to update in the cell state and the candidate cell state, which creates new potential values for
the cell state based on 𝑡1and 𝑥𝑡.
𝑖𝑡=𝜎(𝑊𝑖· [𝑡1, 𝑥𝑡] + 𝑏𝑖)(2)
˜
𝐶𝑡=tanh (𝑊𝑐· [𝑡1, 𝑥𝑡] + 𝑏𝑐)(3)
The cell state 𝐶𝑡is then updated as follows:
𝐶𝑡=𝑓𝑡𝐶𝑡1+𝑖𝑡˜
𝐶𝑡(4)
Step 3: Output Gate and Hidden State Update
The hidden state
𝑡
is updated to produce the output of the LSTM cell at a specic time step. The output gate
𝑜𝑡
decides
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 5
the percentage of the new cell state 𝐶𝑡to include in the output, dened by:
𝑜𝑡=𝜎(𝑊𝑜· [𝑡1, 𝑥𝑡] + 𝑏𝑜)(5)
The hidden state 𝑡is then computed as follows:
𝑡=𝑜𝑡tanh(𝐶𝑡)(6)
3.2 SRU
SRUs simplify the structure of the cell by reducing the complexity inside gating mechanisms, focusing on computational
eciency while still retaining information over long sequences. This engenders faster training time and the use of
fewer resources.
xt
WfWiWr
X
+
ht
X
-
1
Sig
Sig
-
Ct-1 X
1
X
+
tanh
Ct
Fig. 2. Diagram of an SRU cell
The SRU relies mainly on one component: the cell state. A similar update mechanism decides how to update this
cell state based on new inputs at each time step. The former LSTMs gates operations can be identied within the new
SRU cell architecture:
Step 1: Forget Gate
The forget gate 𝑓𝑡still determines how much of the previous cell state 𝑐𝑡1should be retained:
𝑓𝑡=𝜎𝑊𝑓·𝑥𝑡+𝑏𝑓(7)
Step 2: Input gate
The input gate in the SRU is simplied by using the complementary of the forget gate:
𝑖𝑡=(1𝑓𝑡) ˜
𝑥𝑡(8)
where:
Manuscript submitted to ACM
6 Chehade et al.
˜
𝑥𝑡=𝑊𝑖·𝑥𝑡
The cell state 𝑐𝑡is scaled by the forget gate and then updated by adding the input gate:
𝑐𝑡=𝑓𝑡𝑐𝑡1+𝑖𝑡(9)
Step 3: Hidden State Computation
Finally, the hidden state
𝑡
is computed using the reset gate
𝑟𝑡
. Part of
𝑡
is derived from the cell state
𝑐𝑡
after applying
the tanh activation, while the rest comes from a highway connection with the original input 𝑥𝑡.
𝑡=𝑟𝑡tanh(𝑐𝑡)+(1𝑟𝑡) 𝑥𝑡(10)
with:
𝑟𝑡=𝜎(𝑊𝑟·𝑥𝑡+𝑏𝑟)(11)
Note that 𝑥𝑡must have the same dimensionality as 𝑟𝑡in (10) for the equation to be valid.
3.3 Notable Dierences Between SRU and LSTM
In an SRU, a cell does not require the hidden state
𝑡1
to calculate the current cell state. Indeed, in an LSTM, the
forget gate depends on the input
𝑥𝑡
and the hidden state
𝑡1
, while in an SRU, the input gate depends only on
𝑥𝑡
. This
removes temporal dependencies.
The input gate is also simplied: there is no activation functions as in LSTMs (sigmoid for the input gate and tanh
for the candidate cell state), but simply a multiplication with the complimentary of the forget gate 1
𝑓𝑡
. This change
also reduces the number of weights by two in the input gate.
Finally, a highway connection is used in the computation of the hidden state
𝑡
whereas in an LSTM, the hidden
state is computed using only a regular gating mechanism as the output.
These contrasts suggest possible benets to replacing LSTMs with SRUs in this work’s use case.
3.4 Scientific Literature Contribution Analysis Concerning Transitioning from LSTMs to SRUs
Firstly, removing hidden state dependency reduces the number of matrix-vector operations. More specically, the
dimension of the hidden state vector is equal to the number of LSTM cells
𝑛
and the dimension of the weight matrix in
an LSTM is 4
𝑛× (𝑛+𝑚)
with
𝑚
being the dimension of the input vector. In the LSTM gates equations (1)–(3), (5), the
matrix vector multiplication will thus involve 4
𝑛× (𝑛+𝑚)
multiplications and 4
𝑛× (𝑛+𝑚
1
)
additions. In the SRU
gates (7), (8), (11), since there are no hidden states, there will be 3
𝑛× (𝑚)
multiplications and 3
𝑛× (𝑚
1
)
additions.
The model’s details will be explained below in 3.6, but the rst layer consists of 100 LSTM cells, and this reduces the
total number of operations by 6.67 in each matrix vector multiplication for each time step in this layer only. Having
no hidden state computations can also leverage the SRU’s full parallelization potential by joining multiple time steps
together, based on the available hardware resources.
Activation functions help LSTM manage non-linear and complex time dependencies, but they also demand more
processing power. Indeed, replacing both sigmoid and tanh by a basic arithmetic operation, such as 1
𝑓𝑡
, makes
operations more lightweight without requiring further memory access, unlike LUT-based (Lookup Table) operations.
This is also convenient since it reuses the forget gate 𝑓𝑡without requiring another computation.
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 7
Finally, adding a highway connection in the hidden state computation could be particularly relevant in multi-layer
models like ours. While LSTMs employ a straightforward hidden state computation (6), SRUs highway connection
mechanism (10) enables direct information propagation from lower to higher layers through an adaptive pathway. The
reset gate
𝑟𝑡
will act then as an adaptive lter, determining the balance between transformed cell state information and
raw input features. When
𝑟𝑡
is close to 1, the hidden state will represent the cell state transformation whereas when
𝑟𝑡
is closer to 0, the hidden state will preserve more of the original input features. This allows dierent SRU layers to
selectively receive processed or preserved input features based on the learned parameters in a exible way.
3.5 Hypothesis: SRU Architecture Streamlining for Predictive Maintenance
Since embedded systems may lack the hardware needed for parallel computation, this study aimed to explore the
potential of a naive SRU implementation tailored for predictive maintenance tasks. Observing the nature of these tasks,
a streamlined version of the conventional SRU architecture is proposed, by eliminating hidden state computations. This
hypothesis is based on one key point: predictive maintenance primarily focuses on identifying gradual degradation and
long-term trends in sensor data, which the cell state
𝑐𝑡
can capture eectively, possibly rendering a separate hidden
state
𝑡
unnecessary. The cell state will then not only be passed across time steps but also as inputs of the second layer
of the model. Thus, only the forget gate
𝑓𝑡
and the input gate
𝑖𝑡
with added bias is used in order to capture these gradual
changes rather than complex short-term patterns.
Mathematically, the proposed modication transforms the SRU computations from:
𝑓𝑡=𝜎(𝑊𝑓·𝑥𝑡+𝑏𝑓)
𝑖𝑡=(1𝑓𝑡) ˜
𝑥𝑡
𝑟𝑡=𝜎(𝑊𝑟·𝑥𝑡+𝑏𝑟)
𝑐𝑡=𝑓𝑡𝑐𝑡1+ (1𝑓𝑡) 𝑖𝑡
𝑡=𝑟𝑡tanh(𝑐𝑡)+(1𝑟𝑡) 𝑥𝑡
(12)
to the minimal form:
𝑓𝑡=𝜎(𝑊𝑓·𝑥𝑡+𝑏𝑓)
𝑖𝑡=(1𝑓𝑡) ˜
𝑥𝑡
𝑐𝑡=𝑓𝑡𝑐𝑡1+ (1𝑓𝑡) 𝑖𝑡
(13)
This streamlining also reduces further the model’s number of matrix-vector operations by 1.5. Figure 3shows the
minimal SRU diagram.
These advantages, however, do not come without a downside. The structure of the SRU reduces consequently its
exibility in capturing more complex temporal dependencies. Indeed, the LSTM utilizes two distinct activation functions
in the input gate and cell state candidate, using a dot operation between them. This allows the LSTM to use a gating
mechanism with the the cell state candidate independently of the forget gate. The advantages of this combination is
allowing the cell state not only to increase (since sigmoid output is always positive between 0 and 1) but also to decrease
thanks to tanh that scales input values but also maintain its sign, mapping the values from -1 to 1. Hence, LSTMs can
handle the non-linearity in the data in an ecient way. In contrast, the SRU’s input gate is directly coupled to the forget
gate through the expression 1
𝑓𝑡
, eliminating both activation functions. This coupling means that the decision to add
new information is directly tied to the decision to forget information, reducing the network’s exibility in independently
Manuscript submitted to ACM
8 Chehade et al.
W_f W
Sig
-
X
1
X
+
xt
WfWi
Sig
-
Ct-1 X
1
X
+Ct
Ct
Fig. 3. Diagram of a minimal SRU
managing these two processes. The absence of
𝑡𝑎𝑛ℎ
also means the cell state can only increase, contributing to a more
linear behavior in how new information is added.
It is, therefore essential to thoroughly test the SRU architecture for the specic application involved. In general, SRUs
oer a substantial improvement in terms of general embedded performance, though potentially aecting the model’s
predictive capabilities. However, the trade-o between these two metrics will be assessed in this paper to evaluate the
potential of SRUs in this research’s application.
3.6 LSTM and SRU Models and Dataset
The models aim to correctly predict the RUL of engines directly on the VisionFive2 RISC-V board. To achieve this, the
Turbofan Engine Degradation Simulation Data Set [
18
] provided by NASA has been used, and the initial LSTM model
trained here [
7
], has been used as a base model. This is a great model to start with since it is the raw implementation of
LSTM by keras and tensorow, which engenders optimization opportunities.
The important information to retain regarding the training process, is the format of the input data and the depth of
the models. The dataset is segmented into time sequences that contain input from 25 dierent sensors within a range of
50 cycles. In total, 93 dierent engines were tested each having its own corresponding sequence. The models consist of
two LSTM/SRU layers with 100 and 50 nodes, followed by a single-unit dense layer for predictions, as described by
Figure 4generated with netron [
17
]. This setup enables to evaluate the behavior of SRUs with the same number of cells
as the LSTMs.
Fig. 4. Models layers
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 9
4 Tests and Experimental Results
4.1 Test Bench
Since the aim is to combine the models with optimizations that oer various adjustable parameters, establishing a
robust workow is essential in order to conduct and track multiple optimization tests, each resulting in a dierent
model. To evaluate and compare these models, a test bench enabling an automated and traceable workow has been
implemented. It includes multiple crucial steps such as deploying the selected model on the target device, retrieving
its predictions, and generating detailed test and analysis reports. This procedure also includes automatic storage of
generated reports enabling traceable and organized records.
In these reports, the model’s predictive capabilities and embedded performance are captured through relevant metrics.
In predictive maintenance application, the most important criteria typically focus on a balance between predictive
accuracy and ecient use of system resources while primarily identifying engine failures. Accuracy is used for overall
reliable predictions, F1-Score and MCC are essential since predictive maintenance often deals with imbalanced data
(fewer failures than normal operations). More specically, recall is also measured since detection of maintenance needs
or failures is more crucial for cost and safety issues. Average latency is critical from an embedded point of view and is
the main gauge to measure if the model is correctly optimized or not. Energy per inference is measured by averaging
the power consumed by the target during inference. Finally, usage of memory is estimated since embedded systems
have limited memory.
4.2 Optimization techniques
Three primary techniques were tested in this work: pruning, quantization and knowledge distillation. Below, an
explanation of why each optimization technique was applied or not is given.
4.2.1 Pruning. In the rst layer of the LSTM model, there are 100 LSTM cells, consisting of 50,000 weights and 400 bias
terms, totaling 50 400 parameters with dierent values, many of which contribute very slightly to the nal output. By
applying pruning, these less important connections could be removed, reducing the memory footprint of the model and
accelerating inference.
Progressive pruning has been chosen specically for a smoother adaptation to the loss of connections. This approach
would enable the model to adapt and optimize itself over time. This incremental approach enabled an evaluation of how
far the model can be pruned without signicantly impacting its accuracy, pushing the limits of compression.
4.2.2 Knowledge Distillation. A dierent compression approach is to directly reduce the number of cells in the models’
layers. However, instead of just training a new model with smaller layers, knowledge distillation was applied to benet
from the original trained model. Knowledge distillation could create a smaller student model that mimic the performance
of a larger teacher model with fewer parameters. The trade-o with knowledge distillation is the additional training
time required to train the student models and the variability in the results, making the process less reproducible. Despite
these limitations, knowledge distillation seems to be a valuable technique to signicantly reduce the model size without
a severe loss in performance for embedded applications.
The knowledge distillation applied aimed to divide the number of LSTM and SRU cells in each layer by 2, resulting
in models consisting of a rst layer of 50 units and a second layer of 25 units.
4.2.3 antization. Quantization was also considered to further reduce the model size and speed up inference by
lowering the precision of the model weights and activations. However, the tests revealed that quantization did not
Manuscript submitted to ACM
10 Chehade et al.
lead to improved performance on the VisionFive2. In fact, the quantized operations’ performance was worse than the
oating-point implementations, likely due to the normalization overhead of quantized arithmetic operations, coupled
with the presence of a Floating Point Unit (FPU) on the target platform. Thus, oating-points operations are processed
with high eciency resulting in the quantized implementation being slower and less ecient. In this specic setup,
quantization not only hinders the embedded performance but also impacts negatively the models’ predictive capabilities,
making it less advantageous than the knowledge distillation and pruning. The same results have been demonstrated on
similar devices in this work [4].
4.3 Experimental Protocol and Results
An experimental protocol was designed to evaluate and compare the performance of multiple models. Figure 5is a
sketch describing the dierent steps of the protocol. It starts by training the desired models, the LSTM and the SRU.
A specic optimization technique is then applied generating dierent version of the models. Next, these models are
evaluated one by one using a test bench, which sends it to the target, executes it, retrieves the metrics and generates a
performance report with the performance indicators in it. Once all the models are assessed individually, each of their
performance reports are selected together and a comparative analysis report is generated. This analysis report allows
the comparison of all of the relevant models with charts combining their individual metrics. Thus, the impact of the
optimization techniques and the model variations can be assessed by directly comparing them in a single report to the
base initial LSTM.
xt
WfWiWo
ht
X
Sig
Sig
Ct-1 X
X
+Ct
ht-1
Sig tanh
Wc
ht
tanh
Start Select model to test
on target
Send model and
inference script to
target
Execute model on
target
Send metrics and
embedded
performance to PC
Generate report for
model including
performance
indicators
Automated
Do these steps again
for several models
Select several
reports for different
models
Automated
Generate analysis
report with charts
comparing the
models' metrics
Comparison and
analysis of results End
Train different
models and apply
optimization
techniques
Fig. 5. Automated Test Bench Workflow
To ensure consistency and reliability, each test was performed over a series of 100 iterations. Performance indicators
for each model were averaged over these 100 runs to obtain stable metrics, especially for embedded performance. The
memory footprint is an estimation of the amount of RAM required to load and run a model. It is measured by tracking
the dierence in memory consumption before and after loading the model and also includes memory consumption of
other processes that are active on the target. Tables 1and 2showcase the nal models evaluated on the test bench with
their associated metrics and the comparison with the initial LSTM in percentage points. In table 2, positive values for
prediction scores accuracy, precision, recall, F1-score, and MCC indicate a gain whereas for average latency and model
loading memory, negative values represent the gain.
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 11
Table 1. Performance and Resource Metrics of LSTM and SRU Models
Model Accuracy Precision Recall F1-Score MCC Average Loading Energy
Latency (s) Memory (MB) per Inference (J)
Initial LSTM 0.9785 0.9600 0.9600 0.9600 0.9453 0.0164 0.0400 0.0763
LSTM pruned 0.9677 0.8929 1.0000 0.9434 0.9238 0.0093 0.0300 0.0432
LSTM distilled 0.9785 0.9259 1.0000 0.9615 0.9480 0.0065 0.0300 0.0302
LSTM distilled pruned 0.9032 0.8333 0.8000 0.8163 0.7510 0.0048 0.0300 0.0223
SRU 0.9892 0.9615 1.0000 0.9804 0.9733 0.0079 0.0300 0.0367
SRU pruned 0.9140 0.8148 0.8800 0.8462 0.7877 0.0049 0.0300 0.0228
SRU distilled 0.9785 0.9259 1.0000 0.9615 0.9480 0.0038 0.0300 0.0177
SRU distilled pruned 0.9032 0.9444 0.6800 0.7907 0.7466 0.0031 0.0300 0.0144
Minimal SRU 0.9785 0.9600 0.9600 0.9600 0.9453 0.0041 0.0200 0.0186
Minimal SRU distilled 0.9570 0.9565 0.8800 0.9167 0.8892 0.0021 0.0200 0.0098
Table 2. Performance and Resource Metrics of Binary LSTM and SRU Models with Variations in Percentage Points
Model Accuracy Precision Recall F1-Score MCC Average Loading Energy
Latency (%) Memory (%) per Inference (%)
Initial LSTM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
LSTM pruned -1.10 -6.71 +4.00 -1.66 -2.15 -43.29 0.00 -43.38
LSTM distilled +0.00 -3.41 +4.00 +0.15 +0.27 -60.37 -25.00 -60.40
LSTM distilled pruned -7.53 -12.67 -16.00 -14.37 -19.43 -70.73 -25.00 -70.78
SRU +1.07 +0.15 +4.00 +2.04 +2.80 -51.83 -25.00 -51.91
SRU pruned -6.45 -14.52 -8.00 -11.38 -15.76 -70.11 -25.00 -70.12
SRU distilled +0.00 -3.41 +4.00 +0.15 +0.27 -76.83 -25.00 -76.82
SRU distilled pruned -7.53 -1.56 -28.00 -16.93 -19.87 -81.10 -25.00 -81.12
Minimal SRU +0.00 +0.00 +0.00 +0.00 +0.00 -75.00 -50.00 -75.62
Minimal SRU distilled -2.15 -0.35 -8.00 -4.33 -5.61 -87.20 -50.00 -87.15
Among the evaluated models, the LSTM distilled, the SRU, the SRU distilled and the minimal SRU models stand out
for their optimized balance of accuracy and resource eciency. The LSTM distilled manages to maintain the accuracy
of the initial LSTM, outperforming the initial LSTM model with a recall of 1.0 while reducing the average inference
latency by 60.37%. Surprisingly, the minimal SRU achieved the exact same performance as the initial LSTM while
reducing latency by 75% and memory consumption of the model by 50%. The SRU model achieves the best metrics, not
only increasing accuracy to 0.9892 but also reducing initial average latency by 51.83% which is signicant. The SRU
distilled model is also interesting since it brings the best of the initial LSTM model and the SRU metrics by maintaining
a 0.9785 accuracy, succesfully indentifying all engine failures, and further reducing latency by 76.83%. Figure 6shows
the prediction capabilities of the initial LSTM, the SRU, and the minimal SRU. Each metric is represented on one axis
and values are shown on a scale from 0.8 to 1 for better visibility. In this spider diagram, a larger enclosed area indicates
better predictions. The prediction scores are either enhanced or maintained respectively for the SRU and minimal SRU
compared to the initial LSTM model. Figure 7is a spider diagram for embedded performance of the three models. The
LSTM serves as a reference with a value of 1 for each of the metrics. The others models are then directly compared to
the reference. In this spider diagram, a smaller enclosed area indicates better embedded performance. It shows that the
Manuscript submitted to ACM
12 Chehade et al.
SRU’s embedded performance is improved when comparing to the initial LSTM, and that the minimal SRU’s embedded
performance is further enhanced compared to the SRU.
Accuracy
Precision
Recall
F1-Score
AUC-ROC
MCC
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
Model Performance Comparison (Selected Metrics)
Initial LSTM
SRU
Minimal SRU
Fig. 6. Predictive metrics of initial LSTM and SRU models
0.2
0.4
0.6
0.8
1.0
Maximum Latency
(seconds)
Average Latency
(seconds)
Model Loading
Memory Consumption
(MB)
Average Energy
per Inference
(Joules)
Initial LSTM
SRU
Minimal SRU
Fig. 7. Enhancement of embedded performance of SRU models compared to the initial LSTM
Aggressive pruning on both models is required to deliver changes from the initial model. This will lead to very
sparse models with a 90 to 95% sparsity for the LSTM pruned and SRU pruned. The pruned LSTM model managed
to achieve the same level of accuracy as the initial model while increasing the recall score to 1.00, despite the high
sparsity. Pruning on SRUs has a more negative impact since it already has considerably fewer parameters than the
LSTM, reducing the accuracy by -6.45 points and the recall by -8.000 points. However, for both types of models, pruning
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 13
has reduced average latency by respectively 43.29% and 76.83%, which is a signicant change considering it can be a
fairly straightforward technique to apply.
The SRU model combining knowledge distillation and pruning reduced the average latency by 81.10%, for a loss of
7.53 points in accuracy and 28 points in Recall. The LSTM distilled and pruned model also showed signicant drops in
performance metrics with the MCC falling by 19.43 points. These results showcase that combining both optimizations
is ecient to signicantly enhance embedded performance but also aects the most the precision scores. The loss in
prediction scores due to optimization techniques is natural because pruning and distillation both involve simplifying
the model by reducing its parameters or capacity, unintentionally aecting the model’s ability to generalize. To reduce
this loss, setting clear thresholds during training is recommended to carefully balance the degree of optimization based
on the application’s requirements. In this work, a standard pruning and distillation approach was pursued. However,
more advanced techniques like structured pruning, sparsity-aware training or adaptive distillation can be explored in
future work, to minimize the resulted loss in prediction scores when applying optimization techniques.
Overall, for the application of this paper, the SRU oers the best trade-o. It outperformed the initial model across
all metrics while achieving a 51.83% reduction in latency. However, it still has a similar memory footprint to the LSTM
which is acceptable for a small model but would cause issues in another context. For applications prioritizing rapid
responses, such as mobile or IoT-based monitoring systems, the SRU distilled and the minimal SRU would be a better
choice. They both achieved an accuracy of 0.975 identical to the initial LSTM with respectively a reduction of latency of
76.83% and 75.0%. It is also observed that the the minimal SRU distilled achieved the largest reduction in latency with
an 87.20 % decrease with a moderate loss in prediction scores given the speed-up. In embedded system applications, the
minimal SRU proposed is interesting since it oers an attractive trade-o of 48.1% speed-up compared to the normal
SRU with a maximum of 0.04 points lost in prediction scores. Thus, these SRUs are viable options for embedded systems,
and achieved great results both in speeding up the inference and reducing the computational complexity.
5 Concluding Remarks
This work reveals two dierent strategies for porting LSTM and SRU models on a RISC-V embedded target in the
context of predictive maintenance. The rst one involves directly applying some optimization techniques on the models.
Pruning proved eective in reducing resource demands, highlighting its utility as a fairly straightforward optimization
strategy. Knowledge distillation greatly oered performance and eciency improvements for all models. However,
excessive optimization can risk compromising predictive accuracy as demonstrated when combining both methods on
an SRU.
The second strategy is changing the model’s architecture itself, either by replacing it with a dierent one or reducing
the layers of the model, tailoring it for the specic application. This approach worked particularly well, and gave
promising results with the minimal SRU emerging as an optimal choice for embedded systems with strict requirements,
maintaining the same prediction scores as the LSTM with a 75% speed-up.
Moreover, the successful implementations of SRUs as a replacement of LSTMs demonstrates their potential for
embedded predictive maintenance, oering a favorable balance between prediction accuracy and computational
eciency. By replacing an LSTM with an SRU, a one point increase in accuracy was incurred, as well as a speed-up of
51.83% average latency and a model size 1.62 times smaller.
Future work could focus on further working on the minimal SRU architecture to propose a custom model with the
aim of maintaining attractive embedded performance while enhancing prediction scores.
Manuscript submitted to ACM
14 Chehade et al.
Acknowledgments
We would like to express our sincere appreciation to the institution ADVANS Group that made this work possible
through its subsidiary ADVANS Lab, the entity which manages the group’s internal and external R&D.
References
[1]
Xanthi Bampoula, Georgios Siaterlis, Nikolaos Nikolakis, and Kosmas Alexopoulos. 2021. A Deep Learning Model for Predictive Maintenance in
Cyber-Physical Production Systems Using LSTM Autoencoders. Sensors 21, 3 (2021). https://doi.org/10.3390/s21030972
[2]
Dario Bruneo and Fabrizio De Vita. 2019. On the Use of LSTM Networks for Predictive Maintenance in Smart Industries. In 2019 IEEE International
Conference on Smart Computing (SMARTCOMP). 241–248. https://doi.org/10.1109/SMARTCOMP.2019.00059
[3]
Jean-Baptiste Chaudron and Arnaud Dion. 2023. Evaluation of Gated Recurrent Neural Networks for Embedded Systems Applications. In
Computational Intelligence, Jonathan Garibaldi, Christian Wagner, Thomas Bäck, Hak-Keung Lam, Marie Cottrell, Kurosh Madani, and Kevin
Warwick (Eds.). Springer International Publishing, Cham, 223–244.
[4]
Jerey Chen, Sehwan Hong, Warrick He, Jinyeong Moon, and Sang-Woo Jun. 2021. Eciton: Very Low-Power LSTM Neural Network Accelerator
for Predictive Maintenance at the Edge. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 1–8. https:
//doi.org/10.1109/FPL53798.2021.00009
[5]
Swarnava Dey, Arijit Mukherjee, Arijit Ukil, and Arpan Pal. 2024. Towards a Task-agnostic Distillation Methodology for Creating Edge Foundation
Models. In Proceedings of the Workshop on Edge and Mobile Foundation Models (Minato-ku, Tokyo, Japan) (EdgeFM ’24). Association for Computing
Machinery, New York, NY, USA, 10–15. https://doi.org/10.1145/3662006.3662061
[6]
Vidushi Goyal, Reetuparna Das, and Valeria Bertacco. 2022. Hardware-friendly User-specic Machine Learning for Edge Devices. ACM Trans.
Embed. Comput. Syst. 21, 5, Article 62 (Oct. 2022), 29 pages. https://doi.org/10.1145/3524125
[7]
Umberto Grio. 2018. Predictive Maintenance using LSTM. https://github.com/umbertogrio/Predictive-Maintenance-usingLSTM. GitHub
repository.
[8]
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. arXiv:1709.02755 [cs.CL]
https://arxiv.org/abs/1709.02755
[9] H. Lu. 2023. Supervised Algorithm for Predictive Maintenance. https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-325821
[10]
Michał Markiewicz, Maciej Wielgosz, Mikołlaj Bocheński, Waldemar Tabaczyński, Tomasz Konieczny, and Liliana Kowalczyk. 2019. Predictive
Maintenance of Induction Motors Using Ultra-Low Power Wireless Sensors and Compressed Recurrent Neural Networks. IEEE Access 7 (2019),
178891–178902. https://doi.org/10.1109/ACCESS.2019.2953019
[11]
Irene Niyonambaza Mihigo, Marco Zennaro, Alfred Uwitonze, James Rwigema, and Marcelo Rovai.2022. On-Device IoT-Based Predictive Maintenance
Analytics Model: Comparing TinyLSTM and TinyModel from Edge Impulse. Sensors 22, 14 (2022). https://doi.org/10.3390/s22145174
[12]
Arijit Mukherjee, Jayeeta Mondal, and Swarnava Dey. 2022. Accelerated Fire Detection and Localization at Edge. ACM Trans. Embed. Comput. Syst.
21, 6, Article 70 (Oct. 2022), 27 pages. https://doi.org/10.1145/3510027
[13]
Donghyun Park, Seulgi Kim, Yelin An, and Jae-Yoon Jung. 2018. LiReD: A Light-Weight Real-Time Fault Detection System for Edge Computing
Using LSTM Recurrent Neural Networks. Sensors 18, 7 (2018). https://doi.org/10.3390/s18072110
[14]
Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. 2018. Fully neural network based speech recognition on mobile and
embedded devices. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran
Associates Inc., Red Hook, NY, USA, 10642–10653.
[15]
Zepeng Qin, Chen Cen, and Xu Guo. 2019. Prediction of Air Quality Based on KNN-LSTM. Journal of Physics: Conference Series 1237, 4 (jun 2019),
042030. https://doi.org/10.1088/1742-6596/1237/4/042030
[16]
Lei Ren, Tao Wang, Zidi Jia, Fangyu Li, and Honggui Han. 2023. A Lightweight and Adaptive Knowledge Distillation Framework for Remaining
Useful Life Prediction. IEEE Transactions on Industrial Informatics 19, 8 (2023), 9060–9070. https://doi.org/10.1109/TII.2022.3224969
[17]
Lutz Roeder. [n. d.]. Netron: Viewer for neural network, deep learning and machine learning models. https://github.com/lutzroeder/netron. GitHub
repository.
[18]
A. Saxena and K. Goebel. 2008. Turbofan Engine Degradation Simulation Data Set. https://www.nasa.gov/content/turbofan-engine-degradation-
simulation-data- set NASA Prognostics Data Repository, NASA Ames Research Center, Moett Field, CA.
[19]
Muhammad Shaque, Theocharis Theocharides, Hai Li, and Chun Jason Xue. 2022. Introduction to the Special Issue on Accelerating AI on the Edge
Part 1. ACM Trans. Embed. Comput. Syst. 21, 5, Article 47 (Dec. 2022), 5 pages. https://doi.org/10.1145/3558078
[20]
Apeksha Shewalkar, Deepika Nyavanandi, and Simone A Ludwig. 2019. Performance evaluation of deep neural networks applied to speech
recognition: RNN, LSTM and GRU. Journal of Articial Intelligence and Soft Computing Research 9, 4 (2019), 235–245.
[21]
Izaak Stanton, Kamran Munir, Ahsan Ikram, and Murad El-Bakry. 2023. Predictive maintenance analytics and implementa-
tion for aircraft: Challenges and opportunities. Systems Engineering 26, 2 (2023), 216–237. https://doi.org/10.1002/sys.21651
arXiv:https://incose.onlinelibrary.wiley.com/doi/pdf/10.1002/sys.21651
[22]
Wonyong Sung and Jinhwan Park. 2018. Single Stream Parallelization of Recurrent Neural Networks for Low Power and Fast Inference. CoRR
abs/1803.11389 (2018). arXiv:1803.11389 http://arxiv.org/abs/1803.11389
Manuscript submitted to ACM
Optimizing SRU Models for Predictive Maintenance on Embedded Systems (RISC-V) 15
[23]
Dechen Yao, Boyang Li, Hengchang Liu, Jianwei Yang, and Limin Jia. 2021. Remaining useful life prediction of roller bearings based on improved
1D-CNN and simple recurrent unit. Measurement 175 (2021), 109166. https://doi.org/10.1016/j.measurement.2021.109166
Received XX December 2024; revised Day Month Year; accepted Day Month Year
Manuscript submitted to ACM
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
A precise prediction of the health status of industrial equipment is of significant importance to determine its reliability and lifespan. This prediction provides users information that is useful in determining when to service, repair, or replace the unhealthy equipment’s components. In the last decades, many works have been conducted on data-driven prognostic models to estimate the asset’s remaining useful life. These models require updates on the novel happenings from regular diagnostics, otherwise, failure may happen before the estimated time due to different facts that may oblige rapid maintenance actions, including unexpected replacement. Adding to offline prognostic models, the continuous monitoring and prediction of remaining useful life can prevent failures, increase the useful lifespan through on-time maintenance actions, and reduce the unnecessary preventive maintenance and associated costs. This paper presents the ability of the two real-time tiny predictive analytics models: tiny long short-term memory (TinyLSTM) and sequential dense neural network (DNN). The model (TinyModel) from Edge Impulse is used to predict the remaining useful life of the equipment by considering the status of its different components. The equipment degradation insights were assessed through the real-time data gathered from operating equipment. To label our dataset, fuzzy logic based on the maintainer’s expertise is used to generate maintenance priorities, which are later used to compute the actual remaining useful life. The predictive analytic models were developed and performed well, with an evaluation loss of 0.01 and 0.11, respectively, for the LSTM and model from Edge Impulse. Both models were converted into TinyModels for on-device deployment. Unseen data were used to simulate the deployment of both TinyModels. Conferring to the evaluation and deployment results, both TinyLSTM and TinyModel from Edge Impulse are powerful in real-time predictive maintenance, but the model from Edge Impulse is much easier in terms of development, conversion to Tiny version, and deployment.
Article
Full-text available
Fire-related incidents continue to be reported as a leading cause of life and property destruction. Automated fire detection and localization (AFDL) systems have grown in importance with the evolution of applied robotics, especially because use of robots in disaster situations can lead to avoidance of human fatality. The importance of AFDL on resource-constrained devices has further grown, as most unmanned vehicles (drones or ground vehicles) are battery operated with limited computational capacity, the disaster situations cannot guarantee uninterrupted communication with high end resources in the cloud, and yet faster response time is a prime necessity. Traditional computer-vision based techniques require hand-engineered features on a case-by-case basis. Deep Learning -based classifiers perform well for fire/no-fire classification due to the availability of large datasets for training, however, a dearth of good fire localization datasets renders the localization performance below par. We have tried to address both problems with a multi-task learned cascaded model that triggers localization workflow only if the presence of fire is detected, through a strong classifier trained on available large fire datasets. This presents only fire images to a relatively weaker localization model, reducing false positives, false negatives, and thereby improving overall AFDL accuracy. The multi-task learning (MTL) approach for end-to-end training of a stitched classifier and object localizer model on diverse datasets enabled us to build a strong fire classifier and feature extractor. It also resulted in a single unified model, capable of running on “on-board” compute infrastructure without compromising on accuracy. To achieve the target inference rate for the AFDL deployment, we have investigated the effect of quantization and compression due to hardware acceleration on an MTL model. This paper presents an approach to automate the hardware-software co-design to find the optimum parameter partitioning for a given MTL problem, especially when some parts of the model are hardware accelerated. We present combined evaluation results showing that our methodology and the corresponding AFDL model strikes a balance between the frames inferred per second and several accuracy metrics. We report fire localization accuracy in terms of mean average precision (object detection), that is not done earlier for embedded AFDL systems.
Article
Full-text available
Condition monitoring of industrial equipment, combined with machine learning algorithms, may significantly improve maintenance activities on modern cyber-physical production systems. However, data of proper quality and of adequate quantity, modeling both good operational conditions as well as abnormal situations throughout the operational lifecycle, are required. Nevertheless, this is difficult to acquire in a non-destructive approach. In this context, this study investigates an approach to enable a transition from preventive maintenance activities, that are scheduled at predetermined time intervals, into predictive ones. In order to enable such approaches in a cyber-physical production system, a deep learning algorithm is used, allowing for maintenance activities to be planned according to the actual operational status of the machine and not in advance. An autoencoder-based methodology is employed for classifying real-world machine and sensor data, into a set of condition-related labels. Real-world data collected from manufacturing operations are used for training and testing a prototype implementation of Long Short-Term Memory autoencoders for estimating the remaining useful life of the monitored equipment. Finally, the proposed approach is evaluated in a use case related to a steel industry production process.
Article
Full-text available
In real-world applications – to minimize the impact of failures – machinery is often monitored by various sensors. Their role comes down to acquiring data and sending it to a more powerful entity, such as an embedded computer or cloud server. There have been attempts to reduce the computational effort related to data processing in order to use edge computing for predictive maintenance. The aim of this paper is to push the boundaries even further by proposing a novel architecture, in which processing is moved to the sensors themselves thanks to decrease of computational complexity given by the usage of compressed recurrent neural networks. A sensor processes data locally, and then wirelessly sends only a single packet with the probability that the machine is working incorrectly. We show that local processing of the data on ultra-low power wireless sensors gives comparable outcomes in terms of accuracy but much better results in terms of energy consumption that transferring of the raw data. The proposed ultra-low power hardware and firmware architecture makes it possible to use sensors powered by harvested energy while maintaining high confidentiality levels of the failure prediction previously offered by more powerful mains-powered computational platforms.
Conference Paper
Artificial Neural Networks (ANNs), based on the concept of neuron cells, are widely used nowadays for multiple applications. Recurrent Neural Networks (RNNs) are special types of ANNs capable of capturing temporal relationship into a set of data and offer interesting features. However, RNN architectures are complex and their integration on embedded systems implies a proper understanding of inherent structures and algorithms. This paper describes and analyses four types of gated RNN cells: Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Minimal Gated Unit (MGU) and STAckable Recurrent cell (STAR). We explain the algorithmic details and how these cells can be combined and integrated to build complex structures. We illustrate our approach with experimental results using different architectures and implementations on dedicated test cases.
Article
Machine learning (ML) on resource-constrained edge devices is expensive and often requires offloading computation to the cloud, which may compromise the privacy of user data. In contrast, the type of data processed at edge devices is user-specific and limited to a few inference classes. In this work, we explore building smaller, user-specific machine learning models, rather than utilizing a generic, compute-intensive machine learning model that caters to a diverse range of users. We first present a hardware-friendly, lightweight pruning technique to create user-specific models directly on mobile platforms, while simultaneously executing inferences. The proposed technique leverages compute sharing between pruning and inference, customizes the backward pass of training, and chooses a pruning granularity for efficient processing on edge. We then propose architectural support to prune user-specific models on a systolic edge ML inference accelerator. We demonstrate that user-specific models provide a speedup of 2.9 × and 2.3 × on the mobile CPUs for the ResNet-50 and Inception-V3 models.
Article
To overcome the shortcomings of traditional roller bearing remaining useful life prediction methods, which mainly focus on prediction accuracy improvement and ignore labor cost and time, the present work proposed a novel prediction method by combining an improved one-dimensional convolution neural network (1D-CNN) and a simple recurrent unit (SRU) network. For feature extraction, the proposed method uses the ability of the 1D-CNN to extract signal features. Moreover, use the global maximum pooling layer to replace the fully connected layer. In the prediction part, a parallel-input SRU network was established by reconstructing the serial operation mode of a traditional recurring neural network (RNN). Finally, experiments were carried out using the XJTU-SY dataset to verify. Results revealed that on the premise of ensuring prediction accuracy, the 1D-CNN-SRU method could reduce manual intervention and time cost to a certain extent and provide an intelligent method for roller bearing remaining useful life prediction.