Conference PaperPDF Available

Anomaly Detection for Cyber-Physical Systems Using Transformers

Authors:

Abstract and Figures

Safety and reliability are two critical factors of modern Cyber-Physical Systems (CPS). However, the increasing structural and behavioral complexity of modern automation systems significantly increases the possibility of system errors and failures, which can easily lead to economic loss or even hazardous events. Anomaly Detection (AD) techniques provide a potential solution to this problem, and conventional methods, e.g., Autoregressive Integrated Moving Average model (ARIMA), are no longer the best choice for anomaly detection for modern complex CPS. Recently, Deep Learning (DL) and Machine Learning (ML) anomaly detection methods became more popular, and numerous practical applications have been presented in many industrial scenarios. Most of the modern DL-based anomaly detection methods use the prediction approach and LSTM architecture. The Transformer is a new neural network architecture that outperforms LSTM in natural language processing. In this paper, we show that the Transformer-based deep learning model, which has received much attention recently, can be applied to the anomaly detection of industrial automation systems. Specifically, we collect time-series data from a system of two industrial robots using a Simulink model. Then, we feed these data into our Transformer-based model and train it to be a time-series data predictor. The paper presents the experimental results that show the comparison of precision and speed of a Long-Short Time Memory (LSTM) predictor and our Transformer-based predictor.
Content may be subject to copyright.
1 © 2021 by ASME
Proceedings of the ASME 2021
International Mechanical Engineering Congress and Exposition
IMECE2021
November 1-5, 2021, Virtual, Online
IMECE2021-69395
ANOMALY DETECTION FOR CYBER-PHYSICAL SYSTEMS USING TRANSFORMERS
Yuliang Ma
University of Stuttgart
Germany
Andrey Morozov
University of Stuttgart
Germany
Sheng Ding
University of Stuttgart
Germany
ABSTRACT
Safety and reliability are two critical factors of modern
Cyber-Physical Systems (CPS). However, the increasing
structural and behavioral complexity of modern automation
systems significantly increases the possibility of system errors
and failures, which can easily lead to economic loss or even
hazardous events. Anomaly Detection (AD) techniques provide a
potential solution to this problem, and conventional methods,
e.g., Autoregressive Integrated Moving Average model (ARIMA),
are no longer the best choice for anomaly detection for modern
complex CPS. Recently, Deep Learning (DL) and Machine
Learning (ML) anomaly detection methods became more
popular, and numerous practical applications have been
presented in many industrial scenarios. Most of the modern DL-
based anomaly detection methods use the prediction approach
and LSTM architecture. The Transformer is a new neural
network architecture that outperforms LSTM in natural
language processing.
In this paper, we show that the Transformer-based deep
learning model, which has received much attention recently, can
be applied to the anomaly detection of industrial automation
systems. Specifically, we collect time-series data from a system
of two industrial robots using a Simulink model. Then, we feed
these data into our Transformer-based model and train it to be a
time-series data predictor. The paper presents the experimental
results that show the comparison of precision and speed of a
Long-Short Time Memory (LSTM) predictor and our
Transformer-based predictor.
Keywords: Anomaly Detection, Transformers, LSTM,
Machine Learning, Deep Learning, Time-Series Data, Industrial
Robots, Automation.
1. INTRODUCTION
Modern production Cyber-Physical Systems (CPS), e.g.,
industrial robots, are getting increasingly complex. The reason
behind this phenomenon is that there are numerous types of
sensors, actuators, network hardware, embedded computers, and
an enormous amount of software contributing to the complexity
of these systems. This increases the vulnerability of the systems,
which inevitably leads to serious errors. Furthermore, the
occurrence of anomalous situations inside the CPS can lead to
dangerous accidents, especially for those safety-critical cases.
On the other hand, the high degree of structural variability and
internal heterogeneity not only makes the system very
susceptible to abnormal situations but also eclipses conventional
anomaly detection methods, such as the Autoregressive
Integrated Moving Average model (ARIMA), so that they
become not the best choice for anomaly detection for modern
automation systems. During the last few years, anomaly
detection methods based on Deep Learning (DL) or Machine
Learning (ML) technologies have attracted a lot of attention and
been applied effectively in numerous research and industrial
domains [1].
Anomaly Detection (AD) in industrial time-series data has
been achieved using classification, prediction, and
reconstruction methods. The basic principle of prediction-based
anomaly detection method can be illustrated in Figure 1.
Basically, the neural network is trained by inputting error-free
data and the trained neural network can be used as a time series
data predictor. When the real data with errors is input, the
residual between the predicted data and the original data value
will be calculated and compared with the preset threshold. Once
the residual is higher than the threshold, it will be regarded as an
error, e.g., bit flips or offset in real data streams, which can
reflect the error comes from the controllers, sensors or other
components inside the CPS. In our previous paper, we have
2 © 2021 by ASME
discussed that the prediction approach is the most popular
nowadays for several reasons [2]. The two main tasks of
prediction-based anomaly detection are the prediction itself
followed by the detection. The detection-related problems of
CPS are well described in [2,3]. In this paper, we compare two
neural network architectures. The choice of the architecture
influence only the prediction part. Therefore, we will focus only
on the prediction and leave different anomaly detection methods
such as dynamic threshold setting outside of the scope of this
paper.
FIGURE 1: THE PRINCIPLE OF PREDICTION-BASED
ANOMALY DETECTION METHOD.
The time-series data generated from the sensors of the
automation systems is our main research object. According to
recent research papers, Long-Short Time Memory (LSTM)
networks are capable of learning the long-term dependencies of
time-series data and can be trained to be excellent predictors [2].
However, in 2017, the Google Brain Team proposed the
Transformer, a new type of efficient deep learning model, and
achieved great success in Natural Language Processing (NLP)
[4]. It has been also proved that the Transformer can be trained
to be a time-series data predictor in [5]. In our paper, we will
employ a Transformer-based machine learning model and
demonstrate how it can predict time-series data for anomaly
detection. We have conducted a large number of prediction
experiments using both the LSTM model and the Transformer
model comparing the prediction speed and the prediction
accuracy. The experimental results confirm that our transformer-
based model has the potential to become a faster online anomaly
detector.
Contributions: There are two main contributions of this paper.
First, we apply a Transformer-based time-series prediction
model for anomaly detection that is able to predict industrial
time-series data. Compared to the other transformer-based
predictors, our model is optimized for the fast online prediction
that is crucial for anomaly detection. Second, our experiments
show that the Transformer-based model can give a comparable
prediction accuracy to that of the LSTM model. At the same
time, the prediction speed of the Transformer-based model is
nearly twice as fast as that of the LSTM model, which makes the
Transformer-based model more suitable for online anomaly
detection.
2. STATE OF THE ART
Time-series data reflect periodical behaviors in fields like
engineering, economy, or social sciences and so on [6]. In many
research fields related to time-series data, one of the most
important questions is how to obtain future tendencies through
prediction. In general, there are many time-series data prediction
methods, including traditional prediction methods, machine
learning-based or deep learning-based prediction methods. Some
traditional approaches, such as Moving Average (MA) and Auto-
Regressive Integrated Moving Average (ARIMA), normally
have relatively simple models, but they also have some obvious
limitations to process time-series data with high dimensionality,
as well as efficiently capture complex features inside the data [7].
In recent years, the popularity of using deep learning models to
predict time series data is gradually increasing, and the most
representative of these models is Long-Short Term Memory
(LSTM) recurrent network.
Long-Short Term Memory (LSTM) model [8] is a special
variant of Recurrent Neural Networks (RNNs). Standard RNNs
suffer from vanishing and exploding gradient problems, so they
are unable to learn information in a relatively long distance [9].
However, the LSTM network is introduced to overcome this
vanishing gradient problem. Inside the LSTM network, a linear
unit (LSTM cell) is used to collect and add the information for
each timestep. At the same time, different kinds of ‘gates’ are
exploited to conduct the error flow control with a cell.
Specifically, the cell extracts information through the input gate,
the flow of the information will be allocated to the rest of the
network through the output gate, while the forget gate will
determine which information from the previous timesteps will be
retained and which will be discarded. Obviously, the LSTM
network is able to maintain temporal information in the state for
a long distance of time-series data and avoid the vanishing or
exploding gradient problems at the same time. In practice, LSTM
has been widely used in numerous time-series data tasks[10]-
[12] in univariate and multivariate [13]-[14] domains.
Transformer model [4] was originally introduced by
Vaswani et al. and the framework was initially applied in the
Natural Language Processing (NLP) research domain. Also,
Transformers have been successfully applied in several other
domains, including time-series data forecasting [5]. Using a
standard encoder-decoder architecture fully based on the multi-
head attention mechanism, Transformer models are suitable for
time-series data analysis. First, based on the self-attention
mechanism, they can capture the relationship between two input
elements in a rather long distance. Second, the multi-head
attention mechanism provides a richer representation of these
relationships, thereby avoiding overfitting. Based on the above
reasons, Transformers have shown overwhelming advantages in
some practical time-series data forecasting tasks. The authors of
[15] demonstrate the excellent performance of the Transformer
model through prediction experiments on four public datasets,
and the experimental results show that the Transformer model
outperforms a variety of models. According to another
representative work [6], the authors forecast the influenza
prevalence through a Transformer model and similarly show that
Transformer model outperform ARIMA method, an LSTM, and
an attention-based GRU Seq2Seq model. Compared to the other
transformer-based predictors, our model is optimized for the
online prediction that is crucial for anomaly detection.
3 © 2021 by ASME
Time2Vec [16] is a model-agnostic vector representation
approach for time-series data that can be embedded into many
existing models. The experimental results in [16] demonstrate
that a range of standard models equipped with the Time2Vec
method, e.g., LSTM, show impressive performance. Thus, in our
experiments, we used the Transformer model with Time2Vec.
3. REFERENCE SIMULINK MODEL
In this paper, we use the same Simulink model as we used
in [2] for the LSTM based anomaly detection. The model
consists of two industrial robotic manipulators. The top level of
the Simulink model is shown in Figure 2. There are multiple
sensors embedded in two robotic arms. We take the sensor signal
from the second robotic arm as our reference time-series data.
The signal is the actual positional signal of a joint of the second
robotic arm. The whole working process of the case study system
can be described as follows: the first robotic arm picks up the
shared tool from the position A, and it places the tool to the
position B after moving it through several random points. The
movement process of the second robotic arm is similar to that of
the first robotic arm, except that the shared tool is returned from
position B to position A.
FIGURE 2: THE TOP LAYER OF THE SIMULINK MODEL OF
TWO ROBOTIC MANIPULATORS. IMAGE SOURCE [2].
4. MODELS AND METHODS
In this paper, we compare the prediction performance of an
LSTM model and a Transformer-based model. These models
must have the same input data and some hyperparameters should
be the same, e.g., the number of lookback time steps. However,
these two totally different models certainly have their special
hyperparameters, thus, we will introduce the training and
prediction process of these two models in this chapter. Figure 3
sketches the workflow of our comparison method.
FIGURE 3: THE WORKFLOW OF THE COMPARISON
METHOD.
4.1 Dataset
Based on the Simulink model of two manipulators, we
continuously observe and record the behavior of a sensor signal
from the second robotic arm during 250 seconds. The obtained
time series data has a constant sampling frequency that is equal
to 10 sample times per second. The original data is shown in
Figure 4.
FIGURE 4: RECORDED DATASET FROM SENSOR 1 OF THE
SECOND MANIPULATOR.
4.2 LSTM-based model
A standard LSTM network is exploited and trained to be a
time-series data predictor. The structure of LSTM Cell unit can
be illustrated in Figure 5 and the training process using LSTM
network has been introduced in detail in [2]. In this paper, we
will illustrate the selected parameters in Table 1. The LSTM
network has 50 lookback steps and 1 lookahead step. Two
consecutive hidden recurrent layers are fully connected with a
dropout of 0.3 in-between. Both layers consist of 80 LSTM units.
Adam optimizer is selected during the training process and the
batch size is 70. The training is done for 35 epochs with early
stopping.
FIGURE 5: STRUCTURE OF LSTM CELL UNIT
TABLE 1: HYPERPARAMETERS OF THE LSTM MODEL.
Hyperparameters
Value
Lookback steps
50
Lookahead steps
1
LSTM units
80
Batch size
70
Epochs
35
Optimizer
Adam
4 © 2021 by ASME
4.3 Transformer-based model
The structure of the Transformer-based model is shown in
Figure 6 and in this paper, we use a variant of classical
Transformer model. First, we continue to use the same encoder
part of classical Transformers but the decoder part is completely
removed and two dense layers serve as the last part of the model
to predict the elements. Second, since we are dealing with
sequential scalar values rather than a word or a sentence, we use
an approach called Time2Vec instead of classical Transformers'
positional encoding for words. This Time2Vec method can
provide time representation for Transformer models that are not
able to capture the positional information of time-series data.
The whole process is similar to the positional encoding which
marks the positional information of each word in the sentence.
According to [16], many popular deep learning models
embedded with the Time2Vec approach will improve the
performance of time-series data analysis tasks.
FIGURE 6: THE STRUCTURE OF THE TRANSFORMER-BASED
MODEL. ADOPTED FROM [17].
The sensor signal is shown in Figure 3 and we will split the
data in this way: the first 80% of the whole sequence is used as
the training data and the next 10% is used as the validation data.
The last 10% will be used as test data for prediction experiments.
Referring to the hyperparameters of the LSTM network model,
we also set 35 training epochs and the batch size is 70. At the
same time, we use the same optimizer, Adam and for prediction,
the Transformer-based model has the same number of lookback
steps and lookahead steps with the LSTM network. But we do
not use the fixed special hyperparameters of Transformers here,
e.g., d_k (the dimension of keys), on the contrary, we will test
the influence of these unfixed hyperparameters on the prediction
performance in the following experiments. The hyperparameters
of the Transformer-based model are shown in Table 2.
TABLE 2: HYPERPARAMETERS OF THE TRANSFORMER
MODELS
Hyperparameters
Value
Lookback steps
50
Lookahead steps
1
Number of heads
not fixed
Number of attention layer
not fixed
d_k
not fixed
Batch size
70
Epochs
35
Optimizer
Adam
5. EXPERIMENTS AND RESULTS
Using the last 10% data from Figure 3 as our test data, as it
is sketched in Figure 7, we implement prediction experiments
based on the LSTM model and the Transformer-based model.
For each model, we input 250 data samples and let it predict 200
data samples. But as we described before, although we try to
make these two models have the same hyperparameters to make
them comparable, they still have their special hyperparameters,
e.g., the number of heads or attention layers inside the
Transformer-based model. Thus, we conduct several preliminary
experiments (pre-experiments) and take the results of these pre-
experiments as our preliminary reference. Based on this
reference, we will implement the prediction experiments in the
next step.
FIGURE 7: TEST DATA.
5.1 Preliminary experiments results
First, through two sets of pre-experiments, we respectively
test the impact of different numbers of heads and attention layers
inside the Transformer-based model on the prediction speed and
prediction accuracy. At the same time, different d_k values are
also taken into consideration. We make two sets of pre-
experiments and the parameters are as follows : (i) in the first set,
the number of heads is constantly equal to 1 with the number of
attention layers ranging from 1 to 3, (ii) in the second set, the
number of attention layers is constantly equal to 1 with the
number of heads ranging from 1 to 3. For each set, the d_k ranges
from 1 to 200 with different increasing intervals.
5 © 2021 by ASME
Taking all six different situations (numbers of heads and
attention layers) into consideration, we train the models with
different d_k values and used the trained models to implement
ten times prediction experiments. Then we obtain the average
prediction time and Mean Absolute Error (MAE) for each model.
The experimental results are shown in Figure 8 and Figure 9. It
is worth mentioning that Transformer-based models with fewer
heads and layers require significantly less training time than
LSTM networks. But for online anomaly detection, we usually
complete the training process of the model offline and then apply
it to online detection. Therefore, in this work, model’s training
time is not not the key factor we consider. However, faster
prediction speed is indeed needed for online anomaly detection,
even slight improvement of prediction speed can be meaningful
and important. This is because most sensors of modern CPS can
generate time-series data in a short period of time in online mode,
usually in 0.1 seconds or less, so even if the prediction speed is
slightly faster, e.g., from 0.8s to 0.1s, it is still beneficial to online
anomaly detection.
FIGURE 8: RESULTS OF THE FIRST SET OF PRE-
EXPERIMENTS.
FIGURE 9: RESULTS OF THE SECOND SET OF PRELIMINARY
EXPERIMENTS.
These plots show the expected results of average prediction
time and based on the results we can conclude: (i) the average
prediction time will significantly increase along with the
increasing number of attention-layers when the Transformers
have the same number of heads, (ii) when Transformers have the
same number of attention layers, the average prediction time will
slightly increase along with the heads number, (iii) the value of
d_k does not show an absolute influence on the average
prediction time but show great randomness. However, from the
perspective of the prediction accuracy metric, neither the number
of heads or layers nor the d_k values have an absolute influence
on MAE. But with this preliminary guide, we find that when
d_k=100, all six Transformers achieve comparable or even better
prediction accuracy to LSTM. Thus, in the next step, we use
d_k=100 as the reference hyperparameter to train the models and
conduct subsequent prediction experiments.
5.2 Final experiments results
Next, we implement the prediction experiments and we also
select Transformers with different numbers of layers and heads
for separate training and conduct prediction experiments. The
experimental parameters are set as follows: the number of heads
varies from 1 to 3 and the number of attention layers also varies
from 1 to 3, while the d_k is constantly equal to 100. Therefore,
we have a total of nine Transformer-based models with different
structures
Since the attention layers will be randomly initialized inside
the model and given random weight parameters before each
training starts, our well-trained Transformer-based predictor will
produce unstable accuracy when making predictions, which is a
common phenomenon in the process of training neural networks.
But in order to eliminate the error caused by this randomness, we
train each model for ten times and conduct twenty prediction
experiments on ninety trained models. Then, we obtain the
average prediction time and average MAE results, as shown in
Figure 10 (the point labeled with ‘1 h, 1 al’ means the result from
Transformer with one head and one attention layer).
FIGURE 10: RESULTS OF PREDICTION EXPERIMENTS.
6 © 2021 by ASME
Based on the experimental results, we can conclude: (i) once
again, the average prediction time will increase along with both
the number of heads and attention layers. (ii) Transformer with
one head and one attention layer shows the comparable MAE
with LSTM network, however, it has nearly twice the prediction
speed of LSTM, which confirms that it is more suitable for online
anomaly detection.
FIGURE 11: EXTRACTION OF EXPERIMENTAL RESULTS
WITH ONE HEAD AND ONE ATTENTION LAYER.
As we described before, for each Transformer model with
the same number of heads and attention layers, we train it ten
times and obtain ten different trained models. We extract the
experimental results from Transformer model with one head and
one attention layer and plot them in Figure 11 (the point labeled
with ‘M9’ means the ninth trained model). Even with the same
structure, the accuracy of the prediction model is still unstable.
But we believe that Transformer-based models have the potential
to be trained as a faster and more accurate predictor than LSTM,
'M9' point provides the evidence. We extract the experimental
results from Transformer model (M9 model) with one head and
one attention layer and plot them in Figure 12 and Figure 13.
FIGURE 12: PREDICTION CURVE FROM LSTM.
FIGURE 13: PREDICTION CURVE FROM TRANSFORMER-
BASED MODEL WITH ONE HEAD AND ONE ATTENTION
LAYER.
6. CONCLUSIONS
In this paper, we have compared two deep learning models
for anomaly detection for time-series signals of an industrial
CPS. Both models use the prediction approach: We exploit the
deep learning model to predict the next value based on the
previous values and compare the prediction with the actual
value. It is assumed that the difference should be within a certain
threshold. The first model is based on the LSTM architecture.
The second model is based on the Transformer architecture. In
our experiments, we compared the prediction accuracy and
speed. The results show that the Transformer-based model with
one head and one attention layer can achieve very close
prediction accuracy to LSTM. However the prediction speed of
the Transformer-based model nearly two times faster than the
prediction speed of the LSTM-based model. Based on this result,
we conclude that the Transformer-based model is a better choice
for online, real-time anomaly detection. In the future, inspired by
the Transformer’s capacity of fast prediction, we have reason to
believe that the Transformer-based model can be applied to
larger-scale, higher-dimensional time-series data anomaly
detection and achieve better results, which is our next research
step.
REFERENCES
[1] Chalapathy, R., S. Chawla. “Deep Learning for Anomaly
Detection: A Survey.” 2019.
[2] Ding, S. , et al. “Model-Based Error Detection for Industrial
Automation Systems Using LSTM Networks.” Springer,
Cham, 2020.
[3] Ding, K. , et al. “On-Line Error Detection and Mitigation for
Time-Series Data of Cyber-Physical Systems using Deep
Learning Based Methods.” 2019 15th European
Dependable Computing Conference (EDCC) IEEE, 2019.
[4] Vaswani, A. , et al. “Attention Is All You Need. 2017.
[5] Wu, N. , et al. Deep Transformer Models for Time Series
Forecasting: The Influenza Prevalence Case. 2020.
[6] G. Box and G. Jenkins, Time Series Analysis: Forecasting
and Control, Holden Day, San Francisco, 1976.
[7] Y. Bengio and Y. LeCun, “Scaling learning algorithms
towards AI,” Large-Scale Kernel Mach., vol. 34, no. 5, pp.
141, 2007.
7 © 2021 by ASME
[8] Hochreiter S, Schmidhuber J. Long Short-Term Memory.
Neural Computation. 1997 Nov;9(8):17351780.
[9] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to
forget: Continual prediction with LSTM,” Istituto Dalle
Molle di Studi sull’Intelligenza Artificiale, Manno,
Switzerland, Tech. Rep. IDSIA-01-99, 1999.
[10] A. Graves, N. Jaitly, and A.-R. Mohamed, “Hybrid speech
recognition with deep bidirectional LSTM,” in Proc. IEEE
Workshop Autom. Speech Recognit. Understand. (ASRU),
Dec. 2013, pp. 273278.
[11] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink,
and J. Schmidhuber, “LSTM: A search space odyssey,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp.
22222232, Oct. 2017.
[12] Lipton, Z. C. , et al. “Learning to Diagnose with LSTM
Recurrent Neural Networks.” 2015.
[13] R. Fu, Z. Zhang, and L. Li, “Using LSTM and GRU neural
network methods for traffic flow prediction,” in Proc. Youth
Acad. Annu. Conf. Chin. Assoc. Automat. (YAC), Nov.
2016, pp. 324328.
[14] Filonov, P. , A. Lavrentyev , and A. Vorontsov .
Multivariate Industrial Time Series with Cyber-Attack
Simulation: Fault Detection Using an LSTM-based
Predictive Data Model. 2016.
[15] Li, S. , et al. Enhancing the Locality and Breaking the
Memory Bottleneck of Transformer on Time Series
Forecasting. In Advances in Neural Information
Processing Systems, pp. 52435253, 2019.
[16] Kazemi, S. M. , et al. Time2Vec: Learning a Vector
Representation of Time. 2019.
[17] Jan Schmitz, a blog: “Stock predictions with state-of-the-art
Transformer and Time Embeddings”, 2020.
... Examples are computer vision tasks (e.g. object detection [1] or optical quality inspection [2]), anomaly detection [3], failure analysis [4], soft sensor modeling [5], or remaining useful lifetime prediction [6]. ...
Conference Paper
Full-text available
Accurate and real-time traffic flow prediction is important in Intelligent Transportation System (ITS), especially for traffic control. Existing models such as ARMA, ARIMA are mainly linear models and cannot describe the stochastic and nonlinear nature of traffic flow. In recent years, deep-learning-based methods have been applied as novel alternatives for traffic flow prediction. However, which kind of deep neural networks is the most appropriate model for traffic flow prediction remains unsolved. In this paper, we use Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) neural network (NN) methods to predict short-term traffic flow, and experiments demonstrate that Recurrent Neural Network (RNN) based deep learning methods such as LSTM and GRU perform better than auto regressive integrated moving average (ARIMA) model. To the best of our knowledge, this is the first time that GRU is applied to traffic flow prediction.
Article
Full-text available
Clinical medical data, especially in the intensive care unit (ICU), consists of multivariate time series of observations. For each patient visit (or episode), sensor data and lab test results are recorded in the patient's Electronic Health Record (EHR). While potentially containing a wealth of insights, the data is difficult to mine effectively, owing to varying length, irregular sampling and missing data. Recurrent Neural Networks (RNNs), particularly those using Long Short-Term Memory (LSTM) hidden units, are powerful and increasingly popular models for learning from sequence data. They adeptly model varying length sequences and capture long range dependencies. We present the first study to empirically evaluate the ability of LSTMs to recognize patterns in multivariate time series of clinical measurements. Specifically, we consider multilabel classification of diagnoses, training a model to classify 128 diagnoses given 13 frequently but irregularly sampled clinical measurements. First, we establish the effectiveness of a simple LSTM network for modeling clinical data. Then we demonstrate a straightforward and effective deep supervision strategy in which we replicate targets at each sequence step. Trained only on raw time series, our models outperforms several strong baselines on a wide variety of metrics, and nearly matches the performance of a multilayer perceptron trained on carefully hand-engineered features, establishing the usefulness of LSTMs for modeling medical data. The best LSTM model accurately classifies many diagnoses, including diabetic ketoacidosis (F1 score of .714), scoliosis (.677), and status asthmaticus (.632).
Article
Full-text available
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs (about 15 years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
Chapter
The increasing complexity of modern automation systems leads to inevitable faults. At the same time, structural variability and untrivial interaction of the sophisticated components makes it harder and harder to apply traditional fault detection methods. Consequently, the popularity of Deep Learning (DL) fault detection methods grows. Model-based system design tools such as Simulink allow the development of executable system models. Besides the design flexibility, these models can provide the training data for DL-based error detectors. This paper describes the application of an LSTM-based error detector for a system of two industrial robotic manipulators. A detailed Simulink model provides the training data for an LSTM predictor. Error detection is achieved via intelligent processing of the residual between the original signal and the LSTM prediction using two methods. The first method is based on the non-parametric dynamic thresholding. The second method exploits the Gaussian distribution of the residual. The paper presents the results of extensive model-based fault injection experiments that allow the comparison of these methods and the evaluation of the error detection performance for varying error magnitude.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We adopted an approach based on an LSTM neural network to monitor and detect faults in industrial multivariate time series data. To validate the approach we created a Modelica model of part of a real gasoil plant. By introducing hacks into the logic of the Modelica model, we were able to generate both the roots and causes of fault behavior in the plant. Having a self-consistent data set with labeled faults, we used an LSTM architecture with a forecasting error threshold to obtain precision and recall quality metrics. The dependency of the quality metric on the threshold level is considered. An appropriate mechanism such as "one handle" was introduced for filtering faults that are outside of the plant operator field of interest.
Conference Paper
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
Article
One long-term goal of machine learning research is to produce methods that are applicable to highly complex tasks, such as perception (vision, audition ), rea- soning, intelligent control, and other artificially intelligent behaviors. We arg ue that in order to progress toward this goal, the Machine Learning community must endeavor to discover algorithms that can learn highly complex functions, with min- imal need for prior knowledge, and with minimal human intervention. We present mathematical and empirical evidence suggesting that many popular approaches to non-parametric learning, particularly kernel methods, are fundame ntally lim- ited in their ability to learn complex high-dimensional functions. Our analysis focuses on two problems. First, kernel machines are shallow architectures, in which one large layer of simple template matchersis followed by a single layer of trainable coefficients. We argue that shallow architectures can be ver y ineffi- cient in terms of required number of computational elements and examples. Sec- ond, we analyze a limitation of kernel machines with a local kernel, linked to th e curse of dimensionality, that applies to supervised, unsupervised (man ifold learn- ing) and semi-supervised kernel machines. Using empirical results on invariant image recognition tasks, kernel methods are compared with deep architectures, in which lower-level features or concepts are progressively combined into more ab- stract and higher-level representations. We argue that deep architec tures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks r equired for artificial intelligence.