ArticlePDF Available

A Guide to Recurrent Neural Networks and Backpropagation

Authors:

Abstract and Figures

This paper provides guidance to some of the concepts surrounding recurrent neural networks. Contrary to feedforward networks, recurrent networks can be sensitive, and be adapted to past inputs. Backpropagation learning is described for feedforward networks, adapted to suit our (probabilistic) modeling needs, and extended to cover recurrent networks. The aim of this brief paper is to set the scene for applying and understanding recurrent neural networks.
Content may be subject to copyright.
A guide to recurrent neural networks and backpropagation
Mikael Bod´en
mikael.boden@ide.hh.se
School of Information Science, Computer and Electrical Engineering
Halmstad University.
November 13, 2001
Abstract
This paper provides guidance to some of the concepts surrounding recurrent neural
networks. Contrary to feedforward networks, recurrent networks can be sensitive, and be
adapted to past inputs. Backpropagation learning is described for feedforward networks,
adapted to suit our (probabilistic) modeling needs, and extended to cover recurrent net-
works. The aim of this brief paper is to set the scene for applying and understanding
recurrent neural networks.
1 Introduction
It is well known that conventional feedforward neural networks can be used to approximate
any spatially finite function given a (potentially very large) set of hidden nodes. That
is, for functions which have a fixed input space there is always a way of encoding these
functions as neural networks. For a two-layered network, the mapping consists of two
steps,
y(t) = G(F (x(t))). (1)
We can use automatic learning techniques such as backpropagation to find the weights of
the network (G and F ) if sufficient samples from the function is available.
Recurrent neural networks are fundamentally different from feedforward architectures
in the sense that they not only operate on an input space but also on an internal state
space a trace of what already has been processed by the network. This is equivalent
to an Iterated Function System (IFS; see (Barnsley, 1993) for a general introduction to
IFSs; (Kolen, 1994) for a neural network perspective) or a Dynamical System (DS; see
e.g. (Devaney, 1989) for a general introduction to dynamical systems; (Tino et al., 1998;
Casey, 1996) for neural network perspectives). The state space enables the representation
(and learning) of temporally/sequentially extended dependencies over unspecified (and
potentially infinite) intervals according to
y(t) = G(s(t)) (2)
s(t) = F (s(t 1), x(t)). (3)
This document was mainly written while the author was at the Department of Computer Science, Univer-
sity of Sk¨ovde.
1
To limit the scope of this paper and simplify mathematical matters we will assume
that the network operates in discrete time steps (it is perfectly possible to use continuous
time instead). It turns out that if we further assume that weights are at least rational
and continuous output functions are used, networks are capable of representing any Tur-
ing Machine (again assuming that any number of hidden nodes are available). This is
important since we then know that all that can b e computed, can be processed
1
equally
well with a discrete time recurrent neural network. It has even been suggested that if real
weights are used (the neural network is completely analog) we get super-Turing Machine
capabilities (Siegelmann, 1999).
2 Some basic definitions
To simplify notation we will restrict equations to include two-layered networks, i.e. net-
works with two layers of nodes excluding the input layer (leaving us with one ’hidden’ or
’state’ layer, and one ’output’ layer). Each layer will have its own index variable: k for
output nodes, j (and h) for hidden, and i for input nodes. In a feed forward network, the
input vector, x, is propagated through a weight layer, V,
y
j
(t) = f(net
j
(t)) (4)
net
j
(t) =
n
X
i
x
i
(t)v
ji
+ θ
j
(5)
where n is the number of inputs, θ
j
is a bias, and f is an output function (of any differ-
entiable type). A network is shown in Figure 1.
In a simple recurrent network, the input vector is similarly propagated through a
weight layer, but also combined with the previous state activation through an additional
recurrent weight layer, U,
y
j
(t) = f(net
j
(t)) (6)
net
j
(t) =
n
X
i
x
i
(t)v
ji
+
m
X
h
y
h
(t 1)u
jh
) + θ
j
(7)
where m is the number of ’state’ nodes.
The output of the network is in both cases determined by the state and a set of output
weights, W,
y
k
(t) = g(net
k
(t)) (8)
net
k
(t) =
m
X
j
y
j
(t)w
kj
+ θ
k
(9)
where g is an output function (possibly the same as f ).
1
I am intentionally avoiding the term ’computed’.
2
Weights V
Weights W
Output
State/hidden
k
j
j
kj
k
y
w
t
net
θ
+=
)
(
)
(
)
(
)
(
k
k
net
g
t
y
=
Input
)
(
)
(
j
j
net
f
t
y
=
j
i
i
ji
j
t
x
v
t
net
θ
+=
)
(
)
(
Figure 1: A feedforward network.
Weights V
Weights W
Output
State/hidden
k
j
j
kj
k
t
y
w
t
net
θ
+=
)
(
)
(
)
(
)
(
k
k
net
g
t
y
=
Input
)
(
)
(
j
j
net
f
t
y
=
j
h
h
jh
i
i
ji
j
t
y
u
t
x
v
t
net
θ
++=
)
1
(
)
(
)
(
Weights U
(delayed)
Figure 2: A simple recurrent network.
3
3 The principle of backpropagation
Any network structure can be trained with backpropagation when desired output patterns
exist and each function that has been used to calculate the actual output patterns is
differentiable. As with conventional gradient descent (or ascent), backpropagation works
by, for each modifiable weight, calculating the gradient of a cost (or error) function with
respect to the weight and then adjusting it accordingly.
The most frequently used cost function is the summed squared error (SSE). Each
pattern or presentation (from the training set), p, adds to the cost, over all output units,
k.
C =
1
2
n
X
p
m
X
k
(d
pk
y
pk
)
2
(10)
where d is the desired output, n is the total number of available training samples and m
is the total number of output nodes.
According to gradient descent, each weight change in the network should be propor-
tional to the negative gradient of the cost with respect to the specific weight we are
interested in modifying.
w = η
C
w
(11)
where η is a learning rate.
The weight change is best understood (using the chain rule) by distinguishing between
an error component, δ = C/∂net, and net/∂w. Thus, the error for output nodes is
δ
pk
=
C
y
pk
y
pk
net
pk
= (d
pk
y
pk
)g
0
(y
pk
) (12)
and for hidden nodes
δ
pj
= (
m
X
k
C
y
pk
y
pk
net
pk
net
pk
y
pj
)
y
pj
net
pj
=
m
X
k
δ
pk
w
kj
f
0
(y
pj
). (13)
For a first-order polynomial, net/∂w equals the input activation. The weight change is
then simply
w
kj
= η
n
X
p
δ
pk
y
pj
(14)
for output weights, and
v
ji
= η
n
X
p
δ
pj
x
pi
(15)
for input weights. Adding a time subscript, the recurrent weights can be modified accord-
ing to
u
jh
= η
n
X
p
δ
pj
(t)y
ph
(t 1). (16)
A common choice of output function is the logistic function
g(net) =
1
1 + e
net
. (17)
4
The derivative of the logistic function can be written as
g
0
(y) = y(1 y). (18)
For obvious reasons most cost functions are 0 when each target equals the actual output
of the network. There are, however, more appropriate cost functions than SSE for guiding
weight changes during training (Rumelhart et al., 1995). The common assumptions of
the ones listed below are that the relationship between the actual and desired output is
probabilistic (the network is still deterministic) and has a known distribution of error.
This, in turn, puts the interpretation of the output activation of the network on a sound
theoretical footing.
If the output of the network is the mean of a Gaussian distribution (given by the
training set) we can instead minimize
C =
n
X
p
m
X
k
(y
pk
d
pk
)
2
2σ
2
(19)
where σ is assumed to be fixed. This cost function is indeed very similar to SSE.
With a Gaussian distribution (outputs are not explicitly bounded), a natural choice
of output function of the output nodes is
g(net) = net. (20)
The weight change then simply becomes
w
kj
= η
n
X
p
(d
pk
y
pk
)y
pj
. (21)
If a binomial distribution is assumed (each output value is a probability that the desired
output is 1 or 0, e.g. feature detection), an appropriate cost function is the so-called cross
entropy,
C =
n
X
p
m
X
k
d
pk
ln y
pk
+ (1 d
pk
) ln(1 y
pk
). (22)
If outputs are distributed over the range 0 to 1 (as here), the logistic output function is
useful (see Equation 17). Again the output weight change is
w
kj
= η
n
X
p
(d
pk
y
pk
)y
pj
. (23)
If the problem is that of “1-of-n classification, a multinomial distribution is appro-
priate. A suitable cost function is
C =
n
X
p
m
X
k
d
pk
ln
e
net
k
P
q
e
net
q
(24)
where q is yet another index of all output nodes. If the right output function is selected,
the so-called softmax function,
g(net
k
) =
e
net
k
P
q
e
net
q
, (25)
the now familiar update rule follows automatically,
w
kj
= η
n
X
p
(d
pk
y
pk
)y
pj
. (26)
As shown in (Rumelhart et al., 1995) this result occurs whenever we choose a probability
function from the exponential family of probability distributions.
5
Weights W
Output
State/hidden
)
(
ω
tx
)
2(
tx
)1(
tx
)(tx
...
...
Figure 3: A “tapped delay line” feedforward network.
4 Tapped delay line memory
The perhaps easiest way to incorporate temporal or sequential information into a training
situation is to make the temporal domain spatial and use a feedforward architecture.
Information available back in time is inserted by widening the input space according to
a fixed and pre-determined “window” size, X = x(t), x(t 1), x(t 2), ..., x(t ω) (see
Figure 3). This is often called a tapped delay line since inputs are put in a delayed buffer
and discretely shifted as time passes.
It is also possible to manually extend this approach by selecting certain intervals “back
in time” over which one uses an average or other pre-processed features as inputs which
may reflect the signal decay.
The classical example of this approach is the NETtalk system (Sejnowski and Rosen-
berg, 1987) which learns from example to pronounce English words displayed in text at
the input. The network accepts seven letters at a time of which only the middle one is
pronounced.
Disadvantages include that the user has to select the maximum number of time steps
which is useful to the network. Moreover, the use of independent weights for processing
the same components but in different time steps, harms generalization. In addition, the
large number of weights requires a larger set of examples to avoid over-specialization.
5 Simple recurrent network
A strict feedforward architecture does not maintain a short-term memory. Any memory
effects are due to the way past inputs are re-presented to the network (as for the tapped
delay line).
6
Weights V
Weights W
Output
State/hidden
Input
Previous state
Weights U
Copy
(delayed)
Figure 4: A simple recurrent network.
A simple recurrent network (SRN; (Elman, 1990)) has activation feedback which em-
bodies short-term memory. A state layer is updated not only with the external input of
the network but also with activation from the previous forward propagation. The feed-
back is modified by a set of weights as to enable automatic adaptation through learning
(e.g. backpropagation).
5.1 Learning in SRNs: Backpropagation through time
In the original experiments presented by Jeff Elman (Elman, 1990) so-called truncated
backpropagation was used. This basically means that y
j
(t 1) was simply regarded as
an additional input. Any error at the state layer, δ
j
(t), was used to modify weights from
this additional input slot (see Figure 4).
Errors can be backpropagated even further. This is called backpropagation through
time (BPTT; (Rumelhart et al., 1986)) and is a simple extension of what we have seen
so far. The basic principle of BPTT is that of “unfolding.” All recurrent weights can
be duplicated spatially for an arbitrary number of time steps, here referred to as τ.
Consequently, each node which sends activation (either directly or indirectly) along a
recurrent connection has (at least) τ number of copies as well (see Figure 5).
In accordance with Equation 13, errors are thus backpropagated according to
δ
pj
(t 1) =
m
X
h
δ
ph
(t)u
hj
f
0
(y
pj
(t 1)) (27)
where h is the index for the activation receiving node and j for the sending node (one time
step back). This allows us to calculate the error as assessed at time t, for node outputs
(at the state or input layer) calculated on the basis of an arbitrary number of previous
presentations.
7
Weights V
Weights W
Output
State/hidden
Input
Weights U
Weights V
State/hidden (t-1)
Input (t-1)
Weights U
Weights V
State/hidden (t-2)
Input (t-2)
State/hidden (t-3)
Weights U
Figure 5: The effect of unfolding a network for BPTT (τ = 3).
8
It is important to note, however, that after error deltas have been calculated, weights
are folded back adding up to one big change for each weight. Obviously there is a greater
memory requirement (both past errors and activations need to be stored away), the larger
τ we choose.
In practice, a large τ is quite useless due to a “vanishing gradient effect” (see e.g.
(Bengio et al., 1994)). For each layer the error is backpropagated through the error
gets smaller and smaller until it diminishes completely. Some have also pointed out that
the instability caused by possibly ambiguous deltas (e.g. (Pollack, 1991)) may disrupt
convergence. An opposing result has been put forward for certain learning tasks (Boen
et al., 1999).
6 Discussion
There are many variations of the architectures and learning rules that have been discussed
(e.g. so-called Jordan networks (Jordan, 1986), and fully recurrent networks, Real-time
recurrent learning (Williams and Zipser, 1989) etc). Recurrent networks share, however,
the property of being able to internally use and create states reflecting temporal (or even
structural) dependencies. For simpler tasks (e.g. learning grammars generated by small
finite-state machines) the organization of the state space straightforwardly reflects the
component parts of the training data (e.g. (Elman, 1990; Cleeremans et al., 1989)).
The state space is, in most cases, real-valued. This means that subtleties beyond the
component parts, e.g. statistical regularities may influence the organization of the state
space (e.g. (Elman, 1993; Rohde and Plaut, 1999)). For more difficult tasks (e.g. where
a longer trace of memory is needed, and context-dependence is apparent) the highly non-
linear, continous space offers novel kinds of dynamics (e.g. (Rodriguez et al., 1999; Bod´en
and Wiles, 2000)). These are intriguing research topics but beyond the scope of this
introductory paper. Analyses of learned internal representations and processes/dynamics
are crucial for our understanding of what and how these networks process. Metho ds
of analysis include hierarchical cluster analysis (HCA), and eigenvalue and eigenvector
characterizations (of which Principal Components Analysis is one).
References
Barnsley, M. (1993). Fractals Everywhere. Academic Press, Boston, 2nd edition.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.
Bod´en, M. and Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent
neural networks. Connection Science, 12(3).
Bod´en, M., Wiles, J., Tonkes, B., and Blair, A. (1999). Learning to predict a context-
free language: Analysis of dynamics in recurrent hidden units. In Proceedings of the
International Conference on Artificial Neural Networks, pages 359–364, Edinburgh.
IEE.
Casey, M. (1996). The dynamics of discrete-time computation, with application to re-
current neural networks and finite state machine extraction. Neural Computation,
8(6):1135–1178.
Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. (1989). Finite state automata
and simple recurrent networks. Neural Computation, 1(3):372–381.
Devaney, R. L. (1989). An Introduction to Chaotic Dynamical Systems. Addison-Wesley.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14:179–211.
9
Elman, J. L. (1993). Learning and development in neural networks: The importance of
starting small. Cognition, 48:71–99.
Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. (1992).
Learning and extracted finite state automata with second-order recurrent neural
networks. Neural Computation, 4(3):393–405.
Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential
machine. In Proceedings of the Eighth Conference of the Cognitice Science Society.
Kolen, J. F. (1994). Fool’s gold: Extracting finite state machines from recurrent network
dynamics. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in
Neural Information Processing Systems, volume 6, pages 501–508. Morgan Kaufmann
Publishers, Inc.
Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7:227.
Rodriguez, P., Wiles, J., and Elman, J. L. (1999). A recurrent neural network that learns
to count. Connection Science, 11(1):5–40.
Rohde, D. L. T. and Plaut, D. C. (1999). Language acquisition in the absence of explicit
negative evidence: How important is starting small? Cognition, 72:67–109.
Rumelhart, D. E., Durbin, R., Golden, R., and Chauvin, Y. (1995). Backpropagation:
The basic theory. In Chauvin, Y. and Rumelhart, D. E., editors, Backpropagation:
Theory, architectures, and applications, pages 1–34. Lawrence Erlbaum, Hillsdale,
New Jersey.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal represen-
tations by back-propagating errors. Nature, 323:533–536.
Sejnowski, T. and Rosenberg, C. (1987). Parallel networks that learn to pronounce English
text. Complex Systems, 1:145–168.
Siegelmann, H. T. (1999). Neural Networks and Analog Computation: Beyond the Turing
Limit. Birkh¨auser.
Tino, P., Horne, B. G., Giles, C. L., and Collingwood, P. C. (1998). Finite state machines
and recurrent neural networks automata and dynamical systems approaches. In
Dayhoff, J. and Omidvar, O., editors, Neural Networks and Pattern Recognition,
pages 171–220. Academic Press.
Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully
recurrent neural networks. Neural Computation, 1(2):270–280.
10
... Different from FFNNs using fixed-length context, RNNs do not take use of a limited size of context, which contain cache models that encode temporal information implicitly with arbitrary lengths [22]. The recurrent connections allow information to cycle inside networks for a long time adapting to the past inputs [2]. A continuous vector space best suits for word representation, meanwhile, deep learning methods reflect the relationship between continuous words. ...
Article
Full-text available
Speech recognition is an important field in natural language processing. In this paper, the end-to-end framework for speech recognition with multilingual datasets is proposed. The end-to-end methods do not require complicated alignment and construction of the pronunciation dictionary, which show a promising prospect. In this paper, we implement a hybrid model of CTC and attention (CTC+Attention) model based on PyTorch. In order to compare speech recognition methods for multiple languages, we design and create three datasets: Chinese, English, and Code-Switch. We evaluate the proposed hybrid CTC+Attention model in multilingual environment. Throughout our experiments, we find that the proposed hybrid CTC+Attention model based on end-to-end framework achieves better performance compared with the HMM-DNN model in a single language and Code-Switch speaking environment. Moreover, the results of speech recognition with regard to different languages are compared in this paper. The CER(i.e., Character Error Rate) of the proposed hybrid CTC+Attention model based on the Chinese dataset defeated the traditional model and reached 10.22%.
... Recurrent neural networks (RNNs) [17,54] are supervised neural networks having artificial neurons with one or more feedback loops. Generally, RNNs can be categorized into two variants, viz. ...
Article
Full-text available
Epilepsy is a psychosocial neurological disorder, which emerges as a major threat to public health. In this age of the internet of things, the smart diagnosis of epilepsy has gained huge research attention with machine learning-based seizure detection in cloud-fog assisted environments. The present paper also proposes a cloud-fog integrated smart neurocare approach, which performs a temporal analysis of raw electroencephalogram (EEG) signals using deep learning to detect the occurrence of epileptic seizures. This patient-independent approach makes use of single-channel EEG signals to achieve real-time and computationally efficient seizure detection at fog layer devices. It employs a maximum variance-based channel selection procedure to select only one channel of raw scalp EEG signals, followed by their filtering and segmentation into various short-duration temporal segments. To analyse EEG patterns, these segments are further fed to the proposed models of convolutional neural network, recurrent neural network and stacked autoencoder deep learning classifiers. The performance analysis through simulation results evidently reveals that the proposed convolutional neural network-based temporal analysis approach performs better than other approaches. It realises an optimum accuracy of 96.43%, sensitivity of 100% and specificity of 93.33% for 30s duration EEG segments of CHB-MIT dataset and achieves 100% accuracy, sensitivity and specificity values for EEG segments of Bonn dataset for 23.6s EEG segments. Thus, the proposed convolutional neural network-based approach emerges as an appropriate method for rapid and accurate detection of epileptic seizures in fog-cloud integrated environment.
... Long Short-Term Memory Network (LSTM) [23][24][25] is a newer version of recurrent neural networks (RNN). Traditional RNNs are the combination of three layers, viz., input layer, recurrent hidden layer and output layer [10,48]. The input signals x t are given to input layer. ...
Article
Full-text available
Epilepsy is a prevalent neurological disorder, which disturbs the lives of millions of people worldwide owing to the onset of abrupt seizures. The forecasting of seizures could help in protecting their lives by alerts or in clinical operations during epilepsy surgeries. The present paper addresses this problem by proposing a deep learning framework for prediction of epileptic seizures using intracranial EEG (iEEG) recordings. This framework performs filtering and segmentation of iEEG signals into 10s, 20s, 30s, 40s, 50s and 60s duration segments. These segments are further resolved into eight distinct spectral bands corresponding to delta, theta, alpha, beta and gamma sub-bands with frequency-domain transformation. Then, mean amplitude and band power features are extracted from each band, which are provided to convolutional neural network (CNN) and long short-term memory network (LSTM) algorithms for classification. The simulation results of the proposed CNN model exhibit higher performance with average accuracy, sensitivity, specificity, AUC and F1 score of 94.74%, 95.8%, 94.46%, 95.13% and 94.75% respectively for iEEG segments of 40s duration. Thus, the performance analysis and comparison with existing literature unveil that the proposed CNN model is an optimal approach for accurate and real-time prediction of epileptic seizures.
... Unlike a standard neural network, the hidden layers in a RNN architecture have input from the previous hidden layer step along with the input. This creates a loop for the information flow, termed as recurrent edge, introducing 'memory' of past inputs into the network [28] [29]. ...
Conference Paper
Full-text available
The field of Additive manufacturing techniques has witnessed tremendous growth in recent years. It is considered to be one of the forerunners of the Industry 4.0 paradigm, owing to its immense potential to transform the manufacturing industry with its high efficiency compared to the conventional techniques. AM techniques are increasingly being employed for producing 'customized' composite materials with tailored properties suitable for user-specific applications, including safety-critical systems like satellite components. However, there exist some drawbacks in AM parts such as inconsistency in part quality, porosity control, etc. which can lead to huge financial loss and damage in industrial production, thereby hindering the expected growth in adoption of AM systems by various industries. Some of the countermeasures against these cases were the development of in-line monitoring systems including advanced sensors and computing capabilities including the application of Artificial Intelligence (AI)/ Machine Learning (ML) which has proven to be effective in defect detection and quality assessment. However, the processes developed to be more data-intensive and dependent on computerization, which makes the system vulnerable to cyber-intrusion. The chain of attacks possible includes modification of the design file and manipulation of printing parameters such as thermal (nozzle temperature) and filament values. These may not be easily detectable without mechanical testing, effectively sabotaging the manufactured part. Another risk, which has emerged recently is the possibility of counterfeit production of high-quality AM parts. It has been shown to be possible in fibre reinforced composites using reverse engineering, by the misuse of ML techniques on imaging results to reconstruct toolpath information. In this paper, we try to focus on these shortcomings of AM techniques by discussing the severity and impacts of these risks and the current state-of-the-art countermeasures including steady process monitoring and intellectual property (IP) protection. This is vital in identifying future issues to be addressed for continuous process improvement and increasing the adoption of AM.
... Unlike traditional feed-forward networks, recurrent neural networks (RNNs) do not take a fixed input, and do not process all vector data simultaneously [42]. In fact, an RNN reuses a computational unit that takes data sequentially and at each step of its calculations [43,44]. When processing a sequence, the output at each step is combined with the next element from the sequence, and used as the next input. ...
Article
Full-text available
The existing knowledge of interfacial forces, lubrication and wear of bearings in real-world operation has significantly improved their designs over time for prolonged service life. As a result, self lubricating bear-ings have become a viable alternative to traditional bearing designs in industrial machines. However, wear mechanisms are still inevitable, and they occur progressively in self-lubricating bearings character-ized by the loss of the lubrication film and seizure. Therefore, monitoring the stage of wear states in these components will help impart the necessary countermeasures to reduce the machines' maintenance downtime. The article proposes a methodology using Long Short-Term Memory (LSTM) based encoder-decoder architecture on interfacial force signatures to detect abnormal regimes and early prediction of the failure in self-lubricating sliding contacts even before they occur. Reciprocating sliding experiments using a self-lubricating bronze bushing and a steel shaft journal were performed in a custom-built transversally oscillating tribometer setup. The force signatures corresponding to each cycle of the reciprocating sliding motion in the normal regime were used as input to train the encoder-decoder architecture to reconstruct any new signal of the normal regime with minimum error. With this semi-supervised training exercise, the force signatures corresponding to the abnormal regime can be differentiated from the normal regime as their reconstruction error would be very high. During the validation procedure of the pro-posed LSTM based encoder-decoder model, the model predicted force signals corresponding to the normal and abnormal regime with an accuracy of 97%. In addition to the model accuracy, the visualiza-tion of the reconstruction error across the whole force signature showed noticeable patterns in the reconstruction error when temporally decoded before the actual critical failure point, which can be used for early prediction of failure.
Article
Full-text available
This paper presents our effort in developing a Maithili Part of Speech (POS) tagger. Substantial effort has been devoted to developing POS taggers in several Indian languages, including Hindi, Bengali, Tamil, Telugu, Kannada, Punjabi, and Marathi; but Maithili did not achieve much attention from the research community. Maithili is one of the official languages of India, with around 50 million native speakers. So, we worked on developing a POS tagger in Maithili. For the development, we use a manually annotated in-house Maithili corpus containing 56,126 tokens. The tagset contains 27 tags. We train a conditional random fields (CRF) classifier to prepare a baseline system that achieves an accuracy of 82.67%. Then we employ several recurrent neural networks (RNN) based models, including Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), LSTM with a CRF layer (LSTM-CRF), and GRU with a CRF layer (GRU-CRF) and perform a comparative study. We also study the effect of both word embedding and character embedding in the task. The highest accuracy of the system is 91.53%.
Thesis
Full-text available
Au cours des dernières années, les établissements de soins cherchent sans cesse à optimiser le fonctionnement de leurs services tout en assurant la qualité de ces services. La Durée De Séjour hospitalier (DDS) est un indicateur d’évaluation du rendement des établissements de soins et d’efficacité de la performance des services hospitaliers. De ce fait, l’estimation de la DDS au moment de l’admission du patient et durant son séjour hospitalier fait l’objet de plusieurs études. La prédiction des DDS contribue à l’optimisation des ressources des hôpitaux, à l’amélioration de l’organisation des soins et à une meilleure planification des activités. D’abord, une étude bibliographique est réalisée afin de recenser les différents modèles de DDS existants dans un environnement hospitalier. Nous avons ensuite déduit un modèle générique caractérisant la DDS dans plusieurs unités médicales en rajoutant de nouvelles informations définies en se basant sur les besoins quotidiens des hôpitaux. La démarche suivie pour la prédiction de DDS s’appuie principalement sur les techniques d’apprentissage automatique et de fouille de données. Deux différents modèles de prédiction sont proposés. Un modèle statique de prédiction de DDS qui concerne la prédiction de DDS au moment d’admission du patient. Un deuxième modèle ajuste la DDS initialement prédite en intégrant de nouvelles données disponibles au cours du séjour hospitalier. Ce modèle est nommé modèle séquentiel de prédiction de DDS. La complexité des données médicales est une des difficultés principales auxquelles il faut faire face. Les données issues du Programme de Médicalisation des Systèmes d’Informations (PMSI) sont utilisées pour la mise en pratique de nos contributions.
Article
Nowadays, deep neural architectures have acquired great achievements in many domains, such as image processing and natural language processing. In this paper, we hope to provide new perspectives for the future exploration of novel artificial neural architectures via reviewing the proposal and development of existing architectures. We first roughly divide the influence domain of intrinsic motivations on some common deep neural architectures into three categories: information processing, information transmission and learning strategy. Furthermore, to illustrate how deep neural architectures are motivated and developed, motivation and architecture details of three deep neural networks, namely convolutional neural network (CNN), recurrent neural network (RNN) and generative adversarial network (GAN), are introduced respectively. Moreover, the evolution of these neural architectures are also elaborated in this paper. At last, this review is concluded and several promising research topics about deep neural architectures in the future are discussed.
Article
Full-text available
We explore a network architecture introduced by Elman (1988) for predicting successive elements of a sequence. The network uses the pattern of activation over a set of hidden units from time-step t−1, together with element t, to predict element t + 1. When the network is trained with strings from a particular finite-state grammar, it can learn to be a perfect finite-state recognizer for the grammar. When the network has a minimal number of hidden units, patterns on the hidden units come to correspond to the nodes of the grammar, although this correspondence is not necessary for the network to act as a perfect finite-state recognizer. We explore the conditions under which the network can carry information about distant sequential contingencies across intervening elements. Such information is maintained with relative ease if it is relevant at each intermediate step; it tends to be lost when intervening elements do not depend on it. At first glance this may suggest that such networks are not relevant to natural language, in which dependencies may span indefinite distances. However, embeddings in natural language are not completely independent of earlier information. The final simulation shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.
Article
Full-text available
A higher order recurrent neural network architecture learns to recognize and generate languages after being trained on categorized exemplars. Studying these networks from the perspective of dynamical systems yields two interesting discoveries: First, a longitudinal examination of the learning process illustrates a new form of mechanical inference: Induction by phase transition. A small weight adjustment causes a bifurcation in the limit behavior of the network. This phase transition corresponds to the onset of the network''s capacity for generalizing to arbitrary-length strings. Second, a study of the automata resulting from the acquisition of previously published training sets indicates that while the architecture is not guaranteed to find a minimal finite automaton consistent with the given exemplars, which is an NP-Hard problem, the architecture does appear capable of generating non-regular languages by exploiting fractal and chaotic dynamics. I end the paper with a hypothesis relating linguistic generative capacity to the behavioral regimes of non-linear dynamical systems.
Conference Paper
Full-text available
Several recurrent networks have been proposed as representations for the task of formal language learning. After training a recurrent network rec- ognize a formal language or predict the next symbol of a sequence, the next logical step is to understand the information processing carried out by the network. Some researchers have begun to extracting finite state machines from the internal state trajectories of their recurrent networks. This paper describes how sensitivity to initial conditions and discrete measurements can trick these extraction methods to return illusory finite state descriptions.
Article
Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation). The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states. A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items. These representations suggest a method for representing lexical categories and the type/token distinction.
Article
We show that a recurrent, second-order neural network using a real-time, forward training algorithm readily learns to infer small regular grammars from positive and negative string training samples. We present simulations that show the effect of initial conditions, training set size and order, and neural network architecture. All simulations were performed with random initial weight strengths and usually converge after approximately a hundred epochs of training. We discuss a quantization algorithm for dynamically extracting finite state automata during and after training. For a well-trained neural net, the extracted automata constitute an equivalence class of state machines that are reducible to the minimal machine of the inferred grammar. We then show through simulations that many of the neural net state machines are dynamically stable, that is, they correctly classify many long unseen strings. In addition, some of these extracted automata actually outperform the trained neural network for classification of unseen strings.