ArticlePDF Available

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Authors:

Abstract and Figures

In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Content may be subject to copyright.
Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation
Kyunghyun Cho
Universit´
e de Montr´
eal
kyunghyun.cho@umontreal.ca
Bart van Merri¨
enboer
Universit´
e de Montr´
eal
University of Gothenburg
vanmerb@iro.umontreal.ca
Caglar Gulcehre
Universit´
e de Montr´
eal
gulcehrc@iro.umontreal.ca
Fethi Bougares Holger Schwenk
Universit´
e du Maine
firstname.lastname@lium.univ-lemans.fr
Yoshua Bengio
Universit´
e de Montr´
eal, CIFAR Senior Fellow
find.me@on.the.web
Abstract
In this paper, we propose a novel neural
network model called RNN Encoder–Decoder
that consists of two recurrent neural networks
(RNN). One RNN encodes a sequence of sym-
bols into a fixed-length vector representation,
and the other decodes the representation into
another sequence of symbols. The encoder
and decoder of the proposed model are jointly
trained to maximize the conditional probability
of a target sequence given a source sequence.
The performance of a statistical machine
translation system is empirically found to
improve by using the conditional probabilities
of phrase pairs computed by the RNN
Encoder–Decoder as an additional feature in
the existing log-linear model. Qualitatively,
we show that the proposed model learns a
semantically and syntactically meaningful
representation of linguistic phrases.
1 Introduction
Deep neural networks have shown great success in vari-
ous applications such as objection recognition (see, e.g.,
(Krizhevsky et al., 2012)
) and speech recognition (see,
e.g., (Dahl et al., 2012)). Furthermore, many recent
works showed that neural networks can be successfully
used in a number of tasks in natural language processing
(NLP). These include, but are not limited to, paraphrase
detection (Socher et al., 2011), word embedding
extraction
(Mikolov et al., 2013)
and language model-
ing
(Bengio et al., 2003)
. In the field of statistical ma-
chine learning (SMT), deep neural networks have begun
to show promising results. (Schwenk, 2012) summa-
rizes a successful usage of feedforward neural networks
in the framework of phrase-based SMT system.
Along this line of research on using neural networks
for SMT, this paper focuses on a novel neural network
architecture that can be used as a part of the conven-
tional phrase-based SMT system. The proposed neural
network architecture, which we will refer to as an RNN
Encoder–Decoder, consists of two recurrent neural net-
works (RNN) that act as an encoder and a decoder pair.
The encoder maps a variable-length source sequence to
a fixed-length vector, and the decoder maps the vector
representation back to a variable-length target sequence.
The two networks are trained jointly to maximize the
conditional probability of the target sequence given
a source sequence. Additionally, we propose to use
a rather sophisticated hidden unit in order to improve
both the memory capacity and the ease of training.
The proposed RNN Encoder–Decoder with a novel
hidden unit is empirically evaluated on the task of
translating from English to French. We train the model
to learn the translation probability of an English phrase
to a corresponding French phrase. The model is then
used as a part of a standard phrase-based SMT system
by scoring each phrase pair in the phrase table. The
empirical evaluation reveals that this approach of
scoring phrase pairs with an RNN Encoder–Decoder
improves the translation performance.
We qualitatively analyze the trained RNN Encoder–
Decoder by comparing its phrase scores with those
given by the existing translation model. The qualitative
analysis shows that the RNN Encoder–Decoder is
better at capturing the linguistic regularities in the
phrase table, indirectly explaining the quantitative
improvements in the overall translation performance.
The further analysis of the model reveals that the
RNN Encoder–Decoder learns a continuous space
representation of a phrase that preserves both the
semantic and syntactic structure of the phrase.
2 RNN Encoder–Decoder
2.1 Preliminary: Recurrent Neural Networks
A recurrent neural network (RNN) is a neural network
that consists of a hidden state hand an optional
output
y
which operates on a variable-length sequence
x= (x1, . . . , xT). At each time step t, the hidden
state hhtiof the RNN is updated by
hhti=fhht1i, xt,(1)
where
f
is a non-linear activation function.
f
may be
as simple as an element-wise logistic sigmoid function
and as complex as a long short-term memory (LSTM)
unit (Hochreiter and Schmidhuber, 1997).
An RNN can learn a probability distribution over a
sequence by being trained to predict the next symbol in
a sequence. In that case, the output at each timestep
t
is
arXiv:1406.1078v1 [cs.CL] 3 Jun 2014
the conditional distribution p(xt|xt1, . . . , x1). For
example, a multinomial distribution (1-of-Kcoding)
can be output using a softmax activation function
p(xt,j = 1 |xt1, . . . , x1) = exp wjhhti
PK
j0=1 exp wj0hhti,
(2)
for all possible symbols
j= 1, . . . , K
, where
wj
are
the rows of a weight matrix W. By combining these
probabilities, we can compute the probability of the
sequence xusing
p(x) =
T
Y
t=1
p(xt|xt1, . . . , x1).(3)
From this learned distribution, it is straightforward
to sample a new sequence by iteratively sampling a
symbol at each time step.
2.2 RNN Encoder–Decoder
In this paper, we propose a novel neural network
architecture that learns to encode a variable-length se-
quence into a fixed-length vector representation and to
decode a given fixed-length vector representation back
into a variable-length sequence. From a probabilistic
perspective, this new model is a general method to
learn the conditional distribution over a variable-length
sequence conditioned on yet another variable-length
sequence, e.g.
p(y1, . . . , yT0|x1, . . . , xT)
, where one
should note that the input and output sequence lengths
Tand T0may differ.
The encoder is an RNN that reads each symbol of an
input sequence
x
sequentially. As it reads each symbol,
the hidden state of the RNN changes according to
Eq.
(1)
. After reading the end of the sequence (marked
by an end-of-sequence symbol), the hidden state of
the RNN is a summary
c
of the whole input sequence.
The decoder of the proposed model is another RNN
which is trained to generate the output sequence by
predicting the next symbol ytgiven the hidden state
hhti
. However, unlike the RNN described in Sec.
2.1
,
both ytand hhtiare also conditioned on yt1and
on the summary cof the input sequence. Hence, the
hidden state of the decoder at time tis computed by,
hhti=fhht1i, yt1,c,
and similarly, the conditional distribution of the next
symbol is
P(yt|yt1, yt2, . . . , y1,c) = ghhti, yt1,c.
for given activation functions
f
and
g
(the latter must
produce valid probabilities, e.g. with a softmax).
See Fig.
1
for a graphical depiction of the proposed
model architecture.
x1x2xT
yT' y2y1
c
Decoder
Encoder
Figure 1: An illustration of the proposed RNN
Encoder–Decoder.
The two components of the proposed RNN
Encoder–Decoder are jointly trained to maximize the
conditional log-likelihood
max
θ
1
N
N
X
n=1
log pθ(yn|xn),(4)
where θis the set of the model parameters and each
(xn,yn)is an (input sequence, output sequence) pair
from the training set. In our case, as the output of
the decoder, starting from the input, is differentiable,
we can use a gradient-based algorithm to estimate the
model parameters.
Once the RNN Encoder–Decoder is trained, the
model can be used in two ways. One way is to use
the model to generate a target sequence given an input
sequence. On the other hand, the model can be used
to score a given pair of input and output sequences,
where the score is simply a probability
pθ(y|x)
from
Eqs. (3) and (4).
2.3 Hidden Unit
that Adaptively Remembers and Forgets
In addition to a novel model architecture, we also
propose a new type of hidden unit (fin Eq. (1)) that
has been motivated by the LSTM unit but is much
simpler to compute and implement.
1
Fig.
2
shows the
graphical depiction of the proposed hidden unit.
Let us describe how the activation of the
j
-th hidden
unit is computed. First, the reset gate
rj
is computed by
rj=σ[Wrx]j+Urhht1ij,(5)
1
The LSTM unit, which has shown impressive results in several
applications such as speech recognition, has a memory cell and
four gating units that adaptively control the information flow inside
the unit. For details on LSTM networks, see, e.g.,
(Graves, 2012)
.
z
r
hh
~x
Figure 2: An illustration of the proposed hidden
activation function. The update gate
z
selects whether
the hidden state is to be updated with a new hidden
state
˜
h
. The reset gate
r
decides whether the previous
hidden state is ignored. See Eqs. (5)–(8) for the
detailed equations of r,z,hand ˜
h.
where σis the logistic sigmoid function, and [.]j
denotes the j-th element of a vector. xand ht1are
the input and the previous hidden state, respectively.
Wrand Urare weight matrices which are learned.
Similarly, the update gate zjis computed by
zj=σ[Wzx]j+Uzhht1ij.(6)
The actual activation of the proposed unit
hj
is then
computed by
hhti
j=zjhht1i
j+ (1 zj)˜
hhti
j,(7)
where
˜
hhti
j=f[Wx]j+rjUhht1i.(8)
In this formulation, when the reset gate is close
to 0, the hidden state is forced to ignore the previous
hidden state and reset with the current input only.
This effectively allows the hidden state to drop any
information that is found to be irrelevant later in the
future, thus, allowing a more compact representation.
On the other hand, the update gate controls how
much information from the previous hidden state
will carry over to the current hidden state. This acts
similarly to the memory cell in the LSTM network and
helps the RNN to remember long-term information.
Furthermore, this may be considered an adaptive variant
of a leaky-integration unit (Bengio et al., 2013).
As each hidden unit has separate reset and update
gates, each hidden unit will learn to capture dependen-
cies over different time scales. Those units that learn to
capture short-term dependencies will tend to have reset
gates that are frequently active, but those that capture
longer-term dependencies will have update gates that
are mostly active.
3 Statistical Machine Translation
In a commonly used statistical machine translation
system (SMT), the goal of the system (decoder,
specifically) is to find a translation fgiven a source
sentence e, which maximizes
log p(f|e)log p(e|f) + log p(f),
where the first term at the right hand side is called
translation model and the latter is language model (see,
e.g.,
(Koehn, 2005)
). In practice, however, most SMT
systems model
log p(f|e)
as a log-linear model with
additional features and corresponding weights:
log p(f|e)
N
X
n=1
wnfn(f,e),(9)
where fnand wnare the n-th feature and weight,
respectively. The weights are often optimized to
maximize the BLEU score on a development set.
In the phrase-based SMT framework introduced
in
(Koehn et al., 2003)
and
(Marcu and Wong, 2002)
,
the translation model log p(e|f)is factorized into
the translation probabilities of matching phrases in
the source and target sentences.2These probabilities
are once again considered additional features in
the log-linear model (see Eq. (9)) and are weighted
accordingly to maximize the BLEU score.
Since the neural net language model was proposed
in (Bengio et al., 2003), neural networks have been
used widely in SMT systems. In many cases, neural
networks have been used to rescore translation
hypotheses (n-best lists) proposed by the existing
SMT system or decoder using a target language model
(see, e.g.,
(Schwenk et al., 2006)
). Recently, however,
there has been interest in training neural networks to
score the translated sentence (or phrase pairs) using
a representation of the source sentence as an additional
input. See, e.g., (Schwenk, 2012), (Son et al., 2012)
and (Zou et al., 2013).
3.1 Scoring
Phrase Pairs with RNN Encoder–Decoder
Here we propose to train the RNN Encoder–Decoder
(see Sec. 2.2) on a table of phrase pairs and use its
scores as additional features in the log-linear model
in Eq. (9) when tuning the SMT decoder.
When we train the RNN Encoder–Decoder, we
ignore the (normalized) frequencies of each phrase
pair in the original corpora. This measure was taken
in order (1) to reduce the computational expense of
randomly selecting phrase pairs from a large phrase
table according to the normalized frequencies and (2)
to ensure that the RNN Encoder–Decoder does not
simply learn to rank the phrase pairs according to their
numbers of occurrences. One underlying reason for
this choice was that the existing translation probability
in the phrase table already reflects the frequencies
2
Without loss of generality, from here on, we refer to
p(e|f)
for each phrase pair as a translation model as well
of the phrase pairs in the original corpus. With a
fixed capacity of the RNN Encoder–Decoder, we try
to ensure that most of the capacity of the model is
focused toward learning linguistic regularities, i.e.,
distinguishing between plausible and implausible
translations, or learning the “manifold” (region of
probability concentration) of plausible translations.
Once the RNN Encoder–Decoder is trained, we
add a new score for each phrase pair to the existing
phrase table. This allows the new scores to enter into
the existing tuning algorithm with minimal additional
overhead in computation.
As Schwenk pointed out in (Schwenk, 2012), it
is possible to completely replace the existing phrase
table with the proposed RNN Encoder–Decoder.
In that case, for a given source phrase, the RNN
Encoder–Decoder will need to generate a list of (good)
target phrases. This requires, however, an expensive
sampling procedure to be performed repeatedly. At
the moment we consider doing this efficiently enough
to allow integration with the decoder an open problem,
and leave this to future work.
3.2 Related Approaches:
Neural Networks in Machine Translation
Before presenting the empirical results, we discuss
a number of recent works that have proposed to use
neural networks in the context of SMT.
Schwenk in (Schwenk, 2012) proposed a similar
approach of scoring phrase pairs. Instead of the RNN-
based neural network, he used a feedforward neural
network that has fixed-size inputs (7 words in his case,
with zero-padding for shorter phrases) and fixed-size
outputs (7 words in the target language). When it is used
specifically for scoring phrases for the SMT system,
the maximum phrase length is often chosen to be small.
However, as the length of phrases increases or as we
apply neural networks to other variable-length sequence
data, it is important that the neural network can handle
variable-length input and output. The proposed RNN
Encoder–Decoder is well-suited for these applications.
Although it is not exactly a neural network they train,
the authors of (Zou et al., 2013) proposed to learn a
bilingual embedding of words/phrases. They use the
learned embedding to compute the distance between
a pair of phrases which is used as an additional score
of the phrase pair in an SMT system.
In (Chandar et al., 2014), a feedforward neural
network was trained to learn a mapping from a
bag-of-words representation of an input phrase to
an output phrase. This is closely related to both the
proposed RNN Encoder–Decoder and the model
proposed in (Schwenk, 2012), except that their input
representation of a phrase is a bag-of-words. Earlier,
a similar encoder–decoder model using two recursive
neural networks was proposed in
(Socher et al., 2011)
,
but their model was restricted to a monolingual setting,
i.e. the model reconstructs an input sentence.
One important difference between the proposed
RNN Encoder–Decoder and the approaches in
(Zou et al., 2013) and (Chandar et al., 2014) is that
the order of the words in source and target phrases
is taken into account. The RNN Encoder–Decoder
naturally distinguishes between sequences that have
the same words but in a different order, whereas the
aforementioned approaches effectively ignore order
information.
The closest approach related to the proposed
RNN Encoder–Decoder is the Recurrent Contin-
uous Translation Model (Model 2) proposed in
(Kalchbrenner and Blunsom, 2013). In their paper,
they proposed a similar model that consists of an
encoder and decoder. The difference with our model
is that they used a convolutional
n
-gram model (CGM)
for the encoder and the hybrid of an inverse CGM
and a recurrent neural network for the decoder. They,
however, evaluated their model on rescoring the
n
-best
list proposed by the conventional SMT system.
4 Experiments
We evaluate our approach on the English/French
translation task of the WMT’14 workshop.
4.1 Data and Baseline System
Large amounts of resources are available to build
an English/French SMT system in the framework of
the WMT’14 translation task. The bilingual corpora
include Europarl (61M words), news commentary
(5.5M), UN (421M), and two crawled corpora of 90M
and 780M words respectively. The last two corpora are
quite noisy. To train the French language model, about
712M words of crawled newspaper material is available
in addition to the target side of the bitexts. All the word
counts refer to French words after tokenization.
It is commonly acknowledged that training statistical
models on the concatenation of all this data does
not necessarily lead to optimal performance, and
results in extremely large models which are difficult to
handle. Instead, one should focus on the most relevant
subset of the data for a given task. We have done so
by applying the data selection method proposed in
(Moore and Lewis, 2010)
, and its extension to bitexts
(Axelrod et al., 2011). By these means we selected a
subset of 418M words out of more than 2G words for
language modeling and a subset of 348M out of 850M
words for training the RNN Encoder–Decoder. We used
the test set
newstest2012 and 2013
for data
selection and weight tuning with MERT, and new-
stest2014
as our test set. Each set has more than
70 thousand words and a single reference translation.
For training the neural networks, including the pro-
posed RNN Encoder–Decoder, we limited the source
and target vocabulary to the most frequent 15,000
Models BLEU
dev test
Baseline 27.63 29.33
CSLM 28.33 29.58
RNN 28.48 29.96
CSLM + RNN 28.60 30.64
CSLM + RNN + WP 28.93 31.18
Table 1: BLEU scores computed on the development
and test sets using different combinations of approaches.
WP denotes a word penalty, where we penalizes the
number of unknown words to neural networks.
words for both English and French. This covers approx-
imately 93% of the dataset. All the out-of-vocabulary
words were mapped to a special token ([UNK]).
The baseline phrase-based SMT system was built
using Moses with default settings. The phrase table was
created using only the 2% highest scoring sentences
of the full dataset, according to Axelrod’s method (a
total of 15M French words). This was done in order to
keep decoding time with the CSLM reasonable. This
system achieves a BLEU score of 27.63 and 29.33 on
the test set (see Table 1).
4.1.1 RNN Encoder–Decoder
The RNN Encoder–Decoder used in the experiment
had 1000 hidden units with the proposed gates at
the encoder and at the decoder. The input matrix
between each input symbol xhtiand the hidden unit
is approximated with two lower-rank matrices, and
the output matrix is approximated similarly. We
used rank-100 matrices, equivalent to learning an
embedding of dimension 100 for each word. The
activation function used for
˜
h
in Eq.
(8)
is a hyperbolic
tangent function. The computation from the hidden
state in the decoder to the output is implemented as
a deep neural network (Pascanu et al., 2014) with a
single intermediate layer having 500 maxout units each
pooling 2 inputs (Goodfellow et al., 2013).
All the weight parameters in the RNN Encoder
Decoder were initialized by sampling from an isotropic
zero-mean (white) Gaussian distribution with its stan-
dard deviation fixed to 0.01, except for the recurrent
weight parameters. For the recurrent weight matrices,
we first sampled from a white Gaussian distribution
and used its left singular vectors matrix multiplied with
a small constant (0.01), following (Saxe et al., 2014).
We used Adadelta to train the RNN Encoder
Decoder with hyperparameters = 106and
ρ= 0.95 (Zeiler, 2012). At each update, we used 64
randomly selected phrase pairs from a phrase table
(which was created from 348M words). The model
was trained for approximately three days.
Details of the architecture used in the experiments are
explained in more depth in the supplementary material.
4.1.2 Neural Language Model
In order to assess the effectiveness of scoring phrase
pairs with the proposed RNN Encoder–Decoder, we
also tried a more traditional approach of using a
neural network for learning a target language model
(CSLM)
(Schwenk, 2007)
. Especially, the comparison
between the SMT system using CSLM and that using
the proposed approach of phrase scoring by RNN
Encoder–Decoder will clarify whether the contributions
from multiple neural networks in different parts of the
SMT system add up or are redundant.
We trained the CSLM model on 7-grams from the
target corpus. Each input word was projected into the
embedding space R512, and they were concatenated
to form a 3072-dimensional vector. The concatenated
vector was fed through two rectified layers (of size
1536 and 1024) (Glorot et al., 2011). The output
layer was a simple softmax layer (see Eq. (2)). All
the weight parameters were initialized uniformly
between 0.01 and 0.01, and the model was trained
until the validation perplexity did not improve for 10
epochs. After training, the language model achieved
a perplexity of 45.80. The validation set was a random
selection of 0.1% of the corpus. The model was used to
score partial translations during the decoding process,
which generally leads to higher gains in BLEU score
than n-best list rescoring (Vaswani et al., 2013).
To address the computational complexity of
using a CSLM in the decoder a buffer was used to
aggregate n-grams during the stack-search performed
by the decoder. Only when the buffer is full, or
a stack is about to be pruned, the n-grams are
scored by the CSLM. This allows us to perform fast
matrix-matrix multiplication on GPU using Theano
(Bergstra et al., 2010; Bastien et al., 2012). This
approach results in a significant speedup compared to
scoring each n-gram in isolation.
4.2 Quantitative Analysis
We tried the following combinations:
1. Baseline configuration
2. Baseline + CSLM
3. Baseline + RNN
4. Baseline + CSLM + RNN
5. Baseline + CSLM + RNN + Word penalty
The results are presented in Table 1. As expected,
adding features computed by neural networks con-
sistently improves the performance over the baseline
performance. Noticeably, the phrase pair scores
computed by the proposed RNN Encoder–Decoder
(RNN) were able to improve more, compared to the
more traditional approach of having only a target
language model (
CSLM
). This is a significant improve-
ment considering that the additional computational
complexity induced by having simply one additional
phrase pair score is minimal when tuning the decoder.
Source Translation Model RNN Encoder–Decoder
at the end of the [ la fin de l’] [´
etaient `
a la fin de la] [´
equipement `
a
la fin du]
[`
a la fin du] [`
a la fin de la] [`
a la suite de la]
for the first time [taient pour la premi`
ere fois] [lection , pour la
premi
`
ere fois] [crit rapporte , pour la premi
`
ere fois.]
[pour la premi
`
ere fois] [pour une premi
`
ere fois] [pour
la premi`
ere]
in the United States and
[
´
et
´
e constat
´
ees aux
´
Etats-Unis et en] [
´
et
´
e constat
´
ees
aux ´
Etats-Unis et] [`
a la fois des Etats-Unis et]
[aux
´
Etats-Unis et] [des
´
Etats-Unis et] [des
´
Etats-Unis
et de]
, as well as [verres et aussi] [sa carri`
ere ( tout comme Baresi]
[r´
ecipients de capacit´
es diverses , ainsi que]
[, ainsi que] [, ainsi qu’] [ainsi que]
one of the most [une du plus] [une des ´
emissions les plus] [une des
vies nocturnes des plus]
[l’ une des] [l’ un des] [l’ un des plus]
(a) Long, frequent source phrases
Source Translation Model RNN Encoder–Decoder
, Minister of Communi-
cations and Transport
[aux communications et aux transports : deux]
[aux communications et aux transports :] [aux
communications et aux transports]
[aux communications et aux transports] [Secr
´
etaire
aux communications et aux transports] [aux
communications et aux transports :]
did not comply with the
[ne se sont pas conform
´
es
`
a ces] [n’ ont pas respect
´
e
les r`
egles] [n’ ont pas respect´
e les]
[n’ ont pas respect´
e les] [n’ ont pas respect´
e les
r`
egles] [vis-`
a-vis]
parts of the world .
[r
´
egions du monde .] [parties du monde .] [pays du
monde .]
[parties du monde .] [pays du monde .] [.]
the past few days .
[ponte au cours des derniers jours .] [cours des tout
derniers jours .] [au cours des derniers jours .]
[ces derniers jours .] [au cours des derniers jours .]
[ponte au cours des derniers jours .]
on Friday and Saturday [” `
A 18H] [les vendredis et samedis] [, vendredi et
samedi ,]
[vendredi et samedi] [les vendredis et samedis] [,
vendredi et samedi]
(b) Long, rare source phrases
Table 2: The top scoring target phrases for a small set of source phrases according to the translation model (direct
translation probability) and by the RNN Encoder–Decoder. Source phrases were randomly selected from phrases
with 4 or more words.
Source Samples from RNN Encoder–Decoder
at the end of the [`
a la suite de la] [de la fin de la] [`
a la fin de l’] (×2) [, `
a la fin du]
for the first time [pour la premi`
ere fois] (×4) [de la premi`
ere fois]
in the United States and
[ dans les pays membres et] [dans les pays voisins et] [ dans les
´
Etats Unis et] [dans les Etats uni et] [ dans
les UNK et]
, as well as [ , ainsi que] (×2) [ , ainsi que sur] [UNK , ainsi que les] [, ainsi qu’]
one of the most [de l’ une des plus] [l’ un des plus] [, un des] [ sous l’ une des] [ le plus grand nombre]
(a) Long, frequent source phrases
Source Samples from RNN Encoder–Decoder
, Minister of Communications
and Transport
[ , ministre des communications et le transport] (×5)
did not comply with the [n’ ont pas respect´
e] [n’ ´
etaient pas respect´
e] [n’ ´
etaient pas UNK] [n’ UNK pas] [ne UNK pas]
parts of the world . [des parties du monde .] (×4) [.]
the past few days . [les derniers jours .] [les rares jours .] [ les derniers jours] [ ces derniers jours .] [ les sept jours .]
on Friday and Saturday [de UNK et UNK] [ UNK et UNK] [ sur UNK et UNK] [ de UNK et de UNK] (×2)
(b) Long, rare source phrases
Table 3: Samples generated from the RNN Encoder–Decoder for each source phrase used in Table
2
. We show
the top-5 target phrases out of 50 samples. They are sorted by the RNN Encoder–Decoder scores.
The best performance was achieved when we used
both CSLM and the phrase scores from the RNN
Encoder–Decoder. This suggests that the contributions
of the CSLM and the RNN Encoder–Decoder are not
too correlated and that one can expect better results by
improving each method independently. Furthermore,
we were able to improve the translation quality further
(
+1.85
BLEU over baseline) by penalizing the number
of words that are unknown to the neural networks (i.e.
words which are not in the shortlist). We do so by
simply adding the number of unknown words as an
additional feature the log-linear model in Eq. (9).3
3To understand the effect of the penalty, consider the set of
all words in the 15,000 large shortlist, SL. All words xi/SL
are replaced by a special token [UNK]before being scored by
the neural networks. Hence, the conditional probability of any
xi
t/SL is actually given by the model as
p(xt= [UNK]|x<t) = p(xt/SL |x<t)
=X
xj
t/SL
pxj
t|x<tpxi
t|x<t,
where x<t is a shorthand notation for xt1, .. . , x1.
As a result, the probability of words not in the shortlist is always
overestimated. For CSLMs this shortcoming can be addressed
by using a separate back-off n-gram language model that only
contains non-shortlisted words (see
(Schwenk, 2007)
). However,
since there is no direct equivalent of this approach for the RNN
Encoder–Decoder, we opt for introducing a word penalty instead,
which counteracts the word probability overestimation.
−30 −25 −20 −15 −10 −5 0
−6
−5
−4
−3
−2
−1
0
RNN Scores (log)
TM Scores (log)
Figure 3: The visualization of phrase pairs according
to their scores (log-probabilities) by the RNN
Encoder–Decoder and the translation model.
Additionally, we performed an experiment showing
that the performance improvement is not solely due
to a larger set of phrase pairs used to train the RNN
Encoder–Decoder. In this experiment, we tuned the
SMT decoder with the full bitext instead of the reduced
one (348M words). With baseline + RNN, in this
case, we obtained 31.20 and 33.89 BLEU scores on
the development and test sets, while the baseline
scores are 30.62 and 33.30 respectively. This clearly
suggests that the proposed approach is applicable to
and improves a large-scale SMT system as well.
4.3 Qualitative Analysis
In order to understand where the performance
improvement comes from, we analyze the phrase
pair scores computed by the RNN Encoder–Decoder
against those p(f|e), the so-called inverse phrase
translation probability from the translation model.
Since the existing translation model relies solely on
the statistics of the phrase pairs in the corpus, we
expect its scores to be better estimated for the frequent
phrases but badly estimated for rare phrases. Also, as
we mentioned earlier in Sec.
3.1
, we further expect the
RNN Encoder–Decoder which was trained without
any frequency information to score the phrase pairs
based rather on the linguistic regularities than on the
statistics of their occurrences in the corpus.
We focus on those pairs whose source phrase is long
(more than 3 words per source phrase) and frequent. For
each such source phrase, we look at the target phrases
that have been scored high either by the translation
probability
p(f|e)
or by the RNN Encoder–Decoder.
Similarly, we perform the same procedure with those
pairs whose source phrase is long but rare in the corpus.
Table 2 lists the top-3target phrases per source
phrase favored either by the translation model or by
the RNN Encoder–Decoder. The source phrases were
randomly chosen among long ones having more than
4 or 5 words.
In most cases, the choices of the target phrases
by the RNN Encoder–Decoder are closer to actual
or literal translations. We can observe that the RNN
Encoder–Decoder prefers shorter phrases in general.
Interestingly, many phrase pairs were scored
similarly by both the translation model and the RNN
Encoder–Decoder, but there were as many other phrase
pairs that were scored radically different (see Fig. 3).
This could arise from the proposed approach of training
the RNN Encoder–Decoder on a set of unique phrase
pairs, discouraging the RNN Encoder–Decoder from
learning simply the frequencies of the phrase pairs
from the corpus, as explained earlier.
Furthermore, in Table 3, we show for each of the
source phrases in Table
2
, the generated samples from
the RNN Encoder–Decoder. For each source phrase,
we generated 50 samples and show the top-five phrases
accordingly to their scores. We can see that the RNN
Encoder–Decoder is able to propose well-formed target
phrases without looking at the actual phrase table.
Importantly, the generated phrases do not overlap
completely with the target phrases from the phrase table.
This encourages us to further investigate the possibility
of replacing the whole or a part of the phrase table with
the proposed RNN Encoder–Decoder in the future.
4.4 Word and Phrase Representations
Since the proposed RNN Encoder–Decoder is not
specifically designed only for the task of machine
translation, here we briefly look at the properties of
the trained model.
It has been known for some time that continuous
space language models using neural networks are able
to learn semantically meaningful embeddings (See, e.g.,
(Bengio et al., 2003; Mikolov et al., 2013)). Since
the proposed RNN Encoder–Decoder also projects
to and maps back from a sequence of words into a
continuous space vector, we expect to see a similar
property with the proposed model as well.
The left plot in Fig. 4 shows the 2–D embedding
of the words using the word embedding matrix
learned by the RNN Encoder–Decoder. The projection
was done by the recently proposed Barnes-Hut-
SNE
(van der Maaten, 2013)
. We can clearly see that
semantically similar words are clustered with each
other (see the zoomed-in plots in Fig. 4).
The proposed RNN Encoder–Decoder naturally
generates a continuous-space representation of a
phrase. The representation (cin Fig. 1) in this case
is a 1000-dimensional vector. Similarly to the word
representations, we visualize the representations of the
phrases that consists of four or more words using the
Barnes-Hut-SNE in Fig. 5.
From the visualization, it is clear that the RNN
Encoder–Decoder captures both semantic and syntactic
structures of the phrases. For instance, in the top-right
plot, all the phrases are about the percentage of a
Figure 4: 2–D embedding of the learned word representation. The left one shows the full embedding space, while
the right one shows a zoomed-in view of one region (color–coded). For more plots, see the supplementary material.
Figure 5: 2–D embedding of the learned phrase representation. The top left one shows the full representation
space (5000 randomly selected points), while the other three figures show the zoomed-in view of specific regions
(color–coded).
following object, and importantly, the model is able
to correctly map various different ways to phrase %
into a similar representation (e.g., percent, per cent,
%). The bottom-left plot shows phrases that are related
to date or duration and at the same time are grouped
by syntactic similarities among themselves. A similar
trend can also be observed in the last plot.
5 Conclusion
In this paper, we proposed a new neural network archi-
tecture, called an RNN Encoder–Decoder that is able
to learn the mapping from a sequence of an arbitrary
length to another sequence, possibly from a different
set, of an arbitrary length. The proposed RNN Encoder
Decoder is able to either score a pair of sequences (in
terms of a conditional probability) or generate a target
sequence given a source sequence. Along with the
new architecture, we proposed a novel hidden unit that
includes a reset gate and an update gate that adaptively
control how much each hidden unit remembers or
forgets while reading/generating a sequence.
We evaluated the proposed model with the task
of statistical machine translation, where we used the
RNN Encoder–Decoder to score each phrase pair in the
phrase table. Qualitatively, we were able to show that
the new model is able to capture linguistic regularities
in the phrase pairs well and also that the RNN Encoder
Decoder is able to propose well-formed target phrases.
The scores by the RNN Encoder–Decoder were
found to improve the overall translation performance in
terms of BLEU scores. Also, we found that the contribu-
tion by the RNN Encoder–Decoder is rather orthogonal
to the existing approach of using neural networks in
the SMT system, so that we can improve further the
performance by using, for instance, the RNN Encoder–
Decoder and the neural net language model together.
Our qualitative analysis of the trained model shows
that it indeed captures the linguistic regularities in
multiple levels i.e. at the word level as well as phrase
level. This suggests that there may be more natural
language related applications that may benefit from
the proposed RNN Encoder–Decoder.
The proposed architecture has large potential in
further improvement and analysis. One approach
that was not investigated here is to replace the whole,
or a part of the phrase table by letting the RNN
Encoder–Decoder propose target phrases. Also, noting
that the proposed model is not limited to being used
with written language, it will be an important future
research to apply the proposed architecture to other
applications such as speech transcription.
Acknowledgments
The authors would like to acknowledge the support
of the following agencies for research funding and
computing support: NSERC, Calcul Qu
´
ebec, Compute
Canada, the Canada Research Chairs and CIFAR.
References
[Axelrod et al.2011]
Amittai Axelrod, Xiaodong He, and Jianfeng
Gao. 2011. Domain adaptation via pseudo in-domain data
selection. In Proceedings of the ACL Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages
355–362. Association for Computational Linguistics.
[Bastien et al.2012] Fr´
ed´
eric Bastien, Pascal Lamblin, Razvan
Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron,
Nicolas Bouchard, and Yoshua Bengio. 2012. Theano:
new features and speed improvements. Deep Learning and
Unsupervised Feature Learning NIPS 2012 Workshop.
[Bengio et al.2003] Yoshua Bengio, R´
ejean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural probabilistic
language model. J. Mach. Learn. Res., 3:1137–1155, March.
[Bengio et al.2013]
Y. Bengio, N. Boulanger-Lewandowski, and
R. Pascanu. 2013. Advances in optimizing recurrent networks.
In Proceedings of the 38th International Conference on
Acoustics, Speech, and Signal Processing (ICASSP 2013), May.
[Bergstra et al.2010]
James Bergstra, Olivier Breuleux, Fr
´
ed
´
eric
Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume
Desjardins, Joseph Turian, David Warde-Farley, and Yoshua
Bengio. 2010. Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for Scientific Computing
Conference (SciPy), June. Oral Presentation.
[Chandar et al.2014] Sarath Chandar, Stanislas Lauly, Hugo
Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas
Raykar, and Amrita Saha. 2014. An autoencoder approach to
learning bilingual word representations. arXiv:
1402.1454
[cs.CL], February.
[Dahl et al.2012] George E. Dahl, Dong Yu, Li Deng, and Alex
Acero. 2012. Context-dependent pre-trained deep neural net-
works for large vocabulary speech recognition. IEEE Transac-
tions on Audio, Speech, and Language Processing, 20(1):33–42.
[Glorot et al.2011] X. Glorot, A. Bordes, and Y. Bengio. 2011.
Deep sparse rectifier neural networks. In AISTATS.
[Goodfellow et al.2013]
Ian J. Goodfellow, David Warde-Farley,
Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013.
Maxout networks. In ICML’2013.
[Graves2012]
Alex Graves. 2012. Supervised Sequence Labelling
with Recurrent Neural Networks. Studies in Computational
Intelligence. Springer.
[Hochreiter and Schmidhuber1997]
S. Hochreiter and J. Schmid-
huber. 1997. Long short-term memory. Neural Computation,
9(8):1735–1780.
[Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil
Blunsom. 2013. Two recurrent continuous translation models.
In Proceedings of the ACL Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages 1700–1709.
Association for Computational Linguistics.
[Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel
Marcu. 2003. Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on
Human Language Technology - Volume 1, NAACL ’03, pages
48–54, Stroudsburg, PA, USA. Association for Computational
Linguistics.
[Koehn2005] P. Koehn. 2005. Europarl: A parallel corpus for
statistical machine translation. In Machine Translation Summit
X, pages 79–86, Phuket, Thailand.
[Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton. 2012. ImageNet classification with deep
convolutional neural networks. In Advances in Neural
Information Processing Systems 25 (NIPS’2012).
[Marcu and Wong2002]
Daniel Marcu and William Wong. 2002.
A phrase-based, joint probability model for statistical machine
translation. In Proceedings of the ACL-02 Conference on
Empirical Methods in Natural Language Processing - Volume
10, EMNLP ’02, pages 133–139, Stroudsburg, PA, USA.
Association for Computational Linguistics.
[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen,
Greg Corrado, and Jeff Dean. 2013. Distributed representations
of words and phrases and their compositionality. In Advances in
Neural Information Processing Systems 26, pages 3111–3119.
[Moore and Lewis2010] Robert C. Moore and William Lewis.
2010. Intelligent selection of language model training data.
In Proceedings of the ACL 2010 Conference Short Papers,
ACLShort ’10, pages 220–224, Stroudsburg, PA, USA.
Association for Computational Linguistics.
[Pascanu et al.2014] R. Pascanu, C. Gulcehre, K. Cho, and
Y. Bengio. 2014. How to construct deep recurrent neural
networks. In Proceedings of the Second International
Conference on Learning Representations (ICLR 2014), April.
[Saxe et al.2014] Andrew M. Saxe, James L. McClelland, and
Surya Ganguli. 2014. Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. In Proceedings of the
Second International Conference on Learning Representations
(ICLR 2014), April.
[Schwenk et al.2006] Holger Schwenk, Marta R. Costa-Juss`
a,
and Jos´
e A. R. Fonollosa. 2006. Continuous space language
models for the iwslt 2006 task. In IWSLT, pages 166–173.
[Schwenk2007] Holger Schwenk. 2007. Continuous space
language models. Comput. Speech Lang., 21(3):492–518, July.
[Schwenk2012]
Holger Schwenk. 2012. Continuous space trans-
lation models for phrase-based statistical machine translation.
In Martin Kay and Christian Boitet, editors, Proceedings of the
24th International Conference on Computational Linguistics
(COLIN), pages 1071–1080. Indian Institute of Technology
Bombay.
[Socher et al.2011] Richard Socher, Eric H. Huang, Jeffrey
Pennington, Andrew Y. Ng, and Christopher D. Manning.
2011. Dynamic pooling and unfolding recursive autoencoders
for paraphrase detection. In Advances in Neural Information
Processing Systems 24.
[Son et al.2012] Le Hai Son, Alexandre Allauzen, and Franc¸ois
Yvon. 2012. Continuous space translation models with neural
networks. In Proceedings of the 2012 Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL HLT
’12, pages 39–48, Stroudsburg, PA, USA. Association for
Computational Linguistics.
[van der Maaten2013] Laurens van der Maaten. 2013. Barnes-
hut-sne. In Proceedings of the First International Conference
on Learning Representations (ICLR 2013), May.
[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria
Fossum, and David Chiang. 2013. Decoding with large-scale
neural language models improves translation. Proceedings of
the Conference on Empirical Methods in Natural Language
Processing, pages 1387–1392.
[Zeiler2012]
Matthew D. Zeiler. 2012. ADADELTA: an adaptive
learning rate method. Technical report, arXiv 1212.5701.
[Zou et al.2013]
Will Y. Zou, Richard Socher, Daniel M. Cer, and
Christopher D. Manning. 2013. Bilingual word embeddings
for phrase-based machine translation. In Proceedings of the
ACL Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1393–1398. Association for
Computational Linguistics.
A RNN Encoder–Decoder
In this document, we describe in detail the architecture of the RNN Encoder–Decoder used in the experiments.
Let us denote an source phrase by X= (x1,x2,...,xN)and a target phrase by Y= (y1,y2,...,yM).
Each phrase is a sequence of
K
-dimensional one-hot vectors, such that only one element of the vector is
1
and
all the others are 0. The index of the active (1) element indicates the word represented by the vector.
A.1 Encoder
Each word of the source phrase is embedded in a 100-dimensional vector space: e(xi)R100.e(x)is used
in Sec. 4.4 to visualize the words.
The hidden state of an encoder consists of
1000
hidden units, and each one of them at time
t
is computed by
hhti
j=zjhht1i
j+ (1 zj)˜
hhti
j,
where
˜
hhti
j= tanh [We(xt)]j+rjUhht1i,
zj=σ[Wze(xt)]j+Uzhht1ij,
rj=σ[Wre(xt)]j+Urhht1ij.
σis a logistic sigmoid function. To make the equations uncluttered, we omit biases. The initial hidden state
hh0i
jis fixed to 0.
Once the hidden state at the
N
step (the end of the source phrase) is computed, the representation of the source
phrase cis
c= tanh VhhNi.
Also, we collect the average word embeddings of the source phrase such that
mx=1
N
N
X
t=1
e(xt).
A.1.1 Decoder
The decoder starts by initializing the hidden state with
h0h0i= tanh V0c,
where we will use ·0to distinguish parameters of the decoder from those of the encoder.
The hidden state at time tof the decoder is computed by
h0hti
j=z0jh0ht1i
j+ (1 z0j)˜
h0hti
j,
where
˜
h0hti
j= tanh W0e(yt1)j+r0jU0h0ht1i+Cc,
z0j=σW0ze(yt1)j+U0zh0ht1ij+ [Czc]j,
r0j=σW0re(yt1)j+U0rh0ht1ij+ [Crc]j,
and e(y0)is an all-zero vector. Similarly to the case of the encoder, e(y)is an embedding of a target word.
Unlike the encoder which simply encodes the source phrase, the decoder is learned to generate a target phrase.
At each time t, the decoder computes the probability of generating j-th word by
p(yt,j = 1 |yt1,...,y1, X) = exp gjshti
PK
j0=1 exp gj0shti,
where the i-element of shtiis
shti
i= max ns0hti
2i1, s0hti
2io
and
s0hti=Ohh0hti+Oyyt1+Occ+Owmx.
In short, the shti
iis a so-called maxout unit.
For the computational efficiency, instead of a single-matrix output weight
G
, we use a product of two matrices
such that
G=GlGr,
where GlRK×100 and GrR100×1000.
B Word and Phrase Representations
Here, we show enlarged plots of the word and phrase representations in Figs. 4–5.
Figure 6: 2–D embedding of the learned word representation. The top left one shows the full embedding space, while the other three figures show the zoomed-in view
of specific regions (color–coded).
Figure 7: 2–D embedding of the learned phrase representation. The top left one shows the full representation space (5000 randomly selected points), while the other three
figures show the zoomed-in view of specific regions (color–coded).
... where represents the forecast days, represents the batch size, and y , respectively, denote the -th actual value and predicted value at time . Informer [26], LSTM [36], Bidirectional Long Short-Term Memory (BiLSTM) [37], GRU [38], and Bidirectional Gated Recurrent Unit (BiGRU) [38] have been selected as benchmark models for comparative analysis. ...
... where represents the forecast days, represents the batch size, and y , respectively, denote the -th actual value and predicted value at time . Informer [26], LSTM [36], Bidirectional Long Short-Term Memory (BiLSTM) [37], GRU [38], and Bidirectional Gated Recurrent Unit (BiGRU) [38] have been selected as benchmark models for comparative analysis. ...
... where m represents the forecast days, n represents the batch size, y t i andŷ t i , respectively, denote the i-th actual value and predicted value at time t. Informer [26], LSTM [36], Bidirectional Long Short-Term Memory (BiLSTM) [37], GRU [38], and Bidirectional Gated Recurrent Unit (BiGRU) [38] have been selected as benchmark models for comparative analysis. ...
Article
Full-text available
Accurately predicting the trajectories of mesoscale eddies is essential for comprehending the distribution of marine resources and the multiscale energy cascade in the ocean. Nevertheless, current approaches for predicting mesoscale eddy trajectories frequently exhibit inadequate examination of the intrinsic multiscale temporal data, resulting in diminished predictive precision. To address this challenge, our research introduces an enhanced transformer-based framework for predicting mesoscale eddy trajectories. Initially, a multivariate dataset of mesoscale eddy trajectories is constructed and expanded, encompassing eddy properties and pertinent ocean environmental information. Additionally, novel feature factors are delineated based on the physical attributes of eddies. Subsequently, a multi-head attention mechanism is introduced to bolster the modeling of the multiscale time-varying connections within eddy trajectories. Furthermore, the original positional encoding is substituted with Time-Absolute Position Encoding, which considers the dimensions and durations of the sequence mapping, thereby improving the distinguishability of embedded vectors. Ultimately, the Soft-DTW loss function is integrated to more accurately assess the overall discrepancies among mesoscale eddy trajectories, thereby improving the model’s resilience to erratic and diverse trajectory sequences. The effectiveness of the proposed framework is assessed using the eddy-abundant South China Sea. Our framework exhibits exceptional predictive accuracy, achieving a minimum central error of 8.507 km over a seven-day period, surpassing existing state-of-the-art models.
... (1) LSTM [32]: LSTM is a specialized RNN architecture with memory cells and gates, explicitly designed to mitigate the vanishing gradient problem, making it effective in capturing long-term dependencies in sequential data. (2) GRU [33]: GRU is a simplified variant of LSTM, featuring update and reset gates to selectively update and forget information, offering a balance between model complexity and performance. (3) STGCN [18]: STGCN integrates graph convolution and gated temporal convolution through spatio-temporal convolutional blocks, making it well-suited for spatio-temporal prediction tasks. ...
Article
Full-text available
This paper addresses the challenges of exponentially growing traffic in cellular networks by proposing a novel predictive model, Hybrid Graph Convolutional Recurrent Network (HGCRN), which combines static graph convolutional recurrent neural network and meta-graph learning. The model is designed to effectively capture the complex spatio-temporal dependencies in network traffic, enhancing prediction accuracy and operational efficiency. By constructing graph adjacency matrices that go beyond mere geographical proximity, HGCRN offers a deeper understanding of the dynamic interactions within the network. Tested on real-world datasets from Telecom Italia and China Mobile, the model demonstrates significant improvements over traditional and state-of-the-art methods in terms of predictive accuracy and reliability. HGCRN outperforms other models in terms of MAE, MAPE, and RMSE, specifically, our model achieves a lower RMSE of 6.53 compared to the state-of-the-art on a publicly available Telecom dataset, and an RMSE of 11.79, outperforming existing methods in the step-12 setting on the China Mobile dataset.
... We implemented retinotopic and tonotopic organization, as well as local recurrence, by using a Convolutional Gated Recurrent Unit (GRU) to model each brain region (Ballas et al., 2016;Cho et al., 2014). These models were end-to-end differentiable, and therefore, we could train them with backpropagation-of-error. ...
Preprint
Full-text available
Artificial neural networks (ANNs) can generate useful hypotheses about neural computation, but many features of the brain are not captured by standard ANNs. Top-down feedback is a particularly notable missing feature. Its role in the brain is often debated, and it's unclear whether top-down feedback would improve an ANN's ability to model the brain. Here we develop a deep neural network model that captures the core functional properties of top down feedback in the neocortex. This feedback allows identically connected recurrent models to have different processing hierarchies based on the direction of feedforward and feedback connectivity. We then explored the functional impact of different hierarchies on audiovisual categorization tasks. We find that certain hierarchies, such as the one seen in the human brain, impart ANN models with a light visual bias similar to that seen in humans while maintaining excellent performance on all audio-visual tasks. The results further suggest that different configurations of top-down feedback make otherwise identically connected models functionally distinct from each other and from traditional feedforward only recurrent models. Altogether our findings demonstrate that top-down feedback is a relevant feature of biological brains that improves the explanatory power of ANN models in computational neuroscience.
Article
Makine öğrenmesi tabanlı tahmin yaklaşımlarının finansal piyasalarda geliştirilmesi, hızlı ve hassas karar alma, karmaşıklıkla başa çıkma, risk yönetimi, algoritmik ticaret ve duygusal etkilerin azaltılması gibi avantajlar sağlar. Bu yaklaşımlar, sürekli öğrenme ve adaptasyon yetenekleriyle finansal başarı için rekabet avantajı oluşturabilir. Bu makale çalışmasında, Borsa İstanbul (BIST) 100 endeks tahmini için bellek tabanlı makine öğrenmesi modellerine dayalı bir yaklaşım sunulmuştur. Bu amaçla, ardışık veri değerlendirmesinde popüler olan uzun kısa-süreli bellek (LSTM) ve geçitli tekrarlayan birim (GRU) mimarileri kullanılmıştır. Elde edilen model çıktılarına göre bu modellerin, eğitim ve doğrulama aşamalarında düşük kayıplar gösterdiği ve BIST100 endeksinin genel eğilimlerini başarıyla takip ettiği gözlemlenmiştir. Ancak, modeller piyasa dalgalanmaları ve ani değişimlerde gerçek değerlerden sapmalar göstermiş, bu da belirsizlikleri ve genelleme kapasitelerinin sınırlarını ortaya koymuştur. Geleceğe yönelik tahminler, eğitim veri setindeki desenlere dayanarak yapılmış ancak zamanla artan belirsizlik göstermiştir. Çalışma, makine öğrenmesi algoritmalarının finans verileri üzerindeki kullanım alanı konusunda önemli bilgiler sağlayacak potansiyele sahiptir.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
While logistic sigmoid neurons are more biologically plausable that hyperbolic tangent neurons, the latter work better for training multi-layer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semi-supervised setups with extra-unlabelled data, deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labelled data sets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised nueral networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
We introduce a class of probabilistic continuous translation models called Recurrent Continuous Translation Models that are purely based on continuous representations for words, phrases and sentences and do not rely on alignments or phrasal translation units. The models have a generation and a conditioning aspect. The generation of the translation is modelled with a target Recurrent Language Model, whereas the conditioning on the source sentence is modelled with a Convolutional Sentence Model. Through various experiments, we show first that our models obtain a perplexity with respect to gold translations that is > 43% lower than that of state-of-the-art alignment-based translation models. Secondly, we show that they are remarkably sensitive to the word order, syntax, and meaning of the source sentence despite lacking alignments. Finally we show that they match a state-of-the-art system when rescoring n-best lists of translations.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Article
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.