Conference PaperPDF Available

Neural Machine Translation for English to Hindi

Neural Machine Translation for English to Hindi
Sandeep Saini
Department of Electronics and Communication Engineering
The LNM Institute of Information Technology
Jaipur, India
Vineet Sahula
Department of Electronics and Communication Engineering
Malviya National Institute of Technology
Jaipur, India
Abstract—Language translation is one task in which machine
is definitely lagging behind the cognitive powers of human beings.
Statistical Machine Translation is one of the conventional ways of
solving the problem of machine translation. This method requires
huge data sets and performs well on similar grammar structured
language pairs. In recent years, Neural Machine Translation
(NMT) has emerged as an alternate way of addressing the same
issue. In this paper, we explore different configurations for setting
up a Neural Machine Translation System for Indian language
Hindi. We have experimented with eight different architecture
combinations of NMT for English to Hindi and compared our
results with conventional machine translation techniques. We
have also observed in this work that NMT requires very less
amount of data size for training and thus exhibits satisfactory
translation for few thousands of training sentences as well.
KeywordsCognitive linguistics, Neural Machine Translation,
Long and Short Term Memory, Indian Languages
Machine translation was one of the initial tasks taken
by the computer scientists and the research in this field is
going on for last 50 years. During these years, it has been
a remarkable progress that linguists and computer engineers
have worked together to achieve the current status of machine
translation. Machine Translation task was initially handled
with dictionary matching techniques and slowly upgraded to
rule-based approaches. During last two decades, most of the
machine translation systems were based on statistical machine
translation approach. In these systems, [1], [2] and [3], the
basic units of translation process are phrases and sentences.
These phrases can be composed of one or more words. Most
of the conventional translation systems are based on Bayesian
inferencing to predict the estimate translation probabilities for
pairs of phrases. In these pairs, one phrase belongs to the
source language and the other from the target language. Since
the probability of these phrases is very low, thus pairing and
predicting the correct pair is very difficult in these systems. To
improve the probability of a certain pair of phrases, increasing
the size of the dataset was one of the most feasible solutions.
With the limitations of conventional Machine Translation Sys-
tems [4] and dependence on huge datasets, there is a demand
to search for alternate methods for machine translation.
In recent years, Google has also shifted its translation
research focus towards Neural Machine Translation (NMT).
Sutskever, et al. [5] proposed a sequence to sequence learning
mechanism using long and short-term (LSTM) memory mod-
els. This neural network-based machine translation system had
eight layers of encoder and eight layers of the decoder. The
core idea of NMT is to use deep learning and representation
learning. NMT models require only a fraction of the memory
needed as compared to the traditional statistical machine
translation (SMT) models. Furthermore, unlike conventional
translation systems, all parts of the neural translation model
are trained jointly (end-to-end) to maximize the translation
performance [5] and [6]. In an NMT system, a bidirectional
recurrent neural network (RNN), known as an encoder, is used
by the neural network to encode a source sentence for a second
RNN, known as a decoder, that is used to predict words in
the target language. This encoder-decoder architecture can be
designed with multiple layers to increase the efficiency of the
Normally neural machine translation tends to require a lot
of computing power which means that it is normally a great
technique if we have enough time or computing powers. The
other issue with older NMT was inconsistency in handling
rare words. Since these inputs were sparsely available in the
network, the learning and inferencing were not efficient. By
using LSTM models and having eight layers of encoder and
decoder, this system removes these errors to a large extent. The
third major issue with NMT was that the system used to forget
the words after a long. This issue is also resolved in 8 layer
approach. After 2014, this work from Sutskever et. Al. have
inspired many researchers and NMT is developing as a good
alternative to conventional machine translation techniques.
Google has deployed GNMT on an eight-layer encoder-
decoder architecture. This system requires huge GPU compu-
tations for training the neural network. In this work, we explore
a simplified and shallow network that can be trained using a
regular GPU as well. We have explored different architectures
of the shallow network and shown satisfactory results for the
Indian language.
Neural Machine Translation (NMT) is a machine transla-
tion system that uses an artificial neural network to increase
fluency and accuracy the process of machine translation. NMT
is based on a simple encoder-decoder based network. The type
of neural networks used in NMT is Recurrent Neural Networks
(RNN) [7]. The reason for selecting RNN for the task is the
basic architecture of the RNN. RNN involves cyclic structure
and which enables the learning of repeated sequences much
easier than the other networks. RNN can be unrolled to store
the sentences as a sequence in both sources as well as target
languages. A typical structure for RNN is described in Fig.
1. This explains that how a single layer can be unrolled into
multiple layers and information of the previous time period
can be stored in the single cell as well. RNNs can easily map
sequences to sequences when the alignment between inputs
and outputs is known ahead of time.
Fig. 1. Typical structure of an RNN
Let X and Y be the source and target sentence pairs
respectively. The encoder RNN converts the source sentence
x1, x2, x3, ...xninto different vectors of fixed dimensions. The
decoder will provide one word at a time as its output, using
conditional probability
P(Y|X) = P(Y|X1, X2, X3, ...Xm)(1)
Here X1, X2.....XMare the fixed size vectors encoded by
the encoder. Using chain rule the above equation is converted
to the equation below where while decoding, next word is
predicted using symbols that are predicted till now and source
sentence vectors. The above expression then becomes
P(Y|X) = P(yi|y0, y1, y2, ...., yi1;X1, X2, X3, ...Xm)(2)
Each term in the distribution is represented by a softmax
function over all the words in the vocabulary.
Although RNNs work very well, in theory, there are
problems while training them with long sentences because
RNNs have a ”Long Term Dependency Problem.” The reason
is that in RNN the final decision at time t is computed as
... ∂h2
Thus because of multiplicative effect, the output in longer
sentences is very low and results in inaccuracy. This can be
explained by a simple example. If the given task is to predict
the next word in a sentence using a language model, then in
a sentence like ”Potholes are located on the ,
road is the obvious choice to fill in the blank, and we do
not need any further context as the gap between the relevant
information, and its place is small. However, in sentences like I
am a cricket player and I bat well at the position.
it is not as straightforward because from the recent information
we can only deduce that missing word should be a position
in the batting order. However, there can be multiple choices
(e.g., opening, middle order and lower order). This problem
is felt when the gap between the relevant information and the
place that it is needed is not small, and it is entirely possible
to have gaps much bigger than in this example. In practice,
it is difficult for RNNs to learn these dependencies. Since the
typical sentences in any language have such complex context-
dependent cases, then RNN should not be used for encoder and
decoder design. To overcome the shortcomings of the RNNs,
we use Long and Short Term Memory (LSTM) models for
encoding and decoding.
A. Long and Short Term Memory (LSTM) model
Long and Short Term Memory [8] is a variation of RNN
and are known to learn problems with long-range temporal
dependencies, so RNNs are sometimes replaced with LSTMs
in MT networks. LSTMs also have this chain-like structure, but
the structure of the repeating module is different from RNN.
In place of a single neural network layer, there are four layers
in a module. These layers interact within the same modules as
well as with other modules for learning. A typical structure of
LSTM module is shown in Fig 2.
Fig. 2. The repeating module in an LSTM contains four interacting layers
In this module, there are four gates for four different
operations in the learning process. The first gate if ”forget
gate” (ft. This helps in deciding, which part of the previous
learning should be forgotten in this layer. The next is a sigmoid
layer called the input gate layer (it) decides which values the
system will update. The third gate is a tanh layer that creates
a vector of new candidate values,
Ct, that could be added to
the state in the same module. Finally, the output is decided by
the fourth layer. This is also a tanh function which generates
the state for next modules. The corresponding equations for
all these functions are as follows:
ft=σ(Wf[ht1, xt] + bf)(4)
it=σ(Wi[ht1, xt] + bi)(5)
Ct=tanh(WC[ht1, xt] + bC)(6)
ot=σ(Wo[ht1, xt] + bo)(8)
In Fig. 3, the model reads an input sentence ABC and
produces WXYZ as the output sentence. The model stops
making predictions after generating the end-of-sentence (EOS)
token as the output.
A. Encoder and Decoder
In LSTM based NMT, we use a bidirectional encoder [5].
This encoder is based on the concept that the output at any
time instant may not only depend on past data but also on
future data. Using this idea, the LSTM is tweaked to connect
two hidden layers of opposite directions to the same output.
This tweaked version of LSTM is called a Bidirectional LSTM
(Bi-LSTM) [9]. Bi-LSTMs were introduced to increase the
amount of input information available to the network. Unlike
LSTMs, Bi-LSTMs have access to the future input from the
current state without the need for time delays. Fig. 4 shows
the architecture for the bidirectional Encoder used in our NMT
system. The encoder presented in this figure is a single layer
encoder. In GNMT, eight layer encoder and decoder clocks
are used to process the information. For better efficiency in
the learning process, multiple layers of LSTMs are preferred
in encoder as well as decoder designs.
Fig. 3. Sentence modeling in LSTM network
Fig. 4. Bidirectional Encoder design using Bi-LSTM
The decoder is designed to decode the vectors back to
target language words. We have experimented with multi-layer
decoders in the system. A typical two-layer decoder is shown
in Fig. 5.
B. Attention in the model
Attention layer is the bridging layer between encoder and
decoder of an NMT system. There are two kinds of attention
models; global and local. The idea of a global attention model
is to consider all the hidden states of the encoder when deriving
the context vector ct. In global attention model, at, which
is a variable-length alignment vector with a size equals to
the number of time steps on the source side, is derived by
Fig. 5. 2 layer decoder architecture
comparing the current target hidden state htwith each source
hidden state hs. The concept of modeling the language is
different in local attention model. In local attention model, the
model first predicts a single aligned position ptfor the current
target word. With the help of a window, which is centered
around the source position, ptcomputes a context vector ct.
In our system, we have used local attention model. A block
diagram showing the functionality of attention layer is shown
in Fig 6.
Fig. 6. Local attention model.
C. Residual connections and Bridges
The success of a neural network depends on the depth of
the network. However, as the depth of network increases, it
becomes more and more difficult to train, due to vanishing and
exploding gradients [10]. This problem has been addressed in
the past using the idea of modeling differences between an
intermediate layers output and the targets. These are called
residual connections. With residual connections, the input of a
layer is added elementwise to the output before feeding to the
next layer. In Fig 7, the output of LSTM1 is added to input and
sent as an input to LSTM2. Residual connections are known to
greatly improve the gradient flow in the backward pass, which
allows us to train very deep networks.
An additional layer is needed in between encoder and
decoder layers. Fig 8 shows the complete system consisting
of Encoder, decoder, residual connections and bridge. Figure
9 shows the graphical representation of how sentences are
converted in vectors and associated with vectors in the target
Fig. 7. Residual connections inside the Encoder
Fig. 8. Complete system block diagram for NMT
A. Dataset
The initial requirement for setting up a machine translation
system is the availability of parallel corpus for source and
target languages. Hindi is not as resourceful language as its
European counterparts in terms of availability of large datasets.
Premier institutes in India and abroad have been working from
past 2 decades on the development of parallel corpus for Indian
languages. We have considered the following three different
datasets for the experiments.
1) English-Hindi parallel corpus from Institute for Lan-
guage, Cognition, and Computation, the University of
Edinburgh [11]
2) Institute of Formal and Applied Linguistics (UFAL)
at the Computer Science School, Faculty of Math-
ematics and Physics, Charles University in Prague,
Czech Republic [12]
3) Center for Indian Language Technology (CFILT), IIT
Bombay [13].
Datasets from ILCC, University of Edinburgh, contains trans-
lated sentences from Wikipedia. IIT Bombay and UFAL
datasets contain the data from multiple disciplines. All these
datasets are exhaustive with an abundant variety of words.
Table I provides the information regarding the number of
words and sentences in each dataset.
No of sentences 41,396 237,885 1,492,827
No of Words 245,675 1,048,297 20,601,012
No. Of Unique
Words 8,342 21,673 250,619
B. Experimental setup
This NMT system is implemented by taking core algorithm
from the tutorials from Peter Neubig [14] and 1. TensorFlow2
and Theano3are the platforms used in the system design. We
have set up the system on a Nvidia GPU, which is having
NVIDIA Quadro K4200 graphics card. this GPU has 24 stacks
and a total number of CUDA cores 1344.
C. Training details
Once the dataset is preprocessed, the source and target
files are fed into the encoder layer to prepare the vectors
from the sentences. We have used Stochastic gradient descent
(SGD) [15], an algorithm for training. We have worked on two
different layer sizes. Two and four layer networks are trained
for different combinations of encoder and decoder architec-
tures. We replace LSTM with Bi-LSTM and also experimented
with Deep Bi-LSTM. We also add residual connections and
attention mechanism. We also add a dense layer which acts
as a bridge between encoder and decoder and compared its
performance with other methods.
Dataset Total Sentences Training Validation Testing
ILCC 41,396 28,000 6,000 6,000
UFAL 237,885 70,000 15,000 15,000
CFILT 1,492,827 140,000 30,000 30,000
Training Time (hh:mm:ss)
Data set No. of sentences
2 Layer
4 Layer
2 Layer
+SGD +
4 layers
ILCC 28,000 02:58:35 6:34:54 3:28:34 7:45:56
UFAL 70,000 07:34:28 13:46:25 8:31:24 15:23:41
CFILT 140,000 16:38:24 28:38:12 15:43:23 32:25:16
Since we have used GPU, training time for the neural
network for different datasets for different architectures was in
few hours only. For the experiment, we have used a different
number of sentences for each data set. Details about a number
of sentences used in training and testing for each data set
is described in Table II. Training time for each dataset for
a selected number of sentences is shown in table III. Every
dataset is trained for 10 epochs.
D. BLEU Score
We have evaluated our system using BLEU score [16]. In
each configuration, BLEU score of the translation scores are
different and table IV shows the BLEU score for each different
E. Discussions
The results obtained from NMT based English-Hindi ma-
chine translation is comparable with conventional statistical or
phrase-based machine translation systems. One of the earliest
SMT based system Anusaaraka [17] is lacking the capability
to handle complex sentences and doesn’t perform at par with
the latest MT systems. In our work, we have also inspired the
task of translation from human cognitive abilities to translate
the language. This system does not outperform GNMT (with
a BLEU score of 38.20 for En-Fr), but it is showing many
comparable results, when compared to Anusaaraka (21.18),
AnglaMT (22.21) and Anglabharati (20.66) [18].
Statistical Phrase-based Machine translation systems have
been facing the problem of accuracy and requirement of large
data sets for a long time, and in this work, we have investigated
the possibility of using a shallow RNN and LSTM based
Neural Machine translator for solving the issue of Machine
Translation. We have used quite a small amount of dataset and
BLEU Score for En-Hi
Configuration ILCC UFAL CFILT
2 Layer LSTM + SGD 12.512 14.237 16.854
4 Layer LSTM + SGD 13.534 16.895 17.124
2 Layer (Bi-dir) LSTM +SGD 12.854 15.785 18.100
4 layers (Bi-dir) LSTM +SGD+ Res 13.863 17.987 18.215
less number of layers for our experiment. The results show that
NMT can provide much better results for the larger dataset
and have a large number of layers in encoder and decoder.
Compared to contemporary SMT and PBMT systems, NMT
based MT performs much better. Future work would involve
in fine-tuning the training of long and rare sentences using
smaller data sets. We would like to explore NMT for Indian
language pairs as well. Since the grammar structure for many
of the Indian languages is similar to each other, we expect the
higher order of BLEU scores in future.
[1] Alon Lavie, Stephan Vogel, Lori Levin, Erik Peterson, Katharina Probst,
Ariadna Font Llitj´
os, Rachel Reynolds, Jaime Carbonell, and Richard
Cohen. Experiments with a hindi-to-english transfer-based mt system
under a miserly data scenario. 2(2):143–163, June 2003.
[2] S. Saini, U. Sehgal, and V. Sahula. Relative clause based text
simplification for improved english to hindi translation. In 2015
International Conference on Advances in Computing, Communications
and Informatics (ICACCI), pages 1479–1484, Aug 2015.
[3] S. Saini and V. Sahula. A survey of machine translation techniques and
systems for indian languages. In 2015 IEEE International Conference
on Computational Intelligence Communication Technology, pages 676–
681, Feb 2015.
[4] S. Chand. Empirical survey of machine translation tools. In 2016
Second International Conference on Research in Computational Intelli-
gence and Communication Networks (ICRCICN), pages 181–185, Sept
[5] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence
learning with neural networks. CoRR, abs/1409.3215, 2014.
[6] Kyunghyun Cho, Bart Van Merri¨
enboer, Dzmitry Bahdanau, and Yoshua
Bengio. On the properties of neural machine translation: Encoder-
decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
[7] LR Medsker and LC Jain. Recurrent neural networks. Design and
Applications, 5, 2001.
[8] Sepp Hochreiter and J¨
urgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
[9] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural
networks. IEEE Transactions on Signal Processing, 45(11):2673–2681,
[10] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty
of training recurrent neural networks. In International Conference on
Machine Learning, pages 1310–1318, 2013.
[11] Institute for language, cognition and computation,
university of edinburgh, indic multi-parallel corpus, Technical report.
[12] Ondrej Bojar, Vojtech Diatka, Pavel Rychl`
y, Pavel Stran´
ak, V´
ıt Su-
chomel, Ales Tamchyna, and Daniel Zeman. Hindencorp-hindi-english
and hindi-only corpus for machine translation. In LREC, pages 3550–
3555, 2014.
[13] Resource centre for indian language technology solutions(cfilt) i.-b. h.
corpus, Technical report.
[14] Graham Neubig. Neural machine translation and sequence-to-sequence
models: A tutorial. arXiv preprint arXiv:1703.01619, 2017.
[15] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient
descent, weighted sampling, and the randomized kaczmarz algorithm.
In Advances in Neural Information Processing Systems, pages 1017–
1025, 2014.
[16] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu:
A method for automatic evaluation of machine translation. In Pro-
ceedings of the 40th Annual Meeting on Association for Computational
Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002.
Association for Computational Linguistics.
[17] Akshar Bharati, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev San-
gal. Anusaaraka: Machine translation in stages. CoRR, cs.CL/0306130,
Fig. 9. Graphical representation of vector relations between source and target sentences.
[18] Kunal Sachdeva, Rishabh Srivastava, Sambhav Jain, and Dipti Misra
Sharma. Hindi to english machine translation: Using effective selection
in multi-model smt. In LREC, pages 1807–1811, 2014.
... Within the context of Indian languages, (Chandola and Mahalanobis, 1994) and (Dave et al., 2001) were one of the first works to explore a rulebased approach for translation from Hindi to English whereas (Patel et al., 2018), (Barman et al., 2014), (Saini and Sahula, 2018) and (Choudhary et al., 2018) have explored this problem through the prism of NMT. (Philip et al., 2019) and (Madaan and Sadat, 2020) extend the concept of multilingual NMT to the setting of Indian languages. ...
Full-text available
In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve a multilingual model's performance over its baseline, yielding state-of-the-art results for various Indian languages.
... But, it is difficult to handle complex contextdependent cases by RNN. Hence, RNN adopted LSTM, which was able to learn long-term features for encoding and decoding [7]. Apart from the importance of LSTM, other aspects that improve the effectiveness of the NMT system like the requirement of stacked RNNs, test-time decoding using beam search, input feeding using attention mechanism [8]. ...
Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.
Neural machine translation has drastically improved the accuracy of machine translation in recent years. The issue of translating out-of-vocabulary proper nouns (OOV-NNP) is still a hindrance to the betterment of machine translation. In this paper, we introduce neural machine translation followed by Proper Noun Transliteration (NMT-NNPT) to address this issue. We explore the idea of transliteration as a post-processing task on the result of neural machine translation using English–Hindi language pair. This approach further improves the translation accuracy and can be used with any language pair.
Conference Paper
Full-text available
Machine Translation pertains to translation of one natural language to other by using automated computing. The main objective is to fill the language gap between two different languages speaking people, communities or countries. In India, we have multiple and hugely diverse languages and scripts, hence scope and need of language translation is immense. In this paper, we focus on the current scenario of research in machine translation in India. We have reviewed various important Machine Translation Systems (MTS) and presented preliminary comparison of the core methodology as used by them.
Full-text available
Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder--Decoder and a newly proposed gated recursive convolutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
Full-text available
There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
This tutorial introduces a new and powerful set of techniques variously called "neural machine translation" or "neural sequence-to-sequence models". These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning \((L/\mu )^2\) (where \(L\) is a bound on the smoothness and \(\mu \) on the strong convexity) to a linear dependence on \(L/\mu \) . Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.