Conference PaperPDF Available

Neural Machine Translation for English to Hindi

Neural Machine Translation for English to Hindi
Sandeep Saini
Department of Electronics and Communication Engineering
The LNM Institute of Information Technology
Jaipur, India
Vineet Sahula
Department of Electronics and Communication Engineering
Malviya National Institute of Technology
Jaipur, India
Abstract—Language translation is one task in which machine
is definitely lagging behind the cognitive powers of human beings.
Statistical Machine Translation is one of the conventional ways of
solving the problem of machine translation. This method requires
huge data sets and performs well on similar grammar structured
language pairs. In recent years, Neural Machine Translation
(NMT) has emerged as an alternate way of addressing the same
issue. In this paper, we explore different configurations for setting
up a Neural Machine Translation System for Indian language
Hindi. We have experimented with eight different architecture
combinations of NMT for English to Hindi and compared our
results with conventional machine translation techniques. We
have also observed in this work that NMT requires very less
amount of data size for training and thus exhibits satisfactory
translation for few thousands of training sentences as well.
KeywordsCognitive linguistics, Neural Machine Translation,
Long and Short Term Memory, Indian Languages
Machine translation was one of the initial tasks taken
by the computer scientists and the research in this field is
going on for last 50 years. During these years, it has been
a remarkable progress that linguists and computer engineers
have worked together to achieve the current status of machine
translation. Machine Translation task was initially handled
with dictionary matching techniques and slowly upgraded to
rule-based approaches. During last two decades, most of the
machine translation systems were based on statistical machine
translation approach. In these systems, [1], [2] and [3], the
basic units of translation process are phrases and sentences.
These phrases can be composed of one or more words. Most
of the conventional translation systems are based on Bayesian
inferencing to predict the estimate translation probabilities for
pairs of phrases. In these pairs, one phrase belongs to the
source language and the other from the target language. Since
the probability of these phrases is very low, thus pairing and
predicting the correct pair is very difficult in these systems. To
improve the probability of a certain pair of phrases, increasing
the size of the dataset was one of the most feasible solutions.
With the limitations of conventional Machine Translation Sys-
tems [4] and dependence on huge datasets, there is a demand
to search for alternate methods for machine translation.
In recent years, Google has also shifted its translation
research focus towards Neural Machine Translation (NMT).
Sutskever, et al. [5] proposed a sequence to sequence learning
mechanism using long and short-term (LSTM) memory mod-
els. This neural network-based machine translation system had
eight layers of encoder and eight layers of the decoder. The
core idea of NMT is to use deep learning and representation
learning. NMT models require only a fraction of the memory
needed as compared to the traditional statistical machine
translation (SMT) models. Furthermore, unlike conventional
translation systems, all parts of the neural translation model
are trained jointly (end-to-end) to maximize the translation
performance [5] and [6]. In an NMT system, a bidirectional
recurrent neural network (RNN), known as an encoder, is used
by the neural network to encode a source sentence for a second
RNN, known as a decoder, that is used to predict words in
the target language. This encoder-decoder architecture can be
designed with multiple layers to increase the efficiency of the
Normally neural machine translation tends to require a lot
of computing power which means that it is normally a great
technique if we have enough time or computing powers. The
other issue with older NMT was inconsistency in handling
rare words. Since these inputs were sparsely available in the
network, the learning and inferencing were not efficient. By
using LSTM models and having eight layers of encoder and
decoder, this system removes these errors to a large extent. The
third major issue with NMT was that the system used to forget
the words after a long. This issue is also resolved in 8 layer
approach. After 2014, this work from Sutskever et. Al. have
inspired many researchers and NMT is developing as a good
alternative to conventional machine translation techniques.
Google has deployed GNMT on an eight-layer encoder-
decoder architecture. This system requires huge GPU compu-
tations for training the neural network. In this work, we explore
a simplified and shallow network that can be trained using a
regular GPU as well. We have explored different architectures
of the shallow network and shown satisfactory results for the
Indian language.
Neural Machine Translation (NMT) is a machine transla-
tion system that uses an artificial neural network to increase
fluency and accuracy the process of machine translation. NMT
is based on a simple encoder-decoder based network. The type
of neural networks used in NMT is Recurrent Neural Networks
(RNN) [7]. The reason for selecting RNN for the task is the
basic architecture of the RNN. RNN involves cyclic structure
and which enables the learning of repeated sequences much
easier than the other networks. RNN can be unrolled to store
the sentences as a sequence in both sources as well as target
languages. A typical structure for RNN is described in Fig.
1. This explains that how a single layer can be unrolled into
multiple layers and information of the previous time period
can be stored in the single cell as well. RNNs can easily map
sequences to sequences when the alignment between inputs
and outputs is known ahead of time.
Fig. 1. Typical structure of an RNN
Let X and Y be the source and target sentence pairs
respectively. The encoder RNN converts the source sentence
x1, x2, x3, ...xninto different vectors of fixed dimensions. The
decoder will provide one word at a time as its output, using
conditional probability
P(Y|X) = P(Y|X1, X2, X3, ...Xm)(1)
Here X1, X2.....XMare the fixed size vectors encoded by
the encoder. Using chain rule the above equation is converted
to the equation below where while decoding, next word is
predicted using symbols that are predicted till now and source
sentence vectors. The above expression then becomes
P(Y|X) = P(yi|y0, y1, y2, ...., yi1;X1, X2, X3, ...Xm)(2)
Each term in the distribution is represented by a softmax
function over all the words in the vocabulary.
Although RNNs work very well, in theory, there are
problems while training them with long sentences because
RNNs have a ”Long Term Dependency Problem.” The reason
is that in RNN the final decision at time t is computed as
... ∂h2
Thus because of multiplicative effect, the output in longer
sentences is very low and results in inaccuracy. This can be
explained by a simple example. If the given task is to predict
the next word in a sentence using a language model, then in
a sentence like ”Potholes are located on the ,
road is the obvious choice to fill in the blank, and we do
not need any further context as the gap between the relevant
information, and its place is small. However, in sentences like I
am a cricket player and I bat well at the position.
it is not as straightforward because from the recent information
we can only deduce that missing word should be a position
in the batting order. However, there can be multiple choices
(e.g., opening, middle order and lower order). This problem
is felt when the gap between the relevant information and the
place that it is needed is not small, and it is entirely possible
to have gaps much bigger than in this example. In practice,
it is difficult for RNNs to learn these dependencies. Since the
typical sentences in any language have such complex context-
dependent cases, then RNN should not be used for encoder and
decoder design. To overcome the shortcomings of the RNNs,
we use Long and Short Term Memory (LSTM) models for
encoding and decoding.
A. Long and Short Term Memory (LSTM) model
Long and Short Term Memory [8] is a variation of RNN
and are known to learn problems with long-range temporal
dependencies, so RNNs are sometimes replaced with LSTMs
in MT networks. LSTMs also have this chain-like structure, but
the structure of the repeating module is different from RNN.
In place of a single neural network layer, there are four layers
in a module. These layers interact within the same modules as
well as with other modules for learning. A typical structure of
LSTM module is shown in Fig 2.
Fig. 2. The repeating module in an LSTM contains four interacting layers
In this module, there are four gates for four different
operations in the learning process. The first gate if ”forget
gate” (ft. This helps in deciding, which part of the previous
learning should be forgotten in this layer. The next is a sigmoid
layer called the input gate layer (it) decides which values the
system will update. The third gate is a tanh layer that creates
a vector of new candidate values,
Ct, that could be added to
the state in the same module. Finally, the output is decided by
the fourth layer. This is also a tanh function which generates
the state for next modules. The corresponding equations for
all these functions are as follows:
ft=σ(Wf[ht1, xt] + bf)(4)
it=σ(Wi[ht1, xt] + bi)(5)
Ct=tanh(WC[ht1, xt] + bC)(6)
ot=σ(Wo[ht1, xt] + bo)(8)
In Fig. 3, the model reads an input sentence ABC and
produces WXYZ as the output sentence. The model stops
making predictions after generating the end-of-sentence (EOS)
token as the output.
A. Encoder and Decoder
In LSTM based NMT, we use a bidirectional encoder [5].
This encoder is based on the concept that the output at any
time instant may not only depend on past data but also on
future data. Using this idea, the LSTM is tweaked to connect
two hidden layers of opposite directions to the same output.
This tweaked version of LSTM is called a Bidirectional LSTM
(Bi-LSTM) [9]. Bi-LSTMs were introduced to increase the
amount of input information available to the network. Unlike
LSTMs, Bi-LSTMs have access to the future input from the
current state without the need for time delays. Fig. 4 shows
the architecture for the bidirectional Encoder used in our NMT
system. The encoder presented in this figure is a single layer
encoder. In GNMT, eight layer encoder and decoder clocks
are used to process the information. For better efficiency in
the learning process, multiple layers of LSTMs are preferred
in encoder as well as decoder designs.
Fig. 3. Sentence modeling in LSTM network
Fig. 4. Bidirectional Encoder design using Bi-LSTM
The decoder is designed to decode the vectors back to
target language words. We have experimented with multi-layer
decoders in the system. A typical two-layer decoder is shown
in Fig. 5.
B. Attention in the model
Attention layer is the bridging layer between encoder and
decoder of an NMT system. There are two kinds of attention
models; global and local. The idea of a global attention model
is to consider all the hidden states of the encoder when deriving
the context vector ct. In global attention model, at, which
is a variable-length alignment vector with a size equals to
the number of time steps on the source side, is derived by
Fig. 5. 2 layer decoder architecture
comparing the current target hidden state htwith each source
hidden state hs. The concept of modeling the language is
different in local attention model. In local attention model, the
model first predicts a single aligned position ptfor the current
target word. With the help of a window, which is centered
around the source position, ptcomputes a context vector ct.
In our system, we have used local attention model. A block
diagram showing the functionality of attention layer is shown
in Fig 6.
Fig. 6. Local attention model.
C. Residual connections and Bridges
The success of a neural network depends on the depth of
the network. However, as the depth of network increases, it
becomes more and more difficult to train, due to vanishing and
exploding gradients [10]. This problem has been addressed in
the past using the idea of modeling differences between an
intermediate layers output and the targets. These are called
residual connections. With residual connections, the input of a
layer is added elementwise to the output before feeding to the
next layer. In Fig 7, the output of LSTM1 is added to input and
sent as an input to LSTM2. Residual connections are known to
greatly improve the gradient flow in the backward pass, which
allows us to train very deep networks.
An additional layer is needed in between encoder and
decoder layers. Fig 8 shows the complete system consisting
of Encoder, decoder, residual connections and bridge. Figure
9 shows the graphical representation of how sentences are
converted in vectors and associated with vectors in the target
Fig. 7. Residual connections inside the Encoder
Fig. 8. Complete system block diagram for NMT
A. Dataset
The initial requirement for setting up a machine translation
system is the availability of parallel corpus for source and
target languages. Hindi is not as resourceful language as its
European counterparts in terms of availability of large datasets.
Premier institutes in India and abroad have been working from
past 2 decades on the development of parallel corpus for Indian
languages. We have considered the following three different
datasets for the experiments.
1) English-Hindi parallel corpus from Institute for Lan-
guage, Cognition, and Computation, the University of
Edinburgh [11]
2) Institute of Formal and Applied Linguistics (UFAL)
at the Computer Science School, Faculty of Math-
ematics and Physics, Charles University in Prague,
Czech Republic [12]
3) Center for Indian Language Technology (CFILT), IIT
Bombay [13].
Datasets from ILCC, University of Edinburgh, contains trans-
lated sentences from Wikipedia. IIT Bombay and UFAL
datasets contain the data from multiple disciplines. All these
datasets are exhaustive with an abundant variety of words.
Table I provides the information regarding the number of
words and sentences in each dataset.
No of sentences 41,396 237,885 1,492,827
No of Words 245,675 1,048,297 20,601,012
No. Of Unique
Words 8,342 21,673 250,619
B. Experimental setup
This NMT system is implemented by taking core algorithm
from the tutorials from Peter Neubig [14] and 1. TensorFlow2
and Theano3are the platforms used in the system design. We
have set up the system on a Nvidia GPU, which is having
NVIDIA Quadro K4200 graphics card. this GPU has 24 stacks
and a total number of CUDA cores 1344.
C. Training details
Once the dataset is preprocessed, the source and target
files are fed into the encoder layer to prepare the vectors
from the sentences. We have used Stochastic gradient descent
(SGD) [15], an algorithm for training. We have worked on two
different layer sizes. Two and four layer networks are trained
for different combinations of encoder and decoder architec-
tures. We replace LSTM with Bi-LSTM and also experimented
with Deep Bi-LSTM. We also add residual connections and
attention mechanism. We also add a dense layer which acts
as a bridge between encoder and decoder and compared its
performance with other methods.
Dataset Total Sentences Training Validation Testing
ILCC 41,396 28,000 6,000 6,000
UFAL 237,885 70,000 15,000 15,000
CFILT 1,492,827 140,000 30,000 30,000
Training Time (hh:mm:ss)
Data set No. of sentences
2 Layer
4 Layer
2 Layer
+SGD +
4 layers
ILCC 28,000 02:58:35 6:34:54 3:28:34 7:45:56
UFAL 70,000 07:34:28 13:46:25 8:31:24 15:23:41
CFILT 140,000 16:38:24 28:38:12 15:43:23 32:25:16
Since we have used GPU, training time for the neural
network for different datasets for different architectures was in
few hours only. For the experiment, we have used a different
number of sentences for each data set. Details about a number
of sentences used in training and testing for each data set
is described in Table II. Training time for each dataset for
a selected number of sentences is shown in table III. Every
dataset is trained for 10 epochs.
D. BLEU Score
We have evaluated our system using BLEU score [16]. In
each configuration, BLEU score of the translation scores are
different and table IV shows the BLEU score for each different
E. Discussions
The results obtained from NMT based English-Hindi ma-
chine translation is comparable with conventional statistical or
phrase-based machine translation systems. One of the earliest
SMT based system Anusaaraka [17] is lacking the capability
to handle complex sentences and doesn’t perform at par with
the latest MT systems. In our work, we have also inspired the
task of translation from human cognitive abilities to translate
the language. This system does not outperform GNMT (with
a BLEU score of 38.20 for En-Fr), but it is showing many
comparable results, when compared to Anusaaraka (21.18),
AnglaMT (22.21) and Anglabharati (20.66) [18].
Statistical Phrase-based Machine translation systems have
been facing the problem of accuracy and requirement of large
data sets for a long time, and in this work, we have investigated
the possibility of using a shallow RNN and LSTM based
Neural Machine translator for solving the issue of Machine
Translation. We have used quite a small amount of dataset and
BLEU Score for En-Hi
Configuration ILCC UFAL CFILT
2 Layer LSTM + SGD 12.512 14.237 16.854
4 Layer LSTM + SGD 13.534 16.895 17.124
2 Layer (Bi-dir) LSTM +SGD 12.854 15.785 18.100
4 layers (Bi-dir) LSTM +SGD+ Res 13.863 17.987 18.215
less number of layers for our experiment. The results show that
NMT can provide much better results for the larger dataset
and have a large number of layers in encoder and decoder.
Compared to contemporary SMT and PBMT systems, NMT
based MT performs much better. Future work would involve
in fine-tuning the training of long and rare sentences using
smaller data sets. We would like to explore NMT for Indian
language pairs as well. Since the grammar structure for many
of the Indian languages is similar to each other, we expect the
higher order of BLEU scores in future.
[1] Alon Lavie, Stephan Vogel, Lori Levin, Erik Peterson, Katharina Probst,
Ariadna Font Llitj´
os, Rachel Reynolds, Jaime Carbonell, and Richard
Cohen. Experiments with a hindi-to-english transfer-based mt system
under a miserly data scenario. 2(2):143–163, June 2003.
[2] S. Saini, U. Sehgal, and V. Sahula. Relative clause based text
simplification for improved english to hindi translation. In 2015
International Conference on Advances in Computing, Communications
and Informatics (ICACCI), pages 1479–1484, Aug 2015.
[3] S. Saini and V. Sahula. A survey of machine translation techniques and
systems for indian languages. In 2015 IEEE International Conference
on Computational Intelligence Communication Technology, pages 676–
681, Feb 2015.
[4] S. Chand. Empirical survey of machine translation tools. In 2016
Second International Conference on Research in Computational Intelli-
gence and Communication Networks (ICRCICN), pages 181–185, Sept
[5] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence
learning with neural networks. CoRR, abs/1409.3215, 2014.
[6] Kyunghyun Cho, Bart Van Merri¨
enboer, Dzmitry Bahdanau, and Yoshua
Bengio. On the properties of neural machine translation: Encoder-
decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
[7] LR Medsker and LC Jain. Recurrent neural networks. Design and
Applications, 5, 2001.
[8] Sepp Hochreiter and J¨
urgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
[9] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural
networks. IEEE Transactions on Signal Processing, 45(11):2673–2681,
[10] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty
of training recurrent neural networks. In International Conference on
Machine Learning, pages 1310–1318, 2013.
[11] Institute for language, cognition and computation,
university of edinburgh, indic multi-parallel corpus, Technical report.
[12] Ondrej Bojar, Vojtech Diatka, Pavel Rychl`
y, Pavel Stran´
ak, V´
ıt Su-
chomel, Ales Tamchyna, and Daniel Zeman. Hindencorp-hindi-english
and hindi-only corpus for machine translation. In LREC, pages 3550–
3555, 2014.
[13] Resource centre for indian language technology solutions(cfilt) i.-b. h.
corpus, Technical report.
[14] Graham Neubig. Neural machine translation and sequence-to-sequence
models: A tutorial. arXiv preprint arXiv:1703.01619, 2017.
[15] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient
descent, weighted sampling, and the randomized kaczmarz algorithm.
In Advances in Neural Information Processing Systems, pages 1017–
1025, 2014.
[16] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu:
A method for automatic evaluation of machine translation. In Pro-
ceedings of the 40th Annual Meeting on Association for Computational
Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA, 2002.
Association for Computational Linguistics.
[17] Akshar Bharati, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev San-
gal. Anusaaraka: Machine translation in stages. CoRR, cs.CL/0306130,
Fig. 9. Graphical representation of vector relations between source and target sentences.
[18] Kunal Sachdeva, Rishabh Srivastava, Sambhav Jain, and Dipti Misra
Sharma. Hindi to english machine translation: Using effective selection
in multi-model smt. In LREC, pages 1807–1811, 2014.
... Only a few research works Agrawal and Sharma (2017), Parida and Ondřej (2018), Grundkiewicz and Heafield (2018), Saini and Sahula (2018), used the NMT method for English to Hindi translation. Agrawal and Sharma (2017) use a Recurrent Neural Network (RNN) to deal with variable-length input and output by employing Gated Recurrent Units, Long Short Term Memory Units (LSTM), Bidirectional LSTMs and Attention Mechanism. ...
... They learned to align and translate simultaneously. Saini and Sahula (2018) also replace LSTM with Bi-LSTM and Deep Bi-LSTM by adding a residual connection. There is a lack of fine-tuning the training of rare and long sentences using smaller datasets in their method. ...
... The translator acquired the knowledge required for translating in implicit form from the input pair of sentences. Saini and Sahula (2018) investigated the possibility of using shallow RNN (Jang et al., 2019) and Long-Short Term Memory (LSTM) based NMT for solving Machine Translation issues. They have used a small dataset and fewer layers for their experiment. ...
... Machine translation was one of the initial tasks taken by the computer scientists and the research in NLP field is going on for last 50 years [5]. During these years, it has seen a remarkable progress that linguists and computer engineers have worked together to achieve the current status of machine translation. ...
... Several studies and applications have been done for foreign languages using different methodologies and approaches. Most of the machine translation works have been done on language pair of English and other languages, such as Arabic [12], Japanese [13], India [5], Malayalam [14], Bangla [15] among others. ...
A language is a tool fashioned by a man; it is the only gift that identifies human beings from the rest of life. Language is a means of communication in our day-to-day activities to do various things, like giving commands, asking questions, and expressing feelings, but we use it especially to communicate information about the world. There are a lot of languages spoken by human beings. It is difficult to learn and speak all languages spoken in this world, for this reason not all peoples are communicated with each other. Usually, this communication gap is solved by using a human interpreter. However, the use of human interpreters is expensive and inconvenient. Many types of research are being done to resolve this problem using machine translation techniques. Machine translation is an automatic translation of a source language to a target language. This can be spoken to speech or text to text translation. In this work, a bi-directional text-based multilingual machine translation for English and Ometo languages pair using recurrent neural networks is proposed. We started our study with the objective of designing and developing multilingual machine translation for English to Ometo by making the translation bi-directional by applying a recurrent neural network on translations between these language pairs. In order to achieve our objective, we collected parallel corpus data from different sources and divided it into training, validation, and testing sets. We trained our model using four experiments; the first experiment was done by combining four language datasets to Ometo, the second experiment was done by combining Wolaita, Gamo, and Dawuro datasets for training and Gofa dataset for testing and tuning. The third experiment was done by combining Wolaita, Gamo, and Gofa datasets for training and Dawuro dataset for testing and tuning. The fourth experiment was done by combining Wolaita, Gofa, and Dawuro datasets for training and Gamo dataset for testing and tuning. After training and testing these systems on corresponding training and testing datasets, the first experiment shows a BLEU score of 5.7 and 11.7 for English – Ometo, and Ometo-English respectively. For experiments II, III, and IV the experiment shows a BLEU score of 4.5, 2.4, 3.4 for experiments II, III, and IV respectively.
... Work on these has progressed fast for the big languages. For example, Hindi, the highest-resourced South Asian language, has massive hand-annotated dependency treebanks (Bhatt et al., 2009), state-of-the-art neural distributional semantic transformer models (Jain et al., 2020;Khanuja et al., 2021), and machine translation models to and from English (Saini and Sahula, 2018). This is not to say that there are no resources at all for the languages Joshi et al. (2020) terms "the Left-Behinds". ...
Full-text available
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics -- fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in and at the intersection of South Asian NLP and historical-comparative linguistics, describing our and others' current efforts in this area. We also offer new strategies towards breaking the data barrier.
... AnglaBangla is a pseudointerlingua based English to Bangla Machine-Aided Translation (MAT) System, a project sponsored by MCIT,Govt. of India, based on Anglabharati technology, developed by IIT,Kanpur [10]. In [11], eight different architecture combinations with NMT for English to Hindi are compared and exhibits satisfactory translation for few thousands of training sentences as well. Now this paper describes a novel semantic transfer of technical English words for accessing above mentioned literature in Hindi. ...
Full-text available
The nearly all part of the India knows and understands Hindi language. Their mother tongue is also different. The study and observation says that on an average Indian student even though they do the learning formerly the English language they hardly able to grasp it. They don't read, write and speak fluently the English. That's it becomes difficult for these students to get through technical English digital e-books, e-tutorial and e-notes. That's why the AICTE (All India Council for Technical Education) has initiated to teach the technical education in the mother tongue. Nowadays the internet is the most widely used tool for distributing, accessing digital not only simple but technical texts. All Indian language versions in Digital English books are now must to go the knowledge to the end user i.e. to tribunal areas students, remote area students and mostly small villages and territorials. So this paper gives simple machine aided translation, the most important component of the framework is the mapping of single and multi-words in order to cover both simple as well as compound-complex sentences. Due to differences in style and structure between technical English and Indian language, this mapping should be non-trivial. In this paper, we present a practical framework for Semantic Transfer of Technical English Words for Accessing Technical English Text in Hindi as example. We also switch over to some problems in accessing of technical English words in Hindi.
... Hence, we use LSTM as an RNN unit. These LSTM in series form the encoder and decoder part which combines to form a Seq2Seq model [16,17]. ...
Full-text available
In this paper, we focus on the unidirectional translation of Kannada text to English text using Neural Machine Translation (NMT). From studies, we found that using Recurrent Neural Network (RNN) has been the most efficient way to perform machine translation. In this process we have used Sequence to Sequence (Seq2Seq) modelled dataset with the help of Encoder-Decoder Mechanism considering Long Short Term Memory (LSTM) as RNN unit. We have compared our result concerning to Statistical Machine Translation (SMT) and obtained a better Bi-Lingual Evaluation Study (BLEU) value, with an accuracy of 86.32%.
... In earlier times, the translation was handled by statically replacing words with the words from the target language. This dictionary look-up led technique led to inarticulate translation and hence was made obsolete by Rule-Based Machine Translation (RBMT) [1]. RBMT is a system based on linguistic information about the source and target languages derived from dictionaries and grammar including semantics and syntactic regularities of each language [14]. ...
Machine Translation (MT) is one of the most prominent tasks in Natural Language Processing (NLP) which involves the automatic conversion of texts from one natural language to another while preserving its meaning and fluency. Although the research in machine translation has been going on since multiple decades, the newer approach of integrating deep learning techniques in natural language processing has led to significant improvements in the translation quality. This paper has developed a Neural Machine Translation (NMT) system by training the Transformer model to translate texts from Indian Language Hindi to English. Hindi being a low resource language has made it difficult for neural networks to understand the language thereby leading to a slow growth in the development of neural machine translators. Thus, to address this gap, back-translation is implemented to augment the training data and for creating the vocabulary, it has been experimented with both word and subword level tokenization using Byte Pair Encoding (BPE) thereby ending up training the Transformer in 10 different configurations. This led us to achieve a state-of-the-art BLEU score of 24.53 on the test set of IIT Bombay English-Hindi Corpus in one of the configurations.
... NMT has achieved state-of-the-art results since it has overcome different issues of SMT and attained good accuracy in various languages [2,3,13,15,16]. In the context of low resource scenarios like the English to Mizo translation, the NMT system outperforms the conventional SMT system on grounds of BLEU score [14]. Encoder and decoder constitute key elements of the NMT system architecture. ...
Full-text available
Laskar, Sahinur RahmanPakray, ParthaBandyopadhyay, SivajiNeural machine translation (NMT) is a state-of-the-art technique in the task of machine translation (MT), where a source-language text is converted into a target language text while preserving its meaning. NMT attracts attention because it handles sequence to sequence learning problems for variable-length source and target sentences. With the attention mechanism, the NMT system performs well in the context-analyzing ability. But it needs sufficient parallel training corpus, which is a challenge in low resource language scenario. To overcome the bar of a handy parallel corpus, there is an increase in demand for direct translation among similar language pairs. This paper investigates the NMT system for direct translation of low resource similar language pair: Assamese–Bengali. The main contribution of this work is Assamese–Bengali parallel corpus. The NMT system has achieved a bilingual evaluation understudy (BLEU) score of 7.20 for Assamese to Bengali translation and BLEU score 10.10 for Bengali to Assamese translation, respectively.
Words are the meaty component which can be expressed through speech, writing or signals. It is important that the actual message or meaning of the words sent must conveys the same meaning to the one receives. The evolution from manual language translator to the digital machine translation have helped us a lot for finding the exact meaning such that each word must give at least close to exact actual meaning. To make machine translator more human-friendly feeling, natural language processing (NLP) with machine learning (ML) can make the best combination. The main challenges in machine translated sentence can involve ambiguities, lexical divergence, syntactic, lexical mismatches, semantic issues, etc. which can be seen in grammar, spellings, punctuations, spaces, etc. After analysis on different algorithms, we have implemented a two different machine translator using two different Long Short-Term Memory (LSTM) approaches and performed the comparative study of the quality of the translated text based on their respective accuracy. We have used two different training approaches of encodingdecoding techniques using same datasets, which translates the source English text to the target Hindi text. To detect the text entered is English or Hindi language, we have used Sequential LSTM training model for which the analysis has been performed based on its accuracy. As the result, the first LSTM trained model is 84% accurate and the second LSTM trained model is 71% accurate in its translation from English to Hindi text, while the detection LSTM trained model is 78% accurate in detecting English text and 81% accurate in detecting Hindi text. This study has helped us to analyze the appropriate machine translation based on its accuracy. Keywords: Accuracy, Decoding, Machine Learning (ML), Detection System, Encoding, Long Short-Term Memory (LSTM), Machine Translation, Natural Language Processing (NLP), Sequential
Conference Paper
Full-text available
Machine Translation pertains to translation of one natural language to other by using automated computing. The main objective is to fill the language gap between two different languages speaking people, communities or countries. In India, we have multiple and hugely diverse languages and scripts, hence scope and need of language translation is immense. In this paper, we focus on the current scenario of research in machine translation in India. We have reviewed various important Machine Translation Systems (MTS) and presented preliminary comparison of the core methodology as used by them.
Full-text available
Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks. The neural machine translation models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation. In this paper, we focus on analyzing the properties of the neural machine translation using two models; RNN Encoder--Decoder and a newly proposed gated recursive convolutional neural network. We show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
Full-text available
There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
This tutorial introduces a new and powerful set of techniques variously called "neural machine translation" or "neural sequence-to-sequence models". These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.
We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning \((L/\mu )^2\) (where \(L\) is a bound on the smoothness and \(\mu \) on the strong convexity) to a linear dependence on \(L/\mu \) . Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.