PresentationPDF Available

Selecting Parallel In-domain Sentences for Neural Machine Translation (NMT) Using Monolingual Texts

Authors:

Abstract

Slides are for my talk at the 13th TAISIG meeting, where I presented one of my papers, titled "Selecting Parallel In-domain Sentences for NMT Using Monolingual Texts" For more information about the TAISIG talks, visit this link: https://www.tilburguniversity.edu/research/institutes-and-research-groups/taisig
Selecting Parallel In-domain Sentences for
Neural Machine Translation (NMT) Using
Monolingual Texts
Javad Pourmostafa, 21 April 2022
Assimilation – reader initiates translation, wants to know content
users are tolerant of inferior quality
focus of majority of research
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
Assimilation – reader initiates translation, wants to know content
users are tolerant of inferior quality
focus of majority of research
Communication – participants don’t speak same language, rely on translation
users can ask questions, when sth is unclear
chat room translations, hand-held devices
often combined with speech recognition
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
Assimilation – reader initiates translation, wants to know content
users are tolerant of inferior quality
focus of majority of research
Communication – participants don’t speak same language, rely on translation
users can ask questions, when sth is unclear
chat room translations, hand-held devices
often combined with speech recognition
Dissemination – publisher wants to make content available in other languages
high demands for quality
currently almost exclusively done by human translators
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
Training an NMT Model
Goal: translate from English to Dutch
• Data:
Source sentence: e.g., This is an English Sentence
Target sentence: e.g., Dit is een Engelse zin
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
Goal: translate from English to Dutch
• Data:
Source sentence: e.g., This is an English Sentence
Target sentence: e.g., Dit is een Engelse zin
Steps:
1) generate input data and embed them
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
Goal: translate from English to Dutch
• Data:
Source sentence: e.g., This is an English Sentence
Target sentence: e.g., Dit is een Engelse zin
Steps:
1) generate input data and embed them
2) forward pass
encoder decoder
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
Goal: translate from English to Dutch
• Data:
Source sentence: e.g., This is an English Sentence
Target sentence: e.g., Dit is een Engelse zin
Steps:
1) generate input data and embed them
2) forward pass
3) backward pass
encoder decoder
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
Goal: translate from English to Dutch
• Data:
Source sentence: e.g., This is an English Sentence
Target sentence: e.g., Dit is een Engelse zin
Steps:
1) generate input data and embed them
2) forward pass
3) backward pass
4) Do steps 2, 3 until convergence
encoder decoder
more training data is better
training MT systems on large data uses substantial amounts of resources
memory and time for instance
Domain-specific Translation
more training data is better
training MT systems on large data uses substantial amounts of resources
memory and time for instance
parallel data are not available for all language pairs
or especially for domain of interests (e.g., COVID-19)
data-driven MT paradigms may perform poorly for domain-specific translation
regardless of the size of the given data
Domain-specific Translation
more training data is better
training MT systems on large data uses substantial amounts of resources
memory and time for instance
parallel data are not available for all language pairs
or especially for domain of interests (e.g., COVID-19)
data-driven MT paradigms may perform poorly for domain-specific translation
regardless of the size of the given data
A two-side challenge:
What is high-quality, in-domain (ID) data?
What amount of parallel ID data are necessary to achieve SToA MT quality
At low computational and data capacities
Domain-specific Translation
RC has made many efforts through Domain Adaptation (DA).
Our Contribution
RC has made many efforts through Domain Adaptation (DA).
Our data selection method:
Goal: improving ID translation in low-resource scenario
How? selecting ID sentences from Out-of-Domain (OOD) parallel corpus
To what? Possibly employ Domain Adaptation for NMT
Our Contribution
RC has made many efforts through Domain Adaptation (DA).
Our data selection method:
Goal: improving ID translation in low-resource scenario
How? selecting ID sentences from Out-of-Domain (OOD) parallel corpus
To what? Possibly employ Domain Adaptation for NMT
Our main contribution:
a language-agnostic data selection method for generating a parallel ID corpus
using only monolingual domain-specific corpus
without requiring any translation
Our Contribution
An Overview of the Proposed Methodology
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
1 .. n
179K
New In-domain
Source Target
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
1 .. n
179K
New In-domain
Source Target
1 .. n
Train NMT models with the
new in-domain data sets
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
.
.
.
.
.
.
.
.
.
31M
1
2
cos (1, 31M) = 90
cos (1, 2) = 70
cos (1, 1) = 30
step ii)
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
.
.
.
.
.
.
.
.
.
31M
1
2
cos (1, 31M) = 90
cos (1, 2) = 70
cos (1, 1) = 30
step ii)
step iii)
source target
1
2
3
4
5
6
Top n (e.g., = 6) selected
in-domain sentences
An Example of Data Selection Output
Score (/100)
Monolingual in-domain: It can be a very complicated thing, the ocean. -
Top1–parallel in-domain data
EN: Ocean affairs are sensitive and complex.
90.10
FR: Les affaires maritimes sont d´elicates et complexes.
Top2–parallel in-domain data
EN: Rip currents and undertow are common, dangerous conditions along
ocean beaches. 86.60
FR: D´echirez les courants et les ba¨ınes sont des conditions communes
et dangereuses le long des plages d’oc´ean.
Top3–parallel in-domain data
EN: Moving with the waves can be dangerous.
85.96
FR: Il est dangereux de progresser avec la vague.
Top4–parallel in-domain data
EN: Obstacles in the water are particularly dangerous when coupled with
currents. 85.76
FR: Les obstacles dans l’eau sont avant tout dangereux par rapport au
courant.
…. …. ….
Results and Analysis
We showed:
even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
more training data is not always sufficient for ID translation!
Results and Analysis
We showed:
even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
more training data is not always sufficient for ID translation!
mixing ID with OOD, and training them from scratch had a minor
improvement.
Results and Analysis
We showed:
even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
more training data is not always sufficient for ID translation!
mixing ID with OOD, and training them from scratch had a minor
improvement.
our work outperformed SToA methods in terms of translation quality
Results and Analysis
We showed:
even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
more training data is not always sufficient for ID translation!
mixing ID with OOD, and training them from scratch had a minor
improvement.
our work outperformed SToA methods in terms of translation quality
our generated in-domain data sets are relatively small, resulting in less
training time.
We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
can be used for any language pairs
the selected data can be applied directly for domain-specific NMT systems.
Conclusion
We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
can be used for any language pairs
the selected data can be applied directly for domain-specific NMT systems.
The proposed selection pipeline:
(i) a contextual sentence embedding
(ii) a semantic search
(iii) a ranking in-domain data
Conclusion
We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
can be used for any language pairs
the selected data can be applied directly for domain-specific NMT systems.
The proposed selection pipeline:
(i) a contextual sentence embedding
(ii) a semantic search
(iii) a ranking in-domain data
Future work
employing the generated corpora in the context of DA.
Conclusion
Selecting Parallel In-domain Sentences for Neural Machine Translation
Using Monolingual Texts
Javad Pourmostafa, Dimitar Shterionov, Pieter Spronck
Computational Linguistics in the Netherlands Journal (CLIN); (2022); Vol. 11
Access to the Paper
Thank you!
https://ilk.uvt.nl/~pourmost/
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.