Content uploaded by Javad Pourmostafa Roshan Sharami
Author content
All content in this area was uploaded by Javad Pourmostafa Roshan Sharami on Apr 22, 2022
Content may be subject to copyright.
Selecting Parallel In-domain Sentences for
Neural Machine Translation (NMT) Using
Monolingual Texts
Javad Pourmostafa, 21 April 2022
•Assimilation – reader initiates translation, wants to know content
• users are tolerant of inferior quality
• focus of majority of research
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
•Assimilation – reader initiates translation, wants to know content
• users are tolerant of inferior quality
• focus of majority of research
•Communication – participants don’t speak same language, rely on translation
• users can ask questions, when sth is unclear
• chat room translations, hand-held devices
• often combined with speech recognition
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
•Assimilation – reader initiates translation, wants to know content
• users are tolerant of inferior quality
• focus of majority of research
•Communication – participants don’t speak same language, rely on translation
• users can ask questions, when sth is unclear
• chat room translations, hand-held devices
• often combined with speech recognition
•Dissemination – publisher wants to make content available in other languages
• high demands for quality
• currently almost exclusively done by human translators
Why and where we use Machine Translation?
Reference: Philipp Koehn, 2020
Training an NMT Model
• Goal: translate from English to Dutch
• Data:
• Source sentence: e.g., This is an English Sentence
• Target sentence: e.g., Dit is een Engelse zin
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
• Goal: translate from English to Dutch
• Data:
• Source sentence: e.g., This is an English Sentence
• Target sentence: e.g., Dit is een Engelse zin
• Steps:
• 1) generate input data and embed them
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
• Goal: translate from English to Dutch
• Data:
• Source sentence: e.g., This is an English Sentence
• Target sentence: e.g., Dit is een Engelse zin
• Steps:
• 1) generate input data and embed them
• 2) forward pass
encoder decoder
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
• Goal: translate from English to Dutch
• Data:
• Source sentence: e.g., This is an English Sentence
• Target sentence: e.g., Dit is een Engelse zin
• Steps:
• 1) generate input data and embed them
• 2) forward pass
• 3) backward pass
encoder decoder
Training an NMT Model
this is an English sentence <s> Dit is een Englese
src embedding layer
zin
tgt embedding layer
0.33 0.41 0.21 0.29 0.34 0.12
• Goal: translate from English to Dutch
• Data:
• Source sentence: e.g., This is an English Sentence
• Target sentence: e.g., Dit is een Engelse zin
• Steps:
• 1) generate input data and embed them
• 2) forward pass
• 3) backward pass
• 4) Do steps 2, 3 until convergence
encoder decoder
• more training data is better
• training MT systems on large data uses substantial amounts of resources
• memory and time for instance
Domain-specific Translation
• more training data is better
• training MT systems on large data uses substantial amounts of resources
• memory and time for instance
• parallel data are not available for all language pairs
•or especially for domain of interests (e.g., COVID-19)
• data-driven MT paradigms may perform poorly for domain-specific translation
• regardless of the size of the given data
Domain-specific Translation
• more training data is better
• training MT systems on large data uses substantial amounts of resources
• memory and time for instance
• parallel data are not available for all language pairs
•or especially for domain of interests (e.g., COVID-19)
• data-driven MT paradigms may perform poorly for domain-specific translation
• regardless of the size of the given data
• A two-side challenge:
• What is high-quality, in-domain (ID) data?
• What amount of parallel ID data are necessary to achieve SToA MT quality
• At low computational and data capacities
Domain-specific Translation
• RC has made many efforts through Domain Adaptation (DA).
Our Contribution
• RC has made many efforts through Domain Adaptation (DA).
• Our data selection method:
• Goal: improving ID translation in low-resource scenario
• How? selecting ID sentences from Out-of-Domain (OOD) parallel corpus
• To what? Possibly employ Domain Adaptation for NMT
Our Contribution
• RC has made many efforts through Domain Adaptation (DA).
• Our data selection method:
• Goal: improving ID translation in low-resource scenario
• How? selecting ID sentences from Out-of-Domain (OOD) parallel corpus
• To what? Possibly employ Domain Adaptation for NMT
• Our main contribution:
•a language-agnostic data selection method for generating a parallel ID corpus
• using only monolingual domain-specific corpus
• without requiring any translation
Our Contribution
An Overview of the Proposed Methodology
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
1 .. n
179K
New In-domain
Source Target
An Overview of the Proposed Methodology
S-BERT
In-domain
Source / Target
179K
Source Target
Out-of-domain
31M
Data Selection
Algorithm
1 .. n
179K
New In-domain
Source Target
1 .. n
Train NMT models with the
new in-domain data sets
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
.
.
.
.
.
.
.
.
.
31M
1
2
cos (1, 31M) = 90
cos (1, 2) = 70
cos (1, 1) = 30
step ii)
Our Data Selection Algorithm
1
.
.
.
.
embedding vectors
per in-domain
sentence
179K
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
.
.
.
.
.
.
.
.
embedding vectors
per out-of-domain
sentence
31M
1
2
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
[d1, d2, d3, d4, d5, …, d32]
cos(1,1)
cos(1,2)
cos(1,31M)
.
.
.
.
.
.
.
.
.
similarity scores
31M
1
2
cos (1, 1) = 30
cos (1, 2) = 70
cos (1, 31M) = 90
step i)
.
.
.
.
.
.
.
.
.
31M
1
2
cos (1, 31M) = 90
cos (1, 2) = 70
cos (1, 1) = 30
step ii)
step iii)
source target
1
2
3
4
5
6
Top n (e.g., = 6) selected
in-domain sentences
An Example of Data Selection Output
Score (/100)
Monolingual in-domain: It can be a very complicated thing, the ocean. -
Top1–parallel in-domain data
EN: Ocean affairs are sensitive and complex.
90.10
FR: Les affaires maritimes sont d´elicates et complexes.
Top2–parallel in-domain data
EN: Rip currents and undertow are common, dangerous conditions along
ocean beaches. 86.60
FR: D´echirez les courants et les ba¨ınes sont des conditions communes
et dangereuses le long des plages d’oc´ean.
Top3–parallel in-domain data
EN: Moving with the waves can be dangerous.
85.96
FR: Il est dangereux de progresser avec la vague.
Top4–parallel in-domain data
EN: Obstacles in the water are particularly dangerous when coupled with
currents. 85.76
FR: Les obstacles dans l’eau sont avant tout dangereux par rapport au
courant.
…. …. ….
Results and Analysis
• We showed:
• even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
• more training data is not always sufficient for ID translation!
Results and Analysis
• We showed:
• even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
• more training data is not always sufficient for ID translation!
• mixing ID with OOD, and training them from scratch had a minor
improvement.
Results and Analysis
• We showed:
• even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
• more training data is not always sufficient for ID translation!
• mixing ID with OOD, and training them from scratch had a minor
improvement.
• our work outperformed SToA methods in terms of translation quality
Results and Analysis
• We showed:
• even though NMT model employed enormous parallel data, it could not
perform well for ID translation.
• more training data is not always sufficient for ID translation!
• mixing ID with OOD, and training them from scratch had a minor
improvement.
• our work outperformed SToA methods in terms of translation quality
• our generated in-domain data sets are relatively small, resulting in less
training time.
• We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
• can be used for any language pairs
•the selected data can be applied directly for domain-specific NMT systems.
Conclusion
• We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
• can be used for any language pairs
•the selected data can be applied directly for domain-specific NMT systems.
• The proposed selection pipeline:
• (i) a contextual sentence embedding
• (ii) a semantic search
• (iii) a ranking in-domain data
Conclusion
• We proposed a method to help the MT community to mitigate the lack of
parallel in-domain corpora
• can be used for any language pairs
•the selected data can be applied directly for domain-specific NMT systems.
• The proposed selection pipeline:
• (i) a contextual sentence embedding
• (ii) a semantic search
• (iii) a ranking in-domain data
• Future work
• employing the generated corpora in the context of DA.
Conclusion