Conference PaperPDF Available

CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences

Authors:
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 47–55
June 11, 2021. ©2021 Association for Computational Linguistics
https://doi.org/10.26615/978-954-452-056-4_007
47
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual
Sentences
Devansh GautamPrashant KodaliKshitij GuptaAnmol Goel††
Manish ShrivastavaPonnurangam Kumaraguru
International Institute of Information Technology Hyderabad
Indraprastha Institute of Information Technology Delhi
††Guru Gobind Singh Indraprastha University, Delhi
{devansh.gautam,prashant.kodali,kshitij.gupta}@research.iiit.ac.in,
agoel00@gmail.com, m.shrivastava@iiit.ac.in, pk@iiitd.ac.in
Abstract
Code-mixed languages are very popular in
multilingual societies around the world, yet
the resources lag behind to enable robust sys-
tems on such languages. A major contribut-
ing factor is the informal nature of these lan-
guages which makes it difficult to collect code-
mixed data. In this paper, we propose our
system for Task 1 of CACLS 20211to gener-
ate a machine translation system for English
to Hinglish in a supervised setting. Trans-
lating in the given direction can help expand
the set of resources for several tasks by trans-
lating valuable datasets from high resource
languages. We propose to use mBART, a
pre-trained multilingual sequence-to-sequence
model, and fully utilize the pre-training of
the model by transliterating the roman Hindi
words in the code-mixed sentences to Devana-
gri script. We evaluate how expanding the in-
put by concatenating Hindi translations of the
English sentences improves mBART’s perfor-
mance. Our system gives a BLEU score of
12.22 on test set. Further, we perform a de-
tailed error analysis of our proposed systems
and explore the limitations of the provided
dataset and metrics.
1 Introduction
Code-mixing
2
is the mixing of two or more lan-
guages where words from different languages are
interleaved with each other in the same conversa-
tion. It is a common phenomenon in multilingual
societies across the globe. In the last decade, due to
the increase in the popularity of social media and
various online messaging platforms, there has been
an increase in various forms of informal writing,
such as emojis, slang, and the usage of code-mixed
languages.
1https://code-switching.github.io/2021
2
Code-switching is another term that slightly differs in its
meaning but is often used interchangeably with code-mixing
in the research community. We will also be following the
same convention and use both the terms interchangeably in
our paper.
Due to the informal nature of code-mixing, code-
mixed languages do not follow a prescriptively de-
fined structure, and the structure often varies with
the speaker. Nevertheless, some linguistic con-
straints (Poplack,1980;Belazi et al.,1994) have
been proposed that attempt to determine how lan-
guages mix with each other.
Given the increasing use of code-mixed lan-
guages by people around the globe, there is a grow-
ing need for research related to code-mixed lan-
guages. A significant challenge to research is that
there are no formal sources like books or news arti-
cles in code-mixed languages, and studies have to
rely on sources like Twitter or messaging platforms.
Another challenge with Hinglish, in particular, is
that there is no standard system of transliteration
for Hindi words, and individuals provide a rough
phonetic transcription of the intended word, which
often varies with individuals.
In this paper, we describe our systems for Task 1
of CALCS 2021, which focuses on translating En-
glish sentences to English-Hindi code-mixed sen-
tences. The code-mixed language is often called
Hinglish. It is commonly used in India because
many bilingual speakers use both Hindi and En-
glish frequently in their personal and professional
lives. The translation systems could be used to aug-
ment datasets for various Hinglish tasks by trans-
lating datasets from English to Hinglish. An ex-
ample of a Hinglish sentence from the provided
dataset (with small modifications) is shown below:
Hinglish Sentence:
Bahut strange choice thi
ye.
Gloss of Hinglish Sentence:
Very [strange
choice] was this.
English Sentence:
This was a very strange
choice.
We propose to fine-tune mBART for the given
task by first transliterating the Hindi words in the
48
target sentences from Roman script to Devanagri
script to utilize its pre-training. We further trans-
late the English input to Hindi using pre-existing
models and show improvements in the translation
using parallel sentences as input to the mBART
model. The code for our systems, along with error
analysis, is public3.
The main contributions of our work are as fol-
lows:
We explore the effectiveness of fine-tuning
mBART to translate to code-mixed sentences
by utilizing the Hindi pre-training of the
model in Devanagri script. We further explore
the effectiveness of using parallel sentences
as input.
We propose a normalized BLEU score metric
to better account for the spelling variations in
the code-mixed sentences.
Along with BLEU scores, we analyze the
code-mixing quality of the reference trans-
lations along with the generated outputs and
propose that for assessing code-mixed transla-
tions, measures of code-mixing should be part
of evaluation and analysis.
The rest of the paper is organized as follows. We
discuss prior work related to code-mixed language
processing, machine translation, and synthetic gen-
eration of code-mixed data. We describe our trans-
lation systems and compare the performances of
our approaches. We discuss the amount of code-
mixing in the translations predicted by our systems
and discuss some issues present in the provided
dataset. We conclude with a direction for future
work and highlight our main findings.
2 Background
Code-mixing
occurs when a speaker switches
between two or more languages in the context of
the same conversation. It has become popular in
multilingual societies with the rise of social media
applications and messaging platforms.
In attempts to progress the field of code-mixed
data, several code-switching workshops (Diab et al.,
2014,2016;Aguilar et al.,2018b) have been orga-
nized in notable conferences. Most of the work-
shops include shared tasks on various of the lan-
3https://github.com/devanshg27/cm_
translation
guage understanding tasks like language identifi-
cation (Solorio et al.,2014;Molina et al.,2016),
NER (Aguilar et al.,2018a;Rao and Devi,2016),
IR (Roy et al.,2013;Banerjee et al.,2018), PoS tag-
ging (Jamatia et al.,2016), sentiment analysis (Pa-
tra et al.,2018;Patwa et al.,2020), and question
answering (Chandu et al.,2018).
Although these workshops have gained traction,
the field lacks standard datasets to build robust
systems. The small size of the datasets is a major
factor that limits the scope of code-mixed systems.
Machine Translation
refers to the use of soft-
ware to translate text from one language to another.
In the current state of globalization, translation
systems have widespread applications and are con-
sequently an active area of research.
Neural machine translation has gained popularity
only in the last decade, while earlier works focused
on statistical or rule-based approaches. Kalchbren-
ner and Blunsom (2013) first proposed a DNN
model for translation, following which transformer-
based approaches (Vaswani et al.,2017) have taken
the stage. Some approaches utilize multilingual
pre-training (Song et al.,2019;Conneau and Lam-
ple,2019;Edunov et al.,2019;Liu et al.,2020);
however, these works focus only on monolingual
language pairs.
Although a large number of multilingual speak-
ers in a highly populous country like India use
English-Hindi code-mixed language, only a few
studies (Srivastava and Singh,2020;Singh and
Solorio,2018;Dhar et al.,2018) have attempted
the problem. Enabling translation systems in the
following pair can bridge the communication gap
between several people and further improve the
state of globalization in the world.
Synthetic code-mixed data
generation is a plau-
sible option to build resources for code-mixed lan-
guage research and is a very similar task to trans-
lation. While translation focuses on retaining the
meaning of the source sentence, generation is a
simpler task requiring focus only on the quality of
the synthetic data generated.
Pratapa et al. (2018) started by exploring linguis-
tic theories to generate code-mixed data. Later
works attempt the problem using several ap-
proaches including Generative Adversarial Net-
works (Chang et al.,2019), an encoder-decoder
framework (Gupta et al.,2020), pointer-generator
networks (Winata et al.,2019), and a two-level
49
Train Valid Test
# of sentences 8,060 942 960
# of tokens in source sentences 98,080 12,275 12,557
# of tokens in target sentences 101,752 12,611 -
# of Hindi tokens in target sentences 68,054 8,310 -
# of English tokens in target sentences 21,502 2,767 -
# of ‘Other’ tokens in target sentences 12,196 1,534 -
Table 1: The statistics of the dataset. We use the lan-
guage tags predicted by the CSNLI library4. Since the
target sentences of the test set are not public, we do not
provide its statistics.
variational autoencoder (Samanta et al.,2019). Re-
cently, Rizvi et al. (2021) released a tool to generate
code-mixed data using parallel sentences as input.
3 System Overview
In this section, we describe our proposed systems
for the task, which use mBART (Liu et al.,2020)
to translate English to Hinglish.
3.1 Data Preparation
We use the dataset provided by the task organizers
for our systems, the statistics of the datasets are
provided in Table 1. Since the target sentences in
the dataset contain Hindi words in Roman script,
we use the CSNLI library
4
(Bhat et al.,2017,2018)
as a preprocessing step. It transliterates the Hindi
words to Devanagari and also performs text normal-
ization. We use the provided train:validation:test
split, which is in the ratio 8:1:1.
3.2 Model
We fine-tune mBART, which is a multilingual
sequence-to-sequence denoising auto-encoder pre-
trained using the BART (Lewis et al.,2020) ob-
jective on large-scale monolingual corpora of 25
languages including English and Hindi. It uses a
standard sequence-to-sequence Transformer archi-
tecture (Vaswani et al.,2017), with 12 encoder and
decoder layers each and a model dimension of 1024
on 16 heads resulting in
680 million parameters.
To train our systems efficiently, we prune mBART’s
vocabulary by removing the tokens which are not
present in the provided dataset or the dataset re-
leased by Kunchukuttan et al. (2018) which con-
tains 1,612,709 parallel sentences for English and
Hindi.
We compare the following two strategies for fine-
tuning mBART:
4https://github.com/irshadbhat/csnli
mBART-en:
We fine-tune mBART on the
train set, feeding the English sentences to the
encoder and decoding Hinglish sentences. We
use beam search with a beam size of 5 for
decoding.
mBART-hien:
We fine-tune mBART on the
train set, feeding the English sentences along
with their parallel Hindi translations to the en-
coder and decoding Hinglish sentences. For
feeding the data to the encoder, we concate-
nate the Hindi translations, followed by a sep-
arator token ‘##’, followed by the English sen-
tence. We use the Google NMT system
5
(Wu
et al.,2016) to translate the English source
sentences to Hindi. We again use beam search
with a beam size of 5 for decoding.
3.3 Post-Processing
We transliterate the Hindi words in our predicted
translations from Devanagari to Roman. We use the
following methods to transliterate a given Devana-
gari token (we use the first method which provides
us with the transliteration):
1.
When we transliterate the Hindi words in
the target sentences from Roman to Devana-
gari (as discussed in Section 3.1), we store
the most frequent Roman transliteration for
each Hindi word in the train set. If the current
Devanagari token’s transliteration is available,
we use it directly.
2.
We use the publicly available Dakshina
Dataset (Roark et al.,2020) which has 25,000
Hindi words in Devanagari script along with
their attested romanizations. If the current
Devanagari token is available in the dataset,
we use the transliteration with the maximum
number of attestations from the dataset.
3.
We use the
indic-trans
library
6
(Bhat
et al.,2015) to transliterate the token from
Devanagari to Roman.
4 Experimental Setup
4.1 Implementation
We use the implementation of mBART available
in the fairseq library
7
(Ott et al.,2019). We fine-
tune on 4 Nvidia GeForce RTX 2080 Ti GPUs
5https://cloud.google.com/translate
6https://github.com/libindic/
indic-trans
7https://github.com/pytorch/fairseq
50
Model Validation Set Test Set
BLEU BLEUnormalized BLEU BLEUnormalized
mBART-en 15.318.912.22
mBART-hien 14.620.211.86
Table 2: Performance of our systems on the validation
set and test set of the dataset. Since the target sentences
of the test set are not public, we do not calculate the
scores ourselves. We report the BLEU scores of our
systems on the test set from the official leader board.
with an effective batch size of 1024 tokens per
GPU. We use the Adam optimizer (
= 106, β1=
0.9, β2= 0.98
) (Kingma and Ba,2015) with 0.3
dropout, 0.1 attention dropout, 0.2 label smoothing
and polynomial decay learning rate scheduling. We
fine-tune the model for 10,000 steps with 2,500
warm-up steps and a learning rate of
3105
. We
validate the models for every epoch and select the
best checkpoint based on the best BLEU score on
the validation set. To train our systems efficiently,
we prune mBART’s vocabulary by removing the
tokens which are not present in any of the datasets
mentioned in the previous section.
4.2 Evaluation Metrics
We use the following two evaluation metrics for
comparing our systems:
1. BLEU:
The BLEU score (Papineni et al.,
2002) is the official metric used in the
leader board. We calculate the score us-
ing the SacreBLEU library
8
(Post,2018)
after lowercasing and tokenization using
the
TweetTokenizer
available with the
NLTK library9(Bird et al.,2009).
2. BLEUnormalized:
Instead of calculating the
BLEU scores on the texts where the Hindi
words are transliterated to Roman, we cal-
culate the score on texts where Hindi words
are in Devanagari and English words in Ro-
man. We transliterate the target sentences us-
ing the CSNLI library and we use the out-
puts of our system before performing the
post-processing (Section 3.3). We again use
the SacreBLEU library after lowercasing and
tokenization using the
TweetTokenizer
available with the NLTK library.
8https://github.com/mjpost/sacrebleu
9https://www.nltk.org/
Figure 1: Multiple roman spellings for the same Hindi
Word. These spelling variations can cause the BLEU
score to be low, even if the correct Hindi word is pre-
dicted.
5 Results
Table 2shows the BLEU scores of the outputs gen-
erated by our models described in Section 3.2. In
Hinglish sentences, Hindi tokens are often translit-
erated to roman script, and that results in spelling
variation. Since BLEU score compares token/n-
gram overlap between source and target, lack of
canonical spelling for transliterated words, reduces
BLEU score and can mischaracterize the quality
of translation. To estimate the variety in roman
spellings for a Hindi word, we perform normaliza-
tion by back transliterating the Hindi words in a
code-mixed sentence to Devanagari and aggregated
the number of different spellings for a single De-
vanagari token. Figure 1shows the extent of this
phenomena in the dataset released as part of this
shared task, and it is evident that there are Hindi
words that have multiple roman spellings. Thus,
even if the model is generating the correct Devana-
gari token, the BLEU scores will be understated
due to the spelling variation in the transliterated
reference sentence. By back-transliterating Hindi
tokens to Devanagari,
BLEUnormalized
score thus
provides a better representation of translation qual-
ity.
5.1 Error Analysis of Translations of Test set
Since BLEU score primarily look at n-gram over-
laps, it does not provide any insight into the qual-
ity of generated output or the errors therein. To
51
mBART-en mBART-hien
Mistranslated/Partially Translated 28 23
MWE/NER mistranslation 7 4
Morphology/Case Marking/Agreement/Syntax Issues 13 2
No Error 52 71
Table 3: Error Analysis of 100 randomly sampled trans-
lations from test set for both mBART-en and mBART-
hien model
Figure 2: Code Mixing Index(CMI) for the generated
translation of dev and test set .
analyse the quality of translations on the test set,
we randomly sampled 100 sentences (> 10% of
test set) from the outputs generated by the two
models:
mBART-en
and
mBART-hien
, and buck-
eted them into various categories. Table 3shows
the categories of errors and their corresponding
frequency. Mistranslated/partially translated cate-
gory indicates that the generated translation has
no or very less semantic resemblance with the
source sentence. Sentences, where Multi-Word Ex-
pressions/Named Entities are wrongly translated,
is the second category. Morphology/Case Mark-
ing/Agreement/Syntax Issues category indicates
sentences where most of the semantic content is
faithfully captured in the generated output. How-
ever, the errors on a grammatical level render the
output less fluent.
mBART-hien
makes fewer er-
rors when compared to
mBART-en
, but that can
possibly be attributed to the fact that this model
generates a higher number of Hindi tokens while
being low in code-mixing quality, and makes lesser
grammatical errors. A more extensive and fine-
grained analysis of these errors will undoubtedly
help improve the models’ characterization, and we
leave it for future improvements.
Avg CMI Score % of Sents.with
CMI = 0
Train Gold 19.4 26.1%
Dev Gold 21.6 19.3%
mBART-en Dev 21.8 19.4%
mBART-hien Dev 16.9 30.0%
mBART-en Test 21.8 20.0%
mBART-hien Test 16.7 31.4%
Table 4: Avg. CMI scores, Percentage of sentences
with CMI = 0. Train Gold and Dev Gold are calculated
on the target sentences given in the dataset. Rest are
calculated on the outputs generated by our models.
Validation Set Test Set
mBART-en
# of English tokens 3,282 (25.5%) 3,571 (27.6%)
# of Hindi tokens 8,155 (63.4%) 8,062 (62.3%)
# of ‘Other’ tokens 1,435 (11.1%) 1,302 (10.1%)
mBART-hien
# of English tokens 2,462 (18.5%) 2,519 (18.8%)
# of Hindi tokens 9,471 (71.3%) 9,616 (72.0%)
# of ‘Other’ tokens 1,356 (10.2%) 1,233 (9.2%)
Table 5: The number of tokens of each language in our
predicted translations. The language tags are based on
the script of the token.
5.2 Code Mixing Quality of generated
translations
In the code-mixed machine translation setting, it is
essential to observe the quality of the code-mixing
in the generated translations. While BLEU scores
indicate how close we are to the target translation
in terms of n-gram overlap, a measure like Code-
Mixing Index (CMI) (Gambäck and Das,2016)
provides us means to assess if the generated out-
put is a mix of two languages or not. Relying on
just the BLEU score for assessing translations can
misrepresent the quality of translations, as models
could generate monolingual outputs and still have
a basic BLEU score due to n-gram overlap. If a
measure of code mixing intensity, like CMI, is also
part of the evaluation regime, we would be able to
assess the code mixing quality of generated outputs
as well. Figure 2shows us that the distribution of
CMI for outputs generated by our various models
(mBART-en and mBART-hien) for both validation
and test set.
Figure 2and Table 4show that the code mix-
ing quality of the two models is is more or less
similar across the validation and test set. The high
52
Num of Pairs
Meaning of target similar to source 759
Meaning of target distored compared to source 141
Total 900
Table 6: Statistics of the errors in randomly sampled
subset of train + dev.
percentages of sentences having a 0 CMI score
shows that in a lot of sentences, the model does not
actually perform code-mixing. We also find that
even though the outputs generated by the
mBART-
hien
model have a higher BLEU
normalized
score,
the average CMI is lower and the percentage of
sentences with a 0 CMI score is higher. This sug-
gests that
mBART-hien
produces sentences with
a lower amount of code-mixing. This observation,
we believe, can be attributed to the
mBART-hien
model’s propensity to generate a higher percentage
of Hindi words, as shown in Table 5. We also find
that in the train set, more than 20% of the sentences
have a CMI score of 0. Replacing such samples
with sentence pairs with have a higher degree of
code mixing will help train the model to generate
better code mixed outputs. Further analysis us-
ing different measures of code-mixing can provide
deeper insights. We leave this for future work.
5.3 Erroneous Reference Translations in the
dataset
We randomly sampled
10% (900 sentence pairs)
of the parallel sentences from the train and valida-
tion set and annotated them for translation errors.
For annotation, we classified the sentence pairs into
one of two classes : 1) Error - semantic content in
the target is distorted as compared to source; 2)
No Error - semantic content of source and target
are similar and the target might have minor errors.
Minor errors in translations that are attributable to
agreement issues, case markers issues, pronoun er-
rors etc were classified into the No Error bucket.
Out of the 900 samples that were manually an-
noatated, 141 samples, i.e 15% of annotated pairs,
had targets whose meaning was distorted as com-
pared to source sentence. One such example is
shown below:
English Sentence:
I think I know the football
player it was based on.
Hinglish Sentence:
Muje lagtha ki yeh foot-
ball player ke baare mein hein.
Translation of Hinglish Sentence:
I thought
that this is about football player.
Table 6shows the analysis of these annotated
subset. The annotated file with all 900 examples
can be found in our code repository. Filtering such
erroneous examples from training and validation
datasets, and augmenting the dataset with better
quality translations will certainly help in improving
the translation quality.
6 Discussion
In this paper, we presented our approaches for En-
glish to Hinglish translation using mBART. We
analyse our model’s outputs and show that the
translation quality can be improved by including
parallel Hindi translations, along with the English
sentences, while translating English sentences to
Hinglish. We also discuss the limitations of using
BLEU scores for evaluating code-mixed outputs
and propose using BLEU
normalized
- a slightly mod-
ified version of BLEU. To understand the code-
mixing quality of the generated translations, we
propose that a code-mixing measure, like CMI,
should also be part of the evaluation process. Along
with the working models, we have analysed the
model’s shortcomings by doing error analysis on
the outputs generated by the models. Further,
we have also presented an analysis on the shared
dataset : percentage of sentences in the dataset
which are not code-mixed, the erroneous reference
translations. Removing such pairs and replacing
them with better samples will help improve the
translation quality of the models.
As part of future work, we would like to improve
our translation quality by augmenting the current
dataset with parallel sentences with a higher degree
of code-mixing and good reference translations.
We would also like to further analyse the nature of
code-mixing in the generated outputs, and study the
possibility of constraining the models to generated
translations with a certain degree of code-mixing.
References
Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona
Diab, Julia Hirschberg, and Thamar Solorio. 2018a.
Named entity recognition on code-switched data:
Overview of the CALCS 2018 shared task. In
Proceedings of the Third Workshop on Compu-
tational Approaches to Linguistic Code-Switching,
pages 138–147, Melbourne, Australia. Association
for Computational Linguistics.
53
Gustavo Aguilar, Fahad AlGhamdi, Victor Soto,
Thamar Solorio, Mona Diab, and Julia Hirschberg,
editors. 2018b. Proceedings of the Third Workshop
on Computational Approaches to Linguistic Code-
Switching. Association for Computational Linguis-
tics, Melbourne, Australia.
Somnath Banerjee, Kunal Chakma, Sudip Kumar
Naskar, Amitava Das, Paolo Rosso, Sivaji Bandy-
opadhyay, and Monojit Choudhury. 2018. Overview
of the mixed script information retrieval (msir) at
fire-2016. In Text Processing, pages 39–49, Cham.
Springer International Publishing.
Hedi M. Belazi, Edward J. Rubin, and Almeida Jacque-
line Toribio. 1994. Code switching and x-bar theory:
The functional head constraint.Linguistic Inquiry,
25(2):221–237.
Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, and
Dipti Sharma. 2017. Joining hands: Exploiting
monolingual treebanks for parsing of code-mixing
data. In Proceedings of the 15th Conference of the
European Chapter of the Association for Computa-
tional Linguistics: Volume 2, Short Papers, pages
324–330, Valencia, Spain. Association for Computa-
tional Linguistics.
Irshad Bhat, Riyaz A. Bhat, Manish Shrivastava, and
Dipti Sharma. 2018. Universal Dependency parsing
for Hindi-English code-switching. In Proceedings
of the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Pa-
pers), pages 987–998, New Orleans, Louisiana. As-
sociation for Computational Linguistics.
Irshad Ahmad Bhat, Vandan Mujadia, Aniruddha Tam-
mewar, Riyaz Ahmad Bhat, and Manish Shrivastava.
2015. Iiit-h system submission for fire2014 shared
task on transliterated search. In Proceedings of the
Forum for Information Retrieval Evaluation, FIRE
’14, pages 48–53, New York, NY, USA. ACM.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media.
Khyathi Chandu, Ekaterina Loginova, Vishal Gupta,
Josef van Genabith, Günter Neumann, Manoj Chin-
nakotla, Eric Nyberg, and Alan W. Black. 2018.
Code-mixed question answering challenge: Crowd-
sourcing data and techniques. In Proceedings of
the Third Workshop on Computational Approaches
to Linguistic Code-Switching, pages 29–38, Mel-
bourne, Australia. Association for Computational
Linguistics.
Ching-Ting Chang, Shun-Po Chuang, and Hung-Yi
Lee. 2019. Code-Switching Sentence Generation
by Generative Adversarial Networks and its Appli-
cation to Data Augmentation. In Proc. Interspeech
2019, pages 554–558.
Alexis Conneau and Guillaume Lample. 2019. Cross-
lingual language model pretraining. In Advances in
Neural Information Processing Systems, volume 32.
Curran Associates, Inc.
Mrinal Dhar, Vaibhav Kumar, and Manish Shrivastava.
2018. Enabling code-mixed translation: Parallel cor-
pus creation and MT augmentation approach. In
Proceedings of the First Workshop on Linguistic
Resources for Natural Language Processing, pages
131–140, Santa Fe, New Mexico, USA. Association
for Computational Linguistics.
Mona Diab, Pascale Fung, Mahmoud Ghoneim, Ju-
lia Hirschberg, and Thamar Solorio, editors. 2016.
Proceedings of the Second Workshop on Computa-
tional Approaches to Code Switching. Association
for Computational Linguistics, Austin, Texas.
Mona Diab, Julia Hirschberg, Pascale Fung, and
Thamar Solorio, editors. 2014. Proceedings of the
First Workshop on Computational Approaches to
Code Switching. Association for Computational Lin-
guistics, Doha, Qatar.
Sergey Edunov, Alexei Baevski, and Michael Auli.
2019. Pre-trained language model representations
for language generation. In Proceedings of the 2019
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short
Papers), pages 4052–4059, Minneapolis, Minnesota.
Association for Computational Linguistics.
Björn Gambäck and Amitava Das. 2016. Comparing
the level of code-switching in corpora. In Proceed-
ings of the Tenth International Conference on Lan-
guage Resources and Evaluation (LREC’16), pages
1850–1855, Portorož, Slovenia. European Language
Resources Association (ELRA).
Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya.
2020. A semi-supervised approach to generate the
code-mixed text using pre-trained encoder and trans-
fer learning. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2267–
2280, Online. Association for Computational Lin-
guistics.
Anupam Jamatia, Björn Gambäck, and Amitava Das.
2016. Collecting and annotating indian social me-
dia code-mixed corpora. In International Confer-
ence on Intelligent Text Processing and Computa-
tional Linguistics, pages 406–417. Springer.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In Proceedings of
the 2013 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1700–1709, Seattle,
Washington, USA. Association for Computational
Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In 3rd Inter-
national Conference on Learning Representations,
54
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat-
tacharyya. 2018. The IIT Bombay English-Hindi
parallel corpus. In Proceedings of the Eleventh In-
ternational Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan. Euro-
pean Language Resources Association (ELRA).
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. 2020. Multilingual denoising
pre-training for neural machine translation.
Giovanni Molina, Fahad AlGhamdi, Mahmoud
Ghoneim, Abdelati Hawwari, Nicolas Rey-
Villamizar, Mona Diab, and Thamar Solorio.
2016. Overview for the second shared task on
language identification in code-switched data. In
Proceedings of the Second Workshop on Computa-
tional Approaches to Code Switching, pages 40–49,
Austin, Texas. Association for Computational
Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A fast, extensible
toolkit for sequence modeling. In Proceedings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics
(Demonstrations), pages 48–53, Minneapolis, Min-
nesota. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Braja Gopal Patra, Dipankar Das, and Amitava Das.
2018. Sentiment analysis of code-mixed indian lan-
guages: An overview of sail_code-mixed shared task
@icon-2017.
Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj
Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy
Chakraborty, Thamar Solorio, and Amitava Das.
2020. SemEval-2020 task 9: Overview of senti-
ment analysis of code-mixed tweets. In Proceed-
ings of the Fourteenth Workshop on Semantic Eval-
uation, pages 774–790, Barcelona (online). Interna-
tional Committee for Computational Linguistics.
Shana Poplack. 1980. Sometimes i’ll start a sentence
in spanish y termino en espaÑol: toward a typology
of code-switching 1.Linguistics, 18:581–618.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computa-
tional Linguistics.
Adithya Pratapa, Gayatri Bhat, Monojit Choudhury,
Sunayana Sitaram, Sandipan Dandapat, and Kalika
Bali. 2018. Language modeling for code-mixing:
The role of linguistic theory based synthetic data. In
Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1543–1553, Melbourne, Aus-
tralia. Association for Computational Linguistics.
Pattabhi R. K. Rao and S. Devi. 2016. Cmee-il: Code
mix entity extraction in indian languages from social
media text @ fire 2016 - an overview. In FIRE.
Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja
Ganu, Monojit Choudhury, and Sunayana Sitaram.
2021. GCM: A toolkit for generating synthetic
code-mixed text. In Proceedings of the 16th Con-
ference of the European Chapter of the Association
for Computational Linguistics: System Demonstra-
tions, pages 205–211, Online. Association for Com-
putational Linguistics.
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov,
Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and
Keith Hall. 2020. Processing South Asian languages
written in the Latin script: the dakshina dataset.
In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 2413–2423, Mar-
seille, France. European Language Resources Asso-
ciation.
Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Ma-
jumder, and Komal Agarwal. 2013. Overview of
the fire 2013 track on transliterated search. In Post-
Proceedings of the 4th and 5th Workshops of the Fo-
rum for Information Retrieval Evaluation, FIRE ’12
& ’13, New York, NY, USA. Association for Com-
puting Machinery.
Bidisha Samanta, Sharmila Reddy, Hussain Jagirdar,
Niloy Ganguly, and Soumen Chakrabarti. 2019.
A deep generative model for code switched text.
In Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence, IJCAI-
19, pages 5175–5181. International Joint Confer-
ences on Artificial Intelligence Organization.
Thoudam Doren Singh and Thamar Solorio. 2018. To-
wards translating mixed-code comments from social
media. In Computational Linguistics and Intelligent
Text Processing, pages 457–468, Cham. Springer In-
ternational Publishing.
Thamar Solorio, Elizabeth Blair, Suraj Mahar-
jan, Steven Bethard, Mona Diab, Mahmoud
55
Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Ju-
lia Hirschberg, Alison Chang, and Pascale Fung.
2014. Overview for the first shared task on language
identification in code-switched data. In Proceedings
of the First Workshop on Computational Approaches
to Code Switching, pages 62–72, Doha, Qatar. Asso-
ciation for Computational Linguistics.
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. MASS: Masked sequence to se-
quence pre-training for language generation. In Pro-
ceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Ma-
chine Learning Research, pages 5926–5936. PMLR.
Vivek Srivastava and Mayank Singh. 2020. PHINC:
A parallel Hinglish social media code-mixed cor-
pus for machine translation. In Proceedings of the
Sixth Workshop on Noisy User-generated Text (W-
NUT 2020), pages 41–49, Online. Association for
Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Genta Indra Winata, Andrea Madotto, Chien-Sheng
Wu, and Pascale Fung. 2019. Code-switched lan-
guage models using neural based synthetic data from
parallel sentences. In Proceedings of the 23rd Con-
ference on Computational Natural Language Learn-
ing (CoNLL), pages 271–280, Hong Kong, China.
Association for Computational Linguistics.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin John-
son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,
Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Stevens, George Kurian, Nishant Patil, Wei Wang,
Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural machine
translation system: Bridging the gap between human
and machine translation.CoRR, abs/1609.08144.
... Previous work has only used smaller multilingual architectures (Gautam et al., 2021). We use pre-trained multilingual models trained in up to 200 language directions. ...
... A major issue is the lack of parallel Hinglish-English data. Additional parallel data generated by back-translation is used to improve the performance (Gautam et al., 2021;Jawahar et al., 2021). The CALCS'21 competition (Solorio et al., 2021) had a shared task for English to Hinglish for movie review data. ...
Preprint
Full-text available
This paper describes the Stevens Institute of Technology's submission for the WMT 2022 Shared Task: Code-mixed Machine Translation (MixMT). The task consisted of two subtasks, subtask $1$ Hindi/English to Hinglish and subtask $2$ Hinglish to English translation. Our findings lie in the improvements made through the use of large pre-trained multilingual NMT models and in-domain datasets, as well as back-translation and ensemble techniques. The translation output is automatically evaluated against the reference translations using ROUGE-L and WER. Our system achieves the $1^{st}$ position on subtask $2$ according to ROUGE-L, WER, and human evaluation, $1^{st}$ position on subtask $1$ according to WER and human evaluation, and $3^{rd}$ position on subtask $1$ with respect to ROUGE-L metric.
... Code-mixing is commonly encountered during spoken and written communication in multilingual communities [5,6], for example, Indonesian-English [7,8], Malay-English [9], Persian-English [10], Hindi-English [11], and English-Bengali [12]. Code-mixing can be divided into several types, intra-sentential, intra-word, and inter-sentential. ...
Article
Full-text available
The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. This study investigated the techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. This study addressed four research issues to identify gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria dataset. The features are the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.
... proposed a new pre-training strategy to tackle the complexities in CM text sequences in a non-traditional way. (Gautam et al., 2021) talks about generating low-resource Code-Mixed language from a high resource language such as English using various Seq2Seq models such as mBART (Liu et al., 2020). Other than this, various augmentation techniques were also proposed to improve the quality of generated Hinglish text sequences . ...
Preprint
Full-text available
This paper describes the system description for the HinglishEval challenge at INLG 2022. The goal of this task was to investigate the factors influencing the quality of the code-mixed text generation system. The task was divided into two subtasks, quality rating prediction and annotators disagreement prediction of the synthetic Hinglish dataset. We attempted to solve these tasks using sentence-level embeddings, which are obtained from mean pooling the contextualized word embeddings for all input tokens in our text. We experimented with various classifiers on top of the embeddings produced for respective tasks. Our best-performing system ranked 1st on subtask B and 3rd on subtask A.
... where t is the number of object tag tokens in the input, and A ij is a binary matrix representing whether the object tag tokens i and j match and align in the student and teacher network, and O i is the i th object tag token embedding. d) Code-mixed distillation: Several works [19], [37]- [39] study code-switching to augment the data for training the models in a multilingual setting. We propose creating codeswitched translations (sentences with both English and the target language) of our data which is fed to the student model. ...
Preprint
Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
... • LTRC-PreCog (Gautam et al., 2021). They propose to use mBART, a pre-trained multilingual sequence-to-sequence model and fully utilize the pre-training of the model by transliterating the roman Hindi words in the codemixed sentences to Devanagri script. ...
Preprint
Full-text available
To date, efforts in the code-switching literature have focused for the most part on language identification, POS, NER, and syntactic parsing. In this paper, we address machine translation for code-switched social media data. We create a community shared task. We provide two modalities for participation: supervised and unsupervised. For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction. For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions. We share insights and challenges in curating the "into" code-switching language evaluation data. Further, we provide baselines for all language pairs in the shared task. The leaderboard for the shared task comprises 12 individual system submissions corresponding to 5 different teams. The best performance achieved is 12.67% BLEU score for English to Hinglish and 25.72% BLEU score for MSAEA to English.
... We randomly sample 50 sentences from our test set and bucket them into categories. We follow the error analysis categories from Gautam et al. (2021). We employ three human raters that classify the sampled translations into the error buckets. ...
Chapter
Practical needs of developing task-oriented dialogue assistants require the ability to understand many languages. Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots. In such setup models for cross-lingual transfer show remarkable performance in joint intent recognition and slot filling. However, existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure. The evaluation of NLU models seems biased and limited, since code-switching is being left out of scope.Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set. Based on experiments, we report that the state-of-the-art NLU models are unable to handle code-switching. At worst, the performance, evaluated by semantic accuracy, drops as low as 15% from 80% across languages. Further we show, that pre-training on synthetic code-mixed data helps to maintain performance on the proposed test set at a comparable level with monolingual data. Finally, we analyze different language pairs and show that the closer the languages are, the better the NLU model handles their alternation. This is in line with the common understanding of how multilingual models conduct transferring between languages.KeywordsIntent recognitionSlot fillingCode-mixingCode-switchingMultilingual models
Article
Inculcating knowledge in the dialogue agents is an important step towards creating any agent more human-like. Hence, the use of knowledge while conversing is crucial for building interactive and engaging systems. Most existing works for developing social conversation systems focus on monolingual discussions, with little research on multilingual or code-mixed conversations. Therefore, in this work, we propose generating knowledge-aware code-mixed responses for building end-to-end code-mixed dialogue systems. We design a reinforced transformer framework that uses task-specific rewards for training the entire system. In addition, we utilize a knowledge selection module that captures the appropriate knowledge and generates responses using a deliberation decoder. We introduce a Knowledge aware Code-Mixed (KCM) dataset that consists of conversations grounded in knowledge for four Indian languages (Hindi, Bengali, Gujarati, and Telugu) and two European languages (Spanish and French). Quantitative and qualitative analysis show that the proposed framework on the newly created KCM dataset performs superior to the existing baselines for all the metrics.
Conference Paper
Full-text available
Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06\%) drop in perplexity.
Article
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART—a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective (Lewis et al., 2019 ). mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, whereas previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task- specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show that it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.