PreprintPDF Available

Script-Agnostic Language Identification

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
Content may be subject to copyright.
Script-Agnostic Language Identification
Milind Agarwal, Joshua Otten, and Antonios Anastasopoulos
George Mason University
{magarwa,jotten4,antonis}@gmu.edu
Abstract
Language identification is used as the first step
in many data collection and crawling efforts
because it allows us to sort online text into
language-specific buckets. However, many
modern languages, such as Konkani, Kash-
miri, Punjabi etc., are synchronically written
in several scripts. Moreover, languages with
different writing systems do not share signif-
icant lexical, semantic, and syntactic proper-
ties in neural representation spaces, which is
a disadvantage for closely related languages
and low-resource languages, especially those
from the Indian Subcontinent. To counter
this, we propose learning script-agnostic rep-
resentations using several different experimen-
tal strategies (upscaling, flattening, and script
mixing) focusing on four major Dravidian lan-
guages (Tamil, Telugu, Kannada, and Malay-
alam). We find that word-level script ran-
domization and exposure to a language writ-
ten in multiple scripts is extremely valuable for
downstream script-agnostic language identifi-
cation, while also maintaining competitive per-
formance on naturally occurring text.1
1 Introduction
In many natural language processing (NLP) tasks
or data creation efforts, we often need to first iden-
tify the source language of a particular text. For
instance, automated translation, part-of-speech
(POS) tagging, and web scraping for data collec-
tion must typically identify the text’s language be-
fore performing the given task. The languages in-
volved might occur in non-standard scripts, but
as we show in this paper, modern systems are
heavily script-dependent in language identification
(langID). The result is that most current methods
are unable to account for languages written in
non-standard scripts. Moreover, script diversity
is especially common in low-resource languages.
1Data, models and code available here:
https://github.com/Joshua-Otten/Script-Agnostic-Lang-ID
Many bilingual communities choose to write their
minority language in the region’s dominant sys-
tem (such as those in Pakistan, Iran, China), in-
stead of their language’s traditional writing sys-
tem (Ahmadi et al.,2023a). It is also common
for larger standardized languages to be roman-
ized on the internet and in social media. Finally,
some languages simply do not possess one stan-
dard script, and are written in multiple writing sys-
tems. For instance, the Western-Indian Konkani
language is actively written in up to 5 scripts: De-
vanagari, Romi, Kannada, Malayalam, and Perso-
Arabic (Lehal and Saini,2014;Rajan,2014). How-
ever, most Konkani systems only support Devana-
gari and Romi scripts, and would not recognize the
language if written in the other three. This illus-
trates the need to have script-agnosticism so we
can collect high-quality data for low-resource lan-
guages, and support their script-diverse nature in
NLP applications.
Script-agnostic langID is expected to be most
useful for closely related languages that currently
do not use the same script and where languages
often have unique scripts - a scenario most com-
monly occurring in the Indian Subcontinent. In
this paper, we conduct a case study on script-
agnosticism for language identification by focus-
ing on the four major Dravidian languages: Tamil,
Telugu, Kannada, and Malayalam. We explore
three different methods of training script-agnostic
embeddings, evaluate on the langID task across do-
mains, and offer insights for future work. Broadly,
we attempt to answer the following research ques-
tions:
1. What impact does training on transliterated
corpora have on downstream langID?
2. How does projecting to one script or upscal-
ing to multiple scripts impact performance?
3. What impact does intra-sentence script mix-
ing have on language identification?
Upscaling
All scripts for each language
Flattening
Only one script for entire corpus
!"#$
െതലുഗുേലാ
!"ಲುಗು&"ೂೕ
ெத#$ேலா
Telugu sentence in 4 scripts
இ" த$%&
ithu tamizhil
కన#డద&
'(
ನ#ಡದ&'()'
kannadadallide
ఇ* త,-.
Kannada and Tamil only in Telugu script
Mixing
>1 script within a sentence
ఇం'( )*+, మం,. /0ం1..
india tommidi mandito balling..
ఇం01 ெதா()* మం34
ബൗലിംഗ്..
Telugu sentence in mixed scripts
Kannada TamilTelugu Malayalam
telugulo
Figure 1: In upscaling, we transliterate each sentence into other scripts to expose the model to data in that language
in all 4 writing systems. For flattening, we aim to reduce this potential vocabulary overload and project all scripts
into 1 script per experiment. The goal is to identify if any of the 4 scripts is a suitable target script for all languages.
Each writing system has a unique number of total letters even though there is large overlap (Table 1), and we think
that this may result in one or the other script to be a suitable script for projection. For the final mixing setup, we
transliterate at the word-level instead (at different noise levels) and allow multiple scripts per sentence.
2 Methods
Script Flattening Under this setup, we want to
explore whether the embedding space will bene-
fit from seeing all the languages in only one com-
mon script (Figure 1). The idea behind flattening
the script space from four to one is that with only
one script, the embedding space (and consequently
the classification system) can focus on finding dis-
criminative features between the languages. It is
worth noting that training word representations in a
single script may perform poorly in real-world set-
tings and may not be a practical choice since text
will naturally appear in different scripts and will
require transliteration as preprocessing. However,
this experiment is useful to quantify the role that
script plays in language identification, compared
to the non-visual distinguishing features of the lan-
guages.
Script Upscaling This method takes a given
training example written in one script and “up-
scales” it into all 4 scripts (Figure 1). Our intu-
ition is that seeing every example in each script
will prevent a model from giving weight to any
one writing system in its decision-making, forcing
it to rely on inherent features of the language. In
other words, we teach the model that a sentence of
a given language could be written in any script, so
that it learns not to discriminate on the basis of writ-
ing system. This contrasts with the approach taken
in Brown (2012), where each language-script pair
is given a unique language model and their scores
are used to make the final classification decision.
For our setup, we first created four training files
for each language, where a file would include all
of the language’s training examples four times–one
for each script. Then we concatenated all of these
files into one training set. Therefore, our model
assumes that a sentence may appear in any of the
four writing systems with the same likelihood.
Noisy Multi-Script Setup In the final setup, we
create synthetic sentences following Algorithm 1
(Appendix C) and Figure 1for both FLORES200
data splits. Under this approach, for each noise
level n, language lang, and sentence sent, we
choose a base script and then randomly pick n%
words to transform to new non-base scripts. We
train separate text classification fastText models
on each of these noisy datasets and evaluate them
on test sets with clean, noisy, and merged datasets.
This is to evaluate out-of-distribution generaliza-
tion and robustness, and the potential usefulness
of including noise during the training process. We
perform this experiment with permutations of 25%,
50%, 75%, and 100% script-noise levels in the
training data as done in Ahmadi et al. (2023b). Fi-
nally, we train an “All-Noise” model on merged
data from all these script-noise levels.
Language 639-3 Family Script Script Code Vowels Consonants
Tamil tam Southern தமழ Taml 12 18
Kannada kan Southern ಕನಡ Knda 16 35
Telugu tel South-Central లుగు Telu 16 36
Malayalam mal Southern മലയാളം Mlym 15 42
Table 1: A summary of the characteristics of the four Dravidian languages we study in our experiments. All four
languages use abugidas (alphasyllabaries) for writing and are written from left to right with diacritics. Note that
Tamil has the fewest overall graphemes whereas Malayalam has the most. The last two columns indicate common
vowels and consonants in the language, but each script comes with extended grapheme sets to accomodate other
Indian-language phonemes.
IPA ISO    
/ka/ ka
/kha/ kha க₂
/ga/ ga க₃
/gha/ gha க₄
Table 2: Tamil has only one letter to represent the above-
mentioned 4 sounds common in the other 3 Dravidian
languages. So, the transliterator introduces subscripts
to differentiate the four sounds in the source script.
There are 5 such character series but we only show the
velar phonemes’ series.
3 Experiments
Dataset and Languages We use the FLO-
RES200 dataset (Costa-jussà et al.,2024;Goyal
et al.,2022;Guzmán et al.,2019) for training and
in-domain testing in all our experiments. In order
to ensure that our models would work well on test
data that was not simply from FLORES200, we
also tested on three out-of-domain sets: GlotStory-
Books (Kargaran et al.,2023), UDHR (Kargaran
et al.,2023), and MCS-350 (Agarwal et al.,2023).
We do not transliterate these datasets since the goal
is to measure potential performance drops on natu-
rally occurring text compared to traditional models.
We also use a subset of monolingual data from In-
dicCorp (Kakwani et al.,2020) for an experiment
involving non-parallel training in §4.2. For this
paper, we explore script-agnosticism for 4 major
languages (Table 1) that fall within the same lan-
guage family and use four distinct writing systems.
Details about each of the datasets are available in
Appendix Aand language profiles in Appendix B.
Transliteration We use the Aksharamukha2
python package to transliterate between our four
2https://pypi.org/project/aksharamukha/
Model Acc N
CLD3 (Salcianu et al.,2020) 0.98 101
langid.py (Lui and Baldwin,2012) 1.0 97
Franc31.0 419
fastText (Joulin et al.,2017) 1.0 176
HeLI-OTS (Jauhiainen et al.,2022) 0.99 200
Table 3: This table shows different popular language
identification systems, their accuracies on FLORES-200
and the number of supported languages (N). We chose
fastText as our root model since it achieves a high
accuracy, supports many languages, and can be easily
trained from scratch.
writing systems. Since the library is primarily de-
signed for Indic writing systems, it provides an ex-
tremely low-loss 1:1 transliteration, which is suit-
able for our purposes. This 1:1 mapping is pos-
sible because Indic writing systems descend from
a shared ancestor - Brahmi script, and they all
have unique and mappable graphemes for different
phonemes. The only exception (across all Indian
writing systems) is Tamil script, which also de-
scends from Brahmi, but in its modern form, it uses
one grapheme to represent aspirated and unaspi-
rated or voiced and unvoiced versions of a sound.
Aksharamukha adds subscripts (see Appendix Ta-
ble 2) to differentiate these sounds, but we remove
them during preprocessing as they are only found
in Tamil writing in literary and classical settings.
Training Model Choice Table 3shows the per-
formance of commonly used off-the-shelf langID
models on the FLORES200 dev set. Out of the
three highest performing models (with F1score of
1.0) in Table 3, only fastText (Bojanowski et al.,
2017) and langid.py can be trained with custom
files. fastText, trained on data from Wikipedia,
Tatoeba and SETimes, supports a wider number
of languages in its base model, is known to work
well with unknown words, and is very easy to
train. Therefore, fastText will serve as the train-
ing model for our experiments. Macro F1score is
computed across all 4 languages to identify a sys-
tem with the best overall coverage and accuracy.
Moreoever, fastText provides an efficient way
to glean subword information and is known to bet-
ter handle out of vocabulary words. Like all other
language identification systems, it does not come
with script-agnosticism support. All experiments
were run on CPU due to fastText’s optimizations.
Training and Evaluation We obtain our results
based on the original versions and transliterations
of the test sets provided by FLORES200, us-
ing fastText skipgram models on a downstream
language identification task (extrinsic evaluation).
For evaluation, while F1 scores are popular in
langID studies, they are hard to interpret and only
have significant advantages when there is a class
imbalance in the data distribution. We have se-
lected a training and test set that is evenly dis-
tributed and is not imbalanced. Therefore, we opt
for reporting top-1 accuracy since it is appropriate
here and easier to interpret.
Three Baseline Models Our first baseline model
(FLORES200) was trained on the raw language
.dev files from FLORES200. We chose this as a
baseline, given that it represents an easy and intu-
itive approach to training a language classification
model, without any augmentation or modifications.
Our second baseline (SEPARATE) keeps all script-
language pairs separate during training and classi-
fication (Brown,2012). That is, for 4 languages
and 4 unique scripts, we’ll end up with 16 total pre-
diction classes. For reporting accuracies, we pool
together results from all scripts for each language.
Our third baseline (WIKI) is a language identifi-
cation model pre-trained on Wikipedia, SETimes,
and Tatoeba, boasting support for 176 languages
(Joulin et al.,2017). Note that due to this large dis-
crepancy in training data size, its performance will
not be directly comparable to other models.
4 Results
We present our results for the Baseline, Flatten-
ing, Upscaling, and Noisy models here. In gen-
eral, our script-agnostic models demonstrate good
performance above the baselines on the transliter-
ated test sets, and our methods comparable to tra-
ditional approaches on clean data.
4.1 Script Flattening
Under the Flattening experimental setup, even
though certain languages have higher accuracies
than others, each language appears to have com-
parable performance across scripts ( Table 4). For
instance, Tamil sees 80% accuracy on all flattened
tests; in fact, each language’s scores vary less than
one percent when flattening to any given script.
The uniformity across scripts suggests that any par-
ticular script does not play a major role in the mod-
els’ decision-making. This matches and confirms
our initial hypotheses, since there is no alternative
script for the model to consider when evaluating
language identity. Upon comparison with the base-
line, our flattened models are far superior both in
unconventional script scenarios, and when aver-
aged across the four languages. In some cases, the
baseline only classifies correctly 25% of the time,
while our models consistently perform with over
90% average accuracy on the transliterated FLO-
RES200 test set. With respect to individual lan-
guage scores, the baseline classifies with slightly
more accuracy when language and writing system
match, but this is merely due to its heavy reliance
on script, and does not speak to its overall perfor-
mance. When script and language are not the same,
the baseline is easily fooled; for example, in many
cases it cannot classify even a single example cor-
rectly for certain languages.
Interpretability Analysis Interestingly, there is
a difference in performance across the individ-
ual language scores for both models, where they
correctly identify certain languages more often
than others. For example, Malayalam scores
near 100%, while Tamil is only correctly classi-
fied 80% of the time. To interpret differences
in accuracy scores across languages, we utilize
a game-theoretic metric, Shapley Additive Expla-
nations, or SHAP (Lundberg and Lee,2017), to
compute global-level explanations across the train-
ing dataset. We focus on finding explanations
for false positive features in Tamil sentences that
have been predicted as Malayalam. We obtained
translations for Tamil using Agarathi4and Google
Translate, and for Malayalam using Google Trans-
late and Olam5. Appendix Table 10 displays all
the relevant words and characters in mispredicted
Tamil sentences. While not all positively weighted
4https://agarathi.com.அகராதி/agarathi means
dictionary in Tamil
5Malayalam Dictionary - https://olam.in/
Scripts Taml Knda Mlym Telu
Languages Baseline Flatten Baseline Flatten Baseline Flatten Baseline Flatten
 94.37 80.43 - 80.63 - 80.93 - 80.73
 - 91.60 92.59 92.19 - 91.60 - 91.70
 69.27 99.31 88.93 98.32 100.00 98.42 88.93 98.91
 - 93.68 - 93.77 - 93.08 94.07 93.77
 40.91 91.25 45.28 91.23 25.00 91.01 45.75 91.28
Table 4: We find that no particular script is best suited to the flattening task and each script can allow for identifi-
cation of the four Dravidian languages relatively faithfully. Although marginally, the Telugu script Flatten model
performs best and so we include it in cross-domain experiments in 4.4. There is also a noticeable drop in per-
formance for Tamil language, regardless of script (see Appendix Dfor interpretability analysis). Columns show
scripts and rows indicate language. Baseline models are trained on all four languages in their original scripts and
then tested on the transliterated flatten setups. We expect them to only predict the corresponding language for each
script (and have -, meaning 0, for others), but we observe that they sometimes predict other languages too, despite
not seeing them in the training corpus.
WIKI FLORES200 train - 25% train - 50% train - 75% train - 100%
3,988 3,984 7,968 11,952 15,952
           
 100 25 94.37 23.59 48.02 48.84 77.96 78.04 91.8 92.02 95.26 95.16
 100 25 92.59 23.15 74.41 74.18 89.62 90.02 92.69 92.76 95.06 95.06
 100 25 86.78 95.85 95.41 99.11 97.83 99.7 99.68 99.7 99.65 99.65
 100 25 94.07 23.52 47.23 46.89 92.49 92.86 94.37 94.47 95.36 95.41
 100 25 95.26 39.26 66.38 66.33 89.80 89.69 94.64 94.73 96.35 96.32
Table 5: Transliteration of at least 75% of the data is required for Upscale models to perform at par with comparable
baselines (FLORES200) on naturally occurring text. Additionally, these Upscale models also show high perfor-
mance on transliterated test sets. The first two columns evaluate the fastText baselines on WIKI and FLORES200
datasets. The next four columns show Upscale models, trained on 25%, 50%, 75%, and 100% of the original train-
ing examples transliterated. The row underneath displays the amount of training data. Each model was tested on
the original test set (), without any transliterations, and a test set () with all examples transliterated to all
scripts. Rows show language-specific langID performance.
words may have exact parallels in Malayalam, we
think the score may come from positively corre-
lated morphological features within the word itself,
since Tamil and Malayalam share many word suf-
fixes, prefixes, pluralization rules, prepositions etc.
Our interpretability investigation revealed that this
is due to presence of some positive  signal in
 sentences, due to the lexical, semantic, and
phylogenetic similarity of the two languages. This
overlap causes a small number of sentences to be
assigned a high probability of both  and ,
with  winning by a slight margin. For more de-
tailed results, plots, and explanations, please refer
to Appendix §D
4.2 Upscale
Our upscaled model performs quite well on the test
sets, with over 96% accuracy (Table 5). Moreover,
while it drastically outperformed the baseline on
transliterated data, it scores higher on native script
sentences as well. These results demonstrate that
the model was able to correctly disentangle script
and language using data augmentation.
Comparison with Flattening When comparing
the Flattening results to Upscale, it is important
to recognize that the latter model was trained on
four times the amount of data, since we transliter-
ated to all four scripts as opposed to flattening to
a single script. Granted, the task was more com-
plex as the model needs to handle 4 different writ-
ing systems per language. But, in order to nor-
FLORES-TRA FLORES-ORI GLOT UDHR MCS350 Avg
  39.26 95.26 82.41 79.00 45.34 68.25
  96.32 96.35 81.67 77.54 44.79 79.33
 94.39 94.37 84.61 83.86 51.76 81.80
Table 6: This table compares two Upscaled models, each trained on 997 examples per language, which are then
transliterated to all scripts. One is trained on 4-way parallel data, and the other on examples that are not parallel
from IndicCorp. The slight discrepancy of performance is likely a result of data from different domains.  for
FLORES represents the test set that contains transliterations and  represents the default FLORES test set.
malize the effect of the number of examples, we
also trained it using three variations of our training
data: 25%, 50%, and 75% of the original exam-
ples transliterated. As expected, the 25% model
performed much worse than the 100% model, and
we saw improvements as we included more of the
data. Interestingly, the results were only compara-
ble to the Flattening model once we trained with
at least 75% of the original examples. We suspect
this is due to the difference between the number of
cross-language examples and the number of cross-
script examples. For instance, even though the
25% Upscaled model has nearly the same number
of training examples as any of the Flattening mod-
els, many of these sentences are merely transliter-
ated versions of each other, rather than full trans-
lations or original examples. This distribution ap-
pears to allow the model to become script-agnostic,
but sacrifices the ability to identify languages in
the process. This suggests that although Upscaling
may perform better than Flattening overall, Flatten-
ing can perform similarly with fewer examples.
Learning without n-way parallel data It seems
that Upscale models correctly ignore script in their
decision-making process and so far, they have
been trained on n-way parallel data; however, this
could be a potential confounder. Therefore, we
compare the performance of two script-upscaled
models –one trained on 4-way parallel data, the
other on non-parallel data– keeping the number of
training examples per language constant for fair-
ness. For non-parallel data, we use subsets of the
monolingual corpora from IndicCorp for Telugu,
Tamil, and Malayalam. We reuse the FLORES200
examples for Kannada, since these are not parallel
to the data for the other three languages.
Our evaluation on the FLORES transliterated
and clean test sets as well as all out-of-domain
sets is in Table 6. The two models have largely
similar results. The original 4-way parallel model
does somewhat better on the FLORES test sets,
and the non-parallel model has the better accuracy
on average; however, these discrepancies can be
expected due to the domain differences in data
sources. Overall, it appears that both models are
comparable and therefore using explicitly parallel
data has a negligible effect.
4.3 Noisy Multi-Script
In the intra-sentence noise setup, performance
varies to a large degree between the models, but ac-
curacy distributions for each model stay relatively
constant across test sets (Table 7). Our Script-
Upscaled model is the best on average with over
99% accuracy, and the All-Noise model follows
closely behind with a 98.82% score. Beyond these
two, scores drop significantly to the 50-65% range,
which is undesirable for a 4-class langID task.
This is likely explained by the size of the train-
ing sets. The Baseline, as well models with noise
settings from 25 to 100, used data from four sets
(one for each language) with varying script per-
mutations. However, our All-Noise model was
trained on a merged dataset consisting of sentences
at all noise levels (i.e. four times the data). This
is similar to the Script-Upscaled model that had
access to each language’s sentences transliterated
to the four different scripts, and is likely what al-
lowed the two models to perform so well. We be-
lieve that the Script-Upscaled model performed the
best because it was consistently shown the same
sentence in all four scripts, forcing it to become
truly script-agnostic. The All-Noise model was
able to do this to a large degree, but due to random-
ness and slight inconsistencies in permutations, it
likely was not able to completely disregard script
in its decision-making process. Therefore, script-
mixing within sentences seems to be an extremely
challenging setup for models and requires data aug-
mentation for reasonable performance.
Data Language Baseline N@25 N@50 N@75 N@100 N@all Upscale

Tamil 23.59 40.19 14.95 42.81 26.75 93.08 95.26
Kannada 23.15 76.38 58.75 77.32 67.27 93.33 95.16
Malayalam 86.78 94.54 99.93 95.11 99.51 99.63 99.70
Telugu 23.52 51.63 40.07 44.64 51.14 94.89 95.45
all
Tamil 40.77 36.86 14.82 39.66 25.55 99.77 100.00
Kannada 39.72 77.02 56.24 78.59 65.25 99.02 99.14
Malayalam 86.94 96.34 99.97 96.24 99.57 99.90 99.95
Telugu 42.40 52.70 38.71 43.32 52.47 99.47 99.77
 * 50.27 65.72 52.60 64.52 60.79 98.82 99.16
Table 7: Even after introducing transliteration noise at different levels within sentences, the N@all and Upscale
models are competitive implying that we can use word-level script-mixing without sacrificing performance. The
table has been abridged due to space constraints, but an extended version with results for 25, 50, 75, and 100%
noise-levels is in Appendix Table 9. N@25,50,75,100 and the baseline models were trained with 3988 sentences
per class. The Upscale and N@all models (last two columns) were trained with 15952 sentences per class and are
therefore more comparable with each other. The baseline was trained on original FLORES200 data.
Test Dataset FLORES200 GLOT UDHR MCS350 AVERAGE
Test Set Size 4048 3934 285 15000 5817
baseline  100.00 99.96 100.00 71.75 92.93
  25.00 24.92 20.35 25.00 23.81
  95.26 82.41 79.00 45.34 75.50
  91.28 43.18 44.56 33.95 53.24
  96.35 81.67 77.54 44.79 75.09
  95.41 80.19 76.14 43.41 73.79
Table 8: We share three fastText-based baseline models (trained on FLORES200, separate language and script
classes, and Wikipedia) along with the best model from each of our 3 experimental setups (upscale, flatten, noise).
We test them on out of domain data to test domain transfer of the learned embeddings. Overall, the  
and   models have comparable performance to   demonstrating that the multi-
script training doesn’t lead to a significant degradation in performance on the languages’ naturally occurring native
scripts. Note that the WIKI model is trained on all of Wikipedia, and therefore its performance is not directly
comparable to any of the other models. The  baseline performs the poorest, likely due to the low amount
of data required for a 16-way classification task.
4.4 Cross-Domain Performance
A comparison of our models on the clean FLO-
RES200 test set, as well as out-of-domain sets
is in Table 8. The FLORES200  per-
forms well in-distribution and on similar long-
length GLOT and UDHR datasets, but poorly on
MCS350 (children’s stories domain and shorter
sentences). The WIKI baseline is better than the
FLORES200 baseline across all datasets, showing
that is has built a better representation space for
the languages. The   and  
models have comparable performance to 
, demonstrating that the multi-script
training does not lead to a significant degrada-
tion in performance on the languages’ conven-
tional/native scripts. The  algorithm nat-
urally performs poorly compared to the other mod-
els in this setting since it is only exposed to one
script. Therefore, it may not be a practical choice
for real-world language identification.
5 Discussion
The results demonstrate that all of our script-
agnostic language identification models (Flatten-
ing, Noise, and Scipt-Upscaled) perform well
above the baselines on examples that utilize a non-
standard script. In certain cases where data is in na-
tive script, our baseline models can surpass some
script-agnostic ones; this is likely because the base-
lines use script as a basis for determining language
ID. The All-noise model showed very good perfor-
mance, and we suspect it remains second to the
Upscaled setting primarily due to the variability
of the training data. Unlike the Upscaled model,
it may not see every example transliterated to all
scripts, and thus may not become completely ag-
nostic of script. However, it is a strong contender
and its performance on other downstream tasks and
the quality of its learned representations should be
evaluated in future work when scaling to a larger
number of languages and scripts.
In the practical setting, our models –especially
Script-Upscaled– appear to be a reasonable alter-
native to current language identification systems.
Additionally, it is likely that had we trained an Up-
scaled model on Wikipedia, we would have seen re-
sults that matched the WIKI baseline on noiseless
data. The large amount of storage and computa-
tional power for this endeavor, in addition to poten-
tial challenges in transliterating to so many scripts,
would have been beyond the scope of our current
work. However, future work should carefully ex-
plore creation of script-agnostic WIKI langID mod-
els as well. Our Upscaling approach is relatively
straightforward, and requires no more examples
than for a standard language identification system.
Since transliteration can be done automatically and
cheaply, our final proposal is a script-based data-
augmentation process for complete sentences and
within sentences. When expanding to other lan-
guages and scripts, lossy transliteration quality in
non-Indic systems may be a challenge, and we rec-
ommend using the International Phonetic Alpha-
bet (IPA) as a bridge for high-quality and natural
transliteration.
6 Related Work
Previous work has demonstrated that script bar-
riers discourage transfer learning from high-
resource languages into low-resource languages’
representation spaces, especially for Neural Ma-
chine Translation (Muller et al.,2021;Anasta-
sopoulos and Neubig,2019). Moreover, script di-
versity negatively impacts low-resource languages
disproportionately because their training data is
often of poor quality and smaller in size (Pfeif-
fer et al.,2021). Consequently, researchers have
focused on transliteration, romanization, phonetic
representation etc. to reduce vocabulary sizes
and allow lexical sharing between languages with
different writing systems (Amrhein and Sennrich,
2020).
Another common approach relies on existing
pre-trained models and fine-tuning them with dif-
ferent transliterated versions of the originally sup-
ported languages (Muller et al.,2021;Dhamecha
et al.,2021). This is an instance of the com-
mon hierarchical pipeline (Goutte et al.,2014;Lui
et al.,2014;Bestgen,2017) or fine-tuning-based
approach for language identification (Jauhiainen
et al.,2018;Agarwal et al.,2023;Ahmadi et al.,
2023a). Most recently, Moosa et al. (2023) con-
ducted a study on effects of transliteration on multi-
lingual language modeling, which focused on two
kinds of models: a multi-script model with native
scripts of each language (matching our 
setup) and a uni-script model with only one script
for all languages (similar to our  setup).
As a natural extension of their work, we also
consider  and  setups for Dravidian
languages, as described in §3. Unlike their work,
we do not fine-tune on downstream tasks, but in-
stead focus on including the transliteration in the
original training data to give the model the ability
to handle non-native scripts without losing perfor-
mance on the original script. Moreover, our work
is not only motivated from a lexical-sharing and
transfer-learning perspective, but is grounded with
the aim of supporting synchronic and diachronic di-
graphia adequately in NLP applications and tasks.
7 Conclusion
We introduce and evaluate three new kinds of
language identification models that are script-
agnostic. All of our systems have been shown to
outperform the baseline on examples that are not
written in the standard script. Two of our models
(Upscaled and All-Noise) perform especially well
on both clean and transliterated data. Our meth-
ods may provide a reasonable alternative to train-
ing language identifiers that can correctly classify
text based on the language used, rather than the
script in which it is written. Future work should ex-
pand to include more languages and scripts, as well
as performing thorough intrinsic evaluation on the
learned embeddings to determine if these would be
effective on other downstream tasks.
Limitations
Extending to a larger set of languages We note
that our models were only trained and evaluated
using the four major Dravidian languages - Tamil,
Telugu, Malayalam, and Kannada. Extending the
successful experiments (upscale and all-noise) to a
larger number of writing systems may prove chal-
lenging in terms of computational resources and
dataset sizes. Data loss associated with script con-
version and non-phonetic scripts is a likely chal-
lenge (and potential limitation) when we scale our
approach to more scripts.
Unknown Scripts Note that our approach helps
bring script-agnosticism to scripts included during
training time. The model will still struggle with un-
known writing systems, and for this, we will need
to scale to an extremely large number of writing
systems, which we leave for future work.
Data loss due to script-conversion Most In-
dic scripts have a 1:1 phonetic mapping between
graphemes, but there may still be letters that are
not mapped accurately (truly unique sounds in cer-
tain languages). In our study, three of the four
scripts have direct phonetic mappings, while only
one (Tamil) includes aspirated sounds that are not
translatable to the other writing systems. This
means that two different scripts representing the
same word can have two different character distri-
butions.
Ethics Statement
Languages may be written in non-native scripts to
obfuscate their presence on the internet, and the
use script-agnostic embeddings would be able to
discover and accurately identify such text during
web crawls. This may have some downstream
privacy and surveillance related concerns that are
out of scope for this work. Currently, our pilot
study uses the FLORES200 dataset to train em-
beddings, but in the future, a larger corpora such
as Wikipedia, CommonCrawl, or other publicly
crawled data can be used, which may bring with it
several concerns around data ownership and copy-
right.
Acknowledgements
This work was generously supported by the Na-
tional Endowment for the Humanities under award
PR-276810-21, by the National Science Founda-
tion under award IIS-2327143, and by a Sponsored
Research Award from Meta. Joshua Otten is also
supported by the Presidential Scholarship awarded
by the George Mason University Graduate Divi-
sion. Computational resources for experiments
were provided by the Office of Research Comput-
ing at George Mason University (URL: https:
//orc.gmu.edu).
References
Milind Agarwal, Md Mahfuz Ibn Alam, and Antonios
Anastasopoulos. 2023. LIMIT: Language identifica-
tion, misidentification, and translation using hierar-
chical models in 350+ languages. In Proceedings of
the 2023 Conference on Empirical Methods in Natu-
ral Language Processing, pages 14496–14519, Sin-
gapore. Association for Computational Linguistics.
Sina Ahmadi, Milind Agarwal, and Antonios Anasta-
sopoulos. 2023a. PALI: A language identification
benchmark for Perso-Arabic scripts. In Tenth Work-
shop on NLP for Similar Languages, Varieties and
Dialects (VarDial 2023), pages 78–90, Dubrovnik,
Croatia. Association for Computational Linguistics.
Sina Ahmadi, Milind Agarwal, and Antonios Anasta-
sopoulos. 2023b. PALI: A language identification
benchmark for Perso-Arabic scripts. In Tenth Work-
shop on NLP for Similar Languages, Varieties and
Dialects (VarDial 2023), pages 78–90, Dubrovnik,
Croatia. Association for Computational Linguistics.
Chantal Amrhein and Rico Sennrich. 2020. On Roman-
ization for model transfer between scripts in neural
machine translation. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
2461–2469, Online. Association for Computational
Linguistics.
Antonios Anastasopoulos and Graham Neubig. 2019.
Pushing the limits of low-resource morphological in-
flection. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
984–996, Hong Kong, China. Association for Com-
putational Linguistics.
Yves Bestgen. 2017. Improving the character ngram
model for the DSL task with BM25 weighting and
less frequently used feature sets. In Proceedings
of the Fourth Workshop on NLP for Similar Lan-
guages, Varieties and Dialects (VarDial), pages 115–
123, Valencia, Spain. Association for Computational
Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
subword information.Transactions of the Associa-
tion for Computational Linguistics, 5:135–146.
Ralf D. Brown. 2012. Finding and identifying text in
900+ languages.Digital Investigation, 9:S34–S43.
The Proceedings of the Twelfth Annual DFRWS
Conference.
Marta R Costa-jussà, James Cross, Onur Çelebi,
Maha Elbayad, Kenneth Heafield, Kevin Heffer-
nan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Koehn, Alexandre Mourachko, Christophe Ropers,
Safiyyah Saleem, Holger Schwenk, Jeff Wang, and
NLLB Team. 2024. Scaling neural machine transla-
tionto 200 languages.Nature.
Tejas Dhamecha, Rudra Murthy, Samarth Bharad-
waj, Karthik Sankaranarayanan, and Pushpak Bhat-
tacharyya. 2021. Role of Language Relatedness in
Multilingual Fine-tuning of Language Models: A
Case Study in Indo-Aryan Languages. In Proceed-
ings of the 2021 Conference on Empirical Methods
in Natural Language Processing, pages 8584–8595,
Online and Punta Cana, Dominican Republic. Asso-
ciation for Computational Linguistics.
Cyril Goutte, Serge Léger, and Marine Carpuat. 2014.
The NRC system for discriminating similar lan-
guages. In Proceedings of the First Workshop on
Applying NLP Tools to Similar Languages, Varieties
and Dialects, pages 139–145, Dublin, Ireland. As-
sociation for Computational Linguistics and Dublin
City University.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
and Angela Fan. 2022. The Flores-101 evaluation
benchmark for low-resource and multilingual ma-
chine translation.Transactions of the Association for
Computational Linguistics, 10:522–538.
Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan
Pino, Guillaume Lample, Philipp Koehn, Vishrav
Chaudhary, and Marc’Aurelio Ranzato. 2019. The
FLORES evaluation datasets for low-resource ma-
chine translation: Nepali–English and Sinhala–
English. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Process-
ing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP),
pages 6098–6111, Hong Kong, China. Association
for Computational Linguistics.
Tommi Jauhiainen, Heidi Jauhiainen, and Krister
Lindén. 2022. HeLI-OTS, off-the-shelf language
identifier for text. In Proceedings of the Thirteenth
Language Resources and Evaluation Conference,
pages 3912–3922, Marseille, France. European Lan-
guage Resources Association.
Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Tim-
othy Baldwin, and Krister Lindén. 2018. Automatic
language identification in texts: A survey.CoRR,
abs/1804.08186.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient
text classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431, Valencia, Spain. Association
for Computational Linguistics.
Divyanshu Kakwani, Anoop Kunchukuttan, Satish
Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M.
Khapra, and Pratyush Kumar. 2020. IndicNLPSuite:
Monolingual corpora, evaluation benchmarks and
pre-trained multilingual language models for Indian
languages. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 4948–
4961, Online. Association for Computational Lin-
guistics.
Amir Hossein Kargaran, Ayyoob Imani, François Yvon,
and Hinrich Schütze. 2023. GlotLID: Language
identification for low-resource languages. In The
2023 Conference on Empirical Methods in Natural
Language Processing.
Gurpreet Singh Lehal and Tejinder Singh Saini. 2014.
Sangam: A perso-Arabic to indic script machine
transliteration model. In Proceedings of the 11th
International Conference on Natural Language Pro-
cessing, pages 232–239, Goa, India. NLP Associa-
tion of India.
Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In Pro-
ceedings of the ACL 2012 System Demonstrations,
pages 25–30, Jeju Island, Korea. Association for
Computational Linguistics.
Marco Lui, Ned Letcher, Oliver Adams, Long Duong,
Paul Cook, and Timothy Baldwin. 2014. Exploring
methods and resources for discriminating similar lan-
guages. In Proceedings of the First Workshop on
Applying NLP Tools to Similar Languages, Varieties
and Dialects, pages 129–138, Dublin, Ireland. As-
sociation for Computational Linguistics and Dublin
City University.
Scott M Lundberg and Su-In Lee. 2017. A unified
approach to interpreting model predictions. In Ad-
vances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc.
Ibraheem Muhammad Moosa, Mahmud Elahi Akhter,
and Ashfia Binte Habib. 2023. Does translitera-
tion help multilingual language modeling? In Find-
ings of the Association for Computational Linguis-
tics: EACL 2023, pages 670–685, Dubrovnik, Croa-
tia. Association for Computational Linguistics.
Benjamin Muller, Antonios Anastasopoulos, Benoît
Sagot, and Djamé Seddah. 2021. When being un-
seen from mBERT is just the beginning: Handling
new languages with multilingual language models.
In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 448–462, Online. Association for Computa-
tional Linguistics.
Odunayo Ogundepo, Tajuddeen Gwadabe, Clara
Rivera, Jonathan Clark, Sebastian Ruder, David
Adelani, Bonaventure Dossou, Abdou Diop, Clay-
tone Sikasote, Gilles Hacheme, Happy Buzaaba,
Ignatius Ezeani, Rooweither Mabuya, Salomey
Osei, Chris Emezue, Albert Kahira, Shamsud-
deen Muhammad, Akintunde Oladipo, Abraham
Owodunni, Atnafu Tonja, Iyanuoluwa Shode, Akari
Asai, Anuoluwapo Aremu, Ayodele Awokoya,
Bernard Opoku, Chiamaka Chukwuneke, Christine
Mwase, Clemencia Siro, Stephen Arthur, Tunde
Ajayi, Verrah Otiende, Andre Rubungo, Boyd
Sinkala, Daniel Ajisafe, Emeka Onwuegbuzia,
Falalu Lawan, Ibrahim Ahmad, Jesujoba Alabi,
Chinedu Mbonu, Mofetoluwa Adeyemi, Mofya
Phiri, Orevaoghene Ahia, Ruqayya Iro, and Sonia
Adhiambo. 2023. Cross-lingual open-retrieval ques-
tion answering for African languages. In Findings
of the Association for Computational Linguistics:
EMNLP 2023, pages 14957–14972, Singapore.
Association for Computational Linguistics.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebas-
tian Ruder. 2021. UNKs everywhere: Adapting mul-
tilingual language models to new scripts. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 10186–
10203, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Vinodh Rajan. 2014. Konkanverter - a finite state
transducer based statistical machine transliteration
engine for Konkani language. In Proceedings of the
Fifth Workshop on South and Southeast Asian Natu-
ral Language Processing, pages 11–19, Dublin, Ire-
land. Association for Computational Linguistics and
Dublin City University.
Alex Salcianu, Andy Golding, Anton Bakalov, Chris
Alberti, Daniel Andor, David Weiss, Emily Pitler,
Greg Coppola, Jason Riesa, Kuzman Ganchev,
Michael Ringgaard, Nan Hua, Ryan McDonald, Slav
Petrov, Stefan Istrate, and Terry Koo. 2020. Com-
pact language detector v3 (cld3).Last accessed:
2023-06-23.
Lisa Yankovskaya, Maali Tars, Andre Tättar, and Mark
Fishel. 2023. Machine translation for low-resource
Finno-Ugric languages. In Proceedings of the 24th
Nordic Conference on Computational Linguistics
(NoDaLiDa), pages 762–771, Tórshavn, Faroe Is-
lands. University of Tartu Library.
A Out-of-Domain Datasets
1. FLORES200: Open-source n-way parallel
dataset consisting of sentences from 842 web
articles, translated into a large number of lan-
guages (Costa-jussà et al.,2024;Goyal et al.,
2022;Guzmán et al.,2019). Each language’s
example are in the same order, and are sepa-
rated into .dev and .devtest files, contain-
ing 997 and 1012 sentences, respectively.
2. GlotStoryBooks6: Open-licensed curated
library of books (Kargaran et al.,2023)
from a variety of sources in 176 languages
(Yankovskaya et al.,2023;Ogundepo et al.,
2023). Each sample contains a sentence along
with its language identifier and script.
3. UDHR (Universal Declaration of Human
Rights): We use Kargaran et al. (2023)’s
public domain preprocessed version of the
UDHR dataset, where each sample is a para-
graph along with a language identifier. The
authors removed errors and formatting issues
in the original UDHR data and made this
clean version available7.
4. MCS-350: Multilingual Children’s Stories
dataset, released by Agarwal et al. (2023),
contains over 50K children’s stories cu-
rated primarily from two sources - African
Storybooks Initiative and Pratham Story-
weaver, both open-source story repositories
for African and Indian languages respectively.
For our experiments, we use the monolingual
data files available on the authors’ GitHub
repository8for Tamil, Malayalam, Kannada,
and Telugu. Compared to UDHR, the sen-
tences are relatively smaller in length since
they are not from the legal domain, and unlike
GlotStoryBooks, the authors don’t apply any
length-based filtering to the curated stories.
5. IndicCorp9: Monolingual, sentence-level
corpora for English and 11 Indian languages
from the Dravidian and Indo-Aryan fami-
lies (Kakwani et al.,2020). It consists of 8.8
6https://huggingface.co/datasets/cis-lmu/
GlotStoryBook
7https://huggingface.co/datasets/cis-lmu/
udhr-lid
8https://github.com/magarw/limit
9https://paperswithcode.com/dataset/
indiccorp
billion tokens and is sourced mostly from In-
dian news crawls (articles, blog posts, maga-
zines), though it also takes data from the OS-
CAR corpus.
B Brief Language Profiles
1. Tamil (tam), a Southern-Dravidian language,
is spoken by over 80 million people and is
an official language in Sri Lanka, the Indian
states of Tamil Nadu and Puducherry, and of
the Indian Constitution’s Eighth Schedule. It
is curently most widely written in the Tamil
abugida - தமழ எத (tamizh ezhuttu).
2. Telugu (tel), a South-Central Dravidian lan-
guage, is spoken by about 100 million peo-
ple and is the most spoken Dravidian lan-
guage. It is also an Eighth Schedule language
of the Indian Constitution and is official in the
Indian states of Andhra Pradesh, Telangana,
and Puducherry (Yanam). It is written in Tel-
ugu abugida - లుగు  (telugu lipi)
3. Malayalam, (mal), another Southern-
Dravidan language is the smallest language
from our selection, spoken by about 40
million people in Southern India. It is an
Eighth Schedule language and is official
in the southernmost Indian state of Kerala.
It is written in the Malayalam abugida -
മലയാളം അരൾ (malayalam
aksharangal).
4. Kannada (kan), also a member of the
Southern-Dravidian language subfamily, is
spoken by about 60 million people, mostly
within India. It is an official language of the
Indian Constitution’s eighth schedule and is
the sole official language of Karnataka state.
It is widely written in Kannada script, which
is closely related to the Telugu script and is
also an abugida, but diverged around 1300 CE
-ಕನಡ ಅರಾ (kannada aksharamale).
C Noise-Experiments Extended Results
D Interpreting Flattening Results
The default baseline (in-distribution) is a
fastText model trained on FLORES200 data,
keeping the languages in their original scripts
without any transliterations. For the flattening
experiments, we project all data to one script at
a time. Since the test data is flattened to a single
Data Language Baseline N@25 N@50 N@75 N@100 N@all Upscale

Tamil 23.59 40.19 14.95 42.81 26.75 93.08 95.26
Kannada 23.15 76.38 58.75 77.32 67.27 93.33 95.16
Malayalam 86.78 94.54 99.93 95.11 99.51 99.63 99.70
Telugu 23.52 51.63 40.07 44.64 51.14 94.89 95.45
25
Tamil 35.05 36.25 14.20 39.98 25.08 99.90 100.00
Kannada 31.38 77.21 56.93 78.82 64.76 99.40 99.60
Malayalam 85.74 95.58 99.90 96.18 99.80 99.90 100.00
Telugu 36.18 52.66 38.29 43.82 51.96 99.10 99.90
50
Tamil 38.87 36.86 14.30 39.68 26.38 99.70 100.00
Kannada 41.84 77.30 55.83 79.23 66.67 99.29 99.59
Malayalam 86.00 96.48 100.00 96.17 96.17 99.90 99.90
Telugu 43.50 52.47 38.67 42.50 52.77 99.40 99.70
75
Tamil 45.01 37.34 14.83 39.35 25.93 99.80 100.00
Kannada 44.08 76.34 56.22 78.77 64.91 98.79 98.89
Malayalam 88.56 96.86 100.00 95.85 99.39 100.00 100.00
Telugu 47.61 52.59 39.19 43.65 52.79 99.59 99.70
100
Tamil 44.21 36.99 15.96 39.63 24.80 99.70 100.00
Kannada 41.19 77.23 55.97 77.53 64.68 98.58 98.48
Malayalam 87.46 96.43 100.00 96.74 99.39 99.80 99.90
Telugu 42.35 53.09 38.70 42.96 52.38 99.80 99.80
all
Tamil 40.77 36.86 14.82 39.66 25.55 99.77 100.00
Kannada 39.72 77.02 56.24 78.59 65.25 99.02 99.14
Malayalam 86.94 96.34 99.97 96.24 99.57 99.90 99.95
Telugu 42.40 52.70 38.71 43.32 52.47 99.47 99.77
 * 50.27 65.72 52.60 64.52 60.79 98.82 99.16
Table 9: Even after introducing noise at all levels, the N@all and Upscale models are competitive implying that
we can both use the word-level script-mixing without sacrificing performance on clean or noisy data. Among the
noise@25,50,75 settings, we observe that 50% and 100% noise have drastic impact on classification accuracy for
2languages. N@25,50,75,100 and the baseline models were trained with 3988 sentences per class. The Upscale
and N@all models were trained with 15952 sentences per class and are therefore more comparable with each other.
The baselinen was trained on FLORES200 data.
Algorithm 1 Synthetic Noise Within Sentences
1: for noise = 25,50,75,100 do
2: for lang =tam, kan, mal, tel do
3: for sent = 0,1, . . . ..N do
4: Choose 1 base script
5: Choose noise% words to transform
6: for index in chosen indices do
7: nonbase = Chose new script
8: Transform word into nonbase
9: Save transformed data at noise-level
10: Merge-save sentences at all noise levels into a
new file for the all-noise setting
script, we would expect the model to only predict
the language that is representative of the writing
system. For instance, the baseline model would
predict Tamil when it’s shown data from any
language in the Tamil script. But, we find that
the models (trained on data in 4 different scripts
and languages) tend to default to a Malayalam
prediction for sentences that it knows are not
Tamil (Table 4). This can be seen by the presence
of a Malayalam signal across experiments for all
4 projection scripts. It also seems that several
Malayalam sentences are being misclassified as
Tamil (as evident by the less-than-100% accu-
racy for the Malayalam row for non-Malayalam
Figure 2: Example: Sentence 0’s SHAP visualization for gold  sentence and weights when predicted class is
. Red indicates positive signal for  (unwanted) and blue indicates negative signal for  (wanted).
scripts).
For the Upscale experiments (Table 5), we find
that the Wikipedia pre-trained model does not have
the same bias towards Malayalam as our model,
and instead is perfectly fit to each language’s writ-
ing system (100% and 25% accuracy on Origi-
nal and Transliterated data). The custom-trained
FLORES200 baseline, on the other hand, has sim-
ilar performance (between 86-94% for Original
and 23% for Transliterated). We observe the
Malayalam-defaulting phenomenon here as well,
and it is likely that the model is over-predicting
Malayalam, treating it as an “other” prediction
bucket.
For Noise experiments (Table 7), we observe
similar performance by the FLORES200 baseline
as on the Upscaling experiments. However, the
accuracy for non-Malayalam languages seems to
increase as we increase the amount of noise. To
interpret differences in accuracy scores across lan-
guages, we utilize a game-theoretic metric, Shap-
ley Additive Explanations, or SHAP (Lundberg
and Lee,2017), to compute global-level explana-
tions across the training dataset for all 4 languages.
As discovered in 4.1, we find that Tamil receives
a significantly lower accuracy (around 80%) com-
pared to the other 3 languages, especially com-
pared to Malayalam (95%+). Therefore, we fo-
cus on finding explanations for false positive fea-
tures in Tamil sentences that have been predicted
as Malayalam. Readers should note that Tamil
and Malayalam are closely related since they were
the most recent to diverge from each other among
the four major Dravidian languages (around the
9th century CE). Therefore, there are substantial
vocabulary and grammatical similarities between
them.
Table 10 displays all the relevant words and
characters in mispredicted Tamil sentences. We
obtained translations for  using Agarathi10 and
Google Translate, and for  using Google Trans-
10https://agarathi.com.அகராதி/agarathi means
dictionary in Tamil
late and Olam11. While not all positively weighted
words may have exact parallels in Malayalam, we
think the score may come from positively corre-
lated morphological features within the word itself,
since Tamil and Malayalam share many word suf-
fixes, prefixes, pluralization rules, prepositions etc.
It is worth noting that our interpretability study re-
vealed that for the flattened script condition, the
fastText trained models always predict  as
default. This is not inherently bad because we
still receive over 90% accuracy for , and
, indicating that the models find sufficient non-
 signal in the sentence when it’s present. How-
ever, for , we saw that there was a 10% gap in
performance (i.e  prediction accuracy stayed
around 80%). Our interpretability investigation re-
vealed that this is due to presence of some positive
 signal in  sentences, due to the lexical,
semantic, and phylogenetic similarity of the two
languages. This overlap causes a small number of
sentences to be assigned a high probability of both
 and , with  having the maximum since
it is the default prediction being downscored.
Results and all graphs from the Interpretability
Jupyter notebook have been attached below. It
shows the sentence-level explanations for each of
the Tamil sentences that were misclassified in the
training set with a small margin.
11Malayalam Dictionary - https://olam.in/
Sent  in  script Weight Transliteration  

0ఇ 0.039 itaiyil during in between
0ఒరు 0.025 oru a, an a, an
0వవకపటు ళన 0.021 vativamaikkuppattullana shaped are designed
1వఴంకపటతు 0.041 vazhaankappattathu indulgence provided
2ఇ 0.021 illai no, not no, not, ain’t
3య!0.031 chizhiyavai small ones small ones
4ఱువపటతు 0.039 nizhuuvappattathu established
5వరుకు 0.059 varukkaikku to visit
5ఒరు 0.058 oru a, an a, an
5వఴంతు.0.029 vazhankaathu don’t give in doesn’t provide
6పన. 0.033 pativaakina regularly were recorded.
7ఒరు 0.035 oru a, an a, an
7చకపటుఱతు. 0.024 chamaikkappatukizhathu is being cooked
8ఆతరవక. 0.043 aatharavalikkavillai not supported

0వవకపటు ళన 0.052 vativamaikkappattullana
0ఒరు_ 0.037 oru_ a, an one
0కుఱంక 0.037 kutiyezzhankalai above
0 0.035 itaiyil in
1వఴంకపత 0.102 vazhankappattatha suffix suffix
1కు 0.033 kutiniir above
1అవరళకు 0.024 avarkulukku to them they
2కురును 0.152 kutiyiruppinul above
3యవ 0.125 chizhiyava small ones small ones
4ఉరుకుం 0.033 uruvaakkum emerge create
5ఱువపట0.112 nizhuvappattatha
6కు_ 0.079 kku_
6వఴం0.045 vazhankaatha
6_ఒరు 0.037 _oru a, an one
7ప0.119 pativaakina
7మ0.022 malaiyin
8చకపటుఱత 0.048 chamaikkappatukizhatha
8కు 0.035 kuzhi pit pit
9ర ( ) 0.035 cheerppathai (r)
Table 10: Words and characters that have a positive Malayalam explanation weight of >0.02 for ground-truth
Tamil sentences. All sentences under consideration had a difference of >0.15 between the Tamil and Malayalam
classes. We pick this threshold since it gives us Tamil sentences that have a high-enough Malayalam signal (or low
Tamil signal) causing the classifier to mispredict.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.50.30.1 0.7 0.9
0.0381440.038144
0.038144
ba
s
e
v
a
lu
e
0.3562970.3562970.356297
f
__
l
abe
l
__
t
a
m
(
i
nputs
)
i
nputs
,
ర
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.50.30.1 0.7 0.9
0.0381440.038144
0.038144
ba
s
e
v
a
lu
e
0.362340.362340.36234
f
__
l
abe
l
__
t
a
m
(
i
nputs
)
,
i
nputs
ర
,
,

రం
C
ha
r
ac
t
e
r
L
e
v
e
l
E
xpl
a
n
a
t
i
ons
>
0.15
ix_array
array([ 906, 1113, 1395, 1687, 2080, 2108, 2224, 2270, 2801])
for i in ix_array:
shap.plots.text(shap_values[i])
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.60.50.4 0.9 1 1.1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5853720.5853720.585372
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
,
i
nputs
,
In[]:
Out[]:
In[]:
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.50.40.3 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.4518810.4518810.451881
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
F
)
-
90
i
nputs
90(
F
)
-
ర
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.60.50.4 0.9 1 1.1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.6493570.6493570.649357
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.50.4 0.8 0.9 1 1.1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5209840.5209840.520984
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.60.5 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.6107870.6107870.610787
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.50.4 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5785740.5785740.578574
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
(
NA
i
nputs

(
MINAE
),
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.50.4 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.4871360.4871360.487136
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs

ర
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.60.5 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.600850.600850.60085
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5793890.5793890.579389
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
ర
రవ
for i in ix_array:
shap.plots.bar(shap_values[i][:,2], max_display=20)
In[]:
E
xpl
a
n
a
t
i
ons
>
0.15
for i in ix_array:
shap.plots.text(shap_values[i])
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5686650.5686650.568665
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
,
i
nputs
ంత
,
ంక
ంక
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.50.4 0.8 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.4324960.4324960.432496
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
90(
F
)
-
i
nputs
90(
F
)
-
ంత
ర
ంక
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.6 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.6203030.6203030.620303
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
.
-0.19
i
nputs
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.479760.479760.47976
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
!
In[]:
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.539690.539690.53969
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9 1
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5627540.5627540.562754
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
MINAE
i
nputs

(
MINAE
),
ం
ం
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.4553510.4553510.455351
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
మ

ంక
ర
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.80.70.6 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5887540.5887540.588754
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
ం
ం
.
outputs
__
l
abe
l
__
t
a
m
__
l
abe
l
__
ka
n
__
l
abe
l
__
m
a
l
__
l
abe
l
__
t
e
l
0.70.60.5 0.8 0.9
0.9173830.917383
0.917383
ba
s
e
v
a
lu
e
0.5247870.5247870.524787
f
__
l
abe
l
__
m
a
l
(
i
nputs
)
i
nputs
ర
రవ
.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.