PreprintPDF Available

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.
Content may be subject to copyright.
LIMIT: Language Identification, Misidentification, and Translation using
Hierarchical Models in 350+ Languages
Milind Agarwal Md Mahfuz Ibn Alam Antonios Anastasopoulos
Department of Computer Science, George Mason University
{magarwa, malam21, antonis}@gmu.edu
Abstract
Knowing the language of an input text/audio
is a necessary first step for using almost every
natural language processing (NLP) tool such
as taggers, parsers, or translation systems. Lan-
guage identification is a well-studied problem,
sometimes even considered solved; in reality,
most of the world’s 7000 languages are not sup-
ported by current systems. This lack of repre-
sentation affects large-scale data mining efforts
and further exacerbates data shortage for low-
resource languages. We take a step towards
tackling the data bottleneck by compiling a cor-
pus of over 50K parallel children’s stories in
350+ languages and dialects, and the computa-
tion bottleneck by building lightweight hierar-
chical models for language identification. Our
data can serve as benchmark data for language
identification of short texts and for understud-
ied translation directions such as those between
Indian or African languages. Our proposed
method, Hierarchical LIMIT, uses limited com-
putation to expand coverage into excluded lan-
guages while maintaining prediction quality.1
1 Introduction
Building natural language processing (NLP) tools
like machine translation, language identification,
part of speech (POS) taggers, etc. increasingly
requires more and more data and computational
resources. To attain good performance on a large
number of languages, model complexity and data
quantity must be substantially increased. How-
ever, for low-resource languages, large amounts
of data are often unavailable which creates a high
barrier of entry for a majority of the world’s 7000
languages. Increasing model complexity for large-
scale models also requires disproportionate amount
of computational resources, further disincentiviz-
ing researchers to work towards including these
languages in modern NLP systems.
1
Data and code are available on
https://github.com/
magarw/limit
0
25
50
Figure 1: Most languages in our dataset are from the
Indian Subcontinent and Sub-Saharan Africa, with sig-
nificant minorities from Europe (primarily in the role of
the high-resource language parallel translation available
for each story). Color broadly indicates continent or
region (North America, South America, Africa, Europe,
Asia, Oceania) and size indicates number of languages
per country in our dataset.
A popular data collection approach is large-
scale web mining (Tiedemann and Nygaard,2004;
Bañón et al.,2020;Schwenk et al.,2021b), where
large parts of the internet are scoured to find train-
ing data for data-hungry NLP algorithms. When
faced with a piece of text (ex. a sentence or a
phrase), such algorithms must know how to reli-
ably sort this text into the appropriate language
bucket. Since the web is replete with content in
a variety of languages, a model needs to recog-
nize text in a sufficiently large number of these
languages with high accuracy. Identifying paral-
lel bitext is even more demanding as some trans-
lation models must also be available to correctly
identify and align parallel data (Vegi et al.,2022;
Kunchukuttan et al.,2018). This data-collection
paradigm becomes inaccessible for low-resource
languages because high-quality translation models
usually require substantial amounts of parallel data
for training, which is often unavailable. Without
good quality language identification and translation
tools, it becomes impractical to mine the internet
for relevant text during such collection efforts.
arXiv:2305.14263v1 [cs.CL] 23 May 2023
Low-quality language identification and ma-
chine translation plagues low-resource languages
disproportionately which need large-scale resource
creation efforts the most (Jauhiainen et al.,2019;
Schwenk et al.,2021a). Additionally, mispredic-
tions by language identification and data collection
algorithms can increase inter-class noise, reduc-
ing the crawled data’s quality, and harming perfor-
mance in downstream tasks without strong quality
evaluation metrics (Kocyigit et al.,2022). How
can we better understand the errors made by such
models? And, how can we correct mispredictions
to improve accuracy in supported languages with
limited data and sustainably trained models? Can
we expand language coverage of current models
without complete retraining? How can this be done
without compromising performance on already sup-
ported languages?
To tackle data scarcity in low-resource lan-
guages, we share a parallel children’s stories
dataset created using two resources: African Sto-
rybooks Initiative
2
and Indian non-profit publish-
ing outfit Pratham Books’ digital repository Story-
weaver
3
(data available under appropriate permis-
sive Creative Commons licenses). The combined
dataset includes original and human-translated par-
allel stories in over 350 languages (visualized in
Figure 1) and we merge, preprocess, and structure
it so it is easily utilizable by NLP researchers for
training and benchmarking (Section 2).
Armed with parallel stories in many low-
resource African and Indian languages, we utilize
a pre-trained multilingual translation model (Alam
and Anastasopoulos,2022) and continue train-
ing with hierarchical language-level and language
family-level adapter units to translate children’s
stories at the page-level (Section 3). By leveraging
hierarchically organized adapter units on top of a
root translation model, we save computational re-
sources, while expanding machine translation into
many new and understudied language pairs (espe-
cially those between two low-resource languages),
creating new benchmarks for the story translation
domain, as well as evaluating our models on the
FLORES (NLLB Team et al.,2022) benchmark.
To use this diverse linguistic data judiciously, we
also propose hierarchical models to resolve confu-
sion in language identification systems. The pro-
posed approach is exciting because unlike previ-
2https://www.africanstorybook.org/
3https://storyweaver.org.in/
ously published language identification models like
AfroLID (Adebara et al.,2022), CLD3 (Salcianu
et al.,2020) and Franc
4
it avoids training large
multilingual models for a new set of languages
and still outperforms existing systems. In contrast
with other recent work in hierarchical language
identification (Goutte et al.,2014;Lui et al.,2014;
Bestgen,2017;Jauhiainen et al.,2019), our work
stands out because it accounts for mispredictions
made by existing trained models. It does not predict
a group/language family first, but rather directly
learns confusion relationships between language
pairs (which may not be from the same language
family).
We leverage lightweight, hierarchical, classifica-
tion units to improve linguistic diversity and perfor-
mance of a root system. This is made possible by
analyzing the root model’s mispredictions and iden-
tifying commonly confused language clusters. Fur-
thermore, this confusion-based hierarchical model
approach is applicable to both a model’s supported
languages and unsupported languages subject to
availability of some training data (Section 4).
To summarize, our main contributions are:
1.
We compile a dataset of 50K+ parallel chil-
dren’s stories from African Storybooks Initia-
tive and Storyweaver in 350+ languages. 2)
2.
We perform machine translation experiments
with hierarchical adapter-based multilingual
translation models. Our benchmark data en-
ables translation evaluation in more than 1400
new translation directions 3)
3.
We propose a misidentification based hierar-
chical model whose units act as an alternative
to large and expensive multilingual models for
low-resource languages. Using these, we ex-
pand language identification coverage without
retraining entire models from scratch. 4)
2 Data Curation
We identify two large-scale parallel repositories -
African Storybooks Initiative and Pratham Books’
Storyweaver, both under permissive Creative Com-
mons Licenses, with their storybooks available for
non-commercial and research use. As the name
suggests, African Storybooks Initiative focuses on
children’s stories in languages and dialects from
4https://github.com/wooorm/franc/
Family Languages Sentences
Niger-Congo 129 142605
Indo-European 84 169823
Nilo-Saharan 22 23204
Sino-Tibetan 21 19264
Austronesian 18 28096
Afro-Asiatic 15 20266
Dravidian 13 35638
Austro-Asiatic 10 22989
Otomanguean 9 6761
Creole 8 1037
Mayan 7 1379
Turkic 5 5970
Uto-Aztecan 4 7245
Mixe-Zoquean 3 2005
Table 1: Some key language families with 1000+ sen-
tences across languages in the combined African Story-
books and Storyweaver data.
Africa, and hosts parallel translated and human-
verified children’s stories in over 200 African lan-
guages. Pratham Books is a non-profit Indian pub-
lisher that aims to increase literacy of children and
adults alike in Indian languages. Their digital repos-
itory, Storyweaver, publishes parallel translated sto-
ries in 300+ languages. This includes not only In-
dian languages but also African, European, and
Indigenous languages from the Americas.
2.1 Parallel Dataset
We collect stories through a mix of web scraping
and public APIs, preprocess them to remove mis-
matched/incorrect text, extract monolingual text
for language identification and parallel text for ma-
chine translation. We maintain metadata about au-
thors, translators, illustrators, reading level, parallel
translations, and copyrights for each story. We re-
move stories that are either empty or those from
non-English languages that have over 50% of pages
classified as containing English text with 90% con-
fidence using
langdetect
(Nakatani,2010). This
leaves us with 52K stories.
2.2 Multilingual Documents
The dataset also contains multilingual stories with
language identifiers denoted by
L1_L2
for a story
multilingual in
L1
and
L2
. Such stories include
text in multiple languages within the same page.
This text may be code-mixed or consecutively pre-
sented. In order to extract as many parallel sen-
Script Languages Examples
Devanagari 38 Hindi, Marathi
Cyrillic 14 Russian, Bulgarian
Arabic 8 Arabic, Persian
Tibetan 3 Tibetan, Ladakhi
Telugu 3 Telugu, Konda
Odia 3 Odia, Ho, Kui
Table 2: Some prominent non-Latin writing systems in
the combined African Storybooks and Storyweaver data
tences as possible to support vulnerable languages
and also create new translation directions, we em-
ploy string-similarity based matching to identify
the segments corresponding to the high-resource
language in the pair, and therefore automatically
generate over 10K new parallel pages across 52
languages. This was facilitated by the highly par-
allel nature of the dataset and the guaranteed oc-
currence of high-resource language translations for
each story. For example, through this process, we
extracted 1000+ sentences in Kui (0 sentences pre-
extraction), a minority Dravidian language with
about 900K native speakers. These extracted sen-
tences can be used for language identification train-
ing as a monolingual seed corpus and for transla-
tion since the sentences are parallel with Odia (the
official language in Odisha, where Kui is spoken).
2.3 Language Varieties and Dialects
We attempt to separate language varieties into
unique prediction classes if there is sufficient train-
ing data for them, setting a cuttoff at 1000 sen-
tences. If a language ISO code is available for
the variety, it is used. Otherwise, we assign a
class name with the ISO code and the subdivision
specified as follows -
ISO_subdivision
. For in-
stance, we separated Gondi’s South Bastar variety
(
gon_bastar
, 4000+ sentences) from the generic
language code for Gondi (
gon
). For fair evalua-
tion and comparison, we provide manual mappings
from various language identification tools’ (CLD3,
Franc, Langid.py (Lui and Baldwin,2012)) outputs
to our output class space during inference.
Language varieties/dialects with no unique ISO
code and with little data are naturally merged ac-
cording to their parent language’s ISO code. For
example, "Bangla (Bangladesh)" and "Bengali" are
merged since Bangladesh’s Bengali variety doesn’t
have a unique ISO code and it doesn’t have over
1000 sentences in the dataset. Here "Bengali"
Dataset New languages New pairs
Microsoft 67 2835
FLORES-200 51 1449
OPUS 82 2853
Table 3: Additional languages and pairs (with test data)
in our corpus compared to other benchmarks.
doesn’t specifically refer to Indian Bengali but to
broader Bengali text with many dialects included
(i.e. where the author/translator didn’t specify a
specific dialect). Both are assigned the ISO code
ben
and their stories merged. Note that we perform
such merges as the last step of the data process-
ing pipeline and unmerged stories with complete
metadata are made available. A full list of these
transformations with explanations is located in our
GitHub repository.
2.4 Data Overview
The combined data covers over 350 languages
from a diverse pool of language families. In Ta-
ble 1, we share the number of languages and the
number of sentences in each language family in
the dataset. The data is roughly evenly split be-
tween stories from the large Niger-Congo and Indo-
European language families, with a sizeable minor-
ity in other language families like Nilo-Saharan,
Sino-Tibetan, Austronesian, Dravidian, Creole, etc.
An exhaustive list of all ISO codes and language
variety-specific codes used for language identifica-
tion and machine translation tasks is available on
our GitHub repository.
Compared to the most multilingual existing
translation benchmarks like NTREX (parallel
data of 128 languages with English; Federmann
et al.,2022), FLORES-200 (
n
-way, 200 languages;
NLLB Team et al.,2022), or OPUS-100 (parallel
data for 99 languages to/from English; Aharoni
et al.,2019), our benchmark introduces up to 82
new languages leading to more than 1400 new lan-
guage pairs (see Table 3).
About 70% of the dataset’s languages use the
Latin script or its extended variants with diacrit-
ics, in line with global adoption and usage of the
Latin script. However, the data is quite typograph-
ically rich, and stories with non-Latin scripts are
in abundance, enumerated in Table 1. Details to re-
produce raw data, intermediate preprocessing, and
the merged data can be found in Appendix A.1.
3 Machine Translation Benchmark
To test whether our dataset can improve machine
translation performance, we perform experiments
with hierarchical adapters units and provide new
baselines among low-resource African languages.
3.1 Experimental Settings
As our baseline, we used the model from Alam
and Anastasopoulos (2022), which is the best-
performing publicly available model from the
WMT Shared Task on Large Scale Evaluation for
African Languages (Adelani et al.,2022).
5
They
first fine-tuned the DeltaLM
6
model (Ma et al.,
2021) in 26 languages. After that, they added
lightweight language-specific adapter layers (Pfeif-
fer et al.,2022) and fine-tuned only the adapters
in those 26 languages. We can either use a sin-
gle adapter per language (
L-Fine
) or organize the
adapters in a phylogenetically-informed hierarchy
(
F-Fine
) so that similar languages share language-
family- and genus-level adapters (Faisal and Anas-
tasopoulos,2022). See Appendix A.3 for details on
the phylogenetic trees we used in our experiments.
We perform both
L-Fine
and
F-Fine
exper-
iments using the publicly available code
7
and
also share an additional baseline by finetuning the
DeltaLM model without adapters. Details to re-
produce our machine translation experiments, base-
lines, and results can be found in Appendix A.3.
3.2 Train-Test Split
We shuffle all stories and split them to achieve at
least 1000 pages for test sets. All excess stories
are kept for training and are used for fine-tuning.
We ensure that even different sentences from the
same story don’t appear across the train test, i.e.,
all training stories are separate from test stories.
This is done to get a more realistic estimate of
translation quality on new stories. For languages
with 1000 or fewer pages, we use 500-page test
sets.
3.3 Results: Machine Translation
In Table 5, we show the performance of our
L-Fine
and
F-Fine
models compared to the base-
line on our test set. We evaluate using three well-
5
That system ranked third in the Shared Task, but the top
two systems were industry submissions that are not publicly
available.
6https://aka.ms/deltalm
7https://github.com/mahfuzibnalam/large-scale_
MT_African_languages
Pair spBLEU Pair spBLEU
eng-xho 20.1 eng-hau 18.8
fra-lug 3.6 nso-lug 3.0
lug-kin 2.9 kin-lug 2.4
nya-lug 2.1 eng-kam 1.8
ibo-lug 1.7 eng-lug 1.5
zul-lug 1.5 fra-tso 1.3
xho-lug 1.2 fra-yor 1.1
nso-tso 1.0 amh-lug 1.0
Table 4: Example language pairs with performance
gains for the F-Fine model over the baseline one.
known MT metrics: BLEU (Papineni et al.,2002),
CHRF++ (Popovi´
c,2017), and spBLEU (NLLB
Team et al.,2022). For spBLEU, we use the FLO-
RES200 SPM model to create subwords. In all
three metrics, we see the same trend across dif-
ferent averages. Our L-Fine model outperforms
the Baseline model by 4.0-11.5 spBLEU points by
fine-tuning only the language-specific adapters on
our training set. Our F-Fine model outperforms
the L-Fine model by 5.0-7.5 spBLEu points by
fine-tuning only some shared parameters among
languages and language-specific adapters.
We also test our models on the FLORES200
benchmark (Appendix B) and observe that our L-
Fine model and F-Fine model under-perform the
Baseline model except for
AvgengX
directions
across the three evaluation metrics. This is likely
due domain adaptation of L-Fine and F-Fine mod-
els to the story domain upon fine-tuning. Since
the dataset consists of children’s stories which are
usually written in simpler language, it may also be
a slightly easier domain than FLORES. Even then,
there are low-resource language pairs that benefit
from fine-tuning using adapters across domains.
We report these language pairs and their respec-
tive spBLEU gains for the F-Fine model in Table 4.
We get the highest gains for English-Xhosa (20.1
points) and English-Hausa (18.8 points) across do-
mains, both of which had poor performance from
the Baseline model with spBLEU of 3.5 and 4.5,
respectively. Exhaustive results for other language
pairs can be found in Appendix B.
4 Language (Mis)Identification
Language identification affects low-resource lan-
guage resource creation efforts severely (Jauhi-
ainen et al.,2019;Schwenk et al.,2021a) because
to collect data, we need accurate language identi-
fiers that themselves need data to train, creating a
vicious cycle. Low-quality language identification
systems often make mispredictions which increases
inter-class noise and reduces the crawled data’s
quality (Kocyigit et al.,2022) both for the pre-
dicted language and the true language. To correct
mispredictions and improve accuracy in supported
languages with limited data and sustainably trained
models, we propose a hierarchical modeling ap-
proach.
Hierarchical modeling is an extremely popular
choice for a wide variety of algorithmic tasks and
it has been explored for language identification as
well (Goutte et al.,2014;Lui et al.,2014;Best-
gen,2017;Jauhiainen et al.,2019). However, pre-
vious work has focused on predicting language
group/family first, followed by finer-grained pre-
dictions with a smaller set of classes. Our work
departs from this paradigm in two ways - first, we
bring focus onto expanding language identification
coverage in pre-trained or off-the-shelf systems
without retraining, and second, we predict a prior
and posterior language based on confusion and mis-
prediction patterns of the model directly (without
predicting language family/group first).
First, we choose a well-performing root model
with high-coverage that provides us with the
base/prior prediction. Such base predictions are
obtained for a sample of data (ex. our benchmark
training set), allowing us to identify systemic con-
fusion patterns embedded within the model using a
confusion matrix. Based on the identified mispre-
diction patterns (which may or may not be between
languages in the same family), we train lightweight
confusion-resolution subunits that can be attached
onto the root model to make the posterior predic-
tion. Our results showcase that a sample of data
can be used to investigate a pretrained, off-the-shelf
or even blackbox commerical model, identify sys-
temic misprediction patterns, and resolve them with
hierarchical models.
4.1 Experimental Settings
To establish a root system for the hierarchical
model architecture, we train our own Multinomial
Naive Bayes model with training data from 355
classes (355 languages + unknown class), using
character-level
n
-gram features. We withhold 50
pages from randomly selected stories for each lan-
guage to create the test set. As is common in ex-
tremely low-resource settings, 123 languages in our
selection had less than 200 sentences. Therefore,
Metric Models Avgall AvgAf ricanAf rican AvgXeng AvgengXAvgYfra AvgfraY
spBLEU
Baseline 11.87 10.19 18.79 13.20 15.64 12.55
L-Fine 19.52 18.21 30.38 17.46 21.93 17.86
F-Fine 24.93 23.58 35.66 25.26 27.06 21.36
Table 5: Evaluation results on our test set of 176 language directions.
Avgall
denotes the average result of 176
translation directions.
AvgAf ricanAf rican
denotes the average score of directions between African languages.
AvgXeng
denotes the average score for translating into English, and
AvgengX
for out of English (similarly for
French in the last two columns.
we used synthetic minority oversampling (Chawla
et al.,2002) and tested on 10 human-verified sen-
tences per language. To condense the large number
of features, improve inference speed, and keep the
model size low, we use Incremental PCA (always
preserving at least 90% variance of the original fea-
tures). As recommended in Chawla et al. (2002),
minority class upsampling is done after feature ex-
traction.
For the confusion-resolving classification units,
we again train simple Multinomial Naive Bayes
models with up to 1000 sentences per language
and character-level
n
-grams (2-4 grams) and word-
level
n
-grams (1-2 grams) as features. We use
Multinomial Naive Bayes models over other meth-
ods such as transformers to keep model complexity
low, model sizes lean, and show that reasonable
performance of low-resource languages is possible
even with limited computation, space, and training
data. Similarly, we rely on character and word-
level
n
-grams since they can be universally com-
puted, and do not share the disadvantages of low-
coverage in pre-trained models like BERT (Devlin
et al.,2019) that are not trained on sufficiently wide
low-resource language data.
4.2 Misidentification and Confusion
Resolution
To resolve high-confidence incorrect predictions in
the multilingual root model, we inspect its confu-
sion matrix (a representative example in Figure 2).
For each test language, we divide the root model’s
predictions by the total number of tested examples
giving us a hit ratio for each pair. For example, (Gu-
jarati, Kutchi) would represent the ratio of Kutchi
sentences that were confused with Gujarati. We
select the 9 clusters (given below) with a confusion
ratio >0.7and train a hierarchical LIMIT model.
1. Gujarati, Kutchi, Bhilori
2. Amharic, Tigrinya, Silt’e
3. Koda, Bengali, Assamese
4. Mandarin, Yue Chinese
5. Konda, Telugu
6. Kodava, Kannada
7. Tsonga, Tswa
8. Dagaare, Mumuye
9. Bats, Georgian
As shown in Figure 2, we can identify that
Amharic and Tigringya (both supported languages)
are often misidentified as Tigrinya. Another kind
of misidentification is when the source language
is not supported by the model, i.e. Silt’e being
misidentified as Tigrinya. To resolve this, we train a
small unit trained to distinguish between Amharic,
Tigrinya, and Silt’e. When the root model predicts
Amharic or Tigrinya, the example gets passed down
to the unit for a more fine-tuned prediction. This
increases the model’s coverage and resolves confu-
sion without needing to retrain the root model.
4.3 Evaluation
All models are evaluated on held-out test sets as per
Section 3.1. To evaluate the confusion-resolution
units, we report language-level scores as well as ag-
gregates. For root model selection, we report macro
F1
scores. Details to reproduce all experiments,
models, and results can be found in Appendix A.2.
4.4 Results: Language Identification at Scale
In Table 6, we show macro-
F1
scores for all 4 sys-
tems - Google’s CLD3, Langid.py, Franc, and our
baseline system, LIMIT. Scores are reported across
all 355 languages in the test set to better compare
model performances on large multi-class classifica-
tion tasks with limited data. Our system, although
trained with very limited data, on a simple Multino-
mial Naive Bayes classifier (with 2-5x the number
of classes compared to the other models) still per-
forms on par with CLD3 and langid.py. Franc, built
using the Universal Declaration of Human Rights
(UDHR) data, comes out to be the best model, cov-
Multilingual Root Model
Gujarati
guj-kfr-Bhilori Unit
Gujarati Kutchi Bhilori
Amharic
amh-tir-slv Unit
Amharic Tigrinya Silt’e
Tigrinya
Figure 2: Subset of the multilingual root model’s (Franc) confusion matrix (6 languages). Using the confusion
matrix, clusters of highly confused languages are identified and confusion-resolution units trained according to
the tree shown on the right. The tree, for demonstration purposes, is a subset of the entire tree which has 9
confusion-resolution units
Lang CLD3 langid Franc LIMIT
Macro F1 0.11 0.09 0.18 0.11
Table 6: Our baseline multilingual language identifica-
tion model (LIMIT) places second when compared to
the state-of-the-art (aggregated
F1
score on our test set).
Based on this macro
F1
score, we choose Franc as our
root multilingual language identification model.
ering
30
% of our languages (
105/356
languages).
It is derived from guess-language
8
which uses a
mix of writing system detection and character-
level trigrams. Our baseline model, LIMIT, trains
to identify 250 additional languages that Franc
doesn’t support, but due to limited data coupled
with a large number of languages, it places sec-
ond. Hence, we use Franc as the root system for
our confusion resolution and coverage expansion
experiments.
4.5 Results: Language Misidentification
In Table 7, we report
F1
scores for each of the 9
highly confused clusters by Franc. We observe that
languages within each cluster share a single writing
system and are phylogenetically related. Below, we
analyze some highlights from Table 7.
Gujarati, Kutchi, and Bhilori are Western
Indo-Aryan languages spoken primarily in Gu-
jarat and written in the Gujarati script. Franc
doesn’t support low-resource languages like
Kutchi and Bhilori and confuses them with
8https://github.com/kent37/guess-language
Gujarati (Figure 2). Our confusion-resolution
unit resolves these to produce competitive
Kutchi and Bhilori
F1
scores, with only minor
drop for Gujarati.
Amharic, Tigrinya, and Silt’e are all Ethiopic
languages that use the Ge’ez script. Franc sup-
ports language identification for Amharic and
Tigrinya, while it doesn’t support Silt’e. Our
confusion-resolution unit improves Amharic’s
F1
score while introducing a new language
Silt’e at a reasonable baseline
F1
score, with
minor drop in performance for Tigrinya.
Bengali and Assamese are Eastern Indo-
Aryan languages, whereas Koda is an endan-
gered Munda language. All three languages
use the Bengali-Assamese script. With our
confusion resolution unit, we improve perfor-
mance on all three languages and succesfully
introduce Assamese and Koda language iden-
tification.
Our hierarchical, confusion-resolution approach
improves
F1
score from
0.2
to
0.55
, a 175% in-
crease in performance, while providing novel lan-
guage identification for 13 new low-resource and
endangered languages.
4.6 Computational and Space Complexity
Each trained model has two components - the clas-
sifier and a projection model, which projects test-
time examples into the train-feature embedding
space. The traditional approach to train a large mul-
tilingual model takes
500MB space with a
15MB
Language Franc Hier. LIMIT
Gujarati (guj) 0.50 0.48
Kutchi (kfr) 0.48
Bhilori 0.43
Amharic (amh) 0.21 0.47
Tigrinya (tir) 0.43 0.28
Silt’e (stv) 0.48
Koda (cdz) 0.32
Bengali (ben) 0.48 0.52
Assamese (asm) 0.25
Mandarin (zho) 0.36
Yue (yue) 0.68
Konda (kfc) 0.66
Telugu (tel) 0.64 0.69
Kodava (kfa) 0.55
Kannada (kan) 0.71 0.77
Tsonga (tso) 0.49 0.41
Tswa (tsc) 0.30
Dagaare 0.84
Mumuye (mzm) 0.86
Bats (bbl) 0.91
Georgian (kat) 0.67 0.89
aggregate 0.20 0.55
Table 7: Our Hierarchical LIMIT approach improves
F1
LID over Franc in highly confused languages (over
70% confusion) across language families and with very
limited data. Empty Franc entries indicate languages
unsupported by Franc.
NaiveBayes classifier and a
450MB projection
model. In contrast, our lightweight confusion-
resolution approach creates units with size 7-10KB
(
0.06%
of base model) and a projection model of
<
100MB (
33.34%
of base model). All reported
sizes are uncompressed sizes.
The traditional large multilingual model with
365+ languages and only 1000 training examples
per language take 7-8 hours to train on CPU. In con-
trast, the hierarchical LIMIT units take 1-2 minutes
to train (0.4% of base time).
5 Related Work
Parallel Datasets Language identification mod-
els tend to use popular training datasets like Vata-
nen et al. (2010) (UDHR data used by Franc),
Blodgett et al. (2017) for social media (70 lan-
guages), King and Abney (2013) (web-crawl in
30 languages), FLORES (200 languages), etc. An-
other recently published dataset, BLOOM (Leong
et al.,2022), leverages text and audio in children’s
stories from similar sources (African Storybooks,
The Asia Foundation, Little Zebra Books etc.) to
create benchmarks for image captioning and speech
recognition. However, their data is monolingual,
unaligned, and can not be used for machine trans-
lation. We leveraged the highly parallel nature of
the collected storybooks (5x the number of stories
in BLOOM) and created test sets and baselines for
understudied translation directions.
Machine Translation As a result of its ability to
produce translations between multiple languages,
multilingual neural machine translation (Dong
et al.,2015;Johnson et al.,2017;Arivazhagan
et al.,2019;Dabre et al.,2020;Philip et al.,2020;
Lin et al.,2021) has become a popular architec-
ture. Thousands of languages are spoken world-
wide, so representing them with bilingual models
would require thousands of models. Neither scala-
bility nor adaptability makes this an ideal solution.
Through various training methods (Aharoni et al.,
2019;Wang et al.,2020), model structures (Wang
et al.,2018;Gong et al.,2021;Zhang et al.,2021),
and data augmentation (Tan et al.,2019;Pan et al.,
2021) a variety of research has attempted to im-
prove multilingual translation models. Adapter
units were initially proposed for light-weight do-
main adaptation for MT (Vilar,2018) and then
also for extending a large pre-trained model to a
downstream task (Houlsby et al.,2019). Bapna
and Firat (2019) improved pre-trained multilingual
machine translation models for domain adaptation
using bilingual adapters.
Language Identification Text-based language
identification is usually modelled as a classifica-
tion task. Similar to our featurization approach,
other popular language identification models utilize
byte, character and word-level
n
-gram features, fol-
lowed by some dimensionality reduction, and clas-
sifers such as SVMs (Ciobanu et al.,2018;Malmasi
and Dras,2015), Naive Bayes (King et al.,2014;
Mathur et al.,2017), Neural Networks (Medvedeva
et al.,2017;Criscuolo and Aluísio,2017;Eldesouki
et al.,2016), for their straightforward modeling
and high performance. By increasing the number
of classes/languages a classifier handles, accuracy
tends to decrease (Jauhiainen et al.,2017), a prob-
lem we propose to tackle by leveraging a confusion-
informed hierarchical approach. To distinguish be-
tween closely related languages, a lot of exciting
research has been published at various editions of
VarDial - The Workshop on NLP for Similar Lan-
guages, Varieties and Dialects (Aepli et al.,2022;
Scherrer et al.,2022;Chakravarthi et al.,2021;
Zampieri et al.,2020,2014). But, even at the work-
shop, a large number of ongoing tasks and papers
are restricted to European languages, with very lit-
tle space in the agenda for Indian, African, or other
Indigenous languages. Over the last 3 iterations of
VarDial from 2019-2022, many new datasets and
techniques to identify Romance languages such as
Italian (Jauhiainen et al.,2022b;Camposampiero
et al.,2022;Zugarini et al.,2020) or Romanian
(Jauhiainen et al.,2021a;Zaharia et al.,2021;Ce-
olin and Zhang,2020;Zaharia et al.,2020), Nordic
languages (Mæhlum et al.,2022;Haas and Der-
czynski,2021), Uralic languages (Jauhiainen et al.,
2020;Bernier-Colborne et al.,2021), German vari-
eties (Mihaela et al.,2021;Nigmatulina et al.,2020;
Gaman and Ionescu,2020;Siewert et al.,2020),
and the Slavic language continuum (Popovi´
c et al.,
2020;Abdullah et al.,2020) were published. In
contrast, we see a very small number of papers
or tasks on Indian languages at the venue with 2
focusing on Indo-Aryan and 2 focusing on Dra-
vidian languages (Nath et al.,2022;Bhatia et al.,
2021;Jauhiainen et al.,2021b;Chakravarthi et al.,
2020), and no papers or tasks, to our knowledge,
on African languages or varieties.
Hierarchical Modeling Hierarchical approaches
have proved successful in solving a myriad of com-
putational problems, and have proved useful in
language identification previously. The widely
used approach first predicts a preliminary language
group/family that a given input may belong to, and
then does another fine-tuned prediction from the
smaller set of output classes contained within the
language group/family (Goutte et al.,2014;Lui
et al.,2014;Bestgen,2017;Jauhiainen et al.,2019).
In contrast, our work extends this commonly ac-
cepted hierarchical modeling architecture to ac-
count for mispredictions made by existing trained
models, and does not predict a group/language fam-
ily first, but rather directly learns confusion rela-
tionships between language pairs. Then, similar
to Bestgen (2017); Goutte et al. (2014), we train
smaller classifiers for a fine-tuned prediction, but in
contrast, our classifiers distinguish between highly-
confused languages (which may not be part of the
same language group/family), and map a first-pass
language prediction (not family/group) into another
refined language prediction.
6 Conclusion
We introduce Hier-LIMIT, a hierarchical, confu-
sion-based approach to counter the misidentifica-
tions in pretrained language identification systems
while increasing language coverage without retrain-
ing large multilingual models for text classification.
We release a large, massively parallel children’s
stories dataset covering languages from diverse lan-
guage families, writing systems, and reading levels.
We utilize this parallel dataset to create new trans-
lation directions for vulnerable and low-resource
languages. We train adapter-based networks fine-
tuned on language and family/sub-family level in-
formation and demonstrate improvements in the
children’s story domain and cross-domain improve-
ment for several languages (on the FLORES bench-
mark dataset).
Our dataset also includes monolingual text ex-
tracted from multilingual stories to enable the
creation of language identification tools for low-
resource languages that don’t have such tools
available. We perform experiments demonstrating
the performance of pretrained language identifi-
cation models on these languages, highlight their
high-confidence incorrect predictions, and offer
a lightweight hierarchical solution. In the future,
we hope to use this children’s story data to inves-
tigate better architectures, feature selection, and
training setups to further improve our baselines.
Armed with high-quality language identification
systems with wide coverage, we will also experi-
ment with large-scale data mining efforts for these
under-resourced languages.
Limitations
While our hierarchical model approach is
efficient, we were limited in the data that
we trained the subunits with. All language
identification training and testing data were
obtained from the parallel children’s story
dataset. We believe that if more diverse train-
ing data can be collected in low-resource lan-
guages, the hierarchical subunits will be per-
formant across domains since the identified
confusion will also be domain-independent.
Our dataset covers over 350 languages and
we build high-quality language identification
models for these languages. However, we
restrict ourselves to text-based language iden-
tification and translation. Out of the 7000
languages in the world, many are primarily
spoken languages and do not have online or
offline textual presence in the form of articles,
textbooks, stories etc. Therefore understand-
ing and studying speech is crucial and we plan
on tackling speech-based language identifica-
tion/recognition and machine translation in
future work.
Language identification performance varies
with domain, length of text, and language. We
acknowledge that our system, like other state-
of-the-art systems, is not perfect and may
make classification errors due to such factors.
We hope that readers will understand this risk
well and its potential downstream effects be-
fore using our dataset, language identification,
or machine translation results in their work.
There are many more off-the-shelf systems
other than the ones we used such as HeLi-
OTS (Jauhiainen et al.,2022a) and fastText
(Joulin et al.,2016), methods to transform the
feature space (Brown,2014), and techniques
to improve dataset precision for low-resource
languages for better crawls (Caswell et al.,
2020) that we hope to include in the future to
produce a stronger benchmark for the commu-
nity.
Ethics Statement
Data used, compiled, and preprocessed in this
project is freely available online under Creative
Commons licenses (CC BY 4.0). Stories from the
African Storybooks Initiative (ASI) are openly li-
censed, can be used without asking for permission,
and without paying any fees. We acknowledge the
writers, authors, translators, illustrators of each of
the books and the ASI team for creating such a valu-
able repository of parallel storybooks in African
languages. Stories from the Pratham Storybooks’
Storyweaver portal are available under open licens-
ing as well, and we preserve metadata for the au-
thor, illustrator, translator (where applicable), pub-
lisher, copyright information, and donor/funder for
each book, in accordance with Storyweaver’s guide-
lines. Since stories hosted on African Storybooks
Initiative and Pratham Books’ Storyweaver are in-
tended for children and most of them are vetted
or human-verified we do not explicitly check for
offensive content.
Our language identification models, by design,
are meant to provide an alternative to training
resource-hungry large-scale multilingual models
that require a lot of training data. Such models are
inaccessible to many researchers since they require
access to specialized computing hardware. Our
models are built with sustainability and equity in
mind, and can be trained in a matter of minutes on
CPU on standard laptops.
Acknowledgments
This work was generously supported by the Na-
tional Endowment for the Humanities under award
PR-276810-21 and by the National Science Foun-
dation under award FAI-2040926. Computational
resources for experiments were provided by the Of-
fice of Research Computing at George Mason Uni-
versity (URL:
https://orc.gmu.edu
) and funded
in part by grants from the National Science Foun-
dation (Awards Number 1625039 and 2018631).
References
Badr M. Abdullah, Jacek Kudera, Tania Avgustinova,
Bernd Möbius, and Dietrich Klakow. 2020. Redis-
covering the Slavic continuum in representations
emerging from neural models of spoken language
identification. In Proceedings of the 7th Workshop
on NLP for Similar Languages, Varieties and Di-
alects, pages 128–139, Barcelona, Spain (Online).
International Committee on Computational Linguis-
tics (ICCL).
Ife Adebara, AbdelRahim Elmadany, Muhammad
Abdul-Mageed, and Alcides Alcoba Inciarte. 2022.
Afrolid: A neural language identification tool for
african languages.
David Adelani, Md Mahfuz Ibn Alam, Antonios
Anastasopoulos, Akshita Bhagia, Marta R. Costa-
jussÃ, Jesse Dodge, Fahim Faisal, Christian Fe-
dermann, Natalia Fedorova, Francisco Guzmán,
Sergey Koshelev, Jean Maillard, Vukosi Marivate,
Jonathan Mbuya, Alexandre Mourachko, Safiyyah
Saleem, Holger Schwenk, and Guillaume Wenzek.
2022. Findings of the wmt’22 shared task on large-
scale machine translation evaluation for african lan-
guages. In Proceedings of the Seventh Conference
on Machine Translation, pages 773–800, Abu Dhabi.
Association for Computational Linguistics.
Noëmi Aepli, Antonios Anastasopoulos, Adrian-Gabriel
Chifu, William Domingues, Fahim Faisal, Mihaela
Gaman, Radu Tudor Ionescu, and Yves Scherrer.
2022. Findings of the VarDial evaluation campaign
2022. In Proceedings of the Ninth Workshop on NLP
for Similar Languages, Varieties and Dialects, pages
1–13, Gyeongju, Republic of Korea. Association for
Computational Linguistics.
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019.
Massively multilingual neural machine translation.
In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 3874–3884,
Minneapolis, Minnesota. Association for Computa-
tional Linguistics.
Md Mahfuz Ibn Alam and Antonios Anastasopoulos.
2022. Language adapters for large-scale mt: The
gmu system for the wmt 2022 large-scale machine
translation evaluation for african languages shared
task. In Proceedings of the Seventh Conference on
Machine Translation, pages 1015–1033, Abu Dhabi.
Association for Computational Linguistics.
Naveen Arivazhagan, Ankur Bapna, Orhan Firat,
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,
Mia Xu Chen, Yuan Cao, George F. Foster, Colin
Cherry, Wolfgang Macherey, Zhifeng Chen, and
Yonghui Wu. 2019. Massively multilingual neural
machine translation in the wild: Findings and chal-
lenges.CoRR, abs/1907.05019.
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth
Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L.
Forcada, Amir Kamran, Faheem Kirefu, Philipp
Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere,
Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec,
Brian Thompson, William Waites, Dion Wiggins, and
Jaume Zaragoza. 2020. ParaCrawl: Web-scale acqui-
sition of parallel corpora. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 4555–4567, Online. Association
for Computational Linguistics.
Ankur Bapna and Orhan Firat. 2019. Simple, scal-
able adaptation for neural machine translation. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1538–
1548, Hong Kong, China. Association for Computa-
tional Linguistics.
Gabriel Bernier-Colborne, Serge Leger, and Cyril
Goutte. 2021. N-gram and neural models for Uralic
language identification: NRC at VarDial 2021. In
Proceedings of the Eighth Workshop on NLP for Sim-
ilar Languages, Varieties and Dialects, pages 128–
134, Kiyv, Ukraine. Association for Computational
Linguistics.
Yves Bestgen. 2017. Improving the character ngram
model for the DSL task with BM25 weighting and
less frequently used feature sets. In Proceedings of
the Fourth Workshop on NLP for Similar Languages,
Varieties and Dialects (VarDial), pages 115–123, Va-
lencia, Spain. Association for Computational Lin-
guistics.
Kushagra Bhatia, Divyanshu Aggarwal, and Ashwini
Vaidya. 2021. Fine-tuning distributional semantic
models for closely-related languages. In Proceed-
ings of the Eighth Workshop on NLP for Similar Lan-
guages, Varieties and Dialects, pages 60–66, Kiyv,
Ukraine. Association for Computational Linguistics.
Su Lin Blodgett, Johnny Wei, and Brendan O’Connor.
2017. A dataset and classifier for recognizing social
media English. In Proceedings of the 3rd Workshop
on Noisy User-generated Text, pages 56–61, Copen-
hagen, Denmark. Association for Computational Lin-
guistics.
Ralf Brown. 2014. Non-linear mapping for improved
identification of 1300+ languages. In Proceedings
of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 627–
632, Doha, Qatar. Association for Computational
Linguistics.
Giacomo Camposampiero, Quynh Anh Nguyen, and
Francesco Di Stefano. 2022. The curious case of
logistic regression for Italian languages and dialects
identification. In Proceedings of the Ninth Work-
shop on NLP for Similar Languages, Varieties and
Dialects, pages 86–98, Gyeongju, Republic of Korea.
Association for Computational Linguistics.
Isaac Caswell, Theresa Breiner, Daan van Esch, and
Ankur Bapna. 2020. Language ID in the wild: Unex-
pected challenges on the path to a thousand-language
web text corpus. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
pages 6588–6608, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics.
Andrea Ceolin and Hong Zhang. 2020. Discriminating
between standard Romanian and Moldavian tweets
using filtered character ngrams. In Proceedings of the
7th Workshop on NLP for Similar Languages, Vari-
eties and Dialects, pages 265–272, Barcelona, Spain
(Online). International Committee on Computational
Linguistics (ICCL).
Bharathi Raja Chakravarthi, Gaman Mihaela, Radu Tu-
dor Ionescu, Heidi Jauhiainen, Tommi Jauhiainen,
Krister Lindén, Nikola Ljubeši´
c, Niko Partanen,
Ruba Priyadharshini, Christoph Purschke, Eswari Ra-
jagopal, Yves Scherrer, and Marcos Zampieri. 2021.
Findings of the VarDial evaluation campaign 2021.
In Proceedings of the Eighth Workshop on NLP for
Similar Languages, Varieties and Dialects, pages 1–
11, Kiyv, Ukraine. Association for Computational
Linguistics.
Bharathi Raja Chakravarthi, Navaneethan Ra-
jasekaran, Mihael Arcan, Kevin McGuinness, Noel
E. O’Connor, and John P. McCrae. 2020. Bilingual
lexicon induction across orthographically-distinct
under-resourced Dravidian languages. In Pro-
ceedings of the 7th Workshop on NLP for Similar
Languages, Varieties and Dialects, pages 57–69,
Barcelona, Spain (Online). International Committee
on Computational Linguistics (ICCL).
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall,
and W. Philip Kegelmeyer. 2002. Smote: Synthetic
minority over-sampling technique. J. Artif. Int. Res.,
16(1):321–357.
Alina Maria Ciobanu, Marcos Zampieri, Shervin Mal-
masi, Santanu Pal, and Liviu P. Dinu. 2018. Discrim-
inating between Indo-Aryan languages using SVM
ensembles. In Proceedings of the Fifth Workshop on
NLP for Similar Languages, Varieties and Dialects
(VarDial 2018), pages 178–184, Santa Fe, New Mex-
ico, USA. Association for Computational Linguistics.
Marcelo Criscuolo and Sandra Maria Aluísio. 2017.
Discriminating between similar languages with word-
level convolutional neural networks. In Proceedings
of the Fourth Workshop on NLP for Similar Lan-
guages, Varieties and Dialects (VarDial), pages 124–
130, Valencia, Spain. Association for Computational
Linguistics.
Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan.
2020. A comprehensive survey of multilingual neural
machine translation.CoRR, abs/2001.01115.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and
Haifeng Wang. 2015. Multi-task learning for multi-
ple language translation. In Proceedings of the 53rd
Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Confer-
ence on Natural Language Processing (Volume 1:
Long Papers), pages 1723–1732, Beijing, China. As-
sociation for Computational Linguistics.
Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, and
Kareem Darwish. 2016. QCRI @ DSL 2016: Spoken
Arabic dialect identification using textual features. In
Proceedings of the Third Workshop on NLP for Sim-
ilar Languages, Varieties and Dialects (VarDial3),
pages 221–226, Osaka, Japan. The COLING 2016
Organizing Committee.
Fahim Faisal and Antonios Anastasopoulos. 2022.
Phylogeny-inspired adaptation of multilingual mod-
els to new languages. In Proceedings of the 2nd
Conference of the Asia-Pacific Chapter of the Asso-
ciation for Computational Linguistics and the 12th
International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 434–452,
Online only. Association for Computational Linguis-
tics.
Christian Federmann, Tom Kocmi, and Ying Xin. 2022.
NTREX-128 news test references for MT evalua-
tion of 128 languages. In Proceedings of the First
Workshop on Scaling Up Multilingual Evaluation,
pages 21–24, Online. Association for Computational
Linguistics.
Mihaela Gaman and Radu Tudor Ionescu. 2020. Com-
bining deep learning and string kernels for the lo-
calization of Swiss German tweets. In Proceedings
of the 7th Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 242–253, Barcelona,
Spain (Online). International Committee on Compu-
tational Linguistics (ICCL).
Hongyu Gong, Xian Li, and Dmitriy Genzel. 2021.
Adaptive sparse transformer for multilingual transla-
tion.
Cyril Goutte, Serge Léger, and Marine Carpuat. 2014.
The NRC system for discriminating similar lan-
guages. In Proceedings of the First Workshop on
Applying NLP Tools to Similar Languages, Varieties
and Dialects, pages 139–145, Dublin, Ireland. As-
sociation for Computational Linguistics and Dublin
City University.
René Haas and Leon Derczynski. 2021. Discriminating
between similar Nordic languages. In Proceedings of
the Eighth Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 67–75, Kiyv, Ukraine.
Association for Computational Linguistics.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Parameter-efficient transfer learning for NLP. In
Proceedings of the 36th International Conference
on Machine Learning, volume 97 of Proceedings
of Machine Learning Research, pages 2790–2799.
PMLR.
Tommi Jauhiainen, Heidi Jauhiainen, and Krister
Lindén. 2021a. Naive Bayes-based experiments in
Romanian dialect identification. In Proceedings of
the Eighth Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 76–83, Kiyv, Ukraine.
Association for Computational Linguistics.
Tommi Jauhiainen, Heidi Jauhiainen, and Krister
Lindén. 2022a. HeLI-OTS, off-the-shelf language
identifier for text. In Proceedings of the Thirteenth
Language Resources and Evaluation Conference,
pages 3912–3922, Marseille, France. European Lan-
guage Resources Association.
Tommi Jauhiainen, Heidi Jauhiainen, and Krister
Lindén. 2022b. Italian language and dialect iden-
tification and regional French variety detection using
adaptive naive Bayes. In Proceedings of the Ninth
Workshop on NLP for Similar Languages, Varieties
and Dialects, pages 119–129, Gyeongju, Republic of
Korea. Association for Computational Linguistics.
Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen,
and Krister Lindén. 2020. Uralic language identifi-
cation (ULI) 2020 shared task dataset and the wanca
2017 corpora. In Proceedings of the 7th Workshop
on NLP for Similar Languages, Varieties and Di-
alects, pages 173–185, Barcelona, Spain (Online).
International Committee on Computational Linguis-
tics (ICCL).
Tommi Jauhiainen, Krister Lindén, and Heidi Jauhi-
ainen. 2017. Evaluation of language identification
methods using 285 languages. In Proceedings of the
21st Nordic Conference on Computational Linguis-
tics, pages 183–191, Gothenburg, Sweden. Associa-
tion for Computational Linguistics.
Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timo-
thy Baldwin, and Krister Lindén. 2019. Automatic
language identification in texts: A survey.J. Artif.
Int. Res., 65(1):675–682.
Tommi Jauhiainen, Tharindu Ranasinghe, and Marcos
Zampieri. 2021b. Comparing approaches to Dravid-
ian language identification. In Proceedings of the
Eighth Workshop on NLP for Similar Languages, Va-
rieties and Dialects, pages 120–127, Kiyv, Ukraine.
Association for Computational Linguistics.
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2017. Google’s
multilingual neural machine translation system: En-
abling zero-shot translation.Transactions of the As-
sociation for Computational Linguistics, 5:339–351.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2016. Bag of tricks for efficient text
classification. arXiv preprint arXiv:1607.01759.
Ben King and Steven Abney. 2013. Labeling the lan-
guages of words in mixed-language documents us-
ing weakly supervised methods. In Proceedings of
the 2013 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 1110–1119,
Atlanta, Georgia. Association for Computational Lin-
guistics.
Ben King, Dragomir Radev, and Steven Abney. 2014.
Experiments in sentence language identification with
groups of similar languages. In Proceedings of the
First Workshop on Applying NLP Tools to Similar
Languages, Varieties and Dialects, pages 146–154,
Dublin, Ireland. Association for Computational Lin-
guistics and Dublin City University.
Muhammed Kocyigit, Jiho Lee, and Derry Wijaya. 2022.
Better quality estimation for low resource corpus
mining. In Findings of the Association for Com-
putational Linguistics: ACL 2022, pages 533–543,
Dublin, Ireland. Association for Computational Lin-
guistics.
Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat-
tacharyya. 2018. The IIT Bombay English-Hindi
parallel corpus. In Proceedings of the Eleventh In-
ternational Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan. European
Language Resources Association (ELRA).
Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna
Filighera, Abraham Owodunni, and Daniel White-
nack. 2022. Bloom library: Multimodal datasets in
300+ languages for a variety of downstream tasks.
In Proceedings of the 2022 Conference on Empiri-
cal Methods in Natural Language Processing, pages
8608–8621, Abu Dhabi, United Arab Emirates. As-
sociation for Computational Linguistics.
Zehui Lin, Liwei Wu, Mingxuan Wang, and Lei Li.
2021. Learning language specific sub-network for
multilingual machine translation. In Proceedings
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 293–305, Online.
Association for Computational Linguistics.
Marco Lui and Timothy Baldwin. 2012. langid.py: An
off-the-shelf language identification tool. In Proceed-
ings of the ACL 2012 System Demonstrations, pages
25–30, Jeju Island, Korea. Association for Computa-
tional Linguistics.
Marco Lui, Ned Letcher, Oliver Adams, Long Duong,
Paul Cook, and Timothy Baldwin. 2014. Exploring
methods and resources for discriminating similar lan-
guages. In Proceedings of the First Workshop on
Applying NLP Tools to Similar Languages, Varieties
and Dialects, pages 129–138, Dublin, Ireland. As-
sociation for Computational Linguistics and Dublin
City University.
Shuming Ma, Li Dong, Shaohan Huang, Dong-
dong Zhang, Alexandre Muzio, Saksham Sing-
hal, Hany Hassan Awadalla, Xia Song, and Furu
Wei. 2021. Deltalm: Encoder-decoder pre-training
for language generation and translation by aug-
menting pretrained multilingual encoders.CoRR,
abs/2106.13736.
Petter Mæhlum, Andre Kåsen, Samia Touileb, and
Jeremy Barnes. 2022. Annotating Norwegian lan-
guage varieties on Twitter for part-of-speech. In
Proceedings of the Ninth Workshop on NLP for Simi-
lar Languages, Varieties and Dialects, pages 64–69,
Gyeongju, Republic of Korea. Association for Com-
putational Linguistics.
Shervin Malmasi and Mark Dras. 2015. Language iden-
tification using classifier ensembles. In Proceedings
of the Joint Workshop on Language Technology for
Closely Related Languages, Varieties and Dialects,
pages 35–43, Hissar, Bulgaria. Association for Com-
putational Linguistics.
Priyank Mathur, Arkajyoti Misra, and Emrah Budur.
2017. LIDE: language identification from text docu-
ments.CoRR, abs/1701.03682.
Maria Medvedeva, Martin Kroon, and Barbara Plank.
2017. When sparse traditional models outperform
dense neural networks: the curious case of discrimi-
nating between similar languages. In Proceedings of
the Fourth Workshop on NLP for Similar Languages,
Varieties and Dialects (VarDial), pages 156–163, Va-
lencia, Spain. Association for Computational Lin-
guistics.
Gaman Mihaela, Sebastian Cojocariu, and Radu Tudor
Ionescu. 2021. UnibucKernel: Geolocating Swiss
German jodels using ensemble learning. In Proceed-
ings of the Eighth Workshop on NLP for Similar Lan-
guages, Varieties and Dialects, pages 84–95, Kiyv,
Ukraine. Association for Computational Linguistics.
Shuyo Nakatani. 2010. Language detection library for
java.
Abhijnan Nath, Rahul Ghosh, and Nikhil Krishnaswamy.
2022. Phonetic, semantic, and articulatory features in
Assamese-Bengali cognate detection. In Proceedings
of the Ninth Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 41–53, Gyeongju, Re-
public of Korea. Association for Computational Lin-
guistics.
Iuliia Nigmatulina, Tannon Kew, and Tanja Samardzic.
2020. ASR for non-standardised languages with
dialectal variation: the case of Swiss German. In
Proceedings of the 7th Workshop on NLP for Simi-
lar Languages, Varieties and Dialects, pages 15–24,
Barcelona, Spain (Online). International Committee
on Computational Linguistics (ICCL).
NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
Bhosale, Sergey Edunov, Angela Fan, Cynthia
Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
Koehn, Alexandre Mourachko, Christophe Ropers,
Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
2022. No language left behind: Scaling human-
centered machine translation.
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021.
Contrastive learning for many-to-many multilingual
neural machine translation. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 244–258, Online. Asso-
ciation for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James
Cross, Sebastian Riedel, and Mikel Artetxe. 2022.
Lifting the curse of multilinguality by pre-training
modular transformers. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 3479–3495, Seattle,
United States. Association for Computational Lin-
guistics.
Jerin Philip, Alexandre Berard, Matthias Gallé, and
Laurent Besacier. 2020. Monolingual adapters for
zero-shot neural machine translation. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
4465–4470, Online. Association for Computational
Linguistics.
Maja Popovi´
c. 2017. chrF++: words helping charac-
ter n-grams. In Proceedings of the Second Confer-
ence on Machine Translation, pages 612–618, Copen-
hagen, Denmark. Association for Computational Lin-
guistics.
Maja Popovi´
c, Alberto Poncelas, Marija Brkic, and
Andy Way. 2020. Neural machine translation for
translating into Croatian and Serbian. In Proceedings
of the 7th Workshop on NLP for Similar Languages,
Varieties and Dialects, pages 102–113, Barcelona,
Spain (Online). International Committee on Compu-
tational Linguistics (ICCL).
Alex Salcianu, Andy Golding, Anton Bakalov, Chris Al-
berti, Daniel Andor, David Weiss, Emily Pitler, Greg
Coppola, Jason Riesa, Kuzman Ganchev, Michael
Ringgaard, Nan Hua, Ryan McDonald, Slav Petrov,
Stefan Istrate, and Terry Koo. 2020. Compact lan-
guage detector v3 (cld3).
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubeši´
c,
Preslav Nakov, Jörg Tiedemann, and Marcos
Zampieri, editors. 2022. Proceedings of the Ninth
Workshop on NLP for Similar Languages, Varieties
and Dialects. Association for Computational Linguis-
tics, Gyeongju, Republic of Korea.
Holger Schwenk, Vishrav Chaudhary, Shuo Sun,
Hongyu Gong, and Francisco Guzmán. 2021a. Wiki-
Matrix: Mining 135M parallel sentences in 1620 lan-
guage pairs from Wikipedia. In Proceedings of the
16th Conference of the European Chapter of the Asso-
ciation for Computational Linguistics: Main Volume,
pages 1351–1361, Online. Association for Computa-
tional Linguistics.
Holger Schwenk, Guillaume Wenzek, Sergey Edunov,
Edouard Grave, Armand Joulin, and Angela Fan.
2021b. CCMatrix: Mining billions of high-quality
parallel sentences on the web. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 6490–6500, Online. As-
sociation for Computational Linguistics.
Janine Siewert, Yves Scherrer, Martijn Wieling, and
Jörg Tiedemann. 2020. LSDC - a comprehensive
dataset for Low Saxon dialect classification. In
Proceedings of the 7th Workshop on NLP for Sim-
ilar Languages, Varieties and Dialects, pages 25–35,
Barcelona, Spain (Online). International Committee
on Computational Linguistics (ICCL).
Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu.
2019. Multilingual neural machine translation with
knowledge distillation. In International Conference
on Learning Representations.
Jörg Tiedemann and Lars Nygaard. 2004. The OPUS
corpus - parallel and free:
http://logos.uio.no/
opus
. In Proceedings of the Fourth International
Conference on Language Resources and Evaluation
(LREC’04), Lisbon, Portugal. European Language
Resources Association (ELRA).
Tommi Vatanen, Jaakko J. Väyrynen, and Sami Vir-
pioja. 2010. Language identification of short text
segments with n-gram models. In Proceedings of the
Seventh International Conference on Language Re-
sources and Evaluation (LREC’10), Valletta, Malta.
European Language Resources Association (ELRA).
Pavanpankaj Vegi, Sivabhavani J, Biswajit Paul, Abhi-
nav Mishra, Prashant Banjare, Prasanna Kumar K R,
and Chitra Viswanathan. 2022. Webcrawl african :
A multilingual parallel corpora for african languages.
In Proceedings of the Seventh Conference on Ma-
chine Translation, pages 1076–1089, Abu Dhabi. As-
sociation for Computational Linguistics.
David Vilar. 2018. Learning hidden unit contribution
for adapting neural machine translation models. In
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 500–505, New Or-
leans, Louisiana. Association for Computational Lin-
guistics.
Xinyi Wang, Yulia Tsvetkov, and Graham Neubig. 2020.
Balancing training for multilingual neural machine
translation. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 8526–8537, Online. Association for Computa-
tional Linguistics.
Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu,
and Chengqing Zong. 2018. Three strategies to im-
prove one-to-many multilingual translation. In Pro-
ceedings of the 2018 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2955–
2960, Brussels, Belgium. Association for Computa-
tional Linguistics.
George-Eduard Zaharia, Andrei-Marius Avram,
Dumitru-Clementin Cercel, and Traian Rebedea.
2020. Exploring the power of Romanian BERT
for dialect identification. In Proceedings of the 7th
Workshop on NLP for Similar Languages, Varieties
and Dialects, pages 232–241, Barcelona, Spain
(Online). International Committee on Computational
Linguistics (ICCL).
George-Eduard Zaharia, Andrei-Marius Avram,
Dumitru-Clementin Cercel, and Traian Rebedea.
2021. Dialect identification through adversarial
learning and knowledge distillation on Romanian
BERT. In Proceedings of the Eighth Workshop on
NLP for Similar Languages, Varieties and Dialects,
pages 113–119, Kiyv, Ukraine. Association for
Computational Linguistics.
Marcos Zampieri, Preslav Nakov, Nikola Ljubeši´
c, Jörg
Tiedemann, and Yves Scherrer, editors. 2020. Pro-
ceedings of the 7th Workshop on NLP for Similar Lan-
guages, Varieties and Dialects. International Commit-
tee on Computational Linguistics (ICCL), Barcelona,
Spain (Online).
Marcos Zampieri, Liling Tan, Nikola Ljubeši´
c, and Jörg
Tiedemann, editors. 2014. Proceedings of the First
Workshop on Applying NLP Tools to Similar Lan-
guages, Varieties and Dialects. Association for Com-
putational Linguistics and Dublin City University,
Dublin, Ireland.
Biao Zhang, Ankur Bapna, Rico Sennrich, and Orhan
Firat. 2021. Share or not? learning to schedule
language-specific capacity for multilingual transla-
tion. In International Conference on Learning Rep-
resentations.
Andrea Zugarini, Matteo Tiezzi, and Marco Maggini.
2020. Vulgaris: Analysis of a corpus for middle-age
varieties of Italian language. In Proceedings of the
7th Workshop on NLP for Similar Languages, Vari-
eties and Dialects, pages 150–159, Barcelona, Spain
(Online). International Committee on Computational
Linguistics (ICCL).
Family Genus (Group) Language
Indo-European Germanic English
Afrikaans
Romance French
Afro-Asiatic
Hausa Hausa
Amharic Amharic
Cushitic Oromo
Cushitic Somali
Nilo-Saharan Luo Luo
Senegambian Wolof Wolof
Fula Nigerian Fulfulde
Volta-Niger Igboid Igbo
Yoruboid Yoruba
Bantu
Bangi Lingala
Shona Shona
Nyasa Chichewa
Umbundu Umbundu
Sotho-Tswana Tswana
Northern Sotho
Nguni-Tsonga
Zulu
Xhosa
Swati
Xitsonga
Northeast-Bantu
Kamba
Swahili
Kinyarwanda
Luganda
Table 8: The phylogeny-informed tree hierarchy
A Reproducibility
This section contains information about the repro-
ducibility of all three components of this paper -
data collection and preprocessing language identi-
fication and machine translation.
A.1 Data Curation
All data can be replicated and reproduced
through
code/data-collection
. Interme-
diate preprocessing steps can be applied
through
code/preprocessing
, merged through
code/merging
, and summary statistics be pro-
duced through
code/summary-stats
. Data paths
are set up so that any retrieved, preprocessed,
merged data is located in data/.
A.2 Language ID
All language identification experiments and results
can be replicated through
code/language-id/
.
All relevant data is decoupled from the code direc-
tory and can be found in
data/language-id
, and
trained language models can be found in
data/lms
A.3 Machine Translation
Our machine translation experiments are
performed using publicly available code
from
https://github.com/mahfuzibnalam/
large-scale_MT_African_languages
. To
produce results regarding novel translation
directions enabled by our data, please refer
to
code/new_lang_pairs
. Table 8shows the
phylogeny configuration we use to fine-tune the
MT system.
B Supplementary Machine Translation
Benchmarks
On the following pages, we report the aggre-
gate evaluation results of our MT models on the
FLORES200 devtest of 176 languages (BLEU,
CHRF++, spBLEU). We also report BLEU,
CHRF++, and spBLEU for baseline, language-fine,
and family-fine scores for all language pairs we per-
form machine translation experiments for (African
focus languages from the WMT tasks’ focus lan-
guages)
Metrics Models Avgall AvgXeng AvgengXAvgAfricanAf rican AvgYfra AvgfraY
BLEU
Baseline 9.59 17.84 10.56 7.83 13.42 9.79
L-Fine 16.57 28.55 13.68 15.28 19.16 14.24
F-Fine 21.52 33.77 21.79 20.01 23.99 17.28
CHRF++
Baseline 29.59 35.47 30.88 28.05 32.54 31.25
L-Fine 37.04 45.54 35.24 35.89 39.11 36.82
F-Fine 41.33 49.83 41.28 40.18 43.27 39.28
spBLEU
Baseline 11.87 18.79 13.20 10.19 15.64 12.55
L-Fine 19.52 30.38 17.46 18.21 21.93 17.86
F-Fine 24.93 35.66 25.26 23.58 27.06 21.36
Table 9: Evaluation results on our test set of 176 language directions.
AvgXeng
denotes the average score of
directions between other languages and English.
AvgengX
denotes the average score of directions between English
and other languages.
AvgAf ricanAf rican
denotes the average score of directions between African languages to
other African languages.
AvgYf ra
denotes the average score of directions between other languages and French.
Avgf raY
denotes the average score of directions between French and other languages.
Avgall
denotes the average
result of all translation directions.
Metrics Models Avgall AvgXeng AvgengXAvgAfricanAf rican AvgYfra AvgfraY
BLEU
Baseline 14.01 28.21 13.68 23.62 11.66 11.23
L-Fine 12.91 25.98 14.19 22.21 10.98 10.05
F-Fine 12.25 25.15 13.99 20.98 10.88 9.34
CHRF++
Baseline 39.16 49.54 37.16 46.24 38.03 37.29
L-Fine 37.89 47.28 39.86 44.80 37.26 35.57
F-Fine 37.03 46.57 39.64 43.68 37.41 34.50
spBLEU
Baseline 18.23 30.77 17.22 28.01 16.69 15.64
L-Fine 17.01 28.23 18.91 26.46 15.75 14.22
F-Fine 16.15 27.41 18.69 25.25 15.69 13.21
Table 10: Evaluation results on FLORES200 devtest of 176 language directions.
AvgXeng
denotes the average
score of directions between other languages and English.
AvgengX
denotes the average score of directions
between English and other languages.
AvgAf ricanAf rican
denotes the average score of directions between African
languages to other African languages.
AvgYf ra
denotes the average score of directions between other languages
and French.
Avgf raY
denotes the average score of directions between French and other languages.
Avgall
denotes
the average result of all translation directions.
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
lug-eng 10.1 17.3 23.4 25.1 33.8 39.6 10.9 18.2 24.5
yor-eng 16.2 22.2 26.3 33.6 40.1 43.8 16.3 22.1 26.5
hau-eng 15.9 21.5 24.6 33.6 39 40.7 17.2 22.7 25.7
amh-eng 21.4 27.8 30.1 41.3 45.8 47.6 23.2 29 31.3
swa-eng 22.5 26.6 33.3 41.1 44.1 49.4 23.4 27.5 34.5
ibo-eng 15.1 20.9 22.8 34 40.2 41.8 15.8 22.3 24.1
nya-eng 17.9 25.2 33.4 37.3 44.5 51.4 18.7 25.5 35.2
orm-eng 7.9 11.3 15 24.1 30 33.3 9.4 12.6 16.5
nso-eng 22.5 46.7 53.5 39.5 61.7 66.5 23.4 51.1 57.1
xho-eng 21.9 36.5 42.2 39 52.1 56.9 22.4 38.2 44.7
tso-eng 20.5 48.3 57.1 37.1 62 68.7 21.1 52.1 60.3
kin-eng 13.4 22.5 29.1 31.1 40.3 45.8 13.9 23.1 30
kam-eng 5.3 8 11.7 21.8 24.6 29.9 6.1 8.9 12.9
zul-eng 19.5 34.8 41.3 37.9 51.9 57.1 20.7 39.1 45.5
ssw-eng 17.4 41.9 48.8 33.8 56.6 61.9 17.9 45.9 52.5
afr-eng 38 45.3 47.7 57.2 61.9 62.9 40.3 47.7 49.3
eng-swa 16.8 18.1 24.5 41.4 43 47 20 22.1 27.8
eng-ibo 13.6 17.6 20.9 33.4 37.4 39.7 17.1 21 24.3
eng-nya 10.4 11.7 15.6 33.5 36.3 39.4 12 14.3 18.4
eng-orm 1.1 3.1 2.9 14.1 18.5 17.7 1.7 4.4 4.4
eng-nso 19.5 25.1 43.2 37.7 43.6 59.9 19.8 26 46.4
eng-tso 16 23.6 44.6 36.8 43.8 61.2 16.9 25.6 47.8
eng-kin 5.3 7.8 10.1 26.3 31.1 33.5 7.7 11.2 13.9
eng-kam 1.6 2.1 2.9 16.3 18.6 20.2 2.4 3.1 4.3
eng-zul 9 11.2 27.8 36.4 38.9 49.3 16 18.7 32.9
eng-ssw 5.7 9.9 36.2 29.3 36.8 53.5 10.2 16.7 37.9
eng-afr 33.1 35.1 40.3 52.8 54.6 57.7 35.6 37.4 42.4
eng-xho 4.5 12.4 28.2 26.5 37.7 48.2 8.1 18.1 31.5
eng-lug 4.2 5.9 8.1 25.7 29.8 31.9 7 9.8 12.3
eng-yor 13.3 15.8 20.4 30.9 32.5 35.7 14.8 17.3 20.8
eng-hau 10.4 12.4 15 31.3 34.9 37.8 8.7 14.2 17.8
eng-amh 4.4 7 7.9 21.7 26.3 27.7 13.2 19.4 21.3
fra-swa 8.9 12.9 17.6 34 39 42.6 11.1 16 21.3
fra-kin 5 7.6 9.9 28.4 32 34.5 7.6 11.6 14.5
fra-hau 6.5 8.9 11.8 29.5 33.6 36.1 7.8 10.7 14
fra-nso 15.2 23.9 29.2 34.2 45.4 49.9 16 26.3 32.3
fra-amh 2.9 5 5.9 17.6 21.7 21.9 10.2 14.3 15.8
fra-xho 9.2 12.2 15.8 35.8 40.1 42.1 14.5 19 22.2
fra-zul 6.9 9.4 12.7 35.3 38.8 41.1 13 16.3 20
fra-lug 2.3 5.3 7.3 21.9 29.2 31.3 4.3 8.7 11.3
fra-ibo 13 18.1 20.5 32 37.1 38.6 16 21.2 23.5
fra-afr 27.1 30.1 31.6 47.3 49 50.3 28.5 31.3 33.3
fra-nya 9 11.8 14.9 33.2 36.3 39.1 10.6 13.8 17.4
fra-ssw 5.7 12.4 17.7 29.8 39.8 44.2 9.7 18.2 24.1
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
fra-yor 9.6 14.1 15.5 23.9 28 29.5 9.9 13.8 16
fra-tso 15.8 27.6 31.5 34.6 45.5 48.7 16.5 28.9 33.3
hau-fra 9 13.9 17.3 26.2 32.5 35.8 11.4 16.8 20.7
nso-fra 15.2 23.3 29.9 35.8 44.5 49.7 18 26.3 33
amh-fra 12 16.7 19.1 30.2 35.6 37.4 13.4 18.9 21.2
xho-fra 16.7 22.1 26.6 37.1 42.9 46.2 18.2 23.9 28.8
zul-fra 14.4 19.9 24.5 36.2 41.5 45.5 16.9 23.1 28.3
lug-fra 6.4 14.1 19.5 21.6 32.6 38.5 8.4 16.5 22.2
ibo-fra 10.9 15.1 17.6 31.5 36.8 38.9 13.6 18.5 21.5
afr-fra 26 28.5 31.5 47.8 50.3 51.8 29.6 32.4 35.2
nya-fra 14 20.4 27.2 32.7 39.5 45.6 15.7 22.5 29.6
ssw-fra 14.7 21.4 28.8 34.2 42 48 16.7 24.4 32.3
yor-fra 10.4 16 18.2 30.1 35.6 38 11.9 18.2 21
tso-fra 18.3 26 35.5 36.1 43.9 51.3 20.4 29 38.5
swa-fra 12 15.3 19.2 30.5 34.5 38.4 14.1 18 22.2
kin-fra 7.9 15.5 21 25.5 35.4 40.7 10.6 18.5 24.3
tso-swa 13.4 17.7 27.2 37.6 41.8 48.8 17.4 22.8 31.9
ssw-tso 13.7 35.6 40.9 33.5 54.3 58.3 13.8 38.9 44.1
amh-kin 2.6 5.8 7.1 21 26.2 27.3 4.6 8.2 9.7
tso-nya 6.4 8.7 17 31.5 36.6 43.7 9.4 12.5 22.7
tso-nso 18.6 37 43.6 36.4 55.4 60.2 17.8 40.7 46.8
nso-kin 4.8 9.2 14.2 27.9 33.9 37.7 8.2 13.7 19.7
yor-ibo 13.3 24.9 26.7 28.4 39.3 41.5 15.6 26.2 28.4
ssw-swa 8.6 13.2 23.2 34.1 39.8 48.4 11.7 16.8 28.6
nya-swa 9.8 14 22.9 35.1 38.6 45.6 12.2 17.2 27.4
yor-swa 10.4 17.8 21.5 30.3 39.1 42.8 12.2 18.5 23.4
ssw-nso 13.8 38.2 44.9 31.6 54.9 60.4 13.8 40.6 47.5
ssw-nya 5.9 9.3 17.7 28.1 35.5 42.1 7.9 13.5 22.5
afr-swa 11.8 15.5 23.7 38.7 41.7 46 16.1 19.8 28.6
xho-tso 13.1 35.6 42.1 34 53.7 59.7 14.1 37.1 44.4
lug-nya 2.3 7.4 12.2 20.2 30 34.2 4 9.7 15.3
amh-afr 10.4 16.9 19.5 28.6 34.1 36.8 11.9 18.1 21
lug-nso 5.7 14.6 18.8 20.7 31.9 36.1 5.5 14.8 19.3
nso-afr 20.6 30.9 43.3 40 50.9 58.9 21.9 33.4 45.2
hau-kin 2.6 5 6.1 21.7 26.8 27.6 5.2 8.6 9.5
ibo-swa 14.1 19.9 26.7 35 39.8 44.6 15.6 21 27.6
amh-zul 3.1 6.7 7.9 26.2 31.2 32.5 7.5 11.8 13.5
lug-swa 6.3 11.4 17.7 26.5 34.8 40 8.2 14.7 21.8
lug-ibo 4.8 13.7 17.3 18 28.6 33 6.8 15.6 19.7
nso-zul 7.8 23.9 27 34.3 45.6 48.8 13.6 27 30.9
zul-swa 9.5 13.7 20.9 34.9 37.9 43.4 13.3 17.7 25.8
xho-swa 13 16.6 23.4 37.1 40.2 45.3 16.2 20.1 27.4
lug-xho 3.7 7.7 12.4 22.1 29.7 34.9 5.5 11.2 16.8
xho-nso 15.1 32.8 40.5 34.5 51.6 57.9 15.5 35.5 43.5
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
zul-nya 6.1 7.7 11.2 31.2 33.6 36.8 8.8 10.9 15.5
kam-swa 4.1 4.4 7.2 22.3 21.2 25.8 5.8 5.8 9.5
xho-nya 6.8 10 18 31.6 36.4 43.2 9.7 13.5 22.6
tso-kin 4.4 7.1 13 26.2 29.4 36.1 6.8 10.6 18.7
nso-nya 5.6 8.1 15 30.2 35.6 40.5 8.1 12.2 19
lug-afr 9.7 14.2 22.3 27.9 32.7 39.9 10.6 15.1 23.4
amh-orm 0.8 2 3.1 14.3 19.8 22.9 1.3 3.8 5.1
amh-swa 5.7 14.8 18.3 27 38 41.1 7.8 19 22.2
swa-kin 4.3 5.3 7.3 23.3 24.2 27.5 6.3 7.5 10.5
lug-zul 3.3 5.7 8.6 22.9 28.1 32.2 5.7 9.3 13.3
nso-swa 13.5 19.5 27.9 36.8 42.9 48.4 16.9 24.3 32.4
xho-ssw 4.5 36.2 40.9 27.4 51.9 56.9 8.9 37.4 43.2
ssw-kin 5.3 10 15.3 26.9 33.5 39 8.2 13.9 20.7
nya-kin 4.9 9.2 13.3 26.7 31.8 36.9 7 12.4 18
yor-lug 2.1 5.7 7.9 18.7 26.8 27.9 3 7.5 9.7
xho-zul 9.1 32.4 36.1 33.9 49.7 53.5 14.9 33.5 38.4
xho-afr 19.7 24.5 33.9 39.8 44.4 51.8 21.4 26.1 36
zul-afr 18 22.4 30.2 38.4 42.8 49.1 20.1 24.4 32.8
tso-zul 6.3 39.4 41.8 31.4 53.7 56 11.2 38 41.5
afr-kin 5 7.6 11.3 29.7 32.9 37.6 7.9 11.6 16.6
hau-swa 8.3 10.5 13.5 29.3 32.6 34.5 10 13.6 16.2
orm-swa 5.6 8.4 9.6 23.4 27.7 28.1 6.4 10.9 11.6
tso-afr 19.9 25.9 36.3 38.4 44.3 52.8 21.1 27.1 38.1
lug-kin 1.1 6.3 7.8 15 27.2 29.5 2 8.7 11
zul-kin 4.8 6.8 10.4 27.4 30.6 33.4 7 10.1 14.6
ssw-zul 8.5 31 33.3 33.8 50.6 52.2 13.9 33.3 36
xho-kin 4.8 7.6 11.1 27.9 31.3 35.2 7.6 11.6 16.3
lug-amh 1 2.4 4.7 8.5 13.6 16.7 3.2 8.2 11.4
ssw-afr 17.1 23.5 36.3 35 42.1 51.8 18.2 25.1 38.1
nya-afr 15.4 20.8 29.7 33 38.6 45.9 16.7 21.8 31.3
swa-tso 13.8 21.1 29.7 35.6 41.6 49.5 15.3 22.7 32.4
tso-ssw 6.3 31.1 35 31 50 54.2 10.5 33.1 38.3
kin-amh 0.7 1.8 2.9 9.7 14.8 16.9 3.5 8.4 11
nya-tso 13.3 24.7 31.5 33.2 45.1 50.6 14.2 27.2 34.3
nso-tso 15.9 37.5 42.2 36.2 56.5 60.3 16.3 41.3 46.2
kin-nso 6.4 17.8 24.2 22.7 36.1 41.8 6.1 17.9 25
ibo-yor 12.9 17.1 18.3 26.2 29.9 30.3 13 16.7 17.4
swa-ssw 3.9 6.5 11.4 23.5 28.7 34.4 5.8 10.2 16.1
swa-nya 6.5 7.4 11.4 27.6 29 33.4 8.3 9.8 14.3
swa-yor 9.3 13.3 15.7 21.3 24.7 28 9.4 13.8 15.5
nso-ssw 5.7 36.3 40.2 29.7 52.8 56.6 10 36.9 41.4
nya-ssw 5.1 10.5 19.2 27.3 36.3 44.6 7.7 16 25.9
swa-afr 15.2 18.4 25 34 36.2 41.9 16.8 19.5 26.8
tso-xho 8.8 33.3 37.2 34.9 51.3 54.9 13.8 33.2 38.4
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
nya-lug 2.5 4.7 8.1 22.9 26.4 29.7 4.5 7.5 11.3
afr-amh 2.6 3.8 5.3 17.4 21.1 23.8 9.2 12.7 15.6
nso-lug 2.6 4 5.1 23.3 27.4 28.1 4.6 7 8.9
afr-nso 18.6 28.5 34.7 38.3 48.7 53 18.8 29.5 36
kin-hau 6.3 8.3 11.1 26 28.4 32.8 7.6 10.3 13.9
swa-ibo 18.4 23.5 27.3 32.5 37.4 41 20.1 24.7 28.5
zul-amh 2 4.2 5.4 15.9 19.4 22.8 7.4 11.8 15.8
swa-lug 3.5 5.1 7.2 23.2 25.2 28.3 5.8 7.8 10.4
ibo-lug 1.6 5.5 7.7 20.3 28.2 29.9 2.9 8.2 10.8
zul-nso 15.7 29.1 34.6 36 47.8 52.8 15.9 30.6 36.6
swa-zul 7.2 8.3 11.6 30.2 31.5 34.9 10.9 12.5 16.6
swa-xho 8.8 9.8 15.7 33.9 35.1 40.6 13.4 14.8 21.6
xho-lug 3.7 5.4 8.1 26.6 30.4 31.9 5.6 8.6 11.9
nso-xho 8 29.9 33.9 34.1 49.2 53 13.5 32.1 36.7
nya-zul 6.9 8.8 14 31.3 34.5 39.4 11.2 14.2 20.1
swa-kam 1.5 1.3 1.8 16.2 16.3 17.8 2.3 2 3
nya-xho 8.1 11.3 20.5 33.2 38.9 46.1 12.8 18.2 27.5
kin-tso 6.2 15.8 22.2 22.9 34 39.8 6.6 17 24.1
nya-nso 14.9 24.2 32.9 34 44.2 50.5 15.1 25.8 34.5
afr-lug 4 5.4 7 28.3 31.2 32.6 6.9 9.3 11.4
orm-amh 1 1.8 2.8 10.5 15.7 16.3 3.2 9.9 10.8
swa-amh 2.1 4.4 5.8 14.8 19.4 22.5 7.4 12.9 16.9
kin-swa 6.3 11.2 18.7 28.4 34.2 40.1 9 14.8 23.2
zul-lug 3.2 4 5.2 24 25.2 26.3 6 7.3 8.2
swa-nso 15.2 20.1 29.4 33.9 39.6 46.8 15.2 21.3 31
ssw-xho 6.1 36.4 40.5 29.1 51.4 55.2 10.6 36.6 41.3
kin-ssw 1.9 10 16.9 19.9 33.6 40.8 4 15.7 24.2
kin-nya 2.4 6.2 11.4 22.1 31.5 36.7 3.4 8.4 15.3
lug-yor 3.2 8 13.1 12.6 19.1 25.2 3.3 8.4 13.5
zul-xho 8.9 32.6 35.7 34.4 49.8 52.7 14.5 33.4 37.2
afr-xho 10.1 12.3 16.2 37.7 40.3 43.2 16.3 19.2 23.4
afr-zul 9.5 11.2 15.1 36.5 38.8 42 15.3 18 22.3
zul-tso 13 39.6 44 34 56.6 59.9 13.8 42.1 46.6
kin-afr 10.9 15.3 23.6 29 33.6 41.3 11.3 16 25.3
swa-hau 7.8 10.7 13 30.4 31.1 35.1 9.3 12 14.8
swa-orm 2 2.2 3.3 17.9 20.1 22.2 3.3 3.6 5.2
afr-tso 18.2 24.9 29.5 39.6 48.2 52.3 19.6 28 33
kin-lug 0.3 4.6 6.2 14.6 26.6 28.4 1.4 8 9.6
kin-zul 3.9 6.6 9.5 24.2 30.3 33.2 6.6 11.2 14.8
zul-ssw 6.4 30.4 33.8 33 51.2 53.9 12.4 32.8 37.2
kin-xho 3.4 8.7 12.1 24 31.3 34.3 5.8 12.8 16.8
amh-lug 1.4 4.2 4.6 18.4 24.7 24.9 3.1 6.2 7.2
afr-ssw 6.4 11.9 18 28.9 37.8 43.4 10 18 25.1
afr-nya 6.4 8.4 13 30.1 33.5 37.7 9.6 12.1 17.5
Table 11: Results of all language pairs on our test set
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
lug-eng 16.1 15.5 15.5 36.6 36 36.2 18.3 17.5 17.7
yor-eng 16.7 15.6 16.3 38.6 37.2 38.2 18.9 17.7 18.4
hau-eng 27.8 27.2 25.9 50.2 49.1 48 31.1 29.2 27.9
amh-eng 31.4 27.8 27.1 55.5 52.1 51.3 34 30.1 29.3
swa-eng 41.6 34.2 36.6 62.5 56.5 58.4 43.5 36.2 38.5
ibo-eng 25.6 24 23.8 48.3 45.3 45.4 28.6 26.3 26.1
nya-eng 25.2 23.3 22.9 47.8 45.5 44.9 28.7 26.3 25.7
orm-eng 13.2 11.7 10.3 34.6 32.2 29.6 14.5 12.7 11.1
nso-eng 34.6 32.5 29.5 55 53 50.2 36.8 34.5 31.5
xho-eng 35.1 33.4 31.7 56.2 55.1 53.3 37.9 36.2 34.3
tso-eng 28.1 25.8 24.4 49.6 47.2 46.2 30.8 28.2 26.9
kin-eng 28.1 24.2 23.3 50 46.4 45.5 30.2 26.2 25.4
kam-eng 9.5 10 9.1 28.3 28.2 28.7 12.2 12.3 12.2
zul-eng 35.8 32.3 31.1 57.5 53.8 52.8 38.7 34.6 33.3
ssw-eng 26.1 25.6 24.1 47.6 47.1 45.7 28.5 27.9 26.3
afr-eng 56.5 52.6 50.8 74.4 71.8 70.7 59.6 55.7 53.9
eng-swa 33.8 30.8 29.8 59.4 57.3 56.4 38 35.3 34.4
eng-ibo 15.8 16.1 16.3 39.5 40.1 40.2 18.6 19 19.2
eng-nya 14.2 13.8 13.4 44.5 44.5 43.6 18.1 17.7 16.9
eng-orm 1.3 1 0.7 18.2 17.1 15.4 2.4 1.7 1.2
eng-nso 23.1 19.1 19.4 47.9 44.9 45.7 24.4 21 21.5
eng-tso 16.4 15.6 16.8 43.7 42.5 43.9 19.6 18.2 19.4
eng-kin 12.5 11 11.3 37.9 38.1 38.2 15.9 14.5 14.7
eng-kam 2.8 3.9 4.2 19.3 22.4 22.8 3.8 5.4 5.6
eng-zul 16.1 15.3 14.3 50.2 49.5 48.5 27.2 26.2 24.7
eng-ssw 7.6 7 7 39 38.6 38.8 14.7 14.6 14.3
eng-afr 40.4 37.5 35.8 65.7 63.6 62.4 46.1 43.4 41.7
eng-xho 1.4 12.8 13.9 15.7 46.6 47.6 3.5 22.5 23.6
eng-lug 5.4 5.8 6.1 29.8 30.9 31.2 7 8 8.5
eng-yor 3.3 3.3 3.2 19.5 19.2 19.2 5.1 4.6 4.6
eng-hau 13.1 22.3 20.7 27.7 46.9 45.6 4.5 24.2 23.3
eng-amh 11.6 11.8 10.9 36.6 35.5 34.7 26.8 26.2 25.4
fra-swa 23.6 20.5 20.1 50.9 48.6 47.5 28.1 25.1 24.3
fra-kin 9.6 8.6 9.1 36.4 34.6 35.6 13.4 11.8 12.2
fra-hau 15.4 15.3 15.1 41 40.5 40.7 18.1 17.8 17.7
fra-nso 12.8 12.3 13.1 38.6 38 39.4 14.9 14.4 15.3
fra-amh 8.5 6.9 6.5 31.7 28.5 27.8 22.2 19.6 19.1
fra-xho 10.3 9.1 9 43.1 41.1 41.2 19.4 17 17.3
fra-zul 11.1 10 9.6 44.9 43.5 42.9 21.6 19.9 19.1
fra-lug 2.2 4.3 4.2 22 28 28.7 3 6.4 6.6
fra-ibo 13.1 12.1 12.5 36.9 35.6 36.4 16.1 14.9 15.3
fra-afr 26.7 24.6 22.6 54.9 52.6 50.8 32.7 30.1 28.2
fra-nya 11.6 10.1 10.1 42.2 39.7 39.6 15.6 13.4 13.5
fra-ssw 4.9 4.8 4.9 33.7 34.2 35.1 11.3 11.2 11.4
Table 12: Results of all language pairs on our test set
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
fra-yor 2.4 3.2 3 18.2 18.5 18.7 3.6 4.8 4.7
fra-tso 11 11.9 12.5 37.9 38.2 39.4 13.6 14.1 14.9
hau-fra 23.2 21.8 21.2 45.7 44.1 43.6 27.4 25.8 25.2
nso-fra 24.4 23.2 20.8 46.7 45.5 43.6 29 27.3 25.1
amh-fra 25.2 22.7 22.6 49.3 47.3 46.7 29.9 27.4 27.1
xho-fra 26.6 25.9 23.4 49.2 48.4 46 31.3 30.2 27.9
zul-fra 28.1 25.3 22.7 51 48.4 46.1 32.5 29.6 27.1
lug-fra 13.1 13.6 12.5 33.8 34.3 33.6 16.2 16.6 15.6
ibo-fra 20.2 18.8 18.1 43.1 41 40.7 24.2 22.5 22.4
afr-fra 37.9 37.1 35.8 60.8 59.9 58.9 43.6 42.7 41.5
nya-fra 20.5 19.7 18.5 44 42.5 41.2 25.4 24.3 23
ssw-fra 19.7 20.4 18.3 41.9 43.1 40.9 24.1 24.5 22.4
yor-fra 15 13.5 13.5 37 35.4 35.3 18.9 17.5 17.5
tso-fra 22.4 20.9 19 44.8 43.5 41.5 26.7 25.2 22.8
swa-fra 31.7 27.3 27.7 54.4 50.4 50.8 36.1 31.9 32.2
kin-fra 22.7 20.7 19.6 45.7 43.4 42.6 26.8 24.9 23.7
tso-swa 19.3 16.3 13.8 45.7 42.8 39.3 22.9 20 17.1
ssw-tso 12.1 13 12.4 38.6 39.4 38.2 15.1 15.5 14.6
amh-kin 8.2 6.7 6.5 35.1 32 31.4 11.6 9.5 8.9
tso-nya 10.3 9.9 8.3 39.2 38.4 35.1 13.8 13.2 11.2
tso-nso 17.3 15.5 14.1 42 40.6 39.2 18.9 17.3 15.9
nso-kin 9.6 9.4 7.9 35 34.6 31.9 12.9 12.4 10.3
yor-ibo 8.4 7.9 8.2 29.7 29.2 29.4 11.1 10.6 10.9
ssw-swa 17.4 16.4 13.1 43.5 42.6 38.6 20.6 19.5 16.2
nya-swa 17.8 15.1 13.3 44.9 41.4 39 21.6 18.9 16.7
yor-swa 11.9 9.4 9.9 37.6 33.4 34.2 14.6 11.9 12.4
ssw-nso 16.1 14.9 13.8 40.8 39.9 38.1 17.7 16.7 15.3
ssw-nya 9.2 10.2 8.3 37.4 38.5 35.2 12.5 13.1 11
afr-swa 27.6 24.2 20.9 54.8 51.5 48.4 32.1 28.7 25.7
xho-tso 13.8 13.7 13.7 40.5 40 40.1 16.8 16.4 16.2
lug-nya 7.2 7 6 32.5 32.4 30.4 9.4 9.4 8.1
amh-afr 19.4 17.3 16.3 46.5 44.1 43.1 23.3 20.8 19.6
lug-nso 9.9 10.7 9.5 32 33.7 32.8 10.9 12.1 11
nso-afr 20.9 18.8 16.8 45.8 43.4 41.1 24.2 21.3 19.2
hau-kin 10.1 8.4 7.6 36.5 32.7 31.5 13.7 10.8 9.9
ibo-swa 16.6 15 14.7 44.4 40.5 40.7 20.8 18.1 17.7
amh-zul 9 8.3 7.5 42.1 40.5 39.4 18.1 16.7 15.5
lug-swa 11.9 9.9 8.7 36.7 33.9 32.3 14.2 12.2 10.9
lug-ibo 7.5 8 7.1 26.6 28.1 27.4 9.8 10.3 9.8
nso-zul 12.7 10.8 9.3 44.9 42.5 40.2 22 19.7 17.4
zul-swa 25 20.9 17.7 51.6 47.2 43.1 29 24.6 21
xho-swa 22.5 20.6 17.8 49.4 47.3 43.5 26.6 24.7 21.4
lug-xho 5.1 4.6 4.5 31.8 31.1 30.7 10.3 9.8 8.9
xho-nso 18.2 17.5 16.1 43.2 42.2 40.7 19.7 18.8 17.6
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
zul-nya 12.8 11.6 9.1 42.9 40.8 36.7 16.6 15 12.2
kam-swa 8.4 7.1 5.6 30.3 27.5 25.6 10.5 9.2 7.3
xho-nya 12.3 11.7 10.2 42.1 40.8 37.7 16 15.2 13.1
tso-kin 10 9 6.8 36.3 34.2 30.6 14.1 12.3 9.2
nso-nya 11.1 10.6 9.4 39.6 38.9 36.5 14.3 13.5 12
lug-afr 11.6 10.6 9.9 34 32.9 31.8 14 13 11.9
amh-orm 1.2 0.8 0.9 19.5 16.7 17.9 2.5 1.5 1.6
amh-swa 20.2 17.1 16.4 48.4 44.8 43.3 24.3 20.9 19.7
swa-kin 12.2 7.4 7.9 39.9 32.1 32.9 16.1 10.2 10.6
lug-zul 5.4 5.3 4.7 33.1 32.6 31.1 11.9 11.4 10.2
nso-swa 21.7 19 15.3 48.2 45.1 40.6 25.2 22.3 18.4
xho-ssw 7.2 7 6.3 37.7 37.2 35.6 14.3 14.1 12.7
ssw-kin 7.9 8.7 6.6 32.4 33.5 30 10.7 11.3 8.6
nya-kin 9.2 7.7 7.1 35 32.2 31 12.5 10.7 9.5
yor-lug 3.6 3.9 3.3 25.2 24.9 24.6 4.9 5.6 5.2
xho-zul 12.9 12.1 11.1 45.8 44.9 42.8 23.2 22.3 20.1
xho-afr 20.7 19.8 17.2 46.7 45.3 42.7 24.8 23.3 20.5
zul-afr 22.4 18.8 17.4 48.3 44.3 42.4 26.3 22.2 20.3
tso-zul 10.5 9.4 8.4 42.7 41 39.1 20.3 18.1 16.3
afr-kin 11.5 9 8.8 38.9 35.2 34.9 15.9 12.2 12
hau-swa 20.9 17.1 15.2 47.4 42.7 40.9 24.2 20.3 18.3
orm-swa 10.5 7.2 6.3 34.7 28.8 26.5 12.2 8.6 7.3
tso-afr 17.1 16 14 42.2 41 38.4 20.9 19.3 17.1
lug-kin 2.9 6.3 5.3 19.9 29 27.5 4.2 8.5 7.1
zul-kin 10.9 9.2 7.4 37.9 34.5 31.9 14.8 12.4 10
ssw-zul 11.1 10.2 9.1 43 42.2 40.4 20.5 19.2 17.5
xho-kin 10.9 9.8 8.2 37.1 35.3 32.4 14.4 12.9 10.5
lug-amh 3.5 2.6 2.4 18.1 17.2 16.5 10.7 10 9.1
ssw-afr 16 15.6 13.2 41.3 40.4 37.7 19.7 18.6 16
nya-afr 16.4 14.7 13.1 42.4 39.8 37.7 20.7 18.2 16.4
swa-tso 14.8 12.1 13.5 41.5 37 39.5 17.7 13.9 15.4
tso-ssw 6.6 5.9 5.2 36.8 35.4 33.4 13.1 12.2 10.6
kin-amh 6 4 4.1 25.7 21.9 22 16.4 13.5 13.4
nya-tso 11.7 10.7 10.7 37 36.1 36.3 14.5 13.5 13.3
nso-tso 12.8 13.9 14 39.4 39.7 39.8 15.3 16.5 16.3
kin-nso 14.6 12.1 12.3 39.1 36.7 36.7 16.2 14.1 13.9
ibo-yor 2.3 2.9 2.5 17.5 18 17.4 3.8 5.2 4.2
swa-ssw 5.9 4.3 4.7 36.2 31.9 34.1 12.4 9.5 10.5
swa-nya 12.4 9.4 10 43.6 38.6 39.3 16.5 12.7 13.3
swa-yor 2.7 3.7 3.1 18.4 18.8 18.5 3.8 6.6 4.5
nso-ssw 7.1 6.4 6.1 37.2 35.8 35.1 13 12.4 12
nya-ssw 4.7 4.6 4.7 33 33 32.6 10.4 10.6 10
swa-afr 25.3 20 19.9 51.8 46.2 45.9 29.1 23.5 23.2
tso-xho 9.2 8.6 8 40.1 39.1 38.1 16.9 15.6 15.1
BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine
nya-lug 3.2 4.6 4.6 24.2 27.4 26.9 4.5 6.9 6.6
afr-amh 9.3 7.6 8.2 33.1 30 30.7 23.1 21.1 21.9
nso-lug 3.4 5.3 5.3 24.6 29.3 28.5 4.4 7.6 7.4
afr-nso 18.7 14.9 15.4 44.8 41.1 42 20.7 17.3 17.8
kin-hau 14.9 12.9 12.3 39.6 36.4 36.3 17.5 15.3 14.6
swa-ibo 14.3 11.8 12.7 38.3 34.9 36.4 17.2 14.9 15.7
zul-amh 7.7 6.5 5.7 30.3 27.6 25.5 20.7 18.7 16.8
swa-lug 4.5 4.2 4.8 29.1 27.3 27.8 6.1 6.2 6.6
ibo-lug 3.1 4.1 3.9 24.6 26.8 26.4 4.3 6.1 6
zul-nso 18.8 16.7 15.8 44.5 41.6 41.1 20.8 18.3 17.5
swa-zul 13.1 9.8 10.1 47.3 42.3 42.7 24 18.7 18.9
swa-xho 11.1 7.9 9.3 44.5 39 40.8 20.2 15 16.6
xho-lug 4 5.7 5 26.7 29.5 27.6 5.4 7.9 6.6
nso-xho 10.3 9.2 8.7 41.9 40.2 38.8 17.4 16.5 15.8
nya-zul 8.8 7.8 7.2 40.6 38.6 37.1 17.8 16.2 14.8
swa-kam 2.7 2.8 2.8 19.5 20.7 20 4 4.3 4.1
nya-xho 7.7 6.7 6.4 38.7 36.4 35.6 15.7 13.7 13.2
kin-tso 13.1 10.7 10.9 39.3 35.8 36.2 15.9 12.9 12.8
nya-nso 12.4 12.1 12.5 36.6 36.6 37.4 14.2 14.1 14.6
afr-lug 4.8 4.7 4.7 28.8 28.5 28.9 6.3 6.7 7.1
orm-amh 3.8 2.8 2.2 21.1 18.4 16.4 11.8 10 8.4
swa-amh 8.5 5.5 7 31.9 26.4 28.5 21.8 17.5 19.1
kin-swa 19.6 15.6 13.8 46.2 41.6 39.1 23.1 19 16.7
zul-lug 3.6 5.3 4.7 26.1 28.7 27.2 5 7.6 6.5
swa-nso 17.4 14.4 15.1 43.1 39.4 40.4 19.1 16 16.7
ssw-xho 8.9 8.9 7.8 39.8 39.7 37.5 16.3 16.5 14.8
kin-ssw 5.1 4.6 3.9 34.1 31.7 31.4 11.3 9.8 9
kin-nya 10.4 9 8.4 39.9 37.2 35.2 14 12.1 10.9
lug-yor 2.7 3.1 2.8 16.1 17.2 16.5 4.8 5.8 5
zul-xho 12 10.4 10 45.1 42.9 41.5 21.3 19.2 18.1
afr-xho 11.1 9.7 9.5 44.6 42.4 42.1 20.5 18.3 18.1
afr-zul 13 11.4 10.7 47.2 45.1 44.6 23.9 21.6 20.8
zul-tso 14.3 14.1 14 41.9 40 40.3 17.4 16.9 16.7
kin-afr 17.2 15.3 14 42.3 39.7 37.8 20.4 17.8 16.2
swa-hau 19.5 14.8 16.5 45.8 38.9 41.6 22.3 17.2 19
swa-orm 1.1 0.7 0.6 18.3 15 14.6 2.3 1.1 0.9
afr-tso 15.4 13.5 13.8 43.2 40.1 41.2 18.9 16.2 16.6
kin-lug 1.9 4.1 4.1 19.3 26.8 26.5 3.4 6.2 5.8
kin-zul 9.8 8.2 7.3 41.5 39.1 37.2 18.6 16.2 14.4
zul-ssw 7.3 7.2 6.6 39.3 38.2 37.1 14.9 15.1 13.7
kin-xho 8.6 6.9 6.8 39.1 36.6 35.7 15.7 13.5 12.6
amh-lug 2.6 3.2 3 24.7 25.8 25.4 3.6 5 4.6
afr-ssw 6.3 5.4 5 37.7 36.1 35.9 13.6 12.5 11.9
afr-nya 12.3 11 10.3 43.1 41.1 40 16.4 14.6 13.9
Table 13: Results of all language pairs on FLORES200 devtest
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.