Leyzer: A Dataset for Multilingual Virtual Assistants
nski1,2[0000−0002−9360−1395] and Artur Janicki2 [0000−0002−9937−4402]
1Samsung R&D Institute Poland, Warsaw, Poland
2Warsaw University of Technology, Warsaw, Poland
Abstract. In this article we present the Leyzer dataset, a multilingual text corpus
designed to study multilingual and cross-lingual natural language understanding
(NLU) models and the strategies of localization of virtual assistants. The pro-
posed corpus consists of 20 domains across three languages: English, Spanish
and Polish, with 186 intents and a wide range of samples, ranging from 1 to
672 sentences per intent. We describe the data generation process, including cre-
ation of grammars and forced parallelization. We present a detailed analysis of
the created corpus. Finally, we report the results for two localization strategies:
train-on-target and zero-shot learning using multilingual BERT models.
Keywords: virtual assistant, multilingual natural language understanding, text
corpus, machine translation
Virtual assistants (VAs) have been available since 1960s, but the release of their recent
generation on smartphones and embedded devices has opened them to a broader audi-
ence. The most popular development approach for such systems is to release initial set
of languages, usually English as the ﬁrst, and then the following languages. Although
there might be various reasons for choosing such approach, it is clear that adding sup-
port for new languages is a time- and cost-consuming process.
There are over 6900 living languages in the world, from which more than 91 have
over 10 million users. If we want to build an unfragmented e-society, we have to develop
methods that will allow us to create multilingual VAs also for, so called, low-resource
In this work we present Leyzer3, a dataset containing a large number of utterances
created for the purpose of investigation of cross-lingual transfer learning in natural
language understanding (NLU) systems. We believe that Leyzer presents great op-
portunities to investigate multilingual and cross-lingual NLU models and localization
strategies, which allow translating and adapting an NLU system to a speciﬁc country
or region. While creating our dataset, we focused particularly on testing localization
strategies that use machine translation (MT) and multilingual word embeddings. First
localization strategy that we tested was the so called train-on-target, where the training
corpus of a system is translated from one language to another and the model trained
3Named after Ludwik Lejzer Zamenhof, a Polish linguist and the inventor of the international
language Esperanto, the most widely used constructed international auxiliary language in the
The final authenticated publication is available online
2 Marcin Sowa´
nski, Artur Janicki
from this corpus is tested on a parallel testset that was created manually by language
experts (LEs). Second localization strategy tested was zero-shot learning, where the
system that used multilingual embeddings is supposed to generalize from the language
it was trained on to new languages that it will be later tested on. Finally, we report re-
sults for two types of baseline models that were trained either on single language data
only or on all data available in three languages at once.
To the best of our knowledge, Leyzer is the largest dataset in terms of the number of
domains, intents (where intent is understood as an utterance-level concept representing
system functionality available for the user) and slots (where slot is deﬁned as a word-
level concept representing the parameters of a given intent) in the area of multilingual
datasets focused on problems of the localization of VA datasets. It has been publicly
released, with the code to allow reproduction of the experiments and is available at
2 Related Datasets
There exist a couple of text corpora which are often used in the context of VAs, which
can be divided into two groups.
Table 1. Statistics of existing corpora compared to Leyzer, proposed in this work. First group con-
sists of resources designed to train and test VAs without focusing on multilingual setup. Second
group concerns multilingual VAs.
Dataset Languages # Utterances # Domains # Intents # Slots
ATIS  en 5871 1 26 83
Larson et al.  en 23,700 10 150 0
Liu et al.  en 25,716 19 64 54
Snips  en, fr 2,943/1,136 - 7 72
Schuster et al.  en, es, th 43,323/8,643/5,083 3 12 11
Leyzer (this work) en, es, pl 3779/5425/7053 20 186 86
The ﬁrst group, represented by The Air Travel Information System (ATIS) 
dataset consists of spoken queries from ﬂight domain in the English language. ATIS has
a small number of intents and is heavily unbalanced with most utterances belonging to
three intents. Larson et al.  created a dataset to study out-of-scope queries that do
not fall into any of the system’s supported intents. Presented corpus consists of 23,700
queries equally distributed among 150 intents which can be grouped into 10 general
domains. Yet another corpus for English is the one created by Liu et al. . Their
dataset, created as a use case of a home robot, can be used to train and compare multiple
NLU platforms (Rasa, Dialogﬂow, LUIS and Watson). The dataset consists of 25716
English sentences from 21 domains that can be divided into 64 intents and 54 slot types.
The Snips  dataset represents the second category of VAs datasets that were de-
signed to train and evaluate multilingual VAs. The dataset has a small number of intents;
Leyzer Corpus 3
each intent, however, has a large number of sentences. Schuster et al.  proposed a
multilingual dataset for English, Spanish and Thai to study various cross-lingual trans-
fer scenarios. The dataset consists of 3 domains: Alarm, Reminder and Weather with
small number of intents and slots (11 intents and 12 slots in total). Different languages
have different number of sentences, with English having 43,323, Spanish 8,643 and
Thai having 5,083 ones. It follows that there is a large number of sentences per intent
and per slot type.
Table 1 summarizes the existing corpora used to test VAs, and compares them with
our dataset, proposed in this article. There are many multi-domain and multi-intent
resources for English from which to choose. However, to the best of our knowledge,
there exist no multilingual resources with many domains, intents and slot types.
3 Our Dataset
We designed our dataset to be useful mostly in the following two areas related to VAs:
–development and evaluation of VAs, and
–creation and localization of the dataset into other languages in order to have a par-
allel multilingual dataset.
Commercial VA systems often face multiple challenges:
1. Number of languages and their linguistic phenomena, which represents a challenge
of building a multilingual system and handling phenomena such as ﬂexion, which
has impact on slot recognition,
2. Number of domains and their distribution, that introduce two major challenges:
(a) how to train a model to equally represent each domain, even if our trainset is
not balanced in terms of number of sentences per domain,
(b) how to treat sentences that are similar or identical in more than one domain,
3. Number of intents and how they differ. This introduces a problem of having multi-
ple intents that differ only by one parameter or word,
4. Number of slots and their values, that introduces a challenge of how to train a model
that will recognize slots not by their values but rather by their syntactic function in
We approached these typical problems by creating a dataset that consists of a large
number of intent classes (186), yet also contains a wide range of samples per intent
class, ranging from 1 to 672 sentences per intent. We selected three languages that
represent separate language families (Germanic, Romance, Slavic) to address problems
typical for multilingual systems.
There is no easy mapping between the intents in Leyzer and these of Larson et.
al. , some intents however, overlap. When comparing the intents of Leyzer and the
intents in the corpus created by Liu et al.  we found out that out of their 18 domains
(called scenarios in ) we could match seven domains in Leyzer. Similarly to Schuster
et al. , in our paper we tested train-on-target and one zero-shot scenarios. When
compared to Schuster et al., our dataset consists of more intents and slots, which, we
4 Marcin Sowa´
nski, Artur Janicki
believe, may have signiﬁcant impact on the results, especially for the train-on-target
scenario. If an NLU system has hundreds of closely-related intents, MT systems may
easily fail to properly distinguish them, which, as a consequence, may lead to a lot of
Leyzer differs from corpora such as MultiWoz , because our dataset contains
isolated utterances instead of dialogues. We wanted to create a resource that is control-
lable and cheap in terms of the time needed to create or modify it. We also wanted to
demonstrate that VAs able to handle hundreds of intents and slots are still a challenging
3.1 Creation of Corpus
Generation of Leyzer consisted of four steps: creating base grammars, creating target
grammars, applying forced parallelization, slot expansion and splitting data into train-,
dev- and testsets. They are brieﬂy characterized below.
In contrast to approaches such as MultiWoz, where utterances are usually gathered
with the use of crowdsourcing, we decided to use grammars that are written by qualiﬁed
LEs. We believe that all concerns on grammar-based generated text, namely on their
lack of naturalness, can be eliminated if the procedure of quality control is implemented.
We think that grammar-based corpora have two noteworthy advantages: they are cheap
in generation and remodeling, and they can cover all possible ways to express a given
intent, which crowdsourced approaches can easily miss.
Base Grammars Creation Starting with English, we created 20 grammars with sen-
tence patterns in the JSpeech Grammar Format (JSGF). Initial set of intents in each
domain was inspired by example commands available in Almond Virtual Assistant .
Slot values were crawled from the Internet or created manually. Depending on slot type,
we gathered from a few to a few hundreds values for each slot.
All sentence patterns in the corpus were generated from grammars. Each of such
patterns represents possible way to utter a sentence without explicitly giving the con-
tent of the slots. Later on, grammars were ﬁlled with the slot values. Since sentences
generated in such fashion might contain some unnatural expressions or grammatical
errors, we requested veriﬁcation by LEs. Wherever it was possible, incorrect sentences
were ﬁxed, and if that was not possible, sentences were removed.
Target Grammars Creation The same procedures were used to create target gram-
mars. To have intents and slots with same meaning in all languages, LEs were asked to
create grammars with intents which represent the same meaning as in English, but at
the same time, represent the most natural way of expressing such an intent in the target
language. Slot values were either crawled or created manually.
Forced Parallelization Although, as discussed in the previous step, the same intents
will have the same meaning in all languages, there is no sentence-to-sentence mapping
between different languages. It is so because intents can be expressed differently across
Leyzer Corpus 5
languages and our creation procedure did not imply parallel translations. To mitigate
this problem, we decided to create a parallel subset of our corpus that can be used as a
testset for cross-lingual experiments. All English patterns were machine translated into
Polish and Spanish with Google Translate and then veriﬁed and ﬁxed by the LEs in the
Slot Expansion Patterns for all languages, as presented in Table 3, were expanded with
slot values that were previously crawled or manually created. We paid a lot of attention
to gathering enough slot values so that during expansion each pattern, if possible, has a
different slot value. This way, we were able to avoid the systematic error of the system
that memorizes the slots on the basis of their values. Once the patterns were expanded,
the LEs veriﬁed them and changed them, if needed.
Data Split The last step of corpus creation was splitting it into three parts: trainset,
testset and development set. To create the testset, we ﬁrst created parallel sentences, as
described in Forced Parallelization step, and later expanded the slots. Then, we selected
at least one sentence from each intent which at the same time was available in all three
languages. This way it will be possible to test cross-lingual scenarios. The training and
development parts of the corpus were taken from the target grammar patterns that were
expanded with the slots. Up to 10% of such expansion formed the development set,
while the remaining part formed the trainset.
3.2 Domain Selection
Following  we used 20 domains, which represent popular applications that can be
used on mobile devices, computers or embedded devices. We can categorize them into
groups with similar functions:
– Communication with Email, Facebook, Phone, Slack and Twitter domains in that
group. All these domains contain a kind of command to send a message.
– Internet with Web Search and Wikipedia. The aim of these domains is to search for
information on the web and, therefore, these domains will have a lot of open-title
– Media and Entertainment with Spotify and YouTube domains in that group. The
root function of these applications is to ﬁnd content with name entities connected
with artists or titles.
– Devices with Air Conditioner and Speaker domains. These domains represent sim-
ple physical devices that can be controlled by voice.
– Self-management with Calendar and Contacts. These domains consist of actions
that involve time planning and people.
– Other non categorized domains represent functions and language not common to
the other categories. In that sense, remaining domains can be represented as inten-
tionally not matching other domains.
4A computer-assisted translation tool: https://omegat.org/
6 Marcin Sowa´
nski, Artur Janicki
Table 2. Statistics of sentences, intents and slots across domains and languages in Leyzer dataset.
Domain # Intents # Slots # English Utt. # Spanish Utt. # Polish Utt.
Airconditioner 13 3 48 61 52
Calendar 8 5 69 120 190
Contacts 12 4 306 481 615
Email 11 7 294 315 301
Facebook 7 4 48 581 193
Fitbit 5 3 89 116 263
Google Drive 11 5 55 241 305
Instagram 10 6 144 471 579
News 4 3 31 30 42
Phone 5 4 192 283 130
Slack 13 8 268 268 295
Speaker 7 2 73 72 43
Spotify 18 7 633 827 823
Translate 9 6 462 109 452
Twitter 6 3 147 270 122
Weather 10 5 154 159 123
Websearch 7 2 167 291 1498
Wikipedia 8 1 200 234 162
Yelp 12 5 222 142 326
Youtube 10 3 177 354 539
Total 186 86 3779 5425 7053
As mentioned above, several domains differ in size to better reﬂect proportions from
the real world problems where some applications will only have a few possible ways
to express commands, while the other ones will have almost inﬁnite number of valid
3.3 Intent and Slot Selection
There is a close relationship between intents and slots in our corpus, as the intents rep-
resent functions or actions that users want to perform, while the slots are the parameters
of these intents. In many cases intents represent the same action, but they have been dis-
tinguished on the basis of the number of parameters. During the creation of intents our
principle was that intents must differ from each other either by the language (different
important keywords) or by the number of slots they have. The reason for that is purely
pragmatical because in order to avoid system’s unstability we cannot have two identi-
cal sentences with different intents. The model input is a sentence and its output is the
intent, so if in the training corpus we had two identical sentences pointing to different
intents, then the model would not able to learn to which intent this sentence should be
The slots in our corpus can be categorized into two groups:
Leyzer Corpus 7
Table 3. Representative patterns from selected domains of the corpus.
Domain Intent Sentence Pattern
Calendar AddEventWithName add an event called $EVENT NAME
Email ShowEmailWithLabel show me my emails with label $LABEL
Facebook ShowAlbumWithName show photos in my album $ALBUM
Slack SendMessageToChannel send $MESSAGE to $CHANNEL on slack
Spotify PlaySongByArtist play $SONG by $ARTIST
Translate TranslateTextToLanguage translate $TEXT to $TRG LANG
Weather OpenWeather what’s the weather like
Websearch SearchTextOnEngine google $TXT QUERY
– Open-titled – where the value of the slot can be treated as inﬁnite and therefore
cannot be listed. Open-title slots are challenging for NLU systems because they
force them to generalize the unseen data.
– Close-titled – where the values of the slots can be listed.
4.1 Experimental Setup
As an architecture for all of our experiments we used the Joint BERT architecture 
implemented in the NeMo toolkit . We used the pre-trained multilingual cased BERT
model  consisting of 12-layers and 110Mparameters. If not stated otherwise, we
trained models for 100 epochs and saved checkpoints for each one. All checkpoint were
evaluated on test part of corpora. Reported results come from the checkpoint which
achieved the highest score in the tests. The batch size was 128. Adam  was used for
optimization with an initial learning rate of 2e−5. The dropout probability was set to
0.1. We trained each model independently with all data available in the training part of
corpus. In all of our experiment we used the ﬁrst version of our corpus (0.1.0).
4.2 Testing Scenarios
We evaluated the proposed corpus using the following four scenarios:
– Single-language Models – here we trained each language independently on all
sentences available in the trainset and we evaluated the model on a testset.
– Multi-language Model – in this experiment we trained one model using all train-
ing data available for all three languages and independently evaluated it for each
– Train-on-target – similar to strategy proposed by Cettolo et al. , we used Google
Translate to translate English patterns into Polish and Spanish, and expanded them
with target slot values.
– Zero-shot Learning – to test this scenario we trained English model with multi-
lingual cased BERT from the English part of Leyzer trainset and tested it on Polish
and Spanish testsets.
8 Marcin Sowa´
nski, Artur Janicki
We used the accuracy to evaluate the performance of intent prediction and the standard
BIO structure to calculate macro F1-score that does not take label imbalance into ac-
count. We used the evaluation metric implemented in scikit-learn  and provided in
the NeMo evaluation script. Using this script, we tested each model epoch, and the re-
sults for the ones that scored best on both the intent and the slot level are presented in
4.3 Results and Discussion
The Single-BERT models scored relatively low on both the intent and the slot level,
yielding 47%,52% and 69% intent accuracy for English, Polish and Spanish, respec-
tively. We believe that reason for that is a large number of intent classes in our corpus,
which, by the way, was a motivation to create such corpus.
In order to give some perspective to our experiments, we trained the model on the
training part of the ATIS dataset with the same parameters as in the Single-BERT sce-
nario. When evaluated on the test part of ATIS, we received 97.31% on the intent and
55.23% on the F1-macro slot level (and 97.11% for F1-micro). Those results suggest
that easier problems, such as ATIS, can be easily learned by Single-BERT model.
Table 4. Results for NeMo models trained on various conﬁgurations of Leyzer corpus
Model Type Language Intent acc. Slot F1 macro
Single-BERT English 46.58 45.07
Polish 51.66 54.56
Spanish 68.88 67.79
Multi-BERT English 62.80 76.48
Polish 64.17 74.83
Spanish 72.26 84.66
Train-on-target Polish 41.67 40.70
Spanish 46.42 52.38
Zero-shot Polish 13.82 15.39
Spanish 30.21 24.13
The Multi-BERT experiment scored better than the Single-BERT models on both
intent and slot level. We believe that the reason for this is that multilingual model had
more data to learn how to separate intent classes and eliminate inconsistencies. Pre-
sented results suggest that multilingual models might beneﬁt from joint learning on
multiple languages, at least for problems that are formulated as in this paper.
The train-on-target models scored low when compared to the Single-BERT models.
We think that the MT errors, especially in the most important components of the sen-
tence (usually verbs) led to a drastic performance drop. On the intent classiﬁcation level
the accuracy for Polish and Spanish were respectively 9.9% and 22.5% relative lower
than the baseline.
The zero-shot scenario scored very low when compared to the Single-BERT or the
train-on-target experiments. Large number of intent classes, combined with different
Leyzer Corpus 9
slot values in each language is a non-trivial problem, and, apparently, more sophisti-
cated methods are needed.
The results presented in this article may seem unsatisfactory, especially if we com-
pare them to other VA publications. However, it is noteworthy that a search for the best
architecture and parameters was not an intent of this work – we rather wanted to set
the baselines and to show complexity of the MT problem for the proposed data. We
aimed to create a challenging corpus which can be a subject of future works, such as
the localization of VAs with the use of train-on-target and zero-shot learning scenarios.
5 Conclusions and Future Work
In our work we introduced a new dataset, named Leyzer, designed to study multilingual
and cross-lingual NLU models and localization strategies in VAs. We also demonstrated
the results for the models trained on our corpus that can set the baseline for further work.
In the future we plan to extend our dataset to new languages and increase the num-
ber of sentences per intent. Another line of work that we consider is to add follow-up
intents, as this would allow to build a fully autonomous VA from our dataset.
The Leyzer dataset, the translation memories and the detailed experiment results
presented in this paper are available at https://github.com/cartesinus/
leyzer. We hope that this way we will foster further research in machine translation
for the virtual assistants.
We thank Małgorzata Misiaszek for her help in verifying the quality of our corpus and
improving its consistency.
1. Budzianowski, P., Wen, T.H., Tseng, B.H., Casanueva, I., Ultes, S., Ramadan, O., Gaˇ
M.: MultiWOZ – a large-scale multi-domain wizard-of-Oz dataset for task-oriented dia-
logue modelling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing. pp. 5016–5026. Association for Computational Linguistics, Brussels,
Belgium (2018), https://www.aclweb.org/anthology/D18-1547
2. Campagna, G., Ramesh, R., Xu, S., Fischer, M., Lam, M.S.: Almond: The architecture of
an open, crowdsourced, privacy-preserving, programmable virtual assistant. In: Proc. of the
26th International Conference on World Wide Web. pp. 341–350 (2017)
3. Cettolo, M., Corazza, A., De Mori, R.: Language portability of a speech understanding sys-
tem. Computer Speech & Language 12(1), 1–21 (1998)
4. Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classiﬁcation and slot ﬁlling (2019)
5. Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gis-
selbrecht, T., Caltagirone, F., Lavril, T., et al.: Snips voice platform: an embedded spo-
ken language understanding system for private-by-design voice interfaces. arXiv preprint
6. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. of the 6th Inter-
national Conference on Learning Representations (ICRL 2015), San Diego, CA (2015)
10 Marcin Sowa´
nski, Artur Janicki
7. Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev,
S., Lavrukhin, V., Cook, J., Castonguay, P., Popova, M., Huang, J., Cohen, J.M.: NeMo: a
toolkit for building AI applications using neural modules (2019)
8. Larson, S., Mahendran, A., Peper, J., Clarke, C., Lee, A., Hill, P., Kummerfeld, J.K., Leach,
K., Laurenzano, M., Tang, L., Mars, J.: An evaluation dataset for intent classiﬁcation and
out-of-scope prediction. In: Proc. of the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP 2019), Hong Kong, China (2019)
9. Liu, X., Eshghi, A., Swietojanski, P., Rieser, V.: Benchmarking natural language understand-
ing services for building conversational agents. arXiv preprint arXiv:1903.05566 (2019)
10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research 12, 2825–2830 (2011)
11. Price, P.: Evaluation of spoken language systems: The ATIS domain. In: Proc. of the Speech
and Natural Language Workshop, Hidden Valley, PA (1990)
12. Schuster, S., Gupta, S., Shah, R., Lewis, M.: Cross-lingual transfer learning for multilingual
task oriented dialog. In: Proc. of the 2019 Annual Conference of the North American Chapter
of the Association for Computational Linguistics (NAACL-HLT 2019), Minneapolis, MN
13. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf,
R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of-the-art natural language
processing. ArXiv abs/1910.03771 (2019)