Conference PaperPDF Available

Leyzer: A Dataset for Multilingual Virtual Assistants


Abstract and Figures

In this article we present the Leyzer dataset, a multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants. The proposed corpus consists of 20 domains across three languages: English, Spanish and Polish, with 186 intents and a wide range of samples, ranging from 1 to 672 sentences per intent. We describe the data generation process, including creation of grammars and forced parallelization. We present a detailed analysis of the created corpus. Finally, we report the results for two localization strategies: train-on-target and zero-shot learning using multilingual BERT models.
Content may be subject to copyright.
Leyzer: A Dataset for Multilingual Virtual Assistants
Marcin Sowa´
nski1,2[0000000293601395] and Artur Janicki2 [0000000299374402]
1Samsung R&D Institute Poland, Warsaw, Poland
2Warsaw University of Technology, Warsaw, Poland
Abstract. In this article we present the Leyzer dataset, a multilingual text corpus
designed to study multilingual and cross-lingual natural language understanding
(NLU) models and the strategies of localization of virtual assistants. The pro-
posed corpus consists of 20 domains across three languages: English, Spanish
and Polish, with 186 intents and a wide range of samples, ranging from 1 to
672 sentences per intent. We describe the data generation process, including cre-
ation of grammars and forced parallelization. We present a detailed analysis of
the created corpus. Finally, we report the results for two localization strategies:
train-on-target and zero-shot learning using multilingual BERT models.
Keywords: virtual assistant, multilingual natural language understanding, text
corpus, machine translation
1 Introduction
Virtual assistants (VAs) have been available since 1960s, but the release of their recent
generation on smartphones and embedded devices has opened them to a broader audi-
ence. The most popular development approach for such systems is to release initial set
of languages, usually English as the first, and then the following languages. Although
there might be various reasons for choosing such approach, it is clear that adding sup-
port for new languages is a time- and cost-consuming process.
There are over 6900 living languages in the world, from which more than 91 have
over 10 million users. If we want to build an unfragmented e-society, we have to develop
methods that will allow us to create multilingual VAs also for, so called, low-resource
In this work we present Leyzer3, a dataset containing a large number of utterances
created for the purpose of investigation of cross-lingual transfer learning in natural
language understanding (NLU) systems. We believe that Leyzer presents great op-
portunities to investigate multilingual and cross-lingual NLU models and localization
strategies, which allow translating and adapting an NLU system to a specific country
or region. While creating our dataset, we focused particularly on testing localization
strategies that use machine translation (MT) and multilingual word embeddings. First
localization strategy that we tested was the so called train-on-target, where the training
corpus of a system is translated from one language to another and the model trained
3Named after Ludwik Lejzer Zamenhof, a Polish linguist and the inventor of the international
language Esperanto, the most widely used constructed international auxiliary language in the
The final authenticated publication is available online
2 Marcin Sowa´
nski, Artur Janicki
from this corpus is tested on a parallel testset that was created manually by language
experts (LEs). Second localization strategy tested was zero-shot learning, where the
system that used multilingual embeddings is supposed to generalize from the language
it was trained on to new languages that it will be later tested on. Finally, we report re-
sults for two types of baseline models that were trained either on single language data
only or on all data available in three languages at once.
To the best of our knowledge, Leyzer is the largest dataset in terms of the number of
domains, intents (where intent is understood as an utterance-level concept representing
system functionality available for the user) and slots (where slot is defined as a word-
level concept representing the parameters of a given intent) in the area of multilingual
datasets focused on problems of the localization of VA datasets. It has been publicly
released, with the code to allow reproduction of the experiments and is available at
2 Related Datasets
There exist a couple of text corpora which are often used in the context of VAs, which
can be divided into two groups.
Table 1. Statistics of existing corpora compared to Leyzer, proposed in this work. First group con-
sists of resources designed to train and test VAs without focusing on multilingual setup. Second
group concerns multilingual VAs.
Dataset Languages # Utterances # Domains # Intents # Slots
ATIS [11] en 5871 1 26 83
Larson et al. [8] en 23,700 10 150 0
Liu et al. [9] en 25,716 19 64 54
Snips [5] en, fr 2,943/1,136 - 7 72
Schuster et al. [12] en, es, th 43,323/8,643/5,083 3 12 11
Leyzer (this work) en, es, pl 3779/5425/7053 20 186 86
The first group, represented by The Air Travel Information System (ATIS) [11]
dataset consists of spoken queries from flight domain in the English language. ATIS has
a small number of intents and is heavily unbalanced with most utterances belonging to
three intents. Larson et al. [8] created a dataset to study out-of-scope queries that do
not fall into any of the system’s supported intents. Presented corpus consists of 23,700
queries equally distributed among 150 intents which can be grouped into 10 general
domains. Yet another corpus for English is the one created by Liu et al. [9]. Their
dataset, created as a use case of a home robot, can be used to train and compare multiple
NLU platforms (Rasa, Dialogflow, LUIS and Watson). The dataset consists of 25716
English sentences from 21 domains that can be divided into 64 intents and 54 slot types.
The Snips [5] dataset represents the second category of VAs datasets that were de-
signed to train and evaluate multilingual VAs. The dataset has a small number of intents;
Leyzer Corpus 3
each intent, however, has a large number of sentences. Schuster et al. [12] proposed a
multilingual dataset for English, Spanish and Thai to study various cross-lingual trans-
fer scenarios. The dataset consists of 3 domains: Alarm, Reminder and Weather with
small number of intents and slots (11 intents and 12 slots in total). Different languages
have different number of sentences, with English having 43,323, Spanish 8,643 and
Thai having 5,083 ones. It follows that there is a large number of sentences per intent
and per slot type.
Table 1 summarizes the existing corpora used to test VAs, and compares them with
our dataset, proposed in this article. There are many multi-domain and multi-intent
resources for English from which to choose. However, to the best of our knowledge,
there exist no multilingual resources with many domains, intents and slot types.
3 Our Dataset
We designed our dataset to be useful mostly in the following two areas related to VAs:
development and evaluation of VAs, and
creation and localization of the dataset into other languages in order to have a par-
allel multilingual dataset.
Commercial VA systems often face multiple challenges:
1. Number of languages and their linguistic phenomena, which represents a challenge
of building a multilingual system and handling phenomena such as flexion, which
has impact on slot recognition,
2. Number of domains and their distribution, that introduce two major challenges:
(a) how to train a model to equally represent each domain, even if our trainset is
not balanced in terms of number of sentences per domain,
(b) how to treat sentences that are similar or identical in more than one domain,
3. Number of intents and how they differ. This introduces a problem of having multi-
ple intents that differ only by one parameter or word,
4. Number of slots and their values, that introduces a challenge of how to train a model
that will recognize slots not by their values but rather by their syntactic function in
the sentence.
We approached these typical problems by creating a dataset that consists of a large
number of intent classes (186), yet also contains a wide range of samples per intent
class, ranging from 1 to 672 sentences per intent. We selected three languages that
represent separate language families (Germanic, Romance, Slavic) to address problems
typical for multilingual systems.
There is no easy mapping between the intents in Leyzer and these of Larson et.
al. [8], some intents however, overlap. When comparing the intents of Leyzer and the
intents in the corpus created by Liu et al. [9] we found out that out of their 18 domains
(called scenarios in [9]) we could match seven domains in Leyzer. Similarly to Schuster
et al. [12], in our paper we tested train-on-target and one zero-shot scenarios. When
compared to Schuster et al., our dataset consists of more intents and slots, which, we
4 Marcin Sowa´
nski, Artur Janicki
believe, may have significant impact on the results, especially for the train-on-target
scenario. If an NLU system has hundreds of closely-related intents, MT systems may
easily fail to properly distinguish them, which, as a consequence, may lead to a lot of
Leyzer differs from corpora such as MultiWoz [1], because our dataset contains
isolated utterances instead of dialogues. We wanted to create a resource that is control-
lable and cheap in terms of the time needed to create or modify it. We also wanted to
demonstrate that VAs able to handle hundreds of intents and slots are still a challenging
3.1 Creation of Corpus
Generation of Leyzer consisted of four steps: creating base grammars, creating target
grammars, applying forced parallelization, slot expansion and splitting data into train-,
dev- and testsets. They are briefly characterized below.
In contrast to approaches such as MultiWoz, where utterances are usually gathered
with the use of crowdsourcing, we decided to use grammars that are written by qualified
LEs. We believe that all concerns on grammar-based generated text, namely on their
lack of naturalness, can be eliminated if the procedure of quality control is implemented.
We think that grammar-based corpora have two noteworthy advantages: they are cheap
in generation and remodeling, and they can cover all possible ways to express a given
intent, which crowdsourced approaches can easily miss.
Base Grammars Creation Starting with English, we created 20 grammars with sen-
tence patterns in the JSpeech Grammar Format (JSGF). Initial set of intents in each
domain was inspired by example commands available in Almond Virtual Assistant [2].
Slot values were crawled from the Internet or created manually. Depending on slot type,
we gathered from a few to a few hundreds values for each slot.
All sentence patterns in the corpus were generated from grammars. Each of such
patterns represents possible way to utter a sentence without explicitly giving the con-
tent of the slots. Later on, grammars were filled with the slot values. Since sentences
generated in such fashion might contain some unnatural expressions or grammatical
errors, we requested verification by LEs. Wherever it was possible, incorrect sentences
were fixed, and if that was not possible, sentences were removed.
Target Grammars Creation The same procedures were used to create target gram-
mars. To have intents and slots with same meaning in all languages, LEs were asked to
create grammars with intents which represent the same meaning as in English, but at
the same time, represent the most natural way of expressing such an intent in the target
language. Slot values were either crawled or created manually.
Forced Parallelization Although, as discussed in the previous step, the same intents
will have the same meaning in all languages, there is no sentence-to-sentence mapping
between different languages. It is so because intents can be expressed differently across
Leyzer Corpus 5
languages and our creation procedure did not imply parallel translations. To mitigate
this problem, we decided to create a parallel subset of our corpus that can be used as a
testset for cross-lingual experiments. All English patterns were machine translated into
Polish and Spanish with Google Translate and then verified and fixed by the LEs in the
Slot Expansion Patterns for all languages, as presented in Table 3, were expanded with
slot values that were previously crawled or manually created. We paid a lot of attention
to gathering enough slot values so that during expansion each pattern, if possible, has a
different slot value. This way, we were able to avoid the systematic error of the system
that memorizes the slots on the basis of their values. Once the patterns were expanded,
the LEs verified them and changed them, if needed.
Data Split The last step of corpus creation was splitting it into three parts: trainset,
testset and development set. To create the testset, we first created parallel sentences, as
described in Forced Parallelization step, and later expanded the slots. Then, we selected
at least one sentence from each intent which at the same time was available in all three
languages. This way it will be possible to test cross-lingual scenarios. The training and
development parts of the corpus were taken from the target grammar patterns that were
expanded with the slots. Up to 10% of such expansion formed the development set,
while the remaining part formed the trainset.
3.2 Domain Selection
Following [2] we used 20 domains, which represent popular applications that can be
used on mobile devices, computers or embedded devices. We can categorize them into
groups with similar functions:
Communication with Email, Facebook, Phone, Slack and Twitter domains in that
group. All these domains contain a kind of command to send a message.
Internet with Web Search and Wikipedia. The aim of these domains is to search for
information on the web and, therefore, these domains will have a lot of open-title
Media and Entertainment with Spotify and YouTube domains in that group. The
root function of these applications is to find content with name entities connected
with artists or titles.
Devices with Air Conditioner and Speaker domains. These domains represent sim-
ple physical devices that can be controlled by voice.
Self-management with Calendar and Contacts. These domains consist of actions
that involve time planning and people.
Other non categorized domains represent functions and language not common to
the other categories. In that sense, remaining domains can be represented as inten-
tionally not matching other domains.
4A computer-assisted translation tool:
6 Marcin Sowa´
nski, Artur Janicki
Table 2. Statistics of sentences, intents and slots across domains and languages in Leyzer dataset.
Domain # Intents # Slots # English Utt. # Spanish Utt. # Polish Utt.
Airconditioner 13 3 48 61 52
Calendar 8 5 69 120 190
Contacts 12 4 306 481 615
Email 11 7 294 315 301
Facebook 7 4 48 581 193
Fitbit 5 3 89 116 263
Google Drive 11 5 55 241 305
Instagram 10 6 144 471 579
News 4 3 31 30 42
Phone 5 4 192 283 130
Slack 13 8 268 268 295
Speaker 7 2 73 72 43
Spotify 18 7 633 827 823
Translate 9 6 462 109 452
Twitter 6 3 147 270 122
Weather 10 5 154 159 123
Websearch 7 2 167 291 1498
Wikipedia 8 1 200 234 162
Yelp 12 5 222 142 326
Youtube 10 3 177 354 539
Total 186 86 3779 5425 7053
As mentioned above, several domains differ in size to better reflect proportions from
the real world problems where some applications will only have a few possible ways
to express commands, while the other ones will have almost infinite number of valid
3.3 Intent and Slot Selection
There is a close relationship between intents and slots in our corpus, as the intents rep-
resent functions or actions that users want to perform, while the slots are the parameters
of these intents. In many cases intents represent the same action, but they have been dis-
tinguished on the basis of the number of parameters. During the creation of intents our
principle was that intents must differ from each other either by the language (different
important keywords) or by the number of slots they have. The reason for that is purely
pragmatical because in order to avoid system’s unstability we cannot have two identi-
cal sentences with different intents. The model input is a sentence and its output is the
intent, so if in the training corpus we had two identical sentences pointing to different
intents, then the model would not able to learn to which intent this sentence should be
The slots in our corpus can be categorized into two groups:
Leyzer Corpus 7
Table 3. Representative patterns from selected domains of the corpus.
Domain Intent Sentence Pattern
Calendar AddEventWithName add an event called $EVENT NAME
Email ShowEmailWithLabel show me my emails with label $LABEL
Facebook ShowAlbumWithName show photos in my album $ALBUM
Slack SendMessageToChannel send $MESSAGE to $CHANNEL on slack
Spotify PlaySongByArtist play $SONG by $ARTIST
Translate TranslateTextToLanguage translate $TEXT to $TRG LANG
Weather OpenWeather what’s the weather like
Websearch SearchTextOnEngine google $TXT QUERY
Open-titled – where the value of the slot can be treated as infinite and therefore
cannot be listed. Open-title slots are challenging for NLU systems because they
force them to generalize the unseen data.
Close-titled – where the values of the slots can be listed.
4 Experiments
4.1 Experimental Setup
As an architecture for all of our experiments we used the Joint BERT architecture [4]
implemented in the NeMo toolkit [7]. We used the pre-trained multilingual cased BERT
model [13] consisting of 12-layers and 110Mparameters. If not stated otherwise, we
trained models for 100 epochs and saved checkpoints for each one. All checkpoint were
evaluated on test part of corpora. Reported results come from the checkpoint which
achieved the highest score in the tests. The batch size was 128. Adam [6] was used for
optimization with an initial learning rate of 2e5. The dropout probability was set to
0.1. We trained each model independently with all data available in the training part of
corpus. In all of our experiment we used the first version of our corpus (0.1.0).
4.2 Testing Scenarios
We evaluated the proposed corpus using the following four scenarios:
Single-language Models – here we trained each language independently on all
sentences available in the trainset and we evaluated the model on a testset.
Multi-language Model – in this experiment we trained one model using all train-
ing data available for all three languages and independently evaluated it for each
Train-on-target – similar to strategy proposed by Cettolo et al. [3], we used Google
Translate to translate English patterns into Polish and Spanish, and expanded them
with target slot values.
Zero-shot Learning – to test this scenario we trained English model with multi-
lingual cased BERT from the English part of Leyzer trainset and tested it on Polish
and Spanish testsets.
8 Marcin Sowa´
nski, Artur Janicki
We used the accuracy to evaluate the performance of intent prediction and the standard
BIO structure to calculate macro F1-score that does not take label imbalance into ac-
count. We used the evaluation metric implemented in scikit-learn [10] and provided in
the NeMo evaluation script. Using this script, we tested each model epoch, and the re-
sults for the ones that scored best on both the intent and the slot level are presented in
Table 4.
4.3 Results and Discussion
The Single-BERT models scored relatively low on both the intent and the slot level,
yielding 47%,52% and 69% intent accuracy for English, Polish and Spanish, respec-
tively. We believe that reason for that is a large number of intent classes in our corpus,
which, by the way, was a motivation to create such corpus.
In order to give some perspective to our experiments, we trained the model on the
training part of the ATIS dataset with the same parameters as in the Single-BERT sce-
nario. When evaluated on the test part of ATIS, we received 97.31% on the intent and
55.23% on the F1-macro slot level (and 97.11% for F1-micro). Those results suggest
that easier problems, such as ATIS, can be easily learned by Single-BERT model.
Table 4. Results for NeMo models trained on various configurations of Leyzer corpus
Model Type Language Intent acc. Slot F1 macro
Single-BERT English 46.58 45.07
Polish 51.66 54.56
Spanish 68.88 67.79
Multi-BERT English 62.80 76.48
Polish 64.17 74.83
Spanish 72.26 84.66
Train-on-target Polish 41.67 40.70
Spanish 46.42 52.38
Zero-shot Polish 13.82 15.39
Spanish 30.21 24.13
The Multi-BERT experiment scored better than the Single-BERT models on both
intent and slot level. We believe that the reason for this is that multilingual model had
more data to learn how to separate intent classes and eliminate inconsistencies. Pre-
sented results suggest that multilingual models might benefit from joint learning on
multiple languages, at least for problems that are formulated as in this paper.
The train-on-target models scored low when compared to the Single-BERT models.
We think that the MT errors, especially in the most important components of the sen-
tence (usually verbs) led to a drastic performance drop. On the intent classification level
the accuracy for Polish and Spanish were respectively 9.9% and 22.5% relative lower
than the baseline.
The zero-shot scenario scored very low when compared to the Single-BERT or the
train-on-target experiments. Large number of intent classes, combined with different
Leyzer Corpus 9
slot values in each language is a non-trivial problem, and, apparently, more sophisti-
cated methods are needed.
The results presented in this article may seem unsatisfactory, especially if we com-
pare them to other VA publications. However, it is noteworthy that a search for the best
architecture and parameters was not an intent of this work – we rather wanted to set
the baselines and to show complexity of the MT problem for the proposed data. We
aimed to create a challenging corpus which can be a subject of future works, such as
the localization of VAs with the use of train-on-target and zero-shot learning scenarios.
5 Conclusions and Future Work
In our work we introduced a new dataset, named Leyzer, designed to study multilingual
and cross-lingual NLU models and localization strategies in VAs. We also demonstrated
the results for the models trained on our corpus that can set the baseline for further work.
In the future we plan to extend our dataset to new languages and increase the num-
ber of sentences per intent. Another line of work that we consider is to add follow-up
intents, as this would allow to build a fully autonomous VA from our dataset.
The Leyzer dataset, the translation memories and the detailed experiment results
presented in this paper are available at
leyzer. We hope that this way we will foster further research in machine translation
for the virtual assistants.
6 Acknowledgements
We thank Małgorzata Misiaszek for her help in verifying the quality of our corpus and
improving its consistency.
1. Budzianowski, P., Wen, T.H., Tseng, B.H., Casanueva, I., Ultes, S., Ramadan, O., Gaˇ
M.: MultiWOZ – a large-scale multi-domain wizard-of-Oz dataset for task-oriented dia-
logue modelling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing. pp. 5016–5026. Association for Computational Linguistics, Brussels,
Belgium (2018),
2. Campagna, G., Ramesh, R., Xu, S., Fischer, M., Lam, M.S.: Almond: The architecture of
an open, crowdsourced, privacy-preserving, programmable virtual assistant. In: Proc. of the
26th International Conference on World Wide Web. pp. 341–350 (2017)
3. Cettolo, M., Corazza, A., De Mori, R.: Language portability of a speech understanding sys-
tem. Computer Speech & Language 12(1), 1–21 (1998)
4. Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classification and slot filling (2019)
5. Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gis-
selbrecht, T., Caltagirone, F., Lavril, T., et al.: Snips voice platform: an embedded spo-
ken language understanding system for private-by-design voice interfaces. arXiv preprint
arXiv:1805.10190 (2018)
6. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. of the 6th Inter-
national Conference on Learning Representations (ICRL 2015), San Diego, CA (2015)
10 Marcin Sowa´
nski, Artur Janicki
7. Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev,
S., Lavrukhin, V., Cook, J., Castonguay, P., Popova, M., Huang, J., Cohen, J.M.: NeMo: a
toolkit for building AI applications using neural modules (2019)
8. Larson, S., Mahendran, A., Peper, J., Clarke, C., Lee, A., Hill, P., Kummerfeld, J.K., Leach,
K., Laurenzano, M., Tang, L., Mars, J.: An evaluation dataset for intent classification and
out-of-scope prediction. In: Proc. of the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP 2019), Hong Kong, China (2019)
9. Liu, X., Eshghi, A., Swietojanski, P., Rieser, V.: Benchmarking natural language understand-
ing services for building conversational agents. arXiv preprint arXiv:1903.05566 (2019)
10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research 12, 2825–2830 (2011)
11. Price, P.: Evaluation of spoken language systems: The ATIS domain. In: Proc. of the Speech
and Natural Language Workshop, Hidden Valley, PA (1990)
12. Schuster, S., Gupta, S., Shah, R., Lewis, M.: Cross-lingual transfer learning for multilingual
task oriented dialog. In: Proc. of the 2019 Annual Conference of the North American Chapter
of the Association for Computational Linguistics (NAACL-HLT 2019), Minneapolis, MN
13. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf,
R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of-the-art natural language
processing. ArXiv abs/1910.03771 (2019)
... We use subsection of Leyzer [37] that consists of 5 VA domains with 51 intents and 29 slots. Subsection we selected consists of 5200 sentences total. ...
... To observe train dataset size impact on model knowledge interpretations we prepared multiple model versions each finetuned on dataset of increasing size. In this study we used patterns from Leyzer corpus [37] to automatically create extended version of dataset. ...
... Task Dataset CLINC150 (Larson et al., 2019) Redwood (Larson and Leach, 2022) GOOGLE-DSTC8 Leyzer (Sowański and Janicki, 2020) HINT3 (Arora et al., 2020) NLU Chatbot-Corpus (Braun et al., 2017) MultiWOZ BANKING77 (Casanueva et al., 2020) FEWSHOTWOZ (Peng et al., 2020) ATIS (Tur et al., 2010) Schema (Rastogi et al., 2019) CrossNER WNUT17 (Derczynski et al., 2017) NER CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003) CoNLL-2004 (Carreras and Màrquez, 2004) IE OntoNotes (Weischedel et al., 2013) SCIERC (Luan et al., 2018) (Tomasello et al., 2022). SLURP is substantially larger and linguistically more diverse than previous SLU datasets. ...
... Przygotowany syntezator można także wykorzystać jako generator danych do problemu rozumienia mowy (ang. spoken language understanding), na przykład w połączeniu ze zbiorem Leyzer[57], który zawiera dane tekstowe do zagadnienia rozumienia języka (ang. natural language understanding).Możliwe jest także rozszerzenie eksperymentów związanych z syntezą mowy z wykorzystaniem przygotowanych danych. ...
Full-text available
Niniejsza praca skupia się na omówieniu zagadnienia syntezy mowy w języku polskim. W jej ramach stworzony został korpus nagrań mowy w języku polskim składający się z ponad 24 tysięcy próbek, co łącznie daje ponad 19 godzin nagrań. Przygotowany zbiór został dokładnie przeanalizowany oraz porównany z innymi dostępnymi zasobami tego typu pod kątem różnych właściwości istotnych ze względu na zastosowanie w syntezie mowy. Z wykorzystaniem analizowanych korpusów mowy, przygotowano rzeczywiste systemy syntezy mowy, których stabilność i jakość zostały porównane. Na podstawie tych badań wykazano, że korpus stworzony w ramach tej pracy sprawdza się najlepiej, ze wszystkich analizowanych, w zastosowaniu do budowy systemów syntezy mowy. Jego jakość potwierdzono dodatkowo za pomocą badania MOS, w którym syntezator zbudowany z wykorzystaniem przygotowanego zbioru danych oceniono na 4,11 w skali od 1 do 5 (rzeczywiste nagrania zostały ocenione na 4,23). Stworzony korpus zostanie udostępniony publicznie na otwartej licencji (CC0). Poza tym dokładnie omówione zostało samo zagadnienie syntezy mowy. W szczególności skupiono się na aktualnie stosowanych podejściach neuronowych. Opisano architektury modeli akustycznych Tacotron2 i TransformerTTS oraz wokoderów WaveRNN, MelGAN i SqueezeWave. Poruszono także złożony problem oceny jakości systemów syntezy mowy. Z wykorzystaniem przygotowanego korpusu nagrań mowy podjęto także próbę odtworzenia modelu TransformerTTS opisanego w artykule Neural Speech Synthesis with Transformer Network. Jednak ze względu na ograniczone zasoby obliczeniowe, próba ta została zakończona tylko częściowym sukcesem. Wytrenowany model wykazywał się niestabilnością ze względu na użycie w trakcie treningu zbyt małej wielkości porcji danych. Podjęto jednak próbę usprawnienia tego modelu, która zakończyła się zmniejszeniem wartości metryki CER z 0,200 do 0,156 bez znaczącego zwiększenia wymagań obliczeniowych.
... In this work we describe a case study using the Leyzer corpus [22], the largest open-source dataset for VA-oriented intent and domain utterance classification. We show how the performance of FedProx is affected by different levels of heterogeneous data settings in the context of natural language understanding (NLU). ...
Due to recent increased interest in data privacy, it is important to consider how personal virtual assistants (VA) handle data. The established design of VAs makes data sharing mandatory. Federated learning (FL) appears to be the most optimal way of increasing data privacy of data processed by VAs, as in FL, models are trained directly on users’ devices, without sending them to a centralized server. However, VAs operate in a heterogeneous environment – they are installed on various devices and acquire various quantities of data. In our work, we check how FL performs in such heterogeneous settings. We compare the performance of several optimizers for data of various levels of heterogeneity and various percentages of stragglers. As a FL algorithm, we used FedProx, proposed by Sahu et al. in 2018. For a test database, we use a publicly available Leyzer corpus, dedicated to VA-related experiments. We show that skewed quantity and label distributions affect the quality of VA models trained to solve intent classification problems. We conclude by showing that a carefully selected local optimizer can successfully mitigate this effect, yielding 99% accuracy for the ADAM and RMSProp optimizers even for highly skewed distributions and a high share of stragglers.Keywordsfederated learningFedProxnatural language understandingvirtual assistantsADAMPGD
Full-text available
In this paper, I argue that AI-powered voice assistants, just as all technologies, actively mediate our interpretative structures, including values. I show this by explaining the productive role of technologies in the way people make sense of themselves and those around them. More specifically, I rely on the hermeneutics of Gadamer and the material hermeneutics of Ihde to develop a hermeneutic lemniscate as a principle of technologically mediated sense-making. The lemniscate principle links people, technologies and the sociocultural world in the joint production of meaning and explicates the feedback channels between the three counterparts. When people make sense of technologies, they necessarily engage their moral histories to comprehend new technologies and fit them in daily practices. As such, the lemniscate principle offers a chance to explore the moral dynamics taking place during technological appropriation. Using digital voice assistants as an example, I show how these AI-guided devices mediate our moral inclinations, decisions and even our values, while in parallel suggesting how to use and design them in an informed and critical way.
Conference Paper
Full-text available
This paper presents the architecture of Almond, an open, crowdsourced, privacy-preserving and programmable virtual assistant for online services and the Internet of Things (IoT). Included in Almond is Thingpedia, a crowdsourced public knowledge base of natural language interfaces and open APIs. Our proposal addresses four challenges in virtual assistant technology: generality, interoperability, privacy, and usability. Generality is addressed by crowdsourcing Thingpedia, while interoperability is provided by ThingTalk, a high-level domain-specific language that connects multiple devices or services via open APIs. For privacy, user credentials and user data are managed by our open-source ThingSystem, which can be run on personal phones or home servers. Finally, we address usability by providing a natural language interface, whose capability can be extended via training with the help of a menu-driven interface. We have created a fully working prototype, and crowdsourced a set of 187 functions across 45 different kinds of devices. Almond is the first virtual assistant that lets users specify trigger-action tasks in natural language. Despite the lack of real usage data, our experiment suggests that Almond can understand about 40% of the complex tasks when uttered by a user familiar with its capability.
We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25 K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission ( The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision (At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. We’d threfore like to stress that this paper does not include an evaluation of this feature in Watson NLU.). Again, Dialogflow, LUIS and Rasa perform well on this task.
Attention-based recurrent neural network models for joint intent detection and slot filling have achieved a state-of-the-art performance. Most previous works exploited semantic level information to calculate the attention weights. However, few works have taken the importance of word level information into consideration. In this paper, we propose WAIS, word attention for joint intent detection and slot filling. Considering that intent detection and slot filling have a strong relationship, we further propose a fusion gate that integrates the word level information and semantic level information together for jointly training the two tasks. Extensive experiments show that the proposed model has robust superiority over its competitors and sets the state-of-the-art.
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
Abstract Progress can be measured,and encouraged,via standards for comparison,and,evaluation. Though,qualitative as- sessments can be useful in initial stages, quantifiable measures,of systems,under,the same,conditions,are es- sential for comparing,results and assessing claims. This paper will address the emerging,standards,for evaluation of spoken,language,systems. Introduction and Background Numbers,are meaningless,unless it is clear where,they come,from. The evaluation,of any technology,is greatly enhanced,in usefulness if accompanied,by documented standards,for assessment. There has been a growing,ap- preciation in the speech,recognition,community,of the importance,of standards,for reporting performance.,The availability of standard,databases,and protocols for eval- uation has been an important,component,in progress in the field and in the sharing of new ideas. Progress toward evaluating spoken language systems, like the technology itself, is beginning to emerge. This paper presents some background,on the problem,and outlines the issues and initial experiments,in evaluating,spoken,language,sys- tems in the "common" task domain, known as ATIS (Air Travel Information Service). The speech recognition community,has reached,agree- ment,on some,standards,for evaluating,speech recogni- tion systems, and is beginning to evolve a mechanism for revising these standards,as the needs of the community change (e.g., as new systems require new kinds of data, as new system capabilities emerge, or as refinements in ex-
An important problem in automatic speech understanding is the transport of an existing application system to a new language. Design choices are required to keep the cost and implementation time of the porting as low as possible. One of the bottlenecks in spoken language system development is represented by data collection. When moving an application from the original language to a new one, it is very important to exploit, as much as possible, the data collected in the first language. This paper discusses the construction of a speech-based Air Travel Information Service (ATIS) for Italian, starting from (American) English. The aim of the work is to maximize the use of data available in English for the ATIS task in the construction of the Italian system. All components will be examined and proposed strategies will be evaluated by experimental tests. By just using a small Italian corpus and the ATIS data available in English, an understanding error rate of 6·7% was obtained on Italian written transcriptions. By adapting to speakers acoustic models and by training a language model on translations of transcriptions, an understanding error rate of 17·9% was obtained by considering read as well as spontaneous Italian sentences.