PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We explore a bilingual next-word predictor (NWP) under federated optimization for a mobile application. A character-based LSTM is server-trained on English and Dutch texts from a custom parallel corpora. This is used as the target performance. We simulate a federated learning environment to assess the feasibility of distributed training for the same model. The popular Federated Averaging (FedAvg) algorithm is used as the aggregation method. We show that the federated LSTM achieves decent performance, yet it is still sub-optimal. We suggest possible next steps to bridge this performance gap. Furthermore, we explore the effects of language imbalance varying the ratio of English and Dutch training texts (or clients). We show the model upholds performance (of the balanced case) up and until a 80/20 imbalance before decaying rapidly. Lastly, we describe the implementation of local client training, word prediction and client-server communication in a custom virtual keyboard for Android platforms. Additionally, homomorphic encryption is applied to provide with secure aggregation guarding the user from malicious servers.
Content may be subject to copyright.
Exploring a Bilingual Next Word Predictor Exploring a Bilingual Next Word Predictor for a Federatedfor a Federated
Learning Mobile ApplicationLearning Mobile Application
This paper was downloaded from TechRxiv (https://www.techrxiv.org).
LICENSE
CC BY 4.0
SUBMISSION DATE / POSTED DATE
08-01-2022 / 14-01-2022
CITATION
Burgos, Natali Alfonso; Kiš, Karol; Bakarac, Peter; Kvasnica, Michal; Licitra, Giovanni (2022): Exploring a
Bilingual Next Word Predictor for a Federated Learning Mobile Application. TechRxiv. Preprint.
https://doi.org/10.36227/techrxiv.18058682.v1
DOI
10.36227/techrxiv.18058682.v1
Exploring a Bilingual Next Word Predictor for a Federated Learning Mobile
Application
N. Alfonso Burgosa, K. Kišb, P. Bakaracb, M. Kvasnicab, G. Licitraa,
aNeurocast B.V., Amsterdam, the Netherlands
bSlovak University of Technology in Bratislava, Bratislava, Slovakia
Abstract
We explore a bilingual next-word predictor (NWP) under federated optimization for a mobile application. A character-
based LSTM is server-trained on English and Dutch texts from a custom parallel corpora. This is used as the target
performance. We simulate a federated learning environment to assess the feasibility of distributed training for the same
model. The popular Federated Averaging (FedAvg) algorithm is used as the aggregation method. We show that the
federated LSTM achieves decent performance, yet it is still sub-optimal. We suggest possible next steps to bridge this
performance gap. Furthermore, we explore the effects of language imbalance varying the ratio of English and Dutch
training texts (or clients). We show the model upholds performance (of the balanced case) up and until a 80/20 imbalance
before decaying rapidly. Lastly, we describe the implementation of local client training, word prediction and client-server
communication in a custom virtual keyboard for Android platforms. Additionally, homomorphic encryption is applied
to provide with secure aggregation guarding the user from malicious servers.
Keywords: Federated Learning, Bilingual Word Prediction, Character-based LSTM, Homomorphic Encryption
1. INTRODUCTION
In 2019, the Global Mobile Consumer Trends survey
reported smartphones to be the most ubiquitous elector-
nic device in developed markets: about 90% of consumers
own one and use them daily [1]. What’s more, text and5
instant messaging apps are consumers’ favorites, stirring
a fair share of efforts into creating smoother and faster
user texting experiences, i.e. auto-complete and auto-
correction features, word and emoji suggestions, in-app
speech-to-text, etc. For the first time, access to continu-10
ous, uninterrupted and vast amounts of data is attainable;
allowing AI-driven, data-hungry algorithms to power these
applications that, in turn, promote more usage and more
data to be created.
Yet, most of this data is private in nature. Data privacy15
protection laws in the European Union - General Data
Protection Regulation (GDPR) - restrict the application
of AI algorithms and dictate that personal data [..] shall
be adequate, relevant and limited to what is necessary in
relation to the purposes for which they are processed (data20
minimisation principle) [2]. Federated Learning (FL) elim-
inates the need of data collection alltogether and helps de-
velop GDPR-compliant AI systems. More specifically, FL
Corresponding author
Email addresses: natali@neurocast.nl (N. Alfonso Burgos),
karol.kis@stuba.sk (K. Kiš), peter.bakarac@stuba.sk
(P. Bakarac), michal.kvasnica@stuba.sk (M. Kvasnica),
giovanni@neurocast.nl (G. Licitra)
refers to the approach of learning a task by a federation of
user devices (clients) orquestrated by a central server [3].25
This technique allows users to benefit from a shared global
model, without the need to share data or to centrally store
it. Often times, data privacy methods like differential pri-
vacy [4, 5] or homomorphic encryption [6] are embedded in
the system to further improve security. FL provides with30
a promising alternative to server-based data collection and
model training in commercial settings.
However, privacy doesn’t come without costs and limi-
tations. In commercial mobile keyboards, on-device train-
ing of language models is greatly limited. CPU usage,35
memory footprint, battery consumption and network band-
width must be carefully considered when locally training
and updating models in FL systems. Models are usually
constraint to tens of megabytes in size to be able to run,
even on high-end devices, while delivering predictions at40
a reasonable latency (within 20 millisenconds of an input
event) [7]. More often than not, these limitations come
at the cost of prediciton accuracy, the size of vocabularies
and multilingual capabilities.
In this paper, we investigate the possibility of a one-45
shot bilingual Next-Word Predictor (NWP) for a mobile
application in a federated learning fashion. We simulate a
FL environment where we train an RNN-based model from
scratch on two languages (English and Dutch). We inves-
tigate the effects of language-specific sample imbalance in50
learning more than a single language.
Additionally, we provide with the implementational de-
tails of the custom mobile application of our federated
Preprint submitted to Engineering Applications of Artificial Intelligence January 5, 2022
NWP. Here, model updates are performed on encrypted
parameters using homomorphic encryption.55
1.1. Related Work
Besides data privacy benefits derived from FL by de-
sign, others also emerge in terms of resource consump-
tion. Offloading computationally-intensive operations like
the training of Deep Learning (DL) models to edge devices60
is an incredible perk. However, due to hardware and data
transfer limitations, models must be kept small and con-
cise. State-of-the-art Language Models (LMs) are mostly
self-attention based, so-called transformers, namely of tens
of millions of parameters, and many count with multi-65
lingual capabilibities (e.g., BERT is trained on 104 lan-
guages) [8]. Even after optimizing for size and inference la-
tency with methods like network prunning, weight sharing,
Knowledge Destillation, quantization, etc., models still re-
main on the heavy side [9]. They are suitable for on-device,70
offline inference but not for FL settings.
Applications of FL on Natural Language Processing
(NLP) tasks have gained considerable attention in recent
year [10]. Access to real-world text data, and the decen-
tralization of computation and storage, all while preserv-75
ing privacy, acted as great incentives to this motive. Emoji
prediction [11], query suggestions [12] and next-word pre-
diction [13, 14, 7, 15] are instances of NLP tasks solved
on FL environments (on-device and simulated), to name a
few.80
Here, we focus on next-word prediction on an FL set-
ting. Like in [13], we train a character-level RNN-based
language model, except with bilingual capabilities.
1.2. Federated Learning
FL is an approach to distributed machine learning that85
builds privacy into infrastructure by design. It consist of a
series of training rounds, where a central server coordinates
the number of clients that participate in for a round. Each
client starts off with the same global model, and computes
a number of local updates to it, e.g. an epoch of mini-batch90
stochastic gradient descent. Model updates are performed
using local, client-generated training data, and are never
sent to or collected centrally in the server. After training is
terminated, updates are communicated to the server. The
server aggregates the contributions and incorporates it to95
the global model. Note that updates are ephemeral, and
only live until immediately after aggregation. The new
global model is sent back to the clients, along with the
minimal information necessary for model training [3, 10,
12]. A graphical representation of FL is shown in Figure 1.100
Aggregation methods have been actively researched in
recent years (an exhaustive overview of existing methods
can be found in Tables 1 and 2 in [10]). FedAvg algorithm
is currently the most widespread method in FL settings,
reported in over 58 publications by the end of 2020 [10].
FedAvg updates the server-side model using the parame-
ters (or weights) of the model through a weighted average
Figure 1: Federated Learning Diagram (i) Initialize global model
and send them to client devices (ii) train model on client device with
client data (iii) each client sends the trained model back to the server
once training is complete (iv) once the number of client models for
a training round are met, the server aggregates them (v) the global
model is updated and a new round of training begins.
of participating client models defined as follows
wt+1
K
X
k=1
nk
nwk
t+1 (1)
where wk
t+1 are the model parameters of the kth client.
In comparison to its predecesor, FedAvg has shown to
converge faster, reducing the number of training rounds by
90% and to work surprinsigly well when client models share
the same initialization [3]. Because of its effectiveness and105
simplicity, we consider FedAvg as the FL algorithm.
Lastly, and more generally, FL is a distributed ap-
proach to an optimization problem. The implicit prob-
lem to solve in an FL setting was coined by [3] as feder-
ated optimization. Alongside communication constraints110
and many other practical issues, they made emphasis on
the unbalanced and non Independent and Identically Dis-
tributed (IID) nature of data distributions in federated op-
timization. FedAvg applications showed to be robust under
these conditions. In the present work, we are specially in-115
terested in non-IID distributions, not of target symbols,
but across input languages. Statistical distributions of
characters from English and Dutch texts are comparable,
yet the effect on performance accuracy and speed of con-
vergence of unequally distributed, language-specific texts120
across clients is yet to be disclosed.
1.3. Homomorphic Encryption
An alternative is to provide with a secure aggregation
to the central averaging server that is capable of comput-
ing, at each iteration t, the average model wtin Equation125
(1) without explicit knowledge (hence, of interpretation)
of individual client model parameters wk
t. This can be
2
achieved by employing the concept of homomorphic en-
cryption [16, 17]. In short, homomorphic encryption allows
to perform mathematical operations over ciphers of plain-130
text data. Consider an encryption mechanism E:N0N0
and a decryption function D:N0N0, whereby c=E(x)
is the cipher of a non-negative integer x, and D(c) = x
is the decryption operation. Naturally, D(E(x)) = x. It
is important to note that functions Eand Dare asym-135
metric; in the sense that encryption is performed using
a public key that allows any interested party to generate
cipher E(x), yet the decryption key is kept private grant-
ing permission only to its legitimate owner to apply the
decryption operation D(E(x)).140
Remark 1. Although the encryption and the decryption
functions are defined for non-negative integers, they can
easily be extended to support floating point numbers by first
shifting them to the range [0,)and subsequently convert-
ing them to integers, e.g., by applying quantization. In145
what follows we assume that such a conversion, denoted by
i(x), was performed whenever xis a floating-point number.
Unlike conventional encryption methods, homomorphic
encryption allows to perform operations over ciphers. Take
two non-negative integers as an example, say aand b.150
Then the encryption/decryption functions Eand Dyield a
homomorphic property if for some operation {+,,, /}
(where ’+’ stands for addition, ’-’ is subtraction, ’*’ de-
notes multiplication, and ’/’ represents division) we have
that D(E(a) E(b)) = abfor some {+,,, /}. If,155
for instance, we have =and = +, the product of
two ciphers yields the encrypted sum of the plaintexts, i.e.,
E(a) E (b) = E(a+b), which, after decryption D(E(a+b))
gives a+b. The obvious advantage is that one can cal-
culate the sum a+bbased on the ciphers E(a)and E(b)160
without having to decrypt them in the first place.
Various homomorphic encryption algorithms (i.e., func-
tions Eand D) exist differing in the subset of mathe-
matical operations in and they support. The so-
called fully homomorphic schemes, such as SEAL [18], al-165
low to perform any of operations {+,,, /}on ciphers.
On the other hand, partially homomorphic algorithms are
only capable of performing a subset of operations. For
instance, the popular RSA assymetric encryption algo-
rithm [19] only provides the multiplicative homomorphic170
property, i.e., E(a) E(b) = E(ab). For the purpose of
applying homomorphic encryption to achieving a secure
central averaging, the Paillier [20] and Benaloh [21] cryp-
tosystems are beneficial, as both allow to calculate the
encrypted sum, i.e., E(a+b)by multiplying the individual175
ciphers E(a)and E(b). In addition, the multiplication of
a cipher E(x)by a public (i.e., not encrypted) constant
mcan be achieved by E(x)m. Then, calculating 1
2(a+b)
(i.e., the average of aand b) can be achieved by considering
their encrypted counterparts E(a)and E(b)and perform-180
ing (E(a) E(b))i(1
2), where i(1
2)is a suitable conversion of
a floating point number to an integer.
The application of the Paillier or Benaloh partially ho-
momorphic encryption as a secure way of performing the
central averaging of model parameters is self-evident. In-185
stead of computing the average model wtin Equation (1)
using plain-text values of wk
t, each participating client en-
crypts its model parameters, i.e., generates E(wk
t), prior to
sending them to the server. The server then exploits the
homomorphic property that E(1/n Pkwk
t)=(QkE(wk
t))i(n),190
namely, it calculates the average of encrypted model pa-
rameters while unable to decrypt them (since it does not
posses the encryption keys). The encrypted value of the
average is then sent back to the clients, which are able
to recover the plain text value of the average using their195
respective private keys. In such a way, privacy of the pro-
posed method against malicious or dishonest central aver-
aging servers is achieved.
1.4. Language Model
RNN models have been extensively used for language200
modeling [22]. Most work, however, is done on Long-Short
Term Memory (LSTM) models; a type of RNNs that over-
comes the inability of vanilla RNNs to retain long range
dependencies. LSTMs use purpose-built, gated memory
cells that store historical information of a sequence and205
ensure correct propagation of information through many
time steps [23, 24, 25].
Albeit deterministic, LSTMs can be used to learn prob-
ability distributions over a sequence of language symbols.
The joint probability over symbols is defined using the
chain rule,
P r(S={s1, . . . , sN}) =
N
Y
i=1
P r(si|s1...,si1)(2)
where the current symbol is conditional on the previous
one, and where the context of previous symbols is encoded
in the hidden states of LSTMs. The conditional proba-210
bility P r(si|si1)is, therefore, a multinomial distribution
parameterized by a softmax function.
For LMs, the goal is to maximize the log-likelihood
of a given sequence of symbols by optimizing the cross-
entropy (CE) between the prediction and the target prob-215
abilities. However, computing the joint probability over
very large number of symbols becomes prohibitively slow
during training [26, 27]. Character-level language mod-
elling is an interesting alternative to word-level or subword-
level next-word prediction. Not only are the set of symbols220
much smaller in size, but allows for the inclusion of Out-
Of-Vocabulary (OOV) words.
To this extent, we employ a character-level LSTM with
multiple layers to generate the next word.
3
(a) English Characters
(b) Dutch Characters
Figure 2: Language Character Distributions. Character frequency distributions of (a) English and (b) Dutch texts
in the training set.
2. METHODOLOGY225
2.1. Dataset
The dataset is a collection of parallel corpora1(English-
Dutch) accesible from OPUS open parallel corpus web-
site2. Corpuses come in Moses format3and proceed from
various sources:230
OpenSubtitles (37,2M) a collection of translated
movie subtitles from OpenSubtitles.org4[28].
TED2020 (319,9K) contains a crawl of nearly 4000
TED and TED-X talks transcripts from July 2020.
The transcripts were translated by a global commu-235
nity of volunteers to more than 100 languages [29].
Wikipedia (797,1K) a corpus of parallel sentences
extracted from Wikipedia by K. Wołk and K. Marasek
[30].
QED (411,3K) an open multilingual collection of240
subtitles for educational videos and lectures collabo-
ratively transcribed and translated over the AMARA
web-based platform [29].
Books (38,6K) a collection of copyright free books
aligned by A. Farkas5[29].245
Corpuses are stripped of HTML tags, urls, citations, titles,
subtitles, etc., and the remaining clean text is split into
sentences (one per line). The aftermath is 69,7M lines of
text data. Character frequency distributions among both
languages are comparable (Figure 2). There are 76.9±250
69.3characters per sentences in English, and 92.5±72.3
in Dutch. Sentence length variability is large and right-
skewed.
2.2. Data Partition and Distribution
We hold out 5% of each parallel corpus as the test255
set. Namely, parallel sentences pertain to the same set
to avoid possible lexical leakage. This results in 69,7M
training lines and 7.7M testing lines. Note that both sets
contain the same number of Dutch and English sentences.
Provided that the data is not user-tagged (at least not260
all of it), training samples are sharded at random and un-
equally into Kclient shards to simulate an unbalanced
and non-IDD client dataset. This is performed for every
language in the dataset. Each client shard is further parti-
tioned into Requally-sized parts to mimic the local client265
cache consumed in every training round (Figure 3).
To study the impact of language imbalance on federated
optimization, the fraction Fof language-specific clients is
experimented with. More specifically, Fis the fraction of
English-speaking clients (EN clients) in the training data,270
and thus 1Fis the fraction of Dutch-speaking clients (NL
clients). This results in K×FEN clients and K×(1 F)
NL clients (Figure 8 in Appendix).
1A parallel corpus is a large and structured set of translated texts
between two languages.
2https://opus.nlpl.eu/
3Two aligned plain text files.
4http://www.opensubtitles.org/
5Available on http://www.farkastranslations.com/bilingual_
books.php
4
Figure 3: Client Cache Sharding. The client dataset consist
of 69.7M English and Dutch training sentences (yellow and green),
found at equal parts, partitioned into Kunequally-distributed client
sets (yellow). Note that a set {ki}`
i=1:Kexists per language, `. In
turn, local cache is simulated by sharding each client ki {ki}`
i=1:K
into Requally-sized parts (blue). Each client local cache kr
i
{kr
i}`
r=1:Ris consumed by client model iat round r.Glossary:
number of languages (L), number of clients (K), number of training
rounds (R).
2.3. Data Processing
Given that, the Dutch and English languages employ275
the same (Latin) alphabet, the size of the symbol set M=
|S|(or the vocabulary size) is reduced to M= 69 unique
characters: lowercase Latin characters (26), numbers (10),
punctuation symbols (32), white-space and <UNK> spe-
cial tag. Characters in sentences are standardized to lower-280
cased ASCII characters, and unknown characters are re-
placed with <UNK>. Sentences are truncated at the length
of 100 characters, as it has repeatedly shown the best re-
sults. Shorter sentences are pre-padded with zeros, as sug-
gested in [31]. Lastly, characters are replaced with their285
index representations. Indices are obtained by enumerat-
ing the set S, such that f:SJ, where J={j1, . . . , jM}
is the index set and fis the particular enumeration of S.
Sentences are fed to the model as sequences of one-
hot character vectors. Namely, every character index, j, is290
represented by a zero-valued M-dimensional vector, whose
element on index jequals one. A language code is concate-
nated to the character vector. This is encoded as a binary
vector of length dlog2(`)e, where `is the number of lan-
guages. In a bilingual setting, dlog2(`)e= 1. The resulting295
input sequence is of size M+dlog2(`)e= 69 + 1 = 70.
Last but not least, targets are built by forward-shifting
input sequences. Essentially, any given character acts as
the target to the previous character in the sequence.
2.4. Model Architecture300
The model is a multi-layer LSTM, with six (6) stacked
LSTM layers of 128 hidden neurons each. It counts with
over 771K parameters, and is 3MB in size. If weights are
quantized, the model size drops as low as 0.8MB. Very
wide models (1024+ neurons per layer) with about half305
the depth performed better in some instances. However,
with many times more the size. A deep but moderately
wide model results in a more convenient trade.
2.5. Federated Optimization
The multi-layered LSTM is at the core of the feder-310
ated next-word predictor. The Federated LSTM, or F-
LSTM, trains a global LSTM model distributively over
Kclients, where only a fraction Cof clients are consid-
ered on each training round by the FedAvg algorithm.
The global model is trained for a maximum of Rrounds.315
Clients update the global model after a training round is
closed by averaging their model parameters (see Equation
1). Local training at client nodes consists of Eepochs
of nk
Bmini-batch stochastic gradient descent (SGD) opti-
mizations with momentum, where nkis the number of local320
samples generated by the kth client, and Bis the size of the
mini-batch. Momentum is computed following Nesterov’s
accelerated gradient method [32]. Momentum-based opti-
mizer have shown to work best on fully non-IID data-sets
in federated learning settings [33]. Additionally, to avoid325
exploding gradients in BPTT, gradients are clipped when
their L2-norm exceeds a maximum value of 2. This is
meant to increase the robustness of model convergence.
In FedAvg, client and server learning rates are decou-
pled by formulating the server update as if applying SGD
to the "pseudo-gradient", t=PK
i(wi
twt)[34]. This
formulation is often used to increase the degree of free-
dom of the system for better convergence rates. How-
ever, as mentioned in [34], setting the client learning rate
has a greater impact on converge. For that, and for sim-
plicty’s sake, we opt to not decouple learning rates (i.e.,
ηserver = 1.0) and tune a single (client) learning rate η.
The client learning rate is decayed by
η=η0
1 + tc
(3)
where η0is the initial value of η, and tcis the times η
has been decayed. Counter tcincreases by a unit if the
following condition is met
`tmin
j<t `j(4)
where `tis the average training loss in round t, and minj<t `j
is best average training loss (relative to ) across an ar-330
bitrary number of previous rounds [35, 36].
2.6. Prediction Sampling
During inference, the next character is picked based on
the joint probability over characters defined in Equation 2.
Rather than selecting the character with the highest prob-335
ability, one is drawn from the top-3 most probable charac-
ters with the intention to add a higher degree of flexibility
5
Figure 4: Top-k Prediction Sampling The model samples from
the top-kmost probable symbols at every time step t.
to the language model. This is a customary strategy in
language modeling (see Figure 4 as an exemplary).
The sampling process begins with a character that acts340
as a starts token. Start tokens can be a sequence of char-
acters, too. At each step t, the model samples a charac-
ter based on P r(st|s0:t1). This is done iteratively until
astop symbol is encountered (e.g. white-space, <UNK>,
<PAD>,<EOS>) or the sequence reaches a maximum of345
number of characters.
During testing, this process is applied to every testing
sentence independently. That is, hidden cell states are
initialized at the beginning of every sentence prediction.
2.7. Performance Evaluation350
The quality of the federated model is tested against
the hold-out set described in Section 2.2. This is carried
out after every training round is closed. Client models are
not tested against the hold-out set: simulations are run
serially on a single-GPU machine, and thus testing client355
models become highly time-consuming. Nonetheless, their
training performance is reported.
At character-level prediction, we are interested in as-
sessing the learning process in relation to a target perfor-
mance. Central LSTM, or C-LSTM, is an LSTM trained360
centrally on a server, where optimization strategies like
training for multiple epochs, online validation, early stop-
ping, learning rate adaptive methods and scheduling, etc.
can be applied out-of-the-box. C-LSTM yields a Cross
Entropy (CE) loss of 1.37, and a overall/top-3 accuracy365
of 77.61 % and 88.42 %, respectively. We use C-LSTM as
the target performance for F-LSTM training. Note that
C-LSTM and F-LSTM share the same model definition.
Thus, we report accuracy, precision, recall and bytes-per-
character (BPC).370
3. EXPERIMENTS
3.1. Setup
We simulate a F-LSTM following the methodology de-
scribed above in a two-stage process: (A) the optimiza-
tion of the system, which comprises the hyper-parameter375
tuning of the federated system to achieve the best perfor-
mance and speed of convergence possible, (B) the experi-
mentation of various ratios of EN/NL clients, or values of
F, and their effect on performance and convergence. In
this series of simulations, Ftakes on the following ratios380
F={0.5,0.6,0.7,0.8,0.9,1.0}. Note that the default lan-
guage imbalance of the training data is F= 0.5. In any
case, client training and the evaluation of the global model
are executed concurrently and independently.
Some parameter settings were shared across all experi-385
ments to adapt to hardware and time constraints. We sim-
ulate K= 600 clients for a maximum of R= 1000 rounds
with a client participation ratio of C= 0.1. Namely, 10%
of clients, selected at random, are allowed to participate in
every training round6. Learning rate ηis decayed following390
Equation 3 if no progress is made (relative to ∆=0.001
in Equation 4) after 5 consecutive training rounds. Early
stopping is implemented under the same criterion, except
that training is interrupted after 15 consecutive rounds.
The same seed is used for all experiments.395
3.2. Environment
We employ an AWS EC2 instance (p2.xlarge) equipped
with a single NVIDIA Tesla K80 GPU with 12GiB of mem-
ory, 4 vCPUs, 61GiB of RAM and EBS-backed storage of
100GB. Client simulations were run serially provided that400
we are limited to a single GPU instance.
3.3. Results
Performance Metric C-LSTM F-LSTM
EN NL EN NL
CE Loss 1.44 1.50 2.69 2.61
Accuracy 87.21 % 68.12 % 66.02 % 67.50 %
Top-3 Accuracy 93.54 % 81.75 % 70.52 % 72.82 %
Precision 87.21 % 75.54 % 66.02 % 67.50 %
Top-3 Precision 31.18 % 27.96 % 23.50 % 24.27 %
Recall 87.21 % 68.12 % 66.02 % 67.50 %
Top-3 Recall 93.54 % 81.75 % 70.52 % 72.82 %
BPC 2.08 2.16 17.57 17.05
Table 1: Table of Results. Performance metrics of best performing
F-LSTM and C-LSTM with F= 0.5. Results are reported for English
(EN) and Dutch (NL), separately.
Table 1 shows various goodness-of-fit measurements of
F-LSTM and C-LSTM for English and Dutch languages,
separately. A performance gap stands out between cen-405
tralized and federated optimization of our NWP model.
6Note that this sampling is performed keeping Fratio unaltered.
6
Figure 5: Train & Test CE Loss over Training Rounds. CE loss of best performing
F-LSTM for English (EN) and Dutch (NL) languages, separately. Top: Train CE loss across
clients per round. Dashed lines report the average train CE loss per round. Bottom: Test
CE loss per round. Overall test CE loss is outlined in solid red.
Hyper-parameter Value
K600
C0.1
R1000
Rt42
E1
B16
η01.5
ηt0.15
m0.9
Table 2: Hyper-parameter Setting.
Parameter setting of the best perform-
ing F-LSTM for F= 0.5.Glosary: (K)
number of clients, (C) fraction of par-
ticipating clients, (R) number of max-
imum rounds, (Rt) convergence round,
(E) number of epochs, (B) batch size,
(η0) initial learning rate, (ηt) conver-
gence learning rate, (m) momentum hy-
perparameter.
C-LSTM outperforms its federated counterpart by 13.98%
in overall accuracy. Complementary to Table 1, train and
test CE loss progression over training rounds is illustrated
in Figure 5. It shows that, on average, F-LSTM converges410
to a test CE loss of 2.82 and a top-3 accuracy of 76.2 %.
Convergence is achieved at round Rt= 42 with learning
rate ηt= 0.15. Hyper-parameter settings are detailed in
Table 2. These settings return the best model performance
among many experimental iterations. In every case, expo-415
nential learning rate decay and momentum lead to a faster
and more stable convergence rate, respectively. Moreover,
Table 1 shows high rates of top-3 accuracy and recall (70%-
95%) in both optimization cases, while precision rates re-
main noticeably lower (20%-35%). In a multi-class classifi-420
cation problem, high recall and low precision suggests that
there is a preference for the dominant classes (whitespace,
e,a, ...; see Figure 2), and thus the model is under-fitting
the hold-out set. However, when considering the overall
distribution, the model is more precise yet at the cost of425
accuracy.
In terms of model diagnostics, Figure 5 discloses the
relation between train and test CE loss. They go hand-in-
hand over training rounds suggesting that F-LSTM either
(1) needs further training to reach the target performance,430
or (2) that it has converged to a sub-optimal local minima
of the global cost function.
Figure 4 in [13] showcases the CE loss of a character-
level NWP over 3000 training rounds during a live client
evaluation of various FL settings. Readily, in the first435
rounds, it becomes evident that FedAvg plateaus showing
hardly any further improvements. Yet, very high recall
and precision rates were reported. It is well-known that
CE loss is sensitive to label imbalance (e.g. [37]). Language
character datasets for NWP are naturally and purposely440
imbalanced as depicted in Figure 2. CE loss penalizes
unlikely events, even if these events are legitimately so.
Hence, under a similar large CE loss many more training
rounds could lead to qualitatively better models.
Alternatively, and most likely, F-LSTM has converged445
to an inferior solution. Theoretical understanding of the
(non-)convergence of real-world applications of FedAvg is
to date work in progress. Despite many efforts to tackle or
minimize sub-optimality in FedAvg, [38, 35, 39, 40, 36] to
name a few, there is no theoretical guarantee that it con-450
sistently improves upon mini-batch SGD local updates for
non-convex objectives and heterogeneous (non-IID) set-
tings. This is exacerbated by practicalities of FL networks,
for instance, the number and selection strategy of effective
devices contributing to training [36, 35].455
Empirically, FedAvg has showned to be a promising
estimator of the global objective [3]. Nonetheless, the per-
formance gap between centralized and distributed training
with FedAvg is particularly large for NLP tasks. This was
pointed out by [38], where they compare the performance460
of various distributed NLP tasks with different FL algo-
rithms and their centralized training. They report large
performance gaps in accuracy across FL algorithms (Table
2 in [38]). They also noted that FedOpt [41], where client
(ClientOpt) and server (ServerOpt) optimizers are decou-465
pled, and ClientOpt is AdamW, shows the best accuracy
across the board. This is in congruence with our central-
ized training: adaptive optimizers worked best in compar-
ison to vanilla and momentum SGD. Lastly, they studied
the impact of different degrees of data heterogeneity in470
terms of label (statistical heterogeneity, non-IIDness) and
quantity (sample imbalance) distribution. They showed
that statistical heterogeneity has important consequences
to the convergence of FedAvg. Yet, for a uniform distribu-
7
Figure 6: Test CE loss of F-LSTM with different language ratios. Test CE loss of F-LSTM for F={0.5,0.6,0.7,0.8,0.9,1.0}.
F-LSTM is tested against the overall (left), only English (upper right) and only Dutch (bottom right) hold-out sets.
tion of labels across clients, the convergence remains sub-475
optimal for some NLP tasks (particularly, for Language
Modeling tasks).
On the one hand, it is generally prefered to offload com-
putation to client nodes, i.e. adding more local SGD up-
dates per round, and reducing the number of communica-480
tion rounds to reach a target performance. This is achieved
by either decreasing B, increasing Eor C, or any combina-
tion of the three. Yet, allowing for more local SGD steps
per round directs the convergence of FedAvg to an infe-
rior optimum in non-IID settings. This is known as "client485
drift", and it refers to the notion that local objectives may
differ drastically from the global objective. Thus, over-
fitting local objectives can result in biased global model
updates. Figure 7, in the Appendix section, illustrates
the degree of statistical heterogeneity (Jensen-Shanon490
distance) among participicating clients on the first round.
Overall, clients show low statistical heterogeneity, except
for a few highly heterogeneous pair of clients (in blue). We
believe that these clients may be driving the drift. Mit-
igation strategies such as regularization [39, 40], learning495
rate decay schemas [36] or control variates [42, 43] can help
minimize client drift, and in some cases, harnesses FedAvg
to optimal convergence. All the same, in the stochastic
non-convex case, convergence is not guaranteed.
We set E= 1,B= 16,C= 0.1and R= 1000 in an at-500
tempt to find an optimal trade-off between client drift and
computation efficiency, as suggested in [3]. These settings
returned K×C= 60 clients per round, and 3.39±2.2local
steps in average (61.84±35.11 average samples per round).
Larger batch sizes B > 16 casues FedAvg to plateau or di-505
verge. This is in line with the intuition that smaller local
step sizes have a regularizing effect while increasing gradi-
ent variance [44, 36], and in turn, mitigating client drift.
At the risk of over-fitting, increasing E > 1instead ac-
celerated convergence to the same sub-optimal point. As510
pointed out in [35], as long as we decay the learning rate,
FedAvg can converge to the optimal even if E > 1. The
same holds for C.FedAvg converges to the same point
whatever the value of C, which allowed for low partici-
pation and quicker training rounds. However, for large515
enough C, the convergence is faster (similarly to [35]).
Differently, FedAvg is known to be robust against sam-
ple size imbalance. Recall that FedAvg computes the
weighted sum of client models, where the weights are the
normalized number of training samples, nk
n. This allows520
clients to contribute to the global model proportionally to
their local training data. In [38], they show that skewed
sample size distribution over clients (drawn from a Dirich-
let distribution for a range of βconcentration factors) do
not suppose a great challenge to FedAvg. This was the525
case for F-LSTM, too.
Ratio of EN clients Accuracy
F EN NL
0.5 (50 %) 66.02 % 67.05 %
0.6 (60 %) 65.95 % 66.94 %
0.7 (70 %) 67.77 % 66.94 %
0.8 (80 %) 66.01 % 67.50 %
0.9 (90 %) 5.39 % 4.90 %
1.0 (100 %) 5.39 % 5.38 %
Table 3: Table of Results. Performance metrics of best performing
F-LSTM and C-LSTM with F= 0.5. Results are reported for English
(EN) and Dutch (NL), separately.
Last but not least, we repeatedly increase the ratio
of training English clients by 10%, and observe its effect
on the convergence and performance of F-LSTM. Model
accuracy for every case is collected in Table 3, and test530
CE losses over training rounds are depicted in Figure 6.
The latter shows a decline in model performance (increase
in overall CE loss) proportionally to F(left). To further
assess this behaviour, we tested F-LSTM against each lan-
guage set, independently (bottom right: NL CE loss; top535
8
right: EN CE loss). As expected, NL CE loss monoton-
ically increases with F, driving the overall performance
decrease, while EN CE loss behaves in the opposite di-
rection. Nonetheless, this effect is more severe in NL CE
loss. This suggests that there might exist a minimum effec-540
tive number of training samples or clients from where the
performance stops decaying rapidly and starts improving
slowly. Findings in Table 3 somewhat support this idea.
Accuracy rates seem rather unaltered up and until a 20 %
increase in F, when the performance drastically plummets.545
Unlike CE loss comparatives in Figure 6, changes in accu-
racy are not gradual and complementary, but instead are
abrupt and mirrored. An explanation is that the model is
trained with the overall train CE loss, and so a decline or
gain in performance in any language sub-task unavoidably550
affects the other. The sensitivity of performance accuracy
to small changes in CE loss reflects the non-linear nature
of their relationship for unbalanced datasets. An observa-
tion on CE loss for performance assessment in Language
Modeling tasks was mentioned earlier in this section.555
4. CLIENT-SIDE SYSTEM IMPLEMENTATION
In this section, we provide a brief description of the
proposed client-side system implementation. Inference,
training and client-server communication are built within
a custom soft keyboard in Android Studio.560
The main feature of a soft keyboard is to test the entire
framework word predictions as the user types. On-device,
the prediction is supported by PyTorch Mobile [45] and,
as the user sends or confirms the text input, the typed se-
quences are saved into a local cache. Predictions and text565
collection are disabled when the user types URIs, e-mail
addresses, and passwords to protect the user’s personal in-
formation. The reported time between the key-press and
a word prediction is 11 ±4ms.
To participate in training, client devices must meet sev-570
eral requirements. The local cached data must contain a
minimal number of characters. The device must be charg-
ing, idle and connected to the unmetered network. After
the fullfilment of these criteria, the device is ready to train
the local model. Using Android’s JobScheduler workers,575
we train the model and communicate with the server in
the background. Training and encryption of the model
are handled by Chaquopy [46]. Chaquopy is a Python
SDK for android that enables the use of python training
scripts build with PyTorch within an Android device.580
A first worker, executes the local training and performs
a single epoch of mini-batch stochastic gradient descent.
Model parameters are encrypted and stored in the device.
After local training is completed, a second worker sends
the encrypted model parameters to the server via a POST585
request. Immediately after, a third worker sends a GET
request to obtain the global model and, once received, it
decrypts it.
5. CONCLUDING REMARKS
In this paper, we explore a bilingual next-word predic-590
tor mobile application using federated learning. First, we
centrally trained a single mobile-size character-level model
to predict the next word either in Dutch or in English. We
show that we can successfully generate language-specific
words given a language ID by continuously sampling from595
the model. Next, we simulate a federated learning environ-
ment to assess the feasibility of distributed training of the
same model. Our results show that the federated model
converges to a sub-optimal point affecting the quality of
the model predictions. Longer training rounds and/or de-600
coupling client and server optimizers and learning rates
are promising options to brigde the performance gap. Fur-
thermore, varying the ratios of training English and Dutch
clients disclosed how acute language imbalance deteriorate
the performance not only of one, but both sub-tasks.605
Lastly, we describe the client-side implementation of
the federated NWP using Pytorch Mobile, Chaquopy and
Android Studio in a custom soft keyboard. We apply ho-
momorphic encryption as a way to securely perform the
central averaging of clients models against malicious or610
dishonest central averaging servers.
References
[1] C. Deloite, Global mobile consumer trends (2017).
[2] General data protection regulation (gdpr), https://gdpr- info.
eu/, (Accessed on 18/02/2021).615
[3] H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson,
B. Agüera y Arcas, Communication-efficient learning of deep
networks from decentralized data, ArXiv e-prints (2016) arXiv–
1602.
[4] Y. Lu, X. Huang, Y. Dai, S. Maharjan, Y. Zhang, Differentially620
private asynchronous federated learning for mobile edge com-
puting in urban informatics, IEEE Transactions on Industrial
Informatics 16 (3) (2019) 2134–2143.
[5] R. C. Geyer, T. Klein, M. Nabi, Differentially private fed-
erated learning: A client level perspective, arXiv preprint625
arXiv:1712.07557.
[6] J. Zhang, B. Chen, S. Yu, H. Deng, Pefl: A privacy-enhanced
federated learning scheme for big data analytics, in: 2019 IEEE
Global Communications Conference (GLOBECOM), IEEE,
2019, pp. 1–6.630
[7] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays,
S. Augenstein, H. Eichner, C. Kiddon, D. Ramage, Feder-
ated learning for mobile keyboard prediction, arXiv preprint
arXiv:1811.03604.
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-635
training of deep bidirectional transformers for language under-
standing, arXiv preprint arXiv:1810.04805.
[9] Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, Mobilebert: a
compact task-agnostic bert for resource-limited devices (2020).
arXiv:2004.02984.640
[10] Y. Liu, L. Zhang, N. Ge, G. Li, A systematic literature review
on federated learning: From a model quality perspective, arXiv
preprint arXiv:2012.01973.
[11] S. Ramaswamy, R. Mathews, K. Rao, F. Beaufays, Feder-
ated learning for emoji prediction in a mobile keyboard, arXiv645
preprint arXiv:1906.04329.
[12] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong,
D. Ramage, F. Beaufays, Applied federated learning: Im-
proving google keyboard query suggestions, arXiv preprint
arXiv:1812.02903.650
9
[13] M. Chen, R. Mathews, T. Ouyang, F. Beaufays, Fed-
erated learning of out-of-vocabulary words, arXiv preprint
arXiv:1903.10635.
[14] M. Chen, A. T. Suresh, R. Mathews, A. Wong, C. Allauzen,
F. Beaufays, M. Riley, Federated learning of n-gram language655
models, arXiv preprint arXiv:1910.03432.
[15] J. Stremmel, A. Singh, Pretraining federated text models for
next word prediction, arXiv e-prints (2020) arXiv–2005.
[16] C. Fontaine, F. Galand, A survey of homomorphic encryption
for nonspecialists, EURASIP Journal on Information Security660
2007 (2007) 1–10.
[17] X. Yi, R. Paulet, E. Bertino, Homomorphic encryption, in: Ho-
momorphic Encryption and Applications, Springer, 2014, pp.
27–46.
[18] K. Laine, R. Player, Simple encrypted arithmetic library-seal665
(v2. 0), Technical report, Technical report.
[19] H. Williams, A modification of the rsa public-key encryption
procedure (corresp.), IEEE Transactions on Information Theory
26 (6) (1980) 726–729.
[20] P. Paillier, Public-key cryptosystems based on composite degree670
residuosity classes, in: International conference on the theory
and applications of cryptographic techniques, Springer, 1999,
pp. 223–238.
[21] J. D. C. Benaloh, Verifiable secret-ballot elections.
[22] W. De Mulder, S. Bethard, M.-F. Moens, A survey on the ap-675
plication of recurrent neural networks to statistical language
modeling, Computer Speech & Language 30 (1) (2015) 61–98.
[23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neu-
ral computation 9 (8) (1997) 1735–1780.
[24] F. A. Gers, J. Schmidhuber, F. Cummins, Learning to forget:680
Continual prediction with lstm.
[25] R. J. Williams, D. Zipser, Gradient-based learning algorithms
for recurrent, Backpropagation: Theory, architectures, and ap-
plications 433 (1995) 17.
[26] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, Y. Wu,685
Exploring the limits of language modeling, arXiv preprint
arXiv:1602.02410.
[27] A. Graves, Generating sequences with recurrent neural net-
works, arXiv preprint arXiv:1308.0850.
[28] P. Lison, J. Tiedemann, Opensubtitles2016: Extracting large690
parallel corpora from movie and tv subtitles.
[29] J. Tiedemann, Parallel data, tools and interfaces in opus., in:
Lrec, Vol. 2012, 2012, pp. 2214–2218.
[30] K. Wołk, K. Marasek, Building subject-aligned comparable cor-
pora and mining it for truly parallel sentence pairs, Procedia695
Technology 18 (2014) 126–132.
[31] M. Dwarampudi, N. Reddy, Effects of padding on lstms and
cnns, arXiv preprint arXiv:1903.07288.
[32] Y. E. Nesterov, A method for solving the convex programming
problem with convergence rate o (1/kˆ 2), in: Dokl. akad. nauk700
Sssr, Vol. 269, 1983, pp. 543–547.
[33] V. Felbab, P. Kiss, T. Horváth, Optimization in federated learn-
ing., in: ITAT, 2019, pp. 58–65.
[34] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush,
J. Konečný, S. Kumar, H. B. McMahan, Adaptive federated705
optimization (2020). arXiv:2003.00295.
[35] X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, On the conver-
gence of fedavg on non-iid data (2020). arXiv:1907.02189.
[36] Z. Charles, J. Konečný, On the outsized importance of learning
rates in local update methods (2020). arXiv:2007.00878.710
[37] Y. Ho, S. Wookey, The real-world-weight cross-entropy
loss function: Modeling the costs of mislabeling, CoRR
abs/2001.00570. arXiv:2001.00570.
URL http://arxiv.org/abs/2001.00570
[38] B. Y. Lin, C. He, Z. Zeng, H. Wang, Y. Huang,715
M. Soltanolkotabi, X. Ren, S. Avestimehr, Fednlp: A research
platform for federated learning in natural language processing,
arXiv preprint arXiv:2104.08815.
[39] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar,
V. Smith, Federated optimization in heterogeneous networks,720
arXiv preprint arXiv:1812.06127.
[40] J. Wang, Q. Liu, H. Liang, G. Joshi, H. V. Poor, Tackling
the objective inconsistency problem in heterogeneous federated
optimization, arXiv preprint arXiv:2007.07481.
[41] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush,725
J. Konečný, S. Kumar, H. B. McMahan, Adaptive federated
optimization, in: International Conference on Learning Repre-
sentations, 2021.
URL https://openreview.net/forum?id=LkFG3lB13U5
[42] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, A. T.730
Suresh, Scaffold: Stochastic controlled averaging for federated
learning, in: International Conference on Machine Learning,
PMLR, 2020, pp. 5132–5143.
[43] S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi,
S. U. Stich, A. T. Suresh, Mime: Mimicking centralized735
stochastic algorithms in federated learning, arXiv preprint
arXiv:2008.03606.
[44] X. Qian, D. Klabjan, The impact of the mini-batch size on
the variance of gradients in stochastic gradient descent (2020).
arXiv:2004.13146.740
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Des-
maison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala,
Pytorch: An imperative style, high-performance deep learn-745
ing library, in: H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neu-
ral Information Processing Systems 32, Curran Associates, Inc.,
2019, pp. 8024–8035.
[46] Chaquopy python sdk for android, https://chaquo.com/750
chaquopy/, accessed: 2021-06-10.
10
6. APPENDIX
(a) English Clients (b) Dutch Clients
Figure 7: Statistical Heterogeneity between Clients. Jensen-Shanon distance matrix between pairs of (a) 100
English clients and (b) 100 Dutch clients available in the first training round. Color intensity (more blue) represents
higher level of heterogeneity
Figure 8: Language-Domain Partition. For L= 2, where {k}1and {k}2are the
set of English and Dutch clients of size K, respectively, and {k}={k}1 {k}2also of
size K, we assign F×KEnglish clients and (1 F)×KDutch clients to set {k}for
any given F. Here, we illustrate this partition strategy for K= 600 different values
of F.Glossary: (K) number of clients, (F) fraction of English-speaking clients.
11
... The conventional methods of next-letter prediction and EEG-based text generation are detailed in this section. The next letter prediction for the text generation is suggested by [6][7][8][9], and [4]. The federated learning-based next-word prediction was presented in [6], in which the Federated Long Short Term Memory (LSTM) was utilized to predict the next word in English and Dutch languages. ...
... The next letter prediction for the text generation is suggested by [6][7][8][9], and [4]. The federated learning-based next-word prediction was presented in [6], in which the Federated Long Short Term Memory (LSTM) was utilized to predict the next word in English and Dutch languages. The training of the prediction network was done using the Stochastic Gradient descent optimization algorithm, and character level prediction was evaluated based on the cross-entropy to show the performance enhancement. ...
... • The federated learning technique has security concerns, and a communication bottleneck occurs,which leads to slower system performance when considering larger text documents [6]. • The Generative pre-trained transformer has a limited memory capability that affects the performance while considering the large contextual information [7]. ...
Article
Full-text available
The text generation technique employs the transformation of the word document from the source to the targeted document based on the sequence to sequence generation. Video captioning, language identification, image captioning, recognition of speech, machine translation, and several other natural language generations are the application areas of the text generation techniques. The Electroencephalographic (EEG) signals record brain activity and are considered the source of information for using the brain-computer interface. Several kinds of research were developed for text generation. The most challenging task is more accurate text generation by considering the large contextual information and the significant features for generating the text. Hence, in this research, text generation using Folded deep learning is proposed for generating the text through text prediction and suggestion through the non-invasive technique. The EEG signal recorded from the patients is utilized for the prediction of the first letter using the proposed Folded Ensemble Deep convolutional neural network (DeepCNN), in which the hybrid ensemble activation function along with the folded concept in validating the training data to obtain the network stability and to solve the class imbalance issue. Then, the next letter suggestion is employed using the proposed Folded Ensemble Bidirectional long short-term memory (BiLSTM) approach based on the eye-blink criteria for generating the sequence-to-sequence text generation. The enhanced performance is evaluated using accuracy, precision, and recall and acquired the maximal values of 97.22%, 98.00%, and 98.00%, respectively. The proposed method can be utilized for real-time processing applications due to its non-invasive nature.
Article
Full-text available
In this paper, we propose a new metric to measure goodness-of-fit for classifiers: the Real World Cost function. This metric factors in information about a real world problem, such as financial impact, that other measures like accuracy or F1 do not. This metric is also more directly interpretable for users. To optimize for this metric, we introduce the Real-World-Weight Cross-Entropy loss function, in both binary classification and single-label multiclass classification variants. Both variants allow direct input of real world costs as weights. For single-label, multiclass classification, our loss function also allows direct penalization of probabilistic false positives, weighted by label, during the training of a machine learning model. We compare the design of our loss function to the binary cross-entropy and categorical cross-entropy functions, as well as their weighted variants, to discuss the potential for improvement in handling a variety of known shortcomings of machine learning, ranging from imbalanced classes to medical diagnostic error to reinforcement of social bias. We create scenarios that emulate those issues using the MNIST data set and demonstrate empirical results of our new loss function. Finally, we discuss our intuition about why this approach works and sketch a proof based on Maximum Likelihood Estimation.
Article
Full-text available
In this paper we present a survey on the application of recurrent neural networks to the task of statistical language modeling. Although it has been shown that these models obtain good performance on this task, often superior to other state-of-the-art techniques, they suffer from some important drawbacks, including a very long training time and limitations on the number of context words that can be taken into account in practice. Recent extensions to recurrent neural network models have been developed in an attempt to address these drawbacks. This paper gives an overview of the most important extensions. Each technique is described and its performance on statistical language modeling, as described in the existing literature, is discussed. Our structured overview makes it possible to detect the most promising techniques in the field of recurrent neural networks, applied to language modeling, but it also highlights the techniques for which further research is required.
Article
Full-text available
This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.
Chapter
Federated learning is a decentralized approach for training models on distributed devices, by summarizing local changes and sending aggregate parameters from local models to the cloud rather than the data itself. In this research we employ the idea of transfer learning to federated training of a language models for next word prediction (NWP) and conduct a number of experiments demonstrating enhancements to current baselines for which federated NWP models have been successful. Specifically, we compare federated training baselines from randomly initialized models to various combinations of pretraining approaches including pretrained word embeddings and transfer learning with whole model pretraining followed by federated fine-tuning for NWP on a dataset of Stack Overflow posts. We realize lift in performance using pretrained embeddings without exacerbating the number of required training rounds or memory footprint. We also observe notable differences using centrally pretrained networks, especially depending on the datasets used. Our research offers effective, yet inexpensive, improvements to federated NWP and paves the way for more rigorous experimentation with transfer learning techniques for federated learning.
Article
Driven by technologies such as Mobile Edge Computing (MEC) and 5G, recent years have witnessed the rapid development of urban informatics, where a large amount of data is generated. To cope with the growing data, Artificial Intelligence (AI) algorithms have been widely exploited. Federated learning is a promising paradigm for distributed edge computing, which enables edge nodes to train models locally without transmitting their data to a server. However, the security and privacy concerns of federated learning hinder its wide deployment in urban applications such as vehicular networks. In this paper, we propose a Differentially Private Vehicular Federated Learning (DP-VFL) scheme for resource sharing in vehicular networks. To build a secure and robust federated learning scheme, we incorporate Local Differential Privacy into federated learning for protecting the privacy of updated local models. We further propose a random distributed update scheme to get rid of the security threats led by a centralized curator. Moreover, we perform the convergence boosting in our proposed scheme through updates verification and weighted aggregation. We evaluate our scheme on three real-world datasets. Numerical results show the high accuracy and efficiency of our proposed scheme, while preserves the data privacy.
Conference Paper
Achieving fully homomorphic encryption was a longstanding open problem in cryptography until it was resolved by Gentry in 2009. Soon after, several homomorphic encryption schemes were proposed. The early homomorphic encryption schemes were extremely impractical, but recently new implementations, new data encoding techniques, and a better understanding of the applications have started to change the situation. In this paper we introduce the most recent version (v2.1) of Simple Encrypted Arithmetic Library - SEAL, a homomorphic encryption library developed by Microsoft Research, and describe some of its core functionality.
Chapter
Homomorphic encryption is a form of encryption which allows specific types of computations to be carried out on ciphertexts and generate an encrypted result which, when decrypted, matches the result of operations performed on the plaintexts. This is a desirable feature in modern communication system architectures. RSA is the first public-key encryption scheme with a homomorphic property. However, for security, RSA has to pad a message with random bits before encryption to achieve semantic security. The padding results in RSA losing the homomorphic property. To avoid padding messages, many public-key encryption schemes with various homomorphic properties have been proposed in last three decades. In this chapter, we introduce basic homomorphic encryption techniques. It begins with a formal definition of homomorphic encryption, followed by some well-known homomorphic encryption schemes.
Article
The Rivest, Shamir, and Adleman (RSA) public-key encryption algorithm can be broken if the integer R used as the modulus can be factored. It may however be possible to break this system without factoring R. A modification of the RSA scheme is described. For this modified version it is shown that, if the encryption procedure can be broken in a certain number of operations, then R can be factored in only a few more operations. Furthermore, this technique can also be used to produce digital signatures, in much the same manner as the RSA scheme.