Content uploaded by Ali Raza
Author content
All content in this area was uploaded by Ali Raza on Nov 04, 2022
Content may be subject to copyright.
Content uploaded by Kim Phuc TRAN
Author content
All content in this area was uploaded by Kim Phuc TRAN on Oct 08, 2021
Content may be subject to copyright.
Highlights
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Health-
care Applications
Ali Raza,Kim Phuc Tran,Ludovic Koehl,Shujun Li,Xianyi Zeng,Khaled Benzaidi,Sarah Hotham
•We proposed a novel lightweight transformer for HAR classification in home healthcare applications. We show that
the proposed transformer outperformed other state-of-the-art HAR classification methods based on CNNs and RNNs
when trained and tested on a public dataset as well as a dataset we constructed.
•In order to address challenges related to privacy and communication costs, we introduce TransFed, the first HAR
classification framework based on federated learning and transformers.
•We designed a prototype to collect human activity data using three different types of body sensors: accelerometer,
gyroscope and magnetometer. We also tested different locations of each type of sensors on the human body to find the
points of maximum impulse (PMIs) and evaluated the performance of the data for each location. We then constructed
a new dataset for evaluating HAR classifiers, which will be released publicly as a new research dataset.
•We evaluated the performance of TransFed on non-identical independent distributed (non-IID) data. Based on the data
distribution we analyze how clients can affect the performance of TransFed.
Lightweight Transformer in Federated Setting for Human Activity
Recognition in Home Healthcare Applications
Ali Razaa,b,∗,Kim Phuc Tranb,∗,Ludovic Koehlb,Shujun Lia,∗,Xianyi Zengb,Khaled Benzaidiband
Sarah Hothamc
aSchool of Computing & Institute of Cyber Security for Society (iCSS), University of Kent, Canterbury, CT2 7NP, UK
bUniversity of Lille, ENSAIT, GEMTEX–Laboratoire de Génie et Matériaux Textiles, Lille, F-59000, France
cCentre for Health Services Studies (CHSS), School of Social Policy, Sociology and Social Research (SSPSSR), University of Kent, Canterbury, CT2
7NF, UK
ARTICLE INFO
Keywords:
Human activity recognition
HAR
healthcare
home
machine learning
transformers
federated learning
ABSTRACT
Human activity recognition (HAR) is a machine learning task with important applications in health-
care especially in the context of home care of patients and older adults. HAR is often based on data
collected from smart sensors, particularly smart home IoT devices such as smartphones, wearables
and other body sensors. Deep learning techniques like convolutional neural networks (CNNs) and
recurrent neural networks (RNNs) have been used for HAR, both in centralized and federated
settings. However, these techniques have certain limitations: RNNs cannot be easily parallelized,
CNNs have the limitation of sequence length, and both are computationally expensive. Moreover,
in home healthcare applications the centralized approach can raise serious privacy concerns since the
sensors used by a HAR classifier collect a lot of highly personal and sensitive data about people
in the home. In this paper, to address some of such challenges facing HAR, we propose a novel
lightweight (one-patch) transformer, which can combine the advantages of RNNs and CNNs without
their major limitations, and also TransFed, a more privacy-friendly, federated learning-based HAR
classifier using our proposed lightweight transformer. We designed a testbed to construct a new HAR
dataset from five recruited human participants, and used the new dataset to evaluate the performance of
the proposed HAR classifier in both federated and centralized settings. Additionally, we use another
public dataset to evaluate the performance of the proposed HAR classifier in centralized setting to
compare it with existing HAR classifiers. The experimental results showed that our proposed new
solution outperformed state-of-the-art HAR classifiers based on CNNs and RNNs, whiling being more
computationally efficient.
1. Introduction
Human activity recognition (HAR) is a classification
task to learn which activity is performed by a certain person
in a given period of time, which is normally achieved using
supervised machine learning. Activities can be of different
kinds such as sitting, standing, walking, running, eating,
going upstairs and downstairs in a home environment. The
rapid development of mobile computing, smart sensing and
IoT technologies has led to a rich set of health-related data
that can be used for various healthcare applications including
HAR classifiers [1]. For HAR, sensors such as wireless
cameras, accelerometers, gyroscope sensors, wearables and
other body sensors are often used.
HAR has important implications across a wide range
of healthcare settings and contexts. For instance, it has
a role in the prevention of diseases and maintenance of
health (e.g., for monitoring rehabilitation, fall identification
and prevention, tracking health behaviors) and clinician-
∗Corresponding co-authors
ali.raza@ensait.fr (A. Raza); kim-phuc.tran@ensait.fr (K.P.
Tran); ludovic.koehl@ensait.fr (L. Koehl); s.j.li@kent.ac.uk (S. Li);
xianyi.zeng@ensait.fr (X. Zeng); khaled.benzaidi@ensait.fr (K.
Benzaidi); S.Hotham@kent.ac.uk (S. Hotham)
http://www.hooklee.com/ (S. Li)
ORCID (s): 0000-0001-8326-8325 (A. Raza); 0000-0002-6005-1497 (K.P.
Tran); 0000-0002-3404-8462 (L. Koehl); 0000-0001-5628-7328 (S. Li);
0000-0002-3236-6766 (X. Zeng)
and patient-focused tools (e.g., point-of-care diagnostics and
early detection of diseases) that facilitate management of
health conditions and prompt behavioral changes with real-
time feedback [2]. It can also be used for fitness and exercise
monitoring, e.g., used by mobile apps on smartphones and
wearables that count steps and be context-aware, and for
safety purposes, e.g., the “Do Not Disturb while Driving”
feature on iOS version 11 [3], which requires real-time
detection of the specific human activity “driving”.
Recently, by utilizing deep learning, researchers have
made substantial progress on high-accuracy HAR classi-
fiers [4,5,6]. For instance, Yao et al. [4] proposed a deep
learning model based on data collected from accelerometer
and gyroscope sensors on mobile devices, which involves
a hybrid model combining a convolutional neural network
(CNN) and a recurrent neural network (RNN). However,
training such models on real-world data collected from smart
devices leads to two major challenges. First, deep learning
requires a large amount of data for training [4], which will
incur communications between the centralized server and
clients (data owners). Second, collecting data from a home
environment can raise privacy concerns since a lot of the
sensor data include or can infer personal and sensitive data
about people in the home [7,8,9]. Third, even through
for healthcare applications there are often a lot of data
and sensors available, the relevant data owners and local
healthcare providers may not be willing to share the data
Raza et al.: Preprint submitted to Elsevier Page 1 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
for legitimate business interests, therefore limiting the data
available for training HAR classifiers.
The above challenges can be addressed by a new machine
learning concept called federated learning [10], which al-
lows multiple clients to collaboratively build a global model
without sharing their local data with each other or with a
global server, but by sharing only trained parameters of local
models. This helps improve privacy of local data, reduce
unnecessary communications between the global server and
the clients, and meets the business needs of data owners
and local healthcare providers who would not share their
data. Therefore, many researchers have considered using
federated learning for HAR and other healthcare applica-
tions [11]. For instance, Liu et al. [12] developed a classifier
in a federated setting to address the aforementioned chal-
lenges by training a classifier using federated learning [13],
and Sozinov et al. [14] compared HAR classifier trained
using centralized learning with HAR classifier trained using
federated learning for different data distributions among
clients. Their results show that federated learning based
classifiers achieved a comparable accuracy to deep learning
classifiers in a centralized setting.
In regards to performance, commonly used deep learn-
ing techniques including CNNs and RNNs have their lim-
itations. CNNs have an advantage over RNNs (including
LSTMs) as they are easy to parallelize, while RNNs have
recurrent connections and hence parallelizaion cannot be
easily achieved. However, in long-term sequences like time-
series, capturing the dependencies can be cumbersome and
unpractical using CNNs [15]. To address these challenges,
transformers have been introduced recently [16]. The trans-
former technique is an attempt to capture the best of both
worlds (CNNs and RNNs). They can model dependencies
over the whole range of a sequence and there are no recurrent
connections, so the whole model can be computed in a
very efficient feedforward fashion. Since its introduction,
transformers have been widely studied in various applica-
tions, for example, in natural language processing [17] and
healthcare [18]. However, transformers are not developed for
HAR, and developing transformers in federated setting can
potentially boost up the HAR. Nevertheless, there are open
questions of the implementation details such as what are the
limits of the algorithm and if transformers really perform
well in HAR, coupled with federated learning.
In this study, to answer all these questions, first we de-
velop a novel lightweight transformer for HAR classification
and showed that it can provide high performance in terms of
accuracy and computational cost compared to existing deep
models such as RNNs and CNNs. We trained and tested the
proposed base-line transformer using different open-source
sensor data, as well as on some data that we collected using
a prototype that we developed to collect the human activity
data using three different types of sensors: accelerometer,
gyroscope and magnetometer. While collecting the data
we tested different sensor locations on human body: hip,
chest and upper arm (further details about the prototype
and data collected will be provided in later sections). Fur-
thermore, to address the aforementioned challenges, such as
privacy concerns and additional communication costs, we
propose a novel transformer framework in federated setting
called TransFed, the first transformer-based classifier for
HAR in federated settings. Moreover, we evaluated perfor-
mance of federated learning using the proposed transformer
and showed that federated learning can be used instead
of centralized learning for HAR classifier. We compared
the proposed transformer-based HAR classifier under two
training settings, using centralized learning with HAR clas-
sifier and using federated learning. In federated learning, we
use skewed data based on a non-identical independent dis-
tributed (non-IID) data distribution among clients, and in the
centralized setting we use the whole dataset which contains
only slightly unbalanced classes. The results showed that
the proposed FedTrans outperformed existing state-of-the-
art methods.
The main contributions of the paper are as follows:
1. We proposed a novel lightweight transformer for HAR
classification. We show that the proposed transformer
outperformed other state-of-the-art HAR classifica-
tion methods based on CNNs and RNNs when trained
and tested on a public dataset as well as a dataset we
constructed.
2. In order to address challenges related to privacy and
communication costs, we introduce TransFed, the first
framework for HAR classification based on federated
learning and transformers.
3. We designed a prototype to collected human activity
data using three different types of body sensors: ac-
celerometer, gyroscope and magnetometer. We also
tested different locations of each type of sensors on the
human body to find the points of maximum impulse
(PMIs) and evaluated the performance of the data for
each location. We then constructed a new dataset for
evaluating HAR classifiers, which will be released
publicly as a new research dataset.
4. We evaluated the performance of TransFed using
non-identical independent distributed (non-IID) data.
Based on the data distribution we analyze how the
non-IID data of clients can affect the performance of
TransFed.
The rest of the paper is organized as follows. In the
next section, we briefly introduce some related work on
HAR, federated learning and transformers. Section 3ex-
plains our proposed transformer and the federated learning
framework TransFed in details. The experimental setup and
the experimental results on performance analysis are given
in Sections 4and 5, respectively. Some further discussions
including limitations and some future research directions are
presented in Section 6. The last section concludes the paper.
Raza et al.: Preprint submitted to Elsevier Page 2 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
2. Related Work
2.1. Human Activity Recognition
Based on data from one or more body sensors, HAR
classification using a classifier to predict human activities.
Generally, the data contains tri-axial data from different sen-
sors like accelerometer, gyroscope and magnetometer. Most
modern smart devices such as smartphones and wearables
have such sensors. Initially, researchers have used various
hand-crafted features for training HAR classifiers.
Generally speaking, the hand-crafted features can be di-
vided into three main types: frequency domain, time domain,
and time-frequency analysis. Classical machine learning
models like 𝑘-means, probabilistic methods (naive Bayes)
and support vector machines have been proposed for HAR
classification [19].
On the other hand, deep learning exploits benefits of
having a huge amount of data and highly non-linear deep
models, to outperform classical models. Using deep learn-
ing, the feature extraction simply can be omitted which in
case of classical machine learning is a hectic and important
task. Raw data in the form of sliding window or simple
window can be directly fed into deep learning based clas-
sifiers. CNNs, long short-term memory (LTSM) RNNs, and
hybrid models combining RNNs and CNNs are dominating
approaches proposed for HAR [20,4,21,22,14,23,24,25,
26]. Figure 1shows the pipeline of a typical HAR classifica-
tion process. First, the raw data from sensors is transformed
into windows of a fixed-length size, which are fed directly
to the classifier. While prediction, data is collected using
the same window length and then, again depending on the
model selected for a HAR classifier, features are extracted
or raw windowed data is fed into the classifier that predicts
the target human activity, such as walking, sitting, etc.
2.2. Federated Learning
Federated learning (FL) [27] is a new concept of dis-
tributed machine learning, where peers in a network train
a global model collaboratively, without sharing their local
training data with the central node. The main goal of FL
is to train a global model 𝐆𝐌 by using shared parameters
of locally trained models at 𝐾 > 1clients. This ensures
that the local training data remains at each client. At the
𝑟-th global round, each client 𝑃𝑘has it own local data 𝐷𝑟
𝑘
and trains a local model 𝐋𝐌𝑟
𝑘. After training, each client
sends the parameter of 𝐋𝐌𝑟
𝑘to the central server, which
aggregates the local parameters from all clients to produce an
updated global model 𝐆𝐌𝑟. There are mainly two different
approaches for making global updates: i) federated averag-
ing where clients send the updates to global server after
training the local model for multiple training epochs, and ii)
federated stochastic gradient descent (SGD) where clients
send updates to the global server after each local training
batch. McMahan et al. [27] compared the two approaches
and shown that federated averaging can reduce communi-
cation costs by a factor of 10 to 100 times, compared to
federated SGD. The merits of federated averaging make it
popular in many applications [28]. Federated averaging can
Human Activity
data from sensor
Create Window input
Model
Output
(Activity
Recognized)
Train/Predict
Figure 1: Overview of the HAR classification pipeline.
be described by the following equation:
𝐋𝐌𝑟+1
𝑘=𝐋𝐌𝑟
𝑘−𝛼𝑔𝑟
𝑘;𝐆𝐌𝑟+1 =
𝐾
𝑘=1
𝑛𝑘
𝑛
𝐋𝐌𝑟+1
𝑘,(1)
where 𝑔𝑟
𝑘is the gradient of back-propagation at the 𝑟-th
global round, 𝛼is the learning rate, 𝑛is the sum of data points
of all the participating clients, and 𝑛𝑘is the number of data
points of the 𝑘-th client.
Federated learning has many advantages over the cen-
tralized approach. For instance, one of its advantages over
the centralized approach is that it can provide more pri-
vacy protection to the sensitive data. This is because the
global model is trained without requiring clients to share
their local (often sensitive) data. Moreover, it reduces the
communication costs because only trained parameters are
shared, instead of the often large amount of data from all
clients.
2.3. Transformers
Vaswani et al. [16] introduced a novel architecture called
Transformer for sequence to sequence learning. One of the
key component of transformers is the attention-mechanism.
The attention-mechanism looks at an input sequence and
decides at each step which other parts of the sequence
are important. Similar to LSTMs, a transformer basically
transforms one sequence to another one with the help of two
parts: an encoder and a decoder, but it differs from existing
sequence-to-sequence methods in that it does not imply any
recurrent networks (GRU, LSTM, etc.). The encoder and
decoder consist of modules that can be stacked on the top of
Raza et al.: Preprint submitted to Elsevier Page 3 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
MatMul
Scale
Mask(opt.)
Softmax
MatMul
Q K V
linear linear linear
Scaled Dot-Product
Attention
linear
linear linear
linear
linear
linear
Scaled Dot-Product
Attention
Scaled Dot-Product
Attention
Q K V
Concat
linear
Scaled Dot-Product
Attention
Multi-Head Attention
h
Figure 2: An illustrative diagram of the attention mechanism and its parallelization, regenerated from [16]. Left: the attention
mechanism, Right: parallelization of the attention mechanism.
each other multiples times. Each module mainly consists of
multi-head attention and feed-forward layers. The input and
output are first embedded into an 𝑛-dimensional space since
they cannot be used directly. Another part of the model is
positional encoding of different words. Since there are not
recurrent networks that can remember how a sequence is fed
into the model, a relative position is encoded for every part of
the input sequence. This positions are added to the embedded
n-dimensional vector of each input sub-sequence.
The attention mechanism used in transformers can be
described by the following equation:
Attention(𝑄, 𝐾, 𝑉 ) = Sofmax 𝑄𝐾𝑇
𝑑𝑘𝑉 , (2)
where 𝑄is the query matrix (vector representation of input
sub-sequence), 𝐾are all the keys (vector representations of
all the sequence) and 𝑉are the values(vector representations
of all the sequence). For the encoder and the decoder, multi-
head attention modules, 𝑉consists of the same word se-
quence as 𝑄. However, for the attention module that is taking
in the encoder and the decoder sequences, 𝑉is different
from the sequence represented by 𝑄. In other words, 𝑉is
multiplied and summed with the attention weights 𝛾, defined
by the following equation:
𝛾=Sofmax 𝑄𝐾𝑇
𝑑𝑘.(3)
The self-attention mechanism is applied multiples times in
parallel along with the linear projections of 𝑄,𝐾and 𝑉. It
helps the system to learn from different representations of
𝑄,𝐾and 𝑉. The weight matrices 𝑊that are learned during
the training are multiplied by 𝑄,𝐾and 𝑉to learn the linear
representations. Figure 2gives an illustrative diagram of the
attention mechanism and its parallelization.
Moreover, positional encoding is used to keep track of
the input and output sequence. Finally, transformers employ
feed-forward networks. These feed-forward networks have
identical parameters for each position of the input sequence,
which describes each element from a given sequence as a
separate but identical linear transformation.
Transformers have various applications because they use
attention mechanisms, which can help the machine learning
models to learn from data more effectively to improve the
performance of many machine learning tasks such as nat-
ural language processing ones [29]. For example, a hybrid
HAR classifier using CNNs and transformers was introduced
in [30], which utilizes a two-streamed structure to capture
both time-over-channel and channel-over-time features, and
use the multi-scale convolution augmented transformer to
capture range-based patterns.
3. Proposed Methods
In this section, we explain our proposed methods, includ-
ing the proposed lightweight transformer and the FL-based
framework TransFed for addressing privacy concerns in a
federated setting. For the proposed transformer model, we
first describe the model itself and then move on to explain
two data formats we tested for the proposed transformer
model. Both the model itself and the data formats tailored
for the model help improve the transformer’s performance
in terms of computational complexity and classification
Raza et al.: Preprint submitted to Elsevier Page 4 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
accuracy. The following three subsections will introduce the
transformer model, the data formats used, and the proposed
TransFed framework, respectively.
3.1. Proposed Lightweight Transformer Model
We designed a novel lightweight transformer, as shown
in Figure 3. From bottom to top, the first layer inputs the
raw human activity data after preprocessing it into certain
sized window (as discussed earlier). The input is then passed
through the transformer layer, which extracts discriminative
features from the input data. Finally, the output of trans-
former is fed into a prediction layer for the classification or
final output.
Unlike the traditional transformer model, we use a single
patch encoding in our model. This is because we found out
that, we can get the state-of-the-art results even without us-
ing multiple patches, which helps make our proposed trans-
former simpler and therefore more computationally efficient.
The proposed transformer is virtually divided into two parts:
the transformer layer and the prediction layer. Both layer are
composed of many sub-layers. The transformer layer starts
with an augmentation sub-layer, which is used to increase
the diversity of the training set by applying random (but re-
alistic) transformations. The output of the augmentation sub-
layer is fed into a normalization sub-layer to normalize the
data. Following the normalization sub-layer, a multi-head
attention sub-layer applies the self-attention mechanism to
the input data, which is the fundamental mechanism of a
transformer. The self-attention mechanism is a sequence-
to-sequence operation: a sequence of vectors goes in, and
a sequence of vectors comes out. Let us call the input
vectors 𝑋1, 𝑋2,…, 𝑋𝑙and the corresponding output vectors
𝑌1, 𝑌2,…, 𝑌𝑙. To produce the vector 𝑌𝑖, the self-attention
operation applies weights averaged over the input vectors as
follows:
𝑦𝑖=
𝑗
𝑤𝑖𝑗 𝑋𝑗,(4)
where 𝑗indexes over the whole sequence and the weights
sum to one over all 𝑗. The weight 𝑤𝑖𝑗 is not a parameter as
in a normal neural network, but derived from a function over
𝑋𝑖and 𝑋𝑗. To make this operation more lightweight, we use
the dot product:
𝑤𝑖𝑗 =𝑋𝑇
𝑖𝑋𝑗.(5)
Since the dot product gives a value between −∞ and +∞,
we apply a softmax to map the values to the range of 0 and
1, and to ensure that they sum to 1 over the whole sequence:
𝑤𝑖𝑗 =expw𝑖𝑗
𝑗exp w𝑖𝑗
.(6)
The self-attention operation defines the correlation among
the input features with respect to the learning task. The out-
put of the multi-head attention sub-layer is then added with
the output of the previous normalization sub-layer and fed
the result into a succeeding normalization sub-layer, which
is then passed through a dense sub-layer. The dense sub-
layer applies a non-linear transformation for further feature
extraction, given as:
Output =activation(dot(input,kernel)),(7)
where “activation” is the element-wise activation function
passed as the activation argument, “kernel” is a weight ma-
trix and “dot” is the dot product. The output of the dense sub-
layer is added with the output of the previous addition sub-
layer. The transformer layer can be applied multiple times.
The output from the transformer layer is fed into a flattening
and dense output sub-layer with softmax activation, which
gives the final classification probability distribution over the
pre-defined classes.
Input window of shape
140x9
Augmentation Layer
Multi-head Attention Layer
Normalization Layer
Add Layer
Normalization Layer
Dense Layer
Add Layer
Flatten
Normalization Layer
Output (Dense-softmax Layer)
9
Transformer Layer
Figure 3: The proposed transformer model.
Raza et al.: Preprint submitted to Elsevier Page 5 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
samples
axayazayay
axaxazaz
Time
Features
Image
...
Vector of N patches of size nxn
Figure 4: Image-based format for the proposed transformer model.
3.2. Data Formats for Proposed Transformer
Model
Since the input data format plays an important role in the
classification and computational efficiency of the proposed
transformer model, we experimented with two possible for-
mats of the input data as explained below.
3.2.1. Image-based Format
Since the transformers were initially introduce for com-
puter visions tasks, they work with 2-D images by splitting
an image into a vector of small sub-images (patches). This
vector is used as the input. In order to follow the convention,
we created an image of size 𝑁×𝑀from input samples of
an activity in a time frame of 2 seconds, as a time frame of
2 seconds can give enough information about the activity,
as shown in Figure 4. After this, patches of size 𝑛×𝑛were
created, where 𝑁, 𝑀 > 𝑛. We tried images and patches of
different sizes. Unfortunately, when we trained and tested
the proposed transformer, using the above-mentioned data
format, the results were not good enough. This is due to bro-
ken the natural boundary between consecutive 1-D samples
into 2-D images, which makes it hard for the transformer
to capture all features effectively by suppressing all random
noise in the images because no any 2-D filter of a reasonable
size would cover distant pixels that are neighboring samples
in the original 1-D signal. We observed that with the increase
in size (the number of transformer layers), the classification
performance was improved but this setting is not suitable
for edge devices with a relatively low computation power.
Nevertheless, this approach can be utilized in settings where
the computational power is high enough. Hence, we decided
not to use the image-based format.
3.2.2. Averaged-Window Format
In this approach, we propose to reshape the raw input
data into fixed sliding window of shape 𝑊×𝐹, where 𝑊is
the number of samples in a window and 𝐹is the number of
features. In our experiments, we tried to change 𝑊as well as
the number of features. Since a human activity is an action
that can be performed in a certain window of time, to capture
a specific activity we need to optimize the window size so
that the model can map the information contained in that
window to a specific class. For example, going upstairs is an
activity that can be recognized within a suitable window size
such as 2 or 3 seconds. Hence, 𝑊(the window size) is very
critical in HAR because of the following trade-off: if 𝑊is
too small the classifier cannot distinguish between activities;
and if 𝑊is too large the classifier will require more data and
more computational resources. Let us take going upstairs
as an example. If 𝑊is below one second, it would not be
suitable to recognize the activity as no human being can
finish going upstairs within just one second (for HAR tasks,
a single temporal point is too small to be informative). In
regards to 𝐹, we noticed that when there are a large number
of candidate classes, a large 𝐹significantly improves the
classification performance. While for a small number of
classes, a small (such as 6) or large (such as 9) value of 𝐹
gives almost the same classification performance. Moreover,
we found that combining tri-axial data from accelerometer
and gyroscope provides a significant improvement over com-
bining accelerometer and magnetometer data, and combin-
ing magnetometer and gyroscope data. After trying many
different values of the window size with grid search, we
found that a window size of 140 × 9 provided optimized
results for all human activities we considered, when both
accuracy and computational costs are taken into account. To
summarize, we take samples of the input over a time frame
of 2 seconds with 9 features and average them feature-wise
to generate a new sample as shown in Figure 5. This format
can work because averaging cancels the random noise.
Raza et al.: Preprint submitted to Elsevier Page 6 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
140
samples
Averaged sample of
window 140X9
axayazayay
axaxazaz
Time
Features
Figure 5: Averaged-window format for the proposed transformer model.
3.3. The Proposed FL Framework TransFed
Figure 6shows the basic flowchart of the proposed
TransFed. In our method, the federated setting is adopted
in order to facilitate collaborative learning while preserving
the privacy of the underlying data. In the federated setting,
a central (global) server sends the compiled architecture
of the model (which is a transformer in our case) to all
edge or client1devices. Each devices trains its transformer
locally using its local data. After all the local transformers
are trained, each edge device sends trained parameters of its
transformer to the global server, which are then aggregated
by the global server to construct the global model without a
training process. After the global model is available, each
edge device downloads the aggregated parameters of the
global model and updates its local model according to its
local needs. There are two expected key advantages of the
federated setting, (i) it increases the overall accuracy and
the generalization of the model, and (ii) it provides better
privacy protection to the data owners. Algorithm 1defines
the workflow of the proposed TransFed framework.
4. Experimental Setup
In this section, we discuss the experimental setup that we
designed to test the performance of the proposed transformer
in federated setting using real-time data. We first discuss a
testbed that we designed to construct a new dataset and to
support the experiments for evaluating the performance of
the proposed the TransFed framework.
4.1. Testbed for Data Collection
TransFed can in principle work with different types of
sensors from which data about human activities are col-
lected. For our testbed, we decided to use three types of sen-
sors available on most smart wearable devices: tri-axial ac-
celerometers, gyroscopes and magnetometers. These sensors
provide measurements at a sampling frequency of 115 Hz.
The frequency of 115 Hz establishes a sufficient condition
1In this paper, we use the terms ‘edge’ and ‘client’ interchangeably.
ΔwA
ΔwK
Δw3
Δw2
Δw1
Different Human Activities
Sensors: Accelerometer, Gyroscope and Manetometer
Activity Data Activity Data Activity Data
Activity Data
Edge 1 Edge 2
Edge 3 Edge k
Local Database
Transformer Model
Figure 6: An illustrative diagram of the proposed framework.
for a sample rate that permits a discrete sequence of samples
to capture all the information from a continuous-time human
activity signal. The testbed is shown in Figure 7.
4.1.1. Sensor Locations and Data Collection
Since the quality of data being used by an ML model
can significantly impact its performance, we decided to use
a data-centric approach to ensure the performance of our
proposed model. One important aspects of acquiring high-
quality data for HAR purposes is to identify an optimized
Raza et al.: Preprint submitted to Elsevier Page 7 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
location on the human body for each sensor used. We need
to optimize the sensor locations in such a way that it provides
both (i) data informative enough to be used in ML, and (ii)
convenience and comfort to humans while they are wearing
the sensors. Keeping both points in mind and considering
what was commonly chosen in the literature [22,26], we
tried different locations on the human body: upper arm, hip
and chest. After placing each sensor on each body location in
the recruited individuals, we recorded data for the following
15 activities: sitting, walking, jogging, going upstairs, go-
ing downstairs, eating, writing, using laptop, washing face,
washing hand, swiping, vacuuming, dusting a surface, and
brushing teeth. Using the data collected from each location,
we trained and tested the proposed transformer model, sep-
arately for the data of each location, in a centralized setting.
The results showed that the model trained and tested using
data collected from hip outperformed models using data
collected from two other locations (chest and upper arm).
Hence, in the following we report experimental results with
data collected from hip only.
To construct our new dataset, we recruited five human
participants with different ethnic backgrounds, i.e., Pak-
istani, Algerian, French, Vietnamese and Moroccan. Be-
fore collecting the data, each participant was briefed about
each activity to be conducted, the health hazards of the
experimental setup, and how the data will be used by the
researchers. No financial compensations were made. Each
participant performed the 15 activities as shown in Table 1.
While performing each activity for a duration of around 3
minutes2, the sensors placed on each participant generated
activity data, which was sent to the ESP32 module using
the I2C communication protocol. Note that the ESP32 and
sensors were both located on the participant’s body, powered
by a lithium battery. The ESP32 module sent the data over
Wi-Fi using the HTTP method POST to the edge (Raspberry
Pi), which was then stored in a MySQL database. The exper-
iment was approved by the ENSAIT, GEMTEX–Laboratoire
de Génie et Matériaux Textiles, University of Lille, France,
from which all participants were recruited.
I2C Comm.
Sensors
ESP32
Wifi Router
Raspberry-Pi (Edge)
Database
HTTP Post Comm.
Figure 7: Our prototype for real-time data collection.
2The precise duration was determined based on the ability and willing-
ness of each participant.
4.2. Federated Learning Testbed
We built a testbed to support training and testing of
the proposed HAR classifier in a federated setting (i.e., the
TransFed framework). The testbed consists of one aggrega-
tion server (master node) and five client (edge) devices. Each
client trains its local model for 𝑒= 100 epochs on the local
dataset, updates the local model and sends the model back to
the aggregation server. When all clients perform 𝑒number
of epochs, the master node updates the global model and
sends it again to the client workers. The process continues
in 𝑟number of communication rounds. Algorithm 1defines
the whole implementation of the federated learning using
TensorFlow and socket communication. Each client executes
Lines 3-5. Whereas, the rest of the algorithm is executed
by the aggregation server. In Algorithm 1,train() refers to
TensorFlow’s training (fit) function, and send() refers to the
send function of the MLSocket library in Python.
Algorithm 1: The federated learning algorithm
for training our proposed transformer-based HAR
classifier.
Input: 𝐆𝐌𝑟– the global transformer model for the
𝑟-th round, 𝐋𝐌𝑟
𝑘– the local model on the
𝑘-th client for the 𝑟-th round, 𝑛– the number
of data observations across all clients, 𝑛𝑘–
the number of observations on each client
(edge), 𝐋𝐃𝑘– the set of local datasets for
training on each client, 𝑟– the index of the
current round, 𝑒– the number of training
epochs per one round, 𝑏– batch size of
training data, Δ𝑊𝑟
𝑘– parameters of client 𝑘
at 𝑟-th round, 𝐾– the number of clients
participating in federated learning.
Output: Trained aggregated and updated model
1while while 𝑟≠0do
2for each cleint 𝑘do
3𝐋𝐌𝑟
𝑘=𝐆𝐌𝑟
4𝐋𝐌𝑟+1
𝑘=Train(𝐋𝐌𝑟
𝑘,𝐋𝐃𝑘, 𝑒, 𝑏)
5send (Δ𝑊𝑟+1
𝑘)
6𝐆𝐌𝑟+1 =𝐾
𝑘=1
𝑛𝑘
𝑛𝐋𝐌𝑟+1
𝑘
To simulate more realistic scenarios, we used the SSL
(Secure Socket Layer) protocol for secure communications
between the server and client devices. The transformer-
based classifier was trained locally only on five local Rasp-
berry Pi devices (Pi 5 Model B+ with a 1.4GHz, 64-bit quad-
core ARMv8 CPU and 1GB LPDDR2 SDRAM) as edge
devices. Furthermore, a workstation with an Intel core i-
6700HQ CPU and 32 GB RAM was used as the aggregation
server. This hardware setup for clients allowed us to simulate
what typical home healthcare systems can provide, in terms
of computing resources.
4.3. Data Partitioning and Distribution
In order to analyze the performance of the proposed
transformer in a federated setting we used skew data derived
Raza et al.: Preprint submitted to Elsevier Page 8 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
Table 1
Description of the 15 human activities covered in our experiments for constructing the new dataset.
Class Name Class ID Performed Activity Number of Samples collected
Standing 0 Standing still on the floor 22,851
Sitting 1 Sitting still on a chair 23,204
Walking 2 Walking at a normal pace 23,982
Jogging 3 Running at a high speed 21,594
Going Upstairs 4 Ascending on a set of stairs at a normal pace 23,832
Going Downstairs 5 Descending from a set of stairs at a normal pace 21,836
Eating 6 Eating lunch 21,798
Writing 7 Writing on a paper 21,434
Using laptop 8 Using laptop normally 22,009
Washing face 9 Washing face standing 22,027
Washing hand 10 Washing hands standing 22,009
Swiping 11 Swiping a surface walking and standing 19,186
Vacuuming 12 Vacuuming a surface while walking and standing 22,507
Dusting a surface 13 Dusting a surface sitting 21,513
Brushing Teeth 14 Brushing teeth standing 22,495
Table 2
Basic information of the activity classes in our collected
dataset.
Class Name Class ID Number of Samples
Walking 0 424,400
Jogging 1 342,177
Going Upstairs 2 122,869
Going Downstairs 3 3,100,427
Sitting 4 59,939
Standing 5 48,395
from a non-IID distribution. To achieve a non-IID data over
the clients in a federated learning, we group the data by
each user in the dataset and split it among 5 different clients,
and selected one activity on each client to have 40-50% less
samples. For example, Client 1 has 40-50% less samples of
standing activity compared to others and Client 2 has 40-
50% less samples of sitting activity compared to others.
4.4. Centralized Setting
In addition to testing the performance of our proposed
TransFed framework based on the proposed transformer,
we also tested the proposed transformer in a centralized
setting, which demonstrated that it is a general technique that
can work under both centralized and federated settings. We
tested the proposed transformers with both our new dataset
and the well-known public WISDM dataset [31], and com-
pared its performance against other state-of-the-art methods.
Since most of the existing work used WISDM dataset in
centralized setting, testing the proposed transformer-based
classifier using public dataset in a centralized setting gives
a fair comparison. The details of WISDM can be found in
Table 2.
5. Performance Analysis
In this section, we provide the performance analysis
of the proposed framework based on the above-mentioned
experimental setup.
5.1. Performance Metrics Used
To measure the classification performance of the pro-
posed transformer based classifier, we use the following four
classification metrics widely used for evaluating machine
learning models.
1. Accuracy is defined as the correctly predicted obser-
vation divided by total observations, as given below:
Accuracy =TP +TN
TP +FP +FN +TN ,(8)
where TP, TN, FP and FN are true positives, true
negatives, false positives, and false negatives number
of samples, respectively.
2. Precision defined as the number of true positives
divided by the total number true positives plus total
number of false positives for a given class, given as
follow:
Precision =TP
TP +FP .(9)
3. Recall defined as the total number of true positives
divided by total true positives plus false negatives for
a given class, given as follow:
Recall =TP
TP +FN .(10)
4. F1-score is defined as the weighted average of preci-
sion and recall, given as follows:
F1-Score =2 × Recall ×Precision
Recall +Precision .(11)
Raza et al.: Preprint submitted to Elsevier Page 9 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
012345678 9 10 11 12 13 14
1.5
2
2.5
⋅104
Activity
Number of Samples
Client 1 Client 2 Client 3 Client 4 Client 5
Figure 8: Data derived from a non-IID quantity distribution among all clients in the federated setting, where each client contains
40-50% less data of a given class. The x-axis represents the class ID and the y-axis represents the number of samples.
Table 3
Hyper-parameters used by each client for federated learning.
Hyper-Parameter Value
Learning Rate 0.01
Number of Epochs 100
Batch Size 30
Weight Decay 0.001
Transformer Layers 2
Multi-attention Heads 5
Input shape 140×9
All the above four performance metrics used are defined
for binary classifiers only, so for multi-class HAR classifiers
we used the one-vs-rest strategy to calculate the performance
metrics for each class.
5.2. Accuracy and Loss (Training and validation)
Accuracy (one-vs-rest accuracy, where we split multi-
class classification into binary classification problem per
class) and loss (categorical cross-entropy) are often used to
measure how a machine learning classifier’s performance
evolves during the training process. The trend over time can
be used to determine whether the model is properly and
ideally trained, to detect anomalies in time (such as over-
or under-fitting), and to make necessary adjustments.
To evaluate the performance of the proposed transformer
in a federated setting, we used non-IID distribution as shown
in Figure 8. We trained the transformer at each client for 100
epochs. The hyper parameters used during the training pro-
cess are given in Table 3. Each client used two transformer
layers and a learning rate of 0.01 with an Adam optimizer.
Figures 9and 10 show the accuracy and loss curves
for each local model at the corresponding local client, re-
spectively. For comparison purposes, we performed 100
iterations (epochs) for each model and draw a point every
10 iterations when drawing, making the curves clearer but
still reflecting the overall trend. Among the local models, the
performance is almost similar for the non-IID dataset, which
indicates that the proposed transformer is robust against
imbalanced data caused by a non-IID distribution. Overall,
each local model was able to achieve more then 98 percent
training and validation accuracy using the non-IID dataset.
For the centralized setting, we trained the proposed
model using the public WISDM dataset as well as our col-
lected dataset. The hyper-parameters were kept the same as
mentioned earlier for the federated setting. Figures 11a and
11b present training and validation accuracy of the proposed
transformer in the centralized setting using the WISDM
dataset and our collected dataset, respectively. Figures 12a
and 12b present training and validation loss of the classifier
based on the proposed transformer in the centralized setting,
using the WISDM dataset and our collected dataset, respec-
tively.
5.3. Classification Performance
Tables 4–8present classification results obtained on all
the five clients using their local non-IID data in the federated
setting. Table 9shows the classification performance of
the global model after performing federated averaging. We
tested the global model using a test dataset that was not
used to train any of the client models and the proposed
transformer achieved an overall accuracy of 98.74%.
Moreover, Tables 10 and 11 present the classification
performance of the proposed transformer transformer based
classifier in the centralized setting, using the WISDM dataset
and our collected dataset, respectively. Overall the classifier
based on our proposed transformer achieved an accuracy
of 99.14% and 98.89% with our collected dataset and the
WISDM dataset, respectively. Since the WISDM dataset
has much more unbalanced class samples than our col-
lected dataset, it is not surprising to see the performance is
Raza et al.: Preprint submitted to Elsevier Page 10 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
(a) Client 1 (b) Client 2 (c) Client 3
(d) Client 4 (e) Client 5
Figure 9: Training and validation accuracy of the clients using data derived from a non-IID distribution.
(a) Client 1 (b) Client 2 (c) Client 3
(d) Client 4 (e) Client 5
Figure 10: Training and validation loss of the clients using data derived from a non-IID distribution.
(slightly) lower compared with our collected dataset, since
imbalanced data are harder to learn.
5.4. Confusion Matrices
A confusion matrix, also known as an error matrix, is an
𝑛×𝑛matrix or table that shows how each class is classified
into all the 𝑛classes of a classifier. The diagonal elements
of a confusion matrix shows the correct classification re-
sults and other cells show different misclassification rates.
Hence, we evaluate the proposed transformer using confu-
sion matrix to determine where exactly the transformer miss-
classifies the classes during testing. Figure 13 presents the
confusion matrix obtained with the updated global model in
the federated setting, using our collected dataset. Figure 14
presents the confusion matrix obtained using the classifier
based on our proposed transformer in the centralized set-
ting, using our collected dataset. In both figures, the x-axis
indicates the predicted class labels and the y-axis indicates
the ground true class labels. We can see that in both settings
the proposed HAR classifier worked very well with similar
performance across all the 15 classes. Similarly, Figure 15
presents the confusion matrix obtained using the WISDM
dataset in the centralized settings. It can be seen that the
Raza et al.: Preprint submitted to Elsevier Page 11 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
(a) Accuracy of the centralized model
with the WISDM dataset
(b) Accuracy of the centralized model
with our collected dataset
Figure 11: Training and validation accuracy of the classifier
based on the proposed transformer in the centralized setting.
(a) Loss of the centralized model with the
WISDM dataset
(b) Loss of the centralized model with our
collected dataset
Figure 12: Training and validation loss of the classifier based
on the proposed transformer in the centralized setting.
proposed transformer achieved almost perfect classification
results for all classification activities.
From the confusion matrices, it can be observed that mis-
classifications occurred more between activities that involve
similar body movements, e.g., swiping and vacuuming. It
can also be observed that, even for these similar but different
Table 4
Classification performance of the clients using data derived
from a non-IID distribution: Client 1 with 50% less “Standing”
data.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 100% 100% 100%
Walking 100% 100% 100%
Jogging 100% 100% 100%
Going Upstairs 100% 100% 100%
Going Downstairs 100% 100% 100%
Eating 100% 100% 100%
Writing 100% 100% 100%
Using Laptop 100% 100% 100%
Washing Face 100% 99.0% 99%
Washing Hand 99.0% 100% 99.0%
Swiping 95.0% 100% 98.0%
Vacuuming 100% 96.0% 98.0%
Dusting 100% 98.0% 99.0%
Brushing Teeth 100% 100% 100%
Table 5
Classification performance of the clients using data derived
from a non-IID distribution: Client 2 with 50% less “Sitting”
data.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 98.0% 100% 99.0%
Walking 100% 98.0% 99.0%
Jogging 99.0% 99.0% 99.0%
Going Upstairs 98.0% 96.0% 97.0%
Going Downstairs 95.0% 99.0% 97.0%
Eating 98.0% 98.0% 97.0%
Writing 100% 96.0% 99.0%
Using Laptop 96.0% 100% 98.0%
Washing Face 99.0% 100% 99.0%
Washing Hand 97.0% 97.0% 97.0%
Swiping 96.0% 95.0% 96.0%
Vacuuming 93.0% 94.0% 93.0%
Dusting 96.0% 92.0% 94.0%
Brushing Teeth 97.0% 98.0% 97.0%
human activities, the proposed transformer achieved a very
good performance.
5.5. Comparison with State-of-the-Art Methods
In this subsection, we compare the proposed transformer
with existing state-of-the-art methods. Table 12 compares
two key features of our proposed transformers with methods
based on RNNs and CNNs. RNN-based models do not allow
parallelization during training because of their sequential
nature, which makes the model computationally slow and
expensive. CNN-based methods can perform parallel com-
putation, but they are computationally expensive because
of the convolution function. Our new method based on the
proposed transformer completely eliminates recurrence and
Raza et al.: Preprint submitted to Elsevier Page 12 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
standing
sitting
walking
jogging
going upstairs
going downstairs
eating
writing
using laptop
washing face
washing hands
swiping
vacuuming
dusting a surface
brushing teeth
standing
sitting
walking
jogging
going upstairs
going downstairs
eating
writing
using laptop
washing face
washing hands
swiping
vacuuming
dusting a surface
brushing teeth
141
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
152
99.35%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.65%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
128
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
137
98.56%
2
1.44%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.69%
2
1.39%
136
94.44%
5
3.47%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
4
2.72%
143
97.28%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
154
96.86%
0
0%
1
0.63%
0
0%
3
1.89%
1
0.63%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
142
99.30%
1
0.70%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
152
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
132
97.78%
3
2.22%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
128
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.68%
0
0%
0
0%
0
0%
2
1.37%
132
90.41%
11
7.53%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.76%
128
97.71%
0
0%
2
1.53%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
2
1.34%
3
2.01%
140
93.96%
4
2.68%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.64%
0
0%
0
0%
1
0.64%
154
98.72%
0
100
%
Figure 13: The confusion matrix obtained with the final global transformer model in a federated setting with five clients, using
data derived from a non-IID distribution.
standing
sitting
walking
jogging
going upstairs
going downstairs
eating
writing
using laptop
washing face
washing hands
swiping
vacuuming
dusting a surface
brushing teeth
standing
sitting
walking
jogging
going upstairs
going downstairs
eating
writing
using laptop
washing face
washing hands
swiping
vacuuming
dusting a surface
brushing teeth
142
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
147
98.66%
1
0.67%
0
0%
0
0%
0
0%
0
0%
1
0.67%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.65%
0
0%
153
99.35%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
129
98.47%
0
0%
1
0.76%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.76%
0
0%
0
0%
0
0%
0
0%
1
0.65%
2
1.30%
146
94.81%
5
3.25%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.71%
139
99.29%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
131
97.76%
0
0%
2
1.49%
0
0%
1
0.75%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
140
98.59%
2
1.41%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
2
1.32%
150
98.68%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
1
0.66%
147
97.35%
3
1.99%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
135
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
2
1.49%
122
91.04%
8
5.97%
1
0.75%
1
0.75%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
2
1.30%
152
98.70%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
4
2.88%
3
2.16%
129
92.81%
3
2.16%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
137
100%
0
100
%
Figure 14: The confusion matrix obtained with the transformer model in a centralized setting using 5-fold cross-validation, using
the balanced collected data.
Raza et al.: Preprint submitted to Elsevier Page 13 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
Table 6
Classification performance of the clients using data derived
from a non-IID distribution: Client 3 with 50% less “Walking”
data.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 100% 100% 100%
Walking 99.0% 100% 100%
Jogging 98.0% 98.0% 98.0%
Going Upstairs 98.0% 97.0% 98.0%
Going Downstairs 98.0% 98.0% 98.0%
Eating 95.0% 97.0% 96.0%
Writing 99.0% 98.0% 99.0%
Using Laptop 96.0% 97.0% 96.0%
Washing Face 96.0% 99.0% 97%
Washing Hand 98.0% 95.0% 96.0%
Swiping 96.0% 92.0% 94.0%
Vacuuming 95.0% 97.0% 96.0%
Dusting 96.0% 95.0% 96.0%
Brushing Teeth 97.0% 98.0% 98.0%
Table 7
Classification performance of the clients using data derived
from a non-IID distribution: Client 4 with 50% less “Jogging”
data.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 100% 97.0% 99.0%
Walking 97.0% 100% 99.0%
Jogging 99.0% 100% 100%
Going Upstairs 99.0% 98.0% 98.0%
Going Downstairs 98.0% 99.0% 98.0%
Eating 97.0% 97.0% 97.0%
Writing 99.0% 98.0% 98.0%
Using Laptop 97.0% 99.0% 98.0%
Washing Face 98.0% 97.0% 98.0%
Washing Hand 96.0% 98.0% 97.0%
Swiping 96.0% 94.0% 95.0%
Vacuuming 93.0% 94.0% 94.0%
Dusting 95.0% 94.0% 95.0%
Brushing Teeth 99.0% 98.0% 98.0%
convolution and replaces them with a self-attention mech-
anism to establish dependencies between the input and the
output. It is the first type of architecture to rely entirely
on attention to calculate representations of the input and
the output. In addition, transformers leave more room for
parallelization. RNNs and CNNs use a large number of
parameters (usually hundreds of thousands or even more),
but the proposed transformer only uses 14,697 parame-
ters. Moreover, unlike traditional transformers, the proposed
transformer uses a single patch instead of using multiple-
patches. Therefore, the proposed transformer is also much
more computationally efficient.
We also compared the performance of TransFed global
model and the centralized classifier based on the proposed
Table 8
Classification performance of the clients using data derived
from a non-IID distribution: Client 5 with 50% less “Going
Upstairs” data.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 99.0% 97.0% 98.0%
Walking 97.0% 98.0% 98.0%
Jogging 99.0% 100% 99.0%
Going Upstairs 100% 100% 100%
Going Downstairs 99.0% 99.0% 99.0%
Eating 96.0% 96.0% 96.0%
Writing 100% 97.0% 98.0%
Using Laptop 96.0% 100% 98.0%
Washing Face 97.0% 97.0% 97.0%
Washing Hand 95.0% 95.0% 95.0%
Swiping 94.0% 95.0% 95.0%
Vacuuming 97.0% 96.0% 96.0%
Dusting 96.0% 97.0% 96.0%
Brushing Teeth 99.0% 98.0% 98.0%
Table 9
Classification performance of the global model after federated
averaging using data derived from a non-IID distribution.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 100% 100% 100%
Walking 99.0% 100% 100%
Jogging 99.0% 99.0% 99.0%
Going Upstairs 96.0% 94.0% 95.0%
Going Downstairs 97.0% 97.0% 97.0%
Eating 99.0% 97.0% 98.0%
Writing 100% 99.0% 100%
Using Laptop 99.0% 100% 99.0%
Washing Face 100% 98.0% 99.0%
Washing Hand 93.0% 100% 97.0%
Swiping 97.0% 90.0% 94.0%
Vacuuming 90.0% 98.0% 94.0%
Dusting 99.0% 94.0% 97.0%
Brushing Teeth 96.0% 99.0% 97.0%
transform with those of selected state-of-the-art HAR meth-
ods in the literature [14,24,25], including five working
in a centralized setting and one in the federated setting.
These three state-of-the-art methods were chosen because
their performance results were reported using the WISDM
dataset, which allow a direct comparison of the performance
results. Table 13 shows the comparison results with the
six selected state-of-the-art methods. It is obvious that our
proposed methods achieved a substantial improvement in
terms of accuracy, when trained and tested using the our new
dataset and the WISDM dataset. Specifically, our method
achieved an accuracy of 98.74% and 99.14% in the federated
and centralized settings, respectively, using our collected
dataset. Furthermore, using the WISDM dataset in the cen-
tralized setting, it achieved an overall accuracy of 98.89%.
Raza et al.: Preprint submitted to Elsevier Page 14 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
Table 10
Classification performance of the proposed transformer in the
centralized setting using WISDM dataset [31].
Activity Precision Recall F1-score
Walking 100% 100% 100%
Jogging 100% 100% 100%
Going Upstairs 95.0% 96.0% 96.0%
Going Downstairs 98.0% 99.0% 99.0%
Sitting 100% 100% 100%
Standing 97.0% 97.0% 97.0%
Table 11
Classification performance of the proposed transformer in the
centralized setting using our collected dataset.
Activity Precision Recall F1-score
Standing 100% 100% 100%
Sitting 100% 99.0% 99.0%
Walking 99.0% 99.0% 99.0%
Jogging 98.0% 98.0% 98.0%
Going Upstairs 99.0% 95.0% 97.0%
Going Downstairs 97.0% 99.0% 98.0%
Eating 99.0% 98.0% 98.0%
Writing 98.0% 99.0% 98.0%
Using Laptop 97.0% 99.0% 98.0%
Washing Face 100% 97.0% 99.0%
Washing Hand 96.0% 100% 98.0%
Swiping 95.0% 90.0% 93.0%
Vacuuming 93.0% 98.0% 95.0%
Dusting 99.0% 93.0% 96.0%
Brushing Teeth 96.0% 100% 98.0%
standing
sitting
walking
jogging
going upstairs
going downstairs
standing
sitting
walking
jogging
going upstairs
going downstairs
313
95.72%
0
0%
0
0%
0
0%
10
3.06%
4
1.22%
0
0%
323
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
32
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
315
100%
0
0%
0
0%
0
0%
0
0%
0
0%
0
0%
275
99.28%
2
0.72%
1
0.29%
0
0%
0
0%
0
0%
3
0.88%
338
98.83%
0
100
%
Figure 15: The confusion matrix obtained with the transformer
model in the centralized setting , using the WISDM dataset.
Hence, from Table 13 it can be seen that the proposed
transformer outperforms existing state-of-the-art methods,
in both centralized and federated settings.
Table 12
Comparison of our transformer-based approach with those
based on RNNs and CNNs, in terms of computation costs.
Method Parallelization Computationally
Expensive
RNNs No Yes
CNNs Yes Yes
Transformers Yes No
Table 13
Comparison with selected state-of-the-art methods for HAR
classification.
Scheme Centralized
or Federated
Number of
Activities
Accuracy
[14]aFederated 6 89.00%
[24]aCentralized 6 97.63%
[25]aCentralized 6 96.70%
ProposedaCentralized 6 98.89%
ProposedbFederated 15 98.74%
ProposedbCentralized 15 99.14%
aUsing the WISDM dataset.
bUsing our new dataset.
6. Further Discussions
In summary, our experimental results showed that our
proposed method is a step forward to protect the privacy of
users using transformer-based FL, which uses very few pa-
rameters while providing high accuracy. However, although
FL can provide better privacy protection of local data against
the global server, various security and privacy attacks have
been proposed to make simple FL architecture less secure
and privacy-preserving [32], therefore, one future research
direction is to investigate how the proposed method can be
further enhanced to be robust against known security and
privacy attacks.
In addition to further improve data security and privacy,
a major limitation of our work is that our newly constructed
dataset was based on only five human participants and
a more artificially constructed home care scenario. It is
therefore important to evaluate the proposed work in more
realistic home healthcare settings, and ultimately move the
proposed HAR classification method into real-world usage.
For such future work, involvement of patients, carers and
health professionals is vital in all stages of the research
process, following well-established standard procedures and
guidelines such as the UK Standards for Public Involve-
ment [33]. One important aspect of the real-world-facing re-
search is to carefully evaluate the acceptability and usability
of body sensors used to ensure that they are the right ones
for the target patients. This suggests that different sets of
body sensors may have to be used for patients with different
conditions or preferences, so we need to investigate how
the proposed HAR classifier will change w.r.t. the different
sets of sensors. These include scenarios where no body
Raza et al.: Preprint submitted to Elsevier Page 15 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
sensors can be put on the body of a patient, so computer
vision based approaches relying on monitoring cameras and
microphones will need investigation, which will involve very
different machine learning models from those we used for
the proposed work in this paper.
7. Conclusions
In this paper, we proposed a novel single-patch lightweight
transformer for HAR. We examined the use of transformers
as a HAR classifier. The purpose of the lightweight trans-
former was to provide a state-of-the-art classification per-
formance while keeping it computationally efficient. For our
proposed transformer-based HAR classifier, we examined
its performance in both federated and centralized settings,
under a non-IID data distribution. To test the performance
of the proposed transformer in the federated setting, we de-
veloped a framework called TransFed and designed a testbed
to collect data from five human participants who conducted
15 different activities in a simulated home environment.
Our extensive experimental results confirmed that the
proposed transformer can provide better performance com-
pared with a number of state-of-the-art CNN- and RNN-
based HAR classifiers, while providing a standardized and
automated way to accomplish the feature learning step.
Furthermore, the federated setting used by our proposed
framework TransFed can help improve data privacy, which
is a major issue in centralized approaches.
Acknowledgments
This research work was supported by the I-SITE Univer-
sité Lille Nord-Europe 2021 of France under grant No. I-
COTKEN-20-001-TRAN-RAZA.
CRediT authorship contribution statement
Ali Raza: Conceptualization of this study, Methodol-
ogy, Software, Data analysis, Paper writing. Kim Phuc
Tran: Conceptualization of this study, Methodology, Paper
writing. Ludovic Koehl: Conceptualization of this study,
Methodology, Paper writing. Shujun Li: Conceptualization
of this study, Methodology, Paper writing. Xianyi Zeng:
Conceptualization of this study, Methodology, Paper writ-
ing. Khaled Benzaidi: Conceptualization of this study,
Methodology, Data collection. Sarah Hotham: Providing
domain knowledge, Paper writing.
References
[1] K. Adnan, R. Akbar, S. W. Khor, A. B. A. Ali, Role and challenges of
unstructured big data in healthcare, in: Data Management, Analytics
and Innovation: Proceedings of ICDMAI 2019, Volume 1, Springer,
2020, pp. 301–323. doi:10.1007/978-981- 32-9949- 8_22.
[2] M. Wu, J. Luo, Wearable technology applications in healthcare:
A literature review, Online Journal of Nursing Informatics 23 (3)
(2019).
URL https://www.himss.org/resources/
wearable-technology- applications-healthcare- literature- review
[3] Apple Inc., Use Driving Focus on your iPhone to concentrate on the
road, Online document (2021).
URL https://support.apple.com/en-gb/HT208090
[4] S. Yao, S. Hu, Y. Zhao, A. Zhang, T. Abdelzaher, DeepSense: a
unified deep learning framework for time-series mobile sensing data
processing, in: Proceedings of the 26th International Conference
on World Wide Web, International World Wide Web Conference
Committee (IW3), 2017, pp. 351–360. doi:10.1145/3038912.3052577.
[5] F. J. Ordóñez, D. Roggen, Deep convolutional and LSTM recurrent
neural networks for multimodal wearable activity recognition, Sen-
sors 16 (1) (2016) 115:1–115:25. doi:10.3390/s16010115.
[6] N. Y. Hammerla, S. Halloran, T. Plötz, Deep, convolutional, and
recurrent models for human activity recognition using wearables,
arXiv:1604.08880 [cs.LG] (2016). doi:10.48550/arXiv.1604.08880.
[7] X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A. V. Vasilakos,
Privacy and security issues in deep learning: A survey, IEEE Access
9 (2020) 4566–4593. doi:10.1109/ACCESS.2020.3045078.
[8] E. De Cristofaro, A critical overview of privacy in machine learning,
IEEE Security & Privacy 19 (4) (2021) 19–27. doi:10.1109/MSEC.
2021.3076443.
[9] D. S. Char, N. H. Shah, D. Magnus, Implementing machine learning
in health care – addressing ethical challenges, New England Journal
of Medicine 378 (11) (2018) 981–983. doi:10.1056/NEJMp1714229.
URL https://doi.org/10.1056/NEJMp1714229
[10] T. Li, A. K. Sahu, A. Talwalkar, V. Smith, Federated learning:
Challenges, methods, and future directions, IEEE Signal Processing
Magazine 37 (3) (2020) 50–60. doi:10.1109/MSP.2020.2975749.
[11] J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, F. Wang, Federated
learning for healthcare informatics, Journal of Healthcare Informatics
Research 5 (1) (2021) 1–19. doi:10.1007/s41666- 020- 00082-4.
[12] R. Liu, Y. Cao, M. Yoshikawa, H. Chen, FedSel: Federated SGD
under local differential privacy with top-k dimension selection, in:
Database Systems for Advanced Applications: 25th International
Conference, DASFAA 2020, Jeju, South Korea, September 24–27,
2020, Proceedings, Part I, Springer, 2020, pp. 485–501. doi:10.1007/
978-3- 030-59410- 7_33.
[13] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,
V. Ivanov, C. Kiddon, J. Konečn`
y, S. Mazzocchi, H. B. McMahan,
T. Van Overveldt, D. Petrou, D. Ramage, J. Roselander, Towards
federated learning at scale: System design, arXiv:1902.01046 [cs.LG]
(2019). doi:10.48550/arXiv.1902.01046.
[14] K. Sozinov, V. Vlassov, S. Girdzijauskas, Human activity recognition
using federated learning, in: Proceedings of the 8th IEEE International
Conference on Big Data and Cloud Computing, IEEE, 2018, pp.
1103–1111. doi:10.1109/BDCloud.2018.00164.
[15] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative study of CNN
and RNN for natural language processing, arXiv:1702.01923 [cs.CL]
(2017). doi:10.48550/arXiv.1702.01923.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in:
Advances in Neural Information Processing Systems 30 (NIPS
2017), NeurIPS, 2017, pp. 5998–6008.
URL https://papers.nips.cc/paper/2017/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer,
P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural
language processing, in: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: System Demon-
strations, ACL, 2020, pp. 38–45. doi:10.18653/v1/2020.emnlp- demos.
6.
[18] X. Yang, J. Bian, W. R. Hogan, Y. Wu, Clinical concept extraction
using transformers, Journal of the American Medical Informatics
Association 27 (12) (2020) 1935–1942. doi:10.1093/jamia/ocaa189.
[19] A. Mannini, A. M. Sabatini, Machine learning methods for classifying
human physical activity from on-body accelerometers, Sensors 10 (2)
(2010) 1154–1175. doi:10.3390/s100201154.
Raza et al.: Preprint submitted to Elsevier Page 16 of 17
Lightweight Transformer in Federated Setting for Human Activity Recognition in Home Healthcare Applications
[20] Y. Chen, Y. Xue, A deep learning approach to human activity recogni-
tion based on single accelerometer, in: Proceedings of the 2015 IEEE
International Conference on Systems, Man, and Cybernetics, IEEE,
2015, pp. 1488–1492. doi:10.1109/SMC.2015.263.
[21] F. Attal, S. Mohammed, M. Dedabrishvili, F. Chamroukhi, L. Oukhel-
lou, Y. Amirat, Physical human activity recognition using wear-
able sensors, Sensors 15 (12) (2015) 31314–31338. doi:10.3390/
s151229858.
[22] N. Sikder, A.-A. Nahid, KU-HAR: An open dataset for heterogeneous
human activity recognition, Pattern Recognition Letters 146 (2021)
46–54. doi:10.1016/j.patrec.2021.02.024.
[23] M. M. Hassan, M. Z. Uddin, A. Mohamed, A. Almogren, A robust
human activity recognition system using smartphone sensors and
deep learning, Future Generation Computer Systems 81 (2018) 307–
313. doi:10.1016/j.future.2017.11.029.
[24] A. Ignatov, Real-time human activity recognition from accelerometer
data using Convolutional Neural Networks, Applied Soft Computing
62 (2018) 915–922. doi:10.1016/j.asoc.2017.09.027.
[25] A. Murad, J.-Y. Pyun, Deep recurrent neural networks for human
activity recognition, Sensors 17 (11) (2017) 2556:1–2556:17. doi:
10.3390/s17112556.
[26] X. Zhou, W. Liang, K. I.-K. Wang, H. Wang, L. T. Yang, Q. Jin,
Deep-learning-enhanced human activity recognition for internet of
healthcare things, IEEE Internet of Things Journal 7 (7) (2020) 6429–
6438. doi:10.1109/JIOT.2020.2985082.
[27] H. B. McMahan, E. Moore, D. Ramage, B. A. y Arcas, Federated
learning of deep networks using model averaging, arXiv:1602.05629
[cs.LG] (2016). doi:10.48550/arXiv.1602.05629.
[28] Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning:
Concept and applications, ACM Transactions on Intelligent Systems
and Technology 10 (2) (2019) 12:1–12:19. doi:10.1145/3298981.
[29] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-
training of deep bidirectional transformers for language understand-
ing, arXiv:1810.04805 [cs.CL] (2018).
URL 10.48550/arXiv.1810.04805
[30] B. Li, W. Cui, W. Wang, L. Zhang, Z. Chen, M. Wu, Two-stream con-
volution augmented transformer for human activity recognition, in:
Proceedings of the 2021 AAAI Conference on Artificial Intelligence,
Vol. 35, AAAI, 2021, pp. 286–293.
URL https://ojs.aaai.org/index.php/AAAI/article/view/16103
[31] J. R. Kwapisz, G. M. Weiss, S. A. Moore, Activity recognition using
cell phone accelerometers, ACM SIGKDD Explorations Newsletter
12 (2) (2011) 74–82. doi:10.1145/1964897.1964918.
[32] J. J. Hathaliya, S. Tanwar, An exhaustive survey on security and
privacy issues in Healthcare 4.0, Computer Communications 153
(2020) 311–335. doi:10.1016/j.comcom.2020.02.018.
[33] Chief Scientist Office (Scotland), Health and Care Research Wales,
the Public Health Agency (Northern Ireland) and the National Insti-
tute for Health and Research (England), UK, UK standards for public
involvement in research, Website (2016).
URL https://sites.google.com/nihr.ac.uk/pi-standards/home
Raza et al.: Preprint submitted to Elsevier Page 17 of 17