ArticlePDF Available

Abstract and Figures

Deep learning (DL) using large scale, high-quality IoT datasets can be computationally expensive. Utilizing such datasets to produce a problem-solving model within a reasonable time frame requires a scalable distributed training platform/system. We present a novel approach where to train one DL model on the hardware of thousands of mid-sized IoT devices across the world, rather than the use of GPU cluster available within a data center. We analyze the scalability and model convergence of the subsequently generated model, identify three bottlenecks that are: high computational operations, time consuming dataset loading I/O, and the slow exchange of model gradients. To highlight research challenges for globally distributed DL training and classification, we consider a case study from the video data processing domain. A need for a two-step deep compression method, which increases the training speed and scalability of DL training processing, is also outlined. Our initial experimental validation shows that the proposed method is able to improve the tolerance of the distributed training process to varying internet bandwidth, latency, and Quality of Service metrics.
Content may be subject to copyright.
EDITOR: Schahram Dustdar, dustdar@dsg.tuwien.ac.at
DEPARTMENT: INTERNET OF THINGS, PEOPLE, AND PROCESSES
Toward Distributed, Global, Deep Learning
Using IoT Devices
Bharath Sudharsan , National University of Ireland Galway, Galway H91 TK33, Ireland
Pankesh Patel ,National University of Ireland Galway, Galway H91 TK33, Ireland
John Breslin ,National University of Ireland Galway, Galway H91 TK33, Ireland
Muhammad Intizar Ali ,Dublin City University, Dublin 9, Ireland
Karan Mitra ,Lulea
University of Technology, 97187 Lulea
, Sweden
Schahram Dustdar ,TU Wien, 1040 Vienna, Austria, Cardiff University, Cardiff CF10 3AT, U.K.
Omer Rana ,Cardiff University, Cardiff CF24 3AA, U.K.
Prem Prakash Jayaraman ,Swinburne University of Technology, Hawthorn VIC 3122, Australia
Rajiv Ranjan ,Newcastle University, Newcastle upon Tyne NE1 7RU, U.K.
Deep learning (DL) using large scale, high-quality IoT datasets can be
computationally expensive. Utilizing such datasets to produce a problem-solving
model within a reasonable time frame requires a scalable distributed training
platform/system. We present a novel approach where to train one DL model on the
hardware of thousands of mid-sized IoT devices across the world, rather than the
use of GPU cluster available within a data center. We analyze the scalability and
model convergence of the subsequently generated model, identify three bottlenecks
that are: high computational operations, time consuming dataset loading I/O, and
the slow exchange of model gradients. To highlight research challenges for globally
distributed DL training and classication, we consider a case study from the video
data processing domain. A need for a two-step deep compression method,
which increases the training speed and scalability of DL training processing,
is also outlined. Our initial experimental validation shows that the proposed
method is able to improve the tolerance of the distributed training process
to varying internet bandwidth, latency, and Quality of Service metrics.
IoT datasets are now being produced at an ever
increasing rate, as emerging IoT frameworks and librar-
ies have simplied the process of continuous monitor-
ing, real-time edge-level processing, and encrypted
storage of the generated multimodal image, audio, and
sensor data. Such data are generated by a variety of
hardware systems operating in indoor and outdoor
infrastructures, including smart factory oors, AR/VR
experience centers, smart city sensors, etc. In order
to complete training in a reasonable time when using
such large scale, high-quality IoT datasets that have
been collected over decades, we need a scalable dis-
tributed training system that can efciently harness
the hardware resources of millions of IoT devices.
Particularly, such a system should take account of
current network connectivity between these devices,
and able to collectively train to produce the nal
problem-solving deep learning (DL) models at very
high speeds.
This work is licensed under a Creative Commons Attribution
4.0 License. For more information, see https://creativecom-
mons.org/licenses/by/4.0/
Digital Object Identier 10.1109/MIC.2021.3053711
Date of current version 18 June 2021.
IEEE Internet Computing Published by the IEEE Computer Society May/June 20216
Instead of following the traditional approach that
loads such datasets and trains a model locally within
a GPU cluster or a data center, we utilize distributed
training on multiple IoT devices as:
i) Considering the GPU to IoT devices ratio, IoT
devices are much greater in number, i.e., market
estimates show that roughly 50 Billion Micro-
Contoller Units (MCU) and small CPU chips were
shipped in 2020, which far exceed other process-
ors like GPUs (only 100 Million units sold);
ii) Every modern household does not compulsorily
own a GPU, yet it roughly has around a dozen
medium resource IoT devices which when ef-
ciently connected together can, within a home
network, train machine learning models without
depending on Cloud or GPU servers that can
perform the same training task at very high
speeds, but at additional cost;
iii) In most real-life IoT scenarios, the training data-
set used to produce a learned model can often
be hard to source due to GDPR and privacy con-
cerns. In such cases, we need an algorithm to
directly utilize capability of the IoT device hard-
ware without disturbing routine operation of the
device. This algorithm when deployed across
user devices should make use of locally gener-
ated data to collectivelytrain a model without
storing live data on a central server. Thus, locally
producing learned models from data without vio-
lating the privacy protection regulations;
iv) Training advanced DL models on a single GPU
might consume days or even weeks to converge.
Hence, if we design and use an intelligent algo-
rithm that can tolerate high latency and low
bandwidth constraints, we can collectively har-
ness the idle hardware resource of thousands of
mid-sized IoT devices and complete training at
very high speeds. For example, at the time of
writing, the latest GEFORCE RTX 2080 Ti GPU
has 11GB RAM but costs US $1500. Whereas
one Alexa smart speaker device has 2 GB RAM
and efciently connecting 20 such devices can
collectively pool 40 GB of RAM. In this way, we
can complete training faster on such resources,
if coordinated correctly, compared to expensive
GPU and at a comparatively smaller invest-
mentparticularly by utilizing idle capacity of
smart IoT devices that exist across the world.
The hardware of IoT devices is not designed for DL
workloads. Resource-friendly model training algo-
rithms like Edge2Train
1
could be used in distributed
setups for training models MCUs and limited capacity
CPUs of IoT devices. We identify challenges involved
with DL model training on hardware of common IoT
devices such as video doorbells, smart speakers, cam-
eras, etc. To overcome some of the challenges, we
also present a two-step deep compression method
that increases the training speed and scalability of DL
training processing.
Outline. For globally distributed DL model training
scenarios, in section Distributed Global Training:
Research Challenges,we present our bottleneck anal-
ysis. Section Proposed two-step Deep Compression
Method and Initial Experimental Resultscontains our
solution to address the challenges highlighted in Sec-
tion Distributed Global Training: Research Chal-
lenges.In section Discussion,we conclude by
providing greater context for future work.
DISTRIBUTED GLOBAL TRAINING:
RESEARCH CHALLENGES
In the large-scale distributed/collaborative learning
domain, distributed training has seen limited adoption,
especially when the target is to train a DL model than
can perform video analytics tasks such as object
detection (e.g., detect FedEx, USPS vehicles, etc.) for
package theft prevention, detect, and recognize
unsafe objects such as a gun to reduce crime, identify
known/unknown faces. This is because:
i) Models that can learn from video datasets have
a dense (i.e., large number of parameters and
layers) architecture design that requires signi-
cant computational resources when compared
to models designed to learn from image or audio
datasets. For example, the popular ResNet-50
model trained using a 2-D image dataset con-
sumes around 4 GFLOPs, whereas a ResNet-50
Inated 3-D model contains 3-D convolutional
kernels to model temporal information in a
video, consuming 30 GFLOPs, i.e., more than 7
times larger than the previous case;
ii) These datasets can be signicant in size, hence
consuming high internet bandwidth when load-
ing video from a (central) data server to training
devices that are geographically distributed. For
example, the ImageNet dataset has 1.28M
images, whereas the Kinetics-400 video dataset
has 63M frames, i.e., 50times larger; and
iii) Finally, complex models trained on such data-
sets can have millions of parameters and gra-
dients that need to be quickly exchanged (with
minimum latency) among devices during
May/June 2021 IEEE Internet Computing 7
INTERNET OF THINGS, PEOPLE, AND PROCESSES
distributed training, which again increases inter-
net trafc (and charges to consumers) and more
critically can lead to slow convergence when
devices involved in training suffer from network
latency issues. In short, the bottlenecks are due
to the demand for high computational power,
time overhead associated with dataset loading I/
O, and slow exchange of model gradients. In the
rest of this section, each of these three bottle-
necks are explained in more detail.
High FLOPs Consumption
Unlike for 2-D image recognition models, the input/acti-
vation of video analytics DL networks has [N, T, C, H, W]
as its ve dimensions, where:N is the batch size, T refers
to temporal timestamps, the channel number is C, and
spatial resolution H & W. To reduce computational over-
head and network congestion, we can train using the
same target dataset by applying 2-D CNN to each image
frames from the video. Using such an approach, the
temporal relationship between the frames cannot be
modeled/learned, which is crucial to understanding the
scenes (labeled) from the video datasets. Hence, inat-
ing the 2-D to 3-D convolution layer results in producing
an I3-D model, which grows the model size by ktimes.
For distributed learning of spatio-temporal data, the
models with 3-D convolutions, in addition to model
size demands, also suffers from having a large num-
ber of parameters, which is the main reason to
slow down the training and communication process
even within a GPU cluster and in real-world net-
works. Consequently, training will stall when unex-
pected network issues are encountered.
Expensive Dataset Loading I/O
Video network architectures available in ML Hubs and
marketplaces (Google AI Hub and TensorFlow Hub)
usually sample many frames from video datasets and
use them as input during learning (i.e., top models
2
sample 32 and 64 frames). Then, they progressively
reduce the temporal resolution by realizing temporal
poolingtechniques.
3
Another orthogonal approach is
to design networks that sample and use fewer frames
(i.e., eight frames) during learning and maintain the
same temporal resolution to retain information from
the video dataset. In both designs, the overall compu-
tational requirements are similar, but the former
involves additional sampling and full dataset loading
steps, increasing the dataset loading I/O at the data
server, while making data loading on many distributed
IoT devices challenging when considering the limited
memory and internet bandwidth available in practice.
Slow Exchange of Model Gradients
During training, maintaining good scalability, low late-
ncy, and high bandwidth internet connection is manda-
tory at least during gradients exchange.
3
Existing large-
scale distributed learning studies and frameworks
require high-end Inniband network infrastructure
where bandwidth ranges from 10 to 100 Gb/s, with a
1ms latency. Even if we increase bandwidth by stack-
ing (aggregating) hardware, latency improvements are
still difcult to achieve. In contrast to our assumption,
latency in real-world scenarios can be further exacer-
bated due to queueing delay in switches and indirect
routing between service providers. This bottleneck
makes distributed training scale poorly in real-world net-
work conditions, particularly when transmitting data-
sets in addition to the gradients.
Handling Dataset I/O Efciency
Video datasets are usually stored in a high-perfor-
mance storage system (HPSS) or a central data server
that is shared across all worker nodesin our case
these are IoT devices distributed across the world.
Although HPSS systems have good sequential I/O per-
formance, their random-access performance is infe-
rior, causing bottlenecks for large data trafc. Most
existing I3-D models a high frame rate (within a video),
then perform a downsampling to reduce overall data
size. Given the distributed training scenario being con-
sidered, we argue that such designs waste bandwidth.
Consequently, research needs to consider novel data
approximation, sampling and ltering methods. For
example, in the context of video datasets, one can
develop a method to identify videos that have multiple
similar frames (i.e., we say that nearby frames contain
similar information), then load and share only nonre-
dundant frames during distributed training. Similarly,
for other datasets associated with images and sensor
readings, we recommend ltering or downsampling
the data without losing information, then distributing
it during training. Therefore any approximation, sam-
pling, and ltering method will need to be correctly
parameterized while considering the resource-con-
strained nature of IoT devices.
Variable Training and Convergence
Speed
Research has shown that naive synchronous stochas-
tic gradient descent (SSGD) achieves poor scalability
in real time and distributed DL model training, making
training using 100 distributed GPUs slower than train-
ing on 1 GPU. Unlike SSGD, asynchronous SGD (ASGD)
relaxes synchronization enabling its use across many
8IEEE Internet Computing May/June 2021
INTERNET OF THINGS, PEOPLE, AND PROCESSES
real-world applications. D2
10
and AD-PSGD
11
perform
only partial synchronization in each update to over-
come latency issues. Such large-scale training takes
advantage of data parallelism by increasing the num-
ber of contributing devices, but at the cost of data
transfer between devices (e.g., exchange of parame-
ters), which can be time consuming, especially when
many devices are pooled. This dwarfs the savings in
computation time and producing a low computation-
to-communication ratio. However, such distributed
learning approaches do not scale well when network
latency is high. Additionally, lower network bandwidth,
expensive/limited mobile data plan, and intermittent
network connection, which are all common across use
of mobile devices, also impact our training scenarios.
Hence, if we use SSGD, ASGD, D2, AD-PSGD (or any
such native algorithms) across a large number of
medium-resource IoT devices, the target DL model
might never converge to a suitable level of accuracy.
Hence, there is a need for a method that can efciently
communicate with a large number of heterogeneous
IoT devices, even under real-world internet latency and
bandwidth constraints, and complete training at high
speeds. As SSGD, ASGD, D2, AD-PSGD can be adapted
to learn a globally distributed model, there is a need to
develop benchmarking techniques that compare them
against common evaluation metrics including average
accuracy, training time, and convergence speed. Even-
tually, these evaluation metrics will need to be formu-
lated as a unied distributed-training performance
model. Metaheuristic techniques such as genetic pro-
gramming and particle swarm optimization could be
used to solve and nd feasible (Pareto optimal) solu-
tions for improving performance model.
Handling Network Uncertainties
Distributed learning can be impacted by properties of
access links that connect IoT devices (sensors and
actuators) to edge gateways and/or cloud nodes.
These uncertainties include time-varying connectivity,
network unavailability, and time-varying trafc pat-
terns research has indicated that wireless network
bandwidth and availability uctuates dramatically due
to weather conditions, signal attenuation, and channel
interference. For instance, consider the use of SSGD
during a distributed training process, where only one
gradient transmission occurs in one iteration. This
aspect can worsen with an increase in the number of
transmissions, and if the previously sent gradients
arrive late along with recent gradients (late arrival due
to network congestions). The second issue we expect
is the large variance in latency, which is common in
real-world IoT networks, especially where devices
have long-distance connections and communicate via
a range of networks, e.g., long-range/low-power com-
munications using LoRa-WAN and NarrowBand-IoT
and more powerful high-bandwidth WiFi, 4G/5G radio.
While we can aim to maintain a low average latency
by choosing and involving only IoT devices with stable
internet connection, changes in device network con-
nectivity due to mobility (e.g., when the IoT device is
placed in a car) can cause variable latency.
Handling Staleness Effects
Most popular distributed model training techniques
(e.g., SSGD, ASGD, D2, AD-PSGD) adopt a nonsynchro-
nous execution approach for alleviating network
communication bottleneck that produces stale param-
eters, i.e., the model parameters arrive late, not reect-
ing the latest updates. Staleness not only slows down
convergence but also degrades model performance.
Despite notable contributions in distributed learning,
12,
13
the effects of staleness during training can lead
to model instability,
14
because it is practically not feasi-
ble to monitor and control staleness in the current
complex IoT environments containing heterogeneous
devices using different network protocols. This chal-
lenge can be addressed by designing accuracy gua-
ranteeing dynamic error compensation and network
coding techniquesprimarily a light-weight technique
that adopts a two-step process. In the rst step
gradient synchronization is not performed, instead
each participating IoT device updates their part of the
model with locally available gradients (e.g., local learn-
ing). In the second step, IoT devices perform gradient
synchronization based on the computed averaged
gradients, which takes account of the designed error
compensations.
PROPOSED TWO-STEP DEEP
COMPRESSION METHOD AND
INITIAL EXPERIMENTAL RESULTS
In this section, we present an initial approach to
handle network uncertainties and data staleness
challenges in the context of distributed training of
DNNs. In our distributed training scenario, we model
the communication time tcas
ttcc ¼latency þmodel size=bandwidthðÞ:(1)
Both latency and bandwidth are dynamic and
depend on the network condition, which we cannot
control. Instead, in the following, we present model
size reduction techniques that can be applied to vari-
ous parts of the DL model to save communication
time and networking trafc.
May/June 2021 IEEE Internet Computing 9
INTERNET OF THINGS, PEOPLE, AND PROCESSES
To reduce the communication bandwidth, we rec-
ommend quantizing the model gradients to low-preci-
sion values, then transmitting these to other IoT
devices or servers. The popular methods are: 1-bit
SGD,
4
which achieves a 10speedup for speech data-
sets. In QSGD,
5
the tradeoff between model accuracy
and gradient precision was balanced. Other work dem-
onstrate the convergence of quantized gradient train-
ing of various CNNs and RNNs. A few quantize the
entire model, including gradients then perform train-
ing and a few studies use different bit sizes (e.g., DoR-
eFa-Net
7
uses 2-bit gradients with 1-bit weights).
Threshold quantization method
6
transmits gradients
only when they exceeds a set threshold, which in prac-
tice is hard to choose. To improve this, a xed propor-
tion of positive and negative gradient was chosen
8
to
update separately.
Since the theoretical quantization limit cannot
exceed 32, to address this limitation, gradient sparsi-
cation methods are being applied and investigated
in this distributed training setting. In the studies
that sparsify the gradients by gradient dropping, the
method from Ba et al.
9
saved 99% of gradient
exchange while only compromising 0.3% of the BLEU
score for a machine translation dataset. Some studies
automatically tune this compression rate based on
gradient activity and show 200x compression of fully
connected layers for the ImageNet dataset.
From our discussions in Section Distributed Global
Training: Research Challenges,it is apparent that scal-
ability is essential when connecting a large number of
devices. To improve scalability, we need to signicantly
reduce communication frequency, where the commu-
nication cost is determined by network bandwidth
and latency [see (1)]. All conventional studies focus on
reducing the bandwidth requirements, as the latency
between GPUs inside a cluster or servers inside a data
center is usually low. In contrast, in our use case, since
we propose to perform the same training but on IoT
device hardware that is geographically distributed,
latency still remains an issue due to physical device
separation. For instance, if we can achieve X times
training speedup on Y machines, the overall distrib-
uted training scalability (dened as X/Y) increases.
Next, if we can also tolerate latency, the speedup
will improve further since high latency severely
reduces scalability.
We propose a two-step method to improve live model
compression during training, yet without altering the DL
model architecture and also without compromising the
model accuracy. Our two-step deep compression method
jointly aims to increase the training speed and scalability.
Particularly, the rst step aims to tolerate variation in
real-world latency and bandwidth issues by sparsely
transmitting only the important gradients. The second
step aims to reduce communication-to-computation
ratio and improve scalability by locally accumulating gra-
dients, then encode and perform transmission only after
crossing the gradient threshold. In the rest of this section,
we describe each of these steps.
In the First step, we identify the important gra-
dients, using gradient magnitude as the simple heuris-
tic (users can also choose other selection criteria). We
accumulate these important gradients locally to not
forget the learned information. Since this step reduces
the gradient synchronization frequency by not allow-
ing to transmit all the gradients, as shown in Figure 1
(a) the training process can tolerate latency (does not
reduce the dynamic real-world latency since it is prac-
tically not possible). This results in increasing training
scalability, enabling the participation of more IoT devi-
ces to complete training at higher speeds.
In the Second step, after the set threshold (dynam-
ically derived for the model in use) for the accumu-
lated gradients is crossed, we encode the gradients
(not quantizing like previous works) then transmit
FIGURE 1. Comparing distributed training within a GPU cluster versus training using geographically distributed IoT devices. Our
proposed two-step deep compression method can (a) tolerate latency and increase training speed, (b) reduce the communica-
tion-to-computation ratio to improve scalability and reduce communication costs.
10 IEEE Internet Computing May/June 2021
INTERNET OF THINGS, PEOPLE, AND PROCESSES
them to other contributing devices involved in the
training process or to the parameter server. As shown
in Figure 1(b). this step improves scalability by reducing
the communication-to-computation ratio by sending
all the important gradients, not at dened intervals, but
only when required.
Briey, during training, both the steps jointly work
to improve training speed and scalability by accumu-
lating, encoding, and sparsely transmitting only the
important gradients.
CONCLUSION
In this article, we presented an approach for training
DL models on idle IoT devices, millions of which exist
across the world. With an increase in mechanisms to
connect such devices to a network, the potential for
using such devices to support learning on locally col-
lected data has increased. As data are maintained
locally (and never transferred to a server), user privacy
is also maintainedas the developed model can then
be aggregated with other models (without the need to
transfer raw data). We have identied and studied
challenges associated with building such machine
learning models, and presented a two-step deep com-
pression method to improve distributed training speed
and scalability.
The proposed approach can be used to intercon-
nect DL frameworks executed on large scale resour-
ces (such as TensorFlow on GPU clusters) with
proposals from the TinyML community (studies that
design resource-friendly models for embedded sys-
tems) since we enable distributed training of compu-
tationally demanding models on distributed idle IoT
devices. TinyML and related approaches often only
undertake inference on IoT devices and assume that a
model is constructed at a data center. A learned
model is subsequently modied (e.g., using quantiza-
tion) to execute on a resource constrained device
(e.g., using TensorFlow-Lite). Support for performing
training on resource limited devices is still limited at
presentwith general approaches provided in frame-
works such as Federated Learning,where a surro-
gate model is constructed on each remote resource,
and models are then aggregated at on a cloud server.
There is also an assumption within Federated Learning
that each dataset (from a participating IoT device) fol-
lows the IID distribution (identical, independently
distributed).
Since our method can signicantly compress gra-
dients during the training of a wide range of NN archi-
tectures such as CNNs and RNNs, the proposed
approach can also be utilized alongside TF-Lite and
Federated Learning approaches thereby providing the
basis for a broad-spectrum of decentralized and col-
laborative learning applications.
ACKNOWLEDGMENTS
This publication has emanated from research supported
by grants from the European UnionsHorizon2020
research and innovation program under grant agree-
ment number 847577 (SMART 4.0 Marie Sklodowska-
Curie actions COFUND) and from Science Foundation
Ireland (SFI) under Grants SFI/16/RC/3918 and SFI/12/
RC/2289_P2 cofunded by the European Regional Devel-
opment Fund.
REFERENCES
1. B. Sudharsan, J. G. Breslin, and M. I. Ali, Edge2train: A
framework to train machine learning models (SVMS) on
resource-constrained IoT edge devices,in Proc. 10th
Int. Conf. Internet Things, 2020, Art. no. 6.
2. X. Wang, R. Girshick, A. Gupta, and K. He, Non-local
neural networks,2018. [Online]. Available: https://arxiv.
org/abs/1711.07971
3. J. Lin, C. Gan, and S. Han, TSM: Temporal shift module
for efcient video understanding,2018. [Online].
Available: https://arxiv.org/abs/1811.08383
4. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, 1-bit
stochastic gradient descent and its application to data-
parallel distributed training of speech DNNs.in Proc.
15th Annu. Conf. Int. Speech Commun. Assoc., 2014.
5. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic,
QSGD: Communication-efcient SGD via gradient
quantization and encoding,2016. [Online]. Available:
https://arxiv.org/abs/1610.02132
6. N. Strom, Scalable distributed DNN training using
commodity GPU cloud computing,in Proc. 16th Annu.
Conf. Int. Speech Commun. Assoc., 2015, pp. 14881492.
7. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou,
DoReFa-Net: Training low bitwidth convolutional
neural networks with low bitwidth gradients,2016.
[Online]. Available: https://arxiv.org/abs/1606.06160
8. N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen,
Communication quantization for data-parallel
training of deep neural networks,in Proc. Workshop
Mach.Learn.HighPerform.Comput.Environ.,2016,
pp. 18.
9. J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer
Normalization,2016. [Online]. Available: https://arxiv.
org/abs/1607.06450
10. H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, D2:
Decentralized training over decentralized data,2018.
[Online]. Available: https://arxiv.org/abs/1803.07068
May/June 2021 IEEE Internet Computing 11
INTERNET OF THINGS, PEOPLE, AND PROCESSES
11. X. Lian, W. Zhang, C. Zhang, and J. Liu, Asynchronous
decentralized parallel stochastic gradient descent,
2017. [Online]. Available: https://arxiv.org/abs/
1710.06952
12. I. Mitliagkas, C. Zhang, S. Hadjis, and C. R
e, Asynchrony
begets momentum, with an application to deep
learning,in Proc. 54th Annu. Allerton Conf. Commun.,
Control, Comput., 2016, pp. 9971004.
13. B. Qian et al.,Orchestrating the development lifecycle
of machine learning-based IoT applications: A
taxonomy and survey,ACM Comput. Surv., vol. 53,
2020, Art. no. 82.
14. W. Dai, Y. Zhou, N. Dong, H. Zhang, and E. P. Xing,
Toward understanding the impact of staleness in
distributed machine learning,2018. [Online]. Available:
https://arxiv.org/abs/1810.03264
BHARATH SUDHARSAN is currently working toward the Ph.D.
degree with the CONFIRM SFI Centre for Smart Manufacturing,
Data Science Institute, National University of Ireland Galway,
Ireland. His research areas are resource-constrained IoT devices,
edge intelligence and analytics, real-time machine training. He is
the corresponding author of this article. Contact him at bharath.
sudharsan@insight-centre.org.
PANKESH PATEL is Senior Researcher with the CONFIRM
SFI Centre for Smart Manufacturing, Data Science Institute,
National University of Ireland Galway, Ireland. His academic
background and research work focus on building software
development tools to easily develop applications in the
cross-section of the Internet of Things/Industry 4.0, articial
intelligence, edge computing, and cloud computing. Contact
him at pankesh.patel@insight-centre.org.
JOHN BRESLIN is a Personal Professor (Personal Chair) in
electronic engineering with the College of Science and Engi-
neering, National University of Ireland Galway, Ireland, where
he is the Director of the TechInnovate/AgInnovate pro-
grammes. Contact him at john.breslin@insight-centre.org.
MUHAMMAD INTIZAR ALI is an Assistant Professor with the
School of Electronic Engineering, Dublin City University, Dublin,
Ireland. His research interests include data analytics, Internet
of Things, stream query processing, data integration, distrib-
uted and federated machine learning, and knowledge graphs.
Contact him at ali.intizar@dcu.ie.
KARAN MITRA is an Assistant Professor with Lulea
University
of Technology, Luleå, Sweden. His research interests include
quality of experience modelling and prediction, context-aware
computing, cloud computing and mobile and pervasive com-
puting systems. Contact him at karan.mitra@ltu.se.
SCHAHRAM DUSTDAR is a Professor of Computer Science
and head of the Distributed Systems Group at TU Wien,
Vienna, Austria. He was named Fellow of the Institute of Elec-
trical and Electronics Engineers (IEEE) in 2016 for contribu-
tions to elastic computing for cloud applications. Contact
him at dustdar@dsg.tuwien.ac.at.
OMER RANA is a Professor of Performance Engineering and
previously led the Complex Systems research group, School of
Computer Science and Informatics, Cardiff University, Cardiff,
U.K. His research interests lie in the overlap between intelligent
systems and high-performance distributed computing. He is par-
ticularly interested in understanding how intelligent techniques
could be used to support resource management in distributed
systems, and the use of these techniques in various application
areas. Contact him at ranaof@cardiff.ac.uk.
PREM PRAKASH JAYARAMAN is a Senior Lecturer and Head
of the Digital Innovation Lab in the Department of Computer
Science and Software Engineering, Faculty of Science, Engineer-
ing and Technology at Swinburne University of Technology, Mel-
bourne, Australia. Contact him at pjayaraman@swin.edu.au.
RAJIV RANJAN is an Australian-British computer scientist, of
Indian origin, known for his research in Distributed Systems
(Cloud Computing, Big Data, and the Internet of Things). He is
the University Chair Professor for the Internet of Things research
with the School of Computing, Newcastle University, Newcastle
upon Tyne, U.K. Contact him at raj.ranjan@ncl.ac.uk.
12 IEEE Internet Computing May/June 2021
INTERNET OF THINGS, PEOPLE, AND PROCESSES
... An explosion in the number of IoT datasets has occurred due to the growing availability of powerful IoT frameworks and open-source libraries [76]. These datasets include various data kinds, including text, tabular data, audio, and video. ...
Article
Full-text available
The rapid evolution of software, hardware, and internet technology has enabled the proliferation of internet-connected sensor tools that gather information and observations from the physical world. The IoT comprises billions of intelligent devices, extending physical and virtual boundaries. However, traditional data processing methods face significant challenges in handling the vast volume and variety of IoT data. This paper systematically reviews. These devices generate vast amounts of data daily, with diverse applications crucial for generating new knowledge, identifying future trends, and making informed decisions. This underscores IoT's value and enhances technology. Deep learning (DL) has significantly enhanced IoT and mobile applications, demonstrating promising outcomes. Its data-driven, anomaly-based approach for detecting emerging threats positions it well for IoT intrusion detection. This paper proposes a comprehensive framework leveraging DL techniques to address data processing challenges in IoT environments and enhance intelligence and application capabilities. Furthermore, this study systematically reviews and categorizes existing deep learning techniques applied in IoT, identifies critical challenges in IoT data processing, and provides actionable insights to inspire further research in this domain. It discusses the introduction of IoT and its data processing challenges and explores various DL approaches applied to IoT data. Significant DL efforts in IoT are surveyed and summarized, focusing on datasets, features, applications, and challenges to inspire further advancements in this field.
... Cloud computing provides scalable solutions, offering access to powerful GPU clusters and enabling distributed training across multiple nodes [136]. Innovative methods include utilizing IoT devices for globally distributed DL training [137] and adopting volunteer computing paradigms with preemptible cloud instances, reducing costs by 70-90% [138]. Additionally, the parallelization of neural architectures, efficient memory management, and optimization strategies have been developed to address the complexities of large-scale models [139]. ...
Article
Full-text available
Optical Coherence Tomography (OCT) emerged as a technology for the detection of retinal disease. Owing to its excellent performance and ability to provide in-vivo high-resolution images, the popularity increased dramatically across various application domains. Consequently, OCT has been widely used in other branches of medical applications, i.e., oncology and otolaryngology, industry, and agriculture. Despite its widespread use, OCT image analysis is an inherently subjective, laborious, and time-intensive task that requires expertise. Deep learning (DL) stands as the current state-of-the-art method for image analysis. Hence, several research groups have directed their efforts toward incorporating DL algorithms with OCT imaging to reduce the time as well as the subjectivity. This article comprehensively reviews the principal technological advancements in DL methods applied across various OCT applications. Additionally, it explores the latest trends in developing DL methods for OCT, highlights their limitations, and discusses future opportunities in a comprehensive manner.
... In this section, we begin by delineating the evolution of coding techniques, followed by an exploration of their realworld applications in technologies such as MEC or IoT devices [124], where coding techniques are leveraged in DML to tackle various challenges. ...
Article
Full-text available
In the era of artificial intelligence and big data, the demand for data processing has surged, leading to larger datasets and computation capability. Distributed machine learning (DML) has been introduced to address this challenge by distributing tasks among multiple workers, reducing the resources required for each worker. However, in distributed systems, the presence of slow machines, commonly known as stragglers, or failed links can lead to prolonged runtimes and diminished performance. This survey explores the application of coding techniques in DML and coded edge computing in the distributed system to enhance system speed, robustness, privacy, and more. Notably, the study delves into coding in Federated Learning (FL), a specialized distributed learning system. Coding involves introducing redundancy into the system and identifying multicast opportunities. There exists a tradeoff between computation and communication costs. The survey establishes that coding is a promising approach for building robust and secure distributed systems with low latency.
... IoT networks often involve a large number of devices, and managing AI deployments across these devices can be a complex task 60 . Deploying and managing GenAI on large-scale IoT networks requires updating models, handling device failures and ensuring consistent performance across various device types and operating conditions 61 . ...
Article
Generative artificial intelligence (GenAI) has brought about profound transformations across the diverse domains of the Internet of Things such as manufacturing, marketing, medicine, education and work assistance. However, the proliferation of computationally intensive and highly complex GenAI models poses substantial challenges to servers and central network capacities. To effectively permeate various facets of our lives, GenAI heavily relies on mobile edge computing. In this Perspective article, we first introduce GenAI applications on edge devices highlighting its potential capacity to revolutionize our everyday life. We then outline the challenges associated with deploying GenAI on edge devices and present possible solutions to effectively address these obstacles. Finally, we introduce an intelligent mobile edge computing paradigm able to reduce response latency, improve efficiency, strengthen security and privacy preservation and conserve energy, opening the way to a sustainable and efficient application of the different GenAI models.
Article
Federated learning is one of the emerging areas of research in computer science. It has shown great potential in some application areas and we are witnessing evidence of new approaches where millions or even billions of IoT devices can contribute collectively to achieve a common goal of machine learning through federation. However, existing approaches are primarily suitable for single-task learning with a single objective in a single task owner where it is assumed that the majority of devices contributing to federated learning have a similar design or device type and restrictions. We argue that the true potential of federated learning can only be realised if we have a dynamic and open ecosystem where devices, industrial units, machine manufacturers, non-governmental agencies, and governmental entities can contribute toward learning for multiple tasks and objectives in a crowdsourced manner. In this article, we propose a multi-level framework that shows how federated learning, IoT, and crowdsourcing can come hand-in-hand with each other to make a robust ecosystem of multi-level federated learning for Industry 4.0. This helps build future intelligent applications for Industry 4.0 such as predictive maintenance and fault detection for systems in smart manufacturing units. In addition, we also highlight several use-cases of multi-level federated learning where this approach can be implemented in Industry 4.0. Moreover, if the approach is implemented successfully, besides enhancement in performance it will also help towards a greater common goal e.g. UN Sustainable Goal No 13 i.e. reduction in carbon footprint.
Conference Paper
Full-text available
In recent years, ML (Machine Learning) models that have been trained in data centers can often be deployed for use on edge devices. When the model deployed on these devices encounters unseen data patterns, it will either not know how to react to that specific scenario or result in a degradation of accuracy. To tackle this, in current scenarios, most edge devices log such unseen data in the cloud via the internet. Using this logged data, the initial ML model is then re-trained/upgraded in the data center and then sent to the edge device as an OTA (Over The Air) update. When applying such an online approach, the cost of edge devices increases due to the addition of wireless modules (4G or WiFi) and it also increases the cyber-security risks. Additionally, it also requires maintaining a continuous connection between edge devices and the cloud infrastructure leading to the requirement of high network bandwidth and traffic. Finally, such online devices are not self-contained ubiquitous systems. In this work, we provide Edge2Train, a framework which enables resource-scarce edge devices to re-train ML models locally and offline. Thus, edge devices can continuously improve themselves for better analytics results by managing to understand continuously evolving real-world data on the fly. In this work, we provide algorithms for Edge2Train along with its C++ implementations. Using these functions, on-board, offline SVM training, inference, and evaluation has been performed on five popular MCU boards. The results show that our Edge2Train-trained SVMs produce classification accuracy close to that of SVMs trained on high resource setups. It also performs unit inference for values with 64-dimensional features 3.5x times faster than CPUs, while consuming only 1/350 th of the energy that CPUs consume.
Article
Full-text available
Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML techniques unlock the potential of IoT with intelligence, and IoT applications increasingly feed data collected by sensors into ML models, thereby employing results to improve their business processes and services. Hence, orchestrating ML pipelines that encompass model training and implication involved in the holistic development lifecycle of an IoT application often leads to complex system integration. This article provides a comprehensive and systematic survey of the development lifecycle of ML-based IoT applications. We outline the core roadmap and taxonomy and subsequently assess and compare existing standard techniques used at individual stages.
Conference Paper
Full-text available
Most distributed machine learning (ML) systems store a copy of the model parameters locally on each machine to minimize network communication. In practice, in order to reduce synchronization waiting time, these copies of the model are not necessarily updated in lock-step, and can become stale. Despite much development in large-scale ML, the effect of staleness on the learning efficiency is inconclusive, mainly because it is challenging to control or monitor the staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of SGD in non-convex optimization under staleness, matching the best-known convergence rate of O(1/ √ T).
Article
While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D2^2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D2^2 is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from O(σnT+(nζ2)13T2/3)O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right) to O(σnT)O\left({\sigma \over \sqrt{nT}}\right) where ζ2\zeta^{2} denotes the variance among data on different workers. As a result, D2^2 is robust to data variance among workers. We empirically evaluated D2^2 on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D2^2 significantly outperforms D-PSGD.
Article
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
Article
Recent work shows that decentralized parallel stochastic gradient decent (D-PSGD) can outperform its centralized counterpart both theoretically and practically. While asynchronous parallelism is a powerful technology to improve the efficiency of parallelism in distributed machine learning platforms and has been widely used in many popular machine learning softwares and solvers based on centralized parallel protocols such as Tensorflow, it still remains unclear how to apply the asynchronous parallelism to improve the efficiency of decentralized parallel algorithms. This paper proposes an asynchronous decentralize parallel stochastic gradient descent algorithm to apply the asynchronous parallelism technology to decentralized algorithms. Our theoretical analysis provides the convergence rate or equivalently the computational complexity, which is consistent with many special cases and indicates we can achieve nice linear speedup when we increase the number of nodes or the batchsize. Extensive experiments in deep learning validate the proposed algorithm.