PreprintPDF Available

AI-based Block Identification and Classification in the Blockchain Integrated IoT


Abstract and Figures

Artificial Intelligence (AI) is gaining popularity in the Internet of Things (IoT) based application-based solution development. Whereas, Blockchain is become unavoidable in IoT for maintaining the end-to-end process in the decentralized approach. Combining these two current-age technologies, this paper details a brief comparative study with the implementations and further analyzes the adaptability of the AI-based solution in the Blockchain-integrated IoT architecture. This work focuses on identifying the of block data in the block validation stage using AI-based approaches. Several supervised, unsupervised, and semi-supervised learning algorithms are analyzed to determine a block's data sensitivity. It is identified that machine learning techniques can identify a block's data with very high accuracy. By utilizing this, the block's sensitivity can be identified, which can help the system to reduce the energy consumption of the block validation stage by dynamically choosing an appropriate consensus mechanism. This paper has been accepted and presented at the OCIT 2022 (IEEE) Conference for publication and it has been selected for the "Outstanding Paper Award of OCIT 2022". IEEE Xplore Paper Link: Citation Format: J. Dutta, D. Puthal and E. Damiani, "AI-based Block Identification and Classification in the Blockchain Integrated IoT," 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 2022, pp. 415-421, doi: 10.1109/OCIT56763.2022.00084.
Content may be subject to copyright.
AI-based Block Identification and Classification in
the Blockchain Integrated IoT
Joy Dutta, Deepak Puthal and Ernesto Damiani
Center for Cyber-Physical Systems and Department of EECS, Khalifa University, Abu Dhabi, UAE
Email: {joy.dutta, deepak.puthal, ernesto.damiani}
Abstract—Artificial Intelligence (AI) is gaining popularity in
the Internet of Things (IoT) based application-based solution
development. Whereas, Blockchain is become unavoidable in
IoT for maintaining the end-to-end process in the decentralized
approach. Combining these two current-age technologies, this
paper details a brief comparative study with the implementations
and further analyzes the adaptability of the AI-based solution in
the Blockchain-integrated IoT architecture. This work focuses on
identifying the of block data in the block validation stage using
AI-based approaches. Several supervised, unsupervised, and
semi-supervised learning algorithms are analyzed to determine
a block’s data sensitivity. It is identified that machine learning
techniques can identify a block’s data with very high accuracy.
By utilizing this, the block’s sensitivity can be identified, which
can help the system to reduce the energy consumption of the
block validation stage by dynamically choosing an appropriate
consensus mechanism.
Index Terms—IoT, Blockchain, Block Validation, Edge Devices,
Machine Learning, Energy Efficiency
With the current technological advancements, IoT technol-
ogy is in its golden age, transforming various industries by
storing and processing data distributed and enhancing produc-
tivity. IoT has a lot of potential for the future, like making
networks more flexible, integrating artificial intelligence (AI),
and being able to deploy, automate, coordinate, and secure
different use cases, not just in the business world but also
in everyday life. The potential of IoT-based applications is
limitless. Statistica predicts that by the end of the year 2030,
there will be more than 29 billion IoT devices connected to
the Internet [1]. The IT Services for the IoT market will be
worth 58 billion dollars in 2025, growing at a 34% CAGR
from 2020 [2].
IoT penetrates every sector, and data generated from dif-
ferent sectors require different attention. Matter of fact is,
if the data generated from IoT applications are not securely
handled, then the application is not usable. Thus, for IoT-based
applications to be well adopted across different industries,
applications must facilitate trust, security, transparency, and
the traceability of data shared across a network. This is
where blockchain technology comes into the picture. The main
reason for coming up with the idea of a blockchain was
to reduce interference and dependence on third parties. This
was done by introducing consensus-based updates in a shared
and distributed way [3]. Data stored in blockchains is special
because it can’t be changed, can’t be undone, is decentralized,
stays the same, and doesn’t reveal who it belongs to because
of the intrinsic nature of Blockchain. This makes it the perfect
way to store IoT-based application data to make it reliable. But
these two technologies, i.e., IoT and Blockchain, have their
limitations. For example, IoT devices are resource constrained.
This signifies they have limited computation power, storage,
and networking capability. On the other hand, blockchain
technology faces the issues of scalability, high latency, high
energy consumption, reverse confirmed transactions, etc. This
shows how the basic needs of IoT and the core needs of
blockchain technology, e.g., saving energy, making decisions
in real-time, and getting responses, are at odds with each other
Current research focuses on the energy consumption per-
spective of blockchain technology. In IoT-Blockchain-based
applications, the data is stored in Blockchain goes through
the stages of transactions generation, transaction validation,
block creation, block validation and adding blocks to the
Blockchain. After closely observing the steps, it is observed
that the compute-intensive steps are transaction validation,
block validation, and adding the new block to the overall
blockchain network [4], shown in Fig. 1. Out of these three,
the present research focuses on energy consumption in the
block validation stage. At this stage, after a block is created, its
validation is done via a proper consensus mechanism, and for a
network of nodes, the same consensus is applied uniformly for
all the subsequent blocks [4]. This ensures a collective agree-
ment about the present state of the Blockchain’s distributed
ledger. In this way, consensus algorithms make sure that the
blockchain network is reliable and build trust between peers
who don’t know each other.
In present scenario, while validating the block, we are
currently treating every block with similar importance by
applying the same consensus mechanism. In reality, the block
data’s sensitivity will be different at different times. Based on
this, the present research focuses on dynamically identifying
the sensitivity of a block’s data in the block validation stage.
Based on the sensitivity of the data present, a proper consensus
mechanism is selected for that block. Thus, with this concept,
consensus for a network of nodes is not static anymore; it
has become dynamic. This paper proposes a novel solution to
identify a block’s data sensitivity by proposing a new AI layer
after block creation and before block validation.
Here, our contribution is two-fold. Firstly, we have identified
that choosing consensus dynamically based on the block’s
data can help to reduce energy consumption at the block
Fig. 1. Process of making a block and adding it to the ledger with the Proposed AI Layer between the Block creation and Validation Stage (Server symbols
are used to show processes that require a lot of computing power.)
validation level. Secondly, we have introduced an AI-based
layer for this block-level data sensitivity identification after
block creation and mapped the problem in the ML domain as
a multiclass classification problem by categorizing the block
data into three types, e.g., normal, sensitive, and critical. Then,
proper ML algorithms are identified that are suitable to identify
the importance of the block data, which can help us identify
different consensus mechanisms that can be applied to achieve
energy efficiency at the block verification level.
A consensus mechanism in Blockchain technology plays
a crucial role in the block’s validation stage as this block
validation is one of the most power-hungry stages of adding
transaction data to the blockchain ledger [4] (refer to Fig. 1).
In this stage, a specific consensus mechanism is applied to the
block to check its validity. This consensus mechanism is the
primary cause of energy consumption at this block validation
stage. Thus, blockchain networks cannot function properly
without consensus algorithms.
Now, a wide range of application scenarios exist, and
different consensus mechanisms are used to validate a block
there. Out of them, Proof of Work (PoW), Practical Byzantine
Fault Tolerance Algorithm (PBFT), Proof of Stake (PoS),
Proof of Activity (PoA) are some of the most popular ones
but not suitable for IoT-based applications [4]. Hence, there is
always a temptation in this research area to develop a new con-
sensus mechanism that resolves some specific contradictions
for the efficient integration of IoT-Blockchain. In recent years,
researchers have made significant efforts to develop a different
consensus algorithm that is more IoT-friendly, i.e., less power-
consuming and lighter. In this context, different consensus
algorithms, e.g., Proof of Elapsed Time (PoET), Practical
Byzantine Fault Tolerance (PBFT), Proof-of-Authentication
(PoAh) [4], Proof of Importance (POI), Tangle, Steller, Ripple,
etc. are suggested for diverse application environments [4], [5].
So, it is found that most of the research is happening at
the consensus level, and researchers are applying the same
consensus to all the blocks uniformly, and this is the problem
that we want to highlight here. By looking at the problem
more specifically, we have noticed that, for any IoT-based
application, the data generated from the application does
not always have the same importance. Only sometimes may
those nodes sense a different state of the data because of
certain changes in the application’s environment, which is of
interest to us. Hence, the importance of data varies with time
in a specific application. Since the importance of the data
generated from the application is not uniform, hence applying
the same consensus uniformly to all the blocks is also not
required. Since block validation is a resource-intensive stage,
the application consensus mechanism is the key part of this
stage. Thus, by changing the consensus dynamically at the
edge based on the content of the block’s data, we can make
this stage energy efficient.
Now, to formulate the problem, let us assume that an
incoming data block DBiwhere i= [1,2,3, ..., k]represents
block numbers that are getting generated from the IoT net-
work. Each block consists of multiple transactions. For data
block DBi,jth transaction can be represented as T Xij where
T X ij =[T Xi1,T X i2,..., T Xim ]where j= [1,2,3, ..., m]are
different transactions that are placed in a single block. Here,
both [i, j]N. In case of IoT-blockchain based applications,
these transactions T X s consists of values for a set of selected
feature (here nfeatures) collected from IoT nodes, i.e., for the
j-th transaction, T Xij =[fij1, fij 2, ...fijn]. Here, k,m,nare
non-negative integer numbers, i.e., the number belongs to N.
Now, for this problem scenario, let us assume that, we have
data sensitivity information upto the p-th block DBp. This
implies up to p-th block, we know the value of DSpfor p,
when 0p&p < i. Thus, data sensitivity of p-th block’s
data for a specific transaction jis represented here by DSpj .
Here, data sensitivity values that a block/transaction can take,
are fixed and depends on the problem. The value that DSpj
can have are non-negative integer numbers, i.e., belongs to N.
Thus, we have the information in the following tuple form:
[T X pj,DSpj ]where DSpj is the data sensitivity label for
T X pj transaction (p-th data block).
Now, the problem can be stated as follows:
Based on given previous transaction data (upto p-th block)
along with sensitivity information,i.e., [fpj1,fpj2,... fpj n,
DSpj ] for p, j, n when p < i (current block), and p, j, n
N, we have to predict DSij for T Xij where, i-th block is
the currently received block DBi.
A. Mapping of the problem in the machine learning domain
To identify the sensitivity of the block’s data an AI layer
is introduced after block creation (Fig. 1) to determine the
sensitivity of the block. In this AI layer, different ML models
can be applied. ML encompasses a diverse set of algorithms,
ranging from comprehensive problem-solving strategies to
particular feature extraction methodologies. ML algorithms are
often divided into four categories: 1) Supervised learning 2)
Unsupervised learning, 3) Semi-supervised learning, and 4)
Reinforcement learning [6].
In this study, when an edge node receives a block, the data
in it gets our primary attention. Our current research focuses
primarily on the transaction part T Xij of the block’s data.
When you receive a newly created block, its data part doesn’t
have any data sensitivity DSij information for the transactions
in it and also for the overall block DSias well. Here, we want
to predict DSij for a specific transaction T Xij , which will
provide us, in turn, with the data sensitivity of the i-th block
DSi. We can get the value of DSiby considering predictions
DSij of all the transactions T Xij where j= [1,2,3, ..., m]
and jN, present in the block. Now, three kinds of scenarios
might occur based on the availability of data sensitivity
information in the previous blocks.
The first scenario is when we know the data sensitivity of
all the previous blocks and all the transactions in them. This is
a supervised machine learning scenario, where we predict the
new incoming data based on previous training data. The next
scenario is when you don’t have labels for all the previous
blocks but for a small percentage of blocks. This is a use case
of semi-supervised learning where only a small percentage of
the data is labeled. In the last scenario, on the contrary, it might
be possible that we don’t have any information (labeled data)
about any of the previous blocks’ sensitivity. It is unsupervised
because, without any labels, data groups are created based on
the features and come under clustering because similar data
are grouped here.
Now, being specific to the problem, here is the data sensi-
tivity of the block DSican be put into three broad categories:
normal data, sensitive data, and critical data, based on the data
generation at the IoT node [7]. This is identified by looking
at the percentage of the transaction data’s sensitivity DSij . If
a specific percentage of transaction data falls under a specific
group of sensitivity levels, then the block’s data sensitivity can
take any of those values (refer to Algorithm 1). Therefore,
our primary goal is to map the block’s data to any of these
categories. So, it can be considered a multiclass classification
problem of machine learning, where we can predict, based
on leveled training data, whether a block contains normal,
sensitive, or critical information. But these category values,
i.e., normal, sensitive, and critical, represent categorical data,
and these data cannot be directly used in any kind of pre-
diction problem. Hence, these categorical data are mapped to
numerical data for mapping this problem to the domain of
machine learning. In this problem, DSij {0,1,2}, where
0 represents normal data, 1 represents sensitive data, and 2
represents critical data, makes it easier to understand that these
are three different outputs representing three different classes
that we need to predict accurately. This is where we have
applied categorical encoding. Here, the more accurately we
can predict the block’s nature of data, the more efficient the
block validation stage can become.
B. Machine Learning Model Selection for AI Layer
From the above discussion, one can understand that the
supervised, unsupervised, and semi-supervised ones are being
tested for this research from a practical applicability point of
view. To be concise, real-time IoT systems do not support all
branches of ML as they need to satisfy real-time restrictions
[8]. According to the survey [8], the most representative and
widely used supervised learning algorithms in real-time IoT
systems are KNN, Random Forest, SVM, Decision Tree, Naive
Bayes, Boosting, etc. Now, to solve the present problem, the
classification algorithms that we focus on here are, Support
Vector Machine (SVM), K-nearest neighbors (KNN), Random
Forest (RF), and Gradient Boosting-based LightGBM [9].
Now, we will examine why these machine learning algorithms
are chosen here over the others.
The first model of consideration is SVM which is primarily
designed for binary classification but capable of multiclass
classification by applying the same principle, i.e., by breaking
down the multiclassification problem into multiple binary
classification problems. Next, simple yet efficient machine
learning model of our choice is KNN which a non-parametric,
supervised learning classifier that uses proximity to make clas-
sifications or predictions about the grouping of an individual
data point. It can be used for multiclass classification as well.
The drawback of this model is that it cannot handle high-
dimensional data efficiently as complexity rises. However,
in our case, the dimension of the data is not comparable
to the input size; hence, we have considered this one in
our experiment. Further, bagging based Random Forest (RF)
is considered which is an ensemble learning approach for
classification, regression, and other problems that works by
building a large collection of decision trees during training.
Random forest is more suited to distributed computing than
boosting algorithms and is also easier to tune. Thus, this model
also fits our requirements.
The next machine learning model we have considered is
based on boosting algorithms, which combine a group of
weak learners into one strong learner to make predictive data
analysis more accurate. Out of the different boosting algo-
rithms present, gradient boosting based LightGBM, CatBoost,
and XGBoost are three algorithms that outperformed most
of the models in reality. Out of these three, histogram-based
LightGBM is the most efficient based on our experiment
scenario [10]. Because these models teach themselves to
correct each other’s mistakes, they can detect complicated
patterns in data. Of the three mentioned, LightGBM is the
lightest and quickest of all the boosting algorithms. Though,
CatBoost and XGBoost also show a significant improvement
over GBM, still they tend to be still behind LightGBM [11]
without much compromising with the accuracy. Thus, for this
study, LightGBM is considered.
When the amount of labeled data is extremely low or not
present, the approach to identify different data types is to go
for the clustering approach. For the present study, a predefined
number of clusters are 3. Based on this, the most suitable
algorithm for this kind of scenario is K-Means clustering.
It has many advantages, including guaranteed convergence,
scalability to large data sets, and ease of implementation. It
can also warm-start centroids’ locations and readily adjust to
incoming instances. It may also generalize various forms and
sizes, such as elliptical, oval, and circular clusters. Thus, K-
Means is one of the theoretically best fits when clustering is
applied to form a group of similar data based on the features
present in the transaction data. But in practice, these outcomes
are often inaccurate.
The semi-supervised learning algorithm is the next category
of the learning algorithm that is applicable when the available
training dataset is low, i.e., smaller amount of data labeled,
and the rest is unlabeled. Here, these unlabeled data severities
are predicted using known supervised learning algorithms
after training with the available small labeled data. The same
algorithms mentioned above for supervised ones are also
utilized for this category of learning. The main benefit of this
learning approach is that it may obtain cutting-edge outcomes
on major concerns using just a fraction of the labeled data.
C. Working Model
After receiving the block at edge nodes, the AI layer starts
its function. First, the AI layer checks the computing capability
of the edge node. This edge capability includes processing
power in terms of primary memory (RAM), secondary mem-
ory (local storage), and power source type (battery / connected
power source). Based on the capability of the edge and the
requirements of IoT applications, an appropriate ML algorithm
is chosen.
While considering transactions with data sensitivity labels,
after the selection of the ML model (i.e., supervised or semi-
supervised), the model is trained with the local data. So,
Fig. 2. Procedure for checking the data validation for the block with the
Proposed AI Layer
after the ML model is introduced with [T X, D X], the newly
received block’s transaction data sensitivity is predicted using
this trained model. Now, in a single block, counts of the
transactions can range from a few hundred to thousands.
The trained machine learning model predicts the sensitivity
of all these transactions in real-time. After indicating all the
transactions’ sensitivity, the block’s sensitivity is identified
according to Algorithm 1. Here, if within any block it is found
that more than 5% of the data is sensitive or critical, the
whole block’s data gets that label. The reason for choosing
this percentage threshold is that the outlier percentage will
always be within 5% in general. However, if both critical and
sensitive data are present within the same block data, we label
the data as critical as it has more importance in real-world
scenarios. Now, based on the sensitivity identified for the block
by the ML algorithm, the consensus is chosen appropriately.
The whole procedure for block validation and the role of the
AI layer is shown in Fig. 2. Here, different processes in the
diagram represent all the functionalities of the AI layer.
Now, if no transactions or a severely low percentage of
the dataset has a data sensitivity label, then the process is
straightforward. Since the number of clusters is predefined
in our case, i.e., 3, we can apply K-Means clustering with
the value of K as 3. In this way, from all the available data,
three separate clusters are created. With expert help, these
three groups are identified as normal, sensitive, or critical. By
following this, we have the data sensitivity values for previous
transactions and by utilizing this labels when a new block is
created, by applying Algorithm 1, we can identify that block’s
D. Performance Metrics
The prediction performance needs to be measured to mea-
sure the machine learning models’ effectiveness. The accuracy
metric is one of the most fundamental and widely used
Algorithm 1 Block Data Sensitivity Identification
Input: Predicted Data Sensitivity for All Transactions, i.e.,
[T Xij , DSij ],(i, j)where, (i, j)N
Output: Overall Sensitivity of the Current Block, i.e., DSi)
1: DSi0; //sensitivity of the current block
2: sen count 0; //sensitive transaction count
3: cri count 0; //Critical transaction count;
4: F S 0; //flag for sensitive
5: F C 0; //flag for critical
6: th =(j5)/100; //threshold is set at 5% of number of
the transactions count
7: for j1to mdo // m is the maximum number of
transactions present in the block
8: if (DSij == 1) then
9: sen count + +;
10: end if
11: if (DSij == 2) then
12: cri count + +;
13: end if
14: end for
15: if (sen count > th)then
16: F S 1;
17: end if
18: if (cri count > th)then
19: F C 1;
20: end if
21: if (F C == 1 F S == 1) then
22: DSi2;
23: else if (F C == 1) then
24: DSi2;
25: else if (F S == 1) then
26: DSi1;
27: else
28: DSi0;
29: end if
30: Return DSi
one. Here, we have used accuracy in percentage to mea-
sure the performance of our different supervised multiclass
classification models. To further verify the accuracy metric’s
performance, we have measured the model’s performance
with 10-Fold cross-validation and its standard deviation [12].
We have also calculated the F1 Score, the harmonic mean
of precision and recall, to understand the machine learning
model’s performance. Along with the F1 Score, the other
two metrics considered here are the weighted average, which
averages the support-weighted mean per label, and the macro
average, which averages the unweighted mean per label.
This block data sensitivity can be theoretically compared
with the framework for Data-Driven Global Sensitivity analy-
sis to solidify the concept of the paper, which will be incorpo-
rated in the future for transaction data level sensitivity analysis
phase [13]. Now, in the next section, the implementation of the
same is discussed in detail, and the results are also analyzed.
Feature Data Range
Names Normal Sensitive Critical
Temperature ( °C) 18-30 30-45 >45
Humidity (%) 35 - 65 35-65 20-50
AQI (US) 60 - 80 81 - 150 151 - 500
Noise (dB) (%) 55 - 70 71-80 >80
Occupancy (Binary) 0 / 1 0 / 1 0 / 1
CO (PPM) 0 - 15 16-30 >30
CO2(PPM) 600 - 1000 1001 - 3000 3001 - 10000
Flame Data (Binary) 0 0 1
A. Data Set
We used the smart building context and associated data for
this research, with our primary focus being indoor data. For
measuring the indoor condition, the following features were
considered: temperature, humidity, air quality index (AQI),
noise, occupancy, carbon monoxide (C O), carbon dioxide
(CO2), and flame data, according to [14], [12]. Based on
our experience, domain knowledge, and the data used in
[15], we have created a synthetic dataset for this experiment.
As mentioned in [12], three months of real-world data were
collected, and from there, the typical ranges of temperature,
humidity, air quality index, noise level, occupancy, CO,CO2,
and flame sensor data were determined. Our dataset consists
of 10,000 instances of training instances labeled appropriately
with the nature of the sensitivity, i.e., normal as 0, sensitive
as 1, and critical as 2. The detailed description of the data is
summarized in Table I.
B. Implementation Setup
For the experiment, we have considered the latest Rasberry
Pi 4 Model B as an edge node, which has a high-performance
64-bit quad-core Cortex-A72 processor, 8GB LPDDR4 RAM,
and inbuilt 2.4 GHz and 5.0 GHz IEEE 802.11ac wireless
support. The Raspberry Pi Foundation specifically selected
Python as the main language because of its power, versatility,
and ease of use. Thus, for implementing machine learning
algorithms, we have used Python version 3.9.
C. Results & Discussions
Here, in the AI layer, we have considered the problem
of block data grouping as a multiclass classification problem
and tried to solve the same with suitable supervised machine
learning options, e.g., K-SVM, KNN, RF, and LightGBM
when we are working with labeled transaction data. It is found
that RF outperformed all the other models even with forest size
10. The Score of 10Fold cross-validation and F1 Score also
supports the same. The standard deviation of 10F-CV is also
the lowest for RF, which signifies it is a constant performer
among its competitors with a 98.68% 10F-CV score, and the
detailed results are shown in Table II. Among all the classifiers,
K-SVM could not perform well in this context, which is visible
Fig. 3. Performance Comparison of Different Supervised Machine Learning
Models for Multiclass Classification
Fig. 4. Comparison of the Execution Time of different Supervised Machine
Learning Algorithms
by looking at its 10Fcross-validation and its standard deviation
score. However, LightGBM’s performance is very close to RF,
and KNN also performs very well with a 10F-CV score of
94.42%, as shown in Fig. 3.
Now, since the whole experiment is taking place at the edge,
and generally, edge nodes are resource-constrained devices,
thus along with accuracy, an average prediction time is also
required as it is not only tells us about real-time application
handling capability of the model, but also it is proportional
to the power consumption of the node. Here, it is found that
KNN is taking the least time among all other algorithms, i.e.,
0.029 seconds, which is the average of 100 predictions of dif-
ferent transaction data using the same model. Interestingly, the
performance of KNN is also 7.5 times faster than lightGBM
and two times faster than RF with the same resources. The
version of K-SVM turns out to be the most time-consuming
in terms of prediction, requiring the most help as well. The
details are shown in Fig. 4 for clear pictorial representation.
After supervised learning, we tried to solve the problem
using the unsupervised learning approach K-Means. Since
we have labels of the data with us, after clustering, when
we verified whether data points were in the proper cluster,
we found the average accuracy to be 20%, with an average
execution time of 0.107 seconds. Hence, performance was
abysmal in terms of accuracy.
To encounter real world scenario, based on our experience
with supervised ones, we have considered K-SVM, KNN, RF,
and LightGBM models for semi-supervised learning. We have
experimented with 10% of the training data having labels, and
the rest, 90%, of the test data labels are predicted using the
initial trained model. This whole labeled dataset is used to
Fig. 5. Performance comparison of different semi-supervised algorithms
(After Phase 2)
predict new, unseen data in real-time. The result for the same
is shown in Table III.
Now, by utilizing the information from Table III, Phase
1, it is found that K-SVM is not worth considering as it is
neither effective from an execution time viewpoint nor an
accuracy viewpoint. Thus, algorithms to consider for next-level
prediction are KNN, RF, and LightGBM. We have applied
these three semi-supervised models to a new, unseen dataset.
The experiment found that the accuracy of the LightGBM with
a 10F-CV score of 99.26% compared to 99.12% of RF is very
close, and KNN- being the fastest one, also gives a 94.2% in
the same 10F-CV measurement metric. The results are shown
in Table III and visually as shown in Fig. 5.
D. Summary of the Results
From the above discussions, it is found that the AI layer
(Fig. 1) can identify a block’s data sensitivity with high
accuracy. If the machine learning model in the AI layer is
trained with labeled data, i.e., if all the previous data has
labels (for the supervised learning approach), KNN, RF, and
LightGBM are the options that can be considered to run at
the edge node. If the edge node has limited resources or a
fast prediction is required, KNN is the best option out of the
three. Otherwise, RF can be chosen as a balanced tradeoff after
considering the model’s accuracy and prediction time. Because
of K-SVM’s low accuracy and more considerable training time
is discarded from the recommendation list.
Next, if all the data labels are unavailable, i.e., only a low
fraction of the data has labels, then semi-supervised machine
learning models will be helpful in this context. Here also,
for faster performance or managing resource issues, KNN
is the choice. Otherwise, LightGBM here outperforms RF
marginally. Hence, if resource availability is not an issue, then
LightGBM or RF can be chosen as they are very close in
terms of performance. However, RF gives results faster than
LightGBM for semi-supervised learning as well.
If the data has no labels, cluster identification with unsuper-
vised learning does not help much. The reason is that even a
similar grouping of data, i.e., cluster formation, is not effective
in label prediction as the accuracy of the clustering model is
below standard in this kind of scenario.
Supervised Performance metrics
Classifiers Accuracy 10F CV 10F CV Std. Dev. F1 Score Macro Avg. Weighted Avg.
K-SVM 71.2 70.97 2.24 67.3 51 67.3
KNN 94.5 94.42 0.91 94.59 93.7 94.6
RF 98.5 98.68 0.36 98.5 98.8 98.5
LightGBM 98.35 98.32 0.72 98.35 98.7 98.4
Semi-Supervised Classifiers Performance Metrics
Accuracy 10F CV 10F CV Std. Dev. F1 Score Macro Avg. Weighted Avg.
Phase 1
K-SVM 59.4 57.5 4.18 54.76 41.9 54.8
KNN 90.53 86.5 3.61 90.64 89.4 90.6
RF 97.89 96.62 2.5 97.89 98.3 97.9
LightGBM 97.87 97.25 1.22 97.88 98.1 97.9
Phase 2
KNN 88.38 94.2 0.72 89.17 86.5 87.6
RF 98.69 99.12 0.38 98.69 98.0 98.7
LightGBM 98.12 99.26 0.17 98.13 97.2 98.1
This paper proposes an AI layer between block creation
and block validation to identify the sensitivity of the block’s
data in Blockchain-integrated IoT networks. This is achieved
by incorporating machine learning at the edge. Several su-
pervised, unsupervised, and semi-supervised machine learning
models are thoroughly investigated in this paper. It is found
that specific supervised/ semi-supervised learning models are
suitable at the edge to understand the sensitivity of the block’s
data with fairly high accuracy. Then, after identifying a block’s
data sensitivity, an appropriate block validation technique can
be applied. Thus, instead of uniformly applying the same vali-
dation technique to all the blocks with different types of data,
a different validation technique can be chosen dynamically
based on the nature of the data in the block. This facilitates
the system’s scalability.
The statements made herein are solely the responsibility of
the authors. This publication is based upon work supported by
Khalifa University under Award No. FSU-2022-018.
[1] Statista, “Number of internet of things (iot) connected devices worldwide
from 2019 to 2021, with forecasts from 2022 to 2030,” 2022. [Online;
accessed 26-September-2022].
[2] Gartner Report, “Forecast it services for iot, 2021. [Online; accessed
[3] D. Puthal, N. Malik, S. P. Mohanty, E. Kougianos, and G. Das,
“Everything you wanted to know about the blockchain: Its promise,
components, processes, and problems,” IEEE Consumer Electronics
Magazine, vol. 7, no. 4, pp. 6–14, 2018.
[4] D. Puthal, S. P. Mohanty, P. Nanda, E. Kougianos, and G. Das, “Proof-
of-authentication for scalable blockchain in resource-constrained dis-
tributed systems,” in 2019 IEEE International Conference on Consumer
Electronics (ICCE), pp. 1–5, 2019.
[5] M. Salimitari, M. Chatterjee, and Y. P. Fallah, “A survey on consensus
methods in blockchain for resource-constrained iot networks,” Internet
of Things, vol. 11, p. 100212, 2020.
[6] F. Zantalis, G. Koulouras, S. Karabetsos, and D. Kandris, “A review
of machine learning and iot in smart transportation,” Future Internet,
vol. 11, no. 4, 2019.
[7] B. Sahoo, S. Rath, and D. Puthal, “Energy efficient protocols for wireless
sensor networks: A survey and approach, International Journal of
Computer Applications, vol. 44, no. 18, pp. 43–48, 2012.
[8] J. Bian, A. A. Arafat, H. Xiong, J. Li, L. Li, H. Chen, J. Wang, D. Dou,
and Z. Guo, “Machine learning in real-time internet of things (iot)
systems: A survey,” IEEE Internet of Things Journal, vol. 9, no. 11,
pp. 8364–8386, 2022.
[9] S. Rajora, D.-L. Li, C. Jha, N. Bharill, O. P. Patel, S. Joshi, D. Puthal,
and M. Prasad, “A comparative study of machine learning techniques
for credit card fraud detection based on time variance,” in 2018 IEEE
Symposium Series on Computational Intelligence (SSCI), pp. 1958–
1963, IEEE, 2018.
[10] M. Osman, J. He, F. M. M. Mokbal, N. Zhu, and S. Qureshi, “Ml-lgbm:
A machine learning model based on light gradient boosting machine for
the detection of version number attacks in rpl-based networks,” IEEE
Access, vol. 9, pp. 83654–83665, 2021.
[11] T. Yanabe, H. Nishi, and M. Hashimoto, Anomaly detection based on
histogram methodology and factor analysis using lightgbm for cooling
systems,” in 2020 25th IEEE International Conference on Emerging
Technologies and Factory Automation (ETFA), vol. 1, pp. 952–958,
[12] J. Dutta and S. Roy, “Occupancysense: Context-based indoor occupancy
detection prediction using catboost model, Applied Soft Computing,
vol. 119, p. 108536, 2022.
[13] K. Ye, J. Zhao, C. Huang, N. Duan, Y. Zhang, and T. E. Field, “A data-
driven global sensitivity analysis framework for three-phase distribution
system with pvs,” IEEE Transactions on Power Systems, vol. 36, no. 5,
pp. 4809–4819, 2021.
[14] J. Dutta and S. Roy, “Indoorsense: context based indoor pollutant
prediction using sarimax model,” Multimedia Tools and Applications,
vol. 80, no. 13, p. 19989–20018, 2021.
[15] Y. Liang, C. Lu, Y. Zhao, and C. Sun, “Interference-based consensus
and transaction validation mechanisms for blockchain-based spectrum
management,” IEEE Access, vol. 9, pp. 90757–90766, 2021.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Over the last decade, machine learning (ML) and deep learning (DL) algorithms have significantly evolved and been employed in diverse applications such as computer vision, natural language processing, automated speech recognition, etc. Real-time safety-critical embedded and IoT systems such as autonomous driving systems, UAVs, drones, security robots, etc.,heavily rely on ML/DL-based technologies, accelerated with the improvement of hardware technologies. The cost of a dead-line (required time constraint) missed by ML/DL algorithmswould be catastrophic in these safety-critical systems. However,ML/DL algorithm-based applications have more concerns about accuracy than strict time requirements. Accordingly, researchers from the real-time systems community address the strict timing requirements of ML/DL technologies to include in real-time systems. This paper will rigorously explore the state-of-the-art results emphasizing the strengths and weaknesses in ML/DL-based scheduling techniques, accuracy vs. execution time trade-off policies of ML algorithms, and security & privacy of learning-based algorithms in real-time IoT systems.
Full-text available
The convergence of dynamic spectrum access (DSA) and blockchain has been regarded as the new paradigm of spectrum management. Because of the inherent properties of blockchain, such as decentralization and tamper-resistance, the deployment of blockchain in future networks has advantages to address problems exposed in traditional centralized spectrum management systems, such as high security risk and low allocation efficiency. In this article, we first compare blockchain-based spectrum management with the traditional centralized approach and then present a reference architecture for blockchain-based spectrum management. In particular, we propose an interference-based consensus mechanism, which can be employed to improve transaction efficiency and reduce system overhead while promoting spectrum sharing. The proposed consensus mechanism is based on the comparison of aggregated interference experienced by each node, such that the node that suffers the most aggregated interference will obtain the accounting right as a compensation. Furthermore, to avoid harmful interference caused by spectrum traders, an interference-based transaction validation mechanism is designed to validate the spectrum transactions stored in the blocks. Different from existing transaction validation mechanisms in which every transaction needs to be validated by all nodes, a “transaction validation area” is determined for each spectrum transaction, and only the nodes located in the validation area need to validate the transaction. The simulation results show that the system fairness and nodes’ signal-to-interference-and-noise power ratio (SINR) can be improved by adopting the proposed mechanisms while reducing the system overhead.
Full-text available
Internet of Things (IoT) has caused significant digital disruption to the future of the digital world. With the emergence of the 5G technology, IoT would shift rapidly from aspirational vision to real-world applications. However, one of the most pressing issues in IoT is security. Routing protocols of the IoT, such as the Routing Protocol for Low-power and lossy network protocol (RPL), are vulnerable to both insider and outsider attacks with the insider ones being more challenging because they are more difficult to detect and mitigate. Among the most concerning insider attacks to RPL in IoT applications is the Version Number Attacks (VNAs) that target the global repair mechanisms by consuming resources of IoT devices, such as power, memory, and processing power, to eventually cause the IoT ecosystem to collapse. In this paper, a lightweight VNA detection model named ML-LGBM is proposed. The work on the ML-LGBM model includes the development of a large VNA dataset, a feature extraction method, an LGBM algorithm and maximum parameter optimization. Results of extensive experiments demonstrate the advantages of the proposed ML-LGBM model based on several metrics, such as accuracy, precision, F-score, true negative rate and false-positive rate of 99.6%, 99%, 99.6%, 99.3% and 0.0093, respectively. Moreover, the proposed ML-LGBM model has slower execution time and less memory resource requirement of 140.217 seconds and 347,530 bytes, making it suitable for resource-constrained IoT devices.
Full-text available
Global sensitivity analysis (GSA) of distribution system with respect to stochastic PV and load variations plays an important role in designing optimal voltage control schemes. This paper proposes a data-driven framework for GSA of distribution systems. In particular, two representative surrogate modeling-based approaches are developed, including the traditional Gaussian process-based and the analysis of variance (ANOVA) kernel ones. The key idea is to develop a surrogate model that captures the hidden global relationship between voltage and real and reactive power injections from the historical data. With the surrogate model, the Sobol indices can be conveniently calculated through either the sampling-based method or the analytical method to assess the global sensitivity of voltage to variations of PV and load power injections. The sampling-based method approximates the Sobol indices using Monte Carlo simulations while the analytical method calculates them by resorting to the ANOVA expansion framework. Comparison results with other model-based GSA methods on the unbalanced three-phase IEEE 37-bus and 123-bus distribution systems show that the proposed framework can achieve much higher computational efficiency with negligible loss of accuracy. The results on a real 240-bus distribution system using actual smart meter data further validate the feasibility and scalability of the proposed framework.
Full-text available
Indoor air pollutants e.g., Carbon dioxide (CO2), Particulate Matter(PM)2.5, PM10, Total Volatile Organic Compounds (TVOC), etc. have a serious impact on human health. Out of these pollutants, CO2 is one of the most dominant one. Hence, proper monitoring and control of this pollutant is an important part of maintaining a healthy indoor. To make this happen, it is required to predict the next moment’s indoor pollutant level at an acceptable accuracy range that ensures necessary steps can be taken beforehand to avoid a rise in the indoor pollution level for maintaining a healthy indoor all the time. It also helps people plan ahead, decreases the adverse effects on health and the costs associated. For this experiment, we have collected three months of real-life time-series data along with proper context information and have gone through feature engineering and feature selection process to create model ready data. Now, since the indoor CO2 concentration is dependent on multiple external factors (context data) which in turn is dependent on time, makes it a time-dependent function. Hence, to predict the indoor pollutant CO2, here we have used the time series forecasting model based on our collected data nature. This is a powerful tool and used in a wide range of research domains for predicting the next moment’s target value. This model ready data is utilized in forecasting different time series models. According to our findings, among the selected popular time series models, the SARIMAX time series model is best suited for this forecasting problem which is utilizing indoor context information along with historical data (with 10 Fold Time-Series Split Cross-Validation score 0.907). We have achieved an average of RMSE 26.45 ppm (i.e., 97.36% accuracy) level based on a three day average for indoor pollutant prediction which is outperforming other relevant models in this domain.
Full-text available
With the rise of the Internet of Things (IoT), applications have become smarter and connected devices give rise to their exploitation in all aspects of a modern city. As the volume of the collected data increases, Machine Learning (ML) techniques are applied to further enhance the intelligence and the capabilities of an application. The field of smart transportation has attracted many researchers and it has been approached with both ML and IoT techniques. In this review, smart transportation is considered to be an umbrella term that covers route optimization, parking, street lights, accident prevention/detection, road anomalies, and infrastructure applications. The purpose of this paper is to make a self-contained review of ML techniques and IoT applications in Intelligent Transportation Systems (ITS) and obtain a clear view of the trends in the aforementioned fields and spot possible coverage needs. From the reviewed articles it becomes profound that there is a possible lack of ML coverage for the Smart Lighting Systems and Smart Parking applications. Additionally, route optimization, parking, and accident/detection tend to be the most popular ITS applications among researchers.
Occupancy detection and prediction are two well-established problems which can be improved further to achieve higher accuracy in both cases than the existing solutions. To achieve the desired higher accuracy, proposed OccupancySense model detects human presence and predicts indoor occupancy count by the fusion of Internet of Things (IoT) based indoor air quality (IAQ) data along with static and dynamic context data which is a unique approach in this domain. This data fusion helps us to achieve higher forecasting accuracy along with the integration of state of the art gradient boosting based categorical features supported CatBoost algorithm. For comparison, other commonly used machine learning classification and regression algorithms, e.g., Multiple Linear Regression (MLR), Decision Tree (DT), Random Forests (RF) and Support Vector Machine (SVM) for regression and Logistic Regression (LR), Naïve Bayes (NB), Decision Tree (DT) and Random Forest (RF), Support Vector Machine (SVM) for classification, were also assessed during this experiment. Out of these, CatBoost outperformed other models when considered in terms of accuracy. Hence, CatBoost is used as the core of the OccupancySense design and we have validated the proposed model by a real-world case study with continuous 91 days of indoor data, having 33 unique external features. These features are collected directly as well as derived from the collected data. To handle these features, feature engineering plays a key role in the OccupancySense model. The speciality of this model is, it is non- intrusive one but have high predictive power. It can detect occupancy and predicts headcount along with occupancy density of the room pretty accurately with 99.85%, 93.2% and 95.6% respectively (with 10 fold cross-validation) which outperforms other state of the art models.
The success of blockchain as the underlying technology for cryptocurrencies has opened up possibilities for its use in other application domains as well. The main advantages of blockchain for its potential use in other areas are its inherent security mechanisms and immunity to data manipulation attacks. A blockchain relies on a consensus method for agreeing on any new data. Most of the consensus methods which are currently used for the blockchain of different cryptocurrencies require high computational power and thus are not suitable for resource-constrained systems. In this article, we survey the various blockchain-based consensus methods that are applicable to resource-constrained IoT devices and networks. In a typical IoT network, there exist several devices with limited computational and communication capabilities. Most often, these devices cannot perform intensive computations and are starved for bandwidth. Therefore, we discuss the possible measures that can be taken to reduce the computational power and convergence time for the underlying consensus methods. We also talk about some of the alternatives to the public blockchain, such as private blockchain and tangle, along with their potential adoption for IoT networks. Furthermore, we review the existing consensus methods that have been implemented and explore the possibility of utilizing them to realize a blockchain-based IoT network. Some of the open research challenges including AI-enabled blockchains are also put forward.