Content uploaded by Joy Dutta
Author content
All content in this area was uploaded by Joy Dutta on Nov 01, 2022
Content may be subject to copyright.
AI-based Block Identification and Classification in
the Blockchain Integrated IoT
Joy Dutta, Deepak Puthal and Ernesto Damiani
Center for Cyber-Physical Systems and Department of EECS, Khalifa University, Abu Dhabi, UAE
Email: {joy.dutta, deepak.puthal, ernesto.damiani}@ku.ac.ae
Abstract—Artificial Intelligence (AI) is gaining popularity in
the Internet of Things (IoT) based application-based solution
development. Whereas, Blockchain is become unavoidable in
IoT for maintaining the end-to-end process in the decentralized
approach. Combining these two current-age technologies, this
paper details a brief comparative study with the implementations
and further analyzes the adaptability of the AI-based solution in
the Blockchain-integrated IoT architecture. This work focuses on
identifying the of block data in the block validation stage using
AI-based approaches. Several supervised, unsupervised, and
semi-supervised learning algorithms are analyzed to determine
a block’s data sensitivity. It is identified that machine learning
techniques can identify a block’s data with very high accuracy.
By utilizing this, the block’s sensitivity can be identified, which
can help the system to reduce the energy consumption of the
block validation stage by dynamically choosing an appropriate
consensus mechanism.
Index Terms—IoT, Blockchain, Block Validation, Edge Devices,
Machine Learning, Energy Efficiency
I. INTRODUCTION
With the current technological advancements, IoT technol-
ogy is in its golden age, transforming various industries by
storing and processing data distributed and enhancing produc-
tivity. IoT has a lot of potential for the future, like making
networks more flexible, integrating artificial intelligence (AI),
and being able to deploy, automate, coordinate, and secure
different use cases, not just in the business world but also
in everyday life. The potential of IoT-based applications is
limitless. Statistica predicts that by the end of the year 2030,
there will be more than 29 billion IoT devices connected to
the Internet [1]. The IT Services for the IoT market will be
worth 58 billion dollars in 2025, growing at a 34% CAGR
from 2020 [2].
IoT penetrates every sector, and data generated from dif-
ferent sectors require different attention. Matter of fact is,
if the data generated from IoT applications are not securely
handled, then the application is not usable. Thus, for IoT-based
applications to be well adopted across different industries,
applications must facilitate trust, security, transparency, and
the traceability of data shared across a network. This is
where blockchain technology comes into the picture. The main
reason for coming up with the idea of a blockchain was
to reduce interference and dependence on third parties. This
was done by introducing consensus-based updates in a shared
and distributed way [3]. Data stored in blockchains is special
because it can’t be changed, can’t be undone, is decentralized,
stays the same, and doesn’t reveal who it belongs to because
of the intrinsic nature of Blockchain. This makes it the perfect
way to store IoT-based application data to make it reliable. But
these two technologies, i.e., IoT and Blockchain, have their
limitations. For example, IoT devices are resource constrained.
This signifies they have limited computation power, storage,
and networking capability. On the other hand, blockchain
technology faces the issues of scalability, high latency, high
energy consumption, reverse confirmed transactions, etc. This
shows how the basic needs of IoT and the core needs of
blockchain technology, e.g., saving energy, making decisions
in real-time, and getting responses, are at odds with each other
[3].
Current research focuses on the energy consumption per-
spective of blockchain technology. In IoT-Blockchain-based
applications, the data is stored in Blockchain goes through
the stages of transactions generation, transaction validation,
block creation, block validation and adding blocks to the
Blockchain. After closely observing the steps, it is observed
that the compute-intensive steps are transaction validation,
block validation, and adding the new block to the overall
blockchain network [4], shown in Fig. 1. Out of these three,
the present research focuses on energy consumption in the
block validation stage. At this stage, after a block is created, its
validation is done via a proper consensus mechanism, and for a
network of nodes, the same consensus is applied uniformly for
all the subsequent blocks [4]. This ensures a collective agree-
ment about the present state of the Blockchain’s distributed
ledger. In this way, consensus algorithms make sure that the
blockchain network is reliable and build trust between peers
who don’t know each other.
In present scenario, while validating the block, we are
currently treating every block with similar importance by
applying the same consensus mechanism. In reality, the block
data’s sensitivity will be different at different times. Based on
this, the present research focuses on dynamically identifying
the sensitivity of a block’s data in the block validation stage.
Based on the sensitivity of the data present, a proper consensus
mechanism is selected for that block. Thus, with this concept,
consensus for a network of nodes is not static anymore; it
has become dynamic. This paper proposes a novel solution to
identify a block’s data sensitivity by proposing a new AI layer
after block creation and before block validation.
Here, our contribution is two-fold. Firstly, we have identified
that choosing consensus dynamically based on the block’s
data can help to reduce energy consumption at the block
Fig. 1. Process of making a block and adding it to the ledger with the Proposed AI Layer between the Block creation and Validation Stage (Server symbols
are used to show processes that require a lot of computing power.)
validation level. Secondly, we have introduced an AI-based
layer for this block-level data sensitivity identification after
block creation and mapped the problem in the ML domain as
a multiclass classification problem by categorizing the block
data into three types, e.g., normal, sensitive, and critical. Then,
proper ML algorithms are identified that are suitable to identify
the importance of the block data, which can help us identify
different consensus mechanisms that can be applied to achieve
energy efficiency at the block verification level.
II. BACKGROU ND ST UDY
A consensus mechanism in Blockchain technology plays
a crucial role in the block’s validation stage as this block
validation is one of the most power-hungry stages of adding
transaction data to the blockchain ledger [4] (refer to Fig. 1).
In this stage, a specific consensus mechanism is applied to the
block to check its validity. This consensus mechanism is the
primary cause of energy consumption at this block validation
stage. Thus, blockchain networks cannot function properly
without consensus algorithms.
Now, a wide range of application scenarios exist, and
different consensus mechanisms are used to validate a block
there. Out of them, Proof of Work (PoW), Practical Byzantine
Fault Tolerance Algorithm (PBFT), Proof of Stake (PoS),
Proof of Activity (PoA) are some of the most popular ones
but not suitable for IoT-based applications [4]. Hence, there is
always a temptation in this research area to develop a new con-
sensus mechanism that resolves some specific contradictions
for the efficient integration of IoT-Blockchain. In recent years,
researchers have made significant efforts to develop a different
consensus algorithm that is more IoT-friendly, i.e., less power-
consuming and lighter. In this context, different consensus
algorithms, e.g., Proof of Elapsed Time (PoET), Practical
Byzantine Fault Tolerance (PBFT), Proof-of-Authentication
(PoAh) [4], Proof of Importance (POI), Tangle, Steller, Ripple,
etc. are suggested for diverse application environments [4], [5].
So, it is found that most of the research is happening at
the consensus level, and researchers are applying the same
consensus to all the blocks uniformly, and this is the problem
that we want to highlight here. By looking at the problem
more specifically, we have noticed that, for any IoT-based
application, the data generated from the application does
not always have the same importance. Only sometimes may
those nodes sense a different state of the data because of
certain changes in the application’s environment, which is of
interest to us. Hence, the importance of data varies with time
in a specific application. Since the importance of the data
generated from the application is not uniform, hence applying
the same consensus uniformly to all the blocks is also not
required. Since block validation is a resource-intensive stage,
the application consensus mechanism is the key part of this
stage. Thus, by changing the consensus dynamically at the
edge based on the content of the block’s data, we can make
this stage energy efficient.
Now, to formulate the problem, let us assume that an
incoming data block DBiwhere i= [1,2,3, ..., k]represents
block numbers that are getting generated from the IoT net-
work. Each block consists of multiple transactions. For data
block DBi,jth transaction can be represented as T Xij where
T X ij =[T Xi1,T X i2,..., T Xim ]where j= [1,2,3, ..., m]are
different transactions that are placed in a single block. Here,
both [i, j]∈N. In case of IoT-blockchain based applications,
these transactions T X ’s consists of values for a set of selected
feature (here nfeatures) collected from IoT nodes, i.e., for the
j-th transaction, T Xij =[fij1, fij 2, ...fijn]. Here, k,m,nare
non-negative integer numbers, i.e., the number belongs to N.
Now, for this problem scenario, let us assume that, we have
data sensitivity information upto the p-th block DBp. This
implies up to p-th block, we know the value of DSpfor ∀p,
when 0≤p&p < i. Thus, data sensitivity of p-th block’s
data for a specific transaction jis represented here by DSpj .
Here, data sensitivity values that a block/transaction can take,
are fixed and depends on the problem. The value that DSpj
can have are non-negative integer numbers, i.e., belongs to N.
Thus, we have the information in the following tuple form:
[T X pj,DSpj ]where DSpj is the data sensitivity label for
T X pj transaction (p-th data block).
Now, the problem can be stated as follows:
Based on given previous transaction data (upto p-th block)
along with sensitivity information,i.e., [fpj1,fpj2,... fpj n,
DSpj ] for ∀p, j, n when p < i (current block), and p, j, n
∈N, we have to predict DSij for T Xij where, i-th block is
the currently received block DBi.
III. PROP OS ED SOLUTION
A. Mapping of the problem in the machine learning domain
To identify the sensitivity of the block’s data an AI layer
is introduced after block creation (Fig. 1) to determine the
sensitivity of the block. In this AI layer, different ML models
can be applied. ML encompasses a diverse set of algorithms,
ranging from comprehensive problem-solving strategies to
particular feature extraction methodologies. ML algorithms are
often divided into four categories: 1) Supervised learning 2)
Unsupervised learning, 3) Semi-supervised learning, and 4)
Reinforcement learning [6].
In this study, when an edge node receives a block, the data
in it gets our primary attention. Our current research focuses
primarily on the transaction part T Xij of the block’s data.
When you receive a newly created block, its data part doesn’t
have any data sensitivity DSij information for the transactions
in it and also for the overall block DSias well. Here, we want
to predict DSij for a specific transaction T Xij , which will
provide us, in turn, with the data sensitivity of the i-th block
DSi. We can get the value of DSiby considering predictions
DSij of all the transactions T Xij where j= [1,2,3, ..., m]
and j∈N, present in the block. Now, three kinds of scenarios
might occur based on the availability of data sensitivity
information in the previous blocks.
The first scenario is when we know the data sensitivity of
all the previous blocks and all the transactions in them. This is
a supervised machine learning scenario, where we predict the
new incoming data based on previous training data. The next
scenario is when you don’t have labels for all the previous
blocks but for a small percentage of blocks. This is a use case
of semi-supervised learning where only a small percentage of
the data is labeled. In the last scenario, on the contrary, it might
be possible that we don’t have any information (labeled data)
about any of the previous blocks’ sensitivity. It is unsupervised
because, without any labels, data groups are created based on
the features and come under clustering because similar data
are grouped here.
Now, being specific to the problem, here is the data sensi-
tivity of the block DSican be put into three broad categories:
normal data, sensitive data, and critical data, based on the data
generation at the IoT node [7]. This is identified by looking
at the percentage of the transaction data’s sensitivity DSij . If
a specific percentage of transaction data falls under a specific
group of sensitivity levels, then the block’s data sensitivity can
take any of those values (refer to Algorithm 1). Therefore,
our primary goal is to map the block’s data to any of these
categories. So, it can be considered a multiclass classification
problem of machine learning, where we can predict, based
on leveled training data, whether a block contains normal,
sensitive, or critical information. But these category values,
i.e., normal, sensitive, and critical, represent categorical data,
and these data cannot be directly used in any kind of pre-
diction problem. Hence, these categorical data are mapped to
numerical data for mapping this problem to the domain of
machine learning. In this problem, DSij ∈ {0,1,2}, where
0 represents normal data, 1 represents sensitive data, and 2
represents critical data, makes it easier to understand that these
are three different outputs representing three different classes
that we need to predict accurately. This is where we have
applied categorical encoding. Here, the more accurately we
can predict the block’s nature of data, the more efficient the
block validation stage can become.
B. Machine Learning Model Selection for AI Layer
From the above discussion, one can understand that the
supervised, unsupervised, and semi-supervised ones are being
tested for this research from a practical applicability point of
view. To be concise, real-time IoT systems do not support all
branches of ML as they need to satisfy real-time restrictions
[8]. According to the survey [8], the most representative and
widely used supervised learning algorithms in real-time IoT
systems are KNN, Random Forest, SVM, Decision Tree, Naive
Bayes, Boosting, etc. Now, to solve the present problem, the
classification algorithms that we focus on here are, Support
Vector Machine (SVM), K-nearest neighbors (KNN), Random
Forest (RF), and Gradient Boosting-based LightGBM [9].
Now, we will examine why these machine learning algorithms
are chosen here over the others.
The first model of consideration is SVM which is primarily
designed for binary classification but capable of multiclass
classification by applying the same principle, i.e., by breaking
down the multiclassification problem into multiple binary
classification problems. Next, simple yet efficient machine
learning model of our choice is KNN which a non-parametric,
supervised learning classifier that uses proximity to make clas-
sifications or predictions about the grouping of an individual
data point. It can be used for multiclass classification as well.
The drawback of this model is that it cannot handle high-
dimensional data efficiently as complexity rises. However,
in our case, the dimension of the data is not comparable
to the input size; hence, we have considered this one in
our experiment. Further, bagging based Random Forest (RF)
is considered which is an ensemble learning approach for
classification, regression, and other problems that works by
building a large collection of decision trees during training.
Random forest is more suited to distributed computing than
boosting algorithms and is also easier to tune. Thus, this model
also fits our requirements.
The next machine learning model we have considered is
based on boosting algorithms, which combine a group of
weak learners into one strong learner to make predictive data
analysis more accurate. Out of the different boosting algo-
rithms present, gradient boosting based LightGBM, CatBoost,
and XGBoost are three algorithms that outperformed most
of the models in reality. Out of these three, histogram-based
LightGBM is the most efficient based on our experiment
scenario [10]. Because these models teach themselves to
correct each other’s mistakes, they can detect complicated
patterns in data. Of the three mentioned, LightGBM is the
lightest and quickest of all the boosting algorithms. Though,
CatBoost and XGBoost also show a significant improvement
over GBM, still they tend to be still behind LightGBM [11]
without much compromising with the accuracy. Thus, for this
study, LightGBM is considered.
When the amount of labeled data is extremely low or not
present, the approach to identify different data types is to go
for the clustering approach. For the present study, a predefined
number of clusters are 3. Based on this, the most suitable
algorithm for this kind of scenario is K-Means clustering.
It has many advantages, including guaranteed convergence,
scalability to large data sets, and ease of implementation. It
can also warm-start centroids’ locations and readily adjust to
incoming instances. It may also generalize various forms and
sizes, such as elliptical, oval, and circular clusters. Thus, K-
Means is one of the theoretically best fits when clustering is
applied to form a group of similar data based on the features
present in the transaction data. But in practice, these outcomes
are often inaccurate.
The semi-supervised learning algorithm is the next category
of the learning algorithm that is applicable when the available
training dataset is low, i.e., smaller amount of data labeled,
and the rest is unlabeled. Here, these unlabeled data severities
are predicted using known supervised learning algorithms
after training with the available small labeled data. The same
algorithms mentioned above for supervised ones are also
utilized for this category of learning. The main benefit of this
learning approach is that it may obtain cutting-edge outcomes
on major concerns using just a fraction of the labeled data.
C. Working Model
After receiving the block at edge nodes, the AI layer starts
its function. First, the AI layer checks the computing capability
of the edge node. This edge capability includes processing
power in terms of primary memory (RAM), secondary mem-
ory (local storage), and power source type (battery / connected
power source). Based on the capability of the edge and the
requirements of IoT applications, an appropriate ML algorithm
is chosen.
While considering transactions with data sensitivity labels,
after the selection of the ML model (i.e., supervised or semi-
supervised), the model is trained with the local data. So,
Fig. 2. Procedure for checking the data validation for the block with the
Proposed AI Layer
after the ML model is introduced with [T X, D X], the newly
received block’s transaction data sensitivity is predicted using
this trained model. Now, in a single block, counts of the
transactions can range from a few hundred to thousands.
The trained machine learning model predicts the sensitivity
of all these transactions in real-time. After indicating all the
transactions’ sensitivity, the block’s sensitivity is identified
according to Algorithm 1. Here, if within any block it is found
that more than 5% of the data is sensitive or critical, the
whole block’s data gets that label. The reason for choosing
this percentage threshold is that the outlier percentage will
always be within 5% in general. However, if both critical and
sensitive data are present within the same block data, we label
the data as critical as it has more importance in real-world
scenarios. Now, based on the sensitivity identified for the block
by the ML algorithm, the consensus is chosen appropriately.
The whole procedure for block validation and the role of the
AI layer is shown in Fig. 2. Here, different processes in the
diagram represent all the functionalities of the AI layer.
Now, if no transactions or a severely low percentage of
the dataset has a data sensitivity label, then the process is
straightforward. Since the number of clusters is predefined
in our case, i.e., 3, we can apply K-Means clustering with
the value of K as 3. In this way, from all the available data,
three separate clusters are created. With expert help, these
three groups are identified as normal, sensitive, or critical. By
following this, we have the data sensitivity values for previous
transactions and by utilizing this labels when a new block is
created, by applying Algorithm 1, we can identify that block’s
sensitivity.
D. Performance Metrics
The prediction performance needs to be measured to mea-
sure the machine learning models’ effectiveness. The accuracy
metric is one of the most fundamental and widely used
Algorithm 1 Block Data Sensitivity Identification
Input: Predicted Data Sensitivity for All Transactions, i.e.,
[T Xij , DSij ],∀(i, j)where, (i, j)∈N
Output: Overall Sensitivity of the Current Block, i.e., DSi)
1: DSi←0; //sensitivity of the current block
2: sen count ←0; //sensitive transaction count
3: cri count ←0; //Critical transaction count;
4: F S ←0; //flag for sensitive
5: F C ←0; //flag for critical
6: th =(j∗5)/100; //threshold is set at 5% of number of
the transactions count
7: for j←1to mdo // m is the maximum number of
transactions present in the block
8: if (DSij == 1) then
9: sen count + +;
10: end if
11: if (DSij == 2) then
12: cri count + +;
13: end if
14: end for
15: if (sen count > th)then
16: F S ←1;
17: end if
18: if (cri count > th)then
19: F C ←1;
20: end if
21: if (F C == 1 ∧F S == 1) then
22: DSi←2;
23: else if (F C == 1) then
24: DSi←2;
25: else if (F S == 1) then
26: DSi←1;
27: else
28: DSi←0;
29: end if
30: Return DSi
one. Here, we have used accuracy in percentage to mea-
sure the performance of our different supervised multiclass
classification models. To further verify the accuracy metric’s
performance, we have measured the model’s performance
with 10-Fold cross-validation and its standard deviation [12].
We have also calculated the F1 Score, the harmonic mean
of precision and recall, to understand the machine learning
model’s performance. Along with the F1 Score, the other
two metrics considered here are the weighted average, which
averages the support-weighted mean per label, and the macro
average, which averages the unweighted mean per label.
This block data sensitivity can be theoretically compared
with the framework for Data-Driven Global Sensitivity analy-
sis to solidify the concept of the paper, which will be incorpo-
rated in the future for transaction data level sensitivity analysis
phase [13]. Now, in the next section, the implementation of the
same is discussed in detail, and the results are also analyzed.
TABLE I
DATA DESCRIPTION FOR SYNTHETIC DATA USE D IN T HE IOTBA SED
SMA RT BUILDING CASE ST UDY
Feature Data Range
Names Normal Sensitive Critical
Temperature ( °C) 18-30 30-45 >45
Humidity (%) 35 - 65 35-65 20-50
AQI (US) 60 - 80 81 - 150 151 - 500
Noise (dB) (%) 55 - 70 71-80 >80
Occupancy (Binary) 0 / 1 0 / 1 0 / 1
CO (PPM) 0 - 15 16-30 >30
CO2(PPM) 600 - 1000 1001 - 3000 3001 - 10000
Flame Data (Binary) 0 0 1
IV. IMPLEMENTATION & RE SU LT ANALYS IS
A. Data Set
We used the smart building context and associated data for
this research, with our primary focus being indoor data. For
measuring the indoor condition, the following features were
considered: temperature, humidity, air quality index (AQI),
noise, occupancy, carbon monoxide (C O), carbon dioxide
(CO2), and flame data, according to [14], [12]. Based on
our experience, domain knowledge, and the data used in
[15], we have created a synthetic dataset for this experiment.
As mentioned in [12], three months of real-world data were
collected, and from there, the typical ranges of temperature,
humidity, air quality index, noise level, occupancy, CO,CO2,
and flame sensor data were determined. Our dataset consists
of 10,000 instances of training instances labeled appropriately
with the nature of the sensitivity, i.e., normal as 0, sensitive
as 1, and critical as 2. The detailed description of the data is
summarized in Table I.
B. Implementation Setup
For the experiment, we have considered the latest Rasberry
Pi 4 Model B as an edge node, which has a high-performance
64-bit quad-core Cortex-A72 processor, 8GB LPDDR4 RAM,
and inbuilt 2.4 GHz and 5.0 GHz IEEE 802.11ac wireless
support. The Raspberry Pi Foundation specifically selected
Python as the main language because of its power, versatility,
and ease of use. Thus, for implementing machine learning
algorithms, we have used Python version 3.9.
C. Results & Discussions
Here, in the AI layer, we have considered the problem
of block data grouping as a multiclass classification problem
and tried to solve the same with suitable supervised machine
learning options, e.g., K-SVM, KNN, RF, and LightGBM
when we are working with labeled transaction data. It is found
that RF outperformed all the other models even with forest size
10. The Score of 10Fold cross-validation and F1 Score also
supports the same. The standard deviation of 10F-CV is also
the lowest for RF, which signifies it is a constant performer
among its competitors with a 98.68% 10F-CV score, and the
detailed results are shown in Table II. Among all the classifiers,
K-SVM could not perform well in this context, which is visible
Fig. 3. Performance Comparison of Different Supervised Machine Learning
Models for Multiclass Classification
Fig. 4. Comparison of the Execution Time of different Supervised Machine
Learning Algorithms
by looking at its 10Fcross-validation and its standard deviation
score. However, LightGBM’s performance is very close to RF,
and KNN also performs very well with a 10F-CV score of
94.42%, as shown in Fig. 3.
Now, since the whole experiment is taking place at the edge,
and generally, edge nodes are resource-constrained devices,
thus along with accuracy, an average prediction time is also
required as it is not only tells us about real-time application
handling capability of the model, but also it is proportional
to the power consumption of the node. Here, it is found that
KNN is taking the least time among all other algorithms, i.e.,
0.029 seconds, which is the average of 100 predictions of dif-
ferent transaction data using the same model. Interestingly, the
performance of KNN is also 7.5 times faster than lightGBM
and two times faster than RF with the same resources. The
version of K-SVM turns out to be the most time-consuming
in terms of prediction, requiring the most help as well. The
details are shown in Fig. 4 for clear pictorial representation.
After supervised learning, we tried to solve the problem
using the unsupervised learning approach K-Means. Since
we have labels of the data with us, after clustering, when
we verified whether data points were in the proper cluster,
we found the average accuracy to be 20%, with an average
execution time of 0.107 seconds. Hence, performance was
abysmal in terms of accuracy.
To encounter real world scenario, based on our experience
with supervised ones, we have considered K-SVM, KNN, RF,
and LightGBM models for semi-supervised learning. We have
experimented with 10% of the training data having labels, and
the rest, 90%, of the test data labels are predicted using the
initial trained model. This whole labeled dataset is used to
Fig. 5. Performance comparison of different semi-supervised algorithms
(After Phase 2)
predict new, unseen data in real-time. The result for the same
is shown in Table III.
Now, by utilizing the information from Table III, Phase
1, it is found that K-SVM is not worth considering as it is
neither effective from an execution time viewpoint nor an
accuracy viewpoint. Thus, algorithms to consider for next-level
prediction are KNN, RF, and LightGBM. We have applied
these three semi-supervised models to a new, unseen dataset.
The experiment found that the accuracy of the LightGBM with
a 10F-CV score of 99.26% compared to 99.12% of RF is very
close, and KNN- being the fastest one, also gives a 94.2% in
the same 10F-CV measurement metric. The results are shown
in Table III and visually as shown in Fig. 5.
D. Summary of the Results
From the above discussions, it is found that the AI layer
(Fig. 1) can identify a block’s data sensitivity with high
accuracy. If the machine learning model in the AI layer is
trained with labeled data, i.e., if all the previous data has
labels (for the supervised learning approach), KNN, RF, and
LightGBM are the options that can be considered to run at
the edge node. If the edge node has limited resources or a
fast prediction is required, KNN is the best option out of the
three. Otherwise, RF can be chosen as a balanced tradeoff after
considering the model’s accuracy and prediction time. Because
of K-SVM’s low accuracy and more considerable training time
is discarded from the recommendation list.
Next, if all the data labels are unavailable, i.e., only a low
fraction of the data has labels, then semi-supervised machine
learning models will be helpful in this context. Here also,
for faster performance or managing resource issues, KNN
is the choice. Otherwise, LightGBM here outperforms RF
marginally. Hence, if resource availability is not an issue, then
LightGBM or RF can be chosen as they are very close in
terms of performance. However, RF gives results faster than
LightGBM for semi-supervised learning as well.
If the data has no labels, cluster identification with unsuper-
vised learning does not help much. The reason is that even a
similar grouping of data, i.e., cluster formation, is not effective
in label prediction as the accuracy of the clustering model is
below standard in this kind of scenario.
TABLE II
DETAI LE D PERFORMANCE COMPARISON OF SU PERV IS ED MACH IN E LEARNING MODE LS F OR MULTICLASS CLASSIFICATION
Supervised Performance metrics
Classifiers Accuracy 10F CV 10F CV Std. Dev. F1 Score Macro Avg. Weighted Avg.
K-SVM 71.2 70.97 2.24 67.3 51 67.3
KNN 94.5 94.42 0.91 94.59 93.7 94.6
RF 98.5 98.68 0.36 98.5 98.8 98.5
LightGBM 98.35 98.32 0.72 98.35 98.7 98.4
TABLE III
PERFORMANCE OF SEMI-SU PERV ISE D LEA RNI NG O N NEW TRANSACTION DATA- (PH AS E 1): D IFFE RE NT MAC HIN E LEARNING MOD ELS ’
PERFORMANCE WHE N 10% OF TH E TRANSACTION DATA HA S LAB ELS A ND (P HA SE 2) : ACT UAL A ND PRE DI CTE D LA BEL S AR E USE D TO TR AI N
SE MI-S UP ERVI SED M OD ELS T O PRE DI CT LA BE LS FO R NE W TRA NS ACTI ON DATA
Semi-Supervised Classifiers Performance Metrics
Accuracy 10F CV 10F CV Std. Dev. F1 Score Macro Avg. Weighted Avg.
Phase 1
K-SVM 59.4 57.5 4.18 54.76 41.9 54.8
KNN 90.53 86.5 3.61 90.64 89.4 90.6
RF 97.89 96.62 2.5 97.89 98.3 97.9
LightGBM 97.87 97.25 1.22 97.88 98.1 97.9
Phase 2
KNN 88.38 94.2 0.72 89.17 86.5 87.6
RF 98.69 99.12 0.38 98.69 98.0 98.7
LightGBM 98.12 99.26 0.17 98.13 97.2 98.1
V. CONCLUSION
This paper proposes an AI layer between block creation
and block validation to identify the sensitivity of the block’s
data in Blockchain-integrated IoT networks. This is achieved
by incorporating machine learning at the edge. Several su-
pervised, unsupervised, and semi-supervised machine learning
models are thoroughly investigated in this paper. It is found
that specific supervised/ semi-supervised learning models are
suitable at the edge to understand the sensitivity of the block’s
data with fairly high accuracy. Then, after identifying a block’s
data sensitivity, an appropriate block validation technique can
be applied. Thus, instead of uniformly applying the same vali-
dation technique to all the blocks with different types of data,
a different validation technique can be chosen dynamically
based on the nature of the data in the block. This facilitates
the system’s scalability.
ACKNOWLEDGMENT
The statements made herein are solely the responsibility of
the authors. This publication is based upon work supported by
Khalifa University under Award No. FSU-2022-018.
REFERENCES
[1] Statista, “Number of internet of things (iot) connected devices worldwide
from 2019 to 2021, with forecasts from 2022 to 2030,” 2022. [Online;
accessed 26-September-2022].
[2] Gartner Report, “Forecast it services for iot,” 2021. [Online; accessed
26-September-2022].
[3] D. Puthal, N. Malik, S. P. Mohanty, E. Kougianos, and G. Das,
“Everything you wanted to know about the blockchain: Its promise,
components, processes, and problems,” IEEE Consumer Electronics
Magazine, vol. 7, no. 4, pp. 6–14, 2018.
[4] D. Puthal, S. P. Mohanty, P. Nanda, E. Kougianos, and G. Das, “Proof-
of-authentication for scalable blockchain in resource-constrained dis-
tributed systems,” in 2019 IEEE International Conference on Consumer
Electronics (ICCE), pp. 1–5, 2019.
[5] M. Salimitari, M. Chatterjee, and Y. P. Fallah, “A survey on consensus
methods in blockchain for resource-constrained iot networks,” Internet
of Things, vol. 11, p. 100212, 2020.
[6] F. Zantalis, G. Koulouras, S. Karabetsos, and D. Kandris, “A review
of machine learning and iot in smart transportation,” Future Internet,
vol. 11, no. 4, 2019.
[7] B. Sahoo, S. Rath, and D. Puthal, “Energy efficient protocols for wireless
sensor networks: A survey and approach,” International Journal of
Computer Applications, vol. 44, no. 18, pp. 43–48, 2012.
[8] J. Bian, A. A. Arafat, H. Xiong, J. Li, L. Li, H. Chen, J. Wang, D. Dou,
and Z. Guo, “Machine learning in real-time internet of things (iot)
systems: A survey,” IEEE Internet of Things Journal, vol. 9, no. 11,
pp. 8364–8386, 2022.
[9] S. Rajora, D.-L. Li, C. Jha, N. Bharill, O. P. Patel, S. Joshi, D. Puthal,
and M. Prasad, “A comparative study of machine learning techniques
for credit card fraud detection based on time variance,” in 2018 IEEE
Symposium Series on Computational Intelligence (SSCI), pp. 1958–
1963, IEEE, 2018.
[10] M. Osman, J. He, F. M. M. Mokbal, N. Zhu, and S. Qureshi, “Ml-lgbm:
A machine learning model based on light gradient boosting machine for
the detection of version number attacks in rpl-based networks,” IEEE
Access, vol. 9, pp. 83654–83665, 2021.
[11] T. Yanabe, H. Nishi, and M. Hashimoto, “Anomaly detection based on
histogram methodology and factor analysis using lightgbm for cooling
systems,” in 2020 25th IEEE International Conference on Emerging
Technologies and Factory Automation (ETFA), vol. 1, pp. 952–958,
2020.
[12] J. Dutta and S. Roy, “Occupancysense: Context-based indoor occupancy
detection prediction using catboost model,” Applied Soft Computing,
vol. 119, p. 108536, 2022.
[13] K. Ye, J. Zhao, C. Huang, N. Duan, Y. Zhang, and T. E. Field, “A data-
driven global sensitivity analysis framework for three-phase distribution
system with pvs,” IEEE Transactions on Power Systems, vol. 36, no. 5,
pp. 4809–4819, 2021.
[14] J. Dutta and S. Roy, “Indoorsense: context based indoor pollutant
prediction using sarimax model,” Multimedia Tools and Applications,
vol. 80, no. 13, p. 19989–20018, 2021.
[15] Y. Liang, C. Lu, Y. Zhao, and C. Sun, “Interference-based consensus
and transaction validation mechanisms for blockchain-based spectrum
management,” IEEE Access, vol. 9, pp. 90757–90766, 2021.