ArticlePDF Available

Machine Learning Challenges for IoT Device Fingerprints Identification

Authors:

Abstract and Figures

The dramatic growth of Internet of Things (IoT) devices in recent years increases the IoT networks’ vulnerabilities and introduces new challenges among machine learning (ML) algorithms to detect the networked devices. The creation of a Device Fingerprint (DFP) may depend on extracting the network traffic features related to the device except for the identities assigned to it. In this paper, Device Fingerprints for 20 IoT devices are created by extracting 30 features during startup operation. Wireshark Network Protocol Analyzer is used to collect network traffic of 8 home IoT devices, meanwhile the traffics of the remaining devices are taken from the captures_IoT-Sentinel publicly available dataset. Four supervised machine learning algorithms were applied and tested to detect authorized devices and isolate unknown devices, namely: Support Vector Machine (SVM), Decision Tree (DT), Ensemble Random Forest (RF), and Gradient Boosting Classifier (GBC). Random Forest model and Gradient Boosting Classifier both showed better results of about 98.8% as an average of overall accuracy with less difference comparing with the accuracy of Decision Tree. Voting classifier was applied using the three estimators that resulted in high accuracy (DT, RF, and GBC) and achieving 99.5% as an average of overall accuracy.
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Machine Learning Challenges for IoT Device Fingerprints Identification
To cite this article: Vian Adnan Ferman and Mohammed Ali Tawfeeq 2021 J. Phys.: Conf. Ser. 1963 012046
View the article online for updates and enhancements.
This content was downloaded from IP address 181.215.217.180 on 27/07/2021 at 02:42
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
1
Machine Learning Challenges for IoT Device Fingerprints
Identification
Vian Adnan Ferman1* and Mohammed Ali Tawfeeq2
1,2 Computer Engineering Dept., College of Engineering, Mustansiriyah
University,Baghdad, Iraq
*Corresponding author, e-mail: egma018@uomustansiriyah.edu.iq,
drmatawfeeq@uomustansiriyah.edu.iq2
Abstract. The dramatic growth of Internet of Things (IoT) devices in recent years increases the
IoT networks’ vulnerabilities and introduces new challenges among machine learning (ML)
algorithms to detect the networked devices. The creation of a Device Fingerprint (DFP) may
depend on extracting the network traffic features related to the device except for the identities
assigned to it. In this paper, Device Fingerprints for 20 IoT devices are created by extracting 30
features during startup operation. Wireshark Network Protocol Analyzer is used to collect
network traffic of 8 home IoT devices, meanwhile the traffics of the remaining devices are
taken from the captures_IoT-Sentinel publicly available dataset. Four supervised machine
learning algorithms were applied and tested to detect authorized devices and isolate unknown
devices, namely: Support Vector Machine (SVM), Decision Tree (DT), Ensemble Random
Forest (RF), and Gradient Boosting Classifier (GBC). Random Forest model and Gradient
Boosting Classifier both showed better results of about 98.8% as an average of overall
accuracy with less difference comparing with the accuracy of Decision Tree. Voting classifier
was applied using the three estimators that resulted in high accuracy (DT, RF, and GBC) and
achieving 99.5% as an average of overall accuracy.
Keywords: Gradient Boosting Classifier, IoT device fingerprint, network traffic, Random
Forest, Voting classifier.
1. Introduction
Internet of Things (IoT) forms the basic conceptions of machine-to-machine connection and extends
them outward by creating large cloud networks of devices that connect through cloud platforms.
Therefore, IoT devices are defined as any device that has a dedicated address that gives the ability to
connect to the network via Wi-Fi, Bluetooth, or other technology such as smart devices in the smart
home, smart city, smart industry, smart transportation, medical environment, etc. [1]. Moreover, IoT
devices have some other unique characteristics such as heterogeneity, inter-connectivity, ultra-reliable
communication with low latency, low power and low-cost communications, dynamic network
adaptations, and Intelligence [2, 3]. These features increased people's tendency toward smart things in
their whole lives and empowered markets to support a massive number and different types of IoT
devices. Experts have predicted that IoT devices number will reach nearly 125 billion in 2030 [4].
With the widespread of these devices, the risks added to the IoT networks increase. Besides, due to
IoT devices' connectivity between each other, the protection of one device also depends on the security
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
2
of other devices and the cascading results of its vulnerabilities to the entire IoT system [5]. In 2016,
the Mirai botnet launched a series of Distributed Denial of Service attacks with over 100k
compromised IoT devices, therefore the first step of protecting IoT networks from these attacks is to
find out network traffic details coming from real-world IoT devices and identifying them [6][7][8][9].
Therefore, for the former reason, knowing which devices are connected to IoT networks has become a
trending topic for researchers nowadays.
Device fingerprint (DFP) is a process of creating either actively or passively unique signature for each
device from network traffic data. The features that are used to create signatures should not be
tampered with or modified with the device's mobility. Moreover, these features must be difficult to
guess so, assigned addresses (e.g., International Mobile Equipment Identity, media access control and
internet protocol) should not be taken into consideration [10, 11].
In [12] a DFP is created using device-originated network traffic from two public datasets. Machine
learning (ML) classification algorithms are applied to categorize individual device types and achieved
different accuracy for each dataset 83.35% and 97.7% as an average accuracies. While in [13] ML
classification algorithms are used to classify devices’ events and interactions such as
locking/unlocking and ON/OFF. Also in [14], a multiclass classification is applied to classify traffic
generated during the user interaction with every IoT application (Application to cloud connection).
In [15] a Bag of Words technique is presented to identify 31 IoT devices among 33 of the
experimented devices. The proposed method is based on extracting some devices’ textual information
from IoT network traffic then creating a vector of unique textual features for each device. To detect a
new connected device the similarity is checked between vectors. Whilst in [16] a network protocol
packet keyword query is used to recognize IoT devices. Traffic packets are analyzed to extract
network protocols’ data. Next, the result is filtered and irrelevant invisible information and unrelated
visible character are removed to find IoT device identification features such as website, brand, type,
and model.
In [17] a two-stage of classification technique is presented to identify IoT devices’ traffic in a smart
city. Also, in [18] a multi-stage ML is developed to classify 28 distinct IoT devices. Naive Bayes
(multinomial classifier) is used in the first stage to classify the textual extracted attributes and the
result, as well as, the statistical attributes are fed to the random forest classifier at the last stage.
In addition to the above, researchers have recently focused on ML algorithms especially random forest
(RF) for identification purposes. In [19] a dynamic deduction method is presented for IoT device
detection, classification, anomaly discovery, and health monitoring. A set of ML algorithms have been
used, but RF achieves the best accuracy result. Also, in [20] framework is presented for identifying
IoT and malicious detection from network traffic. Various ML algorithms were applied but the RF
model achieved the best accuracy: around 94.5% for device type identification and 97% for detection
of abnormal traffic. In [21] RF algorithm is applied to identify 27 devices type during the setup phase
and achieved 81.5% as an overall accuracy.
It is evident from the previous studies that most of the authors tend to use multi-stage classification if
their extracted features include textual elements. Moreover, few researchers have shed light on
identifying the traffic generated during the initialization processes although this is the most important
time to know the new setup IoT devices. In this paper, network traffic data are collected from 20 IoT
devices during initialization time (setup and startup times), eight of them are home devices, while the
rest are taken from the publicly available captures_IoT-Sentinel dataset [22]. Then, the collected
packets are filtered and a set of features are extracted from each of them, then some statistical
operations are used to create passive DFPs for each device with 30 features as a final DFP length.
Four ML algorithms are applied to identify the created DFP of the authorized devices, while the
unknown devices are isolated. These algorithms are Support Vector Machine (SVM), Decision Tree
(DT), Ensemble Random Forest, and Gradient Boosting Classifier (GBC). The final step is to apply a
hard voting classifier (HVC) using the estimators with the best accuracies.
The contributions of this work are:
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
3
Extract the numeric and textual features from the IoT device traffic during the initialization
process then convert all textual features to numeric and create DFPs for each device.
Applying four ML algorithms after adjusting their parameters and verifying their ability to
identify DFPs.
Increasing the identification accuracy by applying the HVC model using the tested models
that give the best identification results as the algorithm estimators.
The remaining of this paper is formulated as follows: Traffic collection and data analysis are clarified
in section 2. The Proposed DFP process is explained in section 3. Machine learning algorithms based
tuning parameters are demonstrated in section 4. The performance evaluation and results are discussed
in section 5. Finally, section 6 concludes the paper and reveals future work.
2. Traffic collection and data analysis
To collect IoT traffic data, Raspberry Pi 3 Model B+ is used and configured to work as a router. Eight
IoT home devices are installed using Raspberry Pi as a home network. The generated traffics from
these devices are collected using Wireshark Network Protocol Analyzer (iPhone X is specified for
installing all devices and controlling them via devices’ applications, so the traffics related to iPhone X
aren’t considered). It is found that the generated traffics during setup time are almost similar to the
generated traffics during startup time. When any device wants to connect to the home network,
Extensible Authentication Protocol over LAN (EAPOL) packet are sent from the access point to that
device. EAPOL is a protocol used for network authentication between the router and the connected
device. Since an EAPOL packet is the first packet to transmit when a device tries to connect to the
network, so it's a good choice to start with when creating DFP. After network authentication with a 4-
way handshake (4 EAPOL packets) is completed, a set of protocol packets (e.g., DHCP, DNS, ARP,
ICMPv6, etc.) are exchanged to complete the network connection with an access point.
DNS carries a query name that is a device server name and it is stable unlike IP address that maybe
changed [9] so, it is a strong feature and should take in consideration. DNS query names may be the
same for devices with similar manufacture but they are totally different for distinct manufacture.
Sometimes one device response with multiple DNS servers. Figure 1, shows a sample of collected
traffic of multiple devices during initialization time. The DNS is filtered and it shows the differences
and similarities in device DNS query names.
Figure 1. Sample of DNS collected packets for multiple devices during initialization time.
Since IoT devices are defined to perform a specific function, the first TCP session for some devices is
almost fixed in the number of packets and the type of protocol used according to the source and
destination port number (for example, dynamic ports or well-known ports such as 80 for HTTP and
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
4
443 for HTTPS) but it is varied in TCP window size and sometimes in TCP segment size. While the
connectivity of these devices is dependent on network quality so, sometimes the packet count of the
first TCP session is increased by the number of unreachable or retransmission packets. So, these
differences in addition to some protocols packets details are qualified to create different DFPs for each
IoT device. Figure 2, shows a sample of the first TCP session details of SonoFF power strip with IP
address 192.168.100.95. As shown in the figure, 13 TCP packets are exchanged within about 1.5 sec.
SonoFF power strip opened dynamic port (30736) while the other device opened a well-known port
with HTTPS protocol.
Figure 2. Sample of SonoFF power strip's first TCP session details.
Alongside the home IoT devices, the network traffic of 12 devices is selected from a publicly available
31 Captures_IoT-Sentinel dataset (the rest are verified and found that either the collected traffic
contains a few packets captured in a very short time or contains no EAPOL packets). All IoT devices
used are listed in Table 1.
Manufacturer
Device Name
1
IoT home
devices
SonoFF
SonoFF Power Strip (SPStrip)
Authorized
2
SonoFF Power Plug (SPPlug)
3
SonoFF Bulb (SPBulb)
4
SonoFF Smart Switch with
Temperature Sensor (SSSwitch)
5
Google Assistance
Google Home Mini (GHMini)
6
Aswar
Aswar Camera (ACamera)
7
TEKIN
TEKIN-Plug (TPlug)
8
Google
Chromecast
9
IoT-
Sentinel
devices
public
traffic
D-Link
D-LinkCam (DLCam)
10
D-LinkSensor (DLSensor)
11
D-LinkSwitch (DLSwitch)
12
D-LinkWaterSensor (DLWsensor)
13
Edimax
EdimaxPlug1101W
14
Ednet
EdnetGateway
15
TP-Link
TP-LinkPlugHS100
16
TP-LinkPlugHS110
17
WeMo
WeMoSwitch
18
iKettle
iKettle2
Unknown
19
SmarterCoffee
SmarterCoffee
20
Withings
Withings
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
5
3. Proposed DFP generation
The proposed DFP generation is based on two stages. In the first stage, the traffic of the new device
(the new MAC address) is examined and features began to be extracted from obtaining an EAPOL
packet and ended with the closing of the first TCP session (FIN flag). Table 2, shows the 25 extracted
features from each packet during this stage using python Scapy library. To aggregate the features of all
packets, a matrix of 25 columns (each representing a specific feature) is created with a different
number of rows (packets). The differences in the number of rows are due to the inconsistency of
packets each time they are collected. The representation of features is as follows:
Assign a value of 1 to each of the following twelve logical features if present (ARP, EAPOL,
IGMPv2, IGMPv3, ICMPv6, DNS (or MDNS), DHCP, UDPD, TCPHTs, TCPH, TCPD, and
VCI), if exist otherwise set 0.
Set the other thirteen feature values, which consist of seven features with numerical values
(PLen, UDPDLen, TCPSLen, TCPWS, TTL, DHCPMS, and PRLLen), as well as six features
with text values (MACSrc, MACDst, IPSrc, IPDst, HN, and QN).
After completing the feature extraction process, all duplicate rows are deleted. Equation (1), shows the
matrix created in the first stage, where n denotes different numbers of rows and f refers to the
extracted feature. The first stage of DFP generation can be simplified as in Algorithm 1.
f,n
f,n
f,n
f,
f,
f,
f,
f,
f,
rown
row
row
Matrix
2521
2522212
2512111
2
1
(1)
In the second stage, the resulting matrix is converted to a vector consisting of 30 elements by applying
some statistical operations to the extracted features taking into account increaseing the weights of
important features and shrinking DHCP features to be only one. The procedure of conversion is count
number of 1 of all columns with logical features and some statistical operations are applied on
columns of numeric elements as follows:
Compute MAX and MIN, Average for PLen, TCPWS, and TTL.
Compute MAX and MIN, for UDPDLen.
Table 2. Extracted features in the first stage.
Type
No. of
features
Features detail
Data Link layer
4
Source MAC (MACSrc), Destination MAC (MACDst), ARP
protocol (ARP), and packet length if TCP (PLen)
Network layer
6
Source IP (IPSrc), Destination IP (IPDst), EAPOL, IGMPv2,
IGMPv3, ICMPv6.
Transport layer
4
UDP data (UDPD), UDP data length (UDPDLen), TCP segment
length (TCPSLen), TCP window size (TCPWS).
Application layer
protocol
2
DNS (or MDNS), DHCP.
IP
1
Time To Live (TTL).
TCP
3
TCP with HTTPS protocol (TCPHTS), TCP with HTTP protocol
(TCPH), TCP with Dynamic source and destination ports (TCPD).
DHCP
4
Length of DHCP Parameter Request List (PRLLen), Maximum
DHCP Message Size (DHCPMS), Vendor class identifier (VCI),
Host Name (HN).
DNS or MDNS
1
Query Name (QN).
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
6
Algorithm 1. First stage of DFP generation.
packets= collected traffic data
MACDst=0 // The destination MAC address of new device
Matrix=[] // The generated matrix
Counter=0
for pkt in packets:
F=[ ] // Variable used for saving features
if pkt.has (EAPOL):
MACDst=pkt. ether_addr_dst
else: continue
if pkt.has (feature) && MACDst not equ 0: // Check all features listed in Table 2 with separate
condition
if feature is logical:
F.append(1)
elsif feature is textual or numeric :
F. append(feature)
else :
F.append(0)
Matrix. insert (counter, F) // Insert row vector F into the Matrix at row specified by counter
counter+=1
Matrix. delete (repetitive rows)
Compute Average of TCPSLen.
The columns of DHCPMS and PRLLen each contain one value so put it.
The columns of textual features are converted to numbers each with a distinct way as follows:
Merge element of columns MACSrc and MACDst, remove the repetitive addresses, and count
the rest.
The columns of IPSrc and IPDst are processed with the same procedure of MAC addresses
columns.
Convert HN column to ASCII codes then apply some statistical operations like MAX
(MAXHN), MIN (MINHN), Average (AVGHN), and hostname length (HNLen).
Create a lookup table for query names of all devices and give a specific number for all query
names of each. For all unknown query names, 0 values are given.
As mentioned earlier, some devices with the same brand may share some or all query names.
So to strengthen the query name fields, another field is added that checks all query names
availability in the lookup table, if one of them doesn’t exist 0 value is put otherwise query
names count is put.
Since there is a similarity of some DHCP features for devices and some of these features may be
guessed by intruders like HN, all DHCP features are merged to be one feature by taking the average of
their values (average (DHCPMS, PRLLen, MAXHN, MINHN, AVGHN, HNLen)).
During the analysis step, it is found that some protocols such as ICMPv6 appeared in the traffic of
some devices more than others, and this can be seen especially in devices produced by the same
manufacture like D-Link devices.ICMPv6 appeared in DLSensor, DLSwitch, and DLWSensor traffics
with different packet count and order but disappeared in DLCam traffics. To increase the differences
between these devices, the first three locations of ICMPv6 protocol packets are concatenated and
normalized by dividing the result by the largest location index (e.g., if three ICMPv6 appeared in
packets indices A, B, and C, the new feature will be ABC/C). At the end of this stage, the DFP vector
is generated with 30 features in length.
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
7
To create a more robust model, Gaussian noise is added with mean=0 and standard deviation =1.
Moreover, since the generated DFP values contain both high and very low values, they are scaled up
and down by 10 times. After preprocessing the resulted dataset, 20 DFPs are generated for each device
(unknown devices are considered as one device).
4. Machine learning algorithms
This section provides a brief overview of four ML classification algorithms, and how to adjust the
parameters if needed. Figure 3, depicts The schematic diagram of DFPs identification.
Figure 3. Schematic diagram of the DFPs identification.
4.1. Support vector machine
SVM is a versatile supervised ML algorithm and it is inherently specified for binary classification.
SVM can be used in both linear and nonlinear classification models. Linear SVM is used to classify
the data domain linearly using a hyperplane. Whereas the nonlinear SVM is used to transform data
domain (which cannot be separated linearly) into a feature space, where the data can be divided
linearly to isolate the classes [23]. To define boundaries between the two classes, two straight lines are
created which pass the nearby points.
Since the created DFPs dataset is associated with 18 classes; 17 for authorized devices while the
remaining devices are used as the "unknown" class, so the One vs Rest (OvR) classifier is applied. The
OvR classifier model involves M binary classifiers where M represents the number of classes. Every
time a binary classifier is applied using one class as class 1 and the rest M-1 classes as class 0. Each
classifier predicts a class probability, the higher positive probability is taken as a classification result.
Equation (2), shows the OvR class prediction formula, where f(x) is a binary classifier function and n
represents a classifier.
(x)
n
f
n
argmax=f(x)
(2)
Grid search is an optimization method used to find the best model parameters using k fold cross-
validation [24]. To determine the best kernel function and hyperparameters for SVM, grid search is
applied twice with fivefold cross-validation, first to choose the appropriate kernel function and second,
to choose hyperparameters. In this paper, the kernel functions that are decided to be checked by grid
search are polynomial, radial basis, and sigmoid. It has been found that the polynomial kernel is the
best for the proposed dataset to be trained with. Then, the grid search is also applied to find the best
polynomial degree and hyperparameters and it is decided to make the range of degrees from 2 to 5, the
range of C (regularization parameter) from 0.001 to 0.01 with step 0.001, and the range of gamma
(determine the extent of the influence of the individual training sample) from 10 to 1000 with step 10.
Equation (3), illustraites the function of polynomial kernel, where X and Y represent vectors in the
Known devices
Unknown devices
Traffic
Collection
Feature
Extraction
DFPS
SVM
Model
GBC
Model
RF
Model
DT
Model
HVC
IoT devices
DFPs
Identification
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
8
input space, γ represents gamma parameter, and d represents the polynomial degree [25].
d
)CY .X( =) Y(X, kernal
(3)
4.2. Decision tree
DT is one of the common versatile supervised ML algorithms. It is constructed by iteratively splitting
the training dataset into a sequence of subsets based on if-then conditions. There are a set of DT
algorithms but for this work, C4.5 and CART algorithms are more suitable algorithms to work with.
The splitting criterion of C4.5 is information gain which is defined as the difference in entropy of each
feature in the dataset to the entropy of the same feature after partitioned according to the threshold
value. The information gain is calculated for all features. The attribute with the maximum information
gain is picked at the node. Equation (4), shows the entropy formula, where P is the probability of the
feature within class i and n represents the number of classes [26].
)
i
P(
n
i2
log
i
P-=Entropy
(4)
Gini impurity is the splitting criterion of the CART algorithm that measures the probability of a
particular variable (randomly chosen) being wrongly classified. For each variable, the weighted sum
of Gini indices is calculated then taken the variable which has the lowest values as a node. Equation
(5), shows the Gini index formula [27].
n
iP2
i
-1 =index-Gini
(5)
To apply the appropriate criterion, a grid search with fivefold cross-validation is applied. Besides
criterion, the max_depth parameter is also tuned to overcome overfitting. It is decided to make its
range from 8 to 20.
4.3. Random forest
Random forest is an ensemble ML model that combines several decision trees. The training step is
based on the bagging method which means bootstrapping and aggregating. Bootstrapping means each
tree train randomly selected data samples and features. The results from decision trees are aggregated
by majority voting. The advantage of the RF model is to decrease overfitting occurred in the decision
tree and increase the model accuracy since it creates several trees instead of one tree that prone to
misclassification as well as random forest look for the most suitable feature among a random subset of
the dataset features [28].
4.4. Gradient boosting classifier
GBC is a type of machine learning boosting method which is building a strong model by recursively
learning a weak model. The three main components of GBC are loss function, weak learner, and
additive model. The loss function role is to estimate how accurate the model to predict the given data.
The weak learner is a learner with a high error rate typically the DTs model. While the additive model
refers to the iterative method of adding the weak learners. The input to the GBC model is the training
data Dataset (xi,ci)ni=1 ,the differentiable loss function (L(ci,F(x))), and the iteration numbers (M),
where xi is the input variable, ci is the observed values (0 or 1). The differentiable loss function can be
written as in equation (6), where P refers to the predicted probability.
P c- L
(6)
The training process is begun by finding the optimal initial predication f0(x) as written in equation (7),
where γ is written in equation (8). After that, using the pseudo residuals for the number of iterations
(m=1 to M iterations) as written in equation (9). Then γm is computed using equation (10), finally,
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
9
fm(x) is updated using equation (11) which mean the prediction function of xi, where v refers to the
learning rate and hm(x) represents the additional model for the prediction function (regression tree)
[17].
),
i
c(
n1i Largmin=(x)
0
f
(7)
)
P-1
P
log(=
(8)
x
1m
f)x(f
]
)
i
x(f
))
i
x(f,
i
c(L
[
im
r
(9)
ij
R
i
x))x(
m
f,
i
c(Lminarg
m
1
(10)
)x(
m
hv)x(
1-m
f)x(
m
f
(11)
5. Performance evaluation and results
The grid search algorithm is applied to the dataset to verify the appropriate hyperparameters and the
polynomial kernel degree of SVM. The stability of the optimization results is found every time it is
executed. So, to decrease the training time, the results are taken directly which are: degree = 2, C =
0.001, gamma = 10. While the grid search results with the DT model differ each time it is executed,
once it is given gini as the best model criterion and other time entropy criterion as well as it has
resulted in a various max_depth parameter. Therefore, the grid search is included with the DT model
to give the best accuracy.
During DFPs generating, 0 values are placed for each device fingerprints with an unknown DNS query
name, therefore, the probability of identifying the unauthorized devices is increased. The performance
measurement of the models is based on the computation of the F1-score and the accuracy of the DFPs
determination as shown in the following equations, where TP denotes true positive, FN denotes false
negative, TN denotes true negative, and TN denotes true negative.
PrecisionRecall
PrecisionRecall2
=SCORE-F1
(12)
FNTP
TP
=Recall
(13)
FPTP
TP
=Precision
(14)
FNTNFPTP
TNTP
=Accuracy tionIdentifica-DFPs
(15)
The overall accuracy of DFPs identification is approximately 95.1% using SVM. Only four devices
(DLSensor, DLSwitch, EdnetGateway, and WeMoSwitch) were identified with 100% accuracy.
Classification errors typically occurr between devices within the same manufacturing and sometimes
among devices within distinct manufactures. Unknown devices are identified well with about 99.6%
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
10
accuracy but with a 95.9% F1-score, which means some known devices are identified as unknown.
The confusion matrix of the SVM model is shown in Figure 4.a. It is clear that only one DFP of
ACamera is classified as unknown and there is only one mistake among devices within different
manufactures (on DFP of DLCam is identified as Chromecast) and three mistakes among devices
within the same manufactures (3 DFPs of SPBulb are identified as SPStrip).
For the DT model, the overall accuracy of identifying DFPs is approximately 97.9%. Since the DT
model is based on if-then conditions, there have been instances of misclassifications between devices
regardless of their manufactures. Accuracy of unknown devices tends to be as SVM (99.8%).
Meanwhile, the F1-score is slightly higher (97.9%). The confusion matrix for DT is shown in figure
4.b. There is only one error defining DLCam, and DT having the same error as SVM for identifying
ACamera.
In the RF and GBC model, the overall accuracy of DFP identification is about 98.8%. The dependent
parameters of RF are entropy as the criterion, max_depth=20, and min_samples_leaf=10. While the
dependent parameters of GBC are min_samples_leaf=10, max_depth=20, learning_rate=0.2,
max_features='sqrt', n_estimators=100). There are slight differences between the two models in the
accuracy of the devices, GBC achieved 100% accuracy and F1-score for unknown devices'
fingerprints identification, while RF results are slightly lower. Figures 4.c and 4.d show confusion
matrices for RF and GBC respectively. Only one mistake was made in both models in identifying
Chromecast DFPs. To increase the identification accuracy, the hard voting classifier is applied using
the best three tested models (DT, RF, and GBC) as the classifier estimators. Each estimator votes for
one class and HVC takes the majority voting results. So, if the misclassification occurred in only one
model, HVC will classify the DFPs correctly. Moreover, even the results of the three models are the
same, it is not necessary for the error to occur with the same observation (e.g., three DFPs are tested,
DT misclassification is only in the first one, RF misclassification is only in the second one, and GBC
misclassification is only in the third one) so, by HVC the identification accuracy becomes better and
here achieved about 99.5% as overall accuracy. The confusion matrix of the HVC is shown in figure
4.e. Table 3 shows the average F1-scores and accuracies of all devices for the five applied models.
6. Conclusion
This work demonstrated that the generated IoT DFPs during initializing operation can be accurately
identified. Expanding DFPs to be including first TCP features, DNS query names, DHCP features, and
all other important features as well as DFPs formulation method all together are qualified to produce
strong fingerprints that increase the ability to identify them using nonlinear machine learning
algorithms. Even ML algorithms are differently predicting DFPs dataset, the hard voting classifier can
gather the prediction results and produce the majority voting so it increases the identification accuracy.
Four ML models are applied to the proposed DFPs and choose the top three models with high
precision as HVC estimators and achieved about 99.5% as overall accuracy. Although the
computational cost of HVC is high, it is nothing as long as the identification accuracy of unauthorized
devices is 100%. In near future, the proposed method is expanded to include IoT network traffic
identification at each instance of time.
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
11
(a)
(b)
(c)
(d)
(e)
Figure 4. Confusion matrices of machine learning models.
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
12
Table 3. Identification results of machine learning models.
SVM
DT
RF
GBC
HVC
Device Name
F1-
scor
e
AC
C.
F1-
scor
e
AC
C.
F1-
scor
e
AC
C.
F1-
scor
e
AC
C.
F1-
scor
e
AC
C.
Aswar_Camera
0.94
1
0.99
5
0.96
3
0.99
6
0.97
9
0.99
7
0.98
2
0.99
8
0.98
2
0.99
8
Chromecast
0.95
2
0.99
3
0.93
3
0.99
2
0.97
2
0.99
7
0.98
1
0.99
8
0.98
9
0.99
9
D-LinkCam
0.93
9
0.99
3
0.98
9
0.99
9
1
1
1
1
1
1
D-LinkSensor
1
1
0.96
0.99
5
1
1
0.99
1
0.99
9
0.99
5
0.99
9
D-LinkSwitch
1
1
0.99
3
0.99
9
1
1
0.99
5
0.99
9
1
1
D-LinkWaterSensor
0.98
3
0.99
9
0.99
7
0.99
9
1
1
1
1
1
1
EdimaxPlug1101W
0.99
5
0.99
9
0.99
6
0.99
9
0.95
5
0.99
6
0.99
4
0.99
9
0.97
9
0.99
7
EdnetGateway
1
1
1
1
1
1
0.99
1
0.99
9
1
1
Google_Home_Mini
0.97
2
0.99
7
0.98
1
0.99
8
1
1
1
1
0.99
4
0.99
9
SonoFF_Power_Plug
0.96
4
0.99
6
0.98
9
0.99
9
0.96
9
0.99
5
0.96
4
0.99
7
1
1
SonoFF_Power_Strip
0.91
2
0.99
1
1
0.95
7
0.99
5
0.95
1
0.99
5
1
1
SonoFF_Smart_Light_Bul
b
0.85
2
0.98
9
1
1
1
1
0.93
5
0.99
4
1
1
SonoFF_Smart_Switch
0.99
5
0.99
9
0.98
7
0.99
9
1
1
0.98
3
0.99
9
0.99
3
0.99
9
TEKIN-Plug
0.99
7
1
0.96
8
0.99
6
1
1
1
1
1
1
TP-LinkPlugHS100
0.79
1
0.97
9
0.98
4
0.99
8
1
1
1
1
1
1
TP-LinkPlugHS110
0.79
7
0.97
9
0.99
4
0.99
9
1
1
1
1
1
1
WeMoSwitch
1
1
0.96
3
0.99
4
1
1
1
1
1
1
Unknown
0.95
9
0.99
6
0.97
9
0.99
8
0.98
5
0.99
8
1
1
1
1
Acknowledgments
We would like to acknowledge the Mustansiriyah University (www.uomustansiriyah.edu.iq) for
supporting to complete this work.
References
[1] E. Ahmed, I. Yaqoob, A. Gani, M. Imran, and M. Guizani, “Internet-of-things-based smart
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
13
environments: state of the art, taxonomy, and open research challenges,” IEEE Wirel.
Commun., vol. 23, no. 5, pp. 1016, 2016.
[2] S. Zeadally and M. Tsikerdekis, “Securing Internet of Things (IoT) with machine learning,”
Int. J. Commun. Syst., vol. 33, no. 1, p. e4169, 2020.
[3] F. Hussain, R. Hussain, S. A. Hassan, and E. Hossain, “Machine learning in IoT security:
Current solutions and future challenges,” IEEE Commun. Surv. Tutorials, vol. 22, no. 3, pp.
16861721, 2020.
[4] I. Alrashdi, A. Alqazzaz, E. Aloufi, R. Alharthi, M. Zohdy, and H. Ming, “Ad-iot: Anomaly
detection of iot cyberattacks in smart city using machine learning,” in 2019 IEEE 9th Annual
Computing and Communication Workshop and Conference (CCWC), 2019, pp. 305310.
[5] S. Bahizad, “Risks of Increase in the IoT Devices,” in 2020 7th IEEE International Conference
on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference
on Edge Computing and Scalable Cloud (EdgeCom), 2020, pp. 178181.
[6] H. Guo and J. Heidemann, “IP-based IoT device detection,” in Proceedings of the 2018
Workshop on IoT Security and Privacy, 2018, pp. 3642.
[7] H. Guo and J. Heidemann, “Detecting iot devices in the internet,” IEEE/ACM Trans. Netw.,
vol. 28, no. 5, pp. 23232336, 2020.
[8] C. Kelly, N. Pitropakis, S. McKeown, and C. Lambrinoudakis, “Testing And Hardening IoT
Devices Against the Mirai Botnet,” in 2020 International Conference on Cyber Security and
Protection of Digital Services (Cyber Security), 2020, pp. 18.
[9] G. Hu and K. Fukuda, “Toward Detecting IoT Device Traffic in Transit Networks,” in 2020
International Conference on Artificial Intelligence in Information and Communication
(ICAIIC), 2020, pp. 525530.
[10] A. K. Dalai et al., “A fingerprinting technique for identification of wireless devices,” in 2018
International Conference on Computer, Information and Telecommunication Systems (CITS),
2018, pp. 15.
[11] S. Aneja, N. Aneja, and M. S. Islam, “IoT device fingerprint using deep learning,” in 2018
IEEE International Conference on Internet of Things and Intelligence System (IOTAIS), 2018,
pp. 174179.
[12] R. R. Chowdhury, S. Aneja, N. Aneja, and E. Abas, “Network Traffic Analysis based IoT
Device Identification,” in Proceedings of the 2020 the 4th International Conference on Big
Data and Internet of Things, 2020, pp. 7989.
[13] B. Charyyev and M. H. Gunes, “Iot event classification based on network traffic,” in IEEE
INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM
WKSHPS), 2020, pp. 854859.
[14] A. Subahi and G. Theodorakopoulos, “Detecting IoT user behavior and sensitive information in
encrypted IoT-app traffic,” Sensors, vol. 19, no. 21, p. 4777, 2019.
[15] N. Ammar, L. Noirie, and S. Tixeuil, “Network-protocol-based IoT device identification,” in
2019 Fourth International Conference on Fog and Mobile Edge Computing (FMEC), 2019, pp.
204209.
[16] Z.-X. Xu, Q. Dai, G. Xu, H. Huang, X.-B. Chen, and Y.-X. Yang, “IoT Device Recognition
Framework Based on Network Protocol Keyword Query,” in International Conference on
Artificial Intelligence and Security, 2020, pp. 219231.
[17] A. Hameed and A. Leivadeas, “IoT traffic multi-classification using network and statistical
features in a smart environment,” in 2020 IEEE 25th International Workshop on Computer
Aided Modeling and Design of Communication Links and Networks (CAMAD), 2020, pp. 17.
[18] A. Sivanathan et al., “Classifying IoT devices in smart environments using network traffic
characteristics,” IEEE Trans. Mob. Comput., vol. 18, no. 8, pp. 17451759, 2018.
[19] A. Pashamokhtari, “PhD Forum Abstract: Dynamic Inference on IoT Network Traffic using
Programmable Telemetry and Machine Learning,” in 2020 19th ACM/IEEE International
Conference on Information Processing in Sensor Networks (IPSN), 2020, pp. 371372.
2nd International Conference on Physics and Applied Sciences (ICPAS 2021)
Journal of Physics: Conference Series 1963 (2021) 012046
IOP Publishing
doi:10.1088/1742-6596/1963/1/012046
14
[20] O. Salman, I. H. Elhajj, A. Chehab, and A. Kayssi, “A machine learning based framework for
IoT device identification and abnormal traffic detection,” Trans. Emerg. Telecommun.
Technol., p. e3743, 2019.
[21] M. Miettinen, S. Marchal, I. Hafeez, N. Asokan, A.-R. Sadeghi, and S. Tarkoma, “Iot sentinel:
Automated device-type identification for security enforcement in iot,” in 2017 IEEE 37th
International Conference on Distributed Computing Systems (ICDCS), 2017, pp. 21772184.
[22] “The kaggle website,” 2021. https://www.kaggle.com/drwardog/iot-device-captures.
[23] S. Suthaharan, “Support vector machine,” in Machine learning models and algorithms for big
data classification, Springer, 2016, pp. 207235.
[24] I. Syarif, A. Prugel-Bennett, and G. Wills, “SVM parameter optimization using grid search and
genetic algorithm to improve classification performance,” Telkomnika, vol. 14, no. 4, p. 1502,
2016.
[25] I. S. Al-Mejibli, J. K. Alwan, and H. Abd Dhafar, “The effect of gamma value on support
vector machine performance with different kernels,” Int. J. Electr. Comput. Eng., vol. 10, no. 5,
p. 5497, 2020.
[26] D. B. Seal, S. Saha, P. Mukherjee, M. Chatterjee, A. Mukherjee, and K. N. Dey, “Gene
ranking: An entropy & decision tree based approach,” in 2016 IEEE 7th Annual Ubiquitous
Computing, Electronics & Mobile Communication Conference (UEMCON), 2016, pp. 15.
[27] M. Shaheen, T. Zafar, and S. Ali Khan, “Decision tree classification: Ranking journals using
IGIDI,” J. Inf. Sci., vol. 46, no. 3, pp. 325339, 2020.
[28] T. Zhu, “Analysis on the Applicability of the Random Forest,” in Journal of Physics:
Conference Series, 2020, vol. 1607, no. 1, p. 12123.
... Even though the model proposed by the literature [16] avoids repeated training, it sacrifices a certain accuracy. Some studies [17,18] build device fingerprints based on traffic characteristics, such as extracting features through special protocol packets such as Domain Name System (DNS) and Address Resolution Protocol (ARP), but many industrial control networks do not have such protocol packets, so corresponding features cannot be extracted. Some systems automatically configure the device network module, so the features extracted from the transport layer and the network layer are almost the same and not unique. ...
Article
Full-text available
With the rapid development of industrial automation technology, a large number of industrial control devices have emerged in cyberspace, but the security of open cyberspace is difficult to guarantee. Attacks on industrial control devices can directly endanger the environment and even life safety. Therefore, how to monitor the industrial control system in real time has become the primary problem, and device identification is the basic guarantee of safety monitoring. There are limitations in building device identification model based on IP address or machine learning. The paper aim at the development of a device traffic fingerprint model and identify the device based on the periodicity of device traffic. The model generates device fingerprints based on pattern sequences abstracted from the traffic and suffix array algorithm. In the process of recognition, the exact pattern matching algorithm is used for preliminary judgment. If the exact pattern matching fails to hit, the final judgment is made by combination fuzzy pattern matching. This paper also proposes a diagonal jump algorithm to optimize the updating of the distance matrix, which saves on the computational cost of fuzzy pattern matching. Simulation results show that compared with SVM, random forest, and LSTM model, the device traffic fingerprint model has good performance advantages in accuracy, recall and precision.
... The analysis method is to divide the intermediate value T of the gray value into two categories of image gray. The ratio of the average variance of the two categories to the variance of the two categories determines the value of the interval [4,5]. In the local inter-value method, the inter-value of each pixel is determined by its pixel and gray value. ...
Article
Full-text available
Fingerprint characteristics will not change due to growth or aging, and are unique and suitable for the field of identification. The fingerprint image is scanned from left to right and from top to bottom in order to scan the points on the entire image, and then the points on the image are judged to achieve fingerprint feature point extraction. Then, the fingerprint image is preprocessed and feature extracted, and the center point of the fingerprint, the direction field of the fingerprint image and the coordinates of the feature point are obtained. The reference direction of the fingerprint is extracted. After the reference point and the reference direction of the fingerprint are determined, the characteristic information of the fingerprint is modified and the characteristic information is expressed in polar coordinates. The template and the feature information of the input image are sorted in the direction according to the direction of increasing polar angle, and the feature information is connected in series. If the matching degree of feature points is greater than the preset value, it is considered that the two fingerprints are matched successfully. It can be seen from the experimental results that the two fingerprint images are loaded, the feature point map and the smooth processing map are extracted after preprocessing, and the feature point matching degree is calculated. It can be judged that the two fingerprint images are from the same fingerprint.
Article
Full-text available
Smart home IoT devices lack proper security, raising safety and privacy concerns. One-size-fits-all network administration is ineffective because of the diverse QoS requirements of IoT devices. Device classification can improve IoT administration and security. It identifies vulnerable and rogue items and automates network administration by device type or function. Considering this, a promising research topic focusing on Machine Learning (ML)-based traffic analysis has emerged in order to demystify hidden patterns in IoT traffic and enable automatic device classification. This study analyzes these approaches to understand their potential and limitations. It starts by describing a generic workflow for IoT device classification. It then looks at the methods and solutions for each stage of the workflow. This mainly consists of i) an analysis of IoT traffic data acquisition methodologies and scenarios, as well as a classification of public datasets, ii) a literature evaluation of IoT traffic feature extraction, categorizing and comparing popular features, as well as describing open-source feature extraction tools, and iii) a comparison of ML approaches for IoT device classification and how they have been evaluated. The findings of the analysis are presented in taxonomies with statistics showing literature trends. This study also explores and suggests undiscovered or understudied research directions.
Article
Full-text available
Currently, the support vector machine (SVM) regarded as one of supervised machine learning algorithm that provides analysis of data for classification and regression. This technique is implemented in many fields such as bioinformatics, face recognition, text and hypertext categorization, generalized predictive control and many other different areas. The performance of SVM is affected by some parameters, which are used in the training phase, and the settings of parameters can have a profound impact on the resulting engine’s implementation. This paper investigated the SVM performance based on value of gamma parameter with used kernels. It studied the impact of gamma value on (SVM) efficiency classifier using different kernels on various datasets descriptions. SVM classifier has been implemented by using Python. The kernel functions that have been investigated are polynomials, radial based function (RBF) and sigmoid. UC irvine machine learning repository is the source of all the used datasets. Generally, the results show uneven effect on the classification accuracy of three kernels on used datasets. The changing of the gamma value taking on consideration the used dataset influences polynomial and sigmoid kernels. While the performance of RBF kernel function is more stable with different values of gamma as its accuracy is slightly changed.
Article
Full-text available
Random forest is a flexible algorithm with a wide range of applications and performs well on a large number of data sets. Besides, Random forest is immune to statistical assumptions as well as preprocessing burden and can handle a large data set with high dimensionality and missing values. Nevertheless, random forest struggles with high-cardinality categorical variables, unbalanced data, time series forecasting, variables interpretation, and is sensitive to hyperparameter. Thus, random forest is relatively suitable for processing high-dimensional data and data with missing variables. Besides, random forest works well with a large amount of data, which is previously unprocessed. Moreover, random forest is an appropriate method, when there are prior statistical assumptions. However, random forest is non-ideal, when processing data with endogenous temporal effects or high-cardinality categorical variables, as well as when the interpretation is the primary goal. Despite the shortcomings of the random forest, there are still some improvements that can be made. It will be more convenient for users to screen methods, if there is a rating system to give an overall score towards all alternative algorithms depending on the input data and the users’ goals.
Article
Full-text available
The future Internet of Things (IoT) will have a deep economical, commercial and social impact on our lives. The participating nodes in IoT networks are usually resource-constrained, which makes them luring targets for cyber attacks. In this regard, extensive efforts have been made to address the security and privacy issues in IoT networks primarily through traditional cryptographic approaches. However, the unique characteristics of IoT nodes render the existing solutions insufficient to encompass the entire security spectrum of the IoT networks. Machine Learning (ML) and Deep Learning (DL) techniques, which are able to provide embedded intelligence in the IoT devices and networks, can be leveraged to cope with different security problems. In this paper, we systematically review the security requirements, attack vectors, and the current security solutions for the IoT networks. We then shed light on the gaps in these security solutions that call for ML and DL approaches. We also discuss in detail the existing ML and DL solutions for addressing different security problems in IoT networks. We also discuss several future research directions for ML-and DL-based IoT security.
Conference Paper
Full-text available
The Internet of Things (IoT) consists of sensors and actuators that facilitate many aspects of our daily life. Compared to typical computing devices such as laptops and smartphones, these devices have a very limited set of functionalities and states. Researchers have shown that it is possible to infer the device type from its network traffic. In this paper, we show that an external observer that sniffs the network traffic of an IoT device can further classify device events and hence infer user actions by employing machine learning classifiers. We evaluate and compare the performance of ten machine learning algorithms in classifying 128 device events from 39 different devices. We analyze the impact of the user interaction through LAN and WAN as well as controllers such as Alexa voice assistant on the correct classification of device actions. We also inspect whether the region from which the device is impacts the performance of classifiers as researchers have shown that differing privacy restrictions lead to different external communications.
Conference Paper
As the number of Internet of Things (IoT) devices and applications increases, the capacity of the IoT access networks is considerably stressed. This can create significant performance bottlenecks in various layers of an end-to-end communication path, including the scheduling of the spectrum, the resource requirements for processing the IoT data at the Edge and/or Cloud, and the attainable delay for critical emergency scenarios. Thus, it is required to classify or predict the time varying traffic characteristics of the IoT devices. However, this classification remains at large an open challenge. Most of the existing solutions are based on machine learning techniques, which nonetheless present high computational cost while non considering the fine-grained flow characteristics. To this end, in this paper we design a two-stage classification framework that utilizes both the network and statistical features to characterize the IoT devices in the context of a smart city. We firstly perform the data cleaning and preprocessing of the data and then analyze the dataset to extract the network and statistical features set for different types of IoT devices. The evaluation results show that the proposed classification can achieve 99% accuracy as compared to other techniques with Mathews Correlation Coefficient of 0.96.
Article
Distributed Denial-of-Service (DDoS) attacks launched from compromised Internet-of-Things (IoT) devices have shown how vulnerable the Internet is to large-scale DDoS attacks. To understand the risks of these attacks requires learning about these IoT devices: where are they? how many are there? how are they changing? This paper describes three new methods to find IoT devices on the Internet: server IP addresses in traffic, server names in DNS queries, and manufacturer information in TLS certificates. Our primary methods (IP addresses and DNS names) use knowledge of servers run by the manufacturers of these devices. Our third method uses TLS certificates obtained by active scanning. We have applied our algorithms to a number of observations. With our IP-based algorithm, we report detections from a university campus over 4 months and from traffic transiting an IXP over 10 days. We apply our DNS-based algorithm to traffic from 8 root DNS servers from 2013 to 2018 to study AS-level IoT deployment. We find substantial growth (about 3.5×) in AS penetration for 23 types of IoT devices and modest increase in device type density for ASes detected with these device types (at most 2 device types in 80% of these ASes in 2018). DNS also shows substantial growth in IoT deployment in residential households from 2013 to 2017. Our certificate-based algorithm finds 254k IP cameras and network video recorders from 199 countries around the world.