Abstract and Figures

Momentous increase in the popularity of explainable machine learning models coupled with the dramatic increase in the use of synthetic data facilitates us to develop a cost-efficient machine learning model for fast intrusion detection and prevention at frontier areas using Wireless Sensor Networks (WSNs). The performance of any explainable machine learning model is driven by its hyperparameters. Several approaches have been developed and implemented successfully for optimising or tuning these hyperparameters for skillful predictions. However, the major drawback of these techniques, including the manual selection of the optimal hyperparameters, is that they depend highly on the problem and demand application-specific expertise. In this paper, we introduced Automated Machine Learning (AutoML) model to automatically select the machine learning model (among support vector regression, Gaussian process regression, binary decision tree, bagging ensemble learning, boosting ensemble learning, kernel regression, and linear regression model) and to automate the hyperparameters optimisation for accurate prediction of numbers of k-barriers for fast intrusion detection and prevention using Bayesian optimisation. To do so, we extracted four synthetic predictors, namely, area of the region, sensing range of the sensor, transmission range of the sensor, and the number of sensors using Monte Carlo simulation. We used 80% of the datasets to train the models and the remaining 20% for testing the performance of the trained model. We found that the Gaussian process regression performs prodigiously and outperforms all the other considered explainable machine learning models with correlation coefficient (R = 1), root mean square error (RMSE = 0.007), and bias = − 0.006. Further, we also tested the AutoML performance on a publicly available intrusion dataset, and we observed a similar performance. This study will help the researchers accurately predict the required number of k-barriers for fast intrusion detection and prevention.
This content is subject to copyright. Terms and conditions apply.
1
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports
AutoML‑ID: automated machine
learning model for intrusion
detection using wireless sensor
network
Abhilash Singh1, J. Amutha2, Jaiprakash Nagar3, Sandeep Sharma4* & Cheng‑Chi Lee5,6*
Momentous increase in the popularity of explainable machine learning models coupled with the
dramatic increase in the use of synthetic data facilitates us to develop a cost‑ecient machine
learning model for fast intrusion detection and prevention at frontier areas using Wireless Sensor
Networks (WSNs). The performance of any explainable machine learning model is driven by its
hyperparameters. Several approaches have been developed and implemented successfully for
optimising or tuning these hyperparameters for skillful predictions. However, the major drawback
of these techniques, including the manual selection of the optimal hyperparameters, is that they
depend highly on the problem and demand application‑specic expertise. In this paper, we introduced
Automated Machine Learning (AutoML) model to automatically select the machine learning model
(among support vector regression, Gaussian process regression, binary decision tree, bagging
ensemble learning, boosting ensemble learning, kernel regression, and linear regression model) and
to automate the hyperparameters optimisation for accurate prediction of numbers of k‑barriers for
fast intrusion detection and prevention using Bayesian optimisation. To do so, we extracted four
synthetic predictors, namely, area of the region, sensing range of the sensor, transmission range of
the sensor, and the number of sensors using Monte Carlo simulation. We used 80% of the datasets
to train the models and the remaining 20% for testing the performance of the trained model. We
found that the Gaussian process regression performs prodigiously and outperforms all the other
considered explainable machine learning models with correlation coecient (R = 1), root mean
square error (RMSE = 0.007), and bias = − 0.006. Further, we also tested the AutoML performance on
a publicly available intrusion dataset, and we observed a similar performance. This study will help
the researchers accurately predict the required number of k‑barriers for fast intrusion detection and
prevention.
Intrusion detection at border areas is of utmost importance and demands a high level of accuracy. Any failure in
intrusion detection may result in havoc on the nations security1. Each country shares international boundaries
with its neighboring countries, extending to thousands of kilometers. Continuous monitoring of such a colossal
borderline through occasional patrolling is a crucial problem. To overcome this problem, WSNs are generally
used and deployed along the borderline for surveillance and monitoring2,3. WSNs are a widely adopted technol-
ogy that consists of a group of sensors capable of sensing, processing, and transmitting processed information.
It can be easily installed anywhere, even in hard-to-reach areas, because it does not require pre-installed infra-
structure. e capability of detecting any event or environmental condition makes it more prudent for intrusion
detection applications4,5. Apart from intrusion detection, WSNs found applications in precision agriculture,
health monitoring, environment monitoring, hazards monitoring, and many more69.
OPEN
1Indian Institute of Science Education and Research Bhopal, Fluvial Geomorphology and Remote Sensing
Laboratory, Bhopal 462066, India. 2Gautam Buddha University, School of ICT, Greater Noida 201312, India. 3Indian
Institute of Technology Kharagpur, Subir Chowdhury School of Quality and Reliability, Kharagpur 721302,
India. 4Department of Electronics Engineering, Madhav Institute of Technology and Science, Gwalior 474005,
India. 5Department of Library and Information Science, Research and Development, Center for Physical Education,
Health, and Information Technology, Fu Jen Catholic University, New Taipei 242, Taiwan. 6Department of Computer
Science and Information Engineering, Asia University, Taichung 41354, Taiwan. *email: sandeepsvce@gmail.com;
cclee@mail.u.edu.tw
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
Border surveillance, intrusion detection, and prevention problems are addressed with two different
approaches. Researchers propose various algorithms and Internet of ings (IoT) solutions for intrusion detec-
tion and surveillance in border areas in the rst approach. In the second approach, they develop analytical
models to estimate the intrusion detection probability in terms of k-coverage, k-barrier coverage, number of
k-barriers, and many other performance metrics. Yang etal.10 have proposed an energy-ecient intrusion detec-
tion method that is capable of identifying weak zones of the network deployment region that need to be repaired.
Aer identifying the weak zones, they are repaired to achieve the desired quality of barrier coverage. Specically,
their proposed method focuses on one-directional coverage only for single and multiple intruder scenarios.
e authors have claimed that their proposed method and algorithms could enhance the network lifetime. In
another work presented in11, Raza etal. have analysed the impact of heterogeneous WSNs deployed following
either uniform or Gaussian distribution scenario. ey have studied the impact of sensor density and sensing
range of sensor nodes on the intrusion detection probability. ey found that the heterogeneous WSNs provide
better intrusion detection performance than the homogeneous WSNs at a given sensing range and sensor node
density. Similarly, Arfaoui etal.12 have rendered an analytical model that considers the notion of possible paths
that an intruder can follow to cross a belt region in border areas. ey have developed a model considering
border area characteristics and the intrusion paths to estimate the time taken by an intruder to cross the border
area. e authors conclude that their proposed model can detect the intrusion as soon as an intruder enters the
restricted border area.
Further, Singh and Singh13 have presented a smart border surveillance system that uses a WSN which is able
to identify and detect the intrusion and then alerts the control center about the presence of an intruder. e
proposed system is capable in dierentiating between animals and persons. Further, the system uses Raspberry
Pi boards integrated with infra-red, ultrasonic and camera sensors and is found to be very eective and accurate
to identify any possible intruder. Again, Sharma and Kumar14 have proposed a ML-based smart surveillance
and intrusion detection system for border regions. e proposed system is capable in detection intruders during
day time and at night along with the kind of weapon carried by the intruder. e proposed system is made of a
high-resolution camera with IR capabilities for day and night vision, a GPS module interfaced with Raspberry Pi
to extract the accurate location of the intruder, and a bluetooth scanner to detect the bluetooth signature of the
intruder device. e entire module is put into a climate protected box that can be mounted on a high platform.
Further, Mishra etal. in15 have provided a detailed literature review on various ML techniques for intrusion detec-
tion. ey have also provided a comprehensive discussion on various types of attacks along with their respec-
tive features and security threats. With the help of a specic feature, ML techniques can identify and detect the
intrusion quickly and accurately. Sun etal.16 have proposed a three-level intrusion detection model to minimise
the memory consumption, computational time, and cost. e proposed model is claimed to decrease memory
consumption, time, and cost up to a great extend. Further, in17, Ghosh etal. have proposed two routing schemes,
namely KPS and Loop-Free (LP)-KPS, to enhance the lifetime of a WSN deployed for intrusion detection in
border areas or surveillance of some crucial military establishments. On comparing the proposed algorithms
with LEACH and TEEN routing algorithms, they found that the proposed algorithms provide enhanced network
lifetime. In18, Benahmed and Benahmed have proposed an optimal approach to achieve a fault-tolerant network
for the surveillance of critical areas using WSNs. e proposed approach identies the faulty sensors and replaces
them with active sensors to ll the coverage gap. e proposed approach can provide a sucient minimum
number of sensors to cover the area under surveillance. Another work presented by Arfaoui and Boudriga in19
provided an ecient surveillance system that can rapidly detect any intruder crossing border areas. In this work,
the authors have incorporated the impact of obstacles present in the environment and the terrain of the border
areas to derive the expression for intrusion detection probability.
Further, Sharma and Nagar20 have obtained an analytical expression of k-barrier coverage probability for
intrusion detection in a rectangular belt region. ey have considered all the possible paths an intruder may
follow to cross the region. Further, they have also analysed the impact of various parameters such as the number
of sensors, sensing range, sensor to intruder velocity ratio, and the intrusion path angle.
e analytical approaches discussed above eectively solve the intrusion detection problem. However, these
approaches need validation through the simulation approach, which is time-consuming. For example, a single
iteration requires approximately 15 hours for a particular set of network parameters, increasing signicantly
as the network complexity increases. Various machine learning methods have been proposed to overcome the
time-complexity issue associated with the simulations. Recently, Singh etal.21 proposed three machine learning
methods based on GPR to map the k-barrier coverage probability for accurate and fast intrusion detection using
WSNs. ese methods are based on scaling the predictors; scale-GPR (S-GPR), center-mean-GPR (C-GPR), and
GPR. ey have used synthetic predictors derived from Monte Carlo simulations. ey selected many sensors,
sensing range of the sensor, sensor to intruder velocity ratio, mobile to static node ratio, angle of the intrusion
path, and the required k-barriers as potential predictors. ey found that the non-standardise methods accurately
map the k-barrier coverage probability using the synthetic variables with R = 0.85 and RMSE = 0.095. More
recently, Singh etal.22 proposed a logarithmic predictor transformation and scaling-based algorithm coupled
with SVR (i.e., LT-FS-ID) to map the number of required k-barriers for fast intrusion detection and prevention
over a rectangular Region of Interest (RoI) considering uniform sensor distribution. e dimension of the dataset
LT-FS-ID is 182
×
5. ey used four predictors to accurately predict the required k-barriers. ey reported that
the proposed approach accurately predicts the k-barriers with R = 0.98 and RMSE = 6.47. e feasibility of deep
learning algorithms for the intrusion detection has been investigated by Otoum etal. in23. ey have presented
a restricted Boltzmann machine-based clustered IDS (RBC-IDS) for monitoring critical infrastructures using
WSNs. Further, they have compared the performance of RBC-IDS with the adaptively supervised and clustered
hybrid IDS (ASCH-IDS) and found that both provides same detection and accuracy rates, but, detection time
of RBC-IDS is approximately twice that of ASCH-IDS.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
e machine learning methods discussed above involve manual selection of the best performing algorithm,
which may lead to bias results if the results are not compared with the benchmark algorithm. In addition, the
optimisation of the hyperparameter associated with each algorithm is treated dierently. To solve this problem,
in this paper, we introduced an automated machine learning (AutoML) model to automate the model selection
and hyperparameter optimisation task. In doing so, we synthetically extracted potential predictors (i.e., area of
the region, sensing range of the sensor, transmission range of the sensor, and the number of sensors) through
Monte Carlo simulation. We then evaluated the predictor importance and predictor sensitivity through the
regression tree ensemble approach. Subsequently, we applied AutoML on the training datasets to get the best
optimised model. We evaluated the performance of the best performing algorithm over the testing data using
R, RMSE, and bias as performance metrics.
Material and methods
Predictor generation. e quality of the prediction of a machine learning model depends on the quality of
predictors and the model hyperparameters24. ese predictors can be categorised into real and synthetic-based
upon the dataset acquiring process. e real data can be obtained through direct measurements through instru-
ments or sensors. However, the generation of real data involves intensive cost and labor. In contrast to real data,
synthetic data can be obtained through mathematical rules, statistical models, and simulations25. In comparison
to real data, acquiring synthetic data is ecient and cost-eective. Due to this, the use of synthetic datasets to
train machine learning models is increased in the past lustrum21,2629.
We adopted the synthetic method to extract the predictor datasets using Monte Carlo simulations. In doing
so, we have used network simulator NS-2.35 to generate the entire dataset. A nite number of homogeneous (i.e.,
sensing, transmission, and computational capabilities are identical for each sensor) sensor nodes are deployed
according to Gaussian distribution, also known as a normal distribution in a rectangular RoI to achieve this.
Gaussian distribution is considered in this study since it can improve intrusion detection capability and is pre-
ferred for realistic applications. In a Gaussian distributed network, the probability that a sensornode is located
at a point (x, y) in reference to the deployed location (x
0
, y
0
)30,31 is given by:
where
σx
and
σy
are the standard deviations of x and y location coordinates, respectively.
To evaluate the performance of WSNs, we have considered the Binary Sensing Model (BSM)32, which is the
most extensively used sensing range model. Each sensor (S
i
) is assumed with the sensing range (R
s
) and is
deployed at an arbitrarypoint (P(x
i
, y
i
)). As per BSM, the target can be detected by any random sensor with
100% probability if the target lies with in the sensing range of the sensor. Otherwise, the target detection prob-
ability will be equal to zero and is represented mathematically as:
where
d
(S
i
,P)=
(x
i
x)
2
+(y
i
y)
2
, the Euclidean distance between S
i
and target point P. In addition, we
have considered that any two sensors can communicate if they satisfy the criteria, R
2R
s
, where R
and R
s
represents the transmission range and sensing range, respectively. A barrier is constructed by joining a cluster
of sensornodes across the RoI to detect the presence of intruders. Furthermore, to assure barrier coverage, it
is required to identify a Barrier Path (BP) in the RoI. e sensor nodes detect each intruder in the path in this
scenario. us, to ensure guaranteed k-barrier coverage in the rectangular RoI, the number of required nodes is
computed as : k
=⌈ L
2
R
s
and maximum number of BPs can be computed as BP
max
=
N
k
33, where L is the length
of the rectangular RoI, R
s
is the sensing range of nodes, and N is the number of sensor nodes. Table1 lists the
various network parameters and their values that have been used to obtain the simulation results.
Relative predictor importance. In machine learning, the choice of input predictors has a substantial con-
trol on its performance28. Predictor importance analysis is not restricted to any particular representations, tech-
(1)
f(x,y)=
1
2πσ
x
σ
y
e
(xx0)
2
2σ2
x
+(yy0)
2
2σ2
y
(2)
P
(Si)=
1, if d(Si,P)R
s
0, otherwise
Table 1. Simulation parameters.
Parameters Valu es
Network simulator NS-2.35
Network region Rectangular RoI
Network area (
m2
)100
×
50–250
×
200
Sensor nodes (N) 100–400
Sensing range (R
s
) 15–40
m
Transmission range (R
tx
) 30–80
m
Node distribution Gaussian distribution
Sensing model Binary sensing model (BSM)
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
niques, or measures and can be used in any situation where predictive models are required. It is used to express
how signicant the predictor was for the model’s predictive performance, irrespective of the structure (linear
or nonlinear) or the direction of the predictor eect. We calculated the relevancy of the selected predictors in
estimating the k-barriers by estimating each predictor’s relative predictor importance score. To do so, we have
used the regression tree ensemble technique21,34. It is an inbuilt class with a tree-based classier that assigns a
relative score for every predictor or attribute of the data. e higher the score, the more important the predictor.
Initially, we trained a regression tree ensemble model by boosting hundred regression trees (i.e., t = 100) with
a learning rate of one (i.e.,
δ
= 1) each using the Least Squares gradient Boosting (LSBoost) ensemble aggregation
method. Boosting an ensemble of regression algorithms seems to have several advantages, like, handling missing
data, representing nonlinear patterns, and yielding better generalisation if weak learners were combined into a
single meta learner. In addition, the LSBoost ensemble minimises the mean square error by combining individual
regression trees, oen known as weak learners. e LSBoost technique successfully trains weak learners on the
testing data set, tting residual errors, and detecting its weak points. Based on such weak points, it generates
a new weak learner (
li
) during every iteration. It evaluates its weight (
ωi
) in order to enhance the dierence
between the response value and the aggregated predicted value, hence increasing prediction accuracy. Finally,
the algorithm updates the current model (
Mi
) by emphasising on the prior weak learner’s (
Mi
-1) weak point
according to Eq.(3). It then integrates the weak learner into the existing model aer training and iteratively
generates a single strong learner (
Mn
, i.e., ensemble of weak learners).
To explore further the predictor importance, we estimated the coecients indicating the relative importance
of each predictor within the trained model by computing the total variations in the node risk (
R) due to split
among each predictor, and then normalising it by the total number of branch nodes (
RBN
) and is mathemati-
cally represented as:
where
RP
indicates the node risk of the parent and
RCH1
&
RCH2
indicates the node risk of two children. e node
risk at individual node (R
i
) is mathematically represented as in Eq.(5);
where
Pi
denotes the probability of node i and
Ei
denotes the node i mean square error.
Predictor sensitivity. We have performed the sensitivity analysis of the predictors using Partial Depend-
ence Plot (PDP)21,35. PDP depicts whether a model’s predicted response (outcome) changes as a single explana-
tory variable varies. ese plots have the advantage of exhibiting the form of relationship that exists between the
variable and the response36. Moreover, it depicts the marginal eect of one or more variables on the predicted
response of the model37. In this study, we have considered the combined impact of two predictors simultane-
ously from the input predictor set (i.e.,
υ
) on the predictand by marginalising the impact of the remaining pre-
dictors. To accomplish this, a subset
υs
and a complimentary set (
υc
) of
υs
is extracted from the predictor set
(
υ={z1,z2,...,zn}
) where n represents the total number of predictors. Any prediction on
υ
is determined by
Eq.(6) and the partial dependence of the predictor in
υs
is inferred by computing the expectation (E
c
) of Eq.(6):
where
ρcc
) indicates the marginal probability of
υc
, which is represented in Eq.(8).
en, the partial dependency of the predictor in
υs
can be determined by :
where U represents the total number of observations.
Automated machine learning model. AutoML is used to automate the machine learning process
such as data pre-processing, predictor or feature engineering, best algorithm selection, and hyperparameter
optimisation3840. For past few years, it has been widely used in industry and academia to solve real and near real-
time problems4143. In this study, rstly, we have performed the predictor standardisation using Z-score scaling44.
Aerward, we divided the complete dataset randomly using Mersenne Twister (MT) random generator in an
80:20 ratio for training and testing the AutoML model. e dimension of the complete dataset is 182
×
5, where
(3)
Mi=Mi1+δ·ωi·li(i=1, 2, 3, ...,n)
(4)
R=
R
P
(R
CH1
+R
CH2
)
RBN
(5)
Ri=Pi·Ei
(6)
f(υ) =f s,υc)
(7)
fs
s
)=
E
c[
f
s,
υ
c
)]
=
fs,υc)·ρc c)·dυ
c
(8)
ρ
cc)
ps,υc)·dυ
s
(9)
fss)
1
U
U
i=1
fs,υc
i
)
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
182 is the number of observations and 5 is the number of predictors (i.e., area of the region, sensing range of the
sensor, transmission range of the sensor, and the number of sensors) and the response variable (i.e., k-barrier).
e dimension of the training dataset is 145
×
5, and the dimension of the testing dataset is 37
×
5. Aer data
division, we have automated the algorithms selection and hyperparameter optimisation step and investigated
its performance. Various explainable machine learning models participate in the algorithm selection process,
which is discussed next in the upcoming subsections.
Support vector regression model. e Support Vector Regression (SVR) model was introduced by Vapnik
etal.45, and it was developed primarily using the Support Vector Machine (SVM) classiers. e SVR model has
the benet of being able to optimise the nominal margin using regression task analysis and is a popular choice
for prediction and curve-tting both for linear and nonlinear regression types46. e relationship among input
and output variables for nonlinear mapping47 is determined by:
where p
=
(
p1
,
p2
,...,
pn)
indicates the input, y
i
Rl indicates the output, w
R
n
indicates the weight vector, q
R indicates the constant, n indicates the number of training datasets and
φ(p)
indicates an irregular function
that is used to assign the input to the predictor. To determine w and q, Eq.(11) is used, where
χi,χ
i
indicates
the slack variable.
In the SVR model, the three basic hyperparameters used are the insensitive loss function (
ǫ
) that speci-
es the tolerance margin; the capacity parameter or penalty coecient or box constraint (C) that species the
error weight; and the Gaussian width parameter or kernel scale (
γ
)48,49. A high value of C lets SVR reminisce
the training data. e smaller
ǫ
value implies noiseless data. However, the
γ
value is equally responsible for the
under-adjustment or over-adjustment of prediction. Mathematically, it is represented as:
where K represents the kernel function,
γ
represents the kernel scale that manages the inuence of predictors
variation on kernel variation.
Gaussian process regression model. Gaussian Process Regression (GPR), also known as kriging50 is based on
Bayesian theory51 and is used to solve complex regression problems (high dimension, nonlinearity), facilitates
the hyper-parameter adaptive acquisition, easy to implement, and is used with no loss of performance. e
fundamental and extensively used GPR is mainly comprised of a simple zero mean and squared exponential
covariance function52 as represented in Eq. (13).
where
where
k(x,x)
represents the covariance function or kernels that provide the expected correlation among several
observations. In the GPR model, there are two hyperparameters used, such as the model noise (
̟f
) and the
length scale (g) that regulates the vertical scale and the horizontal scale of the function change, respectively.
Binary decision tree regression. A Binary Decision Tree (BDT) regression is formed by performing consecutive
recursive binary splits on variables, that is of the form y
i
v, y
i
v, where v
R
are observed values in a binary
regression tree53, which is represented as:
where T(y) indicates the regression tree, M indicates the number of tree’s terminal nodes, and B
m
(y) indicates
the base function which is determined by:
(10)
yi=wφ(p)+q
(11)
Minimise :1
2||w2|| + C
n
i=1
iχ
i
)
Subject to
:yi(wφ(pi)+qi)ǫ+χi
(wφ(pi)+qi)yiǫ+χ
i
χi
,
χ
i
0
(12)
K(pi
,
p)=e
(γ||pip||
2)
(13)
K
(x,x)=̟2
fexp
r
2
(14)
r
=|xx′|
2
g
2
(15)
T
(y)=
M
m=1
m·Bm(y
)
(16)
B
m(y)=
L
m
i=1
[yi(m)vim
]
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
where L
m
indicates the total splits, y
i
indicates the involved variable, and v
im
indicates the splitting value. Moreo-
ver, the decision tree establishes the rule till the samples in a leaf fall under a specied size, i.e., the minimum
leaf (min-leaf) size54. Since the min-leaf size denes when splitting must be terminated, it is considered a vital
parameter that must be ne-tuned.
Ensemble regression model. Perrone and Cooper55 proposed a general conceptual framework for obtaining
considerably better regression estimates using ensemble methods. Ensemble Learning (EL) enhances perfor-
mance by building and combining several base learners with specic approaches. It is mainly used when there is
a limited amount of training data. It is challenging to choose a suitable classier with this limited available data.
Ensemble algorithms minimise the risk of selecting a poor classier by averaging the votes of individual classi-
ers. is study has applied bagging and boosting EL methods due to their widespread usage and eectiveness
for building ensemble learning algorithms.
Bagging (Breiman56,57), also known as bootstrap aggregation or Random Forest (RF), is one of the most promi-
nent approach for building ensembles, that uses a bootstrap sampling technique to generate multiple dierent
training sets. Subsequently, the base learners are trained on every training set, and then combining those base
learners to create the nal model. Hence, bagging works for a regression problem as follows: Consider a training
set, S that comprises of data
{(Xi,Yi),i=1, 2, ...,m}
, where X
i
and Y
i
represents the realisation of a multi-
dimensional estimator and a real valued variable respectively. A predictor P(Y|X = x) = f(x)58 is represented as:
At rst, create a bootstrapped sample Eq. (18) based on the empirical distribution of the pairs S
i
= (X
i
, Y
i
),
next, using the plug-in concept, estimate the bootstrapped predictor as shown in Eq. (19). Finally, the bagged
estimator is represented by Eq. (20).
Moreover, the three hyperparameters used in bagging are the MinLeafSize (minimum number of observations
per leaf), NumVariablesToSample (number of predictors to sample at every node), and the NumLearningCycles
(number of trees). e rst two parameters determine the tree’s structure, while tuning the nal parameter helps
balance eciency and accuracy.
Boosting (Freund59) is another ensemble method that aims to boost the eciency of a given learning algo-
rithm. e Least-Squares Boosting (LSBoost) ensemble method is used in this study because it is suited for
regression and forecasting problems. LSBoost aims to reduce the Mean Squared Error (MSE) between the target
variable (Y) and the weak learners’ aggregated prediction (Y
p
). At rst, median of (Y), represented as
(
Y
) is com-
puted. Next, to enhance the model accuracy, several regression trees (r
1
, r
2
,
...
, r
m
) are integrated in a weighted
manner. Individual regression trees are determined by the following predictor variables (X)60:
where (w
m
) represents the weight for the m model, d represents the weak learners, and
η
with
0
1 repre-
sents the learning rate.
Kernel regression model. Kernel regression (Nadaraya61) is the most used non-parametric method on account
of the virtue of kernel and is undoubtedly known as univariate kernel smoother. In order to achieve a kernel
regression, a collection of kernels are locally placed at every observational point. e kernel is set a weight to
every location depending on its distance from the observational point. A multivariate kernel regression62 deter-
mines how the response parameter, y
i
is dependent on the explanatory parameter, x
i
, as in Eqs.(22) and(23).
and
where
E[ψi]=Cov[m(xi),ψi]=0
, m(.) represents a non-linear function, and
ψi
is random with mean zero
and variance
σ2
. It describes the way that y
i
varies around its mean, m(x
i
). e mean can be represented as the
probability density function f:
(17)
ζm(x)=hm(S1,S2,...Sm)(x)
(18)
S
i
=(
Y
i
,
X
i)
(19)
ζ
m(x)
=
hm(S
1,S
2,...S
m)(x)
(20)
ζm;B
(
x
)=
P
|
S
m
(
x
)
|
(21)
Y
p(X)=
Y(X)+η
d
m=1
wm×rm(X
)
(22)
E(yi|xi)=m(xi)+ψi
(23)
yi=m(xi)+ψi
(24)
m
(xi)=E[Yi|xi=x]=
y.f(x,y)dy
f(x,y)dy =
y.f(x,y)
dy
f(x)
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
Linear regression model. A linear regression model63 examines the relationship among dierent inuential pre-
dictors and an outcome variable. e basic linear regression model, which represents the universal set of two-
variable and multiple regression as complementary subsets, can be expressed as:
where Y represents the dependent variable,
X1,X2,...,Xn
represents the n independent variables, a and b rep-
resents the regression coecients and u represents the stochastic disturbance-term that could be caused by an
undened independent variable.
Bayesian optimisation. Bayesian Optimisation (BO)64,65 is an ecient approach for addressing optimisation
problems characterised by expensive experiments. It keeps track of the previous observations and forms a proba-
bilistic mapping (or model) between the hyperparameter and a probabilistic score on the objective function that
is to be optimised. e probabilistic model is known as a surrogate of the objective function. e surrogate func-
tion is much easy to optimise, and with the help of the acquisition function, the next set of hyperparameters is
selected for evaluation on the actual objective function based on its best performance on the surrogate function.
Hence, it comprises a surrogate function for determining the objective function and an acquisition function for
sampling the next observation. In BO, the objective function (f) is obtained from the Gaussian Process (GP) as
described in Eq. (26).
where
µ
and
ϑ
are calculated from the observations of x66.
We select the best performing algorithm among the above-discussed models with the optimised hyperparam-
eter. Lastly, we evaluated the performance of the best-performing algorithm using the test dataset. A owchart
of the detailed methodology is illustrated in Fig.1.
Results
Predictor importance and sensitivity. We plotted the relative predictor importance score of each pre-
dictor along with their respective box plot for a better visual representation of the datasets (Fig.2). We found
that the relative predictor importance score ranges approximately from 9 to 152. e higher the value of the rela-
tive estimate, the more relevant is the predictor in estimating the response variable (i.e., k-barriers). We found
that out of these four predictors, the transmission range of the sensor emerges as the most relevant predictor in
predicting the required number of k-barriers for fast intrusion detection and prevention considering Gaussian
node distribution over a rectangular region. e number of sensors also shows good relevancy in predicting the
response variable and ranked second. e area of the region of interest and the sensing range of the sensor shows
fair relevancy and ranked third and fourth, respectively.
We also evaluated the impact of each predictor on the response variable. We plotted the partial depend-
ence plot for each possible pair of predictors (Fig.3a–f). For a better visual inspection, we also plotted the
(25)
Y
=a+
n
i=1
biXi+
u
(26)
f(x)GP(µ(x),ϑ(xi,xj))
Figure1. Flowchart of the proposed methodology.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
three-dimensional plot and its two-dimensional illustration. We observed that the area of the RoI has a slightly
negative impact on the target variable i.e., the response variable decreases with an increase in the area of the
RoI. However, an inverse relationship is observed with all other predictors. e sensing range of the sensor, the
transmission range of the sensor, and the number of the sensors have a positive impact on the response variable
i.e., the response variable increases with an increase in these predictors.
Model performance. We iteratively selected the best machine learning model with optimised hyperpa-
rameters value using the Bayesian optimisation6769 on the 80% of the datasets (Fig.4). We used Eq.(27) as the
objective function (Obj) to select the best machine learning model with optimised hyperparameters.
where valLoss is the cross-validation mean square error (CV-MSE). At each iteration, the value of the objective
function is computed for any one of the participating models. e model (with optimised hyperparameters),
which returns the minimum observed loss (i.e., the smallest value of the objective function so far), is considered
as the best model. Aer iterating for 120 iterations, the AutoML algorithm returned the GPR model as the best
model along with the optimal hyperparameters (i.e., for the GPR model; sigma =
0.98
). Before returning the
model, the AutoML algorithm retrains the GPR model on the entire training dataset.
Once we get the trained GPR model, we evaluate its performance on the training datasets to estimate the
training accuracy. We found that the model performed well on the training datasets with a correlation coecient
(R = 1), root mean square error (RMSE = 0.003), and bias = 0. However, for an unbiased evaluation, we evaluated
the performance of the trained model on the test datasets (i.e., 20% of the total datasets). In doing so, we fed the
testing predictors into the trained GPR model and obtained the predicted response. We then compared the GPR
predicted k-barriers with the observed values (Fig.5a). We found that the GRP model performs prodigiously
with a R = 1, RMSE = 0.007, and bias = − 0.006. All the data points are aligned along the regression line and lie
well inside the 95% Condence Interval (C.I).
Further, to assess the appropriateness of the plotted linear regression plot, we performed residual analysis.
We plotted the time series of the observed and the predicted values along with the corresponding residual val-
ues (Fig.5b). We found that the residuals are signicantly low and do not follow any pattern, which indicates a
good linear t.
To understand the distribution of the error (i.e., dierence of predicted and observed values), weperformed
error analysis using error histogram (Fig.6). To do so, we plotted the error histogram using ten bins. e error
ranges from
0.00997
from the le to
0.00356
on the right of the histogram plot. We found that the error follows
a right-skewed Gaussian distribution. e peak of the distribution lies in the underestimated region. Lastly, we
presented the results of the remaining algorithms of the AutoML (i.e., SVR, BDT, Bagging ensemble learning,
Boosting ensemble learning, kernel, and linear regression) in Table2. We found that the best performing AutoML
algorithm (i.e., GPR) outperforms all the other algorithms.
(27)
Obj =log(1+valLoss)
Figure2. Graph showing the relative predictor importance score for all four predictors. e estimates for the
area of the RoI, sensing range of the sensor, transmission range of the sensor, and the number of sensors are
46.0, 9.3, 152.0, and 128.9, respectively.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
Figure3. Two-dimensional and three-dimensional partial dependency plots show the predictor sensitivity
of all possible predictor pairs. e histogram along the x and y-axis of the two-dimensional plot shows the
distribution of the predictor and the response variable, respectively.
Figure4. Curve illustrating the Bayesian optimisation process for the selection of the best machine learning
model with optimal hyperparameters.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
Figure5. e le panel shows the linear regression plot between the predicted and observed responses. e
top plot on the right panel shows the time series plot of the predicted and observed. e bottom panel shows the
corresponding residuals. e dashed line in the residual plot shows the RMSE value.
Figure6. Error analysis using error histogram of 10 bins. e line in red shows the zero error line. e area to
the le of the zero error line shows the underestimated region, and the area right to the zero error line shows the
overestimated region.
Table 2. Performance of the other AutoML algorithms.
Performance metrics
Algorithms
SVR BDT Bagging EL (random
forest) Boosting EL (LSBoost) Kernel regression Linear regression
R 0.93 0.81 0.93 0.73 0.91 0.94
RMSE 63.61 73.07 81.84 118.03 32.29 33.68
Bias 53.59 56.99 67.31 89.81 31.28 31.81
t (s) 95.3 111.3 103.4 107.7 43.01 36.7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
Discussion
We observed that the AutoML approach successfully selects the best machine learning model among a group
of explainable machine learning algorithms (i.e., among SVR, GPR, BDT, bagging ensemble learning, boosting
ensemble learning, kernel regression, and linear regression model) and optimised its hyperparameters. How-
ever, we have compared the AutoML derived results with the benchmark algorithms for an unbiased and fair
evaluation of the proposed approach. We selected Feed-Forward Neural Network (FFNN)70, Recurrent Neural
Network (RNN)71, Radial Basis Neural Networks (RBN)72, Exact RBN73, and Generalised Regression Neural
Network (GRNN)74 as the benchmark algorithms. We selected these algorithms because they are frequently
used in diverse applications such as remote sensing, blockchain, cancer diagnosis, precision medicine, decease
prediction, self-driving cars, streamow forecasting, and speech recognition; hence have high generalisation
capabilities37,7577. In doing so, wetrained these algorithms over the same datasets. We found that the AutoML
outperforms all the deep learning benchmark algorithms (Table3). Among the benchmark algorithms GRNN
performs the best (with R = 0.97, RMSE = 64.61, Bias = 60.18, and computational time complexity, t = 2.23 s).
Surprisingly, all the benchmark algorithms have a high positive bias value. It indicates that these models highly
overestimate the number of required k-barriers. We have also compared the performance of the AutoML with
previous studies21,22 for the prediction of k-barriers and k-barrier coverage probability (Table4).
Further, we also tested the performance of the AutoML approach over the publicly available intrusion detec-
tion dataset22. In a recent study, Singh etal.22 have proposed a log-transformed feature scaling based algorithm
(i.e., LT-FS-ID) for intrusion detection considering uniform node distribution scenario. We downloaded the
datasets and applied the proposed AutoML approach to them. In doing so, we iterated the AutoML for 120
iterations using the Bayesian optimisation to obtain the best optimised machine learning model. We found that
AutoML approach perform well over the dataset (with R = 0.92, RMSE = 30.59, and Bias = 18.13). Interestingly,
the same GPR algorithms emerges as the best learner algorithms with a optimised sigma = 0.33. It highlights
the potential of the GPR algorithm for intrusion detection, which becomes more apparent from the recently
published literature’s21,78.
e proposed AutoML approach for estimating the k-barriers for fast intrusion detection and prevention
is highly user-friendly and provides a fast solution. It reduces the confusion of selecting the best-performing
algorithm by automating the process. Further, it also overcomes the limitation of the LT-FS-ID algorithm22.
LT-FS-ID algorithm only works if the input predictors are a positive real number. It will not work if any input
predictors contain zero (or negative values). Although the AutoML approach gives the best result, its perfor-
mance will hamper with the sensor aging. In other words, with the aging eect in the sensors, the quality of the
data recorded by the sensor may change drastically (i.e., datasets become dynamic), resulting in performance
degradation. In such a situation, retraining the proposed model will solve the problem.
Conclusion
In this study, we proposed a robust AutoML approach to estimate the accurate number of k-barriers required for
fast intrusion detection and prevention using WSNs over a rectangular RoI considering the Gaussian distribu-
tion of the node deployment. We found that the synthetic predictors (i.e., the area of the RoI, sensing range of
the sensornode, transmission range of the sensornode, and the number of sensors) extracted through Monte
Carlo simulations successfully mapped with the k-barriers. Among these predictors, the transmission range
of the sensor emerges as the most relevant predictor, and the sensing range of the sensor emerges as the least
relevant predictor. In addition to this, we observed that only the area of the RoI has a slightly negative impact
on the response variable. We then iteratively run the AutoML algorithms to obtain the best machine learning
model among the explainable machine learning model using Bayesian optimisation techniques. We found that
Table 3. Comparing the performance of the AutoML with the deep learning models.
Performance metrics FFNN RNN Exact RBN RBN GRNN
R 0.47 0.95 0.30 0.41 0.97
RMSE 36.96 14.92 107.95 161.11 64.61
Bias 21.47 71.06 86.21 139.23 60.18
t (s) 2.5 13.51 2.90 3.98 2.23
Table 4. Comparing the results of AutoML with previous studies.
Performance metrics
k-barriers
k-barrier coverage
probability21
AutoML (is study) LT-FS-ID22 GPR S-GPR C-GPR
R 1 0.98 0.85 0.64 0.79
RMSE 0.007 6.47 0.095 0.137 0.108
t (s) 0.73 0.65 8.16 7.79 9.51
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
the AutoML algorithm selects the GPR algorithm as the best machine learning model to map the required
k-barriers accurately. We evaluated the potential of the GPR algorithm over unseen test datasets. We found that
the AutoML elected algorithm performs exceptionally well on the test datasets.
We further compared the AutoML results with the benchmark algorithms for a more reliable and robust
conclusion. We found that AutoML outperforms all the benchmark algorithms in terms of accuracy. For more
generalisation of this approach, we tested the ecacy of the AutoML over the publicly available datasets on
intrusion detection using WSNs, and we found a similar performance. is study is a step towards a cost-ecient
approach for fast intrusion detection and prevention using explainable machine learning models.
Data availability
e datasets generated during and/or analysed during the current study can be made available from the cor-
responding author on a reasonable request.
Code availability
e computer algorithms originated during the current study can be made available from the corresponding
author on a reasonable request.
Received: 30 January 2022; Accepted: 18 May 2022
References
1. Chaabouni, N., Mosbah, M., Zemmari, A., Sauvignac, C. & Faruki, P. Network intrusion detection for iot security based on learning
techniques. IEEE Commun. Surv. Tutor. 21, 2671–2701 (2019).
2. Wang, Y., Wang, X., Xie, B., Wang, D. & Agrawal, D. P. Intrusion detection in homogeneous and heterogeneous wireless sensor
networks. IEEE Trans. Mob. Comput. 7, 698–711 (2008).
3. Abduvaliyev, A., Pathan, A.-S.K., Zhou, J., Roman, R. & Wong, W.-C. On the vital areas of intrusion detection systems in wireless
sensor networks. IEEE Commun. Surv. Tutor. 15, 1223–1237 (2013).
4. Butun, I., Morgera, S. D. & Sankar, R. A survey of intrusion detection systems in wireless sensor networks. IEEE Commun. Surv.
Tutor. 16, 266–282 (2013).
5. Resende, P. A. A. & Drummond, A. C. A survey of random forest based methods for intrusion detection systems. ACM Comput.
Surv. 51, 1–36 (2018).
6. Ali, A., Ming, Y., Chakraborty, S. & Iram, S. A comprehensive survey on real-time applications of wsn. Future Internet 9, 77 (2017).
7. Singh, A., Sharma, S. & Singh, J. Nature-inspired algorithms for wireless sensor networks: A comprehensive survey. Comput. Sci.
Rev. 39, 100342 (2021).
8. Amutha, J., Sharma, S. & Nagar, J. Wsn strategies based on sensors, deployment, sensing models, coverage and energy eciency:
Review, approaches and open issues. Wirel. Pers. Commun. 111, 1089–1115 (2020).
9. Nagar, J., Chaturvedi, S. K. & Soh, S. An analytical model to estimate the performance metrics of a nite multihop network deployed
in a rectangular region. J. Netw. Comput. Appl. 149, 102466 (2020).
10. Yang, T., Mu, D., Hu, W. & Zhang, H. Energy-ecient border intrusion detection using wireless sensors network. EURASIP J.
Wirel. Commun. Netw. 2014, 1–12 (2014).
11. Raza, F., Bashir, S., Tauseef, K. & Shah, S. Optimizing nodes proportion for intrusion detection in uniform and gaussian distributed
heterogeneous wsn. In 2015 12th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 623–628 (IEEE,
2015).
12. Arfaoui, I., Boudriga, N., Trimeche, K. & Abdallah, W. Wsn-based border surveillance systems using estimated known crossing
paths. In Proceedings of the 15th International Conference on Advances in Mobile Computing and Multimedia, 182–190 (2017).
13. Singh, R. & Singh, S. Smart border surveillance system using wireless sensor networks. Int. J. Syst. Assur. Eng. Manage. 20, 1–15
(2021).
14. Sharma, M. & Kumar, C. Machine learning-based smart surveillance and intrusion detection system for national geographic
borders. In Articial Intelligence and Technologies 165–176 (Springer, 2022).
15. Mishra, P., Varadharajan, V., Tupakula, U. & Pilli, E. S. A detailed investigation and analysis of using machine learning techniques
for intrusion detection. IEEE Commun. Surv. Tutor. 21, 686–728 (2018).
16. Sun, Z., Xu, Y., Liang, G. & Zhou, Z. An intrusion detection model for wireless sensor networks with an improved v-detector
algorithm. IEEE Sens. J. 18, 1971–1984 (2017).
17. Ghosh, K., Neogy, S., Das, P. K. & Mehta, M. Intrusion detection at international borders and large military barracks with multi-
sink wireless sensor networks: An energy ecient solution. Wirel. Pers. Commun. 98, 1083–1101 (2018).
18. Benahmed, T. & B enahmed, K. Optimal barrier coverage for critical area surveillance using wireless sensor networks. Int. J. Com-
mun. Syst. 32, e3955 (2019).
19. Arfaoui, I. & Boudriga, N. A border surveillance system using wsn under various environment characteristics. Int. J. Sens. Netw.
30, 263–278 (2019).
20. Sharma, S. & Nagar, J. Intrusion detection in mobile sensor networks: A case study for dierent intrusion paths. Wirel. Pers. Com-
mun. 115, 2569–2589 (2020).
21. Singh, A., Nagar, J., Sharma, S. & Kotiyal, V. A gaussian process regression approach to predict the k-barrier coverage probability
for intrusion detection in wireless sensor networks. Expert Syst. Appl. 172, 114603 (2021).
22. Singh, A., Amutha, J., Nagar, J., Sharma, S. & Lee, C.-C. Lt-fs-id: Log-transformed feature learning and feature-scaling-based
machine learning algorithms to predict the k-barriers for intrusion detection using wireless sensor network. Sensorshttps:// doi.
org/ 10. 3390/ s2203 1070 (2022).
23. Otoum, S., Kantarci, B. & Mouah, H. T. On the feasibility of deep learning in sensor network intrusion detection. IEEE Netw.
Lett. 1, 68–71 (2019).
24. Schmidt, J., Marques, M. R., Botti, S. & Marques, M. A. Recent advances and applications of machine learning in solid-state materi-
als science. NPJ Comput. Mater. 5, 1–36 (2019).
25. Nikolenko, S.I. etal. Synthetic data for deep learning. arXiv: 1909. 11512 (arXiv preprint) (2019).
26. Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare.
Nat. Biomed. Eng. 20, 201–5 (2021).
27. R ankin, D. et al. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data
sharing. JMIR Med. Inform. 8, e18910 (2020).
28. Singh, A., Kotiyal, V., Sharma, S., Nagar, J. & Lee, C.-C. A machine learning approach to predict the average localization error with
applications to wireless sensor networks. IEEE Access 8, 208253–208263 (2020).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
29. Abay, N.C., Zhou, Y., Kantarcioglu, M., uraisingham, B. & Sweeney, L. Privacy preserving synthetic data release using deep
learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 510–526 (Springer, 2018).
30. Wang, D., Xie, B. & Agrawal, D. P. Coverage and lifetime optimization of wireless sensor networks with gaussian distribution. IEEE
Trans. Mob. Comput. 7, 1444–1458 (2008).
31. Wang, Y., Fu, W. & Agrawal, D. P. Gaussian versus uniform distribution for intrusion detection in wireless sensor networks. IEEE
Trans. Parallel Distrib. Syst. 24, 342–355 (2012).
32. Zou, Y. & Chakrabarty, K. Sensor deployment and t arget loca lization in distributed sensor networks. ACM Trans. Embedd. Comput.
Syst. 3, 61–91 (2004).
33. Mostafaei, H., Chowdhury, M. U. & Obaidat, M. S. Border surveillance with wsn systems in a distributed manner. IEEE Syst. J. 12,
3703–3712 (2018).
34. Torres-Barrán, A., Alonso, Á. & Dorronsoro, J. R. Regression tree ensembles for wind energy and solar radiation prediction.
Neurocomputing 326, 151–160 (2019).
35. Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 20, 1189–1232 (2001).
36. Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. Peeking inside the black box: Visualizing statistical learning with plots of indi-
vidual conditional expectation. J. Comput. Graph. Stat. 24, 44–65 (2015).
37. Singh, A., Gaurav, K., Rai, A. K. & Beg, Z. Machine learning to estimate surface roughness from satellite images. Remote Sens. 13,
3794 (2021).
38. Guyon, I. etal. Design of the 2015 chalearn automl challenge. In 2015 International Joint Conference on Neural Networks (IJCNN),
1–8 (IEEE, 2015).
39. Guyon, I. etal. Automl challenge 2015: Design and rst results. In Proceedings of of AutoML (2015).
40. Guyon, I. etal. A brief review of the chalearn automl challenge: Any-time any-dataset learning without human intervention. In
Workshop on Automatic Machine Learning, 21–30 (PMLR, 2016).
41. He, Y. etal. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference
on Computer Vision (ECCV), 784–800 (2018).
42. Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M. & Hutter, F. Practical automated machine learning for the automl challenge
2018. In International Workshop on Automatic Machine Learning at ICML, 1189–1232 (2018).
43. He, X., Zhao, K. & Chu, X. Automl: A survey of the state-of-the-art. Knowl.-Based Syst. 212, 106622 (2021).
44. Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell rna-seq based
on a multinomial model. Genome Biol. 20, 1–16 (2019).
45. Vapnik, V. et al. Support vector method for function approximation, regression estimation, and signal processing. Adv. Neural Inf.
Process. Syst. 20, 281–287 (1997).
46. Saha, A. et al. Flood susceptibility assessment using novel ensemble of hyperpipes and support vector regression algorithms. Wate r
13, 241 (2021).
47. Arifuzzaman, M., Aniq Gul, M., Khan, K. & Hossain, S. Application of articial intelligence (ai) for sustainable highway and road
system. Symmetry 13, 60 (2021).
48. da Silva Santos, C. E., dos Santos Coelho, L. & Llanos, C. H. Nature inspired optimization tools for svms-niots. MethodsX 8, 101574
(2021).
49. Zaghloul, M. S., Hamza, R. A., Iorhemen, O. T. & Tay, J. H. Comparison of adaptive neuro-fuzzy inference systems (ans) and
support vector regression (svr) for data-driven modelling of aerobic granular sludge reactors. J. Environ. Chem. Eng. 8, 103742
(2020).
50. César de Sá, N., Baratchi, M., Hauser, L. T. & van Bodegom, P. Exploring the impact of noise on hybrid inversion of prosail rtm
on sentinel-2 data. Remote Sens. 13, 648 (2021).
51. Rasmussen, C. E. Gaussian processes in machine learning. In Summer School on Machine Learning 63–71 (Springer, 2003).
52. Asante-Okyere, S., Shen, C., Yevenyo Ziggah, Y., Moses Rulegeya, M. & Zhu, X. Investigating the predictive performance of gauss-
ian process regression in evaluating reservoir porosity and permeability. Energies 11, 3261 (2018).
53. Artime Ríos, E. M., Sánchez Lasheras, F., Suárez Sánchez, A., Iglesias-Rodríguez, F. J. & Seguí Crespo, M. D. M. Prediction of
computer vision syndrome in health personnel by means of genetic algorithms and binary regression trees. Sensors 19, 2800 (2019).
54. Kim, S.-H., Moon, I.-J., Won, S.-H., Kang, H.-W. & Kang, S. K. Decision-tree-based classication of lifetime maximum intensity
of tropical cyclones in the tropical western north pacic. Atmosphere 12, 802 (2021).
55. Perrone, M. P. & Cooper, L. N. When networks disagree: Ensemble methods for hybrid neural networks. Tech. Rep., Brown Univ
Providence RI Inst for Brain and Neural Systems (1992).
56. Breiman, L. Bagging Predictors (Technical Report 421) (University of California, 1994).
57. Breiman, L. Stacked regressions. Mach. Learn. 24, 49–64 (1996).
58. Erdal, H. & Karahanoğlu, İ. Bagging ensemble models for bank protability: An emprical research on Turkish development and
investment banks. Appl. So Comput. 49, 861–867 (2016).
59. Freund, Y. et al. Experiments with a new boosting algorithm. In icml Vol. 96 148–156 (Citeseer, 1996).
60. Jung, C. High spatial resolution simulation of annual wind energy yield using near-surface wind speed time series. Energies 9, 344
(2016).
61. Watson, G. S. Smooth regression analysis. Sankhyā Indian J. Stat. Ser. A 20, 359–372 (1964).
62. Heo, G.-Y. Condition monitoring using empirical models: Technical review and prospects for nuclear applications. Nucl. Eng.
Tec hnol. 40, 49–68 (2008).
63. Poole, M. A. & O’Farrell, P. N. e assumptions of the linear regression model. Trans. Inst. Brit. Geograph. 20, 145–158 (1971).
64. Močkus, J. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference 400–404
(Springer, 1975).
65. Feurer, M. etal. Methods for improving Bayesian optimization for automl. In Proceedings of the International Conference on Machine
Learning (2015).
66. Savaia, G. et al. Experimental automatic calibration of a semi-active suspension controller via Bayesian optimization. Control. Eng.
Pract. 112, 104826 (2021).
67. Pelikan, M., Goldberg, D.E., Cantú-Paz, E. etal. Boa: e Bayesian optimization algorithm. In Proceedings of the Genetic and
Evolutionary Computation Conference GECCO-99, vol.1, 525–532 (Citeseer, 1999).
68. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & De Freitas, N. Taking the human out of the loop: A review of Bayesian opti-
mization. Proc. IEEE 104, 148–175 (2015).
69. Frazier, P.I. A tutorial on Bayesian optimization. arXiv: 1807. 02811 (arXiv preprint) (2018).
70. Fine, T. L. Feedforward Neural Network Methodology (Springer Science & Business Media, 2006).
71. Zaremba, W., Sutskever, I. & Vinyals, O. Recurrent neural network regularization. arXiv: 1409. 2329 (arXiv preprint) (2014).
72. Karayiannis, N. B. Reformulated radial basis neural networks trained by gradient descent. IEEE Trans. Neural Netw. 10, 657–671
(1999).
73. Çivicioğlu, P., Alçı, M. & Bedok, E. Using an exact radial basis function articial neural network for impulsive noise suppression
from highly distorted image databases. In International Conference on Advances in Information Systems, 383–391 (Springer, 2004).
74. Specht, D. F. et al. A general regression neural network. IEEE Trans. Neural Netw. 2, 568–576 (1991).
75. Guidotti, R. et al. A survey of methods for explaining black box models. ACM Comput. Surv. 51, 1–42 (2018).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z
www.nature.com/scientificreports/
76. Xie, M., Li, H. & Zhao, Y. Blockchain nancial investment based on deep learning network algorithm. J. Comput. Appl. Math. 372,
112723 (2020).
77. Shrestha, A. & Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 7, 53040–53065 (2019).
78. Nwakanma, C.I., Ahakonye, L. A.C., Lee, J.-M. & Kim, D.-S. Selecting gaussian process regression kernels for iot intr usion detec-
tion and classication. In 2021 International Conference on Information and Communication Technology Convergence (ICTC),
462–465 (IEEE, 2021).
Acknowledgements
We want to acknowledge IISER Bhopal, Madhya Pradesh, India; Gautam Buddha University, Uttar Pradesh,
India; IIT Kharagpur, West Bengal, India; MITS Gwalior, Madhya Pradesh, India; Fu Jen Catholic University,
Taiwan; and Asia University, Taiwan, for providing institutional support.
Author contributions
A.S. developed the models, J.N. and J.A. extracted the datasets, S.S. and C.C.L. analysed the results. All the authors
contributed to the writing and reviewed the manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to S.S.orC.-C.L.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2022
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Border protection interests all nations, especially those involved in armed conflicts or close to areas with a high flow of illegal products (drugs, weapons smuggling, and illegal immigrants, among others) [13]. To help control this identification of possible invaders, several sensors are used to identify suspicious movements at the borders of countries [14] (Figure 1). This technology is relatively inexpensive and can help monitor existing activities in these locations. ...
... This technology is relatively inexpensive and can help monitor existing activities in these locations. These sensors use machine-learning concepts to facilitate the identification of adverse situations [14]. This type of technology also helps to reduce the number of human resources guarding countries with large borders. ...
... (13) Calculate the regression vector z( x t ). (14) Update the weights of the output layer by (5) using y(t). ...
Article
Full-text available
Evolving fuzzy neural networks have the adaptive capacity to solve complex problems by interpreting them. This is due to the fact that this type of approach provides valuable insights that facilitate understanding the behavior of the problem being analyzed, because they can extract knowledge from a set of investigated data. Thus, this work proposes applying an evolving fuzzy neural network capable of solving data stream regression problems with considerable interpretability. The dataset is based on a necessary prediction of k barriers with wireless sensors to identify unauthorized persons entering a protected territory. Our method was empirically compared with state-of-the-art evolving methods, showing significantly lower RMSE values for separate test data sets and also lower accumulated mean absolute errors (MAEs) when evaluating the methods in a stream-based interleaved-predict-and-then-update procedure. In addition, the model could offer relevant information in terms of interpretable fuzzy rules, allowing an explainable evaluation of the regression problems contained in the data streams.
... Deep learning is a subset of machine learning algorithms which has been applied for intrusion detection using WSNs Amutha et al., 2021b;Singh et al., 2022a;Sood et al., 2022;Singh et al., 2022b). It is also employed for pattern matching and network security where it identifies the malicious activities occurring in the network and is termed as Network Intrusion Detection System (NIDS). ...
Preprint
Full-text available
Wireless Sensor Networks (WSNs) is a promising technology with enormous applications in almost every walk of life. One of the crucial applications of WSNs is intrusion detection and surveillance at the border areas and in the defense establishments. The border areas are stretched in hundreds to thousands of miles, hence, it is not possible to patrol the entire border region. As a result, an enemy may enter from any point absence of surveillance and cause the loss of lives or destroy the military establishments. WSNs can be a feasible solution for the problem of intrusion detection and surveillance at the border areas. Detection of an enemy at the border areas and nearby critical areas such as military cantonments is a time-sensitive task as a delay of few seconds may have disastrous consequences. Therefore, it becomes imperative to design systems that are able to identify and detect the enemy as soon as it comes in the range of the deployed system. In this paper, we have proposed a deep learning architecture based on a fully connected feed-forward Artificial Neural Network (ANN) for the accurate prediction of the number of k-barriers for fast intrusion detection and prevention. We have trained and evaluated the feed-forward ANN model using four potential features, namely area of the circular region, sensing range of sensors, the transmission range of sensors, and the number of sensor for Gaussian and uniform sensor distribution. These features are extracted through Monte Carlo simulation. In doing so, we found that the model accurately predicts the number of k-barriers for both Gaussian and uniform sensor distribution with correlation coefficient (R = 0.78) and Root Mean Square Error (RMSE = 41.15) for the former and R = 0.79 and RMSE = 48.36 for the latter. Further, the proposed approach outperforms the other benchmark algorithms in terms of accuracy and computational time complexity.
... Deep learning is a subset of machine learning algorithms which has been applied for intrusion detection using WSNs Lee et al., 2021;Singh, Amutha, Nagar, Sharma, & Lee, 2022a, 2022bSood, Prakash, Sharma, Singh, & Choubey, 2022). It is also employed for pattern matching and network security where it identifies the malicious activities occurring in the network and is termed as Network Intrusion Detection System (NIDS). ...
Article
Full-text available
Wireless Sensor Networks (WSNs) is a promising technology with enormous applications in almost every walk of life. One of the crucial applications of WSNs is intrusion detection and surveillance at the border areas and in the defense establishments. The border areas are stretched in hundreds to thousands of miles, hence, it is not possible to patrol the entire border region. As a result, an enemy may enter from any point absence of surveillance and cause the loss of lives or destroy the military establishments. WSNs can be a feasible solution for the problem of intrusion detection and surveillance at the border areas. Detection of an enemy at the border areas and nearby critical areas such as military cantonments is a time-sensitive task as a delay of few seconds may have disastrous consequences. Therefore, it becomes imperative to design systems that are able to identify and detect the enemy as soon as it comes in the range of the deployed system. In this paper, we have proposed a deep learning architecture based on a fully connected feed-forward Artificial Neural Network (ANN) for the accurate prediction of the number of k-barriers for fast intrusion detection and prevention. We have trained and evaluated the feed-forward ANN model using four potential features, namely area of the circular region, sensing range of sensors, the transmission range of sensors, and the number of sensor for Gaussian and uniform sensor distribution. These features are extracted through Monte Carlo simulation. In doing so, we found that the model accurately predicts the number of k-barriers for both Gaussian and uniform sensor distribution with correlation coefficient (R = 0.78) and Root Mean Square Error (RMSE = 41.15) for the former and R = 0.79 and RMSE = 48.36 for the latter. Further, the proposed approach outperforms the other benchmark algorithms in terms of accuracy and computational time complexity.
Article
Full-text available
The use of low-cost sensors in IoT over high-cost devices has been considered less expensive. However, these low-cost sensors have their own limitations such as the accuracy, quality, and reliability of the data collected. Fog computing offers solutions to those limitations; nevertheless, owning to its intrinsic distributed architecture, it faces challenges in the form of security of fog devices, secure authentication and privacy. Blockchain technology has been utilised to offer solutions for the authentication and security challenges in fog systems. This paper proposes an authentication system that utilises the characteristics and advantages of blockchain and smart contracts to authenticate users securely. The implemented system uses the email address, username, Ethereum address, password and data from a biometric reader to register and authenticate users. Experiments showed that the proposed method is secure and achieved performance improvement when compared to existing methods. The comparison of results with state-of-the-art showed that the proposed authentication system consumed up to 30% fewer resources in transaction and execution cost; however, there was an increase of up to 30% in miner fees.
Article
Full-text available
The dramatic increase in the computational facilities integrated with the explainable machine learning algorithms allows us to do fast intrusion detection and prevention at border areas using Wireless Sensor Networks (WSNs). This study proposed a novel approach to accurately predict the number of barriers required for fast intrusion detection and prevention. To do so, we extracted four features through Monte Carlo simulation: area of the Region of Interest (RoI), sensing range of the sensors, transmission range of the sensor, and the number of sensors. We evaluated feature importance and feature sensitivity to measure the relevancy and riskiness of the selected features. We applied log transformation and feature scaling on the feature set and trained the tuned Support Vector Regression (SVR) model (i.e., LT-FS-SVR model). We found that the model accurately predicts the number of barriers with a correlation coefficient (R) = 0.98, Root Mean Square Error (RMSE) = 6.47, and bias = 12.35. For a fair evaluation, we compared the performance of the proposed approach with the benchmark algorithms, namely, Gaussian Process Regression (GPR), Generalised Regression Neural Network (GRNN), Artificial Neural Network (ANN), and Random Forest (RF). We found that the proposed model outperforms all the benchmark algorithms
Article
Full-text available
Support Vector Machines (SVMs) technique for achieving classifiers and regressors. However, to obtain models with high accuracy and low complexity, it is necessary to define the kernel parameters as well as the parameters of the training model, which are called hyperparameters. The challenge of defining the more suitable value to hyperparameters is called the Parameter Selection Problem (PSP). However, minimizing the complexity and maximizing the generalization capacity of the SVMs are conflicting criteria. Therefore, we propose the Nature Inspired Optimization Tools for SVMs (NIOTS) that offers a method to automate the search process for the best possible solution for the PSP, allowing the user to quickly obtain several sets of good solutions and choose the one most appropriate for his specific problem. •The PSP has been modeled as a Multiobjective Optimization Problem (MOP) with two objectives: (1) good precision and (2) low complexity (low number of support vectors). •The user can evaluate multiple solutions included in the Pareto front, in terms of precision and low complexity of the model. •Apart from the Adaptive Parameter with Mutant Tournament Multiobjective Differential Evolution (APMT-MODE), the user can choose other metaheuristics and also among several kernel options.
Article
Full-text available
We apply the Support Vector Regression (SVR) machine learning model to estimate surface roughness on a large alluvial fan of the Kosi River in the Himalayan Foreland from satellite images. To train the model, we used input features such as radar backscatter values in Vertical–Vertical (VV) and Vertical–Horizontal (VH) polarisation, incidence angle from Sentinel-1, Normalised Difference Vegetation Index (NDVI) from Sentinel-2, and surface elevation from Shuttle Radar Topographic Mission (SRTM). We generated additional features (VH/VV and VH–VV) through a linear data fusion of the existing features. For the training and validation of our model, we conducted a field campaign during 11–20 December 2019. We measured surface roughness at 78 different locations over the entire fan surface using an in-house-developed mechanical pin-profiler. We used the regression tree ensemble approach to assess the relative importance of individual input feature to predict the surface soil roughness from SVR model. We eliminated the irrelevant input features using an iterative backward elimination approach. We then performed feature sensitivity to evaluate the riskiness of the selected features. Finally, we applied the dimension reduction and scaling to minimise the data redundancy and bring them to a similar level. Based on these, we proposed five SVR methods (PCA-NS-SVR, PCA-CM-SVR, PCA-ZM-SVR, PCA-MM-SVR, and PCA-S-SVR). We trained and evaluated the performance of all variants of SVR with a 60:40 ratio using the input features and the in-situ surface roughness. We compared the performance of SVR models with six different benchmark machine learning models (i.e., Gaussian Process Regression (GPR), Generalised Regression Neural Network (GRNN), Binary Decision Tree (BDT), Bragging Ensemble Learning, Boosting Ensemble Learning, and Automated Machine Learning (AutoML)). We observed that the PCA-MM-SVR perform better with a coefficient of correlation (R = 0.74), Root Mean Square Error (RMSE = 0.16 cm), and Mean Square Error (MSE = 0.025 cm2). To ensure a fair selection of the machine learning model, we evaluated the Akaike’s Information Criterion (AIC), corrected AIC (AICc), and Bayesian Information Criterion (BIC). We observed that SVR exhibits the lowest values of AIC, corrected AIC, and BIC of all the other methods; this indicates the best goodness-of-fit. Eventually, we also compared the result of PCA-MM-SVR with the surface roughness estimated from different empirical and semi-empirical radar backscatter models. The accuracy of the PCA-MM-SVR model is better than the backscatter models. This study provides a robust approach to measure surface roughness at high spatial and temporal resolutions solely from the satellite data. Keywords: surface roughness; Sentinel-1; Sentinel-2; machine learning models; AutoML; backscatter models
Article
Full-text available
The National Typhoon Center of the Korea Meteorological Administration developed a statistical-dynamical typhoon intensity prediction model for the western North Pacific, the CSTIPS-DAT, using a track-pattern clustering technique. The model led to significant improvements in the prediction of the intensity of tropical cyclones (TCs). However, relatively large errors have been found in a cluster located in the tropical western North Pacific (TWNP), mainly because of the large predictand variance. In this study, a decision-tree algorithm was employed to reduce the predictand variance for TCs in the TWNP. The tree predicts the likelihood of a TC reaching a maximum lifetime intensity greater than 70 knots at its genesis. The developed four rules suggest that the pre-existing ocean thermal structures along the track and the latitude of a TC's position play significant roles in the determination of its intensity. The developed decision-tree classification exhibited 90.0% and 80.5% accuracy in the training and test periods, respectively. These results suggest that intensity prediction with the CSTIPS-DAT can be further improved by developing independent statistical models for TC groups classified by the present algorithm.
Article
A good border surveillance system is like a deterrent. When enemies know existence of an effective border surveillance system, they do not try to breach it, leading to peace in the region. A good surveillance system should have more smart features like intrusion detection, raising alert upon detection etc. as far as possible. This proposal is a step forward in the design of a good and smart border surveillance system. In this paper, we have proposed a smart border surveillance system that uses the wireless sensor nodes which are capable of identifying the intrusion followed by forwarding of alert messages in case of presence of an intruder. The system is also able to differentiate between animals and persons. A prototype is developed for such smart border surveillance system using Raspberry PI boards integrated with passive infra-red, ultrasonic and camera sensors. For detection purpose Tiny Yolo, tensor flow and OpenCV modules have been utilized. The communication between raspberry PI based wireless sensor nodes in the prototype is done using Zigbee serial communication. The prototype system is capable of forwarding the video streams to the control station where further analysis may be done or action may be taken. The results obtained indicate that the system is able to fulfill the defined objectives.
Book
This is the first book on synthetic data for deep learning, and its breadth of coverage may render this book as the default reference on synthetic data for years to come. The book can also serve as an introduction to several other important subfields of machine learning that are seldom touched upon in other books. Machine learning as a discipline would not be possible without the inner workings of optimization at hand. The book includes the necessary sinews of optimization though the crux of the discussion centers on the increasingly popular tool for training deep learning models, namely synthetic data. It is expected that the field of synthetic data will undergo exponential growth in the near future. This book serves as a comprehensive survey of the field. In the simplest case, synthetic data refers to computer-generated graphics used to train computer vision models. There are many more facets of synthetic data to consider. In the section on basic computer vision, the book discusses fundamental computer vision problems, both low-level (e.g., optical flow estimation) and high-level (e.g., object detection and semantic segmentation), synthetic environments and datasets for outdoor and urban scenes (autonomous driving), indoor scenes (indoor navigation), aerial navigation, and simulation environments for robotics. Additionally, it touches upon applications of synthetic data outside computer vision (in neural programming, bioinformatics, NLP, and more). It also surveys the work on improving synthetic data development and alternative ways to produce it such as GANs. The book introduces and reviews several different approaches to synthetic data in various domains of machine learning, most notably the following fields: domain adaptation for making synthetic data more realistic and/or adapting the models to be trained on synthetic data and differential privacy for generating synthetic data with privacy guarantees. This discussion is accompanied by an introduction into generative adversarial networks (GAN) and an introduction to differential privacy.
Article
The proliferation of synthetic data in artificial intelligence for medicine and healthcare raises concerns about the vulnerabilities of the software and the challenges of current policy.