Conference PaperPDF Available

A Probabilistic-driven Ensemble Approach to Perform Event Classification in Intrusion Detection System

Authors:

Abstract and Figures

Nowadays, it is clear how the network services represent a widespread element, which is absolutely essential for each category of users, professional and non-professional. Such a scenario needs a constant research activity aimed to ensure the security of the involved services, so as to prevent any fraudulently exploitation of the related network resources. This is not a simple task, because day by day new threats arise, forcing the research community to face them by developing new specific countermeasures. The Intrusion Detection System (IDS) covers a central role in this scenario, as its main task is to detect the intrusion attempts through an evaluation model designed to classify each new network event as normal or intrusion. This paper introduces a Probabilistic-Driven Ensemble (PDE) approach that operates by using several classification algorithms, whose effectiveness has been improved on the basis of a probabilistic criterion. A series of experiments, performed by using real-world data, show how such an approach outperforms the state-of-the-art competitors, proving its better capability to detect intrusion events with regard to the canonical solutions.
Content may be subject to copyright.
A Probabilistic-driven Ensemble Approach to Perform Event
Classification in Intrusion Detection System
Roberto Saia, Salvatore Carta and Diego Reforgiato Recupero
Department of Mathematics and Computer Science
University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy
{roberto.saia, salvatore, diego.reforgiato}@unica.it
Keywords: Intrusion Detection, Pattern Mining, Machine Learning, Ensemble Approach, Probabilistic Model.
Abstract: Nowadays, it is clear how the network services represent a widespread element, which is absolutely essential
for each category of users, professional and non-professional. Such a scenario needs a constant research
activity aimed to ensure the security of the involved services, so as to prevent any fraudulently exploitation
of the related network resources. This is not a simple task, because day by day new threats arise, forcing
the research community to face them by developing new specific countermeasures. The Intrusion Detection
System (IDS) covers a central role in this scenario, as its main task is to detect the intrusion attempts through an
evaluation model designed to classify each new network event as normal or intrusion. This paper introduces a
Probabilistic-Driven Ensemble (PDE) approach that operates by using several classification algorithms, whose
effectiveness has been improved on the basis of a probabilistic criterion. A series of experiments, performed
by using real-world data, show how such an approach outperforms the state-of-the-art competitors, proving its
better capability to detect intrusion events with regard to the canonical solutions.
1 INTRODUCTION
Cybersecurity is an increasingly crucial aspect in our
times that is greatly dependent on network services,
since they affect the everyday life, involving areas
such as industry, education, finance, medicine, and so
on. In such a scenario the Intrusion Detection Systems
(IDSs) (Buczak and Guven, 2016) covered an impor-
tant role, because their main task is to face the risks
related to unauthorized use of such services by ma-
licious people. Such systems have been designed in
order to overcome the limits of the canonical appro-
aches, such as, for instance, those based on authenti-
cation procedures, on data encryption, or on specific
rules (e.g., firewalls).
They operate by exploiting a number of techni-
ques aimed to detect anomalous activities related to a
single machine or to an entire network. Examples of
such techniques are the Neural Networks, the Fuzzy
Logic, the Genetic Algorithms, the Clustering Algo-
rithms, and the Support Vector Machines. In addi-
tion, hybrid techniques that combine several techni-
ques can be used.
It should be observed how, regardless of the used
approach, the intrusion detection represents a hard
task, due to the high dynamism and heterogeneity of
the involved scenarios, both in terms of network infra-
structure and possible attacks. Considering that many
attacks are quite similar to the behavior of the normal
activities, it is clear that it is difficult for a classifica-
tion model to distinguish one from another.
The proposed classification approach is based on
several state-of-the-art techniques and the event clas-
sification is performed on the basis of the probabilities
of possible outcomes of each classification. The aim
is to improve the capability of an IDS to detect intru-
sions. The scientific contribution given by this paper
is the definition of an ensemble approach of classifi-
cation able to evaluate a new event by exploiting se-
veral state-of-the-art classification algorithms, which
are driven by a probabilistic model.
2 BACKGROUND AND RELATED
WORK
The concept of intrusion has been well discussed
in (Bace and Mell, 2001), where the authors des-
cribe it as an activity aimed to compromise or by-
pass the security policies that regulate a network or
a single machine. As defined in many authoritative
studies (Pfleeger and Pfleeger, 2012), computer secu-
rity attempts to ensure three main aspects: confiden-
tiality,integrity, and availability, which are compro-
mised (one or more) by an intrusion.
Confidentiality: it grants that the resources, provi-
ded by a single machine or a network, are acces-
sed only by the authorized users.
Integrity: it grants that such resources can be mo-
dified only by the authorized users, according to
the permissions with which they have been made
available.
Availability: it grants that such resources are avai-
lable to authorized users in accordance with the
established times, without any unwanted interrup-
tion or limitation.
An Intrusion Detection System (IDS) is aimed to
analyze the network traffic or the activity of a single
machine in order to discover non-authorized activi-
ties. Such activities can be originated by a malware
(e.g., a virus) or can be related to an human attack
(e.g., the attempt to gain an access to a resource) ope-
rated locally or remotely.
According to the literature (Ghorbani et al., 2010),
there are several aspects that can be used to classify
the IDSs, as reported in Table 1. Such aspects cove-
red two different areas, functional and non-functional.
Functional because they refer to intrinsic IDS charac-
teristics, whereas non-functional because they refer to
external characteristics that are not strictly related to
the IDS.
Table 1: IDSs Classification.
Aspect Possibilities Area
Detection Approach Anomaly, Misuse, or Specification-based Functional
System Modality Host, Network, Network-Node, or Hybrid based Functional
System Response Passive or Active Functional
System Activity Continuous or Periodic Non-functional
On the basis of the adopted detection approach,
the IDSs can be classified into three categories (Ghor-
bani et al., 2010):
1. Anomaly Detection approach (Bronte et al., 2016;
Holm, 2014), also known as Signature-based De-
tection approach, operates by comparing the new
events to a set of signatures related to a series of
known intrusion events;
2. Misuse Detection approach (Garcia-Teodoro
et al., 2009; Viegas et al., 2017), also known
as Anomaly-based Detection approach, works
on the basis of a model able to characterize the
normal activities, which is used to detect intrusion
attempts in the new events;
3. Specification-based Detection approach (Uppu-
luri and Sekar, 2001; Sekar et al., 2002) operates
by defining the allowed behavior of the systems in
the IDS area, considering as intrusion the events
that does not respect it. This approach has been
designed to combine the strengths of the Anomaly
Detection and Misuse Detection approaches.
On the basis of its operative modality an IDS can
operate as follows:
Host-based modality (Bukac et al., 2012): it ope-
rates by exploiting a number of agents installed
on a single machine that belongs to the network
to be monitored. In this case, all events on the
machine (e.g., active processes, log files, configu-
ration alteration, etc.) are analyzed by the agents
and the result is compared to a series of known at-
tack patterns stored in a database. When an attack
is detected, the system can send an alert or can
activate appropriate countermeasures. Such a mo-
dality allows a system to use many agents in order
to gain a higher level of protection, with the possi-
bility to perform a more detailed configuration of
them. A system that adopts this modality is named
as Host-based Intrusion Detection System (HIDS)
and its main disadvantages are the excessive num-
ber of false positives and false negatives, and the
latency between when the event occurs and when
it is processed.
The Network-based modality (Das and Sarkar,
2014): it operates by monitoring and analyzing
the network traffic in order to detect any attempt
of intrusion. In this case, a twofold approach is
followed: first a new event is compared with a
database of known patterns (signature matching)
and, when this operation fails, it is analyzed in
order to detect anomalies (network analysis). It
allows a system to detect also unknown attacks
and the result can be a simple notification or the
activation of automatic countermeasures, such as,
for instance, the block of the IP address of the
machine that is performing the attack. A system
that adopts this modality is named Network In-
trusion Detection System (NIDS) and the related
disadvantages are the difficulty to well operate in
networks characterized by a high level of traffic,
since they have to analyze all the involved pac-
kets, and its inability to analyze encrypted data.
Such a system can not operate proactively, be-
cause an attack can be detected only after it be-
gins.
The Network-Node-based modality (Potluri and
Diedrich, 2016): it operates by analyzed the net-
work traffic of a single network node. It is dif-
ferent from the HIDS and NIDS modalities, since
in this case the traffic taken into account does not
belong to a single machine or the entire network,
but it is related to a precise node of the network,
usually chosen for its strategical position. It can
be considered a mix of the aforementioned HIDS
and NIDS modalities, and a system that adopts it
is defined Network Node Intrusion Detection Sy-
stem (NNIDS).
The Hybrid-based modality (Amrita and Ravu-
lakollu, 2018): it combines yhe aforementioned
modalities to improve the system performance. In
such a context, the systems are usually defined as
Hybrid Intrusion Detection Systems or as Distri-
buted Intrusion Detection Systems.
An intrusion event detected by an IDS can lead
toward a passive or active response (Bijone, 2016), as
detailed in the following.
Passive response: an IDS only sends alerts to se-
curity managers, without triggering any automatic
countermeasure.
Active response: an IDS operates by performing
specific countermeasures, e.g., by closing a net-
work connection or a process.
The system activity is another criterion adopted to
classify the IDSs, as reported in the following.
Continuous Activity: systems included in this
group are those that collect and process the events
in real-time. It means that such systems con-
stantly monitor the area in which they operate-
Periodic Activity:IDSs that not process the col-
lected events in real-time but periodically are in-
cluded in this group.
Ensemble Methods: Literature reports that the com-
bined approaches are usually adopted in order to im-
prove the classification performance. It means that
an ensemble strategy commonly gets better perfor-
mance than that based on a single approach (Diette-
rich, 2000; Zainal et al., 2008). However, it should be
noted that it is not the norm, since a combination of
approaches can also worsen the classification perfor-
mance (Gomes et al., 2017).
The ensemble process is performed by using a de-
pendent framework (i.e., the construction of each ap-
proach depends on the output of the previous one)
or by using an independent framework (i.e., the con-
struction of each approach is independent of the other
ones) (Sagi and Rokach, 2018). In this paper we take
into account only this last type of framework.
In the context of the intrusion detection the ensem-
ble solutions are aimed to improve the detection of the
intrusion attempts, reducing at the same time the num-
ber of misclassification. Such solutions are imple-
mented by following two steps: in the first step a set of
approaches is defined on the basis of their result com-
plementarity (i.e., different correct and wrong classi-
fications); in the second step the single results of the
approaches are combined into one by using a strategy
(e.g., full agreement,majority voting, or weighted vo-
ting).
It should be noted how some approaches (e.g.,
Gradient Boosting and AdaBoost) work by following
ensemble criteria, by exploiting an ensemble of weak
prediction approaches (Guo et al., 2017).
Evaluation Metrics: Literature reports a number of
metrics adopted in order to evaluate the IDS perfor-
mance, as well as several datasets used in this pro-
cess (Munaiah et al., 2016). Considering that the main
objective of an IDS is the correct classification of the
events in two classes (i.e., normal and intrusion), me-
trics aimed to evaluate the binary classifiers, such as
those based on the confusion matrix1(i.e., F-score,
Accuracy,Sensitivity, and Specificity), or those able
to assess the effectiveness of the adopted classifica-
tion model (i.e., ROC and AUC2) are usually used.
In addition, many works take also into account ot-
her aspects, such as those related to the IDS perfor-
mance, by evaluating, for instance, the computational
load, the computational time, or the memory usage.
Together with the aforementioned functional aspects,
further metrics can be taken into account to evaluate
some financial aspects, such as, for instance, the costs
related to the needed hardware and software or those
related to the IDS administration (Sommer, 1999).
The systematic literature review performed
in (Munaiah et al., 2016) shows that there is not an
unique criterion to evaluate the performance of an
IDS, whereas it is fairly common to use more than
one metric to evaluate the proposed approaches.
Open Problems: We have previously highlighted
how the intrusion detection task is not easy, mainly for
the characteristics of the domain taken into account,
which presents an high degree of dynamism and he-
terogeneity. In addition, we underlined how many in-
trusion attempts follow a behavior similar to that of
the normal activities, therefore it is not simple for an
IDS to correctly detect the intrusions. Some major is-
sues related to the state-of-the-art solutions have been
summarized in the following:
one of the most important issues is related to the
Anomaly Detection approach. It mainly depends
on the difficulty to define, exhaustively, a signa-
ture for each possible intrusion event. It leads to-
wards a huge number of false alarms;
another issue arises in the context of the Misuse
Detection approach. It is related to the inability
for the system to recognize unknown intrusion
1A matrix 2x2 where are reported the number of True
Negatives (TN), False Negatives (FN), True Positives (TP),
and False Positives (FP).
2Respectively, Receiver Operating Characteristic and
Area Under the Receiver Operating Characteristic.
patterns, and this prevents the detection of novel
type of intrusion attempts;
considering that the Specification-based approach
combines the Anomaly Detection and Misuse De-
tection approaches, it inherits the aforementioned
issues.
As far as the aforementioned problems are con-
cerned, there is the issue related to the observation
that many intrusion detection approaches are desig-
ned to operate with a specific class of attacks. This
goes against the real-world scenarios, which are cha-
racterized by a various classes of attacks, also by con-
sidering the zero-day ones.
For this reason the ensemble approach proposed
in this paper has been designed to operate in a hete-
rogeneous scenario in terms of classes of attacks, and
therefore, its performance has been evaluated both in
terms of single class of attacks and in terms of all clas-
ses of attacks, by using a real-world dataset.
3 NOTATION AND PROBLEM
DEFINITION
This section reports the formal notation adopted in
this paper together with the formalization of the fa-
ced problem.
Formal Notation: Given a set of classified events
E={e1,e2,...,eN}, we denote as E+the subset of
legitimate ones (then E+E), and as Ethe subset
of anomalous ones (then EE).
Each event eEis composed by a set of values
V={v1,v2,...,vM}and each event can belong only
to one class cC, where C={normal ,intrusion}.
We also denote a set of of unclassified events ˆ
E=
{ˆe1,ˆe2,..., ˆeU}and a set of classification algorithms
A={a1,a2,...,aZ}.
Problem Definition: First, we denote as Ξthe
event classification process performed by our appro-
ach. Second, we define a function EVAL(ˆe,Ξ)able to
evaluate the correctness of the classification by the re-
turned boolean value β(0=misclassification,1=cor-
rect classification). Finally, our problem can be for-
malized as maximization of the sum of the returned
values, as shown in Equation 1.
max
0β≤| ˆ
E|
β=
|ˆ
E|
u=1
EVAL(ˆeu,Ξ)(1)
4 IMPLEMENTATION
The probabilistic model used to drive the classifica-
tion performed by our ensemble approach is based on
the Logistic Regression algorithm. In more detail, a
logistic model has been used in order to evaluate the
probability of a binary response based on more inde-
pendent predictors (i.e., the adopted state-of-the-art-
approaches). More formally, we evaluate the proba-
bility that an unevaluated event ˆeˆ
Ebelongs to a
predicted class cCby mapping the predicted values
to probabilities by using the sigmoid σfunction3.
Such a function is able to map any real value into
[0,1] and it is largely used in machine learning to map
predictions to probabilities. It is shown in Equation 2,
where σ(az(p)) denotes the probability estimate for
the prediction pmade by the algorithm azand expres-
sed in the range [0,1], and eis the base of natural log.
σ(az(p)) = 1
1+ep(2)
According to Equation 2, we define our probabi-
listic model by following the criterion indicated in
Equation 3, where Zdenotes the number of algo-
rithms (i.e., Z=|A|) used to classify an event ˆeˆ
E
by using our ensemble approach, and c(ˆe)denotes the
classification given to the event ˆe.
σ=1
Z·Z
z=1
σ(az(p))
wn=Z
z=1
1i f σ(az(p)) >σaz(p) = normal
wi=Z
z=1
1i f σ(az(p)) <σaz(p) = intrusion
c(ˆe) = (normal,if wn>wi
intrusion,otherwise
(3)
Our model is aimed to exclude from the classifica-
tion process those algorithms characterized by a low
level of probability in their predictions, combining the
effectiveness of the most performing ones through a
weighted probabilistic criterion.
On the basis of the probabilistic model, previously
formalized, the unevaluated events in ˆ
Eare classified
by using the Algorithm 1.
It takes as input the set of adopted evaluation al-
gorithms A, the set of past classified events E, and the
event to evaluate ˆe. It returns the classification (i.e.,
as normal or intrusion) of the event ˆe.
5 EXPERIMENTS
This section describes the environment used to per-
form the experiments, the adopted real-world dataset,
the metrics used for the performance evaluation, the
experimental strategy, and the obtained results, which
are discussed at the end.
3A mathematical function having a characteristic sig-
moid curve.
Algorithm 1:T ransaction classi f icat ion.
Input: A=Set of algorithms, E=Past classified events, ˆe=Event to evaluate
Output: result=Event ˆeclassification
1: procedure CLASSIFICATION(A,E, ˆe)
2: w10
3: w20
4: models =t rainingModel s(A,E)
5: predictions =get Predictions(model s,ˆe)
6: µgetProbabilityAverage(predict ions)
7: for each pin predict ions do
8: if getProbability(p)>µp== nor mal then
9: w1w1+1
10: else
11: w2w2+1
12: end if
13: end for
14: if w1>w2then
15: result normal
16: else
17: result intrusion
18: end if
19: return result
20: end procedure
5.1 Environment
The approach proposed in this paper has been deve-
loped in Python, as well as the implementation of
the state-of-the-art classification techniques used to
define our ensemble approach, which are based on
scikit-learn 4. In order to make our experimental re-
sults reproducible, we have set the seed of the pseudo-
random number generator used by the scikit-learn
evaluation algorithms.
5.2 Dataset
The real-world dataset used to evaluate the propo-
sed approach is NSL-KDD5. It represents an improved
version of the old KDD-CUP99, which in addition to
being dated it presents some problems (Wang et al.,
2014) that do not make it suitable for the evaluation
of the recent IDSs. The most significant issue related
to the former version of the dataset is the data redun-
dancy. It means that both the training and test sets
include a big number of duplicate events and these
influence the machine learning approaches, reducing
their capability to detect the less frequent events cor-
rectly.
These issues have been addressed in the NSL-
KDD as follows: (i) there are not redundant events
in the train and test sets; (ii) the number of events
that belong to each difficulty level group is inversely
proportional to the percentage of events present in the
4http://scikit-learn.org
5https://github.com/defcom17/NSL KDD
former dataset; (iii) the events in the train and test
sets are enough to perform the experiments without
the need to use a subset of them (operation usually
done in the former dataset in order to reduce the com-
putational complexity). Such improvements mitigate
the aforementioned issues and are able to reproduce
a typical real-world scenario that allow us to evaluate
the performance of the proposed approach and that of
the competitors.
An overview of the dataset characteristics is
shown in Table 2, where we can observe its compo-
sition in terms of normal (i.e., set |E+|) and intrusion
(i.e., set |E|) events. It should be observed how the
cardinality of classes (i.e., set |C|) is different in the
training and test sets, since some classes of events are
present in a dataset and not in the other one, and vice
versa.
Table 2: Dataset Overview.
Dataset Total events Normal Intrusion Values Classes
|E| |E+| |E| |V| |C|
Training 125,973 67,343 58,630 41 23
Test 22,543 9,710 12,833 41 38
Total 148,516 77,053 71,463
An analysis of the distribution of event types is
shown in Table 3, where we have grouped all types of
events into the following categories:
-Privilege Escalation Attack (PEA): this category
contains all the attacks aimed to obtain a privileged
access by starting from an unprivileged status, as it
happens, for instance, in a buffer overflow attack;
-Denial of Service Attack (DSA): this category inclu-
des all the attacks aimed to stop a service/system
by using a very high number of legitimate requests,
e.g., as it happens in a syn flooding attack;
-Remote Scanning Attack (RSA): this category clas-
sifies the attacks conducted to obtain information
about one or more services/systems, by adopting
non-invasive and invasive techniques, e.g., as it hap-
pens during a port scanning activity;
-Remote Access Attack (RAA): this category clas-
sifies all the attacks where the goal is to gain an
access to a remote system, without using sophisti-
cate techniques, e.g., by using a brute-force attack
to guess the user credentials;
-Normal Network Activity (NNA): this last category
contains the normal activity, then all the legitimate
network traffic not related to intrusion attempts.
In more detail, in the context of the NSL-KDD da-
taset, the aforementioned four categories of intrusion
events are related to the following major attacks:
-PEA: Buffer overflow,Loadmodule,Rootkit,Perl,
Sqlattack,Xterm, and Ps;
-DSA: Back,Land,Neptune,Pod,Smurf,Teardrop,
Mailbomb,Processtable,Udpstorm,Apache2, and
Worm;
-RSA: Satan,IPsweep,Nmap,Portsweep,Mscan,
and Saint;
-RAA: Guess password,Ftp write,Imap,Phf,Mul-
tihop,Warezmaster,Xlock,Xsnoop,Snmpguess,
Snmpgetattack,Httptunnel,Sendmail, and Named.
Table 3: Events Distribution.
Event Training Test Type Event Training Test Type
01 apache2 0 737 DSA 21 processtable 0 685 DSA
02 back 956 359 DSA 22 ps 0 15 PEA
03 buffer overflow 30 20 PEA 23 rootkit 10 13 PEA
04 ftp write 8 3 RAA 24 saint 0 319 RSA
05 guess passwd 52 1231 RAA 25 satan 3633 735 RSA
06 httptunnel 0 133 RAA 26 sendmail 0 14 RAA
07 imap 11 1 RAA 27 smurf 2646 665 DSA
08 ipsweep 3599 141 RSA 28 snmpgetattack 0 178 RAA
09 land 18 7 DSA 29 snmpguess 0 331 RAA
10 loadmodule 9 2 PEA 30 sqlattack 0 2 PEA
11 mailbomb 0 293 DSA 31 spy 2 0 RAA
12 mscan 0 996 RSA 32 teardrop 892 12 DSA
13 multihop 7 18 RAA 33 udpstorm 0 2 DSA
14 named 0 17 RAA 34 warezclient 890 0 RAA
15 neptune 41214 4657 DSA 35 warezmaster 20 944 RAA
16 nmap 1493 73 RSA 36 worm 0 2 DSA
17 perl 3 2 PEA 37 xlock 0 9 RAA
18 phf 4 2 RAA 38 xsnoop 0 4 RAA
19 pod 201 41 DSA 39 xterm 0 13 PEA
20 portsweep 2931 157 RSA 40 normal 67343 9710 NNA
Table 4 summarises the information previously
presented in Table 3. It gives us a measure about the
relevance of each class of attacks in the dataset.
Table 4: Events Overview.
Dataset PEA DSA RSA RAA NNA
Training 52 45,927 11,656 994 67,343
Test 67 7,460 2,421 2,885 9,710
Total 119 53,387 14,077 3,879 77,053
%0.08 35.95 9.48 2.61 51.88
5.3 Strategy
Algorithms Selections: The five competitor algo-
rithms Multilayer Perceptron (MLP), Decision Tree
(DTC), Adaptive Boosting (ADA), Gradient Boosting
(GBC), and Random Forests (RFC), have been se-
lected in order to have different classification techni-
ques and they include the most performing ones. Con-
sidering that the differences between the results given
by our PDE approach and those given by the single
competitor approaches were almost the same even af-
ter a parameter tuning, all the experiments have been
conducted by using the default scikit-learn values.
Data Preprocessing: Before the experiments, we
converted all the categorical dataset features into a nu-
merical form and, in addition, in order to be able to
perform a binary classification task, we added a new
feature (i.e., named class) that represents the two pos-
sible categories an event can be categorized (i.e., 1=
normal and 2=intrusion).
Evaluation Criterion: Considering that the canoni-
cal k-fold cross-validation criterion used to reduce the
Table 5: Events Detection Performance.
Ap proach Events T NR F -score AUC Average
Multilayer Perceptron (MLP) PE A 0.178 0.187 0.589 0.318
Decision Tree (DTC) PEA 0.484 0.468 0.742 0.565
Adaptive Boosting (ADA) PE A 0.496 0.565 0.748 0.603
Gradient Boosting (GBC) PE A 0.369 0.410 0.684 0.488
Random Forests (RFC) PEA 0.288 0.374 0.644 0.435
PDE PEA 0.610 0.516 0.805 0.644
Multilayer Perceptron (MLP) DSA 0.964 0.969 0.974 0.969
Decision Tree (DTC) DSA 0.991 0.994 0.994 0.993
Adaptive Boosting (ADA) DSA 0.990 0.993 0.994 0.992
Gradient Boosting (GBC) DSA 0.991 0.994 0.994 0.993
Random Forests (RFC) DSA 0.988 0.993 0.993 0.991
PDE DSA 0.993 0.993 0.994 0.993
Multilayer Perceptron (MLP) RSA 0.897 0.848 0.929 0.891
Decision Tree (DTC) RSA 0.977 0.980 0.987 0.981
Adaptive Boosting (ADA) RSA 0.974 0.977 0.985 0.979
Gradient Boosting (GBC) RSA 0.971 0.976 0.984 0.977
Random Forests (RFC) RSA 0.971 0.980 0.985 0.979
PDE RSA 0.988 0.971 0.989 0.983
Multilayer Perceptron (MLP) RAA 0.345 0.303 0.663 0.437
Decision Tree (DTC) RAA 0.851 0.858 0.924 0.878
Adaptive Boosting (ADA) RAA 0.803 0.846 0.901 0.850
Gradient Boosting (GBC) RAA 0.836 0.858 0.917 0.870
Random Forests (RFC) RAA 0.846 0.876 0.923 0.882
PDE RAA 0.885 0.834 0.940 0.886
Multilayer Perceptron (MLP) N NA 0.885 0.900 0.908 0.898
Decision Tree (DTC) NNA 0.969 0.979 0.981 0.976
Adaptive Boosting (ADA) NN A 0.947 0.965 0.967 0.960
Gradient Boosting (GBC) NN A 0.966 0.978 0.979 0.974
Random Forests (RFC) NN A 0.964 0.978 0.980 0.974
PDE NNA 0.980 0.976 0.977 0.978
impact of the data dependency during the experiments
does not work in the case of time series data, since it
does not take into account the event chronology, we
adopted a different approach based on the TimeSe-
riesSplit scikit-learn functionalities. It is a time se-
ries cross-validation criterion able to divide a dataset
into n splits training and test sets, respecting the data
chronology. In our case we used TimeSeriesSplit with
n splits=10.
5.4 Metrics
According to the considerations made in Section 2,
setting the focus on the functional aspect, we evalua-
ted the performance of the proposed approach on the
basis of three metrics. The first considered metric is
the Specificity (true negative rate) as it is more cru-
cial to perform a correct classification of the intrusion
events rather than normal events, and, therefore, it is
better to adopt a prudent policy preferring to get more
false negative than false positives. The second con-
sidered metric is the F-score metric, since it offers
an overview in terms of Accuracy and Recall perfor-
mance. The last metric taken into account is AUC,
since it is able to evaluate the capability of an evalua-
tion model to distinguish between two different clas-
ses of events (i.e., normal and intrusion).
Table 6: Ensemble Strategies Performance.
Strategy Event s TNR F-score AUC Average
Full Agreement PEA 0.048 0.081 0.524 0.218
Majority Voting PEA 1.000 0.003 0.500 0.501
Weighted voting PEA 0.409 0.521 0.705 0.545
PDE PEA 0.610 0.516 0.805 0.644
Full Agreement DSA 0.972 0.980 0.982 0.979
Majority Voting DSA 1.000 0.581 0.500 0.694
Weighted voting DSA 0.990 0.994 0.994 0.993
PDE DSA 0.993 0.993 0.994 0.993
Full Agreement RSA 0.900 0.927 0.946 0.924
Majority Voting RSA 1.000 0.269 0.500 0.590
Weighted voting RSA 0.974 0.979 0.985 0.979
PDE RSA 0.989 0.971 0.990 0.983
Full Agreement RAA 0.312 0.409 0.656 0.454
Majority Voting RAA 1.000 0.089 0.500 0.530
Weighted voting RAA 0.837 0.871 0.918 0.878
PDE RAA 0.885 0.837 0.940 0.886
Full Agreement NNA 0.931 0.946 0.950 0.935
Majority Voting NNA 1.000 0.650 0.500 0.717
Weighted voting NNA 0.963 0.977 0.978 0.972
PDE NNA 0.980 0.975 0.976 0.978
Table 7: F-score and AUC Differences.
Events F-score AUC
PEA 0.005 +0.100
DSA 0.001 +0.000
RSA 0.007 +0.004
RAA 0.039 +0.020
NNA 0.000 0.001
Average 0.010 +0.025
5.5 Results
The performed experiments were aimed to evaluate
the performance of several state-of-the-art approaches
when they are used in order to distinguish the intru-
sion events from the normal ones.
We performed this operation in two phases, initi-
ally by taking into account all normal events together
with those related to a single class of intrusions, then
by using all normal events together with those related
to all classes of intrusions.
The experimental results are reported in Table 5
and Table 6, where, respectively, our approach has
been compared to each single approach, then it has
been compared to the canonical ensemble strategies.
In such tables, the best average performances are
highlighted in bold.
Premising that all the experiments have been made
by following the time series cross-validation criterion
described in Section 5.3, the analysis of the obtained
results leads towards the following considerations:
the results shown in Table 5 indicate that our PDE
approach outperforms each single state-of-the-art
approaches in terms of average performance (i.e.,
TNR,F-score, and AUC), except in the case of
the DSA events, where it reaches the same per-
formance of the best competitors;
the results shows in Table 6 indicate that our PDE
approach also outperforms all canonical ensemble
strategies in terms of average performance, except
in the case of the DSA events, where it reaches the
same performance of the best competitors;
the results shown in Table 7 indicate that such an
improvement in terms of TNR (i.e., Specificity)
does not lead to significant reduction of the clas-
sifier performance in terms of F-score and AUC,
since although we have a moderate worsening in
terms of average F-score (i.e., 0.01), we get
an improvement in terms of average AUC (i.e.,
+0.02);
considering that the adopted real-world dataset
contains a large number of events related to intru-
sion events, as reported in Table 2, the obtained
improvement allows a system to detect a signifi-
cant number of attacks that otherwise would not
have been detected;
the last consideration should be made about the
False Negative Rate (FNR) (i.e., the Miss Rate)
related to our approach, in order to verify that the
increase of the True Negative Rate (i.e., the Spe-
cificity) is not correlated with an equal increase
of this value (i.e., the FNR): the experimental re-
sults report an average FNR of 0.007, whereas our
average TNR is 0.056;
the TPR and FNR rates indicate that a very low
percentage of the new detected intrusions leads to-
wards false alarms, also by considering that, pru-
dentially, in the context taken into account the
false negative classifications are preferable to the
false positive ones.
6 CONCLUSIONS
The ensemble approach proposed in this paper has
been designed to maximize the capability to detect in-
trusion events, independently from the operative sce-
nario, exploiting several evaluation models in order
to differentiate the normal events from those related
to all classes of attacks.
The results show how the proposed ensemble ap-
proach outperforms the single classifiers in terms of
Specificity, without being significantly penalized in
other aspects, because it gets a slight deterioration in
terms of average F-score but a positive average AUC
(wrt the competitor approaches), demonstrating the
effectiveness of our evaluation model.
Some ongoing efforts we are already focusing on
are related to the employment of deep learning techni-
ques and frameworks (e.g. TensorFlow, Keras) on ad-
hoc hardware (e.g. NVidia GPUs) in order to further
improve the proposed approach.
ACKNOWLEDGEMENTS
This research is partially funded by Regione Sar-
degna,Next generation Open Mobile Apps Develop-
ment (NOMAD) project, Pacchetti Integrati di Agevo-
lazione (PIA) - Industria Artigianato e Servizi (2013).
REFERENCES
Amrita and Ravulakollu, K. K. (2018). A hybrid intrusion
detection system: Integrating hybrid feature selection
approach with heterogeneous ensemble of intelligent
classifiers. I. J. Network Security, 20(1):41–55.
Bace, R. and Mell, P. (2001). Nist special publication on
intrusion detection systems. Technical report, BOOZ-
ALLEN AND HAMILTON INC MCLEAN VA.
Bijone, M. (2016). A survey on secure network: Intrusion
detection & prevention approaches. American Journal
of Information Systems, 4(3):69–88.
Bronte, R., Shahriar, H., and Haddad, H. M. (2016). A
signature-based intrusion detection system for web
applications based on genetic algorithm. In Procee-
dings of the 9th International Conference on Security
of Information and Networks, Newark, NJ, USA, July
20-22, 2016, pages 32–39. ACM.
Buczak, A. L. and Guven, E. (2016). A survey of data mi-
ning and machine learning methods for cyber security
intrusion detection. IEEE Communications Surveys
and Tutorials, 18(2):1153–1176.
Bukac, V., Tucek, P., and Deutsch, M. (2012). Advances
and challenges in standalone host-based intrusion de-
tection systems. In TrustBus, volume 7449 of Lecture
Notes in Computer Science, pages 105–117. Springer.
Das, N. and Sarkar, T. (2014). Survey on host and net-
work based intrusion detection system. Internatio-
nal Journal of Advanced Networking and Applicati-
ons, 6(2):2266.
Dietterich, T. G. (2000). Ensemble methods in machine le-
arning. In Multiple Classifier Systems, volume 1857
of Lecture Notes in Computer Science, pages 1–15.
Springer.
Garcia-Teodoro, P., D´
ıaz-Verdejo, J. E., Maci´
a-Fern´
andez,
G., and V´
azquez, E. (2009). Anomaly-based network
intrusion detection: Techniques, systems and challen-
ges. Computers & Security, 28(1-2):18–28.
Ghorbani, A. A., Lu, W., and Tavallaee, M. (2010). Net-
work Intrusion Detection and Prevention - Concepts
and Techniques, volume 47 of Advances in Informa-
tion Security. Springer.
Gomes, H. M., Barddal, J. P., Enembreck, F., and Bifet, A.
(2017). A survey on ensemble learning for data stream
classification. ACM Comput. Surv., 50(2):23:1–23:36.
Guo, H., Li, Y., Shang, J., Mingyun, G., Yuanyue, H.,
and Bing, G. (2017). Learning from class-imbalanced
data: Review of methods and applications. Expert
Syst. Appl., 73:220–239.
Holm, H. (2014). Signature based intrusion detection for
zero-day attacks: (not) A closed chapter? In 47th
Hawaii International Conference on System Sciences,
HICSS 2014, Waikoloa, HI, USA, January 6-9, 2014,
pages 4895–4904. IEEE Computer Society.
Munaiah, N., Meneely, A., Wilson, R., and Short, B. (2016).
Are intrusion detection studies evaluated consistently?
a systematic literature review.
Pfleeger, C. P. and Pfleeger, S. L. (2012). Security in Com-
puting, 4th Edition. Prentice Hall.
Potluri, S. and Diedrich, C. (2016). High performance in-
trusion detection and prevention systems: A survey. In
ECCWS2016-Proceedings fo the 15th European Con-
ference on Cyber Warfare and Security, page 260.
Academic Conferences and publishing limited.
Sagi, O. and Rokach, L. (2018). Ensemble learning: A sur-
vey. Wiley Interdisc. Rew.: Data Mining and Know-
ledge Discovery, 8(4).
Sekar, R., Gupta, A., Frullo, J., Shanbhag, T., Tiwari, A.,
Yang, H., and Zhou, S. (2002). Specification-based
anomaly detection: a new approach for detecting net-
work intrusions. In ACM Conference on Computer
and Communications Security, pages 265–274. ACM.
Sommer, P. (1999). Intrusion detection systems as evidence.
Computer Networks, 31(23-24):2477–2487.
Uppuluri, P. and Sekar, R. (2001). Experiences with
specification-based intrusion detection. In Recent Ad-
vances in Intrusion Detection, volume 2212 of Lecture
Notes in Computer Science, pages 172–189. Springer.
Viegas, E., Santin, A. O., and Oliveira, L. S. (2017). Toward
a reliable anomaly-based intrusion detection in real-
world environments. Computer Networks, 127:200–
216.
Wang, Y., Yang, K., Jing, X., and Jin, H. L. (2014). Pro-
blems of kdd cup 99 dataset existed and data prepro-
cessing. In Applied Mechanics and Materials, volume
667, pages 218–225. Trans Tech Publ.
Zainal, A., Maarof, M. A., Shamsuddin, S. M. H., and
Abraham, A. (2008). Ensemble of one-class classi-
fiers for network intrusion detection system. In IAS,
pages 180–185. IEEE Computer Society.
... The authors of [27] introduced a probabilistic-driven ensemble (PDE)-based approach. This approach operates with several classification algorithms, wherein the effectiveness of these algorithms has been improved by applying a probabilistic criterion. ...
... Machine learning algorithm (MLA) efficiency for cyberattack detection in the Internet of Things infrastructure.Table 1. Cont. Khan, M.A.; Khan Khattk, M.A.; Latif, S.; Shah, A.A.; Saia, R.; Carta, S.; Recupero, D.R.[27] To summarize, there is a strong need to evolve new methods for cyberattack detection in the IoT infrastructure. To do this, we are to eliminate technique drawbacks and increase the detection efficiency of detecting known and unknown cyberattacks in the IoT infrastructure. ...
Article
Full-text available
Cybersecurity is a common Internet of Things security challenge. The lack of security in IoT devices has led to a great number of devices being compromised, with threats from both inside and outside the IoT infrastructure. Attacks on the IoT infrastructure result in device hacking, data theft, financial loss, instability, or even physical damage to devices. This requires the development of new approaches to ensure high-security levels in IoT infrastructure. To solve this problem, we propose a new approach for IoT cyberattack detection based on machine learning algorithms. The core of the method involves network traffic analyses that IoT devices generate during communication. The proposed approach deals with the set of network traffic features that may indicate the presence of cyberattacks in the IoT infrastructure and compromised IoT devices. Based on the obtained features for each IoT device, the feature vectors are formed. To conclude the possible attack presence, machine learning algorithms were employed. We assessed the complexity and time of machine learning algorithm implementation considering multi-vector cyberattacks on IoT infrastructure. Experiments were conducted to approve the method’s efficiency. The results demonstrated that the network traffic feature-based approach allows the detection of multi-vector cyberattacks with high efficiency.
... The advantage of this is that each base learner can learn on a different feature subset or in a different model space, thereby reducing the generalization error of the model. Saia et al. [26] proposed a probability-driven ensemble (PDE) method that uses several classification algorithms and improves the effectiveness of classification algorithms based on probability criteria. The proposed ensemble method exceeds the specificity of a single classifier without suffering a significant penalty in other aspects. ...
Article
Full-text available
Today, the development of the Internet of Things has grown, and the number of related IoT devices has reached the order of tens of billions. Most IoT devices are vulnerable to attacks, especially DdoS (Distributed Denial of Service attack) attacks. DDoS attacks can easily cause damage to IoT devices, and LDDoS is an attack launched against hardware resources through a small string of very slow traffic. Compared with traditional large-scale DDoS, their attacks require less bandwidth and generate traffic similar to that of normal users, making them difficult to distinguish when identifying them. This article uses the CICIoT2023 dataset combined with behavioral features and stacking mechanisms to extract information from the attack behavior of low-rate attacks as features and uses the stacking mechanism to improve the recognition effect. A method of behavioral characteristics and stacking mechanism is proposed to detect DDoS attacks. This method can accurately detect LDDoS. Experimental results show that the recognition rate of low-rate attacks of this scheme reaches 0.99, and other indicators such as accuracy, recall, and F1 score are all better than other LDDoS detection methods. Thus, the method model proposed in this paper can effectively detect LDDoS attacks. At present, DDoS attacks are relatively mature, and there are many related results, but there is less research on LDDoS detection alone. This paper focuses on the investigation and analysis of LDDoS attacks in DDoS attacks and deduces feasible LDDoS detection methods.
... To face these problems, researchers are constantly looking for more and more efficient Intrusion Detection Systems (IDSs) [6], which are designed using various techniques such as, just to name a few, those based on Machine Learning and Deep Learning [7,8], Artificial Intelligence [9], Artificial Neural Networks [10][11][12][13], Fuzzy Logic [14], often combining more than one to define hybrid solutions [15]. Starting from the consideration that most of the approaches and strategies in the literature related to the IDS domain exploit the entire training set to define the classification model [5,[16][17][18], we have trivially observed that a training dataset refers to single events in terms of data rows and to the different features that characterize each event in terms of data columns. ...
... To face these problems, researchers are constantly looking for more and more efficient Intrusion Detection Systems (IDSs) [6], which are designed using various techniques such as, just to name a few, those based on Machine Learning and Deep Learning [7,8], Artificial Intelligence [9], Artificial Neural Networks [10][11][12][13], Fuzzy Logic [14], often combining more than one to define hybrid solutions [15]. Starting from the consideration that most of the approaches and strategies in the literature related to the IDS domain exploit the entire training set to define the classification model [5,[16][17][18], we have trivially observed that a training dataset refers to single events in terms of data rows and to the different features that characterize each event in terms of data columns. ...
Conference Paper
Full-text available
The ever-increasing use of services based on computer networks, even in crucial areas unthinkable until a few years ago, has made the security of these networks a crucial element for anyone, also in consideration of the increasingly sophisticated techniques and strategies available to attackers. In this context, Intrusion Detection Systems (IDSs) play a very important role since they are responsible for analyzing and classifying each network activity as legitimate or illegitimate, allowing us to take the necessary countermeasures at the appropriate time. However, these systems are not infallible and this is due to several reasons, the most important of which are the constant evolution of the attacks (e.g., zero-day attacks) and the problem that many attacks have behavior similar to those of the legitimate activities, and therefore they are very hard to identify. This work relies on the hypothesis that the subdivision of the training data used for the definition of the IDS classification model into a certain number of partitions, in terms of events and features, can improve the characterization of the network events, improving the system performance. All the non-overlapping data partitions train independent classification models and the event is classified according to a majority-voting rule. A series of experiments conducted on a benchmark real-world dataset support the initial hypothesis, showing a performance improvement, compared to a canonical training approach.
... The trained models then performed majority voting during classification to boost prediction accuracy by exploiting the idea of wisdom of the crowd. The ensemble approach, with a probabilistic framework, has also been previously used to improve intrusion detection by combining results from multiple ML algorithms [52]. Alternatively, in our work, we train an ensemble of tensors with different ranks and apply p-value fusion over their predictions to capture the hypothesis carried by each rank. ...
Article
Full-text available
Distinguishing malicious anomalous activities from unusual but benign activities is a fundamental challenge for cyber defenders. Prior studies have shown that statistical user behavior analysis yields accurate detections by learning behavior profiles from observed user activity. These unsupervised models are able to generalize to unseen types of attacks by detecting deviations from normal behavior, without knowledge of specific attack signatures. However, approaches proposed to date based on probabilistic matrix factorization are limited by the information conveyed in a two-dimensional space. Non-negative tensor factorization, on the other hand, is a powerful unsupervised machine learning method that naturally models multi-dimensional data, capturing complex and multi-faceted details of behavior profiles. Our new unsupervised statistical anomaly detection methodology matches or surpasses state-of-the-art supervised learning baselines across several challenging and diverse cyber application areas, including detection of compromised user credentials, botnets, spam e-mails, and fraudulent credit card transactions.
... Many works in the literature [23] evidence that wireless-based technologies evolution did not keep up with the security one. It means that a series of problems that affect security in a broad sense jeopardize the significant opportunities offered by these new technologies. ...
Article
Full-text available
In the last decades, modern societies are experiencing an increasing adoption of interconnected smart devices. This revolution involves not only canonical devices such as smartphones and tablets, but also simple objects like light bulbs. Named as the Internet of Things (IoT), this ever-growing scenario offers enormous opportunities in many areas of modern society, especially if joined by other emerging technologies such as, for example, the blockchain. Indeed, the latter allows users to certify transactions publicly, without relying on central authorities or intermediaries. This work aims to exploit the scenario above by proposing a novel blockchain-based distributed paradigm to secure localization services, here defined as Internet of Entities (IoE). It represents a mechanism for the reliable localization of people and things, and it exploits the increasing number of existing wireless devices and the blockchain-based distributed ledger technology. Moreover, unlike most of the canonical localization approaches, it is strongly oriented towards the protection of the users' privacy. Finally, its implementation requires minimal efforts since it employs the existing infrastructures and devices, thus giving life to a new and wide data environment, exploitable in many domains, such as e-health, smart cities, and smart mobility.
... The idea on which the proposed strategy revolves was born according to many literature works in this field (Carta et al., 2020b;Saia et al., 2018b;Saia et al., 2019;Saia et al., 2020), where we trivially observed that the data used to train an evaluation model refers to different period of time, as well as the features that characterize each event refer to different aspect of it. ...
Conference Paper
Full-text available
Anyone working in the field of network intrusion detection has been able to observe how it involves an ever-increasing number of techniques and strategies aimed to overcome the issues that affect the state-of-the-art solutions. Data unbalance and heterogeneity are only some representative examples of them, and each misclassification operates in this context could have enormous repercussions in different crucial areas such as, for instance, financial, privacy, and public reputation. This happens because the current scenario is characterized by a huge number of public and private network-based services. The idea behind the proposed work is decomposing the canonical classification process into several sub-processes, where the final classification depends on all the sub-processes results, plus the canonical one. The proposed Training Data Decomposition (TDD) strategy is applied on the training datasets, where it applies a decomposition into regions, according to a defined number of events and features. The ratio that leads this process is related to the observation that the same network event could be evaluated in a different manner, when it is evaluated in different time periods and/or when it involves different features. According to this observation, the proposed approach adopts different classification models, each of them trained in a different data region characterized by different time periods and features, classifying the event both on the basis of all model results, and on the basis of the canonical strategy that involves all data.
Conference Paper
Today's computer system security is critical at every operational level and device, as the compromise of a single element can propagate through connected other network elements, causing unpredictable and dangerous effects. To face unauthorized access and evolving malicious strategies, researchers have intensified efforts to develop effective Intrusion Detection Systems (IDSs) that monitor and analyze network traffic to detect illegitimate activities. This is a difficult challenge given the growing sophistication of malicious tactics that often mimic legitimate behavior. In such a context, this work proposes the HYDRA-LNNE (Hybrid Data Real and Artificial LSTM Neural Network Ensemble) approach, which involves feature selection and data quantization to reduce data complexity, and an ensemble of three Long Short-Term Memory (LSTM) neural networks trained on real data, GAN-generated synthetic data, and a combination of both, with the aim to maximize the strengths of each data type, effectively discriminating normal from malicious network activities. The validation process performed on the UNSW-NB15 dataset, well known for its comprehensive representation of modern cyber threats, shows that our approach outperforms state-of-the-art solutions across multiple metrics.
Conference Paper
Full-text available
An Intrusion Detection System (IDS) is an essential feature, aims to defend the integrity, availability, the confidentiality of the data utilized in the networks against attacks. An IDS observes the network activities to examine the invasive patterns. In the case of an attack, the system should have a proper response. Different Machine Learning (ML) techniques are being used over the past several years. Since an algorithm can be evaluated on various parameters, no single algorithm is said to be more accurate. Ensemble learning combines several weak learners and are useful to improve the performance of intrusion detection. Deep learning methods are gaining traction as useful techniques, if the input data is huge and it suits real-time applications. Transfer Learning (TL) methods suits the applications that have limited data. These methods are useful in detecting zero day attacks. This paper presents the recent works carried out on intrusion detection. The study enables the readers to understand research foundation, research status, challenges, and the future scope of research.
Article
Full-text available
In this work, homogeneous ensemble techniques, namely bagging and boosting were employed for intrusion detection to determine the intrusive activities in network by monitoring the network traffic. Simultaneously, model diversity was enhanced as numerous algorithms were taken into account, thereby leading to an increase in the detection rate Several classifiers, i.e., SVM, KNN, RF, ETC and MLP) were used in case of bagging approach. Likewise, tree-based classifiers have been employed for boosting. The proposed model was tested on NSL-KDD dataset that was initially subjected to preprocessing. Accordingly, ten most significant features were identified using decision tree and recursive feature elimination method. Furthermore, the dataset was divided into five subsets, each one them being subjected to training, and the final results were obtained based on majority voting. Experimental results proved that the model was effective for detecting intrusive activities. Bagged ETC and boosted RF outperformed all the other classifiers with an accuracy of 99.123% and 99.309%, respectively.
Article
Full-text available
A popular approach for detecting network intrusion attempts is to monitor the network traffic for anomalies. Extensive research effort has been invested in anomaly-based network intrusion detection using machine learning techniques; however, in general these techniques remain a research topic, rarely being used in real-world environments. In general, the approaches proposed in the literature lack representative datasets and reliable evaluation methods that consider real-world network properties during the system evaluation. In general, the approaches adopt a set of assumptions about the training data, as well as about the validation methods, rendering the created system unreliable for open-world usage. This paper presents a new method for creating intrusion databases. The objective is that the databases should be easy to update and reproduce with real and valid traffic, representative, and publicly available. Using our proposed method, we propose a new evaluation scheme specific to the machine learning intrusion detection field. Sixteen intrusion databases were created, and each of the assumptions frequently adopted in studies in the intrusion detection literature regarding network traffic behavior was validated. To make machine learning detection schemes feasible, we propose a new multi-objective feature selection method that considers real-world network properties. The results show that most of the assumptions frequently applied in studies in the literature do not hold when using a machine learning detection scheme for network-based intrusion detection. However, the proposed multi-objective feature selection method allows the system accuracy to be improved by considering real-world network properties during the model creation process.
Article
Full-text available
This paper proposes Hybrid Feature Selection Approach { Heterogeneous Ensemble of Intelligent Classi�ers (HyFSA-HEIC) for intelligent lightweight network intrusion detection system (NIDS). The purpose is to classify for anomaly from the incoming tra�c. This system hierarchically integrates HyFSA and HEIC. The HyFSA will obtain the optimal number of features and then HEIC is built using these optimal features. HyFSA helps to decrease the computation time of the system and make it lightweight to work in real time. The aim of HEIC is to obtain accurate and robust classi�er and enhance overall performance of the system. The results demonstrate that proposed system outperforms other ensemble and single classi�er methods used in this paper. It has true positive rate (99.9%), accuracy (99.91%), precision (99.9%), receiver operating characteristics (99.9%), low false positive rate (0.1%) and lower root mean square error rate (3.06%) with a minimum number of selected 6 features. It also reduces time to build and time to test the model by 50.79% and 55.30% respectively on reduced features set. The results evince that detection rate, accuracy and precision of the system is increased by incorporating feature selection approach with heterogeneous ensemble of intelligent classi�ers and signi�cantly reduce the computation time.
Article
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integrated with drift detection algorithms and incorporate dynamic updates, such as selective removal or addition of classifiers. This work proposes a taxonomy for data stream ensemble learning as derived from reviewing over 60 algorithms. Important aspects such as combination, diversity, and dynamic updates, are thoroughly discussed. Additional contributions include a listing of popular open-source tools and a discussion about current data stream research challenges and how they relate to ensemble learning (big data streams, concept evolution, feature drifts, temporal dependencies, and others).
Article
Full-text available
With invent of new technologies and devices, Intrusion has become an area of concern because of security issues, in the ever growing area of cyber-attack. An intrusion detection system (IDS) is defined as a device or software application which monitors system or network activities for malicious activities or policy violations. It produces reports to a management station [1]. In this paper we are mainly focused on different IDS concepts based on Host and Network systems.
Article
Ensemble methods are considered the state‐of‐the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state‐of‐the‐art ensemble methods and discusses current challenges and trends in the field. This article is categorized under: • Algorithmic Development > Model Combining • Technologies > Machine Learning • Technologies > Classification
Article
Rare events, especially those that could potentially negatively impact society, often require humans’ decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.
Conference Paper
Web application attacks are an extreme threat to the world's information technology infrastructure. A web application is generally defined as a client-server software application where the client uses a user interface within a web browser. Most users are familiar with web application attacks. For instance, a user may have received a link in an email that led the user to a malicious website. The most widely accepted solution to this threat is to deploy an Intrusion Detection System (IDS). Such a system currently relies on signatures of the predefined set of events matching with attacks. Issues still arise as all possible attack signatures may not be defined before deploying an IDS. Attack events may not fit with the pre-defined signatures. Thus, there is a need to detect new types of attacks with a mutated signature based detection approach. Most traditional literature works describe signature based IDSs for application layer attacks, but several works mention that not all attacks can be detected. It is well known that many security threats can be related to software or application development and design or implementation flaws. Given that fact, this work expands a new method for signature based web application layer attack detection. We apply a genetic algorithm to analyze web server and database logs and the log entries. The work contributes to the development of a mutated signature detection framework. The initial results show that the suggested approach can detect specific application layer attacks such as Cross-Site Scripting, SQL Injection and Remote File Inclusion attacks.
Article
KDD Cup 99 dataset is not only the most widely used dataset in intrusion detection, but also the de facto benchmark on evaluating the performance merits of intrusion detection system. Nevertheless there are a lot of issues in this dataset which cannot be omitted. In order to establish good data mining models in intrusion detection and find the appropriate network intrusion attack types’ features, researchers should have a well-known understanding on this dataset. In this paper, first and foremost we have made an in-depth analysis on the problems which the dataset are existed, and given the related solutions. Secondly, we also have carried out plenty data preprocessing on the 10% subset of KDD Cup 99 dataset’s training set, giving better results to the following process. What’s more, by comparing 10 common kinds of data mining algorithms in our experiment, we have analyzed and summarized that data preprocessing plays a vital role on the performance and importance to data mining algorithms.
Article
This survey paper describes a focused literature survey of machine learning (ML) and data mining (DM) methods for cyber analytics in support of intrusion detection. Short tutorial descriptions of each ML/DM method are provided. Based on the number of citations or the relevance of an emerging method, papers representing each method were identified, read, and summarized. Because data are so important in ML/DM approaches, some well-known cyber data sets used in ML/DM are described. The complexity of ML/DM algorithms is addressed, discussion of challenges for using ML/DM for cyber security is presented, and some recommendations on when to use a given method are provided.