Conference Paper

Using genetic programming for combining an ensemble of local and global outlier algorithms to detect new attacks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Modern intrusion detection systems must be able to discover new types of attacks in real-time. To this aim, automatic or semi-automatic techniques can be used; outlier detection algorithms are particularly apt to this task, as they can work in an unsupervised way. However, due to the different nature and behavior of the attacks, the performance of different outlier detection algorithms varies largely. In this ongoing work, we describe an approach aimed at understanding whether an ensemble of outlier algorithms can be used to detect effectively new types of attacks in intrusion detection systems. In particular, Genetic Programming (GP) is adopted to build the combining function of an ensemble of local and global outlier detection algorithms, which are used to detect different types of attack. Preliminary experiments, conducted on the well-known NSL-KDD dataset, are encouraging and confirm that, depending on the type of attacks, it would be better to use only local or only global detection algorithms and that the GP-based ensemble improves the performance in comparison with commonly used combining functions.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).
Article
Full-text available
A new parallel implementation of genetic programming based on the cellular model is presented and compared with both canonical genetic programming and the island model approach. The method adopts a load balancing policy that avoids the unequal utilization of the processors. Experimental results on benchmark problems of different complexity show the superiority of the cellular approach with respect to the canonical sequential implementation and the island model. A theoretical performance analysis reveals the high scalability of the implementation realized and allows to predict the size of the population when the number of processors and their efficiency are fixed.
Article
Full-text available
During the last decade, anomaly detection has attracted the attention of many researchers to overcome the weakness of signature-based IDSs in detecting novel attacks, and KDDCUP'99 is the mostly widely used data set for the evaluation of these systems. Having conducted a statistical analysis on this data set, we found two important issues which highly affects the performance of evaluated systems, and results in a very poor evaluation of anomaly detection approaches. To solve these issues, we have proposed a new data set, NSL-KDD, which consists of selected records of the complete KDD data set and does not suffer from any of mentioned shortcomings.
Conference Paper
Modern intrusion detection systems must handle many complicated issues in real-time, as they have to cope with a real data stream; indeed, for the task of classification, typically the classes are unbalanced and, in addition, they have to cope with distributed attacks and they have to quickly react to changes in the data. Data mining techniques and, in particular, ensemble of classifiers permit to combine different classifiers that together provide complementary information and can be built in an incremental way. This paper introduces the architecture of a distributed intrusion detection framework and in particular, the detector module based on a meta-ensemble, which is used to cope with the problem of detecting intrusions, in which typically the number of attacks is minor than the number of normal connections. To this aim, we explore the usage of ensembles specialized to detect particular types of attack or normal connections, and Genetic Programming is adopted to generate a non-trainable function to combine each specialized ensemble. Non-trainable functions can be evolved without any extra phase of training and, therefore, they are particularly apt to handle concept drifts, also in the case of real-time constraints. Preliminary experiments, conducted on the well-known KDD dataset and on a more up-to-date dataset, ISCX IDS, show the effectiveness of the approach.
Article
Cyber security classification algorithms usually operate with datasets presenting many missing features and strongly unbalanced classes. In order to cope with these issues, we designed a distributed Genetic Programming (GP) framework, named CAGE-MetaCombiner, which adopts a meta-ensemble model to operate efficiently with missing data. Each ensemble evolves a function for combining the classifiers, which does not need of any extra phase of training on the original data. Therefore, in the case of changes in the data, the function can be recomputed in an incremental way, with a moderate computational effort; this aspect together with the advantages of running on parallel/distributed architectures make the algorithm suitable to operate with the real time constraints typical of a cyber security problem. In addition , an important cyber security problem that concerns the classification of the users or the employers of an e-payment system is illustrated, in order to show the relevance of the case in which entire sources of data or groups of features are missing. Finally, the capacity of approach in handling groups of missing features and unbalanced datasets is validated on many artificial datasets and on two real datasets and it is compared with some similar approaches .
Article
The challenges associated with handling uncertain data, in particular with querying and mining, are finding increasing attention in the research community. Here we focus on clustering uncertain data and describe a general framework for this purpose that also allows to visualize and understand the impact of uncertainty|using different uncertainty models|on the data mining results. Our framework constitutes release 0.7 of ELKI (http://elki.dbs.ifi.lmu.de/) and thus comes along with a plethora of implementations of algorithms, distance measures, indexing techniques, evaluation measures and visualization components.
Article
Ensemble analysis is a widely used meta-algorithm for many data mining problems such as classification and clustering. Numerous ensemble-based algorithms have been proposed in the literature for these problems. Compared to the clustering and classification problems, ensemble analysis has been studied in a limited way in the outlier detection literature. In some cases, ensemble analysis techniques have been implicitly used by many outlier analysis algorithms, but the approach is often buried deep into the algorithm and not formally recognized as a general-purpose meta-algorithm. This is in spite of the fact that this problem is rather important in the context of outlier analysis. This paper discusses the various methods which are used in the literature for outlier ensembles and the general principles by which such analysis can be made more effective. A discussion is also provided on how outlier ensembles relate to the ensemble-techniques used commonly for other data mining problems.
Article
A new parallel implementation of genetic programming (GP) based on the cellular model is presented and compared with both canonical GP and the island model approach. The method adopts a load-balancing policy that avoids the unequal utilization of the processors. Experimental results on benchmark problems of different complexity show the superiority of the cellular approach with respect to the canonical sequential implementation and the island model. A theoretical performance analysis reveals the high scalability of the implementation realized and allows to predict the size of the population when the number of processors and their efficiency are fixed.
Gianluigi Folino Francesco Sergio Pisani and Pietro Sabatino. 2016. A Distributed Intrusion Detection Framework Based on Evolved Specialized Ensembles of Classifiers
  • Francesco Sergio Gianluigi Folino
  • Pietro Pisani
  • Sabatino