Article

Active learning approach to label network traffic datasets

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Otherwise, Guerra et al. present RiskID ( Guerra et al., 2019;Torres et al., 2019 ), a modern application focus in the labeling of real traffic. Specifically, RiskID intend to create labeled datasets based in botnet and normal behaviors. ...
... On the other hand, some assisted labeling strategies seem to be more adaptable to new behaviors, as they depend on the generalization of their prediction model. Both Beaugnon et al. (2017Beaugnon et al. ( , 2018 and Torres et al. (2019) published the source code of their visualization tools together with the AL prediction model. However, the model performance can often decay since predictions are biased to specific network behavior. ...
... However, the area has not explored enough the benefits of the human-in-the-loop approach. Articles surveyed in the present work ( Beaugnon et al., 2018;Fan et al., 2019;Torres et al., 2019 ) mainly focused on the results of usual machine learning metrics. They should stop focusing in this kind of metrics and start analyzing other aspects such as the confidence and uncertainty of annotations or measuring reduction of human effort while keeping human-level precision. ...
Article
In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
... In addition to the information provided by the flow-based predictors, the CTU19 dataset includes information related with the flow like source IP, destination IP, protocol, port and the source linked with each capture. However, the present study will focus only on the information provided by the flow-based predictors as discussed in [2,8]. Fig. 4 describes the process used for the creation of the training and testing sets according to each splitting strategy. ...
Preprint
Full-text available
Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning methodology can sometimes overestimate the generalization error when a dataset is not representative or when rare and elusive examples are a fundamental aspect of the detection problem. In the present work, we analyze strategies based on the predictors' variability to split in training and testing sets. Such strategies aim at guaranteeing the inclusion of rare or unusual examples with a minimal loss of the population's representativeness and provide a more accurate estimation about the generalization error when the dataset is not representative. Two baseline classifiers based on decision trees were used for testing the four splitting strategies considered. Both classifiers were applied on CTU19 a low-representative dataset for a network security detection problem. Preliminary results showed the importance of applying the three alternative strategies to the Monte Carlo splitting strategy in order to get a more accurate error estimation on different but feasible scenarios.
... Otherwise, Guerra et al. present RiskID [79,80], a modern application focus in the labeling of real traffic. Specifically, RiskID pretend to create labeled datasets based in botnet and normal behaviors. ...
Preprint
Full-text available
In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
... This strategy uses the SVM kernel to reduce the model's generalization loss while labeling the queried instances. Similarly, Torres et al. [77] used the expected error reduction for the selection of informative instances using random forests as their basis. Hua Zhu [94] applied multiple shallow learning methods like SVM, logistic regression, decision trees, and naïve Bayes to an active learning framework as the meta-learning technique to label the queried instances. ...
Article
Full-text available
Edge devices are extensively used as intermediaries between the device and the service layer in an industrial Internet of things (IIoT) environment. These devices are quite vulnerable to malware attacks. Existing studies have worked on designing complex learning algorithms or deep architectures to accurately classify malware assuming that a sufficient number of labeled examples are provided. In the real world, getting labeled examples is one of the major issues for training any classification algorithm. Recent advances have allowed researchers to use active learning strategies that are trained on a handful of labeled examples to perform the classification task, but they are based on the selection of informative instances. This study integrates the Q-learning characteristics into an active learning framework, which allows the network to either request or predict a label during the training process. We proposed the use of phase space embedding, sparse autoencoder, and LSTM with the action-value function to classify malware applications while using a handful of labeled examples. The network relies on its uncertainty to either request or predict a label. The experimental results show that the proposed method can achieve better accuracy than the supervised learning strategy while using few labeled requests. The results also show that the trained network is resilient to the adversarial attacks, which proves the robustness of the proposed method. Additionally, this study explores the tradeoff between classification accuracy and number of label requests via the choice of rewards and the use of decision-level fusion strategies to boost the classification performance. Furthermore, we also provide a hypothetical framework as an implication of the proposed method.
Article
One of the reasons that the deployment of network intrusion detection methods falls short is the lack of realistic labeled datasets, which makes it challenging to develop and compare techniques. It is caused by the large amounts of effort that it takes for a cyber expert to classify network connections. This has raised the need for methods that learn from both labeled and unlabeled data which observations are best to present to the human expert. Hence, Active Learning (AL) methods are of interest. In this paper, we propose a new hybrid AL method called Jasmine. Firstly, it uses the uncertainty score and anomaly score to determine how suitable each observation is for querying, i.e., how likely it is to enhance classification. Secondly, Jasmine introduces dynamic updating. This allows the model to adjust the balance between querying uncertain, anomalous and randomly selected observations. To this end, Jasmine is able to learn the best query strategy during the labeling process. This is in contrast to the other AL methods in cybersecurity that all have static, predetermined query functions. We show that dynamic updating, and therefore Jasmine, is able to consistently obtain good and more robust results than querying only uncertainties, only anomalies or a fixed combination of the two.
Article
Data labeling is crucial in various areas, including network security, and a prerequisite for applying statistical-based classification and supervised learning techniques. Therefore, developing labeling methods that ensure good performance is important. We propose a human-guided auto-labeling algorithm involving the self-supervised learning concept, with the purpose of labeling data quickly, accurately, and consistently. It consists of three processes: auto-labeling, validation, and update. A labeling scheme is proposed by considering weighted features in the auto-labeling, while the generalized extreme learning machine (GELM) enabling fast training is applied to validate assigned labels. Two different approaches are considered in the update to label new data to investigate labeling speed and accuracy. We experiment to verify the suitability and accuracy of the algorithm for network traffic, applying the algorithm to five traffic datasets, some including distributed denial of service (DDoS), DoS, BruteForce, and PortScan attacks. Numerical results show the algorithm labels unlabeled datasets quickly, accurately, and consistently and the GELM’s learning speed enables labeling data in real-time. It also shows that the performances between auto- and conventional labels are nearly identical on datasets containing only DDoS attacks, which implies the algorithm is quite suitable for such datasets. However, the performance differences between the two labels are not negligible on datasets, including various attacks. Several reasons that require further investigation can be considered, including the selected features and the reliability of conventional labels. Even with this limitation of the current study, the algorithm will provide a criterion for labeling data in real-time occurring in many areas.
Chapter
Malware is an executable file which is stored on the target computer and which when executed might harm the target computer. It has been acknowledged that there is a drastic growth in the volume of malicious software in recent years which compromises the digital security of individuals, financial institutions, businesses, and government firms. The malware is classified into nine different families. The aim of this paper is to identify class of malware as per convention given. This problem belongs to a multi-class classification problem and our objective is to minimize the multi-class log-loss error and to predict the probability estimates for each class for a given file in order to make sure of the fact in which class the file belongs. The proposed classifier produced a log-loss of 0.031% on the Microsoft dataset which is divided randomly into three parts train, cross-validation, and test.KeywordsMalwareMulti-class classificationMachine learning
Article
Full-text available
Network Traffic Classification (NTC) has become an important feature in various network management operations, e.g., Quality of Service (QoS) provisioning and security services. Machine Learning (ML) algorithms as a popular approach for NTC can promise reasonable accuracy in classification and deal with encrypted traffic. However, ML-based NTC techniques suffer from the shortage of labeled traffic data which is the case in many real-world applications. This study investigates the applicability of an active form of ML, called Active Learning (AL), in NTC. AL reduces the need for a large number of labeled examples by actively choosing the instances that should be labeled. The study first provides an overview of NTC and its fundamental challenges along with surveying the literature on ML-based NTC methods. Then, it introduces the concepts of AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in AL-based classification of network traffic are discussed. Moreover, as a technical survey, some experiments are conducted to show the broad applicability of AL in NTC. The simulation results show that AL can achieve high accuracy with a small amount of data.
Article
The increasing digitization and interconnection of legacy Industrial Control Systems (ICSs) open new vulnerability surfaces, exposing such systems to malicious attackers. Furthermore, since ICSs are often employed in critical infrastructures (e.g., nuclear plants) and manufacturing companies (e.g., chemical industries), attacks can lead to devastating physical damages. In dealing with this security requirement, the research community focuses on developing new security mechanisms such as Intrusion Detection Systems (IDSs), facilitated by leveraging modern machine learning techniques. However, these algorithms require a testing platform and a considerable amount of data to be trained and tested accurately. To satisfy this prerequisite, Academia, Industry, and Government are increasingly proposing testbed (i.e., scaled-down versions of ICSs or simulations) to test the performances of the IDSs. Furthermore, to enable researchers to cross-validate security systems (e.g., security-by-design concepts or anomaly detectors), several datasets have been collected from testbeds and shared with the community. In this paper, we provide a deep and comprehensive overview of ICSs, presenting the architecture design, the employed devices, and the security protocols implemented. We then collect, compare, and describe testbeds and datasets in the literature, highlighting key challenges and design guidelines to keep in mind in the design phases. Furthermore, we enrich our work by reporting the best performing IDS algorithms tested on every dataset to create a baseline in state of the art for this field. Finally, driven by knowledge accumulated during this survey’s development, we report advice and good practices on the development, the choice, and the utilization of testbeds, datasets, and IDSs.
Article
Full-text available
Automatic data annotation eliminates most of the challenges we faced due to the manual methods of annotating sensor data. It significantly improves users’ experience during sensing activities since their active involvement in the labeling process is reduced. An unsupervised learning technique such as clustering can be used to automatically annotate sensor data. However, the lingering issue with clustering is the validation of generated clusters. In this paper, we adopted the k-means clustering algorithm for annotating unlabeled sensor data for the purpose of detecting sensitive location information of mobile crowd sensing users. Furthermore, we proposed a cluster validation index for the k-means algorithm, which is based on Multiple Pair-Frequency. Thereafter, we trained three classifiers (Support Vector Machine, K-Nearest Neighbor, and Naïve Bayes) using cluster labels generated from the k-means clustering algorithm. The accuracy, precision, and recall of these classifiers were evaluated during the classification of “non-sensitive” and “sensitive” data from motion and location sensors. Very high accuracy scores were recorded from Support Vector Machine and K-Nearest Neighbor classifiers while a fairly high accuracy score was recorded from the Naïve Bayes classifier. With the hybridized machine learning (unsupervised and supervised) technique presented in this paper, unlabeled sensor data was automatically annotated and then classified.
Article
Full-text available
Anomaly based approaches in network intrusion detection suffer from evaluation, comparison and deployment which originate from the scarcity of adequate publicly available network trace datasets. Also, publicly available datasets are either outdated or generated in a controlled environment. Due to the ubiquity of cloud computing environments in commercial and government internet services, there is a need to assess the impacts of network attacks in cloud data centers. To the best of our knowledge, there is no publicly available dataset which captures the normal and anomalous network traces in the interactions between cloud users and cloud data centers. In this paper, we present an experimental platform designed to represent a practical interaction between cloud users and cloud services and collect network traces resulting from this interaction to conduct anomaly detection. We use Amazon web services (AWS) platform for conducting our experiments.
Conference Paper
Full-text available
Correctly labelled datasets are commonly required. Three particular scenarios are highlighted, which showcase this need. When using supervised Intrusion Detection Systems (IDSs), these systems need labelled datasets to be trained. Also, the real nature of the analysed datasets must be known when evaluating the efficiency of the IDSs when detecting intrusions. Another scenario is the use of feature selection that works only if the processed datasets are labelled. In normal conditions, collecting labelled datasets from real networks is impossible. Currently, datasets are mainly labelled by implementing off-line forensic analysis, which is impractical because it does not allow real-time implementation. We have developed a novel approach to automatically generate labelled network traffic datasets using an unsupervised anomaly based IDS. The resulting labelled datasets are subsets of the original unlabelled datasets. The labelled dataset is then processed using a Genetic Algorithm (GA) based approach, which performs the task of feature selection. The GA has been implemented to automatically provide the set of metrics that generate the most appropriate intrusion detection results.
Conference Paper
Full-text available
Anomaly detection for network intrusion detection is usu- ally considered an unsupervised task. Prominent techniques, such as one-class support vector machines, learn a hyper- sphere enclosing network data, mapped to a vector space, such that points outside of the ball are considered anoma- lous. However, this setup ignores relevant information such as expert and background knowledge. In this paper, we rephrase anomaly detection as an active learning task. We propose an eective active learning strategy to query low- confidence observations and to expand the data basis with minimal labeling eort. Our empirical evaluation on net- work intrusion detection shows that our approach consis- tently outperforms existing methods in relevant scenarios.
Conference Paper
Full-text available
Flow-based intrusion detection has recently become a promising security mechanism in high speed networks (1-10 Gbps). Despite the richness in contributions in this field, benchmarking of flow-based IDS is still an open issue. In this paper, we propose the first publicly available, labeled data set for flow-based intrusion detection. The data set aims to be realistic, i.e., representative of real traffic and complete from a labeling perspective. Our goal is to provide such enriched data set for tuning, training and evaluating ID systems. Our setup is based on a honeypot running widely deployed services and directly connected to the Internet, ensuring attack-exposure. The final data set consists of 14.2M flows and more than 98% of them has been labeled.
Article
Full-text available
Despite the flurry of anomaly-detection papers in recent years, effective ways to validate and compare proposed so- lutions have remained elusive. We argue that evaluating anomaly detectors on manually labeled traces is both impor- tant and unavoidable. In particular, it is important to eval- uate detectors on traces from operational networks because it is in this setting that the detectors must ultimately suc- ceed. In addition, manual labeling of such traces is unavoid- able because new anomalies will be identified and character- ized from manual inspection long before there are realistic models for them. It is well known, however, that manual labeling is slow and error-prone. In order to mitigate these challenges, we present WebClass, a web-based infrastructure that adds rigor to the manual labeling process. WebClass allows researchers to share, inspect, and label traffic time- series through a common graphical user interface. We are releasing WebClass to the research community in the hope that it will foster greater collaboration in creating labeled traces and that the traces will be of higher quality because the entire community has access to all the information that led to a given label.
Article
Full-text available
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Article
Full-text available
The aim of this study is to compare two supervised classification methods on a crucial meteorological problem. The data consist of satellite measurements of cloud systems which are to be classified either in convective or non convective systems. Convective cloud systems correspond to lightning and detecting such systems is of main importance for thunderstorm monitoring and warning. Because the problem is highly unbalanced, we consider specific performance criteria and different strategies. This case study can be used in an advanced course of data mining in order to illustrate the use of logistic regression and random forest on a real data set with unbalanced classes.
Book
With the growth of public and private data stores and the emergence of off-the-shelf data-mining technology, recommendation systems have emerged that specifically address the unique challenges of navigating and interpreting software engineering data. This book collects, structures, and formalizes knowledge on recommendation systems in software engineering. It adopts a pragmatic approach with an explicit focus on system design, implementation, and evaluation. The book is divided into three parts: “Part I – Techniques” introduces basics for building recommenders in software engineering, including techniques for collecting and processing software engineering data, but also for presenting recommendations to users as part of their workflow. “Part II – Evaluation” summarizes methods and experimental designs for evaluating recommendations in software engineering. “Part III – Applications” describes needs, issues, and solution concepts involved in entire recommendation systems for specific software engineering tasks, focusing on the engineering insights required to make effective recommendations. The book is complemented by the webpage rsse.org/book, which includes free supplemental materials for readers of this book and anyone interested in recommendation systems in software engineering, including lecture slides, data sets, source code, and an overview of people, groups, papers, and tools with regard to recommendation systems in software engineering. The book is particularly well-suited for graduate students and researchers building new recommendation systems for software engineering applications or in other high-tech fields. It may also serve as the basis for graduate courses on recommendation systems, applied data mining, or software engineering. Software engineering practitioners developing recommendation systems or similar applications with predictive functionality will also benefit from the broad spectrum of topics covered.
Conference Paper
Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.
Conference Paper
This paper presents a graphical interface to identify hostile behavior in network logs. The problem of identifying and labeling hostile behavior is well known in the network security community. There is a lack of labeled datasets, which make it difficult to deploy automated methods or to test the performance of manual ones. We describe the process of searching and identifying hostile behavior with a graphical tool derived from an open source Intrusion Prevention System, which graphically encodes features of network connections from a log-file. A design study with two network security experts illustrates the workflow of searching for patterns descriptive of unwanted behavior and labeling occurrences therewith.
Conference Paper
The Visualization for Cyber Security research community (VizSec) addresses longstanding challenges in cyber security by adapting and evaluating information visualization techniques with application to the cyber security domain. This research effort has created many tools and techniques that could be applied to improve cyber security, yet the community has not yet established unified standards for evaluating these approaches to predict their operational validity. In this paper, we survey and categorize the evaluation metrics, components, and techniques that have been utilized in the past decade of VizSec research literature. We also discuss existing methodological gaps in evaluating visualization in cyber security, and suggest potential avenues for future research in order to help establish an agenda for advancing the state-of-the-art in evaluating cyber security visualizations.
Article
With exponential growth in the number of computer applications and the sizes of networks, the potential damage that can be caused by attacks launched over the Internet keeps increasing dramatically. A number of network intrusion detection methods have been developed with respective strengths and weaknesses. The majority of network intrusion detection research and development is still based on simulated datasets due to non-availability of real datasets. A simulated dataset cannot represent a real network intrusion scenario. It is important to generate real and timely datasets to ensure accurate and consistent evaluation of detection methods. In this paper, we propose a systematic approach to generate unbiased fullfeature real-life network intrusion datasets to compensate for the crucial shortcomings of existing datasets. We establish the importance of an intrusion dataset in the development and validation process of detection mechanisms, identify a set of requirements for effective dataset generation, and discuss several attack scenarios and their incorporation in generating datasets. We also establish the effectiveness of the generated dataset in the context of several existing datasets.
Article
Noise is common in any real-world data set and may adversely affect classifiers built under the effect of such type of disturbance. Some of these classifiers are widely recognized for their good performance when dealing with imperfect data. However, the noise robustness of the classifiers is an important issue in noisy environments and it must be carefully studied. Both performance and robustness are two independent concepts that are usually considered separately, but the conclusions reached with one of these metrics do not necessarily imply the same conclusions with the other. Therefore, involving both concepts seems to be crucial in order determine the expected behavior of the classifiers against noise. Existing measures fail to properly integrate these two concepts, and they are also not well suited to compare different techniques over the same data. This paper proposes a new measure to establish the expected behavior of a classifier with noisy data trying to minimize the problems of considering performance and robustness individually: the Equalized Loss of Accuracy (ELA). The advantages of ELA against other robustness metrics are studied and all of them are also compared. Both the analysis of the distinct measures and the empirical results show that ELA is able to overcome the aforementioned problems that the rest of the robustness metrics may produce, having a better behavior when comparing different classifiers over the same data set.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Automatic network intrusion detection has been an important research topic for the last 20 years. In that time, approaches based on signatures describing intrusive behavior have become the de-facto industry standard. Alternatively, other novel techniques have been used for improving automation of the intrusion detection process. In this regard, statistical methods, machine learning and data mining techniques have been proposed arguing higher automation capabilities than signature-based approaches. However, the majority of these novel techniques have never been deployed on real-life scenarios. The fact is that signature-based still is the most widely used strategy for automatic intrusion detection. In the present article we survey the most relevant works in the field of automatic network intrusion detection. In contrast to previous surveys, our analysis considers several features required for truly deploying each one of the reviewed approaches. This wider perspective can help us to identify the possible causes behind the lack of acceptance of novel techniques by network security experts.
Article
Random forest (RF) has been proposed on the basis of classification and regression trees (CART) with "ensemble learning" strategy by Breiman in 2001. In this paper, RF is introduced and investigated for electronic tongue (E-tongue) data processing. The experiments were designed for type and brand recognition of orange beverage and Chinese vinegar by an E-tongue with seven potentiometric sensors and an Ag/AgCl reference electrode. Principal component analysis (PCA) was used to visualize the distribution of total samples of each data set. Back propagation neural network (BPNN) and support vector machine (SVM), as comparative methods, were also employed to deal with four data sets. Five-fold cross-validation (CV) with twenty replications was applied during modeling and an external testing set was employed to validate the prediction performance of models. The average correct rates (CR) on CV sets of the four data sets performed by BPNN, SVM and RF were 86.68%, 66.45% and 99.07%, respectively. RF has been proved to outperform BPNN and SVM, and has some advantages in such cases, because it can deal with classification problems of unbalanced, multiclass and small sample data without data preprocessing procedures. These results suggest that RF may be a promising pattern recognition method for E-tongues. (c) 2012 Elsevier B.V. All rights reserved.
Conference Paper
In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community. However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings. We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively. We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection.
Conference Paper
A useful starting point for designing advanced graphical user interfaces is the visual information seeking Mantra: overview first, zoom and filter, then details on demand. But this is only a starting point in trying to understand the rich and varied set of information visualizations that have been proposed in recent years. The paper offers a task by data type taxonomy with seven data types (one, two, three dimensional data, temporal and multi dimensional data, and tree and network data) and seven tasks (overview, zoom, filter, details-on-demand, relate, history, and extracts)
Article
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms are iterative and can be divided into two types based on whether the parameters are updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms that includes both a sequential- and a parallel-update algorithm as special cases, thus showing how the sequential and parallel approaches can themselves be unified. For all of the algorithms, we give convergenceproofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with the iterative scaling algorithm. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Stratosphere research laboratorys
  • G Sebastian
Providing SCADA network data sets for intrusion detection research
  • Lemay
  • Kuncheva