Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... With the help of new expert knowledge, in the task of network traffic classification, an active learning method can overcome the problem of overlearning the majority class samples and underlearning the minority class samples. In recent years, several active learning methods have been used for network traffic classification (Deka et al., 2019;Dong, 2021;Liu and Sun, 2014;Shahraki et al., 2021;Trittenbach et al., 2021;Wassermann et al., 2020;Yi-peng, 2013), however, most of these methods focus on concept drift problems (Wassermann et al., 2020) or imbalance problems (Dong, 2021;Torres et al., 2019;Trittenbach et al., 2021) and devote little attention to their combination. ...
... In recent years, an increasing number of active learning methods have been applied to classification of network traffic data streams (Deka et al., 2019;Dong, 2021;Liu and Sun, 2014;Shahraki et al., 2021;Torres et al., 2019;Trittenbach et al., 2021;Wassermann et al., 2020;Yi-peng, 2013). Yi-peng (2013) proposed an active learning network traffic classification method based on a support vector machine (SVM). ...
... However, ADAM and RAL were designed and evaluated separately. Torres et al. (2019) proposed an active learning approach that used random forests to interactively assist the user in labelling botnet network traffic datasets. Khanchi et al. (2018) described an incremental active learning framework for streaming botnet traffic analysis. ...
Article
Full-text available
The complex problems of multiclass imbalance, virtual or real concept drift, concept evolution, high-speed traffic streams and limited label cost budgets pose severe challenges in network traffic classification tasks. In this paper, we propose a multiclass imbalanced and concept drift network traffic classification framework based on online active learning (MicFoal), which includes a configurable supervised learner for the initialization of a network traffic classification model, an active learning method with a hybrid label request strategy, a label sliding window group, a sample training weight formula and an adaptive adjustment mechanism for the label cost budget based on a periodic performance evaluation. In addition, a novel uncertain label request strategy based on a variable least confidence threshold vector is designed to address the problems of a variable multiclass imbalance ratio or even the number of classes changing over time. Experiments performed based on eight well-known real-world network traffic datasets demonstrate that MicFoal is more effective and efficient than several state-of-the-art learning algorithms.
... In this section, we review existing work on the application of AL in NTC. 1 https://www.ietf.org/blog/reflections-ietf-97/ Torres textitet al. [80] proposed a botnet detection technique based on AL. The authors provided a novel AL strategy to label network traffic that contains normal and botnet traffics. ...
... NTC challenge Method Technical contribution [80] Botnet detection Active learning+random forest It proposes a new AL strategy to facilitate the network traffic labeling process, especially for security purposes. ...
Article
Full-text available
Network Traffic Classification (NTC) has become an important feature in various network management operations, e.g., Quality of Service (QoS) provisioning and security services. Machine Learning (ML) algorithms as a popular approach for NTC can promise reasonable accuracy in classification and deal with encrypted traffic. However, ML-based NTC techniques suffer from the shortage of labeled traffic data which is the case in many real-world applications. This study investigates the applicability of an active form of ML, called Active Learning (AL), in NTC. AL reduces the need for a large number of labeled examples by actively choosing the instances that should be labeled. The study first provides an overview of NTC and its fundamental challenges along with surveying the literature on ML-based NTC methods. Then, it introduces the concepts of AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in AL-based classification of network traffic are discussed. Moreover, as a technical survey, some experiments are conducted to show the broad applicability of AL in NTC. The simulation results show that AL can achieve high accuracy with a small amount of data.
... Several AL methods have been proposed in intrusion detection research. Many of them focus on querying uncertain data, i.e., requesting the label of observations about which the model is not sure how to classify them [10,11,12]. Adding these observations with their correct label is expected to enhance classification performance more quickly than randomly selecting observations. ...
... Guerra Torres et al. make use of Random Forests for prediction and query uncertain observations. [11] Görnitz et al. use a Support Vector Domain Description (SVDD) for anomaly detection with uncertainty sampling as the AL component [12]. All these studies show that a method with AL obtains better results than one without it, or that the proposed query strategy performs better than randomly presenting observations to the oracle. ...
Preprint
Full-text available
Over the past decade, the advent of cybercrime has accelarated the research on cybersecurity. However, the deployment of intrusion detection methods falls short. One of the reasons for this is the lack of realistic evaluation datasets, which makes it a challenge to develop techniques and compare them. This is caused by the large amounts of effort it takes for a cyber analyst to classify network connections. This has raised the need for methods (i) that can learn from small sets of labeled data, (ii) that can make predictions on large sets of unlabeled data, and (iii) that request the label of only specially selected unlabeled data instances. Hence, Active Learning (AL) methods are of interest. These approaches choose speci?fic unlabeled instances by a query function that are expected to improve overall classi?cation performance. The resulting query observations are labeled by a human expert and added to the labeled set. In this paper, we propose a new hybrid AL method called Jasmine. Firstly, it determines how suitable each observation is for querying, i.e., how likely it is to enhance classi?cation. These properties are the uncertainty score and anomaly score. Secondly, Jasmine introduces dynamic updating. This allows the model to adjust the balance between querying uncertain, anomalous and randomly selected observations. To this end, Jasmine is able to learn the best query strategy during the labeling process. This is in contrast to the other AL methods in cybersecurity that all have static, predetermined query functions. We show that dynamic updating, and therefore Jasmine, is able to consistently obtain good and more robust results than querying only uncertainties, only anomalies or a ?fixed combination of the two.
... Torres textitet al. [80] proposed a botnet detection technique based on AL. The authors provided a novel AL strategy to label network traffic that contains normal and botnet traffics. ...
... To this end, two examples of using the active form of learning for NTC are studied. It NTC challenge Method Technical contribution [80] Botnet detection Active learning+random forest Proposes a new AL strategy to facilitate the network traffic labeling process, especially for security purposes. ...
Preprint
Full-text available
Network Traffic Classification (NTC) has become an important component in a wide variety of network management operations, e.g., Quality of Service (QoS) provisioning and security purposes. Machine Learning (ML) algorithms as a common approach for NTC methods can achieve reasonable accuracy and handle encrypted traffic. However, ML-based NTC techniques suffer from the shortage of labeled traffic data which is the case in many real-world applications. This study investigates the applicability of an active form of ML, called Active Learning (AL), which reduces the need for a high number of labeled examples by actively choosing the instances that should be labeled. The study first provides an overview of NTC and its fundamental challenges along with surveying the literature in the field of using ML techniques in NTC. Then, it introduces the concepts of AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in the use of AL for NTC are discussed. Additionally, as a technical survey, some experiments are conducted to show the broad applicability of AL in NTC. The simulation results show that AL can achieve high accuracy with a small amount of data.
... al [47] use an active anomaly detector based on Isolation Forests, incorporating explanations to guide the expert's investigation. Several works studied the application of active learning in adversarial scenarios [5,15,18,31,43,44,54,56,59]. Gornitz et. ...
Preprint
Full-text available
In recent years, enterprises have been targeted by advanced adversaries who leverage creative ways to infiltrate their systems and move laterally to gain access to critical data. One increasingly common evasive method is to hide the malicious activity behind a benign program by using tools that are already installed on user computers. These programs are usually part of the operating system distribution or another user-installed binary, therefore this type of attack is called "Living-Off-The-Land". Detecting these attacks is challenging, as adversaries may not create malicious files on the victim computers and anti-virus scans fail to detect them. We propose the design of an Active Learning framework called LOLAL for detecting Living-Off-the-Land attacks that iteratively selects a set of uncertain and anomalous samples for labeling by a human analyst. LOLAL is specifically designed to work well when a limited number of labeled samples are available for training machine learning models to detect attacks. We investigate methods to represent command-line text using word-embedding techniques, and design ensemble boosting classifiers to distinguish malicious and benign samples based on the embedding representation. We leverage a large, anonymized dataset collected by an endpoint security product and demonstrate that our ensemble classifiers achieve an average F1 score of 0.96 at classifying different attack classes. We show that our active learning method consistently improves the classifier performance, as more training data is labeled, and converges in less than 30 iterations when starting with a small number of labeled instances.
... Most of these measures are derived from the confusion matrix, which consists of two columns displaying predicted values and two rows displaying the actual values. In NIDS, predicted or actual values are positive if NW traffic is positive or negative if normal, as shown in Figure 5 Recall Value (Re-V) (detection rate) measures NIDS sensitivity [12], [36]. ...
Article
Full-text available
Protecting the confidentiality, integrity, and availability of cyberspace and network (NW) assets has become an increasing concern. The rapid increase in the Internet size and the presence of new computing systems (like Cloud) are creating great incentives for intruders. Therefore, security engineers have to develop new technologies to match growing threats to NWs. New and advanced technologies have emerged to create more efficient intrusion detection systems using machine learning (ML) and dimensionality reduction techniques, to help security engineers bolster more effective NW Intrusion Detection Systems (NIDS). This systematic review provides a comprehensive review of the most recent NIDS using the supervised ML classification and dimensionality reduction techniques, it shows how the used ML classifiers, dimensionality reduction techniques, and evaluating metrics have improved NIDS construction. The key point of this study is to provide up-to-date knowledge for new interested researchers. [JJCIT 2021; 7(4.000): 373-390]
... Semi-supervised approaches provide a trade-off that efficiently spends an expert's work supported by a visualization platform such as RiskID [245]. Documentation. ...
Preprint
Full-text available
The increasing digitization and interconnection of legacy Industrial Control Systems (ICSs) open new vulnerability surfaces, exposing such systems to malicious attackers. Furthermore, since ICSs are often employed in critical infrastructures (e.g., nuclear plants) and manufacturing companies (e.g., chemical industries), attacks can lead to devastating physical damages. In dealing with this security requirement, the research community focuses on developing new security mechanisms such as Intrusion Detection Systems (IDSs), facilitated by leveraging modern machine learning techniques. However, these algorithms require a testing platform and a considerable amount of data to be trained and tested accurately. To satisfy this prerequisite, Academia, Industry, and Government are increasingly proposing testbed (i.e., scaled-down versions of ICSs or simulations) to test the performances of the IDSs. Furthermore, to enable researchers to cross-validate security systems (e.g., security-by-design concepts or anomaly detectors), several datasets have been collected from testbeds and shared with the community. In this paper, we provide a deep and comprehensive overview of ICSs, presenting the architecture design, the employed devices, and the security protocols implemented. We then collect, compare, and describe testbeds and datasets in the literature, highlighting key challenges and design guidelines to keep in mind in the design phases. Furthermore, we enrich our work by reporting the best performing IDS algorithms tested on every dataset to create a baseline in state of the art for this field. Finally, driven by knowledge accumulated during this survey's development, we report advice and good practices on the development, the choice, and the utilization of testbeds, datasets, and IDSs.
Article
Full-text available
Automatic data annotation eliminates most of the challenges we faced due to the manual methods of annotating sensor data. It significantly improves users’ experience during sensing activities since their active involvement in the labeling process is reduced. An unsupervised learning technique such as clustering can be used to automatically annotate sensor data. However, the lingering issue with clustering is the validation of generated clusters. In this paper, we adopted the k-means clustering algorithm for annotating unlabeled sensor data for the purpose of detecting sensitive location information of mobile crowd sensing users. Furthermore, we proposed a cluster validation index for the k-means algorithm, which is based on Multiple Pair-Frequency. Thereafter, we trained three classifiers (Support Vector Machine, K-Nearest Neighbor, and Naïve Bayes) using cluster labels generated from the k-means clustering algorithm. The accuracy, precision, and recall of these classifiers were evaluated during the classification of “non-sensitive” and “sensitive” data from motion and location sensors. Very high accuracy scores were recorded from Support Vector Machine and K-Nearest Neighbor classifiers while a fairly high accuracy score was recorded from the Naïve Bayes classifier. With the hybridized machine learning (unsupervised and supervised) technique presented in this paper, unlabeled sensor data was automatically annotated and then classified.
Article
Full-text available
Anomaly based approaches in network intrusion detection suffer from evaluation, comparison and deployment which originate from the scarcity of adequate publicly available network trace datasets. Also, publicly available datasets are either outdated or generated in a controlled environment. Due to the ubiquity of cloud computing environments in commercial and government internet services, there is a need to assess the impacts of network attacks in cloud data centers. To the best of our knowledge, there is no publicly available dataset which captures the normal and anomalous network traces in the interactions between cloud users and cloud data centers. In this paper, we present an experimental platform designed to represent a practical interaction between cloud users and cloud services and collect network traces resulting from this interaction to conduct anomaly detection. We use Amazon web services (AWS) platform for conducting our experiments.
Conference Paper
Full-text available
Correctly labelled datasets are commonly required. Three particular scenarios are highlighted, which showcase this need. When using supervised Intrusion Detection Systems (IDSs), these systems need labelled datasets to be trained. Also, the real nature of the analysed datasets must be known when evaluating the efficiency of the IDSs when detecting intrusions. Another scenario is the use of feature selection that works only if the processed datasets are labelled. In normal conditions, collecting labelled datasets from real networks is impossible. Currently, datasets are mainly labelled by implementing off-line forensic analysis, which is impractical because it does not allow real-time implementation. We have developed a novel approach to automatically generate labelled network traffic datasets using an unsupervised anomaly based IDS. The resulting labelled datasets are subsets of the original unlabelled datasets. The labelled dataset is then processed using a Genetic Algorithm (GA) based approach, which performs the task of feature selection. The GA has been implemented to automatically provide the set of metrics that generate the most appropriate intrusion detection results.
Thesis
Full-text available
Botnets are the technological backbone supporting myriad of attacks, including identity stealing, organizational spying, DoS, SPAM, government-sponsored attacks and spying of political dissidents among others. The research community works hard creating detection algorithms of botnet network traffic. These algorithms have been partially successful, but are difficult to reproduce and verify; being often commercialized. However, the advances in machine learning algorithms and the access to better botnet datasets start showing promising results. The shift of the detection techniques to behavioral-based models has proved to be a better approach to the analysis of botnet patterns. However, the current knowledge of the botnet actions and patterns does not seem to be deep enough to create adequate traffic models that could be used to detect botnets in real networks. This thesis proposes three new botnet detection methods and a new model of botnet behavior that are based in a deep understanding of the botnet behaviors in the network. First the SimDetect method, that analyzes the structural similarities of clustered botnet traffic. Second the BClus method, that clusters traffic according to its connection patterns and uses decision rules to detect unknown botnet in the network. Third, the CCDetector method, that uses a novel state-based behavioral model of known Command and Control channels to train a Markov Chain and to detect similar traffic in unknown real networks. The BClus and CCDetector methods were compared with third-party detection methods, showing their use in real environments. The core of the CCDetector method is our state-based behavioral model of botnet ac tions. This model is capable of representing the changes in the behaviors over time. To support the research we use a huge dataset of botnet traffic that was captured in our Malware Capture Facility Project. The dataset is varied, large, public, real and has Background,Normal and Botnet labels. The tools, dataset and algorithms were released as free software. Our algorithms give a new high-level interface to identify, visualize and block botnet behaviors in the networks.
Conference Paper
Full-text available
Anomaly detection for network intrusion detection is usu- ally considered an unsupervised task. Prominent techniques, such as one-class support vector machines, learn a hyper- sphere enclosing network data, mapped to a vector space, such that points outside of the ball are considered anoma- lous. However, this setup ignores relevant information such as expert and background knowledge. In this paper, we rephrase anomaly detection as an active learning task. We propose an eective active learning strategy to query low- confidence observations and to expand the data basis with minimal labeling eort. Our empirical evaluation on net- work intrusion detection shows that our approach consis- tently outperforms existing methods in relevant scenarios.
Conference Paper
Full-text available
Flow-based intrusion detection has recently become a promising security mechanism in high speed networks (1-10 Gbps). Despite the richness in contributions in this field, benchmarking of flow-based IDS is still an open issue. In this paper, we propose the first publicly available, labeled data set for flow-based intrusion detection. The data set aims to be realistic, i.e., representative of real traffic and complete from a labeling perspective. Our goal is to provide such enriched data set for tuning, training and evaluating ID systems. Our setup is based on a honeypot running widely deployed services and directly connected to the Internet, ensuring attack-exposure. The final data set consists of 14.2M flows and more than 98% of them has been labeled.
Article
Full-text available
Despite the flurry of anomaly-detection papers in recent years, effective ways to validate and compare proposed so- lutions have remained elusive. We argue that evaluating anomaly detectors on manually labeled traces is both impor- tant and unavoidable. In particular, it is important to eval- uate detectors on traces from operational networks because it is in this setting that the detectors must ultimately suc- ceed. In addition, manual labeling of such traces is unavoid- able because new anomalies will be identified and character- ized from manual inspection long before there are realistic models for them. It is well known, however, that manual labeling is slow and error-prone. In order to mitigate these challenges, we present WebClass, a web-based infrastructure that adds rigor to the manual labeling process. WebClass allows researchers to share, inspect, and label traffic time- series through a common graphical user interface. We are releasing WebClass to the research community in the hope that it will foster greater collaboration in creating labeled traces and that the traces will be of higher quality because the entire community has access to all the information that led to a given label.
Article
Full-text available
The aim of this study is to compare two supervised classification methods on a crucial meteorological problem. The data consist of satellite measurements of cloud systems which are to be classified either in convective or non convective systems. Convective cloud systems correspond to lightning and detecting such systems is of main importance for thunderstorm monitoring and warning. Because the problem is highly unbalanced, we consider specific performance criteria and different strategies. This case study can be used in an advanced course of data mining in order to illustrate the use of logistic regression and random forest on a real data set with unbalanced classes.
Book
With the growth of public and private data stores and the emergence of off-the-shelf data-mining technology, recommendation systems have emerged that specifically address the unique challenges of navigating and interpreting software engineering data. This book collects, structures, and formalizes knowledge on recommendation systems in software engineering. It adopts a pragmatic approach with an explicit focus on system design, implementation, and evaluation. The book is divided into three parts: “Part I – Techniques” introduces basics for building recommenders in software engineering, including techniques for collecting and processing software engineering data, but also for presenting recommendations to users as part of their workflow. “Part II – Evaluation” summarizes methods and experimental designs for evaluating recommendations in software engineering. “Part III – Applications” describes needs, issues, and solution concepts involved in entire recommendation systems for specific software engineering tasks, focusing on the engineering insights required to make effective recommendations. The book is complemented by the webpage rsse.org/book, which includes free supplemental materials for readers of this book and anyone interested in recommendation systems in software engineering, including lecture slides, data sets, source code, and an overview of people, groups, papers, and tools with regard to recommendation systems in software engineering. The book is particularly well-suited for graduate students and researchers building new recommendation systems for software engineering applications or in other high-tech fields. It may also serve as the basis for graduate courses on recommendation systems, applied data mining, or software engineering. Software engineering practitioners developing recommendation systems or similar applications with predictive functionality will also benefit from the broad spectrum of topics covered.
Conference Paper
Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.
Conference Paper
This paper presents a graphical interface to identify hostile behavior in network logs. The problem of identifying and labeling hostile behavior is well known in the network security community. There is a lack of labeled datasets, which make it difficult to deploy automated methods or to test the performance of manual ones. We describe the process of searching and identifying hostile behavior with a graphical tool derived from an open source Intrusion Prevention System, which graphically encodes features of network connections from a log-file. A design study with two network security experts illustrates the workflow of searching for patterns descriptive of unwanted behavior and labeling occurrences therewith.
Conference Paper
The Visualization for Cyber Security research community (VizSec) addresses longstanding challenges in cyber security by adapting and evaluating information visualization techniques with application to the cyber security domain. This research effort has created many tools and techniques that could be applied to improve cyber security, yet the community has not yet established unified standards for evaluating these approaches to predict their operational validity. In this paper, we survey and categorize the evaluation metrics, components, and techniques that have been utilized in the past decade of VizSec research literature. We also discuss existing methodological gaps in evaluating visualization in cyber security, and suggest potential avenues for future research in order to help establish an agenda for advancing the state-of-the-art in evaluating cyber security visualizations.
Article
With exponential growth in the number of computer applications and the sizes of networks, the potential damage that can be caused by attacks launched over the Internet keeps increasing dramatically. A number of network intrusion detection methods have been developed with respective strengths and weaknesses. The majority of network intrusion detection research and development is still based on simulated datasets due to non-availability of real datasets. A simulated dataset cannot represent a real network intrusion scenario. It is important to generate real and timely datasets to ensure accurate and consistent evaluation of detection methods. In this paper, we propose a systematic approach to generate unbiased fullfeature real-life network intrusion datasets to compensate for the crucial shortcomings of existing datasets. We establish the importance of an intrusion dataset in the development and validation process of detection mechanisms, identify a set of requirements for effective dataset generation, and discuss several attack scenarios and their incorporation in generating datasets. We also establish the effectiveness of the generated dataset in the context of several existing datasets.
Article
Noise is common in any real-world data set and may adversely affect classifiers built under the effect of such type of disturbance. Some of these classifiers are widely recognized for their good performance when dealing with imperfect data. However, the noise robustness of the classifiers is an important issue in noisy environments and it must be carefully studied. Both performance and robustness are two independent concepts that are usually considered separately, but the conclusions reached with one of these metrics do not necessarily imply the same conclusions with the other. Therefore, involving both concepts seems to be crucial in order determine the expected behavior of the classifiers against noise. Existing measures fail to properly integrate these two concepts, and they are also not well suited to compare different techniques over the same data. This paper proposes a new measure to establish the expected behavior of a classifier with noisy data trying to minimize the problems of considering performance and robustness individually: the Equalized Loss of Accuracy (ELA). The advantages of ELA against other robustness metrics are studied and all of them are also compared. Both the analysis of the distinct measures and the empirical results show that ELA is able to overcome the aforementioned problems that the rest of the robustness metrics may produce, having a better behavior when comparing different classifiers over the same data set.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Automatic network intrusion detection has been an important research topic for the last 20 years. In that time, approaches based on signatures describing intrusive behavior have become the de-facto industry standard. Alternatively, other novel techniques have been used for improving automation of the intrusion detection process. In this regard, statistical methods, machine learning and data mining techniques have been proposed arguing higher automation capabilities than signature-based approaches. However, the majority of these novel techniques have never been deployed on real-life scenarios. The fact is that signature-based still is the most widely used strategy for automatic intrusion detection. In the present article we survey the most relevant works in the field of automatic network intrusion detection. In contrast to previous surveys, our analysis considers several features required for truly deploying each one of the reviewed approaches. This wider perspective can help us to identify the possible causes behind the lack of acceptance of novel techniques by network security experts.
Article
Random forest (RF) has been proposed on the basis of classification and regression trees (CART) with "ensemble learning" strategy by Breiman in 2001. In this paper, RF is introduced and investigated for electronic tongue (E-tongue) data processing. The experiments were designed for type and brand recognition of orange beverage and Chinese vinegar by an E-tongue with seven potentiometric sensors and an Ag/AgCl reference electrode. Principal component analysis (PCA) was used to visualize the distribution of total samples of each data set. Back propagation neural network (BPNN) and support vector machine (SVM), as comparative methods, were also employed to deal with four data sets. Five-fold cross-validation (CV) with twenty replications was applied during modeling and an external testing set was employed to validate the prediction performance of models. The average correct rates (CR) on CV sets of the four data sets performed by BPNN, SVM and RF were 86.68%, 66.45% and 99.07%, respectively. RF has been proved to outperform BPNN and SVM, and has some advantages in such cases, because it can deal with classification problems of unbalanced, multiclass and small sample data without data preprocessing procedures. These results suggest that RF may be a promising pattern recognition method for E-tongues. (c) 2012 Elsevier B.V. All rights reserved.
Conference Paper
In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community. However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings. We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively. We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection.
Article
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms are iterative and can be divided into two types based on whether the parameters are updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms that includes both a sequential- and a parallel-update algorithm as special cases, thus showing how the sequential and parallel approaches can themselves be unified. For all of the algorithms, we give convergenceproofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with the iterative scaling algorithm. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Article
A useful starting point for designing advanced graphical user interfaces is the Visual InformationSeeking Mantra: Overview first, zoom and filter, then details-on-demand. But this is only a starting point in trying to understand the rich and varied set of information visualizations that have been proposed in recent years. This paper offers a task by data type taxonomy with seven data types (1-, 2-, 3-dimensional data, temporal and multi-dimensional data, and tree and network data) and seven tasks (overview, zoom, filter, details-on-demand, relate, history, and extract). The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations Ben Shneiderman Department of Computer Science, Human-Computer Interaction Laboratory, and Institute for Systems Research University of Maryland College Park, Maryland 20742 USA ben@cs.umd.edu Abstract: A useful starting point for designing advanced graphical user interfaces is the Visual Information-Seeking Mantra: Overview first, ...
Stratosphere research laboratorys
  • G Sebastian
Sebastian G. Stratosphere research laboratorys. 2015. https://stratosphereips. org/, [Online; accessed Jun-2018].
Providing SCADA network data sets for intrusion detection research
  • A Lemay
  • J M Fernandez
Lemay A, Fernandez JM. Providing SCADA network data sets for intrusion detection research. In: Proceedings of the USENIX CSET; 2016.
Review on determining number of Cluster in 720
  • T Kodinariya
  • P Makwana
T. Kodinariya, P. Makwana, Review on determining number of Cluster in 720
  • K-Means Clustering
K-Means Clustering, International Journal of Advance Research in Computer Science and Management Studies 1 (6) (2013) 90-95. URL www.ijarcsms.com
A sequential algorithm for training text classifiers
  • D D Lewis
  • W A Gale
Lewis DD, Gale WA. A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA: Springer-Verlag New York, Inc.; 1994. p. 3-12. ISBN 0-387-19889-X. http://dl.acm.org/citation. cfm?id=188490.188495.
The ctu-19 dataset, normal datasets
  • S Project
S. Project, The ctu-19 dataset, normal datasets, https://www. stratosphereips.org/datasets-normal/, [Online; accessed Jun-2018] (October 2013).
Recommendation systems in software engineering
  • I Avazpour
  • T Pitakrat
  • L Grunske
  • J Grundy
Avazpour I., Pitakrat T., Grunske L., Grundy J. Recommendation systems in software engineering 2014. doi: 10.1007/978-3-642-45135-5.