Article

# Datasets are not enough: Challenges in labeling network traffic

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.

## No full-text available

Article
The goal of this systematic and broad survey is to present and discuss the main challenges that are posed by the implementation of Artificial Intelligence and Machine Learning in the form of Artificial Neural Networks in Cybersecurity, specifically in Intrusion Detection Systems. Based on the results of the state-of-the-art analysis with a number of bibliographic methods, as well as their own implementations, the authors provide a survey of the answers to the posed problems as well as effective, experimentally-found solutions to those key issues. The issues include hyperparameter tuning, dataset balancing, increasing the effectiveness of an ANN, securing the networks from adversarial attacks, and a range of non-technical challenges of applying ANNs for IDS, such as societal, ethical and legal dilemmas, and the question of explainability. Thus, it is a systematic review and a summary of the body of knowledge amassed around implementations of Artificial Neural Networks in Network Intrusion Detection, guided by an actual, real-world implementation.
Article
Full-text available
In this paper, we addressed the problem of dataset scarcity for the task of network intrusion detection. Our main contribution was to develop a framework that provides a complete process for generating network traffic datasets based on the aggregation of real network traces. In addition, we proposed a set of tools for attribute extraction and labeling of traffic sessions. A new dataset with botnet network traffic was generated by the framework to assess our proposed method with machine learning algorithms suitable for unbalanced data. The performance of the classifiers was evaluated in terms of macro-averages of F1-score (0.97) and the Matthews Correlation Coefficient (0.94), showing a good overall performance average.
Article
Full-text available
Intrusion detection is one of the most common approaches for addressing security attacks in modern networks. However, given the increasing diversity of attack behaviors, efficient detection becomes more challenging. Machine learning (ML) has recently dominated as one of the most promising techniques to improve detection accuracy for intrusion detection systems(IDS). With ML-based approaches, a quality dataset for training holds the key to gain high detection performance. Unfortunately, there are few methods to assess the dataset quality, and specifically for ML training. This work presents an automated toolchain, termed CREME (Configuration, REproduction, Multi-dataset, and Evaluation), to generate a dataset and measure its quality and efficiency. CREME integrates various tools to automate all stages of configuration, attack and benign behavior reproduction, data collection, feature extraction, data labeling, and evaluation. CREME can also automatically collect and generate a dataset from multiple sources such as accounting, network traffic, and system logs. Compared with the available datasets in the same category, experiment results show that the datasets generated by CREME contribute up to 20% better performance to ML-based IDS in terms of coverage. They also have significantly better efficiency than most other datasets. The CREME source code is available at https://github.com/buihuukhoi/CREME.
Article
Full-text available
The adoption of network traffic encryption is continually growing. Popular applications use encryption protocols to secure communications and protect the privacy of users. In addition, a large portion of malware is spread through the network traffic taking advantage of encryption protocols to hide its presence and activity. Entering into the era of completely encrypted communications over the Internet, we must rapidly start reviewing the state-of-the-art in the wide domain of network traffic analysis and inspection, to conclude if traditional traffic processing systems will be able to seamlessly adapt to the upcoming full adoption of network encryption. In this survey, we examine the literature that deals with network traffic analysis and inspection after the ascent of encryption in communication channels. We notice that the research community has already started proposing solutions on how to perform inspection even when the network traffic is encrypted and we demonstrate and review these works. In addition, we present the techniques and methods that these works use and their limitations. Finally, we examine the countermeasures that have been proposed in the literature in order to circumvent traffic analysis techniques that aim to harm user privacy.
Article
Full-text available
The Interactive Tree Of Life (https://itol.embl.de) is an online tool for the display, manipulation and annotation of phylogenetic and other trees. It is freely available and open to everyone. iTOL version 5 introduces a completely new tree display engine, together with numerous new features. For example, a new dataset type has been added (MEME motifs), while annotation options have been expanded for several existing ones. Node metadata display options have been extended and now also support non-numerical categorical values, as well as multiple values per node. Direct manual annotation is now available, providing a set of basic drawing and labeling tools, allowing users to draw shapes, labels and other features by hand directly onto the trees. Support for tree and dataset scales has been extended, providing fine control over line and label styles. Unrooted tree displays can now use the equal-daylight algorithm, proving a much greater display clarity. The user account system has been streamlined and expanded with new navigation options and currently handles >1 million trees from >70 000 individual users.
Article
Full-text available
Most research in the field of network intrusion detection heavily relies on datasets. Datasets in this field, however, are scarce and difficult to reproduce. To compare, evaluate, and test related work, researchers usually need the same datasets or at least datasets with similar characteristics as the ones used in related work. In this work, we present concepts and the Intrusion Detection Dataset Toolkit (ID2T) to alleviate the problem of reproducing datasets with desired characteristics to enable an accurate replication of scientific results. Intrusion Detection Dataset Toolkit (ID2T) facilitates the creation of labeled datasets by injecting synthetic attacks into background traffic. The injected synthetic attacks created by ID2T blend with the background traffic by mimicking the background traffic’s properties. This article has three core contributions. First, we present a comprehensive survey on intrusion detection datasets. In the survey, we propose a classification to group the negative qualities found in the datasets. Second, the architecture of ID2T is revised, improved, and expanded in comparison to previous work. The architectural changes enable ID2T to inject recent and advanced attacks, such as the EternalBlue exploit or a peer-to-peer botnet. ID2T’s functionality provides a set of tests, known as TIDED, that helps identify potential defects in the background traffic into which attacks are injected. Third, we illustrate how ID2T is used in different use-case scenarios to replicate scientific results with the help of reproducible datasets. ID2T is open source software and is made available to the community to expand its arsenal of attacks and capabilities.
Article
Full-text available
In recent years cybersecurity attacks have caused major disruption and information loss for online organisations, with high profile incidents in the news. One of the key challenges in advancing the state of the art in intrusion detection is the lack of representative datasets. These datasets typically contain millions of time-ordered events (e.g. network packet traces, flow summaries, log entries); subsequently analysed to identify abnormal behavior and specific attacks (Duffield et al., April). Generating realistic datasets has historically required expensive networked assets, specialised traffic generators, and considerable design preparation. Even with advances in virtualisation it remains challenging to create and maintain a representative environment. Major improvements are needed in the design, quality and availability of datasets, to assist researchers in developing advanced detection techniques. With the emergence of new technology paradigms, such as intelligent transport and autonomous vehicles, it is also likely that new classes of threat will emerge (Kenyon, 2018). Given the rate of change in threat behavior (Ugarte-Pedrero et al., 2019) datasets become quickly obsolete, and some of the most widely cited datasets date back over two decades. Older datasets have limited value: often heavily filtered and anonymised, with unrealistic event distributions, and opaque design methodology. The relative scarcity of (Intrusion Detection System) IDS datasets is compounded by the lack of a central registry, and inconsistent information on provenance. Researchers may also find it hard to locate datasets or understand their relative merits. In addition, many datasets rely on simulation, originating from academic or government institutions. The publication process itself often creates conflicts, with the need to de-identify sensitive information in order to meet regulations such as General Data Protection Act (GDPR) (Regulation, 2016). Another final issue for researchers is the lack of standardised metrics with which to compare dataset quality. In this paper we attempt to classify the most widely used public intrusion datasets, providing references to archives and associated literature. We illustrate their relative utility and scope, highlighting the threat composition, formats, special features, and associated limitations. We identify best practice in dataset design, and describe potential pitfalls of designing anomaly detection techniques based on data that may be either inappropriate, or compromised due to unrealistic threat coverage. Such contributions as made in this paper is expected to facilitate continuous research and development for effectively combating the constantly evolving cyber threat landscape. CCS CONCEPTS Intrusion Detection;Intrusion Prevention; Anomaly Detection; Network Flow; Smart Cities
Article
Full-text available
The performance of anomaly-based intrusion detection systems depends on the quality of the datasets used to form normal activity profiles. Suitable datasets are expected to include high volumes of real-life data free from attack instances. On account of these requirements, obtaining quality datasets from collected data requires a process of data sanitization that may be prohibitive if done manually, or uncertain if fully automated. In this work, we propose a sanitization approach for obtaining datasets from HTTP traces suited for training, testing, or validating anomaly-based attack detectors. Our methodology has two sequential phases. In the first phase, we clean known attacks from data using a pattern-based approach that relies on tools to detect URI-based known attacks. In the second phase, we complement the result of the first phase by conducting assisted manual labeling in a systematic and efficient manner, setting the focus of expert examination not on the raw data (which would be millions of URIs), but on the set of words contained in three dictionaries learned from these URIs. This dramatically downsizes the volume of data that requires expert discernment, making manual sanitization of large datasets feasible. We have applied our method to sanitize a trace that includes 45 million requests received by the library web server of the University of Seville. We were able to generate clean datasets in less than 84 h with only 33 h of manual supervision. We have also applied our method to some public benchmark datasets, confirming that attacks unnoticed by signature-base detectors can be discovered in a reduced time span.
Article
Full-text available
The threat of cyber-attacks is on the rise in the digital world today. As such, effective cybersecurity solutions are becoming increasingly important for detecting and combating cyber-attacks. The use of machine learning techniques for network intrusion detection is a growing area of research, as these techniques can potentially provide a means for automating the detection of attacks and abnormal traffic patterns in real-time. However, misclassification is a common problem in machine learning for intrusion detection, and the improvement of machine learning models is hindered by a lack of insight into the reasons behind such misclassification. This paper presents an interactive method of visualizing network intrusion detection data in three-dimensions. The objective is to facilitate the understanding of network intrusion detection data using a visual representation to reflect the geometric relationship between various categories of network traffic. This interactive visual representation can potentially provide useful insight to aid the understanding of machine learning results. To demonstrate the usefulness of the proposed visualization approach, this paper presents results of experiments on commonly used network intrusion detection datasets.
Article
Full-text available
In order to obtain high quality and large-scale labelled data for information security research, we propose a new approach that combines a generative adversarial network with the BiLSTM-Attention-CRF model to obtain labelled data from crowd annotations. We use the generative adversarial network to find common features in crowd annotations and then consider them in conjunction with the domain dictionary feature and sentence dependency feature as additional features to be introduced into the BiLSTM-Attention-CRF model, which is then used to carry out named entity recognition in crowdsourcing. Finally, we create a dataset to evaluate our models using information security data. The experimental results show that our model has better performance than the other baseline models.
Conference Paper
Full-text available
The growing Internet of Things (IoT) market introduces new challenges for network activity monitoring. Legacy network monitoring is not tailored to cope with the huge diversity of smart devices. New network discovery techniques are necessary in order to find out what IoT devices are connected to the network. In this context, data analysis techniques can be leveraged to find out specific patterns that can help to recognize device types. Indeed, contrary to desktop computers, IoT devices perform very specific tasks making their networking behavior very predictable. In this paper, we present a machine learning based approach in order to recognize the type of IoT devices connected to the network by analyzing streams of packets sent and received. We built an experimental smart home network to generate network traffic data. From the generated data, we have designed a model to describe IoT device network behaviors. By lever-aging the t-SNE technique to visualize our data, we are able to differentiate the network traffic generated by different IoT devices. The data describing the network behaviors are then used to train six different machine learning classifiers to predict the IoT device that generated the network traffic. The results are promising with an overall accuracy as high as 99.9% on our test set achieved by Random Forest classi-fier.
Article
Full-text available
Over the past decades, researchers have been proposing different Intrusion Detection approaches to deal with the increasing number and complexity of threats for computer systems. In this context, Random Forest models have been providing a notable performance on their applications in the realm of the behaviour-based Intrusion Detection Systems. Specificities of the Random Forest model are used to provide classification, feature selection, and proximity metrics. This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics and commonly used methods. It also provides a survey of Random Forest based methods applied in this context, considering the particularities involved in these models. Finally, some open questions and challenges are posed combined with possible directions to deal with them, which may guide future works on the area.
Article
Full-text available
Publicly available labelled data sets are necessary for evaluating anomaly-based Intrusion Detection Systems (IDS). However, existing data sets are often not up-to-date or not yet published because of privacy concerns. This paper identifies requirements for good data sets and proposes an approach for their generation. The key idea is to use a test environment and emulate realistic user behaviour with parameterised scripts on the clients. Comprehensive logging mechanisms provide additional information which may be used for a better understanding of the inner dynamics of an IDS. Finally, the proposed approach is used to generate the flow-based CIDDS-002 data set.
Article
Full-text available
The evaluation of algorithms and techniques to implement intrusion detection systems heavily rely on the existence of well designed datasets. In the last years, a lot of efforts have been done towards building these datasets. Yet, there is still room to improve. In this paper, a comprehensive review of existing datasets is first done, making emphasis on their main shortcomings. Then, we present a new dataset that is built with real traffic and up-to-date attacks. The main advantage of this dataset over previous ones is its usefulness for evaluating IDSs that consider long-term evolution and traffic periodicity. Models that consider differences in daytime/night or weekdays/weekends can also be trained and evaluated with it. We discuss all the requirements for a modern IDS evaluation dataset and analyze how the one presented here meets the different needs.
Article
Full-text available
Linguistic sequence labeling is a general modeling approach that encompasses a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted features. However, in many cases, it is hard to obtain sufficient annotations to train these models. In this study, we develop a novel neural framework to extract abundant knowledge hidden in raw texts to empower the sequence labeling task. Besides word-level knowledge contained in pre-trained word embeddings, character-aware neural language models are incorporated to extract character-level knowledge. Transfer learning techniques are further adopted to mediate different components and guide the language model towards the key knowledge. Comparing to previous methods, these task-specific knowledge allows us to adopt a more concise model and conduct more efficient training. Different from most transfer learning methods, the proposed framework does not rely on any additional supervision. It extracts knowledge from self-contained order information of training sequences. Extensive experiments on benchmark datasets demonstrate the effectiveness of leveraging character-level knowledge and the efficiency of co-training. For example, on the CoNLL03 NER task, model training completes in about 6 hours on a single GPU, reaching F1 score of 91.71$\pm$0.10 without using any extra annotation.
Article
Full-text available
Prior to deploying any intrusion detection system, it is essential to obtain a realistic evaluation of its performance. However, the major problems currently faced by the research community is the lack of availability of any realistic evaluation dataset and systematic metric for assessing the quantified quality of realism of any intrusion detection system dataset. It is difficult to access and collect data from real-world enterprise networks due to business continuity and integrity issues. In response to this, in this paper, firstly, a metric using a fuzzy logic system based on the Sugeno fuzzy inference model for evaluating the quality of the realism of existing intrusion detection system datasets is proposed. Secondly, based on the proposed metric results, a synthetically realistic next generation intrusion detection systems dataset is designed and generated, and a preliminary analysis conducted to assist in the design of future intrusion detection systems. This generated dataset consists of both normal and abnormal reflections of current network activities occurring at critical cyber infrastructure levels in various enterprises. Finally, using the proposed metric, the generated dataset is analyzed to assess the quality of its realism, with its comparison with publicly available intrusion detection system datasets for verifying its superiority.
Article
Full-text available
Anomaly based approaches in network intrusion detection suffer from evaluation, comparison and deployment which originate from the scarcity of adequate publicly available network trace datasets. Also, publicly available datasets are either outdated or generated in a controlled environment. Due to the ubiquity of cloud computing environments in commercial and government internet services, there is a need to assess the impacts of network attacks in cloud data centers. To the best of our knowledge, there is no publicly available dataset which captures the normal and anomalous network traces in the interactions between cloud users and cloud data centers. In this paper, we present an experimental platform designed to represent a practical interaction between cloud users and cloud services and collect network traces resulting from this interaction to conduct anomaly detection. We use Amazon web services (AWS) platform for conducting our experiments.
Conference Paper
Full-text available
Visualization and interactive analysis can help network administrators and security analysts analyze the network flow and log data. The complexity of such an analysis requires a combination of knowledge and experience from more domain experts to solve difficult problems faster and with higher reliability. We developed an online visual analysis system called OCEANS to address this topic by allowing close collaboration among security analysts to create deeper insights in detecting network events. Loading the heterogeneous data source (netflow, IPS log and host status log), OCEANS provides a multi-level visualization showing temporal overview, IP connections and detailed connections. Participants can submit their findings through the visual interface and refer to others' existing findings. Users can gain inspiration from each other and collaborate on finding subtle events and targeting multi-phase attacks. Our case study confirms that OCEANS is intuitive to use and can improve efficiency. The crowd collaboration helps the users comprehend the situation and reduce false alarms.
Conference Paper
Full-text available
One of the major research challenges in this field is the unavailability of a comprehensive network based data set which can reflect modern network traffic scenarios, vast varieties of low footprint intrusions and depth structured information about the network traffic. Evaluating network intrusion detection systems research efforts, KDD98, KDDCUP99 and NSLKDD benchmark data sets were generated a decade ago. However, numerous current studies showed that for the current network threat environment, these data sets do not inclusively reflect network traffic and modern low footprint attacks. Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation. This data set has a hybrid of the real modern normal and the contemporary synthesized attack activities of the network traffic. Existing and novel methods are utilised to generate the features of the UNSWNB15 data set. This data set is available for research purposes and can be accessed from the links: 1. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7348942&filter%3DAND%28p_IS_Number%3A7348936%29 2. https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-NB15-Datasets/
Article
Full-text available
This paper compares and contrasts the most widely used network security datasets, evaluating their efficacy in providing a benchmark for intrusion and anomaly detection systems. The antiquated nature of some of the most widely used datasets along with their inadequacies is examined and used as a basis for discussion of a new approach to analyzing network traffic data. Live network traffic is collected that consists of real normal traffic and both real and penetration testing attack data. Attack data is then inspected and labeled by means of manual analysis. While network attacks and anomaly features vary widely, they share some commonalities that are examined here. Among these are: self-similarity convergence, periodicity, and repetition. Further, the knowledge inherent in the definition of network boundaries and advertised services can provide crucial context that allows the network analyst to consider self-aware attributes when examining network traffic sessions. To these ends the Session Aggregation for Network Traffic Analysis (SANTA) dataset is proposed. The motivation and the methodology of collection, aggregation and evaluation of the raw data are presented, as well as the conceptualization of the SANTA attributes and advantages provided by this approach.
Conference Paper
Full-text available
Correctly labelled datasets are commonly required. Three particular scenarios are highlighted, which showcase this need. When using supervised Intrusion Detection Systems (IDSs), these systems need labelled datasets to be trained. Also, the real nature of the analysed datasets must be known when evaluating the efficiency of the IDSs when detecting intrusions. Another scenario is the use of feature selection that works only if the processed datasets are labelled. In normal conditions, collecting labelled datasets from real networks is impossible. Currently, datasets are mainly labelled by implementing off-line forensic analysis, which is impractical because it does not allow real-time implementation. We have developed a novel approach to automatically generate labelled network traffic datasets using an unsupervised anomaly based IDS. The resulting labelled datasets are subsets of the original unlabelled datasets. The labelled dataset is then processed using a Genetic Algorithm (GA) based approach, which performs the task of feature selection. The GA has been implemented to automatically provide the set of metrics that generate the most appropriate intrusion detection results.
Article
While there has been a significant interest in understanding the cyber threat landscape of Internet of Things (IoT) networks, and the design of Artificial Intelligence (AI)-based security approaches, there is a lack of distributed architecture led to generating heterogeneous datasets that contain the actual behaviours of real-world IoT networks and complex cyber threat scenarios to evaluate the credibility of the new systems. This paper presents a new realistic testbed architecture of IoT network deployed at the IoT lab of the University of New South Wales (UNSW) at Canberra. The platform NSX vCloud NFV was employed to facilitate the execution of Software-Defined Network (SDN), Network Function Virtualization (NFV) and Service Orchestration (SO) to offer dynamic testbed networks, which allow the interaction of edge, fog and cloud tiers. While deploying the architecture, real-world normal and attack scenarios are executed to collect labeled datasets. The datasets are referred to as “ToN_IoT”, as they comprise heterogeneous data sources collected from telemetry datasets of IoT services, Windows and Linux-based datasets, and datasets of network traffic. The ToN_IoT network dataset is validated using four machine learning-based anomaly detection algorithms of Gradient Boosting Machine, Random Forest, Naive Bayes, and Deep Neural Networks, revealing a high performance of detection accuracy using the set of training and testing. These new datasets provide a realistic testbed of design, a variety of normal and attack events, heterogeneous data sources, and a ground truth table of security events. A comparative summary of the ToN_IoT network dataset and other competing network datasets demonstrates its diverse legitimate and anomalous patterns that can be used to better validate new AI-based security solutions. The datasets can be publicly accessed from ADFA (2020).
Chapter
Training data labelling is financially expensive in domain-specific learning applications, which heavily relies on the intelligence from domain experts. Thus, with budget constraint, it is important to judiciously select high-quality training data for labelling in order to prevent over-fitting. In this paper, we propose a learning-to-label (L2L) framework leveraging active learning and reinforcement learning to iteratively select data to label for Name Entity Recognition (NER) task. Experimental results show that our approach is more effective than strong previous methods using heuristics and reinforcement learning. With the same number of labeled data, our approach improves the accuracy of NER by 11.91%. Our approach is superior to state-of-the-art learning-to-label method, with an improvement of accuracy by 6.49%.
Conference Paper
Labeling a real network dataset is specially expensive in computer security, as an expert has to ponder several factors before assigning each label. This paper describes an interactive intelligent system to support the task of identifying hostile behaviors in network logs. The RiskID application uses visualizations to graphically encode features of network connections and promote visual comparison. In the background, two algorithms are used to actively organize connections and predict potential labels: a recommendation algorithm and a semi-supervised learning strategy. These algorithms together with interactive adaptions to the user interface constitute a behavior recommendation. A study is carried out to analyze how the algorithms for recommendation and prediction influence the workflow of labeling a dataset. The results of a study with 16 participants indicate that the behaviour recommendation significantly improves the quality of labels. Analyzing interaction patterns, we identify a more intuitive workflow used when behaviour recommendation is available.
Article
The proliferation of smart home devices has created new opportunities for empirical research in ubiquitous computing, ranging from security and privacy to personal health. Yet, data from smart home deployments are hard to come by, and existing empirical studies of smart home devices typically involve only a small number of devices in lab settings. To contribute to data-driven smart home research, we crowdsource the largest known dataset of labeled network traffic from smart home devices from within real-world home networks. To do so, we developed and released IoT Inspector, an open-source tool that allows users to observe the traffic from smart home devices on their own home networks. Between April 10, 2019 and January 21, 2020, 5,404 users have installed IoT Inspector, allowing us to collect labeled network traffic from 54,094 smart home devices. At the time of publication, IoT Inspector is still gaining users and collecting data from more devices. We demonstrate how this data enables new research into smart homes through two case studies focused on security and privacy. First, we find that many device vendors, including Amazon and Google, use outdated TLS versions and send unencrypted traffic, sometimes to advertising and tracking services. Second, we discover that smart TVs from at least 10 vendors communicated with advertising and tracking services. Finally, we find widespread cross-border communications, sometimes unencrypted, between devices and Internet services that are located in countries with potentially poor privacy practices. To facilitate future reproducible research in smart homes, we will release the IoT Inspector data to the public.
Article
The Internet of Things (IoT) has evolved in the last few years to become one of the hottest topics in the area of computer science research. This drastic increase in IoT applications across different disciplines, such as in health-care and smart industries, comes with a considerable security risk. This is not limited only to attacks on privacy; it can also extend to attacks on network availability and performance. Therefore, an intrusion detection system is essential to act as the first line of defense for the network. IDS systems and algorithms depend heavily on the quality of the dataset provided. Sadly, there has been a lack of work in evaluating and collecting intrusion detection system related datasets that are designed specifically for an IoT ecosystem. Most of the studies published focus on outdated and non-compatible datasets such as the KDD98 dataset. Therefore, in this paper, we aim to investigate the existing datasets and their applications for IoT environments. Then we introduce a real-time data collection framework for building a dataset for intrusion detection system evaluation and testing. The main advantages of the proposed dataset are that it contains features that are explicitly designed for the 6LoWPAN/RPL network, the most widely used protocol in the IoT environment.
Article
Network attacks, intrusion detection, and intrusion prevention are important topics in cyber security. Network flows and system events generate big data, which often leads to challenges in intrusion detection with high efficiency and good accuracy. This paper focuses on the ‘Volume’, ‘Veracity’, and ‘Variety’ of big data characteristics in network traffic and attacks. Datasets with various data types including numerical data and categorical data (such as status or flag data) are analyzed with the help of R language and its functions. Data duplicates detection and removal, missing values detection, and data quality analysis are also performed. The analysis of masquerades for various users is conducted. In addition, the correlation analysis of variables and a clustering analysis based on k-means are also performed.
Article
This survey focuses on intrusion detection systems (IDS) that leverage host-based data sources for detecting attacks on enterprise network. The host-based IDS (HIDS) literature is organized by the input data source, presenting targeted sub-surveys of HIDS research leveraging system logs, audit data, Windows Registry, file systems, and program analysis. While system calls are generally included in audit data, several publicly available system call datasets have spawned a flurry of IDS research on this topic, which merits a separate section. To accommodate current researchers, a section giving descriptions of publicly available datasets is included, outlining their characteristics and shortcomings when used for IDS evaluation. Related surveys are organized and described. All sections are accompanied by tables concisely organizing the literature and datasets discussed. Finally, challenges, trends, and broader observations are throughout the survey and in the conclusion along with future directions of IDS research. Overall, this survey was designed to allow easy access to the diverse types of data available on a host for sensing intrusion, the progressions of research using each, and the accessible datasets for prototyping in the area.
Article
In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies.
Article
Network anomaly detection is an important means for safeguarding network security. On account of the difficulties encountered in traditional automatic detection methods such as lack of labeled data, expensive retraining costs for new data and non-explanation, we propose a novel smart labeling method, which combines active learning and visual interaction, to detect network anomalies through the iterative labeling process of the users. The algorithms and the visual interfaces are tightly integrated. The network behavior patterns are first learned by using the self-organizing incremental neural network. Then, the model uses a Fuzzy c-means-based algorithm to do classification on the basis of user feedback. After that, the visual interfaces are updated to present the improved results of the model, which can help users to choose meaningful candidates, judge anomalies and understand the model results. The experiments show that compared to labeling without our visualizations, our method can achieve a high accuracy rate of anomaly detection with fewer labeled samples. Graphic abstract Open image in new window
Article
The proliferation of IoT systems, has seen them targeted by malicious third parties. To address this challenge, realistic protection and investigation countermeasures, such as network intrusion detection and network forensic systems, need to be effectively developed. For this purpose, a well-structured and representative dataset is paramount for training and validating the credibility of the systems. Although there are several network datasets, in most cases, not much information is given about the Botnet scenarios that were used. This paper proposes a new dataset, so-called Bot-IoT, which incorporates legitimate and simulated IoT network traffic, along with various types of attacks. We also present a realistic testbed environment for addressing the existing dataset drawbacks of capturing complete network information, accurate labeling, as well as recent and complex attack diversity. Finally, we evaluate the reliability of the BoT-IoT dataset using different statistical and machine learning methods for forensics purposes compared with the benchmark datasets. This work provides the baseline for allowing botnet identification across IoT-specific networks. The Bot-IoT dataset can be accessed at Bot-iot (2018) [1].
Article
The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security researchers and attracts the interest of a human expert. To the best of our knowledge, nobody has systematically dissected these datasets to precisely understand what they really contain. The security community generally discards the problem because of the alleged prevalence of uninteresting samples. In this article, we guide the reader through a step-by-step analysis of the hundreds of thousands Windows executables collected in one day from these feeds. Our goal is to show how a company can employ existing state-of-the-art techniques to automatically process these samples and then perform manual experiments to understand and document what is the real content of this gigantic dataset. We present the filtering steps, and we discuss in detail how samples can be grouped together according to their behavior to support manual verification. Finally, we use the results of this measurement experiment to provide a rough estimate of both the human and computer resources that are required to get to the bottom of the catch of the day.
Article
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
Conference Paper
Research in network traffic measurement and analysis is a long-lasting field with growing interest from both scientists and the industry. However, even after so many years, results replication, criticism, and review are still rare. We face not only a lack of research standards, but also inaccessibility of appropriate datasets that can be used for methods development and evaluation. Therefore, a lot of potentially high-quality research cannot be verified and is not adopted by the industry or the community. The aim of this paper is to overcome this controversy with a unique solution based on a combination of distinct approaches proposed by other research works. Unlike these studies, we focus on the whole issue covering all areas of data anonymization, authenticity, recency, publicity, and their usage for research provability. We believe that these challenges can be solved by utilization of semi-labeled datasets composed of real-world network traffic and annotated units with interest-related packet traces only. In this paper, we outline the basic ideas of the methodology from unit trace collection and semi-labeled dataset creation to its usage for research evaluation. We strive for this proposal to start a discussion of the approach and help to overcome some of the challenges the research faces today.
Article
Although the aggregated nature of exported flow data provides many advantages in terms of privacy and scalability, flow data may contain artifacts that impair data analysis. In this article, we investigate the differences between flow data analysis in theory and practice—that is, in lab environments and production networks.
Conference Paper
With exponential growth in the size of computer networks and developed applications, the significant increasing of the potential damage that can be caused by launching attacks is becoming obvious. Meanwhile, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one of the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of adequate dataset, anomaly-based approaches in intrusion detection systems are suffering from accurate deployment, analysis and evaluation. There exist a number of such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Based on our study over eleven available datasets since 1998, many such datasets are out of date and unreliable to use. Some of these datasets suffer from lack of traffic diversity and volumes, some of them do not cover the variety of attacks, while others anonymized packet information and payload which cannot reflect the current trends, or they lack feature set and metadata. This paper produces a reliable dataset that contains benign and seven common attack network flows, which meets real world criteria and is publicly available. Consequently, the paper evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.
Conference Paper
Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.
Article
Labeling data instances is an important task in machine learning and visual analytics. Both fields provide a broad set of labeling strategies, whereby machine learning (and in particular active learning) follows a rather model-centered approach and visual analytics employs rather user-centered approaches (visual-interactive labeling). Both approaches have individual strengths and weaknesses. In this work, we conduct an experiment with three parts to assess and compare the performance of these different labeling strategies. In our study, we (1) identify different visual labeling strategies for user-centered labeling, (2) investigate strengths and weaknesses of labeling strategies for different labeling tasks and task complexities, and (3) shed light on the effect of using different visual encodings to guide the visual-interactive labeling process. We further compare labeling of single versus multiple instances at a time, and quantify the impact on efficiency. We systematically compare the performance of visual interactive labeling with that of active learning. Our main findings are that visual-interactive labeling can outperform active learning, given the condition that dimension reduction separates well the class distributions. Moreover, using dimension reduction in combination with additional visual encodings that expose the internal state of the learning model turns out to improve the performance of visual-interactive labeling.
Conference Paper
Botnets have been a serious threat to the Internet security. With the constant sophistication and the resilience of them, a new trend has emerged, shifting botnets from the traditional desktop to the mobile environment. As in the desktop domain, detecting mobile botnets is essential to minimize the threat that they impose. Along the diverse set of strategies applied to detect these botnets, the ones that show the best and most generalized results involve discovering patterns in their anomalous behavior. In the mobile botnet field, one way to detect these patterns is by analyzing the operation parameters of this kind of applications. In this paper, we present an anomaly-based and host-based approach to detect mobile botnets. The proposed approach uses machine learning algorithms to identify anomalous behaviors in statistical features extracted from system calls. Using a self-generated dataset containing 13 families of mobile botnets and legitimate applications, we were able to test the performance of our approach in a close-to-reality scenario. The proposed approach achieved great results, including low false positive rates and high true detection rates.
Conference Paper
Intrusion detection is an important method for identifying attacks and compromises of computer systems, but it is complicated by rapid changes in technology, the increasing interconnectedness of devices on the internet, the growing use of cyberattacks, and more sophisticated and automated attack methods and tools used by adversaries. The challenge of intrusion detection is further complicated because, as advances are made in the ability to detect attacks, similar advances are made by adversaries to thwart those detective measures. Although numerous machine learning algorithms and approaches have proven effective in detecting cyberattacks, these algorithms have limitations, especially in dealing with adversarial environments. This study addresses the problem that there is not an effective machine learning algorithm that minimizes human interaction to train and evolve the learner to adapt to changing cyberattacks and evasive tactics. This research concludes that selective sampling of unlabeled data for classification by a human expert can result in more efficient labeling for large datasets and demonstrates a more resilient approach to machine learning that utilizes active learning to interact with human subject matter experts and that adapts to changing data, thus addressing issues related to data tampering and evasion. Full text available at: http://ieeexplore.ieee.org/document/7925383/
Article
Internet of Things (IoT) is a new paradigm that integrates the Internet and physical objects belonging to different domains such as home automation, industrial process, human health and environmental monitoring. It deepens the presence of Internet-connected devices in our daily activities, bringing, in addition to many benefits, challenges related to security issues. For more than two decades, Intrusion Detection Systems (IDS) have been an important tool for the protection of networks and information systems. However, applying traditional IDS techniques to IoT is difficult due to its particular characteristics such as constrained-resource devices, specific protocol stacks, and standards. In this paper, we present a survey of IDS research efforts for IoT. Our objective is to identify leading trends, open issues, and future research possibilities. We classified the IDSs proposed in the literature according to the following attributes: detection method, IDS placement strategy, security threat and validation strategy. We also discussed the different possibilities for each attribute, detailing aspects of works that either propose specific IDS schemes for IoT or develop attack detection strategies for IoT threats that might be embedded in IDSs.
Conference Paper
This paper presents a graphical interface to identify hostile behavior in network logs. The problem of identifying and labeling hostile behavior is well known in the network security community. There is a lack of labeled datasets, which make it difficult to deploy automated methods or to test the performance of manual ones. We describe the process of searching and identifying hostile behavior with a graphical tool derived from an open source Intrusion Prevention System, which graphically encodes features of network connections from a log-file. A design study with two network security experts illustrates the workflow of searching for patterns descriptive of unwanted behavior and labeling occurrences therewith.
Article
In recent years, Mobile Ad hoc NETworks (MANETs) have generated great interest among researchers in the development of theoretical and practical concepts, and their implementation under several computing environments. However, MANETs are highly susceptible to various security attacks due to their inherent characteristics. In order to provide adequate security against multi-level attacks, the researchers are of the opinion that detection-based schemes should be incorporated in addition to traditionally used prevention techniques because prevention-based techniques cannot prevent the attacks from compromised internal nodes. Intrusion detection system is an effective defense mechanism that detects and prevents the security attacks at various levels. This paper tries to provide a structured and comprehensive survey of most prominent intrusion detection techniques of recent past and present for MANETs in accordance with technology layout and detection algorithms. These detection techniques are broadly classified into nine categories based on their primary detection engine/(s). Further, an attempt has been made to compare different intrusion detection techniques with their operational strengths and limitations. Finally, the paper concludes with a number of future research directions in the design and implementation of intrusion detection systems for MANETs.
Article
This survey paper describes a focused literature survey of machine learning (ML) and data mining (DM) methods for cyber analytics in support of intrusion detection. Short tutorial descriptions of each ML/DM method are provided. Based on the number of citations or the relevance of an emerging method, papers representing each method were identified, read, and summarized. Because data are so important in ML/DM approaches, some well-known cyber data sets used in ML/DM are described. The complexity of ML/DM algorithms is addressed, discussion of challenges for using ML/DM for cyber security is presented, and some recommendations on when to use a given method are provided.