Article
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Several studies state that a low amount of features is needed to wholly obtain patterns that differentiate an application to another. The works in [152], [153], [154] study the most relevant statistical features for traffic classification. In [153], several FS techniques are used to obtain the most important features, while a new proposed method select the smallest set. ...
... The works in [152], [153], [154] study the most relevant statistical features for traffic classification. In [153], several FS techniques are used to obtain the most important features, while a new proposed method select the smallest set. The results were crossed validated with three datasets measuring the goodness, similarity and stability of each feature; giving as a result a small set, between 6 and 14 statistical features, that offers the best performance measured through the accuracy. ...
... The work in [153] studies the importance of FS and FR for traffic classification using ML. Ten network traffic datasets were used to show the advantages and disadvantages of different well-known FS techniques, such as Information Gain, Gain radio, Principal Component Analysis (PCA), and Correlationbased Feature Selection, among others. ...
Article
Full-text available
Traffic analysis is a compound of strategies intended to find relationships, patterns, anomalies, and misconfigurations, among others things, in Internet traffic. In particular, traffic classification is a subgroup of strategies in this field that aims at identifying the application’s name or type of Internet traffic. Nowadays, traffic classification has become a challenging task due to the rise of new technologies, such as traffic encryption and encapsulation, which decrease the performance of classical traffic classification strategies. Machine Learning gains interest as a new direction in this field, showing signs of future success, such as knowledge extraction from encrypted traffic, and more accurate Quality of Service management. Machine Learning is fast becoming a key tool to build traffic classification solutions in real network traffic scenarios; in this sense, the purpose of this investigation is to explore the elements that allow this technique to work in the traffic classification field. Therefore, a systematic review is introduced based on the steps to achieve traffic classification by using Machine Learning techniques. The main aim is to understand and to identify the procedures followed by the existing works to achieve their goals. As a result, this survey paper finds a set of trends derived from the analysis performed on this domain; in this manner, the authors expect to outline future directions for Machine Learning based traffic classification.
... • Identify the optimal and stable feature in the temporal-domain and the spatial-domain for accurate and effective network traffic classification CHAPTER 1. INTRODUCTION The issue of improving the accuracy of network classification in both the temporaldomain (across different periods of time), and the spatial-domain (across different network-locations) has been the subject of current studies [Li et al., 2009;Fahad et al., 2013]. However, many of these classical studies in this area neglect the insensitivity of feature selection techniques when selecting the representative set in the temporaldomain and the spatial-domain traffic data. ...
... A key issue with many feature selection techniques [Almuallim and Dietterich, 1994;Duda and Hart, 1996;Hall, 2000;Liu and Motoda, 1998] [Fahad et al., 2013] is proposed to address the limitations of existing feature selection techniques and generate a highly discriminant set of features. ...
... Nevertheless, in order to counter the emergence of new applications and patterns, a number of network classifiers and intrusion detection systems (IDs) based on machine learning techniques Linda et al., 2009;Fahad et al., 2013;Tsang and Kwong, 2005;Almalawi et al., 2013a;Kim et al., 2011a;Tze-Haw et al., 2011] have been proposed to assist network experts to analyse the security risks and detect attacks against their systems. However, a key problem in the research and development of such efficient and accurate network traffic classification and intrusion detection systems (based on machine learning) is the lack of sufficient traffic data, especially for industrial network (e.g. ...
... With regard to the use of FS in [48], Fahad et al. compared six different techniques: Information Gain (IG), Gain Ratio (GR), Principal Component Analysis (PCA), Correlation based Feature Selection (CBF), Chi-square, and Consistency-based search (CBC). For the evaluation of the methods three measures were chosen: i) goodness, which corresponds to the detection accuracy, ii) stability, which evaluates the robustness of the subset to the variation in the traffic data and iii) similarity, to compare the behaviour of different FS techniques on the same dataset. ...
... Moreover, similarly to [50] and [52], RMHC constructs one subset for each attack and normal class. However, unlike [48], [49] and [50], the method only uses one evaluation criterion instead of multiple ones. ...
... The following conclusions arise from the study of the literature: 1) One feature selection technique is not enough to achieve stability through the different datasets, as the behaviour of the network traffic is shifting ( [48] , [49] , [50]) 2) For each of the five classes of intrusion detection, one optimal subset should be obtained as one global feature subsets is not capable to describe satisfactorily all the different classes ( [50], [52], [55]). 3) FS can considerably improve not only the detection rate, but also the computational performance. ...
Article
Full-text available
Over the last five years there has been an increase in the frequency and diversity of network attacks. This holds true, as more and more organisations admit compromises on a daily basis. Many misuse and anomaly based Intrusion Detection Systems (IDSs) that rely on either signatures, supervised or statistical methods have been proposed in the literature, but their trustworthiness is debatable. Moreover, as this work uncovers, the current IDSs are based on obsolete attack classes that do not reflect the current attack trends. For these reasons, this paper provides a comprehensive overview of unsupervised and hybrid methods for intrusion detection, discussing their potential in the domain. We also present and highlight the importance of feature engineering techniques that have been proposed for intrusion detection. Furthermore, we discuss that current IDSs should evolve from simple detection to correlation and attribution. We descant how IDS data could be used to reconstruct and correlate attacks to identify attackers, with the use of advanced data analytics techniques. Finally, we argue how the present IDS attack classes can be extended to match the modern attacks and propose three new classes regarding the outgoing network communication.
... Several studies state that a low amount of features is needed to wholly obtain patterns that differentiate an application to another. The works in [188,56,55] study the most relevant statistical features for traffic classification. In [56], several FS techniques are used to obtain the most important features, while a newly proposed method selects the smallest set. ...
... The works in [188,56,55] study the most relevant statistical features for traffic classification. In [56], several FS techniques are used to obtain the most important features, while a newly proposed method selects the smallest set. The results were crossed validated with three datasets measuring the goodness, similarity, and stability of each feature; giving, as a result, a small set, between 6 and 14 statistical features, that offers the best performance measured through the accuracy. ...
... To treat abnormalities, feature reduction and selection are deployed to discard noisy statistical-based features. For instance, the works in [188,56,55] study the most relevant statistical features for Internet traffic classification. In [56], several Feature Selection techniques are used to obtain the most important features, while a newly proposed method selects the smallest set. ...
Thesis
The Internet has become indispensable for the daily activities of human beings. Nowadays, this network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this thesis project aims at finding new Internet traffic classification approaches to improving customer satisfaction by improving the QoS.Machine Learning (ML) algorithms will be studied and deployed to classify Internet traffic. All the necessary elements, to couple an ML solution over a well-known Satellite Communication and QoS management architecture, will be evaluated. In this architecture, one or more monitoring points will intercept Satellite Internet traffic, which in turn will be treated and marked with predefined classes by ML-based classification techniques. The marked traffic will be interpreted by a QoS management architecture that will take actions according to the class type.To develop this ML-based solution, a rich and complete set of Internet traffic is required; however, historical labeled data is hardly publicly available. In this context, binary packets should be monitored and stored to generate historical data. To do so, an emulated cloud platform will serve as a data generation environment in which different Internet communications will be launched and captured. This study is escalated to a Satellite Communication architecture. Moreover, statistical-based features are extracted from the packet flows. Some statistical-based computations will be adapted to achieve accurate Internet traffic classification for encrypted and unencrypted packets in the historical data. Afterward, a proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Besides, to cope with the evolution of Internet applications, a new method is presented to induce updates over the original classification system. Finally, some experiments in the cloud emulated platform validate our proposal and set guidelines for its deployment over a Satellite architecture.
... Previously, the role of FS process was limited regarding to the selection of a subset of original feature set. FS is one of the most important data preprocessing stages in which various field includes such as pattern recognition, ML, and DM Furthermore, FS approaches could be divided into three main types: Wrapper technique, Embedded technique and Filter technique [28][29][30]. ...
... It is worth mentioning that SSPLR method behaves as an optimization technique by iterative procedure, where the computational costs can be computed by OðN Â PÞ for each iteration. Here, N is the total numbers of training records and P is [1,2,3,6,7,9], [11,14,17,18,22], [23,25,26,29,30,31], [32,35,36,39,40] 95.89 93.65 92.15 NSL-KDD (U2R) [1,2,3,4,5,6,7,8,9], [12,13,14,17,18,20,21,22], [23,24,25,26,27,28,29,30,31], [32,33,34,35,36,39,40] 95.91 90.24 87.83 NSL-KDD'99 (R2L) [1,2,3,4,5,6,7,8,9,], [10,12,13,17,18,19,20,21,22], [22,23,25,26,27,28,29,30], [32,33,34,35,36,38,39,40,41] 95.78 91.51 87.84 the total number of features. The convergence includes various aspects such as step size, data set, parameter settings, etc. Whereas, in most of the cases SSPLR convergence rate is very fast especially in first numerous repetitions. ...
... It is worth mentioning that SSPLR method behaves as an optimization technique by iterative procedure, where the computational costs can be computed by OðN Â PÞ for each iteration. Here, N is the total numbers of training records and P is [1,2,3,6,7,9], [11,14,17,18,22], [23,25,26,29,30,31], [32,35,36,39,40] 95.89 93.65 92.15 NSL-KDD (U2R) [1,2,3,4,5,6,7,8,9], [12,13,14,17,18,20,21,22], [23,24,25,26,27,28,29,30,31], [32,33,34,35,36,39,40] 95.91 90.24 87.83 NSL-KDD'99 (R2L) [1,2,3,4,5,6,7,8,9,], [10,12,13,17,18,19,20,21,22], [22,23,25,26,27,28,29,30], [32,33,34,35,36,38,39,40,41] 95.78 91.51 87.84 the total number of features. The convergence includes various aspects such as step size, data set, parameter settings, etc. Whereas, in most of the cases SSPLR convergence rate is very fast especially in first numerous repetitions. ...
Article
Full-text available
With the rapid advancement in technology, network systems are becoming prone to more sophisticated types of intrusions. However, machine learning (ML) based strategies are among the most efficient and popular methods to identify the network intrusions or attacks. In this study, we examined the important and discriminative features, in order to recognize the various attacks by applying the Structural Sparse Logistic Regression (SSPLR) and Support Vector Machine (SVMs) methods. The SVMs are standard ML-based techniques, which provide the reasonable performance, however, they have few shortcomings, such as, interpretability and huge computational cost. On the other hand, the sparse modeling (SSPLR) is considered as the advanced method for the data examination and processing through regularization. The structural sparse modeling can be used to simultaneously select the distinct features or the group of discriminative features from the repository of the data set to determine the coefficient of the linear classifier, where, prior information of the feature’s structure can be mapped on various sparsity-inducing regularizations. In this way, the particular group of features yielded by the most significant network attacks are selected and potentially identified. The experiments and discussion, show that the proposed techniques have improved performance compared to the most state-of-the-art techniques, used for the Intrusion Detection System (IDS).
... Nowadays, the amount of video streaming data is increasing rapidly. This data is highly dimensional because it comes from a variety of sources such as surveillance cameras and sensors [53]. Many feature selection techniques are not suitable to deal with highly dimensional labeled or unlabeled sparse data sets. ...
... As such, the retrieval of feature selection from such data sets is a difficult and time-consuming task. Therefore, it is the need of the hour to develop efficient algorithms for feature selection process than can be used for sparse data sets [53]. ...
Article
Full-text available
For the last three decades, the World Wide Web (WWW) has become one of the most widely used podium to generate a immense amount of heterogeneous data in a single day. Presently, many organizations aimed to process their domain data for taking quick decisions to improve their organizational performance. However, high dimensionality in datasets is a biggest obstacle for researchers and domain engineers to achieve their desired performance through their selected machine learning (ML) algorithms. In ML, Feature Selection is a core concept used for selecting most relevant features from high dimension data and thus improve the performance of the trained learning model. Moreover, the feature selection process also provides an effective way by eliminating in appropriate and redundant features and ultimately shrinks the computational time. Due to significance and applications of feature selection, it has become a well-researched area of ML. Nowadays, feature selection has vital role in most of the effective spam detection systems, pattern recognition systems, automated organization, management of documents, and information retrieval systems. In order to do accurate classification, the relevant feature selection is the most important task, and to achieve its objectives, this study starts with an overview of text classification. This overview is then followed by a survey. The survey covered the popular feature selection methods commonly used for text classification. This survey also shed light on applications of feature selection methods. The focus of this study is three feature selection algorithms i.e, .Principal Component Analysis (PCA), Chi-Square (CS) and Information Gain (IG). This study is helpful for researchers looking for some suitable criterion to decide the suitable technique to be used for better understanding of the performance of the classifier. In order to conduct experiment, web spam uk2007 dataset is considered. Ten, Twenty, Thirty, and Forty features were selected as optimal subset from web spam uk2007 dataset. Among all three feature selection algorithms, CS and IG had highest F1Score (F-measure =0.911) but at the same time suffered with model building time.
... Similarly, the author in [9] proposes an efficient and scalable feature selection approach that selects features based on user-defined interests to perform traffic classification. The proposed approach was aimed at achieving high accuracy in traffic classification by selecting only relevant features. ...
... When building an algorithm, it should be designed in such a way that it produces no false positives, ). Similarly, we used Equation (9). for the sensitivity metric evaluation. ...
Article
Full-text available
Internet of Things (IoT) refers to the interconnection via the Internet of computing devices embedded in everyday objects, enabling them to send and receive data. These devices can be controlled remotely, which makes them susceptible to exploitation or even takeover by an attacker. The lack of security features on many IoT devices makes them easy to access confidential information, issue commands from a distance, or even use the compromised device as part of a DDoS attack against another network. Feature selection is an important part of problem formulation in machine learning. To overcome the above problems, this paper proposes a novel feature selection framework RFS for IoT attack detection using machine learning (ML) techniques. The RFS is based on the concept of effective feature selection and consists of three main stages: feature selection, modeling, and attacks detection. For feature selection, three different models are proposed. Based on these approaches, three different algorithms are proposed. A set of 40 features was included in the model, derived from combinatorial optimization and statistical analysis methods. Our experimental study shows that the proposed frame work significantly improves over state-of-the-art cyberattacks techniques for time series data with outliers.
... so the general density of a degree is analyzed to work out the functions of datasets that influence a selected datum. DBSCAN, OPTICS, DBCLASD and DENCLUE area unit algorithms that use such a way to filtrate noise (ouliers) and see clusters of whimsical form [10]. ...
... Also, it results in how of mechanically decisive the amount of clusters supported commonplace statistics, taking noise (outliers) under consideration and so yielding a sturdy clump methodology. the model-based method: applied mathematics and neural network approaches [10]. MCLUST is perhaps the known model-based rule, however there area unit different smart algorithms, like EM (which uses a mix density model), abstract clump (such as COBWEB), and neural network approaches (such as selforganizing feature maps). ...
Article
Full-text available
Clustering is an important data mining and tool for reading big records. There are difficulties for making use of clustering strategies to huge data duo to new challenges which might be raised with massive records. As large information is relating to terabytes and peta bytes of information and clustering algorithms are come with excessive computational costs, the question is the way to take care of with this hassle and how to install clustering techniques to big information and get the outcomes in a reasonable time. This study is aimed to review the style and progress of agglomeration algorithms to cope with massive knowledge challenges from first projected algorithms until modern novel solutions. The algorithms and the centered demanding situations for generating stepped forward clustering algorithms are introduced and analyzed, and later on the viable future path for extra superior algorithms are based on computational complexity. In this paper we discuss clustering algorithms and big data applications for real world things.
... These are classified into supervised and unsupervised techniques [2] [7] [8]. The supervised methods, such as SVM [8] and Naive Bayes [9], include two main steps. Firstly, the learning step (where classification algorithms aim to analyse the supervised training data and build an inferred model). ...
... Meanwhile, processing any of those transactions could be intrusive actions happen; hence it is still a controversial area of research to find methods that provide privacy-preservation for SCADA data [1]. Figure 2. Basic architecture of SCADA systems [9] I. ...
Article
Full-text available
Supervisory Control and Data Acquisition (SCADA) systems face the absence of a protection technique that can beat different types of intrusions and protect the data from disclosure while handling this data using other applications, specifically Intrusion Detection System (IDS). The SCADA system can manage the critical infrastructure of industrial control environments. Protecting sensitive information is a difficult task to achieve in reality with the connection of physical and digital systems. Hence, privacy preservation techniques have become effective in order to protect sensitive/private information and to detect malicious activities, but they are not accurate in terms of error detection, sensitivity percentage of data disclosure. In this paper, we propose a new Privacy Preservation Intrusion Detection (PPID) technique based on the correlation coefficient and Expectation Maximisation (EM) clustering mechanisms for selecting important portions of data and recognizing intrusive events. This technique is evaluated on the power system datasets for multiclass attacks to measure its reliability for detecting suspicious activities. The experimental results outperform three techniques in the above terms, showing the efficiency and effectiveness of the proposed technique to be utilized for current SCADA systems.
... At the same year, Moore and Zuev [24] applied a Naive Bayes Kernel estimator using their discriminators to categorize network traffic. Fahad et al. [25] investigated the task of feature selection (FS). They chose five well known FS techniques and proposed an integrated FS approach, that used all the others, to obtain an optimal feature set. ...
... We used the python scikit-learn package [50] to run the ML models with default parameters. We evaluate each of the ML models using the following 10 statistical flow-based features as input since these features have been widely used in previous works [7], [23], [24], [25], [26], [36]: ...
Article
Identifying the type of a network flow or a specific application has many advantages, such as, traffic engineering, or to detect and prevent application or application types that violate the organization’s security policy. The use of encryption, such as VPN, makes such identification challenging. Current solutions rely mostly on handcrafted features and then apply supervised learning techniques for the classification. We introduce a novel approach for encrypted Internet traffic classification and application identification by transforming basic flow data into an intuitive picture, a FlowPic, and then using known image classification deep learning techniques, CNNs, to identify the flow category (browsing, chat, video, etc.) and the application in use. We show that our approach can classify traffic with high accuracy, both for a specific application, or a flow category, even for VPN and Tor traffic. Our classifier can even identify with high success new applications that were not part of the training phase for a category, thus, new versions or applications can be categorized without additional training.
... Many works focused on the process of feature generations [18][19][20]. These methods usually created a long list of handcrafted features extracted from bidirectional flows, such as RTT (round-trip delay time) statistics, packet size statistics, inter-arrival time statistics, frequencies, and so on. ...
... These methods usually created a long list of handcrafted features extracted from bidirectional flows, such as RTT (round-trip delay time) statistics, packet size statistics, inter-arrival time statistics, frequencies, and so on. Then some [20] applied feature selection techniques to obtain an optimal feature set. Based on the obtained features, they applied machine-learning classifiers such as Naive Bayes Kernel estimator, SVM, decision trees, etc. ...
Article
Full-text available
Identifying the type of a network flow or a specific application has many advantages but becomes harder in recent years due to the use of encryption, e.g., by VPN. As a result, there is a recent wave of solutions that harness deep learning for traffic classification. These solutions either require a rather long time (15-60 Seconds) of flow data or rely on handcrafted features for solutions that classify flows faster. In this work, we suggest a novel approach for classification that extracts the most out of the two simple yet defining features of a flow: packet sizes and inter-arrival times. We employ a model that uses the inter-arrival times to parameterize the derivative of the flow hidden-state using a neural network (Neural ODE). We compare our results with a solution that uses the same data without the ODE solver and show the benefit of this approach. Our results can classify flows based on 20 or 30 consecutive packets taken from anywhere in one direction of a flow. This reduces the amount of traffic between the sampling point and the analyzer and does not require matching between two directions of the flow. As a result, our solution can classify traffic with good accuracy within a few seconds, and we show how to combine it with a more accurate (and a slower) classifier to achieve (mostly) fast and accurate classifications.
... The machine learning-based traffic classification requires having a "good" set of features to enhance accuracy. Identifying these features is an important challenge because (1) it requires specialized knowledge in this area to understand the important features, (2) the data set may contain irrelevant and redundant features that greatly reduce accuracy and (3) the efficiency of classifiers (based on machine learning techniques) decreases when a large number of features are analyzed [11]. Some studies have shown that unrelated and redundant features can reduce the accuracy and validity of the classification model. ...
... Some studies have shown that unrelated and redundant features can reduce the accuracy and validity of the classification model. Therefore, finding appropriate features in network traffic is a significant challenge in machine learning techniques [11]. Encryption traffic detection and provide automated and accurate systems is an important problem in network traffic classification. ...
Conference Paper
Network traffic classification has Considerable importance with the rapid growth of current Internet networks and their online applications. In the research conducted in this area, machine learning algorithms have been widely used due to the importance of the accuracy issue at the traffic classification. One of the most essential and useful steps in machine learning algorithms is feature selection. Because it reduces redundant features, and it affects accuracy. This paper examines the impact of feature selection methods on traffic classification methods regarding the importance of this issue. In this paper, three strategies are presented for feature selection: the gain ratio, information gain, and weight by SVM models to be applied to SVM and Naïve Bayes machine learning algorithms. Also, considering the importance of encrypted data and their growing trend, this work explores separately the impact of the above-mentioned methods on encrypted and non-encrypted data from the NIMS data set separately. The results of the experiments show that the gain ratio in encrypted data obtained a better accuracy of 97.30% compared to other methods, as well as for non-encrypted data of 99.90%.
... Nowadays, in the domain of network security (IDS), machine learning (ML), data mining (DM), and feature selection (FS) performs significant roles because many researchers are working on that domain to improve the performance of learning algorithms before applying in the different fields such as text mining, computer vision, and image processing etc. [17]. Feature selection is usually used for many reasons such as increased efficiency of the learning algorithm, achieving a high accuracy rate and getting easiness for classification problems [18]. Moreover, FS determines appropriate subset from the original dataset in order to minimize the impact of irrelevant and redundant features without greatly decreasing the accuracy of the classifier. ...
Article
Full-text available
Intrusion detection system (IDS) is a well-known and effective component of network security that provides transactions upon the network systems with security and safety. Most of earlier research has addressed difficulties such as overfitting, feature redundancy, high-dimensional features and a limited number of training samples but feature selection. We approach the problem of feature selection via sparse logistic regression (SPLR). In this paper, we propose a discriminative feature selection and intrusion classification based on SPLR for IDS. The SPLR is a recently developed technique for data analysis and processing via sparse regularized optimization that selects a small subset from the original feature variables to model the data for the purpose of classification. A linear SPLR model aims to select the discriminative features from the repository of datasets and learns the coefficients of the linear classifier. Compared with the feature selection approaches, like filter (ranking) and wrapper methods that separate the feature selection and classification problems, SPLR can combine feature selection and classification into a unified framework. The experiments in this correspondence demonstrate that the proposed method has better performance than most of the well-known techniques used for intrusion detection.
... The first step in reducing the dimension of data is the extraction of significant indicators [7], which is used in most similar problems [8], although, as a rule, emphasis is placed on high-level character extraction [9] rather than primary traffic sifting. The extraction of high-level characteristics is not sufficient when processing large-scale data, especially taking into account the data storage requirements for further regression analysis and incident investigation. ...
Conference Paper
Full-text available
Modern large digital systems security poses the urgent task of detecting network attacks on the backbone highspeed traffic flow. To solve this problem, one need to preprocess, prepare and aggregate data from network packets. The network traffic analysis module aggregates big data from the traffic flow in time series for mathematical analysis. The authors propose a new hierarchical method of data aggregation, which allows to more effectively reducing the data size and speed up the processing of each separate new fragment. The method consists in introducing Parent-Child-Relation links between the analyzed parameters time series and the use of data accumulation and shift based on these relationships. Paper include estimating of proposed method effectiveness.
... These attacks are also called "zero-day misuses" [5] [6]. As just mentioned,Zero-day assaults have appeared to be hard to lighten their harm because of the absence of data [7] [8].Consequently, there is dependably a need to protect against these zero-day assaults before they maketremendous harm to the systems. Information mining is a method that can be utilized with interruption identification to distinguish trademark designs from the information included in that portray framework and clientconduct [9][10], and preferably, cases of pernicious action. ...
... For each score, function is transformed to by (3) Here, and are the minimum and maximum values of , respectively. In [4], it was shown that there is a very weak correlation between the goodness rates of CHI and IG and so we need to normalize them to combine them. Let the maximum IG score among all attributes be denoted by max_IG. ...
Article
Full-text available
The technique used for feature selection immensely affects the performance of classification in case of high dimensional datasets. The bag-of-word model is often used for sentiment classification using machine learning. The set of unique words in a text based dataset constitute the feature vector which has a high dimension. In this work, a new feature selection method has been proposed which chooses features that are highly relevant to the class using the Information Gain score and least redundant with respect to the selected feature set using the CHI square statistic. The two scores were normalized to make them comparable. The proposed method was applied to lexicon based sentiment analysis using the lexicon SentiWordNet. Previously, IG and mRMR have been shown to be the best filter feature selection methods for sentiment term selection. The performance of our proposed method in terms of classification accuracy in sentiment analysis is significantly higher than IG and mRMR methods. Experiments were performed on three datasets in different domains. Also, the feature subset obtained by removing words with zero polarity score in SentiWordNet led to a significant reduction in feature vector size with no effect on performance.
... Meanwhile, the diversity of network users and terminals facilitates the volume, velocity, and variety of network flow information rising at an exponential rate. Understanding network flow behaviors has become a significant topic in network monitoring and management, which helps a lot to reveal and predict the occurrences of network events [3][4][5][6][7]. Therefore, accurately analyzing and comprehensively mining network flow behaviors is the essential condition to establish a secure, stable, and reliable network environment and has attracted a wide range of attention from both academic and industrial areas. ...
Article
Graph-based approaches have been widely employed to facilitate in analyzing network flow connectivity behaviors, which aim to understand the impacts and patterns of network events. However, existing approaches suffer from lack of connectivity-behavior information and loss of network event identification. In this paper, we propose network flow connectivity graphs (NFCGs) to capture network flow behavior for modeling social behaviors from network entities. Given a set of flows, edges of a NFCG are generated by connecting pairwise hosts who communicate with each other. To preserve more information about network flows, we also embed node-ranking values and edge-weight vectors into the original NFCG. After that, a network flow connectivity behavior analysis framework is present based on NFCGs. The proposed framework consists of three modules: a graph simplification module based on diversified filtering rules, a graph feature analysis module based on quantitative or semiquantitative analysis, and a graph structure analysis module based on several graph mining methods. Furthermore, we evaluate our NFCG-based framework by using real network traffic data. The results show that NFCGs and the proposed framework can not only achieve good performance on network behavior analysis but also exhibit excellent scalability for further algorithmic implementations.
... There are three types of feature selection methods-filter method, wrapper method and embedded method [3,4]. The filter method is a simple and effective one, and typical algorithms include correlation-based feature selection (CFS), consistency-based search, information Gain ratio, ReliefF and so on [5,6]. The wrapper method can generate results with higher accuracy compared to the filter methods, but it involves higher computational complexity. ...
Article
Full-text available
The paper proposes a set of features suitable for fine-grained traffic classification of network video, with data collected from real network. These features are parameters related to quality of experience (QoE), which reflects the user’s perception. The QoE value is calculated based on the ITU-T P.1201/Amd2 standard. Under this standard, each video flow can calculate corresponding QoE value and its probability of distribution. One innovative aspect of the paper is that the characteristics of QoE value and its probability distribution are extracted as the discriminating features which are suitable for video traffic classification. The extracted features of QoE distribution are typically mean, variance, maximum and minimum statistical characteristics, and the probability distribution of features can be obtained. Different from previous work, in our method, we obtain for the first time the discrete distribution of probability with five values, and use them directly as independent features to participate in feature selection and classification. The experimental results demonstrate that the proposed new features can significantly improve classification accuracy compared with an existing method. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
... Gupta et al. [5] and Hink et al. [31] demonstrated the importance of feature selection to fast intrusion detection in a network with heavy information traffic. A feature selection approach was proposed in [32] to identify the optimal and smallest set of features that leads to high accuracy with timely and expected classification results. Another intrusion detection model [33] that can provide high detection reliability is also based on feature selection. ...
Article
Full-text available
The smart grid is a revolutionary, intelligent, next-generation power system. Due to its cyber infrastructure nature, it must be able to accurately and detect potential cyber-attacks and take appropriate actions in a timely manner. This paper creates a new intrusion detection model, which is able to classify the binary-class, triple-class, and multi-class cyber-attacks and power-system incidents. The intrusion detection model is based on a whale optimization algorithm (WOA)-trained artificial neural network (ANN). The WOA is applied to initialize and adjust the weight vector of the ANN to achieve the minimum mean square error. The proposed WOA-ANN model can address the challenges of attacks, failure prediction, and failure detection in a power system. We utilize the Mississippi State University and Oak Ridge National Laboratory databases of power-system attacks to demonstrate the proposed model and show the experimental results. The WOA is able to train the ANN to find the optimal weights. We compare the proposed model with other commonly used classifiers. The comparison results show the superiority of the proposed WOA-ANN model.
... In the area of system security, machine learning , data mining and feature selection are performing key roles because numerous specialists are using them to enhance the execution of learning algorithms in the various fields, for example, content mining, PC vision, and image processing and so on [12]. Feature selection is generally utilized for some reasons, for example, expanded effectiveness of the learning algorithm, accomplishing a high precision rate and getting effortlessness for classification issues [13]. Also, feature selection decides suitable subset from the first dataset so as to limit the effect of superfluous and excess features without reducing the precision of the classifier. ...
... They applied a Naive Bayes Kernel estimator to categorize network traffic. Fahad et al. [14] chose five well known feature selection (FS) techniques and proposed an integrated FS approach, to obtain an optimal feature set, using the descriptors list of Moore et al. [12]. There are many works that use only flow-based features (i.e. ...
... In [16], a mutual-information-based feature selection and automatic determiner of the number of relevant features was proposed. In [17] a number of new evaluation metrics namely goodness, stability and similarity were used to assess the advantages and defects of existing feature selection methods. Six useful feature selection methods were integrated with the aim of combining their strengths. ...
Article
Full-text available
The challenges faced by networks nowadays can be solved to a great extent by the application of accurate network traffic classification. Internet network traffic classification is responsible for associating network traffic with the application generating them and helps in the area of network monitoring, Quality of Service management, among other. Traditional methods of traffic classification including port-based, payload-load based, host-based, behavior-based exhibit a number of limitations that range from high computational cost to inability to access encrypted packets for the purpose of classification. Machine learning techniques based on statistical properties are now being employed to overcome the limitations of existing techniques. However, the high number of features of flows that serve as input to the learning machine poses a great challenge that requires the application of a pre-processing stage known as feature selection. Too many irrelevant and redundant features affect predictive accuracy and performance of the learning machine. This work analyses experimentally, the effect of a collection of ranking-basedfilter feature selection methods on a multi-class dataset for traffic classification. In the first stage, the proposed Top-N criterionis applied to the feature sets obtained, while in the second stage we generate for each Top-N set of features a new dataset which is applied as input to a set of four machine learning algorithms (classifiers).Experimental results show the viability of our model as a tool for selecting the optimal subset of features which when applied, lead to improvement of accuracy and performance of the traffic classification process.
... Among its benefits are: network monitoring, network security, network resource management etc. Traditional methods of traffic classification which include port-based, payload-load based, host-based, behavior-based are limited in a number of ways such as dynamic port assigning to applications, encryption of application contents, privacy issues, to mention a few [4,5]. Nowadays, it is common to use machine learning techniques that are based on statistical properties of traffic flows such as maximum length of packets, inter-arrival time of packets among others for identifying the traffic flow [2]. ...
Article
Full-text available
In this paper, we compare two validation methods that are used to estimate the performance of classification algorithms in a non-problem-specific knowledge scenario. One way to measure the performance of a classification algorithm is to determine its prediction error rate. However, this value cannot be calculated but estimated. In this work, we apply and compare two common methods used for estimation namely: test data and cross-validation. Precisely, we analyze and compare the statistical properties of the K-fold cross-validation and test data estimators of the prediction error rates of six classifiers namely; Naïve Bayes, KNN, Random Forest, SVM, J48, and OneR. From the study, the statistical property of repeated cross-validation tends to stabilize the prediction error estimation which in turn reduces the variance of the prediction error estimator when compared with test data. The NIMS dataset collected over a network was employed in the experimental study.
... The definition of feature quality metrics is closely related to feature selection. In general, feature selection is a dimensionality reduction technique that improves data-driven modeling accuracy and reduces computational complexity by eliminating redundant and irrelevant features while maintaining the original characteristics of the data (Almusallam, Tari, Bertok, & Zomaya, 2017;Fahad, Tari, Khalil, Habib, & Alnuweiri, 2013). This definition identifies the key properties of feature selection: ...
Thesis
Full-text available
Maintaining health model robustness has always been a challenge in prognostics and health management. Research on developing advanced machine learning algorithms has shown great promise, but the prognostic performance is limited when the feature quality is poor. This thesis proposes an extensible preprocessing methodology that applies time series pattern recognition to transient-rich and background-rich systems for robust prognostics and health monitoring. This method recognizes patterns-of-interest accurately to facilitate exact extraction of diagnostic information, namely, features. It takes three phases to realize exact feature extraction. First, hierarchical time series classifiers filter out the signals with few critical patterns and prepare the pattern recognition tools for segmentation. Second, time series pattern recognition identifies and segments the patterns-of-interest. Third, extract pattern-specific features as the input for health modeling. The developed exact feature extraction method is validated on two case studies: semiconductor etching process health monitoring and gas type classification using uncalibrated chemical sensors in complex environment. The proposed method is validated to outperform conventional feature extraction such as summary statistics and observation in both studies. The benefits of exact feature extraction include accuracy, consistency, generality, and extensibility. The recognition of patterns enables accurate description of critical process properties and accelerates segmentation compared to human observation. The extracted features are more consistent in healthy condition and more sensitive to faults. Also, the pattern recognition tools are designed for general engineering systems which can be applied to a wide range of industries. Besides, the semi-automated process allows human intervention to include additional patterns for an extensible and customized solution. This thesis embraces domain knowledge and attempts to generalize them and build engineering syntax and semantics at the fundamental level in the PHM system with the assistance of pattern recognition. Instead of making a decisive conclusion, this study hopes to usher in more research on feature quality and broaden the research frontier for prognostics and health management.
... Balanced Feature Selection (BFS) method achieved 90% of the classification accuracy using Naive Bayes Classifier [22]. To investigate merits and demerits with well known feature selection methods and to identify the consistent features, new metrics such as goodness, stability and similarity is proposed in [23]. Result shows that, none of the feature selection methods performed well on these 3 metrics. ...
Conference Paper
Full-text available
Identification and classification of network applications is a key area of network management and network security. This is due to exponential growth of the Internet users, which in turn increases the growth of the internet traffic. As Internet grows, different types of Internet traffic generated over the network also grow. This recommends proposing new methods to identify and classify the network traffic. In this paper, we proposed to embed Fishers's Discriminate Ratio (FDR) with Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) and Plus L Minus R feature selection methods to analyze and classify the Internet traffic. To evaluate the proposed method, we used publicly available KDDcup99 dataset. Experimental results proved that the proposed embedding method will outperform compare to the existing methods.
... This paper does not collect the descriptions of the datasets due to space restrictions. Thus, we recommend that readers consult the original references [3], [4], [10], [11], [36] for more complete details about the characteristics of the datasets. ...
Article
Full-text available
Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data. INDEX TERMS Clustering algorithms, unsupervised learning, big data.
... In the case of large-scale learning problems, the trade-off is more complex because it involves not only the accuracy of the selection but also other aspects. Stability, which is the sensitivity of the results to training set variations, is one of such factors, with a few studies published regarding the behavior of filters in the case in which training set is small, but the number of features can be high [8,11,13]. The other important aspect, scalability, that is the behavior of the algorithms in the case in which the training set is increasingly high is still more scarce in the scientific literature [35], and the studies are mainly concentrated in obtaining scalability in a particular application [30], modifying certain previously existing approaches [43], or adopting online [19] or parallel [51] approaches. ...
Article
Full-text available
Lately, derived from the explosion of high dimensionality, researchers in machine learning became interested not only in accuracy, but also in scalability. Although scalability of learning methods is a trending issue, scalability of feature selection methods has not received the same amount of attention. This research analyzes the scalability of state-of-the-art feature selection methods, belonging to filter, embedded and wrapper approaches. For this purpose, several new measures are presented, based not only on accuracy but also on execution time and stability. The results on seven classical artificial datasets are presented and discussed, as well as two cases study analyzing the particularities of microarray data and the effect of redundancy. Trying to check whether the results can be generalized, we included some experiments with two real datasets. As expected, filters are the most scalable feature selection approach, being INTERACT, ReliefF and mRMR the most accurate methods (Full article available at: http://rdcu.be/BQFA)
... Utilizing feature selection for intrusion detection can be applied on benchmark datasets such as works done by (Aghdam and Kabiri, 2016;Gharaee and Hosseinvand, 2016) and (Ravale et al., 2015). Another application of feature selection with IDS is to select features from the network traffic data directly such as the works done by (Beigi et al., 2014;Fahad et al., 2013) and . ...
Article
Network and Internet security is a critical universal issue. The increased rate of cyber terrorism has put national security under risk. In addition, Internet attacks have caused severe damages to different sectors (i.e., individuals, economy, enterprises, organizations and governments). Network Intrusion Detection Systems (NIDS) are one of the solutions against these attacks. However, NIDS always need to improve their performance in terms of increasing the accuracy and decreasing false alarms. Integrating feature selection with intrusion detection has shown to be a successful approach since feature selection can help in selecting the most informative features from the entire set of features. Usually, for the stealthy and low profile attacks (zero – day attacks), there are few neatly concealed packets distributed over a long period of time to mislead firewalls and NIDS. Besides, there are many features extracted from those packets, which may make some machine learning-based feature selection methods to suffer from overfitting especially when the data have large numbers of features and relatively small numbers of examples. In this paper, we are proposing a NIDS based on a feature selection method called Recursive Feature Addition (RFA) and bigram technique. The system has been designed, implemented and tested. We tested the model on the ISCX 2012 data set, which is one of the most well-known and recent data sets for intrusion detection purposes. Furthermore, we are proposing a bigram technique to encode payload string features into a useful representation that can be used in feature selection. In addition, we propose a new evaluation metric called (combined) that combines accuracy, detection rate and false alarm rate in a way that helps in comparing different systems and selecting the best among them. The designed feature selection-based system has shown a noticeable improvement on the performance using different metrics.
... Also, identifying the metrics required to provide quality result is important in any application. In a similar way, authors in [28] proposed metrics such as goodness, stability and similarity to select features from the feature set. These metrics are applied on various feature selection methods on all 10 datasets of Cambridge University. ...
... The quality of the feature set directly affects the detection performance. Although many researchers have worked on this problem in recent years [6]- [8], designing an appropriate traffic feature set is still an unresolved research topic. The emergence of deep learning is an effective way to solve the feature designed problem of traditional machine learning because it can automatically learn features directly from the original data to avoid the problem of artificially designed features [9]. ...
Article
Full-text available
As an essential part of the network-based intrusion detection systems (IDS), malicious traffic detection using deep learning methods has become a research focus in network intrusion detection. However, even the most advanced IDS available are challenging to satisfy real-time detection because they usually need to accumulate the packets into particular flows and then extract the features, causing processing delays. In this paper, using the deep learning approach, we propose a deep hierarchical network for malicious traffic detection at the packet-level, capable of learning the features of traffic from raw packet data. It used the one-dimensional convolutional layer to extract the spatial features of raw packets and Gated Recurrent Units (GRU) structure to extract the temporal features. To evaluate the performance of our approach, experiments were conducted to examine the efficiency of the proposed deep hierarchical network based on the ISCX2012 dataset, USTC-TFC2016 dataset and CICIDS2017 dataset, respectively. Accuracy (ACC), detection rate (DR) and false alarm rate (FAR) are the metrics for evaluation. In the ISCX2012 dataset, our approach achieved 99.42%, 99.74%, 1.77% on ACC, DR and FAR, respectively. In USTC-TFC2016, there were 99.94%, 99.99%, 0.99%. In CICIDS2017, there were 100%, 100%, 0%. Furthermore, we discussed the impact of data balanced on classification performance and the time efficiency between the Long Short-Term Memory (LSTM) model and the GRU model. Experiments show that our approach can effectively detect malicious traffic and outperform sout s many other state-of-the-art methods in terms of ACC and DR.
... Rough set method is proposed in [16] which not only reduces the dimensions, also improves the accuracy and computing performance of the classifier. Weighted Symmetrical Uncertainty Area Under roc Curve (WSU_AUC) is proposed to select stable and robust features in [17]. To study the merits and demerits of proposed feature selection methods, different metrics such as goodness, stability and similarity are proposed in [18]. ...
Article
Full-text available
Network traffic classification is a core part of the network traffic management. Network management is a critical task since the various new applications are emerging every moment and increase in the number of users of an internet. Due to this problem, there is a need of internet traffic classification for smooth management of an internet by the internet service providers (ISP). Network traffic can be classified based on port, payload and statistical approach. In the proposed work, a novel method to represent internet traffic data based on clustering of feature vector using Multiple Kernel Fuzzy C-Means (MKFCM) is proposed. Further, feature vector of each cluster is used to build an interval valued representation (symbolic) using mean and standard deviation. In addition, this interval valued features are stored in knowledge base as a representative of the cluster. Further, to classify the symbolic interval data, we used symbolic classifier. To validate the effectiveness of the proposed model, experimentation is conducted on standard Cambridge University internet traffic dataset. Further, the proposed symbolic classifier compared with other existing classifiers such as Naïve Bayes, KNN and SVM classifier. The experiment outcome infers that; the proposed symbolic representation classifier performs better than other classifiers.
Article
Full-text available
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance. © 2017, Institute of Advanced Engineering and Science. All rights reserved.
Chapter
This chapter reveals the new challenges that the researchers are finding in ensemble feature selection, most of them related with “Big Data” and some of its consequences, as the important rise in unsupervised learning, because unlabelled samples is the most common situation in large datasets; or the need for visualization, that is a challenge also shared between ensemble learning and feature selection. Although feature selection is a well-established preprocessing technique, during the last years it has experimented certain renaissance due to the fact that is almost mandatory for the new scenarios in which large and/or high-dimensional datasets are present. Thus, feature selection has been successfully applied lately in areas such as DNA microarray analysis, image classification, face recognition, and text classification. Ensemble feature selection is one of the new approaches to the field, in an attempt to obtain better performances and also design distributed FS schemes that allow for more effective process and higher efficiencies. This chapter outlines some of the latest challenges in the field of ensemble feature selection, aiming researchers at following the new paths that are opened for exploration. In Sect. 10.1 a brief Introduction to the need for ensemble feature selection is outlined. Section 10.2 reviews some of the fields in which feature selection, and more specifically feature selection ensembles have been used. To end the chapter, Sect. 10.3 enumerates some of the challenges that lie ahead for feature selection, and thus for the use of ensembles in this preprocessing step.
Article
To reduce the number of packets used in categorizing flows, we propose a new traffic classification method by investigating the relationships between flows instead of considering them individually. Based on the flow identities, we introduce seven types of relationships for a flow and a further Expanding Vector (EV) by searching relevant flows in a particular time window. The proposed Traffic Classification method based on Expanding Vector (TCEV) does not require an inspection of the detailed flow properties, and thus, it can be conducted with a linear complexity of the flow number. The experiments performed on real-world traffic data verify that our method (1) attains as high a performance as the representative methods, while significantly reducing the number of processed packets; (2) is robust against packet loss and the absence of flow direction; and (3) is capable of reaching higher accuracy in the recognition of TCP mice flows.
Article
Feature selection is the process of identifying and removing many irrelevant and redundant features. Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines. In high dimensional space finding clusters of data objects is challenging due to the curse of dimensionality. When the dimensionality increases, data in the irrelevant dimensions may produce much noise. And also, time complexity is the major issues in existing approach. In order to rectify these issues our proposed method made use of efficient feature subset selection in high dimensional data. Here we are considering the input dataset is the high dimensional micro array dataset. Initially, we have to select the optimal features so that our proposed technique employed Modified Social Spider Optimization (MSSO) algorithm. Here the traditional Social Spider Optimization is modified with the help of fruit fly optimization algorithm. Next the selected features are the input for the classifier. Here the classification is performed using Optimized Radial basis Function based neural network (ORBFNN) technique to classify the micro array data as normal or abnormal data. The effectiveness of RBFNN is optimized by means of artificial bee colony algorithm (ABC). Experimental results indicate that the proposed classification framework have outperformed by having better accuracy for five benchmark dataset 93.66%, 97.09%, 98.66%, 98.28% and 98.93% which is minimum value when compared to the existing technique. The proposed method is executed in MATLAB platform.
Article
Full-text available
There are major discussions about the vulnerability of protocols based on Real Time Ethernet (RTE) and techniques for detecting anomalies. Thus, this work proposes a methodology for detecting anomalies by optimizing the data extraction and by classifying traffic-related features. In order to cope with this proposal, an ANN-based classifier is trained using selected relevant features. These features are extracted using variable sized sliding window and selected according to their correlation with the other features and the expected output of the ANN. The number of relevant features can vary according to performance indicators of the classifier. The proposed methodology was exploited for identifying four different events of PROFINET networks and the performance of the classifier was considered successful for all cases. This outcome suggests that the proposed methodology may be successful for anomalies detection in any PROFINET network. However, the application of the proposed methodology to other RTE protocol is foreseen.
Article
Despite the increasing awareness of cyber-Attacks against Critical Infrastructure (CI), safeguarding the Supervisory Control and Data Acquisition (SCADA) systems remains inadequate. For this purpose, designing an efficient SCADA Intrusion Detection System (IDS) becomes a significant research topic of the researchers to counter cyber-Attacks. Most of the existing works present several statistical and machine learning approaches to prevent the SCADA network from the cyber-Attacks. Whereas, these approaches failed to concern the most common challenge, "Curse of dimensionality". This scenario accentuates the necessity of an efficient feature selection algorithm in SCADA IDS where it identifies the relevant features and eliminates the redundant features without any loss of information. Hence, this paper proposes a novel filter-based feature selection approach for the identification of informative features based on Rough Set Theory and Hyper-clique based Binary Whale Optimization Algorithm (RST-HCBWoA). Experiments were carried out by Power system attack dataset and the performance of RST-HCBWoA was evaluated in terms of reduct size, precision, recall, classification accuracy, and time complexity.
Article
With the increase of multimedia traffic, the implementation of fast and accurate classification has become an important issue. Besides, a manually captured dataset contains certain noise and mislabeled instances, which influences the accuracy of classifier to some extent. Motivated by these observations, a novel feature selection and instance purification (FS&IP) method based on consistency measure is proposed. It utilizes a linear consistency-constrained algorithm for feature selection. In each round of iteration, it removes the instance with the minor labels in every pattern subset. Our method has three desirable properties: 1) it can simultaneously achieve feature selection and data purification. 2) when purifying instance, it doesn’t need to annotate the noisy instance with learned labels; that is because it is an unsupervised method in terms of data purification. 3) through data purification, it is able to obtain a minimal feature subset on condition of maintaining accuracy. In addition, the proposed method can be used to discover a new discriminative feature based on linking behaviors called the flow fragment (F−Frag), which can reflect important information among the complex and multitudinous packet communication behaviors. The experimental results over six different datasets demonstrate the advantages of the proposed technique compared to six existing methods, and the discriminative power of the new flow fragment feature.
Thesis
Full-text available
The fast-paced evolution of Internet is drawing a complex context which imposes demanding requirements to assure Quality of Service. The development of advanced intelligent approaches in networking is envisioning features that include autonomous resource allocation, fast reaction against unexpected network events and so on. Internet Network Traffic Classification constitutes a crucial source of information in Network Management and its importance in assisting the former novel paradigms is apparent. Monitoring traffic flowing through network devices facilitate tasks such as: network orchestration, traffic prioritization, network arbitration and cyberthreats detection amongst others. The traditional traffic classifiers have kept obsolete owing to the rapid Internet evolution. Port-based classifiers suffer from significant accuracy losses due to port masking, meanwhile Deep Packet Inspection approaches have severe user-privacy limitations. The advent of Machine Learning has propelled the application of advanced algorithms in many diverse research areas, and some learning approaches have proved as an interesting alternative to the former traffic classification approaches. Addressing Network Traffic Classification from a Machine Learning perspective implies numerous challenges requiring research efforts to achieve feasible traffic classifiers. In this dissertation, we endeavor to formulate and solve important research questions in Machine-Learning-based Network Traffic Classification. As a result of numerous experiments, the knowledge provided in this research constitutes an engaging case of study in which network traffic data from two different environments are successfully collected, processed and modeled. Firstly, we approached the Feature Extraction and Selection processes providing our own contributions. A Feature Extractor was designed to create Machine-Learning ready datasets from real traffic data, and a Feature Selection Filter based on fast correlation is proposed and tested in several classification datasets. Then, the original Network Traffic Classification datasets are reduced using our Selection Filter to provide efficient classification models. Many classification models based on CART Decision Trees were analyzed exhibiting excellent outcomes in identifying Internet Traffic. The experiments presented in this research comprise a comparison amongst ensemble learning schemes, an exploratory study on Class Imbalance and solutions; and an analysis of IP-header predictors for early traffic classification. This thesis is presented in form of compendium of JCR-indexed scientific manuscripts and, furthermore, one conference paper is included. Through the present work, we have studied a wide number of learning approaches assuming the most advance methodology in Machine Learning. As a result, we were able to identify the strengths and weaknesses of these algorithms, providing our own solutions to overcome the observed limitations. Shortly, this thesis proves that Machine Learning offers interesting advanced techniques that open prominent prospects in Internet Network Traffic Classification.
Article
It presents a way to build support for knowledge, analytical reasoning and explore analysis methods in visual text analysis. To do this, use automatically from the use of semantic network model parameters K- unstructured text data obtained for the neighbourhood. Semantic network analysis and network analysis methods to obtain quantitative and qualitative insights. Qualitative analysis and quantitative indicators to explore the semantic structure can support it. have discussed the basic theoretical assumptions about the text modelling to analyse the semantic network and subsequent semantic network. For a systematic overview, demonstrated the essential network elements of the qualitative meaning, which means the network analysis, to understand the meaning of a given network, support analyst doing. Possible exploration strategy. As proof of concept, use a visual survey and analysis of the semantic network. To explain the method proposed a typical analysis of the Wikipedia article, use the visual text analysis system.
Article
Traffic classification groups similar or related traffic data, which is one main stream technique of data fusion in the field of network management and security. With the rapid growth of network users and the emergence of new networking services, network traffic classification has attracted increasing attention. Many new traffic classification techniques have been developed and widely applied. However, the existing literature lacks a thorough survey to summarize, compare and analyze the recent advances of network traffic classification in order to deliver a holistic perspective. This paper carefully reviews existing network traffic classification methods from a new and comprehensive perspective by classifying them into five categories based on representative classification features, i.e., statistics-based classification, correlation-based classification, behavior-based classification, payload-based classification, and port-based classification. A series of criteria are proposed for the purpose of evaluating the performance of existing traffic classification methods. For each specified category, we analyze and discuss the details, advantages and disadvantages of its existing methods, and also present the traffic features commonly used. Summaries of investigation are offered for providing a holistic and specialized view on the state-of-art. For convenience, we also cover a discussion on the mostly used datasets and the traffic features adopted for traffic classification in the review. At the end, we identify a list of open issues and future directions in this research field.
Article
In the Industrial Internet of Things (IIoT) in the 5G era, the growth of smart devices will generate a large amount of data traffic, bringing a huge challenge of network traffic classification, which is the prerequisite of IIoT traffic engineering, quality of service (QoS), cyberspace security, etc. It is difficult for current traffic classification methods to distinguish encrypted dataflow and design effective handcraft features. In this paper, a novel identification scheme of encrypted traffic, TSCRNN, is proposed to automatically extract features for efficient traffic classification, which is based on spatiotemporal features. TSCRNN includes the preprocessing phase and the classification phase. In the preprocessing phase, raw traffic data are processed with flow segmentation, sampling, and vectorization, etc. To solve the classification problem of long time flow, sampling strategies are used to collect samples from the middle of the long-lived flow. In the classification phase, TSCRNN extracts abstract spatial features by CNN and then introduces stack bidirectional LSTM to learn the temporal characteristics. The experiments were performed on the dataset ISCXTor2016. The experimental results show that TSCRNN outperforms other typical methods in all scenarios, which achieves the accuracy up to 99.4% and 95.0% respectively in Tor/nonTor binary classification tasks and sixteen classification tasks. Furthermore, TSCRNN is applied to other real network datasets obtained the satisfactory performance, which validates its feasibility and universality. It means that TSCRNN can effectively identify encrypted and anonymous traffic, provide a fine-grained traffic characterization mechanism, which will support the development of core technologies in the Industrial Internet of Things.
Article
Intrusion detection has drawn considerable interest as researchers endeavor to produce efficient models that offer high detection accuracy. Nevertheless, the challenge remains in developing reliable and efficient Intrusion Detection System (IDS) that is capable of handling large amounts of data, with trends evolving in real-time circumstances. The design of such a system relies on the detection methods used, particularly the feature selection techniques and machine learning algorithms used. Thus motivated, this paper presents a review on feature selection and ensemble techniques used in anomaly-based IDS research. Dimensionality reduction methods are reviewed, followed by the categorization of feature selection techniques to illustrate their effectiveness on training phase and detection. Selection of the most relevant features in data has been proven to increase the efficiency of detection in terms of accuracy and computational efficiency, hence its important role in the design of an anomaly-based IDS. We then analyze and discuss a variety of IDS-based machine learning techniques with various detection models (single classifier-based or ensemble-based), to illustrate their significance and success in the intrusion detection area. Besides supervised and unsupervised learning methods in machine learning, ensemble methods combine several base models to produce one optimal predictive model and improve accuracy performance of IDS. The review consequently focuses on ensemble techniques employed in anomaly-based IDS models and illustrates how their use improves the performance of the anomaly-based IDS models. Finally, the paper laments on open issues in the area and offers research trends to be considered by researchers in designing efficient anomaly-based IDSs.
Article
With the rapid growth of network bandwidth, traffic identification is currently an important challenge for network management and security. In recent years, packet sampling has been widely used in most network management systems. In this paper, in order to improve the accuracy of network traffic identification, sampled NetFlow data is applied to traffic identification, and the impact of packet sampling on the accuracy of the identification method is studied. This study includes feature selection, a metric correlation analysis for the application behavior, and a traffic identification algorithm. Theoretical analysis and experimental results show that the significance of behavior characteristics becomes lower in the packet sampling environment. Meanwhile, in this paper, the correlation analysis results in different trends according to different features. However, as long as the flow number meets the statistical requirement, the feature selection and the correlation degree will be independent of the sampling ratio. While in a high sampling ratio, where the effective information would be less, the identification accuracy is much lower than the unsampled packets. Finally, in order to improve the accuracy of the identification, we propose a Deep Belief Networks Application Identification (DBNAI) method, which can achieve better classification performance than other state-of-the-art methods.
Article
Full-text available
We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space-if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's linear discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed “Fisherface” method has error rates that are lower than those of the eigenface technique for tests on the Harvard and Yale face databases
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Full-text available
Traffic classification technology has increased in relevance this decade, as it is now used in the definition and implementation of mechanisms for service differentiation, network design and engineering, security, accounting, advertising, and research. Over the past 10 years the research community and the networking industry have investigated, proposed and developed several classification approaches. While traffic classification techniques are improving in accuracy and efficiency, the continued proliferation of different Internet application behaviors, in addition to growing incentives to disguise some applications to avoid filtering or blocking, are among the reasons that traffic classification remains one of many open problems in Internet research. In this article we review recent achievements and discuss future directions in traffic classification, along with their trade-offs in applicability, reliability, and privacy. We outline the persistently unsolved challenges in the field over the last decade, and suggest several strategies for tackling these challenges to promote progress in the science of Internet traffic classification.
Article
Full-text available
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward-building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
Article
Full-text available
Today, the most accurate steganalysis methods for digital media are built as supervised classifiers on feature vectors extracted from the media. The tool of choice for the machine learning seems to be the support vector machine (SVM). In this paper, we propose an alternative and wellknown machine learning tool – ensemble classifiers – and argue that they are ideally suited for steganalysis. Ensemble classifiers scale much more favorably w.r.t. the number of training examples and the feature dimensionality with performance comparable to the much more complex SVMs. The significantly lower training complexity opens up the possibility for the steganalyst to work with rich (high-dimensional) cover models and train on larger training sets – two key elements that appear necessary to reliably detect modern steganographic algorithms. Ensemble classification is portrayed here as a powerful developer tool that allows fast construction of steganography detectors with markedly improved detection accuracy across a wide range of embedding methods. The power of the proposed framework is demonstrated on two steganographic methods that hide messages in JPEG images.
Chapter
Full-text available
We develop a face recognition algorithm which is insensitive to gross variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face under varying illumination direction lie in a 3-D linear subspace of the high dimensional feature space — if the face is a Lambertian surface without self-shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher's Linear Discriminant and produces well separated classes in a low-dimensional subspace even under severe variation in lighting and facial expressions. The Eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed Fisherface method has error rates that are significantly lower than those of the Eigenface technique when tested on the same database.
Article
Full-text available
This paper presents a novel cluster-oriented ensemble classifier. The proposed ensemble classifier is based on original concepts such as learning of cluster boundaries by the base classifiers and mapping of cluster confidences to class decision using a fusion classifier. The categorized data set is characterized into multiple clusters and fed to a number of distinctive base classifiers. The base classifiers learn cluster boundaries and produce cluster confidence vectors. A second level fusion classifier combines the cluster confidences and maps to class decisions. The proposed ensemble classifier modifies the learning domain for the base classifiers and facilitates efficient learning. The proposed approach is evaluated on benchmark data sets from UCI machine learning repository to identify the impact of multicluster boundaries on classifier learning and classification accuracy. The experimental results and two-tailed sign test demonstrate the superiority of the proposed cluster-oriented ensemble classifier over existing ensemble classifiers published in the literature.
Article
Full-text available
Temporal data clustering provides underpinning techniques for discovering the intrinsic structure and condensing information over temporal data. In this paper, we present a temporal data clustering framework via a weighted clustering ensemble of multiple partitions produced by initial clustering analysis on different temporal data representations. In our approach, we propose a novel weighted consensus function guided by clustering validation criteria to reconcile initial partitions to candidate consensus partitions from different perspectives, and then, introduce an agreement function to further reconcile those candidate consensus partitions to a final partition. As a result, the proposed weighted clustering ensemble algorithm provides an effective enabling technique for the joint use of different representations, which cuts the information loss in a single representation and exploits various information sources underlying temporal data. In addition, our approach tends to capture the intrinsic structure of a data set, e.g., the number of clusters. Our approach has been evaluated with benchmark time series, motion trajectory, and time-series data stream clustering tasks. Simulation results demonstrate that our approach yields favorite results for a variety of temporal data clustering tasks. As our weighted cluster ensemble algorithm can combine any input partitions to generate a clustering ensemble, we also investigate its limitation by formal analysis and empirical studies.
Article
Full-text available
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.
Conference Paper
Full-text available
Accurate traffic classification is of fundamental importance to numerous other network activities, from security monitoring to accounting, and from Quality of Service to providing operators with useful forecasts for long-term provisioning. We apply a Naïve Bayes estimator to categorize traffic by application. Uniquely, our work capitalizes on hand-classified network data, using it as input to a supervised Naïve Bayes estimator. In this paper we illustrate the high level of accuracy achievable with the \Naive Bayes estimator. We further illustrate the improved accuracy of refined variants of this estimator.Our results indicate that with the simplest of Naïve Bayes estimator we are able to achieve about 65% accuracy on per-flow classification and with two powerful refinements we can improve this value to better than 95%; this is a vast improvement over traditional techniques that achieve 50--70%. While our technique uses training data, with categories derived from packet-content, all of our training and testing was done using header-derived discriminators. We emphasize this as a powerful aspect of our approach: using samples of well-known traffic to allow the categorization of traffic using commonly available information alone.
Conference Paper
Full-text available
We investigate potential simulation artifacts and their effects on the evaluation of network anomaly detection systems in the 1999 DARPA/MIT Lincoln Laboratory off-line intrusion detection evaluation data set. A statistical comparison of the simulated b ackground and training traffic with real t raffic c ollected from a university departmental server suggests the presence of artifacts that could allow a network anomaly detection system to d etect some novel i ntrusions based on idiosyncrasies of the underlying implementation of the simulation, with an artificially low false alarm rate. The evaluation problem can be mitigated by mixing real traffic into the simulation. We compare five anomaly detection algorithms on simulated and mixed traffic. On mixed traffic they detect fewer attacks, but t he e xplanations for these detections are more plausible.
Conference Paper
Full-text available
Feature selection, as a preprocessing step to machine learning, has been eective in reduc- ing dimensionality, removing irrelevant data, increasing learning accuracy, and improving comprehensibility. However, the recent in- crease of dimensionality of data poses a se- vere challenge to many existing feature se- lection methods with respect to eciency and eectiveness. In this work, we intro- duce a novel concept, predominant correla- tion, and propose a fast filter method which can identify relevant features as well as re- dundancy among relevant features without pairwise correlation analysis. The eciency and eectiveness of our method is demon- strated through extensive comparisons with other methods using real-world data of high dimensionality.
Conference Paper
Full-text available
Recent research on Internet traffic classification algorithms has yield a flurry of proposed approaches for distinguishing types of traffic, but no systematic comparison of the various algorithms. This fragmented approach to traffic classification research leaves the operational community with no basis for consensus on what approach to use when, and how to interpret results. In this work we critically revisit traffic classification by conducting a thorough evaluation of three classification approaches, based on transport layer ports, host behavior, and flow features. A strength of our work is the broad range of data against which we test the three classification approaches: seven traces with payload collected in Japan, Korea, and the US. The diverse geographic locations, link characteristics and application traffic mix in these data allowed us to evaluate the approaches under a wide variety of conditions. We analyze the advantages and limitations of each approach, evaluate methods to overcome the limitations, and extract insights and recommendations for both the study and practical application of traffic classification. We make our software, classifiers, and data available for researchers interested in validating or extending this work.
Conference Paper
Full-text available
We present a fundamentally different approach to classifying traffic flows according to the applications that generate them. In contrast to previous methods, our approach is based on observing and identifying patterns of host behavior at the transport layer. We analyze these patterns at three levels of increasing detail (i) the social, (ii) the functional and (iii) the application level. This multilevel approach of looking at traffic flow is probably the most important contribution of this paper. Furthermore, our approach has two important features. First, it operates in the dark, having (a) no access to packet payload, (b) no knowledge of port numbers and (c) no additional information other than what current flow collectors provide. These restrictions respect privacy, technological and practical constraints. Second, it can be tuned to balance the accuracy of the classification versus the number of successfully classified traffic flows. We demonstrate the effectiveness of our approach on three real traces. Our results show that we are able to classify 80%-90% of the traffic with more than 95% accuracy.
Conference Paper
Full-text available
Tra-c classiflcation is the ability to identify and categorize network tra-c by application type. In this paper, we con- sider the problem of tra-c classiflcation in the network core. Classiflcation at the core is challenging because only partial information about the ∞ows and their contributors is avail- able. We address this problem by developing a framework that can classify a ∞ow using only unidirectional ∞ow infor- mation. We evaluated this approach using recent packet traces that we collected and pre-classifled to establish a \base truth". From our evaluation, we flnd that ∞ow statis- tics for the server-to-client direction of a TCP connection provide greater classiflcation accuracy than the ∞ow statis- tics for the client-to-server direction. Because collection of the server-to-client ∞ow statistics may not always be feasible, we developed and validated an algorithm that can estimate the missing statistics from a unidirectional packet trace.
Conference Paper
Full-text available
An accurate mapping of traffic to applications is important for a broad range of network management and measurement tasks. Internet applications have traditionally been identified using well-known default server network-port numbers in the TCP or UDP headers. However this approach has become increasingly inaccurate. An alternate, more accurate technique is to use specific application-level features in the protocol exchange to guide the identification. Unfortunately deriving the signatures manually is very time consuming and difficult.In this paper, we explore automatically extracting application signatures from IP traffic payload content. In particular we apply three statistical machine learning algorithms to automatically identify signatures for a range of applications. The results indicate that this approach is highly accurate and scales to allow online application identification on high speed links. We also discovered that content signatures still work in the presence of encryption. In these cases we were able to derive content signature for unencrypted handshakes negotiating the encryption parameters of a particular connection.
Conference Paper
Full-text available
Classication of network trafc using port-based or payload-based analysis is becoming increasingly difcult with many peer-to-peer (P2P) applications using dynamic port numbers, masquerading tech- niques, and encryption to avoid detection. An alternative approach is to classify trafc by exploiting the distinctive characteristics of applications when they communicate on a network. We pursue this latter approach and demonstrate how cluster analysis can be used to effectively identify groups of trafc that are similar using only transport layer statistics. Our work considers two unsupervised clustering algorithms, namely K-Means and DBSCAN, that have previously not been used for network trafc classication. We eval- uate these two algorithms and compare them to the previously used AutoClass algorithm, using empirical Internet traces. The experi- mental results show that both K-Means and DBSCAN work very well and much more quickly then AutoClass. Our results indicate that although DBSCAN has lower accuracy compared to K-Means and AutoClass, DBSCAN produces better clusters.
Conference Paper
Full-text available
Well-known port numbers can no longer be used to reliably identify network applications. There is a variety of new Internet appli- cations that either do not use well-known port numbers or use other protocols, such as HTTP, as wrappers in order to go through rew alls without being blocked. One consequence of this is that a simple inspec- tion of the port numbers used by o ws may lead to the inaccurate clas- sication of network trac. In this work, we look at these inaccuracies in detail. Using a full payload packet trace collected from an Internet site we attempt to identify the types of errors that may result from port- based classication and quantify them for the specic trace under study. To address this question we devise a classication methodology that re- lies on the full packet payload. We describe the building blocks of this methodology and elaborate on the complications that arise in that con- text. A classication technique approaching 100% accuracy proves to be a labor-intensive process that needs to test o w-characteristics against multiple classication criteria in order to gain sucien t condence in the nature of the causal application. Nevertheless, the benets gained from a content-based classication approach are evident. We are capable of accurately classifying what would be otherwise classied as unknown as well as identifying trac o ws that could otherwise be classied in- correctly. Our work opens up multiple research issues that we intend to address in future work.
Conference Paper
Full-text available
Abstract Accurate traffic classification is the keystone of numerous network activities Our work capitalises on hand - classified network data, used as input to a supervised Bayes estimator We illustrate the high level of accuracy achieved with a supervised Na¨ive Bayes estimator; with the simplest estimator we are able to achieve better than 83% accuracy on both a per - byte and a per - packet basis
Article
Full-text available
The identification of network applications through observation of associated packet traffic flows is vital to the areas of network management and surveillance. Currently popular methods such as port number and payload-based identification exhibit a number of shortfalls. An alternative is to use machine learning (ML) techniques and identify network applications based on per-flow statistics, derived from payload-independent features such as packet length and inter-arrival time distributions. The performance impact of feature set reduction, using Consistency-based and Correlation-based feature selection, is demonstrated on Naïve Bayes, C4.5, Bayesian Network and Naïve Bayes Tree algorithms. We then show that it is useful to differentiate algorithms based on computational performance rather than classification accuracy alone, as although classification accuracy between the algorithms is similar, computational performance can differ significantly.
Article
Full-text available
Recent research on Internet traffic classification has produced a number of approaches for distinguishing types of traffic. However, a rigorous comparison of such proposed algorithms still remains a challenge, since every proposal considers a different benchmark for its experimental evaluation. A lack of clear consensus on an objective and cientific way for comparing results has made researchers uncertain of fundamental as well as relative contributions and limitations of each proposal. In response to the growing necessity for an objective method of comparing traffic classifiers and to shed light on scientifically grounded traffic classification research, we introduce an Internet traffic classification benchmark tool, NeTraMark. Based on six design guidelines (Comparability, Reproducibility, Efficiency, Extensibility, Synergy, and Flexibility/Ease-of-use), NeTraMark is the first Internet traffic lassification benchmark where eleven different state-of-the-art traffic classifiers are integrated. NeTraMark allows researchers and practitioners to easily extend it with new classification algorithms and compare them with other built-in classifiers, in terms of three categories of performance metrics: per-whole-trace flow accuracy, per-application flow accuracy, and computational performance.
Article
Full-text available
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Article
Full-text available
Recent studies have shown that microarray gene expression data are useful for phenotype classification of many diseases. A major problem in this classification is that the number of features (genes) greatly exceeds the number of instances (tissue samples). It has been shown that selecting a small set of informative genes can lead to improved classification accuracy. Many approaches have been proposed for this gene selection problem. Most of the previous gene ranking methods typically select 50-200 top-ranked genes and these genes are often highly correlated. Our goal is to select a small set of non-redundant marker genes that are most relevant for the classification task. To achieve this goal, we developed a novel hybrid approach that combines gene ranking and clustering analysis. In this approach, we first applied feature filtering algorithms to select a set of top-ranked genes, and then applied hierarchical clustering on these genes to generate a dendrogram. Finally, the dendrogram was analyzed by a sweep-line algorithm and marker genes are selected by collapsing dense clusters. Empirical study using three public datasets shows that our approach is capable of selecting relatively few marker genes while offering the same or better leave-one-out cross-validation accuracy compared with approaches that use top-ranked genes directly for classification. The HykGene software is freely available at http://www.cs.dartmouth.edu/~wyh/software.htm wyh@cs.dartmouth.edu Supplementary material is available from http://www.cs.dartmouth.edu/~wyh/hykgene/supplement/index.htm.
Article
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
Article
We present a fundamentally different approach to classifying traffic flows according to the applications that generate them. In contrast to previous methods, our approach is based on observing and identifying patterns of host behavior at the transport layer. We analyze these patterns at three levels of increasing detail (i) the social, (ii) the functional and (iii) the application level. This multilevel approach of looking at traffic flow is probably the most important contribution of this paper. Furthermore, our approach has two important features. First, it operates in the dark , having (a) no access to packet payload, (b) no knowledge of port numbers and (c) no additional information other than what current flow collectors provide. These restrictions respect privacy, technological and practical constraints. Second, it can be tuned to balance the accuracy of the classification versus the number of successfully classified traffic flows. We demonstrate the effectiveness of our approach on three real traces. Our results show that we are able to classify 80%-90% of the traffic with more than 95% accuracy.
Article
Due to their damage to Internet security, malware and phishing website detection has been the Internet security topics that are of great interests. Compared with malware attacks, phishing website fraud is a relatively new Internet crime. However, they share some common properties: 1) both malware samples and phishing websites are created at a rate of thousands per day driven by economic benefits; and 2) phishing websites represented by the term frequencies of the webpage content share similar characteristics with malware samples represented by the instruction frequencies of the program. Over the past few years, many clustering techniques have been employed for automatic malware and phishing website detection. In these techniques, the detection process is generally divided into two steps: 1) feature extraction, where representative features are extracted to capture the characteristics of the file samples or the websites; and 2) categorization, where intelligent techniques are used to automatically group the file samples or websites into different classes based on computational analysis of the feature representations. However, few have been applied in real industry products. In this paper, we develop an automatic categorization system to automatically group phishing websites or malware samples using a cluster ensemble by aggregating the clustering solutions that are generated by different base clustering algorithms. We propose a principled cluster ensemble framework to combine individual clustering solutions that are based on the consensus partition, which can not only be applied for malware categorization, but also for phishing website clustering. In addition, the domain knowledge in the form of sample-level/website-level constraints can be naturally incorporated into the ensemble framework. The case studies on large and real daily phishing websites and malware collection from the Kingsoft Internet Security Laboratory demonstrate the effectiveness and efficiency of our proposed method.
Article
This paper presents a novel traffic classification scheme to improve classification performance when few training data are available. In the proposed scheme, traffic flows are described using the discretized statistical features and flow correlation information is modeled by bag-of-flow (BoF). We solve the BoF-based traffic classification in a classifier combination framework and theoretically analyze the performance benefit. Furthermore, a new BoF-based traffic classification method is proposed to aggregate the naive Bayes (NB) predictions of the correlated flows. We also present an analysis on prediction error sensitivity of the aggregation strategies. Finally, a large number of experiments are carried out on two large-scale real-world traffic datasets to evaluate the proposed scheme. The experimental results show that the proposed scheme can achieve much better classification performance than existing state-of-the-art traffic classification methods.
Article
It has been over 16 years since Cisco's NetFlow was patented in 1996. Extensive research has been conducted since then and many applications have been developed. In this survey, we have reviewed an extensive number of studies with emphasis on network flow applications. First, we provide a brief introduction to sFlow, NetFlow and network traffic analysis. Then, we review the state of the art in the field by presenting the main perspectives and methodologies. Our analysis has revealed that network security has been an important research topic since the beginning. Advanced methodologies, such as machine learning, have been very promising. We provide a critique of the studies surveyed about datasets, perspectives, methodologies, challenges, future directions and ideas for potential integration with other Information Technology infrastructure and methods. Finally, we concluded this survey.
Article
Abstract Any assessment of classification techniques requires data. This document,describes sets of data intended to aid in the assessment of classification work. A number,of data sets are described; each data set consists a number of objects, and each object is described by a group of features (also referred to as discriminators). Leveraged by a quantity of hand-classified data, each object within each data set represents a single flow of TCP packets between client and server. The features for each object consist of the (application-centric) classification derived elsewhere and a number,of features derived as input to probabilistic classification techniques. In addition to describing the features, we also provide information allowing interested parties to retrieve these data sets for use in their own work. The data sets contain no site-identifying information; each object is only described by a set of statistics and a class that defines the causal application.
Article
The reactions of trimethylaluminum (TMAl) and ammonia (NH3) on γ-alumina at 600 K were studied by Fourier-transform infrared spectroscopy (FTIR) and X-ray photoelectron spectroscopy (XPS) to explore fundamental issues of AlN film growth. FTIR and XPS assignments for surface species present after the deposition process have been directly correlated from measurements taken in the same instrument. Sequential exposure of TMAl and NH3 leads to self-limiting adsorption of TMAl and site-selective reaction with NH3 to form a thin layer of AlN along with dinitrogen and NH2 species. The coexposure of TMAl and NH3 leads to continuous deposition resulting in a thick film of AlN with NH2 and dinitrogen species present at the AlN/alumina interface and NH terminating the AlNvacuum interface. These processes are discussed in terms of A1N thin film growth strategies.
Article
Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteris...
Article
Identifying and categorizing network traffic by application type is challenging because of the continued evolution of applications, especially of those with a desire to be undetectable. The diminished effectiveness of port-based identification and the overheads of deep packet inspection approaches motivate us to classify traffic by exploiting distinctive flow characteristics of applications when they communicate on a network. In this paper, we explore this latter approach and propose a semi-supervised classification method that can accommodate both known and unknown applications. To the best of our knowledge, this is the first work to use semi-supervised learning techniques for the traffic classification problem. Our approach allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. We consider pragmatic classification issues such as longevity of classifiers and the need for retraining of classifiers. Our performance evaluation using empirical Internet traffic traces that span a 6-month period shows that: (1) high flow and byte classification accuracy (i.e., greater than 90%) can be achieved using training data that consists of a small number of labeled and a large number of unlabeled flows; (2) presence of “mice” and “elephant” flows in the Internet complicates the design of classifiers, especially of those with high byte accuracy, and necessitates the use of weighted sampling techniques to obtain training flows; and (3) retraining of classifiers is necessary only when there are non-transient changes in the network usage characteristics. As a proof of concept, we implement prototype offline and realtime classification systems to demonstrate the feasibility of our approach.
Article
Feature selection is an effective technique in dealing with dimensionality reduction. For classification, it is used to find an “optimal” subset of relevant features such that the overall accuracy of classification is increased while the data size is reduced and the comprehensibility is improved. Feature selection methods contain two important aspects: evaluation of a candidate feature subset and search through the feature space. Existing algorithms adopt various measures to evaluate the goodness of feature subsets. This work focuses on inconsistency measure according to which a feature subset is inconsistent if there exist at least two instances with same feature values but with different class labels. We compare inconsistency measure with other measures and study different search strategies such as exhaustive, complete, heuristic and random search, that can be applied to this measure. We conduct an empirical study to examine the pros and cons of these search methods, give some guidelines on choosing a search method, and compare the classifier error rates before and after feature selection.
Article
In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.
Conference Paper
When many flows are multiplexed on a non-saturated link, their volume changes over short timescales tend to cancel each other out, making the average change across flows close to zero. This equilibrium property holds if the flows are nearly independent, and it is violated by traffic changes caused by several, potentially small, correlated flows. Many traffic anomalies (both malicious and benign) fit this description. Based on this observation, we exploit equilibrium to design a computationally simple detection method for correlated anomalous flows. We compare our new method to two well known techniques on three network links. We manually classify the anomalies detected by the three methods, and discover that our method uncovers a different class of anomalies than previous techniques do.
Conference Paper
Traditionally, the best number of features is determined by the so-called “rule of thumb”, or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.