Article

NCD based masquerade detection using enriched command lines

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper extends a series of experiments performed by Schonlau et al. [1], Maxion [2] and Bertacchini et al. [3] on the detection of computer masqueraders (i.e. illegitimate users trying to imperson-ate legitimate ones). A compression-based classification algorithm called Normalized Compression Distance or N CD, developed by Vitányi et al. [4] is applied on truncated and enriched command-line data. It is shown that the use of enriched data significantly improves the N CD-based de-tection performance compared with using truncated data sets. Future work, possible enhancements and directions of further research on this topic are presented as well.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... There are different aspects of masquerader behavior which may be taken into account to detect her, e.g., the list of directories she accessed, the programs she executed, the websites she visited, etc. Among all these aspects, the sequence of commands typed by the masquerader has been one of the most popular choices in the literature (see, e.g., [ [9] and [10]). The usual approach has been to monitor the command-line behavior of legitimate users for a long period in order to build a sufficiently large data corpus, and to apply a technique that looks for anomalies when new commands are typed. ...
... In the present work only Schonlau's statistic test and Naive-Bayes are used in order to compute masquerader detection efficiency using different subsets of extracted commands as input. In order to validate the dimension reduction techniques introduced here, also other detection algorithms like Normalized Compression Distance ( [8], [9]), Grammar Extraction ( [13]), Bayes One-Step Markov and Hybrid Multistep Markov ([1]) should be used. This is a matter of future work. ...
... Dimension reduction analysis has been performed using truncated command line data, but Maxion [3] and Bertacchini and Benitez [9] shown that command line data enriched with command parameters allows a better performance in masquerader detection. Therefore, work needs to be done in order to extend the ideas of this paper to the case of enriched commands. ...
Article
Full-text available
We deal with the problem of dimension reduction in mas-querader detection through command-line behavior. Although it has been previously suggested that unpopular commands are more relevant for this task, it is shown that there is no conclusive evidence in favor of this hypothesis. Moreover, it is also shown that selection of a fraction of the most popular or frequently used commands leads to a smooth degra-dation of a masquerader detection algorithm, while the selection of the most unpopular or infrequent commands produces a degradation which is worse than that of simple random selection. Some evidence is provided that the best performance of a masquerader detection algorithm may not necessarily correspond to accounting for all commands in the training data, but for a smaller, adequately chosen fraction of them. We verify this conclusion using two different datasets and two different masquerader detection algorithms. Finally, the empirical evidence provided in this paper suggests that, for many masquerader detection techniques, it may be convenient to work with a small fraction of the most popular or frequently used commands.
... Results were better than the Compression approach, and similar to Schonlau's best results. A later work by the same authors [13] showed improved detection applying the NCD on Greenberg's enriched command-line data. ...
... Their diversity makes comparisons hard or even impossible to achieve. A unified and coherent framework for testing and comparing different algorithms in a consistent, reliable and reproducible way could benefit the entire comunity [13] 60.0 1.0 46.0 enriched No SVM [41] 87.3 6.4 51.1 enriched No Naïve Bayes [6] 82.1 5.7 52.1 enriched Yes Naïve Bayes [6] 70.9 4.7 57.3 truncated Yes SVM [41] 71.1 6.0 64.9 truncated No NCD(bzip) [13] 38.0 1.0 68.0 truncated No PU dataset [7] Hits FAR Cost(1) Config. Upd. ...
... Their diversity makes comparisons hard or even impossible to achieve. A unified and coherent framework for testing and comparing different algorithms in a consistent, reliable and reproducible way could benefit the entire comunity [13] 60.0 1.0 46.0 enriched No SVM [41] 87.3 6.4 51.1 enriched No Naïve Bayes [6] 82.1 5.7 52.1 enriched Yes Naïve Bayes [6] 70.9 4.7 57.3 truncated Yes SVM [41] 71.1 6.0 64.9 truncated No NCD(bzip) [13] 38.0 1.0 68.0 truncated No PU dataset [7] Hits FAR Cost(1) Config. Upd. ...
Article
Full-text available
This paper presents a survey on the area of masquerader detection. The three most popular publicly available UNIX command-line datasets are showed and their features are compared. Several differ-ent masquerader detection approaches are reviewed and their results are compared applying the most popular measures of detection effectiveness in this area, introducing the most extensive quantitative comparison of results in literature. Possible ways for future work in this area are pro-posed as well.
... The survey article [1] cites approximately forty relevant papers, most of which have had their performance validated using the Schonlau dataset [2]. In that survey, the authors identified several general approaches to masquerade detection: Naïve Bayes [3], [4], informationtheoretic [5], [6], support vector machine (SVM) [7], [8], text mining [9], [10], sequences and bioinformatics [11], [12], hidden Markov models (HMMs) [13], [14], [15], [16], and other approaches [17], [18]. ...
... The NCD has been successfully applied to the detection of masqueraders, based on user command line and enriched command line inputs obtaining an accuracy comparable to other statistical methods. However, the NCD has the advantage of not requiring a previous knowledge about the data [51,52]. In web attack detection, the NCD can successfully measure the similarities between URLs. ...
Conference Paper
Logs integration is one of the most challenging concerns in current security systems. Certainly, the accurate identification of security events requires to handle and merge highly heterogeneous sources of information. As a result, there is an urge to construct general codification and classification procedures to be applied on any type of security log. This work is focused on defining such a method using the so-called Normalised Compression Distance (NCD). NCD is parameter-free and can be applied to determine the distance between events expressed using strings. On the grounds of the NCD, we propose an anomaly-based procedure for identifying web attacks from web logs. Given a web query as stored in a security log, a NCD-based feature vector is created and classified using a Support Vector Machine (SVM). The method is tested using the CSIC-2010 dataset, and the results are analysed with respect to similar proposals.
... However we observed that time-stamps and arguments of UNIX commands provide useful information for UNIX commands to be used as behavioral biometrics. For two pieces of work based on compression [60] and frequency distribution [38,61] as described in the previous section, the same detection mechanisms were applied to both the original and truncated versions of the Greenberg dataset. Detection accuracies were consistently higher for the original dataset. ...
Article
Full-text available
The ability to detect insider threats is important for many organisations. However, the field of insider threat detection is not well understood. In this paper, we survey existing insider threat detection mechanisms to provide a better understanding of the field.We identify and categorise insider behaviours into four classes - biometric behaviours, cyber behaviours, communication behaviours, and psychosocial behaviours. Each class is further comprised of several independent research fields of anomaly detection. Our survey reveals that there is significant scope for further research in many of those research fields, with many machine learning algorithms and features that have not been explored. We identify and summarise the unexplored areas as future directions.
... However we observed that time-stamps and arguments of UNIX commands provide useful information for UNIX commands to be used as behavioral biometrics. For two pieces of work based on compression [60] and frequency distribution [38,61] as described in the previous section, the same detection mechanisms were applied to both the original and truncated versions of the Greenberg dataset. Detection accuracies were consistently higher for the original dataset. ...
... For example, Ishihara and Sato [15] proposed to use the Normalized Compression Distance (NCD) for authorship attribution in Japanese novels. NCD was proposed by Cilibrasi and Vitanyi [16], and it was used for various other fields such as bio-informatics [17], music and literature [16], image processing [16, 18], and masquerader detection [19]. The NCD of two information sources a and b is defined as follows. ...
Article
Full-text available
We propose a new method of content-based document recommendation using data compression. Though previous studies mainly used bags-of-words to calculate the similarity between the profile and target documents, users in fact focus on larger unit than words, when searching information from documents. In order to take this point into consideration, we propose a method of document recommendation using data compression. Experimental results using Japanese newspaper corpora showed that (a) data compression performed better than the bag-of-words method, especially when the number of topics was large; (b) our new method outperformed the previous data compression method; (c) a combination of data compression and bag-of-words can also improve performance. We conclude that our method better captures users’ profiles and thus contributes to making a better document recommendation system.
... Previous work by Bertacchini et al. [4], [5] focuses on masquerader detection and extends a series of experiments performed by Schonlau et al. [6] and Maxion [7]. Commands, keystroke dynamics and other features are exploited to find behavior patterns. ...
Conference Paper
Full-text available
We propose an advanced solution to track persistent computer intruders inside a UNIX-based system by clustering sessions into groups bearing similar characteristics according to expertise and type of work. Our semi-supervised method based on Self- Organizing Map (SOM) accomplishes classification of four types of users: computer scientists, experience programmers, non-programmers, and novice programmers. Our evaluation on a range of biometrics shows that using working directories yields better accuracy (>98.5%) than using most popular parameters like command use or keystroke patterns.
Article
In cybersecurity, there is a call for adaptive, accurate and efficient procedures to identifying performance shortcomings and security breaches. The increasing complexity of both Internet services and traffic determines a scenario that in many cases impedes the proper deployment of intrusion detection and prevention systems. Although it is a common practice to monitor network and applications activity, there is not a general methodology to codify and interpret the recorded events. Moreover, this lack of methodology somehow erodes the possibility of diagnosing whether event detection and recording is adequately performed. As a result, there is an urge to construct general codification and classification procedures to be applied on any type of security event in any activity log. This work is focused on defining such a method using the so-called normalized compression distance (NCD). NCD is parameter-free and can be applied to determine the distance between events expressed using strings. As a first step in the concretion of a methodology for the integral interpretation of security events, this work is devoted to the characterization of web logs. On the grounds of the NCD, we propose an anomaly-based procedure for identifying web attacks from web logs. Given a web query as stored in a security log, a NCD-based feature vector is created and classified using a support vector machine. The method is tested using the CSIC-2010 data set, and the results are analyzed with respect to similar proposals.
Article
Full-text available
Keystroke Dynamics is a powerful technique which allows to detect and identify intruders in computer systems. In order to test keystroke data pattern matching and clustering algorithms, user data collection is a mandatory task. Si6 Labs 3 developed a web application named k-profiler 4 with the purpose of collecting the typing rhythm data of volunteer users. This paper describes the experiment design criteria as well as the format of the collected data which will be used for Si6 projects and will be publicly available.
Article
Recently, researchers have proposed efficient detection mechanisms for masquerade attacks. Most of these techniques use machine learning methods to learn the behavioral patterns of users and to check if an observed behavior conforms to the learnt behavior of a user. Masquerade attack is detected when the observed behavior, reportedly of a specific user, does not match with the learnt pattern of this user's past data. A major shortcoming in this process is that the user may legitimately deviate temporarily from its past behavior. If the deviation is large and near-permanent, it is desirable that such deviations are captured in a detection mechanism. We propose, in this paper, a method that takes into consideration this aspect of user behavior while detecting masquerade attacks. Our scheme is based on the premise that the commands used by a legitimate user or an attacker may differ from the trained signature. But the deviation of the legitimate user is momentary whereas that of an attacker persists longer. By introducing this novel concept in the detection mechanism, the performance improves. We show this empirically using several benchmark datasets. Copyright © 2010 John Wiley & Sons, Ltd.
Article
Full-text available
This paper extends a series of experiments performed by M. Schonlau, W. DuMouchel, W. Ju, A. Karr, M. Theus and Y. Vardi [Stat. Sci. 16, No. 1, 58–74 (2001; Zbl 1059.62758)] on the detection of computer masqueraders (i.e. illegitimate users trying to impersonate legitimate ones). A compression-based classification algorithm called Normalized Compression Distance or NCD, developed by R. Cilibrasi and P. Vitányi [IEEE Trans. Inf. Theory 51, No. 4, 1523–1545 (2005)] is applied on the same data set. It is shown that the NCD-based approach performs as well as the methods previously tried by Schonlau et al. Future work, possible enhancements and directions of further research on this topic are presented as well.
Article
Full-text available
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 di#erent languages.
Article
Full-text available
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.
Article
Computer intruders are modern day burglars: some of them steal information, some wreak havoc to the system, some just want to prove they can break in. Computer intrusion detection is concerned with designing alarm systems to prevent break-ins.This paper presents a method for detecting intruders/users masqueradering as other users. We examine UNIX command streams of users and search for anomalies. We identify anomalies based on unpopular and uniquely used commands.
Conference Paper
A masquerade attack, in which one user impersonates another, is among the most serious, forms of computer abuse, largely because such attacks are often mounted by insiders, and can be very difficult to detect. Automatic discovery of masqueraders is sometimes undertaken by detecting significant departures from normal user behavior, as represented by user profiles based on users' command histories. A series of experiments performed by Schonlau et al. (12) achieved moderate success in masquerade detection based on a data set comprised of truncated command lines, i.e., single commands, stripped of any accompanying fags, arguments or elements of shell grammar such as pipes or semi-colons. Using the same data, Maxion and Townsend [8] improved on the Schonlau et al. results by 56%, raising the detection rate from 39.4% to 61.5% at false-alarm rates near 1%. The present paper extends this work by testing the hypothesis that a limitation of these approaches is the use of truncated command-line data, as opposed to command lines enriched with fags, shell grammar, arguments and information about aliases. Enriched command lines were found to facilitate correct detection at the 82% level, far exceeding previous results, with a corresponding 30% reduction in the overall cost of errors, and only a small increase in false alarms. Descriptions of pathological cases illustrate strengths and limitations of both the data and the detection algorithm.
Article
The prediction by partial matching (PPM) data compression algorithm developed by J. Cleary and I. Witten (1984) is capable of very high compression rates, encoding English text in as little as 2.2 b/character. It is shown that the estimates made by Cleary and Witten of the resources required to implement the scheme can be revised to allow for a tractable and useful implementation. In particular, a variant is described that encodes and decodes at over 4 kB/s on a small workstation and operates within a few hundred kilobytes of data space, but still obtains compression of about 2.4 b/character for English text
Article
The recently developed technique of arithmetic coding, in conjunction with a Markov model of the source, is a powerful method of data compression in situations where a linear treatment is inappropriate. Adaptive coding allows the model to be constructed dynamically by both encoder and decoder during the course of the transmission, and has been shown to incur a smaller coding overhead than explicit transmission of the model's statistics. But there is a basic conflict between the desire to use high-order Markov models and the need to have them formed quickly as the initial part of the message is sent. This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source.
  • M Schonlau
  • W Dumouchel
  • W Ju
  • A Karr
  • M Theus
  • Y Vardi
Schonlau, M., DuMouchel, W., Ju, W., Karr, A., Theus, M., Vardi, Y.: Computer Intrusion: Detecting Masquerades (2001) Statistical Science (submitted).