Conference Paper

A Comprehensive Review of Anomaly Detection in Web Logs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Chapter
Full-text available
Cybercrime is one of the fastest-growing crimes worldwide, and they are increasing in volume, sophistication, and cost. According to numerous reports such as Cybersecurity Ventures and others it is estimated that every seven seconds, cyber attackers penetrated into Cyber Systems. As a result, one of the essential parts of any system for storing and handling all the events is the log system. However, the system is not robust, and detecting an anomaly in logs has been challenging because of the continuous and ever-changing log events and their mutability property. Attackers attempt to modify the logs in order to avoid being discovered, which extends the time between detection and triage. In this work, we propose a novel model using Blockchain to problem of log analysis by suggesting two modules, anomaly detection using different machine learning models and Distributed Immutable storage system for securely storing the logs. We also present descriptive and user-friendly Web Application by integrating all modules using HTML, CSS, and Flask Framework on the Heroku cloud environment. Using proposed Hybrid Machine Learning Model, we are able to achieve 99.7% accuracy for detecting network anomalies.
Article
Full-text available
Industrial Information Technology infrastructures are often vulnerable to cyberattacks. To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to monitor the cyber-physical systems (e.g., computer networks) in the industry for malicious activities. This article aims to build such intrusion detection systems to protect the computer networks from cyberattacks. More specifically, we propose a novel unsupervised machine learning approach that combines the K-Means algorithm with the Isolation Forest for anomaly detection in industrial big data scenarios. Since our objective is to build the intrusion detection system for the big data scenario in the industrial domain, we utilize the Apache Spark framework to implement our proposed model that was trained in large network traffic data (about 123 million instances of network traffic) stored in Elasticsearch. Moreover, we evaluate our proposed model on the live streaming data and find that our proposed system can be used for real-time anomaly detection in the industrial setup. In addition, we address different challenges that we face while training our model on large datasets and explicitly describe how these issues were resolved. Based on our empirical evaluation in different use cases for anomaly detection in real-world network traffic data, we observe that our proposed system is effective to detect anomalies in big data scenarios. Finally, we evaluate our proposed model on several academic datasets to compare with other models and find that it provides comparable performance with other state-of-the-art approaches.
Article
Full-text available
Enterprise systems typically produce a large number of logs to record runtime states and important events. Log anomaly detection is efficient for business management and system maintenance. Most existing log-based anomaly detection methods use log parser to get log event indexes or event templates and then utilize machine learning methods to detect anomalies. However, these methods cannot handle unknown log types and do not take advantage of the log semantic information. In this article, we propose ConAnomaly, a log-based anomaly detection model composed of a log sequence encoder (log2vec) and multi-layer Long Short Term Memory Network (LSTM). We designed log2vec based on the Word2vec model, which first vectorized the words in the log content, then deleted the invalid words through part of speech tagging, and finally obtained the sequence vector by the weighted average method. In this way, ConAnomaly not only captures semantic information in the log but also leverages log sequential relationships. We evaluate our proposed approach on two log datasets. Our experimental results show that ConAnomaly has good stability and can deal with unseen log types to a certain extent, and it provides better performance than most log-based anomaly detection methods.
Article
Full-text available
Performance optimization and analysis of Web server is one of the key technologies of Web server. Through the analysis of the work flow and working principle of the Web server, some influencing factors that are in line with the current environment and have feasibility are selected for targeted optimization, such as value of a thread or process, live time, cache size. Through optimization, the comprehensive performance of Web server based on Nginx has been greatly improved.
Article
Full-text available
Network anomaly detection systems (NADSs) play a significant role in every network defense system as they detect and prevent malicious activities. Therefore, this paper offers an exhaustive overview of different aspects of anomaly-based network intrusion detection systems (NIDSs). Additionally, contemporary malicious activities in network systems and the important properties of intrusion detection systems are discussed as well. The present survey explains important phases of NADSs, such as pre-processing, feature extraction and malicious behavior detection and recognition. In addition, with regard to the detection and recognition phase, recent machine learning approaches including supervised, unsupervised, new deep and ensemble learning techniques have been comprehensively discussed; moreover, some details about currently available benchmark datasets for training and evaluating machine learning techniques are provided by the researchers. In the end, potential challenges together with some future directions for machine learning-based NADSs are specified.
Article
Full-text available
Security devices produce huge number of logs which are far beyond the processing speed of human beings. This paper introduces an unsupervised approach to detecting anomalous behavior in large scale security logs. We propose a novel feature extracting mechanism and could precisely characterize the features of malicious behaviors. We design a LSTM-based anomaly detection approach and could successfully identify attacks on two widely-used datasets. Our approach outperforms three popular anomaly detection algorithms, one-class SVM, GMM and Principal Components Analysis, in terms of accuracy and efficiency.
Chapter
Full-text available
In the area of sentiment analysis and classification, the performance of the classification tasks can be varied based on the usage of text vectorization and feature extraction methods. This paper represents a detailed investigation and analysis of the impact on feature extraction methods to attain the highest classification accuracy of the sentiment from user reviews. Unigram, Bigram and Trigram are applied as n-gram vectorization models with TF-IDF features extraction method individually. Accuracy, misclassification rate, Receiver Operating Characteristics (ROC) and recall-precision are used in this study to evaluate which are counted as the most important performance measurement parameters in machine learning based approaches. Parameters are measured by the output obtained from Bagged Decision Tree (BDT), Random Forest (RF), Ada Boost (ADA), Gradient Boost (GB) and Extra Tree (ET). The outcomes of this study is to find out the best fitted combination of term frequency–inverse document frequency (TF-IDF) and n-grams for different data size.
Article
Full-text available
Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world. However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources. This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy. To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems. Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks). Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review. Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection.
Article
Full-text available
Log files give insight into the state of a computer system and enable the detection of anomalous events relevant to cyber security. However, automatically analyzing log data is difficult since it contains massive amounts of unstructured and diverse messages collected from heterogeneous sources. Therefore, several approaches that condense or summarize log data by means of clustering techniques have been proposed. Picking the right approach for a particular application domain is, however, non-trivial, since algorithms are designed towards specific objectives and requirements. This paper therefore surveys existing approaches. It thereby groups approaches by their clustering techniques, reviews their applicability and limitations, discusses trends and identifies gaps. The survey reveals that approaches usually pursue one or more of four major objectives: overview and filtering, parsing and signature extraction, static outlier detection, and sequences and dynamic anomaly detection. Finally, this paper also outlines a concept and tool that support the selection of appropriate approaches based on user-defined requirements.
Conference Paper
Full-text available
Deep Neural Networks are emerging as effective techniques to detect sophisticated cyber-attacks targeting Industrial Control Systems (ICSs). In general, these techniques focus on learning a "normal" behavior of the system, to be then able to label noteworthy deviations from it as anomalies. However, during operations, ICSs inevitably and continuously evolve their behavior, due to e.g., replacement of devices, workflow modifications, or other reasons. As a consequence, the quality of the anomaly detection process may be dramatically affected with a considerable amount of false alarms being generated. This paper presents AADS (Adaptive Anomaly Detection in industrial control Systems), a novel framework based on neural networks and greedy-algorithms that tailors the learning-based anomaly detection process to the changing nature of ICSs. AADS efficiently adapts a pre-trained model to learn new changes in the system behavior with a small number of data samples (i.e., time steps) and a few gradient updates. The performance of AADS is evaluated using the Secure Water Treatment (SWaT) dataset, and its sensitivity to additive noise is investigated. Our results show an increased detection rate compared to state of the art approaches, as well as more robustness to additive noise.
Conference Paper
Full-text available
Recording runtime status via logs is common for almost every computer system, and detecting anomalies in logs is crucial for timely identifying malfunctions of systems. However, manually detecting anomalies for logs is time-consuming, error-prone, and infeasible. Existing automatic log anomaly detection approaches, using indexes rather than semantics of log templates, tend to cause false alarms. In this work, we propose LogAnomaly, a framework to model unstructured a log stream as a natural language sequence. Empowered by template2vec, a novel, simple yet effective method to extract the semantic information hidden in log templates, LogAnomaly can detect both sequential and quantitive log anomalies simultaneously, which were not done by any previous work. Moreover, LogAnomaly can avoid the false alarms caused by the newly appearing log templates between periodic model retrainings. Our evaluation on two public production log datasets show that LogAnomaly outperforms existing log-based anomaly detection methods.
Article
Full-text available
Technological advances and increased interconnectivity have led to a higher risk of previously unknown threats. Cyber Security therefore employs Intrusion Detection Systems that continuously monitor log lines in order to protect systems from such attacks. Existing approaches use string metrics to group similar lines into clusters and detect dissimilar lines as outliers. However, such methods only produce static views on the data and do not sufficiently incorporate the dynamic nature of logs. Changes of the technological infrastructure therefore frequently require cluster reformations. Moreover, such approaches are not suited for detecting anomalies related to frequencies, periodic alterations and interdependencies of log lines. We therefore propose a dynamic log file anomaly detection methodology that incrementally groups log lines within time windows. Thereby, a novel clustering mechanism establishes links between otherwise isolated collections of clusters. Cluster evolution techniques analyze clusters from neighboring time windows and determine transitions such as splits or merges. A self-learning algorithm then detects anomalies in the temporal behavior of these evolving clusters by analyzing metrics derived from their developments. We apply a prototype in an illustrative scenario consisting of a log file containing known anomalies. We thereby investigate the influences of certain parameters on the detection ability and the runtime. The evaluation of this scenario shows that 61.8% of the dynamic changes of log line clusters are correctly identified, while the false alarm rate is only 0.7%. The ability of efficiently detecting these anomalies while self-adjusting to changes of the system environment suggests the applicability of the introduced approach.
Article
Full-text available
Application layer distributed denial of service (App-DDoS) attacks has posed a great threat to the security of the Internet. Since these attacks occur in the application layer, they can easily evade traditional network layer and transport layer detection methods. In this paper, we extract a group of user behavior attributes from our intercept program instead of web server logs and construct a behavior feature matrix based on nine user behavior features to characterize user behavior. Subsequently, principal component analysis (PCA) is applied to profile the user browsing behavior pattern in the feature matrix and outliers from the pattern are used to recognize normal users and attackers. Experiment results show that the proposed method is good to distinguish normal users and attackers. Finally, we implement three machine learning algorithms (K-means, DBSCAN and SVM) to further validate the effectiveness of the proposed attributes and features.
Conference Paper
Full-text available
Recent studies have revealed that cyber criminals tend to exchange knowledge about cyber attacks in online social networks (OSNs). Cyber security experts are another set of information providers on OSNs who frequently share information about cyber security incidents and their personal opinions and analyses. Therefore, in order to improve our knowledge about evolving cyber attacks and the underlying human behavior for different purposes (e.g., crime investigation, understanding career development of cyber criminals and cyber security professionals, detection of impeding cyber attacks), it will be very useful to detect cyber security related accounts on OSNs automatically, and monitor their activities. This paper reports our preliminary work on automatic detection of cyber security related accounts on OSNs using Twitter as an example. Three machine learning based classification algorithms were applied and compared: decision trees, random forests, and SVM (support vector machines). Experimental results showed that both decision trees and random forests had performed well with an overall accuracy over 95%, and when random forests were used with behavioral features the accuracy had reached as high as 97.877%. CCS CONCEPTS • Information systems → Social networks; • Security and privacy → Human and societal aspects of security and privacy; • Computing methodologies → Machine learning;
Article
Full-text available
Cyber attacks have been increasingly detrimental to networks, systems, and users, and are increasing in number and severity globally. To better predict system vulnerabilities, cybersecurity researchers are developing new and more holistic approaches to characterizing cybersecurity system risk. The process must include characterizing the human factors that contribute to cyber security vulnerabilities and risk. Rationality, expertise, and maliciousness are key human characteristics influencing cyber risk within this context, yet maliciousness is poorly characterized in the literature. There is a clear absence of literature pertaining to human factor maliciousness as it relates to cybersecurity and only limited literature relating to aspects of maliciousness in other disciplinary literatures, such as psychology, sociology, and law. In an attempt to characterize human factors as a contribution to cybersecurity risk, the Cybersecurity Collaborative Research Alliance (CSec-CRA) has developed a Human Factors risk framework. This framework identifies the characteristics of an attacker, user, or defender, all of whom may be adding to or mitigating against cyber risk. The maliciousness literature and the proposed maliciousness assessment metrics are discussed within the context of the Human Factors Framework and Ontology. Maliciousness is defined as the intent to harm. Most maliciousness cyber research to date has focused on detecting malicious software but fails to analyze an individual’s intent to do harm to others by deploying malware or performing malicious attacks. Recent efforts to identify malicious human behavior as it relates to cybersecurity, include analyzing motives driving insider threats as well as user profiling analyses. However, cyber-related maliciousness is neither well-studied nor is it well understood because individuals are not forced to expose their true selves to others while performing malicious attacks. Given the difficulty of interviewing malicious-behaving individuals and the potential untrustworthy nature of their responses, we aim to explore the maliciousness as a human factor through the observable behaviors and attributes of an individual from their actions and interactions with society and networks, but to do so we will need to develop a set of analyzable metrics. The purpose of this paper is twofold: (1) to review human maliciousness-related literature in diverse disciplines (sociology, economics, law, psychology, philosophy, informatics, terrorism, and cybersecurity); and (2) to identify an initial set of proposed assessment metrics and instruments that might be culled from in a future effort to characterize human maliciousness within the cyber realm. The future goal is to integrate these assessment metrics into holistic cybersecurity risk analyses to determine the risk an individual poses to themselves as well as other networks, systems, and/or users.
Article
Full-text available
A web application could be visited for different purposes. It is possible for a web site to be visited by a regular user as a normal (natural) visit, to be viewed by crawlers, bots, spiders, etc. for indexing purposes, lastly to be exploratory scanned by malicious users prior to an attack. An attack targeted web scan can be viewed as a phase of a potential attack and can lead to more attack detection as compared to traditional detection methods. In this work, we propose a method to detect attack-oriented scans and to distinguish them from other types of visits. In this context, we use access log files of Apache (or ISS) web servers and try to determine attack situations through examination of the past data. In addition to web scan detections , we insert a rule set to detect SQL Injection and XSS attacks. Our approach has been applied on sample data sets and results have been analyzed in terms of performance measures to compare our method and other commonly used detection techniques. Furthermore, various tests have been made on log samples from real systems. Lastly, several suggestions about further development have been also discussed.
Article
Full-text available
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.
Article
Problems of web application security and antihacker protection are very topical. Queries that users send to a web application via the Internet are registered in log files of the web server. Analyzing log files allows detecting anomalous changes that take place on the web server and identifying attacks. In this work, different methods are used to analyze log files and detect anomalies. The proposed methods allow detecting anomalous queries received from malicious users in log files of the web server.
Article
Nowadays, in almost every computer system, log files are used to keep records of occurring events. Those log files are then used for analyzing and debugging system failures. Due to this important utility, researchers have worked on finding fast and efficient ways to detect anomalies in a computer system by analyzing its log records. Research in log-based anomaly detection can be divided into two main categories: batch log-based anomaly detection and streaming log- based anomaly detection. Batch log-based anomaly detection is computationally heavy and does not allow us to instantaneously detect anomalies. On the other hand, streaming anomaly detection allows for immediate alert. However, current streaming approaches are mainly supervised. In this work, we propose a fully unsupervised framework which can detect anomalies in real time. We test our framework on hdfs log files and successfully detect anomalies with an F-1 score of 83%.
Article
In the face of escalating global Cybersecurity threats, having an automated forewarning system that can find suspicious user profiles is paramount. It can work as a prevention technique for planned attacks or ultimate security breaches. Significant research has been established in attack prevention and detection, but has demonstrated only one or a few different sources with a short list of features. The main goals of this paper are, first, to review the previous user profiling models and analyze them to find their advantages and disadvantages; second, to provide a comprehensive overview of previous research to gather available features and data sources for user profiling; third, based on the deficiencies of the previous models, the paper proposes a new user profiling model that can cover all available sources and related features based on the cybersecurity perspective. The proposed model includes seven profiling criteria for gathering user’s information and more than 270 features to parse and generate the security profile of a user.
Article
Unprotected Web applications are vulnerable places for hackers to attack an organization's network. Statistics show that 42% of Web applications are exposed to threats and hackers. Web requests that Web users request from Web applications are manipulated by hackers to control Web servers. Web queries are detected to prevent manipulations of hacker's attacks. Web attack detection is extremely essential in information distribution over the past decades. Anomaly methods based on machine learning are preferred in the Web application security. This present study aimed to propose an anomaly-based Web attack detection architecture in a Web application using deep learning methods. The architecture structure consists of data preprocess and Convolution Neural Network (CNN) steps. To prove the suitability and success of the proposed CNN architecture, CSIC2010v2 datasets were used. The proposed architecture performed detection of Web attacks, using anomaly-based detection type. Based on the experimental results of the study, the proposed CNN deep learning architecture presented successful outcomes.
Article
The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation where its k-nearest neighbour distance with the maximum gap is significantly different from what we would expect if the distribution of k-nearest neighbours with the maximum gap is in the maximum domain of attraction of the Gumbel distribution. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray.
Article
We critically appraise the recent interest in out-of-distribution (OOD) detection and question the practical relevance of existing benchmarks. While the currently prevalent trend is to consider different datasets as OOD, we argue that out-distributions of practical interest are ones where the distinction is semantic in nature for a specified context, and that evaluative tasks should reflect this more closely. Assuming a context of object recognition, we recommend a set of benchmarks, motivated by practical applications. We make progress on these benchmarks by exploring a multi-task learning based approach, showing that auxiliary objectives for improved semantic awareness result in improved semantic anomaly detection, with accompanying generalization benefits.
Article
The emergence of artificial intelligence technology has promoted the development of the Internet of Things (IoT). However, this promising cyber technology can encounter serious security problems while accessing the internet. A malicious website can disguise itself as a normal website, and obtain users' private information. Thus, it is very important to detect malicious websites using tools such as machine learning algorithms, as these algorithms can help us to identify abnormal information hidden in the mass traffic more easily. Accordingly, many feature engineering tasks must be performed from memory, as a strong machine learning model is greatly improved with good features. In this paper, we propose an unsupervised learning algorithm that learns URL embedding. We also explore some key parameters regarding a domain embedding model to obtain a good effect on domain features.
Article
This study proposes a novel methodology to detect malicious URLs using simulated expert (SE) and knowledge base system (KBS). The proposed study not only efficiently detects known malicious URLs, but also adapt countermeasure against the newly generated malicious URLs. Moreover, this study also explored which lexical features are more contributing in final decision using a factor analysis method and thus helped in avoiding involvement of human expert. Further, we applied the following state-of-the-art ML algorithms, i.e., Naïve Bayes (NB), Decision Tree (DT), Gradient Boosted Trees (GBT), Generalized Linear Model (GLM), Logistic Regression (LR), Deep Learning (DL), and Random rest (RF), and evaluated the performance of these algorithms on a large-scale real data set of data-driven Web application. The experimental results clearly demonstrated the efficiency of NB in the proposed model as NB outperformed when compared to the rest of aforementioned algorithms in term of average minimum execution time (i.e., 3 seconds) and was able to accurately classify the 107586 URLs with 0.2% error rate and 99.8% accuracy rate.
Conference Paper
Conventional attacks of insider employees and emerging APT are both major threats for the organizational information system. Existing detections mainly concentrate on users' behavior and usually analyze logs recording their operations in an information system. In general, most of these methods consider sequential relationship among log entries and model users' sequential behavior. However, they ignore other relationships, inevitably leading to an unsatisfactory performance on various attack scenarios. We propose log2vec, a heterogeneous graph embedding based modularized method. First, it involves a heuristic approach that converts log entries into a heterogeneous graph in the light of diverse relationships among them. Next, it utilizes an improved graph embedding appropriate to the above heterogeneous graph, which can automatically represent each log entry into a low-dimension vector. The third component of log2vec is a practical detection algorithm capable of separating malicious and benign log entries into different clusters and identifying malicious ones. We implement a prototype of log2vec. Our evaluation demonstrates that log2vec remarkably outperforms state-of-the-art approaches, such as deep learning and hidden markov model (HMM). Besides, log2vec shows its capability to detect malicious events in various attack scenarios.
Conference Paper
Logs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event data extracted from historical logs. However, we find that the existing methods do not work well in practice. These methods have the close-world assumption, which assumes that the log data is stable over time and the set of distinct log events is known. However, our empirical study shows that in practice, log data often contains previously unseen log events or log sequences. The instability of log data comes from two sources: 1) the evolution of logging statements, and 2) the processing noise in log data. In this paper, we propose a new log-based anomaly detection approach, called LogRobust. LogRobust extracts semantic information of log events and represents them as semantic vectors. It then detects anomalies by utilizing an attention-based Bi-LSTM model, which has the ability to capture the contextual information in the log sequences and automatically learn the importance of different log events. In this way, LogRobust is able to identify and handle unstable log events and sequences. We have evaluated LogRobust using logs collected from the Hadoop system and an actual online service system of Microsoft. The experimental results show that the proposed approach can well address the problem of log instability and achieve accurate and robust results on real-world, ever-changing log data.
Conference Paper
Web log data analysis is important in intrusion detection. Various machine learning techniques have been applied. However, compared to abundant researches on machine learning, ways to extract features from log data are still under research. In this paper, we present an effective feature extraction approach by leveraging Byte Pair Encoding (BPE) and Term Frequency-Inverse Document Frequency (TF-IDF). We have applied this approach on various downstream machine learning algorithms and proved its usefulness.
Conference Paper
The latest threat intelligence platforms use structured protocols to share and analyze cyber-security data. However, most of this data is reported to the platform in the form of unstructured text such as social media posts, emails, and news articles, which then require manual conversion to structured form. In order to bridge the gap between unstructured and structured data, we propose to implement a natural-language-processing-(NLP)-based information extraction (IE) system that takes texts within the cyber-security domain and parses them into structured format. Our approach targets the VERIS format and makes use of the VERIS Community Database as a source of unstructured texts---primarily consisting of news articles-and their structured counterparts (VERIS reports). We propose first to use a supervised machine learning (ML) classifier to discriminate between cyber-related and non-cyber-related texts, and then to use ML classifiers decide which VERIS parameters are relevant in a given text. Then, we propose to use NLP and IE techniques to extract tuples of grammatically co-dependent words. Finally, these tuples will be passed to a domain- and field-specific IE components to fill in different fields of an output VERIS report.
Article
In the face of escalating global Cybersecurity threats, having an automated forewarning system that can find suspicious user profiles is paramount. It can work as a prevention technique for planned attacks or ultimate security breaches. Significant research has been established in attack prevention and detection, but has demonstrated only one or a few different sources with a short list of features. The main goals of this paper are, first, to review the previous user profiling models and analyze them to find their advantages and disadvantages; second, to provide a comprehensive overview of previous research to gather available features and data sources for user profiling; third, based on the deficiencies of the previous models, the paper proposes a new user profiling model that can cover all available sources and related features based on the cybersecurity perspective. The proposed model includes seven profiling criteria for gathering user's information and more than 270 features to parse and generate the security profile of a user.
Article
Web request query strings (queries), which pass parameters to a referenced resource, are always manipulated by attackers to retrieve sensitive data and even take full control of victim web servers and web applications. However, existing malicious query detection approaches in the literature cannot cope with changing web attacks. In this paper, we introduce a novel adaptive system (AMOD) that can adaptively detect web-based code injection attacks, which are the majority of web attacks, by analyzing queries. We also present a new adaptive learning strategy, called SVM HYBRID, leveraged by our system to minimize manual work. In the evaluation, an up-to-date detection model is trained on a ten-day query dataset collected from an academic institute’s web server logs. The evaluation shows our approach overwhelms existing approaches in two respects. Firstly, AMOD outperforms existing web attack detection methods with an F-value of 99.50% and FP rate of 0.001%. Secondly, the total number of malicious queries obtained by SVM HYBRID is 3.07 times that by the popular support vector machine adaptive learning (SVM AL) method. The malicious queries obtained can be used to update the web application firewall (WAF) signature library.
Conference Paper
Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data is universally available in nearly all computer systems. Log data is an important and valuable resource for understanding system status and performance issues; therefore, the various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workflows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies.
Article
Visualizing outliers in massive datasets requires statistical pre-processing in order to reduce the scale of the problem to a size amenable to rendering systems like D3, Plotly or analytic systems like R or SAS. This paper presents a new algorithm, called hdoutliers , for detecting multidimensional outliers. It is unique for a) dealing with a mixture of categorical and continuous variables, b) dealing with big-p (many columns of data), c) dealing with big- n (many rows of data), d) dealing with outliers that mask other outliers, and e) dealing consistently with unidimensional and multidimensional datasets. Unlike ad hoc methods found in many machine learning papers, hdoutliers is based on a distributional model that allows outliers to be tagged with a probability. This critical feature reduces the likelihood of false discoveries.
Book
Regular expressions are a central element of UNIX utilities like egrep and programming languages such as Perl. But whether you're a UNIX user or not, you can benefit from a better understanding of regular expressions since they work with applications ranging from validating data-entry fields to manipulating information in multimegabyte text files. Mastering Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions offers a wealth information that you can put to immediate use. Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions , Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation. Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions , Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch
Article
Security problems with network are significant, such as network failures and malicious attacks. Monitoring network traffic and detect anomalies of network traffic is one of the effective manner to ensure network security. In this paper, we propose a hybrid method for network traffic prediction and anomaly detection. Specifically, the original network traffic data is decomposed into high-frequency components and low-frequency components. Then, non-linear model Relevance Vector Machine (RVM) model and ARMA (Auto Regressive Moving Average) model are employed respectively for prediction. After combining the prediction, a self-adaptive threshold method based on Central Limit Theorem (LCT) is introduced for anomaly detection. Moreover, our extensive experiments evaluate the efficiency of proposed method.
Conference Paper
Logs play an important role in the maintenance of large-scale online service systems. When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems. Traditionally, engineers perform simple keyword search (such as "error" and "exception") of logs that may be associated with the failures. Such an approach is often time consuming and error prone. Through our collaboration with Microsoft service product teams, we propose LogCluster, an approach that clusters the logs to ease log-based problem identification. LogCluster also utilizes a knowledge base to check if the log sequences occurred before. Engineers only need to examine a small number of previously unseen, representative log sequences extracted from the clusters to identify a problem, thus significantly reducing the number of logs that should be examined, meanwhile improving the identification accuracy. Through experiments on two Hadoop-based applications and two large-scale Microsoft online service systems, we show that our approach is effective and outperforms the state-of-the-art work proposed by Shang et al. in ICSE 2013. We have successfully applied LogCluster to the maintenance of many actual Microsoft online service systems. In this paper, we also share our success stories and lessons learned.