Article

HEOD: Human-assisted Ensemble Outlier Detection for cybersecurity

Authors:
  • German University of Digital Science
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Undoubtedly, the evolution of Generative AI (GenAI) models has been the highlight of digital transformation in the year 2022. As the different GenAI models like ChatGPT and Google Bard continue to foster their complexity and capability, it’s critical to understand its consequences from a cybersecurity perspective. Several instances recently have demonstrated the use of GenAI tools in both the defensive and offensive side of cybersecurity, and focusing on the social, ethical and privacy implications this technology possesses. This research paper highlights the limitations, challenges, potential risks, and opportunities of GenAI in the domain of cybersecurity and privacy. The work presents the vulnerabilities of ChatGPT, which can be exploited by malicious users to exfiltrate malicious information bypassing the ethical constraints on the model. This paper demonstrates successful example attacks like Jailbreaks, reverse psychology, and prompt injection attacks on the ChatGPT. The paper also investigates how cyber offenders can use the GenAI tools in developing cyber attacks, and explore the scenarios where ChatGPT can be used by adversaries to create social engineering attacks, phishing attacks, automated hacking, attack payload generation, malware creation, and polymorphic malware. This paper then examines defense techniques and uses GenAI tools to improve security measures, including cyber defense automation, reporting, threat intelligence, secure code generation and detection, attack identification, developing ethical guidelines, incidence response plans, and malware detection. We will also discuss the social, legal, and ethical implications of ChatGPT. In conclusion, the paper highlights open challenges and future directions to make this GenAI secure, safe, trustworthy, and ethical as the community understands its cybersecurity impacts.
Article
Full-text available
Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
Research
Full-text available
Anomaly detection from Big Cybersecurity Datasets is very important; however, this is a very challenging and computationally expensive task. Feature selection (FS) is an approach to remove irrelevant and redundant features and select a subset of features, which can improve the machine learning algorithms’ performance. In fact, FS is an effective preprocessing step of anomaly detection techniques. This article’s main objective is to improve and quantify the accuracy and scalability of both supervised and unsupervised anomaly detection techniques. In this effort, a novel anomaly detection approach using FS, called Anomaly Detection Using Feature Selection (ADUFS), has been introduced. Experimental analysis was performed on five different benchmark cybersecurity datasets with and without feature selection and the performance of both supervised and unsupervised anomaly detection techniques were investigated. The experimental results indicate that instead of using the original dataset, a dataset with a reduced number of features yields better performance in terms of true positive rate (TPR) and false positive rate (FPR) than the existing techniques for anomaly detection. For example, with FS, a supervised anomaly detection technique, multilayer perception increased the TPR by over 200% and decreased the FPR by about 97% for the KDD99 dataset. Similarly, with FS, an unsupervised anomaly detection technique, local outlier factor increased the TPR by more than 40% and decreased the FPR by 15% and 36% for Windows 7 and NSL-KDD datasets, respectively. In addition, all anomaly detection techniques require less computational time when using datasets with a suitable subset of features rather than entire datasets. Furthermore, the performance results have been compared with six other state-of-the-art techniques based on a decision tree (J48).
Article
Full-text available
Outlier algorithms are becoming increasingly complex. Thereby, they become much less interpretable to the data scientists applying the algorithms in real-life settings and to end-users using their predictions. We argue that outliers are context-dependent and, therefore, can only be detected via domain knowledge, algorithm insight, and interaction with end-users. As outlier detection is equivalent to unsupervised semantic binary classification, at the core of interpreting an outlier algorithm we find the semantics of the classes, i.e., the algorithm’s conceptual outlier definition. We investigate current interpretable and explainable outlier algorithms: what they are, for whom they are, and what their value proposition is. We then discuss how interpretation and explanation and user involvement have the potential to provide the missing link to bring modern complex outlier algorithms from computer science labs into real-life applications and the challenges they induce.
Article
Full-text available
Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.
Article
Full-text available
Anomaly detection, a.k.a. outlier detection or novelty detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection , has emerged as a critical direction. This article surveys the research of deep anomaly detection with a comprehensive taxonomy, covering advancements in 3 high-level categories and 11 fine-grained categories of the methods. We review their key intuitions, objective functions, underlying assumptions, advantages, and disadvantages and discuss how they address the aforementioned challenges. We further discuss a set of possible future opportunities and new perspectives on addressing the challenges.
Article
Full-text available
The evolution in the attack scenarios has been such that finding efficient and optimal Network Intrusion Detection Systems (NIDS) with frequent updates has become a big challenge. NIDS implementation using machine learning (ML) techniques and updated intrusion datasets is one of the solutions for effective modeling of NIDS. This article presents a brief description of publicly available labeled intrusion datasets and ML techniques. Later a brief explanation of the literary works is given in which machine learning techniques are applied for NIDS implementation in different networking scenarios, such as traditional networks, cloud networks, Ad-Hoc, WSNs, and IoT networks. Hence, this article brings together publicly available intrusion datasets and machine learning techniques utilized in recent intrusion detection systems to reveal present-day challenges and future directions. This article also explains problems associated with NIDS. This will help researchers to enhance the existing NIDS models as well as to develop new effective models.
Conference Paper
Full-text available
Large enterprises are increasingly relying on threat detection softwares (e.g., Intrusion Detection Systems) to allow them to spot suspicious activities. These softwares generate alerts which must be investigated by cyber analysts to figure out if they are true attacks. Unfortunately, in practice, there are more alerts than cyber analysts can properly investigate. This leads to a “threat alert fatigue” or information overload problem where cyber analysts miss true attack alerts in the noise of false alarms. In this paper, we present NoDoze to combat this challenge using contextual and historical information of generated threat alert in an enterprise. NoDoze first generates a causal dependency graph of an alert event. Then, it assigns an anomaly score to each event in the dependency graph based on the frequency with which related events have happened before in the enterprise. NoDoze then propagates those scores along the edges of the graph using a novel network diffusion algorithm and generates a subgraph with an aggregate anomaly score which is used to triage alerts. Evaluation on our dataset of 364 threat alerts shows that NoDoze decreases the volume of false alarms by 86%, saving more than 90 hours of analysts’ time, which was required to investigate those false alarms. Furthermore, NoDoze generated dependency graphs of true alerts are 2 orders of magnitude smaller than those generated by traditional tools without sacrificing the vital information needed for the investigation. Our system has a low average runtime overhead and can be deployed with any threat detection software.
Conference Paper
Full-text available
Outlier detection refers to the identification of rare items that are deviant from the general data distribution. Existing approaches suffer from high computational complexity, low predictive capability, and limited interpretability. As a remedy, we present a novel outlier detection algorithm called COPOD, which is inspired by copulas for modeling multivariate data distribution. COPOD first constructs an empirical copula, and then uses it to predict tail probabilities of each given data point to determine its level of "extremeness". Intuitively, we think of this as calculating an anomalous p-value. This makes COPOD both parameter-free, highly interpretable, and computationally efficient. In this work, we make three key contributions, 1) propose a novel, parameter-free outlier detection algorithm with both great performance and interpretability, 2) perform extensive experiments on 30 benchmark datasets to show that COPOD outperforms in most cases and is also one of the fastest algorithms, and 3) release an easy-to-use Python implementation for reproducibility.
Article
Full-text available
According to a study by Cybersecurity Ventures, cybercrime is expected to cost $6 trillion annually by 2021. Most cybersecurity threats access internal networks through infected endpoints. Recently, various endpoint environments such as smartphones, tablets, and Internet of things (IoT) devices have been configured, and security issues caused by malware targeting them are intensifying. Event logs-based detection technology for endpoint security is detected using rules or patterns. Therefore, known attacks can respond, but unknown attacks can be difficult to respond to immediately. To solve this problem, in this paper, local outlier factor (LOF) and Autoencoder detect suspicious behavior that deviates from normal behavior. It also detects threats and shows the corresponding threats when suspicious events corresponding to the rules created through the attack profile are constantly occurring. Experimental results detected eight new suspicious processes that were not previously detected, and four malicious processes and one suspicious process were judged using Hybrid Analysis and VirusTotal. Based on the experiment results, it is expected that the use of operational policies such as allowlists in the proposed model will significantly improve performance by minimizing false positives.
Article
Full-text available
A* algorithm is mostly used to find the shortest path searching for Grid and map based path like road and rail network. A* algorithm generally find the path using Heuristic technique. The main idea behind the A* find the shortest path is the calculating the path (start to destination) very fast. The main work of this paper is that study of two distance metrics viz. Euclidean and Manhattan. We can perform some those metrics experiments in A* algorithm to validate the study. The result is show the comparative analysis of those metrics.
Article
Full-text available
An advanced persistent threat (APT) can be defined as a targeted and very sophisticated cyber attack. IT administrators need tools that allow for the early detection of these attacks. Several approaches have been proposed to provide solutions to this problem based on the attack life cycle. Recently, machine learning techniques have been implemented in these approaches to improve the problem of detection. This paper aims to propose a new approach to APT detection, using machine learning techniques, and is based on the life cycle of an APT attack. The proposed model is organised into two passive stages and three active stages to adapt the mitigation techniques based on machine learning.
Conference Paper
Full-text available
To subvert recent advances in perimeter and host security, the attacker community has developed and employed various attack vectors to make a malware much more stealthy than before to penetrate the target system and prolong its presence. The advanced malware, or stealthy malware, impersonates or abuses benign applications and legitimate system tools to minimize its footprints in the target system. One example of such stealthy malware is fileless malware, which resides its malicious logic mostly in the memory of well-trusted processes. It is difficult for traditional detection tools, such as malware scanners, to detect it, as the malware normally does not expose its malicious payload in a file and hides its malicious behaviors among the benign behaviors of the processes. In this paper, we present PROVDETECTOR, a provenance-based approach for detecting stealthy malware. The intuition behind PROVDETECTOR is that although a stealthy malware may impersonate or abuse a benign process, it still exposes its malicious behaviors in the OS (operating system) level provenance. Based on this intuition, PROVDETECTOR first employs a novel selection algorithm to identify possibly malicious parts in the OS level provenance data of a process. Then, it applies a neural embedding and machine learning pipeline to automatically detect any behavior that deviates significantly from normal behaviors. We evaluate our approach on a large provenance dataset from an enterprise network and demonstrate that it achieves very high detection performance (an average F1 score of 0.974) of stealthy malware. Further, we conduct thorough interpretability studies to understand the internals of the learned machine learning models.
Article
Full-text available
With the evolution of cybersecurity countermeasures, the threat landscape has also evolved, especially in malware from traditional file-based malware to sophisticated and multifarious fileless malware. Fileless malware does not use traditional executables to carry-out its activities. So, it does not use the file system, thereby evading signature-based detection system. The fileless malware attack is catastrophic for any enterprise because of its persistence, and power to evade any anti-virus solutions. The malware leverages the power of operating systems, trusted tools to accomplish its malicious intent. To analyze such malware, security professionals use forensic tools to trace the attacker, whereas the attacker might use anti-forensics tools to erase their traces. This survey makes a comprehensive analysis of fileless malware and their detection techniques that are available in the literature. We present a process model to handle fileless malware attacks in the incident response process. In the end, the specific research gaps present in the proposed process model are identified, and associated challenges are highlighted.
Chapter
Full-text available
Deep learning has made significant contribution to the recent progress in artificial intelligence. In comparison to traditional machine learning methods such as decision trees and support vector machines, deep learning methods have achieved substantial improvement in various prediction tasks. However, deep neural networks (DNNs) are comparably weak in explaining their inference processes and final results, and they are typically treated as a black-box by both developers and users. Some people even consider DNNs (deep neural networks) in the current stage rather as alchemy, than as real science. In many real-world applications such as business decision, process optimization, medical diagnosis and investment recommendation, explainability and transparency of our AI systems become particularly essential for their users, for the people who are affected by AI decisions, and furthermore, for the researchers and developers who create the AI solutions. In recent years, the explainability and explainable AI have received increasing attention by both research community and industry. This paper first introduces the history of Explainable AI, starting from expert systems and traditional machine learning approaches to the latest progress in the context of modern deep learning, and then describes the major research areas and the state-of-art approaches in recent years. The paper ends with a discussion on the challenges and future directions.
Article
Full-text available
Detecting outliers is a significant problem that has been studied in various research and application areas. Researchers continue to design robust schemes to provide solutions to detect outliers efficiently. In this survey, we present a comprehensive and organized review of the progress of outlier detection methods from 2000 to 2019. First, we offer the fundamental concepts of outlier detection and then categorize them into different techniques from diverse outlier detection techniques, such as distance-, clustering-, density-, ensemble-, and learning-based methods. In each category, we introduce some state-of-the-art outlier detection methods and further discuss them in detail in terms of their performance. Second, we delineate their pros, cons, and challenges to provide researchers with a concise overview of each technique and recommend solutions and possible research directions. This paper gives current progress of outlier detection techniques and provides a better understanding of the different outlier detection methods. The open research issues and challenges at the end will provide researchers with a clear path for the future of outlier detection methods.
Article
Full-text available
Cyber-attacks are becoming more sophisticated and thereby presenting increasing challenges in accurately detecting intrusions. Failure to prevent the intrusions could degrade the credibility of security services, e.g. data confidentiality, integrity, and availability. Numerous intrusion detection methods have been proposed in the literature to tackle computer security threats, which can be broadly classified into Signature-based Intrusion Detection Systems (SIDS) and Anomaly-based Intrusion Detection Systems (AIDS). This survey paper presents a taxonomy of contemporary IDS, a comprehensive review of notable recent works, and an overview of the datasets commonly used for evaluation purposes. It also presents evasion techniques used by attackers to avoid detection and discusses future research challenges to counter such techniques so as to make computer systems more secure.
Article
Full-text available
Anomaly detection has numerous applications in diverse fields. For example, it has been widely used for discovering network intrusions and malicious events. It has also been used in numerous other applications such as identifying medical malpractice or credit fraud. Detection of anomalies in quantitative data has received a considerable attention in the literature and has a venerable history. By contrast, and despite the widespread availability use of categorical data in practice, anomaly detection in categorical data has received relatively little attention as compared to quantitative data. This is because detection of anomalies in categorical data is a challenging problem. Some anomaly detection techniques depend on identifying a representative pattern then measuring distances between objects and this pattern. Objects that are far from this pattern are declared as anomalies. However, identifying patterns and measuring distances are not easy in categorical data compared with quantitative data. Fortunately, several papers focussing on the detection of anomalies in categorical data have been published in the recent literature. In this article, we provide a comprehensive review of the research on the anomaly detection problem in categorical data. Previous review articles focus on either the statistics literature or the machine learning and computer science literature. This review article combines both literatures. We review 36 methods for the detection of anomalies in categorical data in both literatures and classify them into 12 different categories based on the conceptual definition of anomalies they use. For each approach, we survey anomaly detection methods, and then show the similarities and differences among them. We emphasize two important issues, the number of parameters each method requires and its time complexity. The first issue is critical, because the performance of these methods are sensitive to the choice of these parameters. The time complexity is also very important in real applications especially in big data applications. We report the time complexity if it is reported by the authors of the methods. If it is not, then we derive it ourselves and report it in this article. In addition, we discuss the common problems and the future directions of the anomaly detection in categorical data.
Article
Full-text available
Nowadays, there is a huge and growing concern about security in information and communication technology among the scientific community because any attack or anomaly in the network can greatly affect many domains such as national security, private data storage, social welfare, economic issues, and so on. Therefore, the anomaly detection domain is a broad research area, and many different techniques and approaches for this purpose have emerged through the years. In this study, the main objective is to review the most important aspects pertaining to anomaly detection, covering an overview of a background analysis as well as a core study on the most relevant techniques, methods, and systems within the area. Therefore, in order to ease the understanding of this survey’s structure, the anomaly detection domain was reviewed under five dimensions: (1) network traffic anomalies, (2) network data types, (3) intrusion detection systems categories, (4) detection methods and systems, and (5) open issues. The paper concludes with an open issues summary discussing presently unsolved problems, and final remarks.
Article
Full-text available
This paper presents a review of three datasets, namely KDD Cup ‘99, NSL-KDD and Kyoto 2006+ datasets, which are widely used in researching intrusion detection in computer networks. The KDD Cup ‘99 dataset consists of five million records, each containing 41 features which can classify malicious attacks into four classes: Probe, DoS, U2R and R2L. The KDD Cup ‘99 dataset cannot reflect real traffic data since it was generated by simulation over a virtual computer network. In the NSL-KDD dataset, redundant and duplicate records form the KDD Cup ‘99 dataset are removed from training and test sets, respectively. The Kyoto 2006+ dataset is built on real three year-network traffic data which are labeled as normal (no attack), attack (known attack) and unknown attack. The Kyoto 2006+ dataset contains 14 statistical features derived from the KDD Cup ‘99 dataset and 10 additional features.
Conference Paper
Full-text available
Deep learning has recently demonstrated state-of-the art performance on key tasks related to the maintenance of computer systems, such as intrusion detection, denial of service attack detection, hardware and software system failures, and malware detection. In these contexts, model interpretability is vital for administrator and analyst to trust and act on the automated analysis of machine learning models. Deep learning methods have been criticized as black box oracles which allow limited insight into decision factors. In this work we seek to bridge the gap between the impressive performance of deep learning models and the need for interpretable model introspection. To this end we present recurrent neural network (RNN) language models augmented with attention for anomaly detection in system logs. Our methods are generally applicable to any computer system and logging source. By incorporating attention variants into our RNN language models we create opportunities for model introspection and analysis without sacrificing state-of-the art performance. We demonstrate model performance and illustrate model interpretability on an intrusion detection task using the Los Alamos National Laboratory (LANL) cyber security dataset, reporting upward of 0.99 area under the receiver operator characteristic curve despite being trained only on a single day's worth of data.
Article
Full-text available
An intrusion detection system (IDS) is hardware, software or a combination of two, for monitoring network or system activities to detect malicious signs. In computer security, designing a robust intrusion detection system is one of the most fundamental and important problems. The primary function of system is detecting intrusion and gives alerts when user tries to intrusion on timely manner. In these techniques when IDS find out intrusion it will send alert massage to the system administrator. Anomaly detection is an important problem that has been researched within diverse research areas and application domains. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. From the existing anomaly detection techniques, each technique has relative strengths and weaknesses. The current state of the experiment practice in the field of anomaly-based intrusion detection is reviewed and survey recent studies in this. This survey provides a study of existing anomaly detection techniques, and how the techniques used in one area can be applied in another application domain.
Article
Full-text available
With the ubiquitous computing of providing services and applications at anywhere and anytime, cloud computing is the best option as it offers flexible and pay-per-use based services to its customers. Nevertheless, security and privacy are the main challenges to its success due to its dynamic and distributed architecture, resulting in generating big data that should be carefully analysed for detecting network vulnerabilities. In this paper, we propose a Collaborative Anomaly Detection Framework CADF for detecting cyber attacks from cloud computing environments. We provide the technical functions and deployment of the framework to illustrate its methodology of implementation and installation. The framework is evaluated on the UNSW-NB15 dataset to check its credibility while deploying it in cloud computing environments. The experimental results showed that this framework can easily handle large-scale systems as its implementation requires only estimating statistical measures from network observations. Moreover, the evaluation performance of the framework outperforms three state-of-the-art techniques in terms of false positive rate and detection rate.
Article
Full-text available
Intrusion detection plays an important role in ensuring information security, and the key technology is to accurately identify various attacks in the network. In our study, we explore how to model an intrusion detection system based on deep learning, and we propose a deep learning approach for intrusion detection using recurrent neural networks (RNN-IDS). Moreover, we study the performance of the model in binary classification and multiclass classification, and the number of neurons and different learning rate impacts on the performance of the proposed model. We compare it with those of J48, Artificial Neural Network, Random Forest, Support Vector Machine and other machine learning methods proposed by previous researchers on the benchmark dataset. The experimental results show that RNN-IDS is very suitable for modelling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification. The RNN-IDS model improves the accuracy of the intrusion detection and provides a new research method for intrusion detection.
Article
Full-text available
A great deal of attention has been given to deep learning over the past several years, and new deep learning techniques are emerging with improved functionality. Many computer and network applications actively utilize such deep learning algorithms and report enhanced performance through them. In this study, we present an overview of deep learning methodologies, including restricted Bolzmann machine-based deep belief network, deep neural network, and recurrent neural network, as well as the machine learning techniques relevant to network anomaly detection. In addition, this article introduces the latest work that employed deep learning techniques with the focus on network anomaly detection through the extensive literature survey. We also discuss our local experiments showing the feasibility of the deep learning approach to network traffic analysis.
Article
Full-text available
Discretization is a data pre-processing task transforming continuous variables into discrete ones in order to apply some data mining algorithms such as association rules extraction and classification trees. In this study we empirically compared the performances of equal width intervals (EWI), equal frequency intervals (EFI) and K-means clustering (KMC) methods to discretize 14 continuous variables in a chicken egg quality traits dataset. We revealed that these unsupervised discretization methods can decrease the training error rates and increase the test accuracies of the classification tree models. By comparing the training errors and test accuracies of the model applied with C5.0 classification tree algorithm we also found that EWI, EFI and KMC methods produced the more or less similar results. Among the rules used for estimating the number of intervals, the Rice rule gave the best result with EWI but not with EFI. It was also found that Freedman-Diaconis rule with EFI and Doane rule with EFI and EWI slightly performed better than the other rules.
Article
Full-text available
Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large datasets. In this paper, we present a survey of outlier detection techniques to reflect the recent advancements in this field. The survey will not only cover the traditional outlier detection methods for static and low dimensional datasets but also review the more recent developments that deal with more complex outlier detection problems for dynamic/streaming and high-dimensional datasets.
Conference Paper
Full-text available
One of the major research challenges in this field is the unavailability of a comprehensive network based data set which can reflect modern network traffic scenarios, vast varieties of low footprint intrusions and depth structured information about the network traffic. Evaluating network intrusion detection systems research efforts, KDD98, KDDCUP99 and NSLKDD benchmark data sets were generated a decade ago. However, numerous current studies showed that for the current network threat environment, these data sets do not inclusively reflect network traffic and modern low footprint attacks. Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation. This data set has a hybrid of the real modern normal and the contemporary synthesized attack activities of the network traffic. Existing and novel methods are utilised to generate the features of the UNSWNB15 data set. This data set is available for research purposes and can be accessed from the links: 1. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7348942&filter%3DAND%28p_IS_Number%3A7348936%29 2. https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-NB15-Datasets/
Article
With the rise of technology and the continued economic growth evident in modern society, acts of fraud have become much more prevalent in the financial industry, costing institutions and consumers hundreds of billions of dollars annually. Fraudsters are continuously evolving their approaches to exploit the vulnerabilities of the current prevention measures in place, many of whom are targeting the financial sector. These crimes include credit card fraud, healthcare and automobile insurance fraud, money laundering, securities and commodities fraud and insider trading. On their own, fraud prevention systems do not provide adequate security against these criminal acts. As such, the need for fraud detection systems to detect fraudulent acts after they have already been committed and the potential cost savings of doing so is more evident than ever. Anomaly detection techniques have been intensively studied for this purpose by researchers over the last couple of decades, many of which employed statistical, artificial intelligence and machine learning models. Supervised learning algorithms have been the most popular types of models studied in research up until recently. However, supervised learning models are associated with many challenges that have been and can be addressed by semi-supervised and unsupervised learning models proposed in recently published literature. This survey aims to investigate and present a thorough review of the most popular and effective anomaly detection techniques applied to detect financial fraud, with a focus on highlighting the recent advancements in the areas of semi-supervised and unsupervised learning.
Chapter
Within today’s organizations, a Security Information and Event Management (SIEM) system is the centralized repository expected to aggregate all security-relevant data. While the primary purpose of SIEM solutions has been regulatory compliance, more and more organizations recognize the value of these systems for threat detection due to their holistic view of the entire enterprise. Today’s mature Security Operation Centers dedicate several teams to threat hunting, pattern/correlation rule creation, and alert monitoring. However, traditional SIEM systems lack the capability for advanced analytics as they were designed for different purposes using technologies that are now more than a decade old. In this paper, we discuss the requirements for a next-generation SIEM system that emphasizes analytical capabilities to allow advanced data science and engineering. Next, we propose a reference architecture that can be used to design such systems. We describe our experience in implementing a next-gen SIEM with advanced analytical capabilities, both in academia and industry. Lastly, we illustrate the importance of advanced analytics within today’s SIEM with a simple yet complex use case of beaconing detection.
Conference Paper
In this paper, we formulate threat detection in SIEM environments as a large-scale graph inference problem. We introduce a SIEM-based knowledge graph which models global associations among entities observed in proxy and DNS logs, enriched with related open source intelligence (OSINT) and cyber threat intelligence (CTI). Next, we propose MalRank, a graph-based inference algorithm designed to infer a node maliciousness score based on its associations to other entities presented in the knowledge graph, e.g., shared IP ranges or name servers. After a series of experiments on real-world data captured from a global enterprise's SIEM (spanning over 3TB of disk space), we show that MalRank maintains a high detection rate (AUC = 96%) outperforming its predecessor, Belief Propagation, both in terms of accuracy and efficiency. Furthermore, we show that this approach is effective in identifying previously unknown malicious entities such as malicious domain names and IP addresses. The system proposed in this research can be implemented in conjunction with an organization's SIEM, providing a maliciousness score for all observed entities, hence aiding SOC investigations.
Conference Paper
Online retailers execute a very large number of price updates when compared to brick-and-mortar stores. Even a few mis-priced items can have a significant business impact and result in a loss of customer trust. Early detection of anomalies in an automated real-time fashion is an important part of such a pricing system. In this paper, we describe unsupervised and supervised anomaly detection approaches we developed and deployed for a large-scale online pricing system at Walmart. Our system detects anomalies both in batch and real-time streaming settings, and the items flagged are reviewed and actioned based on priority and business impact. We found that having the right architecture design was critical to facilitate model performance at scale, and business impact and speed were important factors influencing model selection, parameter choice, and prioritization in a production environment for a large-scale system. We conducted analyses on the performance of various approaches on a test set using real-world retail data and fully deployed our approach into production. We found that our approach was able to detect the most important anomalies with high precision.
Chapter
Ransomware has become a significant global threat with the ransomware-as-a-service model enabling easy availability and deployment, and the potential for high revenues creating a viable criminal business model. Individuals, private companies or public service providers e.g. healthcare or utilities companies can all become victims of ransomware attacks and consequently suffer severe disruption and financial loss. Although machine learning algorithms are already being used to detect ransomware, variants are being developed to specifically evade detection when using dynamic machine learning techniques. In this paper we introduce NetConverse, a machine learning evaluation study for consistent detection of Windows ransomware network traffic. Using a dataset created from conversation-based network traffic features we achieved a True Positive Rate (TPR) of 97.1% using the Decision Tree (J48) classifier. © Springer International Publishing AG, part of Springer Nature 2018.
Chapter
Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data is created by one or more generating processes, which could either reflect activity in the system or observations collected about entities. When the generating process behaves unusually, it results in the creation of outliers. Therefore, an outlier often contains useful information about abnormal characteristics of the systems and entities that impact the data generation process. The recognition of such unusual characteristics provides useful application-specific insights.
Conference Paper
The increase in number and variety of malware samples amplifies the need for improvement in automatic detection and classification of the malware variants. Machine learning is a natural choice to cope with this increase, because it addresses the need of discovering underlying patterns in large-scale datasets. Nowadays, neural network methodology has been grown to the state that can surpass limitations of previous machine learning methods, such as Hidden Markov Models and Support Vector Machines. As a consequence, neural networks can now offer superior classification accuracy in many domains, such as computer vision or natural language processing. This improvement comes from the possibility of constructing neural networks with a higher number of potentially diverse layers and is known as Deep Learning. In this paper, we attempt to transfer these performance improvements to model the malware system call sequences for the purpose of malware classification. We construct a neural network based on convolutional and recurrent network layers in order to obtain the best features for classification. This way we get a hierarchical feature extraction architecture that combines convolution of n-grams with full sequential modeling. Our evaluation results demonstrate that our approach outperforms previously used methods in malware classification, being able to achieve an average of 85.6% on precision and 89.4% on recall using this combined neural network architecture.
Article
Traditional anomaly detection on social media mostly focuses on individual point anomalies while anomalous phenomena usually occur in groups. Therefore, it is valuable to study the collective behavior of individuals and detect group anomalies. Existing group anomaly detection approaches rely on the assumption that the groups are known, which can hardly be true in real world social media applications. In this article, we take a generative approach by proposing a hierarchical Bayes model: Group Latent Anomaly Detection (GLAD) model. GLAD takes both pairwise and point-wise data as input, automatically infers the groups and detects group anomalies simultaneously. To account for the dynamic properties of the social media data, we further generalize GLAD to its dynamic extension d-GLAD. We conduct extensive experiments to evaluate our models on both synthetic and real world datasets. The empirical results demonstrate that our approach is effective and robust in discovering latent groups and detecting group anomalies.
Conference Paper
Program anomaly detection analyzes normal program behaviors and discovers aberrant executions caused by attacks, misconfigurations, program bugs, and unusual usage patterns. The merit of program anomaly detection is its independence from attack signatures, which enables proactive defense against new and unknown attacks. In this paper, we formalize the general program anomaly detection problem and point out two of its key properties. We present a unified framework to present any program anomaly detection method in terms of its detection capability. We prove the theoretical accuracy limit for program anomaly detection with an abstract detection machine. We show how existing solutions are positioned in our framework and illustrate the gap between state-of-the-art methods and the theoretical accuracy limit. We also point out some potential modeling features for future program anomaly detection evolution.
Article
Purpose – Among the growing number of data mining (DM) techniques, outlier detection has gained importance in many applications and also attracted much attention in recent times. In the past, outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack. However, outliers are not always erroneous. Therefore, the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care, in particular. Design/methodology/approach – It is a combined DM (clustering and the nearest neighbor) technique for outliers’ detection, which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety. The outcomes or the knowledge implicit is vitally essential to a proper clinical decision-making process. The method is important to the semantic, and the novel tactic of patients’ events and situations prove that play a significant role in the process of patient care safety and medications. Findings – The outcomes of the paper is discussing a novel and integrated methodology, which can be inferring for different biological data analysis. It is discussed as integrated DM techniques to optimize its performance in the field of health and medical science. It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors. Based on these facts, outliers are detected as clusters and point events, and novel ideas proposed to empower clinical services in consideration of customers’ satisfactions. It is also essential to be a baseline for further healthcare strategic development and research works. Research limitations/implications – This paper mainly focussed on outliers detections. Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch. Therefore, the research can be extended more about the hierarchy of patient problems. Originality/value – DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety. Clinical data based outlier detection is a basic task to achieve healthcare strategy. Therefore, in this paper, the authors focussed on combined DM techniques for a deep analysis of clinical data, which provide an optimal level of clinical decision-making processes. Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services. Therefore, using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers, which could be fundamental to further analysis of healthcare and patient safety situational analysis.