ArticlePDF Available

Abstract

Data leakage and theft from databases is a dangerous threat to organizations. Data Security and Data Privacy protection systems (DSDP) monitor data access and usage to identify leakage or suspicious activities that should be investigated. Because of the high velocity nature of database systems, such systems audit only a portion of the vast number of transactions that take place. Anomalies are investigated by a Security Officer (SO) in order to choose the proper response. In this paper we investigate the effect of sampling methods based on the risk the transaction poses and propose a new method for "combined sampling" for capturing a more varied sample.
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Research in Progress
Hagit Grushka-Cohen
Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel {hgrushka@post.bgu.ac.il}
Oded Sofer
IBM Security Division, Israel {odedso@il.ibm.com}
Ofer Biller
IBM Cyber Security Center of Excellence, Beer Sheva, Iseral {ofer.biller@il.ibm.com}
Michael Dymshits
Software and Information Systems Engineering, Ben-Gurion University of the Negev {m.dmshts@gmail.com}
Lior Rokach
Software and Information Systems Engineering, Ben-Gurion University of the Negev {liorrk @bgu.ac.il}
Bracha Shapira
Software and Information Systems Engineering, Ben-Gurion University of the Negev {bshapira@bgu.ac.il}
ABSTRACT
Data leakage and theft from databases is a dangerous threat to organizations. Data Security and Data
Privacy protection systems (DSDP) monitor data access and usage to identify leakage or suspicious
activities that should be investigated. Because of the high velocity nature of database systems, such
systems audit only a portion of the vast number of transactions that take place. Anomalies are
investigated by a Security Officer (SO) in order to choose the proper response. In this paper we
investigate the effect of sampling methods based on the risk the transaction poses and propose a new
method for "combined sampling" for capturing a more varied sample.
Keywords
sampling; anomaly detection; risk assessment; insider threats; intrusion detection; high throughput;
cyber
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
1. INTRODUCTION
Databases lie at the heart of IT organizational infrastructure. Organizations monitor database
operations in real time to prevent data leakage. Data security and data privacy protection (DSDP)
systems are widely used to help implement security policies and detect attacks and data abuse. DSDP
systems monitor database (DB) activity and enforce a predefined policy in order to issue alerts about
policy violations and discover vulnerabilities such as weak passwords or out-of-date software. DSDP
systems also apply anomaly detection algorithms in an attempt to detect data misuse, data leakage,
impersonation, and attacks on database systems [1, 2].
These systems generate alerts when policy rules are violated or anomalous activities are performed
[3,4,5]. Each alert demands the attention of a security officer (SO) [6] who must decide whether an
alert represents a threat which should be investigated or dismissed. For investigation purposes, the
security officer (SO) needs the original log data to be as informative as possible as she needs to
assemble the passel in order to evaluate the whole picture. Organizations also save this data for
future investigation in case a breach is discovered later. To address these needs some industrial
DSDP systems maintain an archive of log data describing past transactions, and these logs are then
passed on to an anomaly detection system to identify suspicious activity.
Unlike the network domain, in the data-base security domain organizations prefer not to admit the
attacks hence there is no indication to which or what portion of the anomalies are “true positives”.
The databases monitored by DSDP systems serve thousands of users making up to a hundred
thousand transactions per second. Storage is limited and can become a prohibitive cost for these
systems, yet the amount of data that is logged affects the quality of anomaly detection and the ability
to investigate historical data when a breach is discovered. Current solutions for the limited storage
space are based on defining a policy to govern which portion of the transactions will be saved and
audited.
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
Most processing solutions for fast streams of log data focus on compressing the data. These include
various techniques of dimensionality reduction such as PCA, deep learning / auto encoding
approximation methods (including sketching), and Bloom filters [7,8]. These methods allow
extracting features without consuming excess memory and disk space. However, saving compressed
data does not provide SOs with the information required to decide how to respond to anomalies and
does not facilitate the investigation of past behavior.
Another approach to reducing the amount of data saved while preserving the full attributes is to
sample transactions or activities. Techniques for sampling and their effects on anomaly detection
have been studied in the domain of network traffic flow [9,10,11]. This domain is quite different
from the domain of database transaction as the data is richer, containing more features, and the
damage from a single transaction can be greater than the damage from a network packet.
We propose using a sampling strategy based on the perceived risk posed by each transaction to the
organization. The risk can be estimated using a manually calibrated policy or estimated using a
machine learning ranking algorithm such as CyberRank [12].
We seek a smart monitoring policy based on information theory sampling that will be economical
(storage-wise), without compromising our ability to discover the same anomalies, and also enable us
to investigate an anomaly once discovered.
The main contribution of our work is introducing new feature representing “subject matter expert
knowledge” for storage reduction and improving performance of DSDP systems while maintaining
the anomaly detection results and the investigative capabilities as required by the SO and the
regulator. We present a sampling strategy for incorporating the risk associated with transactions \
events and preliminary results on anomaly detection using simulated data.
2. DATA EXPLORATION
We collected 24 hours of user activity data from a production DSDP system. The system’s data is
made of aggregated transactions details such as the IP address, time of the day and description of the
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
preformed query. The users can be ether application data base user or a real user. As described in our
CyberRank work [12], the user is an important entity whose behavior and activity is useful for
identifying risk and controlling database transactions. When a transaction occurs, it is compared to
the user’s history to detect anomalies.
The data collected consists of 24 hours of monitored DB transactions made by 1,901 unique users.
We identified three different user behaviors: most users have very little activity with just a few
queries (less than a thousand queries per user). The second group includes users with a lot of activity
and thousands of queries. User activity is described in the histogram in Figure 1.
Figure 1. Histogram of user activity. Most very active users, those with
over ten thousand transactions, are not depicted. The majority of the
users had less than 1,000 queries in the time frame inspected
In this work we concentrate on the largest group of users
which have a lower number of queries. These users are most
affected from the sampling strategy as very little information
is present for each user.
3. EXPERIMENTAL SETTING
Our objective was to compare different strategies for sampling transaction data. We look at three
sampling methods:
(i) vanilla: keeping a specific portion of the transactions for sampling
(ii) risky only: sampling only from the transactions labeled as risky
(iii) combination sampling: sampling from both risky and non-risky transactions using the
Gibbs sampling approach to define the portions of each transaction class in the resulting
data
The DSDP system we investigated recorded the information per transaction of 20 features. Detecting
anomalies with such complex features is not a trivial task, especially as it is difficult to determine
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
what counts as a true positive anomaly without consulting the SO about each transaction (the data is
unlabeled).
Since we want to investigate the impact of sampling strategy based on the new risk feature on the
results of anomaly detection we concentrate on producing low-complexity data to simulate the data
sampled for audit. Anomalies are introduced randomly into the data, so the anomaly detection system
can be evaluated in a controlled environment (the experiment data is labeled).
3.1 Data creation
We represent the user as a Gaussian distribution which produces a series of observations. An
anomaly is created by sampling a data point obtained from a distribution different from that of the
user. We generated 1,000 users, and for each user a series of observations. The generated
observations contained 1% anomalies, an observation generated from a different Gaussian. We added
a new feature to the data - A risk label, generated randomly in an independent manner for each
observation, and 30% of the observations were labeled as risky.
3.2 Anomaly detection
We simulated the following anomaly detection system for each user: (1) Each observation may be
sampled for audit. (2) If it is sampled, it is compared to the previous series of sampled observations.
(3) The user's Gaussian is estimated using the mean and standard deviation. (4) An observation is
marked as an anomaly by the mock system if the observation value difference from the user's mean
is greater than three standard deviations.
Each sampling method was applied three times with different random seeds, and the results were
averaged. We sampled at four proportions: 35%, 30%, 25%, and 20%. The sampling posteriors
(proportion of each class in the sample) for the "combination sampling" method were set at 80% for
the risky class and 20% for the non-risky class
4. RESULTS
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
Without sampling and when applied to all of the data generated, the simulated anomaly detection
system detected 57% of the anomalies with precision of 64%, with no significant difference between
the observations labeled as high-risk or low-risk.
We compare the recall of the generated anomalies between the sampling methods, and the results are
divided by class (high-risk / low-risk). The vanilla sampling method produced the same recall for
both classes - as the high-risk class is less frequent, the recall for this class is low. The "risky only"
sampling method achieved high recall for the high-risk class at the price of zero recall for the low-
risk class. The "combination sampling" method achieves recall slightly lower than "risk only" on the
high-risk class, while still capturing some of the anomalies in the low-risk class. See Figure 2 for a
comparison of the recall for the two classes.
5. DISCUSSION
Choosing the right sampling strategy is an important decision when designing and implementing a
DSDP system. We have shown that introducing transaction risk into the anomaly detection sampling
process can significantly affect the results. In our experiment there was no underlying difference
between the high and low-risk classes, we expect these transactions to behave differently in real life,
amplifying the effect of the sampling algorithm on the observed anomalies.
The risk captures the likelihood that an SO would investigate the anomaly [12], however the
investigation will be more thorough if low-risk transactions are also captured for the suspect user.
The "risky only" method would not provide this level of resolution, while the naïve vanilla sampling
would not detect most of the high-risk anomalies. Using a Gibbs sampling approach to provide
Figure 2(a) recall for low-risk anomalies vanilla sampling produces the best result (b) recall for high-risk
anomalies "risky only" has the higher recall.
Grushka et al./ Sampling High Throughput Data for Anomaly Detection of Data-Base Activity
Proceedings of the 11th Pre-ICIS Workshop on Information Security and Privacy, Dublin, Ireland December 10, 2016
"combination sampling," guaranteeing sampling mainly from the high-risk class provides a middle
ground. The proportion of each class can be tuned by the SO to fit the organizational needs (the
"risky only" method is simply "combination sampling" with the proportion of high-risk samples set
to one).
We are working on applying a similar type of analysis to a sample of real data, as well as more
complex anomaly detection methods. Other future work should include a usage study of the effect of
different sampling strategies on the adoption and use by security officers. Another venue for future
work is to combine data compression methods with sampling for improving both anomaly detection
and the ability to investigate afterwards.
6. REFERENCES
[1] Veeramachaneni, K. and Arnaldo, I., AI2: Training a big data machine to defend.
[2] Harel, A., Shabtai, A., Rokach, L. and Elovici, Y., 2012. M-score: A misuseability weight measure. Dependable and
Secure Computing, IEEE Transactions on, 9(3), pp.414-428.
[3] Sallam, A., Bertino, E., Hussain, S.R., Landers, D., Lefler, R.M. and Steiner, D., 2015. DBSAFEAn Anomaly
Detection System to Protect Databases From Exfiltration Attempts. IEEE SYSTEMS JOURNAL.
[4] Kim, G., Lee, S. and Kim, S., 2014. A novel hybrid intrusion detection method integrating anomaly detection with
misuse detection. Expert Systems with Applications, 41(4), pp.1690-1700.
[5] Chandola, V., Banerjee, A. and Kumar, V., 2009. Anomaly detection: A survey. In ACM computing surveys
(CSUR), 41(3), p.15.
[6] Veeramachaneni, K. and Arnaldo, I., AI2: Training a big data machine to defend.
[7] Lall, A., Ogihara, M. and Xu, J., 2009, April. An efficient algorithm for measuring medium-to large-sized flows in
network traffic. In INFOCOM 2009, IEEE (pp. 2711-2715). IEEE.
[8] Feldman, D., Schmidt, M. and Sohler, C., 2013, January. Turning big data into tiny data: Constant-size coresets for
k-means, pca and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on
Discrete Algorithms (pp. 1434-1453). Society for Industrial and Applied Mathematics.
[9] Jadidi, Z., Muthukkumarasamy, V., Sithirasenan, E. and Singh, K., 2016. Intelligent Sampling Using an Optimized
Neural Network. Journal of Networks, 11(01), pp.16-27.
[10] Mai, J., Chuah, C.N., Sridharan, A., Ye, T. and Zang, H., 2006, October. Is sampled data sufficient for anomaly
detection?. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement (pp. 165-176). ACM.
[11] Juba, B., Musco, C., Long, F., Sidiroglou-Douskos, S. and Rinard, M.C., 2015, February. Principled Sampling for
Anomaly Detection. In NDSS.
[12] Grushka-Cohen, H., Sofer, O., Biller, O., Shapira, B. and Rokach, L., 2016, October. CyberRank: Knowledge
Elicitation for Risk Assessment of Database Security. In Proceedings of the 25th ACM International on Conference
on Information and Knowledge Management (pp. 2009-2012). ACM.
... These solutions are tailor-made for each specific problem and cannot be used in our domain. In the domain of database monitoring, [3] suggested a Gibbs sampling approach using the transaction risk as the prior for sampling it. ...
... We expanded the simulation system of [3] with risks showing trends to better capture the expected users' behavior. To simulate users' compromise, events of change in the user risk profiles are introduced. ...
Chapter
Monitoring database activity is useful for identifying and preventing data breaches. Such database activity monitoring (DAM) systems use anomaly detection algorithms to alert security officers to possible infractions. However, the sheer number of transactions makes it impossible to track each transaction. Instead, solutions use manually crafted policies to decide which transactions to monitor and log. Creating a smart data-driven policy for monitoring transactions requires moving beyond manual policies. In this paper, we describe a novel simulation method for user activity. We introduce events of change in the user transaction profile and assess the impact of sampling on the anomaly detection algorithm. We found that looking for anomalies in a fixed subset of the data using a static policy misses most of these events since low-risk users are ignored. A Bayesian sampling policy identified 67% of the anomalies while sampling only 10% of the data, compared to a baseline of using all of the data.
Conference Paper
Data streams sampling is intended to build a sample on which the future data analysis tasks will be performed. Several parameters affect the effectiveness of the built sample: the used sampling algorithm, chosen sampling rate, and window size if the sliding window model is adopted. Thus, given a stream of items, the most challenging task is to select the most relevant sampling technique to apply and the right parameters to employ to sample the data. In this paper, we address the impact of data sampling on the anomaly detection results. First, we develop a new version of the Weighted Random Sampling (WRS) algorithm that samples the data based on their values with respect to the values of their neighbors in the current sliding window. Thereafter, we study the impact of the sampling process on the anomalies detection using the Exponential Weighted Moving Average (EWMA) algorithm. In this context, the comparison of the sampling algorithms is based on their response time in case of anomaly and the relevance of the detected anomalies.
Article
Full-text available
Modern Internet has enabled wider usage, resulting in increased network traffic. Due to the high volume of data packets in networking, sampling techniques are widely used in flow-based network management software to manage traffic load. However, sampling processes reduce the likelihood of anomaly detection. Many studies have been carried out at improving the accuracy of anomaly detection. However, only a few studies have considered it with sampled flow traffic. In our study, we investigate the use of an artificial neural network (ANN)-based classifier to improve the accuracy of flow-based anomaly detection in sampled traffic. A feedback from the ANN-based anomaly detector determines the type of the flow sampling method that should be used. Our proposed technique handles malicious flows and benign flows with different sampling methods. To evaluate the proposed sampling technique, a number of flow-based datasets are generated. Our experiments confirm that the proposed technique improves the percentage of the sampled malicious flows by about 7% and it can preserve the majority of traffic information.
Article
Full-text available
Detecting and preventing data leakage and data misuse poses a serious challenge for organizations, especially when dealing with insiders with legitimate permissions to access the organization's systems and its critical data. In this paper, we present a new concept, Misuseability Weight, for estimating the risk emanating from data exposed to insiders. This concept focuses on assigning a score that represents the sensitivity level of the data exposed to the user and by that predicts the ability of the user to maliciously exploit this data. Then, we propose a new measure, the M-score, which assigns a misuseability weight to tabular data, discuss some of its properties, and demonstrate its usefulness in several leakage scenarios. One of the main challenges in applying the M-score measure is in acquiring the required knowledge from a domain expert. Therefore, we present and evaluate two approaches toward eliciting misuseability conceptions from the domain expert.
Article
Full-text available
to difierentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the efiectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the difierent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the difierent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Conference Paper
Security systems for databases produce numerous alerts about anomalous activities and policy rule violations. Prioritizing these alerts will help security personnel focus their efforts on the most urgent alerts. Currently, this is done manually by security experts that rank the alerts or define static risk scoring rules. Existing solutions are expensive, consume valuable expert time, and do not dynamically adapt to changes in policy. Adopting a learning approach for ranking alerts is complex due to the efforts required by security experts to initially train such a model. The more features used, the more accurate the model is likely to be, but this will require the collection of a greater amount of user feedback and prolong the calibration process. In this paper, we propose CyberRank, a novel algorithm for automatic preference elicitation that is effective for situations with limited experts' time and outperforms other algorithms for initial training of the system. We generate synthetic examples and annotate them using a model produced by Analytic Hierarchical Processing (AHP) to bootstrap a preference learning algorithm. We evaluate different approaches with a new dataset of expert ranked pairs of database transactions, in terms of their risk to the organization. We evaluated using manual risk assessments of transaction pairs, CyberRank outperforms all other methods for cold start scenario with error reduction of 20%.
Article
We prove that the sum of the squared Euclidean distances from the n rows of an n x d matrix A to any compact set that is spanned by k vectors in double-struck Rd can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε2)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε2) first right singular vectors (principle components) of A. A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size script O(k) for handling k-means queries, (j, 1)-coresets of size script O(j) for PCA queries, and (j, k)-coresets of size (log n)script O(jk) for any j,k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n. Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d. For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k1/εscript O(1) for k-means and a special class of bregman divergences that is less dependent on the properties of the squared Euclidean distance.
Article
Attempts by insiders to exfiltrate data have become a severe threat to the enterprise. Conventional data security techniques, such as access control and encryption, must be augmented with techniques to detect anomalies in data access that may indicate exfiltration attempts. In this paper, we present the design and evaluation of DBSAFE, a system to detect, alert on, and respond to anomalies in database access designed specifically for relational database management systems (DBMS). The system automatically builds and maintains profiles of normal user and application behavior, based on their interaction with the monitored database during a training phase. The system then uses these profiles to detect anomalous behavior that deviates from normality. Once an anomaly is detected, the system uses predetermined policies guiding automated and/or human response to the anomaly. The DBSAFE architecture does not impose any restrictions on the type of the monitored DBMS. Evaluation results indicate that the proposed techniques are indeed effective in detecting anomalies.
Article
In this paper, a new hybrid intrusion detection method that hierarchically integrates a misuse detection model and an anomaly detection model in a decomposition structure is proposed. First, a misuse detection model is built based on the C4.5 decision tree algorithm and then the normal training data is decomposed into smaller subsets using the model. Next, multiple one-class SVM models are created for the decomposed subsets. As a result, each anomaly detection model does not only use the known attack information indirectly, but also builds the profiles of normal behavior very precisely. The proposed hybrid intrusion detection method was evaluated by conducting experiments with the NSL-KDD data set, which is a modified version of well-known KDD Cup 99 data set. The experimental results demonstrate that the proposed method is better than the conventional methods in terms of the detection rate for both unknown and known attacks while it maintains a low false positive rate. In addition, the proposed method significantly reduces the high time complexity of the training and testing processes. Experimentally, the training and testing time of the anomaly detection model is shown to be only 50% and 60%, respectively, of the time required for the conventional models.
Conference Paper
It has been well recognized that identifying very large flows (i.e., elephants) in a network traffic stream is important for a variety of network applications ranging from traffic engineering to anomaly detection. However, we found that many of these applications have an increasing need to monitor not only the few largest flows (say top 20), but also all of the medium-sized flows (say top 20,000). Unfortunately, existing techniques for identifying elephant flows at high link speeds are not suitable and cannot be trivially extended for identifying the medium-sized flows. In this work, we propose a hybrid SRAM/DRAM algorithm for monitoring all elephant and medium-sized flows with strong accuracy guarantees. We employ a synopsis data structure (sketch) in SRAM to filter out small flows and preferentially sample medium and large flows to a flow table in DRAM. Our key contribution is to show how to maximize the use of SRAM and DRAM available to us by using a SRAM/DRAM hybrid data structure that can achieve more than an order of magnitude higher SRAM efficiency than previous methods. We design a quantization scheme that allows our algorithm to "read just enough" from the sketch at SRAM speed, without sacrificing much estimation accuracy. We provide analytical guarantees on the accuracy of the estimation and validate these by means of trace-driven evaluation using real- world packet traces..