Article

TM-Score: A Misuseability Weight Measure for Textual Content

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent years, data leakage prevention solutions became an inherent component of the organizations' security suite. These solutions focus mainly on the data and its sensitivity level, and on preventing it from reaching an unauthorized entity. They ignore, however, the fact that an insider is gradually exposed to more and more sensitive data to which she is authorized to access. Such data may cause great damage to the organization when leaked or misused. In this research, we propose an extension to the misuseability weight concept. Our main goal is to define a misuseability measure called TM-Score for textual data. Using this measure, the organization can estimate the extent of damage that can be caused by an insider that is continuously and gradually exposed to textual content (e.g., documents and emails). The extent of damage is determined by the amount, type, and quality of information to which the insider was exposed. We present a two-step method for the continuous assignment of a misuseability score to a set of documents and evaluate the proposed method using the Enron email data set.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The ITS express the likelihood that a given entity is leaking sensitive data. Furthermore, the ITS may be combined with a misusability score, such as the M-Score/TM-score [2], to provide a better estimate of the risk represented by each user/machine by invoking the relationship between risk, probability and consequence: ...
... about the likelihood of the entity actually leaking data. In a context where the data objects are textual documents, the "consequence" factor is represented by a misusability metric such as the TM-Score which can be readily computed [2], while the "likelihood" factor corresponds to the ITS, whose derivation and discussion is the topic of this paper. ...
... However, our work is influenced by related concepts such as misusability weights. The M-Score [1] and the TM-Score [2] algorithms provide a means of quantifying the damage an organization would incur if the data a user has accessed would leak to 3rd parties. In order to assign a score, domain experts are consulted to manually assess the sensitivity on a per record or document level for a subset of the corpora. ...
Conference Paper
During the recent years there has been an increased focus on preventing and detecting insider attacks and data thefts. A promising approach has been the construction of data loss prevention systems (DLP) that scan outgoing traffic for sensitive data. However, these automated systems are plagued with a high false positive rate. In this paper we introduce the concept of a meta-score that uses the aggregated output from DLP systems to detect and flag behavior indicative of data leakage. The proposed internal/insider threat score is built on the idea of detecting discrepancies between the userassigned sensitivity level and the sensitivity level inferred by the DLP system, and captures the likelihood that a given entity is leaking data. The practical usefulness of the proposed score is demonstrated on the task of identifying likely internal threats.
... Large datasets containing personal and privacy-sensitive documents have also been created to evaluate data loss prevention (DLP) methods (Vartanian and Shabtai 2014;Hart, Manadhata, and Johnson 2011;Trieu et al. 2017). Even though DLP methods do assess the sensitivity of the information contained in textual documents, they only do it at the document level. ...
Article
Full-text available
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.
... However, the achieved parameters are not compared with its baseline state-of-the-art distribution model [103]. For the textual data, a misuseability evaluator named TM-Score is defined in [107] which is an extension of the misuseability weight concept [104]. By utilizing the presented evaluator, the enterprises become capable of estimating the quantum of the detriment that is resulted from gradual and continuous exposure of textual content such as emails and documents caused by an insider. ...
Article
Full-text available
A large number of researchers, academia, government sectors, and business enterprises are adopting the cloud environment due to the least upfront capital investment, maximum scalability, and several other features of it. Despite the multiple features supported by the cloud environment, it also suffers several challenges. Data protection is the primary concern in the area of information security and cloud computing. Numerous solutions have been developed to address this challenge. However, there is a lack of comprehensive analysis among the existing solutions and a necessity emerges to explore, classify, and analyze the significant existing work for investigating the applicability of these solutions to meet the requirements. This article presents a comparative and systematic study, and in-depth analysis of leading techniques for secure sharing and protecting the data in the cloud environment. The discussion about each dedicated technique includes: functioning for protecting the data, potential and revolutionary solutions in the domain, the core and adequate information including workflow, achievements, scope, gaps, future directions, etc. about each solution. Furthermore, a comprehensive and comparative analysis of the discussed techniques is presented. Afterward, the applicability of the techniques is discussed as per the requirements and the research gaps along with future directions are reported in the field. The authors believe that this article’s contribution will operate as a catalyst for the potential researchers to carry out the research work in the area.
... Large datasets containing personal and privacy-sensitive documents have also been created to evaluate data loss prevention (DLP) methods (Vartanian and Shabtai 2014;Hart, Manadhata, and Johnson 2011;Trieu et al. 2017). Even though DLP methods do assess the sensitivity of the information contained in textual documents, they only do it at the document level. ...
Preprint
Full-text available
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark
... To face the limitations of the data-centric-approaches recent research focuses on the so called misuseabilityscore-approaches (Barthel and Schallehn 2013;Vartanian and Shabtai 2014;Vavilis et al. 2014). This approaches evaluate the possible misuseability to the data a database user queries. ...
Article
The last few decades have witnessed technology progress with leaps and bounds leading to the creation and global adoption of different types of digital devices and platforms that can make personalized recommendations to an individual. One of the consequences of the ubiquity of these devices is the daily generation of data in large quantities. This data includes the sensitive data of an individual as well as multinational organizations and therefore, it must be always kept confidential to prevent its theft and misuse by malicious parties. However, the large volumes of data generated make it difficult to create a robust security solution to safeguard the data from different types of cyberattacks. The paper contributes to the data security industry by consolidating numerous data security algorithms that change with the infrastructure that the data is amid and also outlines the regulations surrounding it. This review paper also aims to highlight the benefits and drawbacks of the security algorithms proposed by researchers to encourage further discussion and consideration of the algorithms for their potential implementation in appropriate domains and also inspire them to develop more powerful and robust security algorithms to cover the drawbacks of the existing ones.
Article
Several privacy protection technologies have been designed for protecting individuals' privacy information in data publishing. It is often easy to make additional information loss of a dataset without measuring the strength of privacy protection it required. To apply appropriate strength of privacy preservation, the authors put forward privacy score, a new metric for making a comprehensive evaluation of the privacy information contained in the pre-published dataset. Using this measure, publishers can apply the privacy techniques to the pre-published dataset in accordance with the privacy level it belongs to. The privacy score is determined by the amount as well as the quality of privacy information in which the pre-published dataset is contained. Furthermore, the authors present a data sensitivity model based on analytic hierarchy process for assigning a sensitivity score to each possible value of a sensitive attribute. The reasonability and effectiveness of the proposed approach are verified by using the Adult dataset.
Article
Full-text available
A data breach is the intentional or inadvertent exposure of confidential information to unauthorized parties. In the digital era, data has become one of the most critical components of an enterprise. Data leakage poses serious threats to organizations, including significant reputational damage and financial losses. As the volume of data is growing exponentially and data breaches are happening more frequently than ever before, detecting and preventing data loss has become one of the most pressing security concerns for enterprises. Despite a plethora of research efforts on safeguarding sensitive information from being leaked, it remains an active research problem. This review helps interested readers to learn about enterprise data leak threats, recent data leak incidents, various state‐of‐the‐art prevention and detection techniques, new challenges, and promising solutions and exciting opportunities. WIREs Data Mining Knowl Discov 2017, 7:e1211. doi: 10.1002/widm.1211 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Prediction
Article
Today, organizations have limited resources available to allocate to the detection of complex cyber-attacks. In order to optimize their resource allocation, organizations must conduct a thorough risk analysis process so as to focus their efforts and resources on the protection of the organization's important assets. In this study we propose a framework that automatically and dynamically derives a misuseability score for every IT component (e.g., PC, laptop, server, router, smartphone, and user). The misuseability score encapsulates the potential damage that can be caused to the organization when its assets are compromised and misused.
Article
Security labels are utilized for several applications. For instance, cross-domain information exchange can be enabled by associating security labels with data objects and enforcing cross-domain information flow control based on these labels (e.g., using guards). The correctness of the security labels is critical to the overall security of such solutions. To assure the correctness of security labels, this paper proposes a flexible framework for trusted information labelling. The proposed solution represents a novel application of attribute based access control (aka. policy- based access control) principles to data labelling. The proposed framework can utilize content verification/analysis, user/application input, information flow monitoring, and contextual information as a basis for its policy-based labelling decisions.
Article
Full-text available
Detecting and preventing data leakage and data misuse poses a serious challenge for organizations, especially when dealing with insiders with legitimate permissions to access the organization's systems and its critical data. In this paper, we present a new concept, Misuseability Weight, for estimating the risk emanating from data exposed to insiders. This concept focuses on assigning a score that represents the sensitivity level of the data exposed to the user and by that predicts the ability of the user to maliciously exploit this data. Then, we propose a new measure, the M-score, which assigns a misuseability weight to tabular data, discuss some of its properties, and demonstrate its usefulness in several leakage scenarios. One of the main challenges in applying the M-score measure is in acquiring the required knowledge from a domain expert. Therefore, we present and evaluate two approaches toward eliciting misuseability conceptions from the domain expert.
Article
Full-text available
Protecting sensitive information from unauthorized disclosure is a major concern of every organization. As an organizations employees need to access such information in order to carry out their daily work, data leakage detection is both an essential and challenging task. Whether caused by malicious intent or an inadvertent mistake, data loss can result in significant damage to the organization. Fingerprinting is a content-based method used for detecting data leakage. In fingerprinting, signatures of known confidential content are extracted and matched with outgoing content in order to detect leakage of sensitive content. Existing fingerprinting methods, however, suffer from two major limitations. First, fingerprinting can be bypassed by rephrasing (or minor modification) of the confidential content, and second, usually the whole content of document is fingerprinted (including non-confidential parts), resulting in false alarms. In this paper we propose an extension to the fingerprinting approach that is based on sorted k-skip-n-grams. The proposed method is able to produce a fingerprint of the core confidential content which ignores non-relevant (non-confidential) sections. In addition, the proposed fingerprint method is more robust to rephrasing and can also be used to detect a previously unseen confidential document and therefore provide better detection of intentional leakage incidents.
Article
Full-text available
With the rapid development of web content, retrieving relevant information is difficult task. The efficient clustering algorithms are needed to improve the results of the retrieval. Document clustering is a process of recognizing the similarity or dissimilarity among the given objects and forms subgroups sharing common characteristics. In this paper, we propose a semantic text document clustering approach that using WordNet lexical and Self Organizing Maps. The proposed approach uses the WordNet to identify the importance of the concepts in the document. The SOM is used to cluster the document. We use this approach to enhance the effectiveness of document clustering algorithms. The approach takes the advantages of the semantics available in knowledge base and the relationship between the words in the input documents. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experiments show advantage of the proposed approach over the others.
Conference Paper
Full-text available
Masquerade attacks are a common security problem that is a consequence of identity theft. This paper extends prior work by modeling user search behavior to detect deviations indicating a masquerade attack. We hypothesize that each individual user knows their own file system well enough to search in a limited, targeted and unique fashion in order to find information germane to their current task. Masqueraders, on the other hand, will likely not know the file system and layout of another user's desktop, and would likely search more extensively and broadly in a manner that is different than the victim user being impersonated. We identify actions linked to search and information access activities, and use them to build user models. The experimental results show that modeling search behavior reliably detects all masqueraders with a very low false positive rate of 1.1%, far better than prior published results. The limited set of features used for search behavior modeling also results in large performance gains over the same modeling techniques that use larger sets of features.
Conference Paper
Full-text available
Document clustering generates clusters from the whole document collection automatically and is used in many fields, including data mining and information retrieval. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem, such a bag of original words cannot represent the content of a document precisely. In this paper, we investigate using the sense disambiguation method to identify the sense of words to construct the feature vector for document representation. Our experimental results demonstrate that in most conditions, using sense can improve the performance of our document clustering system. But the comprehensive statistical analysis performed indicates that the differences between using original single words and using senses of words are not statistically significant. In this paper, we also provide an evaluation of several basic clustering algorithms for algorithm selection.
Conference Paper
Full-text available
We present an overview of anomaly detection used in com- puter security, and provide a detailed example of a host-based Intrusion Detection System that monitors file systems to detect abnormal accesses. The File Wrapper Anomaly Detector (FWRAP) has two parts, a sensor that audits file systems, and an unsupervised machine learning system that computes normal models of those accesses. FWRAP employs the Probabilistic Anomaly Detection (PAD) algorithm previously reported in our work on Windows Registry Anomaly Detection. FWRAP rep- resents a general approach to anomaly detection. The detector is first trained by operating the host computer for some amount of time and a model specific to the target machine is automatically computed by PAD. The model is then deployed to a real-time detector. In this paper we describe the feature set used to model file system accesses, and the performance results of a set of experiments using the sensor while attack- ing a Linux host with a variety of malware exploits. The PAD detector achieved impressive detection rates in some cases over 95% and about a 2% false positive rate when alarming on anomalous processes.
Article
In this paper, we architected a document control system for monitoring leakage of important documents related to military information. Our proposed system inspects all documents when they are downloaded and sent. It consists of 3 modules; authentication module, access control module and watermarking module. The authentication module checks insider information for allow to log on system. The access control module control access authorization to do operations by insiders according to their role and security level. The watermarking module is used to track transmission path of documents. The document control system controls illegal information flow by insiders and does not allow access to documents which are not related to the insider's duties.
Article
A new context-based model (CoBAn) for accidental and intentional data leakage prevention (DLP) is proposed. Existing methods attempt to prevent data leakage by either looking for specific keywords and phrases or by using various statistical methods. Keyword-based methods are not sufficiently accurate since they ignore the context of the keyword, while statistical methods ignore the content of the analyzed text. The context-based approach we propose leverages the advantages of both these approaches. The new model consists of two phases: training and detection. During the training phase, clusters of documents are generated and a graph representation of the confidential content of each cluster is created. This representation consists of key terms and the context in which they need to appear in order to be considered confidential. During the detection phase, each tested document is assigned to several clusters and its contents are then matched to each cluster’s respective graph in an attempt to determine the confidentiality of the document. Extensive experiments have shown that the model is superior to other methods in detecting leakage attempts, where the confidential information is rephrased or is different from the original examples provided in the learning set.
Article
One-to-many data linkage is an essential task in many domains, yet only a handful of prior publications have addressed this issue. Furthermore, while traditionally data linkage is performed among entities of the same type, it is extremely necessary to develop linkage techniques that link between matching entities of different types as well. In this paper, we propose a new one-to-many data linkage method that links between entities of different natures. The proposed method is based on a one-class clustering tree (OCCT) that characterizes the entities that should be linked together. The tree is built such that it is easy to understand and transform into association rules, i.e., the inner nodes consist only of features describing the first set of entities, while the leaves of the tree represent features of their matching entities from the second data set. We propose four splitting criteria and two different pruning methods which can be used for inducing the OCCT. The method was evaluated using data sets from three different domains. The results affirm the effectiveness of the proposed method and show that the OCCT yields better performance in terms of precision and recall (in most cases it is statistically significant) when compared to a C4.5 decision tree-based linkage method.
Article
Now a days, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship, Reuters 21578, Classic 30, 20 News group (atheism), 20 News group (Hardware), 20 News group (Computer Graphics) etc.
Article
A unified narrative exposition of the ESD/MITRE computer security model is presented. A suggestive interpretation of the model in the context of Multics and a discussion of several other important topics (such as communications paths, sabotage and integrity) conclude the report. A full, formal presentation of the model is included in the Appendix.
Article
Clustering, in data mining, is useful to discover distribution patterns in the underlying data. Clustering algorithms usually employ a distance metric based (e.g., euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this paper, we study clustering algorithms for data with boolean and categorical attributes. We show that traditional clustering algorithms that use distances between points for clustering are not appropriate for boolean and categorical attributes. Instead, we propose a novel concept of links to measure the similarity/proximity between a pair of data points. We develop a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters. Our methods naturally extend to non-metric similarity measures that are relevant in situations where a domain expert/similarity table is the only source of knowledge. In addition to presenting detailed complexity results for ROCK, we also conduct an experimental study with real-life as well as synthetic data sets to demonstrate the effectiveness of our techniques. For data with categorical attributes, our findings indicate that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.
Conference Paper
With rapid advances in online technologies, organizations are migrating from paper based resources to digital documents to achieve high responsiveness and ease of management. These digital documents are the most important asset of an organization and are hence the chief target of insider abuse. Security policies provide the first step to prevent abuse by defining proper and improper usage of resources. Coarse grained security policies that operate on the "principle of least privilege" [J. H. Saltzer et al., (1974)] alone are not enough to address the insider threat, since the typical insider possesses a wide range of privileges to start with. In this paper, we propose a security policy that is tailored to prevent insider abuse. We define the concept of subject, object, actions, rights, context and information flow as applicable to the document control domain. Access is allowed based on the principles of "least privilege and minimum requirements", subject to certain constraints. Unlike existing techniques, the proposed policy engine considers, among other factors, the context of a document request and the information flow between such requests to identify potential malicious insiders. Enforcing these fine-grained access control policies gives us a better platform to prevent the insider abuse. Finally, for demonstration purposes, we present a framework that can be used to specify and enforce these policies on Microsoft Word documents, one of the popular document formats.
Conference Paper
One approach to detecting insider misbehavior is to monitor system call activity and watch for danger signs or unusual behavior. We describe an experimental system designed to test this approach. We tested the system's ability to detect common insider misbehavior by examining file system and process-related system calls. Our results show that this approach can detect many such activities.
OCCT: A one-class clustering tree for implementing one-to-many data linkage
  • gafny
  • bell