Article

The Rules of Redaction: Identify, Protect, Review (and Repeat)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Frequent data leak reports in the press attest to the difficulty of identifying and protecting sensitive content. Redaction is particularly challenging because it seeks to protect documents by selectively removing sensitive portions of them, rather than by quarantining or encrypting the whole document. The authors review current redaction practice and technology and describe a prototype system that supports the natural redaction workflow and addresses some limitations of current technology. Their system supports all phases of the redaction process through the use of linguistic content analysis, an interactive user interface, and inference detection algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, deciding what is a quasi-identifier or a confidential value is much more complex than with structured data: for each piece of text we need to judge whether it can be used for re-identification or may disclose sensitive values. Such a judgment is not easy for a human expert [8], let alone for a computer program. ...
... In general, accurate protection of textual documents remains a largely manual process [8]. At most, (semi)automatic tools based on named entity recognition (NER) have been designed to remove -some-of the burden from the human experts. ...
... The task of protecting the private information of the individuals mentioned in text documents is referred to in the literature as document redaction [8], sanitization [46] or anonymization [5]. Whatever the name, it consists of two steps: (i) detecting (potentially) disclosive pieces of text, and (ii) masking those pieces appropriately. ...
Article
A great variety of mechanisms have been proposed to protect structured databases with numerical and categorical attributes; however, little attention has been devoted to unstructured textual data. Textual data protection requires first detecting sensitive pieces of text and then masking those pieces via suppression or generalization. Current solutions rely on classifiers that can recognize a fixed set of (allegedly sensitive) named entities. Yet, such approaches fall short of providing adequate protection because in reality references to sensitive information are not limited to a predefined set of entity types, and not all the appearances of certain entity type result in disclosure. In this work we propose a more general and flexible based on the notion of word embedding. By means of word embeddings we build vectors that numerically capture the semantic relationships of the textual terms. Then we evaluate the disclosure caused by the terms on the entity to be protected according to the similarity between their vector representations. Our method also preserves the semantics (and, therefore, the utility) of the document by replacing risky terms with privacy-preserving generalizations. Empirical results show that our approach offers much more robust protection and greater utility preservation than methods based on named entity recognition.
... Alternatively, when sensitive terms are replaced (instead of removed) by generalisations (e.g. AIDS → disease), the process is more generically referred to as document sanitisation (Bier et al. 2009). Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. Moreover, in document redaction, the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers (Bier et al. 2009), whereas sanitisation gives no such clues. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitise sensitive entities (National Security Agency 2005). Manual redaction has proven to be quite time-consuming and it does not scale to currently required levels of information outsourcing (Chakaravarthy et al. 2008;Bier et al. 2009). ...
Chapter
Full-text available
This chapter introduces privacy and data protection by design, and reviews privacy-enhancing techniques (PETs). Although privacy by design includes both technical and operational measures, the chapter focuses on the technical measures. First, it enumerates design strategies. Next, it considers privacy-enhancing techniques that directly address the hide strategy, but also aid in implementing the separate, control and enforce strategies. Specifically, it addresses PETs for: (1) identification, authentication and anonymity; (2) private communications; (3) privacy-preserving computations; (4) privacy in databases; and (5) discrimination prevention in data mining.
... Alternatively, when sensitive terms are replaced (instead of removed) by generalisations (e.g. AIDS → disease), the process is more generically referred to as document sanitisation (Bier et al. 2009). Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. Moreover, in document redaction, the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers (Bier et al. 2009), whereas sanitisation gives no such clues. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitise sensitive entities (National Security Agency 2005). Manual redaction has proven to be quite time-consuming and it does not scale to currently required levels of information outsourcing (Chakaravarthy et al. 2008;Bier et al. 2009). ...
Chapter
Full-text available
Private sector companies are becoming increasingly frustrated over the lack of effective solutions to growing criminal threats in cyberspace, leading to calls by security experts for a more active cyber-defence including offensive actions in cyberspace taken with defensive purposes in mind. However, should private companies use active cyber-defence measures or would they by such an act implicate themselves in illegal actions? As long as there is no specific regulation defining the legal grounds for active cyber-defence, the conventional doctrine of a right to self-defence may be the closest analogy within the physical realm. This chapter examines cyber-defence along the lines of a right to self-defence and concludes that the categorisation of passive and active does not allow for a thorough analysis of the legal and ethical justification of a specific defensive measure. Instead, a categorisation based on the effects of a specific measure is suggested. Along the lines of this effect-based categorisation—and considering the capabilities as well as the limits of the application of a right to self-defence to cyberspace—this chapter proposes some concrete recommendations for companies on how to define ethical cyber-defence within their security strategy.
... Alternatively, when sensitive terms are replaced (instead of removed) by generalisations (e.g. AIDS → disease), the process is more generically referred to as document sanitisation (Bier et al. 2009). Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. Moreover, in document redaction, the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers (Bier et al. 2009), whereas sanitisation gives no such clues. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitise sensitive entities (National Security Agency 2005). Manual redaction has proven to be quite time-consuming and it does not scale to currently required levels of information outsourcing (Chakaravarthy et al. 2008;Bier et al. 2009). ...
Chapter
Full-text available
This chapter provides an overview on state actor’s opinions and strategies relating to cybersecurity matters. These are addressed on the EU level as well as on the level of the individual European Member States while the focus is on legislation, policy and political approaches to cybersecurity. In this context, many different measures and approaches are taken both in the Union and nationally to streamline knowledge, resources, and measures to combat cybercrime. Furthermore, the role of the new European data protection framework is addressed, and it is explained why data protection has a close relationship to security matters. The main tensions and conflicts in relation to IT and cybersecurity are depicted, which evolve primarily around the frequently negative effect of IT and cybersecurity measures on the rights of data subjects. However, the issue of governmental surveillance is also addressed, with its implications for the fundamental rights of European citizens. Solution approaches to align the two domains of data protection and cybersecurity are explored, since cybersecurity incidents often involve the loss or compromise of an individual’s personal information. To this end, overlaps and synergies are examined that seem promising for a more holistic approach to cyber threats. For instance, this could be achieved by applying principles such as data protection by design and default in IT more thoroughly. In addition, methodologies of data protection impact assessments as well as a more broad deployment of technical and organisational measures while using well-known information security best practices and standards can help to enhance cybersecurity across the European Union.
... Alternatively, when sensitive terms are replaced (instead of removed) by generalisations (e.g. AIDS → disease), the process is more generically referred to as document sanitisation (Bier et al. 2009). Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. Moreover, in document redaction, the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers (Bier et al. 2009), whereas sanitisation gives no such clues. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitise sensitive entities (National Security Agency 2005). Manual redaction has proven to be quite time-consuming and it does not scale to currently required levels of information outsourcing (Chakaravarthy et al. 2008;Bier et al. 2009). ...
Chapter
Full-text available
Cyberspace has witnessed a ‘militarisation’ as a growing number of states engage in a variety of cyber operations directed against foreign entities. The rate of this militarisation has outstripped the diplomatic efforts undertaken to provide this unique environment with some ‘rules of the road’. The primary mechanism for discussing possible norms of responsible state behaviour has been a series of UN Groups of Governmental Experts, which have produced three consensus reports over the last decade. The 2015 report recommended a series of principles and confidence-building measures to prevent conf1lict, but prospects for its implementation have receded as differences amongst states persist over how security concepts should be applied to cyberspace. Renewed efforts to promote responsible state behaviour will require greater engagement on the part of the private sector and civil society, both of which have a huge stake in sustaining cyber peace.
... Alternatively, when sensitive terms are replaced (instead of removed) by generalisations (e.g. AIDS → disease), the process is more generically referred to as document sanitisation (Bier et al. 2009). Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... Document sanitisation is more desirable than pure redaction, since the former better preserves the utility of the protected output. Moreover, in document redaction, the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers (Bier et al. 2009), whereas sanitisation gives no such clues. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitise sensitive entities (National Security Agency 2005). Manual redaction has proven to be quite time-consuming and it does not scale to currently required levels of information outsourcing (Chakaravarthy et al. 2008;Bier et al. 2009). ...
Chapter
Full-text available
This chapter outlines some concrete best practices and recommendations for cybersecurity service providers, with a focus on data sharing, data protection and penetration testing. Based on a brief outline of dilemmas that cybersecurity service providers may experience in their daily operations, it discusses data handling policies and practices of cybersecurity vendors along the following five topics: customer data handling; information about breaches; threat intelligence; vulnerability-related information; and data involved when collaborating with peers, CERTs, cybersecurity research groups, etc. There is, furthermore, a discussion of specific issues of penetration testing such as customer recruitment and execution as well as the supervision and governance of penetration testing. The chapter closes with some general recommendations regarding improving the ethical decision-making procedures of private cybersecurity service providers.
... As a result, they cannot be applied to unstructured data, such as documents, and may not scale well with large datasets because of the human intervention needed to define the risky combination of attributes for each dataset. Moreover, it has been acknowledged that defining such risky combinations is a costly and difficult task even for expert human sanitizers [19]. ...
... Finally, contrary to methods focusing on structured databases [15,17,18], our approach is solely based on data semantics, and thus, it can be applied to any kind of data, regardless of whether they are structured or not. To show the potential and the generality of our approach, in this paper we focus on the storage of raw textual documents (e.g., e-mails, Word files, etc.), which are the most challenging to protect because of their lack of structure [19] while, at the same time, they constitute the most common way to exchange -potentially sensitive-information between human actors. ...
... The algorithm then iterates the set of data chunks that have been created so far (lines 13-23) and evaluates whether the terms already in the chunk p plus the new term to be allocated (IT i ) fulfill the privacy criterion (line 15). If this is the case, IT i is stored in the chunk and it is removed from the set of risky terms RT i to which it belonged (lines [18][19]. Since we have removed IT i from RT i , by definition, the remaining terms in RT i would not cause disclosure (i.e., a set of quasi-identifiers that only cause a risk when all of them are evaluated in aggregate); thus, in the next iteration, we try to allocate all the remaining terms in RT i at once (line 19) in order to minimize the allocation effort that an individual allocation of terms would represent (mainly, in terms of web queries). ...
Article
Full-text available
Even though cloud computing provides many intrinsic benefits (e.g., cost savings, availability, scalability, etc.), privacy concerns related to the lack of control over the storage and management of the outsourced (confidential) data still prevent many customers from migrating to the cloud. In this respect, several privacy-protection mechanisms based on a prior encryption of the data to be outsourced have been proposed. Data encryption offers robust security, but at the cost of hampering the efficiency of the service and limiting the functionalities that can be applied over the (encrypted) data stored on cloud premises. Because both efficiency and functionality are crucial advantages of cloud computing, especially in SaaS, in this paper we aim at retaining them by proposing a privacy-protection mechanism that relies on splitting (clear) data, and on the distributed storage offered by the increasingly popular notion of multi-clouds. Specifically, we propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not incur in those risks; then, chunks of clear data are independently stored into the separate locations of a multi-cloud, so that external entities (cloud service providers and attackers) cannot have access to the whole confidential data. Because partial data are stored in clear on cloud premises, outsourced functionalities are seamlessly and efficiently supported by just broadcasting queries to the different cloud locations. To enforce a robust privacy notion, our proposal relies on a privacy model that offers a priori privacy guarantees; to ensure its feasibility, we have designed heuristic algorithms that minimize the number of cloud storage locations we need; to show its potential and generality, we have applied it to the least structured and most challenging data type: plain textual documents.
... Because of the enormous amount of data to be managed and the burden and cost of manual data protection [3], many automated methods have been proposed in recent years under the umbrella of Statistical Disclosure Control (SDC) [4]. These methods aim at masking input data in a way that either identity or confidential attribute disclosure are minimized. ...
... In general, plain textual data protection consists of two main tasks: i) identify textual terms that may disclose sensitive information according to a privacy criterion (e.g., names, addresses, authorship, personal features, etc.); and ii) mask these terms to minimize disclosure by means of an appropriate protection mechanism (e.g., removal, generalization, etc.). The community refers to the act of removing or blacking-out sensitive terms as redaction, whereas sanitization usually consists in coarsening them via generalization (e.g., AIDS can be replaced by a less detailed generalization such as disease) [3]. The latter approach, which we use in this paper, better preserves the utility of the output. ...
... The enforcement of C-sanitization relies on the foundations of the Information Theory to assess and quantify the semantics to be protected (defined by C) and those disclosed by the terms appearing in the document to be protected, much like humans experts do [3]. The implementation of the C-sanitization can provide the following advantages over the above-described works: i) automatic detection of terms that may cause disclosure of sensitive data via semantic inferences, a task that has been identified as one of the most difficult and time-consuming for human experts [3,16], ii) utility-preserving sanitization based on accurate term generalization, iii) intuitive definition of the a priori privacy guarantees by means of linguistic labels (i.e., the set C of entities to be protected), instead of the abstract numbers used in all the former privacy models, and iv) individual and independent protection of documents (rather than homogenous document collections), regardless their content or structure. ...
Article
Full-text available
Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments.
... However, this process significantly reduces the utility of the redacted document [5]. Moreover, the fact of having blacked-out parts can raise the awareness of the document's sensitivity in front of potential attackers [6]. On the other hand, document sanitisation opts by generalising the sensitive content instead of removing it [6]. ...
... Moreover, the fact of having blacked-out parts can raise the awareness of the document's sensitivity in front of potential attackers [6]. On the other hand, document sanitisation opts by generalising the sensitive content instead of removing it [6]. Generalisation preserves more utility than redaction because a less detailed but still useful document is produced [5]. ...
... 978-1-4799-0913-1/14/$31.00 ©2014 IEEE II. RELATED WORK Document redaction/sanitisation is usually done manually, in a process that is costly, time consuming and prone to disclosure risks [6]. ...
Conference Paper
Full-text available
With the adoption of ITs, a large amount patient-related documents is compiled by healthcare organisations. Quite often, this data is needed to be released to third parties for research or business purposes. The inherent sensitivity of patient's information has brought to the definition of legislations to protect the privacy of individuals. To meet with these legislations, redaction or sanitization of patient-related documents is needed before their release. This is usually done manually, which is costly and time-consuming, or by means of ad-hoc solutions that just protect structured types of sensitive information (e.g. social security numbers), or that are based on removing sensitive terms, which hampers the utility of the output. In this paper, we propose an automatic sanitization method for textual medical documents that is able to protect sensitive terms and those that are semantically related, while retaining the utility of the output as much as possible. Different to redaction schemas, which are based on term removal, our method improves the utility of the protected output by replacing sensitive terms with appropriate generalisations retrieved from medical and general-purpose knowledge bases. Experiments conducted on highly sensitive documents and in coherency with current regulations on healthcare data privacy show promising results in terms of output's privacy and utility.
... The main side effect of document redaction is that it significantly reduces the utility of the protected content [8]. Another important drawback is that the existence of obscured or blacked-out parts can raise the awareness of the document's sensitivity in front of potential attackers [4]. This is especially problematic in documents linked to a specific knowledge area such as healthcare, because the number of different textual elements that usually appear on the documents is relatively limited. ...
... A more suitable alternative to document redaction consists on generalising sensitive content instead of removing it, a measure that preserves more content utility [8]. This is usually referred to as document sanitisation [4]. As a result of content generalisation a less detailed but still useful document is obtained, while no explicit clues about the document's sensitivity are given. ...
... This task has proven to be burdensome, very time-consuming [16] and prone to disclosure risks [6]. For example, the authors in [4] interviewed the medical records manager for a 10,000+ patient healthcare provider in California, who stated that the act of redacting records takes approximately 20 per cent of her time while the remaining 80 per cent is consumed by the more difficult task of deciding what to redact. In order to deal with textual terms and their potential semantic relationships, this expert maintains lists of names of medications, treatments, etc., which are related to sensitive diseases to be protected, such as HIV. ...
Article
Full-text available
The adoption of ITs by medical organisations makes possible the compilation of large amounts of healthcare data, which are quite often needed to be released to third parties for research or business purposes. Many of this data are of sensitive nature, because they may include patient-related documents such as electronic healthcare records. In order to protect the privacy of individuals, several legislations on healthcare data management, which state the kind of information that should be protected, have been defined. Traditionally, to meet with current legislations, a manual redaction process is applied to patient-related documents in order to remove or black-out sensitive terms. This process is costly and time-consuming and has the undesired side effect of severely reducing the utility of the released content. Automatic methods available in the literature usually propose ad-hoc solutions that are limited to protect specific types of structured information (e.g. e-mail addresses, social security numbers, etc.); as a result, they are hardly applicable to the sensitive entities stated in current regulations that do not present those structural regularities (e.g. diseases, symptoms, treatments, etc.). To tackle these limitations, in this paper we propose an automatic sanitisation method for textual medical documents (e.g. electronic healthcare records) that is able to protect, regardless of their structure, sensitive entities (e.g. diseases) and also those semantically related terms (e.g. symptoms) that may disclose the former ones. Contrary to redaction schemes based on term removal, our approach improves the utility of the protected output by replacing sensitive terms with appropriate generalisations retrieved from several medical and general-purpose knowledge bases. Experiments conducted on highly sensitive documents and in coherency with current regulations on healthcare data privacy show promising results in terms of the practical privacy and utility of the protected output.
... Its goal is to mimic and, hence, automatize the reasoning of human sanitizers with regard to semantic inferences, disclosure analysis and protection of textual documents. To achieve that, our proposal relies on an assessment and quantification of the data semantics that human experts usually consider in document sanitization (Bier et al., 2009;Gordon, 2013). Our proposal provides the following contributions over the state of the art: ...
... • In comparison with available models (Anandan et al., 2012;Cumby & Ghani, 2011), which assume that all risky terms (sensitive entities or related terms) have been identified a priori, our proposal automatizes both the detection of terms that can disclose sensitive data via semantic inferences and their protection. This relieves human sanitizers from manually identifying related terms, which has been identified as one of the most difficult and time-consuming challenges (Bier et al., 2009;Gordon, 2013). To do so, our model considers, as human sanitizers do, the semantic relationships by which terms or combinations of terms appearing in a document would disclose sensitive information via semantic inferences. ...
... More specifically, within the context of unstructured data protection, our goal is to mimic human judgments with regard to document sanitization, so that human experts can be relieved from such time-consuming task. As stated in the introduction, because human sanitizers analyze data from a semantic perspective and evaluate disclosure according to semantic inferences (Bier et al., 2009;Gordon, 2013), we interpret disclosure in terms of semantics of data. Informally, given the whole domain knowledge, this can be assessed by answering to this question: does a released term or a combination of terms in a document allow to unequivocally infer (via semantic inference) and, thus, disclose a sensitive entity?; by unequivocal we refer to the fact that there is no semantic ambiguity in the correct inference given the domain knowledge. ...
Article
Full-text available
Within the current context of Information Societies, large amounts of information are daily exchanged and/or released. The sensitive nature of much of this information causes a serious privacy threat when documents are uncontrollably made available to untrusted third parties. In such cases, appropriate data protection measures should be undertaken by the responsible organization, especially under the umbrella of current legislations on data privacy. To do so, human experts are usually requested to redact or sanitize document contents. To relieve this burdensome task, this paper presents a privacy model for document redaction/sanitization, which offers several advantages over other models available in the literature. Based on the well-established foundations of data semantics and the information theory, our model provides a framework to develop and implement automated and inherently semantic redaction/sanitization tools. Moreover, contrary to ad-hoc redaction methods, our proposal provides a priori privacy guarantees which can be intuitively defined according to current legislations on data privacy. Empirical tests performed within the context of several use cases illustrate the applicability of our model and its ability to mimic the reasoning of human sanitizers.
... This approach requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitize sensitive entities [18]. Manual redaction has proven to be a quite time-consuming process and does not scale with regard to the requirements of the Information Society [3,6]. First, the industry and the academia tried to mitigate this issue by providing semi-automatic proposals that highlight potentially sensitive elements from input text and that leave the final decision about erasing, generalizing or keeping them to the human expert [5,8,19]. ...
... This obviously minimizes the disclosure risk of s ij , but also hampers the utility of the output, since the reader would gain no information about s ij . Moreover, the removal of certain pieces of text also raises awareness of potential attackers of document's sensitivity [3]. On the contrary, when the mechanism ζ 1 proposes replacing the sensitive term s ij by an appropriate generalization g(s ij ) (e.g. ...
... An alternative to WordNet is the Open Directory Project (ODP) 3 . ODP is the largest, most comprehensive human-edited directory of the Web. ...
Article
Full-text available
Traditionally, redaction has been the method chosen to mitigate the privacy issues related to the declassification of textual documents containing sensitive data. This process is based on removing sensitive words in the documents prior to their release and has the undesired side effect of severely reducing the utility of the content. Document sanitization is a recent alternative to redaction, which avoids utility issues by generalizing the sensitive terms instead of eliminating them. Some (semi-)automatic redaction/sanitization schemes can be found in the literature; however, they usually neglect the importance of semantic correlations between the terms of the document, even though these may disclose sanitized/redacted sensitive terms. To tackle this issue, this paper proposes a theoretical framework grounded in the Information Theory, which offers a general model capable of measuring the disclosure risk caused by semantically correlated terms, regardless of the fact that they are proposed for removal or generalization. The new method specifically focuses on generating sanitized documents that retain as much utility (i.e., semantics) as possible while fulfilling the privacy requirements. The implementation of the method has been evaluated in a practical setting, showing that the new approach improves the output’s utility in comparison to the previous work, while retaining a similar level of accuracy.
... In particular, rather than merely annotating text spans of certain semantic types (e.g., names, locations, organizations), as done in virtually all previous works (see Section 3), the annotation was explicitly focused on identifying textual elements that may affect the disclosure risk of the individual to protect. This annotation process seeks to reflect the way in which human experts approach manual document sanitization in practice (Bier et al. 2009). Each text span annotated in the TAB corpus is associated with a masking decision that expresses whether, according to the annotator, the span ought to be masked in order to conceal the identity of the person in question. ...
... Compared to the approaches discussed so far, which focus on NER-oriented annotations and de-identification rather than anonymization, these works operate on all terms that may cause direct or indirect re-identification of the individual to be protected, regardless of their semantic type. These privacy-oriented annotations are more accurate than NER-based annotations (Lison et al. 2021), and better capture the way in which manual document redaction is done by human experts (Bier et al. 2009). However, annotations of text spans are just binary (either sensitive or not), and neither describe the entity type nor the type of incurred risk (identity or attribute disclosure). ...
Article
Full-text available
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.
... In particular, rather than merely annotating text spans of certain semantic types (e.g., names, locations, organizations), as done in virtually all previous works (see Section 3), the annotation was explicitly focused on identifying textual elements that may affect the disclosure risk of the individual to protect. This annotation process seeks to reflect the way in which human experts approach manual document sanitization in practice (Bier et al. 2009). Each text span annotated in the TAB corpus is associated with a masking decision that expresses whether, according to the annotator, the span ought to be masked in order to conceal the identity of the person in question. ...
... Compared to the approaches discussed so far, which focus on NER-oriented annotations and de-identification rather than anonymization, these works operate on all terms that may cause direct or indirect re-identification of the individual to be protected, regardless of their semantic type. These privacy-oriented annotations are more accurate than NER-based annotations (Lison et al. 2021), and better capture the way in which manual document redaction is done by human experts (Bier et al. 2009). However, annotations of text spans are just binary (either sensitive or not), and neither describe the entity type nor the type of incurred risk (identity or attribute disclosure). ...
Preprint
Full-text available
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark
... Such document data as judicial records and medical records also contain privacy information. Previous works [13], [14], [15] attempted to evaluate risk or sanitize document data. In one study [14], documents were defined as being composed of entities and terms, and each entity was set based on related terms. ...
... The anonymization algorithm sanitizes the terms so that more than K entities are inferred from the terms. In another work [15], a sanitization tool was proposed, and this tool had certain functions: understanding the contents of a sentence by leveraging linguistic content analysis and understanding the sensitivity of the content in general by using inference detection algorithms. Through their interviews and feedback, the authors claimed that sanitization is an alternative approach in which a document is revised to hide sensitive content while retaining as much cohesion and utility as possible. ...
Article
A huge number of documents such as news articles, public reports, and personal essays have been released on websites and social media. Once documents containing privacy-sensitive information are published, the risk of privacy breaches increases, thus requiring very careful review of documents prior to publication. In many cases, human experts redact or sanitize documents before publishing them; however, this approach can be inefficient with regard to cost and accuracy. Furthermore, such measures do not guarantee that critical privacy risks are eliminated from the documents. In this paper, we present a generalized adversary model and apply it to document data. This work devises an attack algorithm for documents using a web search engine, and then proposes a privacy-preserving framework against the attacks. We evaluate the privacy risks for actual accident reports from schools and court documents. In experiments using these reports, we show that human-sanitized documents still contain privacy risks and that our proposed approach can contribute to risk reduction.
... Furthermore, deciding what is a quasi-identifier is much more complex than with structured data: for each piece of information in the text we need to judge whether it can be used for re-identification. Such judgment is not easy for a human expert [3], let alone for an computer program. ...
... Document anonymization is a process that consists of the following two steps: (i) detecting pieces of information that can be used to re-identify the entity to be protected, and (ii) masking such pieces of information appropriately (that is, by applying the least amount of masking required to keep the risk of re-identification within an acceptable level). In this work, we focus on the first step, that is substantially more difficult than the second [3]. ...
... Traditionally, data protection has been performed manually, in a process by which human experts detect and prevent information disclosure risks (Bier et al., 2009), that is, the chance of discovering sensitive data (identities and/or confidential information) by means of direct or indirect semantic inferences. For example, SS numbers, salaries, or disease names, explicitly reveal sensitive data by direct inference; on the other hand, treatments or drugs that are (semantically) related to a sensitive disease, readings that may suggest political preferences, or personal habits that can be associated to religious or sexual orientations may reveal sensitive information via indirect inference (Chow et al., 2008). ...
...  Data protection methods require an expert to manually state all the data pieces in each data set that may cause disclosure and should be protected. For example, in a structured database, the expert is required to specify which attributes can, individually or in aggregate, re-identify an individual (Samarati, 2001); whereas, in a textual document or a document collection, the expert should tag which textual terms may cause disclosure, either directly or via semantic inferences (Bier et al., 2009).  Privacy models use quantitative privacy parameters to express privacy guarantees: with kanonymity, k states the minimum probability of reidentification resulting from making records indistinguishable (Samarati, 2001); with t-closeness, t defines the difference between the distributions of confidential attributes in the original and protected data sets (Li and Li, 2007); and with ε-differential privacy, ε states the probability that the protected outcome is insensitive to changes in one input record (Dwork, 2006). ...
Article
Full-text available
Purpose To overcome the limitations of purely statistical approaches to data protection, the purpose of this paper is to propose Semantic Disclosure Control (SeDC): an inherently semantic privacy protection paradigm that, by relying on state of the art semantic technologies, rethinks privacy and data protection in terms of the meaning of the data. Design/methodology/approach The need for data protection mechanisms able to manage data from a semantic perspective is discussed and the limitations of statistical approaches are highlighted. Then, SeDC is presented by detailing how it can be enforced to detect and protect sensitive data. Findings So far, data privacy has been tackled from a statistical perspective; that is, available solutions focus just on the distribution of the data values. This contrasts with the semantic way by which humans understand and manage (sensitive) data. As a result, current solutions present limitations both in preventing disclosure risks and in preserving the semantics (utility) of the protected data. Practical implications SeDC captures more general, realistic and intuitive notions of privacy and information disclosure than purely statistical methods. As a result, it is better suited to protect heterogenous and unstructured data, which are the most common in current data release scenarios. Moreover, SeDC preserves the semantics of the protected data better than statistical approaches, which is crucial when using protected data for research. Social implications Individuals are increasingly aware of the privacy threats that the uncontrolled collection and exploitation of their personal data may produce. In this respect, SeDC offers an intuitive notion of privacy protection that users can easily understand. It also naturally captures the (non-quantitative) privacy notions stated in current legislations on personal data protection. Originality/value On the contrary to statistical approaches to data protection, SeDC assesses disclosure risks and enforces data protection from a semantic perspective. As a result, it offers more general, intuitive, robust and utility-preserving protection of data, regardless their type and structure.
... Document redaction consists of removing or blacking out sensitive terms in plain textual documents. Alternatively, when sensitive terms are replaced (instead of removed) by generalizations (e.g., AIDS → disease), the process is more generically referred to as document sanitization 55 . Document sanitization is more desirable than pure redaction, since the former better preserves the utility of the protected output. ...
... This requires a human expert who applies certain standard guidelines that detail the correct procedures to sanitize sensitive entities 56 . Manual redaction has proven to be quite time-consuming and does not scale to currently required levels of information outsourcing 55,57 . ...
... One disadvantage of redaction is the loss of data utility. The other disadvantage is that the existence of blacked-out parts in the released document can raise awareness of the document's sensitivity to potential attackers [5]. It is easier to perform redaction than obfuscation. ...
... For Add/Delete, Random is to randomly pick a term in the user tweets to add/delete and then solve Eqn. (5) to obtain the number of the term occurrences to add/delete. On the other hand, the Minimum operation is to pick the term that causes the minimum number of changes on the tweets, by solving Eqn.(5) for each vocabulary term and finding the minimum of xk. ...
Conference Paper
The prodigious amount of user-generated content continues to grow at an enormous rate. While it greatly facilitates the flow of information and ideas among people and communities, it may pose great threat to our individual privacy. In this paper, we demonstrate that the private traits of individuals can be inferred from user-generated content by using text classification techniques. Specifically, we study three private attributes on Twitter users: religion, political leaning, and marital status. The ground truth labels of the private traits can be readily collected from the Twitter bio field. Based on the tweets posted by the users and their corresponding bios, we show that text classification yields a high accuracy of identification of these personal attributes, which poses a great privacy risk on user-generated content. We further propose a constrained utility maximization framework for preserving user privacy. The goal is to maximize the utility of data when modifying the user-generated content, while degrading the prediction performance of the adversary. The KL divergence is minimized between the prior knowledge about the private attribute and the posterior probability after seeing the user-generated data. Based on this proposed framework, we investigate several specific data sanitization operations for privacy preservation: add, delete, or replace words in the tweets. We derive the exact transformation of the data under each operation. The experiments demonstrate the effectiveness of the proposed framework.
... Previous work on building anonymous environments focused on amnesiac behavior [26], homogeneous environments [26,30], enforced anonymous communication [30], and even scrubbing of file types [1,3,25,29]. Amnesiac behavior provides plausible deniability but requires the user to remember and reproduce state. ...
... Naively transferring these files will likely leak the user's or some confidant's identity [5,6]. WiNoN plans to eliminate this possibility by automatically stripping files of potentially identifying material by using appropriate tools [1,3,25,29]. Such an approach comes with caveats. ...
Article
Full-text available
Despite the attempts of well-designed anonymous communication tools to protect users from being tracked or identified, other artifacts, such as a user's environment and behavior, may leak a user's identity. Plugging this leaky boat of web anonymity requires a "top-to-bottom" whole-system approach, rather than focusing on specific protocols or layers. As an initial step, we present WiNoN, a general purpose anonymity-centric system architecture that plugs identity leaks from the network layer up. The core concept powering WiNoN - nym-browse or pseudonym browsing mode - gives the user a web-browsing environment in which each of the user's contextually independent web activities run in unique, yet homogenous containers. This enforced isolation limits accidental leakage of private information regardless of misconfigured anonymity tools, trojan applications, or bugged files. This paper explores the WiNoN design space and presents an early prototype supporting Tor, Dissent, and SWEET communication tools as well as an incognito mode.
... However, sending patients' medical records has to go through a sanitization process (Bier et al., 2009) in which sensitive data or generally any PII information is removed and replaced with meaningless information. The aim is to prevent the exposure of patients' identities. ...
Article
Full-text available
In this work, we present a framework that facilitates the sharing of EHRs among the community of health-care providers (HCPs). However, the sharing might be obstructed by patients’ privacy and the controlling legislation. Nevertheless, our sharing scheme of EHRs strives to respect patients’ privacy and comply with relevant legislator guidelines, e.g., HIPPA. The proposed work introduces two services while sharing the EHRs: privacy and non-repudiation. To this end, we introduce the partners and the role of each during the course of exchanging of EHRs. The principle of sharing EHRs among HCPs has to be reinforced to save patients’ lives while cryptographic primitives have to be employed to serve this purpose. In this paper, we are motivated to introduce the notion of non-repudiation private membership test (NR-PMT). In NR-PMT, we help patients receive medical services with great flexibility while maintaining their privacy and thwarting all possible threats to disclose their identities. In addition, a formal security analysis based on Kailar’s accountability framework has been used to analyze the proposed framework. Also, a complexity analysis has been conducted.
... This special letter, denoted here by # for consistency, represents some information that is missing (i.e., a missing value) from these datasets. For instance, in genome sequencing data, # corresponds to an unknown DNA base [24]; in databases, # represents a value that has not been recorded [25], [26]; and in masked data outputted by other privacy-preserving methods [27], # is introduced deliberately to achieve their privacy goal. ...
Article
Full-text available
Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike other popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.
... These privacy models and their implementations are, however, difficult to apply to unstructured data such as texts. In fact, text anonymisation has been traditionally enforced manually, a process that is costly, timeconsuming and prone to errors (Bier et al., 2009). These limitations led to the development of various computational frameworks designed to extend automated or semi-automated anonymisation to the text domain (Meystre et al., 2010;Sánchez and Batet, 2016;Dernoncourt et al., 2017). ...
... This special letter, denoted here by # for consistency, represents some information that is missing from these datasets. For instance, in genome sequencing data, # corresponds to an unknown DNA base [18]; in databases, # represents a value that has not been recorded [7], [12]; and in masked data outputted by other privacy-preserving methods [6], # is introduced deliberately to achieve their privacy goal. ...
Conference Paper
Full-text available
We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns introducing, however, a number of spurious patterns that may harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is twofold. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.
... Modern applications collect a lot of data, for improving security or customer experience and for business analytics. However, if some data are not needed beyond the immediate use, or only the partial data are needed, it is best not to store the data or only store the partial data [26], for example, storing only the last four digits of Social Security Numbers (SSN) instead of the whole SSN for verification or not storing the birth date when only the year of birth is needed to identify senior citizens can reduce the risk. ...
Article
Full-text available
Advances in data analytics have made data valuable in getting business intelligence about sales, marketing and customer service. Enterprises are increasingly collecting and storing more and more data about customers and their behavior. Consequently, data breaches have evolved drastically in recent years and have become one of the key challenges faced by the organizations. A data breach typically indicates other defense mechanisms and security practices have failed and attackers have been successful in stealing data by exploiting vulnerabilities. Software bugs, misconfigurations, unpatched security bugs or employees not following security practices properly can cause these failures. Having an in-depth understanding of the points of failures and identifying the effectiveness of the defense methods are crucial factors to fight against data breaches and to minimize the aftermath impacts. This paper studies the characteristics of the exploits from recent major data breaches, evaluates the available mitigations, their effectiveness, then explores data centric security strategy and the challenges in implementing them in enterprises. Index Terms-Data security, data breach, defense-in-depth, data encryption, security at rest, challenges in data security.
... A way to anonymize documents is document redaction, that consists of removing or blacking out sensitive terms in the unstructured text of documents. Alternatively, one may resort to the more general technique of document sanitization (Bier et al. 2009), whereby sensitive terms are replaced (rather than removed) with generalizations (e.g., the last byte or bytes of IP addresses can be deleted). Document sanitization is more desirable than pure redaction, because it preserves better the utility of the protected output. ...
Article
Full-text available
Our society is being shaped in a non-negligible way by the technological advances of recent years, especially in information and communications technologies (ICTs). The pervasiveness and democratization of ICTs have allowed people from all backgrounds to access and use them, which has resulted in new information-based assets. At the same time, this phenomenon has brought a new class of problems, in the form of activists, criminals and state actors that target the new assets to achieve their goals, legitimate or not. Cybersecurity includes the research, tools and techniques to protect information assets. However, some cybersecurity measures may clash with the ethical values of citizens. We analyze the synergies and tensions between some of these values, namely security, privacy, fairness and autonomy. From this analysis, we derive a value graph, and then we set out to identify those paths in the graph that lead to satisfying all four aforementioned values in the cybersecurity setting, by taking advantage of their synergies and avoiding their tensions. We illustrate our conceptual discussion with examples of enabling technologies. We also sketch how our methodology can be generalized to any setting where several potentially conflicting values have to be satisfied.
... Existing technologies filter information in different ways, including privacyenhanced access control (Ardagna et al., 2010), Data Loss Prevention techniques 2 , redaction (Bier et al., 2009), various privacy enhancing technologies and database proxies like Informatica's dynamic data masking tool (Informatica, 2013). Existing auditing frameworks typically manually verify the adequacy of the data handling controls used ( American Institute of Certified Public Accountants, 2016; ISO, 2013). ...
Article
This article focuses on the role of accountability within information management, particularly in cloud computing contexts. Key to this notion is that an accountable Cloud Provider must demonstrate both willingness and capacity for being a responsible steward of other people's data. More generally, the notion of accountability is defined as it applies to the cloud, and a conceptual model is presented related to the provision of accountability of cloud services. This allows a consideration of accountability at various different levels of abstraction, including the operationalisation of accountability. It is underpinned by fundamental requirements for strong accountability, which in particular are aimed at avoiding risks in the provision and verification of accounts (that include different types of accountability evidence and notifications, that may need to be provided to other cloud actors including data subjects, cloud customers and regulators). In addition, the article sketches what kind of tools, mechanisms and guidelines support this in practice, and discusses these in the light of the upcoming European Data Protection Regulation.
Article
Purpose The purpose of this paper is to propose a privacy-preserving paradigm for open data sharing based on the following foundations: subjects have unique privacy requirements; personal data are usually published incrementally in different sources; and privacy has a time-dependent element. Design/methodology/approach This study first discusses the privacy threats related to open data sharing. Next, these threats are tackled by proposing a new privacy-preserving paradigm. The main challenges related to the enforcement of the paradigm are discussed, and some suitable solutions are identified. Findings Classic privacy-preserving mechanisms are ineffective against observers constantly monitoring and aggregating pieces of personal data released through the internet. Moreover, these methods do not consider individual privacy needs. Research limitations/implications This study characterizes the challenges to the tackled by a new paradigm and identifies some promising works, but further research proposing specific technical solutions is suggested. Practical implications This work provides a natural solution to dynamic and heterogeneous open data sharing scenarios that require user-controlled personalized privacy protection. Social implications There is an increasing social understanding of the privacy threats that the uncontrolled collection and exploitation of personal data may produce. The new paradigm allows subjects to be aware of the risks inherent to their data and to control their release. Originality/value Contrary to classic data protection mechanisms, the new proposal centers privacy protection on the individuals, and considers the privacy risks through the whole life cycle of the data release.
Chapter
Cloud computing refers to the underlying infrastructure for an emerging model of service provision that has the advantage of reducing cost by sharing computing and storage resources, combined with an on-demand provisioning mechanism relying on a pay-per-use business model. These new features have a direct impact on information technology (IT) budgeting but also affect traditional security, trust and privacy mechanisms. The advantages of cloud computing—its ability to scale rapidly, store data remotely and share services in a dynamic environment—can become disadvantages in maintaining a level of assurance sufficient to sustain confidence in potential customers. Some core traditional mechanisms for addressing privacy (such as model contracts) are no longer flexible or dynamic enough, so new approaches need to be developed to fit this new paradigm. In this chapter, we assess how security, trust and privacy issues occur in the context of cloud computing and discuss ways in which they may be addressed.
Article
Full-text available
Abstract Deep grammars that include tokenization, morphology, syntax, and se- mantic layers have obtained broad coverage in conjunction with high effi- ciency. This allows them to play a crucial role in applications. However, these grammars are often developed as a general purpose grammar, expect- ing “standard” input, and have to be specialized for the application domain. This paper discusses some engineering tools that are used in the XLE gram- mar development,platform to allow for domain specialization. It provides ex- amples of techniques used to allow specialization via overlay grammars,at the level of tokenization, morphology, syntax, the lexicon, and semantics. As an example, the paper focuses on the use of the broad coverage, general purpose ParGram English grammar,and semantics in the context of an Intelligent Doc- ument Security Solutions (IDSS) system. Within this system, the grammar is used to automatically identify sensitive entities and relations among entities, which can then be redacted to protect the content.
Conference Paper
Full-text available
An intelligence analyst often needs to keep track of more facts than can be held in human memory. As a result, analysts use a notebook or evidence file to record facts learned so far. In practice, the evidence file is often an electronic document into which text snippets and hand-typed notes are placed. While this kind of evidence file is easy to read and edit, it provides little help for making sense of the captured information. We describe Entity Workspace, a tool designed to be used in place of a traditional evidence file. Entity Workspace combines user interface and entity extraction technologies to build up an explicit model of important entities (people, places, organizations, phone numbers, etc.) and their relationships. Using this model, it helps the analyst find and re-find facts rapidly, notice connections between entities, and identify good documents and entities to explore next.
Conference Paper
Full-text available
Software tools, including Web browsers, e-books, electronic document formats, search engines, and digital libraries are changing the way people read, making it easier for them to find and view documents. However, while these tools provide significant help with short-term reading projects involving small numbers of documents, they provide less help with longer-term reading projects, in which a topic is to be understood in depth by reading many documents. For such projects, readers must Find and manage many documents and citations, remember what has been read, and prioritize what to read next. We describe three integrated software tools that facilitate in-depth reading. A first tool extracts citation information from documents. A second finds on-line documents from their citations. The last is a document corpus browser that uses a zoomable user interface to show a corpus at multiple granularities while supporting reading tasks that take days, weeks, or longer. We describe these tools and the design principles that motivated them.
Article
Multilevel relational database systems store information at different security classifications. An inference problem exists if it is possible for a user with a low-level clearance to draw conclusions about information at higher classifications. We are developing DISSECT, a tool for analyzing multilevel relational database schemas to assist in the detection and elimination of inference problems. A translation is defined from schemas to an equivalent graph representation, which can be presented graphically in DISSECT. The initial focus is on detection of inference problems that depend only on information all of which is stored in the database. In particular, we identify us potential inference problems different sequences of foreign key relationships that connect the same entities. Inferences can be blocked by upgrading the security classification of some of foreign key relationships. We suggest a global optimization approach to upgrading to block a set of inference problems that U11OWSupgrade costs to be considered and supports security categories as well as levels.
Article
Newly published data, when combined with existing public knowledge, allows for complex and sometimes unintended inferences. We propose semi-automated tools for detecting these inferences prior to releasing data. Our tools give data owners a fuller understanding of the implications of releasing data and help them ad-just the amount of data they release to avoid unwanted inferences. Our tools first extract salient keywords from the pri-vate data intended for release. Then, they issue search queries for documents that match subsets of these key-words, within a reference corpus (such as the public Web) that encapsulates as much of relevant public knowl-edge as possible. Finally, our tools parse the documents returned by the search queries for keywords not present in the original private data. These additional keywords allow us to automatically estimate the likelihood of cer-tain inferences. Potentially dangerous inferences are flagged for manual review. We call this new technology Web-based inference control. The paper reports on two experiments which demonstrate early successes of this technology. The first experiment shows the use of our tools to automatically estimate the risk that an anonymous document allows for re-identification of its author. The second experiment shows the use of our tools to detect the risk that a doc-ument is linked to a sensitive topic. These experiments, while simple, capture the full complexity of inference de-tection and illustrate the power of our approach.
Conference Paper
Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!. Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine. We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic. These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well-suited to efficient inference detection.
Conference Paper
Retyping text phrases can be time consuming. As a result, techniques for copying text from one software application to another, such as copy-and-paste and drag-and-drop are now commo nplace. However, even these techniques can be too slow in situations where many phrases need to be copied. In the special case where the phrases to be copi ed represent syntactically identifiable entities, such as person names, company names, telephone numbers, or street addresses, much faster phrase copying is possible. We describe entity quick click, an approach that reduces both the amount of cursor travel and the number of button presses needed to copy a phrase.
Conference Paper
According to a famous study (10) of the 1990 census data, 87% of the US population can be uniquely identified by gen- der, ZIP code and full date of birth. This short paper revisits the uniqueness of simple demographics in the US population based on the most recent census data (the 2000 census). We oer a detailed, comprehensive and up-to-date picture of the threat to privacy posed by the disclosure of simple de- mographic information. Our results generally agree with the findings of (10), although we find that disclosing one's gender, ZIP code and full date of birth allows for unique identification of fewer individuals (63% of the US popula- tion) than reported in (10). We hope that our study will be a useful reference for privacy researchers who need sim- ple estimates of the comparative threat of disclosing various demographic data.
Conference Paper
When a large number of citizens are at risk to identity theft, national security and economic prosperity are threatened. This work shows that thousands of Americans are at such risk, and introduces technology, named "Identity Angel," to help. Identity Angel's goal is to crawl through information available on the World Wide Web ("Web") and notify people for whom information, freely available on the Web, can be combined sufficiently to impersonate them in financial or credentialing transactions. This is an ambitious goal due to the various types of available information and the many inferences that relate disparate pieces of data. Therefore, the work presented herein focuses specifically on acquiring information sufficient to fraudulently acquire a new credit card using on-line resumes. An imposter needs to learn the {name, Social Security Number, address, date of birth} of a subject. Results show how resumes containing needed information can automatically be found and values harvested, and how many subjects removed such information from the Web once notified.
Conference Paper
Protecting identity in the Internet age requires the ability to go beyond the identification of explicitly identifying infor- mation like social security numbers, to also find the broadly- held attributes that, when taken together, are identifying. We present a system that can work in conjunction with nat- ural language processing algorithms or user-generated tags, to protect identifying attributes in text. The system uses a new attribute-based encryption protocol to control access to such identifying attributes and thus protects identity. The system supports the definition of user access rights based on role or identity. We extend the existing model of attribute- based encryption to support threshold access rights and pro- vide a heuristic instantiation of revocation.
Conference Paper
For privacy reasons, sensitive content may be revised before it is released. The revision often consists of redaction, that is, the "blacking out" of sensitive words and phrases. Redaction has the side effect of reducing the utility of the content, often so much that the content is no longer useful. Consequently, government agencies and others are increasingly exploring the revision of sensitive content as an alternative to redaction that preserves more content utility. We call this practice sanitization. In a sanitized document, names might be replaced with pseudonyms and sensitive attributes might be replaced with hypernyms. Sanitization adds to redaction the challenge of determining what words and phrases reduce the sensitivity of content. We have designed and developed a tool to assist users in sanitizing sensitive content. Our tool leverages the Web to automatically identify sensitive words and phrases and quickly evaluates revisions for sensitivity. The tool, however, does not identify all sensitive terms and mistakenly marks some innocuous terms as sensitive. This is unavoidable because of the difficulty of the underlying inference problem and is the main reason we have designed a sanitization assistant as opposed to a fully-automated tool. We have conducted a small study of our tool in which users sanitize biographies of celebrities to hide the celebrity's identity both both with and without our tool. The user study suggests that while the tool is very valuable in encouraging users to preserve content utility and can preserve privacy, this usefulness and apparent authoritativeness may lead to a "slippery slope" in which users neglect their own judgment in favor of the tool's.
Military Gaffe Results in Classified Data Leak
  • shea
D. Shea, " Military Gaffe Results in Classified Data Leak, " Planet PDF, 6 May 2005; www.planetpdf.com/ enterprise/article.asp?ContentID=7049.
A New Generation of Redacting Tools
  • kaplan
A. Kaplan, " A New Generation of Redacting Tools, " Nat'l Law J., 14 Nov. 2002; www.law.com/jsp/article. jsp?id=1036630382605.
Revisiting the Uniqueness of Simple Demo-graphics in the US Population Workshop on Privacy in the Electronic Society
  • P Golle
P. Golle, " Revisiting the Uniqueness of Simple Demo-graphics in the US Population, " Workshop on Privacy in the Electronic Society, ACM Press, 2006; http://crypto. stanford.edu/~pgolle/papers/census.pdf.
Illuminating Blacked-Out Words, The New York Times
  • J Markoff
J. Markoff, " Illuminating Blacked-Out Words, The New York Times, 10 May 2004; www.nytimes.com/ 2004/05/10/technology/10crypto.html.
Illuminating Blacked-Out Words
  • markoff