Article

Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Computer scientists have recently undermined our faith in the privacy-protecting power of anonymization, the name for techniques for protecting the privacy of individuals in large databases by deleting information like names and social security numbers. These scientists have demonstrated they can often 'reidentify' or 'deanonymize' individuals hidden in anonymized data with astonishing ease. By understanding this research, we will realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention. We must respond to the surprising failure of anonymization, and this Article provides the tools to do so.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Often referred to as quasi-identifiers, IP addresses, computer host or account names do not directly identify a specific individual but aggregate analysis and linking with other data can lead to identification (Garfinkel, S. L., 2015;Ohm, 2010;Sweeney, 1997). Although Sweeney and Ohm were primarily focused on personal health care records, GDPR article 11 and 4 require that electronic identifiers like account names or IP addresses within datasets available to processors must not reidentify individuals (GDPR, 2018a;Hintze, 2018). ...
... The literature confirms balancing privacy with utility (Cavoukian & Castro, 2014;Hintze, 2018;Ohm, 2010;Sweeney, 1997) will continue factoring into data sharing conversations. European privacy law has been attempting to regulate this balance in relation to online electronic identities, similar to the record data analyzed in this project, placing limits on reidentification protection measures, including measures that destroy the intended function of the data (Article 29 Working Party, 2018). ...
... Acceptable privacy levels are achieved through data sharing agreements and software controls, but the authors acknowledge the solution's complexity and inability to share between different trust levels. Ohm (2010) summarized this tension a decade earlier as "Utility and privacy are, at bottom, two goals at war with one another. In order to be useful, anonymized data must be imperfectly anonymous". ...
Thesis
Full-text available
This thesis explores a programatic method for deidentifying organizational cyber event data into functional training sets suitable for cyber security research and skills development. A 4-phase methodology was developed to support repeatable extraction, deidentification and replacement of sensitive values such as IP addresses, computer, domain, and account names. A proof-of concept was implemented via software modules integrated into a customized Logstash pipeline deidentifying data inline to create replica logfiles. Record formats and content variance necessitated multiple parsing approaches enforced by conditional software logic because ad-hoc search and replace proved inadequate for privacy preservation or reliably formatted output. The second finding reaffirmed the privacy-utility continuum, I.E., reidentification risk should be weighed against the opportunity costs of not supporting cyber security and observability research data needs. Testing in a corporate network lab replica generating more than 100,000 events per day confirmed Windows, Linux, and firewall event data deidentification at 130 to 200+ events per second with cyber intrusion data successfully identified in an Elasticsearch analytics platform. Next steps are cyber specialist field testing to determine the viability of threat hunting and detection engineering using deidentified data. pg. 2
... AI integration into cybersecurity increases privacy risks, especially for data leakage and unauthorized access during AIbased threat surveillance. Structurally, information-driven AI systems must work with large data sets, often containing private user data (Ohm, 2010). Collection and processing of such information enhance the likelihood of releasing PII and cause probable infringements of the individuals' privacy. ...
... Paul Ohm (2010) analyzes the paper and the experience regarding the "surprising failure of anonymization," demonstrating that once data is disclosed, one can link this data to other information sets. This re-identification is a significant risk in which AI programs for cybersecurity control extensive data (Ohm 2010). ...
... These privacy issues make it imperative that significant data protection mechanisms be implemented for AI-operated cybersecurity. This is why measures such as access control, encryption, and other important privacy protective measures should be implemented to prevent risks connected with data leaking and unauthorized access (Ohm, 2010). Still, one of the objectives to which key advancements have been directed is to solve the problem of compromising between fast and efficient threat detection and proper protection of user's private information. ...
Article
This paper explores the case of using privacy-preserving artificial intelligence in cybersecurity by analyzing the importance of effective threat intelligence in the conflict with potential invasions and high user data protection standards. With the increased articulation of cyber threats, AI is crucial in fortifying detection, reaction, and prevention measures for cyber threats in CSFs. However, such large-scale information feeding these systems raises many privacy issues, and hence, strong privacy preservation mechanisms that ensure user anonymity and protect the information from misuse are needed. This study reveals how AI threat detection accuracy can be preserved while protecting users' privacy through data obfuscation, differential privacy, and federated learning. Furthermore, the article highlights the need to apply privacy-enhancing patterns, including Privacy by Design, as new patterns in cybersecurity lifecycles. The recommendations derived here are intended to help researchers and practitioners achieve equal data protection results and threat intelligence efficiency when employing AI models. This approach promotes a secure and highly sensitive terrain for disseminating AI-assisted cybersecurity innovations.
... Examples of types of disclosure risks and privacy threats include membership inference, attribute inference, and reconstruction attacks. In addition, financial regulators' data disclosure is not exclusive to trusted parties, so there is an additional burden in which it must be assumed that the information might be used for exploitation (Ohm, 2010). ...
... Historically, most data sanitation methods revolved around subtractive techniques where sensitive variables and identifiers are masked (or suppressed) to maintain privacy. For instance, release-and-forget anonymisation includes a collection of anonymisation techniques that are applied before a public data release (Ohm, 2010). ...
... Notable cases include user behaviour data in massive open online courses, information leakage in the Netflix Prize competition dataset using Amazon review data, and the identification of individuals using AOL's published web search query logs (Ohm, 2010). ...
Preprint
Full-text available
Financial regulators such as central banks collect vast amounts of data, but access to the resulting fine-grained banking microdata is severely restricted by banking secrecy laws. Recent developments have resulted in mechanisms that generate faithful synthetic data, but current evaluation frameworks lack a focus on the specific challenges of banking institutions and microdata. We develop a framework that considers the utility and privacy requirements of regulators, and apply this to financial usage indices, term deposit yield curves, and credit card transition matrices. Using the Central Bank of Paraguay's data, we provide the first implementation of synthetic banking microdata using a central bank's collected information, with the resulting synthetic datasets for all three domain applications being publicly available and featuring information not yet released in statistical disclosure. We find that applications less susceptible to post-processing information loss, which are based on frequency tables, are particularly suited for this approach, and that marginal-based inference mechanisms to outperform generative adversarial network models for these applications. Our results demonstrate that synthetic data generation is a promising privacy-enhancing technology for financial regulators seeking to complement their statistical disclosure, while highlighting the crucial role of evaluating such endeavors in terms of utility and privacy requirements.
... The potential of data to benefit society has led to calls for its broader access and sharing (4,5). On the other hand, the collection and sharing of personal data at scale could jeopardize the fundamental right to privacy, opening the way to concrete privacy harms to individuals and society, such as mass surveillance, increased risks of blackmail, harassment, or identity theft (6,7). ...
... Moreover, we focus mainly on the works published within the field of computer science since the late 1990s, although we occasionally mention articles outside this scope when necessary. This choice allows us to focus on current technical challenges and avenues for future research, as the past 20 years have seen a shift in the language and tools used to study anonymization (7,22,23). ...
... These results, together with the empirical studies showing the vulnerability of high-dimensional pseudonymous data to linkage attacks, strongly suggest that it is generally hard to achieve a good privacy-utility trade-off with de-identification for high-dimensional data. Most researchers no longer consider deidentification a valid nor promising approach to anonymization in practice (7,22,(94)(95)(96)(97). ...
Article
Full-text available
Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
... The economic potential of open data is enormous, but challenges related to sustainability and standards require constant attention. Balancing openness and security is critical for preserving the benefits of open data [15][16][17][18]. ...
... Case studies and best practices -Success on open data platforms requires the efficient adoption of best practices and case study findings [16]. Practical experience analysis is critical for providing valuable insights on efforts to enhance open data platforms. ...
... Practical experience analysis is critical for providing valuable insights on efforts to enhance open data platforms. For data security, case studies emphasize the crucial need for safe access restrictions, strong user authentication, and frequent security assessments [16]. Successful open data initiatives with robust cybersecurity measures safeguard proprietary data and consumers in the commercial sector, emphasizing continuous threat assessment and proactive incident response [21]. ...
Article
Full-text available
Cybersecurity is critical for protecting open data. Transparency and innovation are facilitated by open data platforms; however, concerns about cybersecurity and privacy persist. This study examines the role of cybersecurity in public institutions in the Republic of Kosovo to determine methods of safeguarding data integrity. The main aim of this study was to examine the role of cybersecurity in securing open data in public organizations in the Republic of Kosovo. The study aimed to identify optimal cybersecurity practices in the context of open data and provide a comprehensive overview of the implementation of cybersecurity measures. This study employed a structured and methodical approach to assess cybersecu-rity and the effectiveness of open data platforms in public organizations in the Republic of Kosovo. Results: The study provides an overview of the status of open data platforms in the Republic of Kosovo and highlights the importance of cybersecurity, data privacy, and data integrity. Despite the stated concerns, such as enhancing security measures and increasing user knowledge, it is evident that public institutions have made significant progress in securing and enhancing their open data platforms. It is suggested that institutions in the Republic of Kosovo continue to invest in cybersecurity, promote privacy protection measures, and focus on enhancing the quality of open data to develop in this sector. Furthermore, collaboration and coordination across institutions and government agencies are required to enhance the efficiency and effectiveness of these platforms.
... (1) De-anonymization Threats [110,193,204,263,269] (2) Inference Attacks [29,48,87,149,384] (3) Poisoning Attacks [106,156,248,325,443] Protection Techniques (1) Anonymization Techniques [63,110,204,211,243,263,338] (2) Perturbation Techniques [44,49,172,262,274,388] (3) Encryption Techniques [19,35,265,286,384] (4) Adversarial Techniques [16,25,54,152,171] significant societal risks [88,138,261]. These risks encompass the spread of toxic content [128], reinforcement of discriminatory biases [148], and the proliferation of misinformation and privacy violations [225]. ...
... • De-anonymization. De-anonymization involves the re-identification of anonymized user data through correlation with external information sources [193,269]. Even when Personally Identifiable Information (PII) is removed, user identities may be exposed through cross-referencing external data sources or inferring missing attributes. ...
Preprint
Full-text available
In recent years, the integration of Large Language Models (LLMs) and Recommender Systems (RS) has ushered in a new era of personalized and intelligent user experiences. This survey provides a comprehensive overview of the current landscape and future directions of the collaboration between LLM-based AI agents (LLM Agent) and recommender systems. We begin with an introduction to the foundational knowledge, exploring the components of LLM agents and the applications of LLMs in recommender systems. The survey then delves into the symbiotic relationship between LLM agents and recommender systems, illustrating how LLM agents enhance recommender systems and how recommender systems support better LLM agents. Specifically, we begin by discussing the overall architectures for designing LLM agents for recommendation, encompassing profile, memory, planning, and action components, along with multi-agent collaboration. Conversely, we also explore how recommender systems enhance LLM agents, focusing on areas such as memory recommendation, plan recommendation, tool recommendation, agent recommendation, and personalized LLMs and LLM agents. A critical analysis of trustworthy AI agents and recommender systems follows, addressing key issues of safety, explainability, fairness, and privacy. Finally, we discuss potential future research directions, highlighting emerging trends and opportunities in the intersection of AI agents and recommender systems. This survey concludes by summarizing the current research and outlining promising avenues for future exploration in this dynamic and rapidly evolving field. We have collected relevant papers to a GitHub repository for this survey: https://github.
... As big data analytics becomes increasingly central to our digital economy and society, getting privacy right is both an ethical imperative and a source of competitive advantage [15]. Through a commitment to privacy-by-design principles and the judicious application of techniques covered in this paper, forward-looking organizations can harness the full power of big data without putting individual rights at risk [16]. ...
... However, it is not an infallible solution. Numerous studies have demonstrated techniques for re-identifying individuals in anonymized datasets by cross-referencing them with auxiliary information [16]. In one prominent example, researchers were able to uniquely pinpoint 87% of individuals in an anonymized credit card transaction database simply by knowing a handful of their purchases [17]. ...
... Moreover, key datasets related to healthcare pathways require detailed information for effective linking. Yet, modern technology enables the identification of specific individuals through statistical inference by combining various datasets [9]. This dilemma poses a challenge: while detailed datasets increase their value, they also heighten the risk of compromising their privacy [10]. ...
... Additionally, covered entities may use or disclose a limited dataset of PHI for research purposes without individual authorization, provided that the recipient signs a data usage agreement and commits to implementing specific safeguard measures for the protected health information in the limited dataset. 9 However, it is important to note that if HIPAA permits disclosure of data without individual consent, but state law mandates individual consent, then state law prevails. Therefore, determining the legal protection requirements for personal health data involving multiple states presents a challenge for the country. ...
Article
Full-text available
This study delves into the four characteristics of data accessibility across seven APEC countries, identifying key regulatory disparities in utilizing health-related data for medical research, including the potential reuse of health-related data, inconsistencies in anonymization or de-identification standards, and varied legal protection requirements based on the data’s origin. To foster cross-jurisdictional sharing of personal health data and fortify privacy security in China, this paper proposes the following strategies: implementing specialized legislation concerning health-related data, developing a Chinese version of the De-Identification Decision-Making Framework, leveraging advanced cryptographic privacy-preserving techniques in biomedical research, and employing data transfer and usage agreements.
... Second, while data reconstruction attacks focus on identifying individuals with a certain probability, such as "John Doe can be re-identified with probability x", our research estimates anonymity within a group, such as "a person with these attributes has an anonymity set of 5". Data reconstruction attacks have been studied extensively [25,28,32]. Recently, the work by Rocher et al. [31] demonstrates the application of machine learning to these attacks. ...
... Considering the ten most populous counties in the US, young women (25)(26)(27)(28)(29)(30) usually constitute the biggest anonymity sets, which is positive in light of our case study. In Miami Dade, however, this is not the case; instead, women between 55 and 60 make up the largest CAS, exposing young women to a greater risk of deanonymization. ...
Article
Full-text available
The realm of digital health is experiencing a global surge, with mobile applications extending their reach into various facets of daily life. From tracking daily eating habits and vital functions to monitoring sleep patterns and even the menstrual cycle, these apps have become ubiquitous in their pursuit of comprehensive health insights. Many of these apps collect sensitive data and promise users to protect their privacy - often through pseudonymization. We analyze the real anonymity that users can expect by this approach and report on our findings. More concretely: We introduce the notion of conditional anonymity sets derived from statistical properties of the population; We measure anonymity sets for two real-world applications and present overarching findings from 39 countries; We develop a graphical tool for people to explore their own anonymity set. One of our case studies is a popular app for tracking the menstruation cycle. Our findings for this app show that, despite their promise to protect privacy, the collected data can be used to identify users up to groups of 5 people in 97% of all the US counties, allowing the de-anonymization of the individuals. Given that the US Supreme Court recently overturned abortion rights, the possibility of determining individuals is a calamity.
... New techniques, such as differential privacy, are emerging to protect individual data while allowing for meaningful analysis (Abadi et al., 2016). Researchers must also consider the implications of re-identification risks when working with large datasets (Ohm, 2010). Research involving artificial intelligence (AI) and machine learning (ML) often raises unique ethical considerations (Ohm, 2010). ...
... Researchers must also consider the implications of re-identification risks when working with large datasets (Ohm, 2010). Research involving artificial intelligence (AI) and machine learning (ML) often raises unique ethical considerations (Ohm, 2010). For instance, researchers must consider the potential biases in algorithms and how these biases might impact individuals or groups (Suresh & Guttag, 2019). ...
Book
Full-text available
The ability to conduct rigorous research, articulate findings effectively, and navigate the complexities of publication are foundational skills for scholars across disciplines. In an era where information proliferates at an unprecedented rate, mastering these skills is more critical than ever. This book, Basics of Research Methodology, Writing, and Publication, aims to serve as a comprehensive guide for students, early-career researchers, and professionals who seek to enhance their research capabilities and communication proficiency. The journey of writing this book was inspired by the recognition of common challenges faced by many in the research community. Whether it is formulating a research question, choosing the appropriate methodology, analyzing data, or crafting a manuscript for publication, each step of the research process presents unique obstacles. Our goal is to demystify these processes and provide practical, actionable advice that can be applied across various fields of study. In the initial chapters, we delve into the fundamentals of research methodology. Here, readers will find detailed explanations of qualitative, quantitative, and mixed-method approaches, along with guidance on designing robust studies and avoiding common pitfalls. By grounding our discussion in real-world examples, we aim to bridge the gap between theory and practice. The middle section of the book focuses on writing. Effective communication is an art as much as it is a science. We explore strategies for clear and compelling writing, structuring a research paper, and adhering to the conventions of academic writing. Special emphasis is placed on the revision process, highlighting the importance of drafting, feedback, and editing in producing high-quality work. Finally, the book addresses the intricacies of publication. Understanding the landscape of academic publishing, selecting the right journal, navigating the peer review process, and responding to reviewers’ comments are critical skills for any researcher. We provide insights into these areas, along with tips for avoiding common ethical issues such as plagiarism and authorship disputes. Throughout this book, we have strived to present information in a clear, accessible manner. Each chapter includes practical tips, illustrative examples, and exercises designed to reinforce key concepts. Whether you are embarking on your first research project or seeking to refine your existing skills, we hope this book will be a valuable resource in your academic journey. We would like to extend our gratitude to the numerous colleagues, mentors, and students who have contributed to the development of this book. Their insights and feedback have been invaluable in shaping its content. It is our sincere hope that this book will inspire and empower you to conduct meaningful research, communicate your findings effectively, and contribute to the advancement of knowledge in your field.
... Conventionally, the most direct approach to protecting data subjects from re-identification has been to anonymize the data by removing/obscuring the parts containing personal information (Ohm, 2009). This process aims at potentially being able to release the data to the public. ...
... Security Measures to Ensure Authorized Access. Restricting access to the FinDARC corpus as described in this paper is in line with the current literature on personal data sharing (Elliot et al., 2018(Elliot et al., , 2020Ohm, 2009;Rubinstein & Hartzog, 2016;Stalla-Bourdillon & Knight, 2016) which adheres to the FAIR principles, while acknowledging the limitations of data anonymization/reduction and encouraging the use of user group limitations. ...
Conference Paper
Full-text available
We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a Finnish dark web marketplace website. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. A reduced version of the corpus, Finnish Dark Web Marketplace Corpus (FINDarC), has been archived in the Language Bank of Finland. In the current work, we focus on the protective measures for storing the data and how researchers can apply for access rights to the corpus under the CLARIN RES licence.
... The LEI code 8 field offers interoperability with company data published by GLEIF. 9 Here we provide an example of a basic compliant company representation in RDF Turtle, including legal name, identifier, registration date, registered address: In the following sections, we describe how through the use of common vocabularies, the location of the company, encoded by administrative units, and the economic activity of the company can be seen as two core, hierarchically structured, categorisation dimensions for companies. Finally, we show how, through guidance in metadata description, the disoverability of compliant datasets can be achieved. ...
... In such a case, there is no choice but to anonymize the data. Here, however, it is necessary to draw attention to the necessity to carry out thorough anonymization because the concept of "personal data" is interpreted very broadly 26 and there is a risk of deanonymisation [9]. ...
Article
Full-text available
The European Commission has published a list of high-value datasets (HVDs) that public sector bodies must make available as open data as part of the Open Data Directive. One of the HVD topics is company data. Although the HVD description contains items that must be included in these datasets, it does not prescribe any technical means of how the data should be published. This is a major obstacle to the interoperability of the datasets once they are published. In this extended paper, we elaborate on the results of STIRData, a project co-financed by the Connecting Europe Facility Programme of the European Union, focusing on various aspects of data interoperability of open data from business registries, covering the company data HVDs topic. These aspects include the semantic, technical, and legal interoperability of this data. The results include a data architecture and a data specification to make the published data technically and semantically interoperable. In addition, we present basic legal interoperability guidelines to ensure legal interoperability of the published data, which is a topic often neglected by technically focused data experts. The project results include proof-of-concept transformations of data from selected European business registries using open source tools and in accordance with the data specification. Moreover, a user-orientated platform for browsing and analysing the data is presented as an example of the possibilities of using the data published in an interoperable way. Finally, we present an example of how compliant data can be processed by data experts for further analysis.
... It is a well-known problem that machine learning algorithms are very good at de-anonymizing data sets[132]. ...
Article
Full-text available
The call for XAI rests on a normative claim: ‘Good AI is explainable AI’ or even the stronger claim: ‘Only explainable AI is good AI.’ However, this valorization runs the risk of being overgeneralized because explanations are not per se useful, appropriate, or demanded. Explainability should not be seen as a value in itself but as a means to certain ends. In this paper, we put the valorization of explainability into question, which is discursively connected to the idea of ‘users’ needs’ and the will to design and develop ethically aligned AI systems. By making the instrumental character of the value of explainability explicit, we address two key issues that necessitate more theoretical attention: (i) to analyze the link between explainability and its presumed purpose; and (ii) to clarify the conceptions of these presumed purposes, namely users’ needs and ethical principles XAI is meant to promote. From a philosophical and from a psychological perspective, we constructively criticize the undertheorized and undercomplex way of talking about ‘users’ needs’ and ethical demands. We plea to carefully differentiate the value of explainable AI in social contexts and signal further need for research.
... Latanya Sweeney proved this by re-identifying the governor of Massachusetts in an "anonymized" medical database. We briefly recall her attack -as reported in (Ohm 2010) -which will be helpful in understanding our approach. Sweeney first leveraged the common knowledge that the governor of Massachusetts had collapsed unconscious during a public event, and was taken to the Deaconess Waltham Hospital. ...
Conference Paper
This paper introduces and investigates k-unmatchability, a counterpart of k-anonymity for knowledge graphs. Like k-anonimity, k-unmatchability enhances privacy by ensuring that any individual in any external source can always be matched to either none or at least k different anonymized individuals. The tradeoff between privacy protection and information loss can be controlled with parameter k. We analyze the data complexity of k-unmatchability under different notions of anonymization.
... This refers to masking some to complete sets of private information for the purpose of making them anonymous and untraceable. Companies and data controllers rely on such techniques for maintaining the privacy of their large scale information flow -specifically this particular implementation which requires less expertise and computational resources (though this is not always guaranteed to provide fool-proof security [23]). Despite the concerns around some failure incidents, this technique has found widespread acceptance globally on various institutional levels. ...
Preprint
Full-text available
This paper aims to cover and summarize the field of IoT and related privacy concerns through the lens of privacy by design. With the ever-increasing incorporation of technology within our daily lives and an ever-growing active research into smart devices and technologies, privacy concerns are inevitable. We intend to briefly cover the broad topic of privacy in the IoT space, the inherent challenges and risks in such systems, and a few recent techniques that intend to resolve these issues on the subdomain level and a system scale level. We then proceed to approach this situation through design thinking and privacy-by-design, given that most of the prior efforts are based on resolving privacy concerns on technical grounds with system-level design. We participated in a co-design workshop for the privacy of a content creation platform and used those findings to deploy a survey-based mechanism to tackle some key concern areas for user groups and formulate design principles for privacy that promote transparent, user-centered, and awareness-provoking privacy design.
... Additionally, there are concerns about how AI systems handle data anonymization. In some cases, even anonymized data can be re-identified when combined with other datasets, raising concerns about the privacy of individuals whose data is used for AI training and analysis (Ohm, 2010). Ensuring robust anonymization techniques and data minimization practices can help reduce the risk of re-identification. ...
... Genomic data is highly sensitive because it can reveal an individual's unique genetic makeup; including disease predispositions and other personal traits. Despite efforts to anonymize genomic data, studies have shown that it is possible to re-identify individuals by crossreferencing with other publicly available information, such as health records and social media information [61,71,48]. For example, Gymrek et al. demonstrated that surnames could be recovered from personal genomes by combining genetic data with publicly accessible genealogy databases, thus re-identifying individuals within supposedly anonymized datasets [61]. ...
Preprint
Full-text available
The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and k -mer tokenizations. We find k -mer tokenization schemes, indicating that our tokenization schemes boost the model’s performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make that comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.
... In addition, another persistent challenge in implementing PETs is securing information while maximizing the utility of the data. Ultimately, as PETs continue to evolve and be employed in combination with other technologies, questions about their efficacy remain, especially in light of the present and emerging instruments employed for re-identification purposes (Ohm, 2010). ...
Article
Full-text available
Enabling private sector trust stands as a critical policy challenge for the success of the EU Data Governance Act and Data Act in promoting data sharing to address societal challenges. This paper attributes the widespread trust deficit to the unmanageable uncertainty that arises from businesses’ limited usage control to protect their interests in the face of unacceptable perceived risks. For example, a firm may hesitate to share its data with others in case it is leaked and falls into the hands of business competitors. To illustrate this impasse, competition, privacy, and reputational risks are introduced, respectively, in the context of three suboptimal approaches to data sharing: data marketplaces, data collaboratives, and data philanthropy. The paper proceeds by analyzing seven trust-enabling mechanisms comprised of technological, legal, and organizational elements to balance trust, risk, and control and assessing their capacity to operate in a fair, equitable, and transparent manner. Finally, the paper examines the regulatory context in the EU and the advantages and limitations of voluntary and mandatory data sharing, concluding that an approach that effectively balances the two should be pursued.
... Consequently, in a workplace, the proportion of individuals identifying as LGBTIQ+ is likely small. If data processing lacks robust security measures and careful pseudonymization or anonymization where necessary, there is heightened risk of data becoming identifiable, especially concerning intersectional minoritized identities (Ohm, 2009). Additionally, despite the option for employees to decline participation in SOGI data collection, the hierarchical power structures prevalent in workplaces pose a challenge to ensuring true voluntary participation. ...
Article
Full-text available
This article addresses the complex issue of sexual orientation and gender identity (SOGI) data collection in workplaces, highlighting the intricate balance between fostering inclusion and mitigating potential harm and exclusion. This tension manifests uniquely across diverse cultural, legal, and organizational settings. We review existing literature, offer practical guidance for decision‐makers, and outline future research avenues. While SOGI data collection in workplaces can enhance diversity, equity, and inclusion (DEI) initiatives and elevate the visibility of lesbian, gay, bisexual, transgender, intersex, and queer (LGBTIQ+) employees, challenges include the risk of discrimination, privacy concerns, and linguistic complexities. To address these, researchers and practitioners must consider the purpose, language, and cultural context of data collection, involving LGBTIQ+ stakeholders, and conducting reconnaissance studies. Future research opportunities lie in understanding employee willingness to share SOGI data, motivations of human resource (HR) and DEI professionals, and the impact on organizational culture. Reimagining LGBTIQ+ research to ease the tension between inclusion and protection, we conclude that responsible SOGI data collection demands a nuanced approach that prioritizes inclusion and equity while addressing privacy concerns and potential harm.
... Historically, different methods to anonymize data have been used to protect the privacy of individuals sharing their information. However, computer scientists have consistently shown how even anonymized data is vulnerable to privacy attacks (Sweeney, 2000;Narayanan and Shmatikov, 2008;Ohm, 2010) and that anonymization methods typically do not provide a mathematical measure of privacy loss. Since some amount of privacy loss is inevitable for any publicly available dataset (Dinur and Nissim, 2003), understanding the amount of privacy loss incurred is crucial in effectively protecting individual privacy. ...
Article
Full-text available
Differential privacy (DP) has been in the public spotlight since the announcement of its use in the 2020 U.S. Census. While DP algorithms have substantially improved the confidentiality protections provided to Census respondents, concerns have been raised about the accuracy of the DP-protected Census data. The extent to which the use of DP distorts the ability to draw inferences that drive policy about small-populations, especially marginalized communities, has been of particular concern to researchers and policy makers. After all, inaccurate information about marginalized populations can often engender policies that exacerbate rather than ameliorate social inequities. Consequently, computer science experts have focused on developing mechanisms that help achieve equitable privacy, i.e., mechanisms that mitigate the data distortions introduced by privacy protections to ensure equitable outcomes and benefits for all groups, particularly marginalized groups. Our paper extends the conversation on equitable privacy by highlighting the importance of inclusive communication in ensuring equitable outcomes for all social groups through all the stages of deploying a differentially private system. We conceptualize Equitable DP as the design, communication, and implementation of DP algorithms that ensure equitable outcomes. Thus, in addition to adopting computer scientists' recommendations of incorporating equity parameters within DP algorithms, we suggest that it is critical for an organization to also facilitate inclusive communication throughout the design, development, and implementation stages of a DP algorithm to ensure it has an equitable impact on social groups and does not hinder the redressal of social inequities. To demonstrate the importance of communication for Equitable DP, we undertake a case study of the process through which DP was adopted as the newest disclosure avoidance system for the 2020 U.S. Census. Drawing on the Inclusive Science Communication (ISC) framework, we examine the extent to which the Census Bureau's communication strategies encouraged engagement across the diverse groups of users that employ the decennial Census data for research and policy making. Our analysis provides lessons that can be used by other government organizations interested in incorporating the Equitable DP approach in their data collection practices.
... It is necessary to create a standard to evaluate the adequacy, quality, and privacy protection of anonymized photos. Masking human identity is very important in many domains, including surveillance, social networks, medical imagery, research, and education [16,17]. The remarkable advances have improved the performance and efficiency of such processes [18]. ...
Article
Full-text available
Effective anonymous facial registration techniques are critical to address privacy concerns arising from facial recognition technology. This study presents an intelligent anonymity platform that incorporates blockchain with advanced privacy and uses a CIAGAN-powered approach. This solution addresses the immediate need for privacy in facial recognition technology. The proposed system uses advanced techniques to anonymously generate highly realistic and effective facial images. The widespread use of facial recognition systems places greater emphasis on privacy concerns, emphasizing the need for strong enrollment mechanisms. The proposed system uses CIAGAN to address this challenge and generate facial images while preserving important attributes. Blockchain storage ensures that data integrity and security are maintained. The process begins with detailed image preprocessing steps to improve data quality and eliminate unwanted noise. CIAGAN can generate anonymous face images with important facial attributes to complicate the recognition of specific objects. A dataset of 202,599 facial images was used. Performance metrics such as PSNR and SSIM indicate image quality and uniformity. The PSNR obtained was 35.0516, indicating a unique image anonymization process.
... This gap underscores the need for more effective digital literacy programs and user-friendly privacy tools. [34,35] ...
Preprint
Full-text available
The digital age has ushered in unprecedented connectivity and convenience, but it has also given rise to escalating cyber threats and privacy concerns. This study examines the intricate balance between technological advancement and personal privacy in the face of evolving cyber risks. A comprehensive literature review was conducted, analyzing peer-reviewed articles, government reports, and industry white papers from the past decade. Additionally, case studies of major data breaches and privacy violations were examined. Surveys and interviews with cybersecurity experts, policymakers, and privacy advocates were conducted to gather diverse perspectives. The research reveals a significant increase in both the frequency and sophistication of cyber-attacks, coupled with a growing public awareness of privacy issues. However, there is a notable disconnect between individuals' stated privacy concerns and their online behaviors. Regulatory efforts to protect privacy often lag behind technological advancements and cyber threats. Maintaining equilibrium between technological progress and privacy protection requires a multifaceted approach involving enhanced cybersecurity measures, updated regulations, and improved digital literacy. The study underscores the need for ongoing collaboration between government, industry, and academia to address this complex challenge.
... However, even anonymized data can sometimes be re-identified, posing an additional challenge to ensuring privacy (Narayanan & Shmatikov, 2008) [27]. These anonymization practices must continually evolve to address new technological challenges (Ohm, 2010) [28]. ...
Article
Full-text available
ethical considerations in Artificial Intelligence (AI) have become increasingly crucial as AI technologies permeate various facets of society. This paper explores key ethical dimensions related to AI, focusing on algorithmic bias, data privacy, transparency, social impact, and regulatory challenges. Each of these areas presents unique ethical dilemmas, from the unintended biases embedded in AI algorithms to the ethical implications of AI-driven decision-making in critical domains. The synthesis of current literature reveals ongoing efforts to address these challenges through regulatory frameworks, technological advancements, and ethical guidelines. However, significant challenges remain, including the need for robust regulatory frameworks that balance innovation with ethical considerations, and the imperative to ensure AI systems are accountable and transparent in their operations. Keywords-algorithmic bias in AI, data privacy in AI, AI transparency and accountability, social impact of AI, AI governance and regulation.
... Since the statutory definition of personal information is quite broad (e.g., "information about an identifiable individual"), this can significantly limit the availability of some datasets for release (Scassa, 2010). Identifiability is assessed in terms of any other available data, and in an era of big data, reidentification risk can be high (Rocher et al., 2019;Ohm, 2010;Sweeney, 2010). Lest privacy concerns become an overwhelming barrier to open data, activists as well as governments have been working to develop guidance and strategies for opening data in ways that protect privacy rights and avoid privacy harms (Green et al., 2017;Simperl et al., 2016;Scassa & Conroy, 2016;Borgesius et al., 2015). ...
... Work related to this paper can largely be divided between that which aims to explain what privacy means and that which aims to explain how privacy can be retained. The first kind is more widely practice by legal scholars [8,18,35], sociologists [7,29], economists [21] and a few exceptional scholars who thrive in more than one of those fields [30][31][32]. ...
Preprint
Full-text available
The contextual integrity model is a widely accepted way of analyzing the plurality of norms that are colloquially called "privacy norms". Contextual integrity systematically describes such norms by distinguishing the type of data concerned, the three social agents involved (subject, sender, and recipient) and the transmission principle governing the transfer of information. It allows analyzing privacy norms in terms of their impact on the interaction of those agents with one another. This paper places contextual integrity in a strict game theoretic framework. When such description is possible it has three key advantages: Firstly, it allows indisputable utilitarian justification of some privacy norms. Secondly, it better relates privacy to topics which are well understood by stakeholders whose education is predominantly quantitative, such as engineers and economists. Thirdly, it is an absolute necessity when describing ethical constraints to machines such as AI agents. In addition to describing games which capture paradigmatic informational norms, the paper also analyzes cases in which the game, per se, does not encourage normative behavior. The paper discusses two main forms of mechanisms which can be applied to the game in such cases, and shows that they reflect accepted privacy regulation and technologies.
... See also ICDPPC andOECD 2015. 59 E.g., Narayanan et al. 2016;Ohm 2010. proposed solution. ...
Book
Full-text available
This open access book focuses on the impact of Artificial Intelligence (AI) on individuals and society from a legal perspective, providing a comprehensive risk-based methodological framework to address it. Building on the limitations of data protection in dealing with the challenges of AI, the author proposes an integrated approach to risk assessment that focuses on human rights and encompasses contextual social and ethical values. The core of the analysis concerns the assessment methodology and the role of experts in steering the design of AI products and services by business and public bodies in the direction of human rights and societal values. Taking into account the ongoing debate on AI regulation, the proposed assessment model also bridges the gap between risk-based provisions and their real-world implementation. The central focus of the book on human rights and societal values in AI and the proposed solutions will make it of interestto legal scholars, AI developers and providers, policy makers and regulators.
... Rather, the information falls within the realm of group privacy. 8 Since my focus is on whether the privacy of a clustered group is violated when anonymised information is accessed by an agent, I do not discuss the issues of re-identification or de-anonymisation, which involve linking an anonymised dataset with a separate dataset containing identifying information (for more information on this matter, see Barocas and Nissenbaum [20]; and Ohm [21]). I argue that, if the information uniquely identifies a person, then individual privacy is at risk due to the personal nature of the information. ...
Article
Full-text available
An issue about the privacy of the clustered groups designed by algorithms arises when attempts are made to access certain pieces of information about those groups that would likely be used to harm them. Therefore, limitations must be imposed regarding accessing such information about clustered groups. In the discourse on group privacy, it is argued that the right to privacy of such groups should be recognised to respect group privacy, protecting clustered groups against discrimination. According to this viewpoint, this right places a duty on others, for example, private companies, institutions, and governments, to refrain from accessing such information. To defend the idea that the right to privacy should be recognised for clustered groups, at least two requirements must be satisfied. First, clustered group privacy must be conceived of as either a collective good or a participatory good. Since these forms of good are of the type from which no member of a group can be excluded from benefiting, the right to them is defined as a group right. Second, there must be group interests on which to base a group right. Group interests can be either the interests of those members that are a result of their being in the group or the interests of the group as a whole that transcend the interests of its members. However, this paper argues that clustered group privacy cannot be conceived of as either a collective or a participatory good because it is possible for some individuals to be excluded from benefiting from it. Furthermore, due to the lack of awareness among individuals that they are members of a clustered group and the nature of a clustered group itself, such groups cannot have the group interests necessary to establish a group right. Hence, the group right to privacy cannot be recognised for these groups, implying that the group right cannot be considered a means to protect clustered groups against discrimination. Instead, this paper suggests that moral principles need to be articulated within an ethics of vulnerability to identify the moral obligations of protecting vulnerable clustered groups. The duty owed to the vulnerable should involve refraining from accessing certain information about clustered groups in specific contexts. This duty is not engendered by the right to privacy of such groups; it is the duty owed to the vulnerable. The findings highlight the need to articulate moral principles regarding privacy and data protection to protect clustered groups in contexts in which accessing information about them could constitute a reason for discriminatory targeting.
... Inferential analysis and linking of different records can also lead to discriminatory treatment [5], which limits user analysis. [6] The inability to anonymize data [7] and weak cybersecurity standards often due to the limited computing power of identification technologies exacerbate privacy risks. Taken together, these risks make free and informed consent in the IoT a challenge. ...
Article
As records are increasingly born digital – and thus, at least ostensibly, potentially much more accessible – archivists find themselves struggling to enable general access while providing appropriate privacy protections for the torrent of records being transferred to their care. In this article, the authors report the results of an integrative literature review study, examining the intersection of AI, archives, and privacy in terms of how archives are currently coping with these challenges and what role(s) AI might play in addressing privacy in archival records. The study revealed three major themes: 1) the challenges of – and possibilities beyond – defining “privacy” and “AI”; 2) the need for context-sensitive ways to manage privacy and access decisions; and 3) the lack of adequate “success measures” for ensuring the actual fitness for purpose of privacy AI solutions in the archival context.
Article
Full-text available
This study explores the complexities and challenges of implementing data privacy laws in the era of big data, where security, privacy, and innovation frequently intersect. The exponential growth of data collection, driven by advancements in technology and the widespread adoption of digital services, has intensified the need for effective data privacy regulations. However, balancing the protection of individual privacy with the demands of innovation and security presents considerable challenges for policymakers. Utilizing a qualitative approach, this study employs a literature review and library research methodology to analyze existing data privacy laws, regulatory frameworks, and scholarly discussions on the topic. Findings indicate that current data privacy laws often struggle to keep pace with rapid technological change, creating gaps that can be exploited by both private and public entities. Additionally, the study highlights conflicting priorities, as stringent privacy protections can inhibit technological innovation, while lenient policies may lead to significant privacy vulnerabilities. This analysis suggests that adaptive and scalable legal frameworks, alongside international cooperation, are essential to address these challenges effectively. Recommendations are provided for balancing privacy and innovation in a way that upholds data security without stifling technological growth. This study contributes to ongoing debates surrounding data privacy by offering insights that may guide future policy developments in the field.
Chapter
Cross-border data flows, which encompass a broad and diverse range of economic and non-economic dimensions, raise a number of new trade policy issues. However, as the status of data and data flows in International Economic Law remains ill-defined, no effective multilateral governance is currently exercised with respect to the digital transformation of trade. Notably, the proliferation of national data governance frameworks is a critical element for regulating trade in the digital economy, but one that receives only limited consideration under WTO law. As digital globalisation accelerates, a patchwork of country-specific data governance frameworks threatens to fragment the global data sphere and thus increase barriers to digital trade. The debate on transnational data governance is particularly pronounced with regard to data privacy laws, as these are a common element of domestic data governance and the global landscape of data privacy regulations is characterised by considerable heterogeneity. As a result, the impact of national data privacy laws on the cross-border flow of personal data is one of the most contentious issues associated with digital trade. This chapter provides an in-depth examination of the regulation of data flows through data privacy rules and explores the rationale behind a contemporary data privacy collision in digital trade.
Chapter
This chapter casts light on three normative challenges in terms of fundamental rights, in particular, the right to privacy and the right to the protection of personal data (Arts. 7 and 8 of the EU Charter of Fundamental Rights), brought about by the structural data and metadata sharing of the ubiquitous IoT. The first legal challenge hinges on how structural IoT data and metadata processing challenges the two different regimes of the ePrivacy Directive and the GDPR, in particular, taking into account (i) the lawfulness of the processing; (ii) lack of transparency and information asymmetry; (iii) information security risks. The second legal challenge casts light on the relationship between various cryptographic technical tools e.g., but not limited to, encryption, and the traditional legal disciplines of privacy and data protection against the background of IoT data and metadata processing. Notwithstanding the lightweight—in the case of resource-constrained devices—or ‘strong’ encryption protocols that may be adopted to secure (IoT) device communication, serious privacy concerns might nevertheless arise. Encrypted traffic analysis, or metadata analysis, which corresponds to the third legal challenge, is presented taking into account, in particular, hidden impacts on IoT users’ privacy and security.
Article
Kentsel alanlarda, özellikle büyük metropollerde, toplu ulaşım sistemleri sürdürülebilir kentsel gelişimin ve yaşam kalitesinin temel bileşenlerinden biridir. Dijital çağın getirdiği teknolojik ilerlemeler, bu sistemlerin etkinliğini ve verimliliğini artırmada kritik bir rol oynamaktadır. Bu bağlamda, Akıllı Ulaşım Sistemleri (AUS), kentsel ulaşım ağlarının optimizasyonunda giderek daha fazla önem kazanmaktadır. Toplu ulaşımın çeşitli modları arasında, taksiler özellikle kısa mesafeli, kapıdan kapıya ve bireyselleştirilmiş seyahat talepleri açısından önemli bir boşluğu doldurmaktadır. Ancak, birçok büyük şehirde taksi hizmetleriyle ilgili çeşitli sorunlar gözlemlenmektedir. Bu sorunlar arasında güvenlik endişeleri, ödeme anlaşmazlıkları ve hizmet kalitesiyle ilgili şikayetler ön plana çıkmaktadır. Sistem tasarımı, fiziksel testlerle doğrulanmış ve bir yüksek lisans tezi kapsamında detaylı olarak incelenmiştir. Geliştirilen sürücü tanıma sistemi, Türkiye'de ilk olma özelliğini taşımakla birlikte, global ölçekte de benzersiz özelliklere sahiptir. Sistemin temel hedefleri arasında sürücü güvenliğinin artırılması, yolcu-sürücü anlaşmazlıklarının minimize edilmesi ve taksi hizmet kalitesinin yükseltilmesi yer almaktadır. Bu amaçlara ulaşmak için, sistem içerisinde acil durum bildirimi, taksi durumunun (boş/dolu/rezerve) dinamik gösterimi gibi yenilikçi özellikler yer almaktadır. Önerilen Taksi Sürücü Sistemi Modeli hem akademik literatüre katkı sağlamayı hem de sektörel uygulamalara yol göstermeyi hedefleyerek New York, Londra, Tokyo, Paris ve İstanbul gibi büyük hacimli taksiye sahip metropollerin ihtiyaçlarına cevap verebilecek şekilde tasarlanmıştır. Bu çalışma, toplu ulaşımda akıllı sistemlerin entegrasyonu konusunda yeni bir perspektif sunmakta ve gelecekteki araştırmalar için verimli bir zemin hazırlamaktadır. Ayrıca, önerilen modelin diğer toplu ulaşım araçlarına uyarlanabilirliği, araştırmanın potansiyel etkisini genişletmektedir.
Article
Full-text available
The intersection of AI and personal privacy presents a landscape rife with both opportunities and challenges. As artificial intelligence (AI) technologies become increasingly integrated into various aspects of daily life, concerns about the protection of personal privacy have come to the forefront. This paper delves into the multifaceted dimensions of this issue, aiming to illuminate the threats posed by AI to personal privacy while also exploring potential avenues for protection. Through a comprehensive review of existing literature and case studies, we investigate the evolving landscape of AI-driven privacy threats and the countermeasures employed to safeguard individual privacy rights. Drawing on insights from ethics, law, and computer science, we analyze the mechanisms through which AI systems can compromise personalprivacy, including data breaches, surveillance, and algorithmic biases. Additionally, we explore emerging techniques and regulatory frameworks designed to enhance privacy protections in an AI-driven world. By synthesizing research from diverse disciplines, this paper contributes to a deeper understanding of the complex interplay between AI and personal privacy, with the ultimate goal of fostering a future where individuals can harness the benefits of AI without sacrificing their privacy rights.
Chapter
Three major advances are driving fundamental changes in the field of remote sensing. New satellite sensors will provide global imaging with excellent spatial and spectral resolution. Technological advancements have rendered earlier limitations on data scale, resolution, location, and availability obsolete. Economic restructuring in the remote sensing community will shift control and dissemination of imagery and related information from the government to the private sector. The internet and other digital infrastructures will speed up information distribution to a global user base. The combined results of these advancements may have serious legal and ethical ramifications for all remote sensing experts. Remote sensing technology may soon be able to provide detailed information, potentially violating privacy and leading to legal and ethical implications. This chapter discusses the legal history of remote sensing, recent innovations in satellite surveillance and information technology, and potential legal and ethical concerns for the remote sensing community. Self-regulation of the profession is essential for balancing individual rights with the economic objectives of the remote sensing community and nation.
Preprint
Full-text available
Human mobility data is a crucial resource for urban mobility management, but it does not come without personal reference. The implementation of security measures such as anonymization is thus needed to protect individuals' privacy. Often, a trade-off arises as such techniques potentially decrease the utility of the data and limit its use. While much research on anonymization techniques exists, there is little information on the actual implementations by practitioners, especially outside the big tech context. Within our study, we conducted expert interviews to gain insights into practices in the field. We categorize purposes, data sources, analysis, and modeling tasks to provide a profound understanding of the context such data is used in. We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy. We provide groundwork for further research on practice-oriented research by identifying privacy needs of practitioners and extracting relevant mobility characteristics for future standardized evaluations of privacy-enhancing methods.
Chapter
Differential Privacy (DP) can provide strong guarantees that personal information is not disclosed in data sets. This is ensured from mathematical, theoretical, and relational proof of privacy, which makes it important to understand the actual behavior of the DP-based protection models. For this purpose, we will review what kind of frameworks or models are available to estimate how well an implemented differential privacy model works. Special attention is paid to how to assess that a certain level of privacy has been reached, what configurations were used, and how to estimate the privacy loss. Our goal is to locate a common framework that could help one decide, based on privacy requirements, which model and configuration should be used and how its protection can be ensured.
Article
A new version of GInv (Gröbner Involutive) for computing involutive Gröbner bases is presented as a library in C++11. GInv uses object-oriented memory reallocation for dynamic data structures, such as lists, red-black trees, binary trees, and GMP libraries for arbitrary-precision integer calculations. The interface of the package is designed as a Python3 module.
Article
The re-identification or de-anonymization of users from anonymized data through matching with publicly available correlated user data has raised privacy concerns, leading to the complementary measure of obfuscation in addition to anonymization. Recent research provides a fundamental understanding of the conditions under which privacy attacks, in the form of database matching, are successful in the presence of obfuscation. Motivated by synchronization errors stemming from the sampling of time-indexed databases, this paper presents a unified framework considering both obfuscation and synchronization errors and investigates the matching of databases under noisy entry repetitions. By investigating different structures for the repetition pattern, replica detection and seeded deletion detection algorithms are devised and sufficient and necessary conditions for successful matching are derived. Finally, the impacts of some variations of the underlying assumptions, such as the adversarial deletion model, seedless database matching, and zero-rate regime, on the results are discussed. Overall, our results provide insights into the privacy-preserving publication of anonymized and obfuscated time-indexed data as well as the closely related problem of the capacity of synchronization channels.
Article
Recommender systems (RS), serving at the forefront of Human-centered AI, are widely deployed in almost every corner of the web and facilitate the human decision-making process. However, despite their enormous capabilities and potential, RS may also lead to undesired effects on users, items, producers, platforms, or even the society at large, such as compromised user trust due to non-transparency, unfair treatment of different consumers, or producers, privacy concerns due to extensive use of user’s private data for personalization, just to name a few. All of these create an urgent need for Trustworthy Recommender Systems (TRS) so as to mitigate or avoid such adverse impacts and risks. In this survey, we will introduce techniques related to trustworthy recommendation, including but not limited to explainable recommendation, fairness in recommendation, privacy-aware recommendation, robustness in recommendation, user-controllable recommendation, as well as the relationship between these different perspectives in terms of trustworthy recommendation. Through this survey, we hope to deliver readers with a comprehensive view of the research area and raise attention to the community about the importance, existing research achievements, and future research directions on trustworthy recommendation.
Article
Full-text available
"This study analyzes legal and technical approaches to data de-identification and anonymization, motivated by the need to develop balanced standards that preserve privacy without stifling beneficial data uses. Doctrinal and technical literature review methods examine provisions in major data protection laws worldwide, including the EU's GDPR, US HIPAA, and emerging frameworks in China, India, and Uzbekistan, alongside mathematical models like differential privacy and k-anonymity. The legal analysis reveals common themes like flexible research exemptions for anonymized data and calibrating standards based on sensitivity, but also gaps such as ambiguities around pseudonymization. The technical review highlights the strengths and weaknesses of encryption, perturbation, generalization, and federation techniques, emphasizing the need to complement mathematical methods with governance controls. Key findings include the importance of allowing contextual optimization, providing detailed regulatory guidance, and addressing re-identification incentives. Recommendations are provided for advancing Uzbekistan's data protection laws and practices based on international experiences, such as enabling public oversight, conducting localized impact assessments, and promoting privacy-enhancing technologies. The study concludes that to anonymize data in a way that enables research while also protecting people's rights, we need a comprehensive approach that includes laws, organizational rules, technical safeguards, ethical decision-making, and public input. All of these parts working together is important for successful data anonymization."
Article
Full-text available
La anonimización de datos es crucial para proteger la privacidad de las personas y garantizar el cumplimiento normativo de cada país. El objetivo de la presente investigación es analizar el grado de incidencia de la anonimización de base de datos en el cumplimiento de la ley orgánica de protección de datos personales en Ecuador. El objetivo planteado, unido a la metodología diseñada permite plantear resultados como que con el surgimiento de la Ley Orgánica de Protección de Datos Personales en el Ecuador las personas asumen obligaciones a cumplir frente a los titulares de derechos. La base de datos desde su papel en la toma de decisiones hasta su contribución a la eficiencia operativa y la adaptabilidad tecnológica, es esencial para la gestión efectiva de la información. Uno de los aspectos fundamentales de la anonimización es su capacidad para permitir el uso de datos de manera ética en la investigación sin comprometer la privacidad de los individuos. La anonimización no es un proceso infalible y debe ser considerada como parte de un enfoque más amplio de seguridad de la información. Finalmente se concluye que solo mediante un enfoque holístico y colaborativo se puede aprovechar plenamente los beneficios de la anonimización.
Article
Full-text available
Releasing network measurement data---including packet traces---to the research community is a virtuous activity that promotes solid research. However, in practice, releasing anonymized packet traces for public use entails many more vexing considerations than just the usual notion of how to scramble IP addresses to preserve privacy. Publishing traces requires carefully balancing the security needs of the organization providing the trace with the research usefulness of the anonymized trace. In this paper we recount our experiences in (i) securing permission from a large site to release packet header traces of the site's internal traffic, (ii) implementing the corresponding anonymization policy, and (iii) validating its correctness. We present a general tool, tcpmkpub, for anonymizing traces, discuss the process used to determine the particular anonymization policy, and describe the use of metadata accompanying the traces to provide insight into features that have been obfuscated by anonymization
Article
Full-text available
The 9/11 terrorists, before their deadly attacks, sought invisibility through integration into the society they hoped to destroy. In a similar fashion, the terrorists who carried out subsequent attacks in Madrid and London attempted to blend into their host lands. This strategy has forced governments, including the United States, to rethink counterterrorism strategies and tools. One of the current favored strategies involves data mining. In its pattern-based variant, data mining searches select individuals for scrutiny by analyzing large data sets for suspicious data linkages and patterns. Because terrorists do not stand out, intelligence and law enforcement agents want to do more than rely exclusively on investigations of known suspects. The new goal is to search for a pattern or signature in massive amounts of transaction data. This Article begins by examining governmental data mining. In Part II, this Article reviews widely held views about the necessary safeguards for the use of data mining. In Part III, this Article considers dataveillance by private corporations and how they have compiled rich collections of information gathered online in the absence of a robust legal framework that might help preserve online privacy. This Article then discusses some of the techniques that individuals can employ to mask their online activity as well as existing and emerging technological approaches to preventing the private sector or government from linking their personal information and tracing their activities. This Article concludes by briefly considering three topics: (1) whether and how to regulate the potential impact of identity management systems on counterterrorism efforts; (2) the requirements of transparency and understanding of the underlying models used in either data mining or identity management systems as a necessary prelude to the creation of rules on appropriate access and use; and (3) the need for research in several further areas.
Article
Full-text available
In the United States, proposals for informational privacy have proved enormously controversial. On a political level, such proposals threaten powerful data processing interests. On a theoretical level, data processors and other data privacy opponents argue that imposing restrictions on the collection, use, and exchange of personal data would ignore established understandings of property, limit individual freedom of choice, violate principles of rational information use, and infringe data processors' freedom of speech. In this article, Professor Julie Cohen explores these theoretical challenges to informational privacy protection. She concludes that categorical arguments from property, choice, truth, and speech lack weight, and mask fundamentally political choices about the allocation of power over information, cost, and opportunity. Each debate, although couched in a rhetoric of individual liberty, effectively reduces individuals to objects of choices and trades made by others. Professor Cohen argues, instead, that the debate about data privacy protection should be grounded in an appreciation of the conditions necessary for individuals to develop and exercise autonomy in fact, and that meaningful autonomy requires a degree of freedom from monitoring, scrutiny, and categorization by others. The article concludes by calling for the design of both legal and technological tools for strong data privacy protection.
Article
Full-text available
The concept of k-anonymity, used in the literature to formally evaluate the privacy preservation of published tables, was introduced based on the notion of quasi-identifiers (or QI for short). The process of obtaining k-anonymity for a given private table is first to recognize the QIs in the table, and then to anonymize the QI values, the latter being called k-anonymization. While k-anonymization is usually rigorously validated, the definition of QI remains mostly informal, and different authors seem to have different interpretations of the concept of QI. The purpose of this paper is to provide a formal underpinning of QI and examine the correctness and incorrectness of various interpretations of QI in our formal framework. We observe that in cases where the concept has been used correctly, its application has been conservative; this note provides a formal understanding of the conservative nature in such cases.
Article
Full-text available
Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis a vis the risk of disclosing proprietary or sensitive information. Among the var- ious methods employed for "sanitizing" the data prior to disclosure, we focus in this paper on anonymization, given its widespread use in practice. We do due diligence to the question "just how safe is the anonymized data". We consider both the scenarios when the hacker has no infor- mation, and more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining and address the safety question at two di!erent levels: (i) how likely are the ident ities of individual items cracked (i.e. reidentified by the hacker), and (ii) how likely are sets of items cracked. For capturing the prior knowledge of the hacker, we propose a belief function, which amounts to an educated guess of the frequency of each item. For various classes of belief functions, which correspond to di!erent degrees of prior knowledge, we derive formulas for computing the expected number of cracks of single items and for itemsets, the probability of cracking the itemsets. While obtaining the exact values for the more general situations is computationally hard, we propose a series of heuristics called the O-estimates. They are easy to compute, and are shown to be fairly accurate, justified by empirical results on real benchmark datasets. Based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma. Our recipe operates at two di!erent levels, depending on whether the data owner wants to reason in terms of single items or sets of items (or both). Finally, we present techniques using which a hacker's knowledge of correlation in terms of co-occurrence of items can be incorporated into our framework of disclosure risk analysis and present experimental results demonstrating how this knowledge a!ects the heuristic estimates we have developed.
Article
Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k-anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k-anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and μ-Argus are compared to MinGen. Both Datafly and μ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and μ-Argus can additionally fail to provide adequate protection.
Article
In Section 1 the nature of statistical confidentiality is explored, i.e., its essential role in the collection of data by statistical offices, its relationship to privacy and the need for increased attention to potential statistical disclosures because of the increased tabulation and dissemination capabilities of statistical offices. In Section 2 a definition of inadvertent direct disclosure is provided as well as a theorem concerning a test for residual disclosure of tabulations. In Section 3 different media and methods of data dissemination are considered from the point of view of potential for statistical disclosure.
Article
This article opens discussion of a starkly new approach for protecting the privacy of Americans' sensitive health information. Last year, Congress empowered the U.S. Food and Drug Administration (FDA) to oversee development of a major new national infrastructure: a large-scale data network, the Sentinel System, that aims to include health data for 100 million Americans by 2012. This marked the first time since the end of the New Deal that a wholly new infrastructure regulatory mandate had been issued at the federal level. This important development, buried in drug-safety provisions of the Food and Drug Administration Amendments Act of 2007 (FDAAA), went largely unnoticed, as did the fact that Congress cast medical privacy, a hot-button issue for many members of the American public, as an infrastructure regulatory problem. Individuals are not empowered to make autonomous decisions about permissible uses and disclosures of their health data. Instead, Congress authorized FDA to decide whether proposed disclosures meet a statutorily defined public-interest standard. If so, then the disclosures are lawful without individual privacy authorization or informed consent. Within limits that this article explores, FDA can approve the release of private health data, including data in identifiable form, to private operators of Sentinel System infrastructure and to outside data users, including academic and commercial entities. This article describes the new privacy model, which was implicit in the statute Congress passed but far from obvious on its face. The goal is not to oppose the new approach. Congress was responding to serious public concern about the safety of FDA-approved products. This article accepts that this new privacy model exists and explores directions for implementing it in a manner that will be least corrosive of public trust. The goal is to elicit ongoing dialogue about appropriate institutional protections for the 100 million Americans whose data soon will be in this vast data network. FDA is, in many respects, an accidental infrastructure regulator, thrust into a new role strikingly different from its longstanding product-safety mandate. Fortunately, the challenges FDA now faces are not new ones. U.S. infrastructure regulators, in a wide variety of industry contexts, have harnessed private capital to build new infrastructures to serve defined public interests while protecting vulnerable classes. Lessons from these other contexts can shed light on appropriate governance structures for the Sentinel System. For example, privacy protection may be enhanced by eschewing vertical integration in favor of segregating certain key infrastructure functions that require access to identifiable data. It may be better to establish core privacy protections via rulemaking rather than through contracts and to centralize certain key discretionary decisions rather than delegating them to private, commercial decision-makers. Public trust will require strong due-process protections, regulatory independence, and a well-funded system of regulatory oversight; approaches employed by other infrastructure regulators may help address these concerns. The single greatest threat to privacy will come as FDA faces pressure to approve wide ancillary sales of Sentinel System data to help defray costs of system development. To make this system financeable while enforcing strong privacy protections, FDA should deploy its limited available funds to support a well-thought-out infrastructure financing facility that backstops clear privacy policies with appropriate political risk guarantees for private infrastructure investors.
Article
Fear of the powerful computer user, "the Superuser," dominates debates about online conflict. This mythic figure is difficult to find, immune to technological constraints, and aware of legal loopholes. Policymakers, fearful of his power, too often overreact, passing overbroad, ambiguous laws intended to ensnare the Superuser, but which are used instead against inculpable, ordinary users. This response is unwarranted because the Superuser is often a marginal figure whose power has been greatly exaggerated. The exaggerated attention to the Superuser reveals a pathological characteristic of the study of power, crime, and security online, which springs from a widely-held fear of the Internet. Building on the social science fear literature, this Article challenges the conventional wisdom and standard assumptions about the role of experts. Unlike dispassionate experts in other fields, computer experts are as susceptible as lay-people to exaggerate the power of the Superuser, in part because they have misapplied Larry Lessig's ideas about code. The experts in computer security and Internet law have failed to deliver us from fear, resulting in overbroad prohibitions, harms to civil liberties, wasted law enforcement resources, and misallocated economic investment. This Article urges policymakers and partisans to stop using tropes of fear; calls for better empirical work on the probability of online harm; and proposes an anti-Precautionary Principle, a presumption against new laws designed to stop the Superuser.
Article
The article examines the government's growing appetite for collecting personal data. Often justified on the basis of protecting national security, government data mining programs sweep up data collected through hundreds of regulatory and administrative programs, and combine them with huge datasets obtained from industry. The result is an aggregation of personal data - the "digital footprints" of individual lives - never before seen. These data warehouses are then used to determine who can work and participate in Social Security programs, who can board airplanes and enter government buildings, and who is likely to pose a threat in the future, even though they have done nothing wrong to date. The article describes the extraordinary volume and variety of personal data to which the government has routine access, directly and through industry, and examines the absence of any meaningful limits on that access. So-called privacy statutes are often so outdated and inadequate that they fail to limit the government's access to our most personal data, or they have been amended in the post-9/11 world to reduce those limits. And the Fourth Amendment, the primary constitutional guarantee of individual privacy, has been interpreted by the Supreme Court to not apply to routine data collection, accessing data from third parties, or sharing data, even if illegally gathered. The result is not only that individual privacy goes unprotected, but that national security is compromised because it is increasingly based on data mining initiatives that are untested, ill focused, and rely on inaccurate or incomplete data. These shortcomings, and the urgent need for Congress to act to address them, have been widely recognized by numerous public and private commissions, but largely ignored by members of Congress - republicans and democrats alike. The article concludes that there is wide agreement about both the need to restore some limits on the government's use of personal data and the form that those limits should take. The problem is the unwillingness - or inability - of Congress to act.
Article
The practices of public surveillance, which include the monitoring of individuals in public through a variety of media (e.g., video, data, online), are among the least understood and controversial challenges to privacy in an age of information technologies. The fragmentary nature of privacy policy in the United States reflects not only the oppositional pulls of diverse vested interests, but also the ambivalence of unsettled intuitions on mundane phenomena such as shopper cards, closed-circuit television, and biometrics. This Article, which extends earlier work on the problem of privacy in public, explains why some of the prominent theoretical approaches to privacy, which were developed over time to meet traditional privacy challenges, yield unsatisfactory conclusions in the case of public surveillance. It posits a new construct, "contextual integrity," as an alternative benchmark for privacy, to capture the nature of challenges posed by information technologies. Contextual integrity ties adequate protection for privacy to norms of specific contexts, demanding that information gathering and dissemination be appropriate to that context and obey the governing norms of distribution within it. Building on the idea of "spheres of justice," developed by political philosopher Michael Walzer, this Article argues that public surveillance violates a right to privacy because it violates contextual integrity; as such, it constitutes injustice and even tyranny.
Article
A broad coalition, including companies formerly opposed to the enactment of privacy statutes, has now formed behind the idea of a national information privacy law. Among the benefits that proponents attribute to such a law is that it would harmonize the U.S. regulatory approach with that of the European Union (E.U.) and possibly minimize international regulatory conflicts about privacy. This essay argues, however, that it would be a mistake for the United States to enact a comprehensive or omnibus federal privacy law for the private sector that preempts sectoral privacy law. In a sectoral approach, a privacy statute regulates only a specific context of information use. An omnibus federal privacy law would be a dubious proposition because of its impact on experimentation in federal and state sectoral laws, and the consequences of ossification in the statute itself. In contrast to its skepticism about a federal omnibus statute, this essay views federal sectoral laws as a promising regulatory instrument. The critical question is the optimal nature of a dual federal-state system for information privacy law, and this essay analyzes three aspects of this topic. First, there are general circumstances under which federal sectoral consolidation of state law can bring benefits. Second, the choice between federal ceilings and floors is far from the only preemptive decision that regulators face. Finally, there are second-best solutions that become important should Congress choose to engage in broad sectoral preemption.
Article
Consider three questions. How would one decide if there was too much telecommunications surveillance in the United States, or too little? How would one know if law enforcement was using its surveillance capabilities in the most effective fashion? How would one assess the impact of this collection of information on civil liberties?In answering these questions, a necessary step, the logical first move, would be to examine existing data about governmental surveillance practices and their results. One would also need to examine and understand how the legal system generated these statistics about telecommunications surveillance. Ideally, the information structure would generate data sets that would allow the three questions posed above to be answered. Light might also be shed on other basic issues, such as whether or not the amount of telecommunications surveillance was increasing or decreasing.Such rational inquiry about telecommunications surveillance is, however, largely precluded by the haphazard and incomplete information that the government collects about it. This Article evaluates the main parts of telecommunications surveillance law and the statistics about their use. The critical statutory regulations are (1) the Wiretap Act, (2) the Pen Register Act, (3) the Stored Communications Act, and, for foreign intelligence, (4) the Foreign Intelligence Surveillance Act, and (5) the different provisions for National Security Letters (NSLs).Other parts of the surveillance landscape represent an even greater expanse of blank spaces on the legal map. There are a number of "semi-known unknowns" (to coin a phrase); these are kinds of telecommunications surveillance about which only limited public information exists - this surveillance also occurs outside a detailed legal framework.This Article concludes with the development of the concept of "privacy theater." Currently, the value of the collection of telecommunications statistics is largely ritualistic. It serves to create a myth of oversight. This Article proposes that we go beyond myth and re-dedicate ourselves to the task of creating a telecommunications surveillance law that minimizes the impact of surveillance on civil liberties and maximizes its effectiveness for law enforcement.
Article
This Article is divided into five parts. In Part II, we summarize the development of concepts of privacy and confidentiality, and examine how legal and statistical communities have reacted to two major changes in the privacy landscape of the last twenty years. The first is the challenge brought by new online and potentially privacy invasive technologies. The second is the privacy and security effects of September 11th. We argue that law's innovations, perhaps reactive to overwrought public concerns regarding privacy and the aftershocks of September 11th, represent real challenges to traditional notions of data access, confidentiality, and respondent trust. Finally, we review statistical societies' quick condemnation of many of the privacy- and trust-erosive measures contained in post-September 11th legislation and their lobbying to overturn the most egregious of these. In Part III, we provide a basic framework for enacting future privacy legislation to govern federal statistical agencies. We argue that federal agencies, and the laws that govern their conduct, should be framed with three broad principles in mind: (i) a commitment to bureaucratic justice; (ii) a level of regulatory coherence that justifies choices made; and (iii) a flexibility that allows anticipation of, rather than merely post hoc response to, challenges posed by future technology and methods. In so doing, we note that law's previous attempts to maneuver between these principles have often resulted in second-best solutions, changing with shifting public opinions that are, themselves, based on various cognitive processes that result in consistent and recognizable errors. We argue that laws that fail to consider methods available in computer science and statistics often underestimate the privacy threats posed by seemingly nonindividually identifiable data disclosures. Such laws also overlook, to the detriment of the trust-based statistical regime that has proven so successful for two centuries, the importance of statistical associations' ethical standards for determining appropriate disclosures. In Part IV, we set out various statistical methods that have been, and could be, employed to protect the confidentiality of data released to the public or government agencies. In describing these options, we discuss the limitations inherent in each and note the law's continuing role in enforcing these measures. Finally, in Part V, we conclude that increased dialogue between our disciplines can, and should, result in a better privacy regime that combines sound statistical methods designed to protect anonymity where appropriate, with legal enactments that promote legitimate sharing yet punish abuses. In so doing, we provide some specific approaches for legislating privacy that account for the trust needed to make statistical programs successful and for the data access and disclosure so important for the challenges currently facing this country.
Article
United States statistical programs and practices are among the best in the world. Lurking underneath this success, however, is a riddle - given the potential for abuse and harm, why do Americans willingly hand over their personal information to government data collection programs? In this article, we seek an answer to this riddle by examining the evolution of United States legal and statistical programs, with a particular focus on the United States Census of Population. In so doing, we explore the statistical programs, policies, regulations, and codes of ethics that have evolved in the United States over the past two centuries. We conclude that the willingness of individuals to disclose their personal information is not linked to programs of legal coercion or to simple cost/benefit analyses. Instead, we note that the intent of United States statistical programs has been to increase the level of trust and confidence that individuals have that their information will be kept strictly confidential. Various legal frameworks and the promulgation of statistical society codes of ethics buttress our basic conclusion that trust is an essential characteristic of a successful and efficient modern statistical program. We conclude by noting some recent developments that may threaten this trust program, including post 9/11 national security efforts, the rise of new data-gathering and analysis technologies, and the increasing use of private data collectors for government statistical programs.
Article
A series of major security breaches at companies with sensitive personal information has sparked significant attention to the problems with privacy protection in the United States. Currently, the privacy protections in the United States are riddled with gaps and weak spots. Although most industrialized nations have comprehensive data protection laws, the United States has maintained a sectoral approach where certain industries are covered and others are not. In particular, emerging companies known as "commercial data brokers" have frequently slipped through the cracks of U.S. privacy law. In this article, the authors propose a Model Privacy Regime to address the problems in the privacy protection in the United States, with a particular focus on commercial data brokers. Since the United States is unlikely to shift radically from its sectoral approach to a comprehensive data protection regime, the Model Regime aims to patch up the holes in existing privacy regulation and improve and extend it. In other words, the goal of the Model Regime is to build upon the existing foundation of U.S. privacy law, not to propose an alternative foundation. The authors believe that the sectoral approach in the United States can be improved by applying the Fair Information Practices - principles that require the entities that collect personal data to extend certain rights to data subjects. The Fair Information Practices are very general principles, and they are often spoken about in a rather abstract manner. In contrast, the Model Regime demonstrates specific ways that they can be incorporated into privacy regulation in the United States.
Article
Nothing in society poses as grave a threat to privacy as the Internet Service Provider (ISP). ISPs carry their users' conversations, secrets, relationships, acts, and omissions. Until the very recent past, they had left most of these alone because they had lacked the tools to spy invasively, but with recent advances in eavesdropping technology, they can now spy on people in unprecedented ways. Meanwhile, advertisers and copyright owners have been tempting them to put their users' secrets up for sale, and judging from a recent flurry of reports, ISPs are giving in to the temptation and experimenting with new forms of spying. This is only the leading edge of a coming storm of unprecedented and invasive ISP surveillance. This Article proposes an innovative new theory of communications privacy to help policymakers strike the proper balance between user privacy and ISP need. We cannot simply ban aggressive monitoring, because ISPs have legitimate reasons for scrutinizing communications on an Internet teeming with threats. Using this new theory, policymakers will be able to distinguish between an ISP's legitimate needs and mere desires.In addition, this Article injects privacy into the network neutrality debate - a debate about who gets to control innovation on the Internet. Despite the thousands of pages that have already been written about the topic, nobody has recognized that we already enjoy mandatory network neutrality in the form of expansive wiretapping laws. The recognition of this idea will flip the status quo and reinvigorate a stagnant debate by introducing privacy and personal autonomy into a discussion that has only ever been about economics and innovation.
Conference Paper
Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while l-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records. For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, l-diversity, and similar methods based on generalization and suppression.
Conference Paper
According to a famous study (10) of the 1990 census data, 87% of the US population can be uniquely identified by gen- der, ZIP code and full date of birth. This short paper revisits the uniqueness of simple demographics in the US population based on the most recent census data (the 2000 census). We oer a detailed, comprehensive and up-to-date picture of the threat to privacy posed by the disclosure of simple de- mographic information. Our results generally agree with the findings of (10), although we find that disclosing one's gender, ZIP code and full date of birth allows for unique identification of fewer individuals (63% of the US popula- tion) than reported in (10). We hope that our study will be a useful reference for privacy researchers who need sim- ple estimates of the comparative threat of disclosing various demographic data.
Article
Research that utilizes human subjects is a large and growing enterprise. Tens of millions of individuals have already participated as subjects in one or more research protocols, and millions more participate each year. Government and industry combined spend billions annually to support as many as 20,000 research studies, many of which are individually large and complex enterprises in their own right.These numbers are, if anything, likely to increase even further. Besides the growth in research, two other trends are apparent. First, research-related litigation is on the rise and appears likely to become even more widespread. Sparked at least in part by recent widely publicized instances of harm befalling research subjects, plaintiffs' attorneys are suing both more often and more creatively. Related to this is the second trend: public trust in research is declining and, as a result, at least some types of research are struggling to find adequate numbers of human subjects.As a result of these trends, exposure to potential liability and public perception are both increasingly important. Concomitant with all of this research is the discovery and generation of tremendous quantities of data specific to individual subjects, including--but not limited to--genetic information. Much of this data is irrelevant to subjects' interests because it lacks predictive value, has uncertain meaning, or is otherwise uninformative. Some, however, is different--some of the personal data learned during the course of research with human subjects bears directly on individuals' health. Despite the fact that much individual data has already been generated and that both the quantity and the quality of data generated seem likely to increase, there is a lack of clear guidance for researchers regarding whether and when such information should be divulged to the subjects on whom it bears.In this environment, the potential exists for litigation alleging that a researcher was negligent for failure to disclose to a subject an individual research finding of medical significance. Such litigation would raise a heretofore-unanswered question: should a researcher have a legal duty to disclose medically significant individual research findings to a subject?
Article
Information about an individual's place and date of birth can be exploited to predict his or her Social Security number (SSN). Using only publicly available information, we observed a correlation between individuals' SSNs and their birth data and found that for younger cohorts the correlation allows statistical inference of private SSNs. The inferences are made possible by the public availability of the Social Security Administration's Death Master File and the widespread accessibility of personal information from multiple sources, such as data brokers or profiles on social networking sites. Our results highlight the unexpected privacy consequences of the complex interactions among multiple data sources in modern information economies and quantify privacy risks associated with information revelation in public forums.
Article
In the trauma surrounding mass disasters, the need to identify victims accurately and as soon as possible is critical. DNA identification testing is increasingly used to identify human bodies and remains where the deceased cannot be identified by traditional means. This form of testing compares DNA taken from the body of the deceased with DNA taken from their personal items (e.g. hairbrush, toothbrush etc.) or from close biological relatives. DNA identification testing was used to identify the victims of the terrorist attack on the World Trade Center in New York on September 11, 2001, and of the victims of the Tsunami that hit Asia on December 26, 2004. Shortly after the 9/11 attack, police investigators asked the victims' families for personal items belonging to the missing, and for DNA samples from family members themselves. The New York medical examiner's office coordinated the DNA identification testing program; however, some of the identification work was contracted out to private laboratories.
Article
Abstract Large-scale genomic databases are becoming increasingly common. These databases, and the underlying biobanks, pose several substantial legal and ethical problems. Neither the usual methods for protecting subject confidentiality, nor even anonymity, are likely to protect subjects' identities in richly detailed databases. Indeed, in these settings, anonymity is itself ethically suspect. New methods of consent will need to be created to replace the blanket consent common to such endeavors, with a consent procedure that gives subjects some real control over what they might consider inappropriate use of their information and biological material. Through their use, these biobanks are also likely to yield information that will be of some clinical significance to the subjects, information that they should have access to. Failure to adjust to these new challenges is not only legally and ethically inappropriate, but puts at risk the political support on which biomedical research depends.