Article

Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Computer scientists have recently undermined our faith in the privacy-protecting power of anonymization, the name for techniques for protecting the privacy of individuals in large databases by deleting information like names and social security numbers. These scientists have demonstrated they can often 'reidentify' or 'deanonymize' individuals hidden in anonymized data with astonishing ease. By understanding this research, we will realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention. We must respond to the surprising failure of anonymization, and this Article provides the tools to do so.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... W ITH the exponential boom in smart devices and the growing popularity of big data, companies and institutions have been gathering more and more personal data from users which is then either published or sold for research or commercial purposes. Although the published data is typically anonymized, i.e., explicit identifiers of the users, such as names and dates of birth are removed, there has been a growing concern over potential privacy leakage from anonymized data, approached from legal [1] and corporate [2] points of view. These concerns are also articulated in the respective literature through successful practical de-anonymization attacks on real data [3]- [17]. ...
... Furthermore, for each i, Y i,4 (1) and Y i, 4 (2) are noisy replicas of X i,4 . Our goal is to estimate the row permutation Θ n which is in this example given as; Θ n (1) = 5, Θ n (2) = 1, Θ n (3) = 4, Θ n (4) = 3 and Θ n (5) = 2, by matching the rows of D (1) and D (2) . Here the i th row of D (1) corresponds to the Θ n (i) th row of D (2) . ...
... Our goal is to estimate the row permutation Θ n which is in this example given as; Θ n (1) = 5, Θ n (2) = 1, Θ n (3) = 4, Θ n (4) = 3 and Θ n (5) = 2, by matching the rows of D (1) and D (2) . Here the i th row of D (1) corresponds to the Θ n (i) th row of D (2) . a converse result using perfect recovery as the error criterion. ...
Preprint
The re-identification or de-anonymization of users from anonymized data through matching with publicly-available correlated user data has raised privacy concerns, leading to the complementary measure of obfuscation in addition to anonymization. Recent research provides a fundamental understanding of the conditions under which privacy attacks, in the form of database matching, are successful in the presence of obfuscation. Motivated by synchronization errors stemming from the sampling of time-indexed databases, this paper presents a unified framework considering both obfuscation and synchronization errors and investigates the matching of databases under noisy entry repetitions. By investigating different structures for the repetition pattern, replica detection and seeded deletion detection algorithms are devised and sufficient and necessary conditions for successful matching are derived. Finally, the impacts of some variations of the underlying assumptions, such as adversarial deletion model, seedless database matching and zero-rate regime, on the results are discussed. Overall, our results provide insights into the privacy-preserving publication of anonymized and obfuscated time-indexed data as well as the closely-related problem of the capacity of synchronization channels.
... This user-level microdata is then published or shared for scientific and/or commercial purposes, after anonymization which refers to the removal of any explicit identifiers. However, concerns over the insufficiency of simple anonymization have been articulated by the scientific [1] and corporate [2] communities. These concerns were further validated and amplified as researchers devised practical privacy attacks on real data [3]- [7] to show the vulnerability of anonymization on its own. ...
... The columns circled in red are deleted. Our goal is to estimate the row permutation Θ n which is in this example given as; Θ n (1) = 5, Θ n (2) = 1, Θ n (3) = 4, Θ n (4) = 3 and Θ n (5) = 2, by matching the rows of D (1) and D (2) , under column deletions with I del = (2,5). Here the i th row of D (1) corresponds to the Θ n (i) th row of D (2) . ...
... Our goal is to estimate the row permutation Θ n which is in this example given as; Θ n (1) = 5, Θ n (2) = 1, Θ n (3) = 4, Θ n (4) = 3 and Θ n (5) = 2, by matching the rows of D (1) and D (2) , under column deletions with I del = (2,5). Here the i th row of D (1) corresponds to the Θ n (i) th row of D (2) . been rigorously investigated from information-theoretic and statistical perspectives [13]- [21]. ...
Preprint
The de-anonymization of users from anonymized microdata through matching or aligning with publicly-available correlated databases has been of scientific interest recently. While most of the rigorous analyses of database matching have focused on random-distortion models, the adversarial-distortion models have been wanting in the relevant literature. In this work, motivated by synchronization errors in the sampling of time-indexed microdata, matching (alignment) of random databases under adversarial column deletions is investigated. It is assumed that a constrained adversary, which observes the anonymized database, can delete up to a $\delta$ fraction of the columns (attributes) to hinder matching and preserve privacy. Column histograms of the two databases are utilized as permutation-invariant features to detect the column deletion pattern chosen by the adversary. The detection of the column deletion pattern is then followed by an exact row (user) matching scheme. The worst-case analysis of this two-phase scheme yields a sufficient condition for the successful matching of the two databases, under the near-perfect recovery condition. A more detailed investigation of the error probability leads to a tight necessary condition on the database growth rate, and in turn, to a single-letter characterization of the adversarial matching capacity. This adversarial matching capacity is shown to be significantly lower than the \say{random} matching capacity, where the column deletions occur randomly. Overall, our results analytically demonstrate the privacy-wise advantages of adversarial mechanisms over random ones during the publication of anonymized time-indexed data.
... O resultado fácil da reidentificação também causará estragos em nossos sistemas jurídicos, porque nossa fé na anonimização robusta se infiltrou completamente em nossas leis e regulamentos de privacidade, (...). 11 (OHM, 2010(OHM, , p. 1706(OHM, -1707 A Lei Geral de Proteção de Dados brasileira situou em polaridades opostas os dados pessoais e os dados anonimizados, deles cuidando em seu artigo 12, de forma expressa, mas com ressalvas contidas nos parágrafos subsequentes ao caput: ...
... É nesse contexto que se defende o conceito de entropia de dados: o termo é utilizado na física para, em um sistema termodinâmico bem definido, medir seu grau de irreversibilidade; por isso, parece definir com exatidão o espírito da "razoabilidade" insculpida na lei. Seu uso é atribuído a Paul Ohm (2010Ohm ( , p. 1760 (OHM, 2010(OHM, , p. 1760 A "metáfora do corredor" citada no excerto acima é didaticamente explicada pelo autor imaginando-se que determinada pessoa se posiciona em um dos lados de um longo corredor dedicado somente a ela e, do lado oposto, está situado o fato arruinador daquela pessoa, ou seja, algum segredo ou detalhe que uma pessoa inimiga poderia usar para prejudicá-la; porém, ao longo deste corredor, tem-se uma série de portas fechadas e trancadas, e cada cadeado requer uma chave diferente, que representa os atributos de banco de dados que devem ser reconectados ou os links na cadeia de inferências que devem ser estabelecidos para conectar a pessoa ao fato. Ohm (2010Ohm ( , p. 1760Ohm ( -1761 explica que muitas outras pessoas são as guardiãs das chaves para algumas das portas, ou seja, controlam, cada qual, um banco de dados, e cada chave representa as inferências que ela pode fazer, usando os dados que possui. ...
... É nesse contexto que se defende o conceito de entropia de dados: o termo é utilizado na física para, em um sistema termodinâmico bem definido, medir seu grau de irreversibilidade; por isso, parece definir com exatidão o espírito da "razoabilidade" insculpida na lei. Seu uso é atribuído a Paul Ohm (2010Ohm ( , p. 1760 (OHM, 2010(OHM, , p. 1760 A "metáfora do corredor" citada no excerto acima é didaticamente explicada pelo autor imaginando-se que determinada pessoa se posiciona em um dos lados de um longo corredor dedicado somente a ela e, do lado oposto, está situado o fato arruinador daquela pessoa, ou seja, algum segredo ou detalhe que uma pessoa inimiga poderia usar para prejudicá-la; porém, ao longo deste corredor, tem-se uma série de portas fechadas e trancadas, e cada cadeado requer uma chave diferente, que representa os atributos de banco de dados que devem ser reconectados ou os links na cadeia de inferências que devem ser estabelecidos para conectar a pessoa ao fato. Ohm (2010Ohm ( , p. 1760Ohm ( -1761 explica que muitas outras pessoas são as guardiãs das chaves para algumas das portas, ou seja, controlam, cada qual, um banco de dados, e cada chave representa as inferências que ela pode fazer, usando os dados que possui. ...
Article
Full-text available
A anonimização de dados é tema de peculiar relevância na Lei Geral de Proteção de Dados Pessoais brasileira (Lei nº 13.709/2018), uma vez que afasta a incidência normativa e deixa de acarretar todas as consequências legalmente previstas para a tutela de situações que envolvam dados pessoais. O que se problematiza neste breve estudo, contudo, é exatamente a nebulosidade quanto aos processos de anonimização existentes e à insegurança jurídica causada por sua incompleta e pouco clara compreensão. A hipótese de pesquisa parte da “entropia” como filtro adequado à delimitação de processos de anonimização, reversíveis ou não, e para viabilizar a aplicabilidade dos institutos de proteção contidos na lei de regência do tema aos referidos processos. A pesquisa utilizará o método de abordagem dedutivo, partindo de um conceito amplo de dados pessoais para um conceito mais específico de dados anonimizados para, então, ponderar os impactos da reversão e a adequação da hipótese de pesquisa para a solução da insegurança jurídica problematizada. Além disso, o trabalho promoverá a análise bibliográfico-doutrinária para, logo em seguida, apresentar as considerações finais, das quais se procurará extrair uma compreensão mais assertiva quanto à problemática explicitada.
... Although it seems fairly clear from Textbox 1 that the rule intends for investigators to rely on the use of 3-digit zip codes (compared with 5-digit zip codes), not all who read this stipulation see it that way. There are many reasons for this, including various misleading representations of the rule found in legal web-based documentation and in the literature on public health and disease mapping [3][4][5][6][7][8][9][10][11]. The following section explores how zip codes have come to play a key role in the safe harbor rule. ...
... • Broken promise of privacy: responding to the surprising failure of anonymization (Ohm [5]) ...
... This is likely because the privacy rule was designed with tabular data in mind, and much medical research involves working with data in its tabular form [9]. For these investigators, a zip code is primarily a 5-digit number that can be reduced to a 3-digit one [5]. For example, an analyst receives a spreadsheet of patient data from which to build a risk model. ...
... Pseudonymization and anonymization techniques include altering and removing explicit identifiers, such as names, addresses, and national identity numbers, from a data set. However, in pseudonymization, a person can still be re-identified by data linking, leading to a reduction in k-anonymity [1,2], and anonymization techniques have failed multiple times in the past [3]. To further reduce the risk of re-identification, data Data synthesis's impacts were measured based on the general and specific utility and quality of the information in the synthetic data set compared to the original data set. ...
... Since a higher ε value means there is a lot of noise in the data, questions of utility (the ability to use data for analysis) versus privacy (risk of disclosure) started to emerge [23]. Despite the efforts, there is a growing consensus that traditional anonymization techniques are insufficient, as they have failed multiple times in the past [1,3,23,24]. ...
Article
Full-text available
Clinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection.
... Academic research into privacy looks at privacy either from the perspective of the individual and how they may perceive privacy, e.g., [12,13], or whether we have a right to expect privacy [14,15]. Privacy has also been considered through a technical lens looking, for example, at how technical manipulation of data can be achieved. ...
... Gavinson defined this as the "extent to which we are known to others; the extent to which we are the subject of others' attention; and the extent to which others have physical access to us" [26]. The loss of data privacy can occur in many ways, including; personal information being disclosed without knowledge [15,27], consent [28], through leakage, e.g., as a result of a data breach [29]. ...
Article
Full-text available
The introduction of Data Protection by Default and Design (DPbDD) brought in as part of the General Data Protection Regulation (GDPR) in 2018, has necessitated that businesses review how best to incorporate privacy into their processes in a transparent manner, so as to build trust and improve decisions around privacy best practice. To address this issue, this paper presents a 7-stage data lifecycle, supported by nine privacy goals that together, will help practitioners manage data holdings throughout data lifecycle. The resulting data lifecycle (7-DL) was created as part of the Ideal-Cities project, a Horizon-2020 Smart-city initiative, that seeks to facilitate data re-use and/or repurposed. We evaluate 7-DL through peer review and an exemplar worked example that applies the data lifecycle to a real-time life logging fire incident scenario, one of the Ideal-Cities use cases to demonstrate the applicability of the framework.
... For example in one study data such as postal code, date of birth and gender were used to uniquely identify 87% of Americans [10]. There are indications that movie preferences can also generate similar identifications [11]. This process of combining data bases that were intended to be kept apart has been labelled as reidentification [11], and is seen as a "game changer" in the digital privacy protection because of its implications. ...
... There are indications that movie preferences can also generate similar identifications [11]. This process of combining data bases that were intended to be kept apart has been labelled as reidentification [11], and is seen as a "game changer" in the digital privacy protection because of its implications. ...
Article
Full-text available
Concerns about digital privacy are so ubiquitous that they have become part of the wallpaper of life, but the implications of large data and predictive analytics on privacy merit serious scholarly attention. Recently a colleague recounted that he had purchased potato chips at a store with cash and was surprised the next day to be targeted with advertisements for the same chips on his home computing device. This anecdote encapsulates nicely the developments with digital privacy and surveillance in a world where the consumer is not aware of the hidden workings of corporate surveillance. North America in particular has entered into an era where the private human experience is being captured through digital devices, with or without permission, and sold for profit. The reality is that neither policy nor education has kept pace with these digital developments, to the point that vast amounts of data are collected, synthesized and sold without the consumer's express permission or cognisance. Data are captured continuously from smart devices and Closed Circuit Television (CCTV) footage, documenting individuals' locations and preferences. Many personal elements of life are voluntarily shared online such as heart rate and sleep habits. The "creep" of data collected with and without permission is greater than most people realize. The educational implications of this surveillance need to be explored. Parents, students, educational leaders and the general public have a right to know how digital surveillance works and the implications for predictive analytics on their futures and their decision-making in a democratic society. Policy gaps are evident surrounding digital privacy and education. More critical, interdisciplinary approaches to policy analysis are needed in education, guided by a critical policy analysis framework that interrogates all aspects of policy related to this emerging issue.
... Infatti, gli algoritmi consentono la coordinazione dei dati "sensibili" (per i quali è richiesto il consenso dell'avente diritto) con quelli non sensibili, ricostruendo l'intero profilo individuale di ciascuna persona, ledendo in questo modo sia la sua privacy (a causa dell'invasività totale nella sfera privata di questa), sia la sua dignità (rendendola "nuda" e "trasparente" di fronte a chi entra in possesso dei suoi dati rielaborati), e sottoponendola a un serio pericolo discriminatorio. Tale circostanza avviene con maggiore facilità in presenza di grandi quantità di informazioni, seppure anonimizzate, per esempio attraverso l'incrocio di diverse banche dati, quindi consentendo alla reidentificazione dei soggetti anche muovendo da dati anonimi (PELLECCHIA, 2020, 360;OHM, 2010OHM, , 1701. In siffatto contesto si può parlare della raccolta massiva di dati personali dopo la manifestazione di un "Consenso consapevolmente disinformato" (ZUBOFF, cit.), dato anche per stanchezza, in considerazione della costante reiterazione, una sorta di "stalking", delle richieste di adesione alle informative privacy presenti sui siti Internet sui quali navigano gli utenti. ...
Chapter
SOMMARIO: 1. Introduzione. – 2. Il principio di realtà, sua rappresentazione e la raccolta massiva di dati. – 3. Rappresentazione e discriminazione: una questione non solo terminologica. – 4. La formazione delle black box tra privacy e principio di non discriminazione. – 5. Sommarie conclusioni: aderire o manipolare la realtà? – 6. Bibliografia.
... For example, Latanya Sweeney established that the majority of people in the USA can be uniquely identified by the combination of ZIP code, birth date, and sex (Sweeney, 2000). The k-mer issue has been extensively discussed and potential remedies debated (Sweeney, 2002;Ohm, 2010;Sweeney et al., 2018;Goldacre and Morley, 2022). ...
... However, a large body of research has proven that the traditional de-identification model-which aims to anonymize individual records by modifying them-is generally inadequate to provide a good tradeoff between privacy and statistical utility for modern data (see e.g. Ohm [28] and Rocher et al. [34]). ...
Preprint
Personal data collected at scale from surveys or digital devices offers important insights for statistical analysis and scientific research. Safely sharing such data while protecting privacy is however challenging. Anonymization allows data to be shared while minimizing privacy risks, but traditional anonymization techniques have been repeatedly shown to provide limited protection against re-identification attacks in practice. Among modern anonymization techniques, synthetic data generation (SDG) has emerged as a potential solution to find a good tradeoff between privacy and statistical utility. Synthetic data is typically generated using algorithms that learn the statistical distribution of the original records, to then generate "artificial" records that are structurally and statistically similar to the original ones. Yet, the fact that synthetic records are "artificial" does not, per se, guarantee that privacy is protected. In this work, we systematically evaluate the tradeoffs between protecting privacy and preserving statistical utility for a wide range of synthetic data generation algorithms. Modeling privacy as protection against attribute inference attacks (AIAs), we extend and adapt linear reconstruction attacks, which have not been previously studied in the context of synthetic data. While prior work suggests that AIAs may be effective only on few outlier records, we show they can be very effective even on randomly selected records. We evaluate attacks on synthetic datasets ranging from 10^3 to 10^6 records, showing that even for the same generative model, the attack effectiveness can drastically increase when a larger number of synthetic records is generated. Overall, our findings prove that synthetic data is subject to privacy-utility tradeoffs just like other anonymization techniques: when good utility is preserved, attribute inference can be a risk for many data subjects.
... In contrast to data anonymization, data re-identification involves matching previously anonymized data with its original owners. The general ease of reidentification means that the promised privacy of data anonymization is a weak commitment and that data privacy laws must also be applied to anonymized data (Ghinita et al., 2009;Ohm, 2010;Rubinstein & Hartzog, 2015). One way to address the risk of re-identification is to prevent the so-called mosaic effect (Czajka et al., 2014). ...
Chapter
Full-text available
Computational Social Science (CSS) has grown exponentially as the process of datafication and computation has increased. This expansion, however, is yet to translate into effective actions to strengthen public good in the form of policy insights and interventions. This chapter presents 20 limiting factors in how data is accessed and analysed in the field of CSS. The challenges are grouped into the following six categories based on their area of direct impact: Data Ecosystem, Data Governance, Research Design, Computational Structures and Processes, the Scientific Ecosystem, and Societal Impact. Through this chapter, we seek to construct a taxonomy of CSS governance and policy challenges. By first identifying the problems, we can then move to effectively address them through research, funding, and governance agendas that drive stronger outcomes.
... Re-identification attacks [49] combine data from multiple collections to re-identify a record from outsourced, public or open data records. A classic well-known re-identification attack [50] makes use of a voter registration list for re-identification of a government health record from the records released by a health insurance company. Impersonation attacks [51] monitor the communication pattern of a device and try to simulate it. ...
Article
Full-text available
Recent advances in hardware and information technology have accelerated the proliferation of smart and interconnected devices facilitating the rapid development of the Internet of Things (IoT). IoT applications and services are widely adopted in environments such as smart cities, smart industry, autonomous vehicles, and eHealth. As such, IoT devices are ubiquitously connected, transferring sensitive and personal data without requiring human interaction. Consequently, it is crucial to preserve data privacy. This paper presents a comprehensive survey of recent Machine Learning (ML)- and Deep Learning (DL)-based solutions for privacy in IoT. First, we present an in depth analysis of current privacy threats and attacks. Then, for each ML architecture proposed, we present the implementations, details, and the published results. Finally, we identify the most effective solutions for the different threats and attacks.
... The attribution of such data to a natural person is therefore only possible through the use of additional information [81]. However, the subversion of anonymisation is well established [112] and is particularly worrying in the age of social networking [123]. ...
Thesis
The Internet of Things (IoT), especially for consumer applications, is often described in terms of either its great promise for new and improved services, or its wholesale invasion of user privacy. Investigating this dichotomy, describing the nature there of, and proposing a remedy, jointly constitute the core of the project and contribution presented herein. The IoT is characterised by relentless miniaturisation, cost reduction, and the continued inclusion of new market segments, all in aid of delivering on the promise of truly ubiquitous computing. As one of the most prominent areas for IoT implementation, networked consumer electronics shows a rapid pace of adoption, recasts legacy devices as connected ”smart” devices, and presents an extensive list of privacy and security failures. Making use of connected devices at the edge, consumer IoT implementations supply data to more capable off site systems for analysis and value extraction. This supplies the service provider with valuable data but also affords the customer new services and device functionality. Unfortunately, such devices and systems are all too often rolled out with little to no regard for privacy or regulatory compliance. We contend that the best option for addressing these issues is a new “by design” approach which is based on an investigation of current practice and theory and framed within modern industry best practice. We act on this contention by considering a wide range of related contemporary research and legislation, conducting testbed based research and finally, deriving a new domain extension for the Systems Modelling Language (SysML) connecting formerly discrete privacy and compliance focused elements. Consequently, this domain extension is called DISCREET: D oma I n exten S ion for C ompliance and p R ivacy by d E sign in consum E r io T .
... A utilização desse recurso tecnológico, com intensidade cada vez maior, revela um grande giro no eixo da tutela dos dados pessoais. Ainda que se considere que o direito fundamental, enquanto tal, tende a propiciar uma interpretação antropocêntrica, as análises de dados não têm se dado de forma individualizada, isto é, parametrizada a partir de um interlocutor diretamente a outro.De outro modo, é cada vez mais visível que a tomada de decisões baseada em análise de dados pessoais se dá de forma obscura ao titular(OHM, 2010;RUBINSTEIN, 2013). O cruzamento de bases de dados, o cross-tracking em diversas plataformas (redes e periféricos) são fatores e métodos que levam ao ...
Thesis
Full-text available
Este estudo trata da definição, do enfrentamento e desenvolvimento do conceito de privacidade e da proteção de dados pessoais relacionados ao direito da personalidade. A metodologia utilizada para desenvolver este estudo incluiu as pesquisas, documental-jurídico-teórica, a bibliográfico-dogmática. Na pesquisa documental, consultaram-se documentos oficiais de órgãos e convenções nacionais e internacionais, constituições e farta legislação pertinente. Na pesquisa bibliográfica investigaram-se pressupostos teóricos de egrégios teóricos, em livros, artigos, periódicos, dissertações, teses manuscritos e em fontes virtuais que tratam da temática em tela. Para o arcabouço teórico da pesquisa discutiu-se a pertinência do conceito de autonomia privada utilizado como base para a argumentação sobre formulação de variáveis e a interpretação das relações que podem surgir entre elas e o objeto de estudo desta tese. O objetivo geral foi estudar a regulação dos dados pessoais com base na autonomia e na heteronomia e conferir maior segurança jurídica e aprimorar as técnicas existentes para a sua tutela. Para tanto, definiu-se discorrer sobre a regulação dos dados pessoais, tarefa que contou com o suporte da discussão entre autonomia e a heteronomia, destacando o direito à privacidade e à autodeterminação informativa na sociedade contemporânea, à sua vinculação aos direitos fundamentais e à análise de situações latentes envolvendo tutela individual e coletiva, critério de responsabilidade civil, regulação e autorregulação. Como objetivos específicos, destacaram-se o aprimoramento da política legislativa relacionada à proteção de dados pessoais e sua aplicação estruturante, a privacidade e a autodeterminação informativa à estruturação de modelos de negócio. Questionaram- se pressupostos teóricos anteriormente definidos e o resultado da discussão apontou para um novo panorama interpretativo do direito à privacidade e à autodeterminação informativa.
... However, as databases become larger and more integrated, these attempts increasingly prove to be ineffective. After reviewing the computer science and legal literatures on anonymity and reidentification, Paul Ohm concludes that: "Data can be either useful or perfectly anonymous but never both" (Ohm 2010(Ohm : 1704. This would appear to hold true in the educational context. ...
Chapter
Full-text available
In the late 1990s, schools in much of the developed world connected most, if not all, of their classrooms to the Internet in an attempt to provide students with universal access to networked communications technologies (Ginsberg and Foster 1998). For the most part, policy-makers, technology companies, and educators agreed that this connectivity would improve learning outcomes and prepare young people for their role as workers in the emerging information economy (Ginsberg and Foster 1998; Steeves 2010). The rhetoric of the day also celebrated the child as a natural technology user who would readily adapt to technology and use it to innovate and generate wealth (Shade and Dechief 2005). The jury is still out on whether or not the networked classroom has delivered on its initial promise. The track record of technology in improving learning outcomes has been mixed at best (see for example Young et al. 2003) and, although today's students may be better at uploading videos to social media than their parents, they are not particularly savvy when it comes to evaluating the veracity and reliability of online content (Steeves 2012b) or using net-worked resources to conduct simple research (Quarton 2003). Perhaps most
... Ultimately, increased datafication leads to greater regulation of borders and restricts freedom of movement (Ajana 2015: 13). Here, it is useful to draw on critical data studies, where questions are asked pertaining to power dynamics involved within data-driven processes that create and reproduce social and cultural divisions (Andrejevic, Hearn and Kennedy 2015: 385), limit privacy (Ohm 2010;Pasquale 2015), result in a lack traditional informed consent (Fairfield and Shtein 2014;Metcalf and Crawford 2016), further discrimination (boyd and Crawford, 2012) and effectively enact exclusion and hostility. Specifically, datafied or 'smart' borders, Vukov suggests, create a new intersection between biopolitical and algorithmic forms of governance (2016: 81; see also De Genova 2013). ...
... One of the issues is to what extent de-identification or anonymization protects individuals, and how a sense of privacy is related to care. Re-identification techniques are improving; as has been known for some while, the more data that is combined, the better the chance of re-identification (Ohm 2010). Genomes are unique, so identifiable. ...
Chapter
The field of consumer health informatics (CHI) is constantly evolving. The literature that supports CHI includes a broad scope of expertise and disciplines, which makes discovering relevant literature a challenge. Through a library and information science lens, we provide foundational familiarity with the structures of information discovery systems and considerations that impact the discovery of CHI literature. We outline the steps included in the design and execution phases of a CHI-related literature search. We also provide an example search using wearable technologies and a case in point that illustrates how terminologies differ across databases. We describe the importance of operationalizing elements of a research question and strategically combining search terms in a query to enhance the findability of CHI literature. The reader will gain a database-agnostic understanding of the structures and factors relevant to the retrieval of CHI literature, which should be particularly useful as the field of CHI and the tools for retrieving literature continuously change.
... Statistical disclosure control methods address this problem by altering values in the data to mask their true values and thus limit disclosure leaks. However, ensuring that the released data protects user privacy while maintaining its utility is challenging, especially for high-dimensional data (Ohm (2009)). Rubin (1993) proposed using synthetically generated data as a tool to enable data analysis while mitigating disclosure risk. ...
Preprint
Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what statistical patterns are captured, leading to concerns over privacy protection. While synthetic records are not linked to a particular real-world individual, they can reveal information about users indirectly which may be unacceptable for data owners. There is thus a need to empirically verify the privacy of synthetic data -- a particularly challenging task in high-dimensional data. In this paper we present a general framework for synthetic data generation that gives data controllers full control over which statistical properties the synthetic data ought to preserve, what exact information loss is acceptable, and how to quantify it. The benefits of the approach are that (1) one can generate synthetic data that results in high utility for a given task, while (2) empirically validating that only statistics considered safe by the data curator are used to generate the data. We thus show the potential for synthetic data to be an effective means of releasing confidential data safely, while retaining useful information for analysts.
... Information is stripped of all personal identifiers such as names, and categories of data that may serve as personal identifiers are also modified. 153 This leaves personal data deidentified and creates an opportunity for further disclosure and use. The appeal of data anonymisation is its capacity to balance the privacy interests of the data subject and the interest in free flow of information. ...
Article
Biometric identity systems have been adopted in the Global South, following the Global North’s lead. The greatest discrepancy, however, is the existence of legal frameworks that govern the use, storage and processing of the data collected. The Kenyan government’s roll-out of the Huduma Namba registration exercise in April 2019 with no existing data protection law in Kenya exemplifies this. Thereafter, Parliament passed the Data Protection Act. Unfortunately, parts of this law are not keen enough to protect personal data. Deviating from the requirement for personal data to be directly collected from the data subject, section 28(2)(c) of the referenced Act permits indirect collection of personal data from a source other than the data subject themselves. Relying on desk-based research and using the Huduma Namba exercise as a case study, this paper examines this permission and the imminent danger it poses to privacy of the personal data of Kenyans. Finding that section 28(2)(c) exposes personal data to the privacy violations of secondary use and exclusion threatens the right to privacy, this research suggests that the meta consent model as embraced by the healthcare sector emerges as a feasible solution. This model allows data subjects to determine their consent preferences i.e., how and when they wish their consent to be sought for further collection and use, at the point of primary collection of personal data. Additionally, this paper recommends that the model should be embraced by the judiciary in its adjudication of matters and finally, that an amendment incorporating the solution should be made.
... However, some participants voiced concerns about being identified even if "pseudonyms" were used. Therefore, researchers must consider the risk of re-identification from what appears to be fully anonymised data (Ford et al., 2021;Ohm, 2009). ...
Preprint
Full-text available
Background Improved data sharing could have extensive benefits for mental health research and treatment. However, it is vital that data are shared in a way that aligns with the views of people with mental health conditions. Whilst previous research has examined public views of health data sharing, few studies have focused specifically on people with mental illness. Methods Semi-structured online interviews were conducted with twelve people with a range mental health conditions, including schizophrenia, anxiety, depression, eating disorder and addiction. Interview questions focussed on the risks and benefits of sharing mental health data, how data should be kept safe, and the sensitivity of different types of data. Results The overarching themes identified were: benefits of sharing mental health data, concerns about sharing mental health data, safeguards, and data types. There was a high level of support for the use of data sharing to facilitate improved knowledge of and treatment for mental health conditions. Concerns included the potential for misuse of data, such as by insurance companies or employers, and the risk of mental health stigma from researchers and healthcare professionals who accessed the data. There was a focus on appropriate safeguards, such as secure storage access procedures. Conclusions There was a strong sense across participants that more should be done to combat the suffering caused by mental illness, and that appropriate health data sharing could facilitate this. The mental health research community could build on this generally positive attitude to mental health data sharing by ensuring that they follow rigorous best practice which accounts for the specific concerns of people with mental illness.
... The variable glossary is presented in Section 4.1 and Section 4.5.1 accordingly. 19 of principles established by the European Data Protection Board (EDPB) in their Guidelines on the application and setting of administrative fines for the purpose of the Regulation 2016/679 22 . Section 3 provides on existing calculation models proposed by DPAs. ...
Article
Before the adoption of the EU-GDPR, researchers remarkably argued on law enforcement of personal data protection being „toothless” and a “paper tiger”. Almost three years after its enforcement date, the GDPR fines are increasing, and the world is beginning to witness the effect of sizeable fines awarded to organizations. This analysis aims to discover potential correlations between GDPR fines, and equally the lack of them. Such correlations might help to tap into trends that are followed by Data Protection Authorities (DPA) in their fining practices. This paper specifically describes the fines issued by the Romanian DPA, while also containing qualitative research findings extracted from discussions with interview subjects. The aim of this paper is to evaluate the possibility to construct a prediction model that is based on linear regression analysis and provide for future direction on the field of legal data analysis.
... Anonymized is sometimes used synonymously with de-identified, or can have a stronger connotation of data being rendered incapable of being re-identified. I avoid anonymized because it is far too an assured word given the known cases of failure (Ohm 2010). And in public data contexts, there might not be personally identifiable information to speak of given the use of pseudonyms. ...
Article
Full-text available
Concerned researchers of online forums might implement what Bruckman (2002) referred to as disguise . Heavy disguise, for example, elides usernames and rewords quoted prose so that sources are difficult to locate via search engines. This can protect users (who might be members of vulnerable populations, including minors) from additional harms (such as harassment or additional identification). But does disguise work? I analyze 22 Reddit research reports: 3 of light disguise, using verbatim quotes, and 19 of heavier disguise, using reworded phrases. I test if their sources can be located via three different search services (i.e., Reddit, Google, and RedditSearch). I also interview 10 of the reports’ authors about their sourcing practices, influences, and experiences. Disguising sources is effective only if done and tested rigorously; I was able to locate all of the verbatim sources (3/3) and many of the reworded sources (11/19). There is a lack of understanding, among users and researchers, about how online messages can be located, especially after deletion. Researchers should conduct similar site-specific investigations and develop practical guidelines and tools for improving the ethical use of online sources.
... This can also be applied to identifying characteristics. In other words, in the case of anonymized data, it is not very hard to indicate to whom the data is related, effectively de-anonymizing it ( Ohm, 2009 ;Brasher, 2018 ). For instance, on the basis of trivial data such as postal codes, it may be possible to predict all kinds of sensitive health characteristics. ...
Article
Full-text available
In healthcare, gender and sex considerations are crucial because they affect individuals' health and disease differences. Yet, most algorithms deployed in the healthcare context do not consider these aspects and do not account for bias detection. Missing these dimensions in algorithms used in medicine is a huge point of concern, as neglecting these aspects will inevitably produce far from optimal results and generate errors that may lead to misdiagnosis and potential discrimination. This paper explores how current algorithmic-based systems may reinforce gender biases and affect marginalized communities in healthcare-related applications. To do so, we bring together notions and reflections from computer science, queer media studies, and legal insights to better understand the magnitude of failing to consider gender and sex difference in the use of algorithms for medical purposes. Our goal is to illustrate the potential impact that algorithmic bias may have on inadvertent discriminatory, safety, and privacy-related concerns for patients in increasingly automated medicine. This is necessary because by rushing the deployment of AI technologies that do not account for diversity, we risk having an even more unsafe and inadequate healthcare delivery. By promoting the account for privacy, safety, diversity, and inclusion in algorithmic developments with health-related outcomes, we ultimately aim to inform the Artificial Intelligence (AI) global governance landscape and practice on the importance of integrating gender and sex considerations in the development of algorithms to avoid exacerbating existing or new prejudices.
... Data owners often practice the release-and-forget model where datasets, once released, are not analyzed further for potential privacy risks concerning the newly released datasets [43,47]. However, the risk of re-identification can be considered a temporal function [51], thus requiring proactive monitoring of the risks. ...
Preprint
Full-text available
Open data sets that contain personal information are susceptible to adversarial attacks even when anonymized. By performing low-cost joins on multiple datasets with shared attributes, malicious users of open data portals might get access to information that violates individuals' privacy. However, open data sets are primarily published using a release-and-forget model, whereby data owners and custodians have little to no cognizance of these privacy risks. We address this critical gap by developing a visual analytic solution that enables data defenders to gain awareness about the disclosure risks in local, joinable data neighborhoods. The solution is derived through a design study with data privacy researchers, where we initially play the role of a red team and engage in an ethical data hacking exercise based on privacy attack scenarios. We use this problem and domain characterization to develop a set of visual analytic interventions as a defense mechanism and realize them in PRIVEE, a visual risk inspection workflow that acts as a proactive monitor for data defenders. PRIVEE uses a combination of risk scores and associated interactive visualizations to let data defenders explore vulnerable joins and interpret risks at multiple levels of data granularity. We demonstrate how PRIVEE can help emulate the attack strategies and diagnose disclosure risks through two case studies with data privacy experts.
... Pesquisas têm alertado que a aplicação de relacionamentos entre dados, incluindo os oriundos da utilização de redes sociais, pode acarretar inferência ou descoberta de informações privadas [Rocasolano 2022]. Mesmo dados anônimos podem ser re-identificados e atribuídos a indivíduos específicos [Ohm 2009]. Além disso, através de diferentes tipos de dados como textos [Chen et al. 2014], imagens [Wang and Kosinski 2018], e inclusive "likes" [Kosinski et al. 2013], informações como gênero, sexualidade, dentre outras, podem ser deduzidas por meio de técnicas de Inteligência Artificial (IA) [Bindu 2017]. ...
Conference Paper
Pesquisas têm alertado sobre riscos de novas tecnologias com enorme poder computacional e de processamento se aliadas ao grande volume de dados advindos da crescente utilização de redes sociais. Um dos riscos está relacionado a privacidade e a propriedade dos dados pessoais. Este trabalho investigou a comunicabilidade do Instagram com o objetivo de identificar as estratégias de comunicação desta rede social sobre o consentimento informado. Para isso, aplicamos o Método de Inspeção Semiótica, um método com foco na avaliação da comunicabilidade de sistemas interativos. Os resultados indicam violações do princípio ético de privacidade relacionado ao consentimento e a presença de Padrões Obscuros nas estratégias de comunicação do Instagram.
Chapter
In this chapter we will look at existing literature on big data and public policymaking and see which assumptions and conclusions are reached on the impact big data has on the public policymaking process. We will follow the lines of reasoning we have made within the previous chapter and therefore we will structure the effects of big data on public policymaking along the lines of technological features, institutional factors and the policy arena itself.
Chapter
Full-text available
Define mHealth and mobile technologies • Discuss three mobile technology uses cases in a clinical setting • Discuss the shortcomings of medical apps for smartphones • Enumerate the challenges of mHealth in low and middle-income countries • Identify the software development kits (SDKs) for the iPhone and Android OS
Chapter
Full-text available
The computer revolution, directly related to the Network society (CASTELLS, 1996, 2010), the Information Age (BYRON, 2010; FLORIDI, 2008), or the 4th Revolution (SCHWAB, 2017) has provided powerful but also misleading cross-field advances thanks to Artificial Intelligence. Besides, this related algorithm thinking does not express the reality and complexity of human thinking, which is opportunistic, multi heuristic, and can be functionally defined as blended (VALLVERDÚ & MÜLLER, 2019). Such incompleteness in human thinking can be analyzed in all kinds of fields, and the judical and law spheres are not free from that characteristic. Therefore, new advances in such fields like computational law (HILDEBRANDT, 2017), legal informatics, legal analitics, computational legal theory, or AI Law, engineering law (HOWARTH, 2013), among others, are biased in a different set of posibilities. Although it is affirmed that such computational tecnologies will bring transparency, justice, equity, and clarity to citizens, I have identified several challenges for the completion of such promises. I will explore them under four categories: Design Biases, Wrong causal models, Failed automatization, and Non-universality of Justice models.
Chapter
Full-text available
The internet as a technology not only revolutionized communication, it also enabled new forms of trade. Digital trade often involves personal data. Information about individuals now travels around the world on an unprecedented and rapidly growing scale. The key to understanding the implications of data protection in the EU for trade with the wider world is the Charter of Fundamental Rights of the EU (Charter, CFR). The Charter has the status of primary Union law and data protection is enshrined as a fundamental right in Article 8 CFR. The first section of this chapter traces the development of the right to data protection from the early data protection laws in Europe to the inclusion of Article 8 into the Charter. It identifies the driving forces behind this development and offers insights into the origins of this new fundamental right (Sect. 2.1). The second section addresses the substance of the right to data protection. It explains the underlying values for the interpretation of the new fundamental right and analyzes the six written constituent parts of Article 8 CFR. It shows that the right to data protection must be distinguished from the right to private life in Article 7 CFR. The second section also explains what counts as an interference with the right to data protection and addresses lawful limitations on the exercise of this new fundamental right (Sect. 2.2). The third section focuses on the extraterritorial dimension of the right to data protection. The jurisprudence of the ECJ reveals an unwritten constituent part of the new fundamental right: the right to continuous protection of personal data. Personal data cannot be exported to third states that do not provide a level of protection for the transferred personal data that is essentially equivalent to that guaranteed within the EU (Sect. 2.3). Certain practices in third states are of particular relevance for the extraterritorial dimension of Article 8 CFR. Foreign internet surveillance often targets personal data that is transferred from the EU to a third country. The fourth section analyzes the requirements for foreign internet surveillance practices emanating from the right to data protection in Article 8 CFR (Sect. 2.4).
Chapter
As precision and personalized medicine prove their worth, care shifts more towards treating representations of patients rather than patients’ persons and bodies. Something is gained and something is lost by virtualizing patients and mediating care through technology. Because benefits are clear, the chapter highlights ethical, legal, and social issues surrounding quality of care, privacy, bias, and fairness to consider what could be lost.I argue that virtualization reduces distinctions between individuals and reduces knowledge of each patient and patient’s body. That changes relationships between patients and clinicians and shifts the locus of care away from the patient. It also decontextualizes data on which treatment and algorithmic recommendations are based. The data and algorithms all lack transparency, yet their predictions influence care. Not only can care be compromised, but both patients’ and clinicians’ personhood and autonomy are threatened.Privacy, too, is endangered by the push to generate, collect, and aggregate data as all data become health data, used repeatedly and combined into multiple datasets. It is impossible to predict what those datasets will be, how data will be used, and what they will yield. Anonymity and consent both lose meaning. Privacy concerns can undermine confidentiality, which, in turn, can undermine trust, and therefore, can compromise care.Algorithmic predictions based on sorting patients into algorithmically derived groups can harm group members. Care influenced by algorithmic recommendations may not be appropriate for all patients in the group, and predictions may stem from, or result in, bias, stigmatization, negative profiling, or disparate services.The chapter concludes with a framework for analyzing ethical, legal, and social issues. It expands the scope of bioethics to more generally include information technologies in healthcare. To realize the promise of personalized medicine in ethical ways, individuals and their bodies should be central and personalization personal.KeywordsPersonalized medicinePrecision medicineVirtualized patientsRepresentations of patientsAlgorithmsBig dataEthical issuesQuality of carePrivacyBiasEthics frameworkELSI
Article
Pandemic-tracking apps may form a future infrastructure for public health surveillance. Yet, there has been relatively little exploration of the potential societal implications of such an infrastructure. In semi-structured interviews with 23 participants from India, the Middle East and North Africa (MENA), and the United States, we discussed attitudes and preferences regarding the deployment of apps that support contact tracing to contain the spread of COVID-19. Through interpretive analysis, we examined the relationship between persistent discomfort and vulnerability when using such apps. Such an examination yielded three temporal forms of vulnerability: real, anticipatory, and speculative. By identifying and defining the temporalities of vulnerability through an analysis of people's pandemic-related thoughts and experiences, we develop the overlapping discourses of humanistic infrastructure studies and infrastructural speculation. In doing so, we explore the concept of vulnerability itself and present implications for the study of vulnerability in Human-Computer Interaction (HCI) and for the oversight of app-based public health surveillance.
Conference Paper
Open data sets that contain personal information are susceptible to adversarial attacks even when anonymized. By performing low-cost joins on multiple datasets with shared attributes, malicious users of open data portals might get access to information that violates individuals' privacy. However, open data sets are primarily published using a release-and-forget model, whereby data owners and custodians have little to no cognizance of these privacy risks. We address this critical gap by developing a visual analytic solution that enables data defenders to gain awareness about the disclosure risks in local, joinable data neighborhoods. The solution is derived through a design study with data privacy researchers, where we initially play the role of a red team and engage in an ethical data hacking exercise based on privacy attack scenarios. We use this problem and domain characterization to develop a set of visual analytic interventions as a defense mechanism and realize them in PRIVEE, a visual risk inspection workflow that acts as a proactive monitor for data defenders. PRIVEE uses a combination of risk scores and associated interactive visualizations to let data defenders explore vulnerable joins and interpret risks at multiple levels of data granularity. We demonstrate how PRIVEE can help emulate the attack strategies and diagnose disclosure risks through two case studies with data privacy experts.
Article
News headlines about privacy invasions, discrimination, and biases discovered in the platforms of big technology companies are commonplace today, and big tech's reluctance to disclose how they operate counteracts ideals of transparency, openness, and accountability. This book is for computer science students and researchers who want to study big tech's corporate surveillance from an experimental, empirical, or quantitative point of view and thereby contribute to holding big tech accountable. As a comprehensive technical resource, it guides readers through the corporate surveillance landscape and describes in detail how corporate surveillance works, how it can be studied experimentally, and what existing studies have found. It provides a thorough foundation in the necessary research methods and tools, and introduces the current research landscape along with a wide range of open issues and challenges. The book also explains how to consider ethical issues and how to turn research results into real-world change.
Article
This book is about rights and powers in the digital age. It is an attempt to reframe the role of constitutional democracies in the algorithmic society. By focusing on the European constitutional framework as a lodestar, this book examines the rise and consolidation of digital constitutionalism as a reaction to digital capitalism. The primary goal is to examine how European digital constitutionalism can protect fundamental rights and democratic values against the charm of digital liberalism and the challenges raised by platform powers. Firstly, this book investigates the reasons leading to the development of digital constitutionalism in Europe. Secondly, it provides a normative framework analysing to what extent European constitutionalism provides an architecture to protect rights and limit the exercise of unaccountable powers in the algorithmic society. This title is also available as open access on Cambridge Core.
Chapter
Abstract The increasing attention on privacy has motivated the rapid design and implementation of multiple privacy‐preserving methods. This chapter first presents a few modern approaches for privacy preservation, including anonymization, differential privacy, federated learning, homomorphic encryption, and secure multi‐party computation. Then, a privacy‐preserving anomaly detection scheme under the E‐Health system is presented. In the era of modern communication technologies and big data, E‐Health has emerged and become a popular paradigm. However, recent E‐Health applications have raised serious concerns when detecting the anomalous behaviors in a person's sensitive medical records. Such records contain private data of health conditions, some of which are maliciously compromised and modified by adversaries. This privacy‐preserving anomaly detection scheme is introduced to strengthen the understanding of detecting anomalous behaviors without compromising information confidentiality. This scheme is enabled by predicate encryption. Specifically, the novelty is the use of a session key as a message payload during the encryption operation of the predicate encryption system in order to achieve both information privacy and efficient cryptographic computations. Moreover, the system model and security model are defined. Detailed descriptions on the design of predicates and anomaly detection procedures are presented. Lastly, evaluations are performed in terms of computational overhead, communication overhead, and anomaly detection.
Article
Generation of massive data in the digital age leads to possible violations of individual privacy. The search for personal data becomes an increasingly recurrent exposure today. The present work corresponds to the area of differential privacy, which guarantees data confidentiality and robustness against invasive identification attacks. This area stands out in the literature for its rigorous mathematical basis capable of quantifying the loss of privacy. A differentially private method based on regression models was developed to prevent inversion attacks while retaining model efficacy. In this paper, we propose a novel approach to improve the data privacy based on regression models under heteroscedasticity, a common aspect, but not studied, in practical situations of differential privacy. The influence of privacy restriction on the statistical performance of the estimators of model parameters was evaluated using Monte Carlo simulations, including test rejection rates. The results of the numerical evaluation show high inferential distortion for stricter privacy restrictions. Empirical illustrations with data are presented to show potential applications.
Chapter
This chapter examines the processing of a peculiar information-type of the food safety risk assessment domain, namely food consumption data. The discussion towards this data-type is crucial to evaluate the numerous implications of personal data processing in the food safety domain. Moreover, this discussion contributes to the ongoing debate on certain aspects of data processing, including legal categories and group privacy. The chapter suggests that governance solutions are needed to be adopted, in order to minimize the risks to fundamental rights and freedoms due to its processing.
Thesis
This thesis examines whether the GDPR affords individuals a right to control the online commodification of their personal data. The thesis investigates whether the intention that ‘Natural persons shall have control over their personal data’ in Recital 7 GDPR is given effect in the substantive provisions of the Regulation. Set against a backdrop of the many problems associated with the online commodification of personal data, the thesis posits that these are the issues individuals would like to address through a right to control. While these problems cover a wide gamut, the thesis focuses on the objections to online commodification, privacy invasions, and unfair bargains, which are systemic to the commercial surveillance which forms the basis for the commodification of personal data. The thesis examines the possibility of a right of control to be found in the data processing principles in Article 5 GDPR by way of the jurisprudential concept of informational self-determination. The thesis considers whether the use of consent as a legal basis of processing in Article 6 GDPR or the user rights in Chapter III GDPR give rise to a right to consent, and whether the information requirements in Articles 12-14 GDPR facilitates such a right. The thesis also examines whether the enforcement provisions in Articles 77-79 GDPR and 82 GDPR give effect to a right to control. In its findings, the thesis concludes that while many of the substantive provisions include elements of control, these provisions cannot be said to afford individuals the control they seek and as suggested by the wording of Recital 7 GDPR. The legal rights do not overlap with the rights sought; they are too limited and uncertain; and the technological environment has become so ubiquitous, complex, and interconnected that the individuals cannot meaningfully exercise control. Thus, the thesis questions whether it is even possible for individuals to achieve control over the online commodification of their personal data.
Article
Full-text available
At the end of 2020, the European Commission published a new European data strategy, which aims to create a new legal framework to promote the development of a single European data market. In the scope of the new strategy the European Commission has already proposed the Data Governance Act proposal (hereinafter – the “DGA”), which aims to strengthen the mechanism to facilitate data exchange. The proposal of the new legislative act has raised the question about the right balance between it and the existing EU legislative acts, which were adopted in the key of personal data protection, especially with the General Data Protection Regulation (hereinafter – the “GDPR”). We argue that there are a number of inconsistencies between the DGA proposal and the GDPR that may prevent the full implementation of mechanisms that allow achieving the EU’s intended goals in this field within the framework of the new European strategy. The interaction of the DGA with the GDPR is characterized by conflict of laws and legal uncertainty, which can jeopardize the achievement of the objectives of the DGA itself and can reduce the level of personal data protection in the EU, compared to the GDPR. Besides, the examination of the relevant case-law of the Court of Justice of the European Union (here inafter – the “CJEU” or the “Court”) regarding the implementation of the GDPR identifies approaches that prevent the opening of data exchange as the DGA proposal requires.
Article
Full-text available
Human resources management (HRM) stakeholders must make strategic decisions about artificial intelligence (AI) or they may risk failure if competitors implement faster objective methods to hire best-fit candidates. Big data is one of the major problems HRM staff face in finding relevant talent due to the post-pandemic paradigm shift towards remote work along with the subsequent exponential increase of online-based recruiting. In the current study, a unique action research experiment was constructed based on a pragmatic ideology. The purpose was to answer four research questions (RQ) and one hypothesis. The key RQ was AI could help, or possibly out-perform humans, for recruiting new employees when there was a high volume of real-time internet-based job application big data. A pharmaceutical company was selected for the action research experiment. A job description was created using the hierarchical cluster analysis in machine learning to identify 27 key skills. The analytical hierarchy process (AHP) was used to transform subjective and qualitative data from the hiring manager into objective quantitative candidate selection criteria. An HRM AI was installed in the organizations human resources information system (HRIS). Participants were randomly selected form the action research study company (N = 10) to complete the experiment. The 10 participants competed against the AI to find the best candidate. The human recruiters and AI evaluated approximately the same real-time applicant big data population from the internet (not a test sample). Quality assurance was performed to ensure the evaluations were accurate and there were no outliers. The scores of the best candidates were compared using ANOVA and post hoc Tukey, with an effect size of 44%. The results demonstrated AI could outperform human recruiters, without discrimination, due to using AHP for prioritizing the hiring criteria and not training the HRM AI with best-in-class resumes.
Article
Full-text available
Human mobility data is a crucial resource for urban mobility management, but it does not come without personal reference. The implementation of security measures such as anonymization is thus needed to protect individuals’ privacy. Often, a trade-off arises as such techniques potentially decrease the utility of the data and limit its use. While much research on anonymization techniques exists, there is little information on the actual implementations by practitioners, especially outside the big tech context. Within our study, we conducted expert interviews to gain insights into practices in the field. We categorize purposes, data sources, analysis, and modeling tasks to provide a profound understanding of the context such data is used in. We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy. We provide groundwork for further research on practice-oriented research by identifying privacy needs of practitioners and extracting relevant mobility characteristics for future standardized evaluations of privacy-enhancing methods.
Book
Full-text available
La rivoluzione digitale ha portato alla nascita di un nuovo modello economico in cui i dati e la conoscenza in generale si attestano come la più promettente risorsa dell’era postindustriale. Nel concetto di conoscenza è insito un conflitto di fondo. Da un lato, l’interesse privato a capitalizzarne il valore, che rende strategica la pro- prietà intellettuale come forma di controllo del sapere. Dall’altro, l’interesse gene- rale a riservare al pubblico dominio – ovvero alla libera fruizione collettiva – una parte sostanziale del capitale intellettuale e semantico e del patrimonio culturale. L’economia di rete, fondata sulla condivisione di conoscenza, soffre oggi un gap di contenuti di pubblico dominio. Appena il 10% del patrimonio culturale europeo è digitalizzato e meno della metà è accessibile in rete e disponibile per il riutilizzo.A risponderne è in primo luogo il diritto che stenta a riconoscere tutela alle istanze diaccesso aperto ai beni comuni. La presente monografia guida il lettore attraverso le varie aree del diritto che contribuiscono a restringere il perimetro del pubblico dominio. Il volume spazia dal diritto d’autore, al regolamento a tutela dei dati personali, dal codice dei beni culturali, alla recente proposta di Data Governance Act. Conclude, infine, che nell’equo bilanciamento tra gli interessi in conflitto, ogni con- sociato dovrebbe vantare un diritto soggettivo alla libera utilizzazione dei beni immateriali che compongono il pubblico dominio: un “diritto al pubblico dominio”
Article
This paper explores whether the establishment of data markets based on individual data portability can result in better societal outcomes. The results suggest that markets where individuals can sell data generated through their online engagement to third parties, could result in pareto improving outcomes for subscribers to digital platforms and purchasers of targeted advertising services. Data markets would enable third parties to combine their own proprietary data with other individual level data and produce information for targeted advertising, reducing the market power of Big Tech firms. However, successful data markets require strong regulatory measures by governments that ensure: data collected by Big Tech firms are considered the property of individuals; the privacy of individuals; and the intellectual property protection of firms creating unique data products. Such policies have the potential to guarantee that the benefits of Big Data are not confined to a few large firms.
Article
Task assignment is a research focus of Mobile Crowd Sensing(MCS). On the arrival of a sensing task, most researches use the way of immediate recruitment to assign the task, ignoring the impact of the task assignment time point. However, the arrival time and start time of a task are often different, hence an earlier assignment decision leads to rough predictions of users reaching the destination of task, which reduces task completion ratio and budget utilization ratio. In order to solve this problem, in this paper we propose a dynamic delayed-decision task assignment method. Firstly, we formalize a new task assignment problem considering the time point of task assignment, in which each selected user can complete as many tasks as possible within the given spatial-temporal constraints, rather than only one task, and propose a method of decision time point selection using a delayed-decision strategy to deal with the tasks of dynamically arriving at the MCS platform. Secondly, we propose two mobility prediction methods to efficiently compute the probabilities of users reaching the destinations before the end time of the sensing tasks, and then propose the related task assignment algorithms, namely with TDMar and TDMeta, which use the proposed semi-Markov prediction and meta-path prediction, respectively. Finally, by using large-scale real data sets, we evaluate the two algorithms and compare them with the baseline methods. The results show that using delayed-decision strategy could evidently improve the task completion ratio and budget utilization ratio, and also decrease the user singleness.
Article
Behavioral data, collected from our daily interactions with technology, have driven scientific advances. Yet, the collection and sharing of this data raise legitimate privacy concerns, as individuals can often be reidentified. Current identification attacks, however, require auxiliary information to roughly match the information available in the dataset, limiting their applicability. We here propose an entropy-based profiling model to learn time-persistent profiles. Using auxiliary information about a single target collected over a nonoverlapping time period, we show that individuals are correctly identified 79% of the time in a large location dataset of 0.5 million individuals and 65.2% for a grocery shopping dataset of 85,000 individuals. We further show that accuracy only slowly decreases over time and that the model is robust to state-of-the-art noise addition. Our results show that much more auxiliary information than previously believed can be used to identify individuals, challenging deidentification practices and what currently constitutes legally anonymous data.
Article
Data management is the collection, processing, storing, and sharing of data. In today's dispensation, data sharing and collaborative data processing is a necessity for multi-partner organizations as it can lead to the discovery of new insight. Shared data is generally for the purpose of marketing, advertising, and other institutional decision-making reasons. A challenge however is that the collected data (mostly about individuals known as data subjects or producers) is disseminated among organizations without meaningful consent from the data subjects. Hence, data subjects are not aware of what is happening to their data regarding use and misuse. Furthermore, data subjects can hardly determine which third-party institutions have access to their data. In this paper, we opined that data should be managed in a manner that can persuade the trust of data subjects. To achieve this, data subjects should have the rights to be informed about details of their data. Thus, this paper proposes a cloud-based data management and sharing platform to enable data subjects to control who can access their data and consent to its collection and usage based on smart contracts. The proposed system leverages blockchain to enforce accountability, provenance, and auditability of all events. We implemented a dynamic consent management prototype on top of Ethereum blockchain to demonstrate the feasibility of the proposed work.
Article
Vehicular Ad hoc Networks (VANETs), as the integration of mobile vehicle and intelligent technology, has raised plenty of attentions. In VANETs, mobile smart vehicles can timely get information from surrounding environment, which brings great convenience for society. However, this highlights the need to improve data quality while protecting privacy for vehicular data provider. Thus, in this article, we propose an AI-based Trust-aware and Privacy-preserving System (ATPS) to preserve privacy for vehicular data providers while improving quality of data collections in VANETs. Our proposed ATPS system mainly consists of two schemes: 1) a Partial Ordering based Trust Management (POTM) scheme and 2) a Trajectory Privacy Preserving (TPP) scheme jointly designed by Wasserstein Generative Adversarial Networks (WGAN) and differential privacy. The POTM scheme uses partial ordering relationship to accurately evaluate and manage trusts for vehicular data providers with the assistance of fully trusted drones, and selects the vehicles with top ranks to collect data. Then, TPP scheme skillfully incorporates WGAN and differential privacy to preserve trajectory privacy for vehicular data providers in VANETs, which also guarantees data availability by adding carefully designed noise to the original trajectory. Compared to existing scheme, extensive experiments conducted on the real-world datasets demonstrates efficiency of our ATPS in terms of improving the data quality by 45.76% to 52.57%, reducing the malicious vehicle participants by 15.48% to 16.95%, preserving privacy of vehicles, and guaranteeing data availability.
Article
Full-text available
Releasing network measurement data---including packet traces---to the research community is a virtuous activity that promotes solid research. However, in practice, releasing anonymized packet traces for public use entails many more vexing considerations than just the usual notion of how to scramble IP addresses to preserve privacy. Publishing traces requires carefully balancing the security needs of the organization providing the trace with the research usefulness of the anonymized trace. In this paper we recount our experiences in (i) securing permission from a large site to release packet header traces of the site's internal traffic, (ii) implementing the corresponding anonymization policy, and (iii) validating its correctness. We present a general tool, tcpmkpub, for anonymizing traces, discuss the process used to determine the particular anonymization policy, and describe the use of metadata accompanying the traces to provide insight into features that have been obfuscated by anonymization
Article
Full-text available
The 9/11 terrorists, before their deadly attacks, sought invisibility through integration into the society they hoped to destroy. In a similar fashion, the terrorists who carried out subsequent attacks in Madrid and London attempted to blend into their host lands. This strategy has forced governments, including the United States, to rethink counterterrorism strategies and tools. One of the current favored strategies involves data mining. In its pattern-based variant, data mining searches select individuals for scrutiny by analyzing large data sets for suspicious data linkages and patterns. Because terrorists do not stand out, intelligence and law enforcement agents want to do more than rely exclusively on investigations of known suspects. The new goal is to search for a pattern or signature in massive amounts of transaction data. This Article begins by examining governmental data mining. In Part II, this Article reviews widely held views about the necessary safeguards for the use of data mining. In Part III, this Article considers dataveillance by private corporations and how they have compiled rich collections of information gathered online in the absence of a robust legal framework that might help preserve online privacy. This Article then discusses some of the techniques that individuals can employ to mask their online activity as well as existing and emerging technological approaches to preventing the private sector or government from linking their personal information and tracing their activities. This Article concludes by briefly considering three topics: (1) whether and how to regulate the potential impact of identity management systems on counterterrorism efforts; (2) the requirements of transparency and understanding of the underlying models used in either data mining or identity management systems as a necessary prelude to the creation of rules on appropriate access and use; and (3) the need for research in several further areas.
Article
Full-text available
In the United States, proposals for informational privacy have proved enormously controversial. On a political level, such proposals threaten powerful data processing interests. On a theoretical level, data processors and other data privacy opponents argue that imposing restrictions on the collection, use, and exchange of personal data would ignore established understandings of property, limit individual freedom of choice, violate principles of rational information use, and infringe data processors' freedom of speech. In this article, Professor Julie Cohen explores these theoretical challenges to informational privacy protection. She concludes that categorical arguments from property, choice, truth, and speech lack weight, and mask fundamentally political choices about the allocation of power over information, cost, and opportunity. Each debate, although couched in a rhetoric of individual liberty, effectively reduces individuals to objects of choices and trades made by others. Professor Cohen argues, instead, that the debate about data privacy protection should be grounded in an appreciation of the conditions necessary for individuals to develop and exercise autonomy in fact, and that meaningful autonomy requires a degree of freedom from monitoring, scrutiny, and categorization by others. The article concludes by calling for the design of both legal and technological tools for strong data privacy protection.
Article
Full-text available
The concept of k-anonymity, used in the literature to formally evaluate the privacy preservation of published tables, was introduced based on the notion of quasi-identifiers (or QI for short). The process of obtaining k-anonymity for a given private table is first to recognize the QIs in the table, and then to anonymize the QI values, the latter being called k-anonymization. While k-anonymization is usually rigorously validated, the definition of QI remains mostly informal, and different authors seem to have different interpretations of the concept of QI. The purpose of this paper is to provide a formal underpinning of QI and examine the correctness and incorrectness of various interpretations of QI in our formal framework. We observe that in cases where the concept has been used correctly, its application has been conservative; this note provides a formal understanding of the conservative nature in such cases.
Article
Full-text available
Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis a vis the risk of disclosing proprietary or sensitive information. Among the var- ious methods employed for "sanitizing" the data prior to disclosure, we focus in this paper on anonymization, given its widespread use in practice. We do due diligence to the question "just how safe is the anonymized data". We consider both the scenarios when the hacker has no infor- mation, and more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining and address the safety question at two di!erent levels: (i) how likely are the ident ities of individual items cracked (i.e. reidentified by the hacker), and (ii) how likely are sets of items cracked. For capturing the prior knowledge of the hacker, we propose a belief function, which amounts to an educated guess of the frequency of each item. For various classes of belief functions, which correspond to di!erent degrees of prior knowledge, we derive formulas for computing the expected number of cracks of single items and for itemsets, the probability of cracking the itemsets. While obtaining the exact values for the more general situations is computationally hard, we propose a series of heuristics called the O-estimates. They are easy to compute, and are shown to be fairly accurate, justified by empirical results on real benchmark datasets. Based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma. Our recipe operates at two di!erent levels, depending on whether the data owner wants to reason in terms of single items or sets of items (or both). Finally, we present techniques using which a hacker's knowledge of correlation in terms of co-occurrence of items can be incorporated into our framework of disclosure risk analysis and present experimental results demonstrating how this knowledge a!ects the heuristic estimates we have developed.
Article
Often a data holder, such as a hospital or bank, needs to share person-specific records in such a way that the identities of the individuals who are the subjects of the data cannot be determined. One way to achieve this is to have the released records adhere to k-anonymity, which means each released record has at least (k-1) other records in the release whose values are indistinct over those fields that appear in external data. So, k-anonymity provides privacy protection by guaranteeing that each released record will relate to at least k individuals even if the records are directly linked to external information. This paper provides a formal presentation of combining generalization and suppression to achieve k-anonymity. Generalization involves replacing (or recoding) a value with a less specific but semantically consistent value. Suppression involves not releasing a value at all. The Preferred Minimal Generalization Algorithm (MinGen), which is a theoretical algorithm presented herein, combines these techniques to provide k-anonymity protection with minimal distortion. The real-world algorithms Datafly and μ-Argus are compared to MinGen. Both Datafly and μ-Argus use heuristics to make approximations, and so, they do not always yield optimal results. It is shown that Datafly can over distort data and μ-Argus can additionally fail to provide adequate protection.
Article
In Section 1 the nature of statistical confidentiality is explored, i.e., its essential role in the collection of data by statistical offices, its relationship to privacy and the need for increased attention to potential statistical disclosures because of the increased tabulation and dissemination capabilities of statistical offices. In Section 2 a definition of inadvertent direct disclosure is provided as well as a theorem concerning a test for residual disclosure of tabulations. In Section 3 different media and methods of data dissemination are considered from the point of view of potential for statistical disclosure.
Article
This article opens discussion of a starkly new approach for protecting the privacy of Americans' sensitive health information. Last year, Congress empowered the U.S. Food and Drug Administration (FDA) to oversee development of a major new national infrastructure: a large-scale data network, the Sentinel System, that aims to include health data for 100 million Americans by 2012. This marked the first time since the end of the New Deal that a wholly new infrastructure regulatory mandate had been issued at the federal level. This important development, buried in drug-safety provisions of the Food and Drug Administration Amendments Act of 2007 (FDAAA), went largely unnoticed, as did the fact that Congress cast medical privacy, a hot-button issue for many members of the American public, as an infrastructure regulatory problem. Individuals are not empowered to make autonomous decisions about permissible uses and disclosures of their health data. Instead, Congress authorized FDA to decide whether proposed disclosures meet a statutorily defined public-interest standard. If so, then the disclosures are lawful without individual privacy authorization or informed consent. Within limits that this article explores, FDA can approve the release of private health data, including data in identifiable form, to private operators of Sentinel System infrastructure and to outside data users, including academic and commercial entities. This article describes the new privacy model, which was implicit in the statute Congress passed but far from obvious on its face. The goal is not to oppose the new approach. Congress was responding to serious public concern about the safety of FDA-approved products. This article accepts that this new privacy model exists and explores directions for implementing it in a manner that will be least corrosive of public trust. The goal is to elicit ongoing dialogue about appropriate institutional protections for the 100 million Americans whose data soon will be in this vast data network. FDA is, in many respects, an accidental infrastructure regulator, thrust into a new role strikingly different from its longstanding product-safety mandate. Fortunately, the challenges FDA now faces are not new ones. U.S. infrastructure regulators, in a wide variety of industry contexts, have harnessed private capital to build new infrastructures to serve defined public interests while protecting vulnerable classes. Lessons from these other contexts can shed light on appropriate governance structures for the Sentinel System. For example, privacy protection may be enhanced by eschewing vertical integration in favor of segregating certain key infrastructure functions that require access to identifiable data. It may be better to establish core privacy protections via rulemaking rather than through contracts and to centralize certain key discretionary decisions rather than delegating them to private, commercial decision-makers. Public trust will require strong due-process protections, regulatory independence, and a well-funded system of regulatory oversight; approaches employed by other infrastructure regulators may help address these concerns. The single greatest threat to privacy will come as FDA faces pressure to approve wide ancillary sales of Sentinel System data to help defray costs of system development. To make this system financeable while enforcing strong privacy protections, FDA should deploy its limited available funds to support a well-thought-out infrastructure financing facility that backstops clear privacy policies with appropriate political risk guarantees for private infrastructure investors.
Article
Fear of the powerful computer user, "the Superuser," dominates debates about online conflict. This mythic figure is difficult to find, immune to technological constraints, and aware of legal loopholes. Policymakers, fearful of his power, too often overreact, passing overbroad, ambiguous laws intended to ensnare the Superuser, but which are used instead against inculpable, ordinary users. This response is unwarranted because the Superuser is often a marginal figure whose power has been greatly exaggerated. The exaggerated attention to the Superuser reveals a pathological characteristic of the study of power, crime, and security online, which springs from a widely-held fear of the Internet. Building on the social science fear literature, this Article challenges the conventional wisdom and standard assumptions about the role of experts. Unlike dispassionate experts in other fields, computer experts are as susceptible as lay-people to exaggerate the power of the Superuser, in part because they have misapplied Larry Lessig's ideas about code. The experts in computer security and Internet law have failed to deliver us from fear, resulting in overbroad prohibitions, harms to civil liberties, wasted law enforcement resources, and misallocated economic investment. This Article urges policymakers and partisans to stop using tropes of fear; calls for better empirical work on the probability of online harm; and proposes an anti-Precautionary Principle, a presumption against new laws designed to stop the Superuser.
Article
The article examines the government's growing appetite for collecting personal data. Often justified on the basis of protecting national security, government data mining programs sweep up data collected through hundreds of regulatory and administrative programs, and combine them with huge datasets obtained from industry. The result is an aggregation of personal data - the "digital footprints" of individual lives - never before seen. These data warehouses are then used to determine who can work and participate in Social Security programs, who can board airplanes and enter government buildings, and who is likely to pose a threat in the future, even though they have done nothing wrong to date. The article describes the extraordinary volume and variety of personal data to which the government has routine access, directly and through industry, and examines the absence of any meaningful limits on that access. So-called privacy statutes are often so outdated and inadequate that they fail to limit the government's access to our most personal data, or they have been amended in the post-9/11 world to reduce those limits. And the Fourth Amendment, the primary constitutional guarantee of individual privacy, has been interpreted by the Supreme Court to not apply to routine data collection, accessing data from third parties, or sharing data, even if illegally gathered. The result is not only that individual privacy goes unprotected, but that national security is compromised because it is increasingly based on data mining initiatives that are untested, ill focused, and rely on inaccurate or incomplete data. These shortcomings, and the urgent need for Congress to act to address them, have been widely recognized by numerous public and private commissions, but largely ignored by members of Congress - republicans and democrats alike. The article concludes that there is wide agreement about both the need to restore some limits on the government's use of personal data and the form that those limits should take. The problem is the unwillingness - or inability - of Congress to act.
Article
The practices of public surveillance, which include the monitoring of individuals in public through a variety of media (e.g., video, data, online), are among the least understood and controversial challenges to privacy in an age of information technologies. The fragmentary nature of privacy policy in the United States reflects not only the oppositional pulls of diverse vested interests, but also the ambivalence of unsettled intuitions on mundane phenomena such as shopper cards, closed-circuit television, and biometrics. This Article, which extends earlier work on the problem of privacy in public, explains why some of the prominent theoretical approaches to privacy, which were developed over time to meet traditional privacy challenges, yield unsatisfactory conclusions in the case of public surveillance. It posits a new construct, "contextual integrity," as an alternative benchmark for privacy, to capture the nature of challenges posed by information technologies. Contextual integrity ties adequate protection for privacy to norms of specific contexts, demanding that information gathering and dissemination be appropriate to that context and obey the governing norms of distribution within it. Building on the idea of "spheres of justice," developed by political philosopher Michael Walzer, this Article argues that public surveillance violates a right to privacy because it violates contextual integrity; as such, it constitutes injustice and even tyranny.
Article
A broad coalition, including companies formerly opposed to the enactment of privacy statutes, has now formed behind the idea of a national information privacy law. Among the benefits that proponents attribute to such a law is that it would harmonize the U.S. regulatory approach with that of the European Union (E.U.) and possibly minimize international regulatory conflicts about privacy. This essay argues, however, that it would be a mistake for the United States to enact a comprehensive or omnibus federal privacy law for the private sector that preempts sectoral privacy law. In a sectoral approach, a privacy statute regulates only a specific context of information use. An omnibus federal privacy law would be a dubious proposition because of its impact on experimentation in federal and state sectoral laws, and the consequences of ossification in the statute itself. In contrast to its skepticism about a federal omnibus statute, this essay views federal sectoral laws as a promising regulatory instrument. The critical question is the optimal nature of a dual federal-state system for information privacy law, and this essay analyzes three aspects of this topic. First, there are general circumstances under which federal sectoral consolidation of state law can bring benefits. Second, the choice between federal ceilings and floors is far from the only preemptive decision that regulators face. Finally, there are second-best solutions that become important should Congress choose to engage in broad sectoral preemption.
Article
Consider three questions. How would one decide if there was too much telecommunications surveillance in the United States, or too little? How would one know if law enforcement was using its surveillance capabilities in the most effective fashion? How would one assess the impact of this collection of information on civil liberties?In answering these questions, a necessary step, the logical first move, would be to examine existing data about governmental surveillance practices and their results. One would also need to examine and understand how the legal system generated these statistics about telecommunications surveillance. Ideally, the information structure would generate data sets that would allow the three questions posed above to be answered. Light might also be shed on other basic issues, such as whether or not the amount of telecommunications surveillance was increasing or decreasing.Such rational inquiry about telecommunications surveillance is, however, largely precluded by the haphazard and incomplete information that the government collects about it. This Article evaluates the main parts of telecommunications surveillance law and the statistics about their use. The critical statutory regulations are (1) the Wiretap Act, (2) the Pen Register Act, (3) the Stored Communications Act, and, for foreign intelligence, (4) the Foreign Intelligence Surveillance Act, and (5) the different provisions for National Security Letters (NSLs).Other parts of the surveillance landscape represent an even greater expanse of blank spaces on the legal map. There are a number of "semi-known unknowns" (to coin a phrase); these are kinds of telecommunications surveillance about which only limited public information exists - this surveillance also occurs outside a detailed legal framework.This Article concludes with the development of the concept of "privacy theater." Currently, the value of the collection of telecommunications statistics is largely ritualistic. It serves to create a myth of oversight. This Article proposes that we go beyond myth and re-dedicate ourselves to the task of creating a telecommunications surveillance law that minimizes the impact of surveillance on civil liberties and maximizes its effectiveness for law enforcement.
Article
This Article is divided into five parts. In Part II, we summarize the development of concepts of privacy and confidentiality, and examine how legal and statistical communities have reacted to two major changes in the privacy landscape of the last twenty years. The first is the challenge brought by new online and potentially privacy invasive technologies. The second is the privacy and security effects of September 11th. We argue that law's innovations, perhaps reactive to overwrought public concerns regarding privacy and the aftershocks of September 11th, represent real challenges to traditional notions of data access, confidentiality, and respondent trust. Finally, we review statistical societies' quick condemnation of many of the privacy- and trust-erosive measures contained in post-September 11th legislation and their lobbying to overturn the most egregious of these. In Part III, we provide a basic framework for enacting future privacy legislation to govern federal statistical agencies. We argue that federal agencies, and the laws that govern their conduct, should be framed with three broad principles in mind: (i) a commitment to bureaucratic justice; (ii) a level of regulatory coherence that justifies choices made; and (iii) a flexibility that allows anticipation of, rather than merely post hoc response to, challenges posed by future technology and methods. In so doing, we note that law's previous attempts to maneuver between these principles have often resulted in second-best solutions, changing with shifting public opinions that are, themselves, based on various cognitive processes that result in consistent and recognizable errors. We argue that laws that fail to consider methods available in computer science and statistics often underestimate the privacy threats posed by seemingly nonindividually identifiable data disclosures. Such laws also overlook, to the detriment of the trust-based statistical regime that has proven so successful for two centuries, the importance of statistical associations' ethical standards for determining appropriate disclosures. In Part IV, we set out various statistical methods that have been, and could be, employed to protect the confidentiality of data released to the public or government agencies. In describing these options, we discuss the limitations inherent in each and note the law's continuing role in enforcing these measures. Finally, in Part V, we conclude that increased dialogue between our disciplines can, and should, result in a better privacy regime that combines sound statistical methods designed to protect anonymity where appropriate, with legal enactments that promote legitimate sharing yet punish abuses. In so doing, we provide some specific approaches for legislating privacy that account for the trust needed to make statistical programs successful and for the data access and disclosure so important for the challenges currently facing this country.
Article
United States statistical programs and practices are among the best in the world. Lurking underneath this success, however, is a riddle - given the potential for abuse and harm, why do Americans willingly hand over their personal information to government data collection programs? In this article, we seek an answer to this riddle by examining the evolution of United States legal and statistical programs, with a particular focus on the United States Census of Population. In so doing, we explore the statistical programs, policies, regulations, and codes of ethics that have evolved in the United States over the past two centuries. We conclude that the willingness of individuals to disclose their personal information is not linked to programs of legal coercion or to simple cost/benefit analyses. Instead, we note that the intent of United States statistical programs has been to increase the level of trust and confidence that individuals have that their information will be kept strictly confidential. Various legal frameworks and the promulgation of statistical society codes of ethics buttress our basic conclusion that trust is an essential characteristic of a successful and efficient modern statistical program. We conclude by noting some recent developments that may threaten this trust program, including post 9/11 national security efforts, the rise of new data-gathering and analysis technologies, and the increasing use of private data collectors for government statistical programs.
Article
A series of major security breaches at companies with sensitive personal information has sparked significant attention to the problems with privacy protection in the United States. Currently, the privacy protections in the United States are riddled with gaps and weak spots. Although most industrialized nations have comprehensive data protection laws, the United States has maintained a sectoral approach where certain industries are covered and others are not. In particular, emerging companies known as "commercial data brokers" have frequently slipped through the cracks of U.S. privacy law. In this article, the authors propose a Model Privacy Regime to address the problems in the privacy protection in the United States, with a particular focus on commercial data brokers. Since the United States is unlikely to shift radically from its sectoral approach to a comprehensive data protection regime, the Model Regime aims to patch up the holes in existing privacy regulation and improve and extend it. In other words, the goal of the Model Regime is to build upon the existing foundation of U.S. privacy law, not to propose an alternative foundation. The authors believe that the sectoral approach in the United States can be improved by applying the Fair Information Practices - principles that require the entities that collect personal data to extend certain rights to data subjects. The Fair Information Practices are very general principles, and they are often spoken about in a rather abstract manner. In contrast, the Model Regime demonstrates specific ways that they can be incorporated into privacy regulation in the United States.
Article
Nothing in society poses as grave a threat to privacy as the Internet Service Provider (ISP). ISPs carry their users' conversations, secrets, relationships, acts, and omissions. Until the very recent past, they had left most of these alone because they had lacked the tools to spy invasively, but with recent advances in eavesdropping technology, they can now spy on people in unprecedented ways. Meanwhile, advertisers and copyright owners have been tempting them to put their users' secrets up for sale, and judging from a recent flurry of reports, ISPs are giving in to the temptation and experimenting with new forms of spying. This is only the leading edge of a coming storm of unprecedented and invasive ISP surveillance. This Article proposes an innovative new theory of communications privacy to help policymakers strike the proper balance between user privacy and ISP need. We cannot simply ban aggressive monitoring, because ISPs have legitimate reasons for scrutinizing communications on an Internet teeming with threats. Using this new theory, policymakers will be able to distinguish between an ISP's legitimate needs and mere desires.In addition, this Article injects privacy into the network neutrality debate - a debate about who gets to control innovation on the Internet. Despite the thousands of pages that have already been written about the topic, nobody has recognized that we already enjoy mandatory network neutrality in the form of expansive wiretapping laws. The recognition of this idea will flip the status quo and reinvigorate a stagnant debate by introducing privacy and personal autonomy into a discussion that has only ever been about economics and innovation.
Conference Paper
Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasi-identifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, k-anonymity requires that each "quasi-identifier" tuple appear in at least k records, while l-diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that k-anonymous databases can be useful for data mining, but k-anonymization does not guarantee any privacy. By contrast, we measure the tradeoff between privacy (how much can the adversary learn from the sanitized records?) and utility, measured as accuracy of data-mining algorithms executed on the same sanitized records. For our experimental evaluation, we use the same datasets from the UCI machine learning repository as were used in previous research on generalization and suppression. Our results demonstrate that even modest privacy gains require almost complete destruction of the data-mining utility. In most cases, trivial sanitization provides equivalent utility and better privacy than k-anonymity, l-diversity, and similar methods based on generalization and suppression.
Conference Paper
According to a famous study (10) of the 1990 census data, 87% of the US population can be uniquely identified by gen- der, ZIP code and full date of birth. This short paper revisits the uniqueness of simple demographics in the US population based on the most recent census data (the 2000 census). We oer a detailed, comprehensive and up-to-date picture of the threat to privacy posed by the disclosure of simple de- mographic information. Our results generally agree with the findings of (10), although we find that disclosing one's gender, ZIP code and full date of birth allows for unique identification of fewer individuals (63% of the US popula- tion) than reported in (10). We hope that our study will be a useful reference for privacy researchers who need sim- ple estimates of the comparative threat of disclosing various demographic data.
Article
Research that utilizes human subjects is a large and growing enterprise. Tens of millions of individuals have already participated as subjects in one or more research protocols, and millions more participate each year. Government and industry combined spend billions annually to support as many as 20,000 research studies, many of which are individually large and complex enterprises in their own right.These numbers are, if anything, likely to increase even further. Besides the growth in research, two other trends are apparent. First, research-related litigation is on the rise and appears likely to become even more widespread. Sparked at least in part by recent widely publicized instances of harm befalling research subjects, plaintiffs' attorneys are suing both more often and more creatively. Related to this is the second trend: public trust in research is declining and, as a result, at least some types of research are struggling to find adequate numbers of human subjects.As a result of these trends, exposure to potential liability and public perception are both increasingly important. Concomitant with all of this research is the discovery and generation of tremendous quantities of data specific to individual subjects, including--but not limited to--genetic information. Much of this data is irrelevant to subjects' interests because it lacks predictive value, has uncertain meaning, or is otherwise uninformative. Some, however, is different--some of the personal data learned during the course of research with human subjects bears directly on individuals' health. Despite the fact that much individual data has already been generated and that both the quantity and the quality of data generated seem likely to increase, there is a lack of clear guidance for researchers regarding whether and when such information should be divulged to the subjects on whom it bears.In this environment, the potential exists for litigation alleging that a researcher was negligent for failure to disclose to a subject an individual research finding of medical significance. Such litigation would raise a heretofore-unanswered question: should a researcher have a legal duty to disclose medically significant individual research findings to a subject?
Article
Information about an individual's place and date of birth can be exploited to predict his or her Social Security number (SSN). Using only publicly available information, we observed a correlation between individuals' SSNs and their birth data and found that for younger cohorts the correlation allows statistical inference of private SSNs. The inferences are made possible by the public availability of the Social Security Administration's Death Master File and the widespread accessibility of personal information from multiple sources, such as data brokers or profiles on social networking sites. Our results highlight the unexpected privacy consequences of the complex interactions among multiple data sources in modern information economies and quantify privacy risks associated with information revelation in public forums.
Article
In the trauma surrounding mass disasters, the need to identify victims accurately and as soon as possible is critical. DNA identification testing is increasingly used to identify human bodies and remains where the deceased cannot be identified by traditional means. This form of testing compares DNA taken from the body of the deceased with DNA taken from their personal items (e.g. hairbrush, toothbrush etc.) or from close biological relatives. DNA identification testing was used to identify the victims of the terrorist attack on the World Trade Center in New York on September 11, 2001, and of the victims of the Tsunami that hit Asia on December 26, 2004. Shortly after the 9/11 attack, police investigators asked the victims' families for personal items belonging to the missing, and for DNA samples from family members themselves. The New York medical examiner's office coordinated the DNA identification testing program; however, some of the identification work was contracted out to private laboratories.
Article
Abstract Large-scale genomic databases are becoming increasingly common. These databases, and the underlying biobanks, pose several substantial legal and ethical problems. Neither the usual methods for protecting subject confidentiality, nor even anonymity, are likely to protect subjects' identities in richly detailed databases. Indeed, in these settings, anonymity is itself ethically suspect. New methods of consent will need to be created to replace the blanket consent common to such endeavors, with a consent procedure that gives subjects some real control over what they might consider inappropriate use of their information and biological material. Through their use, these biobanks are also likely to yield information that will be of some clinical significance to the subjects, information that they should have access to. Failure to adjust to these new challenges is not only legally and ethically inappropriate, but puts at risk the political support on which biomedical research depends.