Article

The discombobulation of de-identification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To protect participants from privacy risks that may arise as a result of the use of identifiable samples and data, research materials may be anonymized or irreversibly de-identified (Rothstein 2010). This is a process by which all identifying information connected with the materials is removed and destroyed (Phillips and Knoppers 2016). While in this document we will use the term "anonymized" to refer to this level of identifiability, it is important to note the inconsistency in terminology throughout the normative documents and literature on this topic. ...
... While in this document we will use the term "anonymized" to refer to this level of identifiability, it is important to note the inconsistency in terminology throughout the normative documents and literature on this topic. Notably, some normative documents use the term "de-identified" instead, particularly when the identifying information is irreversibly removed from the samples and data (Phillips and Knoppers 2016). ...
... Arguably, the current ambiguity in the normative documents left institutions and researchers "without a framework for their deliberations and decisions" (de Vries, Munung et al. 2017). This inconsistency in the terminology surrounding identifiability has also been observed and critiqued in the literature and has led to a call for the "harmonization of terminology" (Phillips and Knoppers 2016). ...
Article
While the anonymization of biological samples and data may help protect participant privacy, there is still debate over whether this alone is a sufficient safeguard to ensure the ethical conduct of research. The purpose of this systematic review is to examine whether the review of an ethics committee is necessary in the context of anonymized research, and what the considerations in said ethics review would be. The review of normative documents issued by both national and international level organizations reveals a growing concern over the ability of anonymization procedures to prevent against reidentification. This is particularly true in the context of genomic research where genetic material’s uniquely identifying nature along with advances in technology have complicated previous standards of identifiability. Even where individual identities may not be identifiable, there is the risk of group harm that may not be protected by anonymization alone. We conclude that the majority of normative documents support that the review of an ethics committee is necessary to address the concerns associated with the use of anonymized samples and data for research.
... Several principles and norms for governing health data research have been fostered. However, a lack of harmonization and widespread confusion about their meaning prevails (Kalkman, et al., 2019a;Phillips and Knoppers, 2016). This can be witnessed in the many openended and ambiguous conditions guiding local research sites and databases that participate in DHRNs informing governance (Kaye et al., 2018). ...
... This often results in mere aggregation of conditions, principles and norms. However, subsequent effort to facilitate convergence, such as by harmonizing agreements and policies, is lacking (Kalkman et al., 2019a;Phillips and Knoppers, 2016). Similarly, in-context operationalization of conditions to guide collaborative practice requires further and fargoing effort. ...
Article
Full-text available
Current challenges to sustaining public support for health data research have directed attention to the governance of data-intensive health research networks. Accountability is hailed as an important element of trustworthy governance frameworks for data-intensive health research networks. Yet the extent to which adequate accountability regimes in data-intensive health research networks are currently realized is questionable. Current governance of data-intensive health research networks is dominated by the limitations of a drawing board approach. As a way forward, we propose a stronger focus on accountability as learning to achieve accountable governance. As an important step in that direction, we provide two pathways: (1) developing an integrated structure for decision-making and (2) establishing a dialogue in ongoing delib-erative processes. Suitable places for learning accountability to thrive are dedicated governing bodies as well as specialized committees, panels or boards which bear and guide the development of governance in data-intensive health research networks. A continuous accountability process which comprises learning and interaction accommodates the diversity of expectations, responsibilities and tasks in data-intensive health research networks to achieve responsible and effective governance.
... This could affect both the sharing, security and confidentiality of that data. Our findings support the notion of what Phillips and Knoppers have labelled a "Babel-like lexicon for de-identified data" [49]. While funders and research organisations push towards increased data sharing, there are legal duties that require 'de-identification' to protect privacy. ...
... For example, the terms 'anonymous' , 'anonymised' and 'de-identified' seemed to be used interchangeably. Yet there is a moral difference between collecting data without direct identifiers and removing those direct identifiers later on [49]. We recommend that a governance framework (1) clearly defines the terms it uses and 2) goes beyond simply acknowledging the limits of anonymity and/or requiring de-identification at all costs (at the expense of data quality). ...
Article
Full-text available
Background Large-scale linkage of international clinical datasets could lead to unique insights into disease aetiology and facilitate treatment evaluation and drug development. Hereto, multi-stakeholder consortia are currently designing several disease-specific translational research platforms to enable international health data sharing. Despite the recent adoption of the EU General Data Protection Regulation (GDPR), the procedures for how to govern responsible data sharing in such projects are not at all spelled out yet. In search of a first, basic outline of an ethical governance framework, we set out to explore relevant ethical principles and norms. Methods We performed a systematic review of literature and ethical guidelines for principles and norms pertaining to data sharing for international health research. Results We observed an abundance of principles and norms with considerable convergence at the aggregate level of four overarching themes: societal benefits and value; distribution of risks, benefits and burdens; respect for individuals and groups; and public trust and engagement. However, at the level of principles and norms we identified substantial variation in the phrasing and level of detail, the number and content of norms considered necessary to protect a principle, and the contextual approaches in which principles and norms are used. Conclusions While providing some helpful leads for further work on a coherent governance framework for data sharing, the current collection of principles and norms prompts important questions about how to streamline terminology regarding de-identification and how to harmonise the identified principles and norms into a coherent governance framework that promotes data sharing while securing public trust.
... Moreover, Article 21-2 should not cover this type of usage of combined data because of its potential to be reidentified in such large-scale public datasets (de Montioye et al. 2013;Rodriquez et al. 2013) and the inherent risk of re-identifiability of public biobanks linked with a geo-information system (Bovenberg et al. 2016). Arguably, consent exemption, even for academic research, should not apply and separate consent from the data subject should be required to collect and use location information, particularly because of the contentious use term of the 'de-identification' (Phillips and Knoppers 2016). ...
... Moreover, privacy regulations applicable to data de-identification or anonymization will need to be changed to better align with international data sharing standards (Phillips and Knoppers 2016;Kwon et al. 2016). Pseudonymization is an example of a concept that is not introduced in the PIPA. ...
Article
Full-text available
With rapid developments in genomic and digital technologies, genomic data sharing has become a key issue for the achievement of precision medicine in South Korea. The legal and administrative framework for data sharing and protection in this country is currently under intense scrutiny from national and international stakeholders. Policymakers are assessing the relevance of specific restrictions in national laws and guidelines for better alignment with international approaches. This manuscript will consider key issues in international genome data sharing in South Korea, including consent, privacy, security measures, compatible adequacy and oversight, and map out an approach to genomic data sharing that recognizes the importance of patient engagement and responsible use of data in South Korea.
... In fact, many techniques used in information technology today can reduce the level of identifiability of data. For instance, it is possible to replace directly identifiable information (e.g., name, address, health insurance number) with a code, add a random "noise" to datasets, or use masking techniques, such as data swapping, encryption, tokenization, aggregation or other methods (El Emam, 2007;El Emam, 2011;Phillips & Knoppers, 2016). As a result, information that was initially identifiable to an individual can become de-identified. ...
... Moreover, as illustrated in Table 1, there are significant terminological differences when referring to research data (Phillips & Knoppers, 2016). For example, the terms anonymous, anonymized, non-identifying, de-identified, coded and pseudonymized are often used in genomics, but their meaning is not equivalent from one jurisdiction to another. ...
... Prior work ignores this nebulous 18th category. One reason the category is ill-defined is due to the existence of quasi-identifiers, pieces of personally identifiable information (PII) that do not fall under any single category and therefore are difficult to identify and label in the general case (Phillips and Knoppers, 2016). Even data that has all of the categories from Safe Harbor removed may still be reidentified through quasi-identifiers (Angiuli et al., 2015). ...
Preprint
Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many identifying aspects that would fall outside of the common named entity based approach.
... Privacy. We envisaged that DMS would raise complex questions around participant privacy, including, for example, where the data are inherently identifiable (e.g., photographs of participants), where genomic data collected have potential relevance for genetic relatives, and where data linkage increases re-identification risk (Narayanan et al., 2016;Phillips & Knoppers, 2016). ...
Article
Digital Mega-Studies' are entirely or extensively digitised, longitudinal, population-scale initiatives, collecting, storing, and making available individual-level research data of different types and from multiple sources, shaped by technological developments and unforeseeable risks over time. The Australian 'Gen V' project exemplifies this new research paradigm. In 2019, we undertook a multidisciplinary, multi-stakeholder process to map Digital Mega-Studies' key characteristics, legal and governance challenges and likely solutions. We conducted large and small group processes within a one-day symposium and directed online synthesis and group prioritisation over subsequent weeks. We present our methods (including elicita-tion, affinity mapping and prioritisation processes) and findings, proposing six priority governance principles across three areas-data, participation, trust-to support future high-quality, large-scale digital research in health.
... Many legislatures require researchers to "de-identify" human subject data upon collection, by removing or deleting any sensitive personal information and contact details. However, when researchers have an imperfect understanding of the legal obligations and dispensations, they can end up over-or undersharing, and risk either placing participants at risk of being identified, or undermine the replicability of their own work and potential reuse value of their data (Phillips and Knoppers [30]). Furthermore, it is important to note that the European GDPR prohibits storing "personal data" (information which can identify a natural person whether directly or indirectly) on a server outside the EU, unless it offers an "adequate level of protection". ...
Article
Adopting open science principles can be challenging, requiring conceptual education and training in the use of new tools. This paper introduces the Workflow for Open Reproducible Code in Science (WORCS): A step-by-step procedure that researchers can follow to make a research project open and reproducible. This workflow intends to lower the threshold for adoption of open science principles. It is based on established best practices, and can be used either in parallel to, or in absence of, top-down requirements by journals, institutions, and funding bodies. To facilitate widespread adoption, the WORCS principles have been implemented in the R package worcs, which offers an RStudio project template and utility functions for specific workflow steps. This paper introduces the conceptual workflow, discusses how it meets different standards for open science, and addresses the functionality provided by the R implementation, worcs. This paper is primarily targeted towards scholars conducting research projects in R, conducting research that involves academic prose, analysis code, and tabular data. However, the workflow is flexible enough to accommodate other scenarios, and offers a starting point for customized solutions. The source code for the R package and manuscript, and a list of examplesof WORCS projects, are available at https://github.com/cjvanlissa/worcs.
... Despite the problems with it 5 , de-identification, in which health data are stripped of any information that could be used to identify the participant (such as name, social security number, address) has long been hailed as a way to protect people's privacy in research 6 . Yet because of conflicting terminology and gaps in understanding, researchers rarely know what standard they must meet for their data to be properly anonymized or 'pseudonymized' (in which a code enables individuals to be re-identified) 7 . What's more, laws are difficult to enforce in practice because it is often unclear how breaches of confidentiality occurred, or which organization or researcher was responsible 8,9 . ...
Article
Efforts to protect people’s privacy in a massive international cancer project offer lessons for data sharing. Efforts to protect people’s privacy in a massive international cancer project offer lessons for data sharing. Coloured scanning electron micrograph of a migrating breast cancer cell
... As a consequence, the fundamental concepts remain either unknown to the research community or difficult to comprehend. Adding to the confusion, a well-documented and long-standing ambiguity exists in the vocabulary used by those who contribute to the practice [20,21]. In particular, the terms 'de-identification' and 'anonymization' have been used with different meanings by researchers. ...
Article
Full-text available
Background: The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other. Objective: This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. Methods: Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. Results: After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. Conclusions: Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
... If data are perturbed, redacted, or subject to other processing, and the outcome is that there are no means reasonably likely to be used to relate the data to an identifiable individual, then the data are more properly described as anonymized. This does not overlook the rich literature relating to the potential for re-identification of genetic data (e.g., Phillips and Knoppers 2016;Cai et al. 2015). That literature simply informs an understanding of what contextual controls would be necessary to establish that any risk of (re)identification, in the circumstances, has been sufficiently mitigated. ...
Article
Full-text available
In the United Kingdom (UK), transfer of genomic data to third countries is regulated by data protection legislation. This is a composite of domestic and European Union (EU) law, with EU law to be adopted as domestic law when Brexit takes place. In this paper we consider the content of data protection legislation and the likely impact of Brexit on transfers of genomic data from the UK to other countries. We examine the advice by regulators not to rely upon consent as a lawful basis for processing under data protection law, at least not when personal data is used for research purposes, and consider some of the other ways in which the research context can qualify an individual’s ability to exercise control over processing operations. We explain how the process of pseudonymisation is to be understood in the context of transfer of genomic data to third parties, as well as how adequacy of data protection in a third country is to be determined in general terms. We conclude with reflections on the future direction of UK data protection law post Brexit with the reclassification of the UK itself as a third country.
... Il est notamment possible de remplacer les données directement identificatoires (ex. : nom, adresse, numéro d'assurance maladie, etc.) par un code, d'y ajouter du « bruit », ou d'utiliser des techniques de masquage telles que la permutation de données, le cryptage, la tokenisation, l'agrégation de plusieurs données, ou d'autres types de méthodes(El Emam, 2007;El Emam, 2011;Phillips & Knoppers, 2016). Ainsi, des renseignements personnels initialement considérés identificatoires peuvent désormais être aisément désidentifiés.La recherche en génomique soulève certaines préoccupations particulières en raison du fait qu'une séquence complète d'ADN est (de manière générale) unique à un individu. ...
... One form of information suppression is de-identification (Uzuner, Luo, and Szolovits 2007), which may involve removing personal identifiers completely (anonymization) or replacing each identifier in the dataset with a unique key (pseudonymization or coding; Phillips and Knoppers 2016). the effectiveness of such approaches is vigorously debated, with some arguing that essentially any data can be reidentified via linkage with other datasets or knowledge (Barocas and Nissenbaum 2014;Ohm 2010), and others arguing that such steps are often noisy and thus may not be revealing for more than a few individuals. ...
Article
To use administrative and other new data sources for the scientific study of human beings and human behavior, analysts need to be able both to connect to these new data and to deploy new methods for linking and analyzing the data. These efforts are often hindered by legal, technical, and operational difficulties. In this article, I examine how new digital research infrastructures can be used to reduce such barriers. Experiences with data stewardship in other scientific domains shows that appropriate infrastructure can enable the efficient, secure, and collaborative integration of domain expertise, data, and analysis capabilities. I review the state of the art in these areas and argue for the use of cloud-hosted enclaves as a safe interaction point for analysts, data, and software, and as a means of automating and thus professionalizing data stewardship processes.
... The inevitable technical uncertainty has been needlessly exacerbated by a proliferation of technical terms to describe data identifiability and the adoption of contradictory definitions by a variety of entities (Phillips and Knoppers 2016). Among the still-all-too-common consequences are, on the one hand, patently inadequate Banonymization,^such as merely removing names while leaving home addresses intact (McLean 2016), and, on the other, needless barriers to medical researchers' access to data sets for analysis (Council of Canadian Academies 2015). ...
Article
Full-text available
The collapse of confidence in anonymization (sometimes also known as de-identification) as a robust approach for preserving the privacy of personal data has incited an outpouring of new approaches that aim to fill the resulting trifecta of technical, organizational, and regulatory privacy gaps left in its wake. In the latter category, and in large part due to the growth of Big Data–driven biomedical research, falls a growing chorus of calls for criminal and penal offences to sanction wrongful re-identification of “anonymized” data. This chorus cuts across the fault lines of polarized privacy law scholarship that at times seems to advocate privacy protection at the expense of Big Data research or vice versa. Focusing on Big Data in the context of biomedicine, this article surveys the approaches that criminal or penal law might take toward wrongful re-identification of health data. It contextualizes the strategies within their respective legal regimes as well as in relation to emerging privacy debates focusing on personal data use and data linkage and assesses the relative merit of criminalization. We conclude that this approach suffers from several flaws and that alternative social and legal strategies to deter wrongful re-identification may be preferable.
Article
Background Pharmaceutical drug development rarely addresses a study participant's control of their genomic data, how to return individual incidental findings, and how to make use of genomic data more efficiently for exploratory research purposes. Mutually beneficial solutions to these issues are needed, as whole genome sequencing (WGS) is increasingly adopted in human research and as access to such information could provide impactful health-related information for a participant. Methods In this paper, we offer a novel framework to align a trial sponsor's need for broad exploratory research of the human genome with the study participant's right to data access and access control. The Exploratory Genetic Research Project (EGRP) aims to gather WGS on all participants of a sponsor's clinical trials. It is set up as a separate umbrella protocol to facilitate the consenting process, as well as the delineation between clinical trial endpoints versus exploratory future research. Conclusion This concept establishes a participant's autonomy regarding access to genomic data and the disclosure of actionable incidental findings. The feasibility of EGRP will be tested and reassessed as it is deployed over the next few years.
Article
Full-text available
Background We aimed to examine the ethical concerns Singaporeans have about sharing health-data for precision medicine (PM) and identify suggestions for governance strategies. Just as Asian genomes are under-represented in PM, the views of Asian populations about the risks and benefits of data sharing are under-represented in prior attitudinal research. Methods We conducted seven focus groups with 62 participants in Singapore from May to July 2019. They were conducted in three languages (English, Mandarin and Malay) and analysed with qualitative content and thematic analysis. Results Four key themes emerged: nuanced understandings of data security and data sensitivity; trade-offs between data protection and research benefits; trust (and distrust) in the public and private sectors; and governance and control options. Participants were aware of the inherent risks associated with data sharing for research. Participants expressed conditional support for data sharing, including genomic sequence data and information contained within electronic medical records. This support included sharing data with researchers from universities and healthcare institutions, both in Singapore and overseas. Support was conditional on the perceived social value of the research and appropriate de-identification and data security processes. Participants suggested that a data sharing oversight body would help strengthen public trust and comfort in data research for PM in Singapore. Conclusion Maintenance of public trust in data security systems and governance regimes can enhance participation in PM and data sharing for research. Contrary to themes in much prior research, participants demonstrated a sophisticated understanding of the inherent risks of data sharing, analysed trade-offs between risks and potential benefits of PM, and often adopted an international perspective.
Article
Profoundly more data-intensive than conventional medicine, precision medicine’s distinctive informational needs present new challenges for healthcare management. Data protection and privacy law are key determinants in precision medicine’s future. This article examines legal and regulatory barriers to the incorporation of precision medicine into healthcare. Specific attention is paid to analyzing recent health privacy laws, court cases, and medical device regulations. Considering the challenges identified, recommendations and guidance are crafted for health leaders with reference to domestic and international initiatives.
Article
Genetic data contain sensitive health and non-health-related information about the individuals and their family members. Therefore, adopting adequate privacy safeguards is paramount when processing genetic data for research or clinical purposes. One of the major legal instruments for personal data protection in the EU is the new General Data Protection Regulation (GDPR), which has entered into force in May 2016 and repealed the Directive 95/46/EC, with an ultimate goal of enhancing effectiveness and harmonization of personal data protection in the EU. This paper explores the major provisions of the new Regulation with regard to processing genetic data, and assesses the influence of such provisions on reinforcing the legal safeguards when sharing genetic data for research purposes. The new Regulation attempts to elucidate the scope of personal data, by recognizing pseudonymized data as personal (identifiable) data, and including genetic data in the catalog of special categories of data (sensitive data). Moreover, a set of new rules is laid out in the Regulation for processing personal data under the scientific research exemption. For instance, further use of genetic data for scientific research purposes, without obtaining additional consent will be allowed, if the specific conditions is met. The new Regulation has already fueled concerns among various stakeholders, owing to the challenges that may emerge when implementing the Regulation across the countries. Notably, the provided definition for pseudonymized data has been criticized because it leaves too much room for interpretations, and it might undermine the harmonization of the data protection across the countries.
Article
Full-text available
Genome-wide association studies (GWAS) are commonly applied on human genomic data to understand the causal gene combinations statistically connected to certain diseases. Homer et al. (2008) show that patients involved in these GWAS could be reidentified when the studies release statistical information on a large number of SNPs. Subsequent work, however, find that such privacy attacks are only theoretically possible, but unsuccessful and unconvincing in real settings. We derive the first practical privacy attack algorithm that can successfully identify specific individuals from limited published associations from the Wellcome Trust Case Control Consortium (WTCCC) dataset. For GWAS results computed over 25 randomly-selected loci, our algorithm always pinpoints at least one patient from the WTCCC dataset. Moreover, the number of re-identified patients grows rapidly with the number of published genotypes. Finally, we discuss prevention methods to disable the attack thus providing a security solution enhancing patient privacy. Proofs of the theorems and additional experimental results are available at the support online documents. The attack algorithm codes are publicly available at https://sites.google.com/site/zhangzhenjie/GWAS_attack.zip.The genomic data set used in the experiments is available at http://www.wtccc.org.uk/ on request. winslett@illinois.edu or zhenjie@adsc.com.sg. © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Article
Full-text available
Author Summary In this report we describe a framework for accurately and robustly resolving whether individuals are in a complex genomic DNA mixture using high-density single nucleotide polymorphism (SNP) genotyping microarrays. We develop a theoretical framework for detecting an individual's presence within a mixture, show its limits through simulation, and finally demonstrate experimentally the identification of the presence of genomic DNA of individuals within a series of highly complex genomic mixtures. Our approaches demonstrate straightforward identification of trace amounts (<1%) of DNA from an individual contributor within a complex mixture. We show how probe-intensity analysis of high-density SNP data can be used, even given the experimental noise of a microarray. We discuss the implications of these findings in two fields: forensics and genome-wide association (GWA) genetic studies. Within forensics, resolving whether an individual is contributing trace amounts of genomic DNA to a complex mixture is a tremendous challenge. Within GWA studies, there is a considerable push to make experimental data publicly available so that the data can be combined with other studies. Our findings show that such an approach does not completely conceal identity, since it is straightforward to assess the probability that a person or relative participated in a GWA study.
Book
Massive amounts of data on human beings can now be analyzed. Pragmatic purposes abound, including selling goods and services, winning political campaigns, and identifying possible terrorists. Yet 'big data' can also be harnessed to serve the public good: scientists can use big data to do research that improves the lives of human beings, improves government services, and reduces taxpayer costs. In order to achieve this goal, researchers must have access to this data - raising important privacy questions. What are the ethical and legal requirements? What are the rules of engagement? What are the best ways to provide access while also protecting confidentiality? Are there reasonable mechanisms to compensate citizens for privacy loss? The goal of this book is to answer some of these questions. The book's authors paint an intellectual landscape that includes legal, economic, and statistical frameworks. The authors also identify new practical approaches that simultaneously maximize the utility of data access while minimizing information risk.
Chapter
Today, we are on the cusp of a revolution in healthcare delivery where clinicians will be able to access patients' genetic information cheaply, quickly, and entirely. As the much-anticipated $1000 genome comes closer to a clinical reality for patients and their healthcare providers, we must face the numerous ethical questions that arise with forethought. In light of the changes in healthcare posed by next-generation sequencing (NGS), the flowing tide of conversation regarding the ethical use of genetic information needs to be revisited in light of the principles of autonomy, beneficence, and justice. While particular recommendations and guidance may become obsolete quickly, this chapter examines the broader ethical issues underlying NGS to provide guiding recommendations for (1) returning results, (2) protecting privacy and confidentiality, and (3) informed consent. These recommendations reflect patterns of thought and acceptable behavior that can be useful to clinicians who find themselves practicing medicine in this new frontier.
Article
Major funding agencies should ensure that large biological data sets are stored in cloud services to enable easy access and fast analysis, say Lincoln D.
Article
Anonymity Compromised The balance between maintaining individual privacy and sharing genomic information for research purposes has been a topic of considerable controversy. Gymrek et al. (p. 321 ; see the Policy Forum by Rodriguez et al. ) demonstrate that the anonymity of participants (and their families) can be compromised by analyzing Y-chromosome sequences from public genetic genealogy Web sites that contain (sometimes distant) relatives with the same surname. Short tandem repeats (STRs) on the Y chromosome of a target individual (whose sequence was freely available and identified in GenBank) were compared with information in public genealogy Web sites to determine the shortest time to the most recent common ancestor and find the most likely surname, which, when combined with age and state of residency identified the individual. When STRs from 911 individuals were used as the starting points, the analysis projected a success rate of 12% within the U.S. male population with Caucasian ancestry. Further analysis of detailed pedigrees from one collection revealed that families of individuals whose genomes are in public repositories could be identified with high probability.
Article
RNA profiling can be used to capture the expression patterns of many genes that are associated with expression quantitative trait loci (eQTLs). Employing published putative cis eQTLs, we developed a Bayesian approach to predict SNP genotypes that is based only on RNA expression data. We show that predicted genotypes can accurately and uniquely identify individuals in large populations. When inferring genotypes from an expression data set using eQTLs of the same tissue type (but from an independent cohort), we were able to resolve 99% of the identities of individuals in the cohort at P(adjusted) ≤ 1 × 10(-5). When eQTLs derived from one tissue were used to predict genotypes using expression data from a different tissue, the identities of 90% of the study subjects could be resolved at P(adjusted) ≤ 1 × 10(-5). We discuss the implications of deriving genotypic information from RNA data deposited in the public domain.
Article
The increasing integration of patient-specific genomic data into clinical practice and research raises serious privacy concerns. Various systems have been proposed that protect privacy by removing or encrypting explicitly identifying information, such as name or social security number, into pseudonyms. Though these systems claim to protect identity from being disclosed, they lack formal proofs. In this paper, we study the erosion of privacy when genomic data, either pseudonymous or data believed to be anonymous, are released into a distributed healthcare environment. Several algorithms are introduced, collectively called RE-Identification of Data In Trails (REIDIT), which link genomic data to named individuals in publicly available records by leveraging unique features in patient-location visit patterns. Algorithmic proofs of re-identification are developed and we demonstrate, with experiments on real-world data, that susceptibility to re-identification is neither trivial nor the result of bizarre isolated occurrences. We propose that such techniques can be applied as system tests of privacy protection capabilities.
Article
The importance of scientific databases for genetic research is clear. However, human genetic data can only be made,public if we,can protect the privacy of research subjects. Particularly in light of the HIPAA Privacy Rule, unrestricted access to comprehensive public human ,genomic ,data is inappropriate. Even ,without any direct identifiers or phenotype information attached, individual genomic data (such as SNPs) are quite 'identifiable' using common definitions. At the same time, genome-wide analysis of human,genetic variation to understand disease susceptibility or therapeutic opportunity requires access to large genomic data sets. In this paper, we argue that stripping identifiable components, using established statistical disclosure control procedures, renders genomic data virtually useless for research. In particular, we show that even relatively small amounts of genomic data are identifiable, and yet, current disclosure control methods ,do not ,offer a viable ,solution for ,protecting privacy while maintaining the research utility of this type of data. Therefore, the special privacy challenges posed by genomic data need to be addressed with new policies or creative technical approaches. Key words: privacy, confidentiality, disclosure, HIPAA, genetics, SNPs, database, data sharing
<http://techscience.org/a/2015103001>
  • J. Zang
  • K. Dummit
  • J. Graves
  • P. Lisker
  • L. Sweeney
Anonymizing Health Data
  • K El Emam
  • L Arbuckle
Data Protection Principles for the 21st Century: Revising the 1980 OECD Guidelines
  • F H Cate
  • P Cullen
  • V Mayer-Schönberger
Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health through Research (National Academies
  • Institute Of Medicine