ArticlePDF Available

Data Sharing in the Post-Genomic World: The Experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO)

PLOS
PLOS Computational Biology
Authors:

Abstract and Figures

The scientific community, research funders, and governments have repeatedly recognized the importance of open access to genomic data for scientific research and medical progress [1]–[4]. Open access is becoming a well-established practice for large-scale, publicly funded, data-intensive community science projects, particularly in the field of genomics. Given this consensus, restrictions to open access should be regarded as exceptional and treated with caution. Yet, several developments [5] have led scientists and policymakers to investigate and implement open access restrictions [5]–[9]. Notably, there are privacy concerns within the genomics community and critiques from some researchers that open access, if left completely unregulated, could raise significant scientific, ethical, and legal issues (e.g., quality of the data, appropriate credit to data generators, relevance of the system for small and medium projects, etc.) [1]–[10]. A recent paper by Greenbaum and colleagues in this journal [11] identified protecting the privacy of study participants as the main challenge to open genomic data sharing. One possible way to reconcile open data sharing with privacy concerns is to use a tiered access system to separate access into “open” and “controlled.” Open access remains the norm for data that cannot be linked with other data to generate a dataset that would uniquely identify an individual. A controlled access mechanism, on the other hand, regulates access to certain, more sensitive data (e.g., detailed phenotype and outcome data, genome sequences files, raw genotype calls) by requiring third parties to apply to a body (e.g., custodian, original data collectors, independent body, or data access committee) and complete an access application that contains privacy safeguards. This mechanism, while primarily designed to protect study participants, can also be used to protect investigators, database hosting institutions, and funders from perceptions or acts of favoritism or impropriety. The experience of controlled access bodies to date has been only minimally documented in the literature [9], [12]. To address this lacuna, we present the experience of the Data Access Compliance Office (DACO) of the International Cancer Genome Consortium (ICGC). The goal is to provide information on this increasingly important type of database governance body.
Content may be subject to copyright.
A preview of the PDF is not available
... In particular, diverse terms have been used to describe the procedures involved in granting or facilitating data access. Some of these include: controlled access [22][23][24][25] , managed access 21,26,27 , tiered access 22,24 , unrestricted access, open access 24,27,28 , registered access 21,24 , authenticated, charged, exclusive and password access 29 , a passport model of access 30 and more. There is a lack of clarity about the extent to which these terms are truly distinct, related, or interchangeable with one another. ...
... In contrast to the closed approach of CA, open access (OA) makes data publicly available without restriction 22,27 . The theory behind OA is that unfettered access assists in the verification and replication of data, broadens opportunities to pool data and generates results without the need to collect further data, leading to better quality results and establishment of community resources 27,32 . ...
... By imposing restrictions on data access 24 , MA arrangements involve more regulatory action by the consortium. Conditions might include who can access, what they can access, and how and under what terms access may be granted (i.e. on an internal server or downloaded) [22]. The term MA is not used ubiquitously; 'controlled access' and 'restricted access' are also common 22,24,25 . ...
Article
Full-text available
One of the most common terms that is used to describe entities responsible for sharing genomic data for research purposes is ‘genomic research consortium’. However, there is a lack of clarity around the language used by consortia to describe their data sharing arrangements. Calls have been made for more uniform terminology. This article reports on a review of the genomic research consortium literature illustrating a wide diversity in the language that has been used over time to describe the access arrangements of these entities. The second component of this research involved an examination of publicly available information from a dataset of 98 consortia. This analysis further illustrates the wide diversity in the access arrangements adopted by genomic research consortia. A total of 12 different access arrangements were identified, including four simple forms (open, consortium, managed and registered access) and eight more complex tiered forms (for example, a combination of consortium, managed and open access). The majority of consortia utilised some form of tiered access, often following the policy requirements of funders like the US National Institutes of Health and the UK Wellcome Trust. It was not always easy to precisely identify the access arrangements of individual consortia. Greater consistency, clarity and transparency is likely to be of benefit to donors, depositors and accessors alike. More work needs to be done to achieve this end.
... With the increasing availability of these datasets and advancements in computational methods, a wide array of user-friendly primary and secondary data resources has been developed. These resources enable researchers to explore clinical and omics data to identify diagnostic and prognostic biomarkers across various cancer types (Barretina, et al., 2012;Cai, et al., 2019;Cancer Genome Atlas Research, et al., 2013;Cerami, et al., 2012;Deng, et al., 2017;Deng, et al., 2023;Grossman, et al., 2016;Joly, et al., 2012;Kaur, et al., 2020;Rhodes, et al., 2004;Tate, et al., 2019;Uhlen, et al., 2005;Yang, et al., The surge in multi-omics data availability has profoundly impacted the field of cancer research, providing unprecedented opportunities for biomarker discovery and precision medicine (Bhalla, et al., 2017;Bhalla, et al., 2019;Deng, et al., 2023;Dhall, et al., 2020;Garg, et al., 2024;Han, et al., 2013;Hussein, Abou-Shanab and Badr, 2024;Kaur, Bhalla and Raghava, 2019;Oh, et al., 2020;Vasudevan and Murugesan, 2018;Xiao, et al., 2021;Xiao, et al., 2022;Yang, et al., 2024;Zhang and Liu, 2021). However, the inherent complexity and heterogeneity of multi-omics data pose significant challenges for analysis, interpretation, and subsequent biomarker identification (Brooks, et al., 2024;Lopez de Maturana, et al., 2019;Matthews, Hanison and Nirmalan, 2016;McDermott, et al., 2013;Mohr, et al., 2024;Ng, et al., 2023;Yamada, et al., 2021). ...
Preprint
Full-text available
Accurate survival prediction is vital for optimizing treatment strategies in clinical practice. The advent of high-throughput multi-omics data and computational methods has enabled machine learning (ML) models for survival analysis. However, handling high-dimensional omics data remains challenging. This study introduces the Cancer Patient Survival Model (CPSM), an R package developed to provide individualized survival predictions through a fully integrated and reproducible computational pipeline. The CPSM package encompasses nine modules that streamline the survival modeling workflow, organized into four key stages: (1) Data Preprocessing and Normalization, (2) Feature Selection, (3) Survival Prediction Model Development, and (4) Visualization. The visual tools facilitate the interpretation of survival predictions, enhancing clinical decision-making. By providing an end-to-end solution for multi-omics data integration and analysis, CPSM not only enhances the precision of survival predictions but also aids in discovering clinically relevant biomarkers.
... As such, safeguarding the privacy of individuals' genomic data becomes crucial (Greenbaum et al., 2011). Unauthorized access, misuse, or commercial exploitation of genetic information can lead to potential discrimination, stigmatization, and breaches of confidentiality (Joly et al., 2012). Researchers and policymakers must implement robust data protection measures to ensure the responsible handling and storage of genomic data while promoting transparency and informed consent in data sharing practices (Mittelstadt and Floridi, 2016). ...
... Related work. The US-based dbGaP 4 is one of the oldest deposition databases for subject-level genome/ phenome data; as such, it has set a precedent for the controlled access model and various other data sharing initiatives 35 . dbGaP requires data submitters to delineate all data use conditions as data use limitation tags on datasets. ...
Article
Full-text available
The EU General Data Protection Regulation (GDPR) requirements have prompted a shift from centralised controlled access genome-phenome archives to federated models for sharing sensitive human data. In a data-sharing federation, a central node facilitates data discovery; meanwhile, distributed nodes are responsible for handling data access requests, concluding agreements with data users and providing secure access to the data. Research institutions that want to become part of such federations often lack the resources to set up the required controlled access processes. The DS-PACK tool assembly is a reusable, open-source middleware solution that semi-automates controlled access processes end-to-end, from data submission to access. Data protection principles are engraved into all components of the DS-PACK assembly. DS-PACK centralises access control management and distributes access control enforcement with support for data access via cloud-based applications. DS-PACK is in production use at the ELIXIR Luxembourg data hosting platform, combined with an operational model including legal facilitation and data stewardship.
... The majority of journals and funders now have data sharing policies. National and international data protection laws restrict data sharing by genomic researchers but a number of initiatives have been developed to promote successful data sharing including those hosted by the European Molecular Biology Laboratory's European Bioinformatics Institute [225], the International Cancer Genome Consortium's project [226], the Pan-Cancer Analysis of Whole Genomes (PCAWG) [227] and the Human Cell Atlas [228]. The researchers involved in setting up PCAWG have called for an international code of conduct to overcome issues with data protection and provide guidelines for researchers [229]. ...
Article
Full-text available
Aims To identify differential expression of shorter non-coding RNA (ncRNA) genes associated with autism spectrum disorders (ASD). Background ncRNA are functional molecules that derive from non-translated DNA sequence. The HUGO Gene Nomenclature Committee (HGNC) have approved ncRNA gene classes with alignment to the reference human genome. One subset is microRNA (miRNA), which are highly conserved, short RNA molecules that regulate gene expression by direct post-transcriptional repression of messenger RNA. Several miRNA genes are implicated in the development and regulation of the nervous system. Expression of miRNA genes in ASD cohorts have been examined by multiple research groups. Other shorter classes of ncRNA have been examined less. A comprehensive systematic review examining expression of shorter ncRNA gene classes in ASD is timely to inform the direction of research. Methods We extracted data from studies examining ncRNA gene expression in ASD compared with non-ASD controls. We included studies on miRNA, piwi-interacting RNA (piRNA), small NF90 (ILF3) associated RNA (snaR), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), transfer RNA (tRNA), vault RNA (vtRNA) and Y RNA. The following electronic databases were searched: Cochrane Library, EMBASE, PubMed, Web of Science, PsycINFO, ERIC, AMED and CINAHL for papers published from January 2000 to May 2022. Studies were screened by two independent investigators with a third resolving discrepancies. Data was extracted from eligible papers. Results Forty-eight eligible studies were included in our systematic review with the majority examining miRNA gene expression alone. Sixty-four miRNA genes had differential expression in ASD compared to controls as reported in two or more studies, but often in opposing directions. Four miRNA genes had differential expression in the same direction in the same tissue type in at least 3 separate studies. Increased expression was reported in miR-106b-5p, miR-155-5p and miR-146a-5p in blood, post-mortem brain, and across several tissue types, respectively. Decreased expression was reported in miR-328-3p in bloods samples. Seven studies examined differential expression from other classes of ncRNA, including piRNA, snRNA, snoRNA and Y RNA. No individual ncRNA genes were reported in more than one study. Six studies reported differentially expressed snoRNA genes in ASD. A meta-analysis was not possible because of inconsistent methodologies, disparate tissue types examined, and varying forms of data presented. Conclusion There is limited but promising evidence associating the expression of certain miRNA genes and ASD, although the studies are of variable methodological quality and the results are largely inconsistent. There is emerging evidence associating differential expression of snoRNA genes in ASD. It is not currently possible to say whether the reports of differential expression in ncRNA may relate to ASD aetiology, a response to shared environmental factors linked to ASD such as sleep and nutrition, other molecular functions, human diversity, or chance findings. To improve our understanding of any potential association, we recommend improved and standardised methodologies and reporting of raw data. Further high-quality research is required to shine a light on possible associations, which may yet yield important information.
Article
Full-text available
Striving to build an exhaustive guidebook of the types and properties of human cells, the Human Cell Atlas’ (HCA) success relies on the sampling of diverse populations, developmental stages, and tissue types. Its open science philosophy preconizes the rapid, seamless sharing of data – as openly as possible. In light of the scope and ambition of such an international initiative, the HCA Ethics Working Group (EWG) has been working to build a solid foundation to address the complexities of data collection and sharing as part of Atlas development. Indeed, a particular challenge of the HCA is the diversity of sampling scenarios (e.g., living participants, deceased donors, pediatric populations, culturally diverse backgrounds, tissues from various developmental stages, etc.), and associated ethical and legal norms, which vary across countries contributing to the effort. Hence, to the extent possible, the EWG set out to provide harmonised, international and interoperable policies and tools, to guide its research community. This paper provides a high-level overview of the types of challenges and approaches proposed by the EWG.
Article
The Data Privacy Assessment Tool for Health (D-PATH) is a proof-of-concept online tool designed to help users intending to share biomedical data identify applicable legal obligations and relevant best practices. D-PATH provides a series of simple questions to assess important aspects of the data sharing task, such as the user’s legal jurisdiction and the types of entities involved. Based on the combination of answers that the user provides, D-PATH will generate a list of privacy obligations and security-best practices, categorized into themes of 1) accountability, 2) lawfulness of storage, transfer, and protection, and 3) security and safeguards that will likely apply in the user’s scenario. Currently, the D-PATH focuses on Canadian and European privacy laws and various global best-practice policies, but there are plans to extend this in later iterations of the tool. D-PATH was developed specifically to inform users about their legal privacy obligations and best practices and was written to facilitate compliant and ethical data sharing. As a proof-of-concept, D-PATH demonstrates the potential value of a tool in simplifying and translating complex concepts into more accessible formats. Such a tool can be adapted and valuable in many different contexts, such as training core researchers in data sharing laws and practices.
Article
How does concern about genetic data privacy compare with other concerns? We conduct behavioral experiments to compare risk attitudes towards sharing genetic data with a healthcare provider with risk attitudes towards sharing financial data with a money manager. Both scenarios involve identical decisions and monetary stakes, permitting us to focus on how the framing of data sharing influences attitudes. To delve deeper into individual motivations to share data, we provide treatments that study how data sharers' altruism and trust affect their decisions. Our findings (with 162 subjects) indicate that individuals are more willing to risk a loss to privacy of genetic data (for an anticipated return framed as health benefits) than they are to risk loss of financial data (for an anticipated return in financial benefits). We also find that 50%–60% of data recipients choose to protect another person's data, with no significant differences between frames.
Article
Background: Gastric cancer develops as a malignant tumor in the mucosa of the stomach, and spreads through further layers. Early-stage diagnosis of gastric cancer is highly challenging because the patients either exhibit symptoms similar to stomach infections or show no signs at all. Biomarkers are active players in the cancer process by acting as indications of aberrant alterations due to malignancy. Objective: Though there have been significant advancements in the biomarkers and therapeutic targets, there are still insufficient data to fully eradicate the disease in its early phases. Therefore, it is crucial to identify particular biomarkers for detecting and treating stomach cancer. This review aims to provide a thorough overview of data analysis in gastric cancer. Methods: Text mining, network analysis, machine learning (ML), deep learning (DL), and structural bioinformatics approaches have been employed in this study. Results: We have built a huge interaction network in the current study to forecast new biomarkers for gastric cancer. The four putatively unique and potential biomarker genes have been identified via a large association network in this study. Conclusion: The molecular basis of the illness is well understood by computational approaches, which also provide biomarkers for targeted cancer therapy. These putative biomarkers may be useful in the early detection of disease. This study also shows that in H. pylori infection in early-stage gastric cancer, the top 10 hub genes constitute an essential component of the epithelial cell signaling pathways. These genes can further contribute to the future development of effective biomarkers.
Article
Full-text available
One of the core goals of Digital Health Technologies (DHT) is to transform healthcare services and delivery by shifting primary care from hospitals into the community. However, achieving this goal will rely on the collection, use and storage of large datasets. Some of these datasets will be linked to multiple sources, and may include highly sensitive health information that needs to be transferred across institutional and jurisdictional boundaries. The growth of DHT has outpaced the establishment of clear legal pathways to facilitate the collection, use and transfer of potentially sensitive health data. Our study aimed to address this gap with an ethical code to guide researchers developing DHT with international collaborative partners in Singapore. We generated this code using a modified Policy Delphi process designed to engage stakeholders in the deliberation of health data ethics and governance. This paper reports the outcomes of this process along with the key components of the code and identifies areas for future research. Supplementary Information The online version contains supplementary material available at 10.1186/s12910-023-00952-7.
Article
Full-text available
Recent years have witnessed a key development within biomedicine—namely, the move from genetic to genomic research. Genomic research, which operates at the level of the whole genome rather than individual genes, requires a powerful new set of research tools, resources and supporting technologies. Having moved from DNA sequence mapping to the use of haplotypes, the next advances in our understanding of disease risk and health may well be achieved through the study of “normal” genomic variation across whole populations. Such studies require not only samples and data, but also highly sophisticated, substantial database infrastructures to support them. Longitudinal and largely epidemiological in nature, these population-scale genomic database resources are designed to serve a multiplicity of specific research projects at both national and international levels. Current ethical guidance in the area of genetic research promotes the need for international collaboration. Yet, is international genomic research collaboration possible considering both the scientific and structural differences between national approaches to governing genomic databases and associated population biobanks? A review of existing norms at the international level—in particular, around benefit sharing and access to data—and their application in different countries, reveals areas of both convergence and divergence. But, most of all, it reveals the need for international harmonisation in order to secure interoperability and the public participation, trust and investment in such large initiatives that are crucial to their success.
Article
Full-text available
The objective of this study is to describe researchers', health-care providers' and other stakeholders' views of ethical review and research governance procedures. The study design involved qualitative semi-structured interviews. Participants included 60 individuals who either undertook research in the subspecialty of cancer genetics (n ¼ 40) or were involved in biomedical research in other capacities (n ¼ 20), e.g. research governance and oversight, patient support groups or research funding. While all interviewees observed that oversight is necessary to protect research participants, ethical review and research governance (ERG) arrangements were described negatively throughout these interviews. Interviewees identified a number of problems with ERG, including: over-bureaucratization, over-standardization of information requirements for different types of research, a lack of standardization in the types of information required by different committees for the same research and a lack of consistency in different committees' responses. A number of solutions were proposed including streamlining application procedures and harmonizing committees' responses and information requirements. Recent reports suggest that ethical review procedures and research governance arrangements threaten the possibility of undertaking clinical research in the UK, hence the introduction of the Integrated Research Application System (IRAS) is long overdue. However, while IRAS may solve some of the problems identified by interviewees, it remains to be seen to what extent it will impact upon the very negative perceptions of ethics and research governance procedures reported here.
Article
Skip Main Navigation Click here! The Lancet . RSS Feeds Subscribe | Register | Login Close. Username: Password: Forgotten Username or Password? Remember me on this computer until I logout. ... outline goes here. The Lancet , Volume 377, Issue 9765, Pages 537 - 539, 12 ...
Article
Several recent biomedical research initiatives have sought to make their data freely accessible to others, so as to stimulate innovation. Many of these initiatives have adopted the "open source" model that has achieved prominence in the computing industry. With respect to genomics research, open access models of data release have become common and most large funding bodies now require researchers to deposit their data in centralized repositories. In particular, biobanks, which are organized collections of biological samples and corresponding data, often created for the use of investigators who are not affiliated with the biobank, benefit from the implementation of open source principles. Several obstacles loom, however, as barriers to widespread implementation of open source principles in the field of biomedical research. These include the reluctance among researchers to share their data; the challenge of crafting appropriate publication and intellectual property policies; the difficulties in affording informed consent, privacy, and confidentiality to research participants when data is shared so widely; controversy surrounding the issues of commercialization and benefit-sharing; and the complexity of establishing a suitable infrastructure. This article will examine each of these challenges to implementation of an open source biotechnology model, and consider an alternative approach, “fair access” biobanks.
Article
Recent advances in genome-scale, system-level measurements of quantitative phenotypes (transcriptome, metabolome, and proteome) promise to yield unprecedented biological insights. In this environment, broad dissemination of results from genome-wide association studies (GWASs) or deep-sequencing efforts is highly desirable. However, summary results from case-control studies (allele frequencies) have been withdrawn from public access because it has been shown that they can be used for inferring participation in a study if the individual's genotype is available. A natural question that follows is how much private information is contained in summary results from quantitative trait GWAS such as regression coefficients or p values. We show that regression coefficients for many SNPs can reveal the person's participation and for participants his or her phenotype with high accuracy. Our power calculations show that regression coefficients contain as much information on individuals as allele frequencies do, if the person's phenotype is rather extreme or if multiple phenotypes are available as has been increasingly facilitated by the use of multiple-omics data sets. These findings emphasize the need to devise a mechanism that allows data sharing that will facilitate scientific progress without sacrificing privacy protection.
Article
RNA profiling can be used to capture the expression patterns of many genes that are associated with expression quantitative trait loci (eQTLs). Employing published putative cis eQTLs, we developed a Bayesian approach to predict SNP genotypes that is based only on RNA expression data. We show that predicted genotypes can accurately and uniquely identify individuals in large populations. When inferring genotypes from an expression data set using eQTLs of the same tissue type (but from an independent cohort), we were able to resolve 99% of the identities of individuals in the cohort at P(adjusted) ≤ 1 × 10(-5). When eQTLs derived from one tissue were used to predict genotypes using expression data from a different tissue, the identities of 90% of the study subjects could be resolved at P(adjusted) ≤ 1 × 10(-5). We discuss the implications of deriving genotypic information from RNA data deposited in the public domain.
Article
Recent advances in genome-scale, system-level measurements of quantitative phenotypes (transcriptome, metabolome, and proteome) promise to yield unprecedented biological insights. In this environment, broad dissemination of results from genome-wide association studies (GWASs) or deep-sequencing efforts is highly desirable. However, summary results from case-control studies (allele frequencies) have been withdrawn from public access because it has been shown that they can be used for inferring participation in a study if the individual's genotype is available. A natural question that follows is how much private information is contained in summary results from quantitative trait GWAS such as regression coefficients or p values. We show that regression coefficients for many SNPs can reveal the person's participation and for participants his or her phenotype with high accuracy. Our power calculations show that regression coefficients contain as much information on individuals as allele frequencies do, if the person's phenotype is rather extreme or if multiple phenotypes are available as has been increasingly facilitated by the use of multiple-omics data sets. These findings emphasize the need to devise a mechanism that allows data sharing that will facilitate scientific progress without sacrificing privacy protection.
Article
Technological advancements are rapidly propelling the field of genome research forward, while lawmakers attempt to keep apace with the risks these advances bear. Balancing normative concerns of maximizing data utility and protecting human subjects, whose privacy is at risk due to the identifiability of DNA data, are central to policy decisions. Research on genome research participants making real-time data sharing decisions is limited; yet, these perspectives could provide critical information to ongoing deliberations. We conducted a randomized trial of 3 consent types affording varying levels of control over data release decisions. After debriefing participants about the randomization process, we invited them to a follow-up interview to assess their attitudes toward genetic research, privacy and data sharing. Participants were more restrictive in their reported data sharing preferences than in their actual data sharing decisions. They saw both benefits and risks associated with sharing their genomic data, but risks were seen as less concrete or happening in the future, and were largely outweighed by purported benefits. Policymakers must respect that participants' assessment of the risks and benefits of data sharing and their privacy-utility determinations, which are associated with their final data release decisions, vary. In order to advance the ethical conduct of genome research, proposed policy changes should carefully consider these stakeholder perspectives.
Article
Context: Federal regulations mandate independent review and approval by an "institutional review board" (IRB) before studies that involve human research subjects may begin. Although many researchers strongly support the need for IRB review, they also contend that it is burdensome when it imposes costs that do not add to the protections afforded to research participants and that this burden threatens the viability of research. The U.S. Department of Health and Human Services recently announced its intention to reform the regulations governing IRB review. Methods: We used a search of the PubMed database, supplemented by a bibliographic review, to identify all existing primary data on the costs of IRB review. "Costs" were broadly defined to include both expenditures of time or money and constraints imposed on the scope of the research. Burdensome costs were limited to those that did not contribute to greater protections for the participants. Findings: Evidence from a total of fifty-two studies shows that IRBs operate at different levels of efficiency; that waiting to obtain IRB approval has, in some instances, delayed project initiation; that IRBs presented with identical protocols sometimes asked for different and even competing revisions; and that some decisions made (and positions held) by IRBs are not in accord with federal policy guidance. Conclusions: While the evidence is sufficient to conclude that there is burden associated with IRB review, it is too limited to allow for valid estimates of its magnitude or to serve as the basis for formulating policies on IRB reform. The single exception is multicenter research, for which we found that review by several local IRBs is likely to be burdensome. No mechanism currently exists at the national level to gather systematic evidence on the intersection between research and IRB review. This gap is of concern in light of the changing nature of research and the increasingly important role that research is envisioned to play in improving the overall quality of health care.