[Show abstract][Hide abstract] ABSTRACT: Cathie Sudlow and colleagues describe the UK Biobank, a large population-based prospective study, established to allow investigation of the genetic and non-genetic determinants of the diseases of middle and old age.
[Show abstract][Hide abstract] ABSTRACT: Background:
DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual levEL Databases) has been proposed to facilitate the co-analysis of individual-level data from multiple studies without physically sharing the data. In a previous paper, we investigated whether DataSHIELD could protect participant confidentiality in accordance with UK law. In this follow-up paper, we investigate whether DataSHIELD addresses a broader range of ethics-related data-sharing concerns.
Ethics-related data-sharing concerns of Institutional Review Boards, ethics experts, international research consortia and research participants were identified through a literature search and systematically examined at a multidisciplinary workshop to determine whether DataSHIELD proposes mechanisms which can address these concerns.
DataSHIELD addresses several ethics-related data-sharing concerns related to privacy, confidentiality, and the protection of the research participant's rights while sharing data and after the data have been shared. The data remain entirely under the direct management of the study that collected them. Data processing commands are strictly supervised, and the data are queried in a protected environment. Issues related to the return of individual research results when data are shared are eliminated; the responsibility for return remains at the study of origin.
DataSHIELD can provide an innovative and robust solution for addressing commonly encountered ethics-related data-sharing concerns.
Full-text · Article · Dec 2014 · Public Health Genomics
[Show abstract][Hide abstract] ABSTRACT: Background:
Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data.
Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC.
Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach.
DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property-the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.
Full-text · Article · Sep 2014 · International Journal of Epidemiology
[Show abstract][Hide abstract] ABSTRACT: The analysis of rich catalogues of genetic variation from population-based sequencing provides an opportunity to screen for functional effects. Here we report a rare variant in APOC3 (rs138326449-A, minor allele frequency ~0.25% (UK)) associated with plasma triglyceride (TG) levels (-1.43 s.d. (s.e.=0.27 per minor allele (P-value=8.0 × 10(-8))) discovered in 3,202 individuals with low read-depth, whole-genome sequence. We replicate this in 12,831 participants from five additional samples of Northern and Southern European origin (-1.0 s.d. (s.e.=0.173), P-value=7.32 × 10(-9)). This is consistent with an effect between 0.5 and 1.5 mmol l(-1) dependent on population. We show that a single predicted splice donor variant is responsible for association signals and is independent of known common variants. Analyses suggest an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels. This represents one of the first examples of a rare, large effect variant identified from whole-genome sequencing at a population scale.
[Show abstract][Hide abstract] ABSTRACT: Background
Asthma and chronic obstructive pulmonary disease (COPD) are heterogeneous diseases.
We sought to determine, in terms of their sputum cellular and mediator profiles, the extent to which they represent distinct or overlapping conditions supporting either the “British” or “Dutch” hypotheses of airway disease pathogenesis.
We compared the clinical and physiological characteristics and sputum mediators between 86 subjects with severe asthma and 75 with moderate-to-severe COPD. Biological subgroups were determined using factor and cluster analyses on 18 sputum cytokines. The subgroups were validated on independent severe asthma (n = 166) and COPD (n = 58) cohorts. Two techniques were used to assign the validation subjects to subgroups: linear discriminant analysis, or the best identified discriminator (single cytokine) in combination with subject disease status (asthma or COPD).
Discriminant analysis distinguished severe asthma from COPD completely using a combination of clinical and biological variables. Factor and cluster analyses of the sputum cytokine profiles revealed 3 biological clusters: cluster 1: asthma predominant, eosinophilic, high TH2 cytokines; cluster 2: asthma and COPD overlap, neutrophilic; cluster 3: COPD predominant, mixed eosinophilic and neutrophilic. Validation subjects were classified into 3 subgroups using discriminant analysis, or disease status with a binary assessment of sputum IL-1β expression. Sputum cellular and cytokine profiles of the validation subgroups were similar to the subgroups from the test study.
Sputum cytokine profiling can determine distinct and overlapping groups of subjects with asthma and COPD, supporting both the British and Dutch hypotheses. These findings may contribute to improved patient classification to enable stratified medicine.
Full-text · Article · Aug 2014 · The Journal of allergy and clinical immunology
[Show abstract][Hide abstract] ABSTRACT: Background: Errors, introduced through poor assessment of physical measurement or because of inconsistent or inappropriate standard operating procedures for collecting, processing, storing or analysing haematological and biochemistry analytes, have a negative impact on the power of association studies using the collected data. A dataset from UK Biobank was used to evaluate the impact of pre-analytical variability on the power of association studies.
Methods: First, we estimated the proportion of the variance in analyte concentration that may be attributed to delay in processing using variance component analysis. Then, we captured the proportion of heterogeneity between subjects that is due to variability in the rate of degradation of analytes, by fitting a mixed model. Finally, we evaluated the impact of delay in processing on the power of a nested case-control study using a power calculator that we developed and which takes into account uncertainty in outcome and explanatory variables measurements.
Results: The results showed that (i) the majority of the analytes investigated in our analysis, were stable over a period of 36 h and (ii) some analytes were unstable and the resulting pre-analytical variation substantially decreased the power of the study, under the settings we investigated.
Conclusions: It is important to specify a limited delay in processing for analytes that are very sensitive to delayed assay. If the rate of degradation of an analyte varies between individuals, any delay introduces a bias which increases with increasing delay. If pre-analytical variation occurring due to delays in sample processing is ignored, it affects adversely the power of the studies that use the data.
Full-text · Article · Aug 2014 · International Journal of Epidemiology
[Show abstract][Hide abstract] ABSTRACT: Background
Severe refractory asthma is a heterogeneous disease. We sought to determine statistical clusters from the British Thoracic Society Severe refractory Asthma Registry and to examine cluster-specific outcomes and stability.
Factor analysis and statistical cluster modelling was undertaken to determine the number of clusters and their membership (N = 349). Cluster-specific outcomes were assessed after a median follow-up of 3 years. A classifier was programmed to determine cluster stability and was validated in an independent cohort of new patients recruited to the registry (n = 245).
Five clusters were identified. Cluster 1 (34%) were atopic with early onset disease, cluster 2 (21%) were obese with late onset disease, cluster 3 (15%) had the least severe disease, cluster 4 (15%) were the eosinophilic with late onset disease and cluster 5 (15%) had significant fixed airflow obstruction. At follow-up, the proportion of subjects treated with oral corticosteroids increased in all groups with an increase in body mass index. Exacerbation frequency decreased significantly in clusters 1, 2 and 4 and was associated with a significant fall in the peripheral blood eosinophil count in clusters 2 and 4. Stability of cluster membership at follow-up was 52% for the whole group with stability being best in cluster 2 (71%) and worst in cluster 4 (25%). In an independent validation cohort, the classifier identified the same 5 clusters with similar patient distribution and characteristics.
Statistical cluster analysis can identify distinct phenotypes with specific outcomes. Cluster membership can be determined using a classifier, but when treatment is optimised, cluster stability is poor.
[Show abstract][Hide abstract] ABSTRACT: Data sharing is an essential element of research; however, recent scientific and social developments have challenged conventional methods for protecting privacy. Here we provide guidance for determining data sharing thresholds for human pluripotent stem cell research aimed at a wide range of stakeholders, including research consortia, biorepositories, policy-makers, and funders.
[Show abstract][Hide abstract] ABSTRACT: Background:
Data from individual collections, such as biobanks and cohort studies, are now being shared in order to create combined datasets which can be queried to ask complex scientific questions. But this sharing must be done with due regard for data protection principles. DataSHIELD is a new technology that queries nonaggregated, individual-level data in situ but returns query data in an anonymous format. This raises questions of the ability of DataSHIELD to adequately protect participant confidentiality.
An ethico-legal analysis was conducted that examined each step of the DataSHIELD process from the perspective of UK case law, regulations, and guidance.
DataSHIELD reaches agreed UK standards of protection for the sharing of biomedical data. All direct processing of personal data is conducted within the protected environment of the contributing study; participating studies have scientific, ethics, and data access approvals in place prior to the analysis; studies are clear that their consents conform with this use of data, and participants are informed that anonymisation for further disclosure will take place.
DataSHIELD can provide a flexible means of interrogating data while protecting the participants' confidentiality in accordance with applicable legislation and guidance.
Full-text · Article · Mar 2014 · Public Health Genomics
[Show abstract][Hide abstract] ABSTRACT: Not all obese subjects have an adverse metabolic profile predisposing them to developing type 2 diabetes or cardiovascular disease. The BioSHaRE-EU Healthy Obese Project aims to gain insights into the consequences of (healthy) obesity using data on risk factors and phenotypes across several large-scale cohort studies. Aim of this study was to describe the prevalence of obesity, metabolic syndrome (MetS) and metabolically healthy obesity (MHO) in ten participating studies.
Ten different cohorts in seven countries were combined, using data transformed into a harmonized format. All participants were of European origin, with age 18-80 years. They had participated in a clinical examination for anthropometric and blood pressure measurements. Blood samples had been drawn for analysis of lipids and glucose. Presence of MetS was assessed in those with obesity (BMI >= 30 kg/m2) based on the 2001 NCEP-ATPIII criteria, as well as an adapted set of less strict criteria. MHO was defined as obesity, having none of the MetS components, and no previous diagnosis of cardiovascular disease.
Data for 163,517 individuals were available; 17% were obese (11,465 men and 16,612 women). The prevalence of obesity varied from 11.6% in the Italian CHRIS cohort to 26.3% in the German KORA cohort. The age-standardized percentage of obese subjects with MetS ranged in women from 24% in CHRIS to 65% in the Finnish Health2000 cohort, and in men from 43% in CHRIS to 78% in the Finnish DILGOM cohort, with elevated blood pressure the most frequently occurring factor contributing to the prevalence of the metabolic syndrome. The age-standardized prevalence of MHO varied in women from 7% in Health2000 to 28% in NCDS, and in men from 2% in DILGOM to 19% in CHRIS. MHO was more prevalent in women than in men, and decreased with age in both sexes.
Through a rigorous harmonization process, the BioSHaRE-EU consortium was able to compare key characteristics defining the metabolically healthy obese phenotype across ten cohort studies. There is considerable variability in the prevalence of healthy obesity across the different European populations studied, even when unified criteria were used to classify this phenotype.
Full-text · Article · Feb 2014 · BMC Endocrine Disorders
[Show abstract][Hide abstract] ABSTRACT: Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses.
Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study's questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis.
Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method.
New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein.
Full-text · Article · Nov 2013 · Emerging Themes in Epidemiology
[Show abstract][Hide abstract] ABSTRACT: Interindividual variation in mean leukocyte telomere length (LTL) is associated with cancer and several age-associated diseases. We report here a genome-wide meta-analysis of 37,684 individuals with replication of selected variants in an additional 10,739 individuals. We identified seven loci, including five new loci, associated with mean LTL (P < 5 × 10(-8)). Five of the loci contain candidate genes (TERC, TERT, NAF1, OBFC1 and RTEL1) that are known to be involved in telomere biology. Lead SNPs at two loci (TERC and TERT) associate with several cancers and other diseases, including idiopathic pulmonary fibrosis. Moreover, a genetic risk score analysis combining lead variants at all 7 loci in 22,233 coronary artery disease cases and 64,762 controls showed an association of the alleles associated with shorter LTL with increased risk of coronary artery disease (21% (95% confidence interval, 5-35%) per standard deviation in LTL, P = 0.014). Our findings support a causal role of telomere-length variation in some age-related diseases.
[Show abstract][Hide abstract] ABSTRACT: Gene-environment interaction studies offer the prospect of robust causal inference through both gene identification and instrumental variable approaches. As such they are a major and much needed development. However, conducting these studies using traditional methods, which require direct participant contact, is resource intensive. The ability to conduct gene-environment interaction studies remotely would reduce costs and increase capacity.
To develop a platform for the remote conduct of gene-environment interaction studies.
A random sample of 15,000 men and women aged 50+ years and living in Cardiff, South Wales, of whom 6,012 were estimated to have internet connectivity, were mailed inviting them to visit a web-site to join a study of successful ageing. Online consent was obtained for questionnaire completion, cognitive testing, re-contact, record linkage and genotyping. Cognitive testing was conducted using the Cardiff Cognitive Battery. Bio-sampling was randomised to blood spot, buccal cell or no request.
A heterogeneous sample of 663 (4.5% of mailed sample and 11% of internet connected sample) men and women (47% female) aged 50-87 years (median = 61 yrs) from diverse backgrounds (representing the full range of deprivation scores) was recruited. Bio-samples were donated by 70% of those agreeing to do so. Self report questionnaires and cognitive tests showed comparable distributions to those collected using face-to-face methods. Record linkage was achieved for 99.9% of participants.
This study has demonstrated that remote methods are suitable for the conduct of gene-environment interaction studies. Up-scaling these methods provides the opportunity to increase capacity for large-scale gene-environment interaction studies.