The NIDDK Central Repository at 8 years—Ambition, Revision, Use and Impact

RTI International, Research Triangle Park, NC 27709, USA.
Database The Journal of Biological Databases and Curation (Impact Factor: 3.37). 01/2011; 2011:bar043. DOI: 10.1093/database/bar043
Source: PubMed


The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository makes data and biospecimens from NIDDK-funded research available to the broader scientific community. It thereby facilitates: the testing of new hypotheses without new data or biospecimen collection; pooling data across several studies to increase statistical power; and informative genetic analyses using the Repository’s well-curated phenotypic data. This article describes the initial database plan for the Repository and its revision using a simpler model. Among the lessons learned were the trade-offs between the complexity of a database design and the costs in time and money of implementation; the importance of integrating consent documents into the basic design; the crucial need for linkage files that associate biospecimen IDs with the masked subject IDs used in deposited data sets; and the importance of standardized procedures to test the integrity data sets prior to distribution. The Repository is currently tracking 111 ongoing NIDDK-funded studies many of which include genotype data, and it houses over 5 million biospecimens of more than 25 types including serum, plasma, stool, urine, DNA, red blood cells, buffy coat and tissue. Repository resources have supported a range of biochemical, clinical, statistical and genetic research (188 external requests for clinical data and 31 for biospecimens have been approved or are pending). Genetic research has included GWAS, validation studies, development of methods to improve statistical power of GWAS and testing of new statistical methods for genetic research. We anticipate that the future impact of the Repository’s resources on biomedical research will be enhanced by (i) cross-listing of Repository biospecimens in additional searchable databases and biobank catalogs; (ii) ongoing deployment of new applications for querying the contents of the Repository; and (iii) increased harmonization of procedures, data collection strategies, questionnaires etc. across both research studies and within the vocabularies used by different repositories.
Database URL:

Download full-text


Available from: Philip Cooley
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.Database URL:
    Full-text · Article · Jan 2013 · Database The Journal of Biological Databases and Curation
  • [Show abstract] [Hide abstract]
    ABSTRACT: Standardization of sample collection, shipping, and storage has been a major focus of biorepositories servicing large, multi-institute studies. The standardization of total protein concentration measurements may also provide an important metric for characterizing biospecimens. The measurement of total protein concentration in urine is challenging because of widely variable sample dilutions obtained in the clinic and the lack of a reference matrix for use with a standard curve and blank subtraction. Urinary proteins are therefore typically precipitated and reconstituted in a reference solution before quantitation. We have tested three different methods for protein precipitation and evaluated them using variability in total protein concentration measurement as a metric. The methods were tested on four urine samples ranging from very concentrated to very dilute. A method using a commercially available kit provided the most reproducible results, with average coefficients of variation <10%. Addition of a freeze/thaw did not lead to significant protein loss or additional variability. Samples were titrated and the measurements obtained appeared to be linearly correlated with sample starting volume. This method was applied to analysis of 77 urine biorepository samples and provided reproducible results when the same sample was assayed on different microwell plates.
    No preview · Article · Oct 2014 · Journal of biomolecular techniques: JBT
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The launch of the US BRAIN and European Human Brain Projects coincides with growing international efforts toward transparency and increased access to publicly funded research in the neurosciences. The need for data-sharing standards and neuroinformatics infrastructure is more pressing than ever. However, 'big science' efforts are not the only drivers of data-sharing needs, as neuroscientists across the full spectrum of research grapple with the overwhelming volume of data being generated daily and a scientific environment that is increasingly focused on collaboration. In this commentary, we consider the issue of sharing of the richly diverse and heterogeneous small data sets produced by individual neuroscientists, so-called long-tail data. We consider the utility of these data, the diversity of repositories and options available for sharing such data, and emerging best practices. We provide use cases in which aggregating and mining diverse long-tail data convert numerous small data sources into big data for improved knowledge about neuroscience-related disorders.
    Full-text · Article · Oct 2014 · Nature Neuroscience
Show more