Chapter

Quality Assurance/Quality Control (QA/QC) for Biomarker Data Sets

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

As with any scientific discipline, data quality is of paramount importance in assuring correct interpretations of findings. This is particularly true for data which will be used for risk assessments and ultimately societal decision making. Rapidly evolving molecular biomarker tests, such as the omics-based tests (genomics, proteomics, metabolomics/metabonomics), represent a special challenge since the tests themselves generate enormous quantities and types of data which evolve over a relatively short time interval. While the general principles of quality assurance/quality control (QA/QC) remain in place, the QA/QC procedures for molecular biomarker data must evolve and be tailored to keep pace with rapid technological advances in these biomarker classes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In a recent National Research Council document, new strategies for risk assessment were described to enable more accurate and quicker assessments.⁽ ¹ ⁾ This report suggested that evaluating individual responses through increased use of bio-monitoring could improve dose-response estimations. Identi-fication of specific biomarkers may be useful for diagnostics or risk prediction as they have the potential to improve exposure assessments. This paper discusses systems biology, biomarkers of effect, and computational toxicology approaches and their relevance to the occupational exposure limit setting process. The systems biology approach evaluates the integration of biological processes and how disruption of these processes by chemicals or other hazards affects disease outcomes. This type of approach could provide information used in delineating the mode of action of the response or toxicity, and may be useful to define the low adverse and no adverse effect levels. Biomarkers of effect are changes measured in biological systems and are considered to be preclinical in nature. Advances in computational methods and experimental -omics methods that allow the simultaneous measurement of families of macromolecules such as DNA, RNA, and proteins in a single analysis have made these systems approaches feasible for broad application. The utility of the information for risk assessments from -omics approaches has shown promise and can provide information on mode of action and dose-response relationships. As these techniques evolve, estimation of internal dose and response biomarkers will be a critical test of these new technologies for application in risk assessment strategies. While proof of concept studies have been conducted that provide evidence of their value, challenges with standardization and harmonization still need to be overcome before these methods are used routinely.
Article
Full-text available
Construction and validation of a prognostic model for survival data in the clinical domain is still an active field of research. Nevertheless there is no consensus on how to develop routine prognostic tests based on a combination of RT-qPCR biomarkers and clinical or demographic variables. In particular, the estimation of the model performance requires to properly account for the RT-qPCR experimental design. We present a strategy to build, select, and validate a prognostic model for survival data based on a combination of RT-qPCR biomarkers and clinical or demographic data and we provide an illustration on a real clinical dataset. First, we compare two cross-validation schemes: a classical outcome-stratified cross-validation scheme and an alternative one that accounts for the RT-qPCR plate design, especially when samples are processed by batches. The latter is intended to limit the performance discrepancies, also called the validation surprise, between the training and the test sets. Second, strategies for model building (covariate selection, functional relationship modeling, and statistical model) as well as performance indicators estimation are presented. Since in practice several prognostic models can exhibit similar performances, complementary criteria for model selection are discussed: the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performance. On the training dataset, appropriate resampling methods are expected to prevent from any upward biases due to unaccounted technical and biological variability that may arise from the experimental and intrinsic design of the RT-qPCR assay. Moreover, the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performances are pivotal indicators to select the optimal model to be validated on the test dataset.
Article
Full-text available
Introduction: The eMERGE (electronic MEdical Records and GEnomics) Network is an NHGRI-supported consortium of five institutions to explore the utility of DNA repositories coupled to Electronic Medical Record (EMR) systems for advancing discovery in genome science. eMERGE also includes a special emphasis on the ethical, legal and social issues related to these endeavors. Organization: The five sites are supported by an Administrative Coordinating Center. Setting of network goals is initiated by working groups: (1) Genomics, (2) Informatics, and (3) Consent & Community Consultation, which also includes active participation by investigators outside the eMERGE funded sites, and (4) Return of Results Oversight Committee. The Steering Committee, comprised of site PIs and representatives and NHGRI staff, meet three times per year, once per year with the External Scientific Panel. Current progress: The primary site-specific phenotypes for which samples have undergone genome-wide association study (GWAS) genotyping are cataract and HDL, dementia, electrocardiographic QRS duration, peripheral arterial disease, and type 2 diabetes. A GWAS is also being undertaken for resistant hypertension in ≈ 2,000 additional samples identified across the network sites, to be added to data available for samples already genotyped. Funded by ARRA supplements, secondary phenotypes have been added at all sites to leverage the genotyping data, and hypothyroidism is being analyzed as a cross-network phenotype. Results are being posted in dbGaP. Other key eMERGE activities include evaluation of the issues associated with cross-site deployment of common algorithms to identify cases and controls in EMRs, data privacy of genomic and clinically-derived data, developing approaches for large-scale meta-analysis of GWAS data across five sites, and a community consultation and consent initiative at each site. Future activities: Plans are underway to expand the network in diversity of populations and incorporation of GWAS findings into clinical care. Summary: By combining advanced clinical informatics, genome science, and community consultation, eMERGE represents a first step in the development of data-driven approaches to incorporate genomic information into routine healthcare delivery.
Article
Full-text available
Proteomic profiling of complex biological mixtures by the ProteinChip technology of surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry (MS) is one of the most promising approaches in toxicological, biological, and clinic research. The reliable identification of protein expression patterns and associated protein biomarkers that differentiate disease from health or that distinguish different stages of a disease depends on developing methods for assessing the quality of SELDI-TOF mass spectra. The use of SELDI data for biomarker identification requires application of rigorous procedures to detect and discard low quality spectra prior to data analysis. The systematic variability from plates, chips, and spot positions in SELDI experiments was evaluated using biological and technical replicates. Systematic biases on plates, chips, and spots were not found. The reproducibility of SELDI experiments was demonstrated by examining the resulting low coefficient of variances of five peaks presented in all 144 spectra from quality control samples that were loaded randomly on different spots in the chips of six bioprocessor plates. We developed a method to detect and discard low quality spectra prior to proteomic profiling data analysis, which uses a correlation matrix to measure the similarities among SELDI mass spectra obtained from similar biological samples. Application of the correlation matrix to our SELDI data for liver cancer and liver toxicity study and myeloma-associated lytic bone disease study confirmed this approach as an efficient and reliable method for detecting low quality spectra. This report provides evidence that systematic variability between plates, chips, and spots on which the samples were assayed using SELDI based proteomic procedures did not exist. The reproducibility of experiments in our studies was demonstrated to be acceptable and the profiling data for subsequent data analysis are reliable. Correlation matrix was developed as a quality control tool to detect and discard low quality spectra prior to data analysis. It proved to be a reliable method to measure the similarities among SELDI mass spectra and can be used for quality control to decrease noise in proteomic profiling data prior to data analysis.
Article
Full-text available
Nasopharyngeal carcinoma (NPC) is a rare malignancy in most part of the world and it is one of the most confusing, commonly misdiagnosed and poorly understood diseases. The cancer is an Epstein-Barr virus-associated malignancy with a remarkable racial and geographical distribution. It is highly prevalent in southern Asia where the disease occurs at a prevalence about a 100-fold higher compared with other populations not at risk. The etiology of NPC is thought to be associated with a complex interaction of genetic, viral, environmental and dietary factors. Thanks to the advancements in genomics, proteomics and bioinformatics in recent decades, more understanding of the disease etiology, carcinogenesis and progression has been gained. Research into these components may unravel the pathways in NPC development and potentially decipher the molecular characteristics of the malignancy. In the era of molecular medicine, specific treatment to the potential target using technologies such as immunotherapy and RNAi becomes formulating from bench to bedside application and thus makes molecular biomarker discovery more meaningful for NPC management. In this article, the latest molecular biomarker discovery and progress in NPC is reviewed with respect to the diagnosis, monitoring, treatment and prognostication of the disease.
Article
LC-MS/MS-based proteomics studies rely on stable analytical system performance that can be evaluated by objective criteria. The National Institute of Standards and Technology (NIST) introduced the MSQC software to compute diverse metrics from experimental LC-MS/MS data, enabling quality analysis and quality control (QA/QC) of proteomics instrumentation. In practice, however, several attributes of the MSQC software prevent its use for routine instrument monitoring. Here, we present QuaMeter, an open-source tool that improves MSQC in several aspects. QuaMeter can directly read raw data from instruments manufactured by different vendors. The software can work with a wide variety of peptide identification software for improved reliability and flexibility. Finally, QC metrics implemented in QuaMeter are rigorously defined and tested. The source code and binary versions of QuaMeter are available under Apache 2.0 License at http://fenchurch.mc.vanderbilt.edu.
Article
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient reuse of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of 14 phenotypes for extraction of study samples from each site's DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample and marker quality and various batch effects. Upon completion of the genotyping and QC analyses for each site's primary study, eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset reentered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here, we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II, and also serve as a starting point for investigators merging multiple genotype datasets accessible through the National Center for Biotechnology Information in the database of Genotypes and Phenotypes. Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
Article
Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium test P-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the "Gene Environment Association Studies" (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS.