A Method to Address Differential Bias in Genotyping in Large-Scale Association Studies

Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom.
PLoS Genetics (Impact Factor: 7.53). 06/2007; 3(5):e74. DOI: 10.1371/journal.pgen.0030074
Source: PubMed


In a previous paper we have shown that, when DNA samples for cases and controls are prepared in different laboratories prior to high-throughput genotyping, scoring inaccuracies can lead to differential misclassification and, consequently, to increased false-positive rates. Different DNA sourcing is often unavoidable in large-scale disease association studies of multiple case and control sets. Here, we describe methodological improvements to minimise such biases. These fall into two categories: improvements to the basic clustering methods for identifying genotypes from fluorescence intensities, and use of "fuzzy" calls in association tests in order to make appropriate allowance for call uncertainty. We find that the main improvement is a modification of the calling algorithm that links the clustering of cases and controls while allowing for different DNA sourcing. We also find that, in the presence of different DNA sourcing, biases associated with missing data can increase the false-positive rate. Therefore, we propose the use of "fuzzy" calls to deal with uncertain genotypes that would otherwise be labeled as missing.

Download full-text


Available from: Jason D Cooper, Jun 19, 2014
  • Source
    • "Consequently, a large proportion of samples were misclassified when we attempted unsupervised clustering using bivariate finite mixture model approaches, first with PlatinumCNV [18], then with our own, mixture of beta-Gaussian distributions, approach. Finally, the clusters are in slightly different positions in cases and controls, reflecting the known sensitivity of genotyping chips to subtle differences in DNA preparation and storage conditions since they were prepared and processed in two different centers [19,20]. Instead, we used the qPCR copy numbers as training data to perform supervised classification with knn on the SNP signals, which does not explicitly rely on the identification of distinct clusters. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Killer Immunoglobulin-like Receptors (KIRs) are surface receptors of natural killer cells that bind to their corresponding Human Leukocyte Antigen (HLA) class I ligands, making them interesting candidate genes for HLA-associated autoimmune diseases, including type 1 diabetes (T1D). However, allelic and copy number variation in the KIR region effectively mask it from standard genome-wide association studies: single nucleotide polymorphism (SNP) probes targeting the region are often discarded by standard genotype callers since they exhibit variable cluster numbers. Quantitative Polymerase Chain Reaction (qPCR) assays address this issue. However, their cost is prohibitive at the sample sizes required for detecting effects typically observed in complex genetic diseases. We propose a more powerful and cost-effective alternative, which combines signals from SNPs with more than three clusters found in existing datasets, with qPCR on a subset of samples. First, we showed that noise and batch effects in multiplexed qPCR assays are addressed through normalisation and simultaneous copy number calling of multiple genes. Then, we used supervised classification to impute copy numbers of specific KIR genes from SNP signals. We applied this method to assess copy number variation in two KIR genes, \textit{KIR3DL1} and KIR3DS1, which are suitable candidates for T1D susceptibility since they encode the only KIR molecules known to bind with HLA-Bw4 epitopes. We find no association between KIR3DL1/3DS1 copy number and T1D in 6744 cases and 5362 controls; a sample size twenty-fold larger than in any previous KIR association study. Due to our sample size, we can exclude odds ratios larger than 1.1 for the common KIR3DL1/3DS1 copy number groups at the 5% significance level. We found no evidence of association of KIR3DL1/3DS1 copy number with T1D, either overall or dependent on HLA-Bw4 epitope. Five other KIR genes, KIR2DS4, KIR2DL3, KIR2DL5, KIR2DS5 and KIR2DS1, in high linkage disequilibrium with KIR3DL1 and KIR3DS1, are also unlikely to be significantly associated. Our approach could potentially be applied to other KIR genes to allow cost effective assaying of gene copy number in large samples.
    Full-text · Article · Apr 2014 · BMC Genomics
  • Source
    • "Although methods for performing imputation have been developed, methods of downstream analysis that can work with continuous probability values produced as imputation output are still in development. Substantial work has been done for case-control studies, including methods originally aimed at addressing differential data sources [Plagnol et al., 2007] and " fuzzy " genotype calls [Louis et al., 2010; Marchini et al., 2007]. However, no method has yet been developed for case-parent trio data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genotype imputation has become a standard option for researchers to expand their genotype datasets to improve signal precision and power in tests of genetic association with disease. In imputations for family-based studies however, subjects are often treated as unrelated individuals: currently, only BEAGLE allows for simultaneous imputation for trios of parents and offspring; however, only the most likely genotype calls are returned, not estimated genotype probabilities. For population-based SNP association studies, it has been shown that incorporating genotype uncertainty can be more powerful than using hard genotype calls. We here investigate this issue in the context of case-parent family data. We present the statistical framework for the genotypic transmission-disequilibrium test (gTDT) using observed genotype calls and imputed genotype probabilities, derive an extension to assess gene-environment interactions for binary environmental variables, and illustrate the performance of our method on a set of trios from the International Cleft Consortium. In contrast to population-based studies, however, utilizing the genotype probabilities in this framework (derived by treating the family members as unrelated) can result in biases of the test statistics toward protectiveness for the minor allele, particularly for markers with lower minor allele frequencies and lower imputation quality. We further compare the results between ignoring relatedness in the imputation and taking family structure into account, based on hard genotype calls. We find that by far the least biased results are obtained when family structure is taken into account and currently recommend this approach in spite of its intense computational requirements.
    Full-text · Article · Apr 2012 · Genetic Epidemiology
  • Source
    • "Statistical methods have also been developed to identify, assess or incorporate genotyping errors in association studies (Gordon et al., 2001; Gordon and Ott, 2001; Hao and Wang, 2004; Rice and Holmans, 2003). Recently, Plagnol et al. (2007) introduced a calling algorithm to minimize the biases that occur when case and control DNA samples are from different sources and processed in different laboratories. Miyagawa et al. (2008) investigated appropriate cutoff values for each of the QC variables (MSP, MAF, HWE and confidence score of genotype calls) by dividing and reshuffling healthy samples. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The quality control (QC) filtering of single nucleotide polymorphisms (SNPs) is an important step in genome-wide association studies to minimize potential false findings. SNP QC commonly uses expert-guided filters based on QC variables [e.g. Hardy-Weinberg equilibrium, missing proportion (MSP) and minor allele frequency (MAF)] to remove SNPs with insufficient genotyping quality. The rationale of the expert filters is sensible and concrete, but its implementation requires arbitrary thresholds and does not jointly consider all QC features. We propose an algorithm that is based on principal component analysis and clustering analysis to identify low-quality SNPs. The method minimizes the use of arbitrary cutoff values, allows a collective consideration of the QC features and provides conditional thresholds contingent on other QC variables (e.g. different MSP thresholds for different MAFs). We apply our method to the seven studies from the Wellcome Trust Case Control Consortium and the major depressive disorder study from the Genetic Association Information Network. We measured the performance of our method compared to the expert filters based on the following criteria: (i) percentage of SNPs excluded due to low quality; (ii) inflation factor of the test statistics (lambda); (iii) number of false associations found in the filtered dataset; and (iv) number of true associations missed in the filtered dataset. The results suggest that with the same or fewer SNPs excluded, the proposed algorithm tends to give a similar or lower value of lambda, a reduced number of false associations, and retains all true associations. The algorithm is available at http://www4.stat.ncsu.edu/jytzeng/software.php
    Full-text · Article · Jul 2010 · Bioinformatics
Show more