A genotype calling algorithm for the Illumina BeadArray platform

Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.
Bioinformatics (Impact Factor: 4.98). 11/2007; 23(20):2741-6. DOI: 10.1093/bioinformatics/btm443
Source: PubMed


Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes.
We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy.
The C++ executable for the algorithm described here is available by request from the authors.

Download full-text


Available from: Taane G Clark, Jan 10, 2014
  • Source
    • "A number of algorithms are available for processing the raw signal of paired allele intensities into discrete genotype calls (AA, AB, BB) for each SNP in each sample. Current methods include: GenCall [11], Illumina’s proprietary method implemented in the GenomeStudio software; GenoSNP [12]; Illuminus [13]; CRLMM [14-16]; Birdseed [17] and BeagleCall [18]. Three new methods have been proposed recently to meet the challenge of calling low frequency/rare variants on the Illumina platform (M 3[7], zCall [19] and OptiCall [8]). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering. Results Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data. Conclusions Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the open-source crlmm package distributed via the Bioconductor project (http://www.bioconductor.org).
    BMC Bioinformatics 05/2014; 15(1):158. DOI:10.1186/1471-2105-15-158 · 2.58 Impact Factor
  • Source
    • "The samples were genotyped using the Illumina HumanHap610Q array. The normalized intensity data was used by the Illluminus calling algorithm [31] to assign genotypes. No calls were assigned if an individual's most likely genotype was called with a posterior probability threshold of less than 0.95. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background A newly-described syndrome called Aneurysm-Osteoarthritis Syndrome (AOS) was recently reported. AOS presents with early onset osteoarthritis (OA) in multiple joints, together with aneurysms in major arteries, and is caused by rare mutations in SMAD3. Because of the similarity of AOS to idiopathic generalized OA (GOA), we hypothesized that SMAD3 is also associated with GOA and tested the hypothesis in a population-based cohort. Methods Study participants were derived from the Chingford study. Kellgren-Lawrence (KL) grades and the individual features of osteophytes and joint space narrowing (JSN) were scored from radiographs of hands, knees, hips, and lumbar spines. The total KL score, osteophyte score, and JSN score were calculated and used as indicators of the total burden of radiographic OA. Forty-one common SNPs within SMAD3 were genotyped using the Illumina HumanHap610Q array. Linear regression modelling was used to test the association between the total KL score, osteophyte score, and JSN score and each of the 41 SNPs, with adjustment for patient age and BMI. Permutation testing was used to control the false positive rate. Results A total of 609 individuals were included in the analysis. All were Caucasian females with a mean age of 60.9±5.8. We found that rs3825977, with a minor allele (T) frequency of 20%, in the last intron of SMAD3, was significantly associated with total KL score (β = 0.14, Ppermutation = 0.002). This association was stronger for the total JSN score (β = 0.19, Ppermutation = 0.002) than for total osteophyte score (β = 0.11, Ppermutation = 0.02). The T allele is associated with a 1.47-fold increased odds for people with 5 or more joints to be affected by radiographic OA (Ppermutation = 0.046). Conclusion We found that SMAD3 is significantly associated with the total burden of radiographic OA. Further studies are required to reveal the mechanism of the association.
    PLoS ONE 05/2014; 9(5):e97786. DOI:10.1371/journal.pone.0097786 · 3.23 Impact Factor
  • Source
    • "TwinsUK samples were genotyped using a combination of Illumina arrays (HumanHap300 [24,25], HumanHap610Q, 1 M-Duo and 1.2MDuo 1 M). For each dataset, the Illuminus calling algorithm [26] was used to assign genotypes (posterior probability ≥0.95) and applied the standardized data QC criteria based on: (1) call rate, heterozygosity, ethnicity, and relatedness (for sample exclusion); and (2) HWE, minor allele frequency, and call rate (for SNPs). After pair-wise concordance check and further visual inspection, the genotype datasets from different arrays were merged. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging technologies based on mass spectrometry or nuclear magnetic resonance enable the monitoring of hundreds of small metabolites from tissues or body fluids. Profiling of metabolites can help elucidate causal pathways linking established genetic variants to known disease risk factors such as blood lipid traits. We applied statistical methodology to dissect causal relationships between single nucleotide polymorphisms, metabolite concentrations and serum lipid traits, focusing on 95 genetic loci reproducibly associated with the four main serum lipids (total-, low-density lipoprotein- and high-density lipoprotein- cholesterol and triglycerides). The dataset used included 2,973 individuals from two independent population-based cohorts with data for 151 small molecule metabolites and four main serum lipids. Three statistical approaches, namely conditional analysis, Mendelian Randomization and Structural Equation Modelling, were compared to investigate causal relationship at sets of a single nucleotide polymorphism, a metabolite and a lipid trait associated with one another. A subset of three lipid-associated loci (FADS1, GCKR and LPA) have a statistically significant association with at least one main lipid and one metabolite concentration in our data, defining a total of 38 cross-associated sets of a single nucleotide polymorphism, a metabolite and a lipid trait. Structural Equation Modelling provided sufficient discrimination to indicate that the association of a single nucleotide polymorphism with a lipid trait was mediated through a metabolite at 15 of the 38 sets, and involving variants at the FADS1 and GCKR loci. These data provide a framework for evaluating the causal role of components of the metabolome (or other intermediate factors) in mediating the association between established genetic variants and diseases or traits.
    Genome Medicine 03/2014; 6(3):25. DOI:10.1186/gm542 · 5.34 Impact Factor
Show more