Haplotype Inference for Population Data with Genotyping Errors

Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, P. R. China.
Biometrical Journal (Impact Factor: 0.95). 08/2009; 51(4):644-58. DOI: 10.1002/bimj.200800215
Source: PubMed


Inference of haplotypes is important in genetic epidemiology studies. However, all large genotype data sets have errors due to the use of inexpensive genotyping machines that are fallible and shortcomings in genotyping scoring softwares, which can have an enormous impact on haplotype inference. In this article, we propose two novel strategies to reduce the impact induced by genotyping errors in haplotype inference. The first method makes use of double sampling. For each individual, the "GenoSpectrum" that consists of all possible genotypes and their corresponding likelihoods are computed. The second method is a genotype clustering algorithm based on multi-genotyping data, which also assigns a "GenoSpectrum" for each individual. We then describe two hybrid EM algorithms (called DS-EM and MG-EM) that perform haplotype inference based on "GenoSpectrum" of each individual obtained by double sampling and multi-genotyping data. Both simulated data sets and a quasi real-data set demonstrate that our proposed methods perform well in different situations and outperform the conventional EM algorithm and the HMM algorithm proposed by Sun, Greenwood, and Neal (2007, Genetic Epidemiology 31, 937-948) when the genotype data sets have errors.

5 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Haplotype frequency estimation is indispensable in studies of human genetics based on haplotypes since studies based on haplotypes are likely to yield more information than those based on single SNP marker. However, most existing algorithms estimate haplotype frequencies under the assumption that all of the genotype data sets are correct. To date, nearly all large genotype data sets have errors, and studies have demonstrated that even a small quantity of genotyping errors can have enormous impact on haplotype frequency estimation. Although the GenoSpectrum (GS)-EM algorithm which estimates haplotype frequencies incorporating genotyping uncertainty has been presented recently [1], it can only be suitable for independent individuals rather than dependent pedigree data. In this paper, we describe a new EM algorithm, called GS-PEM, that calculates maximum likelihood estimates (MLEs) of haplotype frequencies based on all possible multilocus genotypes (GenoSpectrum) of each member of the pedigrees through making use of the dependence information of relatives. We evaluate the performance of the GS-PEM by simulation studies and find that our GS-PEM can reduce the impact induced by the genotyping errors in haplotype frequency estimation.
    Human Heredity 02/2007; 64(3):172-81. DOI:10.1159/000102990 · 1.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The choice of genotyping families vs unrelated individuals is a critical factor in any large-scale linkage disequilibrium (LD) study. The use of unrelated individuals for such studies is promising, but in contrast to family designs, unrelated samples do not facilitate detection of genotyping errors, which have been shown to be of great importance for LD and linkage studies and may be even more important in genotyping collaborations across laboratories. Here we employ some of the most commonly-used analysis methods to examine the relative accuracy of haplotype estimation using families vs unrelateds in the presence of genotyping error. The results suggest that even slight amounts of genotyping error can significantly decrease haplotype frequency and reconstruction accuracy, that the ability to detect such errors in large families is essential when the number/complexity of haplotypes is high (low LD/common alleles). In contrast, in situations of low haplotype complexity (high LD and/or many rare alleles) unrelated individuals offer such a high degree of accuracy that there is little reason for less efficient family designs. Moreover, parent-child trios, which comprise the most popular family design and the most efficient in terms of the number of founder chromosomes per genotype but which contain little information for error detection, offer little or no gain over unrelated samples in nearly all cases, and thus do not seem a useful sampling compromise between unrelated individuals and large families. The implications of these results are discussed in the context of large-scale LD mapping projects such as the proposed genome-wide haplotype map.
    European Journal of HumanGenetics 11/2002; 10(10):616-22. DOI:10.1038/sj.ejhg.5200855 · 4.35 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a new stochastic model for genotype generation. The model offers a compromise between rigid block structure and no structure altogether: It reflects a general blocky structure of haplotypes, but also allows for "exchange" of haplotypes at nonboundary SNP sites; it also accommodates rare haplotypes and mutations. We use a hidden Markov model and infer its parameters by an expectation-maximization algorithm. The algorithm was implemented in a software package called HINT (haplotype inference tool) and tested on 58 datasets of genotypes. To evaluate the utility of the model in association studies, we used biological human data to create a simple disease association search scenario. When comparing HINT to three other models, HINT predicted association most accurately.
    Journal of Computational Biology 01/2006; 12(10):1243-60. DOI:10.1089/cmb.2005.12.1243 · 1.74 Impact Factor
Show more