Two-stage designs for gene-disease association studies with sample size constraints

Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10021, USA.
Biometrics (Impact Factor: 1.52). 10/2004; 60(3):589-97. DOI: 10.1111/j.0006-341X.2004.00207.x
Source: PubMed

ABSTRACT Gene-disease association studies based on case-control designs may often be used to identify candidate polymorphisms (markers) conferring disease risk. If a large number of markers are studied, genotyping all markers on all samples is inefficient in resource utilization. Here, we propose an alternative two-stage method to identify disease-susceptibility markers. In the first stage all markers are evaluated on a fraction of the available subjects. The most promising markers are then evaluated on the remaining individuals in Stage 2. This approach can be cost effective since markers unlikely to be associated with the disease can be eliminated in the first stage. Using simulations we show that, when the markers are independent and when they are correlated, the two-stage approach provides a substantial reduction in the total number of marker evaluations for a minimal loss of power. The power of the two-stage approach is evaluated when a single marker is associated with the disease, and in the presence of multiple disease-susceptibility markers. As a general guideline, the simulations over a wide range of parametric configurations indicate that evaluating all the markers on 50% of the individuals in Stage 1 and evaluating the most promising 10% of the markers on the remaining individuals in Stage 2 provides near-optimal power while resulting in a 45% decrease in the total number of marker evaluations.

  • Source
    • "Although the costs of whole-genome genotyping are decreasing with the high-throughput biological technology, the total costs for a GWAS are still very expensive due to the thousands of sampling units and huge amounts of singlenucleotide polymorphisms. In order to save the costs, the two-stage design and the corresponding statistical analysis where all the SNPs are genotyped in Stage 1 on a portion of the samples and the promising SNPs with small í µí±ƒ-values (e.g., <0.001) based on some efficient tests are further screened on the remaining subjects, are often adopted in practice (e.g., [11] [12] [13] [14] [15]). "
  • Source
    • "Approach of minimization costs in two-stage design was proposed by Elston et al. [3] for linkage analysis. Later this approach was transferred to association analysis by Satagopan et al. [12][13][14]. Optimization of the design consists in choosing the proportion of samples between two stages and critical values in such a manner as to minimize the total cost for specified genome-wide significance level and power [9][15][21][5] [8][16][10]. The start point of present work was a paper of Nguyen et al [10], where an optimal robust two-stage design using the MAX3 test were considered. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide Association Studies (GWAS) require large phenotyping and genotyping costs. Two-stage design can be efficient to reduce genotyping costs: on the first stage some disease associated SNP are detected and these associations are checked on the second stage with reliable significance level. This procedure decreases the number of genotyped SNP on the second stage, thus the genotyping costs will be less than genotyping costs of one-stage design. Modern genotyping technologies allow using 96 and 384 well plates. Thus the number of individuals should be proportional to well plate size. Monte Carlo simulation was used to find optimal number of well plates and critical values on the first and second stages. We also found that the costs have inverse relationship to Kullback-Leibler divergence between cases and controls distributions under alternative hypothesis.
    Applied Methods of Statistical Analysis. Simulations and Statistical Inference (AMSA 2013) International Conference, Novosibirsk, Russia; 09/2013
  • Source
    • "False negative rates are increased by multiple factors that cause systematic biases, and such biases reduce statistical power [26]. The statistical power of 80% is used widely to avoid false negative associations and to determine a cost-effective sample size in large-scale association studies [7, 22, 23]. However, many researchers tend to overlook the importance of statistical power and sample size calculations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A sample size with sufficient statistical power is critical to the success of genetic association studies to detect causal genes of human complex diseases. Genome-wide association studies require much larger sample sizes to achieve an adequate statistical power. We estimated the statistical power with increasing numbers of markers analyzed and compared the sample sizes that were required in case-control studies and case-parent studies. We computed the effective sample size and statistical power using Genetic Power Calculator. An analysis using a larger number of markers requires a larger sample size. Testing a single-nucleotide polymorphism (SNP) marker requires 248 cases, while testing 500,000 SNPs and 1 million markers requires 1,206 cases and 1,255 cases, respectively, under the assumption of an odds ratio of 2, 5% disease prevalence, 5% minor allele frequency, complete linkage disequilibrium (LD), 1:1 case/control ratio, and a 5% error rate in an allelic test. Under a dominant model, a smaller sample size is required to achieve 80% power than other genetic models. We found that a much lower sample size was required with a strong effect size, common SNP, and increased LD. In addition, studying a common disease in a case-control study of a 1:4 case-control ratio is one way to achieve higher statistical power. We also found that case-parent studies require more samples than case-control studies. Although we have not covered all plausible cases in study design, the estimates of sample size and statistical power computed under various assumptions in this study may be useful to determine the sample size in designing a population-based genetic association study.
    06/2012; 10(2):117-22. DOI:10.5808/GI.2012.10.2.117
Show more