Robust linear regression methods in association studies.
ABSTRACT It is well known that data deficiencies, such as coding/rounding errors, outliers or missing values, may lead to misleading results for many statistical methods. Robust statistical methods are designed to accommodate certain types of those deficiencies, allowing for reliable results under various conditions. We analyze the case of statistical tests to detect associations between genomic individual variations (SNP) and quantitative traits when deviations from the normality assumption are observed. We consider the classical analysis of variance tests for the parameters of the appropriate linear model and a robust version of those tests based on M-regression. We then compare their empirical power and level using simulated data with several degrees of contamination.
Data normality is nothing but a mathematical convenience. In practice, experiments usually yield data with non-conforming observations. In the presence of this type of data, classical least squares statistical methods perform poorly, giving biased estimates, raising the number of spurious associations and often failing to detect true ones. We show through a simulation study and a real data example, that the robust methodology can be more powerful and thus more adequate for association studies than the classical approach.
The code of the robustified version of function lmekin() from the R package kinship is provided as Supplementary Material.
- SourceAvailable from: ncbi.nlm.nih.gov[show abstract] [hide abstract]
ABSTRACT: The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.The American Journal of Human Genetics 08/2000; 67(1):170-81. · 11.20 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Association between disease and genetic polymorphisms often contributes critical information in our search for the genetic components of common diseases. Devlin and Roeder [1999: Biometrics 55:997-1004] introduced genomic control, a statistical method that overcomes a drawback to the use of population-based samples for tests of association, namely spurious associations induced by population structure. In essence, genomic control (GC) uses markers throughout the genome to adjust for any inflation in test statistics due to substructure. To date, genomic control (GC) has been developed for binary traits and bi- or multiallelic markers. Tests of association using GC have been limited to single genes. In this report, we generalize GC to quantitative traits (QT) and multilocus models. Using statistical analysis and simulations, we show that GC controls spurious associations in reasonable settings of population substructure for QT models, including gene-gene interaction. Through simulations, we explore GC power for both random and selected samples, assuming the QT locus tested is causal and its specific heritability is 2.5-5%. We find that GC, combined with either random or selected samples, has good power in this setting, and that more complex models induce smaller GC corrections. The latter suggests greater power can be achieved by specifying more complex genetic models, but this observation only follows when such models are largely correct and specified a priori.Genetic Epidemiology 02/2002; 22(1):78-93. · 4.02 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Great efforts and expense have been expended in attempts to detect genetic polymorphisms contributing to susceptibility to complex human disease. Concomitantly, technology for detection and scoring of single nucleotide polymorphisms (SNPs) has undergone rapid development, extensive catalogues of SNPs across the genome have been constructed, and SNPs have been increasingly used as a means for investigation of the genetic causes of complex human diseases. For many diseases, population-based studies of unrelated individuals--in which case-control and cohort studies serve as standard designs for genetic association analysis--can be the most practical and powerful approach. However, extensive debate has arisen about optimum study design, and considerable concern has been expressed that these approaches are prone to population stratification, which can lead to biased or spurious results. Over the past decade, a great shift has been noted, away from case-control and cohort studies, towards family-based association designs. These designs have fewer problems with population stratification but have greater genotyping and sampling requirements, and data can be difficult or impossible to gather. We discuss past evidence for population stratification on genotype-phenotype association studies, review methods to detect and account for it, and present suggestions for future study design and analysis.The Lancet 03/2003; 361(9357):598-604. · 39.06 Impact Factor