Robust linear regression methods in association studies
V.M. Louren¸ co1,∗, A.M. Pires2and M. Kirst3
1Department of Mathematics, Faculdade de Ciˆ encias e Tecnologia, Universidade Nova de Lisboa,
2829-516 Caparica, Portugal, and CEMAT2Department of Mathematics and CEMAT, Instituto
Superior T´ ecnico (TULisbon), 1049-001 Lisboa, Portugal and3School of Forest Resources and
Conservation, University of Florida, Gainesville, FL 32611, USA.
coding/rounding errors, outliers or missing values, may lead to
misleading results for many statistical methods. Robust statistical
methods are designed to accommodate certain types of those
deficiencies, allowing for reliable results under various conditions.We
analyze the case of statistical tests to detect associations between
genomic individual variations (SNP) and quantitative traits when
deviations from the normality assumption are observed.We consider
the classical ANOVA tests for the parameters of the appropriate linear
model and a robust version of those tests based on M-regression.We
then compare their empirical power and level using simulated data
with several degrees of contamination.
Results: Data normality is nothing but a mathematical convenience.
In practice, experiments usually yield data with nonconforming
observations. In the presence of this type of data, classical least
squares statistical methods perform poorly, giving biased estimates,
raising the number of spurious associations and often failing to detect
true ones. We show through a simulation study and a real data
example, that the robust methodology can be more powerful and thus
more adequate for association studies than the classical approach.
Availability:The code of the robustified version of function lmekin()
from the R package kinship is provided as supplementary material.
Supplementary Information: Some extra tables and figures are
provided as supplementary material.
It is well known that data deficiencies,such as
Genetic association studies aim to identify genetic polymor-phisms
that cause phenotypic variation for a trait of interest, or that are in
linkage disequilibrium with the causative genetic variant. We focus
our analysis on biallelic genetic variants, such as single-nucleotide
polymorphisms (SNP). In this case, the unit of analysis can be
A(adenine) and G (guanine) has categoriesAA,AG and GG.
We are interested in using a number of genotyped SNPs in a gene,
or region, to detect the genetic factors underlying a quantitative
trait of interest that does not follow simple Mendelian patterns
of inheritance. The most straightforward and still more favoured
approach in association studies, though raising multiple testing
∗to whom correspondence should be addressed
problems (Nyholt, 2004), is to perform a single SNP test for
every genotyped SNP via regression or ANOVA methods (Tao and
Boulding, 2003, used a linear model to test the association between
SNPs in 8 candidate genes and age-specific growth rate in the Artic
charr; Martínez et al., 2007, used mixed linear models to test for
association between 57 SNPs from 20 candidate genes and some
wood properties in Pinus taeda; Weber et al., 2007, and Weber
et al., 2008, used a mixed random effects linear model to test for
the association between a collection of SNPs and some Teosinte
traits; Moe et al., 2009, used a linear model to test for association
between 151 SNPs from 57 candidate genes and several traits of
boar). Though the single-SNP approach may be considered if we
are looking for a single causal variant, it is not very efficient
when the SNPs have limited LD with that causal variant, meaning
smaller power. Moreover, quantitative traits are usually controlled
by several and sometimes many genes. Thus, a joint analysis of
SNPs may be more adequate, being much more informative than
single-SNP analysis (Jannot et al., 2003). However, it also may
lose power due to the usually large number of degrees of freedom
involved. Ideally, one should make use of the information provided
by multiple SNPs, capturing as much of the genetic variance as
possible, without raising the degrees of freedom too much (Bureau
et al., 2005; Wang and Elston, 2006; Chapman andWhittaker, 2008;
Xiang et al., 2008; Kwee et al., 2008; Li et al., 2009) and thus not
therefore implying that in a genome-wide association study context
a preliminary step of dimension reduction is necessary.
There is an extensive literature on how two specific data problems
— LD and population structure (PS) — may affect both the power to
detect true associations as well as the number of false positives,
therefore distorting the conclusions when testing for association
between a quantitative trait and a set of candidate SNPs in a
population-based study (Pritchard et al., 2000a; Cardon and Palmer,
for overcoming these problems (Devlin and Roeder, 1999; Pritchard
et al., 2000b; Bacanu et al., 2002; Carlson et al., 2004; Price et al.,
2006; Yu et al., 2006; Li et al., 2008; Malo et al., 2008). Another
frequent data problem, which may have the same sort of undesirable
data. This problem is far less studied than LD or PS. For instance the
review paper by Balding, 2006, treats non-normality in one sentence
© The Author (2011). Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com
Associate Editor: Dr. Alex Bateman
Bioinformatics Advance Access published January 7, 2011
by guest on August 24, 2015
Tan, S. et al. (2010) Large effects on body mass index and insulin resistance of fat
mass and obesity associated gene (FTO) variants in patients with polycystic ovary
syndrome (PCOS). BMC Medical Genetics 11(12).
Tao,W.J.and Boulding,E.G.(2003) Association
polymorphisms in candidate gene and growth rate in the Artic Charr (Salvelinus
alpinus). Heredity, 91, 60–69.
to Probability and Statistics. Stanford Univ. Press, Stanford, CA.
disequilibrium mapping. Am. J. Hum. Genet., 80, 353–360.
in teosinte (Zea mays ssp. parviglumis). Genetics, 177, 2349–2359.
Weber,A.L. et al. (2008) The genetic architecture of complex traits in teosinte (Zea
mays ssp. parviglumis): new evidence from association mapping. Genetics, 180,
between single nucleotide
Wu,R. et al. (2007) Statistical Genetics of QuantitativeTraits: Linkage, Maps and QTL.
Springer, New York.
Xiang,Z. et al. (2007) An efficient algorithm for genome-wide association study. ACM
T. on Knowl. Disc. from Data, 3, 19.
Xu,C. et al. Integrative analysis of DNAcopy number and gene expression in metastatic
oral squamous cell carcinoma identifies genes associated with poor survival.
Molecular Cancer 9(143).
Yu,J. et al. (2006)Aunified mixed-model method for association mapping that accounts
for multiple levels of relatedness. Nat. Genet., 38, 203–208.
Zhao,K. et al. (2007) An Arabidopsis example of association mapping in structured
samples. PLoS Genet. 3, e4. doi:10.1371/journal.pgen.0030004
Zhao,W. et al. (2006) Panzea: a database and resource for molecular and functional
diversity in the maize genome. Nucleic Acids Res., 34, D752–D757.
Zou,F. et al. (2003) Rank-based statistical methodologies for quantitative trait locus
mapping. Genetics 165, 1599–1605.
by guest on August 24, 2015