Page 1

A Robust Statistical Method for Association-Based eQTL

Analysis

Ning Jiang1,3, Minghui Wang1, Tianye Jia1, Lin Wang2, Lindsey Leach1, Christine Hackett3, David

Marshall4, Zewei Luo1,2*

1School of Biosciences, University of Birmingham, Birmingham, United Kingdom, 2Laboratory of Population and Quantitative Genetics, Institute of Biostatistics, Fudan

University, Shanghai, China, 3BioSS, Invergowrie, Dundee, Scotland, United Kingdom, 4Scottish Crop Research Institute, Invergowrie, Dundee, Scotland, United Kingdom

Abstract

Background: It has been well established that theoretical kernel for recently surging genome-wide association study

(GWAS) is statistical inference of linkage disequilibrium (LD) between a tested genetic marker and a putative locus affecting

a disease trait. However, LD analysis is vulnerable to several confounding factors of which population stratification is the

most prominent. Whilst many methods have been proposed to correct for the influence either through predicting the

structure parameters or correcting inflation in the test statistic due to the stratification, these may not be feasible or may

impose further statistical problems in practical implementation.

Methodology: We propose here a novel statistical method to control spurious LD in GWAS from population structure by

incorporating a control marker into testing for significance of genetic association of a polymorphic marker with phenotypic

variation of a complex trait. The method avoids the need of structure prediction which may be infeasible or inadequate in

practice and accounts properly for a varying effect of population stratification on different regions of the genome under

study. Utility and statistical properties of the new method were tested through an intensive computer simulation study and

an association-based genome-wide mapping of expression quantitative trait loci in genetically divergent human

populations.

Results/Conclusions: The analyses show that the new method confers an improved statistical power for detecting genuine

genetic association in subpopulations and an effective control of spurious associations stemmed from population structure

when compared with other two popularly implemented methods in the literature of GWAS.

Citation: Jiang N, Wang M, Jia T, Wang L, Leach L, et al. (2011) A Robust Statistical Method for Association-Based eQTL Analysis. PLoS ONE 6(8): e23192.

doi:10.1371/journal.pone.0023192

Editor: Momiao Xiong, University of Texas School of Public Health, United States of America

Received April 29, 2011; Accepted July 7, 2011; Published August 9, 2011

Copyright: ? 2011 Jiang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits

unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The research was funded by the Biotechnology and Biological Sciences Research Council (RRAD11534) and the Leverhulme Trust (RCEJ14713). NJ was

also supported by a joint studentship between the University of Birmingham and Biomathematics and Statistics Scotland (BioSS). The funders had no role in study

design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: z.luo@bham.ac.uk

Introduction

Linkage disequilibrium (LD) based association mapping has

received increasing attention in the recent literature [1–6] for its

potential power and precision in detecting subtle phenotypic

associated genetic variants when compared with traditional

family-based linkage studies. Association mapping methods for the

genetic dissection of complex traits utilize the decay of LD, the rate

of which is determined by genetic distance between loci and the

generation time since LD arose [7]. Over multiple generations of

segregation, only loci physically close to the quantitative trait loci

(QTL) are likely to be significantly associated with the trait of

interest in a randomly mating population, providing great efficiency

at distinguishingbetween small recombinationfractions [8]. Despite

this potential, many reported association studies have not been

replicated or have resulted in false positives [9–10], commonly

caused by ‘cryptic’ structure in population-based samples. Popula-

tion structure, or population stratification [11], arises from

systematic variation in allele frequencies across subpopulations,

which can result in statistical association between a disease

phenotype and marker(s) that have no physical linkage to causative

loci[12–13],i.e.falsepositiveorspuriousassociations.Thisgivesrise

to an urgent need for methods of adjusting for both population

structure and cryptic relatedness occurring due to distant related-

ness among samples with no known family relationships.

To avoid the problems raised from population stratification,

family-based association studies have been proposed, such as the

transmission-disequilibrium test (TDT), which compares the

frequencies of marker alleles transmitted from heterozygous parents

to affected offspring against those that are not transmitted [14]. In

this design the ethnic background of cases and controls is necessarily

matched, conferring robustness to the presence of population

structure. However, TDT design requires samples from family trios,

which are difficult to obtain compared to population based designs

where a large sample is feasibly obtained. Moreover, increased

genotyping efforts are required for TDT design to achieve the same

power as population based design [15–16].

Numerous methods have been proposed to overcome the

problems caused by population structure without the need for

family based samples. Among the most widely used are the

PLoS ONE | www.plosone.org1August 2011 | Volume 6 | Issue 8 | e23192

Page 2

genomic control (GC) [17] and the structure association (SA)

analysis [18–19]. In the former, inflation of the test statistic by

population structure is estimated as a constant from unlinked

markers in the genomic control group and then the test statistic

will be adjusted from the estimate before being applied to infer the

association. In the latter, unlinked markers are used to estimate the

number of subpopulations from which the sample are collected,

and then assign sample individuals to subpopulations. The former

method considers an ideal but unrealistic situation of constant

inflation factor for all markers, while in reality the influence of

population structure on statistical inference of marker-trait

association varies over genome locations [20]. For the SA method,

it is computationally intensive to obtain accurate and reliable

values for both the number of subpopulations in real datasets and

to assign individual population membership. Alternative methods

have been adopted to infer the subpopulation number, including

Latent-Class model [21], mixture model [22] and a Bayesian

model AdmixMap [23]. These methods share the assumption that

associations among unlinked markers are the result of population

structure and subpopulations are allocated to minimize these

associations. This step depends critically upon the correct selection

of a panel of markers to reflect population structure information.

Price et al. [24] proposed a principal component analysis (PCA)

based method, EIGENSTRAT, to model the ancestral difference

in allele frequency and correct for population stratification by

adjusting genotypes through linear regression on continuous axes

of variation. While EIGENSTRAT provides specific correction for

candidate markers, how to choose appropriate markers to infer

population structure remains in question. In fact, prediction of the

population structure may fail whenever the key assumption behind

the structure prediction methods is violated.

Rather than using a panel of unlinked markers to exploit the

cryptic population structure, a single null marker can be used to

correct for bias of the test statistic in association studies. Wang et al.

[25] suggested using a well-selected null marker to correct biases

from population stratification on odds ratio estimation for a

candidate gene within a logistic regression framework. They

assumed a simplistic situation that the null marker had the same

genotypic distribution as the candidate gene, which, however, was

unknown in practice.

The expression quantitative trait locus (eQTL) analyses have

recently shown that variation in human gene expression levels

among individuals and also populations is influenced by

polymorphic genetic variants [26–28]. The use of structured

populations has meant that to detect the genetic variants

accounting for differences in gene expression between subpopu-

lations, GWAS had to be carried out separately for each

subpopulation and the results subsequently compared. We

present here a simple regression model of utilizing only one

‘control’ marker to remove the population structure effect in

detecting LD between a marker and a putative quantitative trait

locus. We first established the theoretical basis for selection and

use of a control marker to correct for population structure and

established a regression-based method for detecting the LD which

is integrated with information of the control marker. We

investigated the method for its efficiency to test the LD and to

reduce false positives stemmed from population structure through

intensive computer simulation studies and re-analysis of the gene

expression (or eQTL) datasets collected from genetically diver-

gent populations. The new method (Method 1) was compared

with two alternative methods: single marker regression without

population structure correction (Method 2) and multiple

regression analysis with incorporation of known individual

ancestry information (Method 3).

Materials and Methods

Method 1 (Regression analysis with correcting

population structure)

The method analyzes a structured randomly mating population

produced through instant admixture of two genetically divergent

subpopulations. The proportion of subpopulation 1 in the mixed

population is denoted by m. Let us consider three bi-allelic loci:

one affects a quantitative trait (Q) while another two are

polymorphic markers devoid of direct effect on the trait. We call,

for convenience, one of the markers the test marker (T) which is to

be tested for association with the QTL, and the other as control

marker (C), assumed to be not associated with both the QTL and

the test marker (i.e. the linkage disequilibrium D equal 0). Two

alleles are denoted by A and a at the putative QTL, T and t at the

test marker, and C and c at the control marker. Three genotypes at

the QTL, AA, Aa and aa, are assumed to affect the quantitative

trait by d, h and –d respectively. Trait phenotype of an individual

(Y) is assumed to be normally distributed with mean depending on

its genotype at the QTL and residual variance s2

values at the test marker and control marker are denoted by X and

Z, which are the number of alleles T and C respectively. In

subpopulation i (i=1 or 2), the allelic frequencies of the QTL, test

marker and control marker are denoted by p(i)

respectively, while the coefficients of linkage disequilibrium

between any pair of the loci are denoted by D(i)

D(i)

CQ. Table 1 illustrates probability distribution of joint genotypes

at a test marker and a putative QTL in randomly mating

populations together with genotypic values at the QTL and details

e. Genotypic

Q, p(i)

Tand p(i)

C

TC, D(i)

TQand

Table 1. Probability distribution of joint genotypes at a test marker and a putative QTL and genotypic values at the QTL.

Genotypes at QTL

AAAa aa

Marker genotypes TTTtttTT Tttt TTTttt

Probabilities (qQ)2

2q2Q(12Q)q2(12Q)2

2 q(12q)QR2 q(12q)

(Q+R22QR)

2 q(12q)

(12Q)(12R)

(12q)2R2

2(12q)2

R(12R)

(12q)2(12R)2

Genotypic values at QTL

m+d

m+h

m2d

where A and a are segregating alleles at a putative QTL, T and t are alleles at the test marker locus. Allele frequency of A is q, allele frequency of T is p. Q and R are

conditional probabilities of marker allele T given QTL allele A and a respectively, which are formulated as Q~pzD=q and R~p{D=(1{q) where D is the coefficient of

linkage disequilibrium between the marker and QTL. m, d and h are population mean, additive and dominance genic effects at the QTL.

doi:10.1371/journal.pone.0023192.t001

Robust LD-Based eQTL Mapping

PLoS ONE | www.plosone.org2 August 2011 | Volume 6 | Issue 8 | e23192

Page 3

for the parameterization can be found in Luo [29]. It is clear from

Table 1 that the marker-QTL distribution can be fully

characterized by the parameters defining population allele

frequencies at the two loci and the coefficient of linkage

disequilibrium between them. This provides the theoretical basis

for statistical analyses developed below.

Regression analysiscorrecting

structure.

For phenotype of a quantitative trait and each of

the test markers, we fitted the following model: the genotype Xijof

individual i at the given marker locus j may be classified as one of

three states: Xij~0, 1, or 2 for homozygous rare, heterozygous

and homozygous common alleles, respectively. For this model, we

fitted a linear regression of the form for each genetic marker:

effectofpopulation

Yi~b0zb1Xijzei

ð1Þ

where Yi is phenotype for individual i~1,??????,n, and ei are

independent normally distributed random variables with mean 0

and variance s2

e. We have demonstrated that significance of the

regression coefficient can be used to infer significance of LD

between a polymorphic marker locus and a QTL in a single

randomly mating population since the regression coefficient has a

form of

b1~sX,Y

s2

X

~E(XY){E(X)E(Y)

E(X2){E2(X)

~2DTQ½dz(1{2pQ)h?

2pT(1{pT)

ð2Þ

[29]. However, in a structured population, we note that the LD

between a marker and a QTL is given by

DTQ~mD(1)

TQz(1{m)D(2)

TQzm(1{m)dTdQ,

ð3Þ

[30], where m is the proportion of subpopulation 1 in this mixed

samples, the superscripts (1) and (2) refers to the subpopulations,

dT~p(1)

Q. The covariance between the

QTL and the test marker can be worked out as

T{p(2)

Tand dQ~p(1)

Q{p(2)

sX,Y~2mD(1)

TQ(dzh{2hp(1)

Q)z2(1{m)D(2)

TQ(dzh{2hp(2)

Q)

z4m(1{m)dTdQ½dzh(1{p(1)

Q{p(2)

Q)?:

ð4Þ

Equations 3 and 4 show that the association between the QTL and

test marker in a mixed population is the summation of (i) a linear

combination of the associations between the two loci in each of the

subpopulations (i.e. the genuine association due to LD between the

two loci in each of the subpopulations), and (ii) a nonlinear

component of the differences in allele frequencies between the two

subpopulations (i.e. a spurious term of association). The objective

of our analysis is to remove the spurious term by using a control

marker ‘C’. If the control marker is neither in association with the

D(1)

CQ~0)

(D(1)

TC~0), then the covariance between control marker

and QTL (or test marker) can be given by

QTL(i.e.

CQ~D(2)

norwith thetest marker

TC~D(2)

sY,Z~4m(1{m)dCdQ½dzh(1{p(1)

Q{p(2)

Q)?ð5Þ

sX,Z~4m(1{m)dTdC

ð6Þ

In an admixed population, the control marker’s allelic frequency is

pC~mp(1)

at the control marker locus, the expected and observed variances

at the control marker are

Cz(1{m)p(2)

C. In a population with allelic frequency pC

E½s2

Z?~2½mp(1)

Cz(1{m)p(2)

C?½1{mp(1)

C{(1{m)p(2)

C?~2pC(1{pC)ð7Þ

s2

Z~2½mp(1)

where dC~p(1)

and observed variances at the control marker indicates the

existence of population structure,

Cz(1{m)p(2)

C?½1{mp(1)

C{(1{m)p(2)

C?z2m(1{m)d2

Cð8Þ

C{p(2)

C. Thus, the difference between the expected

s2

Z{E½s2

Z?~2m(1{m)d2

C

ð9Þ

The spurious term in the covariance in equation (4) can be

completely corrected using a single control marker, as follows:

~ s sX,Y~sX,Y{

sX,ZsY,Z

2fs2

TQ(dzh{2hp(1)

Z{E½s2

Z?g

Q)z2(1{m)D(2)

~2mD(1)

TQ(dzh{2hp(2)

Q)

ð10Þ

Therefore, the regression coefficient calculated from

b1~~ s sX,Y

s2

X

~

sX,Y{

sX,ZsY,Z

2fs2

s2

X

Z{E½s2

Z?g

ð11Þ

would reflect correction for the population structure. The students

t-test can be used to test for significance of the regression

coefficient b1. Standard error (se) of b1is given by

Sb1~

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ns2

s2

Xs2

Y{~ s s2

X,Y

X

s

ð12Þ

Given the regression coefficients and their variances, the power of

the regression analysis can be predicted from the probability [31]

rt~Prftv(dt)wta=2;vgð13Þ

where tv(dt) represents a random variable with non-central

t-distribution with v degrees of freedom and non-centrality

parameter dt and ta=2;v is the upper a=2 point of a central

t-variable with the same degrees of freedom. The value of v equals

n23 and the non-centrality parameter is given by [31] as

dt~

C½v=2?b1

C½(v{1)=2?Sb1

ffiffiffiffiffiffiffi

v=2

p

ð14Þ

where C(:) stands for a gamma function.

Selection of the control marker.

the following procedure to select the control marker for a given test

marker. Firstly, any marker but the test marker would be

candidate for the control marker if it has or is

N an autosomal location on different chromosomes from the test

marker,

N less missing genotype data than a prior given proportion

In practice, we propose

Robust LD-Based eQTL Mapping

PLoS ONE | www.plosone.org3August 2011 | Volume 6 | Issue 8 | e23192

Page 4

For each marker passing the above screening, one calculates the

expected and observed variances from

E½s2

Z?~2pC(1{pC)

ð15Þ

s2

Z~

X

n

i~1

(Zi{m)2=(n{1)

ð16Þ

where Ziis the genotypic value of the candidate control marker (0,

1, 2) for individual i~1,??????,n, and m and pC are the mean

genotypic value across all individuals (P

frequency of this marker, respectively. It should be noted that

equations (7) and (15) are the same and that equation (16) stands

for the sampling variance of the control marker whose expectation

is given by equation (8) in the presence of population structure.

The control marker is the one with the maximum difference

between observed and expected variances, which has the

maximum ability to remove the spurious term in mixed

populations and does not introduce bias in single population.

n

i~1

Zi=n ) and the allelic

Method 2 (Regression analysis without correcting

population structure)

The method fits a simple regression model for detecting LD

between the trait phenotype and a test marker as we proposed

previously [29] and implemented in a recent population based

eQTL analysis in [28], in which the regression coefficient has a

form of

b1~s?

X,Y

s2

X

ð17Þ

with a standard error equal to

Sb1~s2

Xs2

Y{(s?

ns2

X,Y)2

X

ð18Þ

where s?

locus and the quantitative trait.

X,Yis the non-corrected covariance between test marker

Method 3 (multiple regression analysis)

The method regresses the trait phenotype on genotypic value of

a test marker (Xij=0, 1, 2) and the probability of membership to

each constituent population Pi(i=1, 2 here) as described in the

following multiple regression model

Yi~b0zb1Xijzb2Pizei

ð19Þ

where the b2Pi term reflects the population structure effect in

mixed populations.

The regression coefficients are given by

b1~s2

PsX,Y{sX,PsP,Y

s2

Xs2

P{s2

X,P

ð20Þ

b2~s2

XsP,Y{sX,PsX,Y

s2

Xs2

P{s2

X,P

ð21Þ

and standard errors of the regression coefficients are formulated as

Sb1~

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ns2

s2

Ps2

P{s2

Y

Xs2

X,P

s

ð22Þ

Sb2~

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ns2

s2

Xs2

P{s2

Y

Xs2

X,P

s

ð23Þ

according to [32]. Significance of association of the test marker

with the quantitative trait can be tested through testing for

significance of the regression coefficient b1by the Student t-test.

Results

Simulation study

To explore statistical properties and limitations of the methods

described above, we developed and conducted a series of

computation simulation studies. The simulation program mimics

segregation pattern of genes at multiple marker loci and QTL in

randomly mating natural populations in terms of simulation

parameters defining allele frequencies, linkage disequilibria and

population structure as illustrated in Table S1. The methods were

detailed for simulating a population characterized the joint

genotypic distribution at two loci and for sampling individuals

from the simulated population [33]. Although the distribution

involves only two loci, it is easy to extend to multiple loci because

the two locus joint distribution can be easily converted into

conditional (or transition) probability distribution of genotypes at

one locus on that at another, and genotypes at multiple loci can be

simulated as a Markov process governed by the conditional

probability distribution. Of course, this will not undermine

flexibility to specify any required linkage disequilibrium pattern

among any loci. Subpopulations were independently generated

and merged to produce the admixed population. In the present

study, we were focused on 10 simulated populations defined by

simulation parameters listed in Table S1.

Each simulation was repeated 100 times and simulation data

was analyzed using the three different methods described above.

We tabulated in Table 2 means and standard errors of 100

repeated regression coefficients and proportions of significant tests

of the regression coefficients. It can be seen that Methods 1 and 2

predicted the regression coefficients adequately in all simulated

populations, but Method 3 did so when all individuals were

correctly allocated to their correct subpopulations. Listed in

Table 2 were also proportions of significant tests of the regression

in repeated simulations. It should be stressed that the proportion

measures rate of false positive when the test marker and QTL were

in linkage equilibrium such as in the first 4 simulated populations

whilst it provides evaluation of an empirical statistical power for

detecting the genetic association in populations 5 to 10. It is clear

that the rate of false positive was properly controlled in association

analysis with Method 1, and Method 3 when all individuals

were correctly allocated, and that LD between the test marker and

QTL in populations 5–9 was tested significant by these methods

with a high statistical power. In contrast, the simple regression

analysis (Method 2) made a high proportion of false positive

inference of the marker and QTL association when the LD was

actually absent (populations 1–4) but failed to detect truly existing

LD between the two loci (populations 5–9). The method is thus

inappropriate to be used for genetic association analysis when

population structure was present. Performance of Method 3,

Robust LD-Based eQTL Mapping

PLoS ONE | www.plosone.org4 August 2011 | Volume 6 | Issue 8 | e23192

Page 5

Table 2. Means and standard errors of regression coefficients (b6se) and proportions (r or ^ r r) of statistical tests for significance of the regression coefficients from three methods.

Pop

DTQ

D

0

TQ

Method 1

Method 2

Method 3

Simulated

Predicted

Simulated

Predicted

Simulated

Predicted

b ± se

^ r r

b

r

b ± se

^ r r

b

r

b ± sea

^ r ra

b ± seb

^ r rb

ba

ra

1

0.04

0.00

20.07860.015

0.06

0.00

0.00

1.29360.006

0.98

1.278

1.00

0.00660.007

0.00

1.03560.006

0.84

0.00

0.00

2

0.04

0.00

20.08760.015

0.07

0.00

0.00

1.16260.006

0.97

1.163

0.98

20.00860.007

0.00

0.94060.007

0.74

0.00

0.00

3

20.09

0.00

0.01560.008

0.00

0.00

0.00

22.37160.005

1.00

22.368

1.00

0.00660.007

0.00

22.03860.006

1.00

0.00

0.00

4

20.09

0.00

0.00560.011

0.00

0.00

0.00

23.15760.007

1.00

23.157

1.00

20.00760.009

0.00

22.72560.008

1.00

0.00

0.00

5

0.02

0.05

0.96560.021

0.48

0.828

0.55

20.15960.007

0.00

20.166

0.00

0.99760.006

0.85

0.08260.007

0.00

0.994

0.91

6

0.04

0.07

1.08660.008

0.86

1.062

0.92

0.13060.007

0.00

0.125

0.00

1.28060.006

1.00

0.37560.007

0.01

1.274

1.00

7

0.05

0.08

1.34160.008

0.98

1.325

1.00

0.33360.007

0.01

0.331

0.01

1.59360.006

1.00

0.59760.007

0.14

1.59

1.00

8

0.05

0.08

1.26060.006

0.99

1.249

0.99

0.31360.007

0.01

0.312

0.01

1.50360.006

1.00

0.57260.007

0.13

1.499

1.00

9

0.04

0.08

1.30760.014

0.92

1.234

0.99

20.00560.006

0.00

0.00

0.00

1.69860.006

1.00

0.33360.007

0.02

1.704

1.00

10

20.04

0.00

0.00860.009

0.01

0.00

0.00

21.23360.006

0.99

21.234

0.99

20.00360.007

0.00

20.99560.007

0.80

0.00

0.00

DTQand D

0

TQare the coefficients of LD between the marker and QTL in the simulated mixed population before and after correction for population structure respectively.

apredicted when all individuals were allocated to their correct subpopulations;

bpredicted when half of all individuals were correctly allocated to their subpopulations but other half were randomly allocated to either of the two subpopulations. The predicted values were estimated from theoretical analysis,

while the simulated values were estimated from the simulation studies.

doi:10.1371/journal.pone.0023192.t002

Robust LD-Based eQTL Mapping

PLoS ONE | www.plosone.org5August 2011 | Volume 6 | Issue 8 | e23192