Page 1

Vol. 28 no. 19 2012, pages 2540–2542

doi:10.1093/bioinformatics/bts474

BIOINFORMATICS APPLICATIONS NOTE

Genetics and population analysis

Advance Access publication July 26, 2012

Estimation of pleiotropy between complex diseases using

single-nucleotide polymorphism-derived genomic relationships

and restricted maximum likelihood

S.H. Lee1,*, J. Yang2, M.E. Goddard3, P.M. Visscher1,2and N.R. Wray1

1The University of Queensland, Queensland Brain Institute, Brisbane, QLD 4072,2The University of Queensland

Diamantina Institute, Princess Alexandra Hospital, Brisbane, QLD 4102 and3Department of Agriculture and

Food Systems, University of Melbourne, VIC 3010, Melbourne, Australia

Associate Editor: Jeffrey Barrett

ABSTRACT

Summary: Genetic correlations are the genome-wide aggregate

effects of causal variants affecting multiple traits. Traditionally, genetic

correlations between complex traits are estimated from pedigree stu-

dies, but such estimates can be confounded by shared environmental

factors. Moreover, for diseases, low prevalence rates imply that even if

the true genetic correlation between disorders was high, co-aggre-

gation of disorders in families might not occur or could not be distin-

guished from chance. We have developed and implemented statistical

methods based on linear mixed models to obtain unbiased estimates

of the genetic correlation between pairs of quantitative traits or pairs

of binary traits of complex diseases using population-based

case–control studies with genome-wide single-nucleotide polymorph-

ism data. The method is validated in a simulation study and applied to

estimategeneticcorrelation between

Wellcome Trust Case Control Consortium data in a series of bivariate

analyses. We estimate a significant positive genetic correlation be-

tween risk of Type 2 diabetes and hypertension of ?0.31 (SE 0.14,

P¼0.024).

Availability: Our methods, appropriate for both quantitative and binary

traits, are implemented in the freely available software GCTA (http://

www.complextraitgenomics.com/software/gcta/reml_bivar.html).

Contact: hong.lee@uq.edu.au

Supplementary Information: Supplementary data are available at

Bioinformatics online.

various diseasesfrom

Received on May 20, 2012; revised on July 18, 2012; accepted on

July 20, 2012

1 INTRODUCTION

Recently, we have developed new methods to estimate the

proportion of variation in quantitative traits (Yang et al., 2010,

2011) or in liability to disease that is associated with single-

nucleotide polymorphisms (SNPs) (Lee et al., 2012, 2011). The

methods use very distant relationships between individuals so

that estimates are unlikely to be confounded with shared

family environment effects. The methodology can be extended

to estimation of the genetic covariance and hence genetic

correlation between different disorders that is tagged by SNPs

to provide estimates of genome-wide pleiotropy. Evidence for a

genetic correlation between disorders estimated directly by inter-

rogation of the genome could have an important impact on the

design of future genetic and functional studies for medical nos-

ology and may provide new insights for novel treatments across

disorders.

The aim of this study is to estimate genome-wide pleiotropy

using genome-wide association studies (GWAS) case–control

data for different diseases or disorders. For binary disease

traits, we derive valid statistical approaches to obtain unbiased

estimates of comorbidity interpretable on the scale of liability to

disease. We develop computationally efficient algorithms for

estimation. The method is applied to estimate the genetic correl-

ation between hypertension (HT) and type 2 diabetes (T2D),

bipolar disorder (BD) and rheumatoid arthritis (RA), BD

and T2D or HT and RA from Wellcome Trust Case Control

Consortium (WTCCC) GWAS data.

2METHODS

2.1.Bivariate linear mixed model and efficient AIREML

We used a standard bivariate linear mixed model (Thompson, 1973). The

models can be written as

y1¼ X1b1þ Z1g1þ e1for trait 1 and y2¼ X2b2þ Z2g2þ e2for trait 2,

where y is a vector of observations for trait, b1and b2are vectors of fixed

effects, g1and g2are vectors of random polygenic effects for each indi-

vidual in both trait 1 and 2 and e1and e2are residuals for trait 1 and 2,

respectively. X and Z are incidence matrices for the effects b and g,

respectively. The variance covariance matrix (V) is defined as

"

V ¼

Z1AZ01?2

Z2AZ01?2

g1þ I?2

g1g2

e1

Z1AZ02?2

Z2AZ02?2

g1g2

g2þ I?2

e2

#

,

where A is the genomic similarity relationship matrix based on SNP

information (Yang et al., 2010) and I is an identity matrix, ?2

?2

g1g2, which are genetic variance, residual variance and covariance be-

tween g1and g2. Lee and Van der Werf (2006) showed that the method of

average information (AI) matrices derived directly from the V is much

more efficient computationally than the original AI algorithms (Gilmour

et al., 1995; Johnson and Thompson, 1995). Following equation (8) in

Lee and Van der Werf (2006), the AI matrix for the bivariate model can

be derived as

g, ?2

eand

*To whom correspondence should be addressed.

2540

? The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions.com

by guest on September 13, 2015

http://bioinformatics.oxfordjournals.org/

Downloaded from

Page 2

AI ¼1

2

y0I1PI1PPy

y0I1PI2PPy

y0I1PG1PPy

y0I1PG2PPy

y0I1PCPPy

y0I2PI2PPy

y0I2PG1PPy

y0I2PG2PPy

y0I2PCPPy

y0G1PG1PPy

y0G1PG2PPy

y0G1PCPPy

y0G2PG2PPy

y0G2PCPPy y0CPCPPy

2

6

6

4

6

6

3

7

7

5

e1,

7

7

where P ¼ V?1? V?1XðX0V?1XÞ?1X0V?1, y0¼ ½y01y02?, I1¼ @V=@?2

I2¼ @V=@?2

e2, G1¼ @V=@?2

g1, G2¼ @V=@?2

g2and C ¼ @V=@?2

g1g2.

2.2. Correlation on the scale of liability is approximately

the same as that on the observed risk scale

For disease traits when the y phenotype vectors contain only 1 for cases

and 0 for controls, a liability threshold model can be written to link

unobserved continuous liability to the observed discrete scale of disease

(Falconer, 1965)

l ¼ g?þ e?

ð1Þ

where l is a vector of liability phenotypes which are distributed as N(0, 1)

in the population, g* is a vector of random additive genetic effects on the

liability scale which are distributed N (0, ?2

residuals on the liability scale distributed with N(0, ?2

function links liability to the probability of y¼1, and g* on the scale of

liability can be approximated by a linear function of g on the observed

0–1 scale (Dempster and Lerner, 1950). Using this linear approximation,

the correlation between two diseases is the same on both the observed and

liability scale (Gianola, 1982; Ho ¨ schele et al., 1987). When samples are

ascertained (typical in case–control studies), the genetic value on the

observed scale can be defined with an ascertainment correcting factor

as (Lee et al., 2011)

g?) and e* is a vector of random

e?). The probit link

gccffi c þ zPð1 ? PÞ

Kð1 ? KÞg?,

ð2Þ

where gccis genetic values on the observed scale in a case–control study,

c is a constant, K is the disease prevalence in the population, P is the

proportion of the sample that are cases and z is the height of the standard

normal probability density function that truncates the proportion

K. From equation (2), the covariance between genetic values on the

observed scale with ascertained samples can be written as

covðgcc1,gcc2Þ ffi z1P1ð1 ? P1Þ

K1ð1 ? K1Þz2P2ð1 ? P2Þ

K2ð1 ? K2Þcovðg?

1,g?

2Þ:

ð3Þ

From equation (3) it is clear that that even when samples are ascer-

tained, the correlation is the same on both observed and liability scales,

because an approximate linear relationship exists between the genetic

values on the different scales.

2.3.Simulation study

In order to confirm the derivation that the genetic correlation is approxi-

mately the same on both observed and liability scales when samples are

ascertained, we performed a simulation study. The simulation procedure

was similar to that in Lee et al. (2011) except that two traits were simu-

lated with a genetic correlation between them (described in the

Supplementary material).

2.4.Application to genome-wide genotype data

We applied our method to estimate genetic correlation between HT and

T2D, BD and RA, BD and T2D, or HT and RA using WTCCC GWAS

data (WTCCC, 2007), following stringent quality control (QC) as

described in the supplementary material. Since there are two control

groups in the WTCCC data, i.e. 1958 cohort controls and NBS controls,

we used 1958 cohort controls for the first trait, and NBS controls for the

second trait. In a confirmation study, we swapped the control groups i.e.

NBS for the first trait and 1958 cohort for the second trait.

We estimated a test statistic by dividing the square of the estimated

genetic correlation coefficient by its approximate sampling variance and

calculated a p-value from this test statistic assuming that it is distributed

as a chi-square with 1 degree of freedom.

3 RESULTS

In simulations the estimated genetic correlation on the observed

scale was close to the true values when using various

combinations of true heritability and population prevalence

(Supplementary Table S1). This confirms that the estimated gen-

etic correlation is approximately the same on both observed and

liability scales [Equation (3)]. Previously we have shown that if

misdiagnosis occurs between the two disorders, then the expect-

ation of the estimate of the genetic correlation coefficient can be

non-zero even when the true genetic correlation is zero (Wray

et al., 2012).

The estimated genetic correlation between HT and T2D was

0.31 (SE¼0.14 and p-value¼0.023) (Supplementary Table S2),

indicating that genetic factors for HT and T2D are positively

correlated. However, estimates for genetic correlation between

BD and RA, BD and T2D, or HT and RA were not significantly

different from zero (Supplementary Table S2). In a confirmation

study switching control groups between the first and second trait

(Supplementary Table S2), the genetic correlation between

HT and T2D was 0.32 (SE¼0.14 and P¼0.024). Again, none

of otheranalyses hadsignificant

(Supplementary Table S2). None of the parameter estimates dif-

fered significantly between our original and confirmation ana-

lyses. We previously demonstrated that the application of our

stringent QC process resulted in estimated genetic variance not

significantly different from zero if we conduct a dummy

case-control analysis using these two control sets but treating

one set as a cases (Lee et al., 2011) (h2¼0.06, SE 0.11).

geneticcorrelations

ACKNOWLEDGEMENTS

We thank QBI IT team. S.H.L. acknowledges the use of the

Genetic Cluster Computer for carrying out a part of simulations.

The cluster is financially supported by the Netherlands Scientific

Organization (NOW 480-05-003). This study makes use of data

generated by the Wellcome Trust Case-Control Consortium. A

full list of the investigators who contributed to the generation of

the WTCCC data is available from www.wtccc.org.uk. Funding

for the WTCCC project was provided by the Wellcome Trust

under award 076113.

Funding: The Australian National Health and Medical Research

Council (613672, 613601, 613608 and 1011506), the Australian

Research Council (DP1093502 and FT0991360) and the US

National Institute of Health (GM075091).

Conflict of Interest: none declared.

REFERENCES

Dempster,E.R. and Lerner,I.M. (1950) Heritability of threshold characters.

Genetics, 35, 212–236.

Falconer,D.S. (1965) The inheritance of liability to certain diseases, estimated from

the incidence among relatives. Ann. Hum. Genet., 29, 51–71.

2541

Estimation of pleiotropy between complex diseases

by guest on September 13, 2015

http://bioinformatics.oxfordjournals.org/

Downloaded from

Page 3

Gianola,D. (1982) Theory and analysis of threshold characters. J. Anim. Sci., 54,

1079–1096.

Gilmour,A.R. et al. (1995) Average information REML: an efficient algorithm for

variance parameters estimation in linear mixed models. Biometrics, 51,

1440–1450.

Ho ¨ schele,I. et al. (1987) Estimation of variance components with quasi-

continuous data using Bayesian methods. J. Anim. Breed. Genet., 104,

334–349.

Johnson,D.L. and Thompson,R. (1995) Restricted maximum likelihood estimation

of variance components for univariate animal models using sparse matrix

techniques and average information. J. Dairy Sci., 78, 449–456.

Lee,S.H. et al. (2012) Estimating the proportion of variation in susceptibility to

schizophrenia captured by common SNPs. Nat. Genet., 44, 247–250.

Lee,S.H. and Van der Werf,J.H.J. (2006) An efficient variance component

approach implementing an average information REML suitable for combined

LD and linkage mapping with a general complex pedigree. Genet. Sel. Evol., 38,

25–43.

Lee,S.H. et al. (2011) Estimating missing heritability for disease from genome-wide

association studies. Am. J. Hum. Genet., 88, 294–305.

Thompson,R. (1973) The estimation of variance and covariance components with

an application when records are subject to culling. Biometrics, 29, 527–550.

Wray,N.R. et al. (2012) Impact of diagnostic misclassification on estimation of

genetic correlations using genome-wide genotypes. Eur. J. Hum. Genet., 20,

668–674.

WTCCC (2007) Genome-wide association study of 14,000 cases of seven common

diseases and 3,000 shared controls. Nature, 447, 661–678.

Yang,J. et al. (2010) Common SNPs explain a large proportion of the heritability

for human height. Nat. Genet., 42, 565–569.

Yang,J. et al. (2011) GCTA: a tool for genome-wide complex trait analysis. Am. J.

Hum. Genet., 88, 76–82.

2542

S.H.Lee et al.

by guest on September 13, 2015

http://bioinformatics.oxfordjournals.org/

Downloaded from