Page 1

Am. J. Hum. Genet. 67:1515–1525, 2000

1515

Family-Based Tests of Association in the Presence of Linkage

Stephen L. Lake,1Deborah Blacker,2,3and Nan M. Laird1

Departments of

Hospital and Harvard Medical School, Harvard University, Boston

1Biostatistics and

2Epidemiology, Harvard School of Public Health, and

3Department of Psychiatry, Massachusetts General

Linkage analysis may not provide the necessary resolution for identification of the genes underlying phenotypic

variation. This is especially true for gene-mapping studies that focus on complex diseases that do not exhibit

Mendelian inheritance patterns. One positional genomic strategy involves application of association methodology

to areas of identified linkage. Detection of association in the presence of linkage localizes the gene(s) of interest to

more-refined regions in the genome than is possible through linkage analysis alone. This strategy introduces a

statistical complexity when family-based association tests are used: the marker genotypes among siblings are cor-

related in linked regions. Ignoring this correlation will compromise the size of the statistical hypothesis test, thus

clouding the interpretation of test results. We present a method for computing the expectation of a wide range of

association test statistics under the null hypothesis that there is linkage but no association. To standardize the test

statistic, an empirical variance-covariance estimator that is robust to the sibling marker-genotype correlation is

used. This method is widely applicable: any type of phenotypic measure or family configuration can be used. For

example, we analyze a deletion in the A2M gene at the 5?splice site of “exon II” of the bait region in Alzheimer

disease (AD) discordant sibships. Since the A2M gene lies in a chromosomal region (chromosome 12p) that con-

sistently has been linked to AD, association tests should be conducted under the null hypothesis that there is linkage

but no association.

Introduction

Although linkage analysis has been applied successfully

to the mapping of genes involved in the pathogenesis of

diseases exhibiting Mendelian inheritance, its applica-

tion in the setting of genetically complex diseases has

been less fruitful (Risch and Merikangas 1996). With

complex diseases, the resolution from linkage analysis

is reduced, and extended segments of the genome con-

taining large numbers of genes may be implicated in

disease etiology (Hauser and Boehnke 1997; Roberts et

al. 1999). Fine mapping of these linked regions may be

accomplished through the use of allelic-association

methods that are designed to jointly detect linkage and

gametic-phase disequilibrium. Detecting association sig-

nificantly refines the search for disease susceptibility

genes, because linkage disequilibrium between a genetic

marker and disease susceptibility polymorphisms is ex-

pected to exist only over relatively small genetic dis-

tances in most populations. The sequential approach of

linkage-based genomic screening followed by dissection

Received July 28, 2000; accepted for publication September 21,

2000; electronically published October 31, 2000.

Address for correspondence and reprints: Mr. Stephen Lake, De-

partment of Biostatistics, Harvard School of Public Health, 655

Huntington Avenue, Boston, MA 02115. E-mail: slake@hsph.harvard

.edu

? 2001 by The American Society of Human Genetics. All rights reserved.

0002-9297/2000/6706-0017$02.00

of linked regions with association methodology recently

has been used to identify a susceptibility locusforhuman

hypertension (Bray et al. 2000).

Allelic association can be detected throughtraditional

contingency-table analysis using cases and controls

(Woolf 1955). Although straightforward to implement,

tests based on this approach are sensitive to spurious

association caused by population admixture(Ott1989).

Family-based association tests (FBATs) are a class of

tests that utilize within- and between-family marker-

inheritance patterns to test for association and that are

safeguarded, by design, from confounding caused by

admixture (Ewens and Spielman 1995). A widely used

FBAT is the transmission/disequilibrium test (TDT; Ter-

williger and Ott 1992; Spielman et al. 1993), whichuses

the marker genotypes of an affected child and those of

his/her parents to test for association. FBATs have re-

ceived much attention lately, with numerous extensions

and generalizations of the TDT being proposed in the

literature. Recently, Rabinowitz and Laird (2000) de-

veloped a unified approach to family-based association

tests that puts tests of different genetic models, tests of

different sampling designs, tests involving different dis-

ease phenotypes, tests with missing parents, and tests

of different null hypotheses, all in the same framework.

Algorithms for calculating the distribution of associa-

tion test statistics for these many settings are also

presented.

A distinction must be made between tests for linkage

Page 2

1516

Am. J. Hum. Genet. 67:1515–1525, 2000

that use association methods and tests for association

in the presence of linkage. Letting v be the recombi-

nation parameter and d be a measure of allelic associ-

ation, the tests for linkage that use association methods

have a composite null hypothesis (type I

be expressed asor

H :d p 0

0

esis for testing association in the presence of linkage

(type II) isand

HH :d p 0

00

the same alternative hypothesis,

Complications arise in tests addressing the type II H0

setting, because sibling marker genotypes are correlated

under (Martin et al. 1997; Lazzeroni and Lange

H0

1998). Ignoring the correlation in the type II

compromises the a level of the tests. In this article, we

show that valid tests for association in the presence of

linkage may be performed using the mean of the test

statistic computed via the Rabinowitz-Laird (RL) al-

gorithm for the type I setting and an empirical var-

H0

iance-covariance estimator that adjusts for the corre-

lation among sibling marker genotypes. This provides

a convenient means for testing allelic association in the

presence of linkage that can be used with a wide range

of test statistics and any pedigree configuration. For

example, the nine strategies for testing the type I H0

advocated by S. Horvath, X. Xu, and N. Laird (un-

published data), which include applications to binary,

quantitative and time-to-onset phenotypes, can all be

adapted to the type II setting with the method pre-

H0

sented here. We note that in the biallelic setting and

with a qualitative trait, the pedigree disequilibrium test

(PDT; Martin et al. 2000c) is similar to the approach

developed here.

As an illustration, we focus on the reported associ-

ation between alleles of the A2M gene and late-onset

Alzheimer disease. Blacker et al. (1998) reported a

strong association between a deletion near the 5?splice

site of exon 18 of the A2M gene (A2M-18i) and AD in

a sample of sibships from the National Institute of Men-

tal Health (NIMH) Genetics Initiative (Blacker et al.

1997). During the course of the A2M association study,

linkage to a nearby region on chromosome 12 was re-

ported as part of a genome screen (Pericak-Vance et al.

1997). Subsequent linkage analyses revealed linkage

peaks at or near the A2M gene (Rimmler et al. 1997;

Rogaeva et al. 1998; Wu et al. 1998; Kehoe et al. 1999;

Scott et al. 1999). The reported A2M association has

been controversial, with further findings both confirm-

atory and nonconfirmatory (Dow et al. 1999; Rogaeva

et al. 1999; Rudrasingham et al. 1999; Romas et al.

2000). In any case, A2M is useful as an illustration of

association tests conducted in the presence of linkage.

We use the NIMH data set, in which a strong A2M/AD

association has been reported (Blacker et al. 1999), to

illustrate our method.

) that can

H0

. The null hypoth-

v p 1/2

. Both settings have

and

H :d 1 0

a

v ! 1/2

.

v ! 1/2

setting

H0

FBATs

We assume that there are N nuclear families, with ni

children in each family. Let

for the jth child in the ith family and

of marker genotypes for the

In addition, the vector of parental marker genotypeswill

be denoted by. Let

MX(m )

i

codes for marker genotype. Depending on the coding

scheme, may be a scalar or a vector (see Schaid

X(m )

ij

1996; Laird et al. 2000; S. Horvath, X. Xu, and N. Laird

[unpublished data]). Last, let

jth child in the ith family and

the phenotype. In what follows we will often abbreviate

with and with

X(m )

X

T(y )

ijijij

indicating family when dealing with data from only one

family.

Association test statistics are constructed to detect

correlation between genotype and phenotype. In this

article, we restrict attention to the class of test statistics

that can be expressed as

?

i

be the marker genotype

be the vector

mi

children in the ith family.

mij

ni

be anvector that

h # 1

ij

be the phenotype of the

be some function of

T(y )

ij

yij

and drop the subscript

T

ij

S p

S p

i

T X ,

ij

(1)

??

i

ij

j

where the summation is over all children in all families

andis the contribution from the ith nuclear family,

Si

. Test statistics in this general class consti-

i p 1,…,N

tute the majority of family-based association test sta-

tistics proposed in the literature, including tests in the

multiallelic setting, tests using quantitative phenotypes,

and tests that allow missing parental marker informa-

tion (Laird et al. 2000; Rabinowitz and Laird 2000).

For example, with simplex families, letting

indicator function for child disease status and

count of a particular marker allele,

number of alleles in the affected child and

test statistic used in the TDT. Other types of test sta-

tistics are discussed in S. Horvath, X. Xu, and N. Laird

(unpublished data).

Under the assumption that the N families are un-

related, the distribution of the test statistic

depends on the distributions of the independent

H0

, . For the ith family, the general distri-

S i p 1,…,N

i

bution of depends on the joint distribution of the

Si

observed children’s marker genotypes, children’s phe-

notypes, and parental marker genotypes

Under the type I,

Hp(m ,M ,y)

0

frequencies and the genetic model; conditioning on

the phenotypes and the parental genotypes eliminates

these unknown nuisance parameters and makes the

distribution ofdependent only on the conditional

Si

distribution of the children’s marker genotypes (Laz-

zeroni and Lange 1998). When parental genotypes

are unknown, the nuisance parameters can be elim-

be an

be the

Tij

Xij

counts the total

is the same

S

Si

under

S

.

p(m ,M ,y)

iii

depends on allele

iii

Page 3

Lake et al.: FBATs with Linkage

1517

inated by conditioning on the sufficient statistic for

the parental genotypes

S(M)

the observed parental genotypes (when available)

and the children’s genotype configuration

M

obs

(Rabinowitz and Laird 2000). The distribution under

the type II is discussed in the next section.

H0

Using the conditional distribution of the children’s

marker genotypes, we take the approach of standard-

izing and using the large sample normal or x2ap-

S

proximation. In this case, the mean and variance of the

are required. For the type I

S

i

, S. Horvath, X. Xu, and N. Laird (unpublished[S(M),y]

data) show that can be computed with the uni-

E(SFF)

iI

variate conditional distribution of the children’s marker

genotype, and can be computed with the uni-Var(SFF)

iI

variate and bivariate conditional distributions of the

children’s marker genotypes, where

variance-covariance matrix. That is, by using just the

joint distributions of(m ,m )

ij

, do not depend on j and k), we can compute

H0

. These distributions can be computed usingVar(SFF)

iI

the RL algorithm for the type I

, which is composed of

C

m

, letting

H

F p

I

0

refers to theVar(7)

(which, under the type I

ik

.

H0

Tests of Association in the Presence of Linkage

As discussed above, association tests performed in areas

of known linkage may significantly refine gene-mapping

studies. The challenge is that, among siblings, genetic

markers that reside within linked regions are correlated

even in the absence of association and after conditioning

on . The dependence exists because sib-

F p [y,S(M)]

I

lings with similar phenotypes are more likely to share

the putative disease genes, even in the absence of allelic

association. Linkage between a marker and the putative

disease gene, therefore, induces positive correlation be-

tween the genetic markers of siblings with similar phe-

notypes. The opposite holds for siblings with disparate

phenotypes. The correlation makes

on the recombination parameter and the genetic model

for the phenotype.

Conditioning on the minimal sufficient statistic for

v and the phenotypes removes the dependence of the

marker genotypes on v and

When the patterns of allele sharing among siblings

can be unambiguously determined, they serve as the

minimal sufficient statistic for v (Rabinowitz and

Laird 2000). With incomplete identification of the

allele sharing patterns, the outcome space of the chil-

dren’s marker genotypes given the minimal sufficient

statistic under the type II

H0

the RL algorithm (type II

H0

the type II , the minimal sufficient statistic

H

0

sists of the minimal sufficient statistic for the recom-

bination parameter , the minimal sufficient sta-

S(v)

dependent

p(mFF)

I

under the type II.

y

H0

may be computed using

case). Therefore, under

con-

F

II

tistic for the parental marker genotypes

observed phenotypes . y

Since patterns of allele sharing are defined by the joint

realization of sibling marker genotypes, the conditional

outcome space consists of the various joint outcomes

of sibling marker genotypes satisfying the constraints

of the minimal sufficient statistic for the type II

tin et al. 1997; Rabinowitz and Laird 2000). Therefore,

after conditioning on , the convenient expression of

FII

and, in terms of the univariate and

E(SFF )Var(SFF )

i IIi II

bivariate conditional distribution of marker genotypes

under the type I , cannot be paralleled. Rather, under

H0

the type II, expressions for

H

0

using the RL algorithm can be found with the multi-

nomial distribution.

For a given family, assume that there are p compatible

realizations of the sibling marker genotypes, and let r

be a random vector, with the kth element being

p # 1

an indicator function that assumes the value 1, when

the realization of the sibling marker genotypes corre-

sponds to the kth element of the conditional outcome

space, and 0 otherwise. The set of possible outcomes is

given in tables 4–7 in Rabinowitz and Laird (2000) for

nuclear families. Because, under the type II

ditional on , all outcomes are equally likely, with

FII

probability, follows a multinomial distribution,1/p r

with mean and variance given by

, and the

S(M)

(Mar-

H0

and

E(SFF )

i

Var(SFF )

IIi II

and con-

H0

1

p

m p E(rFF ) p

r

1

IIp

and

1

p

1

p

?

p

S p Var(rFF ) p

r

I ?

p

1 1

p

,

II

()

where

mensional identity matrix.

The moments of

of . Let be an

rS

i

equal to

? T X(m )

j ij

vector of sibling marker genotypes corresponding to the

kth element of the conditional outcome space and h is

the length of the marker genotype coding vector

conditional mean and variance of

is a vector of 1s andis a di-1

p # 1

I

p # p

pp

can be derived using the moments

matrix with the kth column

h # p

where

m

p (m ,…,m )

Si

r

is the

(k)

ij

(k)(k)

i1

(k)

ini

. The

X

are

Si

r

i

m p E(SFF ) p S m

Si

i

IIr

and

r

i

r ?

i

S p Var(SFF ) p S S (S ) .

Si II

i

r

Under the type II

S ? E(SFF )

, the approximate distribution of

H0

.

iSi

is

N (0,? S )

h II

Page 4

1518

Am. J. Hum. Genet. 67:1515–1525, 2000

Table 1

Nuclear Family Informativeness for Both Conditioning

Approaches

PARENTAL

GENOTYPESa

CHILDREN

CONFIGURATIONb

FAMILY

INFORMATIVENESS

EV-FBAT

RL Algorithm

Type II H0

AA,AA

AA,AB

AA,BB

AB,AB

AA,?

AA,?

AB,?

AB,?

AB,?

AB,?

AB,?

?,?

?,?

?,?

?,?

?,?

NA

NA

NA

NA

{AA}

{AA,AB}

{AA}

{AB}

{AA,AB}

{AA,BB}

{AA,AB,BB}

{AA}

{AB}

{AA,AB}

{AA,BB}

{AA,AB,BB}

No

Yes

No

Yes

No

Yes

No

No

Yes

Yes

Yes

No

No

Yes

Yes

Yes

No

Yes

No

Yes

No

Yes

No

No

No when n 1 2

Yes

Yes

No

No

No when n 1 2

Yes

Yes

a? p Not genotyped.

bNA p not applicable.

The last column of table 1 indicates which combi-

nations of parental marker genotypes and children

marker configurations are potentially informative in the

biallelic setting with the RL algorithm applied to the

type II setting. When parental data are missing (as

H0

is often the case for late-onset diseases), sibships with

more than two sibs and

are not informative, because allele sharing can-{BB,AB}

not be discerned. The removal of these types of sibships

may cause a substantial loss in the effective sample size,

especially when one of the alleles is rare, because ho-

mozygotes of the rare allele will be infrequent. An al-

ternative to conditioning on the allele sharing is to take

advantage of the linear form of the test statistic (eq. [1])

and to use theRLalgorithm forthetypeI

the expectation, in conjunction with a robust variance-

covariance estimator. The development of this approach

follows.

or

C p {AA,AB}

m

C p

m

tocalculate

H0

Factorization ofunder Type II

p(mFF)

H

I

0

In view of the potentially severe loss of information

caused by conditioning on sibling identical-by-descent

(IBD) patterns, we here develop a method that employs

the type I RL algorithm to compute

H

0

and an empirical variance-covariance estimator

E(SFF)

iI

that is robust to the correlation among the sibling

marker genotypes. To show that

valid measure of association in the presence of linkage,

we derive the marginal conditional distribution for the

N

ip1

?

S ?

i

is a

N?

S ? E(SFF)

iip1

iI

kth sibling marker genotype

marginal distribution is the same under both the type I

and the type II and does not depend on the re-

HH

00

combination parameter v or on the observed phenotypes

for(see Appendix). Since the linear form

y

k p 1,…,n

of the test statistic (eq. [1]) permits its expectation to be

found using , the RL algorithm for the type I

p(m FF)

kI

can be used to compute

H

0

specification or estimation of v and without parameter-

ization of the phenotype distribution,

used to construct an unbiased test for association in the

presence of linkage. Since family-specific contributions

comprise, only the variances of these con-

S ? E(SFF)

I

tributions are needed to compute

correlation among children need not be addressed when

finding. Var[S ? E(SFF)]

iiI

The derivation in the Appendix employs an ordered

notation similar to that of Thomson (1995), where

is the marker genotype of the kth child, expressed

mk

in terms of the parental derived haplotypes (see Appen-

dix). In particular, it is shown that under both the type

I and the type II, the joint conditional probability

HH

00

for a family can be factored into

?

M ?A

u

[

and show that this

p(m FF)

kI

. Therefore, without

E(SFF)

iI

can be

S ? E(SFF)

I

; theVar[S ? E(SFF)]

I

∗

Pr(mFF) p

Pr(m Fm ,M,y)

?kIk

∗

k

?

Pr(m ,M)

S(M)

∗

k

m ?B

#

,

]

where

the kth sibling information omitted,

served parental marker genotypes,

served parental maker genotypes that coincide with

and corresponds to the set of paternal and ma-

S(M)

B

ternal derived markers for parents with marker geno-

typesthat result in the kth sibling’s observed marker

M

genotype. Marginalization of

m

k

toresults in the marginal conditional probability

m?k

for the kth sibling marker genotype with Pr(m FF) p

. In addition, we show that Pr[m FS(M)]

k

not a function of v and can be computed using the RL

algorithm for the type I

H0

tion can be used to find the correct conditional expecta-

tion of the test statistic, it cannot be used to derive ex-

pressions for the covariance between sibling marker

genotypes, because it marginalizes over the IBD

relationships.

Sinceare independent mean 0 random

S ? E(SFF)

iiI

vectors with unspecified variance-covariance matrices,

we can apply the results of White (1980) to construct

a robust variance-covariance estimator of

Specifically, White (1980) addresses estimation of the

variance-covariance matrix for estimated regression pa-

is the vector of sibling marker alleles with

m?k

is the unob-

is the set of unob-

Mu

A

with respect Pr(mFF)

I

kI

is Pr[m FS(M)]

k

. Although the factoriza-

.

S ? E(SFF)

I

Page 5

Lake et al.: FBATs with Linkage

1519

Table 2

A2M/Alzheimer Disease Association Test Results for

Various Methods

Method

No. of

Informative

Sibships

2

x

P

Type I RL algorithm

Type II RL algorithm

EV-FBAT

Siegmund et al. (2000)

PDT

SDT

51

10

44

51

50

46

8.599

6.125

8.631

6.916

8.387

…

.0034

.0133

.0033

.0085

.0038

.0016

rameters in linear models with heteroscedastic errors.

The test statistic

S ? E(SFF)

i

portional to a vector of parameter estimates from a

linear model and, therefore, the White empirical vari-

ance-covariance estimator, given by

can be couched as pro-

I

N

ˆ

ˆS p Var

W

[S ? E(SFF)]

i

?

ip1

iI

{}

N

?

p

[S ? E(SFF)][S ? E(SFF)] ,

iiI

(2)

?

ip1

iiI

provides a consistent estimate of the variance-covari-

ance matrix of . Alternatively,

S ? E(SFF)

I

rived using the results of Liang and Zeger (1986) on

generalized estimating equations. When

ued, may not be full rank. In this case, the test statistic

ˆS

for the type II is

H

[S ? E(SFF)]S [S ? E(SFF)]

0

is the generalized inverse of

ˆ

S

W

that the empirical variance-covariance estimator (2) re-

duces to a simple sum of squares for the biallelic case.

Extensions to more-complex pedigrees are straight-

forward. Assume that the ith pedigree can be split into

nuclear families, for

qi p 1,…,F

i

??

ip1 jp1

can be de-

ˆ

S

W

is vector-val-

S

, where

?

S

?

W

ˆ

ˆ

II

. It should be noted

?

W

, and let

Fqi

S ? E(SFF) p

[S ? E(S FF)] ,

ijI ijI

where

nuclear family in the ith pedigree and

puted using formulas by S. Horvath, X. Xu, and N.

Laird (unpublished data). Although the contributions

from nuclear families in the same pedigree are not in-

dependent, we can again appeal to White (1980) to

construct a consistent estimate of the variance-covari-

ance matrix of:

S ? E(SFF)

I

is the test-statistic contribution from the jth

Sij

is com-

E(S FF)

ijI

?

Fqq

ii

ˆS p

W

S ? E(S FF)

ij

S ? E(S FF)

ij

.

??

ip1

?

jp1

ijI ijI

[ ][]

jp1

The advantage of the empirical variance-covariance

approach is that more nuclear-family marker configu-

rations are informative than is the case with the type II

conditioning method. Table 1 indicates which nuclear

family configurations are informative for the two ap-

proaches in the setting of a biallelic marker. In addition,

since the conditioning is different for the two ap-

proaches, the expected values and variance-covariance

terms are also not the same. We will refer to the em-

pirical variance-covariance approach as “EV-FBAT.”

Example: Testing for Association in the A2M Gene

As an example, we tested for association between the

A2M-18i deletion and AD in a set of sibships from the

National Institute of Mental Health (NIMH) Genetics

Initiative AD Sample. The ascertainment and assessment

of the AD families collected have been discussed else-

where (Blacker et al. 1997). The sample we used is com-

posed of 437 individuals in 120 sibships and is identical

to the sample analyzed by Blacker et al. (1999); 246 of

the siblings met the NINCDS/ADRDA criteria for AD

and/or had autopsy confirmation of the diagnosis.

Table 2 contains the results for testing the A2M-18i/

AD association. The test statistic used in the applica-

tions oftheRL algorithm isthesumoftheA2M-1alleles

in AD-affected siblings. This corresponds to the follow-

ing coding schemes:

1 if sibling j in ith sibship is affected

otherwise

T p

ij

{0

and

2if m p A2M-1/A2M-1

ij

if m p A2M-1/A2M-2 .

ij

otherwise

X p 1

ij

{0

Implementation of the RL algorithm consists of finding

the expected value of conditional on the minimal

Xij

sufficient statistic corresponding to the null hypothesis.

Variance estimation is accomplished through the pro-

cedures described above.

Application of the RL algorithm to test for linkage

and association (type I) results in 51 informative

H0

sibships and a significant finding. As discussed above,

the type I may not be appropriate in view of the

H0

reported linkage evidence in the region spanning the

A2M gene. Conditioning on the type II

sufficient statistic results in a dramatic decrease in the

effective sample size. With only 10 informative sibships,

the test statistic is only marginally significant, and its

large sample x2approximation may not be reliable (ta-

minimal

H0