# Conditional likelihood score functions for mixed models in linkage analysis.

**ABSTRACT** In this paper, we develop a general strategy for linkage analysis, applicable for arbitrary pedigree structures and genetic models with one major gene, polygenes and shared environmental effects. Extending work of Whittemore (1996), McPeek (1999) and Hossjer (2003d), the efficient score statistic is computed from a conditional likelihood of marker data given phenotypes. The resulting semiparametric linkage analysis is very similar to nonparametric linkage based on affected individuals. The efficient score S depends not only on identical-by-descent sharing and phenotypes, but also on a few parameters chosen by the user. We focus on (1) weak penetrance models, where the major gene has a small effect and (2) rare disease models, where the major gene has a possibly strong effect but the disease causing allele is rare. We illustrate our results for a large class of genetic models with a multivariate Gaussian liability. This class incorporates one major gene, polygenes and shared environmental effects in the liability, and allows e.g. binary, Gaussian, Poisson distributed and life-length phenotypes. A detailed simulation study is conducted for Gaussian phenotypes. The performance of the two optimal score functions S(wpairs) and S(normdom) are investigated. The conclusion is that (i) inclusion of polygenic effects into the score function increases overall performance for a wide range of genetic models and (ii) score functions based on the rare disease assumption are slightly more powerful.

**0**Bookmarks

**·**

**70**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**In this article we deal with two-locus nonparametric linkage (NPL) analysis, mainly in the context of conditional analysis. This means that one incorporates single-locus analysis information through conditioning when performing a two-locus analysis. Here we describe different strategies for using this approach. Cox et al. [Nat Genet 1999;21:213-215] implemented this as follows: (i) Calculate the one-locus NPL process over the included genome region(s). (ii) Weight the individual pedigree NPL scores using a weighting function depending on the NPL scores for the corresponding pedigrees at speci fi c conditioning loci. We generalize this by conditioning with respect to the inheritance vector rather than the NPL score and by separating between the case of known (prede fi ned) and unknown (estimated) conditioning loci. In the latter case we choose conditioning locus, or loci, according to prede fi ned criteria. The most general approach results in a random number of selected loci, depending on the results from the previous one-locus analysis. Major topics in this article include discussions on optimal score functions with respect to the noncentrality parameter (NCP), and how to calculate adequate p values and perform power calculations. We also discuss issues related to multiple tests which arise from the two-step procedure with several conditioning loci as well as from the genome-wide tests.Human Heredity 02/2008; 66(3):138-56. · 1.57 Impact Factor - [Show abstract] [Hide abstract]

**ABSTRACT:**Many common diseases are known to have genetic components, but since they are non-Mendelian, i.e. a large number of genetic factors affect the phenotype, these components are difficult to localize. These traits are often called complex and analysis of siblings is a valuable tool for mapping them. It has been shown that the power of the affected relative pairs method to detect linkage of a disease susceptibility locus depends on the locus contribution to increased risk of relatives compared with population prevalence (Risch, 1990a,b). In this paper we generalize calculation of relative risk to arbitrary phenotypes and genetic models, but also show that the relative risk can be split into the relative risk at the main locus and the relative risk due to interaction between the main locus and loci at other chromosomes. We demonstrate how the main locus contribution to the relative risk is related to probabilities of allele sharing identical by descent at the main locus, as well as power to detect linkage. To this end we use the effective number of meioses, introduced by Hössjer (2005a) as a convenient tool. Relative risks and effective number of meioses are computed for several genetic models with binary or quantitative phenotypes, with or without polygenic effects.Annals of Human Genetics 12/2006; 70(Pt 6):907-22. · 2.22 Impact Factor -
##### Article: A general method for linkage disequilibrium correction for multipoint linkage and association.

[Show abstract] [Hide abstract]

**ABSTRACT:**Lately, many different methods of linkage, association or joint analysis for family data have been invented and refined. Common to most of those is that they require a map of markers that are in linkage equilibrium. However, at the present day, high-density single nucleotide polymorphisms (SNPs) maps are both more inexpensive to create and they have lower genotyping error. When marker data is incomplete, the crucial and computationally most demanding moment in the analysis is to calculate the inheritance distribution at a certain position on the chromosome. Recently, different ways of adjusting traditional methods of linkage analysis to denser maps of SNPs in linkage disequilibrium (LD) have been proposed. We describe a hidden Markov model which generalizes the Lander-Green algorithm. It combines Markov chain for inheritance vectors with a Markov chain modelling founder haplotypes and in this way takes account for LD between SNPs. It can be applied to association, linkage or combined association and linkage analysis, general phenotypes and arbitrary score functions. We also define a joint likelihood for linkage and association that extends an idea of Kong and Cox (1997 Am. J. Hum. Genet. 61: 1179-1188) for pure linkage analysis.Genetic Epidemiology 06/2008; 32(7):647-57. · 4.02 Impact Factor

Page 1

Biostatistics (2005), 6,2, pp. 313–332

doi:10.1093/biostatistics/kxi012

Conditional likelihood score functions for mixed models

in linkage analysis

OLA HÖSSJER

Department of Mathematics, Stockholm University, S-106 91 Stockholm, Sweden

ola@math.su.se

SUMMARY

In this paper, we develop a general strategy for linkage analysis, applicable for arbitrary pedigree struc-

tures and genetic models with one major gene, polygenes and shared environmental effects. Extending

work of Whittemore (1996), McPeek (1999) and Hössjer (2003d), the efficient score statistic is computed

from a conditional likelihood of marker data given phenotypes. The resulting semiparametric linkage

analysis is very similar to nonparametric linkage based on affected individuals. The efficient score S de-

pends not only on identical-by-descent sharing and phenotypes, but also on a few parameters chosen by

the user. We focus on (1) weak penetrance models, where the major gene has a small effect and (2) rare

disease models, where the major gene has a possibly strong effect but the disease causing allele is rare.

We illustrate our results for a large class of genetic models with a multivariate Gaussian liability. This

class incorporates one major gene, polygenes and shared environmental effects in the liability, and allows

e.g. binary, Gaussian, Poisson distributed and life-length phenotypes. A detailed simulation study is con-

ducted for Gaussian phenotypes. The performance of the two optimal score functions Swpairsand Snormdom

are investigated. The conclusion is that (i) inclusion of polygenic effects into the score function increases

overall performance for a wide range of genetic models and (ii) score functions based on the rare disease

assumption are slightly more powerful.

Keywords: Conditional likelihood; Founder alleles; Linkage analysis; Mixed model; Score functions.

1. INTRODUCTION

The goal of linkage analysis is to find the position (locus) along the genome of a gene which causes or

increases the risk of development of a certain inheritable disease. Disease related quantities, so called

phenotypes, and DNA samples, are collected for a number of families with aggregation of the disease.

The DNA samples are typed at a large number of genetic markers, distributed throughout the genome.

Using information from the markers, loci are sought at which segregation of DNA is correlated with

the inheritance pattern of the disease. Since each individual’s phenotype is a blurred observation of the

two copies of the disease gene (disease alleles) that he/she carries, DNA transmission is correlated with

phenotype segregation in close vicinity of the disease locus.

If genetic model parameters (disease allele frequencies and penetrance parameters) are known, the lod

score of Morton (1955) can be computed at each locus. For complex diseases, the genetic model is rarely

known and alternative procedures have been proposed. One possibility is to regard the genetic model

parameters as nuisance parameters and optimize the lod score with respect to them at each locus. This is

the mod score approach of Risch (1984) and Clerget-Darpoux et al. (1986).

c ?TheAuthor2005. PublishedbyOxfordUniversityPress. Allrightsreserved. Forpermissions, pleasee-mail: journals.permissions@oupjournals.org.

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 2

314O. HÖSSJER

For binary traits, nonparametric linkage (NPL) is another method developed in situations when the

genetic model parameters are unknown. The affected-pedigree-member (APM) method is the most

commonly used version of NPL. A score function S is used which does not require specification of a

genetic model. Instead, it quantifies the extent to which the APMs share their alleles identical-by-descent

(IBD) from the same founder alleles, see e.g. Penrose (1935), Weeks and Lange (1988), Fimmers et al.

(1989), Whittemore and Halpern (1994) and Kruglyak et al. (1996).

The NPL method can be extended to more general phenotypes by first noting that the lod score

is equivalent to a conditional likelihood L of DNA marker data (MD) given phenotypes (Whittemore,

1996). By differentiating log L with respect to genetic model parameters a score function S is obtained.

Whittemore (1996) showed that allele sharing score functions S used in NPL can be used to construct an

appropriate L. The opposite route is to start from L, based on a biologically based genetic model with

disease allele frequencies and penetrance parameters, and then to compute S. McPeek (1999) showed, for

binary traits, arbitrary pedigree structures and affected-only phenotypes, that several allele sharing score

functions S could be derived in this way. McPeek’s work was generalized in Hössjer (2003d) to arbi-

trary monogenic models, e.g. Gaussian, Cox proportional hazards and logistic regression models. The

resulting score functions S depend on IBD allele sharing in the pedigree, the observed phenotypes and

a small number of parameters which have to be specified by the user. This approach was referred to as

semiparametric linkage (SPL) in Hössjer (2003d).

The purpose of this paper is to extend the work of McPeek (1999) and Hössjer (2003d) to mixed gen-

etic models with one major susceptibility locus and polygenes/shared environmental effects. In Section

2, we define basic genetic concepts. In Section 3, we provide a general formulation of the efficient score

function approach. In Section 4, we define a broad class of genetic models which includes several existing

models based on Gaussian, binary, life-length or Poisson phenotypes as special cases. In Section 5, we

derive the efficient score functions S for weak penetrance models, where the major gene has a small effect

on the disease, and rare disease models, where the disease allele is rare. A simulation study in Section

6 for the Gaussian mixed model reveals that more powerful and robust procedures are obtained when

polygenic effects are included in the score function S. Some conclusions and further recommendations

are given in Section 7. In Section A of supplementary data available at Biostatistics online, henceforth

denoted HS, we show that the SPL approach is asymptotically equivalent to mod scores under the null

hypothesis of no linkage. In Section B of HS, we derive an orthogonal decomposition of functions of

several genotypes. This is used in Section 5.1 to derive weak penetrance optimal score functions, but we

believe it to be of independent interest. For instance, the classical variance components decomposition of

genetic variance (Fisher, 1918; Kempthorne, 1955) can be obtained from these expansions. Finally, the

proofs are collected in Section C of HS.

2. BASIC GENETIC CONCEPTS

Consider a pedigree with n individuals of which f are founders (without ancestors in the pedigree) and

n − f nonfounders. Assume that two forms of the disease gene exist—the normal allele (0) and the

disease allele (1). Each individual has a pair of alleles (genotype), of which one is inherited from the

father and one from the mother. The genotypes of all pedigree members can be collected into a vector

G = (G1,...,Gn) = (a1,...,a2n), where Gk= (a2k−1,a2k) is the genotype of the kth individual, with

a2k−1and a2kthe paternally and maternally transmitted alleles, respectively. We will use the convention

that founders are numbered k = 1,..., f . Since the gene of interest is unknown, we do not observe

G, but rather a vector of disease phenotypes Y = (Y1,...,Yn). Here Ykis the phenotype of the kth

individual, a quantity related to the disease. This could be binary (affected/unaffected) or a quantitative

variable such as insulin concentration, body mass index, etc. Some Yk may also represent unknown

phenotypes.

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 3

Conditional likelihood score functions

315

Alleles are transmitted from parents to children according to Mendel’s law of segregation. At a certain

locus t, allele transmission can be summarized through the inheritance vector, introduced by Donnelly

(1983). It is defined as v(t) = (v1(t),...,vm(t)), where m = 2(n − f ) is the number of meioses and

vk(t) equals 0 or 1 depending on whether a grandpaternal or grandmaternal allele was transmitted during

the kth meiosis. A priori, grandpaternal and grandmaternal alleles are equally likely to be transmitted

during each meiosis, so that

P(v(t) = w) = 2−m

for all binary vectors w of length m.

If τ is the disease locus, the phenotype vector Y will be correlated to v(τ) if the genetic component is

strong enough. It is standard in linkage analysis to look at the conditional distribution

(2.1)

w −→ P(v(τ) = w|Y),

(2.2)

which differs from the prior (2.1). Moreover, (2.2) is unaffected by the way we sampled the pedigree

(ascertainment). The stronger the genetic component at τ is, the stronger is the discrepancy between

P(v(τ)|Y) and the uniform distribution, because of co-inheritance of phenotypes and DNA at τ. Even at

loci around τ, the conditional distribution of the inheritance vector given phenotypes differs from (2.1),

although the amount of co-inheritance decays with the genetic distance from the disease locus. This

is because of occurrence of so called crossovers—random points along the chromosome where, during

meioses, segregation switches between grandmaternal and grandpaternal transmission.

In practice we do not observe the process v(·), but MD from all or a subset of the pedigree members

give incomplete information of it. As we will see in the next section, up to a multiplicative constant

the conditional probability Pt,θ(MD|Y) serves as an approximation of (2.2). Here, θ is the set of ge-

netic model parameters (disease allele frequencies and penetrance parameters) and t corresponds to the

hypothesis t = τ.

3. CONDITIONAL LIKELIHOOD AND SCORE FUNCTIONS

Consider a chromosome of length tmaxcentimorgans (cM) with at most one disease locus τ located along

[0,tmax]. The hypothesis testing problem is

H0: τ is located on another chromosome

H1: τ ∈ [0,tmax].

The unconditional likelihood requires knowledge of MD, Y and the sampling scheme. Since the latter

is often (more or less) unknown, it is common in linkage analysis to consider the conditional likelihood

of MD given Y. The reason is that nuisance parameters involved in the ascertainment scheme are not

included in the conditional likelihood, see e.g. Ewens and Shute (1986). If v = v(τ) is the inheritance

vector at the disease locus, the conditional likelihood can be written as

?

= 2mP(MD)

where?

and genetic distances between the markers. These are assumed to be known in linkage analysis. Since

2mP(MD) is independent of t and θ, it is a fixed constant which we drop. Hence,

?

Pt,θ(MD|Y) =

v

Pt(MD|v)Pθ(v|Y)

?

wPt(MD|v = w)Pθ(v = w|Y). In the last equality of (3.1)

v

Pt(v|MD)Pθ(v|Y),

(3.1)

vPt(MD|v)Pθ(v|Y) is short for?

we applied Bayes’ rule and P(v) = 2−m. The factor P(MD) depends on marker allele frequencies

L(t,θ; MD) =

v

Pt(v|MD)Pθ(v|Y) = Et(Pθ(v|Y)|MD)

(3.2)

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 4

316O. HÖSSJER

is our conditional likelihood. Notice that we only include MD as an argument of L, because we condition

on the phenotype vector Y and consider it as fixed. Kruglyak et al. (1996) noticed that the conditional

inheritance distribution Pt(v|MD) could be used both for NPL and parametric linkage analysis based on

lod scores.

In linkage analysis, t is the structural parameter of interest, whereas θ contains nuisance parameters.

The lod score was originally formulated as the base ten logarithm of a likelihood ratio, but it is a linear

transformation of, and hence equivalent to, Z(t) = Z(t; θ) = log L(t,θ), as noted by Whittemore (1996).

We regard Z(t) as a local test statistic for testing H0against the simple alternative hypothesis τ = t. When

testing H0against H1, the global test statistic

Zmax= sup

t∈?

Z(t)

(3.3)

is used instead, where ? = [0,tmax] or, more generally, ? may consist of several chromosomes. The null

hypothesis is rejected when Zmaxexceeds a given threshold T, which depends on the chosen significance

level, ? and θ.

When θ is unknown, the profile conditional likelihood Z(t;ˆθ(t)) = supθZ(t; θ) can be computed at

each t and H0is rejected when maxtZ(t;ˆθ(t)) exceeds a given threshold. This procedure is equivalent

to mod scores. The mod score is computationally demanding, especially for larger pedigrees, although a

faster version is obtained by replacing the original L(t,θ), based on penetrance and disease allele param-

eters, by a simplified one with only one parameter, see Whittemore (1996) and Kong and Cox (1997).

For quantitative phenotypes, it is common to use variance components techniques, see e.g. Amos

(1994) and Almasy and Blangero (1998). These can be interpreted as an approximate version of the mod

score, where the original L(t,θ), based on penetrance and disease allele parameters, has been replaced by

a simplified one, which is a ratio of two multivariate normal densities.

In this paper, we will not maximize with respect to θ at each locus. Instead, we take a local approach

and assume that {θε} is a one-dimensional trajectory of genetic model parameters such that θ0corresponds

to no genetic effect at the disease locus, i.e. Pθ0(v|Y) = 2−m. In other words, under θ0the prior distribu-

tion of v equals the posterior distribution v|Y. Our objective is to compute a likelihood score function by

differentiating the log conditional likelihood w.r.t. ε at each locus t.

Assume first complete MD. It corresponds to an infinitely dense set of markers genotyped for all

pedigree members so that P(v = v(t)|MD) may be determined unambiguously at all loci. We let MDcompl

denote complete MD. The corresponding conditional likelihood is

L(t,θ; MDcompl) = Pθ(v = v(t)|Y). (3.4)

Define the score function

S(v) =dρlogPθε(v|Y)

dερ

????ε=0

,

(3.5)

where ρ is the smallest positive integer such that the right-hand side of (3.5) is nonzero for at least one v.

When ρ ? 2 the estimation problem is singular (with zero Fisher information) for ε at ε = 0. By means

of the reparametrization

? = ερ/ρ!,

S is interpreted as a (conditional) likelihood score function for ? at ? = 0. Originally, Whittemore (1996)

used ρ = 1 in her definition of S, but ρ = 2 is also possible for weak penetrance models, see McPeek

(1999) and Hössjer (2003d).

Our goal is to test ε = 0 against ε ?= 0 at each locus t, which can also be interpreted as testing H0

against τ = t at each t. Depending on the application, there may or may not be a sign constraint on ε.

In any case, the sign of ε is not of interest, and for this reason we use ? instead of ε as parameter when

(3.6)

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 5

Conditional likelihood score functions

317

formulating a test statistic, also when ρ is even. The Fisher information for ? is Icompl= Eθ0(S2(v)|Y) =

2−m?

For incomplete MD, the observed conditional likelihood L(t,θ; MD) =

MD) is obtained by averaging the complete conditional likelihood, treating MDcomplas hidden data. Then

wS2(w) at ? = 0, with the sum ranging over all binary vectors w of length m. The square root of

the likelihood score statistic for testing ? = 0 against ? ?= 0 is Wcompl(t) = S(v(t))/√Icomplat locus t.

E(L(t,θ; MDcompl)|

W(t) =[dρlog L(t,θε; MD)/dερ]ε=0

√I(t)

=

√Icompl

√I(t)

E?Wcompl(t)|MD?,

(3.7)

where I(t) = Eθ0(E2(S(v(t))|MD)|Y) is the Fisher information. It depends on t in a way that reflects

positioning and informativity of markers. The outer expectation is taken over variations in MD and is

typicallycomputationallyinvolved. Itcanbecalculatedexactly(WhittemoreandHalpern, 1994), approxi-

mated as I(t) ≈ Icomplso that the first factor of the right-hand side of (3.7) vanishes (Kruglyak et al.,

1996) or approximated by a multiple imputation Monte Carlo algorithm (Clayton, 2001). Yet, another

method of handling incomplete MD was suggested by Kong and Cox (1997).

By definition of W(t) we have

Eθ0(W(t)) = 0,

Vθ0(W(t)) = 1,

(3.8)

where expectation is with respect to variations in MD.

So far, we have only discussed a single family. The extension to N pedigrees with mutually independ-

ent phenotype/MD is straightforward. We allow the pedigree structures to vary arbitrarily and index

quantitiesfortheithfamilywithi. Theoverallconditionallikelihood L(t,θ) =?N

?N

W(t) =[dρlogL(t,θε; MD)/dερ]ε=0

i=1Li(t,θ)isthensim-

ply the product of the familywise conditional likelihoods, and the total Fisher information is I(t) =

i=1Ii(t) at locus t. From this it follows that the linkage score function for N families can be written as

√I(t)

=

N

?

i=1

γi(t)Wi(t),

(3.9)

where Wi(t) is the score (3.7) for family i and γi(t) =

(McPeek, 1999). Since?N

The local SPL test statistic at locus t is defined as

?

Ii(t)/?N

j=1Ij(t) are the locally optimal weights

1γ2

i(t) = 1, (3.8) also holds for the total linkage score with N families.

Whittemore (1996) noticed that (3.9) includes APM methods as special case.

Z(t) =

?W(t)2,

W(t),

no sign constraint on ?,

? ? 0,

(3.10)

and the corresponding global test statistic for testing H0against H1is obtained by inserting Z(t) in (3.10)

into (3.3). Two separate definitions of Z(t) are needed, because ? = 0 is at the boundary of the parameter

space when ? ? 0. This is always the case when ρ is even and sometimes, depending on the applica-

tion, when ρ is odd (in that case ε ? 0 is imposed). A derivation of (3.10) is given in Section A of HS.

There, it is also shown that the SPL and mod score approaches are asymptotically equivalent under the

null hypothesis of no linkage when the genetic model parameters are restricted to a one-dimensional

trajectory {θε}.

For quantitative phenotypes, the SPL approach is also asymptotically equivalent to variance compon-

ent techniques when the parameters of the latter model are varied along a one-dimensional trajectory. This

is briefly discussed in Example 6 of Section 5.

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 6

318O. HÖSSJER

4. GENETIC MODELS

To begin with, it is mathematically more convenient to work with

Pθ(Y|v) = 2mPθ(v|Y)Pθ(Y)

(4.1)

than with Pθ(v|Y). The reason is that

Pθ(Y|v) =

?

G

Pψ(Y|G)Pp(G|v)

(4.2)

can be expanded by summing over all possible genotype configurations in the pedigree. In (4.2), we split

θ = (p,ψ) into p, the frequency (probability) of the disease allele, and ψ, the penetrance parameter(s).

The latter describe the relationship between phenotypes and genotypes.

We will introduce a class of genetic models with penetrance factor of the form

?

by integrating with respect to a vector X = (X1,..., Xn) of liabilities. This vector contains influences

from the major gene G as well as polygenes and environmental components. Given X, the components

of Y are conditionally independent

n?

where zkis a (possibly empty) set of observed covariates for k and ψY is the (possibly empty) set of

penetrance parameters involved in (4.4). In many cases Ykis just a deterministic function of Xk. If k has

unknown phenotype we put P(Yk|Xk,zk) = 1.

Conditional on G, we assume that the liability vector is multivariate normal,

Pψ(Y|G) =

PψY(Y|x)PψX(x|G)dx

(4.3)

PψY(Y|X) =

k=1

P(Yk|Xk,zk),

(4.4)

X|G ∈ N(µ(G) + β?,σ2?),

(4.5)

where s ? 0 is the number of covariates, ? is the s×n design matrix of observed covariates and β is a 1×s

vector of regression coefficients. The vector µ(G) depends on the major gene G whereas the stochastic

variation is due to polygenic and environmental effects. Therefore, neither the conditional variance σ2=

Var(Xk|G) nor the conditional correlation matrix ? = Corr(X|G) depends on G.

In order to describe µ(G) and ?, we need some more definitions. Assume there are numbers m0, m1

and m2such that Xk|Gk∈ N(m|Gk|+(β?)k,σ2), with |Gk| = a2k−1+a2kthe number of disease alleles

of Gk. Then put

µ(G) = (m|G1|,...,m|Gn|).

For instance, if large values of the liability indicate disease, a natural constraint is m0 ? m1 ? m2.

Let IBDkl = IBDkl(w) be the number of alleles that two individuals k and l (1 ? k,l ? n) share

identical by descent for inheritance vector w. The coefficient of relationship between k and l is defined

as rkl = E(IBDkl(w))/2. We also put δkl = P(IBDkl(w) = 2), where expectation and probability is

taken w.r.t. a uniform distribution (2.1). Following Fisher (1918) and Kempthorne (1955), the correlation

matrix with polygenic and shared environmental effects is

(4.6)

? = (1 − h2

a− h2

d− h2

s)In+ h2

aR + h2

d? + h2

sS,

(4.7)

where Inis an identity matrix of order n, R = (rkl) and ? = (δkl). Further, h2

and dominant polygenic heritabilities, respectively. These are the fractions of total environmental and

aand h2

dare the additive

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 7

Conditional likelihood score functions

319

polygenic liability variance (σ2) due to additive and dominant genetic effects, respectively. The matrix

S = (skl)modelsasharedenvironment. Forinstance, eachentrysklcanbeputtozerooronedependingon

whether k andl share the same household or not. The parameter h2

ronment. The part of the penetrance vector ψ involved in (4.5) is ψX= (m0,m1,m2,σ2,h2

withthefirstthreecomponentsduetothemajorgene, andthenextfivecausedbypolygenic/environmental

effects.

Summarizing, (4.3)–(4.5) is a penetrance model with multivariate Gaussian liability X and penetrance

parameters ψ = (ψX,ψY). A similar class of models has been suggested in the geostatistical literature

by Diggle et al. (1999).

sis the fraction of σ2due to shared envi-

a,h2

d,h2

s,β),

EXAMPLE 1 (GAUSSIAN MIXED MODEL) When liabilities are observed, we put Y = X. This is

the Gaussian mixed model of Ott (1979). The name refers to Y being a mixture of multivariate normal

distributions Y|G ∈ N(µ(G) + β?,σ2?).

EXAMPLE 2 (LIABILITY THRESHOLD MODEL) When the phenotypes Ykare binary (Yk= 1 affected,

Yk= 0 unaffected) but show no simple Mendelian inheritance pattern, it is common to model the distribu-

tion of Ykas a function of an underlying quantitative variable Xkinvolving alleles from several loci. The

liability threshold model was originally introduced by Pearson and Lee (1901). With T a given threshold,

the phenotypes are defined as Yk = 1{Xk?T}and ψY = {T}. For identifiability we assume m = 0 and

σ2= 1, where

m = E(Xk) − (β?)k= q2m0+ 2pqm1+ p2m2

and q = 1 − p. Usually this model does not include a covariate, but we may include a single covariate

containing the liability class or age of each individual. For a recent review of binary liability models with

various extensions, see Todorov and Suarez (2002).

EXAMPLE 3 (LOGISTIC REGRESSION) As in the previous example, we consider a binary trait with

Yk = 1 and 0 corresponding to an affected or unaffected individual. We also include a design vector

? = (t1,...,tn), where tkis either the time of examination or time of onset of k and β is a nonnegative

regression parameter. Then assume

P(Yk|Xk) = F(Xk)Yk(1 − F(Xk))1−Yk,

P(Yk|Xk) = βf (Xk),

where F(x) = ex/(1 + ex) is the logistic distribution function and f (x) = F?(x) the corresponding

density. In this case ψY = {β} (so that β appears in ψXand ψY). A similar model has been considered

by Bonney (1986) and Elston and George (1989), but these authors use a Markov rather than Gaussian

liability model for Y.

if tkis age of examination,

if Yk= 1 and tkis age of onset,

EXAMPLE 4 (SURVIVAL ANALYSIS) In Example 3, an alternative is to use a Cox proportional haz-

ards model with hazard rate λ(t; Xk) = λ0(t)exp(Xk), baseline hazard λ0and distribution function

F(t; Xk) = 1 − exp(−?t

P(Yk|Xk,tk) = F(tk; Xk)Yk(1 − F(tk; Xk))1−Yk,

P(Yk|Xk,tk) = f (tk; Xk),

so that ψY= {λ0}. For identifiability we put m = 0. See Thomas and Gauderman (1996) for more details.

0λ(u; Xk)du). In this case there are no covariates (s = 0) in (4.5). Instead,

we include tkas covariate in (4.4) (zk= tk) and put

if tkis age of examination,

if Yk= 1 and tkis age of onset,

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 8

320O. HÖSSJER

Another possibility is that Ykand Xkare related through a generalized linear model (McCullagh and

Nelder, 1989), i.e. h(E(Yk)) = Xkfor some link function h. An example is Poisson distributed data

Yk∈ Po(exp(Xk)).

5. CHOOSING SCORE FUNCTIONS

To start with, we establish the following simple but very useful result.

PROPOSITION 1 Let¯S(v) = dρlog Pθε(Y|v)/dερ|ε=0be the score function of Pθε(Y|v) at ε = 0. Then

S is the centered version of¯S, i.e.

S(v) =¯S(v) − C,

where C is a centering constant, ensuring that Eθ0(S(v)|Y) = 2−m?

Proposition 1 implies that it suffices to consider score functions of Pθ(Y|v). In formula (4.2), Pθ(Y|v)

is defined by summing over all possible genotype vectors G. This is equivalent to summing over all

founder allele vectors a = (a1,...,a2 f). In fact, J(w) = (j1(w),..., j2n(w)), the gene-identity state of

the pedigree (Thompson, 1974), is a function of the inheritance vector w, such that jk(w) ∈ {1,...,2 f }

is the number of the founder allele that has been transmitted to allele number k. Since

G = G(a,v) = aJ(v)=?aj1(v),...,aj2n(v)

we obtain

Pθ(Y|v) =

a

where the sum ranges over all 22 fpossible founder allele vectors a and the last expectation is w.r.t. a.

We assumed in (5.1) that a and v are independent (no segregation distortion), so that Pp(G|v) = Pp(a).

Under random mating, the components of a are independent,

wS(w) = 0.

?,

?

Pψ(Y|a,v)Pp(a) = Ep(Pψ(Y|a,v)),

(5.1)

Pp(a) = p|a|q2 f −|a|,

(5.2)

where |a| =?2 f

1aj. Viewing a as hidden data, Pψ(Y|a,v) is the complete likelihood corresponding to

Pθ(Y|v). Formulas (5.1) and (5.2) will be used in the next two subsections for deriving score functions S.

5.1

Local penetrance models

Assume the disease allele frequency p is fixed whereas the penetrance parameters ψεvary with ε so that

Pψ0(Y|G) = Pψ0(Y) is independent of G. This implies no genetic effect at the disease locus when ε = 0.

In more detail, we consider penetrance functions of the form

Pψ(Y|G) = f (Y; µ),

(5.3)

with µ = µ(G) as in (4.6). The Gaussian liability class of models (4.3) is included in (5.3). For brevity,

we write ψ = (m0,m1,m2), since only these three penetrance parameters depend on ε according to

ψε= (m∗,m∗,m∗) + ε(u(0),u(1),u(2)).

Hence, it is only µ that depends on ε in (5.3). Define σ2

of the liability is Var(Xk) = ε2σ2

(5.4)

g= Var(u(|Gk|)). Then, in (4.5), the variance

g+ σ2. The first term (ε2σ2

g) is genetic variance at the main locus and

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 9

Conditional likelihood score functions

321

the second term (σ2) is variance due to polygenic and shared environmental effects. We may further split

σ2

g= σ2

a+ σ2

dinto additive and dominant variance components, where

σ2

a= 2pq(p(u(2) − u(1)) + q(u(1) − u(0)))2,

σ2

d= (pq)2(u(2) − 2u(1) + u(0))2.

gbe the fraction of dominance variance at the main locus, put µ0 = (m∗,...,m∗) and

introduce weights

(5.5)

Let c = σ2

d/σ2

ωk= ωk(Y) = σg∂f (Y; µ)/∂µk|µ=µ0/f (Y; µ0),

ωkl= ωkl(Y) = σ2

assigned to individuals and pairs of individuals. Then, the following result holds.

g∂2f (Y; µ)/∂µk∂µl|µ=µ0/f (Y; µ0),

(5.6)

THEOREM 1

Then, for an inbred pedigree, ρ = 1 and the score function S in (3.5) satisfy

S(v) =√c

Consider a weak penetrance model (4.6), (5.3), (5.4) and assume random mating (5.2).

n

?

k=1

ωkHBDk− C,

(5.7)

provided c ?= 0 and at least one ωk?= 0. Here HBDk= HBDk(v) = 1{j2k−1(v)=j2k(v)}is the homozygosity

of descent indicator of k and C is a centering constant. For an outbred pedigree, ρ = 2 and

S(v) = 2

1?k<l?n

provided ωklis nonzero for at least one pair kl and, again, C is a centering constant.

?

ωkl

?(1 − c)IBDkl/2 + c1{IBDkl=2}

?− C,

(5.8)

Notice that ρ in Theorem 1 depends on the pedigree structure. On the other hand, the weights ωkand

ωkldepend on the phenotypes and the genetic model. They determine how various individuals or pairs of

individuals should be weighted in the optimal score function.

EXAMPLE 5 (MONOGENIC DISEASES) When there are no polygenic or shared environmental compon-

ents, we assume

n?

This includes the Gaussian regime class of models with ? diagonal. McPeek (1999) derived (5.7)–(5.8)

for binary traits and pedigrees whose members are affected or have unknown phenotype. In this case (5.7)

simplifies to a score function which counts the number of affected individuals that are HBD, i.e. ωkare

identical for all affected individuals. For outbred pedigrees, (5.8) becomes a linear combination of the two

score functions Spairs, which sums the number of alleles shared IBD over all pairs of affected individuals

(Whittemore and Halpern, 1994) and Sg-prs, which sums all pairs of affected individuals that have both

alleles IBD. McPeek’s results were generalized to arbitrary genetic models in Hössjer (2003d), where it

was shown that

ωkl= ωkωl,

For instance, for the Gaussian mixed model of Example 1 without polygenic effects, (5.8) equals the

weighted pairwise correlation statistic SWPC of Commenges (1994) when the main locus is additive,

i.e. c = 0.

P(Y|G) =

k=1

P(Yk|Gk).(5.9)

k ?= l. (5.10)

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 10

322O. HÖSSJER

EXAMPLE 6 (GAUSSIAN MIXED MODEL, OUTBRED PEDIGREE) Consider the Gaussian mixed model

of Example 1. Assume σ2

g= σ2. It is shown in HS that

ωk= (r?−1)k,

ωkl= (r?−1)k(r?−1)l− ?−1

kl,

(5.11)

where r = (Y − µ0− β?)/σ = (Y − E(Y))/σ is the standardized vector of residuals and ?−1

(k,l)th component of ?−1. If Ykis unknown, we put ωk= ωkl= 0.

For an outbred pedigree ρ = 2, hence ? = ε2/2 can be written as

klis the

? =

h2

2(1 − h2),

where h2= Var(m|Gk|)/Var(Yk) is the heritability of the phenotype at the main locus; when ε is small,

? ≈ h2/2.

We denote the score function obtained when inserting (5.11) into (5.8) by Swpairs. The name re-

flects that Swpairsis a weighted sum of pairwise IBD sharing. The unknown parameters of Swpairsare

(m∗,σ2,c,h2

shared environmental effects we reduce the parameter vector further by letting h2

cial case of Swpairsis the weighted pairwise correlation statistic SWPC. It contains σ−2as a multiplicative

constant which can be dropped. The only remaining parameters to estimate for SWPCare (m∗,β).

One may approximate the exact distribution (4.2) of Y|v by a multivariate normal one. The VC

techniques are based on this approximation. Tang and Siegmund (2001), Putter et al. (2002) and Wang

and Huang (2002) have shown, for nuclear families, that Swpairsis the score obtained with the multivariate

normal approximation. Hence, the two approaches are locally equivalent for small ε.

For an inbred pedigree ρ = 1. If q2u(0) + 2pqu(1) + p2u(2) = 0, it follows that

E(Yk) = m∗+ Fkσdε + (β?)k,

where Fk = E(HBDk) is the inbreeding coefficient of k. This means that ? = ε is proportional to the

change in phenotype mean for all inbred individuals compared to the null model ψ0. The score function

obtained by inserting (5.11) into (5.7) has (m∗,h2

put to prior values. If we ignore polygenic dominance and shared environmental effects and there are no

covariates, we only need to choose (m∗,h2

Inbred pedigrees are often used for recessive models. If u(0) ? u(1) ? u(2), it is natural to put

the constraint ε ? 0 in (5.4) in order to maintain monotonicity of the three mean parameters m0 ?

m1? m2. Then θ0is at the boundary of the parameter space and Z(t) = W(t) is a natural test statistic

at locus t. It is also possible to replace the ε = 0 model (m∗,m∗,m∗) in (5.4) by an additive model

(m0,(m0+ m2)/2,m2) with m2> m0. The argument leading to (5.7) carries over to this case. If any

deviation from additivity is of interest, we put no sign constraint on ε and use Z(t) = W(t)2as test

statistic at locus t.

a,h2

d,h2

s,β). For additive models we put c = h2

d= 0, and in absence of polygenic and

a= h2

s= 0. This spe-

a,h2

d,h2

s,β) as parameters that need to be estimated or

a) a priori.

EXAMPLE 7 (GAUSSIAN LIABILITY MODELS) It is shown in HS that for the Gaussian liability model

(4.3),

?

σ2

g

σ2

ωk=σg

σ

(x?−1)kP(x|Y)dx,

ωkl=

?

((x?−1)k(x?−1)l− ?−1

kl)P(x|Y)dx,

(5.12)

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 11

Conditional likelihood score functions

323

where P(x|Y) ∝ P(Y|σx +µ0+β?)P(x) is the posterior density of (X −µ0−β?)/σ and P(x) is the

density of an N(0,?)-distribution.

5.2

Rare disease models

In this subsection, we keep the penetrance parameter ψ fixed whereas pε= ε is a function of ε.

PROPOSITION 2 Assume random mating (5.2) and let ejand 0 be binary vectors of length 2 f with ej

having a one in the jth position and zeros elsewhere and 0 having zeros everywhere. Then ρ = 1 and

2 f

?

whenever the right-hand side is a nonconstant function of v. The constant C is chosen so that Eθ0

(S(v)|Y) = 0.

Notice that θ0is at the boundary of the parameter space because of the constraint p ? 0 on the disease

allele frequency. Hence, Z(t) = W(t) is the appropriate test statistic to use.

McPeek (1999) derived (5.13) for binary traits and affected pedigree members. The resulting score

function S she referred to as Srobdom, since it had good and robust performance over a wide range of

dominant models. McPeek’s result was extended in Hössjer (2003d) to the monogenic model (5.9), whilst

(5.13) is a further extension to include polygenic and shared environmental effects.

S(v) =

j=1

P(Y|ej,v)

P(Y|0,v)− C

(5.13)

EXAMPLE 8 (GAUSSIAN MIXED MODELS FOR RARE DISEASES) In Example 1, assume that the

pedigree is outbred. Define bj = bj(v) = (bj1,...,bjn), where bjkis 1 iff individual k receives the

jth founder allele via one of its parents (either j2k−1(v) or j2k(v) equals j). Put K = exp((m1−m0)/σ),

and letr = (Y−m0−β?)/σ be a standardized residual vector in the absence of disease alleles (m = m0).

Then, inserting Y|ej,v ∼ N(m01 + (m1− m0)bj+ β?,σ2?) into (5.13) we arrive at

Snormdom(v) =

2 f

?

j=1

Kbj?−1(r−0.5log(K)bj)?− C,

(5.14)

where C is a centering constant. We use the score function name S = Snormdomintroduced in Hössjer

(2003d) for the special case h2

s= 0 of no polygenic effects. Notice that m2does not enter

into Snormdom, because for rare disease alleles it is very unlikely that there is more than one disease allele

among the founders. Since the pedigree is assumed to have no loops, the disease allele can appear at most

once in each individual. The unknown parameters of Snormdomare (K,m0,σ2,h2

is most important, since it measures the strength of the major genetic component. For rare disease alleles

one has E(Yk) ≈ m0+ (β?)kand V(Yk) ≈ σ2. This motivates why m0+ (β?)kand σ are used for

standardizing phenotypes.

a= h2

d= h2

a,h2

d,h2

s). Of these, K

6. A SIMULATION STUDY

In this section we investigate various score functions for the Gaussian mixed model of Example 1. For

simplicity we do not include covariates and put β = 0 in (4.5) (with Y = X). We assume that the

phenotype mean E(Yk) = m and total variance V(Yk) = σ2

from population data. Here Var(m|Gk|) is the total genetic variance of the major gene while σ2, defined in

(4.5) with Y = X, is the sum of all environmental and polygenic variance components. For simplicity, we

t= Var(m|Gk|) + σ2have been estimated

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 12

324O. HÖSSJER

Fig. 1. 50·ANCP, where ANCP is the asymptotic noncentrality parameter, as function of true h2

solid line), Haseman–Elston (dashed line) and Snormdomscore functions with k = 1.5 and different choices of assumed

h2

afor optimal (thick

a: h2

a= 0 (∗), h2

a= 0.2 (+), h2

a= 0.5 (dotted line) and h2

a= 0.8 (o). The number of Monte Carlo iterates is 5000.

assume there are no dominant polygenic or shared environmental effects, i.e. h2

four essential unknown genetic model characteristics are then p, h2

d= h2

s= 0 in (4.7). The

aand

Disp = (m2− m0)/σ,

Dom = (2m1− m0− m2)/(m2− m0).

The displacement Disp quantifies the strength and Dom the degree of dominance of the main locus genetic

component. Under the mild restriction that miare nondecreasing we have Disp ? 0 and −1 ? Dom ? 1,

with Dom taking values −1, 0 and 1 for recessive, additive and dominant models, respectively.

We only consider outbred pedigrees, hence the linkage score function is Z(t) = W(t), i.e. the second

row of (3.10) is used. As performance criterion we use the noncentrality parameter, NCP = E(Z(τ)|Y),

the expected value w.r.t. MD and conditional on phenotypes of the linkage score function at the disease

locus. This criterion is related to the power PH1(Zmax? T) to detect linkage (Feingold et al., 1993), but

does not require specification of a threshold T, genome region ? or significance level PH0(Zmax? T).

For a genomewide scan, an NCP of about 4 corresponds to significant linkage, although the exact value

depends on the collection of pedigrees, the score function, marker informativeness and the genetic model

(Lander and Kruglyak, 1995; Ängquist and Hössjer, 2004a).

Assuming complete MD, one has

NCP =

?

w

S(w)Pθ(v = w|Y)

??

2−m?

w

S(w)2,

(6.1)

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 13

Conditional likelihood score functions

325

Fig. 2. 50·ANCP as function of true h2

sumed h2

(dashed line). The four subplots correspond to different pedigree structures. The number of Monte Carlo iterates is

5000 (a,b) and 2000 (c,d).

afor different score functions: optimal (thick solid line), Snormdomwith as-

a= 0.5 and k = 1.5 (dotted line), Swpairswith assumed h2

a= 0.5 (thin solid line) and Haseman–Elston

for one pedigree and any centered score function S. Here m is the number of meioses of the pedigree and

v = v(τ). For a collection of N pedigrees, the NCP grows at rate√N, since

√N

??N

with NCPithe NCP and γithe weight of the ith pedigree. We choose γias in the denominator of (6.1).

For the locally optimal score function (3.5), this weighting scheme is equivalent to (3.9).

If pedigrees (including their phenotypes) are drawn from a population, the second factor of (6.2) con-

verges to ANCP =?γ(Y)NCP(Y)dP(Y)/

NCP =

?N

i=1γiNCPi/N

i=1γ2

i/N

,

(6.2)

??γ2(Y)dP(Y) as N grows, where dP(Y) is the sampling

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 14

326O. HÖSSJER

Fig. 3. 50·ANCP as function of true h2

sumed h2

(dashed line). The four subplots correspond to different strengths of the penetrance parameters (Disp). The number

of Monte Carlo iterates is 5000.

afor different score functions: optimal (thick solid line), Snormdomwith as-

a= 0.5 and k = 1.5 (dotted line), Swpairswith assumed h2

a= 0.5 (thin solid line) and Haseman–Elston

distribution of pedigrees including their phenotype vectors Y (Hössjer, 2003b,d). Hence,

NCP ≈

√N ANCP

for large N. When sampling pedigrees, we consider one fixed pedigree structure with certain pedigree

members having unknown phenotypes. For the remaining pedigree members, the phenotype vector Y is

drawn from the fraction α (0 < α ? 1) of randomly sampled Y (Pθ(Y) =?

informative pedigrees are considered, because the weights γiare then proportional to

roots of the Fisher informations.

Four score functions were included in the simulations, Swpairs, Snormdom, SHEand Soptimal. Since m

and σtare assumed to be known, we use the residual vector r = (Y − m)/σtin the definition of Swpairs

GPψ(Y|G)Pp(G)) with

?

largest weights γ(Y). For the locally optimal score function (3.5), this means that a fraction α of the most

Icompl

i

, the square

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from

Page 15

Conditional likelihood score functions

327

Fig. 4. 50·ANCP as function of true h2

sumed h2

(dashed line). The four subplots correspond to different degrees of dominance (Dom). The number of Monte Carlo

iterates is 5000.

afor different score functions: optimal (thick solid line), Snormdomwith as-

a= 0.5 and k = 1.5 (dotted line), Swpairswith assumed h2

a= 0.5 (thin solid line) and Haseman–Elston

in (5.11) and Snormdomin (5.14). We also put c = h2

and k = 1.5 in the definition of Snormdom. The value k = 1.5 yields good and robust performance for a

wide range of genetic models. As a score function analogue of the classical Haseman–Elston regression

method for quantitative traits we included

?

see Haseman and Elston (1972) and Hössjer (2003d). Finally, as a benchmark, we also included the opti-

mal (in terms of NCP) score function Soptimal, which is the centered version of P(v|Y) (Hössjer, 2003b).

In Figures 1–6 we have plotted 50 × ANCP for complete MD, all four score functions and various

genetic models (Disp, Dom, h2

a, p), pedigrees and sampling fractions α. This corresponds to an NCP of

a sample with N = 2500. We assume, for simplicity of interpretation, that all families in the populations

d= h2

s= 0 in the definition of Swpairsand h2

d= h2

s= 0

SHE(v) =

k<l

(2σ2

t− (Yk− Yl)2)IBDkl− C,

at University of Portland on May 22, 2011

biostatistics.oxfordjournals.org

Downloaded from