Page 1

Original Contribution

Making the Most of Case-Mother/Control-Mother Studies

M. Shi1, D. M. Umbach1, S. H. Vermeulen2, and C. R. Weinberg1

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC.

2Departments of Endocrinology; Epidemiology, Biostatistics and HTA; and Human Genetics, Radboud University Nijmegen

Medical Centre, Nijmegen, the Netherlands.

Received for publication January 30, 2008; accepted for publication May 8, 2008.

The prenatal environment plays an important role in many conditions, particularly those with onset early in life,

such as childhood cancers and birth defects. Because both maternal and fetal genotypes can influence risk,

investigators sometimes use a case-mother/control-mother design, with mother-offspring pairs as the unit of

analysis, to study genetic factors. Risk models should account for both the maternal genotype and the correlated

fetal genotype to avoid confounding. The usual logistic regression analysis, however, fails to fully exploit the fact

that these are mothers and offspring. Consider an autosomal, diallelic locus, which could be related to disease

susceptibility either directly or through linkage with a polymorphic causal locus. Three nested levels of assumptions

are often natural and plausible. The first level simply assumes Mendelian inheritance. The second further assumes

parental mating symmetry for the studied locus in the source population. The third additionally assumes parental

allelic exchangeability. Those assumptions imply certain nonlinear constraints; the authors enforce those con-

straints by using Poisson regression together with the expectation-maximization algorithm. Calculations reveal that

improvements in efficiency over the usual logistic analysis can be substantial, even if only the Mendelian assump-

tion is honored. Benefits are even more marked if, as is typical, information on genotype is missing for some

individuals.

case-control studies; genetics; linear models; polymorphism, single nucleotide; risk

Abbreviation: EM, expectation-maximization.

When studying the etiology of complex conditions with

onset early in life, such as childhood cancers, certain psy-

chiatric illnesses, congenital malformations, and pregnancy

complications, both the maternal genome and the fetal ge-

nome may influence susceptibility, and both need to be con-

sidered. Case-parent triad designs, where one genotypes

cases and both of their parents, can enable the investigator

to differentiate fetal genetic effects from maternally medi-

ated genetic effects (1–3) and can bypass the practical prob-

lems imposed by the need to recruit population controls.

Triad designs also offer robustness against a potential source

of bias called ‘‘genetic population stratification,’’ which

may arise when the population consists of incompletely

mixed subpopulations that differ both in their baseline dis-

ease risk (i.e., risk in people who do not carry the variant

allele) and in the frequency of the genetic variant being

studied. Such a population structure can produce confound-

ing bias in a case-control study, butnot in a triad study. Triad

designs also permit assessment of parent-of-origin effects,

where inheritance of a particular genetic variant can have

effects on risk that differ according to which parent trans-

mitted it to the offspring.

These advantages aside, triad designs suffer from some

important limitations. First, fathers may be hard to recruit,

and paternity is also inherently harder to be confident of

than is maternity. A more disturbing limitation is that the

Correspondence to Dr. Clarice R. Weinberg, Biostatistics Branch, National Institute of Environmental Health Sciences, Mail Drop A3-03 101/A315,

Research Triangle Park, NC 27709 (e-mail: weinber2@niehs.nih.gov).

541 Am J Epidemiol 2008;168:541–547

American Journal of Epidemiology

Published by the Johns Hopkins Bloomberg School of Public Health 2008.

Vol. 168, No. 5

DOI: 10.1093/aje/kwn149

Advance Access publication July 23, 2008

Page 2

case-parent triad design does not permit estimation of main

effects of exposures.

An alternative design calls for comparing randomly sam-

pled mother-offspring pairs in which the offspring is healthy

with mother-offspring pairs in which the offspring has the

condition under study. We shall refer to this approach as the

case-mother/control-mother design. We assume that the dis-

ease is rare in the population under study and that, although

subpopulations might vary either in their baseline risks of

disease or in their frequencies of the genetic variant, the

covariance across subpopulations between the genotype fre-

quency and baseline risk is 0 (4). In effect, we are making

the usual assumption of no uncontrolled confounding; there-

fore, a case-control design is valid for this disease and

population.

One complication of the case-mother/control-mother de-

sign (5) is that the maternal genome is a confounder for

effects of the fetal genome, because of their correlation.

Consequently, naı ¨ve analyses that use separate models to

estimate effects of fetal genotypes and effects of maternal

genotypes are vulnerable to confounding bias. One should

instead fit a single model that simultaneously includes as

predictors the fetal genotype and the maternal genotype.

What has not been appreciated, however, is that the parent-

child relationship implies certain linear relations among

parameters. Our purpose in this paper is to describe those

natural family-based constraints, to demonstrate a log-linear

approach implemented through the expectation-maximization

(EM) algorithm (6) that can honor them, and to document

the power advantages they confer. We also assess the extent

to which use of the family-based constraints can improve

analytic efficiency/precision when some genotypes are ran-

domly missing.

NATURAL CONSTRAINTS BASED ON FAMILY

RELATIONSHIPS

Suppose, for simplicity, we are considering a diallelic

single nucleotide polymorphism in an autosomal gene that

could be related to disease susceptibility either causally or

through linkage with a polymorphic causal locus. Let M and

C denote the number of copies of the variant allele (i.e., 0, 1,

or 2) carried by the mother and the child, respectively. It will

not matter which allele is considered the ‘‘variant,’’ but

usually the one designated as such is the less frequent

one, the ‘‘minor’’ allele. One obvious constraint that applies

to both case pairs and control pairs is that (M,C) cannot be

(2,0) or (0,2), because a homozygous mother has to pass on

one of her two identical alleles to her child. Thus, instead of

nine mother-child pairs being possible, only seven are

possible.

Considering the father, who is not directly studied in this

design, there are nine possible pairs of parental genotypes.

Let lmfdenote the population frequency of pairs of parents

in which the mother has m copies and the father f copies of

theallelicvariant.Supposecontrolmother-childpairsarese-

lected at random from the source population, where trans-

mission from mother to offspring follows Mendelian

inheritance and survival to the time of study is nondifferen-

tial by genotype. If the disease is rare or unrelated to the

variant under study, then the population-based distribution

of mother-child paired genotypes among controls can be

expressed in terms of the lmfparameters and Mendelian

proportions (table 1). We have simply collapsed over the

missing fathers. For example, the (0,0) cell in table 1 con-

sists of triads with (M,F,C) equal to (0,0,0) and (0,1,0) with

expected frequencies l00and half of l01, respectively.

With no additional assumptions about the population, the

M ¼ 1 row already implies a constraint: The expected

counts for (1,0) and for (1,2) sum to the expected count

for (1,1). Thus, the family relationship alone specifies two

structural zeroes and also a constraint.

Next, suppose that in addition to Mendelian inheritance

we assume parental mating symmetry in the source popula-

tion, at the locus under study (i.e., lmf¼ lfmfor all m, f ).

This additional assumption reduces the nine original lmf

parameters in table 1 to only six. Adjusting the cell com-

ponents of table 1 accordingly, the family relationships then

imply a second constraint for the expected counts for

mother-child pairs, (M,C): The expected difference between

the count for (1,0) and the count for (0,1) equals the ex-

pected difference between the count for (1,2) and the count

for (2,1)—namely, (1/4)l11– l02, which is the same as

(1/4)l11– l20.

Another constraint that is often plausible is parental allelic

exchangeability, which asserts that in the source population,

conditional on the set of four alleles carried by a pair of

parents, those alleles are randomly allocated to the two in-

dividuals. This condition is a single-locus special case of

TABLE 1.

transmission of parental alleles*

Expected frequencies of control mother-child pairs under Mendelian

C ¼ 0C ¼ 1C ¼ 2

M ¼ 0

M ¼ 1

M ¼ 2

* Note that lmfis proportional to the underlying frequency in the source population of parental

pairs in which the mother carries m copies of the variant and the father carries f copies and where

P

l00þ (1/2)l01

(1/2)l10þ (1/4)l11

0

(1/2)l01þ l02

(1/2)[l10þ l11þ l12]

l20þ (1/2)l21

0

(1/4)l11þ (1/2)l12

l22þ (1/2)l21

m

P

f

lmf¼ N0, the total number of control-mother pairs.

542Shi et al.

Am J Epidemiol 2008;168:541–547

Page 3

parental haplotype exchangeability (7). This assumption is

slightly stronger than mating symmetry, but it is much

weaker than Hardy-Weinberg equilibrium because it per-

mits the existence of genetically distinct subpopulations.

Under parental allelic exchangeability, because there are

four ways to assign one variant each to two parents,

l11¼ 4l02¼ 4l20. This exchangeability assumption also

implies the other two assumptions, and it follows that the

expected difference between the count for (1,0) and the

count for (0,1) and the expected difference between the count

for (1,2) and the count for (2,1) are not just equal to each

other but are both equal to 0. Thus, with this slightly stronger

additional assumption, now three constraints can be imposed

on the expected counts for control pairs.

What about the distribution for case-mother pairs? Under

a multiplicative model for risk of a rare condition, the ex-

pected counts for case mother-child pairs can be expressed

in terms of the lmfparameters, Mendelian proportions, and

relative risks (table 2). Here R1and R2are the relative risks

for a child with one or two copies, respectively, relative to

a child with no copies, and S1and S2are the relative risks for

a child whose mother has one or two copies, respectively,

relative to a child whose mother has no copies. The param-

eter B is the normalizing constant included to ensure that the

expected counts sum to the total number of case-mother

pairs.

FITTING MODELS THAT ENFORCE THESE

CONSTRAINTS

In the usual logistic regression model, one conditions on

all the predictor variables and models the log odds of dis-

ease. This approach is wonderfully flexible because one

does not need to specify the distribution of covariates when

maximizing the relevant conditional likelihood. The down-

side is that one has no way to impose prior knowledge about

that covariate distribution. Imposing appropriate constraints

on the covariate distribution can improve statistical effi-

ciency (3, 8).

One way to impose constraints on the covariate distribu-

tion is to use log-linear Poisson regression. If no constraints

are imposed, the results of fitting logistic and Poisson re-

gression models are identical for the same data set. Using

logistic regression to carry out a case-mother/control-

mother analysis with the (M,C) count data corresponding

to tables 1 and 2, one would fit the model:

ln

PrðDjM;CÞ

1?PrðDjM;CÞ

??

¼lþb1IðC¼1Þþb2IðC¼2Þ

þa1IðM¼1Þþa2IðM¼2Þ:

Here, I(expression)is an indicator function which is 1 when the

expression is true and 0 when it is false. The coefficients b1

and b2are the natural logarithms of R1and R2, while a1and

a2are the natural logarithms of S1and S2.

To accomplish an equivalent analysis using Poisson re-

gression, let Nmcddenote the observed number of families in

which M ¼ m, C ¼ c, and D ¼ d, where d is 1 for case pairs

and 0 for control pairs, and let E(Nmcd) be the expected value

of that count. One uses the 14 observed cell counts to fit the

following Poisson regression model:

ln EðNmcdÞ

½ ?¼hmcþddþb1dIðc¼1Þþb2dIðc¼2Þ

þa1dIðm¼1Þþa2dIðm¼2Þ:

The seven parameters hmc, one for each (M,C) cell among

controls, by allowing complete flexibility for the control-

mother distribution (consider setting d ¼ 0), ensure that the

covariate distribution is unconstrained. An advantage to

using the Poissonversion of these two identical approaches

is that, by modeling the cell counts directly, the Poisson

approach provides a way to impose constraints on the hmc

parameters describing the covariate distribution.

An additional difficulty is that the constraints we have

described are linear constraints on the cell counts or, equiv-

alently, on the lmfparameters, but they are nonlinear con-

straints on the hmcparameters, because those parameters are

the natural logarithms of the cell counts. Imposition of such

nonlinear constraints is not straightforward in available soft-

ware packages like Stata or SAS. Other software, for exam-

ple, LEM (log-linear expectation maximization) by van den

Oord and Vermunt (9), easily handles such constraints.

For the constraints that we are considering, it is conve-

nient to imagine an idealized data structure with 15 cells for

case-parent triads (as in the article by Weinberg et al. (1))

and a similar data structure with 15 cells for control-parent

TABLE 2.

for risk*

Expected frequencies of case mother-child pairs under a multiplicative model

C ¼ 0C ¼ 1C ¼ 2

M ¼ 0

M ¼ 1

M ¼ 2

* Note that lmfdenotes the underlying frequency in the source population of parental pairs in

which the mother carries m copies of the variant and the father carries f copies. R1and R2denote

the relative risks for a child with one or two copies, respectively, relative to a child with no copies;

S1and S2denote the relative risks for a child whose mother has one or two copies, respectively,

relative to the child whose mother has no copies. B is a normalizing constant included to ensure

that the expected counts will sum to the total number of case-mother pairs.

B[l00þ (1/2)l01]

BS1[(1/2)l10þ (1/4)l11]

0

BR1[(1/2)l01þ l02]

(1/2)BR1S1[l10þ l11þ l12]

BR1S2[l20þ (1/2)l21]

0

BR2S1[(1/4)l11þ (1/2)l12]

BR2S2[l22þ (1/2)l21]

Efficient Analysis of Case-Mother/Control-Mother Studies543

Am J Epidemiol 2008;168:541–547

Page 4

triads, but where the fathers’ genotypes are all missing. One

then can use the EM algorithm to maximize the fatherless

likelihood, and it becomes easy to impose these constraints.

The assumption required by the EM algorithm that the fa-

thers’ genotypes be noninformatively missing is trivially

satisfied because all are missing. The proposed construction

automatically satisfies the linear constraint (and the struc-

tural zeroes) that follows from Mendelianism and the family

relationships. One must, of course, use the observed-data

likelihood, rather than the pseudo-complete-data likelihood,

to compute the likelihood ratio v2statistic.

To impose parental mating-type symmetry, one collapses

each 15-cell multinomial to a 10-cell multinomial (1), be-

cause, for example, the triple genotype (0,1,C) is merged

with (1,0,C). One can then again use the EM algorithm to

maximize the appropriate likelihood. The additional con-

straint of parental allele exchangeability (7) can be honored

by using the same 10-cell multinomials but using a single

stratum parameter for both the f0,2g and f1,1g parental

strata and assigning offsets that are the logarithms of 2, 1, 2,

and 1 to the MFC triads (0,2,1), (1,1,0), (1,1,1), and (1,1,2),

respectively. (Here ‘‘(0,2,1)’’ includes ‘‘(2,0,1),’’ because

parental switches are treated as equivalent under mating

symmetry.)

In an actual case-parent/control-parent study,a proportion

of the genotypes will be missing, either because the individ-

ual was not studied (e.g., the baby did not survive, or um-

bilical cord blood but not maternal blood was retained) or

because the laboratory could not assign the genotype. Not

only does the Poisson approach together with the EM algo-

rithm (6) permit imposition of constraints, it facilitates the

use of partial data when genotypes are missing. For such an

approach to be valid, one must assume that missingness is

noninformative—that is, missingness is random conditional

on disease status and the observed genotypes. Thus, if some

offspring genotypes are missing due to failure to survive,

one must assume that survival is unrelated to the unobserved

genotype among case mother-offspring pairs and also un-

related to the unobserved genotype among control mother-

offspring pairs.

POWER COMPARISONS

To evaluate the power gains possible by exploiting vari-

ous constraints in the analysis of a case-mother/control-

mother study, we considered a study of 150 case-mother

pairs and 150 control-mother pairs. For convenience, we

employed a source population in which the single nucleo-

tide polymorphism was in Hardy-Weinberg equilibrium.

Hardy-Weinberg equilibrium is neither necessary nor as-

sumed in our analyses, but it simplifies power calculations

by allowing us to specify the lmfparameters as simple

functions of allele frequency. A source population in

Hardy-Weinberg equilibrium satisfies all three assumptions.

Assuming the model of tables 1 and 2, we examined several

risk scenarios defined by R1, R2, S1, and S2over a range of

allele frequencies.

For complete data, we calculated power for the usual

logistic regression analysis (no constraints) and for our pro-

posed analysis under each of the three nested levels of con-

straints. We also calculated power for a case-parent triad

design with 150 cases. We repeated these calculations for

scenarios with 20 percent of the genotypes randomly miss-

ing. For these missing-genotype scenarios, we considered

two versions of the unconstrained logistic analysis: one re-

stricted to mother-offspring pairs with complete genotype

data and one where pairs with missing data were included

via Poisson regression and the EM algorithm. Constrained

analyses always included pairs with missing data.

We studied the noncentrality parameter, equivalently the

power, for the 4-df v2likelihood ratio test of the null hy-

pothesis that R1¼ R2¼ S1¼ S1¼ 1: We made use of the

fact that the noncentrality parameter of the likelihood ratio

test statistic under a specified alternative can be closely ap-

proximatedbythelikelihoodratiostatisticcalculatedbytreat-

ing the expected counts under that alternative as if they were

data (10). We calculated expected cell counts under various

scenarios using the formulae in tables 1 and 2 and employed

LEM software (9) to maximize the observed data likelihoods

under the null and alternative hypotheses. LEM software is

freely available, and the reader can download the LEM

‘‘scripts’’ that we used for maximizing the relevant observed

data likelihoods under any of our three sets of assumptions

from our website (http://www.niehs.nih.gov/research/atniehs/

labs/bb/staff/weinberg/index.cfm#downloads).

We plotted the noncentrality parameters as a function of

allele frequency for analyses under different sets of assump-

tions and included horizontal reference lines corresponding

to specific power values for a 0.05-level 4-df likelihood ratio

test. When the noncentrality parameter exceeds a particular

cutpoint, the power exceeds the specified power. To modify

the number of cases studied to some other number, say

K, one can simply multiply these noncentrality values by

K/150. The ratio of any two of the case-mother/control-

mother curves corresponds to the relative efficiency of the

two analytic approaches, across allele frequencies—that is,

the approximate ratio of sample sizes required to achieve

any desired level of statistical power.

Consider first a scenario in which R1, R2, S1, and S2are

2, 3, 1, and 1, respectively. This scenario includes a gene-

dose effect of the fetal genotype with no effects of the

maternal genotype. Under this scenario, noncentrality curves

for analyses that impose the constraints lie above the curve

for the usual logistic regression analysis (figure 1, panel A).

Simply imposing the fact that the mother-offspring pairs

reflect Mendelian proportions improved power, particularly

at allele frequencies below 0.5. Imposing two constraints or

all three together improved the power even more. For this

scenario, a case-parent design with 150 case triads provided

power comparable to that of an unconstrained case-mother/

control-mother logistic analysis with 150 case-mother pairs

and 150 control-mother pairs.

We obtained qualitatively similar results with two addi-

tional risk scenarios. In a scenario where the only effect is

a recessive effect of the fetal genotype (R1, R2, S1, and S2are

1, 3, 1, and 1, respectively), increasing the number of im-

posed constraints again increased the power across all allele

frequencies (figure 1, panel B). The power advantage of

even the simple familial constraint was marked. For this

544 Shi et al.

Am J Epidemiol 2008;168:541–547

Page 5

scenario, the power of the triad design exceeded that of the

unconstrained case-mother/control-mother analysis and

even exceeded that of some constrained analyses at low

allele frequency. In a scenario where the fetal genotype

has a recessive effect and the maternal genotype has a dom-

inant effect (R1, R2, S1, and S2are 1, 3, 2, and 2, respec-

tively), power again increased as more constraints were

imposed on the analysis (figure 1, panel C). In this scenario,

FIGURE 1.

design.The vertical axes show,in the left column, thev2noncentrality parameterfor a 4-df likelihoodratio test, and, in theright column,the power of

a corresponding test with a ¼ 0.05. Left column (panels A–C): no missing data; right column (panels D–F): 20% of genotypes missing. First row

(panels A and D): R1¼ 2, R2¼ 3, S1¼ 1, S2¼ 1; second row (panels B and E): R1¼ 1, R2¼ 3, S1¼ 1, S2¼ 1; third row (panels C and F): R1¼ 1,

R2¼ 3, S1¼ 2, S2¼ 2. Curves for a case-mother/control-mother design with 150 case-mother pairs and 150 control-mother pairs: logistic

regression using all pairs (solid line: —), logistic regression omitting pairs with missing genotypes (short-dashed line: – – – (panels D–F only)), log-

linear Poisson regression using all pairs and imposing only the family relationship constraint (dotted line: - - -), similar analysis that additionally

imposes mating symmetry (dashed-dotted line: – - –), and similar analysis that additionally imposes parental allelic exchangeability (long-dashed

line: — — —). Curve for a case-parent triad design with 150 triads: log-linear Poisson regression using all triads (dashed-dotted-dotted line: – - - –).

For panels A and D, curves for the model imposing mating symmetry (dashed-dotted line: – - –) and the model imposing parental allelic

exchangeability (long-dashed line: — — —) overlap.

Noncentrality parameter and power as a function of allele frequency for the case-mother/control-mother design and case-parent triad

Efficient Analysis of Case-Mother/Control-Mother Studies545

Am J Epidemiol 2008;168:541–547