Page 1

American Journal of Epidemiology

ª The Author 2010. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of

Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Vol. 172, No. 12

DOI: 10.1093/aje/kwq332

Advance Access publication:

October 29, 2010

Original Contribution

Odds Ratios for Mediation Analysis for a Dichotomous Outcome

Tyler J. VanderWeele* and Stijn Vansteelandt

* Correspondence to Dr. Tyler J. VanderWeele, Departments of Epidemiology and Biostatistics, Harvard School of Public Health,

677 Huntington Avenue, Boston, MA 02115 (e-mail: tvanderw@hsph.harvard.edu).

Initially submitted November 23, 2009; accepted for publication August 26, 2010.

For dichotomous outcomes, the authors discuss when the standard approaches to mediation analysis used in

epidemiology and the social sciences are valid, andtheyprovidealternative mediation analysis techniqueswhen the

standard approaches will not work. They extend definitions of controlled direct effects and natural direct and indirect

effects from the risk difference scale to the odds ratio scale. A simple technique to estimate direct and indirect

effectoddsratios bycombininglogistic andlinearregressionsisdescribedthatapplieswhenthe outcomeisrareand

the mediator continuous. Further discussion is given as to how this mediation analysis technique can be extended

to settings in which data come from a case-control study design. For the standard mediation analysis techniques

used in the epidemiologic and social science literatures to be valid, an assumption of no interaction between the

effects of the exposure and the mediator on the outcome is needed. The approach presented here, however, will

apply even when there are interactions between the effect of the exposure and the mediator on the outcome.

case-control studies; causal inference; decomposition; dichotomous response; epidemiologic methods; interac-

tion; logistic regression; odds ratio

Abbreviations: CDE, controlled direct effect; CI, confidence interval; ESE, empirical standard error; NDE, natural direct effect; NIE,

natural indirect effect; OR, odds ratio; SSE, estimated standard error; TE, total effect.

Editor’s note: Invited commentaries on this article ap-

pear on pages 1349 and 1352, and the authors’ response

is published on page 1355.

The causal inference literature has made a considerable

contribution to mediation analysis by providing definitions

for direct and indirect effects that allow for the effect de-

composition of a total effect into a direct and an indirect

effect even in settings involving nonlinearities and interac-

tions (1, 2), thereby circumventing an important limitation

to the concepts and methods for mediation that have been

used in the social sciences (2). The causal inference litera-

ture on mediation has focused on the risk difference scale.

Many analyses in epidemiology, however, use the odds ratio

scale because the outcome is dichotomous and the data arise

from a case-control study design.

In this paper, we consider the use of the odds ratio scale

for mediation analysis. The use of this scale has the advan-

tage that, when the outcome is rare and the mediator con-

tinuous, direct and indirect effects can be estimated through

very simple regressions, even with data arising from a case-

control study design. Under certain no-interaction assump-

tions, this technique reduces to the approach often used in

the epidemiologic literature of including an intermediate

variable in a logistic regression to assess mediation. How-

ever, when the no-interaction assumption does not hold, the

approach described in the present paper can still be used.

DIRECT AND INDIRECT EFFECTS ODDS RATIOS

We will let A denote an exposure of interest, Ya dichoto-

mous outcome, and M a potential mediator. We let C denote

a set of baseline covariates not affected by the exposure. The

relations among these variables are depicted in Figure 1. For

example, A may denote estrogen therapy, M serum lipid

concentrations, and Y cardiovascular disease. A question

of interest may then be the extent to which the effect of

1339 Am J Epidemiol 2010;172:1339–1348

Page 2

estrogen therapy A on cardiovascular disease Y is mediated

through serum lipid concentrations M and the extent to

which it is through other pathways (3, 4). For simplicity

in the example, we suppose treatment is binary and let

A ¼ 1 denote estrogen therapy and A ¼ 0 otherwise.

To address this and similar questions concerning media-

tion, we use the counterfactual framework (5, 6). We will let

Yaand Madenote, respectively, the values of the outcome

and mediator that would have been observed had the expo-

sure A been set, possibly contrary to fact, to level a. We will

let Yamdenote thevalue ofthe outcome that would have been

observed had the exposure, A, and the mediator, M, been set,

possibly contrary to fact, to levels a and m, respectively. We

also assume the technical assumptions called ‘‘consistency’’

and ‘‘composition’’ generally presupposed in the causal in-

ference literature and described elsewhere (7–9).

We extend the definitions of direct and indirect effects

(1, 2) in causal inference from the risk difference to the

odds ratio scale. On the risk difference scale, the total

effect, conditional on C ¼ c, comparing exposure level

a with a*, is defined by E½Ya? Ya*j c? and compares the

average outcome in stratum C ¼ c if A had been set to

a with the average outcome in stratum C ¼ c if A had been

set to a*. On the odds ratio (OR) scale, the total effect (TE),

conditional on C ¼ c, comparing exposure level a with a*,

is defined by

ORTE

a;a*jc¼PðYa¼ 1j cÞ=f1 ? PðYa¼ 1j cÞg

PðYa* ¼ 1j cÞ=f1 ? PðYa* ¼ 1j cÞg

and compares the odds of outcome Y ¼ 1 in stratum C ¼ c if

A had been a with the odds of outcome Y ¼ 1 in stratum C ¼

c if A had been a*. In the context of the cardiovascular

example, if we let a ¼ 1 denote the estrogen therapy and

a*¼ 0 denote no therapy, then ORTE

ratio for cardiovascular disease comparing estrogen therapy

with no therapy for individuals with covariate values c.

As with the total causal effect, we can also define direct

and indirect effects on either the risk difference or the odds

ratio scale. We will adopt the definitions and nomenclature

of Pearl (2) for the risk difference scale and extend these

concepts to the odds ratio scale. On the risk difference scale,

the controlled direct effect, conditional on C ¼ c, comparing

exposure level a with a*and fixing the mediator to level m,

is defined by E½Yam? Ya*mj c? and captures the effect of

exposure A on outcome Y, intervening to fix M to m. On

1;0jcwould be the odds

the odds ratio scale, one could define the conditional con-

trolled direct effect (CDE) as

ORCDE

a;a*jcðmÞ ¼PðYam¼ 1j cÞ=f1 ? PðYam¼ 1j cÞg

PðYa*m¼ 1j cÞ=f1 ? PðYa*m¼ 1j cÞg:

If A is a binary, this isPðY1m¼ 1j cÞ=f1 ? PðY1m¼ 1j cÞg

PðY0m¼ 1j cÞ=f1 ? PðY0m¼ 1j cÞg.

Note that these conditional controlled direct effects may

vary with m when there is interaction between the effects

of A and M on the odds ratio scale. In the cardiovascular

example, ORCDE

1;0jcðmÞ would denote the odds ratio for cardio-

vascular disease comparing therapy and no therapy with

serum lipid concentrations fixed at level m.

The so-called ‘‘natural direct effect’’ (2) or ‘‘pure direct

effect’’ (1) differs from the controlled direct effect in that

the intermediate M is set to the level Ma*, the level it

would have naturally been under some reference condi-

tion for the exposure, A ¼ a*; the natural direct effect,

conditional on C ¼ c, on the risk difference scale thus

takes the form E½YaMa*? Ya*Ma*j c?. The natural direct

effect thus captures the effect of the exposure, estrogen

therapy, on the outcome, cardiovascular disease, interven-

ing to set the mediator, serum lipid concentration, to the

level it would have been under the reference exposure

level (e.g., no estrogen therapy). The conditional natural

direct effect (NDE) odds ratio can be defined analogously

and takes the form

ORNDE

a;a*jc

?a*?¼

PðYaMa*¼ 1j cÞ=f1 ? PðYaMa*¼ 1j cÞg

PðYa*Ma*¼ 1j cÞ=f1 ? PðYa*Ma*¼ 1j cÞg:

On the odds ratio scale, the conditional natural direct effect

can be interpreted as comparing the odds, conditional on

C ¼ c, of the outcome Y if exposure had been a, but if the

mediator had been fixed to Ma* (i.e., to what it would have

been if exposure had been a*) to the odds, conditional on

C ¼ c, of the outcome Y if exposure had been a*but if the

mediator had been fixed at the same level Ma*. This would

capture the odds ratio for cardiovascular disease comparing

therapy with no therapy intervening to set the serum lipid

concentration to the level it would have been for each sub-

ject had they not had estrogen therapy.

One can similarly define a natural indirect effect. On the

risk difference scale, the conditional natural indirect effect

can be defined as E½YaMa? YaMa*j c?, which compares, con-

ditional on C ¼ c, the effect of the mediator at levels Maand

Ma* on the outcome when exposure A is set to a. The con-

ditional natural indirect effect (NIE) can be defined analo-

gously on the odds ratio scale as

ORNIE

a;a*jcðaÞ ¼PðYaMa¼ 1j cÞ=f1 ? PðYaMa¼ 1j cÞg

PðYaMa*¼ 1j cÞ=f1 ? PðYaMa*¼ 1j cÞg:

On the odds ratio scale, the conditional natural indirect ef-

fect can be interpreted as comparing the odds, conditional

on C ¼ c, of the outcome Yif exposure had been a but if the

mediator had been fixed to Ma(i.e., to what it would have

AMYC

Figure 1.

come Y, and covariates C.

Example of mediation with exposure A, mediator M, out-

1340 VanderWeele and Vansteelandt

Am J Epidemiol 2010;172:1339–1348

Page 3

been if exposure had been a) to the odds, conditional on C ¼

c, of the outcome Y if exposure had been a but if the

mediator had been fixed to Ma* (i.e., to what it would have

been if exposure had been a*). The natural indirect effect

odds ratio thus captures the odds ratio for cardiovascular

disease comparing serum lipid concentration under therapy

and no therapy if the subject had in fact had estrogen ther-

apy. As discussed elsewhere, controlled direct effects are

often of greater interest in policy evaluation (2, 10), whereas

natural direct and indirect effects are often of greater interest

in evaluating the action of various mechanisms (10, 11).

Note that throughout this paper we will consider all effects

conditional on the covariates C, and we will thus use ex-

pressions such as ‘‘natural direct effect’’ and ‘‘conditional

natural direct effect’’ interchangeably.

On the risk difference scale, natural direct and indirect

effects have the property that the total effect E½Ya? Ya*j c?

decomposes into a natural direct and indirect effect:

E

h

Ya? Ya*j c

i

¼ E

¼ E

h

YaMa? Ya*Ma*j c

h

i

þ E

YaMa? YaMa*j c

ih

YaMa*? Ya*Ma*jc

i

:

The decomposition holds even when there are nonlinearities

and interactions. On the odds ratio scale, the natural direct

and indirect effects also have a decomposition property. On

the odds ratio scale, the odds ratio for the total effect de-

composes into a product of odds ratios for the natural direct

and indirect effect:

ORTE

a;a*jc¼PðYa¼ 1j cÞ=f1 ? PðYa¼ 1j cÞg

PðYa* ¼ 1j cÞ=f1 ? PðYa* ¼ 1j cÞg

PðYaMa¼ 1j cÞ=f1 ? PðYaMa¼ 1j cÞg

PðYa*Ma*¼ 1j cÞ=f1 ? PðYa*Ma*¼ 1j cÞg

¼PðYaMa¼ 1j cÞ=f1 ? PðYaMa¼ 1j cÞg

PðYaMa*¼ 1j cÞ=f1 ? PðYaMa*¼ 1j cÞg

3PðYaMa*¼ 1j cÞ=f1 ? PðYaMa*¼ 1j cÞg

PðYa*Ma*¼ 1j cÞ=f1 ? PðYa*Ma*¼ 1j cÞg;

¼

where the first expression in the product is the natural in-

direct effect odds ratio, ORNIE

a;a*jcðaÞ, and the second expres-

sion is the natural direct effect odds ratio, ORNDE

the log scale, this is logðORTE

logðORNDE

logðORTE

of the effect of the exposure mediated by the intermediate

on the log odds scale. If the outcome is rare, one can

ORNDE

ORNIE

a;a*jcðaÞ ? 1

ORNIE

a;a*jcðaÞ ? 1

on the risk difference scale. We have given formulas for the

‘‘pure natural direct effect’’ and the ‘‘total natural indirect

effect’’ (1); refer to the Web Appendix, which is posted on

the Journal’s Web site (http://aje.oxfordjournals.org/) for

a;a*jcða*Þ. On

a;a*jcðaÞÞþ

logðORNIE

a;a*jcÞ ¼ logðORNIE

ratio,

a;a*jcða*ÞÞ.

a;a*jcÞ, thus constitutes a measure of the proportion

The

a;a*jcðaÞÞ=

use

a;a*jcða*Þ3

o

no

=

n

ORNDE

a;a*jcða*Þ 3

as a measure of the proportion mediated

further discussion of these measures and for analogous for-

mulas for the ‘‘total natural direct effect’’ and the ‘‘pure

natural indirect effect’’ (1).

Under certain assumptions that the set of covariates C

contains all relevant confounding variables, the direct and

indirect effects can be identified with observed data. We will

follow the exposition of VanderWeele (12) and VanderWeele

and Vansteelandt (9) on the identification assumptions pro-

posed by Pearl (2). These identification assumptions were

presented to identify direct and indirect effects on the risk

difference scale but they apply also to the odds ratio scale.

To identify total effects, it is generally assumed that, con-

ditional on some set of measured covariates C, the effect of

exposure A on outcome Yis unconfounded; in counterfactual

notation, this is Ya

symbol‘to denote that Yais independent of A conditional

a researcher will attempt to collect data on a sufficiently rich

set of covariates C to try to control for confounding of the

exposure-outcome relation.

then the odds ratio for the total causal effect, ORTE

identified and can be estimated from the data using

‘Aj C, where we use the independence

on C. In practice, to make this assumption more plausible,

If thisassumptionholds,

a;a*jc, is

PðYa¼ 1j cÞ=f1 ? PðYa¼ 1j cÞg

PðYa* ¼ 1j cÞ=f1 ? PðYa* ¼ 1j cÞg

¼

PðY ¼ 1j a;cÞ=f1 ? PðY ¼ 1j a;cÞg

PðY ¼ 1j a*;cÞ=f1 ? PðY ¼ 1j a*;cÞg:

The left-hand side isthe odds ratiofor the total causal effect,

ORTE

a;a*jc; the right-hand side is an expression that can be

estimated from the data.

Controlled direct effects on the risk difference or

risk ratio scale are identified if conditioning on the set of

covariates C suffices to control for confounding of both the

exposure-outcome and the mediator-outcome relations. In

counterfactual notation, these 2 assumptions can, respec-

tively, be written as that for all a and m,

a

Yam

Mj

Yam

Aj C

ð1Þ

a

n

A;C

o

:

ð2Þ

Assumption 1 is similar to the assumption of no-unmeasured

confounding assumption for total effects. Assumption 2 re-

quires that, conditional on {A, C}, there is no unmeasured

confounding for the mediator-outcome relation. If assump-

tion 1 is satisfied but assumption 2 fails (i.e., if there is me-

diator-outcome confounding), then estimators for the direct

and indirect effect will in general be biased (1, 2, 13, 14).

Thus, in the cardiovascular example, if U denoted some as-

pect of diet that was associated with serum lipid levels and

was alsoassociatedwithcardiovasculardisease,thenitwould

be necessary to control for U in estimating the direct effect of

estrogen therapy on cardiovascular disease controlling for

serum lipid levels. If estrogen therapy were randomized, then

its effect on serum lipid concentrations only or on cardiovas-

cular disease only could be estimated without control for

U but, when the direct effect of estrogen therapy on

Odds Ratios for Mediation 1341

Am J Epidemiol 2010;172:1339–1348

Page 4

cardiovascular disease controlling for serum lipid concentra-

tions is of interest, data on U would be needed.

Unfortunately, in many studies using mediation analysis,

little attention is given to data collection for variables con-

founding the mediator-outcome relation. Effort is often

made to collect data on some set of covariates C that suffice

to control for confounding of the exposure-outcome relation

so that assumption 1 is satisfied, but this will not ensure that

assumption 2 is satisfied. As noted above, when there are

mediator-outcome confounding variables that are unmea-

sured or for which control has not been made, estimates

of direct and indirect effects will generally be biased. In

epidemiologic research for which questions of mediation

are of interest, greater effort should be made to collect data

on potential mediator-outcome confounders. When these

assumptions 1 and 2 do not hold, then sensitivity analysis

for mediation for violations of the no-unmeasured con-

founding assumptions should be used (15, 16). If assump-

tions 1 and 2 hold, then the controlled direct effect on the

risk difference scale and on the odds ratio scale is identified,

and ORCDE

a;a*jcðmÞ is then given by

PðYam¼ 1j cÞ=f1 ? PðYam¼ 1j cÞg

PðYa*m¼ 1j cÞ=f1 ? PðYa*m¼ 1j cÞg

¼

PðY ¼ 1j a;m;cÞ=f1 ? PðY ¼ 1j a;m;cÞg

PðY ¼ 1j a*;m;cÞ=f1 ? PðY ¼ 1j a*;m;cÞg:

For the identification of natural direct and indirect effects,

additionalassumptionsareneeded.Naturaldirectandindirect

effectswillbeidentifiedif,inadditiontoassumptions1and2,

the following 2 assumptions hold, that for all a, a*, and m,

a

Yam

Ma*j C

Ma

Aj C

ð3Þ

a

ð4Þ

Assumption 3 can be interpreted as that, conditional on C,

there is no unmeasured confounding for the exposure-

mediator relation. Assumption 4 will hold if confounding

for the mediator-outcome relation can be controlled for by

some set of baseline covariates C, so that there is no effect of

exposure A that confounds the mediator-outcome relation

(i.e., no effect L of exposure A that itself affects both

M and Y). Thus, assumption 4 would be violated in the case

of Figure 2. In some settings, assumption 4 may be plausible

if the mediator M occurs shortly after the exposure A (9). If,

however, there is a variable L that is an effect of A and affects

both M and Y, then assumption 4 is violated and natural direct

and indirect effects will not in general be identified (17),

irrespective of whether data are available on L. In such set-

tings, it may still be possible to identify controlled direct

effect odds ratios, but alternative statistical approaches such

as marginal structural models (12, 18, 19) or structural nested

models (20–24) will generally be needed. Note that none of

assumptions 1–4canbetestedbyusingdata;a researcherwill

have to rely on subject matter knowledge in evaluating them.

In the next section, we will show how natural direct and in-

direct effects can be estimated in a relatively straightforward

manner using regression.

REGRESSION ANALYSIS FOR DIRECT AND INDIRECT

EFFECT ODDS RATIOS

In this section, we describe a simple regression technique

that can be used to estimate controlled direct effect and nat-

ural direct and indirect effect odds ratios when the assump-

tions above hold. The estimation technique for controlled

direct effect odds ratios will require only assumptions 1 and

2 and will make use of a single logistic regression. The esti-

mation technique for natural direct and indirect effect odds

ratios will require assumptions 1–4 above and will combine

the results of a linear and logistic regression to obtain the

effects of interest; the estimation technique for natural direct

and indirecteffects will alsorequire that the outcomeYisrare

so that odds ratios approximate risk ratios, which allows one

to obtain particularly simple formulae. We consider a setting

in which the mediator M is continuous and the outcome Y is

dichotomous. We have described a similar approach for con-

tinuous outcomes elsewhere (9). Derivations for the results

below are given in the Web Appendix.

Consider the use of the following 2 models, a logistic

regression for the outcome Y (with no A 3 M product term)

and a linear regression for the mediator M:

logitðPðY ¼ 1j a;m;cÞÞ ¼ h0þ h1a þ h2m þ h4#c

ð5Þ

and

E½Mj a;c? ¼ b0þ b1a þ b2#c;

ð6Þ

where the error term for the linear regression for M is nor-

mally distributed with constant variance. If assumptions 1–4

hold and if regression models 5 and 6 are correctly specified,

then the controlled and natural direct effect and natural in-

direct effect odds ratios are given by

ORNDE

a;a*jcða*Þ ? ORCDE

a;a*jcðmÞ ¼ exp?h1ða ? a*Þ?

ORNIE

a;a*jcðaÞ ? exp?h2b1ða ? a*Þ?;

where the approximation holds to the extent the rare out-

come assumption holds. These expressions essentially use

h1for the direct effect and h2b1for the indirect effect, and

these expressions are also often used in the social science

A

M

Y

C

L

Figure 2.

come Y, covariates C, and a mediator-outcome confounder L that is

itself affected by the exposure.

Example of mediation with exposure A, mediator M, out-

1342 VanderWeele and Vansteelandt

Am J Epidemiol 2010;172:1339–1348

Page 5

literature for mediation analysis with a dichotomous out-

come (25, 26). The use of models 5 and 6 along with the

expressions above is often referred to as the ‘‘Baron-

Kenny’’ approach to mediation (26). A related approach,

common in both the epidemiologic literature and the social

science literature, consists of regressing Y on A, M, C as in

model 5 and then examining whether the coefficient for A is

different from that obtained when Yis regressed on A and C

alone, such as the folllowing:

logitðPðY ¼ 1j a;cÞÞ ¼ /0þ /1a þ /2#c:

The difference between coefficients for A, /1? h1, is some-

times interpreted as an indirect effect. The traditional ‘‘pro-

portion explained’’ methods (27–30) are closely related and

use (/1? h1)//1as the measure of interest, again effectively

relying on the difference between the 2 coefficients. In the

included Appendix, we in fact show that, under assumptions

1–4, correct specification of models 5 and 6, and a rare out-

come, these 2 approaches to mediation analysis with a di-

chotomous outcome are essentially equivalent with /1? h1

? h2b1. The results above provide a formal counterfactual

interpretation of these various effect measures. An alterna-

tive measure of the ‘‘proportion explained’’ proposed by

Wang et al. (31) is, under certain exchangeability assump-

tions, similar to a natural indirect effect (32).

However, a limitation of all of the standard approaches is

that they presuppose that there is no statistical interaction on

the odds ratio scale between A and M in the logistic model

for Y. When such A 3 M interactions are present and are

ignored, the logistic regression model 5 will not be correctly

specified, and the difference /1 ? h1 does not carry

a straightforward interpretation as an indirect causal effect;

the definition of an indirect effect essentially breaks down

within the standard Baron-Kenny approach when such in-

teractions are present (33). Hafeman (34) has also recently

documented the biases that can arise with the traditional

‘‘proportion explained’’ methods when used in multiplica-

tive models for a dichotomous outcome in which interaction

terms are omitted. Here, we show how the regression ap-

proach can be extended to allow for interaction. Specifically,

suppose that, instead of model 5, the following model,

which includes an A 3 M product term, is used:

logitðPðY ¼ 1j a;m;cÞÞ ¼ h0þ h1a þ h2m þ h3am þ h4#c:

ð7Þ

If assumptions 1–4 hold and if the regression models 6 and 7

are correctly specified and the outcome is rare, then the

controlled direct effect and natural indirect effect odds ratios

are given, respectively, by

ORCDE

a;a*jcðmÞ¼ exp?ðh1þ h3mÞ?a ? a*??

ORNIE

ð8Þ

a;a*jcðaÞ ? exp?ðh2b1þ h3b1aÞ?a ? a*??:

The formula for the controlled direct effect odds ratio

requires that assumptions 1 and 2 hold and that model 7 is

correctly specified; no rare outcome assumption is required.

ð9Þ

The formula for the natural indirect effect odds ratio re-

quires that assumptions 1–4 hold, that models 6 and 7 are

correctly specified, and that the outcome Y is rare. An esti-

mator can also be given for the natural direct effect odds

ratio (refer to the Web Appendix material) but is more com-

plicated because, when there is interaction between A and M

in the logistic model for Y, the natural direct effect will be

different for subjects with different covariate values C.

Model 7 and expressions 8 and 9 essentially generalize

the Baron-Kenny approach to allow for exposure-mediator

interactions.

Ninety-five percent confidence intervals for the controlled

directeffectoddsratio inexpression8 and thenatural indirect

effect odds ratio in expression 9 can be computed by using

standard regression output and are given, respectively, by

?

explogORCDE

a;a*jcðmÞ61:96?a ? a*?

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

rh

11þ2rh

13m þrh

33m2

q

?

and

exp

?

logORNIE

a;a*jcðaÞ61:96?a ? a*?

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ijisthe covariance betweenˆbiandˆbjin model 6, and

ijis the covariance betweenˆhiandˆhjin model 7; these

covariances are given in the regression output of standard

statistical software. Alternatively, standard errors for ex-

pressions 8 and 9 could be obtained by bootstrapping.

Expressions 8 and 9 generalize mediation analysis with

a dichotomous outcome to settings in which there may be

interactions on the odds ratio scale between the exposure

and mediator of interest. The standard approach of omit-

ting the h3am product term in assessing mediation is

highly problematic when correct specification of a logistic

regression model for Y requires the product term. When

there is in fact such interaction between A and M, ignoring

this (as is often done) can result in highly misleading in-

ferences concerning mediation. If, for example, the direc-

tion of the association between A and Y differs for

different levels of m and if the h3am term in model 7 is

omitted, the resulting estimate of the exposure coefficient

h1may be close to 0 because of averaging. This might

result in a researcher’s concluding that the effect of A

on Y is largely mediated by M, when in fact all that is

the case is that there is an interaction between the effects

of A and M on Y. At the very least, epidemiologists, before

applying the standard approach, should test whether h3¼

0 in the regression model 7 and should consider whether

the no-unmeasured-confounding assumptions described

above are satisfied. If there is evidence that h36¼ 0, then

this standard approach of merely including the mediator

in a regression for the outcome Y to obtain direct and

indirect effects should not be used. The approach de-

scribed above, however, of using both models 6 and 7

could still be used when there is an interaction between

A and M in model 7.

ðh2þ h3aÞ2rb

11þ b2

1

?rh

22þ 2rh

23a þ rh

33a2?

q

o

;

where rb

rh

Odds Ratios for Mediation1343

Am J Epidemiol 2010;172:1339–1348

Page 6

ODDS RATIOS FOR MEDIATION ANALYSIS IN CASE-

CONTROL STUDIES

In this section, we describe how the above approach can

be adapted when using case-control data. The case-control

setting is of particular importance in mediation analysis

with a dichotomous outcome because often, if the outcome

is rare, it will be infeasible to conduct a cohort study with

a sufficient number of individuals with the outcome. When

data are used from a case-control study design, the estima-

tors of (h1, h2, h3, h4) obtained from logistic regression 7

using case-control data will consistently estimate the same

parameters of a logistic regression using cohort data. This

well-known result is what justifies the use of logistic re-

gression when analyzing odds ratios in case-control studies

for total effects; when logistic regression is used, the case-

control study design can effectively be ignored. Note that

a logistic, not a log-linear model, is being used. In a case-

control study, estimation of model 7 is thus straight-

forward. However, when fitting the linear regression model

6 for the mediator M using case-control data, the case-

control study design cannot be ignored. It is nevertheless

possible to adapt the approach to the estimation of direct

and indirect effects described above in a relatively straight-

forward manner if the prevalence of the outcome Y is

known. We will denote this prevalence by p. We assume

it is known by design so that sampling variability for p is

neglible. Also, let p denote the proportion of cases in the

case-control study (i.e., the ratio of the number of cases in

the study to the sum of the numbers of cases and controls in

the study). If we fit a linear regression of M on A and C

using the case-control data but weighting each case by p / p

and each control byð1?pÞ

ð1?pÞ, then the coefficients obtained in

this weighted regression will give consistent estimators of

(b0, b1, b2) obtained in a linear regression of M on A and C

using data from a cohort study of the same population (35).

Once (h1, h2, h3, h4) are obtained from the logistic regres-

sion and (b0, b1, b2) are obtained from a weighted linear

regression, the estimation of direct and indirect effects can

then proceed using the formulas given in expressions 8 and

9 above.

ILLUSTRATION AND SIMULATIONS

As another example of mediation and to illustrate the

approach we have described, we reanalyzed a previously

reported study (36) with residence in a damp and moldy

dwelling as the exposure, depression as the outcome, and

perception of control over one’s home as the mediator. A

logistic regression model was fit for depression as a function

of perception of control, dampness or mold exposure, and

other individual and housing variables, as reported in

Shenassa et al. (36), and a linear regression model was fit

for perception of control as a function of the exposure and

the same individual and housing variables, each time using

generalized estimating equations to adjust for possible cor-

relation between measurements from residents sharing the

same dwelling. Allowing for the possibility of an interac-

tion, the natural indirect effect of an increase in dampness or

mold exposure from none to minimal, minimal to moderate,

and moderate to extensive on the risk of depression corre-

sponds to odds ratios of 1.03 (95% confidence interval (CI):

0.94, 1.14), 1.04 (95% CI: 0.95, 1.13), and 1.06 (95% CI:

0.93, 1.35). Standard analyses, ignoring such interactions,

gave corresponding natural indirect effect odds ratios of

1.04 (95% CI: 0.99, 1.10), 1.04 (95% CI: 0.99, 1.09), and

1.04 (95% CI: 0.99, 1.19), respectively. Considering that no

significant evidence of an interaction between dampness or

mold exposure and perception of control was found (P ¼

0.91, 0.89, and 0.22 for minimal, moderate, and extensive

dampness or mold exposure, respectively, relative to no ex-

posure), the fact that these results are very similar is not

surprising.

We also use data from this study as the basis for simu-

lation experiments exploring bias and coverage probabili-

ties when outcome prevalence is not rare or when

exposure-mediator interactions are ignored. Table 1 shows

(on the log odds scale) the bias, empirical standard error

(ESE), average of the estimated standard errors (SSEs),

and coverage of 95% confidence intervals for the natural

indirect effects log odds ratios with a ¼ 1 and a*¼ 0, as

based on 1,000 simulated data sets. Outcome and mediator

data conditional on the observed exposure and covariates

in the study by Shenassa et al. (36) were generated by using

the data-generating models obtained in the previous anal-

ysis. In Table 1, the first 5 simulation experiments corre-

spond to varying outcome prevalence. They demonstrate

that the proposed estimates of the natural indirect effect

odds ratio, while theoretically valid only at low outcome

means, give good approximations even at larger preva-

lences for the data-generating mechanism underlying the

data of Shenassa et al. The next 4 experiments evaluate the

impact of exposure-mediator interactions. Here, the mag-

nitude h3¼ ?0.22 was chosen to equal ?2h2/3 and thus to

generate a potentially substantial bias in the natural indi-

rect effect odds ratio at a ¼ 3, which was the largest ob-

served exposure value. As theoretically expected, ignoring

exposure-mediator interactions when they are present can

generate a substantial bias in the indirect effect estimates.

In the final 4 experiments b1and r were increased 5 times

(W ¼ 5 in Table 1) to give indirect effects of a larger

magnitude; here, violations of the rare-outcome assump-

tion do lead to bias.

Table 2 shows related results for the natural direct effects

log odds ratios. Results are similar as for natural indirect

effects: Coverage is poor when an exposure-mediator inter-

action is present and ignored but reasonable when the ap-

proach with the interaction is used. With natural direct

effects, in the final 4 experiments, we see that the bias of

the proposed estimator due to failure of the rare-outcome

assumption can be more sizeable than that of the standard

approach in settings in which the exposure-mediator inter-

action is in fact negligible.

Results from simulations of case-control data with

prevalence-weighted regressions for the mediator followed

a similar pattern as for the estimator of the natural indirect

effect: bias if one ignores a substantial exposure-mediator

interaction when present and bias when the rare-outcome

assumption is violated.

1344 VanderWeele and Vansteelandt

Am J Epidemiol 2010;172:1339–1348

Page 7

DISCUSSION

The 2 most common pitfalls with mediation analysis in

the epidemiologic literature are 1) ignoring possible

mediator-outcome confounding and 2) ignoring possible in-

teractions between the effects of exposure and mediator on

the outcome. Either pitfall can lead to severely biased esti-

mates and incorrect conclusions concerning mediation.

With regard to pitfall 1, we would recommend that, when

questions of mediation are of interest, greater attention be

paid to the collection of data onvariables that may confound

the mediator-outcome relation and that sensitivity analysis

be used when it is not possible to make control for such

confounders (15, 16). As noted above, the no-unmeasured-

confounding assumptions used for the identification of di-

rect and indirect effects cannot be verified with data, so

researchers need to carefully evaluate these using subject

matter knowledge and sensitivity analysis techniques. With

Table 2.

Coverage Probabilities of 95% Confidence Intervals, With Varying Outcome Prevalence and Exposure-Mediator

Interactions

Simulation Results for Natural Direct Effects for Bias, Empirical and Estimated Standard Errors, and

E(Y)u3

W

Without InteractionWith Interaction

BiasESESSE CovBias ESE SSECov

0.01 0.00351 0.020 0.15 0.1394.7?0.0053

0.00029

0.16 0.15 95.1

0.25 0.0160.0330.03293.9 0.0360.03696.6

0.5 0.0150.0330.030 92.10.00300.035 0.03294.2

0.75 0.0200.035 0.04090.2 0.0100.041 0.03793.7

0.089?0.0026

?0.071

0.078

0.0530.047 95.0?0.019

?0.0085

0.0051

0.0570.052 93.3

?0.22

0.22

0.043 0.06078.60.0490.06595.9

0.0420.05953.8 0.050 0.06693.9

?0.44

0.44

?0.11

0.039

0.042 0.07472.1?0.0060

0.014

0.0500.08195.0

0.078 0.0397.8 0.0950.05494.1

0.01 0.00355 0.039 0.18 0.1494.6 0.00480.18 0.1496.4

0.25?0.0093

?0.0052

?0.00094

0.0400.035 93.8 0.0720.0550.044 64.7

0.5 0.035 0.03294.70.130.062 0.049 23.2

0.75 0.039 0.03895.30.18 0.0870.068 21.9

Abbreviations: Cov, coverage probability; ESE, empirical standard error; E(Y), outcome prevalence; SSE, esti-

mated standard error; h3, exposure-mediator interaction; W, variance factor.

Table 1.

Coverage Probabilities of 95% Confidence Intervals, With Varying Outcome Prevalence and Exposure-Mediator

Interactions

Simulation Results for Natural Indirect Effects for Bias, Empirical and Estimated Standard Errors, and

E(Y)u3

W

Without InteractionWith Interaction

Bias ESESSECov BiasESESSE Cov

0.01 0.00351 0.00110.014 0.01595.6 0.000270.0150.01595.0

0.25 0.000240.0046 0.004894.0 0.00039 0.0047 0.004894.3

0.5?0.00021

?0.00057

0.0055

0.00430.004394.7 0.000460.00440.004494.8

0.75 0.0045 0.004792.5 0.000390.0047 0.004994.4

0.089 0.0059 0.005995.3 0.000180.00590.005994.7

?0.22

0.22

0.0032 0.0058 0.005893.6 0.0000360.00580.0058 96.4

0.0045 0.00690.006993.7 0.00047 0.00660.0067 95.7

?0.44

0.44

0.013 0.00610.006242.7?0.00010

0.0010

0.00690.0069 95.0

0.00950.00880.008884.2 0.00800.0083 95.1

0.01 0.003550.0140.0280.024 93.80.00590.0280.024 95.9

0.250.0130.019 0.01687.70.0150.0190.01685.6

0.50.019 0.017 0.01679.10.023 0.0180.01672.9

0.750.021 0.0170.01573.70.026 0.0180.017 64.5

Abbreviations: Cov, coverage probability; ESE, empirical standard error; E(Y), outcome prevalence; SSE, esti-

mated standard error; h3, exposure-mediator interaction; W, variance factor.

Odds Ratios for Mediation 1345

Am J Epidemiol 2010;172:1339–1348

Page 8

regard to pitfall 2, we would recommend that, before pro-

ceeding with what has become a routine approach of simply

including an intermediate variable in a regression to assess

mediation, investigators first examine whether there is in-

teraction between the effects of the exposure and the medi-

ator on the outcome. If there is interaction, then the routine

approach of omitting the product term from the regression

model should be avoided; instead, the product term can be

included and, provided that the outcome is rare, the ap-

proach we have described in this paper can be used.

Several further comments merit attention. First, we have

seen that, although mediation analysis is more difficultwhen

there is interaction between the exposure and the mediator

(1, 33, 37), this interaction can in fact be accommodated.

Our simple formulae did, however, assume no interaction

between the confounders and the treatment or mediator;

other estimation techniques (16) could be used if there are

confounder-exposure interactions; other identification ap-

proaches are also possible when such interactions are pres-

ent in their effects on the mediator (21, 38). Second, the

methods described above require a rare outcome; this was

necessary in the derivations and also circumvents collaps-

ibility issues with odds ratios (39); some existing work con-

siders or could be adapted for non-rare outcomes (16, 40);

future work will consider settings in which the outcome is

not rare and compare power, bias, and efficiency properties

of the estimators. Third, we have considered the setting of

a dichotomous outcome and a continuous mediator. When

the mediator M is dichotomous, rather than continuous,

a somewhat similar approach to the one described here

could potentially be used, but the analytic formulas for me-

diated effects no longer take quite as simple a form. Fourth,

in genetic epidemiology, the extent towhich genetic variants

affect an outcome (e.g., lung cancer) through intermediate

phenotypes (e.g., nicotine addiction) has recently been

a topic of interest (41–43); the approach we have described

here for case-control studies can be applied to address such

questions in genetics research.

ACKNOWLEDGMENTS

Author affiliations: Departments of Epidemiology and

Biostatistics, Harvard School of Public Health, Boston,

Massachusetts (Tyler J. VanderWeele); and Department of

Applied Mathematics and Computer Sciences, Ghent Uni-

versity, Ghent, Belgium (Stijn Vansteelandt).

T. J. V. received funding from grants ES017876 and

HD060696 from the US National Institutes of Health. S.

V. was supported by Interuniversity Attraction Poles (IAP)

research network grant P06/03 from the Belgian govern-

ment (Belgian Science Policy).

The authors thank the World Health Organization’s Eu-

ropean Centre for Environment and Health, Bonn office, for

providing the Large Analysis and Review of European

Housing and Health Status (LARES) data set used in this

paper to illustrate the method and as the basis for simula-

tions.

Conflict of interest: none declared.

REFERENCES

1. Robins JM, Greenland S. Identifiability and exchangeability

for direct and indirect effects. Epidemiology. 1992;3(2):143–

155.

2. Pearl J. Direct and indirect effects. In: Proceedings of the

Seventeenth Conference on Uncertainty and Artificial Intel-

ligence. San Francisco, CA: Morgan Kaufmann; 2001:

411–420.

3. Mendelsohn ME, Karas RH. The protective effects of estrogen

on the cardiovascular system. N Engl J Med. 1999;340(23):

1801–1811.

4. Bush TL, Barrett-Connor E, Cowan LD, et al. Cardiovascular

mortality and noncontraceptive use of estrogen in women:

results from the Lipid Research Clinics Program Follow-up

Study. Circulation. 1987;75(6):1102–1109.

5. Rubin DB. Formal modes of statistical inference for causal

effects. J Statist Plan Inf. 1990;25(3):279–292.

6. Herna ´n MA. A definition of causal effect for epidemiological

research. J Epidemiol Community Health. 2004;58(4):

265–271.

7. Pearl J. Causality: Models, Reasoning, and Inference. 2nd ed.

Cambridge, United Kingdom: Cambridge University Press;

2009.

8. VanderWeele TJ. Concerning the consistency assumption in

causal inference. Epidemiology. 2009;20(6):880–883.

9. VanderWeele TJ, Vansteelandt S. Conceptual issues concern-

ing mediation, interventions and composition. Stat Interface.

2009;2(4):457–468.

10. Robins JM. Semantics of causal DAG models and the identi-

fication of direct and indirect effects. In: Green P, Hjort NL,

Richardson S, eds. Highly Structured Stochastic Systems.

New York, NY: Oxford University Press; 2003:70–81.

11. Joffe M, Small D, Hsu CY. Defining and estimating interven-

tion effects for groups that will develop an auxiliary outcome.

Stat Sci. 2007;22(1):74–97.

12. VanderWeele TJ. Marginal structural models for the estima-

tion of direct and indirect effects. Epidemiology. 2009;20(1):

18–26.

13. Judd CM, Kenny DA. Process analysis: estimating mediation

in treatment evaluations. Eval Rev. 1981;5(5):602–619.

14. Cole SR, Herna ´n MA. Fallibility in estimating direct effects.

Int J Epidemiol. 2002;31(1):163–165.

15. VanderWeele TJ. Bias formulas for sensitivity analysis for

direct and indirect effects. Epidemiology. 2010;21(4):540–

551.

16. Imai K, Keele L, Tingley D. A general approach to causal

mediation analysis. Pyschol Methods. In press.

17. Avin C, Shpitser I, Pearl J. Identifiability of path-specific ef-

fects. In: Proceedings of the International Joint Conferences

on Artificial Intelligence. San Francisco, CA: Morgan

Kaufman; 2005:357–363.

18. Robins JM, Herna ´n MA, Brumback B. Marginal structural

models and causal inference in epidemiology. Epidemiology.

2000;11(5):550–560.

19. van der Laan MJ, Petersen ML. Direct effect models. In: In-

ternational Journal of Biostatistics. Vol. 4, issue 1, article 23.

Berkeley, CA: Berkeley Electronic Press; 2008.

20. Robins JM. Testing and estimation of direct effects by repar-

ameterizing directed acyclic graphs with structural nested

models. In: Glymour C, Cooper GF, eds. Computation, Cau-

sation, and Discovery. Menlo Park, CA: AAAI Press/

Cambridge, MA: The MIT Press; 1999:349–405.

21. Ten Have TR, Joffe MM, Lynch KG, et al. Causal mediation

analyses with rank preserving models. Biometrics. 2007;63(3):

926–934.

1346 VanderWeele and Vansteelandt

Am J Epidemiol 2010;172:1339–1348

Page 9

22. Goetgeluk S, Vansteelandt S, Goetghebeur E. Estimation of

controlled direct effects. J R Stat Soc B. 2008;70(5):1049–

1066.

23. Joffe MM, Greene T. Related causal frameworks for surrogate

outcomes. Biometrics. 2009;65(2):530–538.

24. Vansteelandt S. Estimating direct effects in cohort and case-

control studies. Epidemiology. 2009;20(6):851–860.

25. MacKinnon DP. An Introduction to Statistical Mediation

Analysis. New York, NY: Lawrence Erlbaum Associates;

2008.

26. Baron RM, Kenny DA. The moderator-mediator variable dis-

tinction in social psychological research: conceptual, strategic,

and statistical considerations. J Pers Soc Psychol. 1986;51(6):

1173–1182.

27. Freedman LS, Graubard BI, Schatzkin A. Statistical validation

of intermediate endpoints for chronic diseases. Stat Med.

1992;11(2):167–178.

28. Lin DY, Fleming TR, De Gruttola V. Estimating the proportion

of treatment effect explained by a surrogate marker. Stat Med.

1997;16(13):1515–1527.

29. Li Z, Meredith MP, Hoseyni MS. A method to assess the

proportion of treatment effect explained by a surrogate end-

point. Stat Med. 2001;20(21):3175–3188.

30. Chen C, Wang H, Snapinn SM. Proportion of treatment effect

(PTE) explained by a surrogate marker. Stat Med. 2003;

22(22):3449–3459.

31. Wang Y, Taylor JM. A measure of the proportion of treatment

effect explained by a surrogate marker. Biometrics. 2002;

58(4):803–812.

32. Taylor JM, Wang Y, Thie ´baut R. Counterfactual links to the

proportion of treatment effect explained by a surrogate marker.

Biometrics. 2005;61(4):1102–1111.

33. Kaufman JS, MacLehose RF, Kaufman S. A further critique of

the analytic strategy of adjusting for covariates to identify

biologic mediation [electronic article]. Epidemiol Perspect

Innov. 2004;1(1):4.

34. Hafeman DM. ‘‘Proportion explained’’: a causal interpretation

for standard measures of indirect effect? Am J Epidemiol.

2009;170(11):1443–1448.

35. van der Laan MJ. Estimation based on case-control designs

with known prevalence probability. In: International Journal

of Biostatistics. Vol. 4, issue 1, article 17. Berkeley, CA:

Berkeley Electronic Press; 2008.

36. Shenassa ED, Daskalakis C, Liebhaber A, et al. Dampness and

mold in the home and depression: an examination of mold-

related illness and perceived control of one’s home as possible

depression pathways. Am J Public Health. 2007;97(10):

1893–1899.

37. VanderWeele TJ. Mediation and mechanism. Eur J Epidemiol.

2009;24(5):217–224.

38. Dunn G, Bentall R. Modelling treatment-effect heterogeneity

in randomized controlled trials of complex interventions

(psychological treatments). Stat Med. 2007;26(26):4719–

4745.

39. Greenland S, Robins JM, Pearl J. Confounding and collaps-

ibility in causal inference. Stat Sci. 1999;14(1):29–46.

40. Huang B, Sivaganesan S, Succop P, et al. Statistical assess-

ment of mediational effects for logistic mediational models.

Stat Med. 2004;23(17):2713–2728.

41. Wacholder S, Chatterjee N, Caporaso N. Intermediacy and

gene-environment interaction: the example of CHRNA5-A3

region, smoking, nicotine dependence, and lung cancer. J Natl

Cancer Inst. 2008;100(21):1488–1491.

42. Chanock SJ, Hunter DJ. Genomics: when the smoke clears...

Nature. 2008;452(7187):537–538.

43. Vansteelandt S, Goetgeluk S, Lutz S, et al. On the adjustment

for covariates in genetic association analysis: a novel, simple

principle to infer direct causal effects. Genet Epidemiol.

2009;33(5):394–405.

APPENDIX

Comparison With Dichotomous Outcome Mediation

Analysis in the Social Science Literature

As noted in the text, the approach often used in the social

sciences (25) involves using regressions such as models 5

and 6, along with a regression of Yon just A (and C):

logitðPðY ¼ 1j a;m;cÞÞ ¼ h0þ h1a þ h2m þ h4#c

E½Mj a;c? ¼ b0þ b1a þ b2#c

logitðPðY ¼ 1j a;cÞÞ ¼ /0þ /1a þ /2#c:

Potential confounding variables are often ignored in many

of the analyses in the social sciences in which the exposure

is randomized (even though the mediator is not random-

ized), and thus the set C is sometimes assumed to be empty.

With these regression models, there are then 2 approaches to

estimation typically used for the mediated effect (i.e., in-

direct effect). The first uses b1h2as a measure of the medi-

ated effect, and the second uses /1? h1as a measure of the

mediated effect. The 2 measures will often not coincide. In

the text, we showed that, under the assumptions of 1) a rare

outcome, 2) normally distributed error in regression 6,

3) identification conditions 1–4 holding, and 4) no interac-

tion between a and m in the regression model 5, the quantity

b1h2is approximately equal to the log of the natural indirect

effect odds ratio, logORNIE

a;a*jcðaÞ

rare and the error term for regression model 6 is normally

distributed (with constant variance r2), then it will be

the case that b1h2? /1?h1since, under the rare-outcome

assumption, we must have /0þ /1a þ /2#c ¼ logitðPðY ¼

1j a;cÞÞ ? logfPðY ¼ 1j a;cÞg, and thus we have that

expf/0þ /1a þ /2#cg ? PðY ¼ 1j a;cÞ

¼ E½PðY ¼ 1j a;c;MÞj a;c?

? E½expfh0þ h1a þ h2M þ h4#cgj a;c?

¼ expðh0þ h1a þ h4#cÞE½expðh2MÞj a;c?

¼ expðh0þ h1a þ h4#cÞ

exp?h2ðb0þ b1a þ b2#cÞ þ1

þðh1þ h2b1Þa þ ðh4þ h2b2Þ#c

n

o

. In fact, if the outcome is

2h22r2?

?

¼ exp??h0þ1

2h22r2þ b0h2

?

:

Because this holds for all a, we must have that /1? (h1þ

h2b1) and thus /1? h1? h2b1.

If, however, the outcome is not rare or if the error term in

regression model 6 is heteroscedastic or not normally dis-

tributed, then the 2 quantities b1h2and /1?h1need not

be approximately equal. Furthermore, in that case, neither

b1h2 nor /1 ?h1 may warrant an interpretation as an

indirect effect. Moreover, if the set C does not satisfy the

no-unmeasured-confounding assumptions described in

the text, then b1h2and /1?h1may both be biased for

Odds Ratios for Mediation1347

Am J Epidemiol 2010;172:1339–1348

Page 10

the true log natural indirect effect odds ratio even if the

outcome is rare and the error term in model 6 is normally

distributed with constant variance. Finally, this standard

approach in the social science literature applies only

if there are no interactions between A and M in

regression model 5; the approach described in the text,

however, can still be employed when such interactions

are present.

1348 VanderWeele and Vansteelandt

Am J Epidemiol 2010;172:1339–1348