Page 1

ORIGINAL ARTICLE

Instruments for Causal Inference

An Epidemiologist’s Dream?

Miguel A. Herna ´n* and James M. Robins*†

Abstract: The use of instrumental variable (IV) methods is attrac-

tive because, even in the presence of unmeasured confounding, such

methods may consistently estimate the average causal effect of an

exposure on an outcome. However, for this consistent estimation to

be achieved, several strong conditions must hold. We review the

definition of an instrumental variable, describe the conditions re-

quired to obtain consistent estimates of causal effects, and explore

their implications in the context of a recent application of the

instrumental variables approach. We also present (1) a description of

the connection between 4 causal models—counterfactuals, causal

directed acyclic graphs, nonparametric structural equation models,

and linear structural equation models—that have been used to

describe instrumental variables methods; (2) a unified presentation

of IV methods for the average causal effect in the study population

through structural mean models; and (3) a discussion and new

extensions of instrumental variables methods based on assumptions

of monotonicity.

(Epidemiology 2006;17: 360–372)

C

only answer an epidemiologist can provide is “no.” Regard-

less of how immaculate the study design and how perfect the

measurements, the unverifiable assumption of no unmeasured

confounding of the exposure effect is necessary for causal

inference from observational data, whether confounding ad-

justment is based on matching, stratification, regression, in-

verse probability weighting, or g-estimation.

Now, imagine for a moment the existence of an

alternative method that allows one to make causal infer-

ences from observational studies even if the confounders

remain unmeasured. That method would be an epidemiol-

ogist’s dream. Instrumental variable (IV) estimators, as

reviewed by Martens et al1and applied by Brookhart et al2

an you guarantee that the results from your observational

study are unaffected by unmeasured confounding? The

in the previous issue of EPIDEMIOLOGY, were developed to

fulfill such a dream.

Instrumental variables have been defined using 4 dif-

ferent representations of causal effects:

1. Linear structural equations models developed in econo-

metrics and sociology3,4and used by Martens et al1

2. Nonparametric structural equations models4

3. Causal directed acyclic graphs4–6

4. Counterfactual causal models7–9

Much of the confusion associated with IV estimators

stems from the fact that it is not obvious how these various

representations of the same concept are related. Because the

precise connections are mathematical, we will relegate them

to an Appendix. In the main text, we will describe the

connections informally.

Let us introduce IVs, or instruments, in randomized

experiments before we turn our attention to observational

studies. The causal diagram in Figure 1 depicts the structure

of a double-blind randomized trial. In this trial, Z is the

randomization assignment indicator (eg, 1 ? treatment, 0 ?

placebo), X is the actual treatment received (1 ? treatment,

0 ? placebo), Y is the outcome, and U represents all factors

(some unmeasured) that affect both the outcome and the

decision to adhere to the assigned treatment. The variable Z is

referred to as an instrument because it meets 3 conditions: (i)

Z has a causal effect on X, (ii) Z affects the outcome Y only

through X (ie, no direct effect of Z on Y, also known as the

exclusion restriction), and (iii) Z does not share common

causes with the outcome Y (ie, no confounding for the effect

of Z on Y). Mathematically precise statements of these con-

ditions are provided in the Appendix.

A double-blind randomized trial satisfies these condi-

tions in the following ways. Condition (i) is met because trial

participants are more likely to receive treatment if they were

assigned to treatment, condition (ii) is ensured by effective

double-blindness, and condition (iii) is ensured by the ran-

dom assignment of Z. The intention-to-treat effect (the aver-

age causal effect of Z on Y) differs from the average treatment

effect of X on Y when some individuals do not comply with

the assigned treatment. The greater the rate of noncompliance

(eg, the smaller the effect of Z on X on the risk-difference

scale), the more the intention-to-treat effect and the average

treatment effect will tend to differ. Because the average

treatment effect reflects the effect of X under optimal condi-

tions (full compliance) and does not depend on local condi-

tions, it is often of intrinsic public health or scientific interest.

Submitted 30 January 2006; accepted 6 February 2006.

From the *Department of Epidemiology, Harvard School of Public Health

and †Department of Biostatistics, Harvard School of Public Health,

Boston, Massachusetts.

Editors’ note: A related article appears on page 373.

Correspondence: Miguel A. Herna ´n, Department of Epidemiology. Harvard

School of Public Health. 677 Huntington Ave. 02115 Boston, MA.

E-mail: Miguel_hernan@post.harvard.edu.

Copyright © 2006 by Lippincott Williams & Wilkins

ISSN: 1044-3983/06/1704-0360

DOI: 10.1097/01.ede.0000222409.00878.37

Epidemiology • Volume 17, Number 4, July 2006

360

Page 2

Unfortunately, the average effect of X on Y may be affected

by unmeasured confounding.

Instrumental variables methods promise that if you

collect data on the instrument Z and are willing to make some

additional assumptions (see below), then you can estimate the

average effect of X on Y, regardless of whether you measured

the covariates normally required to adjust for the confounding

caused by U. IV estimators bypass the need to adjust for the

confounders by estimating the average effect of X on Y in the

study population from 2 effects of Z: the average effect of Z

on Y and the average effect of Z on X. These 2 effects can be

consistently estimated without adjustment because Z is ran-

domly assigned. For example, consider this well-known IV

estimator: The estimated effect of X on Y is equal to an

estimate of the ratio

E?Y?Z ? 1? ? E?Y?Z ? 0?

E?X?Z ? 1? ? E?X?Z ? 0?

of the effect of Z on Y divided by the effect of Z on X, all

measured in the scale of difference of risks or means, where

E?X?Z? ? Pr?X ? 1?Z? for the dichotomous variable X.

(Martens et al1showed the derivation and a geometrical

explanation of this IV estimator in the context of linear

models, and Brookhart et al2applied it to pharmacoepide-

miologic data.) To obtain the average treatment effect, one

inflates the intention-to-treat effect in the numerator of the

estimator by dividing by a denominator, which decreases as

noncompliance increases. That is, the effect of X on Y will

equal the effect of Z on Y when X is perfectly determined by

Z (risk difference E?X?Z ? 1? ? E?X?Z ? 0? ? 1). The weaker

the association between Z and X (the closer the Z-X risk

difference is to zero), the more the intention-to-treat effect

will be inflated because of the shrinking denominator.

This instrumental variables estimator can also be used

in observational settings. Investigators can estimate the aver-

age effect of an exposure X by identifying and measuring a

Z-like variable that meets conditions (ii) and (iii) as well as a

more general modified version of condition (i), which we

designate as condition (i*). Under condition (i*), the instru-

ment Z and exposure X are associated either because Z has a

causal effect on X, or because X and Z share a common

cause.4,10Martens et al1cite several articles that describe

some instruments used in observational studies. As these

examples show, the challenge of identifying and measuring

an instrument in an observational study is not trivial. The goal

of Brookhart et al2is to compare the effect of prescribing 2

classes of drugs (cyclooxygenase 2-?COX-2? selective and

nonselective nonsteroidal antiinflammatory drugs ?NSAIDs?)

on gastrointestinal bleeding. The authors propose the “phy-

sician’s prescribing preference” for drug class as the instru-

ment, arguing that it meets conditions (i), (ii), and (iii).

Because the proposed instrument is unmeasured, the authors

replace it in their main analysis by the (measured) surrogate

instrument “last prescription issued by the physician before

current prescription.”

Figure 2 shows a causal structure in which the instru-

ment Z (here, “last prescription issued by the physician before

current prescription”) is a surrogate for another unmeasured

instrument U* (here, “physician’s prescribing preference”).

Both Z and U* meet conditions (i*), (ii) and (iii) but, in

contrast to U*, Z does not satisfy the original condition (i).

The original condition (i) is equivalent to the second assump-

tion of Martens and colleagues1for the validity of an instru-

ment. It follows that Martens et al’s assumptions are too

restrictive and do not recognize that Z can be used as an

instrument. That is, under the Martens et al assumption that

the equations are structural (as defined in the Appendix), their

instrumental variables estimator is consistent for the effect of

X on Y provided the instrument Z is uncorrelated with the

error term E in the structural equation for the outcome Y

(which implies no confounding for the causal effect of Z on

Y), even when the instrument is correlated with the error term

F in the structural equation for the treatment X (which implies

confounding for the causal effect of Z on X).

The IV estimator described previously looks like an

epidemiologist’s dream come true: we can estimate the effect

of the X on Y, even if there is unmeasured confounding for the

effect of X on Y! Many sober readers, however, will suspect

any claim that an analytic method solves one of the major

problems in epidemiologic research. Indeed there are good

reasons for skepticism—as Martens et al1explain, and as the

example of Brookhart et al2illustrates. First, the IV effect

FIGURE 1. A double-blind randomized experiment with as-

signment Z, treatment X, outcome Y, and unmeasured factors

U. Z is an instrument.

FIGURE 2. An observational study with unmeasured instru-

ment U*, exposure X, outcome Y, and unmeasured factors U.

Z is a surrogate instrument.

Epidemiology • Volume 17, Number 4, July 2006 Instruments for Causal Inference

© 2006 Lippincott Williams & Wilkins

361

Page 3

estimate will be biased unless the proposed instrument meets

conditions (ii) and (iii), but these conditions are not empiri-

cally verifiable. Second, any biases arising from violations of

conditions (ii) and (iii), or from sampling variability, will be

amplified if the association between instrument and exposure

?condition (i*)? is weak. Third, our discussion so far may

have appeared to suggest that conditions (i*), (ii), and (iii) are

sufficient to guarantee that the IV estimate consistently esti-

mates the average effect of X on Y. In fact, additional

unverifiable assumptions are required, regardless of whether

the data were generated from a randomized experiment or an

observational study. Finally, most epidemiologic exposures

are time-varying, which standard IV methods are poorly

equipped to address.

We now briefly review these 4 reasons for skepticism

(see also Greenland11). To illustrate these ideas, we will take

the study by Brookhart et al2as an example because one can

indirectly validate their observational estimates by comparing

them with the estimates from a previous randomized trial that

addressed the same question. We will focus on the effect of

prescribing selective versus nonselective NSAIDs on gastro-

intestinal bleeding over a period of 60 days in patients with

arthritis. This effect was estimated to be ?0.47 (in the scale

of risk difference multiplied by 100) in the randomized trial.

Violation of the Unverifiable Conditions (ii)

and (iii) Introduces Bias

Condition (ii), the absence of a direct effect of the

instrument on the outcome, will not hold if, as discussed by

Brookhart et al,2doctors tend to prescribe selective NSAIDs

together with gastroprotective medications (eg, omeprazol).

This direct effect of the instrument would introduce a down-

ward bias in the estimate, that is, the effect of prescribing

selective NSAIDs would look more protective than it really

is. However, the assumption cannot be verified from the data:

the unexpectedly strong inverse association between Z and Y

(?0.35, Table 3 in Brookhart et al) is consistent with a

violation of condition (ii) but also with a very strong protec-

tive effect of selective NSAIDs without a violation of con-

dition (ii).

Brookhart and colleagues2also discuss the possibility

that physicians who prescribe selective NSAIDs frequently

see higher-risk patients. This potential violation of condition

(iii) is the result of unmeasured confounding for the instru-

ment and would introduce an upward bias in the estimate. To

deal with this potential problem—consistent with the associ-

ation between Z and the measured covariates (Table 2 in

Brookhart et al)—the authors made the unverifiable assump-

tion that, within levels of the measured covariates, there were

no other common causes of the instrument and the outcome.

These violations of conditions (ii) and (iii) can be

represented by including arrows from U* to Y and from U to

Z, respectively (Fig. 3).

A Weak Condition (i*) Amplifies The Bias

An instrument weakly associated with exposure leads

to a small denominator of the IV estimator. Therefore, biases

that affect the numerator of the IV estimator (eg, unmeasured

confounding for the instrument, a direct effect of the instru-

ment) or small sample bias in the denominator will be greatly

exaggerated, and may result in an IV estimate that is more

biased than the unadjusted estimate. The exaggeration of the

effect by IV estimators may occur even in large samples and

in the absence of model misspecification. In the study by

Brookhart et al,2the overall Z ? X risk difference was 0.228

(the corresponding number in patients with arthritis was not

reported). Therefore, any bias affecting the numerator of the

IV estimator would be multiplied by approximately 4.4 (1/

0.228), which might explain why the IV effect estimate

?1.81 was farther from the randomized estimate ?0.47 than

the unadjusted estimate 0.10. The IV method might have

exaggerated the effect if the proposed instrument had a direct

effect due to, say, concomitant prescription of gastroprotec-

tive drugs. Alternatively, the instrument Z may satisfy con-

ditions (i*), (ii), and (iii). In that case, the difference between

the IV and the randomized estimates might not be due to bias

in the instrumental variable estimator but rather to sampling

variability or (as suggested by Brookhart et al) to the different

age distributions in the observational study and the random-

ized trial, along with strong effect-measure modification by

age. The latter hypothesis could be assessed by conducting an

analysis stratified by age.

In the context of linear models, Martens et al1demon-

strate that instruments are guaranteed to be weakly correlated

with exposure in the presence of strong confounding because

a strong association between X and U leaves little residual

variation for X to be strongly correlated with the instrument

U* in Figure 2. This problem may be compounded by the use

of surrogate instruments Z.

When Treatment Effects Are Heterogenous,

Conditions (i*) Through (iii) Are Insufficient to

Obtain Effect Estimates

Even when an instrument is available, additional assump-

tionsarerequiredtoestimatetheaveragecausaleffectofXinthe

population. Examples of such assumptions are discussed in the

following paragraphs as well as in the Appendix. Conditions

FIGURE 3. An observational study with exposure X, outcome

Y, and unmeasured factors U in which the variables U* and Z

do not qualify as instruments.

Herna ´n and Robins

Epidemiology • Volume 17, Number 4, July 2006

© 2006 Lippincott Williams & Wilkins

362

Page 4

(i*), (ii), and (iii) allow one to compute upper and lower bounds,

but not a point estimate, for the average causal effect. In a 1989

article, Robins8derived the bounds that can be computed under

conditions (i*) and (ii) plus a weak version of condition (iii), as

well as under different sets of other unverifiable assumptions.

Subsequently, Manski12derived related results, and Balke and

Pearl13derived narrower bounds under a stronger version of

condition (iii) given in the Appendix; this holds, for example,

when the instrument is a randomized assignment indicator. In a

double-blind randomized trial, confidence intervals for the in-

tention-to-treat effect of Z on Y that exceed zero by a wide

margin show that a positive treatment effect is occurring in a

subset of the population. However, if noncompliance is large

(say, 50%), bounds for the average treatment effect may include

the null hypothesis of zero. This would happen if, for example,

the (unobserved) effect of treatment in the noncompliers were

larger in magnitude and opposite in sign to that in the compliers.

However, Martens et al1and Brookhart et al2do present

point estimates—not bounds—for the causal effect of X on Y.

What other assumptions did the authors make either explicitly

or implicitly? The linear structural equation model used by

Martens et al assume that the effect of X on Y on the

mean-difference scale is the same for all subjects. This

assumption of no between-subject heterogeneity in the treat-

ment effect combined with conditions (i*), (ii), and (iii) is

sufficient to identify the effect of X on Y. (A causal effect is

said to be identified if there exists an estimator based on the

observed data ?Z, X, Y? that converges to ?is consistent for?

the effect in large samples). This assumption will hold under

the sharp null hypothesis that the exposure X has no effect on

any subject’s outcome (in contrast with the “nonsharp” null

hypothesis in which the net effect is still zero but includes

positive effects for some and negative for others). It follows

that, when conditions (i*), (ii), and (iii) hold, the usual IV

estimator will correctly estimate the average treatment effect

of 0 whenever the sharp null hypothesis is true. However,

when the sharp null is false, the assumption of no treatment

effect heterogeneity is biologically implausible for continu-

ous outcomes and logically impossible for dichotomous

outcomes.

There is a weaker, more plausible assumption that,

combined with conditions (i*), (ii) and (iii), still implies the

effect of X on Y is the ratio of the effect of Z on Y to the effect

of Z on X. This is the assumption that the X–Y causal risk

difference is the same among treated subjects with Z ? 1 as

among treated subjects with Z ? 0, and similarly among

untreated subjects.8,14In other words, this assumes that there

is no effect modification, on the additive scale, by Z of the

effect of X on Y in the subpopulations of treated and untreated

subjects (strictly speaking, any effect modification would be

due to the causal instrument U*). The identifying assumption

of no effect modification will not generally hold if the

unmeasured factors U on Figure 2 interact with X on an

additive scale to cause Y. Such effect modification would be

expected in many studies, including that by Brookhart et al.2

There might be effect modification, for example, if the risk

difference for the effect of selective NSAIDs (X) on gastro-

intestinal bleeding (Y) was modified by past history of gas-

tritis (U).

Another assumption that is commonly combined with

conditions (i*), (ii), and (iii) to identify the average effect of

X on Y is the monotonicity assumption. In the context of the

research by Brookhart et al,2with dichotomous Z and U*,

monotonicity means that no doctor who prefers nonselective

NSAIDs would prescribe selective NSAIDs to any patient

unless all doctors who prefer selective NSAIDs would do so.

Clearly, in the substantive setting of the study by Brookhart

et al, monotonicity is unlikely to hold. In other settings,

monotonicity may be more likely. The monotonicity assump-

tion does not affect the bounds for the average effect of X on

Y in the population (our target parameter so far).8,13However,

in the Appendix, we extend a result by Imbens and Angrist15

to show that, if the assumptions encoded by the DAG in

Figure 2 and the assumption of monotonicity all hold, a

particular causal effect is identified and the usual IV estimator

based on Z consistently estimates this effect. The identified

causal effect is the average effect of X on Y in the subset of

the study population who would be treated (1) with selective

NSAIDs by all doctors whose “prescribing preference” is for

selective NSAIDs and (2) with nonselective NSAIDs by all

doctors whose preference is for nonselective NSAIDs.15This

subset of the study population can be labeled as the “com-

pliers” because it is analogous to the subset of the population

in randomized experiments (in which the instrument is treat-

ment assignment) who would comply with whichever treat-

ment is assigned to them. A problem with this causal effect is

that we cannot identify the subset of the population (the

“compliers”) the effect estimate refers to. Further, this result

requires that a doctor’s unobserved “prescribing preference”

U* can be assumed to be dichotomous. In the Appendix we

argue that assumptions encoded by the DAG in Figure 2 are

more substantively plausible if U* is a continuous rather than

a dichotomous measure, although in that case a “complier” is

no longer well defined and the interpretation of the IV

estimator based on Z is different (see Appendix).

The assumptions of monotonicity and no effect modi-

fication by Z on an additive (risk difference) scale by no

means exhaust the list of assumptions that serve to identify

causal effects. Alternative identifying assumptions can result

in estimators of the average effect of X that differ from the

usual IV estimator. For example, in the Appendix, we show

that the assumption of no effect modification by Z on an

multiplicative (risk ratio) scale within both levels of X iden-

tifies the average causal effect.8,10However, under this as-

sumption, the estimated ratio of the average effect of Z on Y

to the average effect of Z on X is now biased (inconsistent) for

the average causal effect of X on Y; in the Appendix we

provide a consistent (asymptotically normal) estimator for the

treatment effect.8,10

Because all identifying assumptions are unverifiable,

Robins and Greenland16argued that it is useful to estimate

upper and lower bounds for the effect, instead of (or in

addition to) point estimates and confidence intervals obtained

under various explicit unverifiable assumptions. Such esti-

mates help to make clear “the degree to which public health

Epidemiology • Volume 17, Number 4, July 2006 Instruments for Causal Inference

© 2006 Lippincott Williams & Wilkins

363

Page 5

decisions are dependent on merging the data with strong prior

beliefs.” As noted above, the problem with bounds is that the

resulting interval may be too wide and therefore not very

informative. (Further, there will be 95% confidence intervals

around the upper and the lower bound attributable to sam-

pling variation.)

In addition, when it is necessary to condition on con-

tinuous (or many discrete) preinstrument covariates to try to

insure that the effect of Z on Y is unconfounded, the validity

of IV estimates based on parametric linear models for a

binary response Y also requires as usual both a correctly

specified functional form for the covariates effects and esti-

mated conditional probabilities that lie between zero and one.

The Standard IV Methodology Deals Poorly

With Time-Varying Exposures

Most epidemiologic exposures are time-varying. For

example, Brookhart et al2compared the risks after prescrip-

tion of either selective or nonselective NSAIDs, regardless of

whether patients stayed on the assigned drug during the

follow-up. In other words, the treatment variable was consid-

ered to not be time-varying, and the authors estimated an

observational analog of the intention-to-treat effect com-

monly estimated from randomized experiments. However, in

reality, patients may discontinue or switch their assigned

treatment over time. When this lack of adherence to the initial

treatment is not due to serious side effects, one could be more

interested in comparing the risks had the patients followed

their assigned treatment continuously during the follow-up.

In the presence of time-varying instruments, exposures,

and confounders, Robins’s g-estimation of nested structural

models10,17–19can be used to estimate causal effects. Nested

structural models achieve identification by assuming a non-

saturated model for the treatment effect at each time t (mea-

sured on either an additive or multiplicative scale) as a

function of a subject’s treatment, instrument, and covariate

history through t. These models naturally allow the analyst

(1) to obtain asymptotically unbiased point estimates of the

treatment effect in the treated study population, (2) to char-

acterize the effect on one’s inference to violations of the

model assumptions through sensitivity analysis, (3) to adjust

for baseline and time-varying continuous and discrete con-

founders of the instrument-outcome association, (4) to in-

clude continuous and multivariate instruments and treat-

ments, and (5) to use doubly-robust estimators. In the

Appendix we show that the linear structural equations of

Martens et al1are a simple case of a nested structural mean

model. Robins’s methods apply to continuous, count, failure

time, and rare dichotomous responses but not to nonrare

dichotomous responses.20For nonrare dichotomous re-

sponses, a new extension due to Van der Laan et al21can be

used. For treatments and instruments that are not time-

varying, Tan22has shown how to achieve many of properties

(a) through (e) under a model that achieves identification of

causal effects by assuming monotonicity.

CONCLUSION

We have reviewed how, in observational research, the

use of instrumental variables methods replaces the unverifi-

able assumption of no unmeasured confounding for the treat-

ment effect with other unverifiable assumptions such as “no

unmeasured confounding for the effect of the instrument” and

“no direct effect of the instrument.” Hence, the fundamental

problem of causal inference from observational data–the

reliance on assumptions that cannot be empirically veri-

fied—is not solved but simply shifted to another realm. As

always, investigators must apply their subject-matter knowl-

edge to study design and analysis to enhance the plausibility

of the unverifiable assumptions.

Further, when conditions (i*), (ii), and (iii) do not hold,

the direction of bias of IV estimates may be counterintuitive

for epidemiologists accustomed to conventional approaches

for confounding adjustment. For example, Brookhart et al2

found a much bigger effect estimate using IV methods

(?1.81) than the effect estimated by the randomized trial

(?0.47), whereas conventional methods were unable to de-

tect a beneficial effect of selective NSAIDs. The conventional

unadjusted and adjusted estimates were quite close (0.10 and

0.07, respectively), despite careful adjustment for most of the

known indications and risk factors for the outcome. If the

assumptions required for the validity of the usual IV estima-

tor held and these differences were not the result of sampling

variability, the aforementioned estimates would imply that

the magnitude of unmeasured confounding (from 0.07 to

?1.81) is much greater than the magnitude of the measured

confounding (from 0.10 to 0.07). An alternative explanation

is that the IV assumptions do not hold and the IV estimate is

biased in the apparently counterintuitive direction of exag-

gerating the protective effect.

In summary, Martens et al1are right: IV methods are

not an epidemiologist’s dream come true. Nonetheless, they

certainly deserve greater attention in epidemiology, as shown

by the interesting application presented by Brookhart et al2

But users of IV methods need to be aware of the limitations

of these methods. Otherwise, we risk transforming the meth-

odologic dream of avoiding unmeasured confounding into a

nightmare of conflicting biased estimates.

APPENDIX

This appendix is organized in 5 sections. The first

section describes 4 mathematical representations of causal

effects—counterfactuals, causal directed acyclic graphs, non-

parametric structural equation models, linear structural equa-

tions models—and their relations. The second section de-

scribes IV estimators that identify the average causal effect of

X on Y in the population by using no interaction assumptions.

We show that these estimators can be represented by param-

eters of particular structural mean models. The third section

describes IV estimators that identify the average causal effect

of X on Y in certain subpopulations by using monotonicity

assumptions. The fourth section contains important exten-

sions. The last section contains the proofs of the theorems

presented in the first 3 sections.

1. Representations of Causal Effects

As mentioned in the main text, IV estimators have been

defined using 4 different mathematical representations of

Herna ´n and Robins

Epidemiology • Volume 17, Number 4, July 2006

© 2006 Lippincott Williams & Wilkins

364