Page 1

American Journal of Epidemiology

ª The Author 2010. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of

Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Vol. 172, No. 7

DOI: 10.1093/aje/kwq198

Advance Access publication:

August 17, 2010

Practice of Epidemiology

Treatment Effects in the Presence of Unmeasured Confounding: Dealing With

Observations in the Tails of the Propensity Score Distribution—A Simulation

Study

Til Stu ¨rmer*, Kenneth J. Rothman, Jerry Avorn, and Robert J. Glynn

* Correspondence to Dr. Til Stu ¨rmer, Department of Epidemiology, Gillings School of Global Public Health, University of North

Carolina at Chapel Hill, McGavran-Greenberg Hall, CB 7435, Chapel Hill, NC 27599-7435 (e-mail: til.sturmer@post.harvard.edu).

Initially submitted February 23, 2010; accepted for publication May 26, 2010.

Frailty, a poorly measured confounder in older patients, can promote treatment in some situations and discour-

age it in others. This can create unmeasured confounding and lead to nonuniform treatment effects over the

propensity score (PS). The authors compared bias and mean squared error for various PS implementations under

PS trimming, thereby excluding persons treated contrary to prediction. Cohort studies were simulated with a binary

treatment T as a function of 8 covariates X. Two of the covariates were assumed to be unmeasured strong risk

factors for the outcome and present in persons treated contrary to prediction. The outcome Y was simulated as

a Poisson function of T and all X’s. In analyses based on measured covariates only, the range of PS’s was trimmed

asymmetrically according to the percentile of PS in treated patients at the lower end and in untreated patients at the

upper end. PS trimming reduced bias due to unmeasured confounders and mean squared error in most scenarios

assessed. Treatment effect estimates based on PS range restrictions do not correspond to a causal parameter but

may be less biased by such unmeasured confounding. Increasing validity based on PS trimming may be a unique

advantage of PS’s over conventional outcome models.

bias (epidemiology); causal inference; cohort studies; confounding factors (epidemiology); epidemiologic methods;

models, statistical; propensity score; research design

Abbreviations: IPTW, inverse probability of treatment weighting; PS, propensity score; RR, rate ratio.

Restriction of treatment comparisons to subjects with a

common range of covariates (e.g., age) or any summary

score of covariates (1) can improve the validity of effect

estimates regardless of the analytic technique used. Such

restriction provides a pragmatic focus on persons for whom

uncertainty regarding the value of treatment is most rele-

vant. In practice, implementation of such restrictions can be

complicated and is rarely done outside of propensity score

(PS) analyses (2, 3).

PS analyses offer some advantages in the context of non-

experimental treatment comparisons. These include a focus

on treatment assignment, improved control of confounding

with scarce outcomes, and the ability to easily match co-

horts on a large number of covariates (3). PS analyses do

not offer any advantages with respect to unmeasured con-

founders, however (4).

Frailty is a plausible explanation for paradoxical treat-

ment effects observed in the elderly (5). Frailty may reduce

the likelihood of a particular treatment if physicians focus

on a patient’s main medical problem and do not initiate

useful therapies for alternative conditions (6). The practi-

tioner may determine that in the presence of competing

risks, a new therapy offers little expected benefit (7). Con-

versely, in patients with short life expectancies, physicians

may be more willing to try therapies with potentially serious

side effects as a last resort. Thus, if mortality is the outcome

of interest, frailty can be a powerful confounder that is

difficult to measure and can either increase or decrease the

likelihood of treatment. Although we describe the problem

using the terminology of pharmacoepidemiology, the issues

are more general and the principles should apply broadly to

any type of epidemiologic study.

843Am J Epidemiol 2010;172:843–854

by guest on April 13, 2012

http://aje.oxfordjournals.org/

Downloaded from

Page 2

Recent studies have provided examples of strong hetero-

geneity of treatment effect estimates over the PS that may be

explained by confounding due to unmeasured frailty (8, 9).

In one study of the effects of thrombolysis on all-cause in-

hospital mortality among patients with stroke, mortality was

much higher in the 17 stroke patients (out of a total of 212)

who received thrombolytic therapy despite having the low-

est PS for receiving it (41% mortality), in comparison with

the remaining 195 patients (14% mortality) (8). These 17

patients with a very low predicted probability of receiving

treatment may have received it because they were very

frail—that is, as a treatment of last resort.

In another study that addressed the effects of treatment

with biologics on all-cause mortality in patients with rheu-

matoid arthritis, mortality was much higher in the untreated

patients in the highest PS quintile (72/1,000 person-years)

than in the remainder of the untreated patients (11/1,000

person-years) (9). Frailty may also explain this difference,

if the high-risk untreated patients did not receive the treat-

ment that they might have received given their clinical con-

dition because they were deemed too frail by the treating

physician (treatment withheld).

If increases in mortality in a few patients who are treated

contrary to prediction are due to unmeasured frailty, then

treatment effects over the PS will appear heterogeneous, and

excluding some or all of the patients treated contrary to

prediction could reduce unmeasured confounding by this

frailty under the assumption of uniform effects (10). In

theory, if we excluded all patients with unmeasured frailty,

the resulting treatment effect estimate would not be biased

from unmeasured confounding by frailty. In practice, ex-

cluding all such patients will be impossible. Excluding in-

creasing proportions of those treated contrary to prediction,

however, would increase internal validity at the price of not

being able to describe precisely the population to which the

treatment effect estimate would apply (11, 12). In other

words, the treatment effect that is estimated would not be

a causal parameter even when implementing the PS in a way

that should produce a causal estimate (13). If, contrary to

this assumption, treatment effect heterogeneity is real and

not due to unmeasured confounding, then excluding some

patients will affect the generalizability of the results.

Our aim in this simulation study was to compare bias and

mean squared error in the treatment effect estimates for

varying degrees of asymmetric restriction of the PS distri-

bution under the assumption of the presence of unmeasured

frailty that leads to ‘‘last resort’’ treatment, ‘‘treatment with-

held,’’ or both. We analyzed the data using a variety of

methods to control for confounding using PS’s.

MATERIALS AND METHODS

Simulation

To simulate confounding by frailty in persons who are

treated contrary to prediction, we used a 2-step process to

define covariates (see Figure 1). We started with 3 dichoto-

mous covariates, X1, X2, and X3,each with a prevalence of

0.2, and 3 continuous covariates, X4, X5, and X6, each with

a mean of 0 and unit variance. All covariates were indepen-

dent of one another. We then calculated the predicted prob-

ability of the dichotomous intended treatment T based on

these 6 ‘‘measured’’ covariates and the covariate-treatment

associations presented in Table 1, using a logistic model:

pðTjX1- -X6Þ ¼ ð1 þ expð?ða0þ a1X1

þ a2X2þ a3X3þ a4X4þ a5X5

þa6X6ÞÞÞ?1:

We used this probability of intended treatment to assign 2

additional dichotomous covariates. One, X7, was defined as

most likely to be present when the intended treatment

was least likely. X7was set to 1 (present) when a random

uniform number was less than or equal to [c ? p(TjX1–X6)]

and set to 0 otherwise. Thus, observations with a probability

of intended treatment close to 0 would be most likely to

have X7¼ 1. The second covariate, X8, was likely to be

present when the intended treatment was most likely. X8

was set to 1 (present) when a random uniform number

was less than or equal to [p(TjX1–X6) ? d], absent otherwise.

Thus, observations with a probability of intended treatment

close to 1 would be most likely to have X8¼ 1. The values

for c and d were chosen to result in a low prevalence of both

X7and X8(see Table 1). We assumed a low prevalence for

persons treated contrary to prediction because of the empir-

ical examples.

Based on these 8 covariates (i.e., the ‘‘measured’’ cova-

riates X1–X6and the ‘‘unmeasured’’ covariates X7and X8),

we then recalculated the probability of actual treatment,

again using a logistic model:

ð1Þ

pðTjX1- -X8Þ ¼ ð1 þ expð?ða0þ a1X1

þ a2X2þ a3X3þ a4X4þ a5X5

þa6X6þ a7X7þ a8X8ÞÞÞ?1:

ð2Þ

TY

X1

X2

X3

X4

X5

X6

X7

X8

p(T|X1–X6)

Figure 1.

ment (T) and outcome (Y) as a function of measured covariates (X1–

X6) and unmeasured covariates (X7and X8). The solid lines represent

causal associations, and the dashed lines represent noncausal asso-

ciations used in the 2-step simulation process to mimic treatment

contrary to prediction by measured covariates (X1–X6).

Conceptual diagram of a simulation study depicting treat-

844 Stu ¨rmer et al.

Am J Epidemiol 2010;172:843–854

by guest on April 13, 2012

http://aje.oxfordjournals.org/

Downloaded from

Page 3

The actual treatment Twas then assigned on the basis of this

probability using a random uniform number. Finally, the ex-

pected number of disease outcomes Yover a fixed follow-up

time interval was derived from all 8 covariates and the treat-

ment T using a log-linear model:

EðYjT;X1- -X8Þ ¼ expðb0þ b1X1þ ? ? ? þ b8X8þ bTTÞ:

ð3Þ

The number of outcomes Y was assigned using a random

number from a Poisson distribution based on this expected

value. The Poisson outcome and the log-linear outcome

modelwerechosenbecausetheincidencerateratiosobtained

are collapsible under exchangeability (14) and therefore al-

low direct comparisons between the analytic strategies (15).

The range of values covered in the simulation study is

presented in Table 1. The 6 measured covariates X1–X6were

associated only with treatment (X1and X4), associated only

with outcome (X2and X5), or associated with both treatment

and outcome (X3and X6). X7was strongly positively asso-

ciated with both actual treatment and outcome (or not),

mimicking frailty that leads to ‘‘last resort’’ treatment. X8

was strongly inversely associated with actual treatment and

positively with outcome (or not), mimicking frailty that

leads to ‘‘treatment withheld.’’ The parameter value for

a0in equation 2 was chosen to result in a prevalence of T

of approximately 0.2 or 0.05, the one for b0in equation 3 for

an incidence of approximately 0.1 per observation over

a fixed follow-up time in the untreated. For each scenario

or parameter constellation, we simulated 1,000 closed

cohort studies with n ¼ 10,000.

Analysis

PS estimation and implementation. We first estimated

PSX1–X6 based on the measured covariates X1–X6 using

logistic regression. The treatment-outcome incidence rate

ratio controlling for confounding by the measured X’s was

estimated using log-linear models and 5 different methods

to implement PSX1–X6: modeling, stratification assuming

uniform effects, stratification not assuming uniform effects,

matching, and weighting (4).

We first estimated the rate ratio based on treatment and

PSX1–X6as a continuous covariate, not to encourage this PS

implementation but because it is widely used in medical

research (16). We then stratified the study population into

5 equal-sized strata of PSX1–X6based on the overall (mar-

ginal) distribution of PSX1–X6. We used 5 strata because that

number of strata has been shown to be sufficient to remove

most confounding (17) and thus has become a widely used

approach in stratifying PS’s (16). We estimated the rate ratio

Table 1.Parameters Covered in the Simulation Study and Their Valuesa

VariablePrevalence ORTb

Parameter

Equation

No(s).c

RRYd

Parameter

Equation

No(s).c

X1

0.22.0

a1

1, 21.0

b1

3

X2

0.2 1.0

a2

1, 2 2.0

b2

3

X3

0.20.2

a3

1, 20.2

b3

3

X4

Continuous (0, 1) 1.5

a4

1, 21.0

b4

3

X5

Continuous (0, 1) 1.0

a5

1, 21.5

b5

3

X6

Continuous (0, 1)

0.01e,f

0.01f,g

0.2, 0.05f

0.1f(incidence rate

in untreated)

0.5

a6

1, 20.5

b6

3

X7

1, 10

a7

21, 10

b7

3

X8

1, 0.1

a8

2 1, 10

b8

3

T

a0

22.0

bT

3

Y

b0

3

Abbreviations: OR, odds ratio; RR, rate ratio.

aParameter values are chosen to represent a study with both prevalent and rare treatment and

a low incidence of outcomes over a fixed follow-up time. Covariates are either instruments (X1and

X4), risk factors for the outcome (X2and X5), or confounders (X3, X6, X7, X8). X7and X8are

strongly associated with both treatment and outcome but very rare, to mimic few patients treated

contrary to prediction. Some parameter values are set to 1 (no association) for the tables sepa-

rating ‘‘last resort’’ treatment from ‘‘treatment withheld.’’

bOdds ratio for the relation between the covariate and treatment T; parameters are for

log(ORT).

cFor equations, see text.

dRate ratio for the relation between the covariate and the outcome Y; parameters are for

log(RRY).

e‘‘Last resort’’ treatment if random uniform number ?[c – p(TjX1–X6)]; c was chosen so that the

prevalence of X7was close to 0.01.

fApproximate numbers.

g‘‘Treatment withheld’’ if random uniform number ?[p(TjX1–X6) ? d]; d was chosen so that the

prevalence of X8was close to 0.01.

Trimming Patients Treated Contrary to Prediction 845

Am J Epidemiol 2010;172:843–854

by guest on April 13, 2012

http://aje.oxfordjournals.org/

Downloaded from

Page 4

based on a model including treatment and 4 indicator vari-

ables for PSX1–X6quintiles 2–5.

In addition to these 2 PS implementations based on the

assumption of uniform effects, we analyzed the data using 3

different approaches that do not rely on this assumption.

First we combined the 5 PSX1–X6quintile-specific treatment

effect estimates based on the standardized mortality ratio—

that is, using weights that reflect the distribution of treated

patients over the quintiles as the standard. We then tried

to find untreated matches for every treated patient based

on the estimated PSX1–X6(1:1 individual matching). We

used 5-digit to 1-digit matching: starting with a very narrow

caliper of the PSX1–X6(60.000005) to find an untreated

match for every treated observation without replacement

and gradually increasing the width of the caliper up

to 60.05 if no match could be found (18). Within the

matched data set, we estimated the unconfounded treatment

effect without taking matching into account. This approach

is commonly used and valid, though it is slightly less effi-

cient than taking matching into account (19). Both the

standardized mortality ratio method and matching as we

implemented it result in an estimate of a causal treatment

effect in the treated, in the presence ofnonuniform treatment

effects (13).

Finally, we analyzed the data using inverse probability

of treatment weighting (IPTW). IPTW creates a pseudo-

population in which the association between covariates

and treatment is removed by weighting each observation

by the inverse of the probability of receiving the actual

treatment. To end up with a sum of weights close to the size

of the original study population, we used stabilized

weights—that is, we multiplied the IPTW weights by the

marginal prevalence of the treatment actually received (20).

We used a (conservative) robust variance estimation. IPTW

produces an estimate of a causal treatment effect in the

population in the presence of nonuniform treatment effects

(13, 20).

PS trimming. All of the above analyses were first con-

ducted without any restriction of the PS range. We then

restricted the analysis to observations within a PS range that

was common to both treated and untreated persons—that is,

by excluding all patients in the nonoverlapping parts of the

PS distribution (see Figure 2). Individual matching on the

PS also effectively resulted in a PS range that is common

to treated and untreated persons.

We then applied additional asymmetric PS trimming in

order to exclude those patients who were treated most con-

trary to prediction. We assessed 3 different cutpoints corre-

sponding to the 1st and 99th percentiles, the 2.5th and 97.5th

percentiles, and the 5th and 95th percentiles of the PS dis-

tribution in the treated and untreated patients, respectively.

Stratification into quintiles and matching were performed

after trimming.

RESULTS

In Table 2, we present the mean number of observations

and mean incidence rates in treated and untreated persons,

as well as the corresponding rate ratio from the simulated

data sets, according to a combination of the PSX1–X6percen-

tiles in treated and untreated patients. At the lower end of

the PSX1–X6range (up to the 5th percentile), percentiles are

derived from the distribution of PSX1–X6in the treated. All

other percentiles, including those at the high end of the

distribution, are derived from the distribution of PSX1–X6

in the untreated. This approach allows us to concentrate

on the patients treated contrary to prediction (which would

otherwise be swamped by patients treated according to pre-

diction). It also leads to untreated patients below the 0th

percentile and treated patients above the 100th percentile.

The first set of rows in Table 2 is based on the ‘‘last resort

treatment’’ hypothesis, in which very sick patients receive

a treatment contrary to the prediction of no treatment; it

mimics the results presented in Table 2 of the Kurth et al.

paper(8). Added tothemonotonicdecreaseofincidence rates

inbothtreatedanduntreatedpersonswithdecreasingPSX1–X6,

the presence of the unmeasured covariate X7leads to ‘‘abnor-

mally’’ high incidence rates in the treated with low PSX1–X6.

Because we know that the true rate ratio is 2.0, the higher rate

ratios are confounded by X7. There is some residual con-

founding despite stratification on narrow PSX1–X6strata, even

at the high end of PSX1–X6(e.g., for the 99th–100th percentile,

rate ratio (RR) ¼ 2.14). The maximum rate ratio in any stra-

tum is less extreme than the most extreme stratum-specific

rate ratio reported by Kurth et al. (8) (RR ¼ 13).

Frequency

Nonoverlap

Non-

overlap

Range restriction

(highest percentiles

of untreated)

Range

restriction

(lowest

percentiles

of treated)

Treated

Untreated

PS

Figure 2.Schematic of asymmetric range restriction. PS, propensity score.

846Stu ¨rmer et al.

Am J Epidemiol 2010;172:843–854

by guest on April 13, 2012

http://aje.oxfordjournals.org/

Downloaded from

Page 5

The second set of rows in Table 2 is based on the ‘‘treat-

ment withheld’’ hypothesis, in which a very frail patient

does not receive a treatment as expected because of severe

disability and/or multiple concurrent medical conditions.

It mimics the results presented by Lunt et al. (9) in their

Table 4. Added to the monotonic increase of incidence

rates in both treated and untreated persons with increasing

PSX1–X6, the presence of the unmeasured covariate X8leads

to ‘‘abnormally’’ high incidence rates in the untreated with

high PSX1–X6. This pattern is more difficult to detect be-

cause the increase of incidence rates in the untreated over

PSX1–X6remains monotonic. High incidence rates in the

untreated patients lead to enough confounding to reverse

the direction of the association, resulting in apparently

‘‘protective’’ effect estimates confounded by X8. The min-

imum rate ratio in any stratum is less extreme than the most

extreme stratum-specific rate ratio reported by Lunt et al.

(9) (RR ¼ 0.24).

The bottom set of rows in Table 2 combines the ‘‘last

resort treatment’’ with the ‘‘treatment withheld’’ hypothesis.

These simulations show both overestimated rate ratios in the

lowest PSX1–X6strata and underestimated rate ratios in the

highest PSX1–X6strata.

In Table 3, we present the treatment effect estimates ob-

tained with various restrictions of the PSX1–X6under the

‘‘last resort treatment’’ hypothesis. Note that the true rate

ratio equals 2.0 and that all PS analyses presented rely ex-

clusively on control for the measured covariates X1–X6. The

main confounding is due to measured covariates (crude

RR ¼ 3.52 vs. RR ¼ 2.13 based on the outcome model

for a treatment prevalence of 0.2). There is, however, some

uncontrolled confounding due to the unmeasured X7. The

confounding by X7is not strong despite its strong associa-

tions with both treatment and outcome because the preva-

lence of X7¼ 1 is only 0.01 (Table 1).

Bias due to the unmeasured confounder X7is reduced by

asymmetric PS trimming in most implementations of the PS

(the rate ratio moves closer to the true value of 2.0 and the

mean squared error gets smaller). The exception is PS

matching, where bias is constant (p(T ¼ 1) ¼ 0.2) or in-

creases with increasing range restrictions (p(T ¼ 1) ¼ 0.05).

PS matching provides the least bias without restriction,

however, and remains among the least biased implementa-

tions with a 5–95 range restriction. With a lower prevalence

of the treatment (p(T ¼ 1) ¼ 0.05), IPTW becomes most

biased without range restriction. A lower prevalence of

treatment leads to more extreme weights in the patients

who receive treatment contrary to prediction. Given the

increase in variance and the bias reduction following

increasing trimming, the effect on the coverage of the

95% confidence interval is very pronounced for most

implementations.

In Table 4, we present the treatment effect estimates ob-

tained with various restrictions of the PSX1–X6under the

‘‘treatment withheld’’ hypothesis. The unmeasured con-

founding due to X8is stronger than the one by X7. It leads

to a rate ratio of 1.3 based on control for measured cova-

riates. Consequently, the effects of trimming are more pro-

nounced in this setting, monotonic, and similar for all PS

implementations. The effects of unmeasured confounding

due to X8are most pronounced for PS matching with an

unrestricted rate ratio of 1.2 (p(T ¼ 1) ¼ 0.2).

When combining the 2 hypotheses (Table 5), asymmetric

PS trimming again leads to reduction of bias caused by the

unmeasured confounders with all PS implementations. In-

terestingly enough, increasing restrictions lead to increasing

reduction of bias with all implementations except IPTW.

With IPTW, there is a reduction of bias with restriction up

to the 2.5–97.5 level, but further restriction to the 5–95 level

increases rather than decreases bias.

DISCUSSION

We simulated data sets to mimic treatment effect hetero-

geneity in 2 separate published clinical studies under the

assumption that such heterogeneity is due to unmeasured

confounding by patient frailty. Our simulation study shows

that under this assumption, increasing asymmetric PS trim-

ming can increase the validity of the treatment effect esti-

mates. This increase in validity was observed with most of

the differentPSimplementations and over all ofthe scenarios

assessed in the simulations.

How can we detect unmeasured confounding by frailty?

Sensitivity of treatment effects to the approach of estima-

tion, especially very different results from untrimmed

IPTW, raised caution in the examples cited (8, 9). ‘‘Last

resort treatment’’ and ‘‘treatment withheld’’ will lead to

apparent heterogeneity of treatment effect estimates in the

opposite ends of the overlapping PS distribution. This het-

erogeneity couldeasily be missedby stratifying the data into

broad PS categories, such as quintiles. The heterogeneity

becomes apparent, however, if one stratifies the data finely

by PS at both ends of the PS distribution. Disadvantages of

stratifying by broad percentile categories such as quintiles

have been pointed out in other settings (21). Combining the

lower percentiles from the treated patients and the higher

percentiles from the untreated patients into a single ‘‘per-

centile’’ is an idea proposed previously by Stu ¨rmer et al.

(10). Although some variability will occur by chance, trends

such as those reported (8, 9) should raise caution.

We are aware of few published implementations of PS

range restrictions (8, 22, 23). Here we assessed the perfor-

mance of asymmetric PS trimming (10) when the treatment

effect is homogeneous. We observed some differences be-

tween different methods of using the PS to control con-

founding. PS matching was least affected by bias due to

unmeasured frailty that led to ‘‘last resort treatment.’’ This

result can be explained by the fact that the treatment effect

in the treated patients estimated by PS matching is based on

few matched sets with very low PS’s. Estimating the treat-

ment effect in the treated patients thus guards against major

bias due to ‘‘last resort’’ treatment, even without trimming.

Without trimming, however, estimating the treatment effect

in the treated is more susceptible to bias due to ‘‘treatment

withheld.’’ In the scenario with both ‘‘last resort treatment’’

and ‘‘treatment withheld,’’ trimming IPTW provided the

least correction of uncontrolled confounding and increasing

trimming did not monotonically lead to reduced bias. This

result is in contrast to all other PS methods and scenarios

that we assessed.

Trimming Patients Treated Contrary to Prediction847

Am J Epidemiol 2010;172:843–854

by guest on April 13, 2012

http://aje.oxfordjournals.org/

Downloaded from