Page 1

Using Non-experimental Data to Estimate Treatment Effects

Elizabeth A. Stuart, PhD2,1, Sue M. Marcus, PhD3, Marcela V. Horvitz-Lennon, MD4, Robert

D. Gibbons, PhD5, and Sharon-Lise T. Normand, PhD6

2 Johns Hopkins Bloomberg School of Public Health, Baltimore

3 Mount Sinai School of Medicine, New York

4 University of Pittsburgh School of Medicine, Department of Psychiatry, Pittsburgh

5 University of Illinois at Chicago, Chicago

6 Harvard Medical School and Harvard School of Public Health, Boston

Abstract

While much psychiatric research is based on randomized controlled trials (RCTs), where patients

are randomly assigned to treatments, sometimes RCTs are not feasible. This paper describes

propensity score approaches, which are increasingly used for estimating treatment effects in non-

experimental settings. The primary goal of propensity score methods is to create sets of treated and

comparison subjects who look as similar as possible, in essence replicating a randomized experiment,

at least with respect to observed patient characteristics. A study to estimate the metabolic effects of

antipsychotic medication in a sample of Florida Medicaid beneficiaries with schizophrenia illustrates

methods.

Introduction

While much psychiatric research is based on randomized controlled trials (RCTs), where

patients are randomly assigned to treatments, sometimes RCTs are not feasible. Ethical

concerns might preclude randomization, such as randomizing subjects to smoke, or it may be

impractical, such as when the treatment of interest is widely available and commonly used.

When RCTs are unethical or infeasible, a carefully constructed non-experimental study can be

used to estimate treatment effects. While non-experimental studies are disadvantaged by lack

of randomization, the study costs may be lower, the study sample may be broader, and follow-

up may be longer, as compared to an RCT (1,2).

The primary challenge for estimation of treatment effects is the identification of subjects who

are as similar as possible on all background characteristics other than the treatment of interest.

By virtue of randomization, RCTs ensure, on average, the treatment and comparison groups

are similar on background characteristics, measured and unmeasured. In non-experimental

studies, there is no such guarantee. Treatment and comparison groups may systematically differ

on factors that also affect the outcome, a problem referred to as “selection bias.” Selection bias

leads to confounding, “a situation in which the estimated intervention effect is biased because

of some difference between the comparison groups apart from the planned interventions such

as baseline characteristics, prognostic factors, or concomitant interventions. For a factor to be

a confounder, it must differ between the comparison groups and predict the outcome of

interest” (3).

1Corresponding author contact information: Johns Hopkins Bloomberg School of Public Health, 624 N Broadway, 8th Floor, Baltimore,

MD, 21205; 410-502-6222 (phone); 410-955-9088 (fax); estuart@jhsph.edu.

NIH Public Access

Author Manuscript

Psychiatr Ann. Author manuscript; available in PMC 2010 July 1.

Published in final edited form as:

Psychiatr Ann. 2009 July 1; 39(7): 41451. doi:10.3928/00485713-20090625-07.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

Numerous design and analytical strategies are available to account for measured confounders

but the major limitation is the potential for unmeasured confounders. Well-designed non-

experimental studies make good use of measured confounders by creating treatment groups

that look as similar as possible on the measured characteristics. Researchers then assume that,

given comparability (or balance) between the groups on measured confounders, there are no

measured or unmeasured differences, other than treatment received. This assumption has many

names: “unconfounded treatment assignment,” “no hidden bias,” “ignorable treatment

assignment,” or “selection on observables” (4–6).

We describe approaches that, through the careful design and analysis of non-experimental

studies, create balance between treatment groups. The key idea is to use relatively recently

developed techniques, known as propensity score methods, to ensure that the treatment and

comparison subjects are as similar as possible. The goal is to replicate a randomized

experiment, at least with respect to the measured confounders, by making the treatment and

comparison groups look as if they could have been randomly assigned to the groups, in the

sense of having similar distributions of the confounders. We describe the five key stages to

this process (Table 1). A study that compares atypical and conventional antipsychotic

medications with regard to their effect on adverse metabolic outcomes (dyslipidemia, Type II

diabetes, and obesity) (16) illustrates the methods. The study uses data from Florida Medicaid

beneficiaries (18 to 64 years), diagnosed with schizophrenia and continuously enrolled from

1997 to 2001. Although the bulk of the evidence on the causal associations of antipsychotics

comes from studies using U.S. and U.K. administrative and medical databases, RCTs have

been used to assess the metabolic effects of antipsychotic drugs (e.g., 17,18). Findings of these

RCTs, however, are not regarded as representative of the adverse events of these drugs as used

in routine practice. A possible exception is the CATIE trial (19), an effectiveness trial in that

other than the randomization, every other aspect of the care was naturalistic. Conduct of this

type of trial is costly and generally unfeasible.

I. Defining the Treatment and Comparison Groups

The first step involves clearly specifying the treatment of interest, and identifying individuals

who experienced that treatment. One way to address this is to consider what treatment would

be randomized if randomization were possible. For example, we could randomly assign patients

to receive an atypical medication. We then need to select an appropriate comparison condition.

Because this study investigates the metabolic effects of atypical antipsychotics the relevant

question is whether the comparison of interest is another type of medication, no medication,

or either? Virtually all subjects with schizophrenia during this time frame are treated with some

type of antipsychotic agent and thus the key clinical question is not whether the patient should

receive an antipsychotic medication, but rather, which type of antipsychotic medication should

be used. We compare atypical antipsychotics (specifically, clozapine, olanzapine, quetiapine,

and risperidone) to conventional antipsychotics (specifically, chlorpromazine, trifluoperazine,

fluphenazine, perphenazine, thioridazine, haloperidol, and thiothixene). We use Medicaid

claims data so that atypical (conventional) antipsychotic users are those subjects who filled at

least one prescription for an atypical (conventional) antipsychotic. Prescribing information is

unavailable and so only subjects who were written an antipsychotic prescription and filled it

are included. Like an intent-to-treat analysis, we only know that the prescription was filled and

not whether the medication was actually taken.

The next consideration is identification of confounders: factors that have previously been found

to be associated with receipt of atypical antipsychotics and/or with metabolic outcomes. Key

confounders in the Medicaid study include demographic and clinical variables, listed in Table

1, such as sex, age, race, and medical comorbidities. A good study will have a large set of

measured confounders so that the assumption of no hidden bias is likely to be satisfied.

Stuart et al.Page 2

Psychiatr Ann. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

Once the treatment group, comparison group, and potential confounders are identified,

researchers need to identify data on those groups and the confounders. The particular data

elements necessary are: subjects, some of whom received the treatment (atypical

antipsychotics) and others the comparison condition (conventional antipsychotics), an

indicator for which subject is in which group, potential confounders, and outcomes.

Confounders are measured before treatment assignment to ensure that they are not affected by

the treatment (20,21) and outcomes are measured after treatment assignment, to ensure

temporal ordering. In the Medicaid study, we determined periods during which an individual

had some minimal exposure to an antipsychotic drug, at least 6 months of Medicaid enrollment

preceding treatment initiation (from which we obtained the covariate information), and a 12-

month follow-up period to examine incidence of metabolic outcomes. Often it is not possible

to have truly longitudinal data, and researchers instead use cross-sectional data where

assumptions regarding the time ordering of the variables being measured are made. We analyze

one measurement occasion for each subject, measured 12 months following antipsychotic

initiation. See the paper by Marcus et al. in this series for methods for estimating causal effects

with multiple outcome occasions (22).

II. Creating the Groups for Comparison

Table 2 (Columns 1–3) compares the means of the potential confounders between atypical and

conventional antipsychotic users. The differences in percentages (for binary variables) or

standardized differences (for continuous variables) are also reported. The standardized

difference is the difference in means divided by the standard deviation of the confounder among

the full set of conventional users (1,11,23). We then multiply by 100 to express the difference

as a percentage. The conventional users are older on average (by 26% of a standard deviation)

and more likely to be African American (34% vs. 24%), as compared to the atypical users.

Because of these differences between the groups, comparing the raw outcomes between the

two treatment groups would result in bias (24). Statistical adjustments are required to deal with

the differences in the observed confounders.

Ideally we want to compare atypical and conventional users who have “exactly” the same

values for all the confounders. Assuming no unmeasured confounders, any difference in the

outcomes could then be attributed to the treatment. However, exact matching on all of the

covariates is often infeasible given the large number of covariates and relatively small number

of subjects available. In the Medicaid study, if we were to make each of our 11 confounders

binary, we would have 2048 (= 211) distinct strata and need to have both atypical and

conventional antipsychotic users in each. Because this is not feasible, a reasonable strategy is

to make the “distributions” of the confounders similar between the atypical and conventional

antipsychotic users—e.g., similar age, similar race, similar chronic medical comorbidity status.

There are several general strategies to create comparable groups.

Regression adjustment—A common approach to adjusting for confounders is regression

adjustment, whereby the treatment effect is estimated by regressing the outcome of interest on

an indicator for the treatment received and the set of confounders. The coefficient on the

treatment indicator provides an estimate of the treatment effect (Table 3, Column 1).2 A

drawback of this approach is that if the atypical and conventional groups are very different on

the observed covariates (e.g., with over a 25% standard deviation difference on average age,

as seen in Table 2), the regression adjustment relies heavily on the particular model form and

extrapolates between the two groups (24,25). Why does this pose a problem? First, the

2Although our outcomes are binary we present results from a linear regression model. This was for comparability with the analyses

described for the propensity score approaches with weights. If a logistic regression model is used, the difference in absolute risk can be

obtained by comparing predictions of the outcomes for the full sample under each of the treatment conditions. In this study the results

are virtually identical. Section IV provides more detail.

Stuart et al.Page 3

Psychiatr Ann. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

regression approach will provide a prediction of what would have happened to atypical users

had they instead used conventional antipsychotics using information from a set of conventional

users who are very different from, e.g., older than, those atypical users. Second, in most cases,

the regression approach assumes a linear relationship between the measured covariates and the

outcome of interest—an assumption that may not be true and is often difficult to test. Third,

the output of standard regression analysis provides no information regarding covariate balance

between the two treatment groups. Other approaches avoid these problems by ensuring that

the comparisons are made between groups that are similar.

Propensity score methods—A useful tool to achieve comparable confounder distributions

is the “propensity score,” defined as the probability of receiving the treatment given the

measured covariates (6). A property of the propensity score makes it possible to select subjects

based on their similarity with respect to the propensity score (a single number summary of the

covariates, similar to a comorbidity score) in order to achieve comparability on all the measured

confounders, rather than having to consider each confounder separately. If a group of subjects

have similar propensity scores, then they have similar probabilities of receiving the treatment,

given the measured confounders. Within a small range of propensity score values, the atypical

and conventional users should only differ randomly on the measured confounders, in essence

replicating a randomized experiment.

Because the true propensity score for each subject is unknown, it is estimated with a model,

such as a logistic regression, predicting treatment received given the measured confounders.

Each subject’s propensity score is their predicted probability of receiving the treatment,

generated from the model. The diagnostics for propensity score estimation are not the standard

logistic regression diagnostics, as concern is not with the parameter estimates or predictive

ability of the model. Rather, the success of a propensity score model (and subsequent matching

or stratification procedure) is determined by the covariate balance achieved.

Nearest neighbor matching: One of the simplest ways of ensuring the comparability of groups

is to select for each treated individual the comparison individual with the closest propensity

score3 (26). We illustrate a 1:1 matching algorithm where one conventional antipsychotic user

is selected for each atypical antipsychotic user. Variations on this algorithm include selecting

multiple matches for each atypical user, matching atypical users to a variable number of

conventional users (27), and prioritizing certain variables (12). For example, if there are a large

number of potential control subjects relative to the number of treated, it may be possible to get

2 or 3 good matches for each treated individual, which will increase the precision of estimates

without sacrificing much balance (27,28). In our study, because the numbers of conventional

and atypical users are nearly equal, we used matching with replacement, meaning that each

conventional user could be used as a match multiple times (29).

Figure 1 Panel A illustrates the resulting matches in the Medicaid study, with 1,809

conventional users matched to the 3,384 atypical users. The x-axis reflects the propensity

scores; the y-axis is used to group the subjects into atypical (treated) vs. conventional (control),

and matched vs. unmatched; the vertical spread of the symbols within each grouping is done

to show the symbols more clearly. The figure shows the relative weight different subjects

receive in the analyses of the outcomes, with the relative size of the symbols reflecting the

number of times a subject was matched. Thus, conventional users selected as a match multiple

times have larger symbols. The goal is to see good “overlap” between the propensity scores of

the atypical and conventional users, which we have. However, there are quite a few

conventional users with low propensity scores who are left unmatched. This illustrates a

3Often the matches are based on the logits (the log-odds of the predicted probabilities) because the logits have better statistical properties.

Stuart et al. Page 4

Psychiatr Ann. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

common drawback of nearest neighbor matching, in that sometimes subjects are unmatched,

including some with propensity scores similar to those in the other group.

Weighting: A second approach, inverse probability of treatment weighting (IPTW), avoids

this problem by using data from all subjects (9,13,30). The idea of IPTW is similar to that of

survey sampling weights, where individuals in a survey sample are weighted by their inverse

probabilities of selection so that they then represent the full population from which the sample

was selected. In our setting we treat each of the treatment groups (the atypical users and the

conventional users) as a separate sample, and weight each up to the “population,” which in this

case is all study subjects. Each subject receives a weight that is the inverse probability of being

in the group in which they are in. However, instead of having known survey sampling

probabilities, we use the estimated propensity scores. In particular, atypical users are weighted

by one over their probability of receiving an atypical antipsychotic (the propensity score), and

conventional users are weighted by one over their probability of receiving a conventional

antipsychotic (one minus the propensity score). In the Medicaid study, the conventional users

with low probabilities of receiving a conventional antipsychotic will receive relatively large

weights, because they actually look more similar to the atypical users, thus providing good

information about what would happen to the atypical users if they had instead taken

conventional antipsychotics.

Subclassification: Subclassification, also called stratification, is a method that also uses all

subjects, by forming groups (subclasses) of individuals with similar propensity scores (31). In

the Medicaid study the subclasses were created to have approximately the same number of

subjects taking atypical antipsychotics (about 565); the number of conventional users in each

subclass ranges from 287 to 933 (Figure 1 Panel B; Table 4). Because of the properties of

propensity scores described above, within each subclass, the subjects look similar on the

measured confounders.

Remarks: Is it better to match or to stratify/weight? The answer depends on whether the

investigator is more concerned about bias or about having enough power to detect an effect.

Matching approaches are often used when it is important to reduce as much as possible

differences between treatment groups and consequently, not all subjects are used, reducing the

total sample size available to find differences. While subclassification and weighting retain all

subjects (generally yielding some efficiency gain), there is a risk of making comparisons

between individuals who are not as alike as desired.

III. Assessing Potential Confounding

How do we know if the atypical and conventional groups are “similar,” at least on the measured

covariates? After using one of the approaches described above, the crucial next step is to check

the resulting “balance:” the similarity of the confounders between the treatment and

comparison groups. Common (and sometimes misguided) measures used for balance checks

are standard hypothesis tests, such as t-tests. The danger in using test statistics is that they

conflate changes in balance with changes in the sample size; comparing p-values before and

after matching can be misleading, implying that balance has improved when in fact it has not

(1,11).

A good balance measure, and the one we suggest, is the standardized difference in means. This

is most appropriate for continuous variables. A general rule of thumb is that an acceptable

standardized difference is less than 10% (11). Differences larger than 10% roughly imply that

8% or more of the area covered by atypical and conventional users combined is not overlapping.

4 For binary variables the absolute value of the difference in proportions is examined. These

measures are generally calculated both in the full dataset (Table 2, Column 3), as well as in

Stuart et al. Page 5

Psychiatr Ann. Author manuscript; available in PMC 2010 July 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript