How to Control for Many Covariates? Reliable Estimators Based on the Propensity Score
ABSTRACT We investigate the finite sample properties of a large number of estimators for the average treatment effect on the treated that are suitable when adjustment for observable covariates is required, like inverse probability weighting, kernel and other variants of matching, as well as different parametric models. The simulation design used is based on real data usually employed for the evaluation of labour market programmes in Germany. We vary several dimensions of the design that are of practical importance, like sample size, the type of the outcome variable, and aspects of the selection process. We find that trimming individual observations with too much weight as well as the choice of tuning parameters is important for all estimators. The key conclusion from our simulations is that a particular radius matching estimator combined with regression performs best overall, in particular when robustness to misspecifications of the propensity score is considered an important property.
- SourceAvailable from: psu.edu[show abstract] [hide abstract]
ABSTRACT: Estimation of treatment effects with causal interpretation from observational data is complicated because exposure to treatment may be confounded with subject characteristics. The propensity score, the probability of treatment exposure conditional on covariates, is the basis for two approaches to adjusting for confounding: methods based on stratification of observations by quantiles of estimated propensity scores and methods based on weighting observations by the inverse of estimated propensity scores. We review popular versions of these approaches and related methods offering improved precision, describe theoretical properties and highlight their implications for practice, and present extensive comparisons of performance that provide guidance for practical use.Statistics in Medicine 11/2004; 23(19):2937-60. · 2.04 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: We estimate short‐run, medium‐run, and long‐run individual labor market effects of training programs for the unemployed by following program participation on a monthly basis over a 10‐year period. Since analyzing the effectiveness of training over such a long period is impossible with experimental data, we use an administrative database compiled for evaluating German training programs. Based on matching estimation adapted to address the various issues that arise in this particular context, we find a clear positive relation between the effectiveness of the programs and the unemployment rate over time.Journal of Labor Economics 01/2009; 27:653-692. · 1.64 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Most papers that employ Differences-in-Differences estimation (DD) use many years of data and focus on serially correlated outcomes but ignore that the resulting standard errors are inconsistent. To illustrate the severity of this issue, we randomly generate placebo laws in state-level data on female wages from the Current Population Survey. For each law, we use OLS to compute the DD estimate of its "effect" as well as the standard error of this estimate. These conventional DD standard errors severely understate the standard deviation of the estimators: we find an "effect" significant at the 5 percent level for up to 45 percent of the placebo interventions. We use Monte Carlo simulations to investigate how well existing methods help solve this problem. Econometric corrections that place a specific parametric form on the time-series process do not perform well. Bootstrap (taking into account the autocorrelation of the data) works well when the number of states is large enough. Two corrections based on asymptotic approximation of the variance-covariance matrix work well for moderate numbers of states and one correction that collapses the time series information into a "pre"- and "post"-period and explicitly takes into account the effective sample size works well even for small numbers of states. © 2004 the President and Fellows of Harvard College and the Massachusetts Institute of TechnologyThe Quarterly Journal of Economics. 02/2004; 119(1):249-275.
D I S C U S S I O N P A P E R S E R I E S
zur Zukunft der Arbeit
Institute for the Study
How to Control for Many Covariates?
Reliable Estimators Based on the Propensity Score
IZA DP No. 5268
How to Control for Many Covariates?
Reliable Estimators Based on the
SEW, University of St. Gallen
SEW, University of St. Gallen,
ZEW, CEPR, PSI, CESifo, IAB and IZA
SEW, University of St. Gallen,
CESifo and IZA
Discussion Paper No. 5268
P.O. Box 7240
Any opinions expressed here are those of the author(s) and not those of IZA. Research published in
this series may include views on policy, but the institute itself takes no institutional policy positions.
The Institute for the Study of Labor (IZA) in Bonn is a local and virtual international research center
and a place of communication between science, politics and business. IZA is an independent nonprofit
organization supported by Deutsche Post Foundation. The center is associated with the University of
Bonn and offers a stimulating research environment through its international network, workshops and
conferences, data service, project support, research visits and doctoral program. IZA engages in (i)
original and internationally competitive research in all fields of labor economics, (ii) development of
policy concepts, and (iii) dissemination of research results and concepts to the interested public.
IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion.
Citation of such a paper should account for its provisional character. A revised version may be
available directly from the author.
IZA Discussion Paper No. 5268
How to Control for Many Covariates?
Reliable Estimators Based on the Propensity Score*
We investigate the finite sample properties of a large number of estimators for the average
treatment effect on the treated that are suitable when adjustment for observable covariates is
required, like inverse probability weighting, kernel and other variants of matching, as well as
different parametric models. The simulation design used is based on real data usually
employed for the evaluation of labour market programmes in Germany. We vary several
dimensions of the design that are of practical importance, like sample size, the type of the
outcome variable, and aspects of the selection process. We find that trimming individual
observations with too much weight as well as the choice of tuning parameters is important for
all estimators. The key conclusion from our simulations is that a particular radius matching
estimator combined with regression performs best overall, in particular when robustness to
misspecifications of the propensity score is considered an important property.
JEL Classification: C21
Keywords: propensity score matching, kernel matching, inverse probability weighting,
selection on observables, empirical Monte Carlo study, finite sample properties
Swiss Institute for Empirical Economic Research (SEW)
University of St. Gallen
CH-9000 St. Gallen
* This project received financial support from the Institut für Arbeitsmarkt und Berufsforschung, IAB,
Nuremberg (contract 8104). We would like to thank Patrycja Scioch (IAB), Benjamin Schünemann and
Darjusch Tafreschi (both SEW, St. Gallen) for their help in the early stages of data preparation. The
paper has been presented at the annual meeting of the German Statistical Society in Dortmund and
the Statistische Woche in Nuremberg, as well as at seminars at EIEF, Rome, at the Economics
Department of the University of Mannheim and the Center for European Economic Research (ZEW),
Mannheim. We thank participants, in particular Markus Frölich and Franco Perrachi, for helpful
comments and suggestions. The usual disclaimer applies.
Semiparametric estimators using the propensity score to adjust in one way or another
for covariate differences are now well-established for either estimating causal effects in a
selection-on-observables framework with discrete treatments or for simply purging the means
of an outcome variable in two or more subsamples from differences due to observables.1
Compared to (non-saturated) parametric regressions, they have the advantage to allow for
effect heterogeneity and to include the covariates in a more flexible way without incurring a
course-of-dimensionality problem. The latter problem, which is highly relevant due to the
usually large number of covariates that should be adjusted for, is avoided by collapsing the
covariate information into a single parametric function, the so-called propensity score, which
is defined as the probability of being observed in one of two subsamples conditional on the
covariates. These methods originate from the pioneering work of Rosenbaum and Rubin
(1983) who show that balancing two samples on the propensity score is sufficient to equalize
their covariate distributions.
Although many of these propensity-score-based methods are not asymptotically effi-
cient (see for example Heckman, Ichimura, and Todd, 1998, and Hahn, 1998),2 they are the
work-horses in the literature on microeconometric programme evaluations and are now rap-
idly spreading to other fields. They are usually implemented as semiparametric estimators: the
propensity score is based on a parametric model, but the relationship between the outcome
variables and the propensity score is nonparametric. However, despite the popularity of
1 See for example the recent surveys by Blundell and Costa-Dias (2009), Imbens (2004), and Imbens and Wooldridge
(2009) for a discussion of the properties of such estimators as well as a list of recent applications.
2 See the paper by Angrist and Hahn (2004) for an alternative justification of conditioning on the propensity score by using
non-standard (panel) asymptotic theory.
propensity-score-based methods, the issue of which version of the many different estimators
suggested in the literature should be used in a particular type of application is still unresolved,
despite recent advances in important Monte Carlo studies by Frölich (2004) and Busso, Di-
Nardo, and McCrary (2009a,b). In this paper we shall address this question and add further
insights to it.
Broadly speaking, the popular estimators can be subdivided into five classes: Paramet-
ric estimators (like OLS or Probit or their so-called double-robust relatives, see Robins, Mark,
and Newey, 1992), inverse (selection) probability weighting estimators (similar to Horvitz
and Thompson, 1952), direct matching estimators (Rubin, 1974, Rosenbaum and Rubin,
1983), and kernel matching estimators (Heckman, Ichimura, and Todd, 1998).3 However,
many variants of the estimators exist within each class and several methods are combining the
principles underlying these main classes.
There are two strands of the literature that are relevant for our research question: First,
the literature on the asymptotic properties of a subset of estimators provides some approxi-
mate guidance on their small sample properties. Therefore, the next section reviews this litera-
ture while discussing the various estimators. Unfortunately, such properties have not (yet?)
been derived for all estimators that are used in practice, nor is it obvious how well these
asymptotic properties approximate small sample behaviour. Furthermore, these results are
usually not informative for the important choice of tuning parameters (e.g., number of
matched neighbours, bandwidth selection in kernel matching), on which almost all of these
estimators critically depend.
3 There exists also the approach of stratifying the data along the values of the propensity score ('blocking'), but this
approach did not receive much attention in the empirical economic literature and does not have very attractive theoretical
properties. It is thus omitted (see for example Imbens, 2004, for a discussion of this approach).
The second strand of the literature provides Monte Carlo evidence. As one of the first
papers investigating estimators from several classes simultaneously, Frölich (2004) found that
a particular version of kernel-matching based on local regressions with finite sample adjust-
ments (local ridge regression) performs best. In contrast, Busso, DiNardo and McCrary
(2009a,b) conclude that inverse probability weighting (IPW) has the best properties (when
using normalized weights for estimation).4 They explain the differences to the Frölich (2004)
study by claiming i) that he considers unrealistic data generating processes and ii) that he does
not use an IPW estimator with normalized weights. In other words, they point to the design
dependence of the Monte Carlo results as well as to the requirement of having to use opti-
mized variants of the estimators. Below, we argue that their work is subject to the same criti-
cism. Indeed, it is this criticism that provides a major motivation for our study.
We contribute to the literature on the properties of estimator based on adjusting
covariate differences in the following way: First of all, we suggest a different approach of
conducting simulations. This new approach is based on 'real' data. Therefore, we call our ap-
proach an 'Empirical Monte Carlo Study'. The basic idea is using the real data to simulate
realistic 'placebo treatments' among the non-treated. Selection into treatment, which is poten-
tially of key importance for the performance of the various estimators, is based on a selection
process directly obtained from the data. The various estimators then use the remaining non-
treated in different ways to estimate the (known) non-treatment outcome of the 'placebo-
treated' exploiting the actual dependence of the outcome of interest on the covariates selection
is based on in the data. Thus, this approach is much less prone to the standard critique of
simulation studies that the chosen data generating processes are irrelevant for real applica-
tions. Since our model for the propensity score is mirroring specifications used in past applied
4 Further findings from more specific Monte Carlo studies will be discussed below.
work, it depends on many more covariates compared to the studies mentioned above. Al-
though this makes the simulation results particularly plausible in our context, which is the
context of labour market programme evaluation in Europe, this may also be seen as a limita-
tion concerning its applications to other fields. Therefore, to help generalize the results out-
side our specific data situation, we further modify many features of the data generating proc-
ess, like the type of the outcome variable and as well as various aspects of the selection proc-
Secondly, we consider standard estimators as well as their modified (optimised?) ver-
sions based on different tuning parameters such as bandwidth or radius choice. This leads to a
great number of estimators to evaluate, but it also provides us with more information on
particular important choices regarding the tuning parameters on which the various estimators
depend. Such estimators may also consist of combinations of estimators, like combining
matching with weighted regression, which have not been considered in any simulation so far.
Finally, we reemphasise the relevance of trimming. This issue has also been raised by
Busso, DiNardo, and McCrary (2009a) to account for common support problems. However,
they find that none of the remedies for poor support considered in their paper seems to work
in a robust way, particularly in small samples. Therefore, we propose a different, data driven
trimming rule that is (i) easy to implement, (ii) identical for all estimators, and (iii) avoids any
asymptotic bias. We show that for all estimators considered, including the parametric ones,
trimming based on this rule very effectively improves their performance (even when there is
no common support problem).
5 Our results are also robust to arbitrary effect heterogeneity.
Overall, we find that (i) trimming individual observations that have a 'too large'
weight is important for all estimators (even without any common support problem); (ii) the
choices of the various tuning parameters is important; (iii) simple matching estimators are
inefficient and have considerable small sample bias; (iv) no estimator is superior in all de-
signs; (v) particular bias-adjusted radius matching estimators perform best on average, but
may have fat tails if the number of controls is not large enough; and finally, (vi) flexible, but
simple parametric approaches do almost as well in the smaller samples, because their gain in
precision frequently overcompensates for their larger bias which, however, dominates when
samples become larger. One conclusion from these findings is that the choice of the broad
class of estimators may be less important than using an optimised version.
The plan of the paper is as follows: In the next section we discuss the principles of
relevant estimators and their properties as well as the issue of trimming, while relegating the
technical details of the estimators to Appendix A. Section 3 describes our Monte Carlo de-
sign, again relegating many details as well as descriptive statistics to Appendix B. The main
results are presented in Section 4, while the full set of results is given in Appendix C. Section
5 concludes. The website of this paper (www.sew.unisg.ch/lechner/matching) will contain
additional material that has been removed from the paper for the sake of brevity as well as the
Gauss and Stata code for the preferred estimators.6
6 Until user friendly versions of the estimators are made available on the website, readers are invited to send us an email
indicating their interest in either the Gauss or Stata versions. We will inform them when the respective versions become
2.1 Notation and targets for the estimation
The outcome variable, Y, denotes earnings or employment. The group of treated units
(treatment indicator D=1) are the participants in training in our empirical example. We are
interested in comparing the mean value of Y in the group of treated (D=1) with the mean
value of Y in the group of non-treated (D=0), the non-participants, free of any mean differ-
ences in outcomes that are due to differences in the observed covariates X across the groups.7
p x D
( | 1)[ ( |
E E Y X D
( |1)[ |
E Y D
( |1)[ |
E Y D
0, ()]( )
E Y DD
E Y DXx f
E Y Dp Xfd
where denotes the conditional density of and
χ its support. The propensity score is
defined by . The second equality is shown in the seminal paper by
Rosenbaum and Rubin (1983).
(1|) : (
P DXxp x
If there are no other (perhaps unobservable) covariates that influence the choice of the
different values of D as well as the outcomes that would be realised for a particular value of D
(the so-called potential outcomes), this comparison of means yields a causal effect, namely
the average treatment effect on the treated (ATET). This is the mean effect of D on individu-
als observed with D=1.8 The assumption required to interpret θ as a causal parameter is
7 As a convention, capital letters denote random variables, while small letters denote particular realisations of the random
variables. If the small letters are indexed by another small letter, typically i or j, it means that this is the value realised for
the sample unit i or j.
8 For reasons of computational costs which are a severe restriction in our analysis due to the complexity of the design and
the numbers of estimators, we focus entirely on reweighting the controls towards the distribution of X among the treated.
Common alternatives are reweighting the treated towards the covariate distribution of the controls, or weighting the
outcomes of both groups towards the covariate distribution of the population at large. The resulting parameters are called
the average treatment effect on the non-treated (ATENT) and the average treatment effect (ATE). Estimating the ATENT
called either unconfoundedness, conditional independence assumption (CIA) or selection on
observables (e.g., Imbens, 2004). The plausibility of CIA depends on the particular empirical
problem considered and on the richness of the data at hand. That is, labour market applica-
tions estimating the effects of training programmes on employment should include vari-
ables reflecting education, individual labour market history, age, family status, and local la-
bour market conditions, among others, in order to plausibly justify the CIA (e.g. Gerfin and
Lechner, 2002). Therefore, in applications exploiting the CIA, X is typically of high dimen-
sion, as in most cases many covariates are necessary to make this assumption plausible. How-
θ has a causal interpretation or not, does not matter for this paper. It is impor-
tant to note that other semiparametric estimators also rely on propensity score based covariate
adjustments, like, for example, the instrumental variable estimator proposed by Frölich (2007)
and semi-parametric versions of the difference-in-difference estimator (e.g., Abadie, 2005,
Blundell, Meghir, Costas Dias, and van Reenen, 2004, Lechner, 2010a).
2.2 General structure of the estimators considered
As discussed by Smith and Todd (2005), Busso, DiNardo, and McCrary (2009a) and
Angrist and Pischke (2009) among many others, all estimators adjusting for covariates can be
understood as different methods that weight the observed outcomes using weights,.
yd w y NdNNN
N denotes the sample size of an i.i.d. sample and is the size of the treated subsample.
Reweighting is required to make the non-treated comparable to the treated in terms of the
is symmetric to the problem we consider (just recode D as 1-D) and thus not interesting in its own right. The ATE is
obtained as a weighted average of the ATET and the ATENT, where the weight for the ATET is the share of treated and
the weight of ATENT is one minus this share. We conjecture that having a good estimate of the components of the ATE
will lead to a good estimate of the ATE.
propensity score. See for example the afore-mentioned references for formulas of the weight-
ing functions implied by various estimators. In almost all cases we will set for the
treated, i.e. we estimate the mean outcome under treatment for the treated by the sample mean
of the outcomes in the treated subsample. Therefore, the different estimators discussed below
represent different ways to estimate
[ ( |
E E Y X D
. Following Busso, DiNardo,
and McCrary (2009a), we normalize the weights of all semi-parametric estimators such that
Next, we will briefly introduce the estimators considered in this study, namely inverse
probability weighting, direct matching, kernel matching, linear and non-linear regressions as
well as combinations of direct matching and inverse probability weighting with regression.
All of these estimators, or at least similar versions of them, have been applied in empirical
studies,9 which is the motivation to analyse them in this paper.
2.3 Inverse probability weighting
As already mentioned, the idea of inverse-probability-of-selection weighting (hence-
forth abbreviated as IPW) goes back to Horvitz and Thompson (1952). IPW attains the semi-
parametric efficiency bound derived by Hahn (1998) when using the estimated propensity
score based on the correct parametric model.10
9 For inverse probability weighting see DiNardo, Fortin, and Lemieux (1996), for one-to-one matching Rosenbaum and
Rubin (1983), for kernel matching see Heckman, Ichimura, and Todd (1998), for caliper matching see Dehejia and Wahba
(1999), and for double-robust estimation see Robins, Mark, and Newey (1992). Of course, many more studies than those
mentioned as (early) examples use these estimators in various applications.
10 Hirano, Imbens, and Ridder (2003) also prove that the efficiency bound is reached when the propensity score is estimated
non-parametrically by a particular series estimator. The results by Newey (1984) on two-step GMM estimators imply that
IPW estimators based on a parametric propensity score are consistent and asymptotically normally distributed (under
standard regularity conditions).
Several IPW estimators for the ATET have recently been analysed by Busso, DiNardo
and McCrary (2009a,b). In this Monte Carlo study we consider the following implementation:
1 ( )
1 ( )
ensures that the weights add up to one. This estimator
directly reweights the non-treated outcomes to control for differences in the propensity scores
between treated and non-treated observations. It is the estimator recommended by Busso,
DiNardo, and McCrary (2009a).
Although this estimator is attractive from a computational as well as from an asymp-
totic efficiency point of view, there is also evidence that this or related IPW estimators may be
sensitive to large values of ˆ( )
p x that might lead to fat tails in its distribution (see, for exam-
ple, Frölich, 2004, as well as the discussion in Busso, DiNardo, and McCrary, 2009b).
Furthermore, as this estimator exploits the propensity score directly, there is a potential con-
cern that it might be more sensitive to small misspecifications of the propensity score than
other estimators that do not exploit the actual value of the propensity score, but compare
treated and controls with same value of the score, whatever that value is (e.g., Huber, 2010).
2.4 Direct matching
Pair or one-to-one matching is considered to be the prototype of a matching estimator
(with replacement)11. The pair matching estimator (PM) is defined as:
11 'With replacement' means that a control variable can be used many times as match, whereas in estimators 'without
replacement' it is used once. Since the latter principle works only when there are many more controls than treated, it is
rarely used in econometrics and will be omitted from this study in which we consider treatment shares of up to 90%. For
(1)1 min() ( )
i j d
1( ) ⋅ denotes the indicator function, which is one if its argument is true and zero otherwise.
This estimator is not efficient, as only one non-treated observation is matched to each treated
observation, independent of the sample size. All other control observations obtain a weight of
zero even if they are very similar to the observations with positive weight.
Despite its inefficiency, PM also has its merits. Firstly, using only the closest
neighbour should reduce bias (at the expense of additional variance). Secondly, PM is likely
to be more robust to propensity score misspecification than IPW as it remains consistent even
if the misspecified propensity score model is a monotone transformation of the true model
(see the simulation results in Drake, 1993, Zhao, 2008, Millimet and Tchernis, 2009, and
Huber, 2010, suggesting some robustness of matching to over- and under-fitting of the
A direct extension of PM is the 1:M propensity score matching estimator which, in-
stead of using just one control, uses several controls. Thus, increasing M increases the preci-
sion but also the bias of the estimator. This class of estimators has been analysed by Abadie
and Imbens (2009) for the ATE and has been found to be consistent and asymptotically nor-
mal for a given value of M. Yet, it appears that there do not exist any results on how to opti-
mally choose M in a data dependent way. Thus, we focus on 1:1 matching, which is the most
frequently used variant in this class of estimators.
matching without replacement, many more matching algorithms appeared in the literature that differ on how to use the
scarce pool of good controls optimally (as they can only be used once). See, for example, Augurzky and Kluve (2007) for
some discussion of these issues.
The third class of direct matching estimators considered is the one-to-many calliper
matching algorithm as, for example, discussed by Rosenbaum and Rubin (1985) and used by
Dehejia and Wahba (1999, 2002). Calliper or radius matching uses all comparison observa-
tions within a predefined distance around the propensity score of the respective treated. This
allows for higher precision than fixed nearest neighbour matching in regions of the χ -space
in which many similar comparison observations are available. Also, it may lead to a smaller
bias in regions where similar controls are sparse. In other words, instead of fixing M globally,
M is determined in the local neighbourhood of each treated observation.
There are further matching estimators evaluated in the literature. For example, Rubin
(1979) suggested combining PM with (parametric) regression adjustment to take into account
the fact that treated and controls with exactly the same propensity score are usually very rare
or non-existent.12 This idea has been taken up again by Abadie and Imbens (2006) who show
that for a 1:M matching estimator (directly on X) nonparametric regression can be used to
remove the bias from the asymptotic distribution that may occur when X is more than one-
An additional suggestion to improve naïve propensity score matching estimators is to
use a distance metric that not only includes the propensity score, but in addition those covari-
ates that are particularly good predictors of the outcome (in addition to the treatment). Since
this distance metric has many components, usually a Mahalanobis distance is used to compute
the distance between the treated and the controls (again, see the discussion in Rosenbaum and
Rubin, 1985). The simulation results obtained by Zhao (2004) suggest that this idea works.
12 This idea has been applied by Lechner (1999, 2000) in a programme evaluation study.
The estimator proposed by Lechner, Miquel, and Wunsch (2010) and used in several
applications by these authors,13 combines the features of calliper matching with additional
predictors and linear or nonlinear regression adjustment. After the first step of distance-
weighted calliper matching with predictors, this estimator uses the weights obtained from
matching in a weighted linear or non-linear regression in order to remove any bias due to
mismatches. The matching protocol of this estimator is shown in Appendix A.
2.5 Kernel matching
Propensity score kernel matching is based on the idea of consistently estimating the
E Y D
0, ( )] :
with the control observations and then
averaging the estimated function by the empirical distribution of ˆ()
p X as the treated ob-
( ( ))
m p x
where denotes the nonparametrically estimated conditional expectation function.
Heckman, Ichimura, and Todd (1998) are early examples for an analysis of the type of kernel
regression estimators that could achieve Hahn's (1998) semiparametric efficiency bound if the
covariates were used directly instead of the propensity score (see also Imbens, Newey, and
Ridder, 2006). Due to the curse-of-dimensionality problem, the latter is of course not feasible
in a typical application.
ˆ ( )
Considering a continuous outcome, Frölich (2004) investigated several kernel match-
ing estimators and found the estimator that is based on ridge regressions to have the best finite
13 See Wunsch and Lechner (2008), Lechner (2009), Lechner and Wunsch (2009a, b), Behncke, Frölich, and Lechner
(2010a,b), and Huber, Lechner, and Wunsch (2010).