PreprintPDF Available

Hidden in Plain Sight: Influential Sets in Linear Regression

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The sensitivity of econometric results is central to their credibility. In this paper, we investigate the sensitivity of regression-based inference to influential sets of observations and show how to reliably identify and interpret them. We explore three algorithmic approaches to analyze influential sets, and assess the sensitivity of a number of earlier studies in the field of development economics to them. Many results hinge on small influential sets, and inspecting them can provide crucial insights. The analysis of influential sets may reveal omitted variable bias, unobserved heterogeneity, lacking external validity, and informs about technical limitations of the methodological approach used.
Content may be subject to copyright.
Hidden in Plain Sight:
Inuential Sets in Linear Regression
Nikolas Kuschnig1,
Gregor Zens2
Jesús Crespo Cuaresma1,3,4,5,6
1Vienna University of Economics and Business
2Bocconi University
3International Institute for Applied Systems Analysis
4Wittgenstein Centre for Demography and Global Human Capital
5Austrian Institute of Economic Research
6CESifo
Abstract
The sensitivity of econometric results is central to their credibility. In this paper,
we investigate the sensitivity of regression-based inference to inuential sets of ob-
servations and show how to reliably identify and interpret them. We explore three
algorithmic approaches to analyze inuential sets, and assess the sensitivity of a
number of earlier studies in the eld of development economics to them. Many
results hinge on small inuential sets, and inspecting them can provide crucial in-
sights. The analysis of inuential sets may reveal omitted variable bias, unobserved
heterogeneity, lacking external validity, and informs about technical limitations of
the methodological approach used.
Keywords: sensitivity, robustness, regression diagnostics, masking, inuence
Correspondence to Nikolas Kuschnig, at the Vienna University of Economics and Business,
Welthandelsplatz 1, 1020 Vienna (nikolas.kuschnig@wu.ac.at); Gregor Zens (gregor.zens@unibocconi.it);
Jesús Crespo Cuaresma (jesus.crespo.cuaresma@wu.ac.at). The authors gratefully acknowledge helpful
comments from Ryan Giordano, Daniel Peña, Anthony Atkinson, Lukas Vashold, Maximilian Kasy, as
well as Bernhard Kasberger, Jörg Stoye, Isaiah Andrews, Whitney Newey, and other participants of the
2022 Congress of the European Economic Association and the 2021 Congress of the Austrian Economic
Association.
1
Kuschnig, Zens, & Crespo Cuaresma
1 Introduction
Econometric methods are an important instrument of scientic discovery, and are vital
for the design of evidence-based policy. By approximating real-world phenomena, they
provide us with empirical insights, allow us to test theories, and facilitate prediction.
The sensitivity of these methods to modeling assumptions is a long-standing and active
subject of research. The econometric literature tends to focus on sensitivity along the
horizontal dimension of the data, which is related to the functional form of model speci-
cations. Examples go back to extreme bounds analysis (Leamer,1983,1985), and include
model averaging (Steel,2020), elaborate research designs (Angrist and Pischke,2010),
and randomization (Athey and Imbens,2017). The sensitivity of inference to certain
sets of observations, i.e., the vertical dimension of the data, however, has received less
attention in the modern econometric literature.
There is a long history of studies on the role of inuential observations and outliers
in the statistical literature (e.g. Huber and Ronchetti,2009). It is well known that single
inuential observations may hold considerable sway over regression results (Cook,1979),
and there are many approaches to identify and account for sensitivity to these observa-
tions (see Chatterjee and Hadi,1986). However, this is not the case for inuential sets
of observations, which are not as well understood theoretically and empirically. Exact
analysis of inuential sets is quickly intractable, and earlier papers tend to approximate
inuential sets using individual, full-sample inuences (e.g. Broderick et al.,2020;Peña
and Yohai,1995). Such approximations suer from masking, a phenomenon where cer-
tain observations obscure the inuence of others. For these reasons, the literature has
mostly sidestepped inuential sets, focusing instead on robust estimation and resampling
methods (see e.g. Stigler,2010).
In this paper, we assess the sensitivity of regression-based ndings to inuential sets.
We focus on minimal inuential sets, which are sets of observations that nullify a result
of interest when they are removed. For this, we explore three algorithms that are com-
putationally tractable, straightforward to implement and use, and balance accuracy and
precision against complexity. We revisit the empirical results of several earlier studies in
the eld of development economics, and analyze their sensitivity to inuential sets. Es-
tablished empirical results, including the eect of the slave trades on income per capita
via rugged geography (Nunn and Puga,2012), and the development impact of the Tsetse
y (Alsan,2015), hinge on few observations. We show that analyzing these inuential sets
can provide crucial insights, revealing (inter alia) omitted variable bias, heterogeneous
eects, and problems related to data limitations.
The remainder of this paper is structured as follows. In Section 2, we establish the
2
Inuential Sets in Linear Regression
theoretical framework and connect it to the relevant econometric and statistical literature.
We present and illustrate the algorithmic approaches to identify and assess inuential
sets in Section 3. In Section 4, we investigate the sensitivity of empirical results in three
studies on long-term development in Africa. We discuss the interpretation of sensitivity to
inuential sets, and illustrate with additional applied examples in Section 5. In Section 6,
we conclude. All codes and data used for this paper are available online.1
2 Inuential sets in linear regression models
In this section, we present the main concepts needed to assess the eects of inuential
sets of observations in our chosen framework. We also relate the issue of inuential sets
to the literature on inuential observations and outliers, robust estimation methods, and
other relevant approaches.
2.1 Concepts and denitions
Consider the linear regression model
y=Xβ+ε,(1)
where yis an N×1vector containing observations of the dependent variable, Xis an
N×Pmatrix with Pexplanatory variables, βis a P×1vector of coecients to be
estimated, and εis an N×1vector of independent error terms with zero mean and
unknown variance σ2. We denote the ith observation, i.e. row, of yand Xas yiand xi.
The deletion of an observation jis indicated with a subscript in parentheses, that is, y(j)
is the dependent vector without observation j. A set of observations, S, is dened as a
non-empty subset of the set of all observations, i.e., S ˜
S={s|sZ[1, N ]}. We use
the shorthand Nα=Nαto denote a fraction α[0,1] of the data, and indicate the
cardinality of a set using a subscript, i.e., |Sα|=Nα. The empty set is denoted by and
the set of all sets of cardinality Nαis referred to as [S]α.
Our interest lies in the sensitivity of λ, some quantity of interest, to removing in-
uential sets of observations from the sample. We dene inuential sets as sets whose
omission has a large impact on λ, when compared to the omission of most other sets of
equal size (following Belsley et al.,1980), and measure the impacts of removing such sets
with some (generalized) inuence function (similar to Hampel et al.,2005). Consider,
for example, the sensitivity of the full sample, ordinary least squares (OLS) estimate of β
1The repository at https://github.com/nk027/influential_sets includes all scripts, data, and
an Rpackage that implements the dierent approaches described in the paper.
3
Kuschnig, Zens, & Crespo Cuaresma
to dropping a set of observations S. In this case, we are interested in a function that
compares λ()and λ(S), which are given by the following function of the removed set
λ(S)=X
(S)X(S)1X
(S)y(S).(2)
To assess the sensitivity of λ, we consider the minimal inuential set, dened as the
smallest set whose omission achieves a target impact on λ. We formalize this set by rst
dening the maximally inuential set,S
α, which achieves the maximal inuence for a
given size of the omitted set, as
S
α= arg max
S∈[S]α
∆(S,T, λ),(3)
where the inuence function, , measures the impact on λ, when removing a set S
compared to a set T. To ease notation, we will suppress the dependence on λand the
default value T=. One example for measures the deviation of βjfrom the full
sample OLS estimate, ∆(S) = λ()jλ(S)j, with λ(S)dened in Equation 2.
We can then dene the minimal inuential set, S, as follows
S∗∗ =S
arg minαs.t. ∆(S
α),(4)
where is a target value of choice. A relevant example is the minimal inuential set that
achieves a sign switch of coecient βj. This can be achieved by dened by setting = 0,
and letting ∆(S) = sign(λ())j×λ(S)j. After obtaining a minimal inuential set,
we are interested in its size, both in absolute terms and relative to the full sample size,
and the characteristics of its members.
Conclusively identifying a minimal inuential set is computationally prohibitive, since
we would need to evaluate for N
Nαpotential sets. Instead, we rely on approximations
for all but the most trivial settings. To assess the quality of these approximations, we
introduce some concepts related to potential sources of error. First, members of a maxi-
mally inuential set S
αmay not be identied in the estimated set ˆ
S
α. This phenomenon
is referred to as masking, since particularly inuential observations are masked by others
that appear to be more inuential. Masking is a problem related to the accuracy of inu-
ential set identication, and its severity can be quantied by the number (or percentage)
of masked observations, and the dierence in inuence between the true and estimated
maximally inuential set. Second, the estimated inuence of a given set may not equal
its true inuence. This occurs when an approximation of or λis used, and can be
considered a precision problem of an approximation.
The accuracy and precision of an approximation is tightly related to the existence of
inuential sets that are inuential in joint. Consider the partial inuence of an observation
i, given by the shorthand δi=δi|= ∆({i},), where we will assume scalar inuence
4
Inuential Sets in Linear Regression
values for simplicity. A jointly inuential set,U, is one that satises δi|J δi|for
all i U and J U of some minimal size. In words, the inuence of any member i
increases substantially after the removal of other members. A corollary of this denition
is that the inuence of a jointly inuential set exceeds the sum of full-sample inuences
of its individual members, i.e. ∆(S)>i∈S δi|. Joint inuence is a major challenge for
approximations. Since they rely on assessing a limited number of potential sets, jointly
inuential sets may remain hidden and their inuence unaccounted for.
2.2 Inuential observations and sets in the literature
The statistics literature is rich in methods for identifying and dealing with single inu-
ential observations (see e.g. Huber and Ronchetti,2009;Hampel et al.,2005;Maronna
et al.,2019). Measuring inuence is central to this pursuit, and there is a wide variety of
interrelated statistics serving this purpose (see Chatterjee and Hadi,1986, for a review).
The residuals (e=yXˆ
β) and the leverage (the diagonal elements of the ‘hat matrix’,
H=X(XX)1X) are pivotal elements for most measures proposed in the literature. A
notable example directly measures the inuence of an observation ion the OLS estimate
of β, can be expressed as
δi=λ()λ({i})=(XX)1x
iei
1hi
,(5)
where eiand hiare the residual and leverage of observation i. This well-known re-
sult, termed DFBETAiby Belsley et al. (1980), facilitates the quick evaluation of the
individual inuences of all observations. Similar results are available for estimates of
σ2, coecient standard errors (Belsley et al.,1980), and two-stage least squares (2SLS)
estimates (Phillips,1977). Such convenient forms, together with ecient updating for-
mulae (to evaluate, e.g., (X
(i)X(i))1) facilitate computation, but evaluating a minimal
inuential set, however, remains infeasible for all but the simplest settings.2
The impacts of single inuential observations in regression models are well under-
stood, but generally remain limited. As a result, most existing approaches to assess and
account for sensitivities to observations are of a holistic nature, trying to go beyond sin-
gle observations. These include Cook’s distance (Cook,1979), the Welsch-Kuh distance
(Belsley et al.,1980), and a number of Bayesian approaches (e.g. Box and Tiao,1968;
Kass et al.,1989;Pettit and Young,1990). The most prominent approaches are based on
robust estimation methods, such as M- and S-estimators, which are resistant to a number
2Consider a total number of observations of N= 1,000 and all potential sets of size Nα= 10. Assume
that every calculation of λneeds one microsecond, very roughly the time needed to compute the cross-
product of a four-by-four matrix. Enumeration would require about 8.35 billion years, or 1.8 times the
age of the Earth, which is safely out of scope for non-tenured researchers.
5
Kuschnig, Zens, & Crespo Cuaresma
of arbitrarily inuential observations (see e.g. Hampel et al.,2005;Huber and Ronchetti,
2009;Maronna et al.,2019). However, these methods are rarely used in applied work.
Stigler (2010) notes two important drawbacks related to the potential loss of information
and the elusive notion of ‘robustness’ to arbitrary contamination.
Many methods for assessing other types of sensitivities account for the inuence of
certain sets of observations directly or indirectly. Resampling methods, such as the
Jackknife (Efron and Stein,1981) or the Bootstrap (Efron and Tibshirani,1994), rely
on samples of the data to produce estimates.3Model averaging methods are mostly
concerned with variable selection, but have also been used to assess data sensitivity (see
Steel,2020, for a review). Specications that are saturated with dummy variables for all
observations are conceptually similar to the Jackknife. Other notable methods include
winsorizing or trimming the data, where observations with extreme values are replaced or
removed, but also methods for outlier detection. Outliers are generally not well-dened
(i.e. not inuential with respect to an explicit quantity) and unsupervised clustering
methods are commonly used (see e.g. Hautamaki et al.,2004;Kaufman and Rousseeuw,
2009;Shotwell and Slate,2011). These methods show interesting parallels to the analysis
of inuential sets, but dier in their goals and in how they address computational concerns
to reach them.
An important strand of the literature focuses on detecting and accounting for inuen-
tial sets of observations. These studies often use methods that are based on individual,
full-sample inuences; this includes many statistics proposed by Belsley et al. (1980),
the derived inuence matrix of Peña and Yohai (1995), and the approach of Broderick
et al. (2020). There are few adaptive procedures, with a notable exception in the works
of Atkinson et al. (2010) and Riani et al. (2014). In (applied) econometrics, sensitivity
checks to samples of the available data remain exceedingly rare, and usually limited to
single observations or ad-hoc procedures. This highlights the need for a better under-
standing of the role of inuential sets, how to conduct sensitivity checks, and how to
interpret their results.
3 Algorithms to assess inuential sets
In this section, we formalize three simple algorithms for approximating minimal inu-
ential sets and their inuence. We start with an algorithm that is extremely cheap to
compute, but sacrices accuracy and precision, and proceed with two algorithms that
yield improved precision and accuracy at slightly increased computational cost. Then,
we discuss computational concerns, and illustrate using a simple example.
3The minimal inuential set can be understood as the worst-case scenario of a delete-NαJackknife.
6
Inuential Sets in Linear Regression
3.1 Algorithm 0: Initial approximation
The rst algorithm (Algorithm 0) builds on the full-sample inuence of single observa-
tions, and generalizes the approach of Broderick et al. (2020).4Maximally inuential sets
are proposed based on the order of individual, full-sample inuences, and their inuence
is approximated by accumulating these individual inuences.
Algorithm 0: Initial approximation.
set the function , the target , and the maximum size U, let S ;
compute δi= ∆({i})for all i˜
S;
while ∆(S)<do
let S S arg maxjδj, for j∈ S;
let ∆(S)k∈S δk;
if |S| Uthen return unsuccessful;
end
return S,∆(S);
The algorithm works as follows. First, we set an inuence function, target value, and
a maximum size for the minimal inuential set. Next, we compute the initial inuence δi
for each i. As discussed above, this computation is relatively cheap in many interesting
cases; otherwise, approximations (as in Broderick et al.,2020) could be used. The rst
iterated step is the proposal of a maximally inuential set, which is based on the union of
the observations with the largest individual inuences, δi. The inuence of this proposed
set is then estimated by summing the individual inuences of observations in the set.
These two steps are repeated until the specied target or the maximum size is reached.
The method embodied in Algorithm 0 can yield striking results, as demonstrated by
the ndings of Broderick et al. (2020). However, the low constant computational complex-
ity comes at the price of accuracy and precision. First, the proposal of sets based on the
full-sample inuences makes the algorithm prone to masking. The estimated inuences
are not updated, so inuential observations may remain masked behind the inuence
of already removed observations. Second, approximating a set’s inuence by summing
individual inuences suers from a downward bias that is increasing with the inuence
of observations, and is particularly large for inuential observations (see the Sections A1
and A3 in the Appendix for more information). This means that the algorithm performs
badly in the presence of inuential observations, let alone inuential sets. As a result,
sensitivity checks based on Algorithm 0 are prone to convey a false sense of robustness.
4See Section A2 in the Appendix for a discussion of the approach used in Broderick et al. (2020).
7
Kuschnig, Zens, & Crespo Cuaresma
3.2 Algorithm 1: Initial binary search
The second algorithm (Algorithm 1) recties the critical precision issues of Algorithm 0,
while retaining high computational eciency. Similar to Algorithm 0, maximally inu-
ential sets based on the ordering of the individual, full-sample inuence. The inuence
of these proposed sets, however, is calculated exactly. In order to guarantee low compu-
tational cost, the algorithm follows a binary search pattern when proposing sets.
Algorithm 1: Initial search.
set the function , the target , and the maximum size U, let L0;
compute δi= ∆({i})for all i˜
S;
while LUdo
let M (L+U)/2;
let Sbe the union of the indices of the Mlargest δi;
compute ∆(S);
if ∆(S)<then let LM+ 1;
else let UM1;
end
if ∆(S)then return S,∆(S);
else return unsuccessful;
Algorithm 1 sets the size of proposed sets, M, by iteratively halving a search interval
[L, U ], instead of sequentially increasing it. In each step, the inuence of the proposed
set is computed exactly, and the bounds of the search interval are updated depending on
the inuence of the proposed set. If the target is reached, the upper bound is decreased
to M1, otherwise the lower bound is increased to M+ 1. If an approximate minimal
inuential set exists in the search interval, it is found after O(log U)steps. As a result,
Algorithm 1 adds negligible computational overhead over Algorithm 0, making this divide-
and-conquer approach practical for large-scale problems.
Algorithm 1 yields precise inuence estimates of a given set, which are not directly
aected by masking. However, its accuracy is not guaranteed. Since maximally inuential
sets are still based on the individual, full-sample inuences, masking is still an issue.
Inuential observations are likely to remain hidden behind already removed observations,
especially when they are part of a jointly inuential set. The algorithm also relies on the
(previously implicit) assumption that the inuence of estimated maximally inuential
sets increases steadily with the size of the set. The next algorithm abstracts from this
assumption and is designed to explicitly address masking.
8
Inuential Sets in Linear Regression
3.3 Algorithm 2: Adaptive approximation
The third algorithm (Algorithm 2) uses a simple adaptive procedure for identifying the
minimal inuential set. Maximally inuential sets are constructed iteratively, using up-
dated inuence estimates. This facilitates the discovery of masked observations, which
improves accuracy. The adaptive nature of this algorithm allows for good accuracy and
precision, while retaining computational tractability.
Algorithm 2: Adaptive approximation.
choose the function , the target , and the maximum size U, let S ;
while ∆(S)<do
compute ∆(S j), for all j˜
S S;
let S S arg maxj∆(S j);
if |S| Uthen return unsuccessful;
end
return S,∆(S);
Algorithm 2 starts by computing the inuence of all individual observations. Maxi-
mally inuential sets are proposed adaptively, by forming the union of the previous set
(starting with the empty set) and the observation with the highest inuence (at each
step). This is repeated until a minimal inuential set is found, or the maximal size is
reached. The algorithmic complexity is linear in the cardinality of the set and falls well
short of, e.g., a Jackknife approach. By computing individual inuences after every re-
moval, this approach reduces the risk of masking problems and allows us to more reliably
investigate sensitivity to inuential sets.
3.4 Approximations and computational concerns
The algorithms presented above are computationally straightforward, with complexities
that are constant, as well as logarithmic and linear in the size of the minimal inuential
set. In our linear setting, most quantities of interest rely on matrix factorization, with
a complexity of O(N3). This means that the computation of the inuence function
dominates the overall complexity in practice. Fortunately, many quantities of interest,
such as coecients and standard errors, can be computed eciently using algebraic results
(e.g. Equation 5) and updating formulae (for cross products, inverses, and factorizations).
Together with optimization of the underlying linear algebra, this allows us to quickly
assess the sensitivity to inuential sets in a wide range of applications.
Further speed gains can be realized with approximate methods. When the number
of covariates is large, it can be helpful to marginalize out nuisance covariates before
9
Kuschnig, Zens, & Crespo Cuaresma
computing inuence measures. Iterative solvers are another option that can help speed
up operations; suitable stopping criteria can make the results exact for our purposes (see
e.g. Trefethen and Bau,1997). It can also be helpful to x or discard computationally
intensive elements. For instance, there is no convenient method to update clustered
standard errors, and their computation becomes prohibitive in large settings. By using
non-clustered standard errors for proposing sets, we can facilitate sensitivity checks at
large scales. However, this simplication can be problematic when a large part of the
inuence is driven by the simplied element.5
3.5 An illustration
Here, we illustrate the characteristics of the three algorithms using a simple example.
Consider tting an OLS regression line to the data depicted in Figure 1. The set marked
‘a’ in the top-right of the scatter plot is inuential on the estimated positive slope, lowering
it considerably when removed. This inuence is reected in standard diagnostics, and
all three algorithms will identify this inuential set. This will not be the case for the
set of three observations in the center-right of the scatter plot, marked ‘b’. Its inuence
is initially masked by the rst inuential set, and neither Algorithm 0 nor Algorithm 1
identify any of its elements. By contrast, the adaptive nature of Algorithm 2 allows for
the masked observations to be identied.
-2.7 0.0 8.0
-2.1
0.0
3.5 Masking
a
b
Figure 1: OLS regression line for a dataset
(N= 60) with two sets of size three that are in-
uential on the positive slope. The set marked
‘a’ masks the inuence of the set marked ‘b’.
The solid gray line indicates the full-sample es-
timates; the dashed dark gray line indicates the
estimate after removing both sets.
5This is a problem of the linear approximation of Broderick et al. (2020) to Equation 5, which yields
DFBETAi(XX)1x
iei, thus disregarding the eect of leverage (which is expensive to compute).
This means that the inuence of all observations is biased downwards, with the bias being particularly
large for high-leverage observations. This is problematic since these observations tend to be particularly
inuential. See Section A2 of the Appendix for more information on the nature and size of the bias of
this approximation.
10
Inuential Sets in Linear Regression
Figure 2 depicts the maximally inuential sets that are identied by the dierent
algorithms at increasing size, as well as the implied regression lines after accounting for
inuential sets of size three and seven. Masking aects the sets identied by Algorithms 0
and 1 (see left panel in Figure 2), leading to relatively inconsequential observations being
identied after the rst three removals. The approximation in Algorithm 0 even fails
to account for the full inuence of the rst (jointly inuential) set, and considerably
underestimates the slope starting after the rst removal. Algorithm 2 does not suer
from these problems; the implied slopes are accurate and precise, and masking issues are
avoided.6
Algorithms & 01
0
1
Algorithm 2
3 removed
7 removed
set order
Figure 2: Inuential set analysis with Algorithms 0, 1, and Algorithm 2. The order of the
identied observations is color-coded and indicated by a cross (×, rst three) and a crosshair
(+, next four). The solid gray line indicates the full-sample regression line. The dotted and
dashed lines (labelled with the algorithm used) are the implied regression lines after accounting
for inuential sets of sizes three and seven.
4 Empirical applications
In this section, we investigate the sensitivity to inuential sets of three empirical results
on long-term economic development in Africa. We focus on the inuence on the tvalues
of the main eect of interest, and assess minimal inuential sets that induce (1) a loss of
signicance (at a given level), (2) an estimate of the opposite sign, and (3) a signicant
estimate of the opposite sign.
6In terms of computation, all three algorithms executed in less than 0.1 seconds.
11
Kuschnig, Zens, & Crespo Cuaresma
Identifying factors behind economic development in African countries has been a cen-
tral endeavor in development economics over the last decades. Factors like the colonial
experiences (Acemoglu et al.,2001), the slave trades (Nunn,2008), and precolonial insti-
tutions and centralization (Gennaioli and Rainer,2007;Michalopoulos and Papaioannou,
2013) are known to play a fundamental role. The drivers behind these factors and path-
ways of how they aect development today is an important subject of research (Spolaore
and Wacziarg,2013). Three empirical studies on this topic nd that the slave trades
aect development via ruggedness (Nunn and Puga,2012) and interpersonal trust (Nunn
and Wantchekon,2011), and that the Tsetse y aects precolonial centralization (Alsan,
2015). Below, we assess the sensitivity of the empirical ndings presented in these three
studies to inuential sets of observations.
4.1 Geography, development, and omitted variables
Nunn and Puga (2012) estimate linear regression models that related GDP per capita
across countries of the world to a measure of terrain ruggedness and other control vari-
ables. They allow for heterogeneity of African countries via an interaction term. Nunn
and Puga (2012) nd a signicantly negative overall estimate of the eect of ruggedness,
and a signicantly positive estimate for African countries (c.f. Table 1, Column 1). This
dierential eect is interpreted as rugged geography oering protection against the slave
trades. The authors perform rigorous checks to address the sensitivity of these results to
inuential observations, and their ndings appear to be robust. In particular, the results
are unchanged when omitting observations that exceed a threshold of |DFBETAi|>2/N
(following Belsley et al.,1980), as well as the ten smallest and most rugged observations.
We nd that robust estimates support this conclusion (see Table 1, Column 2).
We investigate the sensitivity of this dierential eect of ruggedness in Africa to
inuential sets. In the row labeled ‘Thresholds’ of Table 1, we report the sizes of inuential
sets that would induce (1) an insignicant estimate, (2) an estimate of the opposite
sign (in square brackets), and (3) a signicant estimate of the opposite sign (in curly
brackets). We nd that an inuential set of two observations overturns the signicance
of the dierential eect. An inuential set of ve observations is enough to induce a
sign-switch, and one of size eleven achieves a signicant sign-switch. To provide more
context for these observations, we visualize the ve most inuential ones in Figure 3,
Panel A. Their inuence is visualized in Panel B, which shows the eect of each subsequent
observation (indexed by the ISO code of the country) on the coecient’s tvalue. Starting
at a value of 2.53, the t-value decreases slightly with the removal of the most inuential
observation, which is of the Seychelles. Its inuence (as indicated by the height of its ISO
code) appears limited, but when using Algorithm 2 (on the left) we can clearly see that
12
Inuential Sets in Linear Regression
Table 1: The dierential eect of ruggedness in Africa
Baseline
Robust
Population
Land area
Ruggedness, Africa0.321 0.325 0.190 0.215
(2.53) (2.46) (1.66) (1.63)
Ruggedness -0.231 -0.251 -0.231 -0.238
(-2.99) (-3.23) (-2.94) (-3.08)
Controls Yes Yes Yes Yes
Population in 1400 Yes
Land area Yes
Thresholds2[5]{11} –[3]{6} –[4]{8}
Observations 170 170 168 170
R20.537 0.533 0.571 0.554
The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a
sign ip], and {a signicant sign ip} of the coecient that captures the dierential eect of ruggedness in Africa (using
Algorithm 2). The ‘Baseline’ column reproduces the results of Nunn and Puga (2012, see Table 1, Column 6) using OLS
estimation, while ‘Robust’ uses robust M-estimation. The specications labeled ‘Population’ and ‘Land area’ add the
population level in the year 1400, and the land area of the country (both in logs) as covariates. Coecient estimates are
reported with tvalues based on HC1 robust standard errors in parentheses.
the inuence of subsequent removals increases considerably. This pattern is indicative of
a jointly inuential set, and remains hidden from Algorithm 0 (on the right), and calls
for closer inspection.
Analyzing inuential sets and the characteristics of their members can provide us
valuable insights. First, the importance of the Seychelles casts doubts on the interpre-
tation presented in Nunn and Puga (2012). The island nation has only been inhabited
permanently since the late 18th century (Fauvel,1909), and its ruggedness played no role
in mediating the eects of the slave trades. Next, the ve most inuential nations are
extraordinarily small in terms of land area and population. This may indicate a survivor-
ship bias, where the inclusion of countries in the dataset is (at least in part) determined
by land area, population, geography, and economic success. Moreover, past population
sizes are likely related to other geographical features that confound the eect of rugged-
ness, and would also play an important role in mediating impacts of the slave trades. In
Columns 3 and 4 of Table 1, we present the results of specications that include past
population size and land area as additional controls. There, no signicant dierential
eect of ruggedness on economic development can be found for the African continent.
13
Kuschnig, Zens, & Crespo Cuaresma
Figure 3: Two nations drive the blessing of bad geography
Algorithm (2)
Algorithm (0)
-2.20
0.00
removals t value
3
2
1
02.53
SYC
LSO
RWA
SWZ
COM
ZAF
MAR
CPV
MUS
MRT
BDI
SYC
RWA
NPL
TJK
GEO
LBN
YEM
SEN
LSO
ALB
SWZ
2
0° East
3. Rwanda
5. Comoros
4. Eswatini
2. Lesotho
1. Seychelles
Note: Seychelles and Comoros not to scale.
RRuggedness
low
high
highest
2nd
3rd
4th
5th
Inuence
Panel A: Inuential nations and ruggedness Panel B: Inuence estimates
Panel A visualizes ruggedness, the explanatory variable of interest, and the ve most inuential observations. Panel B
shows the cumulative reduction of the tvalue as observations (indicated with their ISO codes) are removed (from 2.53 at
the top, to the bottom of the respective observation) using Algorithms 2 (left) and 0 (right).
4.2 Slave trades and heterogeneous origins of mistrust
Nunn and Wantchekon (2011) analyze the role of the Atlantic and East African slave
trades as determinants of interpersonal trust, using individual-level survey data and his-
torical information on the slave trades. They regress measures of the trust of relatives and
neighbors (among others) on a measure of slave trade intensity, as well as a number of
individual and district-level covariates, as well as country-xed eects. The design matrix
of the regression model contains over 20,000 observations, spanning several countries in
sub-Saharan Africa, and a total of 78 regressors. Uncertainty about the estimated eects
is quantied using standard errors that are clustered along ethnicity and district. The
authors nd statistically and economically signicant negative eects of the slave trades
on interpersonal trust. Their ndings are robust to removing Kenya and Mali (which
were also impacted by the trans-Saharan and Red Sea slave trades) from the sample.
For this analysis, a sensitivity check based on the inuential sets is computationally
challenging, since there exists no updating formula for two-way clustered standard er-
rors. To facilitate our analysis, we only cluster standard errors after the removal of an
observation, and not for proposed removals. We reproduce the empirical results of Nunn
and Wantchekon (2011) in Table 2. We nd that an inuential set of size 105 (0.5% of
the sample) induces a loss of signicance of the estimated eect of the slave trades and
14
Inuential Sets in Linear Regression
Table 2: The origins of mistrust
Trust of relatives Trust of neighbors
Pooled West | East Pooled West | East
Exports/area-0.133 -0.145 -0.159 -0.168
(-3.68) (-3.84) (-4.67) (-4.48)
Exports/area, East 0.053 0.023
(0.96) (0.32)
Individual controls Yes Yes Yes Yes
District controls Yes Yes Yes Yes
Country xed eects Yes Yes Yes Yes
Thresholds105[380]{656} 78[301]{532} 161[425]{768} 133[323]{527}
Observations 20,062 7,549 | 12,513 20,027 7,523 | 12,504
Ethnicity clusters 185 62 | 123 185 62 | 123
District clusters 1,257 628 | 651 1,257 628 | 651
R20.133 0.199 | 0.097 0.156 0.228 | 0.117
The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a sign
ip], and {a signicant sign ip} of the coecient that captures the eect of the slave trades (measured as exports of slaves
per area) on trust (using Algorithm 2). The columns labeled ‘Pooled’ reproduce the results of Nunn and Wantchekon
(2011, see Table 2, Columns 1 and 2). The columns labeled ‘West | East’ estimate separate models for observations to the
west and east of the 20° Eastern meridian (thresholds refer to Western subset). Coecient estimates are reported with
tvalues based on two-way clustered standard errors in parentheses.
380 removals (1.9% of the sample) lead to a sign-ip that becomes signicant after 656
removals (3.3% of the sample) for the trust of relatives, with similar results for the trust
of neighbors.
An analysis of these inuential sets shows that 536 of the 600 most inuential ob-
servations (89.3%) stem from three West African nations: Benin, Nigeria, and Ghana
(marked with black borders in Figure 3, also see Table A2 in the Appendix for more
details). These nations were major centers of the Atlantic slave trade, and their large
inuence may suggest dierences in impacts between the Atlantic and East-African slave
trades. When dividing the dataset into a Western and Eastern sample (with the 20° East-
ern meridian as the dividing line), we nd a signicantly negative eect for the Western
sample, and an insignicant eect for the Eastern sample (see Table 2, Columns 2 and
4). This result is consistent with the literature, which suggests larger impacts from the
Atlantic slave trade (see Nunn,2008).
15
Kuschnig, Zens, & Crespo Cuaresma
4.3 The concentrated eects of the Tsetse y
The Tsetse y is an important vector of disease and considered a hindrance to early eco-
nomic development. Alsan (2015) investigates this detrimental eect, concentrating on
its role as a determinant of agricultural practices, urbanization, and institutions. The
empirical analysis in Alsan (2015) is based on regressing various outcomes at the eth-
nic group level on a Tsetse suitability index (TSI) and a number of controls, clustering
standard errors at the province level. Using ethnicity-level data, she nds signicant
detrimental eects of the TSI across the board, with precolonial centralization as the
major channel aecting present economic development. These results are robust to per-
turbations of the TSI and to corrections aimed at assessing the negative selection bias
that may occur due to more developed ethnic groups displacing less developed ones.
Table 3: The eects of the Tsetse y
Animals
Intensive
Plow
Female
Density
Slavery
Centralized
TSI-0.231 -0.09 -0.057 0.206 -0.745 0.101 -0.075
(-5.47) (-3.29) (-2.54) (3.41) (-3.25) (2.51) (-2.12)
Controls Yes Yes Yes Yes Yes Yes Yes
Thresholds33[58]79 7[25]41 3[12]17 12[30]48 9[27]42 4[22]35 1[16]30
Observations 484 485 484 315 398 446 467
Clusters 44 44 44 43 43 44 44
R20.296 0.268 0.462 0.285 0.254 0.178 0.139
The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a
sign ip], and {a signicant sign ip} of the coecient that captures the eect of the TSI on the respective dependent
variable. The dependent variables are binary and continuous, and measure whether a precolonial ethnic group (1) possessed
large domesticated ‘Animals’, (2) adopted ‘Intensive’ agriculture, (3) adopted the ‘Plow’, (4) had ‘Female’ participation in
agriculture, (5) log population ‘Density’, (6) practiced indigenous ‘Slavery’, and (7) had a ‘Centralized’ state. Coecient
estimates are reported with tvalues based on clustered standard errors in parentheses.
We reproduce these results in Table 3, and investigate the sensitivity of the estimated
eects to the sample. First, we focus on the sensitivity to inuential sets. We nd
that the signicance of results is induced by few observations, ranging from one (for the
variable measuring whether a centralized state was present) to 33 (for the possession
of large domesticated animals). Inuential sets of sizes between twelve and 79 induce
signicant results of the opposite signs. The sensitivities of the results in Alsan (2015)
are pronounced, with a possible exception for the eects on the domestication of large
16
Inuential Sets in Linear Regression
animals. The ndings are supported by a relatively small amount of data, which may
be due to a lack of variation in the outcome variables. In spite of the reasonable sample
size, the six binary outcomes are relatively rare (the plow adoption, for example, only
occurred for 37 of the 484 observations).
For comparison, we use robust M-estimation and nd that ve of the seven specica-
tions are eectively unchanged (see Table A3 in the Appendix for the estimation results).
Two specications (those with the plow adoption and indigenous slavery as outcomes)
yield a pathological t. Next, we use a high-breakdown S-estimator. The eects on pop-
ulation density (which are not particularly robust to inuential sets) remain unchanged.
Meanwhile, the six other specications yield a pathological t. Between 106 and 286 (or
22 and 91 percent) observations are agged as outliers. Conclusions based on robust esti-
mators depend heavily on the type of estimator, and can dier strongly from ones based
on inuential sets. While robust estimators also ag sets of impactful observations, these
tend to be larger than the inuential sets obtained with the methods presented here, and
are not as immediately relevant to a particular inferential result of interest.
5 Discussion
In the three applications presented, the main results appear sensitive to inuential sets of
observations. Analysis of the nature of these inuential sets can reveal omitted variable
bias, heterogeneity, or limited support from the data. In general, however, the ques-
tion of how to interpret minimal inuential sets remains unanswered. In this section,
we elaborate on this issue and provide some additional illustrations based on empirical
examples.
Expert knowledge is vital to interpreting sensitivity to inuential sets. The degree of
sensitivity that we expect a priori for a given application provides important context for
the interpretation. When searching for the proverbial needle in the haystack, we expect
high sensitivity to little data, while we expect to nd a good amount of needles when
investigating a sewing kit. Analyses of rare events such as disease outbreaks or economic
crises, as well as policy interventions that only aect a small part of the population,
are inherently sensitive to small subsets of the data. Phenomena that appear more
universally, such as convergence in growth or income levels across economies, should
only be susceptible to larger fractions of the data. Having a good understanding of the
distribution and intensity of the eects hypothesized (and of other underlying factors),
allows us to gain valuable insights from analyzing inuential sets.
First, consider the role of inuential sets when analyzing rare, concentrated phenom-
ena. The eects of the slave trades on mistrust, for instance, are likely to manifest only in
17
Kuschnig, Zens, & Crespo Cuaresma
a few impacted communities. Hence, the size of the minimal inuential sets that we nd
(e.g., 105 / 0.5% of the sample for a loss of signicance) do not appear to be particularly
worrying. Yet, these inuential sets may be insightful by highlighting strongly aected
subsamples. In the case of the eects of the Tsetse y, however, the sensitivity found
can be considered more problematic, despite potentially high concentration. The small
sizes of the identied minimal inuential sets imply that the results rely on deceptively
few observations. Since the distribution of impacts is not always obvious a priori, a small
minimal inuential set can bring dierences in the intensity of eects to our attention.
One example for this concerns the impacts of microcredits (see and Section A2 of the
Appendix and Broderick et al.,2020), which are sensitive to few observations in the heavy
tails of the outcome variable.
Second, consider minimal inuential sets in the context of more universal phenomena.
One notable example is cross-country convergence of poverty rates. Based on theoretical
considerations about income convergence and the link between mean income growth and
poverty dynamics (see Johnson and Papageorgiou,2020, for a survey), we would expect
poverty convergence at the global level to be present in the data. Ravallion (2012),
however, nds that the convergence rate in poverty rates is statistically insignicant in
a sample of 89 countries. As Crespo Cuaresma et al. (2016,2022) point out, this result
is likely to be driven by the idiosyncratic experience of Eastern European countries. An
analysis of inuential sets in this setting reveals exactly this result, acting as a substitute
for and complement to domain knowledge. The minimal inuential set of Belarus, Latvia,
Ukraine, and Poland is behind the lack of signicance of the convergence estimate. An
alternative specication (proposed by Crespo Cuaresma et al.,2016) leads to estimates
of the convergence rate that appear to be less sensitive (see Section B3 in the Appendix
for more information on the application results). Minimal inuential sets of observations
that appear small for the context can highlight misspecication issues, and an analysis
of these sets can help shed light on the nature of the issues.
Lastly, consider the technical circumstances that give rise to an inferential result.
Here, it helps to think of three tightly connected cornerstones the composition of
the data, the identication of parameters, and the estimation method employed. Least
squares estimates tend to be sensitive to observations with high leverage and large resid-
uals, while shrinkage estimators will be less impacted by these. On the other hand, the
compound nature of 2SLS estimators makes them more susceptible to small inuential
sets, and numerical instability of the estimators can quickly become a problem. We dis-
cuss these issues further in a simulation exercise and in the context of the long-term
impacts of migration (as studied by Droller,2018) in Sections A4 and B4 of the Ap-
pendix. Evidently, these issues are related to parameter identication and the variability
18
Inuential Sets in Linear Regression
of the data. For the eect of the Tsetse y on plow adoption, there is little variation in
the outcome that allows us to identify the eect. It is not surprising that estimates are
sensitive to inuential sets one with of less than 10% of the data means that the eect
can no longer be estimated. Such technical sensitivities are a steady feature of empirical
analysis, and inuential sets can bring them to our attention.
The interpretation of sensitivity to inuential sets is highly dependent on context.
Sensitivity to inuential sets does not conclusively indicate a lack of robustness by itself.
Instead, the analysis of inuential sets presents an intuitive, novel tool for gaining deeper
insights about data structures and inferential quantities in a wide range of settings. Re-
porting the absolute and relative size of minimal inuential sets in regression analysis
and analyzing their characteristics may help increase the transparency of research com-
munication and signicantly improve our understanding of socioeconomic phenomena.
6 Conclusions
In this paper, we investigated the sensitivity of inferential quantities to inuential sets of
observations. We explored three algorithms aimed at identifying minimal inuential sets
and quantifying inferential sensitivity to them. We showed how masking, where certain
observations obscure the inuence of others, complicates sensitivity checks by inducing
false negatives. We investigated the sensitivity to inuential sets in several empirical ap-
plications related to the determinants of economic development in Africa, demonstrating
the practical relevance and utility of this sensitivity check. Our analysis showed that
sensitivity to inuential sets is prevalent in practice. A great deal of interesting phenom-
ena are exceedingly rare, and many important insights may hinge on few observations.
The proposed sensitivity check to inuential sets is not necessarily to be interpreted an
indicator for a lack of validity, but analysis of these sets can be an important tool for
identifying possible shortcomings of the model, and potentially fruitful extensions of the
analysis.
In applied econometric practice, the sensitivity to sets of observations is still rarely
assessed in a systematic manner. Reinforced by our ndings in this paper, we believe
that such sensitivity checks should play a more important role in the communication of
the results of econometric analysis. Inuential sets can deliver valuable insights, and the
size of minimal inuential sets can serve as intuitive summary measures. Two important
advantages as compared to existing measures, such as Cook’s distance or robust esti-
mates, are their interpretability and salience. Summaries are directly tied to a statistic
of interest, and inuential sets can be analyzed in detail. Sensitivity checks to inu-
ential sets are necessarily approximate, and robustness cannot be attested conclusively.
19
Kuschnig, Zens, & Crespo Cuaresma
Nonetheless, many false negatives can be avoided at a reasonable computational cost,
and approximations can unveil important insights that would otherwise remain hidden
in plain sight.
There are several pathways for future work building upon this contribution. On
the topic of inuential sets, many potential improvements in terms of computational
eciency and accuracy are conceivable. In a wider sense, there is a lack of comprehensive
indicators and summary measures for sensitivities to the data. Lastly, the approach
presented here opens the door for further replication studies and sensitivity checks of
documented empirical phenomena, which may deliver valuable insights for researchers
and policymakers.
References
Daron Acemoglu, Simon Johnson, and James A. Robinson. The colonial origins of comparative
development: an empirical investigation. American Economic Review, 91(5):1369–1401, 2001.
ISSN 0002-8282. doi:10.1257/aer.91.5.1369.
Marcella Alsan. The eect of the TseTse y on African development. American Economic
Review, 105(1):382–410, 2015. ISSN 0002-8282. doi:10.1257/aer.20130604.
Manuela Angelucci, Dean Karlan, and Jonathan Zinman. Microcredit impacts: evidence from a
randomized microcredit program placement experiment by Compartamos Banco. American
Economic Journal: Applied Economics, 7(1):151–82, 2015. doi:10.1257/app.20130537.
Joshua D. Angrist and Jörn-Steen Pischke. The credibility revolution in empirical economics:
how better research design is taking the con out of econometrics. Journal of Economic
Perspectives, 24(2):3–30, 2010. doi:10.1257/jep.24.2.3.
Susan Athey and Guido W. Imbens. The state of applied econometrics: causality and policy
evaluation. Journal of Economic Perspectives, 31(2):3–32, 2017. doi:10.1257/jep.31.2.3.
Anthony C. Atkinson, Marco Riani, and Andrea Cerioli. The forward search: theory and data
analysis. Journal of the Korean Statistical Society, 39(2):117–134, 2010. ISSN 1226-3192.
doi:10.1016/j.jkss.2010.02.007.
Orazio Attanasio, Britta Augsburg, Ralph De Haas, Emla Fitzsimons, and Heike Harmgart.
The impacts of micronance: evidence from joint-liability lending in Mongolia. American
Economic Journal: Applied Economics, 7(1):90–122, 2015. doi:10.1257/app.20130489.
Britta Augsburg, Ralph De Haas, Heike Harmgart, and Costas Meghir. The impacts of mi-
crocredit: evidence from Bosnia and Herzegovina. American Economic Journal: Applied
Economics, 7(1):183–203, 2015. doi:10.1257/app.20130272.
20
Inuential Sets in Linear Regression
Abhijit Banerjee, Esther Duo, Rachel Glennerster, and Cynthia Kinnan. The miracle of mi-
cronance? evidence from a randomized evaluation. American Economic Journal: Applied
Economics, 7(1):22–53, 2015. doi:10.1257/app.20130533.
David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying inuential
data and sources of collinearity. John Wiley & Sons, 1980. doi:10.1002/0471725153.
Kirill Borusyak, Peter Hull, and Xavier Jaravel. Quasi-experimental shift-share re-
search designs. Review of Economic Studies, 89(1):181–213, 2022. ISSN 0034-6527.
doi:10.1093/restud/rdab030.
George E. P. Box and George C. Tiao. A Bayesian approach to some outlier problems.
Biometrika, 55(1):119–129, 1968. doi:10.1093/biomet/55.1.119.
Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic nite-sample robustness
metric: can dropping a little data change conclusions?, 2020.
Samprit Chatterjee and Ali S. Hadi. Inuential observations, high leverage points, and outliers
in linear regression. Statistical Science, 1(3):379–393, 1986. doi:10.1214/ss/1177013622.
Ralph Dennis Cook. Inuential observations in linear regression. Journal of the American
Statistical Association, 74(365):169–174, 1979. doi:10.2307/2286747.
Bruno Crépon, Florencia Devoto, Esther Duo, and William Parienté. Estimating the
impact of microcredit on those who take it up: evidence from a randomized experi-
ment in Morocco. American Economic Journal: Applied Economics, 7(1):123–50, 2015.
doi:10.1257/app.20130535.
Jesús Crespo Cuaresma, Stephan Klasen, and Konstantin M. Wacker. There is poverty conver-
gence. SSRN Electronic Journal, 2016. doi:10.2139/ssrn.2718720.
Jesús Crespo Cuaresma, Stephan Klasen, and Konstantin M. Wacker. When do we see
poverty convergence? Oxford Bulletin of Economics and Statistics, 2022. ISSN 0305-9049.
doi:10.1111/obes.12492.
Francisco Cribari-Neto, Tatiene C. Souza, and Klaus L. P. Vasconcellos. Inference under het-
eroskedasticity and leveraged data. Communications in Statistics - Theory and Methods, 36
(10):1877–1888, 2007. doi:10.1080/03610920601126589.
Federico Droller. Migration, population composition and long run economic development: ev-
idence from settlements in the Pampas. The Economic Journal, 128(614):2321–2352, 2018.
doi:10.1111/ecoj.12505.
Bradley Efron and Charles Stein. The Jackknife estimate of variance. The Annals of Statistics,
9(3), 1981. doi:10.1214/aos/1176345462.
21
Kuschnig, Zens, & Crespo Cuaresma
Bradley Efron and Robert J. Tibshirani. An introduction to the Bootstrap. CRC Press, 1994.
doi:10.1201/9780429246593.
A. A. Fauvel. Unpublished documents on the history of the Seychelles islands anterior to
1810. Government Printing Oce Mahe, Seychelles, 1909. URL www.loc.gov/item/
unk83018617/.
Nicola Gennaioli and Ilia Rainer. The modern impact of precolonial centralization in Africa.
Journal of Economic Growth, 12(3):185–234, 2007. ISSN 1573-7020. doi:10.1007/s10887-007-
9017-z.
Paul Goldsmith-Pinkham, Isaac Sorkin, and Henry Swift. Bartik instruments: what, when,
why, and how. American Economic Review, 110(8):2586–2624, 2020. ISSN 0002-8282.
doi:10.1257/aer.20181047.
Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust
statistics: the approach based on inuence functions, volume 196. John Wiley & Sons, 2005.
doi:10.1002/9781118186435.
Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. Outlier detection using k-nearest neighbour
graph. In Proceedings of the 17th International Conference on Pattern Recognition, 2004.
ICPR 2004., volume 3, pages 430–433. IEEE, 2004. doi:10.1109/ICPR.2004.1334558.
Peter J. Huber and Elvezio M. Ronchetti. Robust statistics. John Wiley & Sons, Ltd., Chichester,
England, UK, January 2009. ISBN 978-0-47012990-6. doi:10.1002/9780470434697.
Paul Johnson and Chris Papageorgiou. What remains of cross-country convergence? Journal
of Economic Literature, 58(1):129–75, 2020. doi:10.1257/jel.20181207.
Dean Karlan and Jonathan Zinman. Microcredit in theory and practice: using ran-
domized credit scoring for impact evaluation. Science, 332(6035):1278–1284, 2011.
doi:10.1126/science.1200138.
Robbert E. Kass, Luke Tierney, and Joseph B. Kadane. Approximate methods for as-
sessing inuence and sensitivity in Bayesian analysis. Biometrika, 76(4):663–674, 1989.
doi:10.1093/biomet/76.4.663.
Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: An introduction
to cluster analysis, volume 344. John Wiley & Sons, 2009. ISBN 978-0470316801.
doi:10.1002/9780470316801.
Stephan Klasen and Mark Misselhorn. Determinants of the growth semi-elasticity of poverty
reduction, 2008.
22
Inuential Sets in Linear Regression
Edward E. Leamer. Let’s take the con out of econometrics. American Economic Review, 73(1):
31–43, 1983. URL https://www.jstor.org/stable/1803924.
Edward E. Leamer. Sensitivity analyses would help. American Economic Review, 75(3):308–313,
1985. URL https://www.jstor.org/stable/1814801.
Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Matías Salibián-Barrera.
Robust statistics: theory and methods (with R). John Wiley & Sons, 2019.
doi:10.1002/9781119214656.
Stelios Michalopoulos and Elias Papaioannou. Pre-colonial ethnic institutions and con-
temporary African development. Econometrica, 81(1):113–152, 2013. ISSN 1468-0262.
doi:10.3982/ECTA9613.
Nathan Nunn. The long-term eects of Africa’s slave trades. Quarterly Journal of Economics,
123(1):139–176, 2008. ISSN 0033-5533. doi:10.1162/qjec.2008.123.1.139.
Nathan Nunn and Diego Puga. Ruggedness: the blessing of bad geography in Africa. Review
of Economics and Statistics, 94(1):20–36, 2012. doi:10.1162/REST_a_00161.
Nathan Nunn and Leonard Wantchekon. The slave trade and the origins of mistrust in Africa.
American Economic Review, 101(7):3221–52, 2011. doi:10.1257/aer.101.7.3221.
Daniel Peña and Victor J Yohai. The detection of inuential subsets in linear regression by
using an inuence matrix. Journal of the Royal Statistical Society: Series B (Methodological),
57(1):145–156, 1995. doi:10.1111/j.2517-6161.1995.tb02020.x.
Lawrenece I. Pettit and Karen D. S. Young. Measuring the eect of observations on Bayes
factors. Biometrika, 77(3):455–466, 1990. doi:10.1093/biomet/77.3.455.
Garry D. A. Phillips. Recursions for the two-stage least-squares estimators. Journal of Econo-
metrics, 6(1):65–77, 1977. doi:10.1016/0304-4076(77)90055-0.
Martin Ravallion. Why don’t we see poverty convergence? American Economic Review, 102
(1):504–23, 2012. doi:10.1257/aer.102.1.504.
Marco Riani, Andrea Cerioli, Anthony C. Atkinson, and Domenico Perrotta. Monitoring
robust regression. Electronic Journal of Statistics, 8(1):646–677, 2014. ISSN 1935-7524.
doi:10.1214/14-EJS897.
Matthew S. Shotwell and Elizabeth H. Slate. Bayesian outlier detection with Dirichlet process
mixtures. Bayesian Analysis, 6(4):665–690, 2011. doi:10.1214/11-BA625.
Enrico Spolaore and Romain Wacziarg. How deep are the roots of economic development? Jour-
nal of Economic Literature, 51(2):325–69, 2013. ISSN 0022-0515. doi:10.1257/jel.51.2.325.
23
Kuschnig, Zens, & Crespo Cuaresma
Mark F. J. Steel. Model averaging and its use in economics. Journal of Economic Literature,
58(3):644–719, 2020. doi:10.1257/jel.20191385.
Stephen M. Stigler. The changing history of robustness. American Statistician, 64(4):277–281,
2010. ISSN 0003-1305. doi:10.1198/tast.2010.10159.
Alessandro Tarozzi, Jaikishan Desai, and Kristin Johnson. The impacts of microcredit: evi-
dence from Ethiopia. American Economic Journal: Applied Economics, 7(1):54–89, 2015.
doi:10.1257/app.20130475.
Lloyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM, 1997.
doi:10.1137/1.9780898719574.
24
Inuential Sets in Linear Regression
Appendix for online publication
A Additional technical information
In this section, we provide additional technical details on the approximation of inuential
sets and their inuence. First, we focus on limitations of the approach used in Algorithm 0
and by Broderick et al. (2020). Next, we conduct a simulation exercise to compare the
three considered algorithms. Finally, we investigate the sensitivity of 2SLS estimation in
a simulation exercise.
A1 Aggregating individual inuences
Algorithm 0 approximates the inuence of a set of observations by aggregating individual,
full-sample inuences of its members. When assessing the inuence on coecients, e.g.,
this method suers from a downward bias. With every removal, the leverage and hence
the inuence of observations increases and dierences in leverage are exacerbated. The
other driver of inuence, residuals, decreases on average, complicating the analysis. We
illustrate the underestimation with a mean-only example.
Consider the model y=1θ+ε, where we are interested in the inuence on the estimate
of θ, which we are (without loss of generality) trying to decrease (∆(S) = ˆ
θ()ˆ
θ(S)).
We will show that the inuence of a set of two inuential observations y1and y2(i.e.,
they satisfy y1y2>iyi/N) is greater than the sum of inuences of its members. We
need to show δ1+δ2<∆({1,2}),which is equivalent to
ˆ
θˆ
θ(1) +ˆ
θˆ
θ(2) <ˆ
θˆ
θ{(1,2)},
iyi
Ni=1 yi+i=2 yi
N1+i=1,2yi
N2<0.
Since i=1 yi+i=2 yi=iyi+i=1,2yi, we can redistribute the second term for
1
N1
N1
i
yi+1
N21
N1
i=1,2
yi<0.
By assumption, we know that i=1,2yi+2
N2i=1,2yi<iyi, which implies that
1
N1
N1
i
yi+1
N21
N1
i=1,2
yi<
1
N1
N1
i=1,2
yi+2
N2
i=1,2
yi+1
N21
N1
i=1,2
yi= 0,
where the second term cancels out, completing the proof.
25
Kuschnig, Zens, & Crespo Cuaresma
A2 Assessing inuence approximations
Approximations of the inuence of sets of observations that ignore (or approximate) cer-
tain elements enable computation in intensive settings. However, such computational
shortcuts can become problematic if the approximated elements are important determi-
nants of the inuence themselves. In our setting, a notable example is the Approximate
Maximum Inuence Perturbation (AMIP, by Broderick et al.,2020), which we investigate
in more detail.7We illustrate the limitations of this approximation using a simulated ex-
ample, and then discuss the performance in the context of seven studies on microcredits,
which is featured in their contribution. In Section A3, we conduct a simulation exercise.
Illustration
Consider a univariate regression with a slope parameter β= 1 and zero intercept, where
innovations and observations of the covariate are drawn from t(8) distribution. This
setup allows for moderately high leverage and residuals, without inducing sensitivity to
inuential sets. We draw N= 1,000 observations and construct the dependent variable;
data are visualized in Figure A1. First, we illustrate the behavior of AMIP when approx-
imating inuential sets and their inuence with regard to the coecient β. Inuence is
given by
δAMIP
i= (xx)1x
ieiDFBETAi=(xx)1x
iei
1hi
,
where it can be seen that the dierence between δAMIP
iand DFBETAiis given by the
suppression of the leverage component of the measure, hi.
-3.92 4.74
-6.82
0.00
6.17
Figure A1: Simulated data and the least squares regres-
sion line. The covariate (on the horizontal axis) and the
innovations are drawn from a t(8) distribution.
In Figure A2, we visualize the errors of approximation (when compared to the true
inuence). In the left panel, we relate the AMIP error to the leverage of the dataset.
7Our analysis is based on the results in Broderick et al. (2020) and their implementation (kindly made
available at https://github.com/rgiordan/zaminfluence) as of commit 2d29fbb from 4/20/2022.
26
Inuential Sets in Linear Regression
Since leverage and residuals are the only two factors driving inuence in this simple case,
the use of AMIP leads to a downward bias (to (N1)1percent) in the estimation of the
inuence. In the right panel, we relate the error to the true inuence. As can be seen,
the magnitude of the error is particularly pronounced for highly inuential observations.
Figure A2: Errors from using
AMIP to estimate inuence on
coecients (compared to the
exact value), plotted against
the exact leverage (left) and
the exact inuence (right) of
observations.
Next, we investigate the inuence on the standard error of the coecient estimate.
To showcase their relationship, we regress the exact inuence on standard errors onto its
AMIP estimate. We nd an R2of 0.86, and a coecient value of 1.005 (t= 78.8), indi-
cating good accuracy on average. However, the quality of the approximation deteriorates
considerably with increasing inuence. Figure A3 visualizes the relationship between the
(standardized) regression residuals and the tted values, as well as leverage.
Figure A3: Diagnostic plots
for the regression of exact ob-
servation inuence on standard
errors onto their AMIP esti-
mates. Residuals are plotted
against tted values (left) and
the leverage of the inuence
(on the right).
Last, we consider the inuence on the signicance of the parameter estimate. Brod-
erick et al. (2020) do not explicitly track inuence on tvalues, but rather compute the
inuence on signicance for a given tvalue by accumulating
δAMIP
i=βAMIP
(i)+tSEAMIP
(i),(A1)
where βAMIP
(i)and SEAMIP
(i)are the AMIP estimates of the inuence of observation ion
βand its standard error. A minimal inuential set is found by choosing a direction for
(i.e. signicantly positive or negative), and setting the target value that needs to be
exceeded to one of β±t×SE.We investigate this approximation below using an empirical
application.
27
Kuschnig, Zens, & Crespo Cuaresma
Microcredits
Many recent large-scale experimental studies evaluate the ecacy of microcredit as a tool
for alleviating poverty and facilitating economic development. Seven of these studies are
analyzed by Broderick et al. (2020), who assess the sensitivity of the average treatment
eect to removing observations. These are randomized control trials in Bosnia and Herze-
govina (Augsburg et al.,2015), Mongolia (Attanasio et al.,2015), Ethiopia (Tarozzi et al.,
2015), Mexico (Angelucci et al.,2015), Morocco (Crépon et al.,2015), the Philippines
(Karlan and Zinman,2011), and India (Banerjee et al.,2015). The underlying model is
a simple treatment eect model with a single (randomized) treatment dummy.
Table A1: Sensitivity of the average treatment eect of microcredits
BIH MON ETH MEX MOR PHI IND
Estimate β37.53 -0.34 7.29 -4.55 17.54 66.56 16.72
t(1.90) (-1.53) (0.92) (-0.77) (1.54) (0.85) (1.41)
Sign-switch
A2 13 15 1 1 11 9 6
A0 14 16 1 1 11 9 6
B0 14 16 1 1 11 9 6
Signicance
A2 35 34 10 9 29 38 28
A0 68 58 387 41 42 89 76
B0 40 38 66 15 30 58 32
Observations 1,195 961 3,113 16,560 5,498 1,113 6,863
The reported values are the sizes of inuential sets that are needed to induce a sign-switch of the average treatment eect,
and have this sign-ip become signicant (at the 5% level). We use Algorithms 2 (labelled ‘A2’) and 0 using exact inuences
(‘A0’) and AMIP estimates (‘B0’). For the sign-switch, we explicitly target the inuence on β, and not on the tvalues.
In Table A1, we present the full-sample estimates and the sizes of minimal inu-
ential sets that induce a sign-switch, and a signicant sign-switch. We compare three
approaches, based on (1) Algorithm 2, and Algorithm 0, using (2) exact inuences (‘A0’)
and (3) AMIP estimates (‘B0’, reproducing Broderick et al.,2020). The results for sign-
switches are similar. By contrast, the sizes needed for a signicant sign-switch dier
strongly. When particularly inuential observations are present, only Algorithm 2 prop-
erly accounts for the eects of subsequent removals. Notably, the AMIP estimates for
inducing signicance (following Equation A1) are considerably lower than one would
expect.8
8This is peculiar, considering the downward biases of (1) accumulating inuences, (2) the βAMIP
(i)
estimate, and (3) the estimates βAMIP
(i)and SEAMIP
(i)for inuential observations.
28
Inuential Sets in Linear Regression
0 1
-0 .5
11 .7
ETH
0 1
-40.9
0
8.2
MEX
Figure A4: Household income
in thousands (vertical axis) and
the treatment dummy for the
studies on Ethiopia and Mexico.
The regression line is indicated in
gray.
Thanks to randomization, the regression model underlying these results is remark-
ably simple, and Cook’s distance (or other standard checks) already indicate sensitivity
issues. The cases of Ethiopia and Mexico are particularly striking, (c.f. Figure A4). Since
leverage plays a limited role in this regression setting, it appears as an ideal case for using
methods based on initial approximation (Algorithm 1 would be a preferred candidate for
its improved precision).
A3 Approximations in the presence of inuential sets
Approaches for identifying inuential sets that are based on approximations may suer
precisely in the presence of inuential sets. To showcase this issue, we consider the
univariate regression from Section A2 (with β= 1, and innovations and covariates drawn
from a t(8) distribution) with N= 100. We contaminate this setup with two inuential
sets. For the rst set, S0.02 we use a coecient value of three and increase covariate values
by seven. For the second one, T0.03, the coecient value is changed to two and covariate
values are increased by ve. One realization of this process is visualized in Figure A5.
-2.2 0.0 7.6
-3.8
0.0
23.4
Figure A5: Simulated data and OLS regression lines for
the full-sample (in gray) and the uncontaminated sample
(dashed, in black). Members of the inuential sets are
highlighted with color and indicated by a cross (×, for
S0.02) and a crosshair (+, for T0.03 ).
We draw 1,000 realizations of the process described above, and use four approaches
29
Kuschnig, Zens, & Crespo Cuaresma
to assess maximally inuential sets (that lower coecient values) of sizes one to ten.9
These are Algorithms 2 (labelled ‘A2’) and 1 (‘A1’), as well as Algorithm 0, with exact
inuences (‘A0’) and the AMIP estimate (‘B0’). The results are presented in Figure A6.
Both variants of Algorithm 0 perform relatively poorly, and do not recognize the inuence
of the rst inuential set. Algorithm 1 performs better, but does not reliably account for
the inuence of the second set. Estimates are considerably spread out, and the target
value (after ve removals) of unity in the slope coecient is missed on average. The
impact of removals levels o afterwards. By contrast, Algorithm 2 reliably nds both
sets, and continues to identify impactful observations after their removal.
0 2 5 10
2.2
2.0
1.8
1.5
1.0
0.5
B0
0 2 5 10
A0
0 2 5 10
A1
0 2 5 10
A2
Figure A6: Transparent lines indicate individual runs, thick lines the average results of (from
top to bottom) approach ‘B0’ (gray, dashed), ‘A0’ (green, solid), ‘A1’ (purple, dashed), and ‘A2’
(teal, solid). The vertical axis indicates estimates, the horizontal one the number of removals.
A4 Method sensitivities for OLS and 2SLS
Sensitivity to inuential sets depends on the measure of inuence and the estimator used.
We illustrate the role of the particular estimator by comparing inuential sets for OLS and
2SLS coecient estimates. Again, we use the univariate regression from Section A2 (with
β= 1, and innovations and covariates drawn from a t(8) distribution) with N= 100, and
add a repeated layer of this setup to construct the instrumented variable. This gives us
the following setup for 2SLS estimation
xz+u,with ut(8) and zt(8),
yx+e,with et(8).
We use Algorithm 2 to nd and assess the inuence of maximally inuential sets from
size one to 95 and replicate the exercise 1,000 times.
9The average estimates are βLS = 1.91,ˆ
β(S)= 1.38, and ˆ
β({S,T })= 1.00. Interestingly, robust
M-estimation slightly undershoots the inuence of the rst set, at βM= 1.43.
30
Inuential Sets in Linear Regression
The results of our simulation exercise are visualized in Figure A7. On the vertical
axes, we depict the OLS (top) and 2SLS (bottom) estimates after removing inuential
sets of increasing size (horizontal axis). OLS estimates remain steady until about twenty
observations are left in the sample. After this point, we observe a limited divergence of
estimates. The mean and median of the estimates stay close to each other throughout,
indicating no serious numerical problems. This is not the case for the 2SLS estimates,
where estimates start to diverge after about twenty removals. At 24 removals, we have
the rst drop-out due to a lack of numerical stability, which we indicate with a green
cross on top. Before this, we can see a short spike in the mean estimate, which then
drifts o again. Overall, about three quarters of the simulated runs for 2SLS estimation
terminate prematurely due to a lack of numerical stability. At around 50 removals, more
than half of the remaining 2SLS estimates are pathological (see median estimate), with
vast uctuations afterwards.
These results indicate that even with heavy-tailed errors and leverage, the OLS es-
timates of an otherwise correctly specied linear regression model are not particularly
sensitive to inuential sets. However, the 2SLS estimates appear to be susceptible to
inuential sets and numerical issues, even with this simple setup.
31
Kuschnig, Zens, & Crespo Cuaresma
100 80 50 20 5
-20 -10 -1
OLS
100 80 50 20 5
-20 -10 -1
Observations
2SLS
Figure A7: Transparent lines indicate (250 samples of) individual runs, thick lines indicate the
median (solid, blue), the 95% and 5% quantile (dashed), and the average (dotted, pink) of the
estimate. Crosses at the top of the 2SLS panel indicate drop-outs due to pathological numerical
stability (within machine precision). The vertical axes indicate estimates, the horizontal ones
the number of observations left.
32
Inuential Sets in Linear Regression
B Additional results and applications
In this section, we provide supporting material for the applications discussed in the paper.
B1 The origins of mistrust
Table A2: Distribution of the top 600 most inuential observations on trust in relatives
Benin Nigeria Ghana Other
Top 100 54 12 19 15
Top 101–200 85 6 5 4
Top 201–300 80 12 7 1
Top 301–400 52 27 14 7
Top 401-500 3 67 6 24
Top 501–600 3 69 15 13
Top 600 277 193 66 64
Summary of the origin countries of the 600 most inuential observations aecting the eects of slave trades on the trust of
relatives (c.f. Column 1 of Table 2).
B2 The eects of the Tsetse y
Table A3: The robust eects of the Tsetse y
Animals
Intensive
Plow
Female
Density
Slavery
Centralized
M-estimate -0.259 -0.102 0.227 -0.761 -0.080
(-5.53) (-3.29) (3.71) (-3.00) (-2.12)
S-estimate -0.761
(-4.63)
Observations 484 485 484 315 398 446 467
Robust M- and S-estimates for the eects of the Tsetse y (c.f. Table 3). Coecient estimates are reported with tvalues
based on clustered (for M-estimation) and classical (for S-estimation) standard errors in parentheses.
33
Kuschnig, Zens, & Crespo Cuaresma
B3 Poverty convergence and inuential sets
Ravallion (2012) examines the rates of convergence in a sample of 89 countries using the
following linear regression model
T1
i(ln Hit ln Hit1) = α+βln Hit1+εit,(A2)
where Hit denotes the poverty headcount ratio in country iand time period t, and Tiis
the length of its observation period in years. In his study, Ravallion (2012) does not detect
poverty convergence, instead obtaining a positive, statistically insignicant estimate of
β, that we reproduce in Column 1 of Table A4.
Table A4: Sensitivity of poverty convergence
Baseline Eastern Europe Alternative
Convergence0.006 -0.020 -0.019
(0.84) (-2.88) (-9.68)
Convergence, Eastern Europe 0.044
(2.12)
Thresholds–[1]{4} 3[10]{24} 26[32]{42}
Observations 89 89 124
R20.375 0.375 0.607
The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a
sign ip], and {a signicant sign ip} of the coecient that captures the poverty convergence eect (using Algorithm 2).
The column labelled ‘Baseline’ reproduces the results of Ravallion (2012), the one labelled ‘Eastern Europe’ adds an
interaction term for Eastern European countries. The column labelled ‘Alternative’ uses a dierent specication (proposed
by Crespo Cuaresma et al.,2016) and an updated dataset. Coecients are reported with tvalues in parentheses.
This nding may be due to idiosyncratic experiences in the former Eastern bloc. As
Crespo Cuaresma et al. (2022,2016) point out, these countries exhibit low initial poverty
headcount ratios, implying that small absolute changes translate into large growth rates.
This is due to the log-transformation applied, and makes observations of these countries
particularly inuential on OLS estimates. Crespo Cuaresma et al. (2016) consider two
alternative specications that yield empirical evidence of poverty convergence in the orig-
inal dataset. These are (1) an extension of Equation A2 that controls for the experience
of Eastern European countries, and (2) a model that is based on a semi-elastic relation-
ship between poverty reduction and growth (see Klasen and Misselhorn,2008), eectively
dropping the log-transformation from Equation A2. See Table A4, Columns 2 and 3 for
estimates.
We revisit this issue, and assess the sensitivity of poverty convergence to inuential
sets, starting with the model in Equation A2. Figure A8 presents the data and regression
34
Inuential Sets in Linear Regression
0.16 4.59
-0.33
0.00
0.17
Belarus
Latvia
Ukraine Poland
Figure A8: Data and regression line for Ravallion (2012)
before (solid line) and after (dashed line) removing the
inuential set ˆ
S
4(colored and marked with crosses, then
crosshairs). The vertical axis holds annualized log dier-
ences of poverty headcount ratios; the horizontal axis the
logarithm of the initial poverty headcount ratio.
line of Ravallion (2012), as well as the minimal inuential set needed to attain signicant
poverty convergence (see ‘Thresholds’ in Table A4), as identied by Algorithm 2. The
four members of this set are Belarus, Latvia, Ukraine, and Poland.10 In this setting,
a sensitivity check to inuential sets can clearly compensate for (or augment) domain
knowledge. We investigate the two additional specications next. When accounting for
the experience of Eastern Europe, we nd a signicance threshold of three observations.
For the alternative specication, we source an updated dataset from PovCalNet,11 and
nd signicant poverty convergence. Algorithm 2 indicates thresholds at 26 (insigni-
cance, see Figure A9), 32 (sign-ip), and 42 (signicant sign-ip) out of 124 observations.
Algorithm 1 only indicate a loss of signicance after 56 removals (c.f. Figure A9).
B4 Migration, instrumental variables and inuential sets
Migration reshapes global populations and, as a result, economic and cultural structures.
In a recent study, Droller (2018) investigates the long-term impacts of European migration
to Argentina in the late 19th and early 20th century. He uses a shift-share instrument to
identify the eect of the share of European migration on GDP per capita in 2000, for a
sample of 136 counties in the provinces of Buenos Aires, Santa Fe, Córdoba, and Entre
Rios. The results of Droller (2018) show considerable impacts of European migration on
GDP per capita, education, and skilled labor.
We reproduce these results in Table A5, and compute inuential sets for the rst stage
OLS, and the full 2SLS estimates (using the updating formulae of Phillips,1977). The rst
stage results are comparatively insensitive to inuential sets. For the 2SLS estimates (of
10Subsequent removals would be the Russian Federation, Lithuania, Estonia, and