Content uploaded by Nikolas Kuschnig

Author content

All content in this area was uploaded by Nikolas Kuschnig on Oct 13, 2022

Content may be subject to copyright.

Content uploaded by Nikolas Kuschnig

Author content

All content in this area was uploaded by Nikolas Kuschnig on Apr 16, 2021

Content may be subject to copyright.

Hidden in Plain Sight:

Inuential Sets in Linear Regression

Nikolas Kuschnig1,∗

Gregor Zens2

Jesús Crespo Cuaresma1,3,4,5,6

1Vienna University of Economics and Business

2Bocconi University

3International Institute for Applied Systems Analysis

4Wittgenstein Centre for Demography and Global Human Capital

5Austrian Institute of Economic Research

6CESifo

Abstract

The sensitivity of econometric results is central to their credibility. In this paper,

we investigate the sensitivity of regression-based inference to inuential sets of ob-

servations and show how to reliably identify and interpret them. We explore three

algorithmic approaches to analyze inuential sets, and assess the sensitivity of a

number of earlier studies in the eld of development economics to them. Many

results hinge on small inuential sets, and inspecting them can provide crucial in-

sights. The analysis of inuential sets may reveal omitted variable bias, unobserved

heterogeneity, lacking external validity, and informs about technical limitations of

the methodological approach used.

Keywords: sensitivity, robustness, regression diagnostics, masking, inuence

∗Correspondence to Nikolas Kuschnig, at the Vienna University of Economics and Business,

Welthandelsplatz 1, 1020 Vienna (nikolas.kuschnig@wu.ac.at); Gregor Zens (gregor.zens@unibocconi.it);

Jesús Crespo Cuaresma (jesus.crespo.cuaresma@wu.ac.at). The authors gratefully acknowledge helpful

comments from Ryan Giordano, Daniel Peña, Anthony Atkinson, Lukas Vashold, Maximilian Kasy, as

well as Bernhard Kasberger, Jörg Stoye, Isaiah Andrews, Whitney Newey, and other participants of the

2022 Congress of the European Economic Association and the 2021 Congress of the Austrian Economic

Association.

1

Kuschnig, Zens, & Crespo Cuaresma

1 Introduction

Econometric methods are an important instrument of scientic discovery, and are vital

for the design of evidence-based policy. By approximating real-world phenomena, they

provide us with empirical insights, allow us to test theories, and facilitate prediction.

The sensitivity of these methods to modeling assumptions is a long-standing and active

subject of research. The econometric literature tends to focus on sensitivity along the

horizontal dimension of the data, which is related to the functional form of model speci-

cations. Examples go back to extreme bounds analysis (Leamer,1983,1985), and include

model averaging (Steel,2020), elaborate research designs (Angrist and Pischke,2010),

and randomization (Athey and Imbens,2017). The sensitivity of inference to certain

sets of observations, i.e., the vertical dimension of the data, however, has received less

attention in the modern econometric literature.

There is a long history of studies on the role of inuential observations and outliers

in the statistical literature (e.g. Huber and Ronchetti,2009). It is well known that single

inuential observations may hold considerable sway over regression results (Cook,1979),

and there are many approaches to identify and account for sensitivity to these observa-

tions (see Chatterjee and Hadi,1986). However, this is not the case for inuential sets

of observations, which are not as well understood theoretically and empirically. Exact

analysis of inuential sets is quickly intractable, and earlier papers tend to approximate

inuential sets using individual, full-sample inuences (e.g. Broderick et al.,2020;Peña

and Yohai,1995). Such approximations suer from masking, a phenomenon where cer-

tain observations obscure the inuence of others. For these reasons, the literature has

mostly sidestepped inuential sets, focusing instead on robust estimation and resampling

methods (see e.g. Stigler,2010).

In this paper, we assess the sensitivity of regression-based ndings to inuential sets.

We focus on minimal inuential sets, which are sets of observations that nullify a result

of interest when they are removed. For this, we explore three algorithms that are com-

putationally tractable, straightforward to implement and use, and balance accuracy and

precision against complexity. We revisit the empirical results of several earlier studies in

the eld of development economics, and analyze their sensitivity to inuential sets. Es-

tablished empirical results, including the eect of the slave trades on income per capita

via rugged geography (Nunn and Puga,2012), and the development impact of the Tsetse

y (Alsan,2015), hinge on few observations. We show that analyzing these inuential sets

can provide crucial insights, revealing (inter alia) omitted variable bias, heterogeneous

eects, and problems related to data limitations.

The remainder of this paper is structured as follows. In Section 2, we establish the

2

Inuential Sets in Linear Regression

theoretical framework and connect it to the relevant econometric and statistical literature.

We present and illustrate the algorithmic approaches to identify and assess inuential

sets in Section 3. In Section 4, we investigate the sensitivity of empirical results in three

studies on long-term development in Africa. We discuss the interpretation of sensitivity to

inuential sets, and illustrate with additional applied examples in Section 5. In Section 6,

we conclude. All codes and data used for this paper are available online.1

2 Inuential sets in linear regression models

In this section, we present the main concepts needed to assess the eects of inuential

sets of observations in our chosen framework. We also relate the issue of inuential sets

to the literature on inuential observations and outliers, robust estimation methods, and

other relevant approaches.

2.1 Concepts and denitions

Consider the linear regression model

y=Xβ+ε,(1)

where yis an N×1vector containing observations of the dependent variable, Xis an

N×Pmatrix with Pexplanatory variables, βis a P×1vector of coecients to be

estimated, and εis an N×1vector of independent error terms with zero mean and

unknown variance σ2. We denote the ith observation, i.e. row, of yand Xas yiand xi.

The deletion of an observation jis indicated with a subscript in parentheses, that is, y(j)

is the dependent vector without observation j. A set of observations, S, is dened as a

non-empty subset of the set of all observations, i.e., S ⊂ ˜

S={s|s∈Z∩[1, N ]}. We use

the shorthand Nα=⌈Nα⌉to denote a fraction α∈[0,1] of the data, and indicate the

cardinality of a set using a subscript, i.e., |Sα|=Nα. The empty set is denoted by ∅and

the set of all sets of cardinality Nαis referred to as [S]α.

Our interest lies in the sensitivity of λ, some quantity of interest, to removing in-

uential sets of observations from the sample. We dene inuential sets as sets whose

omission has a large impact on λ, when compared to the omission of most other sets of

equal size (following Belsley et al.,1980), and measure the impacts of removing such sets

with some (generalized) inuence function ∆(similar to Hampel et al.,2005). Consider,

for example, the sensitivity of the full sample, ordinary least squares (OLS) estimate of β

1The repository at https://github.com/nk027/influential_sets includes all scripts, data, and

an Rpackage that implements the dierent approaches described in the paper.

3

Kuschnig, Zens, & Crespo Cuaresma

to dropping a set of observations S. In this case, we are interested in a function ∆that

compares λ(∅)and λ(S), which are given by the following function of the removed set

λ(S)=X′

(S)X(S)−1X′

(S)y(S).(2)

To assess the sensitivity of λ, we consider the minimal inuential set, dened as the

smallest set whose omission achieves a target impact on λ. We formalize this set by rst

dening the maximally inuential set,S∗

α, which achieves the maximal inuence for a

given size of the omitted set, as

S∗

α= arg max

S∈[S]α

∆(S,T, λ),(3)

where the inuence function, ∆, measures the impact on λ, when removing a set S

compared to a set T. To ease notation, we will suppress the dependence on λand the

default value T=∅. One example for ∆measures the deviation of βjfrom the full

sample OLS estimate, ∆(S) = λ(∅)j−λ(S)j, with λ(S)dened in Equation 2.

We can then dene the minimal inuential set, S, as follows

S∗∗ =S∗

arg minαs.t. ∆(S∗

α)≥∆∗,(4)

where ∆∗is a target value of choice. A relevant example is the minimal inuential set that

achieves a sign switch of coecient βj. This can be achieved by dened by setting ∆∗= 0,

and letting ∆(S) = −sign(λ(∅))j×λ(S)j. After obtaining a minimal inuential set,

we are interested in its size, both in absolute terms and relative to the full sample size,

and the characteristics of its members.

Conclusively identifying a minimal inuential set is computationally prohibitive, since

we would need to evaluate ∆for N

Nαpotential sets. Instead, we rely on approximations

for all but the most trivial settings. To assess the quality of these approximations, we

introduce some concepts related to potential sources of error. First, members of a maxi-

mally inuential set S∗

αmay not be identied in the estimated set ˆ

S∗

α. This phenomenon

is referred to as masking, since particularly inuential observations are masked by others

that appear to be more inuential. Masking is a problem related to the accuracy of inu-

ential set identication, and its severity can be quantied by the number (or percentage)

of masked observations, and the dierence in inuence between the true and estimated

maximally inuential set. Second, the estimated inuence of a given set may not equal

its true inuence. This occurs when an approximation of ∆or λis used, and can be

considered a precision problem of an approximation.

The accuracy and precision of an approximation is tightly related to the existence of

inuential sets that are inuential in joint. Consider the partial inuence of an observation

i, given by the shorthand δi=δi|∅= ∆({i},∅), where we will assume scalar inuence

4

Inuential Sets in Linear Regression

values for simplicity. A jointly inuential set,U, is one that satises δi|J ≫δi|∅for

all i∈ U and J ⊂ U of some minimal size. In words, the inuence of any member i

increases substantially after the removal of other members. A corollary of this denition

is that the inuence of a jointly inuential set exceeds the sum of full-sample inuences

of its individual members, i.e. ∆(S)>i∈S δi|∅. Joint inuence is a major challenge for

approximations. Since they rely on assessing a limited number of potential sets, jointly

inuential sets may remain hidden and their inuence unaccounted for.

2.2 Inuential observations and sets in the literature

The statistics literature is rich in methods for identifying and dealing with single inu-

ential observations (see e.g. Huber and Ronchetti,2009;Hampel et al.,2005;Maronna

et al.,2019). Measuring inuence is central to this pursuit, and there is a wide variety of

interrelated statistics serving this purpose (see Chatterjee and Hadi,1986, for a review).

The residuals (e=y−Xˆ

β) and the leverage (the diagonal elements of the ‘hat matrix’,

H=X(X′X)−1X′) are pivotal elements for most measures proposed in the literature. A

notable example directly measures the inuence of an observation ion the OLS estimate

of β, can be expressed as

δi=λ(∅)−λ({i})=(X′X)−1x′

iei

1−hi

,(5)

where eiand hiare the residual and leverage of observation i. This well-known re-

sult, termed DFBETAiby Belsley et al. (1980), facilitates the quick evaluation of the

individual inuences of all observations. Similar results are available for estimates of

σ2, coecient standard errors (Belsley et al.,1980), and two-stage least squares (2SLS)

estimates (Phillips,1977). Such convenient forms, together with ecient updating for-

mulae (to evaluate, e.g., (X′

(i)X(i))−1) facilitate computation, but evaluating a minimal

inuential set, however, remains infeasible for all but the simplest settings.2

The impacts of single inuential observations in regression models are well under-

stood, but generally remain limited. As a result, most existing approaches to assess and

account for sensitivities to observations are of a holistic nature, trying to go beyond sin-

gle observations. These include Cook’s distance (Cook,1979), the Welsch-Kuh distance

(Belsley et al.,1980), and a number of Bayesian approaches (e.g. Box and Tiao,1968;

Kass et al.,1989;Pettit and Young,1990). The most prominent approaches are based on

robust estimation methods, such as M- and S-estimators, which are resistant to a number

2Consider a total number of observations of N= 1,000 and all potential sets of size Nα= 10. Assume

that every calculation of λneeds one microsecond, very roughly the time needed to compute the cross-

product of a four-by-four matrix. Enumeration would require about 8.35 billion years, or 1.8 times the

age of the Earth, which is safely out of scope for non-tenured researchers.

5

Kuschnig, Zens, & Crespo Cuaresma

of arbitrarily inuential observations (see e.g. Hampel et al.,2005;Huber and Ronchetti,

2009;Maronna et al.,2019). However, these methods are rarely used in applied work.

Stigler (2010) notes two important drawbacks related to the potential loss of information

and the elusive notion of ‘robustness’ to arbitrary contamination.

Many methods for assessing other types of sensitivities account for the inuence of

certain sets of observations directly or indirectly. Resampling methods, such as the

Jackknife (Efron and Stein,1981) or the Bootstrap (Efron and Tibshirani,1994), rely

on samples of the data to produce estimates.3Model averaging methods are mostly

concerned with variable selection, but have also been used to assess data sensitivity (see

Steel,2020, for a review). Specications that are saturated with dummy variables for all

observations are conceptually similar to the Jackknife. Other notable methods include

winsorizing or trimming the data, where observations with extreme values are replaced or

removed, but also methods for outlier detection. Outliers are generally not well-dened

(i.e. not inuential with respect to an explicit quantity) and unsupervised clustering

methods are commonly used (see e.g. Hautamaki et al.,2004;Kaufman and Rousseeuw,

2009;Shotwell and Slate,2011). These methods show interesting parallels to the analysis

of inuential sets, but dier in their goals and in how they address computational concerns

to reach them.

An important strand of the literature focuses on detecting and accounting for inuen-

tial sets of observations. These studies often use methods that are based on individual,

full-sample inuences; this includes many statistics proposed by Belsley et al. (1980),

the derived inuence matrix of Peña and Yohai (1995), and the approach of Broderick

et al. (2020). There are few adaptive procedures, with a notable exception in the works

of Atkinson et al. (2010) and Riani et al. (2014). In (applied) econometrics, sensitivity

checks to samples of the available data remain exceedingly rare, and usually limited to

single observations or ad-hoc procedures. This highlights the need for a better under-

standing of the role of inuential sets, how to conduct sensitivity checks, and how to

interpret their results.

3 Algorithms to assess inuential sets

In this section, we formalize three simple algorithms for approximating minimal inu-

ential sets and their inuence. We start with an algorithm that is extremely cheap to

compute, but sacrices accuracy and precision, and proceed with two algorithms that

yield improved precision and accuracy at slightly increased computational cost. Then,

we discuss computational concerns, and illustrate using a simple example.

3The minimal inuential set can be understood as the worst-case scenario of a delete-NαJackknife.

6

Inuential Sets in Linear Regression

3.1 Algorithm 0: Initial approximation

The rst algorithm (Algorithm 0) builds on the full-sample inuence of single observa-

tions, and generalizes the approach of Broderick et al. (2020).4Maximally inuential sets

are proposed based on the order of individual, full-sample inuences, and their inuence

is approximated by accumulating these individual inuences.

Algorithm 0: Initial approximation.

set the function ∆, the target ∆∗, and the maximum size U, let S ← ∅;

compute δi= ∆({i})for all i∈˜

S;

while ∆(S)<∆∗do

let S ← S ∪ arg maxjδj, for j∈ S;

let ∆(S)←k∈S δk;

if |S| ≥ Uthen return unsuccessful;

end

return S,∆(S);

The algorithm works as follows. First, we set an inuence function, target value, and

a maximum size for the minimal inuential set. Next, we compute the initial inuence δi

for each i. As discussed above, this computation is relatively cheap in many interesting

cases; otherwise, approximations (as in Broderick et al.,2020) could be used. The rst

iterated step is the proposal of a maximally inuential set, which is based on the union of

the observations with the largest individual inuences, δi. The inuence of this proposed

set is then estimated by summing the individual inuences of observations in the set.

These two steps are repeated until the specied target or the maximum size is reached.

The method embodied in Algorithm 0 can yield striking results, as demonstrated by

the ndings of Broderick et al. (2020). However, the low constant computational complex-

ity comes at the price of accuracy and precision. First, the proposal of sets based on the

full-sample inuences makes the algorithm prone to masking. The estimated inuences

are not updated, so inuential observations may remain masked behind the inuence

of already removed observations. Second, approximating a set’s inuence by summing

individual inuences suers from a downward bias that is increasing with the inuence

of observations, and is particularly large for inuential observations (see the Sections A1

and A3 in the Appendix for more information). This means that the algorithm performs

badly in the presence of inuential observations, let alone inuential sets. As a result,

sensitivity checks based on Algorithm 0 are prone to convey a false sense of robustness.

4See Section A2 in the Appendix for a discussion of the approach used in Broderick et al. (2020).

7

Kuschnig, Zens, & Crespo Cuaresma

3.2 Algorithm 1: Initial binary search

The second algorithm (Algorithm 1) recties the critical precision issues of Algorithm 0,

while retaining high computational eciency. Similar to Algorithm 0, maximally inu-

ential sets based on the ordering of the individual, full-sample inuence. The inuence

of these proposed sets, however, is calculated exactly. In order to guarantee low compu-

tational cost, the algorithm follows a binary search pattern when proposing sets.

Algorithm 1: Initial search.

set the function ∆, the target ∆∗, and the maximum size U, let L←0;

compute δi= ∆({i})for all i∈˜

S;

while L≤Udo

let M← ⌊(L+U)/2⌋;

let Sbe the union of the indices of the Mlargest δi;

compute ∆(S);

if ∆(S)<∆∗then let L←M+ 1;

else let U←M−1;

end

if ∆(S)≥∆∗then return S,∆(S);

else return unsuccessful;

Algorithm 1 sets the size of proposed sets, M, by iteratively halving a search interval

[L, U ], instead of sequentially increasing it. In each step, the inuence of the proposed

set is computed exactly, and the bounds of the search interval are updated depending on

the inuence of the proposed set. If the target is reached, the upper bound is decreased

to M−1, otherwise the lower bound is increased to M+ 1. If an approximate minimal

inuential set exists in the search interval, it is found after O(log U)steps. As a result,

Algorithm 1 adds negligible computational overhead over Algorithm 0, making this divide-

and-conquer approach practical for large-scale problems.

Algorithm 1 yields precise inuence estimates of a given set, which are not directly

aected by masking. However, its accuracy is not guaranteed. Since maximally inuential

sets are still based on the individual, full-sample inuences, masking is still an issue.

Inuential observations are likely to remain hidden behind already removed observations,

especially when they are part of a jointly inuential set. The algorithm also relies on the

(previously implicit) assumption that the inuence of estimated maximally inuential

sets increases steadily with the size of the set. The next algorithm abstracts from this

assumption and is designed to explicitly address masking.

8

Inuential Sets in Linear Regression

3.3 Algorithm 2: Adaptive approximation

The third algorithm (Algorithm 2) uses a simple adaptive procedure for identifying the

minimal inuential set. Maximally inuential sets are constructed iteratively, using up-

dated inuence estimates. This facilitates the discovery of masked observations, which

improves accuracy. The adaptive nature of this algorithm allows for good accuracy and

precision, while retaining computational tractability.

Algorithm 2: Adaptive approximation.

choose the function ∆, the target ∆∗, and the maximum size U, let S ← ∅;

while ∆(S)<∆∗do

compute ∆(S ∪ j), for all j∈˜

S − S;

let S ← S ∪ arg maxj∆(S ∪ j);

if |S| ≥ Uthen return unsuccessful;

end

return S,∆(S);

Algorithm 2 starts by computing the inuence of all individual observations. Maxi-

mally inuential sets are proposed adaptively, by forming the union of the previous set

(starting with the empty set) and the observation with the highest inuence (at each

step). This is repeated until a minimal inuential set is found, or the maximal size is

reached. The algorithmic complexity is linear in the cardinality of the set and falls well

short of, e.g., a Jackknife approach. By computing individual inuences after every re-

moval, this approach reduces the risk of masking problems and allows us to more reliably

investigate sensitivity to inuential sets.

3.4 Approximations and computational concerns

The algorithms presented above are computationally straightforward, with complexities

that are constant, as well as logarithmic and linear in the size of the minimal inuential

set. In our linear setting, most quantities of interest rely on matrix factorization, with

a complexity of O(N3). This means that the computation of the inuence function

dominates the overall complexity in practice. Fortunately, many quantities of interest,

such as coecients and standard errors, can be computed eciently using algebraic results

(e.g. Equation 5) and updating formulae (for cross products, inverses, and factorizations).

Together with optimization of the underlying linear algebra, this allows us to quickly

assess the sensitivity to inuential sets in a wide range of applications.

Further speed gains can be realized with approximate methods. When the number

of covariates is large, it can be helpful to marginalize out nuisance covariates before

9

Kuschnig, Zens, & Crespo Cuaresma

computing inuence measures. Iterative solvers are another option that can help speed

up operations; suitable stopping criteria can make the results exact for our purposes (see

e.g. Trefethen and Bau,1997). It can also be helpful to x or discard computationally

intensive elements. For instance, there is no convenient method to update clustered

standard errors, and their computation becomes prohibitive in large settings. By using

non-clustered standard errors for proposing sets, we can facilitate sensitivity checks at

large scales. However, this simplication can be problematic when a large part of the

inuence is driven by the simplied element.5

3.5 An illustration

Here, we illustrate the characteristics of the three algorithms using a simple example.

Consider tting an OLS regression line to the data depicted in Figure 1. The set marked

‘a’ in the top-right of the scatter plot is inuential on the estimated positive slope, lowering

it considerably when removed. This inuence is reected in standard diagnostics, and

all three algorithms will identify this inuential set. This will not be the case for the

set of three observations in the center-right of the scatter plot, marked ‘b’. Its inuence

is initially masked by the rst inuential set, and neither Algorithm 0 nor Algorithm 1

identify any of its elements. By contrast, the adaptive nature of Algorithm 2 allows for

the masked observations to be identied.

-2.7 0.0 8.0

-2.1

0.0

3.5 Masking

a

b

Figure 1: OLS regression line for a dataset

(N= 60) with two sets of size three that are in-

uential on the positive slope. The set marked

‘a’ masks the inuence of the set marked ‘b’.

The solid gray line indicates the full-sample es-

timates; the dashed dark gray line indicates the

estimate after removing both sets.

5This is a problem of the linear approximation of Broderick et al. (2020) to Equation 5, which yields

DFBETAi≈(X′X)−1x′

iei, thus disregarding the eect of leverage (which is expensive to compute).

This means that the inuence of all observations is biased downwards, with the bias being particularly

large for high-leverage observations. This is problematic since these observations tend to be particularly

inuential. See Section A2 of the Appendix for more information on the nature and size of the bias of

this approximation.

10

Inuential Sets in Linear Regression

Figure 2 depicts the maximally inuential sets that are identied by the dierent

algorithms at increasing size, as well as the implied regression lines after accounting for

inuential sets of size three and seven. Masking aects the sets identied by Algorithms 0

and 1 (see left panel in Figure 2), leading to relatively inconsequential observations being

identied after the rst three removals. The approximation in Algorithm 0 even fails

to account for the full inuence of the rst (jointly inuential) set, and considerably

underestimates the slope starting after the rst removal. Algorithm 2 does not suer

from these problems; the implied slopes are accurate and precise, and masking issues are

avoided.6

Algorithms & 01

0

1

Algorithm 2

3 removed

7 removed

⟶ set order ⟶

Figure 2: Inuential set analysis with Algorithms 0, 1, and Algorithm 2. The order of the

identied observations is color-coded and indicated by a cross (×, rst three) and a crosshair

(+, next four). The solid gray line indicates the full-sample regression line. The dotted and

dashed lines (labelled with the algorithm used) are the implied regression lines after accounting

for inuential sets of sizes three and seven.

4 Empirical applications

In this section, we investigate the sensitivity to inuential sets of three empirical results

on long-term economic development in Africa. We focus on the inuence on the tvalues

of the main eect of interest, and assess minimal inuential sets that induce (1) a loss of

signicance (at a given level), (2) an estimate of the opposite sign, and (3) a signicant

estimate of the opposite sign.

6In terms of computation, all three algorithms executed in less than 0.1 seconds.

11

Kuschnig, Zens, & Crespo Cuaresma

Identifying factors behind economic development in African countries has been a cen-

tral endeavor in development economics over the last decades. Factors like the colonial

experiences (Acemoglu et al.,2001), the slave trades (Nunn,2008), and precolonial insti-

tutions and centralization (Gennaioli and Rainer,2007;Michalopoulos and Papaioannou,

2013) are known to play a fundamental role. The drivers behind these factors and path-

ways of how they aect development today is an important subject of research (Spolaore

and Wacziarg,2013). Three empirical studies on this topic nd that the slave trades

aect development via ruggedness (Nunn and Puga,2012) and interpersonal trust (Nunn

and Wantchekon,2011), and that the Tsetse y aects precolonial centralization (Alsan,

2015). Below, we assess the sensitivity of the empirical ndings presented in these three

studies to inuential sets of observations.

4.1 Geography, development, and omitted variables

Nunn and Puga (2012) estimate linear regression models that related GDP per capita

across countries of the world to a measure of terrain ruggedness and other control vari-

ables. They allow for heterogeneity of African countries via an interaction term. Nunn

and Puga (2012) nd a signicantly negative overall estimate of the eect of ruggedness,

and a signicantly positive estimate for African countries (c.f. Table 1, Column 1). This

dierential eect is interpreted as rugged geography oering protection against the slave

trades. The authors perform rigorous checks to address the sensitivity of these results to

inuential observations, and their ndings appear to be robust. In particular, the results

are unchanged when omitting observations that exceed a threshold of |DFBETAi|>2/N

(following Belsley et al.,1980), as well as the ten smallest and most rugged observations.

We nd that robust estimates support this conclusion (see Table 1, Column 2).

We investigate the sensitivity of this dierential eect of ruggedness in Africa to

inuential sets. In the row labeled ‘Thresholds’ of Table 1, we report the sizes of inuential

sets that would induce (1) an insignicant estimate, (2) an estimate of the opposite

sign (in square brackets), and (3) a signicant estimate of the opposite sign (in curly

brackets). We nd that an inuential set of two observations overturns the signicance

of the dierential eect. An inuential set of ve observations is enough to induce a

sign-switch, and one of size eleven achieves a signicant sign-switch. To provide more

context for these observations, we visualize the ve most inuential ones in Figure 3,

Panel A. Their inuence is visualized in Panel B, which shows the eect of each subsequent

observation (indexed by the ISO code of the country) on the coecient’s tvalue. Starting

at a value of 2.53, the t-value decreases slightly with the removal of the most inuential

observation, which is of the Seychelles. Its inuence (as indicated by the height of its ISO

code) appears limited, but when using Algorithm 2 (on the left) we can clearly see that

12

Inuential Sets in Linear Regression

Table 1: The dierential eect of ruggedness in Africa

Baseline

Robust

Population

Land area

Ruggedness, Africa†0.321 0.325 0.190 0.215

(2.53) (2.46) (1.66) (1.63)

Ruggedness -0.231 -0.251 -0.231 -0.238

(-2.99) (-3.23) (-2.94) (-3.08)

Controls Yes Yes Yes Yes

Population in 1400 – – Yes –

Land area – – – Yes

Thresholds†2[5]{11} — –[3]{6} –[4]{8}

Observations 170 170 168 170

R20.537 0.533 0.571 0.554

The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a

sign ip], and {a signicant sign ip} of the coecient that captures the dierential eect of ruggedness in Africa (using

Algorithm 2). The ‘Baseline’ column reproduces the results of Nunn and Puga (2012, see Table 1, Column 6) using OLS

estimation, while ‘Robust’ uses robust M-estimation. The specications labeled ‘Population’ and ‘Land area’ add the

population level in the year 1400, and the land area of the country (both in logs) as covariates. Coecient estimates are

reported with tvalues based on HC1 robust standard errors in parentheses.

the inuence of subsequent removals increases considerably. This pattern is indicative of

a jointly inuential set, and remains hidden from Algorithm 0 (on the right), and calls

for closer inspection.

Analyzing inuential sets and the characteristics of their members can provide us

valuable insights. First, the importance of the Seychelles casts doubts on the interpre-

tation presented in Nunn and Puga (2012). The island nation has only been inhabited

permanently since the late 18th century (Fauvel,1909), and its ruggedness played no role

in mediating the eects of the slave trades. Next, the ve most inuential nations are

extraordinarily small in terms of land area and population. This may indicate a survivor-

ship bias, where the inclusion of countries in the dataset is (at least in part) determined

by land area, population, geography, and economic success. Moreover, past population

sizes are likely related to other geographical features that confound the eect of rugged-

ness, and would also play an important role in mediating impacts of the slave trades. In

Columns 3 and 4 of Table 1, we present the results of specications that include past

population size and land area as additional controls. There, no signicant dierential

eect of ruggedness on economic development can be found for the African continent.

13

Kuschnig, Zens, & Crespo Cuaresma

Figure 3: Two nations drive the blessing of bad geography

Algorithm (2)

Algorithm (0)

-2.20

0.00

removals t value

3

2

1

02.53

SYC

LSO

RWA

SWZ

COM

ZAF

MAR

CPV

MUS

MRT

BDI

SYC

RWA

NPL

TJK

GEO

LBN

YEM

SEN

LSO

ALB

SWZ

2

0° East

0°

3. Rwanda

5. Comoros

4. Eswatini

2. Lesotho

1. Seychelles

Note: Seychelles and Comoros not to scale.

RRuggedness

low

high

highest

2nd

3rd

4th

5th

Inﬂuence

Panel A: Inﬂuential nations and ruggedness Panel B: Inﬂuence estimates

Panel A visualizes ruggedness, the explanatory variable of interest, and the ve most inuential observations. Panel B

shows the cumulative reduction of the tvalue as observations (indicated with their ISO codes) are removed (from 2.53 at

the top, to the bottom of the respective observation) using Algorithms 2 (left) and 0 (right).

4.2 Slave trades and heterogeneous origins of mistrust

Nunn and Wantchekon (2011) analyze the role of the Atlantic and East African slave

trades as determinants of interpersonal trust, using individual-level survey data and his-

torical information on the slave trades. They regress measures of the trust of relatives and

neighbors (among others) on a measure of slave trade intensity, as well as a number of

individual and district-level covariates, as well as country-xed eects. The design matrix

of the regression model contains over 20,000 observations, spanning several countries in

sub-Saharan Africa, and a total of 78 regressors. Uncertainty about the estimated eects

is quantied using standard errors that are clustered along ethnicity and district. The

authors nd statistically and economically signicant negative eects of the slave trades

on interpersonal trust. Their ndings are robust to removing Kenya and Mali (which

were also impacted by the trans-Saharan and Red Sea slave trades) from the sample.

For this analysis, a sensitivity check based on the inuential sets is computationally

challenging, since there exists no updating formula for two-way clustered standard er-

rors. To facilitate our analysis, we only cluster standard errors after the removal of an

observation, and not for proposed removals. We reproduce the empirical results of Nunn

and Wantchekon (2011) in Table 2. We nd that an inuential set of size 105 (0.5% of

the sample) induces a loss of signicance of the estimated eect of the slave trades and

14

Inuential Sets in Linear Regression

Table 2: The origins of mistrust

Trust of relatives Trust of neighbors

Pooled West | East Pooled West | East

Exports/area†-0.133 -0.145 -0.159 -0.168

(-3.68) (-3.84) (-4.67) (-4.48)

Exports/area, East 0.053 0.023

(0.96) (0.32)

Individual controls Yes Yes Yes Yes

District controls Yes Yes Yes Yes

Country xed eects Yes Yes Yes Yes

Thresholds†105[380]{656} 78[301]{532} 161[425]{768} 133[323]{527}

Observations 20,062 7,549 | 12,513 20,027 7,523 | 12,504

Ethnicity clusters 185 62 | 123 185 62 | 123

District clusters 1,257 628 | 651 1,257 628 | 651

R20.133 0.199 | 0.097 0.156 0.228 | 0.117

The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a sign

ip], and {a signicant sign ip} of the coecient that captures the eect of the slave trades (measured as exports of slaves

per area) on trust (using Algorithm 2). The columns labeled ‘Pooled’ reproduce the results of Nunn and Wantchekon

(2011, see Table 2, Columns 1 and 2). The columns labeled ‘West | East’ estimate separate models for observations to the

west and east of the 20° Eastern meridian (thresholds refer to Western subset). Coecient estimates are reported with

tvalues based on two-way clustered standard errors in parentheses.

380 removals (1.9% of the sample) lead to a sign-ip that becomes signicant after 656

removals (3.3% of the sample) for the trust of relatives, with similar results for the trust

of neighbors.

An analysis of these inuential sets shows that 536 of the 600 most inuential ob-

servations (89.3%) stem from three West African nations: Benin, Nigeria, and Ghana

(marked with black borders in Figure 3, also see Table A2 in the Appendix for more

details). These nations were major centers of the Atlantic slave trade, and their large

inuence may suggest dierences in impacts between the Atlantic and East-African slave

trades. When dividing the dataset into a Western and Eastern sample (with the 20° East-

ern meridian as the dividing line), we nd a signicantly negative eect for the Western

sample, and an insignicant eect for the Eastern sample (see Table 2, Columns 2 and

4). This result is consistent with the literature, which suggests larger impacts from the

Atlantic slave trade (see Nunn,2008).

15

Kuschnig, Zens, & Crespo Cuaresma

4.3 The concentrated eects of the Tsetse y

The Tsetse y is an important vector of disease and considered a hindrance to early eco-

nomic development. Alsan (2015) investigates this detrimental eect, concentrating on

its role as a determinant of agricultural practices, urbanization, and institutions. The

empirical analysis in Alsan (2015) is based on regressing various outcomes at the eth-

nic group level on a Tsetse suitability index (TSI) and a number of controls, clustering

standard errors at the province level. Using ethnicity-level data, she nds signicant

detrimental eects of the TSI across the board, with precolonial centralization as the

major channel aecting present economic development. These results are robust to per-

turbations of the TSI and to corrections aimed at assessing the negative selection bias

that may occur due to more developed ethnic groups displacing less developed ones.

Table 3: The eects of the Tsetse y

Animals

Intensive

Plow

Female

Density

Slavery

Centralized

TSI†-0.231 -0.09 -0.057 0.206 -0.745 0.101 -0.075

(-5.47) (-3.29) (-2.54) (3.41) (-3.25) (2.51) (-2.12)

Controls Yes Yes Yes Yes Yes Yes Yes

Thresholds†33[58]79 7[25]41 3[12]17 12[30]48 9[27]42 4[22]35 1[16]30

Observations 484 485 484 315 398 446 467

Clusters 44 44 44 43 43 44 44

R20.296 0.268 0.462 0.285 0.254 0.178 0.139

The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a

sign ip], and {a signicant sign ip} of the coecient that captures the eect of the TSI on the respective dependent

variable. The dependent variables are binary and continuous, and measure whether a precolonial ethnic group (1) possessed

large domesticated ‘Animals’, (2) adopted ‘Intensive’ agriculture, (3) adopted the ‘Plow’, (4) had ‘Female’ participation in

agriculture, (5) log population ‘Density’, (6) practiced indigenous ‘Slavery’, and (7) had a ‘Centralized’ state. Coecient

estimates are reported with tvalues based on clustered standard errors in parentheses.

We reproduce these results in Table 3, and investigate the sensitivity of the estimated

eects to the sample. First, we focus on the sensitivity to inuential sets. We nd

that the signicance of results is induced by few observations, ranging from one (for the

variable measuring whether a centralized state was present) to 33 (for the possession

of large domesticated animals). Inuential sets of sizes between twelve and 79 induce

signicant results of the opposite signs. The sensitivities of the results in Alsan (2015)

are pronounced, with a possible exception for the eects on the domestication of large

16

Inuential Sets in Linear Regression

animals. The ndings are supported by a relatively small amount of data, which may

be due to a lack of variation in the outcome variables. In spite of the reasonable sample

size, the six binary outcomes are relatively rare (the plow adoption, for example, only

occurred for 37 of the 484 observations).

For comparison, we use robust M-estimation and nd that ve of the seven specica-

tions are eectively unchanged (see Table A3 in the Appendix for the estimation results).

Two specications (those with the plow adoption and indigenous slavery as outcomes)

yield a pathological t. Next, we use a high-breakdown S-estimator. The eects on pop-

ulation density (which are not particularly robust to inuential sets) remain unchanged.

Meanwhile, the six other specications yield a pathological t. Between 106 and 286 (or

22 and 91 percent) observations are agged as outliers. Conclusions based on robust esti-

mators depend heavily on the type of estimator, and can dier strongly from ones based

on inuential sets. While robust estimators also ag sets of impactful observations, these

tend to be larger than the inuential sets obtained with the methods presented here, and

are not as immediately relevant to a particular inferential result of interest.

5 Discussion

In the three applications presented, the main results appear sensitive to inuential sets of

observations. Analysis of the nature of these inuential sets can reveal omitted variable

bias, heterogeneity, or limited support from the data. In general, however, the ques-

tion of how to interpret minimal inuential sets remains unanswered. In this section,

we elaborate on this issue and provide some additional illustrations based on empirical

examples.

Expert knowledge is vital to interpreting sensitivity to inuential sets. The degree of

sensitivity that we expect a priori for a given application provides important context for

the interpretation. When searching for the proverbial needle in the haystack, we expect

high sensitivity to little data, while we expect to nd a good amount of needles when

investigating a sewing kit. Analyses of rare events such as disease outbreaks or economic

crises, as well as policy interventions that only aect a small part of the population,

are inherently sensitive to small subsets of the data. Phenomena that appear more

universally, such as convergence in growth or income levels across economies, should

only be susceptible to larger fractions of the data. Having a good understanding of the

distribution and intensity of the eects hypothesized (and of other underlying factors),

allows us to gain valuable insights from analyzing inuential sets.

First, consider the role of inuential sets when analyzing rare, concentrated phenom-

ena. The eects of the slave trades on mistrust, for instance, are likely to manifest only in

17

Kuschnig, Zens, & Crespo Cuaresma

a few impacted communities. Hence, the size of the minimal inuential sets that we nd

(e.g., 105 / 0.5% of the sample for a loss of signicance) do not appear to be particularly

worrying. Yet, these inuential sets may be insightful by highlighting strongly aected

subsamples. In the case of the eects of the Tsetse y, however, the sensitivity found

can be considered more problematic, despite potentially high concentration. The small

sizes of the identied minimal inuential sets imply that the results rely on deceptively

few observations. Since the distribution of impacts is not always obvious a priori, a small

minimal inuential set can bring dierences in the intensity of eects to our attention.

One example for this concerns the impacts of microcredits (see and Section A2 of the

Appendix and Broderick et al.,2020), which are sensitive to few observations in the heavy

tails of the outcome variable.

Second, consider minimal inuential sets in the context of more universal phenomena.

One notable example is cross-country convergence of poverty rates. Based on theoretical

considerations about income convergence and the link between mean income growth and

poverty dynamics (see Johnson and Papageorgiou,2020, for a survey), we would expect

poverty convergence at the global level to be present in the data. Ravallion (2012),

however, nds that the convergence rate in poverty rates is statistically insignicant in

a sample of 89 countries. As Crespo Cuaresma et al. (2016,2022) point out, this result

is likely to be driven by the idiosyncratic experience of Eastern European countries. An

analysis of inuential sets in this setting reveals exactly this result, acting as a substitute

for and complement to domain knowledge. The minimal inuential set of Belarus, Latvia,

Ukraine, and Poland is behind the lack of signicance of the convergence estimate. An

alternative specication (proposed by Crespo Cuaresma et al.,2016) leads to estimates

of the convergence rate that appear to be less sensitive (see Section B3 in the Appendix

for more information on the application results). Minimal inuential sets of observations

that appear small for the context can highlight misspecication issues, and an analysis

of these sets can help shed light on the nature of the issues.

Lastly, consider the technical circumstances that give rise to an inferential result.

Here, it helps to think of three tightly connected cornerstones — the composition of

the data, the identication of parameters, and the estimation method employed. Least

squares estimates tend to be sensitive to observations with high leverage and large resid-

uals, while shrinkage estimators will be less impacted by these. On the other hand, the

compound nature of 2SLS estimators makes them more susceptible to small inuential

sets, and numerical instability of the estimators can quickly become a problem. We dis-

cuss these issues further in a simulation exercise and in the context of the long-term

impacts of migration (as studied by Droller,2018) in Sections A4 and B4 of the Ap-

pendix. Evidently, these issues are related to parameter identication and the variability

18

Inuential Sets in Linear Regression

of the data. For the eect of the Tsetse y on plow adoption, there is little variation in

the outcome that allows us to identify the eect. It is not surprising that estimates are

sensitive to inuential sets — one with of less than 10% of the data means that the eect

can no longer be estimated. Such technical sensitivities are a steady feature of empirical

analysis, and inuential sets can bring them to our attention.

The interpretation of sensitivity to inuential sets is highly dependent on context.

Sensitivity to inuential sets does not conclusively indicate a lack of robustness by itself.

Instead, the analysis of inuential sets presents an intuitive, novel tool for gaining deeper

insights about data structures and inferential quantities in a wide range of settings. Re-

porting the absolute and relative size of minimal inuential sets in regression analysis

and analyzing their characteristics may help increase the transparency of research com-

munication and signicantly improve our understanding of socioeconomic phenomena.

6 Conclusions

In this paper, we investigated the sensitivity of inferential quantities to inuential sets of

observations. We explored three algorithms aimed at identifying minimal inuential sets

and quantifying inferential sensitivity to them. We showed how masking, where certain

observations obscure the inuence of others, complicates sensitivity checks by inducing

false negatives. We investigated the sensitivity to inuential sets in several empirical ap-

plications related to the determinants of economic development in Africa, demonstrating

the practical relevance and utility of this sensitivity check. Our analysis showed that

sensitivity to inuential sets is prevalent in practice. A great deal of interesting phenom-

ena are exceedingly rare, and many important insights may hinge on few observations.

The proposed sensitivity check to inuential sets is not necessarily to be interpreted an

indicator for a lack of validity, but analysis of these sets can be an important tool for

identifying possible shortcomings of the model, and potentially fruitful extensions of the

analysis.

In applied econometric practice, the sensitivity to sets of observations is still rarely

assessed in a systematic manner. Reinforced by our ndings in this paper, we believe

that such sensitivity checks should play a more important role in the communication of

the results of econometric analysis. Inuential sets can deliver valuable insights, and the

size of minimal inuential sets can serve as intuitive summary measures. Two important

advantages as compared to existing measures, such as Cook’s distance or robust esti-

mates, are their interpretability and salience. Summaries are directly tied to a statistic

of interest, and inuential sets can be analyzed in detail. Sensitivity checks to inu-

ential sets are necessarily approximate, and robustness cannot be attested conclusively.

19

Kuschnig, Zens, & Crespo Cuaresma

Nonetheless, many false negatives can be avoided at a reasonable computational cost,

and approximations can unveil important insights that would otherwise remain hidden

in plain sight.

There are several pathways for future work building upon this contribution. On

the topic of inuential sets, many potential improvements in terms of computational

eciency and accuracy are conceivable. In a wider sense, there is a lack of comprehensive

indicators and summary measures for sensitivities to the data. Lastly, the approach

presented here opens the door for further replication studies and sensitivity checks of

documented empirical phenomena, which may deliver valuable insights for researchers

and policymakers.

References

Daron Acemoglu, Simon Johnson, and James A. Robinson. The colonial origins of comparative

development: an empirical investigation. American Economic Review, 91(5):1369–1401, 2001.

ISSN 0002-8282. doi:10.1257/aer.91.5.1369.

Marcella Alsan. The eect of the TseTse y on African development. American Economic

Review, 105(1):382–410, 2015. ISSN 0002-8282. doi:10.1257/aer.20130604.

Manuela Angelucci, Dean Karlan, and Jonathan Zinman. Microcredit impacts: evidence from a

randomized microcredit program placement experiment by Compartamos Banco. American

Economic Journal: Applied Economics, 7(1):151–82, 2015. doi:10.1257/app.20130537.

Joshua D. Angrist and Jörn-Steen Pischke. The credibility revolution in empirical economics:

how better research design is taking the con out of econometrics. Journal of Economic

Perspectives, 24(2):3–30, 2010. doi:10.1257/jep.24.2.3.

Susan Athey and Guido W. Imbens. The state of applied econometrics: causality and policy

evaluation. Journal of Economic Perspectives, 31(2):3–32, 2017. doi:10.1257/jep.31.2.3.

Anthony C. Atkinson, Marco Riani, and Andrea Cerioli. The forward search: theory and data

analysis. Journal of the Korean Statistical Society, 39(2):117–134, 2010. ISSN 1226-3192.

doi:10.1016/j.jkss.2010.02.007.

Orazio Attanasio, Britta Augsburg, Ralph De Haas, Emla Fitzsimons, and Heike Harmgart.

The impacts of micronance: evidence from joint-liability lending in Mongolia. American

Economic Journal: Applied Economics, 7(1):90–122, 2015. doi:10.1257/app.20130489.

Britta Augsburg, Ralph De Haas, Heike Harmgart, and Costas Meghir. The impacts of mi-

crocredit: evidence from Bosnia and Herzegovina. American Economic Journal: Applied

Economics, 7(1):183–203, 2015. doi:10.1257/app.20130272.

20

Inuential Sets in Linear Regression

Abhijit Banerjee, Esther Duo, Rachel Glennerster, and Cynthia Kinnan. The miracle of mi-

cronance? evidence from a randomized evaluation. American Economic Journal: Applied

Economics, 7(1):22–53, 2015. doi:10.1257/app.20130533.

David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying inuential

data and sources of collinearity. John Wiley & Sons, 1980. doi:10.1002/0471725153.

Kirill Borusyak, Peter Hull, and Xavier Jaravel. Quasi-experimental shift-share re-

search designs. Review of Economic Studies, 89(1):181–213, 2022. ISSN 0034-6527.

doi:10.1093/restud/rdab030.

George E. P. Box and George C. Tiao. A Bayesian approach to some outlier problems.

Biometrika, 55(1):119–129, 1968. doi:10.1093/biomet/55.1.119.

Tamara Broderick, Ryan Giordano, and Rachael Meager. An automatic nite-sample robustness

metric: can dropping a little data change conclusions?, 2020.

Samprit Chatterjee and Ali S. Hadi. Inuential observations, high leverage points, and outliers

in linear regression. Statistical Science, 1(3):379–393, 1986. doi:10.1214/ss/1177013622.

Ralph Dennis Cook. Inuential observations in linear regression. Journal of the American

Statistical Association, 74(365):169–174, 1979. doi:10.2307/2286747.

Bruno Crépon, Florencia Devoto, Esther Duo, and William Parienté. Estimating the

impact of microcredit on those who take it up: evidence from a randomized experi-

ment in Morocco. American Economic Journal: Applied Economics, 7(1):123–50, 2015.

doi:10.1257/app.20130535.

Jesús Crespo Cuaresma, Stephan Klasen, and Konstantin M. Wacker. There is poverty conver-

gence. SSRN Electronic Journal, 2016. doi:10.2139/ssrn.2718720.

Jesús Crespo Cuaresma, Stephan Klasen, and Konstantin M. Wacker. When do we see

poverty convergence? Oxford Bulletin of Economics and Statistics, 2022. ISSN 0305-9049.

doi:10.1111/obes.12492.

Francisco Cribari-Neto, Tatiene C. Souza, and Klaus L. P. Vasconcellos. Inference under het-

eroskedasticity and leveraged data. Communications in Statistics - Theory and Methods, 36

(10):1877–1888, 2007. doi:10.1080/03610920601126589.

Federico Droller. Migration, population composition and long run economic development: ev-

idence from settlements in the Pampas. The Economic Journal, 128(614):2321–2352, 2018.

doi:10.1111/ecoj.12505.

Bradley Efron and Charles Stein. The Jackknife estimate of variance. The Annals of Statistics,

9(3), 1981. doi:10.1214/aos/1176345462.

21

Kuschnig, Zens, & Crespo Cuaresma

Bradley Efron and Robert J. Tibshirani. An introduction to the Bootstrap. CRC Press, 1994.

doi:10.1201/9780429246593.

A. A. Fauvel. Unpublished documents on the history of the Seychelles islands anterior to

1810. Government Printing Oce Mahe, Seychelles, 1909. URL www.loc.gov/item/

unk83018617/.

Nicola Gennaioli and Ilia Rainer. The modern impact of precolonial centralization in Africa.

Journal of Economic Growth, 12(3):185–234, 2007. ISSN 1573-7020. doi:10.1007/s10887-007-

9017-z.

Paul Goldsmith-Pinkham, Isaac Sorkin, and Henry Swift. Bartik instruments: what, when,

why, and how. American Economic Review, 110(8):2586–2624, 2020. ISSN 0002-8282.

doi:10.1257/aer.20181047.

Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust

statistics: the approach based on inuence functions, volume 196. John Wiley & Sons, 2005.

doi:10.1002/9781118186435.

Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. Outlier detection using k-nearest neighbour

graph. In Proceedings of the 17th International Conference on Pattern Recognition, 2004.

ICPR 2004., volume 3, pages 430–433. IEEE, 2004. doi:10.1109/ICPR.2004.1334558.

Peter J. Huber and Elvezio M. Ronchetti. Robust statistics. John Wiley & Sons, Ltd., Chichester,

England, UK, January 2009. ISBN 978-0-47012990-6. doi:10.1002/9780470434697.

Paul Johnson and Chris Papageorgiou. What remains of cross-country convergence? Journal

of Economic Literature, 58(1):129–75, 2020. doi:10.1257/jel.20181207.

Dean Karlan and Jonathan Zinman. Microcredit in theory and practice: using ran-

domized credit scoring for impact evaluation. Science, 332(6035):1278–1284, 2011.

doi:10.1126/science.1200138.

Robbert E. Kass, Luke Tierney, and Joseph B. Kadane. Approximate methods for as-

sessing inuence and sensitivity in Bayesian analysis. Biometrika, 76(4):663–674, 1989.

doi:10.1093/biomet/76.4.663.

Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: An introduction

to cluster analysis, volume 344. John Wiley & Sons, 2009. ISBN 978-0470316801.

doi:10.1002/9780470316801.

Stephan Klasen and Mark Misselhorn. Determinants of the growth semi-elasticity of poverty

reduction, 2008.

22

Inuential Sets in Linear Regression

Edward E. Leamer. Let’s take the con out of econometrics. American Economic Review, 73(1):

31–43, 1983. URL https://www.jstor.org/stable/1803924.

Edward E. Leamer. Sensitivity analyses would help. American Economic Review, 75(3):308–313,

1985. URL https://www.jstor.org/stable/1814801.

Ricardo A Maronna, R Douglas Martin, Victor J Yohai, and Matías Salibián-Barrera.

Robust statistics: theory and methods (with R). John Wiley & Sons, 2019.

doi:10.1002/9781119214656.

Stelios Michalopoulos and Elias Papaioannou. Pre-colonial ethnic institutions and con-

temporary African development. Econometrica, 81(1):113–152, 2013. ISSN 1468-0262.

doi:10.3982/ECTA9613.

Nathan Nunn. The long-term eects of Africa’s slave trades. Quarterly Journal of Economics,

123(1):139–176, 2008. ISSN 0033-5533. doi:10.1162/qjec.2008.123.1.139.

Nathan Nunn and Diego Puga. Ruggedness: the blessing of bad geography in Africa. Review

of Economics and Statistics, 94(1):20–36, 2012. doi:10.1162/REST_a_00161.

Nathan Nunn and Leonard Wantchekon. The slave trade and the origins of mistrust in Africa.

American Economic Review, 101(7):3221–52, 2011. doi:10.1257/aer.101.7.3221.

Daniel Peña and Victor J Yohai. The detection of inuential subsets in linear regression by

using an inuence matrix. Journal of the Royal Statistical Society: Series B (Methodological),

57(1):145–156, 1995. doi:10.1111/j.2517-6161.1995.tb02020.x.

Lawrenece I. Pettit and Karen D. S. Young. Measuring the eect of observations on Bayes

factors. Biometrika, 77(3):455–466, 1990. doi:10.1093/biomet/77.3.455.

Garry D. A. Phillips. Recursions for the two-stage least-squares estimators. Journal of Econo-

metrics, 6(1):65–77, 1977. doi:10.1016/0304-4076(77)90055-0.

Martin Ravallion. Why don’t we see poverty convergence? American Economic Review, 102

(1):504–23, 2012. doi:10.1257/aer.102.1.504.

Marco Riani, Andrea Cerioli, Anthony C. Atkinson, and Domenico Perrotta. Monitoring

robust regression. Electronic Journal of Statistics, 8(1):646–677, 2014. ISSN 1935-7524.

doi:10.1214/14-EJS897.

Matthew S. Shotwell and Elizabeth H. Slate. Bayesian outlier detection with Dirichlet process

mixtures. Bayesian Analysis, 6(4):665–690, 2011. doi:10.1214/11-BA625.

Enrico Spolaore and Romain Wacziarg. How deep are the roots of economic development? Jour-

nal of Economic Literature, 51(2):325–69, 2013. ISSN 0022-0515. doi:10.1257/jel.51.2.325.

23

Kuschnig, Zens, & Crespo Cuaresma

Mark F. J. Steel. Model averaging and its use in economics. Journal of Economic Literature,

58(3):644–719, 2020. doi:10.1257/jel.20191385.

Stephen M. Stigler. The changing history of robustness. American Statistician, 64(4):277–281,

2010. ISSN 0003-1305. doi:10.1198/tast.2010.10159.

Alessandro Tarozzi, Jaikishan Desai, and Kristin Johnson. The impacts of microcredit: evi-

dence from Ethiopia. American Economic Journal: Applied Economics, 7(1):54–89, 2015.

doi:10.1257/app.20130475.

Lloyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM, 1997.

doi:10.1137/1.9780898719574.

24

Inuential Sets in Linear Regression

Appendix for online publication

A Additional technical information

In this section, we provide additional technical details on the approximation of inuential

sets and their inuence. First, we focus on limitations of the approach used in Algorithm 0

and by Broderick et al. (2020). Next, we conduct a simulation exercise to compare the

three considered algorithms. Finally, we investigate the sensitivity of 2SLS estimation in

a simulation exercise.

A1 Aggregating individual inuences

Algorithm 0 approximates the inuence of a set of observations by aggregating individual,

full-sample inuences of its members. When assessing the inuence on coecients, e.g.,

this method suers from a downward bias. With every removal, the leverage and hence

the inuence of observations increases and dierences in leverage are exacerbated. The

other driver of inuence, residuals, decreases on average, complicating the analysis. We

illustrate the underestimation with a mean-only example.

Consider the model y=1θ+ε, where we are interested in the inuence on the estimate

of θ, which we are (without loss of generality) trying to decrease (∆(S) = ˆ

θ(∅)−ˆ

θ(S)).

We will show that the inuence of a set of two inuential observations y1and y2(i.e.,

they satisfy y1≥y2>iyi/N) is greater than the sum of inuences of its members. We

need to show δ1+δ2<∆({1,2}),which is equivalent to

ˆ

θ−ˆ

θ(1) +ˆ

θ−ˆ

θ(2) <ˆ

θ−ˆ

θ{(1,2)},

iyi

N−i=1 yi+i=2 yi

N−1+i=1,2yi

N−2<0.

Since i=1 yi+i=2 yi=iyi+i=1,2yi, we can redistribute the second term for

1

N−1

N−1

i

yi+1

N−2−1

N−1

i=1,2

yi<0.

By assumption, we know that i=1,2yi+2

N−2i=1,2yi<iyi, which implies that

1

N−1

N−1

i

yi+1

N−2−1

N−1

i=1,2

yi<

1

N−1

N−1

i=1,2

yi+2

N−2

i=1,2

yi+1

N−2−1

N−1

i=1,2

yi= 0,

where the second term cancels out, completing the proof.

25

Kuschnig, Zens, & Crespo Cuaresma

A2 Assessing inuence approximations

Approximations of the inuence of sets of observations that ignore (or approximate) cer-

tain elements enable computation in intensive settings. However, such computational

shortcuts can become problematic if the approximated elements are important determi-

nants of the inuence themselves. In our setting, a notable example is the Approximate

Maximum Inuence Perturbation (AMIP, by Broderick et al.,2020), which we investigate

in more detail.7We illustrate the limitations of this approximation using a simulated ex-

ample, and then discuss the performance in the context of seven studies on microcredits,

which is featured in their contribution. In Section A3, we conduct a simulation exercise.

Illustration

Consider a univariate regression with a slope parameter β= 1 and zero intercept, where

innovations and observations of the covariate are drawn from t(8) distribution. This

setup allows for moderately high leverage and residuals, without inducing sensitivity to

inuential sets. We draw N= 1,000 observations and construct the dependent variable;

data are visualized in Figure A1. First, we illustrate the behavior of AMIP when approx-

imating inuential sets and their inuence with regard to the coecient β. Inuence is

given by

δAMIP

i= (x′x)−1x′

iei≈DFBETAi=(x′x)−1x′

iei

1−hi

,

where it can be seen that the dierence between δAMIP

iand DFBETAiis given by the

suppression of the leverage component of the measure, hi.

-3.92 4.74

-6.82

0.00

6.17

Figure A1: Simulated data and the least squares regres-

sion line. The covariate (on the horizontal axis) and the

innovations are drawn from a t(8) distribution.

In Figure A2, we visualize the errors of approximation (when compared to the true

inuence). In the left panel, we relate the AMIP error to the leverage of the dataset.

7Our analysis is based on the results in Broderick et al. (2020) and their implementation (kindly made

available at https://github.com/rgiordan/zaminfluence) as of commit 2d29fbb from 4/20/2022.

26

Inuential Sets in Linear Regression

Since leverage and residuals are the only two factors driving inuence in this simple case,

the use of AMIP leads to a downward bias (to (N−1)−1percent) in the estimation of the

inuence. In the right panel, we relate the error to the true inuence. As can be seen,

the magnitude of the error is particularly pronounced for highly inuential observations.

0.000 0.017

-13.3

0.0

10.2

Leverage

Standardized error

-0.007 0.000 0.00 8

-13.3

0.0

10.2

Inﬂuence

Standardized error

Figure A2: Errors from using

AMIP to estimate inuence on

coecients (compared to the

exact value), plotted against

the exact leverage (left) and

the exact inuence (right) of

observations.

Next, we investigate the inuence on the standard error of the coecient estimate.

To showcase their relationship, we regress the exact inuence on standard errors onto its

AMIP estimate. We nd an R2of 0.86, and a coecient value of 1.005 (t= 78.8), indi-

cating good accuracy on average. However, the quality of the approximation deteriorates

considerably with increasing inuence. Figure A3 visualizes the relationship between the

(standardized) regression residuals and the tted values, as well as leverage.

-0.0004 0.0000 0.000 3

-8.2

0.0

15.3

Fitted values

Standardized residuals

0.0000 0.0759

-8.2

0.0

15.3

Leverage

Standardized residuals

Figure A3: Diagnostic plots

for the regression of exact ob-

servation inuence on standard

errors onto their AMIP esti-

mates. Residuals are plotted

against tted values (left) and

the leverage of the inuence

(on the right).

Last, we consider the inuence on the signicance of the parameter estimate. Brod-

erick et al. (2020) do not explicitly track inuence on tvalues, but rather compute the

inuence on signicance for a given tvalue by accumulating

δAMIP

i=βAMIP

(i)+tSEAMIP

(i),(A1)

where βAMIP

(i)and SEAMIP

(i)are the AMIP estimates of the inuence of observation ion

βand its standard error. A minimal inuential set is found by choosing a direction for

∆(i.e. signicantly positive or negative), and setting the target value that needs to be

exceeded to one of β±t×SE.We investigate this approximation below using an empirical

application.

27

Kuschnig, Zens, & Crespo Cuaresma

Microcredits

Many recent large-scale experimental studies evaluate the ecacy of microcredit as a tool

for alleviating poverty and facilitating economic development. Seven of these studies are

analyzed by Broderick et al. (2020), who assess the sensitivity of the average treatment

eect to removing observations. These are randomized control trials in Bosnia and Herze-

govina (Augsburg et al.,2015), Mongolia (Attanasio et al.,2015), Ethiopia (Tarozzi et al.,

2015), Mexico (Angelucci et al.,2015), Morocco (Crépon et al.,2015), the Philippines

(Karlan and Zinman,2011), and India (Banerjee et al.,2015). The underlying model is

a simple treatment eect model with a single (randomized) treatment dummy.

Table A1: Sensitivity of the average treatment eect of microcredits

BIH MON ETH MEX MOR PHI IND

Estimate β37.53 -0.34 7.29 -4.55 17.54 66.56 16.72

t(1.90) (-1.53) (0.92) (-0.77) (1.54) (0.85) (1.41)

Sign-switch

A2 13 15 1 1 11 9 6

A0 14 16 1 1 11 9 6

B0 14 16 1 1 11 9 6

Signicance

A2 35 34 10 9 29 38 28

A0 68 58 387 41 42 89 76

B0 40 38 66 15 30 58 32

Observations 1,195 961 3,113 16,560 5,498 1,113 6,863

The reported values are the sizes of inuential sets that are needed to induce a sign-switch of the average treatment eect,

and have this sign-ip become signicant (at the 5% level). We use Algorithms 2 (labelled ‘A2’) and 0 using exact inuences

(‘A0’) and AMIP estimates (‘B0’). For the sign-switch, we explicitly target the inuence on β, and not on the tvalues.

In Table A1, we present the full-sample estimates and the sizes of minimal inu-

ential sets that induce a sign-switch, and a signicant sign-switch. We compare three

approaches, based on (1) Algorithm 2, and Algorithm 0, using (2) exact inuences (‘A0’)

and (3) AMIP estimates (‘B0’, reproducing Broderick et al.,2020). The results for sign-

switches are similar. By contrast, the sizes needed for a signicant sign-switch dier

strongly. When particularly inuential observations are present, only Algorithm 2 prop-

erly accounts for the eects of subsequent removals. Notably, the AMIP estimates for

inducing signicance (following Equation A1) are considerably lower than one would

expect.8

8This is peculiar, considering the downward biases of (1) accumulating inuences, (2) the βAMIP

(i)

estimate, and (3) the estimates βAMIP

(i)and SEAMIP

(i)for inuential observations.

28

Inuential Sets in Linear Regression

0 1

-0 .5

11 .7

ETH

0 1

-40.9

0

8.2

MEX

Figure A4: Household income

in thousands (vertical axis) and

the treatment dummy for the

studies on Ethiopia and Mexico.

The regression line is indicated in

gray.

Thanks to randomization, the regression model underlying these results is remark-

ably simple, and Cook’s distance (or other standard checks) already indicate sensitivity

issues. The cases of Ethiopia and Mexico are particularly striking, (c.f. Figure A4). Since

leverage plays a limited role in this regression setting, it appears as an ideal case for using

methods based on initial approximation (Algorithm 1 would be a preferred candidate for

its improved precision).

A3 Approximations in the presence of inuential sets

Approaches for identifying inuential sets that are based on approximations may suer

precisely in the presence of inuential sets. To showcase this issue, we consider the

univariate regression from Section A2 (with β= 1, and innovations and covariates drawn

from a t(8) distribution) with N= 100. We contaminate this setup with two inuential

sets. For the rst set, S0.02 we use a coecient value of three and increase covariate values

by seven. For the second one, T0.03, the coecient value is changed to two and covariate

values are increased by ve. One realization of this process is visualized in Figure A5.

-2.2 0.0 7.6

-3.8

0.0

23.4

Figure A5: Simulated data and OLS regression lines for

the full-sample (in gray) and the uncontaminated sample

(dashed, in black). Members of the inuential sets are

highlighted with color and indicated by a cross (×, for

S0.02) and a crosshair (+, for T0.03 ).

We draw 1,000 realizations of the process described above, and use four approaches

29

Kuschnig, Zens, & Crespo Cuaresma

to assess maximally inuential sets (that lower coecient values) of sizes one to ten.9

These are Algorithms 2 (labelled ‘A2’) and 1 (‘A1’), as well as Algorithm 0, with exact

inuences (‘A0’) and the AMIP estimate (‘B0’). The results are presented in Figure A6.

Both variants of Algorithm 0 perform relatively poorly, and do not recognize the inuence

of the rst inuential set. Algorithm 1 performs better, but does not reliably account for

the inuence of the second set. Estimates are considerably spread out, and the target

value (after ve removals) of unity in the slope coecient is missed on average. The

impact of removals levels o afterwards. By contrast, Algorithm 2 reliably nds both

sets, and continues to identify impactful observations after their removal.

0 2 5 10

2.2

2.0

1.8

1.5

1.0

0.5

B0

0 2 5 10

A0

0 2 5 10

A1

0 2 5 10

A2

Figure A6: Transparent lines indicate individual runs, thick lines the average results of (from

top to bottom) approach ‘B0’ (gray, dashed), ‘A0’ (green, solid), ‘A1’ (purple, dashed), and ‘A2’

(teal, solid). The vertical axis indicates estimates, the horizontal one the number of removals.

A4 Method sensitivities for OLS and 2SLS

Sensitivity to inuential sets depends on the measure of inuence and the estimator used.

We illustrate the role of the particular estimator by comparing inuential sets for OLS and

2SLS coecient estimates. Again, we use the univariate regression from Section A2 (with

β= 1, and innovations and covariates drawn from a t(8) distribution) with N= 100, and

add a repeated layer of this setup to construct the instrumented variable. This gives us

the following setup for 2SLS estimation

x∼z+u,with u∼t(8) and z∼t(8),

y∼x+e,with e∼t(8).

We use Algorithm 2 to nd and assess the inuence of maximally inuential sets from

size one to 95 and replicate the exercise 1,000 times.

9The average estimates are βLS = 1.91,ˆ

β(S)= 1.38, and ˆ

β({S,T })= 1.00. Interestingly, robust

M-estimation slightly undershoots the inuence of the rst set, at βM= 1.43.

30

Inuential Sets in Linear Regression

The results of our simulation exercise are visualized in Figure A7. On the vertical

axes, we depict the OLS (top) and 2SLS (bottom) estimates after removing inuential

sets of increasing size (horizontal axis). OLS estimates remain steady until about twenty

observations are left in the sample. After this point, we observe a limited divergence of

estimates. The mean and median of the estimates stay close to each other throughout,

indicating no serious numerical problems. This is not the case for the 2SLS estimates,

where estimates start to diverge after about twenty removals. At 24 removals, we have

the rst drop-out due to a lack of numerical stability, which we indicate with a green

cross on top. Before this, we can see a short spike in the mean estimate, which then

drifts o again. Overall, about three quarters of the simulated runs for 2SLS estimation

terminate prematurely due to a lack of numerical stability. At around 50 removals, more

than half of the remaining 2SLS estimates are pathological (see median estimate), with

vast uctuations afterwards.

These results indicate that even with heavy-tailed errors and leverage, the OLS es-

timates of an otherwise correctly specied linear regression model are not particularly

sensitive to inuential sets. However, the 2SLS estimates appear to be susceptible to

inuential sets and numerical issues, even with this simple setup.

31

Kuschnig, Zens, & Crespo Cuaresma

100 80 50 20 5

-20 -10 -1

OLS

100 80 50 20 5

-20 -10 -1

Observations

2SLS

Figure A7: Transparent lines indicate (250 samples of) individual runs, thick lines indicate the

median (solid, blue), the 95% and 5% quantile (dashed), and the average (dotted, pink) of the

estimate. Crosses at the top of the 2SLS panel indicate drop-outs due to pathological numerical

stability (within machine precision). The vertical axes indicate estimates, the horizontal ones

the number of observations left.

32

Inuential Sets in Linear Regression

B Additional results and applications

In this section, we provide supporting material for the applications discussed in the paper.

B1 The origins of mistrust

Table A2: Distribution of the top 600 most inuential observations on trust in relatives

Benin Nigeria Ghana Other

Top 100 54 12 19 15

Top 101–200 85 6 5 4

Top 201–300 80 12 7 1

Top 301–400 52 27 14 7

Top 401-500 3 67 6 24

Top 501–600 3 69 15 13

Top 600 277 193 66 64

Summary of the origin countries of the 600 most inuential observations aecting the eects of slave trades on the trust of

relatives (c.f. Column 1 of Table 2).

B2 The eects of the Tsetse y

Table A3: The robust eects of the Tsetse y

Animals

Intensive

Plow

Female

Density

Slavery

Centralized

M-estimate -0.259 -0.102 – 0.227 -0.761 – -0.080

(-5.53) (-3.29) – (3.71) (-3.00) – (-2.12)

S-estimate – – – – -0.761 – –

– – – – (-4.63) – –

Observations 484 485 484 315 398 446 467

Robust M- and S-estimates for the eects of the Tsetse y (c.f. Table 3). Coecient estimates are reported with tvalues

based on clustered (for M-estimation) and classical (for S-estimation) standard errors in parentheses.

33

Kuschnig, Zens, & Crespo Cuaresma

B3 Poverty convergence and inuential sets

Ravallion (2012) examines the rates of convergence in a sample of 89 countries using the

following linear regression model

T−1

i(ln Hit −ln Hit−1) = α+βln Hit−1+εit,(A2)

where Hit denotes the poverty headcount ratio in country iand time period t, and Tiis

the length of its observation period in years. In his study, Ravallion (2012) does not detect

poverty convergence, instead obtaining a positive, statistically insignicant estimate of

β, that we reproduce in Column 1 of Table A4.

Table A4: Sensitivity of poverty convergence

Baseline Eastern Europe Alternative

Convergence†0.006 -0.020 -0.019

(0.84) (-2.88) (-9.68)

Convergence, Eastern Europe – 0.044 –

– (2.12) –

Thresholds†–[1]{4} 3[10]{24} 26[32]{42}

Observations 89 89 124

R20.375 0.375 0.607

The row labeled ‘Thresholds’ reports the sizes of inuential sets that induce a loss of signicance (at the 5% level), [a

sign ip], and {a signicant sign ip} of the coecient that captures the poverty convergence eect (using Algorithm 2).

The column labelled ‘Baseline’ reproduces the results of Ravallion (2012), the one labelled ‘Eastern Europe’ adds an

interaction term for Eastern European countries. The column labelled ‘Alternative’ uses a dierent specication (proposed

by Crespo Cuaresma et al.,2016) and an updated dataset. Coecients are reported with tvalues in parentheses.

This nding may be due to idiosyncratic experiences in the former Eastern bloc. As

Crespo Cuaresma et al. (2022,2016) point out, these countries exhibit low initial poverty

headcount ratios, implying that small absolute changes translate into large growth rates.

This is due to the log-transformation applied, and makes observations of these countries

particularly inuential on OLS estimates. Crespo Cuaresma et al. (2016) consider two

alternative specications that yield empirical evidence of poverty convergence in the orig-

inal dataset. These are (1) an extension of Equation A2 that controls for the experience

of Eastern European countries, and (2) a model that is based on a semi-elastic relation-

ship between poverty reduction and growth (see Klasen and Misselhorn,2008), eectively

dropping the log-transformation from Equation A2. See Table A4, Columns 2 and 3 for

estimates.

We revisit this issue, and assess the sensitivity of poverty convergence to inuential

sets, starting with the model in Equation A2. Figure A8 presents the data and regression

34

Inuential Sets in Linear Regression

0.16 4.59

-0.33

0.00

0.17

Belarus

Latvia

Ukraine Poland

Figure A8: Data and regression line for Ravallion (2012)

before (solid line) and after (dashed line) removing the

inuential set ˆ

S∗

4(colored and marked with crosses, then

crosshairs). The vertical axis holds annualized log dier-

ences of poverty headcount ratios; the horizontal axis the

logarithm of the initial poverty headcount ratio.

line of Ravallion (2012), as well as the minimal inuential set needed to attain signicant

poverty convergence (see ‘Thresholds’ in Table A4), as identied by Algorithm 2. The

four members of this set are Belarus, Latvia, Ukraine, and Poland.10 In this setting,

a sensitivity check to inuential sets can clearly compensate for (or augment) domain

knowledge. We investigate the two additional specications next. When accounting for

the experience of Eastern Europe, we nd a signicance threshold of three observations.

For the alternative specication, we source an updated dataset from PovCalNet,11 and

nd signicant poverty convergence. Algorithm 2 indicates thresholds at 26 (insigni-

cance, see Figure A9), 32 (sign-ip), and 42 (signicant sign-ip) out of 124 observations.

Algorithm 1 only indicate a loss of signicance after 56 removals (c.f. Figure A9).

B4 Migration, instrumental variables and inuential sets

Migration reshapes global populations and, as a result, economic and cultural structures.

In a recent study, Droller (2018) investigates the long-term impacts of European migration

to Argentina in the late 19th and early 20th century. He uses a shift-share instrument to

identify the eect of the share of European migration on GDP per capita in 2000, for a

sample of 136 counties in the provinces of Buenos Aires, Santa Fe, Córdoba, and Entre

Rios. The results of Droller (2018) show considerable impacts of European migration on

GDP per capita, education, and skilled labor.

We reproduce these results in Table A5, and compute inuential sets for the rst stage

OLS, and the full 2SLS estimates (using the updating formulae of Phillips,1977). The rst

stage results are comparatively insensitive to inuential sets. For the 2SLS estimates (of

10Subsequent removals would be the Russian Federation, Lithuania, Estonia, and