Content uploaded by Mayowa Osibodu

Author content

All content in this area was uploaded by Mayowa Osibodu on May 28, 2022

Content may be subject to copyright.

“What Is the Effect of UN Involvement on Post-War Peace?”

We Have Different Answers - Which Do We Trust?

Assessing Model Behaviour on Extreme Counterfactuals.

Mayowa Osibodu.

Aditu.

mayowa@aditu.tech

Abstract:

This paper outlines reported issues that stem from applying statistical models

to extreme counterfactuals - data points which are far outside the scope of

what the model was trained on.

It then proposes a technique to estimate model behaviour on these

out-of-convex-hull datapoints, and rank models based on their performance.

1. Causal Inference: The Causal Effect.

According to Rubin1, the causal effect of a treatment A on an entity E relative

to a different treatment B, is described by the difference in the eventual states

of E after independent applications of treatments A and B.

For example, the causal effect of “eating breakfast” on a person’s mood

compared to “not eating breakfast”, can be observed by assessing the

difference in their mood With vs Without breakfast, at noon.

This can be visualized as follows:

Fig. 1

Algebraically this is described as Causal Effect ,

= 𝑦(𝐴)−𝑦(𝐵) (1)

where

is a dependent variable representing the status of the entity E,

𝑦is the value of at time given treatment A,

𝑦(𝐴) 𝑦 𝑡2

and is the value of at time given treatment B.

𝑦(𝐵) 𝑦 𝑡2

and are known as factual and counterfactual terms.

𝑦(𝐴) 𝑦(𝐵)

This notion of a causal effect is described in contrast to a mere correlation

between and either of the treatments A or B. Such a correlation makes no

𝑦

definitive statement on whether changes in are as a result of the treatments.

𝑦

It only points out an apparent association.

For any given entity E however, only one of and is known. At time ,

𝑦(𝐴) 𝑦(𝐵) 𝑡2

E has been transformed into either or , but not both. E.g, At noon you’ve

𝐸𝐴𝐸𝐵

either eaten breakfast, or you haven’t. Thus only one of the terms in

expression is known for certain.

(1)

How then is the Causal effect of treatment A over treatment B to be

estimated?

This conundrum is known as the “Fundamental Problem of Causal Inference”2.

Rubin1describes a number of approaches to obtain usable estimates of the

Causal effect, in spite of the mentioned conundrum.

Methods of Causal inference are broadly used across a number of disciplines,

from Medicine to Political Science.

2. Extreme Counterfactuals.

Doyle and Sambanis3conducted a detailed study on 124 post-World War 2 civil

wars (the most recent ending in 1997). They outlined a number of factors

influencing eventual peacebuilding success in these wars, and analyzed the

nature of this influence.

One of their findings was that multilateral UN peacekeeping operations make a

positive difference on eventual peacebuilding outcomes.

This was done by fitting a logistic regression model to the data on civil wars,

and conducting inference with this model to determine the causal influence of

UN involvement (a variable named UNTYPE4) on eventual peacebuilding success

(as defined by their Strict Peacebuilding Success PBS2S3 metric).

According to their results, the odds of peacebuilding success is 23 times larger

with a multidimensional UN peacekeeping operation than without it,

accounting for confounding variables.

Essentially their findings provided an estimate for the Causal effect

with regard to the eventual state of

𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)−𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

peace in the concerned location.

We know from the Fundamental Problem of Causal Inference that for a given

civil war, only one of and is known with

𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛) 𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

certainty. The UN either intervened or did not intervene in a given civil war.

The estimates of were obtained by

𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)−𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

analyzing peacebuilding outcomes over the entire dataset- contrasting

outcomes of civil wars in which the UN intervened, with those in which the UN

didn’t.

King and Zeng4however, analyzed the statistical procedure involved in this

specific aspect of the study. They showed that such comparisons were

problematic because the factual and counterfactual terms were too different

from each other - data on UN intervention versus No UN intervention were too

dissimilar to be usefully compared.

Describing their perspective quantitatively: For wars in which the UN did not

intervene, the UNTYPE4 variable (UNOP4 in King’s paper) was originally set to

0. Setting this variable to 1 led to data points which were far outside the

convex hull of the available data. The same was observed in setting the

UNTYPE4 variable to 0 for wars in which the UN did intervene.

According to King and Zeng, not only did the counterfactuals all lie outside the

convex hull, but most were fairly extreme extrapolations well beyond the data.

Thus any resulting inference was more indicative of model specifics, than of

any actual suggestions in the data.

Below we view a diagram from King and Zeng’s paper4. It describes a

straightforward case of such model dependence.

Fig. 2

We see that for an value of 5 (possibly representing some distant

𝑋

counterfactual case outside the convex-hull of the available data points),

whatever prediction we have seems to be entirely dependent on our choice of

model. At that distance, our prediction has practically nothing to do with the

existing data.

Does this mean that absolutely no statement can be made from Doyle and

Sambanis’ dataset, on the causal effect of UN involvement on post-war peace?

Below we outline an approach to still generate some useful insights in such a

situation: We explore the possibility that although inference is unreliable for

extreme counterfactuals, some models can be judged as being more capable of

handling out-of-convex-hull data than others.

3. Drawing Insight from Cross-Validation:

For our approach, we draw insight from the technique of cross-validation. In

cross-validation we aim to estimate how well a model trained on some dataset

X1(the training set), will perform on some other dataset X2(the test set),

where X1and X2are drawn from the same population.

This estimation is done by training the model on a subset of X1(say X1A), and

then using model performance on the rest of X1(say X1B) as an estimate of

possible performance on X2(the test set).

The logic underlying this aims to test for model dependence. Essentially we are

testing how well the model trained on X1generalizes to X2by learning the

statistical patterns present in the underlying population distribution.

If the model learns the underlying patterns well, then it should demonstrate

satisfactory accuracy on X2. However if it learns anything other than

population-wide patterns, then we expect suboptimal performance on X2.

We apply this logic to Convex hulls:

Given a dataset X, we can view data points within the convex hull of X as our X1

(training set), and Out-of-convex-hull data points as X2(test set).

By creating artificial convex hulls around a subset of X1(say X1A), we can use

model performance on the rest of X1(say X1B) - which is outside the convex hull

of X1A), as an estimate for model performance on new data which is outside the

convex hull of X1.

In contrast to cross-validation however, we are certain that the points in X2are

external and “far away” from the ones in X1. I.e., in this case there is a notion

of distance, unlike in cross validation which only assumes two different samples

drawn from the same population.

Here we extend this idea behind cross validation and define it such that it

takes into consideration the notion of distance.

4. The Gower Metric G2

King and Zeng4use two different metrics to describe how “distant and isolated”

a counterfactual is from the rest of the data. The first metric is a largely

qualitative one which employs the notion of the convex hull.

With this metric, a point x is “distant” from a set of points X, if x exists outside

the convex hull of X.

The second, more quantitative metric is the Gower metric G2. G2is a number

ranging from 0 upwards, that describes how far a point x1is from another point

x2, relative to the spread of a set of values X.

𝐺2 = 𝑥1 − 𝑥2

| |

𝑚𝑎𝑥(𝑋) − 𝑚𝑖𝑛(𝑋)

(The one-dimensional case is shown here for simplicity)

Thus given a dataset X and some point x2representing a distant counterfactual,

the G2for some x1(drawn from X) and x2, describes how “far away” x2is from

x1, relative to the spread of values in X.

In our approach to extract usable insight from extreme counterfactuals, we use

the Gower metric G2to define our notion of distance.

5. Convex Hull Iteration.

5.1 Definition.

Say we have a given set X, containing data points in two dimensions. A convex

hull iteration through X begins from the central element in X, and then

proceeds outwards element-wise, until it reaches the outermost element of X.

We illustrate this below:

Fig. 3

The diagram above represents X, containing 4 different data points. The first

step of a convex hull iteration starts at the central element of X. The second

step expands to the second central element, etc.

The third step of the iteration will look like this:

Fig. 4

Essentially the process involves iterating outwards through convex hulls

spreading throughout X.

5.2. Estimating Out–of-Convex-Hull Performance with Convex Hull Iteration.

Say we train a predictive model on the 3 innermost points of X - the 3 elements

within the convex hull illustrated above. We could test this model on the fourth

point - the point just outside the hull in the diagram. Since we have complete

information about this data point, we can instantly verify if our model’s

prediction was correct or not.

Recalling the notion of distance — the Gower metric. We can calculate the

Gower distance G2between this fourth point, and the central element of X.

This quantifies our notion of distance as G2tells us how “far away” from the

training set our fourth point is.

At this point we have two quantities. The prediction accuracy of our model for

this fourth point, and at what Gower distance we observed this accuracy.

We extend this logic:

1. We train a given model on the n-innermost elements (where n ranges

from some baseline to the total number of elements) of our dataset.

2. For the elements outside the given convex hull of n elements, we

calculate the prediction accuracy of our model, and at what Gower

distance this prediction accuracy was observed.

This provides us with data on how well our model generally performs across

different out-of-convex-hull Gower distances.

5.3 Inherent Assumption.

This approach assumes actual out-of-convex-hull data is drawn from an

identical distribution as existing data points (The same assumption underlies

the effectiveness of Cross validation). The effectiveness of this approach would

be impaired in cases where this is not certain.

6. Demonstration:

After expressing the above technique in Python code5, running with different

models on the well-known Iris classification dataset6gave the results shown

below:

Fig. 5

In the diagram above, we see that all the model curves tend to follow a

characteristic trend: For data points very close to the training data they

perform optimally. Then as these points get farther away, the performance

drops quickly and fluctuates with increasing Gower distance.

A key observation is that although the different curves have similar shapes,

some models can be seen to perform consistently better than others.

The Random forest model for example, consistently outperforms the Support

Vector machine. Thus we generally expect that when predicting for extreme

counterfactuals, the Random forest should have more accurate predictions

than the other models.

The diagram below shows results of applying the technique to the Wine quality

classification dataset7.

Fig. 6

We see a similar trend. There is a definitive indication that some models

perform consistently better at predicting out-of-convex-hull data, than others.

The Random forest for example, consistently outperforms the Neural network.

A peculiar feature of these graphs is that some models seem to perform better

as the Gower distance axis approaches its maximum value. More analysis would

be required to understand why this happens.

7. Summary of Results:

We have discussed statistical predictions as generally being unreliable for

extreme counterfactuals.

The approach presented however, provides a way to rank model performance

on out-of-convex-hull data.

In addition to ranking models on out-of-convex-hull performance, the graphs

also provide estimates of expected model accuracy at given out-of-convex-hull

Gower distances.

For example in Figure 5, the Random Forest is seen to have average

classification accuracy of about 0.5 at a Gower distance of 0.8. This could be

taken as an estimate of expected accuracy on a similar data point outside the

convex hull of the training set.

Such estimations should be done with caution however, because they strongly

assume that the concerned counterfactual is drawn from an identical

distribution as the training set. This could be difficult to prove for very

extreme counterfactuals.

Acknowledgements:

The essential idea here started as a college paper supervised by Prof. Alexis

Diamond at Minerva University USA, in 2016. I’m appreciative of his input and

support during that period.

References:

1. Rubin, D. (1974). Estimating Causal Effects Of Treatments In Randomized

And Nonrandomized Studies. Journal of Educational Psychology. 1974. Vol 66,

No 5. 668-701.

Retrieved from http://www.fsb.muohio.edu/lij14/420_paper_Rubin74.pdf

2. King, G., Keohane, R. O. and Verba, S. (1994). Designing Social Inquiry:

Scientific Inference in Qualitative Research. Princeton: Princeton University

Press.

3. Doyle, M. W. and Sambanis, N. (2000) International Peacebuilding: A

Theoretical and Quantitative Analysis. American Political Science Review.

Retrieved from https://www.jstor.org/stable/2586208

4. King, G. and Zeng, L. (2007) When Can History Be Our Guide? The Pitfalls of

Counterfactual Inference. International Studies Quarterly. Pp 183-210.

Retrieved from https://gking.harvard.edu/files/gking/files/counterf.pdf

5. Osibodu M., Convex Hull Iteration. Github.

Retrieved from https://github.com/mayowaosibodu/Convex-Hull-Iteration

6. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. University of

California, School of Information and Computer Science, Irvine, CA, USA.

Retrieved from https://archive.ics.uci.edu/ml/datasets/iris

7. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.

Modeling wine preferences by data mining from physicochemical properties.

Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Retrieved from https://archive.ics.uci.edu/ml/datasets/wine+quality