PreprintPDF Available

Assessing Model Behaviour on Extreme Counterfactuals

  • Aditu (
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper outlines reported issues that stem from applying statistical models to extreme counterfactuals - data points which are far outside the scope of what the model was trained on. It then proposes a technique to estimate model behaviour on these out-of-convex-hull datapoints, and rank models based on their performance.
Content may be subject to copyright.
“What Is the Effect of UN Involvement on Post-War Peace?”
We Have Different Answers - Which Do We Trust?
Assessing Model Behaviour on Extreme Counterfactuals.
Mayowa Osibodu.
This paper outlines reported issues that stem from applying statistical models
to extreme counterfactuals - data points which are far outside the scope of
what the model was trained on.
It then proposes a technique to estimate model behaviour on these
out-of-convex-hull datapoints, and rank models based on their performance.
1. Causal Inference: The Causal Effect.
According to Rubin1, the causal effect of a treatment A on an entity E relative
to a different treatment B, is described by the difference in the eventual states
of E after independent applications of treatments A and B.
For example, the causal effect of “eating breakfast” on a person’s mood
compared to “not eating breakfast”, can be observed by assessing the
difference in their mood With vs Without breakfast, at noon.
This can be visualized as follows:
Fig. 1
Algebraically this is described as Causal Effect ,
= 𝑦(𝐴)𝑦(𝐵) (1)
is a dependent variable representing the status of the entity E,
𝑦is the value of at time given treatment A,
𝑦(𝐴) 𝑦 𝑡2
and is the value of at time given treatment B.
𝑦(𝐵) 𝑦 𝑡2
and are known as factual and counterfactual terms.
𝑦(𝐴) 𝑦(𝐵)
This notion of a causal effect is described in contrast to a mere correlation
between and either of the treatments A or B. Such a correlation makes no
definitive statement on whether changes in are as a result of the treatments.
It only points out an apparent association.
For any given entity E however, only one of and is known. At time ,
𝑦(𝐴) 𝑦(𝐵) 𝑡2
E has been transformed into either or , but not both. E.g, At noon you’ve
either eaten breakfast, or you haven’t. Thus only one of the terms in
expression is known for certain.
How then is the Causal effect of treatment A over treatment B to be
This conundrum is known as the “Fundamental Problem of Causal Inference”2.
Rubin1describes a number of approaches to obtain usable estimates of the
Causal effect, in spite of the mentioned conundrum.
Methods of Causal inference are broadly used across a number of disciplines,
from Medicine to Political Science.
2. Extreme Counterfactuals.
Doyle and Sambanis3conducted a detailed study on 124 post-World War 2 civil
wars (the most recent ending in 1997). They outlined a number of factors
influencing eventual peacebuilding success in these wars, and analyzed the
nature of this influence.
One of their findings was that multilateral UN peacekeeping operations make a
positive difference on eventual peacebuilding outcomes.
This was done by fitting a logistic regression model to the data on civil wars,
and conducting inference with this model to determine the causal influence of
UN involvement (a variable named UNTYPE4) on eventual peacebuilding success
(as defined by their Strict Peacebuilding Success PBS2S3 metric).
According to their results, the odds of peacebuilding success is 23 times larger
with a multidimensional UN peacekeeping operation than without it,
accounting for confounding variables.
Essentially their findings provided an estimate for the Causal effect
with regard to the eventual state of
𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)
peace in the concerned location.
We know from the Fundamental Problem of Causal Inference that for a given
civil war, only one of and is known with
𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛) 𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)
certainty. The UN either intervened or did not intervene in a given civil war.
The estimates of were obtained by
𝑦(𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)𝑦(𝑁𝑜 𝑈𝑁 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛)
analyzing peacebuilding outcomes over the entire dataset- contrasting
outcomes of civil wars in which the UN intervened, with those in which the UN
King and Zeng4however, analyzed the statistical procedure involved in this
specific aspect of the study. They showed that such comparisons were
problematic because the factual and counterfactual terms were too different
from each other - data on UN intervention versus No UN intervention were too
dissimilar to be usefully compared.
Describing their perspective quantitatively: For wars in which the UN did not
intervene, the UNTYPE4 variable (UNOP4 in King’s paper) was originally set to
0. Setting this variable to 1 led to data points which were far outside the
convex hull of the available data. The same was observed in setting the
UNTYPE4 variable to 0 for wars in which the UN did intervene.
According to King and Zeng, not only did the counterfactuals all lie outside the
convex hull, but most were fairly extreme extrapolations well beyond the data.
Thus any resulting inference was more indicative of model specifics, than of
any actual suggestions in the data.
Below we view a diagram from King and Zeng’s paper4. It describes a
straightforward case of such model dependence.
Fig. 2
We see that for an value of 5 (possibly representing some distant
counterfactual case outside the convex-hull of the available data points),
whatever prediction we have seems to be entirely dependent on our choice of
model. At that distance, our prediction has practically nothing to do with the
existing data.
Does this mean that absolutely no statement can be made from Doyle and
Sambanis’ dataset, on the causal effect of UN involvement on post-war peace?
Below we outline an approach to still generate some useful insights in such a
situation: We explore the possibility that although inference is unreliable for
extreme counterfactuals, some models can be judged as being more capable of
handling out-of-convex-hull data than others.
3. Drawing Insight from Cross-Validation:
For our approach, we draw insight from the technique of cross-validation. In
cross-validation we aim to estimate how well a model trained on some dataset
X1(the training set), will perform on some other dataset X2(the test set),
where X1and X2are drawn from the same population.
This estimation is done by training the model on a subset of X1(say X1A), and
then using model performance on the rest of X1(say X1B) as an estimate of
possible performance on X2(the test set).
The logic underlying this aims to test for model dependence. Essentially we are
testing how well the model trained on X1generalizes to X2by learning the
statistical patterns present in the underlying population distribution.
If the model learns the underlying patterns well, then it should demonstrate
satisfactory accuracy on X2. However if it learns anything other than
population-wide patterns, then we expect suboptimal performance on X2.
We apply this logic to Convex hulls:
Given a dataset X, we can view data points within the convex hull of X as our X1
(training set), and Out-of-convex-hull data points as X2(test set).
By creating artificial convex hulls around a subset of X1(say X1A), we can use
model performance on the rest of X1(say X1B) - which is outside the convex hull
of X1A), as an estimate for model performance on new data which is outside the
convex hull of X1.
In contrast to cross-validation however, we are certain that the points in X2are
external and “far away” from the ones in X1. I.e., in this case there is a notion
of distance, unlike in cross validation which only assumes two different samples
drawn from the same population.
Here we extend this idea behind cross validation and define it such that it
takes into consideration the notion of distance.
4. The Gower Metric G2
King and Zeng4use two different metrics to describe how “distant and isolated”
a counterfactual is from the rest of the data. The first metric is a largely
qualitative one which employs the notion of the convex hull.
With this metric, a point x is “distant” from a set of points X, if x exists outside
the convex hull of X.
The second, more quantitative metric is the Gower metric G2. G2is a number
ranging from 0 upwards, that describes how far a point x1is from another point
x2, relative to the spread of a set of values X.
𝐺2 = 𝑥1 − 𝑥2
| |
𝑚𝑎𝑥(𝑋) − 𝑚𝑖𝑛(𝑋)
(The one-dimensional case is shown here for simplicity)
Thus given a dataset X and some point x2representing a distant counterfactual,
the G2for some x1(drawn from X) and x2, describes how “far away” x2is from
x1, relative to the spread of values in X.
In our approach to extract usable insight from extreme counterfactuals, we use
the Gower metric G2to define our notion of distance.
5. Convex Hull Iteration.
5.1 Definition.
Say we have a given set X, containing data points in two dimensions. A convex
hull iteration through X begins from the central element in X, and then
proceeds outwards element-wise, until it reaches the outermost element of X.
We illustrate this below:
Fig. 3
The diagram above represents X, containing 4 different data points. The first
step of a convex hull iteration starts at the central element of X. The second
step expands to the second central element, etc.
The third step of the iteration will look like this:
Fig. 4
Essentially the process involves iterating outwards through convex hulls
spreading throughout X.
5.2. Estimating Out–of-Convex-Hull Performance with Convex Hull Iteration.
Say we train a predictive model on the 3 innermost points of X - the 3 elements
within the convex hull illustrated above. We could test this model on the fourth
point - the point just outside the hull in the diagram. Since we have complete
information about this data point, we can instantly verify if our model’s
prediction was correct or not.
Recalling the notion of distance — the Gower metric. We can calculate the
Gower distance G2between this fourth point, and the central element of X.
This quantifies our notion of distance as G2tells us how “far away” from the
training set our fourth point is.
At this point we have two quantities. The prediction accuracy of our model for
this fourth point, and at what Gower distance we observed this accuracy.
We extend this logic:
1. We train a given model on the n-innermost elements (where n ranges
from some baseline to the total number of elements) of our dataset.
2. For the elements outside the given convex hull of n elements, we
calculate the prediction accuracy of our model, and at what Gower
distance this prediction accuracy was observed.
This provides us with data on how well our model generally performs across
different out-of-convex-hull Gower distances.
5.3 Inherent Assumption.
This approach assumes actual out-of-convex-hull data is drawn from an
identical distribution as existing data points (The same assumption underlies
the effectiveness of Cross validation). The effectiveness of this approach would
be impaired in cases where this is not certain.
6. Demonstration:
After expressing the above technique in Python code5, running with different
models on the well-known Iris classification dataset6gave the results shown
Fig. 5
In the diagram above, we see that all the model curves tend to follow a
characteristic trend: For data points very close to the training data they
perform optimally. Then as these points get farther away, the performance
drops quickly and fluctuates with increasing Gower distance.
A key observation is that although the different curves have similar shapes,
some models can be seen to perform consistently better than others.
The Random forest model for example, consistently outperforms the Support
Vector machine. Thus we generally expect that when predicting for extreme
counterfactuals, the Random forest should have more accurate predictions
than the other models.
The diagram below shows results of applying the technique to the Wine quality
classification dataset7.
Fig. 6
We see a similar trend. There is a definitive indication that some models
perform consistently better at predicting out-of-convex-hull data, than others.
The Random forest for example, consistently outperforms the Neural network.
A peculiar feature of these graphs is that some models seem to perform better
as the Gower distance axis approaches its maximum value. More analysis would
be required to understand why this happens.
7. Summary of Results:
We have discussed statistical predictions as generally being unreliable for
extreme counterfactuals.
The approach presented however, provides a way to rank model performance
on out-of-convex-hull data.
In addition to ranking models on out-of-convex-hull performance, the graphs
also provide estimates of expected model accuracy at given out-of-convex-hull
Gower distances.
For example in Figure 5, the Random Forest is seen to have average
classification accuracy of about 0.5 at a Gower distance of 0.8. This could be
taken as an estimate of expected accuracy on a similar data point outside the
convex hull of the training set.
Such estimations should be done with caution however, because they strongly
assume that the concerned counterfactual is drawn from an identical
distribution as the training set. This could be difficult to prove for very
extreme counterfactuals.
The essential idea here started as a college paper supervised by Prof. Alexis
Diamond at Minerva University USA, in 2016. I’m appreciative of his input and
support during that period.
1. Rubin, D. (1974). Estimating Causal Effects Of Treatments In Randomized
And Nonrandomized Studies. Journal of Educational Psychology. 1974. Vol 66,
No 5. 668-701.
Retrieved from
2. King, G., Keohane, R. O. and Verba, S. (1994). Designing Social Inquiry:
Scientific Inference in Qualitative Research. Princeton: Princeton University
3. Doyle, M. W. and Sambanis, N. (2000) International Peacebuilding: A
Theoretical and Quantitative Analysis. American Political Science Review.
Retrieved from
4. King, G. and Zeng, L. (2007) When Can History Be Our Guide? The Pitfalls of
Counterfactual Inference. International Studies Quarterly. Pp 183-210.
Retrieved from
5. Osibodu M., Convex Hull Iteration. Github.
Retrieved from
6. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. University of
California, School of Information and Computer Science, Irvine, CA, USA.
Retrieved from
7. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Retrieved from
ResearchGate has not been able to resolve any citations for this publication.
International peacebuilding can improve the prospects that a civil war will be resolved. Although peacebuilding strategies must be designed to address particular conflicts, broad parameters that fit most conflicts can be identified. Strategies should address the local roots of hostility; the local capacities for change; and the (net) specific degree of international commitment available to assist change. One can conceive of these as the three dimensions of a triangle, whose area is the "political space"—or effective capacity—for building peace. We test these propositions with an extensive data set of 124 post-World War Two civil wars and find that multilateral, United Nations peace operations make a positive difference. UN peacekeeping is positively correlated with democratization processes after civil war and multilateral enforcement operations are usually successful in ending the violence. Our study provides broad guidelines to design the appropriate peacebuilding strategy, given the mix of hostility, local capacities, and international capacities.
We propose a data mining approach to predict human wine taste preferences that is based on easily available analytical tests at the certification step. A large dataset (when compared to other studies in this domain) is considered, with white and red vinho verde samples (from Portugal). Three regression techniques were applied, under a computationally efficient procedure that performs simultaneous variable and model selection. The support vector machine achieved promising results, outperforming the multiple regression and neural network methods. Such model is useful to support the oenologist wine tasting evaluations and improve wine production. Furthermore, similar techniques can help in target marketing by modeling consumer tastes from niche markets.
Presents a discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation. The objective was to specify the benefits of randomization in estimating causal effects of treatments. It is concluded that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. (15 ref) (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Convex Hull Iteration. Github
  • M Osibodu
Osibodu M., Convex Hull Iteration. Github. Retrieved from
UCI Machine Learning Repository
  • D Dua
  • C Graff
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, USA. Retrieved from