Content uploaded by Sandra Benítez-Peña
Author content
All content in this area was uploaded by Sandra Benítez-Peña on Aug 30, 2022
Content may be subject to copyright.
Joint Clustering and Feature Selection in Data Envelopment Analysis
Sandra Ben´ıtez-Pe˜na1, Peter Bogetoft2, and Dolores Romero Morales2
1Department of Statistics, Universidad Carlos III de Madrid, Getafe, Spain
sbenitez@est-econ.uc3m.es
2Department of Economics, Copenhagen Business School, Frederiksberg, Denmark
{pb.eco,drm.eco}@cbs.dk
August 30, 2022
Abstract
In modelling the relative performance of a set of Decision Making Units (DMUs), a common
challenge is to account for heterogeneity in the services they provide and the settings in which they
operate. One solution is to include many features in the model and hereby to use a one-fits-all model
that is sufficiently complex to account for this heterogeneity. Another approach is to introduce
several but simpler models for different clusters of the DMUs. In this paper, we investigate the
joint problem of DMU clustering and feature selection. The goal is to find a small number of clusters
of DMUs and the features that can be used in each cluster to maximize the average efficiency of the
DMUs. We formulate this as Mixed Integer Linear Optimization problem and propose a collection
of constructive heuristics based on different types of similarity between DMUs. The approach is
used on a real-world dataset from the benchmarking of electricity Distribution System Operators,
as well as on simulated data. We show that by introducing clusters we can considerably reduce the
number of features necessary to get high efficiency levels.
Keywords: Data Envelopment Analysis; Heterogeneity; Clustering; Feature Selection; Mixed
Integer Linear Optimization
1
1 Introduction
Relative performance evaluation or benchmarking is a popular tool to evaluate firms and organizations.
There are many documented examples of the use of benchmarking in both the private and the public
sector, see [5] and the references therein. Examples include the applications in energy (e.g., electricity
distribution), health care (e.g., hospitals, doctors), education (e.g., schools, universities), and public
administration (e.g., municipalities). The nonparametric mathematical optimization approach Data
Envelopment Analysis (DEA) is one of the most popular benchmarking tools [7,11,15,17,19,20,22,
23].
Starting from a set of Decision Making Units (DMUs) that consume the same type of inputs to
produce the same type of outputs, DEA estimates the production frontier and measures the efficiency
of each DMU relative to this best practice frontier. In its simplest form, DEA computes an efficiency
score using Linear Programming (LP) which defines the relative efficiency of a particular DMU when
compared to the remaining DMUs. DMUs with a score equal to one are deemed as efficient, and
otherwise as underperforming.
The efficiency score naturally depends on which inputs and outputs are included in the model.
This, the feature selection problem, is non-trivial since DEA is based on the idea of joint production
and it is therefore impossible to understand the impact of a given feature without knowing the other
features taken into account to calculate the efficiency of the DMUs. Hence, features must be selected
jointly, and there is an emerging Operations Research literature on feature selection, cf. e.g. [4,18,21].
The efficiency score of a DMU also depends heavily on which DMUs we compare against, i.e., the set
of potential peers. The traditional approach is to compare all DMUs against a common frontier, but
to cope with heterogeneity it may be more natural to think of different clusters of DMUs, each with
their own cluster-specific frontier.
The idea of clustering DMUs is not new to the DEA literature. If the modeller has domain
knowledge to avoid comparisons between some pairs of DMUs, clustering is an obvious tool. More
generally, if there exist obviously relevant categorial or ordinal variables, the DEA literature has
long recommended the introduction of clustering according to the categorial variables, and “nested”
clustering according to ordinal variables such that DMUs in “easy” environments can be benchmarked
against DMUs in “difficult” environments but not the other way around [3,8]. In fact, some studies
have used a DMU specific cluster structure where each DMU has its own unique set of relevant potential
2
peers, c.f. [6]. Clusters can also be based on the similarity between the DMUs, and many Cluster
Analysis techniques can be used to partition the DMUs into clusters, e.g., partitioning methods or
hierarchical clustering [24]. In this paper, instead of enforcing a cluster structure ex-ante, the clusters
are formed jointly with the feature selection and the efficiency analysis.
Hence, the problem of finding clusters and features that maximize the mean relative efficiency of
the entities is obviously a complicated one. Clustering is intrinsically difficult, even when the entities to
be clustered are given. In our case, the entities are not fully defined ex ante since different features can
be selected. Therefore, we need to find the description of the entities that allow for the best clustering,
i.e. clustering and feature selection must be decided simultaneously. Moreover, the objective of the
clustering and feature selection problem has no simple formulation. Rather, the objective value, the
average relative performance, can only be determined by solving linear programming problems, one
for each of the firms. Moreover, it is not obvious what leads to high mean efficiencies. While clustering
usually strives to create homogenous groups, this will not guarantee that the mean efficiency will be
high. Relative performance evaluation is like a competition and it might lead to tougher competition
among entities that are more similar. This only adds to the complications.
On the other hand, the combined problem of choosing features, forming clusters and undertaking
relative performance evaluation, is relevant in practice. The motivation of this paper originates from
real-world applications to the regulation of electricity Distribution System Operators (DSOs). In
this particular context, the conventional benchmark may be inappropriate [10] in the presence of
heterogeneity in the set of DMUs [9]. This may, for instance, be the case when some DSOs operate in
larger cities, where underground cables form a large proportion of the electricity grid, whereas some
others operate in rural areas, where overhead lines are used instead. In this setting, one may want to
ensure that no urban network company will be identified as benchmark for rural network firms and
vice versa. More complex forms of heterogeneity, associated with more than one attribute of the data,
may be present. Instead of developing a single model that includes many features to account for this
heterogeneity, it may be more advantageous to partition the set of DMUs into clusters and build a
more targeted and simpler model for each of the clusters.
In this paper, we tackle the joint clustering and feature selection problem by means of Mathematical
Optimization. We propose a Mixed Integer Linear Optimization (MILO) formulation, in which both
the clustering of DMUs and the feature selection for each cluster are decided. This integrative approach
overcomes performing these decisions in a preprocessing step with imperfect information from models
3
using, for instance, all the DMUs. With our novel formulation we can compare the relative efficiencies
provided by the one-model-fits-all approach and the clustered approach. We will see that in our real-
world dataset the average efficiency provided by both approaches is roughly the same, when each of
the two clustered models only uses half of the features of the single model approach.
We propose three different strategies to construct a heuristic solution to the MILO problem that
are in line with model development in practice. Typically, this is an iterative and circular process
where the evaluated DMUs form different coalitions to advocate the introduction of specific features
and clusters. There are at least three ways in which such internal advocacy groups may form. One
possibility is that some DMUs have somewhat similar input and output structures. We will call this
primal similarity. The urban-rural example may illustrate this. Another possibility is that some
DMUs form a group based on similarities in how they prefer that the different inputs and outputs
are priced. We will call this dual similarity. Note that such similarities are hard to identify ex-ante
since they require a full evaluation of all DMUs and a good idea of which features are to be included.
The dual similarities are therefore also going to change as the feature selection changes in real world
model developments. A third example of the endogenous clustering development is that DMUs form
clustering and feature advocacy groups based on efficiency scores. We call this efficiency similarity.
The DMUs with low efficiencies may for example claim that they cannot be compared to the DMUs
with high efficiencies due to some latent, unobservable difference in the working conditions of the two
groups of DMUs.
The remainder of the paper is structured as follows. In Section 2we provide further motivation
for the study of joint clustering and feature selection. In Section 3we provide the MILO formulation
to the joint clustering and feature selection model, and the three paradigms to construct a heuristic
solution. Section 4is devoted to the illustration of our models in the benchmarking of electricity
Distribution System Operators, as well as simulated datasets. We end the paper in Section 5with
some conclusions and lines for future research.
2 Motivation
The general idea of DEA is to compare DMUs doing similar tasks in similar settings. This is often
understood as the DMUs using the same types of inputs to produce the same types of outputs using
similar DEA technologies. Ideally, we therefore need that all aspects of the transformation process to
4
be accounted for by the measured inputs and outputs.
If this is the case, the measurement of efficiency is relatively straightforward. We can for example
use Farrell input efficiency, which is the largest proportional reduction in all inputs of a DMU that
allows it to produce its present outputs in the estimated technology. The technology is traditionally
estimated using the so-called minimal extrapolation principle, i.e., as the smallest subset of input-
output space that contains all observed input-output combinations and which satisfies a minimum of
production economic regularity conditions.
To formalize, consider KDMUs (indexed by k), using Iinputs (indexed by i), and producing O
outputs (indexed by o). DMU kuses vector of inputs xk∈RI
+to produce vector of outputs yk∈RO
+.
Assuming a technology that contains (xk,yk), k∈ {1, . . . , K}, and that is convex, free disposable
and satisfy constant returns to scale, the Farrell input-oriented efficiency of DMU k,Ekis then the
solution to the following LP formulation, often refereed to as the DEA problem in primal space:
Ek= minimize(E,λ)E
s.t.
K
X
j=1
λjxj
i≤E xk
i∀i= 1, . . . , I
K
X
j=1
λjyj
o≥yk
o∀o= 1, . . . , O
E∈R+
λ∈RK
+,
where Emodels the efficiency of DMU kand λthe weights assigned to the Kpeers of kinvolved in
the calculation of the efficiency.
The dual of this simple LP problem is often used instead. The dual problem assigns prices to the
inputs and outputs and seeks to find the prices making the value of the DMU koutputs as valuable
as possible, subject to the constraint that no DMU can make a positive profit with these prices, and
that the value of the inputs used by DMU kis equal to 1. The DEA problem in dual space reads as
follows:
Ek= maximize(α,β)
O
X
o=1
βoyk
o
5
s.t. (DEA)
O
X
o=1
βoyj
o−
I
X
i=1
αixj
i≤0∀j= 1, . . . , K
I
X
i=1
αixk
i= 1
α∈RI
+
β∈RO
+,
where αk
iis the price for input iand βk
othe price for output o.
In these two formulations we have assumed that all the inputs in Iand all the outputs in Oare
relevant and can be included in the model. In reality, there are usually large sets of possible inputs and
possible outputs, and including them all would lead to overfitting. This leads to the feature selection
problem, i.e., the problem of deciding which inputs and outputs to actually use in the model. In
the following we will focus on output selection for the sake of simplicity, but it is straightforward to
generalize our approach to both input and output selection.
The feature selection problem then reduces to one of finding a subset of the Ooutputs to use for
the KDMUs in the evaluation process. Let us assume that we want to select at most poutputs.
Different DMUs are likely going to prefer different outputs and a compromise must be made. An
obvious objective is the maximization of the mean efficiency across all DMUs, see [4] for other suitable
objective functions. This feature selection problem can be incorporated into the dual DEA model, by
simply considering new binary decision variables that model the selection of features. This leads to
the following MILO formulation of the Output Selection DEA (OSDEA) problem:
1
K
K
X
k=1
Ek= maximize(α,β,z)
1
K
K
X
k=1
O
X
o=1
βk
oyk
o(1)
s.t. (OSDEA)
O
X
o=1
βk
oyj
o−
I
X
i=1
αk
ixj
i≤0∀j= 1, . . . , K;∀k= 1, . . . , K (2)
I
X
i=1
αk
ixk
i= 1 ∀k= 1, . . . , K (3)
βk
o≤Mzo∀o= 1, . . . , O;∀k= 1, . . . , K (4)
6
O
X
o=1
zo≤p(5)
α∈RI·K
+(6)
β∈RO·K
+(7)
z∈ {0,1}O,(8)
where Mis a big constant. The objective function (1) is equal to the average of the efficiencies across
all DMUs. Similarly as in (DEA), constraints (2)-(3) are necessary to find the prices of the inputs
and the outputs for each DMU. Constraints (4) make sure that the output selection variables zoare
well defined with respect to βk
o, i.e., any feature with a positive price forces the corresponding feature
selection variable to be 1. Constraint (5) models the maximum number of outputs to be selected.
Finally, constraints (6)-(8) define the type of decision variable we are dealing with.
The pure feature selection model works well under the assumption of homogeneity, i.e., if the
DMUs are comparable. It might also work well even in the presence of some heterogeneity in the
set of DMUs. The traditional idea in DEA is that we can capture some heterogeneity in DMUs
by including sufficient details in the input and output space, i.e., by working with sufficiently many
different inputs and outputs. Although this approach works well in many applications, it is not
foolproof. If the DMUs are too heterogeneous, we might need a very large number of inputs and
outputs, and this may not work well unless the number of DMUs is sufficiently large.
It is well-known that DMUs are more efficient the more features are included and the smaller
the set of DMUs that is compared. Hence, with more features and more clusters we can improve the
average efficiency. We do not know, however, if one or the other approach is most effective in attempts
to make average efficiency high. Also, we cannot simply use both many features and many clusters to
make average efficiency high, since there is always a risk that we may overfit the data.
Hence, to make good models, one must consider the simultaneous choice of features and clusters.
The choice of features and clusters is further complicated by the fact that the relative efficiency of a
DMU depends on which exact features are considered and which exact DMUs it is compared to. It
is not only the number of features and the number of DMUs that are compared which matters. It is
also which features and which DMUs are chosen in the different clusters that matters. Summing up,
the choice of features and clusters in an DEA modelling context must be done simultaneously.
To get a bit of intuition, let us consider a few examples of the power of clustering.
7
Example 1. Primal similarity. In Figure 1, we depict a situation where a number of DMUs have
used the same input value, say x= 1, to produce different combinations of two outputs, y1and y2.
We only show the outputs. Intuitively, it seems that there are two clusters corresponding to the red
and the blue DMUs, respectively.
Indeed, if we split the observations into these groups and use y1as the single output to evaluate the
red DMUs, and y2as the one to evaluate the blue DMUs, all DMUs are efficient, yielding an average
efficiency equal to 1.
If the DMUs are not clustered, we can not achieve an average efficiency of 1 for any choice of
outputs. If we only use one output, we get very low average efficiencies. Using, for instance, output
y2, the efficiencies of the red DMUs would be 1, 0.8, and 0.6 respectively, while all the blue DUMs would
get efficiencies equal to 0.2, yielding an average efficiency equal to 0.5. The same average efficiency
would be obtained when selecting output y1. If both outputs are selected, the efficiencies improve. The
x= 1 iso-quant is illustrated, and it is easy to see that the blue DMUs would get efficiency of 1, 1
1.2
and 0.8
1.2respectively, yielding an average efficiency of 5
6, and likewise for the blue DMUs. Using one
cluster and two outputs therefore leads to an average efficiency of 5
6.
In summary, this example illustrates that introducing two clusters and allowing only one feature
to be used in each of them is better than the one-fits-all model, without clustering and allowing all
features to be used.
Example 2. Similarity in dual space. In Figure 2, we have again a number of DMUs that have
used the same input value, x= 1, to produce different combinations of two outputs, y1and y2. In this
case, however, the primal similarity does not work well, and we introduce another type of similarity,
the dual one based on the β’s of the dual formulation.
With a single cluster model and two features, the efficiencies are equal to 0.2, 0.6, 1, 0.2, 0.6, and
1 respectively, and the average efficiency is 0.6.
We can get the same average efficiencies if we use two clusters. Looking at the dual variables of
the joint model above, we see that all red DMUs are assigned a value of 0 to β1and that all the blue
points are assigned a value of 0 to β2. (For the two DMUs to the north-east, the optimal dual variables
are not unique). If we cluster the points accordingly, i.e., all red DMUs go to one cluster and all blue
DMUs go to another cluster, and we use feature y2in the red cluster and y1in the blue cluster, we
get the same mean efficiency as in the single cluster and two features case. In other words, we get the
8
0
0.2
0.4
0.6
0.8
1
1.2
00.2 0.4 0.6 0.8 11.2
Primal similarity
Figure 1: Illustrating similarity of DMUs in the primal space, with one input equal to 1 for all DMUs
and two outputs y1(horizontal axis) and y2(vertical axis)
same average efficiency if we cluster according to the dual variables and only use one feature in each
cluster as we get if we use only one cluster and two features.
Example 3. Similarity in efficiency. In Figure 3, we illustrate the need for considering another
type of similarity, namely grouping DMUs with similar efficiencies. We have again six DMUs that
have used the same input value, x= 1, to produce different combinations of two outputs, y1and y2.
Using a joint model with two features, we see that the average efficiency is 0.75. We also see that
the three blue DMUs have efficiencies equal to 0.5 and the three red DMUs have efficiencies 1. There
is therefore a natural clustering in the efficiency levels in the joint model.
If we use the initial efficiencies to cluster the DMUs into two groups, the red group and the blue
group, and if we use only one feature, namely y2, in each cluster, all DMUs will be fully efficient.
Hence, the clustering according to efficiency works very well in this example.
The difference between the clusters could for example be the result of an ordinal latent variable
reflecting that some DMUs, the red ones, operate in more “friendly” environments than the blue ones.
In the following section we will formulate the combined clustering and feature selection problem as
an MILO problem. We also discuss how to obtain initial solutions inspired by the examples discussed
above, namely to look for (i) similarity in primal data, {(xk,yk)}K
k=1, (ii) similarity in dual values,
9
0
0.2
0.4
0.6
0.8
1
1.2
00.2 0.4 0.6 0.8 11.2
Dual similarity
Figure 2: Illustrating similarity of DMUs in the dual space, with one input equal to 1 for all DMUs
and two outputs y1(horizontal axis) and y2(vertical axis)
0
0.2
0.4
0.6
0.8
1
1.2
00.1 0.2 0.3 0.4 0.5 0.6 0.7
Efficiency similary
Figure 3: Illustrating similarity of DMUs in efficiencies, with one input equal to 1 for all DMUs and
two outputs y1(horizontal axis) and y2(vertical axis)
10
{(αk,βk)}K
k=1, and (iii) similarity in efficiency levels, {Ek}K
k=1. Our methodology is suitable when
there are no natural clusters that are known ex-ante. We are looking for structures in the data that
allows us to save on the number of features included and that may be the result of latent variables or
differences in the working conditions which are not known ex-ante.
3 The clustered DEA model
In this section we study the mathematical optimization problem that partitions the DMUs into C
clusters and selects features independently for each cluster with the goal of maximizing the average
efficiency across all DMUs. We formulate the joint clustering and feature selection problem as an
MILO model, and propose a number of strategies to construct a heuristic solution.
3.1 The formulation
Let Cbe the number of clusters sought to form a partition of the set of KDMUs, and let Ccdenote
the DMUs in the c-th cluster. Let pcbe the number of outputs selected for Cc,c= 1, . . . , C. Let zoc
be equal to 1 if output ocan be used by Ccfor the calculation of the efficiencies, and 0 otherwise. Let
γkc be equal to 1 if DMU kis in cluster Cc, and 0 otherwise. The decision variables αk
iand βk
oare
defined as above. The Clustered DEA (CLUDEA), in which we cluster the DMUs and we perform
feature selection for each cluster, can be written as the following Mathematical Optimization model:
maximize(α,β,γ,z)
1
K
K
P
k=1
O
P
o=1
βk
oyk
o(9)
s.t. (CLUDEA)
O
X
o=1
βk
oyj
o−
I
X
i=1
αk
ixj
i≤M 1−
C
X
c=1
γjc γkc!∀j= 1, . . . , K ;∀k= 1, . . . , K (10)
I
X
i=1
αk
ixk
i= 1 ∀k= 1, . . . , K (11)
βk
o≤M(zoc −γkc + 1) ∀k= 1, . . . , K;∀o= 1, . . . , O;∀c= 1, . . . , C (12)
O
X
o=1
zoc ≤pc∀c= 1, . . . , C (13)
C
X
c=1
γkc = 1 ∀k= 1, . . . , K (14)
11
α∈RI·K
+(15)
β∈RO·K
+(16)
γ∈ {0,1}K·C(17)
z∈ {0,1}O·C.(18)
The objective function and constraints (11) and (15)-(16) are as in (OSDEA). Constraints (10) are
now enforced only for the pairs of DMUs within the same cluster, removing the links between DMUs
that are not in the same cluster. Constraints (12) model the relationship between zoc,γkc and βk
o. If
output ois not selected in the cluster assigned to k, namely γkc = 1, then βk
ohas to be 0. Otherwise,
the constraint is not binding. Constraints (13) ensure that Ccselects at most pcoutputs. Constraints
(14) ensure that each DMU is exactly in one cluster. Constraints (17)-(18) ensure that γkc and zoc are
integer variables. All constraints in (CLUDEA) except for (10) are linear. With standard modeling
techniques [12], this constraint can be linearized yielding an MILO formulation.
In the rest of the section we propose three strategies to construct a heuristic solution based on
the similarities introduced in Section 2. In the remainder of the section, we will denote the input and
output vectors of the KDMUs by D.
3.2 Heuristics
The first heuristic solution consists of putting together those DMUs with similar features by means
of a Cluster Analysis procedure, such as the popular k-means or hierarchical clustering [14,16]. Once
the clusters have been found, we optimize the selection of outputs using (OSDEA) separately for each
of the clusters. This heuristic is presented in Algorithm 1. Note that a similar heuristic to the one in
Algorithm 1can be constructed using dual information.
Algorithm 1: Pseudocode for constructive heuristic based on primal similarity
1Data (D), number of clusters (C), maximum number of outputs to select in c-th cluster (pc)
2Using a Cluster Analysis procedure on data (xj,yj), j= 1, . . . , K , partition the set of DMUs
{1, . . . , K}=SC
c=1 Cc
3For each c= 1, . . . , C, solve (OSDEA) with DMUs in Ccwhere at most pcoutputs are selected
4Save all decision variables and efficiencies information
An alternative heuristic solution consists of building sets of DMUs such that they are similar in
12
terms of efficiency levels. Below, we follow a slicing approach to build the clusters. We first find a set
of DMUs and the corresponding features such that the efficiencies of these DMUs are above a given
threshold, say E. These will define the first cluster of DMUs and the corresponding features. We then
continue with the remaining DMUs, and slice the remaining clusters, in an iterative process.
In each iteration, this constructive heuristic requires solving the so-called Quantile DEA (QDEA),
a MILO problem in which we have as decision variables δkequal to 1 if DMU kis selected to be in the
slice, and 0 otherwise. The rest of the decision variables α,β,zas defined similarly as in (OSDEA).
We then have that
maximize(α,β,z,δ)
K
X
k=1
δk(19)
s.t. (QDEA)
O
X
o=1
βk
oyj
o≥E δk(20)
O
X
o=1
βk
oyj
o−
I
X
i=1
αk
ixj
i≤M(2 −(δk+δj)) ∀j= 1, . . . , K;∀k= 1, . . . , K (21)
I
X
i=1
αk
ixk
i= 1 ∀k= 1, . . . , K (22)
βk
o≤Mzo,∀o= 1, . . . , O;k= 1, . . . , K (23)
O
X
o=1
zo≤p(24)
α∈RI·K
+(25)
β∈RO·K
+(26)
z∈ {0,1}O,(27)
δ∈ {0,1}K,(28)
where Mis a big constant. The objective function (19) together with constraints (20) and (28) count
the number of DMUs with an efficiency above E. Constraints (21) are now enforced only for the
pairs of DMUs in the slice, removing the ties between DMUs that are not in this cluster. Constraints
(22)-(27) are as in (OSDEA).
With (QDEA) we identify the first slice, which will become the first cluster. We build the second
slice solving again (QDEA) but starting from {1, . . . , K} \ C1, and finding again a subset of DMUs for
13
which the efficiencies are above a threshold. We repeat this until we have found the C−1 slice. This
heuristic is presented in Algorithm 2.
Algorithm 2: Pseudocode for constructive heuristic based on similarity in efficiency
1Data (D), number of clusters (C), maximum number of outputs to select in c-th cluster (pc),
maximum number of outputs to select in (QDEA) (p), threshold efficiency E
2Set L={1, . . . , K}
3for c= 1, . . . , C −1,do
4Solve (QDEA) with DMUs in L, slice from the set of DMUs the c-th cluster Cc
5Set L=L\Cc
6end
7Cluster CCis composed by the remaining DMUs
8For each c= 1, . . . , C, solve (OSDEA) with DMUs in Ccwhere at most pcoutputs are selected
9Save all decision variables and efficiencies information
4 Numerical results
In this section, we illustrate our methodology in a simulated dataset and a real-world one. In the
simulated dataset, where we have a data generating model with underlying clusters and the subset of
features that play a role in the calculation of the efficiencies for each cluster, we show how (CLUDEA)
is able to recover these clusters and the relevant features for each of them. In the real-world dataset,
we can see how separating the DMUs into two clusters already can improve the efficiency of the
firms significantly compared to the one-model-fits-all approach. Even more, (CLUDEA) uses the
same number of features in total and some times fewer than the one-model-fits-all approach. The
experiments were run on a computer with an IntelrCoreTM i7−7700 processor at 3.6 GHz using 32
GB of RAM, running Windows 10 Pro. All the optimization problems have been solved using Python
3.10gu interface [25] with Gurobi 9.5.1 solver [13], with a time limit of 600 seconds and Mequal to
1000.
The simulated dataset, Dataset 1, consists of K= 150 firms, I= 1 input and O= 3 outputs.
The input for all DMUs is equal to 1. The outputs have been generated in such a way that using
3 clusters of 50 DMUs each and selecting 2 outputs in each cluster, we can make every firm to be
14
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
Figure 4: Visualization of the output space for Dataset 1, where the colors are used for the underlying
clusters in the data generating model, DMUs in black cluster are characterized by features y1and y2,
in green by y1and y3, and in red by y2and y3
efficient obtaining therefore an average efficiency of 1 in each of the clusters and hence globally. The
output space of this dataset and the underlying clusters are depicted in Figure 4. We can see there
that for the black cluster, the relevant outputs, y1, y2, were created in such a way that y1∼U(0,1)
and y2= 1 −y1, while the irrelevant one, y3, was equal to zero. The values of the outputs for the
green cluster are generated in a similar way where the relevant features are y1, y3, while for the red
cluster are y2, y3.
In what follows, we discuss the results of (CLUDEA) in Dataset 1 for C= 2,3 and p= 1,2,3, with
and without an initial solution. Our methodology is compared with the one-model-fits-all approach,
i.e., (OSDEA) with p= 1,2,3, which corresponds to having all firms in the same cluster. The average
efficiencies obtained with these models are summarized in Table 1and the clusters are depicted in
Figures 5-7. The first column of Table 1contains the value of the number of features, p, the second
the average efficiencies for (OSDEA), the third and the fourth ones for (CLUDEA) without initial
solution. The last four columns of this table show the average efficiencies when (CLUDEA) is given
as initial solution the one coming out from Algorithm 1, based on primal similarity, and the one from
Algorithm 2, based on similarity at the efficiency level, respectively. In Algorithm 1, we use k-means
as the Cluster Analysis procedure, while in Algorithm 2we use (QDEA) with Econveniently chosen
15
OSDEA CLUDEA
No initial solution Primal initial solution Efficiency level initial solution
pNo clusters 2 clusters 3 clusters 2 clusters 3 clusters 2 clusters 3 clusters
1 0.383 0.639 0.666 0.599 0.655 0.639 0.749
2 0.716 0.913 0.972 0.913 0.914 0.895 0.993
3 1 1 1 1 1 1 1
Table 1: Results for Dataset 1 using (CLUDEA) with C= 2 and 3 clusters and as well as (OSDEA)
when all DMUs are in the same cluster, for p= 1,2,3.
depending on Cand p. For C= 2, we choose E= 0.6 and p= 1, 0.9 for p= 2, and 1 for p= 3, while
for C= 3 we impose stricter values of E, namely, 0.6 for p= 1 and 1 for p= 2,3.
By construction, in the optimal solution of (CLUDEA) for C= 3 and p= 2 all the firms are
efficient. In the feasible solution we obtain within the time limit, (CLUDEA) gives an average efficiency
equal to 0.972, which is close to the true optimal value of 1. Using the heuristics to give an initial
solution to the Gurobi solver, we obtain efficiencies equal to 0.914 and 0.993, respectively. Thus, for
this simulated dataset, the heuristic based on efficiency level similarity can help the solver obtaining
a slightly better average efficiency. By looking at Figures 5-7, we can see that the heuristic based on
efficiency level similarity is already able to detect the underlying pattern on the clusters, although not
optimally.
Although, in general, it is not clear how many clusters should be used, because of the data generat-
ing model behind Dataset 1 we know that if all DMUs are in the same cluster, i.e., the one-model-fits-all
approach, we cannot achieve the perfect efficiency for all DMUs with neither p= 1 or 2. This is also
suggested by Table 1, where the average efficiency for (OSDEA) with p= 1 is 0.383, and with p= 2
is 0.716. Only when all three outputs are used, i.e., p= 3, we can achieve this perfect scenario for the
efficiencies as it can be seen in the last row of the table. Similarly, we know that if C= 2 we cannot
achieve the perfect efficiency for all DMUs with neither p= 1 or 2, which can again be seen from the
results in this table.
We now present the results on Dataset 2, a real-world dataset in the benchmarking of electricity
Distribution System Operators (DSOs) [1,2]. Here we have K= 182 DMUs, O= 100 outputs, and
I= 1 input. As customary, each output has been normalized dividing it by the difference between
16
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
Figure 5: Visualization clusters obtained with (CLUDEA) without initial solution, using p= 2, C = 3.
the maximum and the minimum values of the output. The results are summarized in Table 2, in
which we solve (CLUDEA) with C= 2, with and without an initial solution, as well as (OSDEA). We
consider ten values for the number of outputs selected, namely, p= 1,...,10. To generate an initial
solution for (CLUDEA), we again use Algorithm 1with k-means and Algorithm 2with Edepending
on p, namely 0.75,0.775,0.8,0.82,0.82,0.82,0.9,0.9,0.95,0.95. As for the simulated dataset, we can
see that the heuristics in some cases can help to obtain a better feasible solution within the time limit.
This is, for instance, the case for p= 2 for the efficiency level similarity, as well as for p= 5 for the
primal similarity.
More importantly, these results suggest that we can maintain roughly speaking the same average
efficiency, or even improve it, if we double the number of clusters and halve the number of features per
cluster. Take for instance (CLUDEA) with C= 2 and p= 1, this means that we are using one feature
per cluster and thus 2 in total. The average efficiency there is 0.777. We now compare this (OSDEA)
with p= 2, where all the DMUs are in the same cluster and 2 features are used to calculate the
efficiency. The average efficiency there is 0.655. We have therefore managed to improve the average
efficiency of DMUs by clustering them, while keeping the same number of features used in total. This
is actually a general pattern. Taking now (CLUDEA) with C= 2 and p= 2 and (OSDEA) with
p= 4, we again see (CLUDEA) outperforming (OSDEA), 0.777 versus 0.749. This goes on until the
end of the table, where for (CLUDEA) with C= 2 and p= 5 we obtain an average efficiency of 0.886
17
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
(a) Initial heuristic solution
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
(b) Solution after optimizing
Figure 6: Clusters using heuristic based on primal similarity in Dataset 1, and clusters obtained with
(CLUDEA) having this heuristic solution as initial one, using p= 2, C = 3.
while (OSDEA) with p= 10 gives an average efficiency of 0.873. Therefore, the use of clusters can
improve the average efficiency of DMUs.
A careful examination of the selected features suggests that (OSDEA) and the second cluster of
(CLUDEA) are selecting similar ones, while the first cluster of (CLUDEA) is capturing the activities of
another group of companies. Specifically, based on domain knowledge and the underlying definitions
of the features, it seems that the first cluster of (CLUDEA) is representing companies with lower
voltage levels and more overhead lines while the second one focuses more on middle and high voltage
networks with more cables.
5 Conclusions
In this paper, we have introduced the combined problem of clustering and feature selection to address
heterogeneity in DEA. The problem has been reformulated as a mixed integer linear optimization
problem. To enhance its optimization with commercial solvers, we have proposed constructive heuris-
tics based on three types of similarity between the DMUs. We have investigated numerically one
simulated and one real-world dataset. The simulated dataset allows us to check our formulations and
to understand the value of clustering relatively easily. The real-world dataset allows us to investigate
18
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
(a) Initial heuristic solution
0.0
0.2
0.4
0.6
0.8
0.00.2 0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
y3
y3
y1
y3
y1
y2
(b) Solution after optimizing
Figure 7: Clusters using heuristic based on efficiency level similarity in Dataset 1, and clusters obtained
with (CLUDEA) having this heuristic solution as initial one, using p= 2, C = 3.
a more realistic application with numerous features to select from and where many features typically
are correlated making it more difficult to make the most efficient selection. We have seen that when
clusters and features are optimized, clustering can be a very effective means of improving the fit of the
DEA model. In our real dataset, for example, we saw that two (optimal) clusters with five (optimal)
features in each lead to a fit that is similar to having one cluster with ten (optimal) features. Our
approach is flexible enough to handle domain knowledge constraints such as must-link and cannot-link.
In terms of future research, an obvious extension of this paper concerns the interpretability of
the clusters created. It would be interesting to discover what characterizes the DMUs in a cluster.
Second, a related problem to CLUDEA is the detection of outliers in DEA. The aim in the outlier
detection problem is to find a cluster of ordinary DMUs and a small cluster of outlier ones such that
the average efficiency of the DMUs in the ordinary cluster is maximized. The similarities defined in
this paper can be used for this purpose. Finally, enhancing the big Mformulations proposed in this
paper with domain knowledge is another fruitful line of research.
Acknowledgements
This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant
19
OSDEA CLUDEA
No initial solution Primal initial solution Efficiency level initial solution
pNo clusters 2 clusters 2 clusters 2 clusters
1 0.556 0.777 0.675 0.638
2 0.655 0.777 0.698 0.811
3 0.712 0.858 0.793 0.820
4 0.749 0.858 0.808 0.820
5 0.781 0.886 0.913 0.821
6 0.808 0.912 0.914 0.824
7 0.829 0.916 0.922 0.829
8 0.846 0.929 0.934 0.869
9 0.861 0.946 0.940 0.883
10 0.873 0.957 0.940 0.910
Table 2: Results for Dataset 2 (DSOs) using (CLUDEA) with C= 2 as well as (OSDEA) when all
DMUs are in the same cluster, for p= 1,...,10.
agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178 (Junta de Andaluc´ıa), PID2019-
110886RB-I00 (Ministerio de Ciencia, Innovaci´on y Universidades, Spain), and Independent Re-
search Fund Denmark project on Benchmarking-Based Incentives and Regulatory Applications (9038-
00042B). This support is gratefully acknowledged.
References
[1] P.J. Agrell and P. Bogetoft. Regulatory benchmarking: Models, analyses and applications. Data
Envelopment Analysis Journal, 3(1–2):49–91, 2017.
[2] P.J. Agrell and P. Bogetoft. Theory, techniques, and applications of regulatory benchmarking
and productivity analysis. In The Oxford Handbook of Productivity Analysis. 2018.
[3] Rajiv D Banker and Richard C Morey. The use of categorical variables in data envelopment
analysis. Management Science, 32(12):1613–1627, 1986.
20
[4] Sandra Ben´ıtez-Pe˜na, Peter Bogetoft, and Dolores Romero Morales. Feature selection in data
envelopment analysis: A mathematical optimization approach. Omega, 96:102068, 2020.
[5] P. Bogetoft. Performance benchmarking: Measuring and managing performance. Springer Science
& Business Media, 2013.
[6] Peter Bogetoft and Jesper Wittrup. Productivity and education: Benchmarking of elementary
and lower secondary schools in denmark. In Productivity and Competitiveness, pages 257–294.
Nordic Council of Ministers, 2011.
[7] A. Charnes, W.W. Cooper, and E. Rhodes. Measuring the efficiency of decision making units.
European Journal of Operational Research, 2(6):429–444, 1978.
[8] Wade D Cook, Dan Chai, John Doyle, and Rodney Green. Hierarchies and groups in DEA.
Journal of Productivity Analysis, 10(2):177–198, 1998.
[9] Wade D. Cook, Julie Harrison, Raha Imanirad, Paul Rouse, and Joe Zhu. Data envelopment
analysis with nonhomogeneous DMUs. Operations Research, 61(3):666–676, 2013.
[10] Xiaofeng Dai and Timo Kuosmanen. Best-practice benchmarking using clustering methods: Ap-
plication to energy regulation. Omega, 42(1):179–188, 2014.
[11] A. Emrouznejad and G.-L. Yang. A survey and analysis of the first 40 years of scholarly literature
in DEA: 1978–2016. Socio-Economic Planning Sciences, 61:4–8, 2018.
[12] Robert Fortet. Applications de l’algebre de boole en recherche op´erationelle. Revue Fran¸caise de
Recherche Op´erationelle, 4(14):17–26, 1960.
[13] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2016.
[14] A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666,
2010.
[15] C. Jiang and W. Lin. DEARank: a data-envelopment-analysis-based ranking method. Machine
Learning, 101(1–3):415–435, 2015.
[16] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis.
John Wiley & Sons, New York, 1990.
21
[17] M. Landete, J.F. Monge, and J.L. Ruiz. Robust DEA efficiency scores: A probabilis-
tic/combinatorial approach. Expert Systems with Applications, 86:145–154, 2017.
[18] Chia-Yen Lee and Jia-Ying Cai. Lasso variable selection in data envelopment analysis with small
datasets. Omega, 91:102019, 2020.
[19] Christopher F. Parmeter and Valentin Zelenyuk. Combining the virtues of stochastic frontier and
data envelopment analysis. Operations Research, 67(6):1628–1658, 2019.
[20] N.C. Petersen. Directional Distance Functions in DEA with Optimal Endogenous Directions.
Operations Research, 66(4):1068–1085, 2018.
[21] A. Peyrache, C. Rose, and G. Sicilia. Variable selection in data envelopment analysis. European
Journal of Operational Research, 282(2):644–659, 2020.
[22] J.L. Ruiz and I. Sirvent. Common benchmarking and ranking of units with DEA. Omega, 65:1
– 9, 2016.
[23] Jos´e L Ruiz and Inmaculada Sirvent. Performance evaluation through DEA benchmarking ad-
justed to goals. Omega, 87:150–157, 2019.
[24] Sergey Samoilenko and Kweku-Muata Osei-Bryson. Increasing the discriminatory power of DEA
in the presence of the sample heterogeneity with cluster analysis and decision trees. Expert
Systems with Applications, 34(2):1568–1581, 2008.
[25] Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. CreateSpace, Scotts Valley,
CA, 2009.
22