Content uploaded by Sandra Benítez-Peña

Author content

All content in this area was uploaded by Sandra Benítez-Peña on Aug 30, 2022

Content may be subject to copyright.

Joint Clustering and Feature Selection in Data Envelopment Analysis

Sandra Ben´ıtez-Pe˜na1, Peter Bogetoft2, and Dolores Romero Morales2

1Department of Statistics, Universidad Carlos III de Madrid, Getafe, Spain

sbenitez@est-econ.uc3m.es

2Department of Economics, Copenhagen Business School, Frederiksberg, Denmark

{pb.eco,drm.eco}@cbs.dk

August 30, 2022

Abstract

In modelling the relative performance of a set of Decision Making Units (DMUs), a common

challenge is to account for heterogeneity in the services they provide and the settings in which they

operate. One solution is to include many features in the model and hereby to use a one-ﬁts-all model

that is suﬃciently complex to account for this heterogeneity. Another approach is to introduce

several but simpler models for diﬀerent clusters of the DMUs. In this paper, we investigate the

joint problem of DMU clustering and feature selection. The goal is to ﬁnd a small number of clusters

of DMUs and the features that can be used in each cluster to maximize the average eﬃciency of the

DMUs. We formulate this as Mixed Integer Linear Optimization problem and propose a collection

of constructive heuristics based on diﬀerent types of similarity between DMUs. The approach is

used on a real-world dataset from the benchmarking of electricity Distribution System Operators,

as well as on simulated data. We show that by introducing clusters we can considerably reduce the

number of features necessary to get high eﬃciency levels.

Keywords: Data Envelopment Analysis; Heterogeneity; Clustering; Feature Selection; Mixed

Integer Linear Optimization

1

1 Introduction

Relative performance evaluation or benchmarking is a popular tool to evaluate ﬁrms and organizations.

There are many documented examples of the use of benchmarking in both the private and the public

sector, see [5] and the references therein. Examples include the applications in energy (e.g., electricity

distribution), health care (e.g., hospitals, doctors), education (e.g., schools, universities), and public

administration (e.g., municipalities). The nonparametric mathematical optimization approach Data

Envelopment Analysis (DEA) is one of the most popular benchmarking tools [7,11,15,17,19,20,22,

23].

Starting from a set of Decision Making Units (DMUs) that consume the same type of inputs to

produce the same type of outputs, DEA estimates the production frontier and measures the eﬃciency

of each DMU relative to this best practice frontier. In its simplest form, DEA computes an eﬃciency

score using Linear Programming (LP) which deﬁnes the relative eﬃciency of a particular DMU when

compared to the remaining DMUs. DMUs with a score equal to one are deemed as eﬃcient, and

otherwise as underperforming.

The eﬃciency score naturally depends on which inputs and outputs are included in the model.

This, the feature selection problem, is non-trivial since DEA is based on the idea of joint production

and it is therefore impossible to understand the impact of a given feature without knowing the other

features taken into account to calculate the eﬃciency of the DMUs. Hence, features must be selected

jointly, and there is an emerging Operations Research literature on feature selection, cf. e.g. [4,18,21].

The eﬃciency score of a DMU also depends heavily on which DMUs we compare against, i.e., the set

of potential peers. The traditional approach is to compare all DMUs against a common frontier, but

to cope with heterogeneity it may be more natural to think of diﬀerent clusters of DMUs, each with

their own cluster-speciﬁc frontier.

The idea of clustering DMUs is not new to the DEA literature. If the modeller has domain

knowledge to avoid comparisons between some pairs of DMUs, clustering is an obvious tool. More

generally, if there exist obviously relevant categorial or ordinal variables, the DEA literature has

long recommended the introduction of clustering according to the categorial variables, and “nested”

clustering according to ordinal variables such that DMUs in “easy” environments can be benchmarked

against DMUs in “diﬃcult” environments but not the other way around [3,8]. In fact, some studies

have used a DMU speciﬁc cluster structure where each DMU has its own unique set of relevant potential

2

peers, c.f. [6]. Clusters can also be based on the similarity between the DMUs, and many Cluster

Analysis techniques can be used to partition the DMUs into clusters, e.g., partitioning methods or

hierarchical clustering [24]. In this paper, instead of enforcing a cluster structure ex-ante, the clusters

are formed jointly with the feature selection and the eﬃciency analysis.

Hence, the problem of ﬁnding clusters and features that maximize the mean relative eﬃciency of

the entities is obviously a complicated one. Clustering is intrinsically diﬃcult, even when the entities to

be clustered are given. In our case, the entities are not fully deﬁned ex ante since diﬀerent features can

be selected. Therefore, we need to ﬁnd the description of the entities that allow for the best clustering,

i.e. clustering and feature selection must be decided simultaneously. Moreover, the objective of the

clustering and feature selection problem has no simple formulation. Rather, the objective value, the

average relative performance, can only be determined by solving linear programming problems, one

for each of the ﬁrms. Moreover, it is not obvious what leads to high mean eﬃciencies. While clustering

usually strives to create homogenous groups, this will not guarantee that the mean eﬃciency will be

high. Relative performance evaluation is like a competition and it might lead to tougher competition

among entities that are more similar. This only adds to the complications.

On the other hand, the combined problem of choosing features, forming clusters and undertaking

relative performance evaluation, is relevant in practice. The motivation of this paper originates from

real-world applications to the regulation of electricity Distribution System Operators (DSOs). In

this particular context, the conventional benchmark may be inappropriate [10] in the presence of

heterogeneity in the set of DMUs [9]. This may, for instance, be the case when some DSOs operate in

larger cities, where underground cables form a large proportion of the electricity grid, whereas some

others operate in rural areas, where overhead lines are used instead. In this setting, one may want to

ensure that no urban network company will be identiﬁed as benchmark for rural network ﬁrms and

vice versa. More complex forms of heterogeneity, associated with more than one attribute of the data,

may be present. Instead of developing a single model that includes many features to account for this

heterogeneity, it may be more advantageous to partition the set of DMUs into clusters and build a

more targeted and simpler model for each of the clusters.

In this paper, we tackle the joint clustering and feature selection problem by means of Mathematical

Optimization. We propose a Mixed Integer Linear Optimization (MILO) formulation, in which both

the clustering of DMUs and the feature selection for each cluster are decided. This integrative approach

overcomes performing these decisions in a preprocessing step with imperfect information from models

3

using, for instance, all the DMUs. With our novel formulation we can compare the relative eﬃciencies

provided by the one-model-ﬁts-all approach and the clustered approach. We will see that in our real-

world dataset the average eﬃciency provided by both approaches is roughly the same, when each of

the two clustered models only uses half of the features of the single model approach.

We propose three diﬀerent strategies to construct a heuristic solution to the MILO problem that

are in line with model development in practice. Typically, this is an iterative and circular process

where the evaluated DMUs form diﬀerent coalitions to advocate the introduction of speciﬁc features

and clusters. There are at least three ways in which such internal advocacy groups may form. One

possibility is that some DMUs have somewhat similar input and output structures. We will call this

primal similarity. The urban-rural example may illustrate this. Another possibility is that some

DMUs form a group based on similarities in how they prefer that the diﬀerent inputs and outputs

are priced. We will call this dual similarity. Note that such similarities are hard to identify ex-ante

since they require a full evaluation of all DMUs and a good idea of which features are to be included.

The dual similarities are therefore also going to change as the feature selection changes in real world

model developments. A third example of the endogenous clustering development is that DMUs form

clustering and feature advocacy groups based on eﬃciency scores. We call this eﬃciency similarity.

The DMUs with low eﬃciencies may for example claim that they cannot be compared to the DMUs

with high eﬃciencies due to some latent, unobservable diﬀerence in the working conditions of the two

groups of DMUs.

The remainder of the paper is structured as follows. In Section 2we provide further motivation

for the study of joint clustering and feature selection. In Section 3we provide the MILO formulation

to the joint clustering and feature selection model, and the three paradigms to construct a heuristic

solution. Section 4is devoted to the illustration of our models in the benchmarking of electricity

Distribution System Operators, as well as simulated datasets. We end the paper in Section 5with

some conclusions and lines for future research.

2 Motivation

The general idea of DEA is to compare DMUs doing similar tasks in similar settings. This is often

understood as the DMUs using the same types of inputs to produce the same types of outputs using

similar DEA technologies. Ideally, we therefore need that all aspects of the transformation process to

4

be accounted for by the measured inputs and outputs.

If this is the case, the measurement of eﬃciency is relatively straightforward. We can for example

use Farrell input eﬃciency, which is the largest proportional reduction in all inputs of a DMU that

allows it to produce its present outputs in the estimated technology. The technology is traditionally

estimated using the so-called minimal extrapolation principle, i.e., as the smallest subset of input-

output space that contains all observed input-output combinations and which satisﬁes a minimum of

production economic regularity conditions.

To formalize, consider KDMUs (indexed by k), using Iinputs (indexed by i), and producing O

outputs (indexed by o). DMU kuses vector of inputs xk∈RI

+to produce vector of outputs yk∈RO

+.

Assuming a technology that contains (xk,yk), k∈ {1, . . . , K}, and that is convex, free disposable

and satisfy constant returns to scale, the Farrell input-oriented eﬃciency of DMU k,Ekis then the

solution to the following LP formulation, often refereed to as the DEA problem in primal space:

Ek= minimize(E,λ)E

s.t.

K

X

j=1

λjxj

i≤E xk

i∀i= 1, . . . , I

K

X

j=1

λjyj

o≥yk

o∀o= 1, . . . , O

E∈R+

λ∈RK

+,

where Emodels the eﬃciency of DMU kand λthe weights assigned to the Kpeers of kinvolved in

the calculation of the eﬃciency.

The dual of this simple LP problem is often used instead. The dual problem assigns prices to the

inputs and outputs and seeks to ﬁnd the prices making the value of the DMU koutputs as valuable

as possible, subject to the constraint that no DMU can make a positive proﬁt with these prices, and

that the value of the inputs used by DMU kis equal to 1. The DEA problem in dual space reads as

follows:

Ek= maximize(α,β)

O

X

o=1

βoyk

o

5

s.t. (DEA)

O

X

o=1

βoyj

o−

I

X

i=1

αixj

i≤0∀j= 1, . . . , K

I

X

i=1

αixk

i= 1

α∈RI

+

β∈RO

+,

where αk

iis the price for input iand βk

othe price for output o.

In these two formulations we have assumed that all the inputs in Iand all the outputs in Oare

relevant and can be included in the model. In reality, there are usually large sets of possible inputs and

possible outputs, and including them all would lead to overﬁtting. This leads to the feature selection

problem, i.e., the problem of deciding which inputs and outputs to actually use in the model. In

the following we will focus on output selection for the sake of simplicity, but it is straightforward to

generalize our approach to both input and output selection.

The feature selection problem then reduces to one of ﬁnding a subset of the Ooutputs to use for

the KDMUs in the evaluation process. Let us assume that we want to select at most poutputs.

Diﬀerent DMUs are likely going to prefer diﬀerent outputs and a compromise must be made. An

obvious objective is the maximization of the mean eﬃciency across all DMUs, see [4] for other suitable

objective functions. This feature selection problem can be incorporated into the dual DEA model, by

simply considering new binary decision variables that model the selection of features. This leads to

the following MILO formulation of the Output Selection DEA (OSDEA) problem:

1

K

K

X

k=1

Ek= maximize(α,β,z)

1

K

K

X

k=1

O

X

o=1

βk

oyk

o(1)

s.t. (OSDEA)

O

X

o=1

βk

oyj

o−

I

X

i=1

αk

ixj

i≤0∀j= 1, . . . , K;∀k= 1, . . . , K (2)

I

X

i=1

αk

ixk

i= 1 ∀k= 1, . . . , K (3)

βk

o≤Mzo∀o= 1, . . . , O;∀k= 1, . . . , K (4)

6

O

X

o=1

zo≤p(5)

α∈RI·K

+(6)

β∈RO·K

+(7)

z∈ {0,1}O,(8)

where Mis a big constant. The objective function (1) is equal to the average of the eﬃciencies across

all DMUs. Similarly as in (DEA), constraints (2)-(3) are necessary to ﬁnd the prices of the inputs

and the outputs for each DMU. Constraints (4) make sure that the output selection variables zoare

well deﬁned with respect to βk

o, i.e., any feature with a positive price forces the corresponding feature

selection variable to be 1. Constraint (5) models the maximum number of outputs to be selected.

Finally, constraints (6)-(8) deﬁne the type of decision variable we are dealing with.

The pure feature selection model works well under the assumption of homogeneity, i.e., if the

DMUs are comparable. It might also work well even in the presence of some heterogeneity in the

set of DMUs. The traditional idea in DEA is that we can capture some heterogeneity in DMUs

by including suﬃcient details in the input and output space, i.e., by working with suﬃciently many

diﬀerent inputs and outputs. Although this approach works well in many applications, it is not

foolproof. If the DMUs are too heterogeneous, we might need a very large number of inputs and

outputs, and this may not work well unless the number of DMUs is suﬃciently large.

It is well-known that DMUs are more eﬃcient the more features are included and the smaller

the set of DMUs that is compared. Hence, with more features and more clusters we can improve the

average eﬃciency. We do not know, however, if one or the other approach is most eﬀective in attempts

to make average eﬃciency high. Also, we cannot simply use both many features and many clusters to

make average eﬃciency high, since there is always a risk that we may overﬁt the data.

Hence, to make good models, one must consider the simultaneous choice of features and clusters.

The choice of features and clusters is further complicated by the fact that the relative eﬃciency of a

DMU depends on which exact features are considered and which exact DMUs it is compared to. It

is not only the number of features and the number of DMUs that are compared which matters. It is

also which features and which DMUs are chosen in the diﬀerent clusters that matters. Summing up,

the choice of features and clusters in an DEA modelling context must be done simultaneously.

To get a bit of intuition, let us consider a few examples of the power of clustering.

7

Example 1. Primal similarity. In Figure 1, we depict a situation where a number of DMUs have

used the same input value, say x= 1, to produce diﬀerent combinations of two outputs, y1and y2.

We only show the outputs. Intuitively, it seems that there are two clusters corresponding to the red

and the blue DMUs, respectively.

Indeed, if we split the observations into these groups and use y1as the single output to evaluate the

red DMUs, and y2as the one to evaluate the blue DMUs, all DMUs are eﬃcient, yielding an average

eﬃciency equal to 1.

If the DMUs are not clustered, we can not achieve an average eﬃciency of 1 for any choice of

outputs. If we only use one output, we get very low average eﬃciencies. Using, for instance, output

y2, the eﬃciencies of the red DMUs would be 1, 0.8, and 0.6 respectively, while all the blue DUMs would

get eﬃciencies equal to 0.2, yielding an average eﬃciency equal to 0.5. The same average eﬃciency

would be obtained when selecting output y1. If both outputs are selected, the eﬃciencies improve. The

x= 1 iso-quant is illustrated, and it is easy to see that the blue DMUs would get eﬃciency of 1, 1

1.2

and 0.8

1.2respectively, yielding an average eﬃciency of 5

6, and likewise for the blue DMUs. Using one

cluster and two outputs therefore leads to an average eﬃciency of 5

6.

In summary, this example illustrates that introducing two clusters and allowing only one feature

to be used in each of them is better than the one-ﬁts-all model, without clustering and allowing all

features to be used.

Example 2. Similarity in dual space. In Figure 2, we have again a number of DMUs that have

used the same input value, x= 1, to produce diﬀerent combinations of two outputs, y1and y2. In this

case, however, the primal similarity does not work well, and we introduce another type of similarity,

the dual one based on the β’s of the dual formulation.

With a single cluster model and two features, the eﬃciencies are equal to 0.2, 0.6, 1, 0.2, 0.6, and

1 respectively, and the average eﬃciency is 0.6.

We can get the same average eﬃciencies if we use two clusters. Looking at the dual variables of

the joint model above, we see that all red DMUs are assigned a value of 0 to β1and that all the blue

points are assigned a value of 0 to β2. (For the two DMUs to the north-east, the optimal dual variables

are not unique). If we cluster the points accordingly, i.e., all red DMUs go to one cluster and all blue

DMUs go to another cluster, and we use feature y2in the red cluster and y1in the blue cluster, we

get the same mean eﬃciency as in the single cluster and two features case. In other words, we get the

8

0

0.2

0.4

0.6

0.8

1

1.2

00.2 0.4 0.6 0.8 11.2

Primal similarity

Figure 1: Illustrating similarity of DMUs in the primal space, with one input equal to 1 for all DMUs

and two outputs y1(horizontal axis) and y2(vertical axis)

same average eﬃciency if we cluster according to the dual variables and only use one feature in each

cluster as we get if we use only one cluster and two features.

Example 3. Similarity in eﬃciency. In Figure 3, we illustrate the need for considering another

type of similarity, namely grouping DMUs with similar eﬃciencies. We have again six DMUs that

have used the same input value, x= 1, to produce diﬀerent combinations of two outputs, y1and y2.

Using a joint model with two features, we see that the average eﬃciency is 0.75. We also see that

the three blue DMUs have eﬃciencies equal to 0.5 and the three red DMUs have eﬃciencies 1. There

is therefore a natural clustering in the eﬃciency levels in the joint model.

If we use the initial eﬃciencies to cluster the DMUs into two groups, the red group and the blue

group, and if we use only one feature, namely y2, in each cluster, all DMUs will be fully eﬃcient.

Hence, the clustering according to eﬃciency works very well in this example.

The diﬀerence between the clusters could for example be the result of an ordinal latent variable

reﬂecting that some DMUs, the red ones, operate in more “friendly” environments than the blue ones.

In the following section we will formulate the combined clustering and feature selection problem as

an MILO problem. We also discuss how to obtain initial solutions inspired by the examples discussed

above, namely to look for (i) similarity in primal data, {(xk,yk)}K

k=1, (ii) similarity in dual values,

9

0

0.2

0.4

0.6

0.8

1

1.2

00.2 0.4 0.6 0.8 11.2

Dual similarity

Figure 2: Illustrating similarity of DMUs in the dual space, with one input equal to 1 for all DMUs

and two outputs y1(horizontal axis) and y2(vertical axis)

0

0.2

0.4

0.6

0.8

1

1.2

00.1 0.2 0.3 0.4 0.5 0.6 0.7

Efficiency similary

Figure 3: Illustrating similarity of DMUs in eﬃciencies, with one input equal to 1 for all DMUs and

two outputs y1(horizontal axis) and y2(vertical axis)

10

{(αk,βk)}K

k=1, and (iii) similarity in eﬃciency levels, {Ek}K

k=1. Our methodology is suitable when

there are no natural clusters that are known ex-ante. We are looking for structures in the data that

allows us to save on the number of features included and that may be the result of latent variables or

diﬀerences in the working conditions which are not known ex-ante.

3 The clustered DEA model

In this section we study the mathematical optimization problem that partitions the DMUs into C

clusters and selects features independently for each cluster with the goal of maximizing the average

eﬃciency across all DMUs. We formulate the joint clustering and feature selection problem as an

MILO model, and propose a number of strategies to construct a heuristic solution.

3.1 The formulation

Let Cbe the number of clusters sought to form a partition of the set of KDMUs, and let Ccdenote

the DMUs in the c-th cluster. Let pcbe the number of outputs selected for Cc,c= 1, . . . , C. Let zoc

be equal to 1 if output ocan be used by Ccfor the calculation of the eﬃciencies, and 0 otherwise. Let

γkc be equal to 1 if DMU kis in cluster Cc, and 0 otherwise. The decision variables αk

iand βk

oare

deﬁned as above. The Clustered DEA (CLUDEA), in which we cluster the DMUs and we perform

feature selection for each cluster, can be written as the following Mathematical Optimization model:

maximize(α,β,γ,z)

1

K

K

P

k=1

O

P

o=1

βk

oyk

o(9)

s.t. (CLUDEA)

O

X

o=1

βk

oyj

o−

I

X

i=1

αk

ixj

i≤M 1−

C

X

c=1

γjc γkc!∀j= 1, . . . , K ;∀k= 1, . . . , K (10)

I

X

i=1

αk

ixk

i= 1 ∀k= 1, . . . , K (11)

βk

o≤M(zoc −γkc + 1) ∀k= 1, . . . , K;∀o= 1, . . . , O;∀c= 1, . . . , C (12)

O

X

o=1

zoc ≤pc∀c= 1, . . . , C (13)

C

X

c=1

γkc = 1 ∀k= 1, . . . , K (14)

11

α∈RI·K

+(15)

β∈RO·K

+(16)

γ∈ {0,1}K·C(17)

z∈ {0,1}O·C.(18)

The objective function and constraints (11) and (15)-(16) are as in (OSDEA). Constraints (10) are

now enforced only for the pairs of DMUs within the same cluster, removing the links between DMUs

that are not in the same cluster. Constraints (12) model the relationship between zoc,γkc and βk

o. If

output ois not selected in the cluster assigned to k, namely γkc = 1, then βk

ohas to be 0. Otherwise,

the constraint is not binding. Constraints (13) ensure that Ccselects at most pcoutputs. Constraints

(14) ensure that each DMU is exactly in one cluster. Constraints (17)-(18) ensure that γkc and zoc are

integer variables. All constraints in (CLUDEA) except for (10) are linear. With standard modeling

techniques [12], this constraint can be linearized yielding an MILO formulation.

In the rest of the section we propose three strategies to construct a heuristic solution based on

the similarities introduced in Section 2. In the remainder of the section, we will denote the input and

output vectors of the KDMUs by D.

3.2 Heuristics

The ﬁrst heuristic solution consists of putting together those DMUs with similar features by means

of a Cluster Analysis procedure, such as the popular k-means or hierarchical clustering [14,16]. Once

the clusters have been found, we optimize the selection of outputs using (OSDEA) separately for each

of the clusters. This heuristic is presented in Algorithm 1. Note that a similar heuristic to the one in

Algorithm 1can be constructed using dual information.

Algorithm 1: Pseudocode for constructive heuristic based on primal similarity

1Data (D), number of clusters (C), maximum number of outputs to select in c-th cluster (pc)

2Using a Cluster Analysis procedure on data (xj,yj), j= 1, . . . , K , partition the set of DMUs

{1, . . . , K}=SC

c=1 Cc

3For each c= 1, . . . , C, solve (OSDEA) with DMUs in Ccwhere at most pcoutputs are selected

4Save all decision variables and eﬃciencies information

An alternative heuristic solution consists of building sets of DMUs such that they are similar in

12

terms of eﬃciency levels. Below, we follow a slicing approach to build the clusters. We ﬁrst ﬁnd a set

of DMUs and the corresponding features such that the eﬃciencies of these DMUs are above a given

threshold, say E. These will deﬁne the ﬁrst cluster of DMUs and the corresponding features. We then

continue with the remaining DMUs, and slice the remaining clusters, in an iterative process.

In each iteration, this constructive heuristic requires solving the so-called Quantile DEA (QDEA),

a MILO problem in which we have as decision variables δkequal to 1 if DMU kis selected to be in the

slice, and 0 otherwise. The rest of the decision variables α,β,zas deﬁned similarly as in (OSDEA).

We then have that

maximize(α,β,z,δ)

K

X

k=1

δk(19)

s.t. (QDEA)

O

X

o=1

βk

oyj

o≥E δk(20)

O

X

o=1

βk

oyj

o−

I

X

i=1

αk

ixj

i≤M(2 −(δk+δj)) ∀j= 1, . . . , K;∀k= 1, . . . , K (21)

I

X

i=1

αk

ixk

i= 1 ∀k= 1, . . . , K (22)

βk

o≤Mzo,∀o= 1, . . . , O;k= 1, . . . , K (23)

O

X

o=1

zo≤p(24)

α∈RI·K

+(25)

β∈RO·K

+(26)

z∈ {0,1}O,(27)

δ∈ {0,1}K,(28)

where Mis a big constant. The objective function (19) together with constraints (20) and (28) count

the number of DMUs with an eﬃciency above E. Constraints (21) are now enforced only for the

pairs of DMUs in the slice, removing the ties between DMUs that are not in this cluster. Constraints

(22)-(27) are as in (OSDEA).

With (QDEA) we identify the ﬁrst slice, which will become the ﬁrst cluster. We build the second

slice solving again (QDEA) but starting from {1, . . . , K} \ C1, and ﬁnding again a subset of DMUs for

13

which the eﬃciencies are above a threshold. We repeat this until we have found the C−1 slice. This

heuristic is presented in Algorithm 2.

Algorithm 2: Pseudocode for constructive heuristic based on similarity in eﬃciency

1Data (D), number of clusters (C), maximum number of outputs to select in c-th cluster (pc),

maximum number of outputs to select in (QDEA) (p), threshold eﬃciency E

2Set L={1, . . . , K}

3for c= 1, . . . , C −1,do

4Solve (QDEA) with DMUs in L, slice from the set of DMUs the c-th cluster Cc

5Set L=L\Cc

6end

7Cluster CCis composed by the remaining DMUs

8For each c= 1, . . . , C, solve (OSDEA) with DMUs in Ccwhere at most pcoutputs are selected

9Save all decision variables and eﬃciencies information

4 Numerical results

In this section, we illustrate our methodology in a simulated dataset and a real-world one. In the

simulated dataset, where we have a data generating model with underlying clusters and the subset of

features that play a role in the calculation of the eﬃciencies for each cluster, we show how (CLUDEA)

is able to recover these clusters and the relevant features for each of them. In the real-world dataset,

we can see how separating the DMUs into two clusters already can improve the eﬃciency of the

ﬁrms signiﬁcantly compared to the one-model-ﬁts-all approach. Even more, (CLUDEA) uses the

same number of features in total and some times fewer than the one-model-ﬁts-all approach. The

experiments were run on a computer with an IntelrCoreTM i7−7700 processor at 3.6 GHz using 32

GB of RAM, running Windows 10 Pro. All the optimization problems have been solved using Python

3.10gu interface [25] with Gurobi 9.5.1 solver [13], with a time limit of 600 seconds and Mequal to

1000.

The simulated dataset, Dataset 1, consists of K= 150 ﬁrms, I= 1 input and O= 3 outputs.

The input for all DMUs is equal to 1. The outputs have been generated in such a way that using

3 clusters of 50 DMUs each and selecting 2 outputs in each cluster, we can make every ﬁrm to be

14

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

Figure 4: Visualization of the output space for Dataset 1, where the colors are used for the underlying

clusters in the data generating model, DMUs in black cluster are characterized by features y1and y2,

in green by y1and y3, and in red by y2and y3

eﬃcient obtaining therefore an average eﬃciency of 1 in each of the clusters and hence globally. The

output space of this dataset and the underlying clusters are depicted in Figure 4. We can see there

that for the black cluster, the relevant outputs, y1, y2, were created in such a way that y1∼U(0,1)

and y2= 1 −y1, while the irrelevant one, y3, was equal to zero. The values of the outputs for the

green cluster are generated in a similar way where the relevant features are y1, y3, while for the red

cluster are y2, y3.

In what follows, we discuss the results of (CLUDEA) in Dataset 1 for C= 2,3 and p= 1,2,3, with

and without an initial solution. Our methodology is compared with the one-model-ﬁts-all approach,

i.e., (OSDEA) with p= 1,2,3, which corresponds to having all ﬁrms in the same cluster. The average

eﬃciencies obtained with these models are summarized in Table 1and the clusters are depicted in

Figures 5-7. The ﬁrst column of Table 1contains the value of the number of features, p, the second

the average eﬃciencies for (OSDEA), the third and the fourth ones for (CLUDEA) without initial

solution. The last four columns of this table show the average eﬃciencies when (CLUDEA) is given

as initial solution the one coming out from Algorithm 1, based on primal similarity, and the one from

Algorithm 2, based on similarity at the eﬃciency level, respectively. In Algorithm 1, we use k-means

as the Cluster Analysis procedure, while in Algorithm 2we use (QDEA) with Econveniently chosen

15

OSDEA CLUDEA

No initial solution Primal initial solution Eﬃciency level initial solution

pNo clusters 2 clusters 3 clusters 2 clusters 3 clusters 2 clusters 3 clusters

1 0.383 0.639 0.666 0.599 0.655 0.639 0.749

2 0.716 0.913 0.972 0.913 0.914 0.895 0.993

3 1 1 1 1 1 1 1

Table 1: Results for Dataset 1 using (CLUDEA) with C= 2 and 3 clusters and as well as (OSDEA)

when all DMUs are in the same cluster, for p= 1,2,3.

depending on Cand p. For C= 2, we choose E= 0.6 and p= 1, 0.9 for p= 2, and 1 for p= 3, while

for C= 3 we impose stricter values of E, namely, 0.6 for p= 1 and 1 for p= 2,3.

By construction, in the optimal solution of (CLUDEA) for C= 3 and p= 2 all the ﬁrms are

eﬃcient. In the feasible solution we obtain within the time limit, (CLUDEA) gives an average eﬃciency

equal to 0.972, which is close to the true optimal value of 1. Using the heuristics to give an initial

solution to the Gurobi solver, we obtain eﬃciencies equal to 0.914 and 0.993, respectively. Thus, for

this simulated dataset, the heuristic based on eﬃciency level similarity can help the solver obtaining

a slightly better average eﬃciency. By looking at Figures 5-7, we can see that the heuristic based on

eﬃciency level similarity is already able to detect the underlying pattern on the clusters, although not

optimally.

Although, in general, it is not clear how many clusters should be used, because of the data generat-

ing model behind Dataset 1 we know that if all DMUs are in the same cluster, i.e., the one-model-ﬁts-all

approach, we cannot achieve the perfect eﬃciency for all DMUs with neither p= 1 or 2. This is also

suggested by Table 1, where the average eﬃciency for (OSDEA) with p= 1 is 0.383, and with p= 2

is 0.716. Only when all three outputs are used, i.e., p= 3, we can achieve this perfect scenario for the

eﬃciencies as it can be seen in the last row of the table. Similarly, we know that if C= 2 we cannot

achieve the perfect eﬃciency for all DMUs with neither p= 1 or 2, which can again be seen from the

results in this table.

We now present the results on Dataset 2, a real-world dataset in the benchmarking of electricity

Distribution System Operators (DSOs) [1,2]. Here we have K= 182 DMUs, O= 100 outputs, and

I= 1 input. As customary, each output has been normalized dividing it by the diﬀerence between

16

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

Figure 5: Visualization clusters obtained with (CLUDEA) without initial solution, using p= 2, C = 3.

the maximum and the minimum values of the output. The results are summarized in Table 2, in

which we solve (CLUDEA) with C= 2, with and without an initial solution, as well as (OSDEA). We

consider ten values for the number of outputs selected, namely, p= 1,...,10. To generate an initial

solution for (CLUDEA), we again use Algorithm 1with k-means and Algorithm 2with Edepending

on p, namely 0.75,0.775,0.8,0.82,0.82,0.82,0.9,0.9,0.95,0.95. As for the simulated dataset, we can

see that the heuristics in some cases can help to obtain a better feasible solution within the time limit.

This is, for instance, the case for p= 2 for the eﬃciency level similarity, as well as for p= 5 for the

primal similarity.

More importantly, these results suggest that we can maintain roughly speaking the same average

eﬃciency, or even improve it, if we double the number of clusters and halve the number of features per

cluster. Take for instance (CLUDEA) with C= 2 and p= 1, this means that we are using one feature

per cluster and thus 2 in total. The average eﬃciency there is 0.777. We now compare this (OSDEA)

with p= 2, where all the DMUs are in the same cluster and 2 features are used to calculate the

eﬃciency. The average eﬃciency there is 0.655. We have therefore managed to improve the average

eﬃciency of DMUs by clustering them, while keeping the same number of features used in total. This

is actually a general pattern. Taking now (CLUDEA) with C= 2 and p= 2 and (OSDEA) with

p= 4, we again see (CLUDEA) outperforming (OSDEA), 0.777 versus 0.749. This goes on until the

end of the table, where for (CLUDEA) with C= 2 and p= 5 we obtain an average eﬃciency of 0.886

17

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

(a) Initial heuristic solution

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

(b) Solution after optimizing

Figure 6: Clusters using heuristic based on primal similarity in Dataset 1, and clusters obtained with

(CLUDEA) having this heuristic solution as initial one, using p= 2, C = 3.

while (OSDEA) with p= 10 gives an average eﬃciency of 0.873. Therefore, the use of clusters can

improve the average eﬃciency of DMUs.

A careful examination of the selected features suggests that (OSDEA) and the second cluster of

(CLUDEA) are selecting similar ones, while the ﬁrst cluster of (CLUDEA) is capturing the activities of

another group of companies. Speciﬁcally, based on domain knowledge and the underlying deﬁnitions

of the features, it seems that the ﬁrst cluster of (CLUDEA) is representing companies with lower

voltage levels and more overhead lines while the second one focuses more on middle and high voltage

networks with more cables.

5 Conclusions

In this paper, we have introduced the combined problem of clustering and feature selection to address

heterogeneity in DEA. The problem has been reformulated as a mixed integer linear optimization

problem. To enhance its optimization with commercial solvers, we have proposed constructive heuris-

tics based on three types of similarity between the DMUs. We have investigated numerically one

simulated and one real-world dataset. The simulated dataset allows us to check our formulations and

to understand the value of clustering relatively easily. The real-world dataset allows us to investigate

18

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

(a) Initial heuristic solution

0.0

0.2

0.4

0.6

0.8

0.00.2 0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

y3

y3

y1

y3

y1

y2

(b) Solution after optimizing

Figure 7: Clusters using heuristic based on eﬃciency level similarity in Dataset 1, and clusters obtained

with (CLUDEA) having this heuristic solution as initial one, using p= 2, C = 3.

a more realistic application with numerous features to select from and where many features typically

are correlated making it more diﬃcult to make the most eﬃcient selection. We have seen that when

clusters and features are optimized, clustering can be a very eﬀective means of improving the ﬁt of the

DEA model. In our real dataset, for example, we saw that two (optimal) clusters with ﬁve (optimal)

features in each lead to a ﬁt that is similar to having one cluster with ten (optimal) features. Our

approach is ﬂexible enough to handle domain knowledge constraints such as must-link and cannot-link.

In terms of future research, an obvious extension of this paper concerns the interpretability of

the clusters created. It would be interesting to discover what characterizes the DMUs in a cluster.

Second, a related problem to CLUDEA is the detection of outliers in DEA. The aim in the outlier

detection problem is to ﬁnd a cluster of ordinary DMUs and a small cluster of outlier ones such that

the average eﬃciency of the DMUs in the ordinary cluster is maximized. The similarities deﬁned in

this paper can be used for this purpose. Finally, enhancing the big Mformulations proposed in this

paper with domain knowledge is another fruitful line of research.

Acknowledgements

This research has been ﬁnanced in part by research projects EC H2020 MSCA RISE NeEDS (Grant

19

OSDEA CLUDEA

No initial solution Primal initial solution Eﬃciency level initial solution

pNo clusters 2 clusters 2 clusters 2 clusters

1 0.556 0.777 0.675 0.638

2 0.655 0.777 0.698 0.811

3 0.712 0.858 0.793 0.820

4 0.749 0.858 0.808 0.820

5 0.781 0.886 0.913 0.821

6 0.808 0.912 0.914 0.824

7 0.829 0.916 0.922 0.829

8 0.846 0.929 0.934 0.869

9 0.861 0.946 0.940 0.883

10 0.873 0.957 0.940 0.910

Table 2: Results for Dataset 2 (DSOs) using (CLUDEA) with C= 2 as well as (OSDEA) when all

DMUs are in the same cluster, for p= 1,...,10.

agreement ID: 822214), FQM-329, P18-FR-2369 and US-1381178 (Junta de Andaluc´ıa), PID2019-

110886RB-I00 (Ministerio de Ciencia, Innovaci´on y Universidades, Spain), and Independent Re-

search Fund Denmark project on Benchmarking-Based Incentives and Regulatory Applications (9038-

00042B). This support is gratefully acknowledged.

References

[1] P.J. Agrell and P. Bogetoft. Regulatory benchmarking: Models, analyses and applications. Data

Envelopment Analysis Journal, 3(1–2):49–91, 2017.

[2] P.J. Agrell and P. Bogetoft. Theory, techniques, and applications of regulatory benchmarking

and productivity analysis. In The Oxford Handbook of Productivity Analysis. 2018.

[3] Rajiv D Banker and Richard C Morey. The use of categorical variables in data envelopment

analysis. Management Science, 32(12):1613–1627, 1986.

20

[4] Sandra Ben´ıtez-Pe˜na, Peter Bogetoft, and Dolores Romero Morales. Feature selection in data

envelopment analysis: A mathematical optimization approach. Omega, 96:102068, 2020.

[5] P. Bogetoft. Performance benchmarking: Measuring and managing performance. Springer Science

& Business Media, 2013.

[6] Peter Bogetoft and Jesper Wittrup. Productivity and education: Benchmarking of elementary

and lower secondary schools in denmark. In Productivity and Competitiveness, pages 257–294.

Nordic Council of Ministers, 2011.

[7] A. Charnes, W.W. Cooper, and E. Rhodes. Measuring the eﬃciency of decision making units.

European Journal of Operational Research, 2(6):429–444, 1978.

[8] Wade D Cook, Dan Chai, John Doyle, and Rodney Green. Hierarchies and groups in DEA.

Journal of Productivity Analysis, 10(2):177–198, 1998.

[9] Wade D. Cook, Julie Harrison, Raha Imanirad, Paul Rouse, and Joe Zhu. Data envelopment

analysis with nonhomogeneous DMUs. Operations Research, 61(3):666–676, 2013.

[10] Xiaofeng Dai and Timo Kuosmanen. Best-practice benchmarking using clustering methods: Ap-

plication to energy regulation. Omega, 42(1):179–188, 2014.

[11] A. Emrouznejad and G.-L. Yang. A survey and analysis of the ﬁrst 40 years of scholarly literature

in DEA: 1978–2016. Socio-Economic Planning Sciences, 61:4–8, 2018.

[12] Robert Fortet. Applications de l’algebre de boole en recherche op´erationelle. Revue Fran¸caise de

Recherche Op´erationelle, 4(14):17–26, 1960.

[13] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2016.

[14] A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666,

2010.

[15] C. Jiang and W. Lin. DEARank: a data-envelopment-analysis-based ranking method. Machine

Learning, 101(1–3):415–435, 2015.

[16] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis.

John Wiley & Sons, New York, 1990.

21

[17] M. Landete, J.F. Monge, and J.L. Ruiz. Robust DEA eﬃciency scores: A probabilis-

tic/combinatorial approach. Expert Systems with Applications, 86:145–154, 2017.

[18] Chia-Yen Lee and Jia-Ying Cai. Lasso variable selection in data envelopment analysis with small

datasets. Omega, 91:102019, 2020.

[19] Christopher F. Parmeter and Valentin Zelenyuk. Combining the virtues of stochastic frontier and

data envelopment analysis. Operations Research, 67(6):1628–1658, 2019.

[20] N.C. Petersen. Directional Distance Functions in DEA with Optimal Endogenous Directions.

Operations Research, 66(4):1068–1085, 2018.

[21] A. Peyrache, C. Rose, and G. Sicilia. Variable selection in data envelopment analysis. European

Journal of Operational Research, 282(2):644–659, 2020.

[22] J.L. Ruiz and I. Sirvent. Common benchmarking and ranking of units with DEA. Omega, 65:1

– 9, 2016.

[23] Jos´e L Ruiz and Inmaculada Sirvent. Performance evaluation through DEA benchmarking ad-

justed to goals. Omega, 87:150–157, 2019.

[24] Sergey Samoilenko and Kweku-Muata Osei-Bryson. Increasing the discriminatory power of DEA

in the presence of the sample heterogeneity with cluster analysis and decision trees. Expert

Systems with Applications, 34(2):1568–1581, 2008.

[25] Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. CreateSpace, Scotts Valley,

CA, 2009.

22