Content uploaded by Jing Yang

Author content

All content in this area was uploaded by Jing Yang on Jun 24, 2021

Content may be subject to copyright.

Finding the “Liberos”:

Discover Organizational Models with Overlaps

Jing Yang1, Chun Ouyang2, Maolin Pan1, Yang Yu1( ), and

Arthur H.M. ter Hofstede2

1Sun Yat-sen University, Guangzhou, China

yangj357@mail2.sysu.edu.cn, {panml,yuy}@mail.sysu.edu.cn

2Queensland University of Technology, Brisbane, Australia

{c.ouyang, a.terhofstede}@qut.edu.au

Abstract. Organizational mining aims at gaining insights for business

process improvement by discovering organizational knowledge relevant

to the performance of business processes. A key topic of organizational

mining is the discovery of organizational models from event logs. While

it is common for modern organizations to have employees sharing roles

and responsibilities across diﬀerent internal groups, most of the exist-

ing methods for organizational model discovery are unable to identify

such overlaps. The overlapping resources are likely to be generalists in

an organization. Existing ﬁndings in process redesign best practices have

proven that generalists can help increase the ﬂexibility of a business pro-

cess (similarly to the ﬂexibility of the role of “libero” in certain team

sports). In this paper we propose an approach capable of discovering

organizational models with overlaps and thus helping identify general-

ists in an organization. The approach builds on existing cluster analysis

techniques to address the underlying technical challenges. Through ex-

periments on real-life event logs the applicability and eﬀectiveness of the

proposed method are evaluated.

Keywords: Process mining ·Organizational mining ·Organizational

model mining ·Overlapping clustering

1 Introduction

Process mining enables data-driven process analysis using the massive amount of

event log data captured by information systems in today’s organizations. Various

techniques have been developed to help extract insights about the actual business

processes with the ultimate goal to improve process performance as well as the

organizations’ business performance. While the main focus of process mining

is on the control-ﬂow perspective, recent years have seen research devoted to

mining other aspects such as the organizational context of business processes.

Organizational mining focuses on discovering organizational knowledge, in-

cluding e.g. organizational structures and human resources relevant to the per-

formance of a business process, from event log data [1]. In any organization

where humans play a dominant role, organizational mining helps managers gain

a better understanding of the de facto grouping of human resources and their

interactions thus to improve the related business processes. The importance of

such organizational knowledge in process improvement is also emphasized by

the fact that 10 of the 29 best practices in process redesign proposed in [2] are

concerned with the structure and population (i.e. resources) of an organization.

Hence, an interesting research topic concerns the discovery of organizational

models from event log data. Given the fact that in many real-life event logs,

only limited information about process execution is provided, it is challenging

to derive the actual organizational model (e.g. an organizational chart) in an

organization. However, it is possible to recognize groups of resources that have

similar characteristics relevant to the performance of a business process. For

example, in [1] the authors propose a resource grouping mechanism based on

how frequently the human resources carry out the same tasks, and suggest that

the discovered organizational groups can be relevant to roles and functional units

in which employees possess similar skills and knowledge to perform the tasks.

To date there have been a number of research eﬀorts on mining organiza-

tional models from event logs (e.g. [1, 3, 4]), whereas almost all of these existing

studies have made an assumption of disjoint organizational groups, which means

that each resource is a member of a single organizational group. In fact, in many

real-world organizations it is common to have employees who possess multiple

skills to share roles and responsibilities across organizational groups. More gen-

erally, modern organizations emphasize the importance of having smooth and

active communication among various functional units, and achieve so by setting

up cross-department roles to enhance the coordination [5]. From the viewpoint

of organizational structures, resources working across diﬀerent organizational

groups form the overlap between the groups. From the viewpoint of process im-

provement, such resources are likely to be the so-called generalists – a special

category of resources that can help increase the ﬂexibility of a business pro-

cess [2]. In terms of ﬂexibility, we consider the generalists to carry out a role

similar to the role of “libero” in certain team sports.

In this paper we propose an approach for the discovery of organizational

models from event logs, which allows the sharing of human resources between

diﬀerent organizational groups. By relaxing the assumption of disjoint organiza-

tional groups (applied in most of the existing work), new discovery algorithms

are developed to address the challenges arising from dealing with the potential

overlaps between organizational groups. Based on the characteristics of the prob-

lem of interest, a couple of existing cluster analysis techniques (from the ﬁeld of

data mining) are chosen and applied in our discovery algorithms. Experiments

are conducted on an implementation of the discovery algorithms, using real-life

event logs, to evaluate the applicability and eﬀectiveness of our approach.

The contribution of our work is twofold. On the one hand, the discovered

organizational model with potential overlaps is a better reﬂection of the actual

organizational grouping of resources relevant to process execution, and hence it

will enable more insightful resource performance analysis. On the other hand,

identifying resources that belong to more than one organizational group from

event logs presents a novel data-driven approach to the discovery of generalists in

an organization and their organizational positioning (i.e. in which organizational

groups they perform in practice). Finding the information about generalists will

help improve resource utilization and also serve as an important step for action-

able process improvement. For example, one strategy for process improvement

is to keep such resources free when possible, which guarantees ﬂexibility in the

distribution of work [6].

The rest of the paper is organized as follows. Sect. 2 provides a review of the

related work on the topic. Sect. 3 introduces basic concepts and preliminary no-

tions. In Sect. 4, we present our approach for mining organizational models with

overlaps, and in Sect. 5 we discuss the experiments and analyze the evaluation

results. Finally, Sect. 6 concludes the paper and outlines future work.

2 Related Work

The research considering the organizational perspective of process mining origi-

nates from the work by van der Aalst et al. [7], in which several types of inter-

resource relationship metrics are deﬁned for deriving resource social networks

from event logs. Based on the analysis of resource social networks, Song and van

der Aalst [1] propose the conceptual framework of organizational mining as a

sub-ﬁeld of process mining, within which three research dimensions of organiza-

tional mining are proposed: discovery,conformance checking and extension.

Discovery refers to constructing models that reﬂect the reality. In the con-

text of organizational mining, these models include organizational models, social

networks and resource assignment/allocation rules. Organizational model min-

ing focuses on ﬁnding the grouping of resources (employees), e.g. who belongs

to which functional unit [1, 8], who plays what roles [3, 9] or holds what social

positions in collaboration [10]. Recently, the work of Appice [8] introduces an

approach for mining organizational models using a community detection tech-

nique, which makes no assumption about each resource belonging to a single

group. To the best of our knowledge, this is so far the only existing approach

capable of deriving organizational models with potential overlaps.

The discovery of social networks emphasizes the use of social network analysis

to help understand the structure of communication between individual resources

as well as between organizational groups [4,7, 11]. The research presented in [12,

13] studies the discovery of rules related to staﬀ assignment (who is allowed to

do which tasks) and runtime activity distribution (to whom a speciﬁc task is

allocated) to help with diagnosis and optimization of pre-deﬁned rules.

In addition, there is also existing research concerning the organizational per-

spective of business processes at the level of individual resources. For example,

in [14] the authors analyze the correlation between the workload of individual

resources and their performance, and in [15] the authors propose a framework for

analyzing and evaluating diﬀerent resource behaviors in order to provide insights

towards more informed resource-related decisions for performance improvement.

3 Preliminaries

Here we present several preliminary concepts necessary for describing the prob-

lem, following the conceptual framework of organizational mining deﬁned by

Song and van der Aalst [1]. A typical event log usually consists of a set of

uniquely identiﬁable cases corresponding to the instances of an underlying busi-

ness process. Each case contains a sequence of events that describe the activities

carried out by some resources. Table 1 gives an example fragment of an event log

recorded by a process-aware information system. Each row refers to one single

event, which is described using attributes such as activity label, timestamp, and

identity of the originating resource1.

Table 1. An example fragment of an event log.

Case ID Event ID Activity label Resource Timestamp

c1e1Register request John 2018/01/03 10:59:06

c1e2Examine thoroughly Mike 2018/02/03 11:10:13

c1e3Decide Clare 2018/02/21 15:43:32

c1e4Reject request John 2018/02/22 10:35:52

Deﬁnition 1 (Event Log [7]). Let Tbe a set of tasks and Rbe a set of

resources. E⊆T×Ris the set of events that denote the execution of tasks by

originator resources. For any event e∈E,πt(e)∈Tis the task being executed

(or the activity) in eand πr(e)∈Ris the originator resource of e.C=E∗is

the set of possible event sequences (traces describing a case). L=B(C)is an

event log, where B(C)is the set of all bags (multi-sets) over C.

In Deﬁnition 1 we do not take into account the ordering of events in a case.

We focus on two standard attributes of an event – task and resource identity. We

use them to build a simple “proﬁle” for each resource, which reﬂects the history

of the resource performing activities. Accordingly, a performer by activity matrix

can be used to represent the proﬁles of a set of resources given an event log.

Deﬁnition 2 (Performer by Activity Matrix, adapted from [7]). Given

an event log L, let {e1, ..., en}be the set of all possible events recorded in L.

The performer by activity matrix is an integer-valued matrix Xof size |R| × |T|,

in which each row vector corresponds to the execution history of activities for

a speciﬁc resource. Each element of Xdenotes the count of frequencies of a

resource ri∈Rconducting a speciﬁc task tj∈T, deﬁned as:

Xij =Σ16k6n(1,if πr(ek) = riand πt(ek) = tj

0,otherwise

where 16i6|R|and 16j6|T|.

1For illustration purposes, resource name is used in the example in Table 1.

Simply consider the example fragment of an event log shown in Table 1.

The performer by activity matrix build from this example based on Deﬁnition 2

is shown in Table 2. Below, we propose a generic and simple deﬁnition of an

organizational group as a non-empty group of human resources (i.e. employees) in

an organization. For each organizational group, we deﬁne a membership indicator

associated with each resource to specify whether or not the resource belongs to

the group.

Deﬁnition 3 (Organizational Group). Let Rbe a set of (human) resources

in an organization, an organizational group can be deﬁned as G⊆Rand G6=∅.

Given an organizational group G, for any r∈R, we deﬁne a membership indi-

cator function IG:R→ {0,1}where IG(r) = 1 if r∈Gand 0otherwise.

Finally, we deﬁne the concept of organization model. It is simply considered

as one entire group of several organizational groups deﬁned in the above.

Deﬁnition 4 (Organizational Model). An organizational model Ois a set

that consists of a ﬁnite number of (k)organizational groups {G1, . . . , Gk}. For

any resource rthat is part of the organizational model O,rbelongs to one or more

than one organizational group in O. That is, ∀r∈SG∈OG,PG∈OIG(r)>1.

As mentioned before, most of the existing studies in organizational mining apply

the assumption of disjoint organizational groups in an organization, and hence

they require that each resource should only belong to a single organizational

group (i.e. ∀r∈SG∈OG,PG∈OIG(r) = 1). In Deﬁnition 4, our focus is to

relax such assumption by recognizing that resources may belong to more than

one organizational group in reality and thus to allow potential overlaps between

diﬀerent organizational groups.

4 Approach

Organizational model mining aims at recognizing groups of resources having

similar characteristics. We concern the connection between this and the purpose

of cluster analysis in data mining, which is to group a set of data objects into

multiple clusters such that objects within a cluster have high similarity but

are dissimilar to those in other clusters [16]. As a relatively mature ﬁeld, there

exist various types of techniques developed to provide solutions for diﬀerent

requirements and contexts. Since our intention is to derive results in which one

resource may be member of more than a single organizational group, we select the

technique of overlapping clustering, which allows ﬂexible assignment of one data

object to multiple clusters. In this paper, we design an approach adopting the

idea of overlapping clustering to solve the problem of discovering organizational

model with overlaps. Fig. 1 gives an overview of the three-phased procedure.

We start from constructing the performer by activity matrix that characterizes

the resources. Then we transfer the problem into cluster analysis and apply the

selected model and algorithm to produce the clustering result, from which we

derive an organizational model as the end result.

1. Perform er-Activity matrix

2. Similarity measure

event

log

organizationa l

model

Characterize

Resources

Run

Cluster

Analysis

Determine

Resource

Membership

Clustering

result

Fig. 1. The designed procedure for discovering organizational model with overlaps.

4.1 Characterizing Resources

Given an event log, we construct the performer by activity matrix by directly

following Deﬁnition 2 and determine the execution frequencies while iterating

over the events. Table 2 shows the result of deriving the matrix using the example

event log fragment in Table 1 as input.

Table 2. The performer by activity matrix built from the example event log fragment.

Activity 1 Activity 2 Activity 3 Activity 4

Register request Examine thoroughly Decide Reject request

John 1 0 0 1

Mike 0 1 0 0

Clare 0 0 1 0

Once the performer by activity matrix has been built, we need to select a

measure for quantifying the similarity between any two resources by comparing

the corresponding row vectors, in order to further group similar resources and

derive an organizational model. Some variants of distance-based metrics pro-

vide meaningful measures in a process mining context. The Hamming distance,

for example, accounts for whether or not two resources have executed the same

types of tasks. Meanwhile, correlation-based metrics such as Pearson’s correla-

tion coeﬃcient provide a view of statistical correlation. The choice of similarity

measure should be done depending on the purpose and context of analysis.

For the next step, we apply the clustering techniques in order to obtain the

clusters of resources. Two possible solutions are presented then. Since these two

vary in terms of the deciding the ﬁnal clusters, we will describe how to derive

the end result, i.e. the output organizational model, respectively.

4.2 Solution 1: Cluster Analysis using a Mixture Model

We ﬁrst elaborate on how to correlate the current problem with the concepts

of overlapping clustering. The concept of probabilistic cluster and the hypoth-

esis of mixture models are commonly used in cluster analysis to characterize

the ﬂexible assignment of one object to multiple clusters simultaneously. The

hypothesis states that the latent categories hidden in the data objects could be

mathematically represented using a series of distribution functions [16]. Each

data object is related to each latent category by a sampling probability, and is

viewed as a sample drawn from a mixture of distributions. In the context of our

problem, we can regard the execution history of activities (i.e. the row vector in

the performer by activity matrix corresponding to a resource) as the result of a

resource following the work patterns of the organizational group(s) it belongs to.

If the resource is indeed a member of several diﬀerent groups, then its execution

history of activities should be the consequence of multiple work patterns. We

may therefore adopt the hypothesis of mixture models as an idea for a solution.

First, cluster the resources by leveraging the performer by activity matrix along

with the speciﬁed similarity measure and ﬁnd the distribution function for each

cluster, then for each row vector we calculate a sampling probability related with

each cluster, which could be used to decide the membership of the resource.

Following the idea we could apply a classic Gaussian mixture model (GMM)

as the ﬁrst solution. In GMM we assume a Gaussian distribution for each latent

category, and apply the well-founded EM algorithm [16] to ﬁt the mixture model

using the performer by activity matrix. EM works in an iterative ﬁtting process,

which starts with a random initialization and updates the mixture model greed-

ily towards a higher value of the goal function (the likelihood of sampling all

the vectors using the current model). The mixture model converges as the goal

function value no longer increases or updates by a very trivial scale.

Using the converged mixture model, we can calculate the posterior probabil-

ity of a row vector relating with each cluster, and take the result as the sampling

probability. However, for actually deciding the membership of a resource, we need

to choose a threshold to be applied on the probability value, which determines if

the resource belongs to one or several of the groups. For example, if the chosen

threshold value is 0.5, then the resource should belong to a group only if its

related sampling probability is larger or equal to 0.5.

For the basic solution using GMM, we notice some problems related to its

conﬁguration. Before starting the ﬁtting process, it requires us to decide the

number of clusters upfront. This should be done based on the control of granu-

larity we desire: with a higher number it enables us to discover more ﬁne-grained

groups, which may be the very speciﬁc roles or small workgroups, whereas a

lower number of clusters would possibly lead to ﬁnding departments at a higher

level. Another problem concerns the thresholding step applied for the purpose

of deciding resource membership. It is hard to determine an eﬀective level of

probability value that decides whether a resource indeed belongs to an organi-

zational group or not: for instance, for a ﬁtted GMM we could calculate the

result that an involved employee Jack has the probability value of 0.49 that he

belongs to Group 1, and 0.51 that he belongs to Group 2. The question is: how

should we actually decide Jack’s membership given these numbers? Selecting an

appropriate threshold value may become a challenging task, since the scale of the

estimated posterior probabilities lack a solid interpretation in the context of or-

ganizational model mining. We therefore present another overlapping clustering

algorithm that addresses the challenge.

4.3 Solution 2: Cluster Analysis using a More Generative Model

Consider the example of deciding Jack’s membership illustrated before. The use

of mixture models like GMM poses the challenge of conﬁguring proper threshold

parameter, which may hinder us from directly applying the method for discover-

ing organizational models. The challenge arises from the underlying hypothesis

of mixture models: when we view the row vector corresponding to a resource as

a data object being clustered, the posterior probabilities that we use for later

deriving membership only indicate the possibilities of having the current data

object sampled from each of the distributions independently [16]. Hence a mix-

ture model may fail in well characterizing the reality that, for resources with

multiple memberships across several groups, their execution history of activities

results from the joint eﬀect of all the work patterns of the groups.

Without shifting from the general concepts of both organizational model

mining and overlapping clustering, we seek to ﬁnd a more natural and descriptive

model that avoids deriving membership from probabilities, and constitutes a

better solution for the current problem.

The Model-based Overlapping Clustering (MOC) model [17] bases itself on

the same concepts of probabilistic clusters as GMM does, but without employ-

ing the hypothesis of having objects sampled from a mixture of distributions

related with the latent categories. Instead, a boolean-valued membership vector

is deﬁned directly for each of the objects to be clustered, of which the values

are inferred after ﬁtting the model with the data. In comparison with mixture

models, the MOC model is a more natural generative model for overlapping clus-

tering. In MOC the data objects being clustered are hypothesized to be generated

by simultaneously considering multiple components, as each of the components

refers to a part of the model that relates to one of the latent categories to be

discovered (similar to the distribution functions).

Algorithm 1 depicts the procedure of applying the MOC model to the current

problem. We omit some of the mathematical details here for brevity, for which

one may refer to [17] for a more in-depth explanation. Given nresources and

the related event log, we assume that the performer by activity matrix Xand

similarity measure have been decided prior to running the algorithm, and the

granularity of analysis has been speciﬁed already, i. e. kgroups to be discovered.

The algorithm starts by an initial estimate of the membership matrix M, which

is usually initialized in a random manner. Another model parameter to be ini-

tialized is a matrix that represents the active status of each component in the

MOC model, denoted as A, for which random initialization will be ﬁne.

After the initialization of the model parameters we proceed to the iterative

process for ﬁtting the model to the data (Line 3-10). At each iteration we ﬁrst

update the value of Adirectly using the current Mand X[17]. In the next step,

for each membership vector Miwe try to ﬁnd a value that maximizes the metric

value. The search may be time-consuming when the desired group number k

is large, however certain algorithms could be plugged in here to speed up the

search process [17]. When the appropriate setting of Mhas been obtained, we

calculate the value of the goal function deﬁned here as the log-likelihood (Line 7),

and compute the increase in comparison to the result of the last iteration. The

iterative updating process stops when convergence is reached, i.e. the increase

in the goal function value is suﬃciently small.

Algorithm 1: Applying MOC for Discovering an Organizational Model

Input:

–{r1,...,rn}: the nresources involved;

–X: the constructed performer by activity matrix, also assuming that the

similarity measure has been speciﬁed accordingly;

–k: the number of organizational groups expected to be discovered

(depending on the desired granularity).

Output: O: the resulting organizational model consisting of kgroups.

// Step 1: Initialize the membership parameter

1Initialize an n×kboolean value matrix M, where each of the nrow vectors

indicates the membership of a corresponding resource

2Initialize ak×dreal value matrix Athat denotes the active status of each

component in the model

// Step 2: Fit the model to the data through iterative updating

until convergence

3repeat

// Update Aby direct computing the value from Xand M

(cf. [17])

4A←update (A, X, M )

// Update Mby searching a setting that maximizes the selected

similarity measure

5for i= 1 to ndo

6Mi←argmax

Mi∈{0,1}k

SIMILARITY MEASURE (Xi, MiA)

7end

// Calculate the goal function value using the log-likelihood

(cf. [17])

8L= log P(X, M , A)

9Calculate the increase ∆L by comparing with the last iteration

10 until ∆L is suﬃciently small

// Step 3: Derive the resulting organizational model utilizing the

membership matrix

11 Initialize kempty sets G1, G2,...,Gk

12 for i= 1 to ndo

13 for j= 1 to kdo

14 if Mij =true then

15 Gj←Gj∪ {ri}

16 end

17 end

18 end

19 return O={G1, G2,...,Gk}

With the ﬁtted model we can now derive the end result in a straightforward

way, since the membership of all the nresources has been determined as the

value of the n×kmembership matrix M. Therefore, we just need to simply

assign the resources to the corresponding ones of the ksets (Line 11-18), and

return the resulting sets as the discovered organizational groups.

Comparing to the more na¨ıve solution of GMM, the solution using MOC

model avoids introducing probabilities as the degree of resource membership,

and therefore addresses the challenge of having to select thresholds. Given the

event log and resources to be analyzed, users would only need to focus on the

resource proﬁling phase, and then set up the expected number of groups. The

end result will be an organizational model containing the exact number of groups

as required, where overlaps are allowed to exist.

5 Evaluation

5.1 Experiment Design

Both solutions (applying either GMM or MOC) have been implemented in a

standalone demo1. We evaluated their feasibility on real-life event log data. We

aim at giving empirical validation on whether the proposed solutions work ef-

fectively in discovering organizational models when there indeed exist overlaps

among organizational groups.

Event Logs. Diﬀerent from the evaluation methods in the previous research on

the problem (cf. [1], [8], [11]), the purpose of the validation here requires us to

be aware of the “ground truth” information relevant to the internal groups in an

organization a priori. For this purpose we picked two sets of real-life event logs,

namely “WABO” and “Volvo”. The background of these event log datasets are

as follows:

–WABO: The event log from the WABO dataset contains the records of the

receiving phase of an environmental permit application process in an anony-

mous municipality within the CoSeLoG project [18].

–Volvo: This dataset includes event logs generated from the problem man-

agement system VINST of Volvo Belgium, which was originally released for

the BPI Challenge 2013 [19]. It contains the event logs that describe several

business processes handling incidents and problems in the IT-services deliv-

ered and/or operated by Volvo IT. We choose the event log related with the

process managing the open problems for experiment use.

The event logs are recorded in the IEEE standard XES format [20], and in-

clude an extended event attribute termed org:group, which indicates the group

identity of the resource that triggered the event. We recognize that the ground

truth organizational models can be extracted by utilizing this information of

identities, which can then serve as the reference models for our experiments.

To do this we ﬁrst ﬁlter out the events with missing values on org:group (in-

cluding both null and invalid ones). Then we extract the ground truth organi-

zational model by putting resources together into groups accordingly, based on

the org:group values they relate to as event originators. Table 3 gives a brief

1https://github.com/royyjing/bpm-2018-Yang Find

overview of the preprocessed event logs, along with some basic statistics of the

extracted reference models: the average size of groups (Avg. group size), and the

average number of groups that a resource belongs to (Avg. membership). One

may recognize immediately the existence of overlaps in the reference models af-

ter inspecting the basic statistics shown in the table. A further comparison on

Avg. membership reveals that the overlapping condition is less obvious in the

Volvo case (Avg. membership 1.176 while WABO has a value of 3.886), suggest-

ing considerably fewer employee resources possessing multiple group identities

in Volvo IT.

Table 3. Overview of the event logs and the extracted reference models.

Event log Cases Events Activities Resources Organizational Avg. Avg.

groups group size membership

WABO 1,348 6,641 27 44 9 19.0 3.886

Volvo 818 2,331 5 239 11 25.5 1.176

Experiment Setups. We conducted the experiments using the comprison

method. Two methods proposed in previous research are selected as baseline:

a traditional partitioning method that produces disjoint organizational mod-

els [1], namely MJA; and a community detection based method developed by

Appice [8] that is capable of deriving organizational models with possible over-

laps, namely Commu. We examine if the organizational models discovered from

the same source of event logs using GMM and MOC can better capture the

reality, i.e. more similar to the reference models.

To start with, we build the performer by activity matrix, and choose the

Pearson’s correlation coeﬃcient as the metric for similarity measure. Since the

setup of the algorithms involved in evaluation may vary, we decided to conﬁgure

the parameters for each algorithm separately, as long as they produce resulting

organizational models with exactly the same number of organizational groups

discovered as that in the reference ground truth.

Evaluation Metrics. For the purpose of comparing between the results of dis-

covery and the reference models to assess the eﬀectiveness of diﬀerent methods,

we consider adopting extrinsic evaluation metrics. One example is the entropy

measure [1], which can be used for measuring the scale of diﬀerence between a

generated model and the referenced one. However, as the current research has

been extended to the overlapping situation, the entropy measure becomes inap-

propriate as well as many other commonly used extrinsic measures. We therefore

turn to the extended BCubed metrics (including BCubed Precision, Recall and

F-measure) [21], as they are applicable for evaluation on the overlapping cases.

From an organizational model mining point of view, the meaning of the BCubed

metrics can be interpreted as follows:

1. BCubed Precision represents the ratio of how many resources in a same dis-

covered organizational groups belong to the same actual groups. A higher

value of BCubed Precision means fewer mistaken assignments in the discov-

ered organizational model.

2. BCubed Recall represents the ratio of how many resources from a same

actual groups are assigned to the same discovered organizational groups. A

higher value of BCubed Recall means more resources with the same actual

group identities are placed together by the mining algorithm.

3. BCubed F-measure is a combination of BCubed Precision and Recall, deﬁned

as the harmonic average of the two.

Besides the BCubed metrics, we also want to compare the basic statistics

of the discovered organizational model (Avg. group size and Avg. membership),

with those of the groundtruth model.

5.2 Comparing with the Disjoint Partitioning Method

In the ﬁrst experiment we wish to compare our solutions with the disjoint parti-

tioning method MJA. The idea behind MJA is to view the resources as vertices

in a graph, and connect weighted edges between them based on the measured

similarity values. By eliminating certain edges by a threshold value, the origi-

nal graph is further partitioned into several connected components, which are

taken as organizational groups that constitute the ﬁnal organizational model.

The result generated from MJA is obviously disjoint.

Table 4 shows the evaluation results measured by the BCubed metrics. From

the table we can see that MJA obtains higher precision rates. However, the dis-

joint nature of MJA prevents it from recognizing the fact that similar resources

may possibly share more than one group identities in an overlapping organiza-

tional model. Thus, for MJA, similar resources are clustered into one group only,

which lead to the relatively lower recall.

On the other hand, the proposed solutions using either GMM or MOC have

comparatively lower precision yet higher recall values. It can be explained that

both overlapping clustering based algorithms tend to put more resources into the

groups, which is consistent with the larger group sizes shown in Table 5. This

leads to the better recall rates, but at the same time makes the discovered orga-

nizational groups contain relatively members being mistakenly assigned, which

directly cause the lower precision of GMM and MOC.

Moreover, for the Volvo case we notice that even the baseline MJA produces

a relatively lower precision, and the situation of recall rates mentioned above

becomes even more signiﬁcant. The reason is due to the large total number of

resources compared to the much smaller number of activity types (239 resources

compared to 5 activity types). The smaller number of activity types leads to

fewer columns in the performer by activity matrix, and may therefore weaken

the eﬀect of measuring similarity.

Despite the observation that GMM and MOC may tend to sacriﬁce some

precision rate and bring mistaken assignments, from Table 5 we can draw a

conclusion – the overlapping-clustering-based solutions are able to derive an

overlapping organizational model that captures the reality, whereas methods

like MJA holding the assumption of disjoint organizational model are not.

Nevertheless, we still have the following questions: How eﬀective are our

solutions comparing to other solutions that can also produce overlapping or-

ganizational models? Will the other solutions also encounter the problem of

unsatisfying precision? We will explore the answers to these questions through

the following experiment and analysis.

Table 4. Results of comparing with MJA on the BCubed metrics.

Event log BCubed Precision BCubed Recall BCubed F-measure

MJA GMM MOC MJA GMM MOC MJA GMM MOC

WABO 0.814 0.624 0.757 0.213 0.812 0.735 0.337 0.706 0.745

Volvo 0.496 0.186 0.24 0.397 0.944 0.94 0.441 0.31 0.382

Table 5. Results of comparing with MJA on the grouping statistics.

Event Log Avg. group size Avg. number of membership

Ground truth MJA GMM MOC Ground truth MJA GMM MOC

WABO 19.0 4.9 28.4 22.8 3.886 15.818 4.659

Volvo 25.5 21.7 146.5 110.9 1.176 16.745 5.105

5.3 Comparing with the Community-Detection Based Method

In this experiment we choose as baseline a community detection based ap-

proach [8] which we refer to as Commu. Our goal is to make a comparison

between the eﬀectiveness of Commu and our approach. Commu is based on

social network analysis techniques rather than cluster analysis, but shares the

same purpose of grouping cohesive resources into communities that represent

the internal organizational groups. It applies the linear network model with the

Louvain algorithm, and derives organizational models which allow the existence

of overlapping communities (organizational groups).

Tables 6 and 7 show the evaluation results of this experiment. By observing

the average number of membership we ﬁrst conﬁrm that the baseline method

Commu indeed generates overlapping results. For the BCubed metrics, we no-

tice that GMM performs roughly the same as Commu, whereas MOC performs

better than Commu in both cases. And the grouping statistics show that the

models produced by using either GMM or MOC are more realistic compared

with Commu.

Meanwhile, we learn from the tables that Commu also produced a result of

low precision and oversize groups, as in the Volvo case, and even worse while

comparing with GMM and MOC (refer to the grouping statistics in Table 7).

In general, we may conclude that our approach is more eﬀective as a solution

to discovering organizational models with overlaps, compared to the community

detection based method. Nevertheless, as both methods have the shortcoming

of introducing mistaken assignment of resources to groups causing low precision

and unrealistic group sizes, further work is needed to address this shortcoming.

Table 6. Results of comparing with Commu on the BCubed metrics.

Event log BCubed Precision BCubed Recall BCubed F-measure

Commu GMM MOC Commu GMM MOC Commu GMM MOC

WABO 0.718 0.624 0.757 0.651 0.812 0.735 0.683 0.706 0.745

Volvo 0.195 0.186 0.24 0.948 0.944 0.94 0.324 0.31 0.382

Table 7. Results of comparing with Commu on the grouping statistics.

Event Log Avg. group size Avg. number of membership

Groundtruth Commu GMM MOC Groundtruth Commu GMM MOC

WABO 19.0 28.8 28.4 22.8 3.886 5.886 5.818 4.659

Volvo 25.5 152.6 146.5 110.9 1.176 7.025 6.745 5.105

5.4 Discussion

We can draw some interesting insights considering results from both experiments

conducted. The ﬁrst conclusion concerns the comparison of eﬀectiveness between

GMM and MOC. It has been evaluated through the experiments that MOC

performs better, indicated by the higher precision and F-measure, along with the

grouping characteristics being more similar to the ground truth model. Taking

into consideration that it requires no cumbersome decision to set up the extra

threshold parameter when applying MOC, we conclude that MOC will serve as a

better solution than GMM for discovering organizational models with overlaps.

On the other hand, we also realize that for our solution, there exists a short-

coming which would become signiﬁcant when the latent organizational model

is less overlapped. We infer the possible reasons behind it as twofold. The ﬁrst

one concerns the relatively fewer types of activities compared to the number of

resources. The second concerns the lack of constraints on the number of groups

allowed for each resource to be assigned to. As the former is limited by the

content of the event log, we discuss the remedy for the latter.

Given no constraints, both GMM and MOC may try to relate resources to

many organizational groups as long as the goal function value is being optimized.

This eventually causes the unrealistic mining result in which one resource is a

member of considerably many organizational groups simultaneously, diverging

from the reality that some resources may possess few or no shared group identi-

ties, as in the Volvo case. To solve this, a natural idea is to set up the constraints

to mitigate the problem of involving too many resources. Yet this would require

more prior knowledge of the underlying organizational structure to implement.

Nevertheless, we argue that such an improvement needs only slight modiﬁcation

on the current solution. For GMM, it requires the proper threshold value. For

MOC, heuristics are to be introduced to prune the search space in updating the

estimate of membership. Another remedy could be mixing application of the

proposed solution with the traditional disjoint method: Given an organizational

model mining task with the performer by activity matrix has been built along

with the speciﬁed similarity measure, one may ﬁrst mine a disjoint model using

the traditional method, and utilize the obtained model statistics for the guided

initialization of the parameters. Then, apply GMM or MOC to discover an or-

ganizational model with potential overlaps. We plan to leave the exploration for

improvement to our future research on the topic.

6 Conclusion

Organizational model mining techniques enable the discovery of organizational

models from event logs. In this paper, we relax the assumption of disjoint orga-

nizational groups held by existing methods and discover organizational models

in which individual resources may share multiple group identities. We refer to

overlapping clustering techniques and introduce two solutions, GMM and MOC,

for deriving organizational models with overlaps. Results from experiments on

real-life event log data demonstrate the applicability and eﬀectiveness of the

methods. We also recognize the potential limitation of our solution and con-

clude the reasons behind it, which lead to identifying the potential heuristics for

further amending the current approach.

In future work we will consider the following aspects: (1) to improve our

approach by eﬀectively incorporating the identiﬁed heuristics; (2) to link the

current research with performance analysis on generalist resources; (3) to con-

duct evaluation on more real-life cases.

Acknowledgments. This work is supported by the National Key Research

and Development Program of China (Grant No. 2017YFB0202200); the National

Natural Science Foundation of China (Grant No. 61572539); the Research Foun-

dation of Science and Technology Plan Project in Guangdong Province (Grant

No. 2016B050502006); and the Research Foundation of Science and Technology

Plan Project in Guangzhou City (Grants No. 2016201604030001, 201704020092).

References

1. Song, M., van der Aalst, W.M.P.: Towards comprehensive support for organiza-

tional mining. Decision Support Systems 46(1) (2008) 300–317

2. Reijers, H., Mansar, S.L.: Best practices in business process redesign: an overview

and qualitative evaluation of successful redesign heuristics. Omega 33(4) (2005)

283 – 306

3. Jin, T., Wang, J., Wen, L.: Organizational modeling from event logs. In: Interna-

tional Conference on Grid and Cooperative Computing (GCC). (2007) 670–675

4. van Zelst, S.J., van Dongen, B.F., van der Aalst, W.M.P.: Online discovery of

cooperative structures in business processes. In: OTM Confederated International

Conferences, Springer (2016) 210–228

5. Daft, R.L.: Organization Theory and Design. (2010)

6. van der Aalst, W.M.P., van Hee, K.: Workﬂow Management: Models, Methods,

and Systems. MIT Press, Cambridge, MA, USA (2004)

7. van der Aalst, W.M.P., Reijers, H.A., Song, M.: Discovering social networks from

event logs. Computer Supported Cooperative Work (CSCW) 14(6) (2005) 549–593

8. Appice, A.: Towards mining the organizational structure of a dynamic event sce-

nario. Journal of Intelligent Information Systems 50(1) (Feb 2018) 165–193

9. Burattin, A., Sperduti, A., Veluscek, M.: Business models enhancement through

discovery of roles. In: IEEE Symposium on Computational Intelligence and Data

Mining (CIDM). (2013) 103–110

10. Liu, R., Agarwal, S., Sindhgatta, R.R., Lee, J.: Accelerating collaboration in task

assignment using a socially enhanced resource model. In: Business Process Man-

agement, Springer (2013) 251–258

11. Ferreira, D.R., Alves, C.: Discovering user communities in large event logs. In:

International Conference on Business Process Management. (2011) 123–134

12. Rinderle-ma, S., van der Aalst, W.M.P.: Life-cycle support for staﬀ assignment

rules in process-aware information systems. Technical Report 213, TU Eindhoven

(2007)

13. Sch¨onig, S., Cabanillas, C., Jablonski, S., Mendling, J.: A framework for eﬃciently

mining the organisational perspective of business processes. Decision Support Sys-

tems 89 (2016) 87 – 97

14. Nakatumba, J., van der Aalst, W.M.P.: Analyzing resource behavior using process

mining. In: International Conference on Business Process Management, Springer

(2009) 69–80

15. Pika, A., Leyer, M., Wynn, M.T., Fidge, C.J., ter Hofstede, A.H.M., van der Aalst,

W.M.P.: Mining resource proﬁles from event logs. ACM Trans. Manage. Inf. Syst.

8(1) (March 2017) 1:1–1:30

16. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier

(2011)

17. Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based

overlapping clustering. In: Proceedings of the Eleventh ACM SIGKDD Interna-

tional Conference on Knowledge Discovery in Data Mining. (2005) 532–537

18. Buijs, J.: Receipt phase of an environmental permit application process (WABO),

CoSeLoG project (2014)

19. Steeman, W.: BPI challenge 2013 (2013)

20. IEEE: IEEE Standard for eXtensible Event Stream (XES) for Achieving Interop-

erability in Event Logs and Event Streams. Technical report (Nov 2016) IEEE Std

1849-2016.

21. Amig´o, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering

evaluation metrics based on formal constraints. Information Retrieval 12(4) (2009)

461–486