Content uploaded by Artem Polyvyanyy

Author content

All content in this area was uploaded by Artem Polyvyanyy on Dec 02, 2020

Content may be subject to copyright.

Postprint, June 2020

A Framework for Estimating Simplicity of

Automatically Discovered Process Models Based on

Structural and Behavioral Characteristics

Anna Kalenkova , Artem Polyvyanyy , and Marcello La Rosa

School of Computing and Information Systems

The University of Melbourne, Parkville, VIC, 3010, Australia

{anna.kalenkova;artem.polyvyanyy;marcello.larosa}@unimelb.edu.au

Abstract. A plethora of algorithms for automatically discovering process mod-

els from event logs has emerged. The discovered models are used for analysis and

come with a graphical ﬂowchart-like representation that supports their compre-

hension by analysts. According to the Occam’s Razor principle, a model should

encode the process behavior with as few constructs as possible, that is, it should

not be overcomplicated without necessity. The simpler the graphical representa-

tion, the easier the described behavior can be understood by a stakeholder. Con-

versely, and intuitively, a complex representation should be harder to understand.

Although various conformance checking techniques that relate the behavior of

discovered models to the behavior recorded in event logs have been proposed,

there are no methods for evaluating whether this behavior is represented in the

simplest possible way. Existing techniques for measuring the simplicity of dis-

covered models focus on their structural characteristics such as size or density,

and ignore the behavior these models encoded. In this paper, we present a con-

ceptual framework that can be instantiated into a concrete approach for estimating

the simplicity of a model, considering the behavior the model describes, thus al-

lowing a more holistic analysis. The reported evaluation over real-life event logs

for several instantiations of the framework demonstrates its feasibility in practice.

1 Introduction

Information systems keep records of the business processes they support in the form of

event logs. An event log is a collection of traces encoding timestamped actions under-

took to execute the corresponding process. Thus, such logs contain valuable information

on how business processes are carried out in the real world. Process mining [1] aims

to exploit this historical information to understand, analyze, and ultimately improve

business processes. A core problem in process mining is that of automatically discov-

ering a process model from an event log. Such a model should faithfully encode the

process behavior captured in the log and, hence, meet a range of criteria. Speciﬁcally,

a discovered model should describe the traces recorded in the log (have good ﬁtness),

not encode traces not present in the log (have good precision), capture possible traces

that may stem from the same process but are not present in the log (have good gener-

alization), and be “simple”. These quality measures for discovered process models are

studied within the conformance checking area of process mining. A good discovered

model is thus supposed to achieve a good balance between these criteria [3].

2 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

In [1], Van der Aalst suggests that process discovery should be guided by the Oc-

cam’s Razor principle [10,8], a problem-solving principle attributed to William of Ock-

ham. Accordingly, “one should not increase, beyond what is necessary, the number

of entities required to explain anything” [1]. Speciﬁcally to process mining, a discov-

ered process model should only contain the necessary constructs. Various measures

for assessing whether a discovered model is simple have been proposed in the liter-

ature [9,15], such as the number of nodes and arcs, density, and diameter. However,

these measures address the number of constructs, i.e., the structure of the discovered

models, while ignoring what these constructs describe, i.e., the process behavior.

In this paper, we present a framework that considers the model’s structure and be-

havior to operationalize Occam’s Razor principle for measuring the simplicity of a pro-

cess model discovered from an event log. The framework comprises three components

that can selectively be conﬁgured: (i) a notion for measuring the structural complexity

of a process model, e.g., size or diameter; (ii) a notion for assessing the behavioral sim-

ilarity, or equivalence, of process models, e.g., trace equivalence, bisimulation, or en-

tropy; and (iii) the representation bias, i.e., a modeling language for describing models.

A conﬁgured framework results in an approach for estimating the simplicity of process

models. The obtained simplicity score establishes whether the behavior captured by the

model can be encoded in a structurally simpler model. To this end, the structure of the

model is related to the structures of other behaviorally similar process models.

To demonstrate these ideas, we instantiate the framework with the number of nodes

[9] and control ﬂow complexity (cfc) [4] measures of structural complexity, topological

entropy [20] measure of behavioral similarity, and uniquely labeled block-structured

[19] process models captured in the Business Process Model and Notation (BPMN)

[18] as the representation bias. We then apply these framework instantiations to assess

the simplicity of the process models automatically discovered from event logs by the

Inductive miner algorithm [16]. This algorithm constructs process trees, which can then

be converted into uniquely labeled block-structured BPMN models [14].

Once the framework is conﬁgured, the next challenge is to obtain models of various

structures that specify behaviors similar to that captured by the given model, as these

are then used to establish and quantify the amount of unnecessary structural information

in the given model. To achieve completeness, one should aim to obtain all the similar

models, including the simplest ones. As an exhaustive approach for synthesizing all

such models is often unfeasible in practice, in this paper, we take an empirical approach

and synthesize random models that approximate those models with similar behavior. To

implement such approximations, we developed a tool that generates uniquely labeled

block-structured BPMN models randomly, or exhaustively for some restricted cases,

and measures their structural complexity and behavioral similarity.

The remainder of this paper is organized as follows. Section 2 discusses an example

that motivates the problem of ignoring the behavior when measuring the simplicity of

process models. Section 3 presents our framework for estimating the simplicity of the

discovered models. In Section 4, we instantiate the framework with concrete compo-

nents. Section 5 presents the results of an analysis of process models discovered from

real-life event logs using our framework instantiations. Section 6 concludes the paper.

A Framework for Estimating Simplicity of Process Models 3

2 Motivating Example

In this section, we show that the existing simplicity measures do not always follow the

Occam’s Razor principle. Consider event log L={hload page,ﬁll name,ﬁl l passport,

ﬁll expire datei,hload page,ﬁll name,ﬁll expire date,ﬁll passporti,hload page,

ﬁll passport,ﬁll expire datei,hload page,ﬁll name ,ﬁll expire datei,hload page,

ﬁll name ,ﬁll passporti,hload page,ﬁll name,ﬁl l passport,ﬁll expire date,load pagei,

hload page,ﬁll name,ﬁll expire date,ﬁll passport,load pagei,hload page,ﬁll name,

load pagei,hﬁll name,ﬁll expire date,ﬁll passporti,hﬁll passport,ﬁll expire datei}

generated by a passport renewal information system.1The log contains ten traces, each

encoded as a sequence of events, or steps, taken by the users of the system. Usually, the

user loads the Web page and ﬁlls out relevant forms with details such as name, previous

passport number, and expiry date. Some steps in the traces may be skipped or repeated,

as this is common for the real world event data [13], Fig. 1 and Fig. 2 present BPMN

models discovered from Lusing, respectively, the Split miner [2] and Inductive miner

(with the noise threshold set to 0.2) [16] process discovery algorithm.

load page

fill name

fill

passport

fill expire

date

Fig. 1: A BPMN model discovered by Split miner from event log L.

It is evident from the ﬁgures that the models are different. First, they have differ-

ent structures. Fig. 1 shows an acyclic model with exclusive and parallel branches. In

contrast, the model in Fig. 2 only contains exclusive branches enclosed in a loop and

allowing to skip any of the steps. Second, the models describe different collections of

traces. While the model in Fig. 1 describes three traces (viz. hload pagei,hload page,

ﬁll name ,ﬁll passport,ﬁl l expire datei, and hload page,ﬁll name,ﬁl l expire date,

ﬁll passporti), the model in Fig. 2 describes all the possible traces over the given steps,

i.e., all the possible sequences of the steps, including repetitions.

fill name fill

passport

fill expire

date

load page

Fig. 2: A BPMN model discovered by Inductive miner from event log L.

The latter fact is also evident in the precision and recall values between the models

and log. The precision and recall values of the model in Fig. 1 are 0.852 and 0.672,

1This simple example is inspired by a real world event log analyzed in [13].

4 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

respectively, while the precision and recall values of the model in Fig. 2 are 0.342 and

1.0, respectively; the values were obtained using the entropy-based measures presented

in [20]. The values indicate, for instance, that the model in Fig. 2 is more permissive

(has lower precision), i.e., encodes more behavior not seen in the log than the model in

Fig. 1, and describes all the traces in the log (has perfect ﬁtness of 1.0); the measures

take values on the interval [0,1] with larger values showing better precision and ﬁtness.

load page

fill name

fill

passport

fill expire

date

Fig. 3: A “ﬂower” model.

To assess the simplicity of discovered pro-

cess models, measures of their structural complex-

ity [4,9,15,17], such as the number of nodes and/or

edges,density,depth,coefﬁcients of network con-

nectivity, and control ﬂow complexity, can be em-

ployed. If one relies on the number of nodes to es-

tablish the simplicity of the two example models,

then they will derive at the conclusion that they are

equally simple, as both contain ten nodes. This con-

clusion is, however, na¨

ıve for at least two reasons:

(i) the two models use ten nodes to encode different

behaviors, and (ii) it may be unnecessary to use ten

nodes to encode the corresponding behaviors.

The model in Fig. 3 describes the same behavior as the model in Fig. 2 using eight

nodes. One can use different notions to establish similarity of the behaviors, including

exact (e.g., trace equivalence) or approximate (e.g., topological entropy). The models in

Figs. 2 and 3 are trace equivalent and specify the behaviors that have the (short-circuit)

topological entropy of 1.0 [20]. Intuitively, the entropy measures the “variety” of traces

of different lengths speciﬁed by the model. The more distinct traces of different lengths

the model describes, the closer the entropy is to 1.0. The entropy of the model in Fig. 1 is

0.185. There is no block-structured BPMN model with unique task labels that describes

the behavior with the entropy of 0.185 and uses less than ten nodes. Thus, we argue that

the model in Fig. 1 should be accepted as such that is simpler than the model in Fig. 2.

3 A Framework for Estimating Simplicity of Process Models

In this section, we present our framework for estimating the simplicity of process mod-

els. The framework describes standard components that can be conﬁgured to result in

a concrete measure of simplicity. The simplicity framework is a tuple F= (M,C,B),

where Mis a collection of process models,C:M → R+

0is a measure of structural

complexity, and B ⊆ (M×M)is a behavioral equivalence relation over M.

The process models are captured using some process modeling language (represen-

tation bias), e.g., ﬁnite state machines [11], Petri nets [21], or BPMN [18]. The mea-

sure of structural complexity is a function that maps the models onto non-negative real

numbers, with smaller assigned numbers indicating simpler models. For graph-based

models, this can be the number of nodes and edges, density, diameter, or some other

existing measure of simplicity used in process mining [17]. The behavioral equivalence

relation Bmust deﬁne an equivalence relation over M, i.e., be reﬂexive, symmetric,

and transitive. For instance, Bcan be given by (weak or strong) bisimulation [12] or

trace equivalence [11] relation over models. Alternatively, equivalence classes of Bcan

A Framework for Estimating Simplicity of Process Models 5

m1

m*

M

m2

mk-1

mk

(a) An equivalence class.

m

M

m'1

m'*

M'

m'2

m'l-1

m'lm''1

m''*

M''

m''2

m''k-1

m''k

(b) Similar equivalence classes.

Fig. 4: Behavioral classes of model equivalence.

be deﬁned by models with the same or similar measure of behavioral complexity, e.g.,

(short-circuit) topological entropy [20].

Given a model m1∈ M, its behavioral equivalence class per relation Bis the set

M={m∈ M | (m, m1)∈ B}, cf. Fig. 4a. If one knows a model m∗∈Mwith the

lowest structural complexity in M, i.e., ∀m∈M:C(m∗)≤ C(m), then they can put

the simplicity of models in Minto the perspective of the simplicity of m∗. For instance,

one can use function sim(m) = (C(m∗)+1)

/(C(m)+1) to establish such a perspective.

Suppose that Mis the set of all block-structured BPMN models with four uniquely

labeled tasks, Bis the trace equivalence relation, and Cis the measure of the number of

nodes in the models. Then, it holds that sim(m1)=1.0and sim(m2) = 9

/11 = 0.818,

where m1and m2are the models from Fig. 1 and Fig. 2, respectively, indicating that m1

is simpler than m2. To obtain these simplicity values, we used our tool and generated

all the block-structured BPMN models over four uniquely labeled tasks, computed all

the behavioral equivalence classes over the generated models, and collected statistics

on the numbers of nodes in the models.

For some conﬁgurations of the framework, however, such exhaustive analysis may

yield intractable. For instance, the collection of models of interest may be inﬁnite, or

ﬁnite but immense. Note that the number of block-structured BPMN models with four

uniquely labeled tasks is 2,211,840, and is 297,271,296 if one considers models with

ﬁve uniquely labeled tasks (the number of models grows exponentially with the number

of allowed labels). In such cases, we suggest grounding the analysis in a representative

subset M0⊂ M of the models.

Suppose that one analyzes model m∈ M that has no other (or only a few) models in

its equivalence class M, refer to Fig. 4b. Then, model mcan be compared to models of

lowest structural complexities m0

∗and m00

∗from some other equivalence classes M0and

M00 which contain models that describe the behaviors “similar” to the one captured by

m. To this end, one needs to establish a measure of “similarity” between the behavioral

equivalence classes of models.

In the next section, we exemplify the discussed concepts by presenting example

instantiations of the framework.

4 Framework Instantiations

In this section, we instantiate our framework F= (M,C,B)for assessing the simplic-

ity of process models discovered from event logs and deﬁne the set of models (M),

structural complexity (C), and the behavioral equivalence relation (B) as follows:

6 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

–Mis a set of block-structured BPMN models with a ﬁxed number of uniquely

labeled tasks. BPMN is one of the most popular process modeling languages. Be-

sides, block-structured uniquely labeled process models are discovered by Induc-

tive miner — a widely used process discovery algorithm;

–Cis either the number of nodes or the control ﬂow complexity measure. These mea-

sures were selected among other simplicity measures, because, as shown empiri-

cally in Section 5, there is a relation between these measures and the behavioral

characteristics of process models; and

–Bis the behavioral equivalence relation induced by the notion of (short-circuit)

topological entropy [20]. The entropy measure is selected because it maps process

models onto non-negative real numbers that reﬂect the complexity of the behaviors

they describe; the greater the entropy, the more variability is present in the un-

derlying behavior. Consequently, models from an equivalence class of Bdescribe

behaviors with the same (or very similar) entropy values.

For BPMN models the problem of minimization is still open and only some rules for lo-

cal BPMN models simpliﬁcation exist [14,22]. Although, NP-complete techniques [5]

synthesizing Petri nets with minimal regions (corresponding to BPMN models [14] with

minimal number of routing contracts) from the sets traces can be applied, there is no

general algorithm for ﬁnding a block-structured BPMN model that contains a minimal

possible number of nodes or has a minimal control ﬂow complexity for a given process

behavior. In this case, it may be feasible to generate the set of all possible process mod-

els for the given behavioral class (see the general description of this approach within our

framework Section 3, Fig. 4a). However, due to the combinatorial explosion, the possi-

ble number of block-structured BPMN models grows exponentially with the number of

tasks. While it is still possible to generate all block-structured BPMN models contain-

ing 4 or less tasks, for larger number of tasks this problem is computationally expensive

and cannot be solved in any reasonable amount of time. In this work, we propose an ap-

proach that approximates the exact solutions by comparing analyzed models with only

some randomly generated models that behave similarly. This approach implements a

general approximation idea proposed within our framework (Section 3, Fig. 4b). Sec-

tion 4.1 introduces basic notions used to describe this approach. Section 4.2 describes

the proposed approach, discusses its parameters and analyzes dependencies between

structural and behavioral characteristics of block-structured BPMN models.

4.1 Basic Notions

In this subsection, we deﬁne basic notions that are used later in this section.

Let Xbe a ﬁnite set of elements. By hx1, x2, . . . , xki, where x1, x2, . . . , xk∈X,

k∈N0, we denote a ﬁnite sequence of elements over Xof length k.X∗stands for the

set of all ﬁnite sequences over Xincluding the empty sequence of zero length.

Given two sequences x=hx1, x2, . . . , xkiand y=hy1, y2, . . . , ymi, by x·ywe

denote concatenation of xand y, i.e., the sequence obtained by appending yto the end

of x, i.e., x·y=hx1, x2, . . . , xk, y1, y2, . . . , ymi.

An alphabet is a nonempty ﬁnite set. The elements of an alphabet are its labels. A

(formal) language Lover an alphabet Σis a (not necessarily ﬁnite) set of sequences,

A Framework for Estimating Simplicity of Process Models 7

a

(a) Initial pattern.

a

ab

(b) Sequence pattern.

a

ab

a

b

(c) Choice pattern.

a

ab

a

b

a

b

(d) Parallel pattern.

a

ab

a

b

a

b

a

(e) Loop pattern.

a

ab

a

b

a

b

a

a

(f) Skip pattern.

Fig. 5: Patterns of block-structured BPMN models.

over Σ, i.e., L⊆Σ∗. Let L1and L2be two languages. Then, L1◦L2is their con-

catenation deﬁned by {l1·l2|l1∈L1∧l2∈L2}. The language L∗is deﬁned as

L∗=S∞

n=0 Ln, where L0={hi},Ln=Ln−1◦L.

Structural Representation. The class of process models considered in this work are

block-structured BPMN models that are often used for the representation of processes

discovered from event logs, e.g., these models are discovered by the Inductive mining

algorithm [16].

Block-structured BPMN models are constructed from the following basic set of el-

ements: start and end events represented by circles with thin and thick borders respec-

tively and denoting beginning and termination of the process; tasks modeling atomic

process steps and depicted by rounded rectangles with labels; routing exclusive and par-

allel gateways modeling exclusive and parallel executions and presented by diamonds;

and control ﬂow arcs that deﬁne the order in which elements are executed.

The investigated class of block-structured BPMN models consists of all and only

BPMN models that:

1) can be constructed starting from the initial model presented in Fig. 5a and induc-

tively replacing tasks with the patterns presented in Figs. 5b to 5f;

2) have uniquely labeled tasks, i.e., any two tasks have different labels;

3) only patterns other than loop can be applied to the nested task of the loop; only pat-

terns other than skip and loop can be applied to the nested task of the skip pattern;

When constructing a model, the number of tasks increases if the patterns from

Figs. 5b to 5d are applied, the patterns Figs. 5e and 5f can be applied no more than

twice in a row, and the pattern Fig. 5a is applied only once. Hence, if we ﬁx the number

tasks (labels) in the investigated models, the the overall set of these models is ﬁnite.

After constructing the collection of models, local minimization rules are applied [14].

These rules merge gateways without changing the model semantics. An example of lo-

cal reduction of gateways is presented in Fig. 6a, Fig. 6b illustrates merging of loop and

skip constructs. For the detailed description of local minimization rules refer to [14].

8 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

c

a

b

a

c

b

a

(a) Merging parallel gateways.

a

a

(b) Merging loop and skip constructs.

Fig. 6: Examples of applying local minimization rules.

We focus on the following complexity measures of block-structured BPMN mod-

els: (1) Cn– the number of nodes (including start and end events, tasks, and gateways);

(2) Ccfc - the control ﬂow complexity measure, which is deﬁned as a sum of two num-

bers: the number of all splitting parallel gateways and the total number of all outgoing

control ﬂows of all splitting exclusive (choice) gateways [4].

Sequences of labels are used to encode executions of business processes. The or-

dering of tasks being executed deﬁnes the ordering of labels in a sequence. We say

that a process model encodes or accepts a formal language if and only if this lan-

guage contains all possible sequences of labels corresponding to the orderings of tasks

being executed within the model and only them. Fig. 7a presents an example of a

block-structured model m1that accepts language L1={ha, b, ci,ha, c, bi,hb, a, ci,

hb, c, ai,hc, a, bi,hc, b, ai}. A block-structured BPMN model m2accepting an inﬁnite

language of all sequences starting with ain alphabet {a, b, c}is presented in Fig. 7b.

a

c

b

a

(a) Block-structured BPMN model m1.

c

b

a

(b) Block-structured BPMN model m2.

Fig. 7: Examples of block-structured BPMN models.

Behavioral Representation. Next, we recall the notion of entropy which is used for

the behavioral analysis of process models and event logs [20]. Let Σbe an alphabet

and let L⊆Σ∗be a language over this alphabet. We say that language Lis irreducible

regular language if and only if it is accepted by a strongly-connected automata model

(for details refer to [6]). Let Cn(L),n∈N0, be the set of all sequences in Lof length n.

Then, the topological entropy that estimates the cardinality of Lby measuring the ratio

of the number of distinct sequences in the language to the length of these sequences is

deﬁned as [6]:

ent(L) = lim sup

n→∞

log |Cn(L)|

n.(1)

A Framework for Estimating Simplicity of Process Models 9

The languages accepted by block-structured BPMN models are regular, because

they are also accepted by corresponding automata models [11]. But not all of them

are irreducible, so the standard topological entropy (Eq. (1)) cannot be always cal-

culated. To that end, in [20], it was proposed to construct an irreducible language

(L◦ {hχi})∗◦L, where χ /∈Σ, for each language L, and use so-called short-circuit

entropy ent•(L) = ent((L◦ {hχi})∗◦L). Monotonicity of the short-circuit measure

follows immediately from the deﬁnition of the short-circuit topological entropy and

Lemma 4.7 in [20]:

Corollary 4.1 (Topological entropy). Let L1and L2be two regular languages.

1. If L1=L2, then ent•(L1) = ent •(L2);

2. If L1⊂L2, then ent•(L1)<ent •(L2).

Note that the opposite is not always true, i.e., different languages can be represented

by the same entropy value. Although the language (trace) equivalence is stricter than

the entropy-based equivalence, in the next section, we show that entropy is still useful

for classifying the process behavior.

In this paper, we use the notion of normalized entropy. Suppose that Lis a language

over alphabet Σ, then the normalized entropy of Lis deﬁned as: ent(L) = ent•(L)

ent•(Σ∗),

where Σ∗is the language containing all words over alphabet Σ. The normalized en-

tropy value is bounded, because, for any language Lit holds that L⊆Σ∗, and hence, by

Corollary 4.1 ent•(L)≤ent•(Σ∗), consequently ent(L)∈[0,1]. Obviously, Corol-

lary 4.1 can be formulated and applied to the normalized entropy measure, i.e., for two

languages L1and L2over alphabet Σ, if L1=L2, then ent(L1) = ent (L2), and if

L1⊂L2, it holds that ent(L1)<ent(L2).

We deﬁne the relation of behavioral equivalence Busing the normalized entropy.

Let Σbe an alphabet and let L1, L2⊆Σ∗be languages accepted by models m1and

m2respectively, (m1, m2)∈ B if and only if ent(L1) = ent(L2).

Normalized entropy not only allows to deﬁne the notion of behavioral equivalence,

but also to formalize the notion of behavioral similarity. For a given parameter ∆, we

say that two models m1and m2are behaviorally similar if and only if |ent(L1)−

ent(L2)|< ∆, where L1and L2are the languages these models accept.

4.2 Estimating Simplicity of Block-Structured BPMN Models

In this subsection, we devise a method for assessing the simplicity of uniquely labeled

block-structured BPMN models. As no analytical method for synthetizing a “minimal”

block-structured BPMN model in terms of number of nodes or control ﬂow complexity

for a given behavior is known, and no computationally feasible approach for generating

all possible models with a given behavior exists, we propose an approach that inves-

tigates the dependencies between the structural and behavioral model characteristics

empirically, and reuse these dependencies to measure the simplicity of models.

As the set of all models Mcannot be exhaustively constructed, we generate its sub-

set M0⊂ M and relate analyzed models from Mwith behaviorally similar models

from M0, producing an approximate solution. We then estimate the simplicity of the

10 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

0 0.2 0.4 0.6 0.8 1

4

6

8

10

12

E

C

(a)

0 0.2 0.4 0.6 0.8 1

4

6

8

10

12

f+

f−

f

E

C

(b)

Fig. 8: Structural (C) and behavioral (E) characteristics of process models: (a) for all

models in M0and (b) ﬁltered models with upper and lower envelops for ∆= 0.05.

given model by comparing its structural complexity with that of the simplest behav-

iorally similar models, with the complexity of these models being in a certain interval

from “the best case” to “the worst case” complexity. In order to deﬁne this interval and

relate it to entropy values, we construct envelope functions f−and f+that approxi-

mate “the best case” and “the worst case” structural complexity of the simplest process

models for a given entropy value. Below we give an approach for constructing these

envelope functions and deﬁne the simplicity measure that relates the model complexity

to an interval deﬁned by these functions.

Let E:M0→[0,1] be a function that maps process models in M0onto the

corresponding normalized entropy values. Fig. 8a presents an example plot relating

structural characteristics C(m)and entropy E(m)values for each process model m∈

M0; the example is artiﬁcial and does not correspond to any concrete structural and

behavioral measures of process models. Once such data points are obtained, we ﬁlter

out all the models m∈ M0such that ∃m0∈ M0:E(m) = E(m0)and C(m)>C(m0).

In other words, we ﬁlter out a model, if its underlying behavior can be described in

a structurally simpler model. This means that only the structurally “simplest” models

remain. The process models that were ﬁltered out are presented by gray dots in Fig. 8a.

Once the set M00 of the models remaining after ﬁltering the models in M0is ob-

tained, it deﬁnes the partial function f: [0,1] 6→ R+

0, such that f(e) = cif and only

if exists a model m∈ M00, where e=E(m)and c=C(m)(Fig. 8b). This function

relates the behavioral and structural characteristics of the remaining models. Then, we

construct envelop functions that deﬁne intervals of the structural complexities of the

remaining “simplest” models. The upper envelop f+: [0,1] →R+

0is a function going

through the set of data points D+={(e, f (e)) | ∀e0∈dom(f) : (|e−e0|< ∆ ⇒

f(e)≥f(e0))}, where ∆is a parameter that deﬁnes classes of behaviorally similar

models. Less formally, f+goes through all the data points that are maximum in a ∆-

size window. Similarly, the lower envelop is deﬁned as a function f−: [0,1] →R+

0

going through D−={(e, f (e)) | ∀e0∈dom(f) : (|e−e0|< ∆ ⇒f(e)≤f(e0))}. In

A Framework for Estimating Simplicity of Process Models 11

general, the envelope can be any smooth polynomial interpolation or a piecewise linear

function, the only restriction is that it goes through D+and D−data points.

Parameter ∆deﬁnes the measure of similarity between classes of behaviorally

equivalent models. In each case, ∆should be selected empirically, for instance, too

small ∆will lead to local evaluations, that may not be reliable because they do not take

into account global trends, while setting too large ∆results in a situation when we do

not take into account entropy, relating our model with all other models from the set. In

Fig. 8b, the upper and lower envelops were constructed for ∆= 0.05.

Using the upper and lower envelope functions we can estimate simplicity of process

models from M. The simplicity measure is deﬁned in Eq. (2), where sim(m)is the

simplicity of model m∈ M with an entropy value e=E(m);αand βare parameters,

such that α, β ∈[0,1] and α+β≤1.

sim(m) =

α·f+(e)

C(m)C(m)≥f+(e)

α+ (1 −α−β)·(f+(e)− C(m))

(f+(e)−f−(e)) f−(e)≤ C(m)< f +(e)

1.0−β·C(m)

f−(e)C(m)< f−(e)

(2)

According to Eq. (2), sim(m)is in the interval between zero and one. Parameters

αand βare used to adjust the measure. Parameter αshows the level of conﬁdence that

some complexity values can be above the upper envelope. If the complexity C(m)of

model mis above the upper envelope f+(e),mis more complex than “the worst case”

model, then sim(m)is less than or equal to αand tends to zero as C(m)grows. If the

model complexity C(m)is between the envelopes f−(e)and f+(e), then sim(m)∈

(α, 1−β]and the higher C(m)is (the closer the model is to “the worst case”), the

closer the simplicity value to α. Parameter βshows the level of conﬁdence that some

data points may be below the lower envelope. If it is guaranteed that there are no models

in Mwith data points below the lower envelope it is feasible to set βto zero. Otherwise,

if C(m)is lower than the lower envelope f−(e), then sim(m)belongs to the interval

(1 −β, 1] and tends to one as C(m)approaches zero.

Next, we apply the proposed approach to construct upper and lower envelope func-

tions for the number of nodes and control ﬂow complexity measures of block-structured

BPMN models with a ﬁxed number of uniquely labeled tasks. To analyze the relations

between the structural and behavioral characteristics, we generated all block-structured

BPMN models with three tasks. Fig. 9 contains plots with data points representing the

behavioral and structural characteristics of these process models. These data points,

as well as the upper and lower envelopes, are constructed using the general technique

described above with window parameter ∆= 0.01. Additionally, to make the enve-

lope functions less detailed and reﬂect the main trends, we construct them only through

some of the data points from D+and D−sets in such a way that the second derivative

of each envelope function is either non-negative or non-positive, i.e., envelope functions

are either convex or concave.

In both cases, the upper envelope functions f+

nand f+

cfc grow monotonically, start-

ing at the sequential model (a sequence of three tasks) with the entropy of zero, reach the

maximum, and then drop to the “ﬂower” model (see an example of a “ﬂower” model

12 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

0 0.2 0.4 0.6 0.8 1

5

10

15

20

f+

n

f−

n

E

Cn

(a)

0 0.2 0.4 0.6 0.8 1

0

5

10

15

f+

cfc

f−

cfc

E

Ccfc

(b)

Fig. 9: Dependencies between entropy and structural complexity of block-structured

BPMN models containing three tasks for the (a) number of nodes and (b) control ﬂow

complexity structural complexity measures.

in Fig. 3) with the entropy of one. These results show that, as the number of nodes

(Fig. 9a) and complexity of the control ﬂow (Fig. 9b) increase, the minimum possi-

ble entropy values also increase. As the model structure becomes more complex, more

behavior is allowed and the lower bound of the entropy increases.

While the lower envelope for the number of nodes measure f−

n(Fig. 9a) is ﬂat and

does not reveal any explicit dependency, the lower envelope for the control ﬂow com-

plexity f−

cfc (Fig. 9b) grows. This can be explained by the fact that the number of nodes

is reduced during the local model minimization [14], while the control ﬂow complexity

measure takes into account the number of outgoing arcs of exclusive splitting gateways

and is not affected by the local minimization.

The empirical results and observations presented in this section reveal the main de-

pendencies between the structural complexity and the behavioral complexity of block-

structured BPMN models and can be generalized for an arbitrary number of tasks. They

also show that this approach relies on the quality of the model generator, i.e., the set

of generated models should be dense enough to reveal these dependencies. In the next

section, we apply the proposed approach to assess the simplicity of process models

discovered from real world event logs.

5 Evaluation

In this section, we use the approach from Section 4 to evaluate the simplicity of process

models discovered by Inductive miner (with noise threshold 0.2) from industrial Busi-

ness Process Intelligence Challenge (BPIC) event logs2and an event log of a booking

ﬂight system (BFS). Before the analysis, we ﬁltered out infrequent events that appear

less than in 80% of traces using the “Filter Log Using Simple Heuristics” Process Min-

ing Framework (ProM) plug-in [7].3The Inductive miner algorithm discovers uniquely

2BPIC logs: https://data.4tu.nl/repository/collection:event logs real.

3The ﬁltered logs are available here: https://github.com/jbpt/codebase/tree/master/jbpt-pm/logs.

A Framework for Estimating Simplicity of Process Models 13

labeled block-structured BPMN models. As structural complexity measures, we used

the number of nodes and the control ﬂow complexity (cfc).

To estimate the upper and lower bounds of the structural complexity of the models,

for each number of tasks (each size of the log alphabet), we generated 5,000 random

uniquely labeled block-structured BPMN models. For all the models, data points rep-

resenting the entropy and structural complexity of the models were constructed. We

then constructed the upper and lower envelopes using the window of size 0.01, refer to

Section 4 for the details. The envelopes were constructed as piecewise linear functions

going through all the selected data points.

For both structural complexity measures, parameter α, cf. Eq. (2), was set to 0.5,

i.e., models represented by the data points above the upper envelope are presumed to

have simplicity characteristic lower than 0.5. In turn, parameter βwas set to 0.0and

0.1for the number of nodes and cfc measure, respectively. In contrast to cfc, the lower

envelope for the number of nodes measure is deﬁned as the minimal possible number of

nodes in any model, and this guarantees that there are no data points below it. The ﬁnal

equations for the number of nodes and cfc simplicity measures, where e=E(m)is the

topological entropy of model m,Cn(m)is the number of nodes in m,Ccfc(m)is the cfc

complexity of m,f+

nand f+

cfc are the upper envelopes for the number of nodes and cfc

measures, and f−

nand f−

cfc are the corresponding lower envelopes, are given below.

simn(m) =

0.5·f+

n(e)

Cn(m)Cn(m)≥f+

n(e)

0.5+0.5·(f+

n(e)− Cn(m))

(f+

n(e)−f−

n(e)) f−

n(e)≤ Cn(m)< f+

n(e)

1.0Cn(m)< f−

n(e)

(3)

simcfc (m) =

0.5·f+

cfc(e)

Ccfc(m)Ccfc (m)≥f+

cfc(e)

0.5+0.4·(f+

cfc(e)− Ccfc (m))

(f+

cfc(e)−f−

cfc(e)) f−

cfc(e)≤ Ccfc (m)< f +

cfc(e)

1.0−0.1·Ccfc(m)

f−

cfc(e)Ccfc (m)< f −

cfc(e)

(4)

Table 1 presents the original and adjusted (proposed in this paper) simplicity mea-

sures, induced by the number of nodes and cfc structural complexity measures, for the

process models discovered from the evaluated event logs. Models were discovered from

the ﬁltered event logs and their sublogs that contain only traces appearing in the ﬁltered

event logs at least two or four times. Model m8(refer to Fig. 11a) is an automatically

discovered process model with redundant nodes. The value of simn(m8)is less than 0.5

because the corresponding data point is above the upper envelope (Fig. 10a). Note that

the manually constructed “ﬂower” model m0

8(Fig. 11) that accepts the same traces as

m8has better structural complexity and, consequently, the corresponding adjusted sim-

plicity measurements relate as follows: simn(m0

8) = 1.0>simn(m8) = 0.485, and

simcfc(m0

8)=0.890 >simcfc (m8)=0.723. The difference between simcfc (m8)and

simcfc(m0

8)simplicity values is not as signiﬁcant as the difference between simn(m8)

and simn(m0

8), as despite m0

8has only two gateways the total number of outgoing

sequence ﬂows from these gateways is rather high.

14 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

Model Event log #Traces #Labels Entropy Cnsim nCcfc sim cfc

m1BPIC’2019 3,365 6 0.484 12 0.923 5 0.767

m2BPIC’2019 614 6 0.333 12 0.887 5 0.697

m3BPIC’2019 302 6 0.377 12 0.901 6 0.714

m4BPIC’2018 15,536 8 0.800 25 0.684 20 0.690

m5BPIC’2018 1,570 7 0.432 16 0.802 10 0.680

m6BPIC’2018 618 7 0.638 18 0.813 15 0.676

m7BFS 279 6 0.378 14 0.754 5 0.723

m8BFS 70 6 0.847 20 0.485 13 0.723

m9BFS 29 6 0.258 15 0.630 7 0.516

Table 1: Simplicity of uniquely labeled block-structured BPMN models discovered from indus-

trial event logs and their sublogs; the number of unique traces (#Traces), number of distinct

labels in the discovered models (#Labels) and their entropy values (Entropy) are speciﬁed.

Models m1,m2, and m3have the same number of nodes, but different entropy val-

ues. Model m1is considered the simplest in terms of the number of nodes among the

three models because it is located further from the upper envelope than the other two

models. Hence, its simnvalue is the highest. Models m1and m2, which in addition

have the same control ﬂow complexity values, are shown in Fig. 12a and Fig. 12b, re-

spectively. Model m1is considered more simple than model m2because it merely runs

all the tasks in parallel (allowing “Record Service Entry Sheet” task to be skipped or

executed several times). In contrast, model m2adds additional constraints on the order

of tasks (leading to lower entropy) and, thus, should be easier to test. Note that m1

models a more diverse behavior which in the worst case, according to the upper enve-

lope, can be modeled with more nodes. These results demonstrate that when analyzing

the simplicity of a discovered process model, it is feasible and beneﬁcial to consider

both phenomena of the structural complexity of the model’s diagrammatic representa-

tion and the variability/complexity of the behavior the model describes. This way, one

can adhere to the Occam’s Razor problem-solving principle.

0 0.2 0.4 0.6 0.8 1

10

15

20

25

f+

n

f−

n

m1

m2m3

m7

m8

m0

8

m9

E(m)

Cn(m)

(a) Number of nodes.

0 0.2 0.4 0.6 0.8 1

0

5

10

15

20

f+

cfc

f−

cfc

m1

m2

m3

m7

m8

m0

8

m9

E(m)

Ccfc(m)

(b) Control ﬂow complexity (cfc).

Fig. 10: Structural complexity of block-structured BPMN models over six labels.

A Framework for Estimating Simplicity of Process Models 15

surname-

fill name-fill birthday-

fill

docnum-

fill

docexpire

-fill

c_phone

_num-fill

surname-

fill name-fill

birthday-

fill

docnum-

fill

docexpire

-fill

c_phone

_num-fill

(a) Model m8.

surname-

fill name-fill birthday-

fill

docnum-

fill

docexpire

-fill

c_phone

_num-fill

surname-

fill name-fill

birthday-

fill

docnum-

fill

docexpire

-fill

c_phone

_num-fill

(b) Model m

0

8.

Fig. 11: Models m8and m0

8discovered from the BFS event log.

Create

Purchase

Order Item

Record

Goods

Receipt

Record

Invoice

Receipt Record

Service

Entry Sheet

Vendor

creates

invoice

Clear

Invoice

Create

Purchase

Order Item

Record

Goods

Receipt Record

Invoice

Receipt

Record

Service

Entry Sheet

Vendor

creates

invoice

Clear

Invoice

(a) Model m1.

Create

Purchase

Order Item

Record

Goods

Receipt

Record

Invoice

Receipt Record

Service

Entry Sheet

Vendor

creates

invoice

Clear

Invoice

Create

Purchase

Order Item

Record

Goods

Receipt Record

Invoice

Receipt

Record

Service

Entry Sheet

Vendor

creates

invoice

Clear

Invoice

(b) Model m2.

Fig. 12: Models m1and m2discovered from the BPIC’2019 event log.

6 Conclusion

This paper presents a framework that can be conﬁgured to result in a concrete approach

for measuring the simplicity of process models discovered from event logs. In con-

trast to the existing simplicity measures, our framework accounts for both a model’s

structure and behavior. In this paper, the framework was implemented for the class of

uniquely-labeled block-structured BPMN models using topological entropy as a mea-

sure of process model behavior. The experimental evaluation of process models dis-

covered from real-life event logs shows the approach’s ability to evaluate the quality

of discovered process models by relating their structural complexity to the structural

complexity of other process models that describe similar behaviors. Such analysis can

complement existing simplicity measurement techniques showing the relative aspects

of the structural complexity of the model.

We identify several research directions arising from this work. First, we acknowl-

edge that the proposed instantiation of the framework is approximate and depends on

the quality of the randomly generated models. The analysis of other structural com-

plexity measures as well as more sophisticated random model generation algorithms

can lead to a more precise approach. Second, the framework described in this paper

can be instantiated with other classes of process models to extend its applicability to

models discovered by a broader range of process discovery algorithms. Finally, we be-

lieve that this work can give valuable insights into the improvement of existing, and the

development of new process discovery algorithms.

16 Anna Kalenkova, Artem Polyvyanyy, and Marcello La Rosa

Acknowledgments. This work was supported by the Australian Research Council Dis-

covery Project DP180102839. We sincerely thank the anonymous reviewers whose sug-

gestions helped us to improve this paper.

References

1. van der Aalst, W.: Data Science in Action, pp. 3–23. Springer Berlin Heidelberg (2016)

2. Augusto, A., Conforti, R., Dumas, M., La Rosa, M., Polyvyanyy, A.: Split miner: automated

discovery of accurate and simple business process models from event logs. Knowl. Inf. Syst.

59(2), 251–284 (2019). https://doi.org/10.1007/s10115-018-1214-x

3. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: On the role of ﬁtness, precision,

generalization and simplicity in process discovery. In: On the Move to Meaningful Internet

Systems: OTM 2012. pp. 305–322. Springer, Berlin (2012)

4. Cardoso, J.: How to Measure the Control-ﬂow Complexity of Web processes and Workﬂows,

pp. 199–212 (01 2005)

5. Carmona, J., Cortadella, J., Kishinevsky, M.: A region-based algorithm for discovering petri

nets from event logs. In: Dumas, M., Reichert, M., Shan, M.C. (eds.) Business Process Man-

agement. pp. 358–373. Springer Berlin Heidelberg, Berlin, Heidelberg (2008)

6. Ceccherini-Silberstein, T., Mach`

ı, A., Scarabotti, F.: On the entropy of regular languages.

Theor. Comp. Sci. 307, 93–102 (2003)

7. van Dongen, B.F., de Medeiros, A.K.A., Verbeek, H.M.W., Weijters, A.J.M.M., van der

Aalst, W.M.P.: The ProM Framework: A New Era in Process Mining Tool Support. In: Ap-

plications and Theory of Petri Nets 2005. pp. 444–454. Springer Berlin Heidelberg (2005)

8. Garrett, A.J.M.: Ockham’s Razor, pp. 357–364. Springer Netherlands (1991)

9. Gruhn, V., Laue, R.: Complexity metrics for business process models. In: 9th International

Conference on Business Information Systems (BIS 2006). pp. 1–12 (2006)

10. Gr¨

unwald, P.D.: The Minimum Description Length Principle (Adaptive Computation and

Machine Learning). The MIT Press (2007)

11. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and

Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., USA (2006)

12. Janˇ

car, P., Kuˇ

cera, A., Mayr, R.: Deciding bisimulation-like equivalences with ﬁnite-state

processes. In: Automata, Languages and Programming. pp. 200–211. Springer (1998)

13. Kalenkova, A.A., Ageev, A.A., Lomazova, I.A., van der Aalst, W.M.P.: E-government ser-

vices: Comparing real and expected user behavior. In: Business Process Management Work-

shops. pp. 484–496. Springer, Cham (2018)

14. Kalenkova, A., Aalst, W., Lomazova, I., Rubin, V.: Process mining using BPMN: Relating

event logs and process models process mining using BPMN. Relating event logs and process

models. Software and Systems Modeling 16, 1019–1048 (01 2017)

15. Kluza, K., Nalepa, G.J., Lisiecki, J.: Square Complexity Metrics for Business Process Mod-

els, pp. 89–107. Springer International Publishing, Cham (2014)

16. Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structured process

models from event logs - a constructive approach. In: Application and Theory of Petri Nets

and Concurrency. pp. 311–329. Springer Berlin Heidelberg (2013)

17. Lieben, J., Jouck, T., Depaire, B., Jans, M.: An improved way for measuring simplicity dur-

ing process discovery. In: EOMAS. pp. 49–62. Springer (2018)

18. OMG: Business Process Model and Notation (BPMN), Version 2.0.2 (Dec 2013), http://

www.omg.org/spec/BPMN/2.0.2

19. Polyvyanyy, A.: Structuring Process Models. Ph.D. thesis, University of Potsdam (2012),

http://opus.kobv.de/ubp/volltexte/2012/5902/

A Framework for Estimating Simplicity of Process Models 17

20. Polyvyanyy, A., Solti, A., Weidlich, M., Ciccio, C.D., Mendling, J.: Monotone precision

and recall measures for comparing executions and speciﬁcations of dynamic systems. ACM

Trans. Softw. Eng. Methodol. 29(3) (Jun 2020). https://doi.org/10.1145/3387909

21. Reisig, W.: Understanding Petri Nets: Modeling Techniques, Analysis Methods, Case Stud-

ies. Springer Publishing Company, Incorporated (2013)

22. Wynn, M.T., Verbeek, H.M.W., van der Aalst, W.M.P., ter Hofstede, A.H.M., Edmond, D.:

Reduction rules for yawl workﬂows with cancellation regions and or-joins. Inf. Softw. Tech-

nol. 51(6), 1010–1020 (Jun 2009)