Content uploaded by Marco Pegoraro

Author content

All content in this area was uploaded by Marco Pegoraro on Jul 15, 2021

Content may be subject to copyright.

Discovering Process Models from

Uncertain Event Data

Marco Pegoraro 1, Merih Seran Uysal 1, and Wil M.P. van der Aalst 1

1Chair of Process and Data Science (PADS), Department of Computer Science,

RWTH Aachen University, Aachen, Germany

{pegoraro, uysal, vwdaalst}@pads.rwth-aachen.de

Abstract

Modern information systems are able to collect event data in the form of event

logs. Process mining techniques allow to discover a model from event data, to

check the conformance of an event log against a reference model, and to perform

further process-centric analyses. In this paper, we consider uncertain event logs,

where data is recorded together with explicit uncertainty information. Wedescribe

a technique to discover a directly-follows graph from such event data which retains

information about the uncertainty in the process. We then present experimental

results of performing inductive mining over the directly-follows graph to obtain

models representing the certain and uncertain part of the process.

Keywords: Process Mining ·Process Discovery ·Uncertain Data.

Colophon

This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-

ternational” license.

©the authors. Some rights reserved.

This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:

Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Discovering Process Models from Uncertain Event

Data”. In: Business Process Management Workshops - BPM 2019 International Workshops, Vienna, Austria, September

1-6, 2019, Revised Selected Papers. Ed. by Di Francescomarino, Chiara, Remco M. Dijkman, and Uwe Zdun. Vol. 362.

Lecture Notes in Business Information Processing. Springer, 2019, pp. 238–249. doi:10.1007/978-3-030-37453-

2_20

Please, cite this document as shown above.

Publication chronology:

•2019-06-02: full text submitted to the International Workshopon Business Process Intelligence (BPI) 2019

•2019-06-28: notication of acceptance

•2019-07-13: camera-ready version submitted

•2019-09-02: presented

•2019-09-20: post-proceedings version submitted

•2020-01-03: post-proceedings published

The published version referred above is ©Springer.

Correspondence to:

Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,

RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany

Website: http://mpegoraro.net/ ·Email: pegoraro@pads.rwth-aachen.de ·ORCID: 0000-0002-8997-7517

Content: 15 pages, 7 gures, 1 table, 9 references. Typeset with pdfL

A

T

E

X, Biber, and BibL

A

T

E

X.

Please do not print this document unless strictly necessary.

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

1 Introduction

With the advent of digitalization of business processes and related management tools,

Process-Aware Information Systems(PAISs), ranging from ERP/CRM-systems to BPM/

WFM-systems, are widely used to support operational administration of processes. The

databases of PAISs containing event data can be queried to obtain event logs, collections

of recordings of the execution of activities belonging to the process. The discipline of

process mining aims to synthesize knowledge about processes via the extraction and anal-

ysis of execution logs.

When applying process mining in real-life settings, the need to address anomalies in

data recording when performing analyses is omnipresent. A number of such anomalies

can be modeled by using the notion of uncertainty: uncertain event logs contain, along-

side the event data, some attributes that describe a certain level of uncertainty afecting

the data. A typical example is the timestamp information: in many processes, specically

the ones where data is in part manually recorded, the timestamp of events is recorded

with low precision (e.g., specifying only the day of occurrence). If multiple events be-

longing to the same case are recorded within the same time unit, the information regard-

ing the event order is lost. This can be modeled as uncertainty of the timestamp attribute

by assigning a time interval to the events. Another example of uncertainty are situations

where the activity label is unrecorded or lost, but the events are associated with specic

resources that carried out the corresponding activity. In many organizations, each re-

source is authorized to perform a limited set of activities, depending on her role. In this

case, it is possible to model the absence of activity labels associating every event with the

set of possible activities which the resource is authorized to perform.

Usually, information about uncertainty is not natively contained into a log: event

data is extracted from information systems as activity label, timestamp and case id (and

possibly additional attributes), without any sort of meta-information regarding uncer-

tainty. In some cases, a description of the uncertainty in the process can be obtained

from background knowledge. Information translatable to uncertainty such as the one

given above as example can, for instance, be acquired from an interview with the process

owner, and then inserted in the event log with a pre-processing step. Research eforts re-

garding how to discover uncertainty in a representation of domain knowledge and how

to translate it to obtain an uncertain event log are currently ongoing.

Uncertainty can be addressed by ltering out the afected eventswhen it appears spo-

radically throughout an event log. Conversely, in situations where uncertainty afects a

signicant fraction of an event log, ltering out uncertain events can lead to information

loss such that analysis becomes very dicult. In this circumstance, it is important to de-

ploy process mining techniques that allow to mine information also from the uncertain

part of the process.

In this paper, we aim to develop a process discovery approach for uncertain event

3 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

data. We present a methodology to obtain Uncertain Directly-Follows Graphs (UDFGs),

models based on directed graphs that synthesize information about the uncertainty con-

tained in the process. We then show how to convert UDFGs in models with execution

semantics via ltering on uncertainty information and inductive mining.

The remainder of the paper is structured as follows: in Section 2we present relevant

previous work. In Section 3, we provide the preliminary information necessary for for-

mulating uncertainty. In Section 4, we dene the uncertain version of directly-follows

graphs. In Section 5, we describe some examples of exploiting UDFGs to obtain exe-

cutable models. Section 6presents some experiments. Section 7proposes future work

and concludes the paper.

2 Related Work

In a previous work [9], we proposed a taxonomy of possible types of uncertainty in

event data. To the best of our knowledge, no previous work addressing explicit uncer-

tainty currently exist in process mining. Since usual event logs do not contain any hint

regarding misrecordings of data or other anomalies, the notion of “noise” or “anomaly”

normally considered in process discovery refers to outlier behavior. This is ofen ob-

tained by setting thresholds to lter out the behavior not considered for representation

in the resulting process model. A variant of the Inductive Miner by Leemans et al. [6]

considers only directly-follows relationships appearing with a certain frequency. In gen-

eral, a direct way to address infrequent behavior on the event level is to apply on it the

concepts of support and condence, widely used in association rule learning [5]. More

sophisticated techniques employ infrequent pattern detection employing a mapping be-

tween events [8] or a nite state automaton [4] mined from the most frequent behavior.

Although various interpretations of uncertain information can exist, this paper pre-

sents a novel approach that aims to represent uncertainty explicitly, rather than ltering

it out. For this reason, existing approaches to identify noise cannot be applied to the

problem at hand.

3 Preliminaries

To dene uncertain event data, we introduce some basic notations and concepts, par-

tially from [1]:

Deﬁnition 1 (Power Set).The power set of a set Ais the set of all possible subsets of

A, and is denoted with P(A).PNE(A)denotes the set of all the non-empty subsets of A:

PNE(A) = P(A)\ {∅}.

4 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

Deﬁnition 2 (Sequence).Given a set X, a ﬁnite sequence over Xof length nis a

function s∈X∗:{1, . . . , n} → X, typically written as s=hs1, s2, . . . , sni. For any

sequence swe deﬁne |s|=n,s[i] = si,Ss={s1, s2, . . . , sn}and x∈s⇐⇒ x∈Ss. Over

the sequences sand s0we deﬁne s∪s0={a∈s}∪{a∈s0}.

Deﬁnition 3 (Directed Graph).Adirected graph G= (V, E)is a set of vertices V

and a set of directed edges E⊆V×V. We denote with UGthe universe of such directed

graphs.

Deﬁnition 4 (Bridge).An edge e∈Eis called a bridge if and only if the graph

becomes disconnected if eis removed: there exists a partition of Vinto V0and V00 such

that E∩((V0×V00)∪(V00 ×V0)) = {e}. We denote with EB⊆Ethe set of all such

bridges over the graph G= (V, E).

Deﬁnition 5 (Path).Apath over a graph G= (V, E)is a sequence of vertices p=

hv1, v2,...vniwith v1, . . . , vn∈Vand ∀1≤i≤n−1(vi, vi+1)∈E.PG(v, w)denotes the set

of all paths connecting vand win G. A vertex w∈Vis reachable from v∈Vif there is

at least one path connecting them: |PG(v, w)|>0.

Deﬁnition 6 (Transitive Reduction).Atransitive reduction of a graph G= (V, E)

is a graph ρ(G) = (V, E0)with the same reachability between vertices and a minimal

number of edges. E0⊆Eis a smallest set of edges such that |Pρ(G)(v, w)|>0 =⇒

|PG(v, w)|>0for any v, w ∈V.

In this paper, we consider uncertain event logs. These event logs contain uncertainty

information explicitly associated with event data. A taxonomy of diferent kinds of un-

certainty and uncertain event logs has been presented in [9] which it distinguishes be-

tween two main classes of uncertainty. Weak uncertainty provides a probability distri-

bution over a set of possible values, while strong uncertainty only provides the possible

values for the corresponding attribute.

We will use the notion of simple uncertainty, which includes strong uncertainty on

the control-ow perspective: activities, timestamps, and indeterminate events. An ex-

ample of a simple uncertain trace is shown in Table 1. Event e1has been recorded with

two possible activity labels (aor c), an example of strong uncertainty on activities. Some

events, e.g. e2, do not have a precise timestamp but a time interval in which the event

could have happened has been recorded: in some cases, this causes the loss of the precise

order of events (e.g. e1and e2). These are examples of strong uncertainty on timestamps.

As shown by the “?” symbol, e3is an indeterminate event: it has been recorded, but it is

not guaranteed to have happened.

Deﬁnition 7 (Universes).Let UEbe the set of all the event identiers. Let UCbe

the set of all case ID identiers. Let UAbe the set of all the activity identiers. Let UT

5 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

Table 1: An example of simple uncertain trace.

Case ID Event ID Activity Timestamp Event Type

354 e1{a, c}[2011-12-02T00:00

2011-12-05T00:00] !

354 e2{a, d}[2011-12-03T00:00

2011-12-05T00:00] !

354 e3{a, b}2011-12-07T00:00 ?

354 e4{a, b}[2011-12-09T00:00

2011-12-15T00:00] !

354 e5{b, c}[2011-12-11T00:00

2011-12-17T00:00] !

354 e6{b}2011-12-20T00:00 !

be the totally ordered set of all the timestamp identiers. Let UO={!,?}, where the “!”

symbol denotes determinate events, and the “?” symbol denotes indeterminate events.

Deﬁnition 8 (Simple uncertain traces and logs).σ∈PNE (UE×PNE (UA)×

UT×UT×UO)is a simple uncertain trace if for any (ei, A, tmin, tmax , o)∈σ,tmin < tmax

and all the event identiﬁers are unique. TUdenotes the universe of simple uncertain traces.

L∈P(TU)is a simple uncertain log if all the event identiﬁers in the log are unique.

Over the uncertain event e= (ei, A, tmin, tmax , o)∈σwe deﬁne the following projection

functions: πA(e) = A,πtmin (e) = tmin,πtmax (e) = tmax and πo(e) = o. Over L∈P(TU)

we deﬁne the following projection function: ΠA(L) = Sσ∈LSe∈σπA(e).

The behavior graph is a structure that summarizes information regarding the un-

certainty contained in a trace. Namely, two vertices are linked by an edge if their corre-

sponding events may have happened one immediately afer the other.

Deﬁnition 9 (Behavior Graph).Let σ∈TUbe a simple uncertain trace. A be-

havior graph β:TU→UGis the transitive reduction of a directed graph ρ(G), where

G= (V, E)∈UGis deﬁned as:

•V={e∈σ}

•E={(v, w)|v, w ∈V∧πtmax (v)< πtmin (w)}

Notice that the behavior graph is obtained from the transitive reduction of an acyclic

graph, and thus is unique. The behavior graph for the trace in Table 1is shown in Figure 1.

6 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

{ a, b }

{ a, c }

{ a, b }

{ b, c }

{ a, d }

{ b }

e1

e2

e3

e4

e5

e6

Figure 1: The behavior graph of the uncertain trace given in Table 1. Each vertex represents an uncertain

event and is labeled with the possible activity label of the event. The dotted circle represents an indetermi-

nate event (may or may not have happened).

4 Uncertain DFGs

The denitions shown in Section 3allow us to introduce some fundamental concepts

necessary to perform discovery in an uncertain setting. Let us dene a measure for the

frequencies of single activities. In an event log without uncertainty the frequency of an

activity is the number of events that have the corresponding activity label. In the un-

certain case, there are events that can have multiple possible activity labels. For a certain

activity a∈UA, the minimum activity frequency of ais the number of events that cer-

tainly have Aas activity label and certainly happened; the maximum activity frequency

is the number of events that may have Aas activity label.

Deﬁnition 10 (Minimum and maximum activity frequency).The minimum

and maximum activity frequency #min :TU×UA→Nand #max :TU×UA→Nof

an activity a∈UAin regard of an uncertain trace σ∈TUare deﬁned as:

•#min(σ, a) = |{e∈σ|πA(e) = {a} ∧ πo(v) = !}|

•#max(σ, a) = |{e∈σ|a∈πA(e)}|.

Many discovery algorithms exploit the concept of directly-follows relationship [2,6].

In this paper, we extend this notion to uncertain traces and uncertain event logs. An un-

certain trace embeds some behavior which depends on the instantiation of the stochastic

variables contained in the event attributes. Some directly-follows relationships exist in

part, but not all, the possible behavior of an uncertain trace. As an example, consider

7 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

events e3and e5in the uncertain trace shown in Table 1: the relationship “ais directly

followed by b” appears once only if e3actually happened immediately before e5(i.e., e4

did not happen in-between), and if the activity label of e3is a b(as opposed to c, the other

possible label). In all the behavior that does not satisfy these conditions, the directly-

follows relation does not appear on e3and e5.

Let us dene as realizations all the possible certain traces that are obtainable by choos-

ing a value among all possible ones for an uncertain attribute of the uncertain trace. For

example, some possible realizations of the trace in Table 1are ha, d, b, a, c, bi,ha, a, a, a, b,

bi, and hc, a, c, b, bi. We can express the strength of the directly-follows relationship

between two activities in an uncertain trace by counting the minimum and maximum

number of times the relationship can appear in one of the possible realizations of that

trace. To this goal, we exploit some structural properties of the behavior graph in order

to obtain the minimum and maximum frequency of directly-follows relationships in a

simpler manner.

A useful property to compute the minimum number of occurrences between two

activities exploits the fact that parallel behavior is represented by the branching of arcs in

the graph. Two connected determinate events have happened one immediately afer the

other if the graph does not have any other parallel path: if two determinate events are

connected by a bridge, they will certainly happen in succession. This property is used to

dene a strong sequential relationship.

The next property accounts for the fact that, by construction, uncertain events cor-

responding to nodes in the graph not connected by a path can happen in any order. This

follows directly from the denition of the edges in the graph, together with the transi-

tivity of UT(which is a totally ordered set). This means that two disconnected nodes

vand wmay account for one occurrence of the relation “πA(v)is directly followed by

πA(w)”. Conversely, if wis reachable from v, the directly-follows relationship may be ob-

served if all the events separating vfrom ware indeterminate (i.e., there is a chance that

no event will interpose between the ones in vand w). This happens for vertices e2and

e4in the graph in Figure 1, which are connected by a path and separated only by vertex

e3, which is indeterminate. This property is useful to compute the maximum number

of directly-follows relationships between two activities, leading to the notion of weak

sequential relationship.

Deﬁnition 11 (Strong sequential relationship).Given a behavior graph β=

(V, E)and two vertices v, w ∈V,vis in a strong sequential relationship with w(de-

noted by vIβw) if and only if πo(v) = ! and πo(w) = ! (vand ware both determinate)

and there is a bridge between them: (v, w)∈EB.

Deﬁnition 12 (Weak sequential relationship).Given a behavior graph β= (V, E)

and two vertices v, w ∈V,vis on a weak sequential relationship with w(denoted by

vBβw) if and only if |Pβ(w, v)|= 0 (vis unreachable from w) and no node in any

8 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

possible path between vand w, excluding vand w, is determinate: Sp∈Pβ(v,w){e∈p|

πo(e) = !}\{v, w}=∅.

Notice that if vand ware mutually unreachable they are also in a mutual weak se-

quential relationship. Given two activity labels, these properties allow us to extract sets

of candidate pairs of vertices of the behavior graph.

Deﬁnition 13 (Candidates for minimum and maximum directly-follows fre-

quencies).Given two activities a, b ∈UAand an uncertain trace σ∈TUand the cor-

responding behavior graph β(σ)=(V, E), the candidates for minimum and maximum

directly-follows frequency candmin :TU×UA×UA→P(V×V)and candmax :TU×

UA×UA→P(V×V)are deﬁned as:

•candmin(σ, a, b) = {(v, w)∈V×V|v6=w∧πA(v) = {a}∧πA(w) = {b}∧vIβ

w}

•candmax(σ, a, b) = {(v, w)∈V×V|v6=w∧a∈πA(v)∧b∈πA(w)∧vBβw}

Afer obtaining the sets of candidates, it is necessary to select a subset of pair of ver-

tices such that there are no repetitions. In a realization of an uncertain trace, an event

ecan only have one successor: if multiple vertices of the behavior graph correspond to

events that can succeed e, only one can be selected.

Consider the behavior graph in Figure 1. If we search candidates for “ais directly fol-

lowed by b”, we nd candmin(σ, a, b) = {(e1, e3),(e2, e3),(e1, e5),(e2, e4),(e3, e4),(e3, e5),

(e4, e6)}. However, there are no realizations of the trace represented by the behavior

graph that contains all the candidates; this is because some vertices appear in multiple

candidates. A possible realization with the highest frequency of a→bis hd, a, b, c, a, bi.

Conversely, consider “ais directly followed by a”. When the same activity appears in

both sides of the relationship, an event can be part of two diferent occurrences, as rst

member and second member; e. g., in the trace ha, a, ai, the relationship a→aoccurs

two times, and the second event is part of both occurrences. In the behavior graph of

Figure 1, the relation a→bcannot be supported by candidates (e1, e3)and (e3, e4)at the

same time, because e3has either label aor bin a realization. But (e1, e3)and (e3, e4)can

both support the relationship a→a, in realizations where e1,e3and e4all have label a.

When counting the frequencies of directly follows relationships between the activi-

ties aand b, every node of the behavior graph can appear at most once if a6=b. If a=b,

every node can appear once on each side of the relationship.

Deﬁnition 14 (Minimum directly-follows frequency).Given a, b ∈UAand σ∈

TU, let Rmin ⊆candmin(σ, a, b)be a largest set such that for any (v, w),(v0, w0)∈Rmin,

it holds:

(v, w)6= (v0, w0) =⇒ {v, w}∩{v0, w0}=∅,if a6=b

9 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

(v, w)6= (v0, w0) =⇒v6=v0∧w6=w0,if a=b

The minimum directly-follows frequency min :TU×UA2→Nof two activities

a, b ∈UAin regard of an uncertain trace σ∈TUis deﬁned as min (σ, a, b) = |Rmin|.

Deﬁnition 15 (Maximum directly-follows frequency).Given a, b ∈UAand

σ∈TU, let Rmax ⊆candmax(σ, a, b)be a largest set such that for any (v, w),(v0, w0)∈

Rmax, it holds:

(v, w)6= (v0, w0) =⇒ {v, w}∩{v0, w0}=∅,if a6=b

(v, w)6= (v0, w0) =⇒v6=v0∧w6=w0,if a=b

The maximum directly-follows frequency max :TU×UA2→Nof two activities

a, b ∈UAin regard of an uncertain trace σ∈TUis deﬁned as max (σ, a, b) = |Rmax|.

For the uncertain trace in Table 1, min (σ, a, b)=0, because Rmin =∅; conversely,

max (σ, a, b)=2, because a maximal set of candidates is Rmax ={(e1, e3),(e4, e6)}.

Notice that maximal candidate sets are not necessarily unique: Rmax ={(e2, e3),(e4, e6)}

is also a valid one.

The operator synthesizes information regarding the strength of the directly-follows

relation between two activities in an event log where some events are uncertain. The rela-

tive diference between the min and max counts is a measure of how certain the relation-

ship is when it appears in the event log. Notice that, in the case where no uncertainty is

contained in the event log, min and max will coincide, and will both contain a directly-

follows count for two activities.

An Uncertain DFG (UDFG) is a graph representation of the activity frequencies

and the directly-follows frequencies; using the measures we dened, we exclude the ac-

tivities and the directly-follows relations that never happened.

Deﬁnition 16 (Uncertain Directly-Follows Graph (UDFG)).Given an event

log L∈P(TU), the Uncertain Directly-Follows Graph DFGU(L)is a directed graph

G= (V, E)where:

•V={a∈ΠA(L)|Pσ∈L#max(σ, a)>0}

•E={(a, b)∈V×V|Pσ∈L max (σ, a, b)>0}

The UDFG is a low-abstraction model that, together with the data decorating ver-

tices and arcs, gives indications on the overall uncertainty afecting activities and directly-

follows relationships. Moreover, the UDFG does not lter out uncertainty: the informa-

tion about the uncertain portion of a process is summarized by the data labeling vertices

and edges. In addition to the elimination of the anomalies in an event log in order to

10 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

identify the happy path of a process, this allows the process miner to isolate the uncer-

tain part of a process, in order to study its features and analyze its causes. In essence

however, this model has the same weak points as the classic DFG: it does not support

concurrency, and if many activities happen in diferent order the DFG creates numerous

loops that cause undertting.

5 Inductive Mining Using Directly-Follows Frequencies

A popular process mining algorithm for discovering executable models from DFGs is the

Inductive Miner [6]. A variant presented by Leemans et al. [7], the Inductive Miner–

directly-follows (IMD), has the peculiar feature of preprocessing an event log to obtain a

DFG, and then discover a process tree exclusively from the graph, which can then be con-

verted to a Petri net. This implies a high scalability of the algorithm, which has a linear

computational cost over the number of events in the log, but it also makes it suited to the

case at hand in this paper. To allow for inductive mining, and subsequent representation

of the process as a Petri net, we introduce a form of ltering called UDFG slicing, based

on four ltering parameters: actmin,actmax ,relmin and relmax. The parameters actmin and

actmax allow to lter on nodes of the UDFG, based on how certain the corresponding

activity is in the log. Conversely, relmin and relmax allow to lter on edges of the UDFG,

based on how certain the corresponding directly-follows relationship is in the log.

Deﬁnition 17 (Uncertain DFG slice).Given an uncertain event log L∈P(TU), its

uncertain directly-follows graph DFGU(L)=(V0, E 0), and actmin, actmax, relmin , relmax ∈

[0,1], an uncertain directly-follows slice is a function DFGU:L→UGwhere

DFGU(L, actmin, actmax, relmin, relmax )=(V, E)with:

•V={a∈V0|actmin ≤Pσ∈L#min(σ,a)

Pσ∈L#max(σ,a)≤actmax }

•E={(a, b)∈E0|relmin ≤Pσ∈L min(σ,a,b)

Pσ∈L max(σ,a,b)≤relmax }

A UDFG slice is an unweighted directed graph which represents a ltering performed

over vertices and edges of the UDFG. This graph can then be processed by the IMD.

Deﬁnition 18 (Uncertain Inductive Miner–directly-follows (UIMD)).Given

an uncertain event log L∈P(TU)and actmin, actmax , relmin, relmax ∈[0,1], the Uncer-

tain Inductive Miner–directly-follows (UIMD) returns the process tree obtained by IMD

over an uncertain DFG slice: IMD(DFGU(L, actmin, actmax, relmin , relmax)).

The ltering parameters actmin,actmax ,relmin,relmax allow to isolate the desired type

of behavior of the process. In fact, actmin =relmin = 0 and actmax =relmax = 1 retain

all possible behavior of the process, which is then represented in the model: both the

11 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

Figure 2: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0,relmax = 1.

Figure 3: UIMDon the test log with actmin = 0.6,actmax = 1,relmin = 0,relmax = 1.

behavior deriving from the process itself and the behavior deriving from the uncertain

traces. Higher values of actmin and relmin allow to lter out uncertain behavior, and to

retain only the parts of the process observed in certain events. Vice versa, lowering actmin

and relmin allows to observe only the uncertain part of an event log.

6 Experiments

The approach described here has been implemented using the Python process mining

framework PM4Py [3]. The models obtained through the Uncertain Inductive Miner–

directly-follows cannot be evaluated with commonly used metrics in process mining,

since metrics in use are not applicable on uncertain event data; nor other approaches

for performing discovery over uncertain data exist. This preliminary evaluation of the

algorithm will, therefore, not be based on measurements; it will show the efect of the

UIMDwith diferent settings on an uncertain event log.

Let us introduce a simplied notation for uncertain event logs. In a trace, we rep-

resent an uncertain event with multiple possible activity labels by listing the labels be-

tween curly braces. When two events have overlapping timestamps, we represent their

activity labels between square brackets, and we represent the indeterminate events by

overlining them. For example, the trace ha, {b, c},[d, e]iis a trace containing 4 events,

of which the rst is an indeterminate event with label a, the second is an uncertain event

that can have either bor cas activity label, and the last two events have a range as times-

tamp (and the two ranges overlap). The simplied representation of the trace in Table 1

is h[{a, c},{a, d}],{a, b},[{a, b},{b, c}], bi. Let us observe the efect of the UIMDon

the following test log:

ha, b, e, f, g, hi80,ha, [{b, c}, e], f , g, h, ii15 ,ha, [{b, c, d}, e], f , g, h, ji5.

In Figure 2, we can see the model obtained without any ltering: it represents all the

possible behavior in the uncertain log. The models in Figures 3and 4show the efect

on ltering on the minimum number of times an activity appears in the log: in Figure 3

12 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

Figure 4: UIMDon the test log with actmin = 0.9,actmax = 1,relmin = 0,relmax = 1.

Figure 5: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0.7,relmax = 1.

activities cand dare ltered out, while the model in Figure 4only retains the activities

which never appear in an uncertain event (i.e., the activities for which #min is at least 90

of #max).

Filtering on relmin has a similar efect, although it retains the most certain relation-

ships, rather than activities, as shown in Figure 5. An even more aggressive ltering of

relmin, as shown in Figure 6, allows to represent only the parts of the process which are

never subjected to uncertainty by being in a directly-follows relationship that has a low

min value.

The UIMDallows also to do the opposite: hide certain behavior and highlight the

uncertain behavior. Figure 7shows a model that only displays the behavior which is

part of uncertain attributes, while activities h,iand j—which are never part of uncer-

tain behavior—have not been represented. Notice that gis represented even though it

always appeared as a certain event; this is due to the fact that the ltering is based on

relationships, and gis in a directly-follows relationship with the indeterminate event f.

7 Conclusion

In this explorative work, we present the foundations for performing process discovery

over uncertain event data. We present a method that is efective in representing a pro-

cess containing uncertainty by exploiting the information into an uncertain event log to

synthesize an uncertain model. The UDFG is a formal description of uncertainty, rather

than a method to eliminate uncertainty to observe the underlying process. This allows

to study uncertainty in isolation, possibly allowing us to determine which efects it has

on the process in terms of behavior, as well as what are the causes of its appearance. We

also present a method to lter the UDFG, obtaining a graph that represents a specic

Figure 6: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0.9,relmax = 1.

13 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

Figure 7: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0,relmax = 0.8.

perspective of the uncertainty in the process; this can be then transformed in a model

that is able to express concurrency using the UIMDalgorithm.

This approach has a number of limitations that will need to be addressed in future

work. An important research direction is the formal denition of metrics and measures

over uncertain event logs and process models, in order to allow for a quantitative evalu-

ation of the quality of this discovery algorithm, as well as other process mining methods

over uncertain logs. Another line of research can be the extension to the weakly uncertain

event data (i.e., including probabilities) and the extension to event logs also containing

uncertainty related to case IDs.

Acknowledgements

We thank the Alexander von Humboldt (AvH) Stifung for supporting our research in-

teractions.

References

[1] van der Aalst, Wil M. P. Process Mining - Data Science in Action, Second Edition.

Springer, 2016. isbn: 978-3-662-49850-7. doi:10.1007/978-3-662- 49851-

4.

[2] van der Aalst, Wil M. P., Ton Weijters, and Laura Maruster. “Workow Mining:

Discovering Process Models from Event Logs”. In: IEEE Transactions on Knowl-

edge and Data Engineering 16.9 (2004), pp. 1128–1142. doi:10 . 1109/ TKDE .

2004.47.

[3] Berti, Alessandro, Sebastiaan J. van Zelst, and Wil M. P. van der Aalst. “Process

Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Sci-

ence”. In: ICPM Demo Track (CEUR 2374). 2019, pp. 13–16. url:http : / /

ceur-ws.org/Vol-2374/paper4.pdf.

14 / 15

M. Pegoraro et al. Discovering Process Models from Uncertain Event Data

[4] Conforti, Rafaele, Marcello La Rosa, and Arthur H. M. ter Hofstede. “Filtering

Out Infrequent Behavior from Business Process Event Logs”. In: IEEE Transac-

tions on Knowledge and Data Engineering 29.2 (2017), pp. 300–314. doi:10 .

1109/TKDE.2016.2614680.

[5] Hornik, Kurt, Bettina Gr ¨

un, and Michael Hahsler. “arules – A computational en-

vironment for mining association rules and frequent item sets”. In: Journal of Sta-

tistical Soware 14.15 (2005), pp. 1–25. doi:10.18637/jss.v014.i15.

[6] Leemans, Sander J. J., Dirk Fahland, and Wil M. P. van der Aalst. “Discovering

Block-Structured Process Models from Event Logs - A Constructive Approach”.

In: Application and Theory of Petri Nets and Concurrency - 34th International

Conference, PETRI NETS 2013, Milan, Italy, June 24-28, 2013. Proceedings. Ed.

by Colom, Jos´

e Manuel and J¨

org Desel. Vol. 7927. Lecture Notes in Computer

Science. Springer, 2013, pp. 311–329. doi:10.1007/978-3-642-38697-8_17.

[7] Leemans, Sander J. J., Dirk Fahland, and Wil M. P. van der Aalst. “Scalable pro-

cess discovery and conformance checking”. In: Soware and Systems Modeling 17.2

(2018), pp. 599–631. doi:10.1007/s10270-016-0545-x.

[8] Lu, Xixi, Dirk Fahland, Frank J. H. M. van den Biggelaar, et al. “Detecting Devi-

ating Behaviors Without Models”. In: Business Process Management Workshops -

BPM 2015, 13th International Workshops, Innsbruck, Austria, August 31 - Septem-

ber 3, 2015, Revised Papers. Ed. by Reichert, Manfred and Hajo A. Reijers. Vol. 256.

Lecture Notes in Business Information Processing. Springer, 2015, pp. 126–139.

doi:10.1007/978-3-319-42887-1_11.

[9] Pegoraro, Marco and Wil M. P. van der Aalst. “Mining Uncertain Event Data in

Process Mining”. In: International Conference on Process Mining, ICPM 2019,

Aachen, Germany, June 24-26, 2019. IEEE, 2019, pp. 89–96. doi:10 . 1109 /

ICPM.2019.00023.

15 / 15