Conference PaperPDF Available

Discovering Process Models from Uncertain Event Data


Abstract and Figures

Modern information systems are able to collect event data in the form of event logs. Process mining techniques allow to discover a model from event data, to check the conformance of an event log against a reference model, and to perform further process-centric analyses. In this paper, we consider uncertain event logs, where data is recorded together with explicit uncertainty information. We describe a technique to discover a directly-follows graph from such event data which retains information about the uncertainty in the process. We then present experimental results of performing inductive mining over the directly-follows graph to obtain models representing the certain and uncertain part of the process.
Content may be subject to copyright.
Discovering Process Models from
Uncertain Event Data
Marco Pegoraro 1, Merih Seran Uysal 1, and Wil M.P. van der Aalst 1
1Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Aachen, Germany
{pegoraro, uysal, vwdaalst}
Modern information systems are able to collect event data in the form of event
logs. Process mining techniques allow to discover a model from event data, to
check the conformance of an event log against a reference model, and to perform
further process-centric analyses. In this paper, we consider uncertain event logs,
where data is recorded together with explicit uncertainty information. Wedescribe
a technique to discover a directly-follows graph from such event data which retains
information about the uncertainty in the process. We then present experimental
results of performing inductive mining over the directly-follows graph to obtain
models representing the certain and uncertain part of the process.
Keywords: Process Mining ·Process Discovery ·Uncertain Data.
This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-
ternational” license.
©the authors. Some rights reserved.
This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:
Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Discovering Process Models from Uncertain Event
Data”. In: Business Process Management Workshops - BPM 2019 International Workshops, Vienna, Austria, September
1-6, 2019, Revised Selected Papers. Ed. by Di Francescomarino, Chiara, Remco M. Dijkman, and Uwe Zdun. Vol. 362.
Lecture Notes in Business Information Processing. Springer, 2019, pp. 238–249. doi:10.1007/978-3-030-37453-
Please, cite this document as shown above.
Publication chronology:
2019-06-02: full text submitted to the International Workshopon Business Process Intelligence (BPI) 2019
2019-06-28: notication of acceptance
2019-07-13: camera-ready version submitted
2019-09-02: presented
2019-09-20: post-proceedings version submitted
2020-01-03: post-proceedings published
The published version referred above is ©Springer.
Correspondence to:
Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany
Website: ·Email: ·ORCID: 0000-0002-8997-7517
Content: 15 pages, 7 gures, 1 table, 9 references. Typeset with pdfL
X, Biber, and BibL
Please do not print this document unless strictly necessary.
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
1 Introduction
With the advent of digitalization of business processes and related management tools,
Process-Aware Information Systems(PAISs), ranging from ERP/CRM-systems to BPM/
WFM-systems, are widely used to support operational administration of processes. The
databases of PAISs containing event data can be queried to obtain event logs, collections
of recordings of the execution of activities belonging to the process. The discipline of
process mining aims to synthesize knowledge about processes via the extraction and anal-
ysis of execution logs.
When applying process mining in real-life settings, the need to address anomalies in
data recording when performing analyses is omnipresent. A number of such anomalies
can be modeled by using the notion of uncertainty: uncertain event logs contain, along-
side the event data, some attributes that describe a certain level of uncertainty afecting
the data. A typical example is the timestamp information: in many processes, specically
the ones where data is in part manually recorded, the timestamp of events is recorded
with low precision (e.g., specifying only the day of occurrence). If multiple events be-
longing to the same case are recorded within the same time unit, the information regard-
ing the event order is lost. This can be modeled as uncertainty of the timestamp attribute
by assigning a time interval to the events. Another example of uncertainty are situations
where the activity label is unrecorded or lost, but the events are associated with specic
resources that carried out the corresponding activity. In many organizations, each re-
source is authorized to perform a limited set of activities, depending on her role. In this
case, it is possible to model the absence of activity labels associating every event with the
set of possible activities which the resource is authorized to perform.
Usually, information about uncertainty is not natively contained into a log: event
data is extracted from information systems as activity label, timestamp and case id (and
possibly additional attributes), without any sort of meta-information regarding uncer-
tainty. In some cases, a description of the uncertainty in the process can be obtained
from background knowledge. Information translatable to uncertainty such as the one
given above as example can, for instance, be acquired from an interview with the process
owner, and then inserted in the event log with a pre-processing step. Research eforts re-
garding how to discover uncertainty in a representation of domain knowledge and how
to translate it to obtain an uncertain event log are currently ongoing.
Uncertainty can be addressed by ltering out the afected eventswhen it appears spo-
radically throughout an event log. Conversely, in situations where uncertainty afects a
signicant fraction of an event log, ltering out uncertain events can lead to information
loss such that analysis becomes very dicult. In this circumstance, it is important to de-
ploy process mining techniques that allow to mine information also from the uncertain
part of the process.
In this paper, we aim to develop a process discovery approach for uncertain event
3 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
data. We present a methodology to obtain Uncertain Directly-Follows Graphs (UDFGs),
models based on directed graphs that synthesize information about the uncertainty con-
tained in the process. We then show how to convert UDFGs in models with execution
semantics via ltering on uncertainty information and inductive mining.
The remainder of the paper is structured as follows: in Section 2we present relevant
previous work. In Section 3, we provide the preliminary information necessary for for-
mulating uncertainty. In Section 4, we dene the uncertain version of directly-follows
graphs. In Section 5, we describe some examples of exploiting UDFGs to obtain exe-
cutable models. Section 6presents some experiments. Section 7proposes future work
and concludes the paper.
2 Related Work
In a previous work [9], we proposed a taxonomy of possible types of uncertainty in
event data. To the best of our knowledge, no previous work addressing explicit uncer-
tainty currently exist in process mining. Since usual event logs do not contain any hint
regarding misrecordings of data or other anomalies, the notion of “noise” or “anomaly”
normally considered in process discovery refers to outlier behavior. This is ofen ob-
tained by setting thresholds to lter out the behavior not considered for representation
in the resulting process model. A variant of the Inductive Miner by Leemans et al. [6]
considers only directly-follows relationships appearing with a certain frequency. In gen-
eral, a direct way to address infrequent behavior on the event level is to apply on it the
concepts of support and condence, widely used in association rule learning [5]. More
sophisticated techniques employ infrequent pattern detection employing a mapping be-
tween events [8] or a nite state automaton [4] mined from the most frequent behavior.
Although various interpretations of uncertain information can exist, this paper pre-
sents a novel approach that aims to represent uncertainty explicitly, rather than ltering
it out. For this reason, existing approaches to identify noise cannot be applied to the
problem at hand.
3 Preliminaries
To dene uncertain event data, we introduce some basic notations and concepts, par-
tially from [1]:
Definition 1 (Power Set).The power set of a set Ais the set of all possible subsets of
A, and is denoted with P(A).PNE(A)denotes the set of all the non-empty subsets of A:
PNE(A) = P(A)\ {∅}.
4 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
Definition 2 (Sequence).Given a set X, a finite sequence over Xof length nis a
function sX:{1, . . . , n} → X, typically written as s=hs1, s2, . . . , sni. For any
sequence swe define |s|=n,s[i] = si,Ss={s1, s2, . . . , sn}and xsxSs. Over
the sequences sand s0we define ss0={as}∪{as0}.
Definition 3 (Directed Graph).Adirected graph G= (V, E)is a set of vertices V
and a set of directed edges EV×V. We denote with UGthe universe of such directed
Definition 4 (Bridge).An edge eEis called a bridge if and only if the graph
becomes disconnected if eis removed: there exists a partition of Vinto V0and V00 such
that E((V0×V00)(V00 ×V0)) = {e}. We denote with EBEthe set of all such
bridges over the graph G= (V, E).
Definition 5 (Path).Apath over a graph G= (V, E)is a sequence of vertices p=
hv1, v2,...vniwith v1, . . . , vnVand 1in1(vi, vi+1)E.PG(v, w)denotes the set
of all paths connecting vand win G. A vertex wVis reachable from vVif there is
at least one path connecting them: |PG(v, w)|>0.
Definition 6 (Transitive Reduction).Atransitive reduction of a graph G= (V, E)
is a graph ρ(G) = (V, E0)with the same reachability between vertices and a minimal
number of edges. E0Eis a smallest set of edges such that |Pρ(G)(v, w)|>0 =
|PG(v, w)|>0for any v, w V.
In this paper, we consider uncertain event logs. These event logs contain uncertainty
information explicitly associated with event data. A taxonomy of diferent kinds of un-
certainty and uncertain event logs has been presented in [9] which it distinguishes be-
tween two main classes of uncertainty. Weak uncertainty provides a probability distri-
bution over a set of possible values, while strong uncertainty only provides the possible
values for the corresponding attribute.
We will use the notion of simple uncertainty, which includes strong uncertainty on
the control-ow perspective: activities, timestamps, and indeterminate events. An ex-
ample of a simple uncertain trace is shown in Table 1. Event e1has been recorded with
two possible activity labels (aor c), an example of strong uncertainty on activities. Some
events, e.g. e2, do not have a precise timestamp but a time interval in which the event
could have happened has been recorded: in some cases, this causes the loss of the precise
order of events (e.g. e1and e2). These are examples of strong uncertainty on timestamps.
As shown by the “?” symbol, e3is an indeterminate event: it has been recorded, but it is
not guaranteed to have happened.
Definition 7 (Universes).Let UEbe the set of all the event identiers. Let UCbe
the set of all case ID identiers. Let UAbe the set of all the activity identiers. Let UT
5 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
Table 1: An example of simple uncertain trace.
Case ID Event ID Activity Timestamp Event Type
354 e1{a, c}[2011-12-02T00:00
2011-12-05T00:00] !
354 e2{a, d}[2011-12-03T00:00
2011-12-05T00:00] !
354 e3{a, b}2011-12-07T00:00 ?
354 e4{a, b}[2011-12-09T00:00
2011-12-15T00:00] !
354 e5{b, c}[2011-12-11T00:00
2011-12-17T00:00] !
354 e6{b}2011-12-20T00:00 !
be the totally ordered set of all the timestamp identiers. Let UO={!,?}, where the “!”
symbol denotes determinate events, and the “?” symbol denotes indeterminate events.
Definition 8 (Simple uncertain traces and logs).σPNE (UE×PNE (UA)×
UT×UT×UO)is a simple uncertain trace if for any (ei, A, tmin, tmax , o)σ,tmin < tmax
and all the event identifiers are unique. TUdenotes the universe of simple uncertain traces.
LP(TU)is a simple uncertain log if all the event identifiers in the log are unique.
Over the uncertain event e= (ei, A, tmin, tmax , o)σwe define the following projection
functions: πA(e) = A,πtmin (e) = tmin,πtmax (e) = tmax and πo(e) = o. Over LP(TU)
we define the following projection function: ΠA(L) = SσLSeσπA(e).
The behavior graph is a structure that summarizes information regarding the un-
certainty contained in a trace. Namely, two vertices are linked by an edge if their corre-
sponding events may have happened one immediately afer the other.
Definition 9 (Behavior Graph).Let σTUbe a simple uncertain trace. A be-
havior graph β:TUUGis the transitive reduction of a directed graph ρ(G), where
G= (V, E)UGis defined as:
E={(v, w)|v, w Vπtmax (v)< πtmin (w)}
Notice that the behavior graph is obtained from the transitive reduction of an acyclic
graph, and thus is unique. The behavior graph for the trace in Table 1is shown in Figure 1.
6 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
{ a, b }
{ a, c }
{ a, b }
{ b, c }
{ a, d }
{ b }
Figure 1: The behavior graph of the uncertain trace given in Table 1. Each vertex represents an uncertain
event and is labeled with the possible activity label of the event. The dotted circle represents an indetermi-
nate event (may or may not have happened).
4 Uncertain DFGs
The denitions shown in Section 3allow us to introduce some fundamental concepts
necessary to perform discovery in an uncertain setting. Let us dene a measure for the
frequencies of single activities. In an event log without uncertainty the frequency of an
activity is the number of events that have the corresponding activity label. In the un-
certain case, there are events that can have multiple possible activity labels. For a certain
activity aUA, the minimum activity frequency of ais the number of events that cer-
tainly have Aas activity label and certainly happened; the maximum activity frequency
is the number of events that may have Aas activity label.
Definition 10 (Minimum and maximum activity frequency).The minimum
and maximum activity frequency #min :TU×UANand #max :TU×UANof
an activity aUAin regard of an uncertain trace σTUare defined as:
#min(σ, a) = |{eσ|πA(e) = {a} ∧ πo(v) = !}|
#max(σ, a) = |{eσ|aπA(e)}|.
Many discovery algorithms exploit the concept of directly-follows relationship [2,6].
In this paper, we extend this notion to uncertain traces and uncertain event logs. An un-
certain trace embeds some behavior which depends on the instantiation of the stochastic
variables contained in the event attributes. Some directly-follows relationships exist in
part, but not all, the possible behavior of an uncertain trace. As an example, consider
7 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
events e3and e5in the uncertain trace shown in Table 1: the relationship “ais directly
followed by b” appears once only if e3actually happened immediately before e5(i.e., e4
did not happen in-between), and if the activity label of e3is a b(as opposed to c, the other
possible label). In all the behavior that does not satisfy these conditions, the directly-
follows relation does not appear on e3and e5.
Let us dene as realizations all the possible certain traces that are obtainable by choos-
ing a value among all possible ones for an uncertain attribute of the uncertain trace. For
example, some possible realizations of the trace in Table 1are ha, d, b, a, c, bi,ha, a, a, a, b,
bi, and hc, a, c, b, bi. We can express the strength of the directly-follows relationship
between two activities in an uncertain trace by counting the minimum and maximum
number of times the relationship can appear in one of the possible realizations of that
trace. To this goal, we exploit some structural properties of the behavior graph in order
to obtain the minimum and maximum frequency of directly-follows relationships in a
simpler manner.
A useful property to compute the minimum number of occurrences between two
activities exploits the fact that parallel behavior is represented by the branching of arcs in
the graph. Two connected determinate events have happened one immediately afer the
other if the graph does not have any other parallel path: if two determinate events are
connected by a bridge, they will certainly happen in succession. This property is used to
dene a strong sequential relationship.
The next property accounts for the fact that, by construction, uncertain events cor-
responding to nodes in the graph not connected by a path can happen in any order. This
follows directly from the denition of the edges in the graph, together with the transi-
tivity of UT(which is a totally ordered set). This means that two disconnected nodes
vand wmay account for one occurrence of the relation “πA(v)is directly followed by
πA(w)”. Conversely, if wis reachable from v, the directly-follows relationship may be ob-
served if all the events separating vfrom ware indeterminate (i.e., there is a chance that
no event will interpose between the ones in vand w). This happens for vertices e2and
e4in the graph in Figure 1, which are connected by a path and separated only by vertex
e3, which is indeterminate. This property is useful to compute the maximum number
of directly-follows relationships between two activities, leading to the notion of weak
sequential relationship.
Definition 11 (Strong sequential relationship).Given a behavior graph β=
(V, E)and two vertices v, w V,vis in a strong sequential relationship with w(de-
noted by vIβw) if and only if πo(v) = ! and πo(w) = ! (vand ware both determinate)
and there is a bridge between them: (v, w)EB.
Definition 12 (Weak sequential relationship).Given a behavior graph β= (V, E)
and two vertices v, w V,vis on a weak sequential relationship with w(denoted by
vBβw) if and only if |Pβ(w, v)|= 0 (vis unreachable from w) and no node in any
8 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
possible path between vand w, excluding vand w, is determinate: SpPβ(v,w){ep|
πo(e) = !}\{v, w}=.
Notice that if vand ware mutually unreachable they are also in a mutual weak se-
quential relationship. Given two activity labels, these properties allow us to extract sets
of candidate pairs of vertices of the behavior graph.
Definition 13 (Candidates for minimum and maximum directly-follows fre-
quencies).Given two activities a, b UAand an uncertain trace σTUand the cor-
responding behavior graph β(σ)=(V, E), the candidates for minimum and maximum
directly-follows frequency candmin :TU×UA×UAP(V×V)and candmax :TU×
UA×UAP(V×V)are defined as:
candmin(σ, a, b) = {(v, w)V×V|v6=wπA(v) = {a}πA(w) = {b}∧vIβ
candmax(σ, a, b) = {(v, w)V×V|v6=waπA(v)bπA(w)vBβw}
Afer obtaining the sets of candidates, it is necessary to select a subset of pair of ver-
tices such that there are no repetitions. In a realization of an uncertain trace, an event
ecan only have one successor: if multiple vertices of the behavior graph correspond to
events that can succeed e, only one can be selected.
Consider the behavior graph in Figure 1. If we search candidates for “ais directly fol-
lowed by b”, we nd candmin(σ, a, b) = {(e1, e3),(e2, e3),(e1, e5),(e2, e4),(e3, e4),(e3, e5),
(e4, e6)}. However, there are no realizations of the trace represented by the behavior
graph that contains all the candidates; this is because some vertices appear in multiple
candidates. A possible realization with the highest frequency of abis hd, a, b, c, a, bi.
Conversely, consider “ais directly followed by a”. When the same activity appears in
both sides of the relationship, an event can be part of two diferent occurrences, as rst
member and second member; e. g., in the trace ha, a, ai, the relationship aaoccurs
two times, and the second event is part of both occurrences. In the behavior graph of
Figure 1, the relation abcannot be supported by candidates (e1, e3)and (e3, e4)at the
same time, because e3has either label aor bin a realization. But (e1, e3)and (e3, e4)can
both support the relationship aa, in realizations where e1,e3and e4all have label a.
When counting the frequencies of directly follows relationships between the activi-
ties aand b, every node of the behavior graph can appear at most once if a6=b. If a=b,
every node can appear once on each side of the relationship.
Definition 14 (Minimum directly-follows frequency).Given a, b UAand σ
TU, let Rmin candmin(σ, a, b)be a largest set such that for any (v, w),(v0, w0)Rmin,
it holds:
(v, w)6= (v0, w0) =⇒ {v, w}∩{v0, w0}=,if a6=b
9 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
(v, w)6= (v0, w0) =v6=v0w6=w0,if a=b
The minimum directly-follows frequency min :TU×UA2Nof two activities
a, b UAin regard of an uncertain trace σTUis defined as min (σ, a, b) = |Rmin|.
Definition 15 (Maximum directly-follows frequency).Given a, b UAand
σTU, let Rmax candmax(σ, a, b)be a largest set such that for any (v, w),(v0, w0)
Rmax, it holds:
(v, w)6= (v0, w0) =⇒ {v, w}∩{v0, w0}=,if a6=b
(v, w)6= (v0, w0) =v6=v0w6=w0,if a=b
The maximum directly-follows frequency max :TU×UA2Nof two activities
a, b UAin regard of an uncertain trace σTUis defined as max (σ, a, b) = |Rmax|.
For the uncertain trace in Table 1, min (σ, a, b)=0, because Rmin =; conversely,
max (σ, a, b)=2, because a maximal set of candidates is Rmax ={(e1, e3),(e4, e6)}.
Notice that maximal candidate sets are not necessarily unique: Rmax ={(e2, e3),(e4, e6)}
is also a valid one.
The operator synthesizes information regarding the strength of the directly-follows
relation between two activities in an event log where some events are uncertain. The rela-
tive diference between the min and max counts is a measure of how certain the relation-
ship is when it appears in the event log. Notice that, in the case where no uncertainty is
contained in the event log, min and max will coincide, and will both contain a directly-
follows count for two activities.
An Uncertain DFG (UDFG) is a graph representation of the activity frequencies
and the directly-follows frequencies; using the measures we dened, we exclude the ac-
tivities and the directly-follows relations that never happened.
Definition 16 (Uncertain Directly-Follows Graph (UDFG)).Given an event
log LP(TU), the Uncertain Directly-Follows Graph DFGU(L)is a directed graph
G= (V, E)where:
V={aΠA(L)|PσL#max(σ, a)>0}
E={(a, b)V×V|PσL max (σ, a, b)>0}
The UDFG is a low-abstraction model that, together with the data decorating ver-
tices and arcs, gives indications on the overall uncertainty afecting activities and directly-
follows relationships. Moreover, the UDFG does not lter out uncertainty: the informa-
tion about the uncertain portion of a process is summarized by the data labeling vertices
and edges. In addition to the elimination of the anomalies in an event log in order to
10 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
identify the happy path of a process, this allows the process miner to isolate the uncer-
tain part of a process, in order to study its features and analyze its causes. In essence
however, this model has the same weak points as the classic DFG: it does not support
concurrency, and if many activities happen in diferent order the DFG creates numerous
loops that cause undertting.
5 Inductive Mining Using Directly-Follows Frequencies
A popular process mining algorithm for discovering executable models from DFGs is the
Inductive Miner [6]. A variant presented by Leemans et al. [7], the Inductive Miner–
directly-follows (IMD), has the peculiar feature of preprocessing an event log to obtain a
DFG, and then discover a process tree exclusively from the graph, which can then be con-
verted to a Petri net. This implies a high scalability of the algorithm, which has a linear
computational cost over the number of events in the log, but it also makes it suited to the
case at hand in this paper. To allow for inductive mining, and subsequent representation
of the process as a Petri net, we introduce a form of ltering called UDFG slicing, based
on four ltering parameters: actmin,actmax ,relmin and relmax. The parameters actmin and
actmax allow to lter on nodes of the UDFG, based on how certain the corresponding
activity is in the log. Conversely, relmin and relmax allow to lter on edges of the UDFG,
based on how certain the corresponding directly-follows relationship is in the log.
Definition 17 (Uncertain DFG slice).Given an uncertain event log LP(TU), its
uncertain directly-follows graph DFGU(L)=(V0, E 0), and actmin, actmax, relmin , relmax
[0,1], an uncertain directly-follows slice is a function DFGU:LUGwhere
DFGU(L, actmin, actmax, relmin, relmax )=(V, E)with:
V={aV0|actmin PσL#min(σ,a)
PσL#max(σ,a)actmax }
E={(a, b)E0|relmin PσL min(σ,a,b)
PσL max(σ,a,b)relmax }
A UDFG slice is an unweighted directed graph which represents a ltering performed
over vertices and edges of the UDFG. This graph can then be processed by the IMD.
Definition 18 (Uncertain Inductive Miner–directly-follows (UIMD)).Given
an uncertain event log LP(TU)and actmin, actmax , relmin, relmax [0,1], the Uncer-
tain Inductive Miner–directly-follows (UIMD) returns the process tree obtained by IMD
over an uncertain DFG slice: IMD(DFGU(L, actmin, actmax, relmin , relmax)).
The ltering parameters actmin,actmax ,relmin,relmax allow to isolate the desired type
of behavior of the process. In fact, actmin =relmin = 0 and actmax =relmax = 1 retain
all possible behavior of the process, which is then represented in the model: both the
11 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
Figure 2: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0,relmax = 1.
Figure 3: UIMDon the test log with actmin = 0.6,actmax = 1,relmin = 0,relmax = 1.
behavior deriving from the process itself and the behavior deriving from the uncertain
traces. Higher values of actmin and relmin allow to lter out uncertain behavior, and to
retain only the parts of the process observed in certain events. Vice versa, lowering actmin
and relmin allows to observe only the uncertain part of an event log.
6 Experiments
The approach described here has been implemented using the Python process mining
framework PM4Py [3]. The models obtained through the Uncertain Inductive Miner–
directly-follows cannot be evaluated with commonly used metrics in process mining,
since metrics in use are not applicable on uncertain event data; nor other approaches
for performing discovery over uncertain data exist. This preliminary evaluation of the
algorithm will, therefore, not be based on measurements; it will show the efect of the
UIMDwith diferent settings on an uncertain event log.
Let us introduce a simplied notation for uncertain event logs. In a trace, we rep-
resent an uncertain event with multiple possible activity labels by listing the labels be-
tween curly braces. When two events have overlapping timestamps, we represent their
activity labels between square brackets, and we represent the indeterminate events by
overlining them. For example, the trace ha, {b, c},[d, e]iis a trace containing 4 events,
of which the rst is an indeterminate event with label a, the second is an uncertain event
that can have either bor cas activity label, and the last two events have a range as times-
tamp (and the two ranges overlap). The simplied representation of the trace in Table 1
is h[{a, c},{a, d}],{a, b},[{a, b},{b, c}], bi. Let us observe the efect of the UIMDon
the following test log:
ha, b, e, f, g, hi80,ha, [{b, c}, e], f , g, h, ii15 ,ha, [{b, c, d}, e], f , g, h, ji5.
In Figure 2, we can see the model obtained without any ltering: it represents all the
possible behavior in the uncertain log. The models in Figures 3and 4show the efect
on ltering on the minimum number of times an activity appears in the log: in Figure 3
12 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
Figure 4: UIMDon the test log with actmin = 0.9,actmax = 1,relmin = 0,relmax = 1.
Figure 5: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0.7,relmax = 1.
activities cand dare ltered out, while the model in Figure 4only retains the activities
which never appear in an uncertain event (i.e., the activities for which #min is at least 90
of #max).
Filtering on relmin has a similar efect, although it retains the most certain relation-
ships, rather than activities, as shown in Figure 5. An even more aggressive ltering of
relmin, as shown in Figure 6, allows to represent only the parts of the process which are
never subjected to uncertainty by being in a directly-follows relationship that has a low
min value.
The UIMDallows also to do the opposite: hide certain behavior and highlight the
uncertain behavior. Figure 7shows a model that only displays the behavior which is
part of uncertain attributes, while activities h,iand j—which are never part of uncer-
tain behavior—have not been represented. Notice that gis represented even though it
always appeared as a certain event; this is due to the fact that the ltering is based on
relationships, and gis in a directly-follows relationship with the indeterminate event f.
7 Conclusion
In this explorative work, we present the foundations for performing process discovery
over uncertain event data. We present a method that is efective in representing a pro-
cess containing uncertainty by exploiting the information into an uncertain event log to
synthesize an uncertain model. The UDFG is a formal description of uncertainty, rather
than a method to eliminate uncertainty to observe the underlying process. This allows
to study uncertainty in isolation, possibly allowing us to determine which efects it has
on the process in terms of behavior, as well as what are the causes of its appearance. We
also present a method to lter the UDFG, obtaining a graph that represents a specic
Figure 6: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0.9,relmax = 1.
13 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
Figure 7: UIMDon the test log with actmin = 0,actmax = 1,relmin = 0,relmax = 0.8.
perspective of the uncertainty in the process; this can be then transformed in a model
that is able to express concurrency using the UIMDalgorithm.
This approach has a number of limitations that will need to be addressed in future
work. An important research direction is the formal denition of metrics and measures
over uncertain event logs and process models, in order to allow for a quantitative evalu-
ation of the quality of this discovery algorithm, as well as other process mining methods
over uncertain logs. Another line of research can be the extension to the weakly uncertain
event data (i.e., including probabilities) and the extension to event logs also containing
uncertainty related to case IDs.
We thank the Alexander von Humboldt (AvH) Stifung for supporting our research in-
[1] van der Aalst, Wil M. P. Process Mining - Data Science in Action, Second Edition.
Springer, 2016. isbn: 978-3-662-49850-7. doi:10.1007/978-3-662- 49851-
[2] van der Aalst, Wil M. P., Ton Weijters, and Laura Maruster. “Workow Mining:
Discovering Process Models from Event Logs”. In: IEEE Transactions on Knowl-
edge and Data Engineering 16.9 (2004), pp. 1128–1142. doi:10 . 1109/ TKDE .
[3] Berti, Alessandro, Sebastiaan J. van Zelst, and Wil M. P. van der Aalst. “Process
Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Sci-
ence”. In: ICPM Demo Track (CEUR 2374). 2019, pp. 13–16. url:http : / /
14 / 15
M. Pegoraro et al. Discovering Process Models from Uncertain Event Data
[4] Conforti, Rafaele, Marcello La Rosa, and Arthur H. M. ter Hofstede. “Filtering
Out Infrequent Behavior from Business Process Event Logs”. In: IEEE Transac-
tions on Knowledge and Data Engineering 29.2 (2017), pp. 300–314. doi:10 .
[5] Hornik, Kurt, Bettina Gr ¨
un, and Michael Hahsler. “arules – A computational en-
vironment for mining association rules and frequent item sets”. In: Journal of Sta-
tistical Soware 14.15 (2005), pp. 1–25. doi:10.18637/jss.v014.i15.
[6] Leemans, Sander J. J., Dirk Fahland, and Wil M. P. van der Aalst. “Discovering
Block-Structured Process Models from Event Logs - A Constructive Approach”.
In: Application and Theory of Petri Nets and Concurrency - 34th International
Conference, PETRI NETS 2013, Milan, Italy, June 24-28, 2013. Proceedings. Ed.
by Colom, Jos´
e Manuel and J¨
org Desel. Vol. 7927. Lecture Notes in Computer
Science. Springer, 2013, pp. 311–329. doi:10.1007/978-3-642-38697-8_17.
[7] Leemans, Sander J. J., Dirk Fahland, and Wil M. P. van der Aalst. “Scalable pro-
cess discovery and conformance checking”. In: Soware and Systems Modeling 17.2
(2018), pp. 599–631. doi:10.1007/s10270-016-0545-x.
[8] Lu, Xixi, Dirk Fahland, Frank J. H. M. van den Biggelaar, et al. “Detecting Devi-
ating Behaviors Without Models”. In: Business Process Management Workshops -
BPM 2015, 13th International Workshops, Innsbruck, Austria, August 31 - Septem-
ber 3, 2015, Revised Papers. Ed. by Reichert, Manfred and Hajo A. Reijers. Vol. 256.
Lecture Notes in Business Information Processing. Springer, 2015, pp. 126–139.
[9] Pegoraro, Marco and Wil M. P. van der Aalst. “Mining Uncertain Event Data in
Process Mining”. In: International Conference on Process Mining, ICPM 2019,
Aachen, Germany, June 24-26, 2019. IEEE, 2019, pp. 89–96. doi:10 . 1109 /
15 / 15
... A recent stream of research considered uncertain event logs [3,2,4], where uncertainty can be associated with timestamps, activities, etc. The true trace in such logs is one of several possible realizations, each of which can be associated with different managerial and operational implications (e.g., regarding resource allocation and decision making). ...
... Weak uncertainty assumes that in addition to the set of possible values, the probability distribution function is known. Pegoraro, Uysal, and Van Der Aalst [3] develop a process discovery technique for event logs with strong uncertainty. The technique constructs a directly-follows graph and extracts a process model via inductive mining and applies filters to simplify the process model. ...
Full-text available
In this work we propose an algorithm for trace recovery from stochastically known logs, a setting that is becoming more common with the increasing number of sensors and predictive models that generate uncertain data. The suggested approach calculates the conformance between a process model and a stochastically known trace and recovers the best alignment within this stochastic trace as the true trace. The paper offers an analysis of the impact of various cost models on trace recovery accuracy and makes use of a product multi-graph to compare alternative trace recovery options. The average accuracy of our approach, evaluated using two publicly available datasets, is impressive, with an average recovery accuracy score of 90-97%, significantly improving a common heuristic that chooses the most likely value for each uncertain activity. We believe that the effectiveness of the proposed algorithm in recovering correct traces from stochastically known logs may be a powerful aid for developing credible decision-making tools in uncertain settings.
... In contrast, a precise relationship between log events and elements of a Petri net model is established through replaying event log traces over the model by computing alignments between them in the founding work [43] and its many extensions [24]. Conformance checking with uncertainty via satisfiability modulo theories [125] is actual when event logs contain uncertain data [126,127,128]. However, the direct application of these approaches is limited to black-token Petri nets and the control-flow process perspective that they describe, i.e., they do not address the challenge of conformance checking of multiple process perspectives [24], e.g., the process, queue, and resource ones described by the PQR-system. ...
Full-text available
Organizations transform digitally to improve their processes, implement new business models, and develop new capabilities. Information systems execute their processes and store detailed data about the execution progress and outcome for various needs. Recently, the vast amount of such data, being intensively collected due to cheap storage solution availability, triggered extensive developments in data science, including the emergence of process mining. Process mining is a field of data science that exploits data about the execution of business processes, typically referred to as event logs, for identifying process improvements and providing operational support. It is achieved through such tasks as the data-driven discovery of process models, conformance and compliance checking, and performance analysis and monitoring. Historically, most process mining techniques address the analysis of process instances, or cases, in isolation, i.e., assuming that various cases do not affect each other. However, this assumption does not hold for many business processes, for example, when cases interact on limited shared resources. If this is the case, applying many existing process mining techniques is infeasible as it would lead to poor or even inaccurate results. In this dissertation, we study material handling processes of Material Handling Systems (MHSs) in logistics, such as Baggage Handling Systems (BHSs) of airports, or warehouse solutions. In MHSs, cases are not isolated. For instance, passenger bags in BHSs interact on conveyors of finite capacity while competing for shared machines. The primary concern of MHS operators is to keep the MHS performance at the desired level. It makes improving material handling processes and providing operational support an actual problem. However, existing process mining techniques fail to capture interactions between cases. This dissertation aims to bridge this gap by adapting existing techniques and creating new ones, primarily targeting MHSs. We start with proposing the performance spectrum in Chapter 3. It is a generic technique for process performance description, capable of revealing case interactions and various performance phenomena, which we describe in a taxonomy of performance patterns. Then, we investigate core aspects affecting the behavior of MHSs in Chapter 4. We explore state-of-the-art queueing theory models for MHS performance analysis, consider their fundamental assumptions, and validate them using the performance spectrum. We show why they do not hold for the MHSs we study but also identify the key concepts for modeling MHSs: queues, resources, and routing functions. Further, we design a Process-Queue-Resource system (PQR-system) by materializing these concepts in a modular process model in Chapter 6. This model is a dedicated synchronous proclet system, whose modules (proclets) represent the process, queues, and resources of an MHS, and whose synchronization channels describe the proclets interactions. Next, we build on the performance spectrum and PQR-system to extend existing techniques and create new ones. Thus, we adopt the concept of generalized conformance checking in Chapter 7. We consider how the problem of PQR-system-based conformance checking can be decomposed into simpler tasks for which existing approaches can be used. Then, we propose a novel method for inferring missing events with timestamps for the log repair task of generalized conformance checking to address the common problem of the incompleteness of MHS event data. Further, we propose a way to align performance spectra to PQR-systems in Chapter 8. As a result, we obtain the performance description of the queue and resource dimensions (besides the ''classical'' control flow dimension). Exploiting information about performance patterns in performance spectra, and possible ways of their propagation in the system along the PQR-system paths, we propose a method for root-cause performance analysis. It detects problems in the performance spectrum of the control-flow dimension and identifies their root causes in the spectra of the queue and resource dimensions. Finally, we address the problem of Predictive Performance Monitoring (PPM) in Chapter 9. We exploit the ability of the performance spectrum to capture the system dynamics to formulate a large class of PPM problems as a generic regression problem over the spectrum. Furthermore, we suggest a PQR-system-based method for selecting features relevant to learning the corresponding regression models. The proposed techniques have been evaluated in controlled experiments using synthetic event logs, generated by a simulation model, and the real data of MHSs built by Vanderlande, an MHS manufacturer. The evaluation of performance spectrum-based analysis allowed us to identify the root causes of a severe performance incident in a major European airport BHS significantly quicker than the existing techniques used by the domain experts. As a result, the corresponding tool was implemented internally by Vanderlande and successfully evaluated on other MHSs. Additionally, an empirical exploration of performance spectra of event logs, recorded by processes outside the MHS domain, showed that untrained analysts were able to identify the performance patterns unambiguously. Evaluation of our method for inferring missing events showed accurate results with synthetic data for which the ground truth was available, and a small error in the estimated load using the real data. Finally, the ML models for PPM, trained on the feature sets extracted with our method, showed feasible results for predicting load on critical areas of BHSs, and peaks of undesirable re-circulation on the sorting loops. Open-source implementations for all the methods have been made available as a ProM plugin and several stand-alone tools.
... 23 Inductive Miner has a special feature to preprocess an event log to construct directly-follows relationships and then discovering process trees. 24 However, Inductive Miner algorithms often lead to oversimplified models. 25 Similar to the Inductive miner, the Split miner extracts directly-follows relations. ...
Full-text available
Designing healthcare facilities and their processes is a complex task which influences the quality and efficiency of healthcare services. The ongoing demand for healthcare services and cost burdens necessitate the application of analytical methods to enhance the overall service efficiency in hospitals. However, the variability in healthcare processes makes it highly complicated to accomplish this aim. This study addresses the complexity in the patient transport service process at a German hospital, and proposes a method based on process mining to obtain a holistic approach to recognise bottlenecks and main reasons for delays and resulting high costs associated with idle resources. To this aim, the event log data from the patient transport software system is collected and processed to discover the sequences and the timeline of the activities for the different cases of the transport process. The comparison between the actual and planned processes from the data set of the year 2020 shows that, for example, around 36% of the cases were 10 or more minutes delayed. To find delay issues in the process flow and their root causes the data traces of certain routes are intensively assessed. Additionally, the compliance with the predefined Key Performance Indicators concerning travel time and delay thresholds for individual cases was investigated. The efficiency of assignment of the transport requests to the transportation staff are also evaluated which gives useful understanding regarding staffing potential improvements. The research shows that process mining is an efficient method to provide comprehensive knowledge through process models that serve as Interactive Process Indicators and to extract significant transport pathways. It also suggests a more efficient patient transport concept and provides the decision makers with useful managerial insights to come up with efficient patient-centred analysis of transportation services through data from supporting information systems.
... In [60], the authors assume that event data contains uncertainty. The authors assume simple uncertainty, i.e., the exact activity may not be known, or the exact timestamp may not be known (i.e., an interval is assumed). ...
Full-text available
The field of process mining focuses on distilling knowledge of the (historical) execution of a process based on the operational event data generated and stored during its execution. Most existing process mining techniques assume that the event data describe activity executions as degenerate time intervals, i.e., intervals of the form [t, t], yielding a strict total order on the observed activity instances. However, for various practical use cases, e.g., the logging of activity executions with a nonzero duration and uncertainty on the correctness of the recorded timestamps of the activity executions, assuming a partial order on the observed activity instances is more appropriate. Using partial orders to represent process executions, i.e., based on recorded event data, allows for new classes of process mining algorithms, i.e., aware of parallelism and robust to uncertainty. Yet, interestingly, only a limited number of studies consider using intermediate data abstractions that explicitly assume a partial order over a collection of observed activity instances. Considering recent developments in process mining, e.g., the prevalence of high-quality event data and techniques for event data abstraction, the need for algorithms designed to handle partially ordered event data is expected to grow in the upcoming years. Therefore, this paper presents a survey of process mining techniques that explicitly use partial orders to represent recorded process behavior. We performed a keyword search, followed by a snowball sampling strategy, yielding 68 relevant articles in the field. We observe a recent uptake in works covering partial-order-based process mining, e.g., due to the current trend of process mining based on uncertain event data. Furthermore, we outline promising novel research directions for the use of partial orders in the context of process mining algorithms.
... This graph can be then utilized to discover process models of uncertain logs via process discovery methods based on directly-follows relationships. In a previous work we illustrated this principle by applying it to the inductive miner, a popular discovery algo- rithm [11]; the edges of the UDFG can be ltered using the information on the labels, in such a way that the nal model can represent all possible behavior in the uncertain log, or only a part. Figure 4 shows some process models obtained through inductive mining of the UDFG, as well as a description regarding how the model relates to the original uncertain log. ...
Conference Paper
Full-text available
Process mining is a subfield of process science that analyzes event data collected in databases called event logs. Recently, novel types of event data have become of interest due to the wide industrial application of process mining analyses. In this paper, we examine uncertain event data. Such data contain meta-attributes describing the amount of imprecision tied with attributes recorded in an event log. We provide examples of uncertain event data, present the state of the art in regard of uncertainty in process mining, and illustrate open challenges related to this research direction.
... This graph can be then utilized to discover process models of uncertain logs via process discovery methods based on directly-follows relationships. In a previous work we illustrated this principle by applying it to the inductive miner, a popular discovery algo- rithm [11]; the edges of the UDFG can be ltered using the information on the labels, in such a way that the nal model can represent all possible behavior in the uncertain log, or only a part. Figure 4 shows some process models obtained through inductive mining of the UDFG, as well as a description regarding how the model relates to the original uncertain log. ...
Full-text available
Process mining is a subfield of process science that analyzes event data collected in databases called event logs. Recently, novel types of event data have become of interest due to the wide industrial application of process mining analyses. In this paper, we examine uncertain event data. Such data contain meta-attributes describing the amount of imprecision tied with attributes recorded in an event log. We provide examples of uncertain event data, present the state of the art in regard of uncertainty in process mining, and illustrate open challenges related to this research direction.
... These techniques can result in a process model that contains only the most frequent traces, thus reducing its complexity. [44] propose an alternative APD technique to manage uncertain events, by discovering a directly-follows graph from event logs, in which the events are recorded with some level of uncertainty [45]. A formal description of uncertainty (as a process model) can then be identified, rather than aiming to eliminate uncertainty so the underlying process can be observed [46]. ...
The focus of this paper is on how data quality can affect business process discovery in real complex environments, which is a major factor determining the success in any data-driven Business Process Management project. Many real-life event logs, especially healthcare ones, can suffer from several data quality issues, some of which cannot be solved by pre-processing or data cleaning techniques, leading to inaccurate results. We take an innovative Process Mining (PM) approach, termed Interactive Process Discovery (IPD), which combines domain knowledge with available data. This approach can overcome the limitations of noisy and incomplete event logs by putting “humans in the loop”, leading to improved business process modelling. This is particularly valuable in healthcare, where physicians have a tacit domain knowledge not available in the event log, and, thus, difficult to elicit. We conducted a two-step approach based on a controlled experiment and a case study in an Italian hospital. At each step, we compared IPD with traditional PM techniques to assess the extent to which domain knowledge helps to improve the accuracy of process models. The case study tests the effectiveness of IPD to uncover knowledge-intensive processes extracted from noisy real-life event logs. The evaluation has been carried out by exploiting a real dataset of an Italian hospital, involving the medical staff. IPD can produce an accurate process model that is fully compliant with the clinical guidelines by addressing data quality issues. Accurate and reliable process models can support healthcare organizations in detecting process-related issues and in taking decisions related to capacity planning and process re-design.
... A taxonomy of uncertain event data is available [2], as well as a method to reliably compute the probability associated with each real-life scenario in an uncertain trace [3]. There exist approaches for conformance checking [4] and process discovery [5] over strongly uncertain event data. The key phase in uncertain data analysis of building graph representation has been optimized through e cient algorithms [6,7]. ...
Full-text available
With the widespread adoption of process mining in organizations, the field of process science is seeing an increase in the demand for ad-hoc analysis techniques of non-standard event data. An example of such data are uncertain event data: events characterized by a described and quantified attribute imprecision. This paper outlines a research project aimed at developing process mining techniques able to extract insights from uncertain data. We set the basis for this research topic, recapitulate the available literature, and define a future outlook.
Process Mining aims to analyze and improve processes to enable organizations to provide better services or products. The starting point of Process Mining is an event log that is extracted from the organization’s information systems that support the process’ executions. Several techniques require event logs to record the timestamp when process’ activities have started and been completed. Unfortunately, information systems do not always record the timestamps when process activities start, preventing the application of these techniques. This paper reports on a technique based on process simulation that aims to estimate the start event timestamps when missing. In a nutshell, the idea is to build an accurate process model from the initial event log without start timestamps, to simulate it with alternative activity-duration profiles, and to select the model with the profile that generates the runs that are the closest to the initial log. This activity-duration profile is used to add the missing, start timestamps to the initial log. Experiments were conducted with two event logs with start timestamps, and aimed at their rediscovery: the results show our estimation of the start event timestamps is more accurate than the state of the art. KeywordsStart timestampsTime perspectiveWaiting timeLog repairProcess simulation
With the growing number of devices, sensors and digital systems, data logs may become uncertain due to, e.g., sensor reading inaccuracies or incorrect interpretation of readings by processing programs. At times, such uncertainties can be captured stochastically, especially when using probabilistic data classification models. In this work we focus on conformance checking, which compares a process model with an event log, when event logs are stochastically known. Building on existing alignment-based conformance checking fundamentals, we mathematically define a stochastic trace model, a stochastic synchronous product, and a cost function that reflects the uncertainty of events in a log. Then, we search for an optimal alignment over the reachability graph of the stochastic synchronous product for finding an optimal alignment between a model and a stochastic process observation. Via structured experiments with two well-known process mining benchmarks, we explore the behavior of the suggested stochastic conformance checking approach and compare it to a standard alignment-based approach as well as to an approach that creates a lower bound on performance. We envision the proposed stochastic conformance checking approach as a viable process mining component for future analysis of stochastic event logs.
Conference Paper
Full-text available
Nowadays, more and more process data are automatically recorded by information systems, and made available in the form of event logs. Process mining techniques enable process-centric analysis of data, including automatically discovering process models and checking if event data conform to a certain model. In this paper we analyze the previously unexplored setting of uncertain event logs: logs where quantified uncertainty is recorded together with the corresponding data. We define a taxonomy of uncertain event logs and models, and we examine the challenges that uncertainty poses on process discovery and conformance checking. Finally, we show how upper and lower bounds for conformance can be obtained aligning an uncertain trace onto a regular process model.
Full-text available
In the era of "big data" one of the key challenges is to analyze large amounts of data collected in meaningful and scalable ways. The field of process mining is concerned with the analysis of data that is of a particular nature, namely data that results from the execution of business processes. The analysis of such data can be negatively influenced by the presence of outliers, which reflect infrequent behavior or "noise". In process discovery, where the objective is to automatically extract a process model from the data, this may result in rarely travelled pathways that clutter the process model. This paper presents an automated technique to the removal of infrequent behavior from event logs. The proposed technique is evaluated in detail and it is shown that its application in conjunction with certain existing process discovery algorithms significantly improves the quality of the discovered process models and that it scales well to large datasets.
Conference Paper
Full-text available
Deviation detection is a set of techniques that identify deviations from normative processes in real process executions. These diagnostics are used to derive recommendations for improving business processes. Existing detection techniques identify deviations either only on the process instance level or rely on a normative process model to locate deviating behavior on the event level. However, when normative models are not available, these techniques detect deviations against a less accurate model discovered from the actual behavior, resulting in incorrect diagnostics. In this paper, we propose a novel approach to detect deviation on the event level by identifying frequent common behavior and uncommon behavior among executed process instances, without discovering any normative model. The approach is implemented in ProM and was evaluated in a controlled setting with artificial logs and real-life logs. We compare our approach to existing approaches to investigate its possibilities and limitations. We show that in some cases, it is possible to detect deviating events without a model as accurately as against a given precise normative model.
Full-text available
Considerable amounts of data, including process events, are collected and stored by organisations nowadays. Discovering a process model from such event data and verification of the quality of discovered models are important steps in process mining. Many discovery techniques have been proposed, but none of them combines scalability with strong quality guarantees. We would like such techniques to handle billions of events or thousands of activities, to produce sound models (without deadlocks and other anomalies), and to guarantee that the underlying process can be rediscovered when sufficient information is available. In this paper, we introduce a framework for process discovery that ensures these properties while passing over the log only once and introduce three algorithms using the framework. To measure the quality of discovered models for such large logs, we introduce a model–model and model–log comparison framework that applies a divide-and-conquer strategy to measure recall, fitness, and precision. We experimentally show that these discovery and measuring techniques sacrifice little compared to other algorithms, while gaining the ability to cope with event logs of 100,000,000 traces and processes of 10,000 activities on a standard computer.
Conference Paper
Full-text available
Process discovery is the problem of, given a log of observed behaviour, finding a process model that ‘best’ describes this behaviour. A large variety of process discovery algorithms has been proposed. However, no existing algorithm guarantees to return a fitting model (i.e., able to reproduce all observed behaviour) that is sound (free of deadlocks and other anomalies) in finite time. We present an extensible framework to discover from any given log a set of block-structured process models that are sound and fit the observed behaviour. In addition we characterise the minimal information required in the log to rediscover a particular process model. We then provide a polynomial-time algorithm for discovering a sound, fitting, block-structured model from any given log; we give sufficient conditions on the log for which our algorithm returns a model that is language-equivalent to the process model underlying the log, including unseen behaviour. The technique is implemented in a prototypical tool.
Full-text available
Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.
Full-text available
Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated time-consuming process and, typically, there are discrepancies between the actual workflow processes and the processes as perceived by the management. Therefore, we have developed techniques for discovering workflow models. The starting point for such techniques is a so-called "workflow log" containing information about the workflow process as it is actually being executed. We present a new algorithm to extract a process model from such a log and represent it in terms of a Petri net. However, we also demonstrate that it is not possible to discover arbitrary workflow processes. We explore a class of workflow processes that can be discovered. We show that the α-algorithm can successfully mine any workflow represented by a so-called SWF-net.
This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a considerably expanded section on software tools and a completely new chapter of process mining in the large. It is self-contained, while at the same time covering the entire process-mining spectrum from process discovery to predictive analytics. After a general introduction to data science and process mining in Part I, Part II provides the basics of business process modeling and data mining necessary to understand the remainder of the book. Next, Part III focuses on process discovery as the most important process mining task, while Part IV moves beyond discovering the control flow of processes, highlighting conformance checking, and organizational and time perspectives. Part V offers a guide to successfully applying process mining in practice, including an introduction to the widely used open-source tool ProM and several commercial products. Lastly, Part VI takes a step back, reflecting on the material presented and the key open challenges. Overall, this book provides a comprehensive overview of the state of the art in process mining. It is intended for business process analysts, business consultants, process managers, graduate students, and BPM researchers.
Process mining for Python (PM4Py): bridging the gap between process- and data science
  • A Berti
  • S J Van Zelst
  • W Van Der Aalst
Detecting Deviating Behaviors Without Models
  • Xixi Lu
  • Dirk Fahland
  • J H M Frank
  • Van Den
  • Biggelaar
Lu, Xixi, Dirk Fahland, Frank J. H. M. van den Biggelaar, et al. "Detecting Deviating Behaviors Without Models". In: Business Process Management Workshops -BPM 2015, 13th International Workshops, Innsbruck, Austria, August 31 -September 3, 2015, Revised Papers. Ed. by Reichert, Manfred and Hajo A. Reijers. Vol. 256. Lecture Notes in Business Information Processing. Springer, 2015, pp. 126-139. : 10.1007/978-3-319-42887-1_11.