PreprintPDF Available

Abstract and Figures

Process mining is a scientific discipline that analyzes event data, often collected in databases called event logs. Recently, uncertain event logs have become of interest, which contain non-deterministic and stochastic event attributes that may represent many possible real-life scenarios. In this paper, we present a method to reliably estimate the probability of each of such scenarios, allowing their analysis. Experiments show that the probabilities calculated with our method closely match the true chances of occurrence of specific outcomes, enabling more trustworthy analyses on uncertain data.
Content may be subject to copyright.
Probability Estimation of Uncertain
Process Trace Realizations
Marco Pegoraro 1, Bianka Bakullari 1, Merih Seran Uysal 1, and
Wil M.P. van der Aalst 1
1Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Aachen, Germany
{pegoraro, bianka.bakullari, uysal, vwdaalst}@pads.rwth-aachen.de
Abstract
Process mining is a scientic discipline that analyzes event data, ofen collected
in databases called event logs. Recently, uncertain event logs have become of in-
terest, which contain non-deterministic and stochastic event attributes that may
represent many possible real-life scenarios. In this paper, we present a method to
reliably estimate the probability of each of such scenarios, allowing their analy-
sis. Experiments show that the probabilities calculated with our method closely
match the true chances of occurrence of specic outcomes, enabling more trust-
worthy analyses on uncertain data.
Keywords: Process Mining ·Uncertain Data ·Partial Order.
Colophon
This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-
ternational” license.
©the authors. Some rights reserved.
This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:
Pegoraro, Marco et al. “Probability Estimation of Uncertain Process Trace Realizations”. In: International Workshop
on Event Data and Behavioral Analytics (EdbA). Springer, 2021
Please, cite this document as shown above.
Publication chronology:
2021-06-15: abstract submitted to the International Conference on Process Mining (ICPM) 2021, main track
2021-07-01: full text submitted to the International Conference on Process Mining (ICPM) 2021, main track
2021-08-16: notication of rejection
2021-08-17: abstract submitted to the International Workshop on EventData and Behavioral Analytics (EdbA) 2021
2021-08-20: full text submitted to the International Workshopon Event Data and Behavioral Analytics (EdbA) 2021
2021-09-16: notication of acceptance
2021-09-22: camera-ready version submitted
2021-11-01: presented
2022-03-24: proceedings published
The published version referred above is ©Springer.
Correspondence to:
Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany
Website: http://mpegoraro.net/ ·Email: pegoraro@pads.rwth- aachen.de ·ORCID:0000-0002-8997-7517
Content: 16 pages, 7 gures, 4 tables, 11 references. Typeset with pdfL
A
T
E
X, Biber, and BibL
A
T
E
X.
Please do not print this document unless strictly necessary.
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
1 Introduction
Process mining is a discipline that focuses on extracting insights about processes in a
data-driven manner. For instance, on the basis of the recorded information on histor-
ical process executions, process mining allows to automatically extract a model of the
behavior of process instances, or to measure the compliance of the process data with a
prescribed normative model of the process. In process mining, the central focus is on the
event log, a collection of data that tracks past process instances. Every activity performed
in a process is recorded in the event log, together with information such as the corre-
sponding process case and the timestamp of the activity, in a sequence of events called a
trace.
Recently, research on novel forms of event data have garnered the attention of the
scientic community. Among these there are uncertain event logs, which contain data
afected by imprecision [8]. This data contains meta-information describing the nature
and entity of the uncertainty. Such meta-information can be obtained from the inher-
ent precision with which the data has been recorded (e.g., timestamps only indicating
the date have a possible “true value” range of 24 hours), from the precision of the tools
involved in supporting the process (e.g., the absolute error of sensors), or from the do-
main knowledge provided by a process expert. An uncertain trace corresponds to mul-
tiple possible real-life scenarios, each of which might have very diverse implications on
features of cases such as compliance to a model. It is then important to be able to assess
the risk of occurrence of specic outcomes of uncertain traces, which enables to estimate
the impact of such traces on indicators such as cost and conformance.
In this paper, we present a method to obtain a complete probability distribution
over the possible instantiations of uncertain attributes in a trace. As a possible example
of application, we frame our results in the context of conformance checking, and show
the impact of assessing probability estimates for uncertain traces on insights about the
compliance of an uncertain trace to a process model. We validate our method with exper-
iments based on a Monte Carlo simulation, which shows that the probability estimates
are reliable and reect the true chances of occurrence of a specic outcome.
The remainder of the paper is structured as follows. Section 2examines relevant
related work. Section 3illustrates a motivating running example for our technique. Sec-
tion 4presents preliminary denitions of diferent types of uncertainty in process min-
ing. Section 5illustrates a method for computing probabilities of realizations for uncer-
tain process traces. Section 6validates our method through experimental results. Finally,
Section 7concludes the paper.
3 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
2 Related Work
The analysis of uncertain data in process mining is a very recent research direction. The
specic formulation and denition of uncertain data utilized in this paper has been in-
troduced in 2019 [8], in the context of an analysis approach consisting in computing
bounds for the conformance score of uncertain traces through alignments [5]. Subse-
quently, that work has been extended with an inductive mining approach for process
discovery over uncertainty [10] and a taxonomy of diferent types of uncertain data, with
their characteristics [9].
Uncertain data, as formulated in our present and previous work, is closely related to a
considerably more studied data anomaly in process mining: partially ordered event data.
In fact, uncertain data as described here is a generalization of partially ordered traces. Lu
et al. [7] proposed a conformance checking approach based on alignments to measure
conformance of partially ordered traces. More recently, Van der Aa et al. [1] illustrated a
method for inferring a linear extension, i.e., a compliant total order, of events in partially
ordered traces, based on examples of correct orderings extracted from other traces in the
log. Busany et al. [4] estimated probabilities for partially ordered events in IoT event
streams.
An associated topic, which draws from disciplines such as pattern and sequence min-
ing and is antithetical to the analysis of partially ordered data, is the inference of partial
orders from fully sequential data as a way to model its behavior. This goes under the
name of episode mining, which can be performed with many techniques both on batched
data and with online streams of events [11,6,2].
In this paper, we present a method to estimate the likelihood of any scenario in
an uncertain setting, which covers partially ordered traces as well as other types of un-
certainty illustrated in the taxonomy [9]. Furthermore, we will cover both the non-
deterministic case (strong uncertainty) and the probabilistic case (weak uncertainty).
3 Running Example
In this section, we will provide a running example of uncertain process instance related
to a sample process. We will then apply our probability estimation method to this un-
certain trace, to illustrate its operation. The example we analyze here is a simplied gen-
eralization of a remote credit card fraud investigation process. This process is visualized
by the Petri net in Figure 1.
Firstly, the credit card owner alerts the credit card company of a possibly fraudulent
transaction. The customer may either notify the company by calling their hotline (alert
hotline) or arrange an urgent meeting with personnel of the bank that issued the credit
card (alert bank). In both scenarios, his credit is frozen (freeze credit) to prevent further
4 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
Figure 1: A Petri net model of the credit card fraud investigation process. This net allows for 10 possible
traces.
fraud. All information provided by the customer about the transaction is summarized
when ling the formal report (file report). As a next step, the credit card company tries
to contact the merchant that charged the credit card. If this happens (contact merchant),
the credit card company claries whether there has been just a mistake (e.g., merchant
charging not delivering a product, or a billing mistake) on the merchant’s side. In such
cases, the customer gets a refund from merchant and the case is closed. Another outcome
might be the discovery of a friendly fraud, which is when a cardholder makes a purchase
and then disputes it as fraud even though it was not. If contacting the merchant is impos-
sible, a fraud investigation is initiated. In this case, fraud investigators will usually start
with the transaction data and look for timestamps, geolocation, IP addresses, and other
elements that can be used to prove whether or not the cardholder was involved in the
transaction. The outcome might be either friendly fraud or true fraud. True fraud can
also happen when both the merchant and the cardholder are afected by the fraud. In
this case, the cardholder receives a refund from the credit institute (activity refund credit
institute) and the case is closed.
Note that for simplicity, we have used single letters to represent the activity labels in
the Petri net transitions. Some possible traces in this process are for example: hh, c, r, m, ui,
hb, c, r, m, f i,hh, c, r, i, f iand hb, c, r, i, t, vi.
Suppose that the credit card company wants to perform conformance checking to
identify deviant process instances. However, some traces in the information system of
the company are afected by uncertainty, such as the one in Table 1.
Suppose that in the rst half of October 2020, the company was implementing a new
system for automatic event data generation. During this time, the event data regarding
the credit card fraud investigation process ofen had to be inserted manually by the em-
ployees. Such manual recordings were subject to inaccuracies, leading to imprecise or
missing data afecting the cases during this period. The process instance from Table 1is
one of the afected instances. Here, events e2, e3, e5, e6are uncertain. The timestamp of
event e2is not precise enough, so the possible timestamp lies between 06-10-2020 00:00
5 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
Table 1: Example of an uncertain case from the credit card fraud investigation process.
Case ID Event ID Activity Timestamp Ind.
5167 e1h(alert hotline) 05-10-2020 23:00
5167 e2c(freeze credit) 06-10-2020
5167 e3r(le report) U(05-10-2020 20:00,
06-10-2020 10:00)
5167 e4i(fraud investigation) 09-10-2020 10:00
5167 e5
{f: 0.3(friendly fraud),
t: 0.7(true fraud)}14-10-2020 09:00
5167 e6v(refund credit institute) 15-10-2020 10:00 ?
and 06-10-2020 23:59. Event e3has happened some time between 20:00 on October 5th
and 10:00 on October 6th. Event e5has two possible activity labels: fwith probability
0.3and twith probability 0.7. Refunding the customer (event e6) has been recorded in
the system, but the customer has not received the money yet, which is why the event is
indeterminate: this is indicated with a question mark (?) in the rightmost column, and
indicates an event that has been recorded, but for which is unclear if it actually occurred
in reality.
The credit card company is interested in understanding if and how the data in this
uncertain trace conforms with the normative process model, and the entity of the ac-
tual compliance risk; they are specically interested in knowing whether a severely non-
compliant scenario is highly likely. In the remainder of the paper, we will describe a
method able to estimate the probability of all possible outcome scenarios.
4 Preliminaries
Let us now present some preliminary denitions regarding uncertain event data.
Definition 1 (Uncertain attributes).Let Ube the universe of attribute domains,
and the set DUbe an attribute domain. Any DUis a discrete set or a totally
ordered set. A strongly uncertain attribute of domain Dis a subset dSDif Dis a
discrete set, and it is a closed interval dS= [dmin, dmax]with dmin Dand dmax D
otherwise. We denote with SDthe set of all such strongly uncertain attributes of domain
D. A weakly uncertain attribute fDof domain Dis a function fD:D6→ [0,1] such
that 0<PxDfD(x)1if Dis finite, 0<R
−∞ fD(x)dx 1otherwise. We denote
with WDthe set of all such weakly uncertain attributes of domain D. We collectively
denote with UD=SDWDthe set of uncertain attributes of domain D.
It is easy to see how a “certain” attribute x, with a value not afected by any uncer-
6 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
tainty, can be represented through the denitions in use here: if its domain is discrete,
it can be represented with the singleton {x}; otherwise, it can be represented with the
degenerate interval [x, x].
Definition 2 (Uncertain events).Let UIbe the universe of event identiers. Let UC
be the universe of case identiers. Let AUbe the discrete domain of all the activity
identiers. Let TUbe the totally ordered domain of all the timestamp identiers.
Let O={?} ∈ U, where the “?” symbol is a placeholder denoting event indeterminacy.
The universe of uncertain events is denoted with E=UI×UC×UA×UT×UO.
The activity label, timestamp and indeterminacy attribute values of an uncertain
event are drawn from UA,UTand UO; in accordance with Denition 1, each of these
attributes can be strongly uncertain (set of possible values or interval) or weakly uncer-
tain (probability distribution). The indeterminacy domain is dened on a single element
“?”: thus, strongly uncertain indeterminacy may be {?}(indeterminate event) or (no
indeterminacy). In weakly uncertain indeterminacy, the “?” element is associated to a
probability value.
Definition 3 (Projection functions).For an uncertain event e= (i, c, a, t, o)E,
we define the following projection functions: πa(e) = a,πt(e) = t,πo(e) = o. We define
πset
a(e) = aif ais strongly uncertain, and πset
a(e) = {xUA|fA(x)>0}with
a=fAotherwise. If the timestamp t= [tmin, tmax ]is strongly uncertain, we define
πtmin (e) = tmin and πtmax (e) = tmax. If the timestamp t=fTis weakly uncertain, we
define πtmin (e) = argminx(fT(x)>0) and πtmax (e) = argmaxx(fT(x)>0).
Definition 4 (Uncertain traces and logs).τEis an uncertain trace if all the
event identifiers in τare unique and all events in τshare the same case identifier cUC.
Tdenotes the universe of uncertain traces. LTis an uncertain log if all the event
identifiers in Lare unique.
Definition 5 (Realizations of uncertain traces).Let e, e0Ebe two uncertain
events. Eis a strict partial order defined on the universe of strongly uncertain events E
as eEe0πtmax (e)< πtmin (e0). Let τTbe an uncertain trace. The sequence
ρ=he1, e2, . . . , eni ∈ E, with n≤ |τ|, is an order-realization of τif there exists a total
function f:{1,2, . . . , n} → τsuch that:
for all 1i<jnwe have that ρ[j]Eρ[i],
for all eτwith πo(e) = there exists 1insuch that f(i) = e.
We denote with RO(τ)the set of all such order-realizations of the trace τ.
Given an order-realization ρ=he1, e2, . . . , eni ∈ RO(τ), the sequence σUA
is a realization of ρif σ∈ {ha1, a2, . . . , ani | ∀1inaiπset
a(i)}. We denote with
7 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
RA(ρ)UAthe set of all such realizations of the order-realization ρ. We denote with
R(τ)UAthe union of the realizations obtainable from all the order-realizations of
τ:R(τ) = SρRO(τ)RA(ρ). We will say that an order-realization ρRO(τ)enables
a sequence σUAif σRA(ρ).
Detailing an algorithm to generate all realizations of an uncertain trace is beyond
the scope of this paper. The literature illustrates a conformance checking method over
uncertain data which employs a behavior net, a Petri net able to replay all and only the re-
alizations of an uncertain trace [8]. Exhaustively exploring all complete ring sequences
of a behavior net, e.g., through its reachability graph, provides all realizations of the cor-
responding uncertain trace.
Given the above formalization, we can now dene more clearly the research question
that we are investigating in this paper. Given an uncertain trace τTand one of its
realizations σR(τ), our goal is to obtain a procedure to reliably compute P(σ|τ) =
probability of σgiven that we observe τ. In other words, provided that σcorresponds to
a scenario (i.e., a realization) for the uncertain trace τ, we are interested in calculating the
probability that σis the actual scenario occurred in reality, which caused the recording
of the uncertain trace τin the event log. In the next section, we will illustrate how to
calculate such probabilities of uncertain traces realizations.
5 Method
Before we show how we can obtain probability estimates for all realizations of an uncer-
tain trace, it is important to state an assumption: the information on uncertainty related
to a particular attribute in some event is independent of the possible values of the same
attribute present in other events, and it is independent of the uncertainty information
on other attributes of the same event. Note that in the examples of uncertainty sources
given in Section 1(data coarseness and sensor errors), this independence assumption of-
ten holds.
Additionally, we need to consider the fact that strongly uncertain attributes do not
come with known probability values: their description only species the values that at-
tributes might acquire, but not the likelihood of each possible value. As a consequence,
estimating probability for specic realizations in a strongly uncertain environment is
only possible with a-priori assumptions on how probability distributes among the at-
tribute value. At times, it might be possible to assume the distribution in an informed
way—for instance, on the basis of features of the information system hosting the data, of
the sensors recording events and attributes, or other tools involved in the management
of the process.
In case no indication is present, a reasonable assumption—which we will hold for
the remainder of the paper—is that any possible value of a strongly uncertain attribute
8 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
is equally likely. Formally, with e= (i, c, a, t, o)Elet τs:EEbe a function such
that τs(e)=(i, c, a0, t0, o0), where a0={(x, 1
|πset
a(e)|)|xπset
a(e)}if aSAand a0=a
otherwise; t0=U(πtmin (e), πtmax (e)) if tSTand t0=totherwise; o0= 0.5if o={?}
and o0=ootherwise.
First, observe that the probability P(σ|τ)that an activity sequence σUAis
indeed a realization of the trace τT, and thus σR(τ), increases with the number
of order-realizations enabling it. Furthermore, for each such order-realizations, one can
construct a probability function PO(ρ|τ)reecting the likelihood of the sequence ρ
itself given the trace τ, and a probability function PA(σ|ρ)reecting the likelihood
that the realization corresponding to ρis indeed σ. The value of PO(ρ|τ)is afected by
the uncertainty information in timestamps and indeterminate events, while the value of
PA(σ|ρ)is aggregated from the uncertainty information in the activity labels.
Given a realization σof an uncertain process instance and the set of its enablers, its
probability is computed as following:
P(σ|τ) = X
ρE
PO(ρ|τ)·PA(σ|ρ)
Note that, if ρdoes not enable σ,PA(σ|ρ) = 0. For any uncertain trace τT, it
holds that PσR(τ)P(σ|τ)=1, since both PO(·)and PA(·)are each constructed to be
(independent) probability distributions.
We will now compute PA(σ|ρ)using the information on the activity labels uncer-
tainty. Let us write fe
Aas a shorthand for πa(e). If there is uncertainty in activities, then
for each event eρand activity label aπset
a(e), the probability that eexecutes ais
given by fe
A(a). Thus, for every ρ=he1, ..., eni ∈ RO(τ)and σ=ha1, ..., ani ∈ RO(τ),
the value PAcan be aggregated from these distributions in the following way:
PA(σ|ρ) =
n
Y
i=1
fi
A(ai)
Through the value of PA, we can assess the likelihood that any given order-realization
executes a particular realization. The next step is to estimate the probability of each
order-realization ρfrom the set RO(τ). The probability of observing ρneeds to be ag-
gregated from the probability that the corresponding set of events appears in the given
particular order, which is determined by the timestamp intervals and, if applicable, the
distributions over them; and the probability that the order-realization contains the cor-
responding specic set of events, which is determined by the uncertainty information
on the indeterminacy. Multiplying the two values obtained above to yield a probability
9 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
estimate for the order-realization reects our independence assumption. Let us rstly
focus on uncertainty on timestamps, which causes the events to be partially ordered.
We will write fe
T(t)as a shorthand for πt(e)(t). For every event e, the value of fe
T(t)
yields the probability that event ehappened on timestamp t. This value is always 0 for all
t < πtmin (e)and t > πtmax (e)(see πtmin and πtmax in Denition 3). Given the continuous
domain of timestamps, PO(·)is assessed by using integrals. For a trace τTand an
order-realization ρ=he1, ..., eni ∈ RO(τ), let ai=πtmin (i)and bi=πtmax (i)for all
1in. Then, we dene:
I(ρ) = Zmin{b1,...,bn}
a1
fe1
T(x1)Zmin{b2,...,bn}
max{a2,x1}
fe2
T(x2)· · ·
Zmin{bi,...,bn}
max{ai,xi1}
fi
T(xi)· · · Zbn
max{an,xn1}
fen
T(xn)dxn. . . dx1
=Zmin{b1,...,bn}
a1Zmin{b2,...,bn}
max{a2,x1}
· · · Zmin{bi,...,bn}
max{ai,xi1}
· · · Zbn
max{an,xn1}
n
Y
i=1
fi
T(xi)dxn. . . dx1
This chain of integrals allows us to compute the probability of a specic order among
all the events in an uncertain trace. Now, to compute the probability of each realization
from Reaccounting for indeterminate events, we combine both the probability of the
events having appeared in a particular order and the probability that the sequence con-
tains exactly those events. For simplicity, we will use a function that acquires the value 1
if an event is not indeterminate. Let us dene fe
O:O[0,1] such that fe
O(?) = πo(e)(?)
if πo(e)6=and fe
O(?) = 1 otherwise. More precisely, given τTand ρRO(τ), we
compute:
PO(ρ|τ) = I(ρ)·Y
eτ
eρ
(1 fe
O(?)) ·Y
eτ
e6∈ρ
fe
O(?)
We now have at our disposal all the necessary tools to compute a probability dis-
tribution over the trace realizations of any uncertain process instance in any possible
uncertainty scenario. Let us then apply this method to compute the probabilities of all
realizations of the trace τin Table 1, and to analyze its conformance to the process in
Figure 1.
Each order-realization of τenables two realizations, because event e5has two pos-
sible activity labels. Since for events eτ\ {e5}, we have fe
Aequal to 1 for their cor-
responding unique activity label, the probability that an order-realization ρRO(τ)
has some realization σRA(ρ)only depends on whether the trace σcontains activ-
ity for t. Thus, for traces σ10, σ20, σ30, σ 40, σ50, σ 60and their unique enabling sequences,
10 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
Table 2: The possible order-realizations of the
process instance from Table 1and their proba-
bilities.
Order-realization ρ I(ρ)PO(ρ)
ρ1:he1, e2, e3, e4, e5, e6i0.140 0.074
ρ2:he1, e3, e2, e4, e5, e6i0.780 0.390
ρ3:he3, e1, e2, e4, e5, e6i0.072 0.036
ρ4:he1, e2, e3, e4, e5i0.149 0.074
ρ5:he1, e3, e2, e4, e5i0.780 0.390
ρ6:he3, e1, e2, e4, e5i0.072 0.036
Table 3: The set of possible realizations of the example
from Table 1, their enablers, their probabilities, and their
conformance scores. The conformance score is equal to
the cost of the optimal alignment between the trace and
the Petri net in Figure 1.
Realization σ ρ P(σ|τ)conf
σ10:hh, c, r, i, f, viρ1PO(ρ1)·PA(σ10|ρ1)=0.022 1
σ100 :hh, c, r, i, t, viρ1PO(ρ1)·PA(σ100 |ρ1)=0.052 0
σ20:hh, r, c, i, f, viρ2PO(ρ2)·PA(σ20|ρ2)=0.117 3
σ200 :hh, r, c, i, t, viρ2PO(ρ2)·PA(σ200 |ρ2)=0.273 2
σ30:hr, h, c, i, f, viρ3PO(ρ3)·PA(σ30|ρ3)=0.011 3
σ300 :hr, h, c, i, t, viρ3PO(ρ3)·PA(σ300 |ρ3)=0.025 2
σ40:hh, c, r, i, f iρ4PO(ρ4)·PA(σ40|ρ4)=0.022 0
σ400 :hh, c, r, i, tiρ4PO(ρ4)·PA(σ400 |ρ4)=0.052 1
σ50:hh, r, c, i, f iρ5PO(ρ5)·PA(σ50|ρ5)=0.117 2
σ500 :hh, r, c, i, tiρ5PO(ρ5)·PA(σ500 |ρ5)=0.273 3
σ60:hr, h, c, i, f iρ6PO(ρ6)·PA(σ60|ρ6)=0.011 2
σ600 :hr, h, c, i, tiρ6PO(ρ6)·PA(σ600 |ρ6)=0.025 3
we always have PA(σi0|si
e) = fe5
A(f) = 0.3, where i∈ {1,...,6}. Similarly, for traces
σ100 , σ200, σ 300, σ 400, σ 500, σ 600 and their unique enabling sequences, we alwayshave PA(σi00 |ρi)
=fe5
A(t) = 0.7, where i∈ {1,...,6}. Next, we calculate the PO(·)values for the 6 possi-
ble order-realizations in RO(τ), which are displayed in Table 2.
One can notice that the Ivalues only depend on the ordering of the rst three events,
which are also the only ones with overlapping timestamps. Since the indeterminate event
e6does not overlap with any other event, pairs of sequences where the rst three events
have the same order also have the same probability. This reects our assumption that
the occurrence and non-occurrence of e6are both equally possible. Table 3displays the
calculations for the computation of the P(σ|τ)values for all realizations. Now we
can compute the expected conformance score for the uncertain process instance τ=
{e1, . . . , e6}. We can do so by computing alignments [5] for each realization of τ:
conf (τ) = X
σR(τ)
P(σ|τ)·conf (σ, M )
= 0.022 ·1+0.05 ·0+0.117 ·3+0.273 ·2+0.011 ·3+0.025 ·2
+ 0.022 ·0+0.052 ·1+0.117 ·2+0.273 ·3+0.011 ·2+0.025 ·3
= 2.204.
Given the information on uncertainty available for the trace, this conformance score
is a more realistic estimate of the real conformance score compared to taking the best,
worst or average scores with values 0, 3 and 1.75 respectively.
11 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
e1
ae2
b: 0.9
c: 0.1
e3
d
?: 0.8
e4
e
Figure 2: The behavior graph of the uncertain
trace considered as example for validation.
Figure 3: The behavior net obtained from the be-
havior graph in Figure 2.
Table 4: The set of realizations of the trace from Figure 2, their enablers, and their probabilities.
Realization σ ρ P(σ|τ)
σ1:ha, b, eiρ1:he1, e2, e4iPO(ρ1)·PA(σ1|ρ1)=0.8·0.9=0.72
σ2:ha, b, d, eiρ2:he1, e2, e3, e4iPO(ρ2)·PA(σ2|ρ2) = (0.5·0.2)·0.9=0.09
σ3:ha, d, b, eiρ3:he1, e3, e2, e4iPO(ρ3)·PA(σ3|ρ3) = (0.5·0.2)·0.9=0.09
σ4:ha, c, eiρ4:he1, e2, e4iPO(ρ4)·PA(σ4|ρ4)=0.8·0.1=0.08
σ5:ha, c, d, eiρ5:he1, e2, e3, e4iPO(ρ5)·PA(σ5|ρ5) = (0.5·0.2)·0.1=0.01
σ6:ha, d, c, eiρ6:he1, e3, e2, e4iPO(ρ6)·PA(σ6|ρ6) = (0.5·0.2)·0.1=0.01
6 Validation of Probability Estimates
In this section, we compute the probability estimates for the realizations of an uncertain
trace, and then show a validation of those estimates by Monte Carlo simulation on the
behavior net of the trace. The process instance of our example has strong uncertainty in
timestamps and weak uncertainty in activities and indeterminacy. It consists of 4 events:
e1, e2, e3and e4, where e2and e3have overlapping timestamps. Event e2executes b(resp.,
c) with probability 0.9 (resp., 0.1). There is a probability of 0.2 that e3did not occur. Fig-
ure 2shows the corresponding behavior graph, an uncertain event data visualization that
represents the time relationships between events with a directed acyclic graph [8]. Lastly,
Table 4list all the possible realizations, their probabilities, and the order-realizations en-
abling them.
We now validate our obtained probability estimates quantitatively by means of a
Monte Carlo simulation approach. First, we construct the behavior net [9] correspond-
ing to the uncertain process instance, which is shown in Figure 3. The set of replayable
traces in this behavior net is exactly the set of realizations for the uncertain instance.
Then, we simulate realizations on the behavior net, dividing the accumulated count of
each realization by the number of runs, and compare those values to our probability es-
timates. Here, we use the stochastic simulator of the PM4Py library [3]. In every step
12 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
of the simulation, the stochastic simulator chooses one enabled transition to re accord-
ing to a stochastic map, assigning a weight to each transition in the Petri net (here, the
behavior net).
To simulate uncertainty in activities, events and timestamps, we do the following:
possible activities executed by the same event appearing in an XOR-split in the behavior
net are weighted so to reect the probability values of the activity labels. Indeterminacy
is equivalently modeled as an XOR-choice between a visible transition and a silent one
in the behavior net, so to model a “skip”. If there are two or more possible activities for
an indeterminate event, then the sum of the weights of the visible transitions in relation
to the weight of the silent transition should be the same as in the distribution given in
the event type uncertainty information. Whenever there are events with overlapping
timestamps, these appear in an AND-split in the behavior net. The (enabled) path of
the AND-split which is taken rst signals which event is executed at that moment.
Let bn(τ)=(P, T )be the behavior net of trace τ. Let (e, a)Tbe a visible transi-
tion related to some event eτ. We weight (e, a)the following way:
weight((e, a)) = (fe
A(a)if πo(e) = ,
(1 fe
O(?)) ·fe
A(a)otherwise.
If eτis an indeterminate event, then weight((e, )) = fe
O(?).
Note that according to the weight assignment function, if eis determinate, then
Paπset
a(e)weight((e, a)) = 1. Otherwise, Paπset
a(e)weight((e, a)) = 1 fe
O(?) = 1
weight((e, τ)). By construction of the behavior net, any transition related to an event in
τcan only re in accordance with the partial order of uncertain timestamps. Addition-
ally, all transitions representing events with overlapping timestamps appear in an AND
construct. By denition of our weight function, whenever the transitions of some eτ
are enabled (in an XOR construct), the probability of ring one of them is 1/k, where kis
the number of events from τfor which none of the corresponding transitions have red
yet. This way, there is always a uniform distribution over the set of enabled transitions
representing overlapping events. Assigning the weights according to this distribution al-
lows to decorate the behavior net with probabilities that reect the chances of occurrence
of every possible value in uncertain attributes.
Applying the stochastic simulator ntimes yields nrealizations. For each of the 6
possible realizations for the uncertain process instance, we obtain a probability measure-
ment by dividing its simulated frequency by n. Figures 4through 7show how for greater
n, this measurement converges to the probability estimates shown in Table 4, which were
computed with our method.
To conclude, the Monte Carlo simulation shows that our estimated probabilities for
realizations match their relative frequencies when one simulates the behavior net of the
corresponding uncertain trace.
13 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
Figure 4: Plot showing how the frequency of trace
ha, b, eiconverges to the expected value of 0.72 over
1000 runs.
Figure 5: Plot showing how the frequency of trace
ha, b, d, eiconverges to the expected value of 0.09
over 1000 runs.
Figure 6: Plot showing how the frequency of trace
ha, d, b, eiconverges to the expected value of 0.09
over 1000 runs.
Figure 7: Plot showing how the frequency of trace
ha, c, eiconverges to the expected value of 0.08 over
1000 runs.
7 Conclusion
Uncertain traces inherently contain behavior, allowing for many realizations; these, in
turn, correspond to diverse possible real-life scenarios, that may have diferent conse-
quences on the management and governance of a process. In this paper, we presented a
method to quantify the probability of each realization of an uncertain trace. This enables
process analysts to weigh the impact of specic insights gathered with uncertainty-aware
process mining techniques, such as conformance checking using alignments. As a con-
sequence, information from process analysis techniques can be associated with a quan-
tication of risk or opportunity for specic scenarios, making them more trustworthy.
Multiple avenues for future work on this topic are possible. These include inferring
probabilities for uncertain traces from sections of the log not afected by uncertainty,
14 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
adopting certain traces or fragments of traces as ground truth. Moreover, inferring prob-
abilities by examining evidence against a ground truth can also be achieved with a nor-
mative model that includes information concerning the probability of error or noise in
specic parts of the process.
Acknowledgements
We thank the Alexander von Humboldt (AvH) Stifung for supporting our research in-
teractions.
References
[1] van der Aa, Han, Henrik Leopold, and Matthias Weidlich. “Partial order resolu-
tion of event logs for process conformance checking”. In: Decision Support Sys-
tems 136 (2020), p. 113347. doi:10.1016/j.dss.2020.113347.
[2] Ao, Xiang, Ping Luo, Chengkai Li, et al. “Online Frequent Episode Mining”.
In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul,
South Korea, April 13-17, 2015. Ed. by Gehrke, Johannes, Wolfgang Lehner, Kyuseok
Shim, et al. IEEE Computer Society, 2015, pp. 891–902. doi:10.1109/ICDE .
2015.7113342.
[3] Berti, Alessandro, Sebastiaan J. van Zelst, and Wil M. P. van der Aalst. “Process
Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Sci-
ence”. In: ICPM Demo Track (CEUR 2374). 2019, pp. 13–16.
[4] Busany, Nimrod, Han van der Aa, Arik Senderovich, et al. “Interval-based Queries
over Lossy IoT Event Streams”. In: Transanctions on Data Science1.4 (2020), 27:1–
27:27. doi:10.1145/3385191.
[5] van Dongen, Boudewijn F., Josep Carmona, Thomas Chatain, et al. “Aligning
Modeled and Observed Behavior: A Compromise Between Computation Com-
plexity and Quality”. In: Advanced Information Systems Engineering - 29th Inter-
national Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017, Proceedings.
Ed. by Dubois, Eric and Klaus Pohl. Vol. 10253. Lecture Notes in Computer Sci-
ence. Springer, 2017, pp. 94–109. doi:10.1007/978-3-319-59536-8_7.
[6] Leemans, Maikel and Wil M. P. van der Aalst. “Discovery of Frequent Episodes in
Event Logs”. In: Proceedings of the 4th International Symposium on Data-driven
Process Discovery and Analysis (SIMPDA 2014), Milan, Italy, November 19-21,
2014. Ed. by Accorsi, Rafael, Paolo Ceravolo, and Barbara Russo. Vol. 1293. CEUR
Workshop Proceedings. CEUR-WS.org, 2014, pp. 31–45. url:http://ceur-
ws.org/Vol-1293/paper3.pdf.
15 / 16
M. Pegoraro et al. Probability Estimation of Uncertain Trace Realizations
[7] Lu, Xixi, Dirk Fahland, and Wil M. P. van der Aalst. “Conformance Checking
Based on Partially Ordered Event Data”. In: Business Process Management Work-
shops - BPM 2014 International Workshops, Eindhoven, The Netherlands, Septem-
ber 7-8, 2014, Revised Papers. Ed. by Fournier, Fabiana and Jan Mendling. Vol. 202.
Lecture Notes in Business Information Processing. Springer, 2014, pp. 75–88. doi:
10.1007/978-3-319-15895-2_7.
[8] Pegoraro, Marco and Wil M. P. van der Aalst. “Mining Uncertain Event Data in
Process Mining”. In: International Conference on Process Mining, ICPM 2019,
Aachen, Germany, June 24-26, 2019. IEEE, 2019, pp. 89–96. doi:10 . 1109 /
ICPM.2019.00023.
[9] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Conformance
Checking over Uncertain Event Data”. In: Information Systems (2021), p. 101810.
doi:10.1016/j.is.2021.101810.
[10] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Discovering
Process Models from Uncertain Event Data”. In: Business Process Management
Workshops - BPM 2019 International Workshops, Vienna, Austria, September 1-
6, 2019, Revised Selected Papers. Ed. by Francescomarino, Chiara Di, Remco M.
Dijkman, and Uwe Zdun. Vol. 362. Lecture Notes in Business Information Pro-
cessing. Springer, 2019, pp. 238–249. doi:10 .1007 / 978- 3 - 030- 37453 -
2_20.
[11] Zhu, Huisheng, Peng Wang, Xianmang He, et al. “Ecient Episode Mining with
Minimal and Non-overlapping Occurrences”. In: ICDM 2010, The 10th IEEE
International Conference on Data Mining, Sydney, Australia, 14-17 December
2010. Ed. by Webb, Geofrey I., Bing Liu, Chengqi Zhang, et al. IEEE Computer
Society, 2010, pp. 1211–1216. doi:10.1109/ICDM.2010.25.
16 / 16
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The strong impulse to digitize processes and operations in companies and enterprises have resulted in the creation and automatic recording of an increasingly large amount of process data in information systems. These are made available in the form of event logs. Process mining techniques enable the process-centric analysis of data, including automatically discovering process models and checking if event data conform to a given model. In this paper, we analyze the previously unexplored setting of uncertain event logs. In such event logs uncertainty is recorded explicitly, i.e., the time, activity and case of an event may be unclear or imprecise. In this work, we define a taxonomy of uncertain event logs and models, and we examine the challenges that uncertainty poses on process discovery and conformance checking. Finally, we show how upper and lower bounds for conformance can be obtained by aligning an uncertain trace onto a regular process model.
Conference Paper
Full-text available
Modern information systems are able to collect event data in the form of event logs. Process mining techniques allow to discover a model from event data, to check the conformance of an event log against a reference model, and to perform further process-centric analyses. In this paper, we consider uncertain event logs, where data is recorded together with explicit uncertainty information. We describe a technique to discover a directly-follows graph from such event data which retains information about the uncertainty in the process. We then present experimental results of performing inductive mining over the directly-follows graph to obtain models representing the certain and uncertain part of the process.
Conference Paper
Full-text available
Nowadays, more and more process data are automatically recorded by information systems, and made available in the form of event logs. Process mining techniques enable process-centric analysis of data, including automatically discovering process models and checking if event data conform to a certain model. In this paper we analyze the previously unexplored setting of uncertain event logs: logs where quantified uncertainty is recorded together with the corresponding data. We define a taxonomy of uncertain event logs and models, and we examine the challenges that uncertainty poses on process discovery and conformance checking. Finally, we show how upper and lower bounds for conformance can be obtained aligning an uncertain trace onto a regular process model.
Conference Paper
Full-text available
Certifying that a process model is aligned with the real process executions is perhaps the most desired feature a process model may have: aligned process models are crucial for organizations, since strategic decisions can be made easier on models instead of on plain data. In spite of its importance, the current algorithmic support for computing alignments is limited: either techniques that explicitly explore the model behavior (which may be worst-case exponential with respect to the model size), or heuristic approaches that cannot guarantee a solution, are the only alternatives. In this paper we propose a solution that sits right in the middle in the complexity spectrum of alignment techniques; it can always guarantee a solution, whose quality depends on the exploration depth used and local decisions taken at each step. We use linear algebraic techniques in combination with an iterative search which focuses on progressing towards a solution. The experiments show a clear reduction in the time required for reaching a solution, without sacrificing significantly the quality of the alignment obtained.
Conference Paper
Full-text available
Lion's share of process mining research focuses on the discovery of end-to-end process models describing the characteristic behavior of observed cases. The notion of a process instance (i.e., the case) plays an important role in process mining. Pattern mining techniques (such as frequent itemset mining, association rule learning, sequence mining, and traditional episode mining) do not consider process instances. An episode is a collection of partially ordered events. In this paper, we present a new technique (and corresponding implementation) that discovers frequently occurring episodes in event logs thereby exploiting the fact that events are associated with cases. Hence, the work can be positioned in-between process mining and pattern mining. Episode discovery has its applications in, amongst others, discovering local patterns in complex processes and conformance checking based on partial orders. We also discover episode rules to predict behavior and discover correlated behaviors in processes. We have developed a ProM plug-in that exploits efficient algorithms for the discovery of frequent episodes and episode rules. Experimental results based on real-life event logs demonstrate the feasibility and usefulness of the approach.
Conference Paper
Full-text available
Conformance checking is becoming more important for the analysis of business processes. While the diagnosed results of conformance checking techniques are used in diverse context such as enabling auditing and performance analysis, the quality and reliability of the conformance checking techniques themselves have not been analyzed rigorously. As the existing conformance checking techniques heavily rely on the total ordering of events, their diagnostics are unreliable and often even misleading when the timestamps of events are coarse or incorrect. This paper presents an approach to incorporate flexibility, uncertainty, concurrency and explicit orderings between events in the input as well as in the output of conformance checking usingpartially ordered traces andpartially ordered alignments, respectively. The paper also illustrates various ways to acquire partially ordered traces from existing logs. In addition, a quantitative-based quality metric is introduced to objectively compare the results of conformance checking. The approach is implemented in ProM plugins and has been evaluated using artificial logs
Article
Recognising patterns that correlate multiple events over time becomes increasingly important in applications that exploit the Internet of Things, reaching from urban transportation through surveillance monitoring to business workflows. In many real-world scenarios, however, timestamps of events may be erroneously recorded, and events may be dropped from a stream due to network failures or load shedding policies. In this work, we present SimpMatch, a novel simplex-based algorithm for probabilistic evaluation of event queries using constraints over event orderings in a stream. Our approach avoids learning probability distributions for time-points or occurrence intervals. Instead, we employ the abstraction of segmented intervals and compute the probability of a sequence of such segments using the notion of order statistics. The algorithm runs in linear time to the number of lost events and shows high accuracy, yielding exact results if event generation is based on a Poisson process and providing a good approximation otherwise. We demonstrate empirically that SimpMatch enables efficient and effective reasoning over event streams, outperforming state-of-the-art methods for probabilistic evaluation of event queries by up to two orders of magnitude.
Article
While supporting the execution of business processes, information systems record event logs. Conformance checking relies on these logs to analyze whether the recorded behavior of a process conforms to the behavior of a normative specification. A key assumption of existing conformance checking techniques, however, is that all events are associated with timestamps that allow to infer a total order of events per process instance. Unfortunately, this assumption is often violated in practice. Due to synchronization issues, manual event recordings, or data corruption, events are only partially ordered. In this paper, we put forward the problem of partial order resolution of event logs to close this gap. It refers to the construction of a probability distribution over all possible total orders of events of an instance. To cope with the order uncertainty in real-world data, we present several estimators for this task, incorporating different notions of behavioral abstraction. Moreover, to reduce the runtime of conformance checking based on partial order resolution, we introduce an approximation method that comes with a bounded error in terms of accuracy. Our experiments with real-world and synthetic data reveal that our approach improves accuracy over the state-of-the-art considerably.
Article
Frequent episode mining is a popular framework for discovering sequential patterns from sequence data. Previous studies on this topic usually process data offline in a batch mode. However, for fast-growing sequence data, old episodes may become obsolete while new useful episodes keep emerging. More importantly, in time-critical applications we need a fast solution to discovering the latest frequent episodes from growing data. To this end, we formulate the problem of Online Frequent Episode Mining (OFEM). By introducing the concept of last episode occurrence within a time window, our solution can detect new minimal episode occurrences efficiently, based on which all recent frequent episodes can be discovered directly. Additionally, a trie-based data structure, episode trie, is developed to store minimal episode occurrences in a compact way. We also formally prove the soundness and completeness of our solution and analyze its time as well as space complexity. Experiment results of both online and offline FEM on real data sets show the superiority of our solution.
Conference Paper
Frequent serial episodes within an event sequence describe the behavior of users or systems about the application. Existing mining algorithms calculate the frequency of an episode based on overlapping or non-minimal occurrences, which is prone to over-counting the support of long episodes or poorly characterizing the followed-by-closely relationship over event types. In addition, due to utilizing the Apriori-style level wise approach, these algorithms are computationally expensive. In this paper, we propose an efficient algorithm MANEPI (Minimal And Non-overlapping EPIsode) for mining more interesting frequent episodes within the given event sequence. The proposed frequency measure takes both minimal and non-overlapping occurrences of an episode into consideration and ensures better mining quality. The introduced depth first search strategy with the Apriori Property for performing episode growth greatly improves the efficiency of mining long episodes because of scanning the given sequence only once and not generating candidate episodes. Moreover, an optimization technique is presented to narrow down search space and speed up the mining process. Experimental evaluation on both synthetic and real-world datasets demonstrates that our algorithms are more efficient and effective.