Content uploaded by Marco Pegoraro

Author content

All content in this area was uploaded by Marco Pegoraro on Jan 22, 2023

Content may be subject to copyright.

Efﬁcient Time and Space Representation

of Uncertain Event Data

Marco Pegoraro 1, Merih Seran Uysal 1, and Wil M.P. van der Aalst 1

1Chair of Process and Data Science (PADS), Department of Computer Science,

RWTH Aachen University, Aachen, Germany

{pegoraro, uysal, vwdaalst}@pads.rwth-aachen.de

Abstract

Process mining is a discipline which concerns the analysis of execution data of op-

erational processes, the extraction of models from event data, the measurement

of the conformance between event data and normative models, and the enhance-

ment of all aspects of processes. Most approaches assume that event data is accu-

rately capture behavior. However, this is not realistic in many applications: data

can contain uncertainty, generated from errors in recording, imprecise measure-

ments, and other factors. Recently, new methods have been developed to analyze

event data containing uncertainty; these techniques prominently rely on repre-

senting uncertain event data by means of graph-based models explicitly capturing

uncertainty. In this paper, we introduce a new approach to eciently calculate a

graph representation of the behavior contained in an uncertain process trace. We

present our novel algorithm, prove its asymptotic time complexity, and show ex-

perimental results that highlight order-of-magnitude performance improvements

for the behavior graph construction.

Keywords: Process Mining ·Uncertain Data ·Partial Order.

Colophon

This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-

ternational” license.

©the authors. Some rights reserved.

This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:

Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Ecient Time and Space Representation of Un-

certain Event Data”. In: Algorithms 13.11 (2020), p. 285. doi:10.3390/a13110285

Please, cite this document as shown above.

Publication chronology:

•2020-09-30: full text submitted to MDPI Algorithms, special issue Process Mining and Emerging Applications

•2020-10-21: major revision requested

•2020-10-30: revised version submitted

•2020-11-06: notication of acceptance

•2020-11-08: camera-ready version submitted

•2020-11-09: published

Correspondence to:

Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,

RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany

Website: http://mpegoraro.net/ ·Email: pegoraro@pads.rwth-aachen.de ·ORCID: 0000-0002-8997-7517

Content: 39 pages, 16 gures, 5 tables, 33 references. Typeset with pdfL

A

T

E

X, Biber, and BibL

A

T

E

X.

Please do not print this document unless strictly necessary.

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

1 Introduction

The pervasive difusion of digitization, which gained momentum thanks to advance-

ments in electronics and computing at the end of the last century, brought a wave of in-

novation in the tools supporting businesses and companies. The past decades have seen

the rise of Process-Aware Information Systems (PAISs)—useful to structurally support

processes in a business—as well as research disciplines such as Business Process Manage-

ment (BPM) and process mining.

Process mining [2] is a eld of research that enables process analysis in a data-driven

manner. Process mining analyses are based on recordings of tasks and events in a process,

memorize in an ensemble of information systems which support business operations.

These recordings are exported and systematically collected in databases called event logs.

Using an event log as a starting point, process mining techniques can automatically ob-

tain a process model illustrating the behavior of the real-life process (process discovery)

and identify anomalies and deviations between the execution data of a process and a nor-

mative model (conformance checking). Process mining is a subeld of data science which

is quickly growing in interest both in academia and industry. Over 30 commercial sof-

ware tools are available on the market for analyzing processes and their execution data.

Process mining tools are used by process experts to analyze processes in tens of thousands

of organizations, e.g., within Siemens, over 6000 employees actively use process mining

to improve internal procedures.

Commercial process mining sofware is able to discover and build a process model

from an event log. Most of the process discovery algorithms implemented in these tools

are based on tallying the number of directly-follows relationships between activities in

the execution data of the process. The more frequently a specic activity immediately

follows another one in the execution log of a process, the stronger a causality and/or

precedence implication between the two activities is understood to be. Such directly-

follows relationships are also the basis for the identication of more complex and abstract

constructs in the workow of a process, such as interleaving or parallelism of activities.

These relationships between activities are ofen represented in a labeled directed graph

called the Directly-Follows Graph (DFG).

In recent times, a new type of event logs has gained research interest: uncertain event

logs [26]. Such execution logs contain, rather than precise values, an indication of the

possible values that event attributes can acquire. In this paper, we will consider the set-

ting where uncertainty is represented by either an interval or a set of possible values for an

event attribute. Moreover, we will consider the case in which an event has been recorded

in the event log albeit it did not happen in reality.

Uncertainty in event logs is best illustrated with a real-life example of a process that

can generate uncertain data in an information system. Let us consider the following pro-

cess instance, a simplied version of anomalies that are actually occurring in processes of

3 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Table 1: The uncertain trace of an instance of healthcare process used as running example. The “Case ID” is

a unique identier for all events in a single process case; the “Event ID” is a unique identier for the events

in the trace. The “Timestamp” eld indicates either the moment in time in which the event has happened,

or the interval of time in which the event may have happened. The “Activity” eld indicates the possible

choices for the activity instantiated by the event. Lastly, the “Indeterminate event” eld contains a “!” if the

corresponding event has surely occurred, and a “?” if it might have been recorded despite not occurring in

reality. For the sake of readability, in the timestamps column only reports the day of the month.

Case ID Event ID Timestamp Activity Indet. event

ID327 e15NightSweats ?

ID327 e28{PrTP,SecTP}!

ID327 e3[4, 10] Splenomeg !

ID327 e412 Adm !

the healthcare domain. An elderly patient enrolls in a clinical trial for an experimental

treatment against myeloproliferative neoplasms, a class of blood cancers. The enroll-

ment in this trial includes a lab exam and a visit with a specialist; then, the treatment can

begin. The lab exam, performed on the 8th of July, nds a low level of platelets in the

blood of the patient, a condition known as thrombocytopenia (TP). At the visit, on the

10th of July, the patient self-reports an episode of night sweats on the night of the 5th

of July, prior the lab exam: the medic notes this, but also hypothesized that it might not

be a symptom, since it can be caused not by the condition but by external factors (such

as very warm weather). The medic also reads the medical records of the patient and sees

that, shortly prior to the lab exam, the patient was undergoing a heparine treatment (a

blood-thinning medication) to prevent blood clots. The thrombocytopenia found with

the lab exam can then be primary (caused by the blood cancer) or secondary (caused by

other factors, such as a drug). Finally, the medic nds an enlargement of the spleen in the

patient (splenomegaly). It is unclear when this condition has developed: it might have

appeared in any moment prior to that point. The medic decides to admit the patient in

the clinical trial, starting 12th of July.

These events generate the trace of Table 1in the information system of the hospital.

For clarity, the timestamp eld only reports the day of the month.

Event e2has been recorded with two possible activity labels (PrTP or SecTP). This is

an example of uncertainty on activities. Some events, e.g. e3, do not have a precise times-

tamp but a time interval in which the event could have happened has been recorded: in

some cases, this causes the loss of a precise ordering of events (e.g. e2and e3). This is an

instance of uncertainty on the time dimension, i.e., on timestamps. As evident by the

“?” symbol, e1is an indeterminate event: it has been recorded, but it is not guaranteed

to have actually happened. Conversely, the “!” symbol indicates that the event has been

recorded while certainly occurring in reality, i.e., it has been recorded correctly in the

4 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

information system (e.g., the event e4).

Quality problems and imprecision in data recording such as the ones described in the

running example as source of uncertainty are not uncommon; in some settings, they are

a frequent occurrence. Healthcare processes are specically know to be aicted by these

sorts of data anomalies, especially if parts of the process rely on recording information on

paper [3,17]. Existing process mining sofware cannot manage such uncertain event data.

When mining the processes where uncertainty in execution data is prominent, a natural

rst approach is to lter the event log eliminating cases where uncertainty appear. Un-

fortunately, in processes with a large portion of cases are afected by such data anomalies,

ltering without losing essential information about the process is not feasible.

As a consequence, new process mining methods to inspect and analyze it have to

be developed. Uncertain timestamps are the most prominent and critical source of un-

certain behavior in a process trace. For example, if nevents have uncertain timestamps

such that their order is unknown, the possible congurations that the control-ow of

the trace can assume are all the n!permutations of the events, in the case where all events

in a case have timestamps dened by mutually overlapping intervals. This is the worst

possible scenario in terms of amount of uncertain behavior introduced by uncertainty

on the timestamps of the events ins a trace. Thus, it is important to capture the time

relationships between events in a compact and efective way. This is accomplished by

the construction of a behavior graph, a directed acyclic graph that expresses precedence

between events. Figure 1shows the behavior graph of the process trace in Table 1; every

known precedence relationship between events is represented by the edges of the graph,

while the pairs of event for which the order is unknown remain unconnected. Efectively,

this creates a representation of the partial order where the arcs are dened by the possible

values of the timestamps contained in the trace, and where the nodes may refer to sets of

possible activities. As we will see, this construct is central to efectively implement both

process discovery and conformance checking applied to uncertain event data.

In a previous paper [28], we presented a time-efective algorithm for the construc-

tion of the behavior graph of an uncertain process trace, attaining quadratic time com-

plexity on the number of events in the trace.

This paper elaborates on this previous result, by providing the proof of the correct-

ness of the new algorithm. Additionally, we will show the improvement in performance

both theoretically, via asymptotic complexity analysis, and in practice, with experiments

on various uncertain event logs comparing computation times of the baseline method

against the novel construction algorithm. Furthermore, the version of the algorithms

presented in this paper is rened so to preprocess uncertain traces in linear time, individ-

uating the variants—which share the same behavior graph –, and proceed to perform the

construction of the behavior graph only once per variant. This slightly improves perfor-

mance, and more importantly, enables the representation of an uncertain event log as

a multiset of behavior graphs, greatly reducing the memory requirements to store the

5 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

NightSweats

e1

{PrTP, SecTP}

e2

Splenomeg

e3

Adm

e4

Figure 1: The behavior graph of the trace in Table 1. Every node represents an event; the labels in the nodes

represent the activity, or set of activities, associated with the event. The arcs represent the partial order

relationship between events as dened by their timestamps. The indeterminate event, which might not

have occurred, is represented by a dashed node.

log. This enables a streamlined application of process mining techniques on event data

where uncertainty is present.

The algorithms have been implemented within the PROVED (PRocess mining OVer

uncErtain Data) library1, based on the PM4Py process mining framework [10].

The reminder of the paper is structured as follows: Section 2motivates the study of

uncertainty in process mining by illustrating an example of conformance checking over

uncertain event data. Section 3strengthens the motivation showing the discovery of pro-

cess models of uncertain event logs. Section 4provides formal denitions, describes the

baseline technique for our research, and shows a new and more ecient method to ob-

tain a behavior graph of an uncertain trace. Section 5presents the analysis of asymptotic

complexity for both the baseline and the novel method. Section 6shows results of ex-

periments on both synthetic and real-life uncertain event logs comparing the eciency

of both methods to compute behavior graphs. Section 7explores recent related works in

the context of uncertain event data and the management of alterations of data in process

mining. Finally, Section 8discusses the output of the experiments and concludes the

paper.

2 Conformance Checking over Uncertain Data

Conformance checking is one of the main tasks in process mining, and consists in mea-

suring the deviation between process execution data (usually in the form of a trace) and

a reference model. This is particularly useful for organization, since it enables them to

compare historical process data against a normative model created by process experts to

identify anomalies and deviations in their operations.

1https://github.com/proved-py/proved-core/tree/Efficient_Time_and_Memory_

Representation_for_Uncertain_Event_Data

6 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

t1

NightSweats

t2

t4

Splenomeg

t3

PrTP

t5

Adm

t6

Figure 2: A normative model for the healthcare process case in the running example. The initial marking is

displayed; the gray “token slot” represents the nal marking.

Let us assume that we have access to a normative model for the disease of the patient

in the running example, shown in Figure 2.

This model essentially states that the disease is characterised by the occurrence of

night sweats and splenomegaly on the patient, which may happen concurrently, and

then should be followed by primary thrombocytopenia. We would like to measure the

conformance between the trace in Table 1and this normative model. A very popular con-

formance checking technique works via the computation of alignments [4]. Through

this technique, we are able to identify the deviations in the execution of a process, in the

form of behavior happening in the model but not in the trace, and behavior happening

in the trace but not in the model. These deviations are identied, and used as basis to

compute a conformance score between the trace and the process model.

The formulation of alignments in [4] is not applicable to an uncertain trace. In fact,

depending on the instantiation of the uncertain attributes of events—like the times-

tamp of e3in the trace—the order of event may difer, and so may the conformance

score. However, we can look at the best- and worst-case scenarios: the instantiation of

attributes of the trace that entails the minimum and maximum number of deviations

with respect to the reference model. In our example, two possible outcomes for the

sample trace are hNightSweats,Splenomeg,PrTP,Admiand hSecTP,Splenomeg,Admi;

both represent the sequence of event that might have happened in reality, but their con-

formance score is very diferent. The alignment of the rst trace against the reference

model can be seen in Table 2, while the alignment of the second trace can be seen in

Table 3. These two outcomes of the uncertain trace in Table 1represent, respectively,

the minimum and maximum amount of deviation possible with respect to the reference

model, and dene then a lower and upper bound for conformance score.

The minimum and maximum bounds for conformance score of an uncertain trace

and a reference process model can be found with the uncertain version of the alignment

technique that we rst described in [26]. In order to nd such bounds, it is necessary

to build a Petri net able to simulate all possible behaviors in the uncertain trace, called

the behavior net. Obtaining a behavior net is possible through a construction that uses

7 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Table 2: An optimal alignment for hNightSweats,Splenomeg,PrTP,Admi, one of the possible instantia-

tions of the trace in Table 1, against the model in Figure 2. This alignment has a deviation cost equal to 0,

and corresponds to the best case scenario for conformance between the process model and the uncertain

trace.

NightSweats Splenomeg PrTP Adm

τNightSweats Splenomeg τPrTP Adm

t1t2t3t4t5t6

Table 3: An optimal alignment for hSecTP,Splenomeg,Admi, one of the possible instantiations of the trace

in Table 1, against the model in Figure 2. This alignment has a deviation cost equal to 3, caused by 2 moves on

model and 1 move on log, and corresponds to the worst case scenario for conformance between the process

model and the uncertain trace.

SecTP Splenomeg Adm

τNightSweats Splenomeg τPrTP Adm

t1t2t3t4t5t6

behavior graphs as a starting point, using the structural information therein contained

to connect places and transitions in the net. The behavior net of the trace in Table 1is

shown in Figure 3.

The alignments in Tables 2and 3show how we can get actionable insights from

process mining over uncertain data. In some applications it is reasonable and appropri-

ate to remove uncertain data from an event log via ltering, and then compute log-level

aggregate information—such as total number of deviations, or average deviations per

trace—using the remaining certain data. Even in processes where this is possible, doing

so prevents the important process mining task of case diagnostic. Conversely, uncertain

alignments allow not only to have best- and worst-case scenarios for a trace, but also to

individuate the specic deviations afecting both scenarios. For instance, the alignments

of the running example can be implemented in a system that warns the medics that the

patient might have been afected by a secondary thrombocytopenia not explained by the

model of the disease. Since the model indicates that the disease should develop primary

thrombocytopenia as a symptom, this patient is at risk of both types of platelets decit

simultaneously, which is a serious situation. The medics can then intervene to avoid this

complication, and perform more exams to ascertain the cause of the patient’s thrombo-

cytopenia.

3 Process Discovery over Uncertain Data

Process discovery is another main objective in process mining, and involves automati-

cally creating a process model from event data. Many process discovery algorithms rely

8 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

(start, e1)

NightSweats

(e1, NightSweats)

NightSweats

(e1, τ)

(e1, e2)

PrTP

(e2, PrTP)

SecTP

(e2, SecTP)

(e2, e4)

(start, e3)(e3, e4)

Splenomeg

(e3, Splenomeg)

Adm

(e4, Adm)(e4,end)

Figure 3: The behavior net representing the behavior of the uncertain trace in Table 1and obtained thanks

to its behavior graph. The initial marking is displayed; the gray “token slot” represents the nal marking.

This artifact is necessary to perform conformance checking between uncertain traces and a reference model.

on the concept of directly-follows relationships between activities to gather clues on how

to structure the process model. Uncertain Directly-Follows Graphs (UDFGs) enable the

representation of directly-follows relationships in an event log under conditions of un-

certainty in the event data; they consist in directed graphs where the activity labels ap-

pearing in an event log constitute the nodes, and the edges are decorated with infor-

mation on the minimum and maximum frequency observable for the directly-follows

relation between pair of activities.

Let us examine an example of UDFG. In order to build a signicant example, we

need to introduce an entire uncertain event log; since the full table notation for uncertain

traces becomes cumbersome for entire logs, let us utilize a shorthand simplied notation.

In a trace, we represent an uncertain event with multiple possible activity labels by listing

all the associated labels between curly braces.

When two events have mutually overlapping timestamps, we write their activity la-

bels between square brackets, and we indicate indeterminate events by overlining them2.

For instance, the trace ha, {b, c},[d, e]iis a trace containing 4 events, of which the rst

is an indeterminate event with activity label a, the second is an uncertain event that can

have either bor cas activity label, and the last two events have an interval as timestamp

(and the two ranges overlap). Let us consider the following event log:

ha, b, e, f, g, hi80,ha, {b, c},[e, f ], g, ii15 ,ha, {b, c, d},[e, f ], g, ji5.

For each pair of activities, we can count the minimum and maximum occurrences

of a directly-follows relationship that can be observed in the log. The resulting UDFG

2Notice that this notation does not allow for the representation of every possible uncertain trace: in

the case of timestamp uncertainty, it can only express mutual overlapping of time intervals. However, this

notation is adequate to illustrate an example for process discovery under uncertainty.

9 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

a

b

c

d

e

f

g

h

i

j

[80, 100]

[0, 20]

[0, 5]

[80, 100]

[0, 20]

[0, 20]

[0, 20]

[0, 5]

[0, 5]

[80, 100][0, 20]

[0, 20]

[80, 100]

[80, 80]

[15, 15]

[0, 5]

[100, 100]

[80, 80]

[15, 15]

[0, 5]

Figure 4: The Uncertain Directly-Follows Graph (UDFG) computed based on the uncertain event log

ha, b, e, f, g, hi80,ha, {b, c},[e, f ], g, ii15 ,ha, {b, c, d},[e, f ], g, ji5. The arcs are labeled with the minimum

and maximum number of directly-follows relationship observable between activities in the corresponding

trace. Uncertain directly-follows relationships are inferred from the behavior graphs of the traces in the log.

The construction of this object is necessary to perform automatic process discovery over uncertain event

data.

is shown in Figure 4.

This graph can be then utilized to discover process models of uncertain logs via pro-

cess discovery methods based on directly-follows relationships. In a previous work [27]

we illustrated this principle by applying it to the Inductive Miner, a popular discovery al-

gorithm [27]; the edges of the UDFG can be ltered using the information on the labels,

in such a way that the nal model can represent all possible behavior in the uncertain log,

or only a part. Figure 5shows some process models obtained through inductive mining

of the UDFG, as well as a description regarding how the model relates to the original

uncertain log.

UDFGs of uncertain event data are obtained on the basis of the behavior graphs of

the traces in an uncertain event log, making their construction a necessary step to per-

form uncertain process discovery. In fact, the frequency information labeling the edges

of UDFGs are obtained through a search among the possible connections within the

behavior graphs of all the traces in an uncertain log.

Thus, the construction of behavior graphs for uncertain traces is the basis of both

process discovery and conformance checking on uncertain event data, since the behav-

ior graph is a necessary processing step to mine information from uncertain traces. It

is then important to be able to quickly and eciently build the behavior graph of any

given uncertain trace, in order to enable performant process discovery and conformance

checking.

10 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

abefg

h

i

(a) A process model that can only replay the relationships appearing in the certain parts

of the traces in the uncertain log. Here, information from uncertainty has been excluded

completely.

a

b

c

e

f

g

h

i

(b) A process model that can replay some—but not all—the relationships appearing in

the uncertain parts of the traces in the uncertain log. This process model mediates be-

tween representing only certain observation and representing all the possible behavior

in the process.

a

b

d

c

e

f

g

i

j

h

k

(c) A process model that can replay all possible congurations of certain and uncertain

traces in the uncertain log. This process model has the highest possible replay tness,

but is also very likely to contain some noisy or otherwise unwanted behavior.

Figure 5: Three diferent process models for the uncertain event log ha, b, e, f, g, hi80,ha, {b, c},[e, f ], g, ii15 ,

ha, {b, c, d},[e, f ], g, ji5obtained through inductive mining over an uncertain directly-follows graph. The

diferent ltering parameters for the UDFG yield models with distinct features.

11 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

4 Materials and Methods

4.1 Preliminaries

Let us illustrate some basic concepts and notations, partially from [2]:

Deﬁnition 1 (Power set).The power set of a set Ais the set of all possible subsets of

A, and is denoted with P(A).PNE(A)denotes the set of all the non-empty subsets of A:

PNE(A) = P(A)\ {∅}.

Deﬁnition 2 (Multiset).Amultiset is an extension of the concept of set that keeps

track of the cardinality of each element. B(A)is the set of all multisets over some set A.

Multisets are denoted with square brackets, e.g. b= [x, x, y], or with the cardinality of

the elements as superscript, e.g. b= [x2, y]. We denote the empty multiset with [ ]. The

operator (·)retrieves the cardinality of an element of the multiset, e.g. b(x)=2,b(y)=1,

b(z) = 0. Over multisets we deﬁne x∈b⇔b(x)≥1, and set(b) = {x∈b}. The

multiset union b=b1]b2is the multiset bsuch that for all xwe have b(x) = b1(x)+ b2(x).

Deﬁnition 3 (Sequence and permutation).Given a set X, a ﬁnite sequence over X

of length nis a function s∈X∗:{1, . . . , n} → X, and is written as s=hs1, s2, . . . , sni.

For any sequence swe deﬁne |s|=n,s[i] = si,x∈s⇔x∈ {s1, s2, . . . , sn}and

s⊕s0=hs1, s2, . . . , sn, s0i. A permutation of the set Xis a sequence xSthat contains all

elements of Xwithout duplicates: xS∈X,X∈xS, and for all 1≤i≤ |xS|and for all

1≤j≤ |xS|,xS[i] = xS[j]→i=j. We denote with SXall such permutations of set X.

We overload the notation for sequences: given a sequence s=hs1, s2, . . . , sni, we will write

Ssin place of S{s1,s2,...,sn}.

Deﬁnition 4 (Transitive relation and correct evaluation order).Let Xbe a

set of objects and Rbe a binary relation R⊆X×X.Ris transitive if and only if for

all x, x0, x00 ∈Xwe have that (x, x0)∈R∧(x0, x00)∈R→(x, x00)∈R. A correct

evaluation order is a permutation s∈SXof the elements of the set Xsuch that for all

1≤i<j≤ |s|we have that (s[i], s[j]) ∈R.

Deﬁnition 5 (Strict partial order).Let Sbe a set of objects. Let s, s0∈S. A strict

partial order (≺, S)is a binary relation that have the following properties:

•Irreﬂexivity: s≺sis false.

•Transitivity: s≺s0and s0≺s00 imply s≺s00.3

3Formally, the third property of strict partial orders is antisimmetry: s≺s0implies that s0≺sis false.

It is implied by irreexivity and transitivity [14].

12 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Deﬁnition 6 (Directed graph).Adirected graph G∈UGis a tuple (V, E)where V

is the set of vertices and E⊆V×Vis the set of directed edges. The set UGis the graph

universe. A path in a directed graph G= (V, E)is a sequence of vertices psuch that for all

1<i<|p|−1we have that (pi, pi+1)∈E. We denote with PGthe set of all such possible

paths over the graph G. Given two vertices v, v0∈V, we denote with pG(v, v0)the set of all

paths beginning in vand ending in v0:pG(v, v0) = {p∈PG|p[1] = v∧p[|p|] = v0}.v

and v0are connected (and v0is reachable from v), denoted by vG

7→ v0, if and only if there

exists a path between them in G:pG(v, v0)6=∅. Conversely, vG

67→ v0⇔pG(v, v0) = ∅. We

drop the superscript Gif it is clear from the context. A directed graph Gis acyclic if there

exists no path p∈PGsatisfying p[1] = p[|p|].

Deﬁnition 7 (Topological sorting).Let G= (V, E)be an acyclic directed graph. A

topological sorting [16]oG=hv1, v2, . . . , v|V|i ∈ SVis a permutation of the vertices of

Gsuch that for all 1≤i<j≤ |V|we have that vj67→ vi. We denote with OG⊆SVall

such possible topological sortings over G.

Deﬁnition 8 (Transitive reduction).Atransitive reduction [6]ρ:G→Gof a

graph G= (V, E)is a graph ρ(G)=(V, Er)with Er⊆Ewhere every pair of vertices con-

nected in ρ(G)is not connected by any other path: for all (v, v0)∈Er,pG(v, v0) = {hv, v0i}.

ρ(G)is the graph with the minimal number of edges that maintain the reachability be-

tween edges of G. The transitive reduction of a directed acyclic graph always exists and is

unique [6].

This paper proposes an analysis technique on uncertain event logs. These execu-

tion logs contain information about uncertainty explicitly associated with event data. A

taxonomy of diferent types of uncertain event logs and attribute uncertainty has been

described in [26]; we will refer to the notion of simple uncertainty, which includes un-

certainty without probabilistic information on the control-ow perspective: activities,

timestamps, and indeterminate events.

Deﬁnition 9 (Universes).Let UIbe the set of all the event identiers. Let UCbe

the set of all case ID identiers. Let UAbe the set of all the activity identiers. Let UT

be the totally ordered set of all the timestamp identiers. Let UO={!,?}, where the “!”

symbol denotes determinate events, and the “?” symbol denotes indeterminate events.

Deﬁnition 10 (Simple uncertain events).e= (ei, A, tmin, tmax , o)is a simple un-

certain event, where ei∈UEis its event identiﬁer, A∈PNE(UAis the set of possible

activity labels for e,tmin and tmax are the lower and upper bounds for the value of its

timestamp, and oindicates if it is an indeterminate event. Let UE= (UI×PNE(UA)×

UT×UT×UO)be the set of all simple uncertain events. Over the uncertain event

e= (ei, A, tmin, tmax, o)we deﬁne the projection functions πa(e) = A,πtmin(e) = tmin ,

πtmax (e) = tmax and πo(e) = o.

13 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Deﬁnition 11 (Simple uncertain traces and logs).σ⊆UEis a simple uncertain

trace if for any (ei, A, tmin, tmax, o)∈σ,tmin < tmax and all the event identiﬁers are

unique. TUdenotes the universe of simple uncertain traces. L⊆TUis a simple uncertain

log if all the event identiﬁers in the log are unique.

Deﬁnition 12 (Strict partial order over simple uncertain events).Let e, e0∈

ES

Ube two simple uncertain events. (≺,ES

U)is an order deﬁned on the universe of strongly

uncertain events ES

Uas:

e≺e0⇔πtmax (e)< πtmin (e0)

Deﬁnition 13 (Order-realizations of simple uncertain traces).Let σ∈TUbe a

simple uncertain trace. An order-realization σO=he1, e2, . . . , e|σ|i ∈ Sσis a permutation

of the events in σsuch that for all 1≤i<j≤ |σ|we have that ej⊀ei, i.e. σOis a correct

evaluation order for σover (≺,ES

U), and the (total) order in which events are sorted in

σOis a linear extension of the strict partial order (≺,ES

U). We denote with RO(σ)the set

of all such order-realizations of the trace σ.

A necessary step to allow for analysis of simple uncertain traces is to obtain their

behavior graph. A behavior graph is a directed acyclic graph that synthesizes the infor-

mation regarding the uncertainty on timestamps contained in the trace.

Deﬁnition 14 (Behavior graph).Let σ∈TUbe a simple uncertain trace. Let the

identiﬁcation function id :σ→ {1,2,...,|σ|} be a bijection between the events in σand

the ﬁrst |σ|natural numbers. A behavior graph β:TU→UGis the transitive reduction

of a directed graph ρ(G), where G= (V, E)∈UGis deﬁned 4as:

•V={(id(e), πa(e), πo(e)) |e∈σ}

•E={(v, w)|v, w ∈V∧πtmax (v)< πtmin (w)}

The set of topological sortings of a behavior graph β(σ)corresponds to the set of all the

order-realizations of the trace σ:

Figures 6and 7show the transitive reduction operation on the running example.

The semantics of a behavior graph are able to ecaciously communicate time and

order information concerning the time relationships among events in the corresponding

uncertain trace in a compact manner. For a behavior graph β(σ)=(V, E )and two events

e1∈σ,e2∈σ,(e1, e2)∈Eholds if and only if e1is immediately followed by e2for some

possible values of the timestamps of the events in the trace. A consequence of this fact

is that if a pair of events in the graph are unreachable, they might have occurred in any

order.

4A technical note: this denition for the nodes of the behavior graph is slightly diferent from the one

in [26], to simplify the notation in algorithms. The two denitions are functionally identical.

14 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

NightSweats

e1

{PrTP, SecTP}

e2

Splenomeg

e3

Adm

e4

Figure 6: The behavior graph of the trace in Table 1

before applying the transitive reduction. All the

nodes in the graph are pairwise connected based on

precedence relationships; pairs of nodes for which

the order is unknown are not connected.

NightSweats

e1

{PrTP, SecTP}

e2

Splenomeg

e3

Adm

e4

Figure 7: The same behavior graph afer the tran-

sitive reduction. The arc between e1and e4is re-

moved, since they are reachable through e2. This

graph has a minimal number of arcs while con-

serving the same reachability relationship between

nodes.

Denition 14 is meaningful and clear from a theoretical point of view. It rigor-

ously denes a behavior graph and the semantics of its parts. While helpful to under-

stand the function of behavior graphs, obtaining them from process traces following

this denition—that is, utilizing the transitive reduction—is inecient and slow. This

hinders the analysis of logs with a large number of events, and with longer traces. It is

nonetheless possible to build behavior graphs from process traces in a faster and more

ecient way.

4.2 Ecient Construction of Behavior Graphs

The set of steps to eciently create a behavior graph from an uncertain trace is separated

into two distinct phases, described by Algorithms 1and 2. An uncertain event eis asso-

ciated with a time interval which is determined by two values: minimum and maximum

timestamp of that event πtmin(e)and πtmax (e). If an event ehas a certain timestamp, we

have that πtmin (e) = πtmax(e).

We will examine here the efect of Algorithms 1and 2on a running example, the

process trace shown in Table 4. Notice that, in this running example, no uncertainty on

activity labels nor indeterminate events are present: this is because of the fact that the

topology of a behavior graph only depends on the (uncertain) timestamps in the events

belonging to the corresponding trace.

The construction of the graph relies on a preprocessing step shown in Algorithm 1,

where a support list Lis created (lines 4-8). Every entry in this list is a tuple of four

elements. For each event ein the trace, we insert two entries in the list—one for each

timestamp πtmin and πtmax appearing in a trace. The four elements in each tuple contained

in the list are:

•an identiﬁer, which in the list construction is an integer representing the rank of

the uncertain event by minimum timestamp (computed in line 3);

15 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Algorithm 1: TimestampList(σ)

Input : An uncertain trace σ.

Output : The list of timestamps Lof σ.

1L∗← hi;// Support list

2L← hi;// List of event attributes

3E←Sort(σ);// Sorts uncertain events by minimum

timestamp

4i←1

5while i≤ |E|do

6L∗←L∗⊕(πtmin (e), i, e, ’MIN’)

7L∗←L∗⊕(πtmax (e), i, e, ’MAX’)

8i←i+ 1

9Sort(L∗);// Sorts the list based on timestamp value

10 i←1

11 while i≤ |L∗|do

12 (t, id, e, type)←L∗[i]

13 L←L⊕(id, πa(e), πo(e),type)

14 i←i+ 1

15 return L

•the activity labels associated with the event πa(e);

•the attribute πo(e), which will carry the information regarding indeterminate events;

•the type of timestamp that generated this entry—if it is a minimum or maximum

of an interval.

As we can see, the list is designed to contain all information about an uncertain event

except the values of minimum and maximum timestamps, which we use to sort the list

(line 9) and then discard prior to returning the list (lines 10-15).

The events of the trace in Table 4are represented in the list L∗by entries shown in

Table 5. These entries are then sorted by Algorithm 1yielding the following list L:

L=h(1,{a},!,’MIN’),(1,{a},!,’MAX’),(2,{b},!,’MIN’),(3,{c},!,’MIN’),

(3,{c},!,’MAX’),(4,{d},!,’MIN’),(5,{e},!,’MIN’),(5,{e},!,’MAX’),

(2,{b},!,’MAX’),(4,{d},!,’MAX’),(6,{f},!,’MIN’),(6,{f},!,’MAX’)i

One of the purposes the list Lserves is gathering the structural information to create

the behavior graph; in fact, visiting the list in order is equivalent of sweeping the events of

16 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Algorithm 2: BehaviorGraph(TimestampList(σ))

Input : The list L=TimestampList(σ)of an uncertain trace σ.

Output : The behavior graph β(σ)=(V, E ).

1V← {(id, πa(e), πo(e)) |(id, πa(e), πo(e),type)∈L}

2E←∅

3i←1

4while i < |L|do

5(id, a, o, type)←L[i]

6if type =’MAX’ then

7j←i+ 1

8while j≤ |L|do

9(id∗, a∗, o∗,type∗)←L[j]

10 if type∗=’MIN’ then

11 E←E∪ {((id, a, o),(id∗, a∗, o∗))}

12 else if ((id, a, o),(id∗, a∗, o∗)) ∈Ethen

13 break

14 j←j+ 1

15 i←i+ 1

16 return (V, E)

Table 4: Running example for the creation of the behavior graph.

Case ID Event ID Activity Timestamp Event Type

872 e1a05-12-2011 !

872 e2b[06-12-2011, 10-12-2011] !

872 e3c 07-12-2011 !

872 e4d[08-12-2011, 11-12-2011] !

872 e5e09-12-2011 !

872 e6f [12-12-2011, 13-12-2011] !

17 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Table 5: Entries for the list Lgenerated by each event in the uncertain trace. Every event ehas two associated

entries, one marked as ’MIN’ and the other as ’MAX’. Each entry is a 4-uple containing an integer that acts

as event identier, the set of possible activity labels πa(e)of the uncertain event, the indeterminate event

attribute πo(e), and the type of timestamp (’MIN’ or ’MAX’).

Event List L∗entry

(minimum timestamp)

List L∗entry

(maximum timestamp)

e1(05-12-2011, 1, {a}, !, ’MIN’) (05-12-2011, 1, {a}, !, ’MAX’)

e2(06-12-2011, 2, {b}, !, ’MIN’) (10-12-2011, 2, {b}, !, ’MAX’)

e3(07-12-2011, 3, {c}, !, ’MIN’) (07-12-2011, 3, {c}, !, ’MAX’)

e4(08-12-2011, 4, {d}, !, ’MIN’) (08-12-2011, 4, {d}, !, ’MAX’)

e5(09-12-2011, 5, {e}, !, ’MIN’) (09-12-2011, 5, {e}, !, ’MAX’)

e6(12-12-2011, 6, {f}, !, ’MIN’) (13-12-2011, 6, {f}, !, ’MAX’)

the trace on the time dimension, encountering each timestamp (minimum or maximum)

sorted through time. We can visualize this on the Gantt diagram representation of the

trace of Table 4, visible in Figure 8.

Every segment representing an uncertain event in the diagram is translated by

TimestampList into two entries in a sorted list, representing the two extremes of the

segment. Events without an uncertain timestamp collapse into a single point in the dia-

gram, and their corresponding two entries in the list are characterized by the same times-

tamp.

Now, let us examine Algorithm 2. The idea leading the algorithm is to analyze the

time relationship among uncertain events in a more precise manner, as opposed to adding

a large number of edges to the graph and then removing them via transitive reduction.

This is attained by searching all the viable successors of each event in the sorted times-

tamp list L. We scan the list Lwith two nested loops, and we use the inner loop to look

for successors of the entry selected by the outer loop. According to the semantics of be-

havior graphs, events with overlapping intervals as timestamps must not be connected

by a path; thus, we draw outgoing edges from an event only when, reading the list, we

arrive at a point in time in which the event has certainly occurred. This is the reason

why outgoing edges are not drawn when inspecting minimum timestamps (line 6) and

incoming edges are not drawn when inspecting maximum timestamps (line 10).

First, we initialize the set of nodes with all the triples (id, πa(e), πo(e)) in the en-

tries of L, and we initialize the edges with an empty set (lines 1-2). For each maximum

timestamp that we encounter in the list, we start searching for successors in the follow-

ing entries (lines 3-9), so we proceed in looking for the successors of (id, a, o, type)only

if type =’MAX’.

If, while searching for successors of the entry (id, a, o, ’MAX’), we encounter the

entry (id∗, a∗, o∗,type∗)corresponding to a minimum timestamp (type∗=’MIN’), we

18 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

06-12-2011 00:00:00

07-12-2011 00:00:00

08-12-2011 00:00:00

09-12-2011 00:00:00

10-12-2011 00:00:00

11-12-2011 00:00:00

12-12-2011 00:00:00

13-12-2011 00:00:00

a

b

c

d

e

f

Figure 8: A Gantt diagram visualizing the time perspective of the events in Table 4. The horizontal blue

bars represent the interval of possible timestamps of uncertain events: such interval is ample for the event

with activity label “c”, which has an uncertain timestamp, and is narrow to indicate a precise point in time

for the other events. This diagram is able to show the order relationship between events in a trace, as well as

the dimensions of their interval of possible timestamps in scale.

connect (id, a, o)and (id∗, a∗, o∗)in the graph, since their timestamps do not have any

possible value in common. The search for successors must continue, since it is possible

that other events took place before the maximum timestamp of the event corresponding

to (id∗, a∗, o∗,type∗). This conguration occurs for events e1and e3in Table 4. As can

be seen in Figure 8,e3can indeed follow e1, but the still undiscovered event e2is another

possible successor for e1.

If the entry (id∗, a∗, o∗,type∗)corresponds to a maximum timestamp (line 12), so

type∗=’MAX’, there are two separate situations to consider. Case 1: (id, a, o)was not

already connected to (id∗, a∗, o∗). Then, the timestamps of the events corresponding to

(id, a, o)and (id∗, a∗, o∗)overlap with each other—if they did not, the two nodes would

have already been connected, since we would have encountered (id∗, a∗, o∗,’MIN’)from

(id, a, o, ’MAX’)before encountering (id∗, a∗, o∗,’MAX’). Thus, (id, a, o)must not be

connected to (id∗, a∗, o∗)and the search must continue. Events e3and e4are an example:

when the maximum timestamp of e4is encountered during the search for the successor

of e3, the two are not connected, so the search for a viable successor of e3has to continue.

Case 2: (id, a, o)and (id∗, a∗, o∗)are already connected. This means that we had already

encountered (id∗, a∗, o∗,’MIN’)during the search for the successors of (id, a, o). Since

the entire time interval representing the possible timestamp of the event associated with

(id∗, a∗, o∗)is detected afer the occurrence of (id, a, o), there are no further events to

consider as successors of (id, a, o)and the search stops (line 13). In the running example,

19 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

a

e1

c

e3

d

e4

e

e5

b

e2

f

e6

Figure 9: The behavior graph of the trace in Table 4.

this happens between e5and e6: when searching for the successors of e5, we rst connect

it with e6when we encounter its minimum timestamp; we then encounter its maximum

timestamp, so no other successive event can be a successor for e5. This concludes the

walkthrough of the procedure, which shows why Algorithms 1and 2can be used to cor-

rectly compute the behavior graph of a trace. The behavior graph of the trace in Table 4

obtained through this procedure is shown in Figure 9.

Let us now prove, in more formal terms, the correctness of these algorithms. We will

show that the procedures BehaviorGraph and TimestampList are able to construct

a behavior graph with the semantics illustrated in Denition 14.

Theorem 1 (Correctness of the behavior graph construction).Let σ∈TU

be an uncertain trace. Let bg = (V, E ) = BehaviorGraph(TimestampList(σ)) be

the behavior graph of σobtained through Algorithms 1and 2. The graph bg follows the

behavior graph semantics: for all pairs of events e∈σand e0∈σsuch that id(e) = eid,

πa(e) = ea,πo(e) = eo,id(e0) = e0

id,πa(e0) = e0

a,πo(e0) = e0

o, we have that the node

(eid, ea, eo)is connected to the node (e0

id, e0

a, e0

o)if and only if πtmax (e)< πtmin (e0)and

there exists no event e00 ∈σsuch that πtmax(e)< πtmin (e00)≤πtmax (e00)< πtmin (e0). Thus,

bg =β(σ).

Proof. Let us rst dene a suitable id function for the behavior graph utilizing the list E

created in TimestampList(σ). For all events e∗∈σand for i∈Nsuch that E[i] = e∗,

we dene id(e∗) = i. Since id is just an enumeration of the events in σ, it is trivially

bijective.

(⇐)Assume πtmax (e)< πtmin (e0). By construction, we have that

L=h. . . , (eid, ea, eo,’MAX’),...,(e0

id, e0

a, e0

o,’MIN’), . . . i. The checks in line 6 and line

10 only allow for edges to be linked from entries of type ’MAX’ to entries of type ’MIN’

20 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

that only appear in a later position in the list L. Thus, the conguration πtmax (e)<

πtmin (e0)is a strict prerequisite for (eid, ea, eo)and (e0

id, e0

a, e0

o)to be connected: ((eid , ea, eo),

(e0

id, e0

a, e0

o)) ∈E⇒πtmax (e)< πtmin (e0).

(⇒)Assume πtmax (e)< πtmin (e0), and that the algorithm is currently searching the succes-

sors for the entry (eid, ea, eo,’MAX’). Eventually, the inner loop will consider as a succes-

sor the entry (e0

id, e0

a, e0

o,’MIN’), and since it is of type ’MIN’, (eid, ea, eo)and (e0

id, e0

a, e0

o)

will necessarily be connected unless the algorithm executes the break at line 13. To exe-

cute it, the algorithm needs to nd a list entry (e00

id, e00

a, e00

o,’MAX’)such that there already

exist an arc between (eid, ea, eo)and (e00

id, e00

a, e00

o), and this is only possible if (e00

id, e00

a, e00

o,

’MIN’)has been encountered while searching for successors of (eid, ea, eo). This implies

that

L=h. . . , (eid, ea, eo,’MAX’),...,(e00

id, e00

a, e00

o,’MIN’), . . .

. . . , (e00

id, e00

a, e00

o,’MAX’),...,(e0

id, e0

a, e0

o,’MIN’), . . . i

which, by construction of L, is only possible if there exist some e00 ∈σsuch that

πtmax (e)< πtmin (e00)≤πtmax (e00)< πtmin (e0)

As mentioned earlier, the procedure of constructing a behavior graph has been struc-

tured in two diferent algorithms specically to enable further optimization in process-

ing uncertain process trace. This becomes evident once we consider the problem of con-

verting in behavior graphs all the traces in an event log, as opposed as one single uncertain

trace.

Firstly, it is important to notice that diferent uncertain traces can have the same list

L. Similarly to directly-follows relationships in more classical process mining, which can

ignore the amount of time in absolute terms elapsed between two consecutive events,

specic values of timestamps in an uncertain trace are not necessarily meaningful with

respect to the connection in the behavior graph; their order, conversely, is crucial.

This fact enables further optimization at the log level. The construction of the list

Lin TimestampList(σ)is engineered in a way that allows for computing the behavior

graph without direct lookup to the events in the trace. This implies that it is possible to

extract a multiset of lists Lfrom the event log, and to compute the conversion to behav-

ior graph only for the set of lists induced by this multiset. This allows to save computa-

tion time in converting an entire event log to behavior graphs; furthermore, it enables a

more compact representation of the log in memory, since we only need to store a smaller

number of graphs to represent the whole log.

The procedure to eciently convert an event log into graphs is detailed in Algo-

rithm 3.

21 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Algorithm 3: ProcessUncertainLog

Input : An uncertain log L.

Output : A multiset of behavior graphs BG.

1ML ←[ ]

2VL←[ ]

3for σ∈Ldo

4ML ←ML ][TimestampList(σ)]

5for L∈ML do

6VL←VL][BehaviorGraph(L)ML(L)]

7return BG

These considerations allow us to extend to the uncertain scenario some concepts

that are essential in classical process mining. Firstly, we can now derive the denition of

variant, highly important for preexisting process mining techniques, to uncertain event

data.

Deﬁnition 15 (Uncertain variants).Let L⊆TUbe a simple uncertain event log.

The variants of Ldenoted by VL, are the multisets of behavior graphs for the uncertain

traces in L, and are computed with ProcessUncertainLog(L).

The computational advantage in representing a log through a multiset of behavior

graphs is evident in the procedure described in Algorithm 2. We see that all data necessary

to the creation of a behavior graph is contained in the list L, fact that justies the log

representation method illustrated in Algorithm 3.

Lemma 1. Two uncertain traces σ1∈Land σ2∈Lbelong to the same variant, and

share the same behavior graph, if and only if they result in the same timestamp list L:

TimestampList(σ1) = TimestampList(σ2).

Another central concept in process mining is the so-called control-ﬂow perspective of

event data. In certain process traces, where timestamps have a total order, events have a

single activity label and no event is indeterminate, the control-ow information is rep-

resented by a sequence of activity labels sorted by timestamp. Although there are many

analysis approaches that also account for other perspectives (e.g. the performance per-

spective, that considers the duration of events and their distance in time, or the resource

perspective, that accounts for the agents that execute the activities), a vast amount of pro-

cess mining techniques, including most popular algorithms for process discovery and

conformance checking, rely only on the control-ow perspective of a process. Analo-

gously, behavior graphs carry over the control-ow information of an uncertain trace:

22 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

instead of describing the ow of events like their certain counterpart, the behavior graph

describes all possible ows of events in the uncertain trace.

5 Asymptotic Complexity

In this section, we will provide some values for the asymptotic complexity of the algo-

rithms seen in this paper.

In a previous paper [26] we introduced the concept of behavior graph for the repre-

sentation of uncertain event data, together with a method to obtain such graphs. Deni-

tion 14 describes such a baseline method for the creation of the behavior graph consisting

of two main parts: the construction of the starting graph and the computation of its tran-

sitive reduction. Let us consider an uncertain process trace σ∈TUwith |σ|=nevents,

and the graph G= (V, E)generated in Denition 14 before the transitive reduction.

The starting graph is created by inspecting the time relationship between every pair

of events; this corresponds to checking if an edge exists between each pair of vertices in

G, which needs O(n2)time.

The transitive reduction of graphs can be obtained through many methods. A sim-

ple and ecient method to compute the transitive reduction on sparse graphs is to test

reachability through a search (either breadth-rst or depth-rst) from each edge. This

method costs O(V·E)time5. However, in the initial graph each event e∈Vhas an

inbound arc from each event certainly preceding eand an outbound arc to each event

certainly following e. Fewer events with overlapping intervals as timestamps of uncer-

tain events imply fewer arcs in G; the initial graph Gof a trace with no uncertainty has

|E|=n(n−1)

2=O(V2)edges. Thus, except for rare, very uncertain cases, the graph Gis

dense.

Aho et al. [6] presented a technique to compute the transitive reduction in O(n3)

time, more appropriate in the case of dense graphs, and proved that the transitive re-

duction has the same computational complexity of the matrix multiplication problem.

The problem of matrix multiplication was generally regarded as having an optimal time

complexity of O(n3), until Volker Strassen presented an algorithm [31] able to multi-

ply matrices in O(n2.807355)time. Subsequent improvements have followed, by Copper-

smith and Winograd [12], Stothers [30] and Williams [33]. The asymptotically fastest

algorithm known to date has been illustrated by Le Gall [19] and has an execution time

of O(n2.3728639). However, these faster algorithms are very seldomly used in practice,

due to the existence of large constant factors in their computation time that are hid-

den by the asymptotic notation. Moreover, they have vast memory requirements. The

Strassen algorithm is helpful in real-life applications only when applied on very large ma-

5Here, for simplicity, we resort to a widely adopted abuse of notation in asymptotic complexity analysis:

we indicate a set instead of its cardinality (e.g., we use O(V)in place of O(|V|)).

23 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

trices [13], and the Coppersmith-Winograd algorithm and subsequent improvements

are more ecient only with inputs so large that they are efectively classied as galactic

algorithms [18].

Bearing in mind these considerations, for the vast majority of event logs, the most

ecient way to implement the creation of the behavior graph via transitive reduction

runs in O(n2) + O(n3) = O(n3)time in the worst-case scenario.

It is straightforward to nd upper bounds for the complexity of Algorithms 1and 2.

Line 3 of TimestampList requires O(nlog n)to be executed. Lines 5-8 require O(n)

time. Line 9 requires O(2nlog(2n)) = O(nlog n)time to be run. Lines 11-14 require 2n=

O(n)time to be run. Lines 1-4 and 10 have a constant cost O(1). Thus, TimestampList

has a total asymptotic cost of O(1)+2·O(nlog n)+2·O(n) = O(nlog n)in the worst-case

scenario.

Let us now examine BehaviorGraph. Lines 1-3 and line 11 run in O(1) time. Lines

11-30 consist of two nested loops over the list L, and we have |L|= 2n, resulting in an

asymptotic cost of O((2n)2) = O(n2). The total running time for the novel construction

method is then O(1) + O(n2) = O(n2)time in the worst-case scenario.

We can also obtain a lower bound for the complexity in the worst-case scenario by

analyzing the possible size of the output. The complete directed bipartite graph with n

vertices, usually indicated with Kn

2,n

2, is a DAG that has (n

4)2=O(n2)edges. It is easy to

see that the complete bipartite graph fullls the requirements to be a behavior graph: it

is in fact acyclic, and no edge can be removed without changing the reachability of the

graph—namely, it is equivalent to its transitive reduction. We can show that a behavior

graph with such a shape exists employing a simple construction: a trace composed by

nevents with timestamps such that the rst n

2events all have overlapping timestamps,

the last n

2also all have overlapping timestamps, and the maximum timestamp of each

of the rst n

2is smaller than the minimum timestamp of each of the last n

2events. The

construction, together with an example, is illustrated in Figure 10. Since lines 11-30 of

the algorithm build this graph with O(n2)edges, the algorithm runs in Ω(n2)time, and

thus also in Θ(n2)time. This also proves the asymptotic optimality of the algorithm:

no algorithm to build behavior graphs can run in less than Θ(n2)time in the worst-case

scenario.

6 Experimental Results

The formal denition of our novel construction method for the behavior graph was used

to show its asymptotic speedup with respect to the construction utilizing the transitive

reduction. In order to empirically conrm this improvement, we built a set of experi-

ments in order to measure the gain in speed and memory usage.

24 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

.

[1, k]

.

[2, k+1]

.

[3, k+2]

...

.

[k, 2k-1]

.

[2k, 3k]

.

[2k+1, 3k+1]

.

[2k+2, 3k+2]

...

.

[3k-1, 4k-1]

.

[1, 4]

.

[2, 5]

.

[3, 6]

.

[4, 7]

.

[8, 12]

.

[9, 13]

.

[10, 14]

.

[11, 15]

Figure 10: Construction of the class of behavior graphs isomorphic to a complete bipartite graph and an

instantiated example. For any n= 2k, it is possible to have a behavior graph isomorphic to the graph Kk,k,

which thus has a number of edges quadratic in the number of vertices.

6.1 Performance of Behavior Graph Construction

In this section, we will show a comparison between the running time of the na¨

ıve behav-

ior graph construction—which employs the transitive reduction—versus the improved

method detailed throughout the paper. The experiments are set to investigate the difer-

ence in performance between the two algorithms, and most importantly how this dif-

ference scales when the size of the event log increases, as well as the amount of events

in the log that have uncertain timestamps. In designing the experiments, we took into

consideration the following research questions:

•Q1: how does the computation time of the two methods compare when run on

logs having an increasing number of traces?

•Q2: how does the computation time of the two methods compare when run on

logs with increasing trace lengths?

•Q3: how does the computation time of the two methods compare when run on

logs with increasing percentages of events with uncertain timestamps?

•Q4: what degree of reduction in memory consumption for the representation of

an uncertain log can we attain with the novel method?

•Q5: do the answers obtained for Q3 hold when simulating uncertainty on real-life

event data?

25 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

Both the baseline algorithm based on transitive reduction [26] and the new algo-

rithm for the construction of the behavior graph are implemented in Python, within

the PROVED project. The implementation of both methods is available online, as well

as the full code for the experiments presented here (see the reference in Section 1).

For each series of experiments exploring Q1 through Q4, we generate a synthetic

event log with a number nof traces of length l(in number of events belonging to the

trace). Uncertainty on timestamps is then articially added to the events in the log. A

specic percentage pof the events in the event log will have an uncertain timestamp,

causing it to overlap with an adjacent event. Finally, behavior graphs are built from all

the traces in the event log with either algorithm, while the execution time is measured.

All results in this section are presented as the mean of the measurements for 10 runs of

the corresponding experiment. In the diagrams, we will label with “TrRed” the na¨

ıve

method using the transitive reduction, and with “Improved” the faster algorithm illus-

trated in this paper. Additionally, the data series for the novel method are labeled with

the relative variation in running time for each specic data point in the experiment, ex-

pressed in percentage.

To answer Q1, the rst experiment inspects how the eciency of the two algorithms

scales with log dimension in number of traces. We generate logs with a xed uncertainty

percentage of p= 0.5, and trace length of l= 20. The number of traces in the uncer-

tain log progressively scales from n= 1000 to n= 10000. As shown in Figure 11, our

proposed algorithm outperforms the baseline algorithm, showing a much smaller slope

in computation time. As anticipated by the theoretical analysis, the computing time to

build behavior graphs increases linearly with the number of traces in the event log for

both methods; in the novel method, the constant factors are much smaller, thus produc-

ing the speedup that we can observe in the graph. Note that in this experiment the novel

method requires between 18% and 26% of the time with respect to the baseline method.

26 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

2,000 4,000 6,000 8,000 10,000

Log size (number of traces)

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Behavior graph building time (seconds)

22.78% 23.27% 22.2% 25.11% 22.37%

23.34% 18.92%

26.13% 22.63%

21.79%

TrRed

Improved

Figure 11: Time in seconds for the creation of the behavior graphs for synthetic logs with traces of length

l= 20 events and p= 0.5of uncertain events, with increasing number of traces n. The solid blue line

indicates the time needed for the na¨

ıve construction; the dashed red line shows the building time of the

improved algorithm, and is labeled with the relative time variation (in percentage).

The second experiment is designed to answer Q2. We analyze the efect of the trace

length on the total time needed for behavior graph creation. Therefore, we created logs

with n= 100 traces of increasing lengths in number of events, and added uncertain

timestamps to events with p= 0.5. The results, illustrated by Figure 12, meet our ex-

pectations: the computation time of the baseline method scales much worse than the

computation time required by our new technique, due to its cubic asymptotic time com-

plexity. This conrms the results of the analysis of the asymptotic time complexity anal-

ysis detailed in Section 5. We can notice an order-of-magnitude increase in speed. At

trace length l= 600, the new algorithm computes the graphs in only 0.35% of the time

required by the baseline algorithm.

27 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

100 200 300 400 500 600

Trace length (number of events)

10−1

100

101

102

Behavior graph building time (seconds)

5.64%

4.63% 1.95%

1.59% 1.16% 0.87% 0.68% 0.6% 0.49% 0.43% 0.38% 0.35%

TrRed

Improved

Figure 12: Time in seconds for the creation of the behavior graphs for synthetic logs with n= 100 traces and

p= 0.5of uncertain events, with increasing trace length l.

The next experiment tackles Q3, by inspecting the diference in execution time for

the two algorithms in function of the percentage of events with an uncertain timestamp

in the event log. Keeping constant the values n= 100 and l= 100, we progressively

increased the percentage pof events with an uncertain timestamp and measured compu-

tation time. As presented in Figure 13, the time required for behavior graph construction

remains almost constant for our proposed algorithm, while it is very slightly decreasing

for the baseline algorithm. This behavior is expected, and is justied by the fact that the

worst-case scenario for the baseline algorithm is a trace that has no uncertainty on the

timestamp: in that case, the behavior graph is simply a chain of nodes representing the

total order in a sequence of events with certain timestamps, thus the transitive reduc-

tion needs to nd and remove a higher number of edges from the directed graph. This

worst-case scenario occurs at p= 0, explaining why the computation time needed by the

transitive reduction is at its highest. It is important to note, however, that for all values

of pour new algorithm runs is signicantly more ecient than the baseline algorithm:

with p= 0, the new algorithm takes 0.47% of the time needed by the na¨

ıve construction,

while for p= 1 this gure grows to 4.39%.

28 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

0.0 0.2 0.4 0.6 0.8 1.0

Uncertainty (%)

0

1

2

3

4

5

6

Behavior graph building time (seconds)

0.47% 2.59% 4.0% 4.77% 3.07% 4.51% 3.53% 4.08% 4.1% 4.33% 4.39%

TrRed

Improved

Figure 13: Time in seconds for the creation of the behavior graphs for synthetic logs with n= 100 traces of

length l= 100 events, with increasing percentages of timestamp uncertainty p.

An additional experiment is illustrated to provide an answer to Q4. Similarly to the

rst experiment, we increase the number of traces nin the uncertain log, while keeping

the other parameters xed: l= 10 and p= 0.5. We then perform the behavior graph

construction with both methods, and we measure the memory consumption derived

from the transitive reduction method (keeping in memory one behavior graph for each

uncertain trace) versus the improved method (which generates a multiset of behavior

graphs, one for each variant in the uncertain log).

29 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

2,000 4,000 6,000 8,000 10,000 12,000 14,000

Log size (number of traces)

0.5

1.0

1.5

2.0

Memory occupation (bytes)

×108

93.97%

89.9%

86.33%

82.92%

80.21%

77.46%

74.9%

72.31%

69.51%

67.52%

65.47%

64.37%

62.04%

60.69%

59.2%

TrRed

Improved

Figure 14: Memory consumption in bytes needed to store the behavior graphs for synthetic uncertain event

logs with traces of length l= 10 events and timestamp uncertainty of p= 0.5, with an increasing number

of traces n.

The results are summarized in Figure 14. Note that when nincreases, more and more

uncertain traces are characterized by the same behavior graph, and can then be grouped

in the same variant. This allows the improved algorithm to store the uncertain log more

efectively. At n= 15000, the space needed by the multiset of behavior graphs is 59.2%,

a sizable improvement in memory requirements when analyzing uncertain event logs of

substantial dimensions. This improvement in memory consumption is a consequence

of the new technique utilized in this paper to obtain the timestamp list, which enables

such renement with respect to the technique illustrated in [28].

Finally, to elucidate research question Q5 we compared the computation time for

behavior graphs creation on real-life event logs, where we articially inserted timestamp

uncertainty in progressively higher percentage of uncertain events pas described for the

experiments above. We considered three event logs: an event log tracking the activities

of the help desk process of an Italian sofware company, a log related to the management

of road trac nes in an Italian municipality, and a log from the BPI Challenge 2012

related to a loan application process. The results, presented in Figure 15, closely adhere

to the ndings of the experiments with synthetically generated uncertain event data: the

novel method provides a substantial speedup, that remains rather stable with respect to

the percentage pof uncertain events added in the log.

30 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

0.0 0.2 0.4

Uncertainty (%)

10

20

30

40

50

Behavior graph building time (seconds)

BPIC 2012

0.0 0.2 0.4

Uncertainty (%)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

HelpDesk

0.0 0.2 0.4

Uncertainty (%)

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

RTFM

Figure 15: Execution times in seconds for real-life event logs with increasing percentages pof timestamp

uncertainty.

6.2 Applications of the Behavior Graph Construction

In Section 1we saw how building the behavior graph is a fundamental preprocessing step

for both process discovery and conformance checking when dealing with uncertain event

logs. In the previous section, we showed in practice how the novel algorithm presented

in this paper impacts the computation time for the construction of behavior graphs.

Now, let us have a glance into the efect of the speedup when applied to process mining

techniques.

In this additional experiment we consider the conformance checking problem. In [26]

we proposed an approach to compute upper and lower bounds for the conformance

score of a trace against a reference Petri net through the alignment technique, which

yields alignments for the best- and worst-case scenarios of an uncertain trace as illustrated

in Section 1. The experiment is set up to assess the efect of the new behavior graph con-

struction on the overall performance of conformance checking over uncertain data. We

rst generate a Petri net with ttransitions, simulate a log by playing out n= 500 traces,

and add timestamp uncertainty with p= 0.1. We then compute the lower bound for

conformance between the uncertain event log and the Petri net used as a source, and

compare the overall execution time for conformance using the two diferent methods

for the creation of the behavior graph. In this specic experiment, we also considered

the other types of uncertainty in process mining illustrated in the taxonomy of [26], as

31 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

well as all types of uncertainty simulate on the same log.

10 20 30 40

Number of transitions

30

40

50

60

70

Time variation (%)

30.34%

58.84%

65.22%

71.08%

67.33% 65.85% 66.69%

64.19%

Activities

10 20 30 40

Number of transitions

20

40

60

Time variation (%)

18.15%

51.66%

60.81%

66.17% 66.95% 65.53% 65.08% 64.01%

Timestamps

10 20 30 40

Number of transitions

20

40

60

Time variation (%)

14.43%

48.93%

59.28%

64.93% 64.83% 64.84% 63.75% 64.25%

Indeterminate events

10 20 30 40

Number of transitions

60

70

Time variation (%)

52.22%

71.58%

77.18%

75.21% 74.23%

70.87% 69.54%

67.42%

All

Figure 16: Relative variation in computation time obtained through the improved behavior graph con-

struction when applied to the computation of conformance bounds between a synthetic uncertain log and

a Petri net with an increasing number of transitions. The synthetic uncertain logs have n= 500 traces and

timestamp uncertainty has been introduced with p= 0.1.

The results are shown in Figure 16. We can see that, on very small nets (t= 5), the

alignment algorithm takes a short time to execute, so the speedup provided by the im-

proved behavior graph construction has a larger impact on the total computation time

(taking as little as 30.71% of the time to calculate alignments). With the increase of t, the

computation time for conformance checking using the fast construction of the behavior

graph appears to stabilize around 65% of the time needed if we employ the na¨

ıve con-

struction when considering only one type of uncertainty in isolation. This translates in

a reduction of roughly 35% of computation time for the very common problem of cal-

culating the conformance score between event data and a reference model, a signicant

impact on performances of concrete applications of process mining over uncertain data.

When compounding all types of uncertainty we see a similar efect, although for t= 5

the improved method takes 52.22% of the time required by the baseline construction, a

less dramatic efect than the other uncertainty settings. This is due to the fact that even

32 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

at such small scales, the high number of realizations of traces slow down the alignment

phase in the computation.

In evaluating this result, it is important to consider that alignments are a notori-

ously time-intensive technique [20], since the technique is based on an A∗search on a

state space that consists in pairs of the activities in the trace combined with the possible

actions in the model. As a consequence, the impact of the algorithm presented in this pa-

per is limited by the characteristics of the implementation of such alignment technique;

combining it with more rened alignment algorithms would further improve the gain

in speed.

In summary, the outcomes of the experiments show how our new algorithm hereby

presented outperforms the previous method for creating the behavior graph on all the

parameters in which the problem instance can scale in dimensions, in both the time and

space dimensions. The experiment designed to answer Q3 shows that, like the na¨

ıve al-

gorithm, our novel method being is essentially insensitive to the percentage of events

with uncertain timestamps contained in a trace. This fact is also veried by the experi-

ment associated with Q5 on real-life data with added time uncertainty. While for every

combination of parameters we benchmarked the novel algorithm runs in a fraction of

time required by the baseline method, the experiments also conrm the improvements

in asymptotic time complexity demonstrated through theoretical complexity analysis.

7 Related Work

The topic of process mining analysis over uncertain event data is relatively new, and lit-

tle research has been carried out. The work introducing the concept of uncertainty in

process mining, together with a taxonomy of the various types of uncertainty, speci-

cally illustrated that if a trace displays uncertain attributes, it contains behavior, which

can be efectively represented through graphical models—namely, behavior graphs and

behavior nets [26]. Diferently to classic process mining, where we have a clearly dened

separation between data and model and between the static behavior of data and the dy-

namic behavior of models, the distinction between data and models becomes more un-

clear in presence of uncertainty, because of the variety in behavior that afects the data.

Representing traces through process models is utilized in [26] for the computation of

upper and lower bounds for conformance scores of uncertain process traces against clas-

sic reference models. Another practical application of behavior graphs in the eld of pro-

cess mining over uncertain event data is presented in [27]. Behavior graphs of uncertain

traces are employed to determine the number of possible directly-follows relationships

between uncertain events, with the end goal of automatically discovering process models

from uncertain event data.

Albeit, as said, the application of the concept of uncertainty in data to process min-

33 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

ing is recent, the same idea has precedents in the older eld of data mining. Aggarwal and

Philip [5] ofer an overview of the topic of uncertain data and its analysis, with a strong

focus on querying. Such data is modeled on the basis of probabilistic databases [32], a

foundational concept in the setting of uncertain data mining. A branch of data mining

particularly close to process mining is frequent itemsets mining: an ecient algorithm

to search for frequent itemsets over uncertain data, the U-Apriori, have been described

by Chui et al. [11].

Behavior graphs are Directed Acyclic Graphs (DAGs), which are widely used through-

out many areas of science to represent with a graph-like model dependencies, precedence

relationships, time information, or partial orders. They are efectively utilized in cir-

cular dependency analysis in sofware [8], probabilistic graphical models [9], dynamic

graphs analytics [24], and compiler design [7]. In process mining, Conditional Partial

Order Graphs (CPOGs)—which consist of collections of DAGs—have been exploited

by Mokhov et al. [25] to aid the task of process discovery.

We have seen throughout the paper that uncertainty on the timestamp dimension—

namely, representing at which time an event occurred with an interval of possible time-

stamps—generates, on the precedence relationships of events, a partial order. Although

uncertainty research in process mining provides a novel justication of partial ordering

that spawns from specic attribute values, the idea of having a partial order instead of

a total order among events in a trace has precedents in process mining research. Lu et

al. [22][23] examined the problem of conformance checking through alignments in the

case of partially ordered traces, and developed a construct to represent conformance

called a p-alignment. Genga et al. [15] devised a method to identify highly frequent

anomalous patterns in partially ordered process traces. More recently, Van der Aa et

al. [1] developed a probabilistic infrastructure that allows to infer the most likely linear

extension of a partial order between events in a trace, with the goal of “resolving” the

partial order.

An important aspect to notice is that conformance checking over uncertain event

data is not to be confused with stochastic conformance checking, which concerns mea-

suring conformance of certain event data against models enriched with probabilistic in-

formation. The probabilities decorating a stochastic model do not derive from uncer-

tainties in event data, but rather from frequency of activities [21] or from performance

indicators [29].

A review of related work on the topic of the asymptotic complexity of the transi-

tive reduction and the equivalent problem of matrix multiplication is provided with the

complexity analysis of the algorithms examined by this paper, in Section 5.

34 / 39

M. Pegoraro et al. Eﬃcient Time and Space Representation of Uncertain Data

8 Conclusions

The creation of the behavior graphs—a graphical structure of paramount importance

for the analysis of uncertain data in the domain of process mining—plays a key role as

initial processing step for both conformance checking and process discovery of process

traces containing events with timestamp uncertainty, the most critical type of uncertain

behavior. It allows, in fact, to represent the time relationship between uncertain events,

which can be in a partial order. The behavior graph also carries the information regard-

ing other types of uncertainty, like uncertain activity labels and indeterminate events.

Such a representation is vital to establish which possible sequence of events in an un-

certain trace most adhere to the behavior prescribed by a reference model, thereby en-

abling conformance checking; and to measure the number of possible occurrences of the

directly-follows relationship