PreprintPDF Available

Abstract and Figures

Among the many sources of event data available today, a prominent one is user interaction data. User activity may be recorded during the use of an application or website, resulting in a type of user interaction data of en called click data. An obstacle to the analysis of click data using process mining is the lack of a case identifier in the data. In this paper, we show a case and user study for event-case correlation on click data, in the context of user interaction events from a mobility sharing company. To reconstruct the case notion of the process, we apply a novel method to aggregate user interaction data in separate user sessions—interpreted as cases—based on neural networks. To validate our findings, we qualitatively discuss the impact of process mining analyses on the resulting well-formed event log through interviews with process experts.
Content may be subject to copyright.
Uncertain Case Identifiers in Process Mining:
A User Study of the Event-Case Correlation
Problem on Click Data
Marco Pegoraro 1, Merih Seran Uysal 1, Tom-Hendrik H¨
ulsmann 1,
and Wil M.P. van der Aalst 1
1Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Aachen, Germany
{pegoraro, uysal, vwdaalst},
Among the many sources of event data available today, a prominent one is user
interaction data. User activity may be recorded during the use of an application or
website, resulting in a type of user interaction data ofen called click data. An ob-
stacle to the analysis of click data using process mining is the lack of a case identier
in the data. In this paper, we show a case and user study for event-case correlation
on click data, in the context of user interaction events from a mobility sharing
company. To reconstruct the case notion of the process, we apply a novel method
to aggregate user interaction data in separate user sessions—interpreted as cases—
based on neural networks. To validate our ndings, we qualitatively discuss the
impact of process mining analyses on the resulting well-formed event log through
interviews with process experts.
Keywords: Process Mining ·Uncertain Event Data ·Event-Case Correlation
·Case Notion Discovery ·Unlabeled Event Logs ·Machine Learning ·Neural
Networks ·word2vec ·UI Design ·UX Design.
This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-
ternational” license.
©the authors. Some rights reserved.
This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:
Pegoraro, Marco et al. “Uncertain Case Identiers in Process Mining: A User Study of the Event-Case Correlation Prob-
lem on Click Data”. In: 27th International Conference on Exploring Modeling Methods for Systems Analysis and Develop-
ment, EMMSAD 2022, Proceedings. Ed. by Augusto, Adriano et al. Lecture Notes in Business Information Processing.
Springer, 2022
Please, cite this document as shown above.
Publication chronology:
2021-11-22: abstract submitted to the International Conference on Advanced Information Systems Engineering (CAiSE) 2022, main track
2021-11-29: full text submitted to the International Conference on Advanced Information Systems Engineering (CAiSE) 2022, main track
2022-03-01: notication of rejection
2022-03-08: full text submitted to the International Working Conference on Exploring ModelingMethods for Systems Analysis and Development (EMM-
SAD) 2022
2022-04-01: notication of acceptance
2022-04-06: camera-ready version submitted
The published version referred above is ©Springer.
Correspondence to:
Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany
Website: ·Email: ·ORCID: 0000-0002-8997-7517
Content: 19 pages, 10 gures, 1 table, 18 references. Typeset with pdfL
X, Biber, and BibL
Please do not print this document unless strictly necessary.
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
1 Introduction
In the last decades, the dramatic rise of both performance and portability of computing
devices has enabled developers to design sofware with an ever-increasing level of sophis-
tication. Such escalation in functionalities caused a subsequent increase in the complex-
ity of sofware, making it harder to access for users. The shif from large screens of desk-
top computers to small displays of smartphones, tablets, and other handheld devices has
strongly contributed to this increase in the intricacy of sofware interfaces. User interface
(UI) design and user experience (UX) design aim to address the challenge of managing
complexity, to enable users to interact easily and efectively with the sofware.
In designing and improving user interfaces, important sources of guidance are the
records of user interaction data. Many websites and apps track the actions of users, such
as pageviews, clicks, and searches. Such type of information is ofen called click data, of
which an example is given in Table 1. These can then be analyzed to identify parts of
the interface which need to be simplied, through, e.g., pattern mining, or performance
measures such as time spent performing a certain action or visualizing a certain page.
Table 1: A sample of click data from the user interactions with the smartphone app of a German mobility
sharing company. This dataset is the basis for the qualitative evaluation of the method presented in this
timestamp screen user team os
2021-01-25 23:00:00.939 pre booking b0b00 2070b iOS
2021-01-25 23:00:03.435 tariffs b0b00 2070b iOS
2021-01-25 23:00:04.683 menu 3fc0c 02d1f Android
2021-01-25 23:00:05.507 my bookings 3fc0c 02d1f Android
In the context of novel click data analysis techniques, a particularly promising sub-
eld of data science is process mining. Process mining is a discipline that aims to analyze
event data generated by process executions, to e.g. obtain a model of the process, mea-
sure its conformance with normative behavior, or analyze the performance of process
instances with respect to time.
Towards the analysis of click data with process mining, a foundational challenge re-
mains: the association of event data (here, user interactions) with a process case identifier.
While each interaction logged in a database is associated with a user identier, which is
read from the current active session in the sofware, there is a lack of an attribute to isolate
events corresponding to one single utilization of the sofware from beginning to end. Ag-
gregating user interactions into cases is of crucial importance, since the case identier—
together with the activity label and the timestamp—is a fundamental attribute to recon-
3 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
struct a process instance as a sequence of activities (trace), also known as control-flow per-
spective of a process instance. A vast majority of the process mining techniques available
require the control-ow perspective of a process to be known.
In this paper, we propose a novel case attribution approach for click data. Our
method allows us to efectively segment the sequence of interactions from a user into
separate cases on the basis of normative behavior. We then verify the efectiveness of
our method by applying it to a real-life use case scenario related to a mobility sharing
smartphone app. Then, we perform common process mining analyses such as process
discovery on the resulting segmented log, and we conduct a user study among business
owners by presenting the result of such analyses to process experts from the company.
Through interviews with such experts, we assess the impact of process mining analysis
techniques enabled by our event-case correlation method.
The remainder of the paper is organized as follows. Section 2discusses existing event-
case correlation methods and other related work. Section 3illustrates a novel event-case
correlation method. Section 4describes the results of our method on a real-life use case
scenario related to a mobility sharing app, together with a discussion of interviews of
process experts from the company about the impact of process mining techniques en-
abled by our method. Finally, Section 5concludes the paper.
2 Related Work
The problem of assigning a case identier to events in a log is a long-standing challenge
in the process mining community [5], and is known by multiple names in literature, in-
cluding event-case correlation problem [3] and case notion discovery problem [13]. Event
logs where events are missing the case identier attribute are usually referred to as unla-
beled event logs [5]. Several of the attempts to solve this problem, such as an early one
by Ferreira et al. based on rst order Markov models [5] or the Correlation Miner by
Pourmiza et al., based on quadratic programming [17] are very limited in the presence
of loops in the process. Other approaches, such as the one by Bayomie et al. [2] can in-
deed work in the presence of loops, by relying on heuristics based on activities duration
which lead to a set of candidate segmented logs. This comes at the cost of a slow com-
puting time. An improvement of the aforementioned method [3] employs simulated
annealing to select an optimal case notion; while still very computationally heavy, this
method delivers high-quality case attribution results.
The problem of event-case correlation can be positioned in the broader context of
uncertain event data [15,16]. This research direction aims to analyze event data with
imprecise attributes, where single traces might correspond to an array of possible real-
life scenarios. Akin to the method proposed in this paper, some techniques allow to
obtain probability distributions over such scenarios [14].
4 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
A notable and rapidly-growing eld where the problem of event-case correlation is
crucial is Robotic Process Automation (RPA), the automation of process activities through
sofware bots. Similar to many approaches related to the problem at large, existing ap-
proaches to event-case correlation in the RPA eld ofen heavily rely on unique start and
end events in order to segment the log, either explicitly or implicitly [10,18,9].
The problem of event-case attribution is diferent when considered on click data—
particularly from mobile apps. Normally, the goal is to learn a function that receives
an event as an independent variable and produces a case identier as an output. In the
scenario studied in this paper, however, the user is tracked by the open session in the
app during the interaction, and recorded events with diferent user identier cannot be-
long to the same process case. The goal is then to subdivide the sequence of interactions
from one user into one or more sessions (cases). Marrella et al. [11] examined the chal-
lenge of obtaining case identiers for unsegmented user interaction logs in the context of
learnability of sofware systems, by segmenting event sequences with a predened set of
start and end activities as normative information. They nd that this approach cannot
discover all types of cases, which limits its exibility and applicability. Jlailaty et al. [7]
encounter the segmentation problem in the context of email logs. They segment cases
by designing an ad-hoc metric that combines event attributes such as timestamp, sender,
and receiver. Their results however show that this method is eluded by edge cases. Other
prominent sources of sequential event data without case attribution are IoT sensors:
Janssen et al. [6] address the problem of obtaining process cases from sequential sen-
sor event data by splitting the long traces according to an application-dependent xed
length, to nd the optimal sub-trace length such that, afer splitting, each case contains
only a single activity. One major limitation of this approach that the authors mention
is the use of only a single constant length for all of the diferent activities, which may
have varying lengths. More recently, Burattin et al. [4] tackled a segmentation problem
for user interactions with a modeling sofware; in their approach, the segmentation is
obtained exploiting eye tracking data.
The goal of the study reported in this paper is to present a method able to rapidly and
eciently segment a user interaction log in a setting where no sample of ground truth
cases are available, and the only normative information at disposal is in the form of a link
graph relatively easy to extract from a UI. Section 3shows the segmentation technique
we propose.
3 Method
In this section, we illustrate our proposed method for event-case correlation on click
data. As mentioned earlier, the goal is to segment the sequence of events correspond-
ing to the interactions of every user in the database into complete process executions
5 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
(cases). In fact, the click data we consider in this study have a property that we need
to account for while designing our method: all events belonging to one case are con-
tiguous in time. Thus, our goal is to determine split points for diferent cases in a se-
quence of interactions related to the same user. More concretely, if a user of the app pro-
duces the sequence of events he1, e2, e3, e4, e5, e6, e7, e8, e9i, our goal is to section such se-
quence in contiguous subsequences that represent a complete interaction—for instance,
he1, e2, e3, e4i,he5, e6i, and he7, e8, e9i. We refer to this as the log segmentation problem,
which can be considered a special case of the event-case correlation problem. In this con-
text, “unsegmented log” is synonym with “unlabeled log”.
Rather than being based on a collection of known complete process instances as
training set, the creation of our segmentation model is based on behavior described by a
model of the system. A type of model particularly suited to the problem of segmentation
of user interaction data—and especially click data—is the link graph. In fact, since the
activities in our process correspond to screens in the app, a graph of the links in the app
is relatively easy to obtain, since it can be constructed in an automatic way by following
the links between views in the sofware. This link graph will be the basis for our training
data generation procedure.
We will use as running example the link graph of Figure 1. The resulting normative
traces will then be used to train a neural network model based on the word2vec architec-
ture [12], which will be able to split contiguous user interaction sequences into cases.
3.1 Training Log Generation
To generate the training data, we will begin by exploiting the fact that each process case
will only contain events associated with one and only one user. Let Lbe our unseg-
mented log and uUbe a user in L; then, we indicate with Luthe sub-log of Lwhere
all events are associated with the user u.
Our training data will be generated by simulating a transition system annotated with
probabilities. The construction of a transition system based on event data is a well-
known procedure in process mining [1], which requires to choose an event representa-
tion abstraction and a window size (or horizon), which are process-specic. In the con-
text of this section, we will show our method using a sequence abstraction with window
size 2. Initially, for each user uUwe create a transition system TSu= (Su, Eu, Tu, i)
based on the sequence of user interactions in the sub-log Lu.Send
uSudenotes the nal
states of TSu. All such transition systems TSushare the same initial state i. To identify
the end of sequences, we add a special symbol to the states fS0to which we connect
any state sSif it appears at the end of a user interaction sequence. To traverse the
transitions to the nal state fwe utilize as placeholder the empty label τ.
We then obtain a transition system TS0= (S0, A, T0, i)corresponding to the en-
tire log L, where Ais the set of activity labels appearing in L,S0=SuUSu, and
6 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 1: The link graph of a simple, ctional sys-
tem that we are going to use as running example.
From this process, we aim to segment the three
unsegmented user interactions hM, A, M, B, C i,
hM, B, C, M i, and hM, A, B, Ci.
Figure 2: The transition system TS0obtained by
the user interaction data of the example (Figure 1).
During the reduction phase, the transition (M, A)
to (A, M )is removed, since it is not supported by
the link graph (Mdoes not follow A). The state
(A, M )is not reachable and is removed entirely (in
red). Consequently, the reduced transition system
TS is obtained.
T0=SuUTu. Moreover, S0end =SuUSend
u. We also collect information about the
frequency of each transition in the log: we dene a weighting function ωfor the transi-
tions tTwhere ω(t) = # of occurrences of t in L. If t /T,ω(t) = 0. Through ω, it is
optionally possible to lter out rare behavior by deleting transitions with ω(t)<, for a
small threshold . Figure 2shows a transition system with the chosen abstraction and win-
dow size, annotated with both frequencies and transition labels, for the user interactions
Lu1=hM, A, M, B, C i,Lu2=hM, B, C, Mi, and Lu3=hM, A, B, C i.
In contrast to transition systems that are created based on logs that are segmented,
the obtained transition system might contain states that are not reachable and transi-
tions that are not possible according to the real process. Normally, the transition system
abstraction is applied on a case-by-case basis. In our case, however, we applied the ab-
straction to the whole sequence of interactions that is associated with a specic user,
consecutive interactions that belong to diferent cases will be included as undesired tran-
sitions in the transition system. In order to prune undesired transitions from the transi-
tion system, we exploit the link graph of the system: a transition in the transition system
is only valid if it appears in the link graph. Unreachable states are also pruned.
We will assume a sequence abstraction in TS. Given a link graph G= (V, E),
we dene the reduced transition system TS = (S, A, T, i), where T={(h...,a1i, a2,
h...,a1, a2i)T0|(a1, a2)E}and S=S(s1,a,s2)t{s1, s2}. Figure 1shows a link
graph for our running example, and Figure 2shows how this is used to reduce TS0into
Next, we dene probabilities for transitions and states based on the values for ω(t).
Let Tout :SP(T)be Tout(s) = {(s1, a, s2)T|s1=s}; this function returns all
outgoing transitions from a given state. The likelihood of a transition (s1, a, s2)Tis
then computed with ltrans :T[0,1]:
7 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
ltrans(s1, a, s2) = ω(s1, a, s2)
Note that if s1has no outgoing transition and Tout(s1) = , by denition ltrans (s1, a, s2) =
0for any aAand s2S. We will need two more supporting functions. We dene
lstart :S[0,1] and lend :S[0,1] as the probabilities that a state sSis, respec-
tively, the initial and nal state of a sequence:
lstart(s) =
ω(i, a, s)
ω(s, a, s)
lend(s) = ω(s, τ, f )
ω(s, a, s)
In our running example of Figure 2,lstart((M)) = 3
3= 1, and lend((C, M )) =
3. Given a path of states hs1, s2, . . . , snitransitioning through the sequence h(i, a1, s1),
(s1, a2, s2),...,(sn1, an, sn),(sn, τ, f )i, we now have the means to compute its probabil-
ity with the function l:S[0,1]:
l(hs1, s2, . . . , sni) = lstart(s1)·
ltrans(si1, ai, si)·lend(sn)
This enables us to obtain an arbitrary number of well-formed process cases as se-
quences of activities ha1, a2, . . . , ani, utilizing a Monte Carlo procedure. We can sample
a random starting state for the case, through the probability distribution given by lstart;
then, we compose a path with the probabilities provided by ltrans and lend. The traces
sampled in this way will reect the available user interaction data in terms of initial and
nal activities, and internal structure, although the procedure still allows for generaliza-
tion. Such generalization is, however, controlled thanks to the pruning provided by the
link graph of the system. We will refer to the set of generated traces as the training log
3.2 Model Training
The training log LTobtained in Section 3.1 is now used in order to train the segmentation
models. The core component of the proposed method consists one or more word2vec
models to detect the boundaries between cases in the input log. When applied for nat-
ural language processing, the input of a word2vec model is a corpus of sentences which
consist of words. Instead of sentences built as sequences of words, we consider traces
ha1, a2, . . . , anias sequences of activities.
The training log LTneeds an additional processing step to be used as training set for
word2vec. Given two traces σ1LTand σ2LT, we build a training instance by join-
8 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 3: Construction of the training instances.
Traces are shued and concatenated with a place-
holder end activity.
Figure 4: The word2vec neural network. Given the
sequence hA, ?, C i, the network produces a prob-
ability distribution over the possible activity labels
for ?.
ing them in a single sequence, concatenating them with a placeholder activity . So, for
instance, the traces σ1=ha1, a2, a4, a5i ∈ LTand σ2=ha6, a7, a8i ∈ LTare combined
in the training sample ha1, a2, a4, a5,, a6, a7, a8i. This is done repeatedly, shuing the
order of the traces. Figure 3shows this processing step on the running example.
The word2vec model [12] consists of three layers: an input layer, a single hidden
layer, and the output layer. This model has already been successfully employed in pro-
cess mining to solve the problem of missing events [8]. During training, the network
reads the input sequences with a sliding window. The activity occupying the center of
the sliding window is called the center action, while the surrounding activities are called
context actions. The proposed method uses the Continuous Bag-Of-Words (CBOW) vari-
ant of word2vec, where the context actions are introduced as input in the neural network
in order to predict the center action. The error measured in the output layer is used for
training in order to adjust the weights in the neural network, using the backpropagation
algorithm. These forward and backward steps of the training procedure are repeated for
all the positions of the sliding window and all the sequences in the training set; when
fully trained, the network will output a probability distribution for the center action
given the context actions. Figure 4shows an example of likelihood estimation for a cen-
ter action in our running example, with a sliding window of size 3.
3.3 Segmentation
Through the word2vec model we trained in Section 3.2, we can now estimate the likeli-
hood of a case boundary at any position of a sequence of user interactions. Figure 5
shows these estimates on one user interaction sequence from the running example. Note
that this method of computing likelihoods is easy to extend to an ensemble of predictive
models: the diferent predicted values can be then aggregated, e.g., with the mean or the
Next, we use these score to determine case boundaries, which will correspond to
9 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 5: A plot indicating the chances of having a case segment for each position of the user interaction
data (second and third trace from the example in Figure 1).
prominent peaks in the graph. Let hp1, p2, . . . , pnibe the sequence of likelihoods of a
case boundary obtained on a user interaction sequence. We consider pia boundary if
it satises the following conditions: rst, pi> b1·pi1; then, pi> b2·pi+1; nally,
pi> b3·Pi1
k, where b1, b2, b3[1,)and kNare hyperparameters that
inuence the sensitivity of the segmentation. The rst two inequalities use b1and b2to
ensure that the score is suciently higher than the immediate predecessor and successor.
The third inequality uses b3to make sure that the likelihood is also signicantly higher
than a neighborhood dened by the parameter k.
These three conditions allow us to select valid case boundaries within user interac-
tion sequences. Splitting the sequences on such boundaries yields traces of complete
process executions, whose events will be assigned a unique case identier. The set of
such traces then constitutes a traditional event log, ready to be analyzed with established
process mining techniques.
4 User Study
In order to validate the utility of process mining workows in the area of user behavior
analysis, a case study was conducted. Such study also aims at assessing the quality of the
segmentation produced by the proposed method in a real-life setting, in an area where
the ground truth is not available (i.e., there are no normative well-formed cases). We
applied the proposed method to a dataset which contains real user interaction data col-
lected from the mobile applications of a German vehicle sharing company. We then uti-
lized the resulting segmented log to analyze user behavior with an array of process mining
techniques. Then, the results were presented to process experts from the company, who
utilized such results to identify critical areas of the process and suggest improvements.
10 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 6: DFG automatically discovered from the log segmented by our method.
In the data, the abstraction for recorded user interactions is the screen (or page) in
the app. For each interaction, the system recorded ve attributes: timestamp,screen,
user,team, and os. The timestamp marks the point in time when the user visited
the screen, which is identied by the screen attribute, our activity label. The user at-
tribute identies who performed the interaction, and the team attribute is an additional
eld referring to the vehicle provider associated with the interaction. Upon ltering out
pre-login screens (not associated with a user), the log consists of about 990,000 events
originating from about 12,200 users. A snippet of these click data was shown in Table 1,
in Section 1.
We applied the segmentation method presented in Section 3to this click data. We
then analyzed the resulting log with well-known process mining techniques. Lastly, the
ndings were presented to and discussed with four experts from the company, consisting
of one UX expert, two mobile developers and one manager from a technical area. All of
the participants are working directly on the application and are therefore highly familiar
with it. We will report here the topics of discussion in the form of questions; for reasons
of space, we will only document a selection of the most insightful questions.
Q1: Draw your own process model of the user interactions.
The participants were asked to draw a Direcly-Follows Graph (DFG) describing the most
common user interactions with the app. A DFG is a simple process model consisting in
a graph where activities A and B are connected by an arc if B is executed immediately
afer A. The concept of this type of graph was explained to the participants beforehand.
The experts were given ve minutes in order to create their models. A cleaned up repre-
sentation of the resulting models can be seen in Figures 7and 8.
For comparison, we created a DFG of the segmented log (Figure 6). Such model was
11 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 7: DFGs created by three of the process experts as part of Q1.
congured to contain a similar amount of diferent screens as the expert models. The
colors indicate the agreement between the model and the expert models. Darker colors
signify that a screen was included in more expert models. The dashed edges between the
screens signify edges that were identied by the generated model, but are not present in
the participant’s models.
Figure 8: DFG created by one of the process experts as part of Q1.
The mobile developers (models A and B) tend to describe the interactions in a more
precise way that follows the diferent screens more closely, while the technical manager
and UX expert (C and D) provided models that capture the usage of the application in a
more abstract way. The fact that the computed model and the expert models are overall
very similar to each other suggests that our proposed method is able to create a segmen-
12 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
tation that contains cases that are able to accurately describe the real user behavior.
Q2: Given this process model that is based on interactions ending on the
booking screen, what are your observations?
Given the process model shown in Figure 9, the participants were surprised by the fact
that the map-based dashboard type is used signicantly more frequently than the ba-
sic dashboard is surprising to them. Additionally, two of the experts were surprised by
the number of users that are accessing their bookings through the list of all bookings
(my bookings). This latter observation was also made during the analysis of the seg-
mented log and is the reason that this process model was presented to the experts. In
general, a user that has created a booking for a vehicle can access this booking directly
from all of the diferent types of dashboards. The fact that a large fraction of the users
take a detour through the menu and booking list in order to reach the booking screen
is therefore surprising. This circumstance was actually already identied by one of the
mobile developers some time before this evaluation, while they were manually analyzing
the raw interaction recordings data. They noticed this behavior because they repeatedly
encountered the underlying pattern while working with the data for other unrelated rea-
sons. Using the segmented user interaction log, the behavior was however much more
discoverable and supported by concrete data rather than just a vague feeling. Another
observation that was not made by the participants is that the path through the booking
list is more frequently taken by users that originate from the map-based dashboard rather
than the basic dashboard. The UX expert suspected that this may have been the case, be-
cause the card that can be used to access a booking from the dashboard is signicantly
smaller on the map-based dashboard and may therefore be missed more frequently by
the users. This is a concrete actionable nding of the analysis that was only made pos-
sible by the use of process mining techniques in conjunction with the proposed method.
Q3: What is the median time a user takes to book a vehicle?
The correct answer to this question is 66 seconds. This was calculated based on the me-
dian time of all cases in which a vehicle booking was conrmed. Three participants
gave the answers 420 seconds, 120 seconds and 120 seconds. The fourth participants
argued that this time may depend on the type of dashboard that the user is using and
answered 300 seconds for the basic dashboard and 120 seconds for the map-based dash-
board. When asked to settle on only one time, the participant gave an answer of 180
seconds. Overall this means that the experts estimated a median duration for this task
of 3 minutes and 30 seconds. This again is a signicant overestimation compared to the
value that was obtained by analyzing the real user behavior. Again, a mismatch between
the perception of the experts and the real behavior of the users was revealed.
Q4: Given this process model that is based on interactions ending on the
13 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 9: A process model created using Disco, with the booking screen as endpoint of the process.
confirm booking screen (Figure 10), what are your observations?
Several of the experts observed that the screens that show details about the vehicles and
the service, such as tariffs,insurance details and car features, are seem-
ingly used much less frequently than expected. In only about 2-10 of cases, the user
visits these screens before booking a vehicle. When considering the concrete numbers,
the availability calendar screen (which is used to choose a timeframe for the
booking) and the tariffs screen (which displays pricing information) are used most
frequently before a booking conrmation. This suggests that time and pricing informa-
tion are signicantly more important to the users than information about the vehicle or
about the included insurance. These ndings sparked a detailed discussion between the
experts about the possible reasons for the observed behavior. Nonetheless, this shows
that models obtained from segmented user interaction logs are an important tool for
the analysis of user behavior and that these models provide a valuable foundation for a
more detailed analysis by the process experts. Another observation regarding this model
was, that a majority of the users seem to choose a vehicle directly from the dashboard
cards present on the app rather than using the search functionality. This suggests that
14 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 10: A process model based on cases that begin in any dashboard and end on the confirm booking
the users are more interested in the vehicle itself, rather than looking for any available
vehicle at a certain point in time.
Q5: Discuss the fact that 2% of users activate the intermediate lock before
ending the booking.
The smartphone application ofers the functionality to lock certain kinds of vehicles dur-
ing an active booking. This is for example possible for bicycles, which can be locked
by the users during the booking whenever they are leaving the bicycle alone. To do so,
the intermediate lock and intermediate action screens are used. During the
analysis, it was found that 2 of users use this functionality in order to lock the vehicle
directly before ending the booking. This is noteworthy, as it is not necessary to manually
lock the vehicle before returning it. All vehicles are automatically locked by the system
at the end of each booking. One expert argued that this may introduce additional tech-
nical diculties during the vehicle return, because the system will try to lock the vehicle
again. These redundant lock operations, discovered analyzing the segmented log, may
introduce errors in the return process.
Q6: Discuss the fact that only 5% of users visit damages and cleanliness.
The application allows users to report damages to the vehicles and rate their cleanliness,
through the homonymous pages. It was possible to observe that only a small percent-
age of the users seem to follow this routine, which was surprising to the experts. For
the vehicle providers it is generally important that the users are reporting problems with
15 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
the vehicles; optimally, every user should do this for all of their bookings. According
to the data, this is however not the case, as only a small percentage of the users are ac-
tually using both of the functionalities. The experts, therefore, concluded that a better
communication of these functionalities is required.
5 Conclusion
In this paper, we showed a case and user study on the topic of the problem of event-
case correlation. This classic process mining problem was presented here in the specic
domain of application of user interaction data.
We examined a case study, the analysis of click data from a mobility sharing smart-
phone application. To perform log segmentation, we proposed an original technique
based on the word2vec neural network architecture, which can obtain case identica-
tion for an unlabeled user interaction log on the sole basis of a link graph of the system
as normative information. We then presented a user study, where experts of the process
were confronted with insights obtained by applying process mining techniques to the
log segmented using our method. The interviews with experts conrm that our tech-
nique helped to uncover hidden characteristics of the process, including ineciencies
and anomalies unknown to the domain knowledge of the business owners. Importantly,
the analyses yielded actionable suggestions for UI/UX improvements. This substanti-
ates both the scientic value of event-log correlation techniques for user interaction data,
and the validity of the segmentation method presented in this paper.
Many avenues for future work are possible. The most prominent one is the need to
further validate our technique by lifing it from the scope of a user study by means of a
quantitative evaluation, to complement the qualitative one showed in this paper. Our
segmentation technique has several points of improvement, including the relatively high
number of hyperparameters: thus, it would benet from a heuristic procedure to deter-
mine the (starting) value for such hyperparameters. Lastly, it is important to consider
additional event data perspectives: one possibility, in this regard, is to add the data per-
spective to the technique, by encoding additional attributes to train the neural network
We thank the Alexander von Humboldt (AvH) Stifung for supporting our research in-
16 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
[1] van der Aalst, Wil M. P., Vladimir A. Rubin, H. M. W. Verbeek, et al. “Process
mining: a two-step approach to balance between undertting and overtting”. In:
Soware and Systems Modeling 9.1 (2010), pp. 87–111. doi:10.1007/s10270-
[2] Bayomie, Dina, Ahmed Awad, and Ehab Ezat. “Correlating Unlabeled Events
from Cyclic Business Processes Execution”. In: Advanced Information Systems
Engineering - 28th International Conference, CAiSE 2016, Ljubljana, Slovenia,
June 13-17, 2016. Proceedings. Ed. by Nurcan, Selmin, Pnina Sofer, Marko Bajec,
et al. Vol. 9694. Lecture Notes in Computer Science. Springer, 2016, pp. 274–289.
[3] Bayomie, Dina, Claudio Di Ciccio, Marcello La Rosa, et al. “A Probabilistic Ap-
proach to Event-Case Correlation for Process Mining”. In: Conceptual Modeling
- 38th International Conference, ER 2019, Salvador, Brazil, November 4-7, 2019,
Proceedings. Ed. by Laender, Alberto H. F., Barbara Pernici, Ee-Peng Lim, et al.
Vol. 11788. Lecture Notes in Computer Science. Springer, 2019, pp. 136–152. doi:
[4] Burattin, Andrea, Michael Kaiser, Manuel Neurauter, et al. “Learning process
modeling phases from modeling interactions and eye tracking data”. In: Data &
Knowledge Engineering 121 (2019), pp. 1–17. doi:10.1016/j.datak.2019.
[5] Ferreira, Diogo R. and Daniel Gillblad. “Discovering Process Models from Un-
labelled Event Logs”. In: Business Process Management, 7th International Con-
ference, BPM 2009, Ulm, Germany, September 8-10, 2009. Proceedings. Ed. by
Dayal, Umeshwar, Johann Eder, Jana Koehler, et al. Vol. 5701. Lecture Notes in
Computer Science. Springer, 2009, pp. 143–158. doi:10.1007/978-3- 642-
[6] Janssen, Dominik, Felix Mannhardt, Agnes Koschmider, et al. “Process Model
Discovery from Sensor Event Data”. In: Process Mining Workshops - ICPM 2020
International Workshops, Padua, Italy, October 5-8, 2020, Revised Selected Papers.
Ed. by Leemans, Sander J. J. and Henrik Leopold. Vol. 406. Lecture Notes in Busi-
ness Information Processing. Springer, 2020, pp. 69–81. doi:10. 1007/978 -
[7] Jlailaty, Diana, Daniela Grigori, and Khalid Belhajjame. “Business Process Instances
Discovery from Email Logs”. In: 2017 IEEE International Conference on Services
Computing, SCC 2017, Honolulu, HI, USA, June 25-30, 2017. Ed. by Liu, Xiao-
17 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
qing (Frank) and Umesh Bellur. IEEE Computer Society, 2017, pp. 19–26. doi:
[8] Lakhani, Karuna and Apurva Narayan. “A Neural Word Embedding Approach
to System Trace Reconstruction”. In: 2019 IEEE International Conference on Sys-
tems, Man and Cybernetics, SMC 2019, Bari, Italy, October 6-9, 2019. IEEE, 2019,
pp. 285–291. doi:10.1109/SMC.2019.8914322.
[9] Leno, Volodymyr, Adriano Augusto, Marlon Dumas, et al. “Identifying Candi-
date Routines for Robotic Process Automation from Unsegmented UI Logs”.
In: 2nd International Conference on Process Mining, ICPM 2020, Padua, Italy,
October 4-9, 2020. Ed. by van Dongen, Boudewijn F., Marco Montali, and Moe
Thandar Wynn. IEEE, 2020, pp. 153–160. doi:10.1109/ICPM49681.2020.
[10] Linn, Christian, Phileas Zimmermann, and Dirk Werth. “Desktop Activity Min-
ing - A new level of detail in mining business processes”. In: 48. Jahrestagung
der Gesellscha f¨
ur Informatik, Architekturen, Prozesse, Sicherheit und Nach-
haltigkeit, INFORMATIK 2018 - Workshops, Berlin, Germany, September 26-
27, 2018. Ed. by Czarnecki, Christian, Carsten Brockmann, Eldar Sultanow, et al.
Vol. P-285. LNI. GI, 2018, pp. 245–258. url:
[11] Marrella, Andrea and Tiziana Catarci. “Measuring the Learnability of Interactive
Systems Using a Petri Net Based Approach”. In: Proceedings of the 2018 on De-
signing Interactive Systems Conference 2018, DIS 2018, Hong Kong, China, June
09-13, 2018. Ed. by Koskinen, Ilpo, Youn-Kyung Lim, Teresa Cerratto Pargman,
et al. ACM, 2018, pp. 1309–1319. doi:10.1145/3196709.3196744.
[12] Mikolov, Tom´
as, Ilya Sutskever, Kai Chen, et al. “Distributed Representations of
Words and Phrases and their Compositionality”. In: Advances in Neural Infor-
mation Processing Systems 26: 27th Annual Conference on Neural Information
Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake
Tahoe, Nevada, United States. Ed. by Burges, Christopher J. C., L´
eon Bottou,
Zoubin Ghahramani, et al. 2013, pp. 3111–3119. url:https://proceedings.
[13] de Murillas, Eduardo Gonz´
alez L´
opez, Hajo A. Reijers, and Wil M. P. van der
Aalst. “Case notion discovery and recommendation: automated event log build-
ing on databases”. In: Knowledge and Information Systems 62.7 (2020), pp. 2539–
2575. doi:10.1007/s10115-019-01430-6.
18 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
[14] Pegoraro, Marco, Bianka Bakullari, Merih Seran Uysal, et al. “Probability Estima-
tion of Uncertain Process Trace Realizations”. In: Process Mining Workshops -
ICPM 2021 International Workshops, Eindhoven, The Netherlands, October 31 -
November 4, 2021, Revised Selected Papers. Ed. by Munoz-Gama, Jorge and Xixi
Lu. Vol. 433. Lecture Notes in Business Information Processing. Springer, 2021,
pp. 21–33. doi:10.1007/978-3-030-98581-3_2.
[15] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Conformance
checking over uncertain event data”. In: Information Systems 102 (2021), p. 101810.
[16] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “PROVED: A
Tool for Graph Representation and Analysis of Uncertain Event Data”. In: Ap-
plication and Theory of Petri Nets and Concurrency - 42nd International Con-
ference, PETRI NETS 2021, Virtual Event, June 23-25, 2021, Proceedings. Ed. by
Buchs, Didier and Josep Carmona. Vol. 12734. Lecture Notes in Computer Sci-
ence. Springer, 2021, pp. 476–486. doi:10.1007/978-3-030-76983-3_24.
[17] Pourmirza, Shaya, Remco M. Dijkman, and Paul Grefen. “Correlation Miner:
Mining Business Process Models and Event Correlations Without Case Identi-
ers”. In: International Journal of Cooperative Information Systems 26.2 (2017),
1742002:1–1742002:32. doi:10.1142/S0218843017420023.
[18] Ramirez, Andres Jimenez, Hajo A. Reijers, Irene Barba, et al. “A Method to Im-
prove the Early Stages of the Robotic Process Automation Lifecycle”. In: Ad-
vanced Information Systems Engineering - 31st International Conference, CAiSE
2019, Rome, Italy, June 3-7, 2019, Proceedings. Ed. by Giorgini, Paolo and Barbara
Weber. Vol. 11483. Lecture Notes in Computer Science. Springer, 2019, pp. 446–
461. doi:10.1007/978-3-030-21290-2_28.
19 / 19
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Process mining is a scientific discipline that analyzes event data, often collected in databases called event logs. Recently, uncertain event logs have become of interest, which contain non-deterministic and stochastic event attributes that may represent many possible real-life scenarios. In this paper, we present a method to reliably estimate the probability of each of such scenarios, allowing their analysis. Experiments show that the probabilities calculated with our method closely match the true chances of occurrence of specific outcomes, enabling more trustworthy analyses on uncertain data.
Conference Paper
Full-text available
The discipline of process mining aims to study processes in a data-driven manner by analyzing historical process executions, often employing Petri nets. Event data, extracted from information systems (e.g. SAP), serve as the starting point for process mining. Recently, novel types of event data have gathered interest among the process mining community, including uncertain event data. Uncertain events, process traces and logs contain attributes that are characterized by quantified imprecisions, e.g., a set of possible attribute values. The PROVED tool helps to explore, navigate and analyze such uncertain event data by abstracting the uncertain information using behavior graphs and nets, which have Petri nets semantics. Based on these constructs, the tool enables discovery and conformance checking.
Full-text available
The strong impulse to digitize processes and operations in companies and enterprises have resulted in the creation and automatic recording of an increasingly large amount of process data in information systems. These are made available in the form of event logs. Process mining techniques enable the process-centric analysis of data, including automatically discovering process models and checking if event data conform to a given model. In this paper, we analyze the previously unexplored setting of uncertain event logs. In such event logs uncertainty is recorded explicitly, i.e., the time, activity and case of an event may be unclear or imprecise. In this work, we define a taxonomy of uncertain event logs and models, and we examine the challenges that uncertainty poses on process discovery and conformance checking. Finally, we show how upper and lower bounds for conformance can be obtained by aligning an uncertain trace onto a regular process model.
Conference Paper
Full-text available
Robotic Process Automation (RPA) is a technology to develop software bots that automate repetitive sequences of interactions between users and software applications (a.k.a. routines). To take full advantage of this technology, organizations need to identify and to scope their routines. This is a challenging endeavor in large organizations, as routines are usually not concentrated in a handful of processes, but rather scattered across the process landscape. Accordingly, the identification of routines from User Interaction (UI) logs has received significant attention. Existing approaches to this problem assume that the UI log is segmented, meaning that it consists of traces of a task that is presupposed to contain one or more routines. However, a UI log usually takes the form of a single unsegmented sequence of events. This paper presents an approach to discover candidate routines from unsegmented UI logs in the presence of noise, i.e. events within or between routine instances that do not belong to any routine. The approach is implemented as an open-source tool and evaluated using synthetic and real-life UI logs.
Full-text available
Process mining techniques use event logs as input. When analyzing complex databases, these event logs can be built in many ways. Events need to be grouped into traces corresponding to a case. Different groupings provide different views on the data. Building event logs is usually a time-consuming, manual task. This paper provides a precise view on the case notion on databases, which enables the automatic computation of event logs. Also, it provides a way to assess event log quality, used to rank event logs with respect to their interestingness. The computational cost of building an event log can be avoided by predicting the interestingness of a case notion, before the corresponding event log is computed. This makes it possible to give recommendations to users, so they can focus on the analysis of the most promising process views. Finally, the accuracy of the predictions and the quality of the rankings generated by our unsupervised technique are evaluated in comparison to the existing regression techniques as well as to state-of-the-art learning to rank algorithms from the information retrieval field. The results show that our prediction technique succeeds at discovering interesting event logs and provides valuable recommendations to users about the perspectives on which to focus the efforts during the analysis.
Full-text available
Process mining aims to understand the actual behavior and performance of business processes from event logs recorded by IT systems. A key requirement is that every event in the log must be associated with a unique case identifier (e.g., the order ID in an order-to-cash process). In reality, however, this case ID may not always be present, especially when logs are acquired from different systems or when such systems have not been explicitly designed to offer process-tracking capabilities. Existing techniques for correlating events have worked with assumptions to make the problem tractable: some assume the generative processes to be acyclic while others require heuristic information or user input. In this paper, we lift these assumptions by presenting a novel technique called EC-SA based on probabilistic optimization. Given as input a sequence of timestamped events (the log without case IDs) and a process model describing the underlying business process, our approach returns an event log in which every event is mapped to a case identifier. The approach minimises the misalignment between the generated log and the input process model, and the variance between activity durations across cases. The experiments conducted on a variety of real-life datasets show the advantages of our approach over the state of the art.
Full-text available
The creation of a process model is a process consisting of five distinct phases, i.e., problem understanding, method finding, modeling, reconciliation, and validation. To enable a fine-grained analysis of process model creation based on phases or the development of phase-specific modeling support, an automatic approach to detect phases is needed. While approaches exist to automatically detect modeling and reconciliation phases based on user interactions, the detection of phases without user interactions (i.e., problem understanding, method finding, and validation) is still a problem. Exploiting a combination of user interactions and eye tracking data, this paper presents a two-step approach that is able to automatically detect the sequence of phases a modeler is engaged in during model creation. The evaluation of our approach shows promising results both in terms of quality as well as computation time demonstrating its feasibility.
Conference Paper
Full-text available
A learnable system allows a user to know how to perform correctly any task of the system after having executed it a few times in the past. In this paper, we propose an approach to measure the learnability of interactive systems during their daily use. We rely on recording in a user log the user actions that take place during a run of the system and on replaying them over the system interaction models, which describe the expected ways of executing system tasks. Our approach identifies deviations between the interaction models and the user log and assesses their weight through a fitness value. By measuring the rate of the fitness value for subsequent executions of the system we are able not only to understand if the system is learnable with respect to its tasks, but also to quantify its degree of learnability over time and to identify potential learning issues.
Virtually all techniques, developed in the area of process mining, assume the input event data to be discrete, and, at a relatively high level (i.e., close to the business-level). However, in many cases, the event data generated during the execution of a process is at a much lower level of abstraction, e.g., sensor data. Hence, in this paper, we present a novel technique that allows us to translate sensor data into higher-level, discrete event data, thus enabling existing process mining techniques to work on data tracked at a sensory level. Our technique discretises the observed sensor data into activities by applying unsupervised learning in the form of clustering. Furthermore, we refine the observed sequences by deducing imperative sub-models for the observed discretised data, i.e., allowing us to identify concurrency and interleaving within the data. We evaluated the approach by comparing the obtained model quality for several clustering techniques on a publicly available data-set in a smart home scenario. Our results show that applying our framework combined with a clustering technique yields results on data that otherwise would not be suitable for process discovery.