Content uploaded by Marco Pegoraro
Author content
All content in this area was uploaded by Marco Pegoraro on Apr 08, 2022
Content may be subject to copyright.
Uncertain Case Identifiers in Process Mining:
A User Study of the Event-Case Correlation
Problem on Click Data
Marco Pegoraro 1, Merih Seran Uysal 1, Tom-Hendrik H¨
ulsmann 1,
and Wil M.P. van der Aalst 1
1Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Aachen, Germany
{pegoraro, uysal, vwdaalst}@pads.rwth-aachen.de,tom.huelsmann@rwth-aachen.de
Abstract
Among the many sources of event data available today, a prominent one is user
interaction data. User activity may be recorded during the use of an application or
website, resulting in a type of user interaction data ofen called click data. An ob-
stacle to the analysis of click data using process mining is the lack of a case identier
in the data. In this paper, we show a case and user study for event-case correlation
on click data, in the context of user interaction events from a mobility sharing
company. To reconstruct the case notion of the process, we apply a novel method
to aggregate user interaction data in separate user sessions—interpreted as cases—
based on neural networks. To validate our ndings, we qualitatively discuss the
impact of process mining analyses on the resulting well-formed event log through
interviews with process experts.
Keywords: Process Mining ·Uncertain Event Data ·Event-Case Correlation
·Case Notion Discovery ·Unlabeled Event Logs ·Machine Learning ·Neural
Networks ·word2vec ·UI Design ·UX Design.
Colophon
This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-
ternational” license.
©the authors. Some rights reserved.
This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:
Pegoraro, Marco et al. “Uncertain Case Identiers in Process Mining: A User Study of the Event-Case Correlation Prob-
lem on Click Data”. In: 27th International Conference on Exploring Modeling Methods for Systems Analysis and Develop-
ment, EMMSAD 2022, Proceedings. Ed. by Augusto, Adriano et al. Lecture Notes in Business Information Processing.
Springer, 2022
Please, cite this document as shown above.
Publication chronology:
•2021-11-22: abstract submitted to the International Conference on Advanced Information Systems Engineering (CAiSE) 2022, main track
•2021-11-29: full text submitted to the International Conference on Advanced Information Systems Engineering (CAiSE) 2022, main track
•2022-03-01: notication of rejection
•2022-03-08: full text submitted to the International Working Conference on Exploring ModelingMethods for Systems Analysis and Development (EMM-
SAD) 2022
•2022-04-01: notication of acceptance
•2022-04-06: camera-ready version submitted
The published version referred above is ©Springer.
Correspondence to:
Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany
Website: http://mpegoraro.net/ ·Email: pegoraro@pads.rwth-aachen.de ·ORCID: 0000-0002-8997-7517
Content: 19 pages, 10 gures, 1 table, 18 references. Typeset with pdfL
A
T
E
X, Biber, and BibL
A
T
E
X.
Please do not print this document unless strictly necessary.
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
1 Introduction
In the last decades, the dramatic rise of both performance and portability of computing
devices has enabled developers to design sofware with an ever-increasing level of sophis-
tication. Such escalation in functionalities caused a subsequent increase in the complex-
ity of sofware, making it harder to access for users. The shif from large screens of desk-
top computers to small displays of smartphones, tablets, and other handheld devices has
strongly contributed to this increase in the intricacy of sofware interfaces. User interface
(UI) design and user experience (UX) design aim to address the challenge of managing
complexity, to enable users to interact easily and efectively with the sofware.
In designing and improving user interfaces, important sources of guidance are the
records of user interaction data. Many websites and apps track the actions of users, such
as pageviews, clicks, and searches. Such type of information is ofen called click data, of
which an example is given in Table 1. These can then be analyzed to identify parts of
the interface which need to be simplied, through, e.g., pattern mining, or performance
measures such as time spent performing a certain action or visualizing a certain page.
Table 1: A sample of click data from the user interactions with the smartphone app of a German mobility
sharing company. This dataset is the basis for the qualitative evaluation of the method presented in this
paper.
timestamp screen user team os
2021-01-25 23:00:00.939 pre booking b0b00 2070b iOS
2021-01-25 23:00:03.435 tariffs b0b00 2070b iOS
2021-01-25 23:00:04.683 menu 3fc0c 02d1f Android
2021-01-25 23:00:05.507 my bookings 3fc0c 02d1f Android
.
.
..
.
..
.
..
.
..
.
.
In the context of novel click data analysis techniques, a particularly promising sub-
eld of data science is process mining. Process mining is a discipline that aims to analyze
event data generated by process executions, to e.g. obtain a model of the process, mea-
sure its conformance with normative behavior, or analyze the performance of process
instances with respect to time.
Towards the analysis of click data with process mining, a foundational challenge re-
mains: the association of event data (here, user interactions) with a process case identifier.
While each interaction logged in a database is associated with a user identier, which is
read from the current active session in the sofware, there is a lack of an attribute to isolate
events corresponding to one single utilization of the sofware from beginning to end. Ag-
gregating user interactions into cases is of crucial importance, since the case identier—
together with the activity label and the timestamp—is a fundamental attribute to recon-
3 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
struct a process instance as a sequence of activities (trace), also known as control-flow per-
spective of a process instance. A vast majority of the process mining techniques available
require the control-ow perspective of a process to be known.
In this paper, we propose a novel case attribution approach for click data. Our
method allows us to efectively segment the sequence of interactions from a user into
separate cases on the basis of normative behavior. We then verify the efectiveness of
our method by applying it to a real-life use case scenario related to a mobility sharing
smartphone app. Then, we perform common process mining analyses such as process
discovery on the resulting segmented log, and we conduct a user study among business
owners by presenting the result of such analyses to process experts from the company.
Through interviews with such experts, we assess the impact of process mining analysis
techniques enabled by our event-case correlation method.
The remainder of the paper is organized as follows. Section 2discusses existing event-
case correlation methods and other related work. Section 3illustrates a novel event-case
correlation method. Section 4describes the results of our method on a real-life use case
scenario related to a mobility sharing app, together with a discussion of interviews of
process experts from the company about the impact of process mining techniques en-
abled by our method. Finally, Section 5concludes the paper.
2 Related Work
The problem of assigning a case identier to events in a log is a long-standing challenge
in the process mining community [5], and is known by multiple names in literature, in-
cluding event-case correlation problem [3] and case notion discovery problem [13]. Event
logs where events are missing the case identier attribute are usually referred to as unla-
beled event logs [5]. Several of the attempts to solve this problem, such as an early one
by Ferreira et al. based on rst order Markov models [5] or the Correlation Miner by
Pourmiza et al., based on quadratic programming [17] are very limited in the presence
of loops in the process. Other approaches, such as the one by Bayomie et al. [2] can in-
deed work in the presence of loops, by relying on heuristics based on activities duration
which lead to a set of candidate segmented logs. This comes at the cost of a slow com-
puting time. An improvement of the aforementioned method [3] employs simulated
annealing to select an optimal case notion; while still very computationally heavy, this
method delivers high-quality case attribution results.
The problem of event-case correlation can be positioned in the broader context of
uncertain event data [15,16]. This research direction aims to analyze event data with
imprecise attributes, where single traces might correspond to an array of possible real-
life scenarios. Akin to the method proposed in this paper, some techniques allow to
obtain probability distributions over such scenarios [14].
4 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
A notable and rapidly-growing eld where the problem of event-case correlation is
crucial is Robotic Process Automation (RPA), the automation of process activities through
sofware bots. Similar to many approaches related to the problem at large, existing ap-
proaches to event-case correlation in the RPA eld ofen heavily rely on unique start and
end events in order to segment the log, either explicitly or implicitly [10,18,9].
The problem of event-case attribution is diferent when considered on click data—
particularly from mobile apps. Normally, the goal is to learn a function that receives
an event as an independent variable and produces a case identier as an output. In the
scenario studied in this paper, however, the user is tracked by the open session in the
app during the interaction, and recorded events with diferent user identier cannot be-
long to the same process case. The goal is then to subdivide the sequence of interactions
from one user into one or more sessions (cases). Marrella et al. [11] examined the chal-
lenge of obtaining case identiers for unsegmented user interaction logs in the context of
learnability of sofware systems, by segmenting event sequences with a predened set of
start and end activities as normative information. They nd that this approach cannot
discover all types of cases, which limits its exibility and applicability. Jlailaty et al. [7]
encounter the segmentation problem in the context of email logs. They segment cases
by designing an ad-hoc metric that combines event attributes such as timestamp, sender,
and receiver. Their results however show that this method is eluded by edge cases. Other
prominent sources of sequential event data without case attribution are IoT sensors:
Janssen et al. [6] address the problem of obtaining process cases from sequential sen-
sor event data by splitting the long traces according to an application-dependent xed
length, to nd the optimal sub-trace length such that, afer splitting, each case contains
only a single activity. One major limitation of this approach that the authors mention
is the use of only a single constant length for all of the diferent activities, which may
have varying lengths. More recently, Burattin et al. [4] tackled a segmentation problem
for user interactions with a modeling sofware; in their approach, the segmentation is
obtained exploiting eye tracking data.
The goal of the study reported in this paper is to present a method able to rapidly and
eciently segment a user interaction log in a setting where no sample of ground truth
cases are available, and the only normative information at disposal is in the form of a link
graph relatively easy to extract from a UI. Section 3shows the segmentation technique
we propose.
3 Method
In this section, we illustrate our proposed method for event-case correlation on click
data. As mentioned earlier, the goal is to segment the sequence of events correspond-
ing to the interactions of every user in the database into complete process executions
5 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
(cases). In fact, the click data we consider in this study have a property that we need
to account for while designing our method: all events belonging to one case are con-
tiguous in time. Thus, our goal is to determine split points for diferent cases in a se-
quence of interactions related to the same user. More concretely, if a user of the app pro-
duces the sequence of events he1, e2, e3, e4, e5, e6, e7, e8, e9i, our goal is to section such se-
quence in contiguous subsequences that represent a complete interaction—for instance,
he1, e2, e3, e4i,he5, e6i, and he7, e8, e9i. We refer to this as the log segmentation problem,
which can be considered a special case of the event-case correlation problem. In this con-
text, “unsegmented log” is synonym with “unlabeled log”.
Rather than being based on a collection of known complete process instances as
training set, the creation of our segmentation model is based on behavior described by a
model of the system. A type of model particularly suited to the problem of segmentation
of user interaction data—and especially click data—is the link graph. In fact, since the
activities in our process correspond to screens in the app, a graph of the links in the app
is relatively easy to obtain, since it can be constructed in an automatic way by following
the links between views in the sofware. This link graph will be the basis for our training
data generation procedure.
We will use as running example the link graph of Figure 1. The resulting normative
traces will then be used to train a neural network model based on the word2vec architec-
ture [12], which will be able to split contiguous user interaction sequences into cases.
3.1 Training Log Generation
To generate the training data, we will begin by exploiting the fact that each process case
will only contain events associated with one and only one user. Let Lbe our unseg-
mented log and u∈Ube a user in L; then, we indicate with Luthe sub-log of Lwhere
all events are associated with the user u.
Our training data will be generated by simulating a transition system annotated with
probabilities. The construction of a transition system based on event data is a well-
known procedure in process mining [1], which requires to choose an event representa-
tion abstraction and a window size (or horizon), which are process-specic. In the con-
text of this section, we will show our method using a sequence abstraction with window
size 2. Initially, for each user u∈Uwe create a transition system TSu= (Su, Eu, Tu, i)
based on the sequence of user interactions in the sub-log Lu.Send
u∈Sudenotes the nal
states of TSu. All such transition systems TSushare the same initial state i. To identify
the end of sequences, we add a special symbol to the states f∈S0to which we connect
any state s∈Sif it appears at the end of a user interaction sequence. To traverse the
transitions to the nal state fwe utilize as placeholder the empty label τ.
We then obtain a transition system TS0= (S0, A, T0, i)corresponding to the en-
tire log L, where Ais the set of activity labels appearing in L,S0=Su∈USu, and
6 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 1: The link graph of a simple, ctional sys-
tem that we are going to use as running example.
From this process, we aim to segment the three
unsegmented user interactions hM, A, M, B, C i,
hM, B, C, M i, and hM, A, B, Ci.
(f)
MM
M
B
BB
A
C
C
τ
τ
Figure 2: The transition system TS0obtained by
the user interaction data of the example (Figure 1).
During the reduction phase, the transition (M, A)
to (A, M )is removed, since it is not supported by
the link graph (Mdoes not follow A). The state
(A, M )is not reachable and is removed entirely (in
red). Consequently, the reduced transition system
TS is obtained.
T0=Su∈UTu. Moreover, S0end =Su∈USend
u. We also collect information about the
frequency of each transition in the log: we dene a weighting function ωfor the transi-
tions t∈Twhere ω(t) = # of occurrences of t in L. If t /∈T,ω(t) = 0. Through ω, it is
optionally possible to lter out rare behavior by deleting transitions with ω(t)<, for a
small threshold . Figure 2shows a transition system with the chosen abstraction and win-
dow size, annotated with both frequencies and transition labels, for the user interactions
Lu1=hM, A, M, B, C i,Lu2=hM, B, C, Mi, and Lu3=hM, A, B, C i.
In contrast to transition systems that are created based on logs that are segmented,
the obtained transition system might contain states that are not reachable and transi-
tions that are not possible according to the real process. Normally, the transition system
abstraction is applied on a case-by-case basis. In our case, however, we applied the ab-
straction to the whole sequence of interactions that is associated with a specic user,
consecutive interactions that belong to diferent cases will be included as undesired tran-
sitions in the transition system. In order to prune undesired transitions from the transi-
tion system, we exploit the link graph of the system: a transition in the transition system
is only valid if it appears in the link graph. Unreachable states are also pruned.
We will assume a sequence abstraction in TS. Given a link graph G= (V, E),
we dene the reduced transition system TS = (S, A, T, i), where T={(h...,a1i, a2,
h...,a1, a2i)∈T0|(a1, a2)∈E}and S=S(s1,a,s2)∈t{s1, s2}. Figure 1shows a link
graph for our running example, and Figure 2shows how this is used to reduce TS0into
TS.
Next, we dene probabilities for transitions and states based on the values for ω(t).
Let Tout :S→P(T)be Tout(s) = {(s1, a, s2)∈T|s1=s}; this function returns all
outgoing transitions from a given state. The likelihood of a transition (s1, a, s2)∈Tis
then computed with ltrans :T→[0,1]:
7 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
ltrans(s1, a, s2) = ω(s1, a, s2)
P
t∗∈Tout(s1)
ω(t∗)
Note that if s1has no outgoing transition and Tout(s1) = ∅, by denition ltrans (s1, a, s2) =
0for any a∈Aand s2∈S. We will need two more supporting functions. We dene
lstart :S→[0,1] and lend :S→[0,1] as the probabilities that a state s∈Sis, respec-
tively, the initial and nal state of a sequence:
lstart(s) =
P
a∈A
ω(i, a, s)
P
s∗∈S
a∈A
ω(s∗, a, s)
lend(s) = ω(s, τ, f )
P
s∗∈S
a∈A
ω(s, a, s∗)
In our running example of Figure 2,lstart((M)) = 3
3= 1, and lend((C, M )) =
1
3. Given a path of states hs1, s2, . . . , snitransitioning through the sequence h(i, a1, s1),
(s1, a2, s2),...,(sn−1, an, sn),(sn, τ, f )i, we now have the means to compute its probabil-
ity with the function l:S∗→[0,1]:
l(hs1, s2, . . . , sni) = lstart(s1)·
n
Y
i=2
ltrans(si−1, ai, si)·lend(sn)
This enables us to obtain an arbitrary number of well-formed process cases as se-
quences of activities ha1, a2, . . . , ani, utilizing a Monte Carlo procedure. We can sample
a random starting state for the case, through the probability distribution given by lstart;
then, we compose a path with the probabilities provided by ltrans and lend. The traces
sampled in this way will reect the available user interaction data in terms of initial and
nal activities, and internal structure, although the procedure still allows for generaliza-
tion. Such generalization is, however, controlled thanks to the pruning provided by the
link graph of the system. We will refer to the set of generated traces as the training log
LT.
3.2 Model Training
The training log LTobtained in Section 3.1 is now used in order to train the segmentation
models. The core component of the proposed method consists one or more word2vec
models to detect the boundaries between cases in the input log. When applied for nat-
ural language processing, the input of a word2vec model is a corpus of sentences which
consist of words. Instead of sentences built as sequences of words, we consider traces
ha1, a2, . . . , anias sequences of activities.
The training log LTneeds an additional processing step to be used as training set for
word2vec. Given two traces σ1∈LTand σ2∈LT, we build a training instance by join-
8 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 3: Construction of the training instances.
Traces are shued and concatenated with a place-
holder end activity.
Figure 4: The word2vec neural network. Given the
sequence hA, ?, C i, the network produces a prob-
ability distribution over the possible activity labels
for ?.
ing them in a single sequence, concatenating them with a placeholder activity . So, for
instance, the traces σ1=ha1, a2, a4, a5i ∈ LTand σ2=ha6, a7, a8i ∈ LTare combined
in the training sample ha1, a2, a4, a5,, a6, a7, a8i. This is done repeatedly, shuing the
order of the traces. Figure 3shows this processing step on the running example.
The word2vec model [12] consists of three layers: an input layer, a single hidden
layer, and the output layer. This model has already been successfully employed in pro-
cess mining to solve the problem of missing events [8]. During training, the network
reads the input sequences with a sliding window. The activity occupying the center of
the sliding window is called the center action, while the surrounding activities are called
context actions. The proposed method uses the Continuous Bag-Of-Words (CBOW) vari-
ant of word2vec, where the context actions are introduced as input in the neural network
in order to predict the center action. The error measured in the output layer is used for
training in order to adjust the weights in the neural network, using the backpropagation
algorithm. These forward and backward steps of the training procedure are repeated for
all the positions of the sliding window and all the sequences in the training set; when
fully trained, the network will output a probability distribution for the center action
given the context actions. Figure 4shows an example of likelihood estimation for a cen-
ter action in our running example, with a sliding window of size 3.
3.3 Segmentation
Through the word2vec model we trained in Section 3.2, we can now estimate the likeli-
hood of a case boundary at any position of a sequence of user interactions. Figure 5
shows these estimates on one user interaction sequence from the running example. Note
that this method of computing likelihoods is easy to extend to an ensemble of predictive
models: the diferent predicted values can be then aggregated, e.g., with the mean or the
median.
Next, we use these score to determine case boundaries, which will correspond to
9 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 5: A plot indicating the chances of having a case segment for each position of the user interaction
data (second and third trace from the example in Figure 1).
prominent peaks in the graph. Let hp1, p2, . . . , pnibe the sequence of likelihoods of a
case boundary obtained on a user interaction sequence. We consider pia boundary if
it satises the following conditions: rst, pi> b1·pi−1; then, pi> b2·pi+1; nally,
pi> b3·Pi−1
j=i−k−1pj
k, where b1, b2, b3∈[1,∞)and k∈Nare hyperparameters that
inuence the sensitivity of the segmentation. The rst two inequalities use b1and b2to
ensure that the score is suciently higher than the immediate predecessor and successor.
The third inequality uses b3to make sure that the likelihood is also signicantly higher
than a neighborhood dened by the parameter k.
These three conditions allow us to select valid case boundaries within user interac-
tion sequences. Splitting the sequences on such boundaries yields traces of complete
process executions, whose events will be assigned a unique case identier. The set of
such traces then constitutes a traditional event log, ready to be analyzed with established
process mining techniques.
4 User Study
In order to validate the utility of process mining workows in the area of user behavior
analysis, a case study was conducted. Such study also aims at assessing the quality of the
segmentation produced by the proposed method in a real-life setting, in an area where
the ground truth is not available (i.e., there are no normative well-formed cases). We
applied the proposed method to a dataset which contains real user interaction data col-
lected from the mobile applications of a German vehicle sharing company. We then uti-
lized the resulting segmented log to analyze user behavior with an array of process mining
techniques. Then, the results were presented to process experts from the company, who
utilized such results to identify critical areas of the process and suggest improvements.
10 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 6: DFG automatically discovered from the log segmented by our method.
In the data, the abstraction for recorded user interactions is the screen (or page) in
the app. For each interaction, the system recorded ve attributes: timestamp,screen,
user,team, and os. The timestamp marks the point in time when the user visited
the screen, which is identied by the screen attribute, our activity label. The user at-
tribute identies who performed the interaction, and the team attribute is an additional
eld referring to the vehicle provider associated with the interaction. Upon ltering out
pre-login screens (not associated with a user), the log consists of about 990,000 events
originating from about 12,200 users. A snippet of these click data was shown in Table 1,
in Section 1.
We applied the segmentation method presented in Section 3to this click data. We
then analyzed the resulting log with well-known process mining techniques. Lastly, the
ndings were presented to and discussed with four experts from the company, consisting
of one UX expert, two mobile developers and one manager from a technical area. All of
the participants are working directly on the application and are therefore highly familiar
with it. We will report here the topics of discussion in the form of questions; for reasons
of space, we will only document a selection of the most insightful questions.
Q1: Draw your own process model of the user interactions.
The participants were asked to draw a Direcly-Follows Graph (DFG) describing the most
common user interactions with the app. A DFG is a simple process model consisting in
a graph where activities A and B are connected by an arc if B is executed immediately
afer A. The concept of this type of graph was explained to the participants beforehand.
The experts were given ve minutes in order to create their models. A cleaned up repre-
sentation of the resulting models can be seen in Figures 7and 8.
For comparison, we created a DFG of the segmented log (Figure 6). Such model was
11 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 7: DFGs created by three of the process experts as part of Q1.
congured to contain a similar amount of diferent screens as the expert models. The
colors indicate the agreement between the model and the expert models. Darker colors
signify that a screen was included in more expert models. The dashed edges between the
screens signify edges that were identied by the generated model, but are not present in
the participant’s models.
Figure 8: DFG created by one of the process experts as part of Q1.
The mobile developers (models A and B) tend to describe the interactions in a more
precise way that follows the diferent screens more closely, while the technical manager
and UX expert (C and D) provided models that capture the usage of the application in a
more abstract way. The fact that the computed model and the expert models are overall
very similar to each other suggests that our proposed method is able to create a segmen-
12 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
tation that contains cases that are able to accurately describe the real user behavior.
Q2: Given this process model that is based on interactions ending on the
booking screen, what are your observations?
Given the process model shown in Figure 9, the participants were surprised by the fact
that the map-based dashboard type is used signicantly more frequently than the ba-
sic dashboard is surprising to them. Additionally, two of the experts were surprised by
the number of users that are accessing their bookings through the list of all bookings
(my bookings). This latter observation was also made during the analysis of the seg-
mented log and is the reason that this process model was presented to the experts. In
general, a user that has created a booking for a vehicle can access this booking directly
from all of the diferent types of dashboards. The fact that a large fraction of the users
take a detour through the menu and booking list in order to reach the booking screen
is therefore surprising. This circumstance was actually already identied by one of the
mobile developers some time before this evaluation, while they were manually analyzing
the raw interaction recordings data. They noticed this behavior because they repeatedly
encountered the underlying pattern while working with the data for other unrelated rea-
sons. Using the segmented user interaction log, the behavior was however much more
discoverable and supported by concrete data rather than just a vague feeling. Another
observation that was not made by the participants is that the path through the booking
list is more frequently taken by users that originate from the map-based dashboard rather
than the basic dashboard. The UX expert suspected that this may have been the case, be-
cause the card that can be used to access a booking from the dashboard is signicantly
smaller on the map-based dashboard and may therefore be missed more frequently by
the users. This is a concrete actionable nding of the analysis that was only made pos-
sible by the use of process mining techniques in conjunction with the proposed method.
Q3: What is the median time a user takes to book a vehicle?
The correct answer to this question is 66 seconds. This was calculated based on the me-
dian time of all cases in which a vehicle booking was conrmed. Three participants
gave the answers 420 seconds, 120 seconds and 120 seconds. The fourth participants
argued that this time may depend on the type of dashboard that the user is using and
answered 300 seconds for the basic dashboard and 120 seconds for the map-based dash-
board. When asked to settle on only one time, the participant gave an answer of 180
seconds. Overall this means that the experts estimated a median duration for this task
of 3 minutes and 30 seconds. This again is a signicant overestimation compared to the
value that was obtained by analyzing the real user behavior. Again, a mismatch between
the perception of the experts and the real behavior of the users was revealed.
Q4: Given this process model that is based on interactions ending on the
13 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 9: A process model created using Disco, with the booking screen as endpoint of the process.
confirm booking screen (Figure 10), what are your observations?
Several of the experts observed that the screens that show details about the vehicles and
the service, such as tariffs,insurance details and car features, are seem-
ingly used much less frequently than expected. In only about 2-10 of cases, the user
visits these screens before booking a vehicle. When considering the concrete numbers,
the availability calendar screen (which is used to choose a timeframe for the
booking) and the tariffs screen (which displays pricing information) are used most
frequently before a booking conrmation. This suggests that time and pricing informa-
tion are signicantly more important to the users than information about the vehicle or
about the included insurance. These ndings sparked a detailed discussion between the
experts about the possible reasons for the observed behavior. Nonetheless, this shows
that models obtained from segmented user interaction logs are an important tool for
the analysis of user behavior and that these models provide a valuable foundation for a
more detailed analysis by the process experts. Another observation regarding this model
was, that a majority of the users seem to choose a vehicle directly from the dashboard
cards present on the app rather than using the search functionality. This suggests that
14 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
Figure 10: A process model based on cases that begin in any dashboard and end on the confirm booking
screen.
the users are more interested in the vehicle itself, rather than looking for any available
vehicle at a certain point in time.
Q5: Discuss the fact that 2% of users activate the intermediate lock before
ending the booking.
The smartphone application ofers the functionality to lock certain kinds of vehicles dur-
ing an active booking. This is for example possible for bicycles, which can be locked
by the users during the booking whenever they are leaving the bicycle alone. To do so,
the intermediate lock and intermediate action screens are used. During the
analysis, it was found that 2 of users use this functionality in order to lock the vehicle
directly before ending the booking. This is noteworthy, as it is not necessary to manually
lock the vehicle before returning it. All vehicles are automatically locked by the system
at the end of each booking. One expert argued that this may introduce additional tech-
nical diculties during the vehicle return, because the system will try to lock the vehicle
again. These redundant lock operations, discovered analyzing the segmented log, may
introduce errors in the return process.
Q6: Discuss the fact that only 5% of users visit damages and cleanliness.
The application allows users to report damages to the vehicles and rate their cleanliness,
through the homonymous pages. It was possible to observe that only a small percent-
age of the users seem to follow this routine, which was surprising to the experts. For
the vehicle providers it is generally important that the users are reporting problems with
15 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
the vehicles; optimally, every user should do this for all of their bookings. According
to the data, this is however not the case, as only a small percentage of the users are ac-
tually using both of the functionalities. The experts, therefore, concluded that a better
communication of these functionalities is required.
5 Conclusion
In this paper, we showed a case and user study on the topic of the problem of event-
case correlation. This classic process mining problem was presented here in the specic
domain of application of user interaction data.
We examined a case study, the analysis of click data from a mobility sharing smart-
phone application. To perform log segmentation, we proposed an original technique
based on the word2vec neural network architecture, which can obtain case identica-
tion for an unlabeled user interaction log on the sole basis of a link graph of the system
as normative information. We then presented a user study, where experts of the process
were confronted with insights obtained by applying process mining techniques to the
log segmented using our method. The interviews with experts conrm that our tech-
nique helped to uncover hidden characteristics of the process, including ineciencies
and anomalies unknown to the domain knowledge of the business owners. Importantly,
the analyses yielded actionable suggestions for UI/UX improvements. This substanti-
ates both the scientic value of event-log correlation techniques for user interaction data,
and the validity of the segmentation method presented in this paper.
Many avenues for future work are possible. The most prominent one is the need to
further validate our technique by lifing it from the scope of a user study by means of a
quantitative evaluation, to complement the qualitative one showed in this paper. Our
segmentation technique has several points of improvement, including the relatively high
number of hyperparameters: thus, it would benet from a heuristic procedure to deter-
mine the (starting) value for such hyperparameters. Lastly, it is important to consider
additional event data perspectives: one possibility, in this regard, is to add the data per-
spective to the technique, by encoding additional attributes to train the neural network
model.
Acknowledgements
We thank the Alexander von Humboldt (AvH) Stifung for supporting our research in-
teractions.
16 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
References
[1] van der Aalst, Wil M. P., Vladimir A. Rubin, H. M. W. Verbeek, et al. “Process
mining: a two-step approach to balance between undertting and overtting”. In:
Soware and Systems Modeling 9.1 (2010), pp. 87–111. doi:10.1007/s10270-
008-0106-z.
[2] Bayomie, Dina, Ahmed Awad, and Ehab Ezat. “Correlating Unlabeled Events
from Cyclic Business Processes Execution”. In: Advanced Information Systems
Engineering - 28th International Conference, CAiSE 2016, Ljubljana, Slovenia,
June 13-17, 2016. Proceedings. Ed. by Nurcan, Selmin, Pnina Sofer, Marko Bajec,
et al. Vol. 9694. Lecture Notes in Computer Science. Springer, 2016, pp. 274–289.
doi:10.1007/978-3-319-39696-5_17.
[3] Bayomie, Dina, Claudio Di Ciccio, Marcello La Rosa, et al. “A Probabilistic Ap-
proach to Event-Case Correlation for Process Mining”. In: Conceptual Modeling
- 38th International Conference, ER 2019, Salvador, Brazil, November 4-7, 2019,
Proceedings. Ed. by Laender, Alberto H. F., Barbara Pernici, Ee-Peng Lim, et al.
Vol. 11788. Lecture Notes in Computer Science. Springer, 2019, pp. 136–152. doi:
10.1007/978-3-030-33223-5_12.
[4] Burattin, Andrea, Michael Kaiser, Manuel Neurauter, et al. “Learning process
modeling phases from modeling interactions and eye tracking data”. In: Data &
Knowledge Engineering 121 (2019), pp. 1–17. doi:10.1016/j.datak.2019.
04.001.
[5] Ferreira, Diogo R. and Daniel Gillblad. “Discovering Process Models from Un-
labelled Event Logs”. In: Business Process Management, 7th International Con-
ference, BPM 2009, Ulm, Germany, September 8-10, 2009. Proceedings. Ed. by
Dayal, Umeshwar, Johann Eder, Jana Koehler, et al. Vol. 5701. Lecture Notes in
Computer Science. Springer, 2009, pp. 143–158. doi:10.1007/978-3- 642-
03848-8_11.
[6] Janssen, Dominik, Felix Mannhardt, Agnes Koschmider, et al. “Process Model
Discovery from Sensor Event Data”. In: Process Mining Workshops - ICPM 2020
International Workshops, Padua, Italy, October 5-8, 2020, Revised Selected Papers.
Ed. by Leemans, Sander J. J. and Henrik Leopold. Vol. 406. Lecture Notes in Busi-
ness Information Processing. Springer, 2020, pp. 69–81. doi:10. 1007/978 -
3-030-72693-5_6.
[7] Jlailaty, Diana, Daniela Grigori, and Khalid Belhajjame. “Business Process Instances
Discovery from Email Logs”. In: 2017 IEEE International Conference on Services
Computing, SCC 2017, Honolulu, HI, USA, June 25-30, 2017. Ed. by Liu, Xiao-
17 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
qing (Frank) and Umesh Bellur. IEEE Computer Society, 2017, pp. 19–26. doi:
10.1109/SCC.2017.12.
[8] Lakhani, Karuna and Apurva Narayan. “A Neural Word Embedding Approach
to System Trace Reconstruction”. In: 2019 IEEE International Conference on Sys-
tems, Man and Cybernetics, SMC 2019, Bari, Italy, October 6-9, 2019. IEEE, 2019,
pp. 285–291. doi:10.1109/SMC.2019.8914322.
[9] Leno, Volodymyr, Adriano Augusto, Marlon Dumas, et al. “Identifying Candi-
date Routines for Robotic Process Automation from Unsegmented UI Logs”.
In: 2nd International Conference on Process Mining, ICPM 2020, Padua, Italy,
October 4-9, 2020. Ed. by van Dongen, Boudewijn F., Marco Montali, and Moe
Thandar Wynn. IEEE, 2020, pp. 153–160. doi:10.1109/ICPM49681.2020.
00031.
[10] Linn, Christian, Phileas Zimmermann, and Dirk Werth. “Desktop Activity Min-
ing - A new level of detail in mining business processes”. In: 48. Jahrestagung
der Gesellscha f¨
ur Informatik, Architekturen, Prozesse, Sicherheit und Nach-
haltigkeit, INFORMATIK 2018 - Workshops, Berlin, Germany, September 26-
27, 2018. Ed. by Czarnecki, Christian, Carsten Brockmann, Eldar Sultanow, et al.
Vol. P-285. LNI. GI, 2018, pp. 245–258. url:https://dl.gi.de/20.500.
12116/17225.
[11] Marrella, Andrea and Tiziana Catarci. “Measuring the Learnability of Interactive
Systems Using a Petri Net Based Approach”. In: Proceedings of the 2018 on De-
signing Interactive Systems Conference 2018, DIS 2018, Hong Kong, China, June
09-13, 2018. Ed. by Koskinen, Ilpo, Youn-Kyung Lim, Teresa Cerratto Pargman,
et al. ACM, 2018, pp. 1309–1319. doi:10.1145/3196709.3196744.
[12] Mikolov, Tom´
as, Ilya Sutskever, Kai Chen, et al. “Distributed Representations of
Words and Phrases and their Compositionality”. In: Advances in Neural Infor-
mation Processing Systems 26: 27th Annual Conference on Neural Information
Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake
Tahoe, Nevada, United States. Ed. by Burges, Christopher J. C., L´
eon Bottou,
Zoubin Ghahramani, et al. 2013, pp. 3111–3119. url:https://proceedings.
neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-
Abstract.html.
[13] de Murillas, Eduardo Gonz´
alez L´
opez, Hajo A. Reijers, and Wil M. P. van der
Aalst. “Case notion discovery and recommendation: automated event log build-
ing on databases”. In: Knowledge and Information Systems 62.7 (2020), pp. 2539–
2575. doi:10.1007/s10115-019-01430-6.
18 / 19
M. Pegoraro et al. A User Study of the Event-Case Correlation Problem
[14] Pegoraro, Marco, Bianka Bakullari, Merih Seran Uysal, et al. “Probability Estima-
tion of Uncertain Process Trace Realizations”. In: Process Mining Workshops -
ICPM 2021 International Workshops, Eindhoven, The Netherlands, October 31 -
November 4, 2021, Revised Selected Papers. Ed. by Munoz-Gama, Jorge and Xixi
Lu. Vol. 433. Lecture Notes in Business Information Processing. Springer, 2021,
pp. 21–33. doi:10.1007/978-3-030-98581-3_2.
[15] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “Conformance
checking over uncertain event data”. In: Information Systems 102 (2021), p. 101810.
doi:10.1016/j.is.2021.101810.
[16] Pegoraro, Marco, Merih Seran Uysal, and Wil M. P. van der Aalst. “PROVED: A
Tool for Graph Representation and Analysis of Uncertain Event Data”. In: Ap-
plication and Theory of Petri Nets and Concurrency - 42nd International Con-
ference, PETRI NETS 2021, Virtual Event, June 23-25, 2021, Proceedings. Ed. by
Buchs, Didier and Josep Carmona. Vol. 12734. Lecture Notes in Computer Sci-
ence. Springer, 2021, pp. 476–486. doi:10.1007/978-3-030-76983-3_24.
[17] Pourmirza, Shaya, Remco M. Dijkman, and Paul Grefen. “Correlation Miner:
Mining Business Process Models and Event Correlations Without Case Identi-
ers”. In: International Journal of Cooperative Information Systems 26.2 (2017),
1742002:1–1742002:32. doi:10.1142/S0218843017420023.
[18] Ramirez, Andres Jimenez, Hajo A. Reijers, Irene Barba, et al. “A Method to Im-
prove the Early Stages of the Robotic Process Automation Lifecycle”. In: Ad-
vanced Information Systems Engineering - 31st International Conference, CAiSE
2019, Rome, Italy, June 3-7, 2019, Proceedings. Ed. by Giorgini, Paolo and Barbara
Weber. Vol. 11483. Lecture Notes in Computer Science. Springer, 2019, pp. 446–
461. doi:10.1007/978-3-030-21290-2_28.
19 / 19