ChapterPDF Available

A Probabilistic Approach to Event-Case Correlation for Process Mining

Authors:

Abstract and Figures

Process mining aims to understand the actual behavior and performance of business processes from event logs recorded by IT systems. A key requirement is that every event in the log must be associated with a unique case identifier (e.g., the order ID in an order-to-cash process). In reality, however, this case ID may not always be present, especially when logs are acquired from different systems or when such systems have not been explicitly designed to offer process-tracking capabilities. Existing techniques for correlating events have worked with assumptions to make the problem tractable: some assume the generative processes to be acyclic while others require heuristic information or user input. In this paper, we lift these assumptions by presenting a novel technique called EC-SA based on probabilistic optimization. Given as input a sequence of timestamped events (the log without case IDs) and a process model describing the underlying business process, our approach returns an event log in which every event is mapped to a case identifier. The approach minimises the misalignment between the generated log and the input process model, and the variance between activity durations across cases. The experiments conducted on a variety of real-life datasets show the advantages of our approach over the state of the art.
Content may be subject to copyright.
A preview of the PDF is not available
... In previous work [18], we introduced a probabilistic optimization technique called EC-SA (Events Correlation by Simulated Annealing) based on a simulated annealing heuristic approach. EC-SA addresses the correlation problem as a multilevel optimization problem. ...
... Also, it is computationally inefficient due to the breadth-first search approach. In a previous paper [18], we proposed the Event Correlation by Simulated Annealing (EC-SA) approach, which uses the event names and timestamps in addition to the process model. EC-SA addresses the correlation problem as a multi-level optimization problem, as it searches for the nearest optimal correlated log considering the fitness with an input process model and the activities' timed behavior within the log. ...
... The definition of a complete set of measures to compare logs goes beyond the scope of this paper. The measures we propose here are inspired by related work in the literature [18,27,56,[59][60][61] and provide a good trade-off between run-time computability and different level of details used in the comparison, experimental results evidenced. A refinement and enrichment of this set can be part of future investigations. ...
Article
Process mining supports the analysis of the actual behavior and performance of business processes using event logs. An essential requirement is that every event in the log must be associated with a unique case identifier (e.g., the order ID of an order-to-cash process). In reality, however, this case identifier may not always be present, especially when logs are acquired from different systems or extracted from non-process-aware information systems. In such settings, the event log needs to be pre-processed by grouping events into cases – an operation known as event correlation. Existing techniques for correlating events have worked with assumptions to make the problem tractable: some assume the generative processes to be acyclic, while others require heuristic information or user input. Moreover, they abstract the log to activities and timestamps, and miss the opportunity to use data attributes. In this paper, we lift these assumptions and propose a new technique called EC-SA-Data based on probabilistic optimization. The technique takes as inputs a sequence of timestamped events (the log without case IDs), a process model describing the underlying business process, and constraints over the event attributes. Our approach returns an event log in which every event is associated with a case identifier. The technique allows users to incorporate rules on process knowledge and data constraints flexibly. The approach minimizes the misalignment between the generated log and the input process model, maximizes the support of the given data constraints over the correlated log, and the variance between activity durations across cases. Our experiments with various real-life datasets show the advantages of our approach over the state of the art.
... To address this issue, several approaches for correlating events were proposed. They work under different assumptions about the domain, such as acyclicity of the process [2], [3], the existence of case identifiers among the event attributes [4], [5], a profiling of the activities' execution time [6], or additional data rules about event attributes [7]. The experiments conducted in [7] highlight that a more detailed domain knowledge (in the form of accurate data rules) implies a higher quality of the output correlated event log. ...
... They work under different assumptions about the domain, such as acyclicity of the process [2], [3], the existence of case identifiers among the event attributes [4], [5], a profiling of the activities' execution time [6], or additional data rules about event attributes [7]. The experiments conducted in [7] highlight that a more detailed domain knowledge (in the form of accurate data rules) implies a higher quality of the output correlated event log. However, extra information about the domain is not always available or easy to acquire. ...
... Our approach learns rules during the event correlation process based on partial versions of the correlated event log. To this end, we build on the event correlation process presented in [7]. This approach uses simulated annealing to iterate over the search space and generate a correlated event log. ...
Conference Paper
Process mining analyzes business processes’ behavior and performance using event logs. An essential requirement is that events are grouped in cases representing the execution of process instances. However, logs extracted from different systems or non-process-aware information systems do not map events with unique case identifiers (case IDs). In such settings, the event log needs to be pre-processed to group events into cases – an operation known as event correlation. Existing techniques for correlating events work with different assumptions: some assume the generating processes are acyclic, others require extra domain knowledge such as the relation between the events and event attributes, or heuristic information about the activities’ execution time behavior. However, the domain knowledge is not always available or easy to acquire, compromising the quality of the correlated event log. In this paper, we propose a new technique called EC-SA-RM, which correlates the events using a simulated annealing technique and iteratively learns the domain knowledge as a set of association rules. The technique requires a sequence of timestamped events (i.e., the log without case IDs) and a process model describing the underlying business process. At each iteration of the simulated annealing, a possible correlated log is generated. Then, EC-SA-RM uses this correlated log to learn a set of association rules that represent the relationship between the events and the changing behavior over the events’ attributes in an understandable way. These rules enrich the input and improve the event correlation process for the next iteration. EC-SA-RM returns an event log in which events are grouped in cases and a set of association rules that explain the correlation over the events. We evaluate our approach using four real-life datasets.
... The problem of assigning a case identifier to events in a log is a long-standing challenge in the process mining community [17], and is known by multiple names in literature, including event-case correlation problem [10] and case notion discovery problem [33]. Event logs where events are missing the case identifier attribute are usually referred to as unlabeled event logs [17]. ...
... This comes at the cost of a slow computing time. An improvement of the aforementioned method [10] employs simulated annealing to select an optimal case notion; while still computationally heavy, this method delivers high-quality case attribution results. This was further improved in [9], where the authors reduce the dependence of the method from control flow information and exploit user defined rules to obtain a higher quality result. ...
Preprint
Modern software systems are able to record vast amounts of user actions, stored for later analysis. One of the main types of such user interaction data is click data: the digital trace of the actions of a user through the graphical elements of an application, website or software. While readily available, click data is often missing a case notion: an attribute linking events from user interactions to a specific process instance in the software. In this paper, we propose a neural network-based technique to determine a case notion for click data, thus enabling process mining and other process analysis techniques on user interaction data. We describe our method, show its scalability to datasets of large dimensions, and we validate its efficacy through a user study based on the segmented event log resulting from interaction data of a mobility sharing company. Interviews with domain experts in the company demonstrate that the case notion obtained by our method can lead to actionable process insights.
... Such a case notion correlates events of a process instance and represents them as a single sequence, e.g., a sequence of events of a patient. However, in real-life business processes supported by ERP systems such as SAP and Oracle, multiple objects (i.e., multiple sequences of events) exist in a process instance [2,7] and they share events (i.e., sequences are overlapping). Fig. 1(a) shows a process instance in a simple blood test process as multiple overlapping sequences. ...
Chapter
Full-text available
Performance analysis in process mining aims to provide insights on the performance of a business process by using a process model as a formal representation of the process. Existing techniques for performance analysis assume that a single case notion exists in a business process (e.g., a patient in healthcare process). However, in reality, different objects might interact (e.g., order, delivery, and invoice in an O2C process). In such a setting, traditional techniques may yield misleading or even incorrect insights on performance metrics such as waiting time. More importantly, by considering the interaction between objects, we can define object-centric performance metrics such as synchronization time, pooling time, and lagging time. In this work, we propose a novel approach to performance analysis considering multiple case notions by using object-centric Petri nets as formal representations of business processes. The proposed approach correctly computes existing performance metrics, while supporting the derivation of newly-introduced object-centric performance metrics. We have implemented the approach as a web application and conducted a case study based on a real-life loan application process.KeywordsPerformance analysisObject-centric process miningObject-centric Petri netActionable insightsProcess improvement
... Проблематика развития вероятностного подхода к математическому моделированию в области научных исследований и технических разработок является весьма актуальной. Можно указать для примера только самые востребованные в аспекте разработки вероятностных и статистических моделей направления: медицина [1], энергетика [2], химические технологии, в первую очередь связанные с нефтегазовым комплексом [3], информационные технологии [4], анализ бизнес-процессов [5] и т.д. В то же время, с точки зрения работодателей, именно построение математических моделей являет-ся слабым местом подготовки инженеров и экономистов в современных условиях. ...
Article
Full-text available
ON CHOICE FROM FINITE VS INFINITE FIELDS IN COURSE OF PROBABILITY THEORY 1Krasnoshchekov V.V., 1Semenova N.V., 2Mohamed B.M.M., 3Bakkar M.M.A. 1 Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia (195220, St. Petersburg, Polytechnicheskaya street, 29), e-mail: krasno_vv@spbstu.ru 2 Cairo University, Giza, Egypt (12613, 1 Gamaa Street, Giza, Egypt), e-mail: boss_3ds@yahoo.com 3 University Al-Baath, Damascus, Syria (PP75+5VC, Aleppo Highway, Damascus, Syria), e-mail: mamadyan1997@gmail.com The authors continue to study the accuracy and limitations of applicability of probabilistic models. The concept of accuracy is an important component of the competencies of university graduates in the field of mathematical modeling. In this paper, the authors compare the probabilities calculated using exact and approximate formulas. The authors find the probabilities of selection from an infinite field of options using the exact Bernoulli formula, which, in this case, bases on the statistical definition of probability. Obviously, in practical problems, only a choice from a finite field of options is possible, then they make calculations according to the classical selection formula. The authors conduct all research on the material of the same task, which is neutral in terms of the content of the text, and, at the same time, allows for a simple interpretation of the results. The found values of the absolute and relative errors of calculating the probabilities demonstrate a fairly fast convergence of the approximate results to the exact ones. Thus, the authors empirically established the limit values of the size of the variants bank, at which the exact and approximate results differ by no more than 1%. The selected approximations of the lines of convergence give formulas for the minimum required size of the bank of variants. It is possible to use these formulas at medium risk levels of the processes under consideration. Keywords: teaching the theory of probability, Bernoulli's formula, choice based on the classical definition, absolute error, relative error.
Article
Full-text available
Cross-organizational process mining aims to discover an entire process model across multiple organizations where their identifier (ID) systems are not managed uniformly, and each organization has an independent ID system. Cross-organizational process mining has been gaining popularity as information systems increase in complexity. However, previous methods have limitations in that they do not work well for event logs that contain only common items, or cyclic orchestrations, which indicates that the model contains loops. In this paper, we propose an accurate cross-organizational process mining technique based on a step-by-step case ID identification mechanism that uses only common items in event logs and can handle cyclic orchestrations. Step-by-step case ID identification repeats the following steps: 1) identification of case IDs based on activity connection of adjacent event pairs, and 2) extraction of additional activity connections by leveraging the newly identified case IDs. We alternately identify the most probable case ID pairs and remove events belonging to these identified case IDs from the event log, which contributes to extracting additional activity connections and narrowing down the candidates of case ID pairs. Evaluation using real-world event logs showed that the proposed method generates the process model with more than 98.4% precision and more than 94.2% recall for two datasets, outperforming previous methods.
Chapter
Event logs are the main source for business process mining techniques. However, not all information systems produce a standard event log. Furthermore, logs may reflect only parts of the process which may span multiple systems. We suggest using network traffic data to fill these gaps. However, traffic data is interleaved and noisy, and there is a conceptual gap between this data and event logs at the business level. This paper proposes a method for producing event logs from network traffic data. The specific challenges addressed are (a) abstracting the low-level data to business-meaningful activities, (b) overcoming the interleaving of low-level events due to concurrency of activities and processes, and (c) associating the abstracted events to cases. The method uses two trained sequence models based on Conditional random fields (CRF), applied to data reflecting interleaved activities. We use simulated traffic data generated by a predefined business process. The data is annotated for sequence learning to produce models which are used for identifying concurrently performed activities and cases to produce an event log. The event log is conformed against the process models with high fitness and precision scores.KeywordsEvent abstractionSequence modelsNetwork trafficInterleaved data
Chapter
Full-text available
User interaction logs allow us to analyze the execution of tasks in a business process at a finer level of granularity than event logs extracted from enterprise systems. The fine-grained nature of user interaction logs open up a number of use cases. For example, by analyzing such logs, we can identify best practices for executing a given task in a process, or we can elicit differences in performance between workers or between teams. Furthermore, user interaction logs allow us to discover repetitive and automatable routines that occur during the execution of one or more tasks in a process. Along this line, this chapter introduces a family of techniques, called Robotic Process Mining (RPM), which allow us to discover repetitive routines that can be automated using robotic process automation technology. The chapter presents a structured landscape of concepts and techniques for RPM, including techniques for user interaction log preprocessing, techniques for discovering frequent routines, notions of routine automatability, as well as techniques for synthesizing executable routine specifications for robotic process automation.
Chapter
Robotic Process Automation (RPA) is an emerging automation technology in the field of Business Process Management (BPM) that creates software (SW) robots to partially or fully automate rule-based and repetitive tasks (a.k.a. routines) previously performed by human users in their applications’ user interfaces (UIs). Nowadays, successful usage of RPA requires strong support by skilled human experts, from the discovery of the routines to be automated (i.e., the so-called segmentation issue of UI logs) to the development of the executable scripts required to enact SW robots. In this paper, we present a human-in-the-loop approach to filter out the routine behaviors (a.k.a. routine segments) not allowed (i.e., wrongly discovered from the UI log) by any real-world routine under analysis, thus supporting human experts in the identification of valid routine segments. We have also measured to which extent the human-in-the-loop strategy satisfies three relevant non-functional requirements, namely effectiveness, robustness and usability.KeywordsRobotic Process AutomationSegmentation of UI logsDeclarative constraints
Chapter
Robotic process automation (RPA) is a technology that is presented as a universal tool that solves major problems of modern businesses. It aims to reduce costs, improve quality and create customer value. However, the business reality differs from this aspiration. After interviews with managers, we found that implementation of robots does not always lead to the assumed effect and some robots are subsequently withdrawn from companies. In consequence, people take over robotized tasks to perform them manually again, and in practice, replace back robots—what we call ‘re-manualization’. Unfortunately, companies do not seem to be aware of this possibility until they experience it on their own, to the best of our knowledge, no previous research described or analysed this phenomenon so far. This lack of awareness, however, may pose risks and even be harmful for organizations. In this paper, we present an exploratory study. We used individual interviews, group discussions with managers experienced in RPA, and secondary data analysis to elaborate on the re-manualization phenomenon. As a result, we found four types of ‘cause and effect’ narrations that reflect reasons for this to occur: (1) overenthusiasm for RPA, (2) low awareness and fear of robots, (3) legal or supply change and (4) code faults.KeywordsRobotic process automationRPASoftware robotInvestmentInformation systemsWork manualization
Article
Full-text available
The problem of automated discovery of process models from event logs has been intensively researched in the past two decades. Despite a rich field of proposals, state-of-the-art automated process discovery methods suffer from two recurrent deficiencies when applied to real-life logs: (i) they produce large and spaghetti-like models; and (ii) they produce models that either poorly fit the event log (low fitness) or over-generalize it (low precision). Striking a trade-off between these quality dimensions in a robust and scalable manner has proved elusive. This paper presents an automated process discovery method, namely Split Miner, which produces simple process models with low branching complexity and consistently high and balanced fitness and precision, while achieving considerably faster execution times than state-of-the-art methods, measured on a benchmark covering twelve real-life event logs. Split Miner combines a novel approach to filter the directly-follows graph induced by an event log, with an approach to identify combinations of split gateways that accurately capture the concurrency, conflict and causal relations between neighbors in the directly-follows graph. Split Miner is also the first automated process discovery method that is guaranteed to produce deadlock-free process models with concurrency, while not being restricted to producing block-structured process models.
Article
Full-text available
The domains of complex event processing (CEP) and business process management (BPM) have different origins but for many aspects draw on similar concepts. While specific combinations of BPM and CEP have attracted research attention, resulting in solutions to specific problems, we attempt to take a broad view at the opportunities and challenges involved. We first illustrate these by a detailed example from the logistics domain. We then propose a mapping of this area into four quadrants – two quadrants drawing from CEP to create or extend process models and two quadrants starting from a process model to address how it can guide CEP. Existing literature is reviewed and specific challenges and opportunities are indicated for each of these quadrants. Based on this mapping, we identify challenges and opportunities that recur across quadrants and can be considered as the core issues of this combination. We suggest that addressing these issues in a generic manner would form a sound basis for future applications and advance this area significantly.
Conference Paper
Full-text available
Nowadays, many business processes once intra-organizational are becoming inter-organizational. Thus, being able to monitor how such processes are performed, including portions carried out by service providers, is paramount. Yet, traditional process monitoring techniques present some shortcomings when dealing with inter-organizational processes. In particular, they require human operators to notify when business activities are performed, and to stop the process when it is not executed as expected. In this paper, we address these issues by proposing an artifact-driven monitoring service, capable of autonomously and continuously monitor inter-organizational processes. To do so, this service relies on the state of the artifacts (i.e., physical entities) participating to the process, represented using the E-GSM notation. A working prototype of this service is presented and validated using real-world processes and data from the logistics domain.
Article
Full-text available
Nowadays, business processes are increasingly supported by IT services that produce massive amounts of event data during the execution of a process. These event data can be used to analyze the process using process mining techniques to discover the real process, measure conformance to a given process model, or to enhance existing models with performance information. Mapping the produced events to activities of a given process model is essential for conformance checking, annotation and understanding of process mining results. In order to accomplish this mapping with low manual effort, we developed a semi-automatic approach that maps events to activities using insights from behavioral analysis and label analysis. The approach extracts Declare constraints from both the log and the model to build matching constraints to efficiently reduce the number of possible mappings. These mappings are further reduced using techniques from natural language processing, which allow for a matching based on labels and external knowledge sources. The evaluation with synthetic and real-life data demonstrates the effectiveness of the approach and its robustness toward non-conforming execution logs.
Article
Full-text available
In the era of "big data" one of the key challenges is to analyze large amounts of data collected in meaningful and scalable ways. The field of process mining is concerned with the analysis of data that is of a particular nature, namely data that results from the execution of business processes. The analysis of such data can be negatively influenced by the presence of outliers, which reflect infrequent behavior or "noise". In process discovery, where the objective is to automatically extract a process model from the data, this may result in rarely travelled pathways that clutter the process model. This paper presents an automated technique to the removal of infrequent behavior from event logs. The proposed technique is evaluated in detail and it is shown that its application in conjunction with certain existing process discovery algorithms significantly improves the quality of the discovered process models and that it scales well to large datasets.
Chapter
Process monitoring aims to provide transparency over operational aspects of a business process. In practice, it is a challenge that traces of business process executions span across a number of diverse systems. It is cumbersome manual engineering work to identify which attributes in unstructured event data can serve as case and activity identifiers for extracting and monitoring the business process. Approaches from literature assume that these identifiers are known a priori and data is readily available in formats like eXtensible Event Stream (XES). However, in practice this is hardly the case, specifically when event data from different sources are pooled together in event stores. In this paper, we address this research gap by inferring potential case and activity identifiers in a provenance agnostic way. More specifically, we propose a semi-automatic technique for discovering event relations that are semantically relevant for business process monitoring. The results are evaluated in an industry case study with an international telecommunication provider.
Article
Process discovery algorithms aim to capture process models from event logs. These algorithms have been designed for logs in which the events that belong to the same case are related to each other — and to that case — by means of a unique case identifier. However, in service-oriented systems, these case identifiers are rarely stored beyond request-response pairs, which makes it hard to relate events that belong to the same case. This is known as the correlation challenge. This paper addresses the correlation challenge by introducing a technique, called the correlation miner, that facilitates discovery of business process models when events are not associated with a case identifier. It extends previous work on the correlation miner, by not only enabling the discovery of the process model, but also detecting which events belong to the same case. Experiments performed on both synthetic and real-world event logs show the applicability of the correlation miner. The resulting technique enables us to observe a service-oriented system and determine — with high accuracy — which request-response pairs sent by different communicating parties are related to each other.
Article
Process mining methods allow analysts to exploit logs of historical executions of business processes in order to extract insights regarding the actual performance of these processes. One of the most widely studied process mining operations is automated process discovery. An automated process discovery method takes as input an event log, and produces as output a business process model that captures the control-flow relations between tasks that are observed in or implied by the event log. Several dozen automated process discovery methods have been proposed in the past two decades, striking different trade-offs between scalability, accuracy and complexity of the resulting models. So far, automated process discovery methods have been evaluated in an ad hoc manner, with different authors employing different datasets, experimental setups, evaluation measures and baselines, often leading to incomparable conclusions and sometimes unreproducible results due to the use of non-publicly available datasets. In this setting, this article provides a systematic review of automated process discovery methods and a systematic comparative evaluation of existing implementations of these methods using an opensource benchmark covering nine publicly-available real-life event logs and eight quality metrics. The review and evaluation results highlight gaps and unexplored trade-offs in the field, including the lack of scalability of several proposals in the field and a strong divergence in the performance of different methods with respect to different quality metrics. The proposed benchmark allows researchers to empirically compare new automated process discovery against existing ones in a unified setting.
Book
This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a considerably expanded section on software tools and a completely new chapter of process mining in the large. It is self-contained, while at the same time covering the entire process-mining spectrum from process discovery to predictive analytics. After a general introduction to data science and process mining in Part I, Part II provides the basics of business process modeling and data mining necessary to understand the remainder of the book. Next, Part III focuses on process discovery as the most important process mining task, while Part IV moves beyond discovering the control flow of processes, highlighting conformance checking, and organizational and time perspectives. Part V offers a guide to successfully applying process mining in practice, including an introduction to the widely used open-source tool ProM and several commercial products. Lastly, Part VI takes a step back, reflecting on the material presented and the key open challenges. Overall, this book provides a comprehensive overview of the state of the art in process mining. It is intended for business process analysts, business consultants, process managers, graduate students, and BPM researchers.