ChapterPDF Available

Abstract and Figures

With the increasing availability of business process related event logs, the scalability of techniques that discover a process model from such logs becomes a performance bottleneck. In particular, exploratory analysis that investigates manifold parameter settings of discovery algorithms, potentially using a software-as-a-service tool, relies on fast response times. However, common approaches for process model discovery always parse and analyse all available event data, whereas a small fraction of a log could have already led to a high-quality model. In this paper, we therefore present a framework for process discovery that relies on statistical pre-processing of an event log and significantly reduce its size by means of sampling. It thereby reduces the runtime and memory footprint of process discovery algorithms, while providing guarantees on the introduced sampling error. Experiments with two public real-world event logs reveal that our approach speeds up state-of-the-art discovery algorithms by a factor of up to 20.
Content may be subject to copyright.
A preview of the PDF is not available
... One sampling strategy is to iteratively grow the number of traces until an objective criterion is met. Bauer et al. propose a statistical framework to perform this task using statistical tests as a stopping criterion [10], [12]. Berti et al. use a similar sampling strategy that stops when the pair-wise activity dependencies resemble the original event logs [13]. ...
Conference Paper
Full-text available
When event logs are large, the time needed to analyze them using process mining techniques can become prohibitive. In this paper, using sampling, we aim to reduce the size of event logs to p-traces, while minimizing the Earth Movers' Distance (EMD) from the unsampled original event log. We contribute by formalizing log sampling in a canonical form and show its link with the EMD, a metric increasingly used for process mining. Next, we propose three log-sampling algorithms that we evaluate using a collection of 18 event logs from industry. We show that our approach largely reduces the EMD compared to existing sampling strategies. Moreover, we highlight that sampled event logs with low EMDs tend to have better behavioural quality, highlighting the generality of our work.
... For example, in [20] event attributes are used to generate hierarchical process models that better represent different levels of process granularity. A statistical pre-processing framework for event logs that reduces the amount of data needed to produce high quality process models is presented in [7]. Similarly, the influence of subset selection on the model quality was examined in [11] where it was shown that, in contrast to random-based selection, strategic subset selection increases the model quality. ...
Chapter
Event logs have become a valuable information source for business process management, e.g., when analysts discover process models to inspect the process behavior and to infer actionable insights. To this end, analysts configure discovery pipelines in which logs are filtered, enriched, abstracted, and process models are derived. While pipeline operations are necessary to manage log imperfections and complexity, they might, however, influence the nature of the discovered process model and its properties. Ultimately, not considering this possibility can negatively affect downstream decision making. We hence propose a framework for assessing the consistency of model properties with respect to the pipeline operations and their parameters, and, if inconsistencies are present, for revealing which parameters contribute to them. Following recent literature on software engineering for machine learning, we refer to it as debugging. From evaluating our framework in a real-world analysis scenario based on complex event logs and third-party pipeline configurations, we see strong evidence towards it being a valuable addition to the process mining toolbox.
... A statistical framework based on information saturation is proposed in [6]. Their approach differs from the probability sampling techniques we propose. ...
Chapter
Full-text available
The aim of a process discovery algorithm is to construct from event data a process model that describes the underlying, real-world process well. Intuitively, the better the quality of the input event data, the better the quality of the resulting discovered model should be. However, existing process discovery algorithms do not guarantee this relationship. We demonstrate this by using a range of quality measures for both event data and discovered process models. This paper is a call to the community of IS engineers to complement their process discovery algorithms with properties that relate qualities of their inputs to those of their outputs. To this end, we distinguish four incremental stages for the development of such algorithms, along with concrete guidelines for the formulation of relevant properties and experimental validation. We use these stages to reflect on the state of the art, which shows the need to move forward in our thinking about algorithmic process discovery.
... For example, in [20] event attributes are used to generate hierarchical process models that better represent different levels of process granularity. A statistical pre-processing framework for event logs that reduces the amount of data needed to produce high quality process models is presented in [7]. Similarly, the influence of subset selection on the model quality was examined in [11] where it was shown that, in contrast to random-based selection, strategic subset selection increases the model quality. ...
Preprint
Full-text available
Event logs have become a valuable information source for business process management, e.g., when analysts discover process models to inspect the process behavior and to infer actionable insights. To this end, analysts configure discovery pipelines in which logs are filtered, enriched, abstracted, and process models are derived. While pipeline operations are necessary to manage log imperfections and complexity, they might, however, influence the nature of the discovered process model and its properties. Ultimately, not considering this possibility can negatively affect downstream decision making. We hence propose a framework for assessing the consistency of model properties with respect to the pipeline operations and their parameters, and, if inconsistencies are present, for revealing which parameters contribute to them. Following recent literature on software engineering for machine learning, we refer to it as debugging. From evaluating our framework in a real-world analysis scenario based on complex event logs and third-party pipeline configurations, we see strong evidence towards it being a valuable addition to the process mining toolbox.
... For example, in [20] event attributes are used to generate hierarchical process models that better represent different levels of process granularity. A statistical pre-processing framework for event logs that reduces the amount of data needed to produce high quality process models is presented in [7]. Similarly, the influence of subset selection on the model quality was examined in [11] where it was shown that, in contrast to random-based selection, strategic subset selection increases the model quality. ...
Conference Paper
Full-text available
Event logs have become a valuable information source for business process management, e.g., when analysts discover process models to inspect the process behavior and to infer actionable insights. To this end, analysts configure discovery pipelines in which logs are filtered, enriched, abstracted, and process models are derived. While pipeline operations are necessary to manage log imperfections and complexity, they might, however, influence the nature of the discovered process model and its properties. Ultimately, not considering this possibility can negatively affect downstream decision making. We hence propose a framework for assessing the consistency of model properties with respect to the pipeline operations and their parameters, and, if inconsistencies are present, for revealing which parameters contribute to them. Following recent literature on software engineering for machine learning, we refer to it as debugging. From evaluating our framework in a real-world analysis scenario based on complex event logs and third-party pipeline configurations, we see strong evidence towards it being a valuable addition to the process mining toolbox.
... For example, in [20] event attributes are used to generate hierarchical process models that better represent different levels of process granularity. A statistical pre-processing framework for event logs that reduces the amount of data needed to produce high quality process models is presented in [7]. Similarly, the influence of subset selection on the model quality was examined in [11] where it was shown that, in contrast to random-based selection, strategic subset selection increases the model quality. ...
Conference Paper
Event logs have become a valuable information source for business process management, e.g., when analysts discover process models to inspect the process behavior and to infer actionable insights. To this end, analysts configure discovery pipelines in which logs are filtered, enriched, abstracted, and process models are derived. While pipeline operations are necessary to manage log imperfections and complexity, they might, however, influence the nature of the discovered process model and its properties. Ultimately, not considering this possibility can negatively affect downstream decision making. We hence propose a framework for assessing the consistency of model properties with respect to the pipeline operations and their parameters, and, if inconsistencies are present, for revealing which parameters contribute to them. Following recent literature on software engineering for machine learning, we refer to it as debugging. From evaluating our framework in a real-world analysis scenario based on complex event logs and third-party pipeline configurations, we see strong evidence towards it being a valuable addition to the process mining toolbox.
Article
Process discovery, one of the main branches of process mining, aims to discover a process model that accurately describes the underlying process captured within the event data recorded in an event log. In general, process discovery algorithms return models describing the entire event log. However, this strategy may lead to discover complex, incomprehensible process models concealing the correct and/or relevant behavior of the underlying process. Processing the entire event log is no longer feasible when dealing with large amounts of events. In this study, we propose the PROMISE⁺ method that rests on an abstraction involving predictive process mining to generate an event log summary. This summarization step may enable the discovery of simpler process models with higher precision. Experiments with several benchmark event logs and various process discovery algorithms show the effectiveness of the proposed method.
Article
Full-text available
In the field of process discovery, it is worth noting that most process discovery algorithms assume that event logs are clean, i.e., event logs should not contain infrequent behaviors. However, real-life event logs often contain infrequent behaviors (i.e., outliers) and lead to quality issues of the discovered process model. On the other hand, driven by recent trends such as big data and process automation, the volume of event data is rapidly increasing: an event log may contain billions of event data. Unfortunately, some process mining algorithms and platforms may have difficulties handling such event logs. The ever-increasing size of event data and infrequent behaviors in the event log are two main challenges in the field of process discovery nowadays. However, little research has been conducted on simultaneously filtering infrequent behaviors and decreasing the size of the event log: Various filtering methods can filter infrequent behaviors, whereas the volume of the filtered log is still considerable. On the other hand, sampling methods can reduce the size of the event log, but the processed event log may still contain infrequent behaviors. Therefore, this paper proposes a technique to simultaneously filter infrequent behaviors and control the volume of input logs by capturing important behaviors and rating trace variants. Our experiments show that our approach can significantly improve the quality of the discovered process models. Furthermore, our approach can obtain a better process model from 0.001% trace variants than the complete event log and significantly improves the runtime of discovery algorithms.
Chapter
Due to growing digital opportunities, persistent legislative pressure, and recent challenges in the wake of the COVID-19 pandemic, public universities need to engage in digital innovation (DI). While society expects universities to lead DI efforts, the successful development and implementation of DIs, particularly in administration and management contexts, remains a challenge. In addition, research lacks knowledge on the DI process at public universities, while further understanding and guidance are needed. Against this backdrop, our study aims to enhance the understanding of the DI process at public universities by providing a structured overview of corresponding drivers and barriers through an exploratory single case study. We investigate the case of a German public university and draw from primary and secondary data of its DI process from the development of three specific digital process innovations. Building upon Business Process Management (BPM) as a theoretical lens to study the DI process, we present 13 drivers and 17 barriers structured along the DI actions and BPM core elements. We discuss corresponding findings and provide related practice recommendations for public universities that aim to engage in DI. In sum, our study contributes to the explanatory knowledge at the convergent interface between DI and BPM in the context of public universities.
Chapter
Business Process Mining (BPM) has become an essential tool in internal audit (IA), which helps auditors analyze potential risks in clients’ core business processes. After finishing the risk analysis task for the target business process with BPM, auditors need to sample a small set of representative process cases from event log, based on which clients will verify the analysis results and analyze the triggers for the risks in the target business process. This process case sampling (PCS) step is important because it is difficult to check each single case from a large event log. Therefore, the quality of the set of case samples (SCS) from PCS is regarded as one of the success factors in IA project. Manual PCS and simple random PCS are two basic methods for executing PCS. However, both methods cannot assure the quality of the generated SCS. In this paper, we propose an advanced PCS method. It first defines the risk of process cases as well as the factors that affect the quality of SCS, before dynamically optimizing the quality of SCS during PCS. Our experimental evaluation highlights that our approach yields higher quality SCS than manual PCS and simple random PCS.
Conference Paper
Full-text available
Although our capabilities to store and process data have been increasing exponentially since the 1960-ties, suddenly many organizations realize that survival is not possible without exploiting available data intelligently. Out of the blue, "Big Data" has become a topic in board-level discussions. The abundance of data will change many jobs across all industries. Moreover, also scientific research is becoming more
Article
Full-text available
The aim of process discovery, originating from the area of process mining, is to discover a process model based on business process execution data. A majority of process discovery techniques relies on an event log as an input. An event log is a static source of historical data capturing the execution of a business process. In this paper we focus on process discovery relying on online streams of business process execution events. Learning process models from event streams poses both challenges and opportunities, i.e. we need to handle unlimited amounts of data using finite memory and, preferably, constant time. We propose a generic architecture that allows for adopting several classes of existing process discovery techniques in context of event streams. Moreover, we provide several instantiations of the architecture, accompanied by implementations in the process mining tool-kit ProM (http://promtools.org). Using these instantiations, we evaluate several dimensions of stream-based process discovery. The evaluation shows that the proposed architecture allows us to lift process discovery to the streaming domain.
Conference Paper
Full-text available
Article
Process discovery aims at constructing a model from a set of observations given by execution traces (a log). Petri nets are a preferred target model in that they produce a compact description of the system by exhibiting its concurrency. This article presents a process discovery algorithm using Petri net synthesis, based on the notion of region introduced by A. Ehrenfeucht and G. Rozenberg and using techniques from linear algebra. The algorithm proceeds in three successive phases which make it possible to find a compromise between the ability to infer behaviours of the system from the set of observations while ensuring a parsimonious model, in terms of fitness, precision and simplicity. All used algorithms are incremental which means that one can modify the produced model when new observations are reported without reconstructing the model from scratch.
Conference Paper
Analysing performance of business processes is an important vehicle to improve their operation. Specifically, an accurate assessment of sojourn times and remaining times enables bottleneck analysis and resource planning. Recently, methods to create respective performance models from event logs have been proposed. These works are severely limited, though: They either consider control-flow and performance information separately, or rely on an ad-hoc selection of temporal relations between events. In this paper, we introduce the Temporal Network Representation (TNR) of a log, based on Allen’s interval algebra, as a complete temporal representation of a log, which enables simultaneous discovery of control-flow and performance information. We demonstrate the usefulness of the TNR for detecting (unrecorded) delays and for probabilistic mining of variants when modelling the performance of a process. In order to compare different models from the performance perspective, we develop a framework for measuring performance fitness. Under this framework, we provide guarantees that TNR-based process discovery dominates existing techniques in measuring performance characteristics of a process. To illustrate the practical value of the TNR, we evaluate the approach against three real-life datasets. Our experiments show that the TNR yields an improvement in performance fitness over state-of-the-art algorithms.
Article
Process mining methods allow analysts to exploit logs of historical executions of business processes in order to extract insights regarding the actual performance of these processes. One of the most widely studied process mining operations is automated process discovery. An automated process discovery method takes as input an event log, and produces as output a business process model that captures the control-flow relations between tasks that are observed in or implied by the event log. Several dozen automated process discovery methods have been proposed in the past two decades, striking different trade-offs between scalability, accuracy and complexity of the resulting models. So far, automated process discovery methods have been evaluated in an ad hoc manner, with different authors employing different datasets, experimental setups, evaluation measures and baselines, often leading to incomparable conclusions and sometimes unreproducible results due to the use of non-publicly available datasets. In this setting, this article provides a systematic review of automated process discovery methods and a systematic comparative evaluation of existing implementations of these methods using an opensource benchmark covering nine publicly-available real-life event logs and eight quality metrics. The review and evaluation results highlight gaps and unexplored trade-offs in the field, including the lack of scalability of several proposals in the field and a strong divergence in the performance of different methods with respect to different quality metrics. The proposed benchmark allows researchers to empirically compare new automated process discovery against existing ones in a unified setting.
Book
This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a considerably expanded section on software tools and a completely new chapter of process mining in the large. It is self-contained, while at the same time covering the entire process-mining spectrum from process discovery to predictive analytics. After a general introduction to data science and process mining in Part I, Part II provides the basics of business process modeling and data mining necessary to understand the remainder of the book. Next, Part III focuses on process discovery as the most important process mining task, while Part IV moves beyond discovering the control flow of processes, highlighting conformance checking, and organizational and time perspectives. Part V offers a guide to successfully applying process mining in practice, including an introduction to the widely used open-source tool ProM and several commercial products. Lastly, Part VI takes a step back, reflecting on the material presented and the key open challenges. Overall, this book provides a comprehensive overview of the state of the art in process mining. It is intended for business process analysts, business consultants, process managers, graduate students, and BPM researchers.
Conference Paper
Understanding the performance of business processes is an important part of any business process intelligence project. From historical information recorded in event logs, performance can be measured and visualized on a discovered process model. Thereby the accuracy of the measured performance, e.g., waiting time, greatly depends on (1) the availability of start and completion events for activities in the event log, i.e. transactional information, and (2) the ability to differentiate between subtle control flow aspects, e.g. concurrent and interleaved execution. Current process discovery algorithms either do not use activity life cycle information in a systematic way or cannot distinguish subtle control-flow aspects, leading to less accurate performance measurements. In this paper, we investigate the automatic discovery of process models from event logs, such that performance can be measured more accurately. We discuss ways of systematically treating life cycle information in process discovery and their implications. We introduce a process discovery technique that is able to handle life cycle data and that distinguishes concurrency and interleaving. Finally, we show that it can discover models and reliable performance information from event logs only.
Conference Paper
Scalability is a major challenge for existing behavioral log analysis algorithms, which extract finite-state automaton models or temporal properties from logs generated by running systems. In this paper we present statistical log analysis, which addresses scalability using statistical tools. The key to our approach is to consider behavioral log analysis as a statistical experiment. Rather than analyzing the entire log, we suggest to analyze only a sample of traces from the log and, most importantly, provide means to compute statistical guarantees for the correctness of the analysis result. We present the theoretical foundations of our approach and describe two example applications, to the classic k-Tails algorithm and to the recently presented BEAR algorithm. Finally, based on experiments with logs generated from real-world models and with real-world logs provided to us by our industrial partners, we present extensive evidence for the need for scalable log analysis and for the effectiveness of statistical log analysis.