SAP Signavio Academic Models: A Large Process Model Dataset
Abstract
In this paper, we introduce the SAP Signavio Academic Models (SAP-SAM) dataset, a collection of hundreds of thousands of business models, mainly process models in BPMN notation. The model collection is a subset of the models that were created over the course of roughly a decade on academic.signavio.com , a free-of-charge software-as-a-service platform that researchers, teachers, and students can use to create business (process) models. We provide a preliminary analysis of the model collection, as well as recommendations on how to work with it. In addition, we discuss potential use cases and limitations of the model collection from academic and industry perspectives.
... The Business Process Management Academic Initiative (BPMAI) [31] and SAP Signavio Academic Models (SAP-SAM) [32] datasets are the most representative collections of process models publicly available. SAP-SAM and BPMAI, are collections of models that were created over the course of roughly a decade on a platform that researchers, teachers, and students could use to create business (process) models. ...
... Leveraging larger training datasets, such as the SAP-SAM dataset [32], can potentially reduce overfitting risks when training general KGE models for activity recommendation tasks. Furthermore, investigating alternative translation methods, including inverse relationships like precededBy and behavioural relationships such as indirectFollowedBy, holds significance for future research. ...
Activity recommendation is an approach to assist process modelers by recommending suitable activities to be inserted at a user-defined position. In this paper, we suggest approaching activity recommendation as a knowledge graph completion task. We convert business process models into knowledge graphs through various translation methods and apply embedding- and rule-based knowledge graph completion techniques to the translated models. Our experimental evaluation reveals that generic knowledge graph completion methods do not perform well on the given task. They lack the flexibility to capture complex regularities that can be learned using a rule-based approach specifically designed for activity recommendation.
... Creation Although several datasets of process models [27,14], model-text pairs [20,5,19,23] and model-image pairs [25,1] exist, none contains multimodal process documentation and respective ground truths. Therefore, we augment a subset of the SAP-SAM dataset, which contains around a million of process models created by researchers, teachers, and students [27]. ...
... Creation Although several datasets of process models [27,14], model-text pairs [20,5,19,23] and model-image pairs [25,1] exist, none contains multimodal process documentation and respective ground truths. Therefore, we augment a subset of the SAP-SAM dataset, which contains around a million of process models created by researchers, teachers, and students [27]. The SAP-SAM dataset is filtered for high-quality and representative models. ...
This paper presents an investigation of the capabilities of Generative Pre-trained Transformers (GPTs) to auto-generate graphical process models from multi-modal (i.e., text- and image-based) inputs. More precisely, we first introduce a small dataset as well as a set of evaluation metrics that allow for a ground truth-based evaluation of multi-modal process model generation capabilities. We then conduct an initial evaluation of commercial GPT capabilities using zero-, one-, and few-shot prompting strategies. Our results indicate that GPTs can be useful tools for semi-automated process modeling based on multi-modal inputs. More importantly, the dataset and evaluation metrics as well as the open-source evaluation code provide a structured framework for continued systematic evaluations moving forward.
... Datasets. We gathered real-world business process models from SAP-SAM [17], focusing on two strategies: mirroring SAP-SAM's task distribution and sourcing models from diverse fields like industry and finance. We then manually annotated the corresponding business process documentation. ...
... These datasets enable software engineers to leverage ML and data-driven techniques to automate and/or improve software development tasks. Within the process modeling community, there have been efforts to curate datasets containing various types of process models [8,19,22]. The sub-discipline of process mining is also heavily engaged in the creation and use of publicly available datasets (see [9]). ...
The conceptual modeling community and its subdivisions of enterprise modeling are increasingly investigating the potentials of applying artificial intelligence, in particular machine learning (ML), to tasks like model creation, model analysis, and model processing. A prerequisite—and currently a limiting factor for the community—to conduct research involving ML is the scarcity of openly available models of adequate quality and quantity. With the paper at hand, we aim to tackle this limitation by introducing an EA ModelSet, i.e., a curated and FAIR repository of enterprise architecture models that can be used by the community. We report on our efforts in building this data set and elaborate on the possibilities of conducting ML-based modeling research with it. We hope this paper sparks a community effort toward the development of a FAIR, large model set that enables ML research with conceptual models.
Activity recommendation in business process modeling is concerned with suggesting suitable labels for a new activity inserted by a modeler in a process model under development. Recently, it has been proposed to represent process model repositories as knowledge graphs, which makes it possible to address the activity-recommendation problem as a knowledge graph completion task. However, existing recommendation approaches are entirely dependent on the knowledge contained in the model repository used for training. This makes them rigid in general and even inapplicable in situations where a process model consists of unseen activities, which were not part of the repository used for training. In this paper, we avoid these issues by recognizing that the semantics contained in process models can be used to instead pose the activity-recommendation problem as a set of textual sequence-to-sequence tasks. This enables the application of transfer-learning techniques from natural language processing, which allows for recommendations that go beyond the activities contained in an available repository. We operationalize this with an activity-recommendation approach that employs a pre-trained language model at its core, and uses the representations of process knowledge as structured graphs combined with the natural-language-based semantics of process models. In an experimental evaluation, we show that our approach considerably outperforms the state of the art in terms of semantic accuracy of the recommendations and that it is able to recommend and handle activity labels that go beyond the vocabulary of the model repository used during training.Keywordsactivity recommendationprocess modelssemantic process analysislanguage modelssequence-to-sequence models
Streaming process mining refers to the set of techniques and tools which have the goal of processing a stream of data (as opposed to a finite event log). The goal of these techniques, similarly to their corresponding counterparts described in the previous chapters, is to extract relevant information concerning the running processes. This chapter presents an overview of the problems related to the processing of streams, as well as a categorization of the existing solutions. Details about control-flow discovery and conformance checking techniques are also presented together with a brief overview of the state of the art.
User interaction logs allow us to analyze the execution of tasks in a business process at a finer level of granularity than event logs extracted from enterprise systems. The fine-grained nature of user interaction logs open up a number of use cases. For example, by analyzing such logs, we can identify best practices for executing a given task in a process, or we can elicit differences in performance between workers or between teams. Furthermore, user interaction logs allow us to discover repetitive and automatable routines that occur during the execution of one or more tasks in a process. Along this line, this chapter introduces a family of techniques, called Robotic Process Mining (RPM), which allow us to discover repetitive routines that can be automated using robotic process automation technology. The chapter presents a structured landscape of concepts and techniques for RPM, including techniques for user interaction log preprocessing, techniques for discovering frequent routines, notions of routine automatability, as well as techniques for synthesizing executable routine specifications for robotic process automation.
Predictive Process Monitoring [29] is a branch of process mining that aims at predicting the future of an ongoing (uncompleted) process execution. Typical examples of predictions of the future of an execution trace relate to the outcome of a process execution, to its completion time, or to the sequence of its future activities
Classical process mining relies on the notion of a unique case identifier, which is used to partition event data into independent sequences of events. In this chapter, we study the shortcomings of this approach for event data over multiple entities . We introduce event knowledge graphs as data structure that allows to naturally model behavior over multiple entities as a network of events. We explore how to construct, query, and aggregate event knowledge graphs to get insights into complex behaviors. We will ultimately show that event knowledge graphs are a very versatile tool that opens the door to process mining analyses in multiple behavioral dimensions at once.
Process mining techniques can help organizations to improve their operational processes. Organizations can benefit from process mining techniques in finding and amending the root causes of performance or compliance problems. Considering the volume of the data and the number of features captured by the information system of today’s companies, the task of discovering the set of features that should be considered in causal analysis can be quite involving. In this paper, we propose a method for finding the set of (aggregated) features with a possible causal effect on the problem. The causal analysis task is usually done by applying a machine learning technique to the data gathered from the information system supporting the processes. To prevent mixing up correlation and causation, which may happen because of interpreting the findings of machine learning techniques as causal, we propose a method for discovering the structural equation model of the process that can be used for causal analysis. We have implemented the proposed method as a plugin in ProM, and we have evaluated it using real and synthetic event logs. These experiments show the validity and effectiveness of the proposed methods.
Background
Command centres have been piloted in some hospitals across the developed world in the last few years. Their impact on patient safety, however, has not been systematically studied. Hence, we aimed to investigate this.
Methods
This is a retrospective population-based cohort study. Participants were patients who visited Bradford Royal Infirmary Hospital and Calderdale & Huddersfield hospitals between 1 January 2018 and 31 August 2021. A five-phase, interrupted time series, linear regression analysis was used.
Results
After introduction of a Command Centre, while mortality and readmissions marginally improved, there was no statistically significant impact on postoperative sepsis. In the intervention hospital, when compared with the preintervention period, mortality decreased by 1.4% (95% CI 0.8% to 1.9%), 1.5% (95% CI 0.9% to 2.1%), 1.3% (95% CI 0.7% to 1.8%) and 2.5% (95% CI 1.7% to 3.4%) during successive phases of the command centre programme, including roll-in and activation of the technology and preparatory quality improvement work. However, in the control site, compared with the baseline, the weekly mortality also decreased by 2.0% (95% CI 0.9 to 3.1), 2.3% (95% CI 1.1 to 3.5), 1.3% (95% CI 0.2 to 2.4), 3.1% (95% CI 1.4 to 4.8) for the respective intervention phases. No impact on any of the indicators was observed when only the software technology part of the Command Centre was considered.
Conclusion
Implementation of a hospital Command Centre may have a marginal positive impact on patient safety when implemented as part of a broader hospital-wide improvement programme including colocation of operations and clinical leads in a central location. However, improvement in patient safety indicators was also observed for a comparable period in the control site. Further evaluative research into the impact of hospital command centres on a broader range of patient safety and other outcomes is warranted.
As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and to accelerate machine learning research. But although automation and efficiency are among AutoML’s main selling points, the process still requires human involvement at a number of vital steps, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training dataset, and selecting a promising machine learning technique. These steps often require a prolonged back-and-forth that makes this process inefficient for domain experts and data scientists alike and keeps so-called AutoML systems from being truly automatic. In this review article, we introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. We begin by describing what an end-to-end machine learning pipeline actually looks like, and which subtasks of the machine learning pipeline have been automated so far. We highlight those subtasks that are still done manually—generally by a data scientist—and explain how this limits domain experts’ access to machine learning. Next, we introduce our novel level-based taxonomy for AutoML systems and define each level according to the scope of automation support provided. Finally, we lay out a roadmap for the future, pinpointing the research required to further automate the end-to-end machine learning pipeline and discussing important challenges that stand in the way of this ambitious goal.