PreprintPDF Available

Abstract and Figures

The recent increase in the availability of medical data, possible through automation and digitization of medical equipment, has enabled more accurate and complete analysis on patients' medical data through many branches of data science. In particular, medical records that include timestamps showing the history of a patient have enabled the representation of medical information as sequences of events, effectively allowing to perform process mining analyses. In this paper, we will present some preliminary findings obtained with established process mining techniques in regard of the medical data of patients of the Uniklinik Aachen hospital affected by the recent epidemic of COVID-19. We show that process mining techniques are able to reconstruct a model of the ICU treatments for COVID patients.
Content may be subject to copyright.
Analyzing Medical Data with Process Mining:
a COVID-19 Case Study
Marco Pegoraro 1, Madhavi Bangalore Shankara Narayana 1,
Elisabetta Benevento 1,3, Wil M.P. van der Aalst 1, Lukas Martin 2, and
Gernot Marx 2
1Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Aachen, Germany
{pegoraro, madhavi.shankar, benevento, vwdaalst}
2Department of Intensive Care and Intermediate Care,
RWTH Aachen University Hospital, Aachen, Germany
{lmartin, gmarx}
3Department of Ener, Systems, Territory and Construction Engineering, University of Pisa, Pisa, Italy
The recent increase in the availability of medical data, possible through automa-
tion and digitization of medical equipment, has enabled more accurate and com-
plete analysis on patients’ medical data through many branches of data science.
In particular, medical records that include timestamps showing the history of a
patient have enabled the representation of medical information as sequences of
events, efectively allowing to perform process mining analyses. In this paper, we
will present some preliminary ndings obtained with established process mining
techniques in regard of the medical data of patients of the Uniklinik Aachen hos-
pital afected by the recent epidemic of COVID-19. We show that process mining
techniques are able to reconstruct a model of the ICU treatments for COVID pa-
Keywords: Process Mining ·Healthcare ·COVID-19.
This work is licensed under a Creative Commons “Attribution-NonCommercial 4.0 In-
ternational” license.
©the authors. Some rights reserved.
This document is an Author Accepted Manuscript (AAM) corresponding to the following scholarly paper:
Pegoraro, Marco, Madhavi Bangalore Shankara Narayana, Elisabetta Benevento, Wil M. P. van der Aalst, Lukas Martin,
and Gernot Marx. “Analyzing Medical Data with Process Mining: a COVID-19 Case Study”. In: Business Information
Systems Workshops. Ed. by Abramowicz, Witold, S¨
oren Auer, and Milena Str´
zyna. Springer, 2022, pp. 39–44
Please, cite this document as shown above.
Publication chronology:
2021-05-01: full text submitted to the Workshopon Applications of Knowledge-Based Technologies in Business, work-in-progress track
2021-05-14: notication of acceptance
2021-05-19: camera-ready version submitted
2021-06-15: presented
2022-04-06: proceedings published
The published version referred above is ©Springer.
Correspondence to:
Marco Pegoraro, Chair of Process and Data Science (PADS), Department of Computer Science,
RWTH Aachen University, Ahornstr. 55, 52074 Aachen, Germany
Website: ·Email: ·ORCID: 0000-0002-8997-7517
Content: 9 pages, 5 gures, 11 references. Typeset with pdfL
X, Biber and BibL
Please do not print this document unless strictly necessary.
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
1 Introduction
The widespread adoption of Hospital Information Systems (HISs) and Electronic Health
Records (EHRs), together with the recent Information Technology (IT) advancements,
including e.g. cloud platforms, smart technologies, and wearable sensors, are allowing
hospitals to measure and record an ever-growing volume and variety of patient- and
process-related data [7]. This trend is making the most innovative and advanced data-
driven techniques more applicable to process analysis and improvement of healthcare
organizations [5]. Particularly, process mining has emerged as a suitable approach to ana-
lyze, discover, improve and manage real-life and complex processes, by extracting knowl-
edge from event logs [1]. Indeed, healthcare processes are recognized to be complex, ex-
ible, multidisciplinary and ad-hoc, and, thus, they are dicult to manage and analyze
with traditional model-driven techniques [9]. Process mining is widely used to devise in-
sightful models describing the ow from diferent perspectives—e.g., control-ow, data,
performance, and organizational.
On the grounds of being both highly contagious and deadly, COVID-19 has been
the subject of intense research eforts of a large part of the international research com-
munity. Data scientists have partaken in this scientic work, and a great number of arti-
cles have now been published on the analysis of medical and logistic information related
to COVID-19. In terms of raw data, numerous openly accessible datasets exist. Eforts
are ongoing to catalog and unify such datasets [6]. A wealth of approaches based on
data analytics are now available for descriptive, predictive, and prescriptive analytics, in
regard to objectives such as measuring efectiveness of early response [8], inferring the
speed and extent of infections [2,10], and predicting diagnosis and prognosis [11]. How-
ever, the process perspective of datasets related to the COVID-19 pandemic has, thus far,
received little attention from the scientic community.
The aim of this work-in-progress paper is to exploit process mining techniques to
model and analyze the care process for COVID-19 patients, treated at the Intensive Care
Unit (ICU) ward of the Uniklinik Aachen hospital in Germany. In doing so, we use a
real-life dataset, extracted from the ICU information system. More in detail, we discover
the patient-ows for COVID-19 patients, we extract useful insights into resource con-
sumption, we compare the process models based on data from the two COVID waves,
and we analyze their performance. The analysis was carried out with the collaboration
of the ICU medical staf.
The remainder of the paper is structured as follows. Section 2describes the COVID-
19 event log subject of our analysis. Section 3reports insights from preliminary process
mining analysis results. Lastly, Section 4concludes the paper and describes our roadmap
for future work.
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
Figure 1: Dotted chart of the COVAS event log. Every dot corresponds to an event recorded in the log; the
cases with Acute Respiratory Distress Syndrom (ARDS) are colored in pink, while cases with no ARDS are
colored in green. The two “waves” of the virus are clearly distinguishable.
2 Dataset Description
The dataset subject of our study records information about COVID-19 patients moni-
tored in the context of the COVID-19 Aachen Study (COVAS). The log contains event
information regarding COVID-19 patients admitted to the Uniklinik Aachen hospital
between February 2020 and December 2020. The dataset includes 216 cases, of which
196 are complete cases (for which the patient has been discharged either dead or alive)
and 20 ongoing cases (partial process traces) under treatment in the COVID unit at the
time of exporting the data. The dataset records 1645 events in total, resulting in an aver-
age of 7.6 events recorded per each admission. The cases recorded in the log belong to
65 diferent variants, with distinct event ows. The events are labeled with the executed
activity; the log includes 14 distinct activities. Figure 1shows a dotted chart of the event
3 Analysis
In this section, we illustrate the preliminary results obtained through a detailed process
mining-based analysis of the COVAS dataset. More specically, we elaborate on results
based on control-ow and performance perspectives.
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
Start startSymptoms Hospitalization
endSymptoms ICUadmission
Figure 2: A normative Petri net that models the process related to the COVAS data.
Firstly, we present a process model extracted from the event data of the COVAS event
log. Among several process discovery algorithms in literature [1], we applied the Inter-
active Process Discovery (IPD) technique [3] to extract the patient-ows for COVAS pa-
tients, obtaining a model in the form of a Petri net (Figure 2). IPD allows to incorporate
domain knowledge into the discovery of process models, leading to improved and more
trustworthy process models. This approach is particularly useful in healthcare contexts,
where physicians have a tacit domain knowledge, which is dicult to elicit but highly
valuable for the comprehensibility of the process models.
The discovered process map allows to obtain operational knowledge about the struc-
ture of the process and the main patient-ows. Specically, the analysis reveals that
COVID-19 patients are characterized by a quite homogeneous high-level behavior, but
several variants exist due to the possibility of a ICU admission or to the diferent out-
comes of the process. More in detail, afer the hospitalization and the onset of rst symp-
toms, if present, each patient may be subject to both oxygen therapy and eventually ICU
pathway, with subsequent ventilation and ECMO activities, until the end of the symp-
toms. Once conditions improve, patients may be discharged or transferred to another
We evaluated the quality of the obtained process model through conformance check-
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
Figure 3: Plot showing the usage of assisted ventilation machines for COVID-19 patients in the ICU ward
of the Uniklinik Aachen. Maximum occupancy was reached on the 13th of April 2020, with 39 patients
simultaneously ventilated.
ing [1]. Specically, we measured the token-based replay tness between the Petri net
and the event log, obtaining a value of 98. This is a strong indication of both a high
level of compliance in the process (the ow of events does not deviate from the intended
behavior) and a high reliability of the methodologies employed in data recording and
extraction (very few deviations in the event log also imply very few missing events and a
low amount of noise in the dataset).
From the information stored in the event log, it is also possible to gain insights re-
garding the time performance of each activity and the resource consumption. For exam-
ple, Figure 3shows the rate of utilization of ventilation machines. This information may
help hospital managers to manage and allocate resources, especially the critical or shared
ones, more eciently.
Finally, with the aid of the process mining tool Everow [4], we investigated difer-
ent patient-ows, with respect to the rst wave (until the end of June 2020) and second
wave (from July 2020 onward) of the COVID-19 pandemic, and evaluated their perfor-
mance perspective, which is shown in Figures 4and 5respectively. The rst wave involves
133 cases with an average case duration of 33 days and 6 hours; the second wave includes
63 patients, with an average case duration of 23 days and 1 hour. The diference in average
case duration is signicant, and could have been due to the medics being more skilled and
prepared in treating COVID cases, as well as a lower amount of simultaneous admission
on average in the second wave.
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
Figure 4: Filtered directly-follows graph related to
the rst wave of the COVID pandemic.
Figure 5: Filtered directly-follows graph related to
the second wave of the COVID pandemic.
4 Conclusion and Future Work
In this preliminary paper, we show some techniques to inspect hospitalization event data
related to the COVID-19 pandemic. The application of process mining to COVID event
data appears to lead to insights related to the development of the disease, to the eciency
in managing the efects of the pandemic, and in the optimal usage of medical equipment
in the treatment of COVID patients in critical conditions. We show a normative model
obtained with the aid of IPD for the operations at the COVID unit of the Uniklinik
Aachen hospital, showing a high reliability of the data recording methods in the ICU
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
Among the ongoing research on COVID event data, a prominent future develop-
ment certainly consists in performing comparative analyses between datasets and event
logs geographically and temporally diverse. By inspecting diferences only detectable
with process science techniques (e.g., deviations on the control-ow perspective), novel
insights can be obtained on aspects of the pandemic such as spread, efectiveness of dif-
ferent crisis responses, and long-term impact on the population.
We acknowledge the ICU4COVID project (funded by European Union’s Horizon 2020
under grant agreement n. 101016000) and the COVAS project for our research interac-
[1] van der Aalst, Wil M. P. Process Mining - Data Science in Action, Second Edition.
Springer, 2016. isbn: 978-3-662-49850-7. doi:10.1007/978-3- 662-49851-
[2] Anastassopoulou, Cleo, Lucia Russo, Athanasios Tsakris, et al. “Data-based anal-
ysis, modelling and forecasting of the COVID-19 outbreak”. In: PloS one 15.3 (2020),
[3] Dixit, Prabhakar M., H. M. W. Verbeek, Joos C. A. M. Buijs, et al. “Interactive
Data-Driven Process Model Construction”. In: Conceptual Modeling - 37th In-
ternational Conference, ER 2018, Xi’an, China, October 22-25, 2018, Proceedings.
Ed. by Trujillo, Juan, Karen C. Davis, Xiaoyong Du, et al. Vol. 11157. Lecture Notes
in Computer Science. Springer, 2018, pp. 251–265. doi:10.1007/978-3-030-
[4] Everflow Process Mining. [On-
line; accessed 2021-05-17].
[5] Galetsi, Panagiota and Korina Katsaliaki. “A review of the literature on big data
analytics in healthcare”. In: Journal of the Operational Research Society 71.10 (2020),
pp. 1511–1529. doi:10.1080/01605682.2019.1630328.
[6] Guidotti, Emanuele and David Ardia. “COVID-19 Data Hub”. In: Journal of
Open Source Soware 5.51 (2020). Ed. by Rowe, Will, p. 2376. doi:10.21105/
M. Pegoraro et al. Analyzing COVID-19 Data with Process Mining
[7] Kou, Vassiliki, Flora Malamateniou, and George Vassilacopoulos. “A Big Data-
driven Model for the Optimization of Healthcare Processes”. In: Digital Health-
care Empowering Europeans - Proceedings of MIE2015, Madrid Spain, 27-29 May,
2015. Ed. by Cornet, Ronald, Lacramioara Stoicu-Tivadar, Alexander H¨
orbst, et
al. Vol. 210. Studies in Health Technology and Informatics. IOS Press, 2015, pp. 697–
701. doi:10.3233/978-1-61499-512-8-697.
[8] Lavezzo, Enrico, Elisa Franchin, Constanze Ciavarella, et al. “Suppression of a
SARS-CoV-2 outbreak in the Italian municipality of Vo’”. In: Nature 584.7821
(2020), pp. 425–429.
[9] Mans, Ronny S., Wil M. P. van der Aalst, and Rob J. B. Vanwersch. Process Min-
ing in Healthcare - Evaluating and Exploiting Operational Healthcare Processes.
Springer Briefs in Business Process Management. Springer, 2015. isbn: 978-3-319-
16070-2. doi:10.1007/978-3-319-16071-9.
[10] Sarkar, Kankan, Subhas Khajanchi, and Juan J Nieto. “Modeling and forecast-
ing the COVID-19 pandemic in India”. In: Chaos, Solitons & Fractals 139 (2020),
p. 110049.
[11] Wynants, Laure, Ben Van Calster, Gary S Collins, et al. “Prediction models for
diagnosis and prognosis of covid-19: systematic review and critical appraisal”. In:
British Medical Journal 369 (2020).
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
On the 21st of February 2020 a resident of the municipality of Vo’, a small town near Padua, died of pneumonia due to SARS-CoV-2 infection¹. This was the first COVID-19 death detected in Italy since the emergence of SARS-CoV-2 in the Chinese city of Wuhan, Hubei province². In response, the regional authorities imposed the lockdown of the whole municipality for 14 days³. We collected information on the demography, clinical presentation, hospitalization, contact network and presence of SARS-CoV-2 infection in nasopharyngeal swabs for 85.9% and 71.5% of the population of Vo’ at two consecutive time points. On the first survey, which was conducted around the time the town lockdown started, we found a prevalence of infection of 2.6% (95% confidence interval (CI) 2.1-3.3%). On the second survey, which was conducted at the end of the lockdown, we found a prevalence of 1.2% (95% Confidence Interval (CI) 0.8-1.8%). Notably, 42.5% (95% CI 31.5-54.6%) of the confirmed SARS-CoV-2 infections detected across the two surveys were asymptomatic (i.e. did not have symptoms at the time of swab testing and did not develop symptoms afterwards). The mean serial interval was 7.2 days (95% CI 5.9-9.6). We found no statistically significant difference in the viral load of symptomatic versus asymptomatic infections (p-values 0.62 and 0.74 for E and RdRp genes, respectively, Exact Wilcoxon-Mann-Whitney test). This study sheds new light on the frequency of asymptomatic SARS-CoV-2 infection, their infectivity (as measured by the viral load) and provides new insights into its transmission dynamics and the efficacy of the implemented control measures.
Full-text available
Since the first suspected case of coronavirus disease-2019 (COVID-19) on December 1st, 2019, in Wuhan, Hubei Province, China, a total of 40,235 confirmed cases and 909 deaths have been reported in China up to February 10, 2020, evoking fear locally and internationally. Here, based on the publicly available epidemiological data for Hubei, China from January 11 to February 10, 2020, we provide estimates of the main epidemiological parameters. In particular, we provide an estimation of the case fatality and case recovery ratios, along with their 90% confidence intervals as the outbreak evolves. On the basis of a Susceptible-Infectious-Recovered-Dead (SIDR) model, we provide estimations of the basic reproduction number (R0), and the per day infection mortality and recovery rates. By calibrating the parameters of the SIRD model to the reported data, we also attempt to forecast the evolution of the outbreak at the epicenter three weeks ahead, i.e. until February 29. As the number of infected individuals, especially of those with asymptomatic or mild courses, is suspected to be much higher than the official numbers, which can be considered only as a subset of the actual numbers of infected and recovered cases in the total population, we have repeated the calculations under a second scenario that considers twenty times the number of confirmed infected cases and forty times the number of recovered, leaving the number of deaths unchanged. Based on the reported data, the expected value of R0 as computed considering the period from the 11th of January until the 18th of January, using the official counts of confirmed cases was found to be ∼4.6, while the one computed under the second scenario was found to be ∼3.2. Thus, based on the SIRD simulations, the estimated average value of R0 was found to be ∼2.6 based on confirmed cases and ∼2 based on the second scenario. Our forecasting flashes a note of caution for the presently unfolding outbreak in China. Based on the official counts for confirmed cases, the simulations suggest that the cumulative number of infected could reach 180,000 (with a lower bound of 45,000) by February 29. Regarding the number of deaths, simulations forecast that on the basis of the up to the 10th of February reported data, the death toll might exceed 2,700 (as a lower bound) by February 29. Our analysis further reveals a significant decline of the case fatality ratio from January 26 to which various factors may have contributed, such as the severe control measures taken in Hubei, China (e.g. quarantine and hospitalization of infected individuals), but mainly because of the fact that the actual cumulative numbers of infected and recovered cases in the population most likely are much higher than the reported ones. Thus, in a scenario where we have taken twenty times the confirmed number of infected and forty times the confirmed number of recovered cases, the case fatality ratio is around ∼0.15% in the total population. Importantly, based on this scenario, simulations suggest a slow down of the outbreak in Hubei at the end of February.
In India, 1,00,340 confirmed cases and 3,155 confirmed deaths due to COVID-19 were reported as of May 18, 2020. Due to absence of specific vaccine or therapy, non-pharmacological interventions including social distancing, contact tracing are essential to end the worldwide COVID-19. We propose a mathematical model that predicts the dynamics of COVID-19 in 17 provinces of India and the overall India. A complete scenario is given to demonstrate the estimated pandemic life cycle along with the real data or history to date, which in turn divulges the predicted inflection point and ending phase of SARS-CoV-2. The proposed model monitors the dynamics of six compartments, namely susceptible (S), asymptomatic (A), recovered (R), infected (I), isolated infected (Iq) and quarantined susceptible (Sq), collectively expressed SARIIqSq. A sensitivity analysis is conducted to determine the robustness of model predictions to parameter values and the sensitive parameters are estimated from the real data on the COVID-19 pandemic in India. Our results reveal that achieving a reduction in the contact rate between uninfected and infected individuals by quarantined the susceptible individuals, can effectively reduce the basic reproduction number. Our model simulations demonstrate that the elimination of ongoing SARS-CoV-2 pandemic is possible by combining the restrictive social distancing and contact tracing. Our predictions are based on real data with reasonable assumptions, whereas the accurate course of epidemic heavily depends on how and when quarantine, isolation and precautionary measures are enforced.
Big data analytics (BDA) is of paramount importance in healthcare aspects such as patient diagnostics, fast epidemic recognition, and improvement of patient management. The objective of this profiling study is (a) to provide an overview of the BDA publication dynamics in the healthcare domain and (b) to discuss this scientific field through related examples. A sampling literature review has been conducted. A total of 804 papers have been identified and content analysis has been performed to mine knowledge in the domain for the years 2000–2016. The findings show that co-authors’ backgrounds are from the subject areas of medicine and computer sciences. Most articles are experimental in nature and use modeling and machine learning techniques to exploit clinical data, for health monitoring and prediction purposes. Many articles are relevant to the medical specialties of neurology/neurosurgery/neuropsychiatry, medical oncology, and cardiology. Well-cited papers investigate the identification and management of high-risk/cost patients, the use of big data, Hadoop and cloud computing in genomics, and the development of mobile applications for disease management. Important is also the research about improving disease prediction by investigating patients' medical results using advanced analysis (such as segmentation and predictive modelling, machine learning, visualisation, etc.).
This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a considerably expanded section on software tools and a completely new chapter of process mining in the large. It is self-contained, while at the same time covering the entire process-mining spectrum from process discovery to predictive analytics. After a general introduction to data science and process mining in Part I, Part II provides the basics of business process modeling and data mining necessary to understand the remainder of the book. Next, Part III focuses on process discovery as the most important process mining task, while Part IV moves beyond discovering the control flow of processes, highlighting conformance checking, and organizational and time perspectives. Part V offers a guide to successfully applying process mining in practice, including an introduction to the widely used open-source tool ProM and several commercial products. Lastly, Part VI takes a step back, reflecting on the material presented and the key open challenges. Overall, this book provides a comprehensive overview of the state of the art in process mining. It is intended for business process analysts, business consultants, process managers, graduate students, and BPM researchers.
Healthcare organizations increasingly navigate a highly volatile, complex environment in which technological advancements and new healthcare delivery business models are the only constants. In their effort to out-perform in this environment, healthcare organizations need to be agile enough in order to become responsive to these increasingly changing conditions. To act with agility, healthcare organizations need to discover new ways to optimize their operations. To this end, they focus on healthcare processes that guide healthcare delivery and on the technologies that support them. Business process management (BPM) and Service-Oriented Architecture (SOA) can provide a flexible, dynamic, cloud-ready infrastructure where business process analytics can be utilized to extract useful insights from mountains of raw data, and make them work in ways beyond the abilities of human brains, or IT systems from just a year ago. This paper presents a framework which provides healthcare professionals gain better insight within and across your business processes. In particular, it performs real-time analysis on process-related data in order reveal areas of potential process improvement.
Interactive Data-Driven Process Model Construction
  • Prabhakar M Dixit
  • H M W Verbeek
  • C A M Joos
  • Buijs
Dixit, Prabhakar M., H. M. W. Verbeek, Joos C. A. M. Buijs, et al. "Interactive Data-Driven Process Model Construction". In: Conceptual Modeling -37th International Conference, ER 2018, Xi'an, China, October 22-25, 2018, Proceedings. Ed. by Trujillo, Juan, Karen C. Davis, Xiaoyong Du, et al. Vol. 11157. Lecture Notes in Computer Science. Springer, 2018, pp. 251-265. : 10.1007/978-3-030-00847-5_19.