ChapterPDF Available

Knowledge Extraction from Events Flows

  • Centre for Applied Software Engineering

Abstract and Figures

In this chapter, we propose an analysis of the approaches and methods available for the automated extraction of knowledge from event flows. We specifically focus on the reconstruction of processes from automatically generated events logs. In this context, we consider that knowledge can be directly gathered by means of the reconstruction of business process models. In the ArtDECO project, we frame such approaches inside delta analysis, that is the detection of differences of the executed processes from the planned models. To this end, we provide an overview of the different techniques available for process reconstruction, and propose an approach for the detection of deviations. To show its effectiveness, we instantiate the usage to the ArtDECO case study.
Content may be subject to copyright.
Chapter 11
Knowledge Extraction from Events Flows
Alireza Rezaei Mahdiraji, Bruno Rossi, Alberto Sillitti, and Giancarlo Succi
Abstract. In this chapter, we propose an analysis of the approaches and methods
available for the automated extraction of knowledge from event flows. We specifi-
cally focus on the reconstruction of processes from automatically generated events
logs. In this context, we consider that knowledge can be directly gathered by means
of the reconstruction of business process models. In the ArtDECO project, we frame
such approaches inside delta analysis, that is the detection of dierences of the exe-
cuted processes from the planned models. To this end, we provide an overview of the
dierent techniques available for process reconstruction, and propose an approach
for the detection of deviations. To show its eectiveness, we instantiate the usage to
11.1 Introduction
Event logs are typically available inside an organisation, and they encode relevant
information about the execution of high level business processes. Even if such infor-
mation is usually available, very often knowledge about processes can be dicult
to reconstruct or processes can deviate from the planned behaviour. Thus, retrieving
knowledge from such event flows is part of the so-called process mining approach,
that is the reconstruction of business processes from event log traces [30].
The reasons for using process mining are usually multiple and multi-faceted.
Two are the most relevant. 1) the information can be used to see whether the ac-
tions inside an organisation are aligned with the designed business processes (so-
called delta analysis). 2) process mining can be used to derive existing patterns in
users’ activities that can then be used for process improvement. In both cases, the
Alireza Rezaei Mahdiraji ·Bruno Rossi ·Alberto Sillitti ·Giancarlo Succi
Center for Applied Software Engineering (CASE) - Free University of Bozen-Bolzano,
Piazza Domenicani 3, 39100 Bolzano, Italy
e-mail: {alireza.rezaei,bruno.rossi,alberto.sillitti,gsucci}
G. Anastasi et al. (Eds.): Networked Enterprises, LNCS 7200, pp. 221–236, 2012.
©Springer-Verlag Berlin Heidelberg 2012
222 A. Rezaei Mahdiraji et al.
knowledge that can be gathered allows a more ecient usage of resources inside an
organisation [28].
In this chapter, we provide an overviewof several techniques and approaches that
can be used for process mining from event flows. We analyse such approaches, and
we describe how such approaches have been used in the context of the ArtDECO
11.2 Knowledge Flow Extraction in the ArtDECO Project
In the context of the ArtDECO project, we have interconnected networked enter-
prises that do not only use their own internal business processes but need to orches-
trate higher level processes involving several companies. Three layers have been
identified: the business process,theapplication,andthelogical layer.Thebusiness
process level deals with the interrelations among enterprises at the level of business
process models. The application level is the implementation of the business pro-
cesses, running either intra- or inter-enterprises. The logical layer is an abstraction
over the physical resources of networked enterprises. The contextualisation to the
GialloRosso winery case study is discussed in Chapter 2, Figure 1.1.
The vertical view shows all the dierent abstract processes, while the horizontal
line evidences the stakeholders interested in the dierent information flows. Enter-
prises are typically integrated horizontally, as such we need to consider the integra-
tion of the dierent business processes. As the case study has been instantiated to the
wine domain, we have typically three dierent stakeholders to consider: the winery,
the delivery company and the retailer. Each actor has its own business process that
is integrated in the overall view of the general interconnected processes.
Extracting knowledge from low level event flows, poses thus more issues that
normal business process reconstruction as we have very fine-grained information as
the source of the whole reconstruction process [18, 6].
In this context, a possible approach for process improvementis the following:
1. Collection of data with client-side plug-ins. This is a step that can be performed
by using systems for automated data collection enabling the organisation to
harvest information during the execution of the business processes. Such infor-
mation can then be used to reconstruct knowledge from the low-level events
2. Collection of data from sensor networks. Another source of data comes from
sensor networks. They provide low level data that can be used to evaluate in-
consistencies between the actual situation and the expected one. Such low-level
events can be used for subsequent phases of the process;
3. Integration of the data sources into a DBMS. All low-level events need to be
integrated in a common centralised repository or DBMS. Such point of integra-
tion can be queried for the reconstruction of the process and the evaluation of
triggers when the expected behaviour is not conformant to the planned business
11 Knowledge Extraction from Events Flows 223
4. Reconstruction of the business processes. To reconstruct the business processes
in an organisation, we need to extract knowledge from low-level events data.
To this end, we need some algorithms that - given a set of events - are able of
reconstructing the ongoing process;
5. Delta analysis of the deviations from expected behaviour. Once the business
processes have been reconstructed, the dierence with the planned models can
be detected. If necessary, triggers are set-up to notify stakeholders about unex-
pected behaviours in the context of the planned high-level processes;
Such approach can be supported by a tool as PROM [27] that allows the collection
of low-level events and the subsequent analysis of the processes [3, 4, 9].
In this chapter, we focus on techniques for the extraction of knowledge from
event flows and process reconstruction (Step 4).
11.3 Overview of Process Mining Techniques
A business process is a collection of related activities that produces a specific service
or product for a customer. The success of an organisation is directly related to the
quality and eciency of its business processes. Designing these processes is a time
consuming and error prone task, because the knowledge about a process is usually
distributed between among employees and managers that execute them. Moreover,
such knowledge is not only distributed in a single organisation but, frequently, it is
distributed across several organisations (cross-organisational processes) that belong
to the same supply-chain. For these reasons, business processes experts that are in
charge of the formal definition of processes face an hard task. Additionally, the
designed model needs to adapt to changes that the market imposes to organisation
to be competitive.
To spend less time and eort and obtain models based on what really happens
in an organisation, we can adopt a bottom-up approach, i.e., extract the structure of
processes from recorded event-logs [26, 2, 5]. Usually, information systems in an
organisation have logging mechanisms to record most of the events. These logs con-
tain information about the actual execution of the processes such as the list of exe-
cuted tasks, their order and process instances. The method of extracting the structure
of process (a.k.a. process model) from the event-logs is known as process mining
or process discovery or workflow mining [28]. The extracted model can be used to
analyse and improve current business, e.g., it can be used to detect the deviations
from normal process executions [19].
Each event-log consists of several traces. Each trace corresponds to the execution
of an instance process, also known as a case. Each case is obtained by executing
activities (tasks) in a specific order. A process model is designed to handle similar
cases, it specifies tasks to be executed and their temporal order of execution.
224 A. Rezaei Mahdiraji et al.
11.3.1 A General Process Mining Algorithm
Figure 11.1 shows the steps of a typical process mining algorithm. The first step
deals with reading the content of the log. Most of the algorithms assume that the
log contains at least the cases identifiers, tasks identifiers, and execution orders of
the tasks for each case. This is the minimal information that is needed for process
mining. However, in reality, logs contain further information that can be used to
extract more dependency information, e.g. if a log contains information about start
and completion time for each task (non-atomic tasks), parallelism detection can be
done by examining just one trace, otherwise at least two traces are needed [35].
The second step deals with the extraction of the dependency relations (also
known as follows relations) among tasks. These relations can be inferred using tem-
poral relationships between tasks in event-log. Task B is dependent on task A iin
every trace of the log, task B follows task A. This definition for real-world logs is
unrealistic because logs always contain noise. Noisy data in the logs are either be-
cause of a problem in the logging mechanism or exceptional behaviours. A process
mining algorithm has to extract the most common behaviour even in presence of
noisy data that makes the task of the mining algorithm more dicult and can result
in overfitting (i.e., the generated model is too specific and allows only the exact be-
haviours seen in the log). To cut-othe noise, we need a way (e.g., a threshold) to
discard less frequent information and reconstruct a sound business process. Hence,
we need to consider the frequency of dependency relation. We can modify the def-
inition of dependency relation as follows: the task B is dependent on task A iin
most traces task B follows task A.
The third step deals with the induction of the structure of the process model . The
problem is to find a process model that satisfies three conditions: (i) generates all
traces in the log (in case of noise-free logs), (ii) only covers few traces that are not
in the log (extra-behaviur), and (iii) has the minimal number of nodes (i.e., steps
of the process). For simple processes, it is easy to discover a model that recreates
Fig. 11.1 Steps in a General Process Mining Algorithm
11 Knowledge Extraction from Events Flows 225
the log, but for larger processes it is an hard task, e.g., the log does not contains
all combinations of selection and parallel routings or some paths may have low
probability and remain undetected or log may contain too noisy data.
The forth step deals with the identification of routing paths among the nodes.
The resulting models may include of four basic routing constructs as follows: (i)
Sequential: the execution of one task is followed by another task, (ii) Parallel: tasks
A and B are in parallel if they can be executed in any order or at the same time, (iii)
Conditional (Selection or choice ): between tasks A and B, either task A or task B is
executed, and (iv) Iteration (Loop):when a task or a set of tasksis executed multiple
The induction of conditions phase corresponds to induction of conditions for
non-deterministic transitions (i.e., selection) based on process attributes. A process
attribute is a specific piece of information used as a control variable for the routing
of a case, e.g., the attributes of the documents which are passed between actors.
Approaches such as decision rule mining from machine learning can be used to
induce these conditions on a set of attributes [16].
The last step converts the result model (e.g., a Petri net) to a graphical represen-
tation that is easy to understand for the users.
11.3.2 Challenges of Process Mining Algorithms
In [28], several of the most challengingproblems of the process mining research has
been introduced. Some of those challenges are now partially solved, such as mining
hidden tasks, mining duplicate tasks, mining problematic constructs (e.g., non-free-
choice net) and using time information, but some of them still need more research.
The current most important challenges of process mining are:
Mining dierent perspectives. Most of the research in process mining is de-
voted to mining control flow (How?), i.e., ordering of tasks. There are few works
that aim at mining perspective such as organisational perspectives (Who?), e.g.,
mining social networks of organisational relationships [31].
Handling noise. The noise in real world event logs is unavoidable and each
process mining system should deal with noisy data. As mentioned before, noisy
logs cause overfitting. The main idea is that the algorithms can be able to de-
tect exceptions from real behaviours. Usually, exceptions are infrequent ob-
servations in logs, e.g., infrequent casual relations. Most algorithms provide a
threshold to distinguish between noise and correct behaviours. Thereare several
heuristics in literature to handle noise.
Underfitting. It means that the model overgeneralises behaviours seen in the
logs and that is because of unsupervised setting, i.e., lack of negative traces in
the logs [7]. The logs only contain successful process instances, so to generate
negative instance process we need to generate them from the current log, e.g.,
first we find outliers in the current log and label them as negative traces and
then apply a binary classification algorithm.
226 A. Rezaei Mahdiraji et al.
Conformance Testing. Comparing a prior model of the process with the ex-
tracted model from process mining is known as conformance testing or delta
analysis. The comparing procedure aims at finding the dierences and com-
monalities between the two models [24].
Dependency between Cases. Most of current approaches assume that there are
no dependencies among cases, i.e. the routing of one case does not depend on
the routing of other cases or in another words, the events of dierent cases are
independent. Real world data may violate this assumption. For example, there
may be a competition among dierent cases for some resources. These so-called
history-dependent behaviours are stronger predictor for the process model [12].
Non-Unique Model. It means that dierent non-equivalentprocess models can
be mined from one log. Some of them are too specific and some others are
too general, but all of them generate the log. By finding a balance between
specificity and generality, we can find an appropriate process model.
Low-Level Event-Logs. Most of current techniques on process mining are only
applicable for process-awareinformation systems’ logs, i.e., logs that are in task
level abstraction. This is not the case in many real-life logs. They usually lack
the concept of task and instead they contain many low-level events. Although
groups of low-level events together represents tasks, it is not easy to infer those
groups. The informationsystems’ logs must be first ported to task level and then
a process mining technique can extract the process model [33, 13, 15].
Un-Structured Event-Logs. In real life systems, even if there is a prior pro-
cess model, it is not enforced and there are lots of behaviours in the logs that
are instances of deviation from the model. This flexibility results in unstructured
process models, i.e., models with lots of nodes and relations. Most of techniques
in literature generate unstructured models in such environments. The resulting
models are not incorrect and usually capture all deviations in the log. Two ap-
proaches in literature dealing with this issue use clustering and fuzzy techniques
[33, 14].
11.3.3 Review of the Approaches
There are several approaches for mining process models from log data [28]. Based
on the strategy that they use to search for appropriate process model, we can divide
them as either local or global approaches. Local strategies rely only on local infor-
mation to build the model step by step, while global strategies search in the space of
potential models to find the model. The local strategies usually have problems with
noise and discovering more complex constructs. Most of the current approaches of
process mining are in the first category and only few of them are in the global cate-
gory. Figure 11.1 depicts the general steps of local algorithms. In the following, we
concisely introduce some of these approaches.
In [16, 17], three algorithms are developed, namely, Merge Sequential, Split Se-
quential, and Split Parallel. These algorithms are the best known for dealing with
11 Knowledge Extraction from Events Flows 227
duplicate tasks. The first and second algorithms are suitable for sequential process
models and the third one for parallel processes. They extract process models as a
Hidden Markov Model (HMM). HMM is basically a Finite State Machine (FSM),
but each transition has a probability associated and each state has a finite set of
output symbols. The Merge Sequential and Split Sequential algorithms use the gen-
eralization and specialization approaches, respectively. The Merge Sequential is a
bottom-up algorithm, i.e., it starts with the most specific process model with one
separate path for each trace in the log, then it iteratively generalizes the model by
merging states that having same output symbols, till the satisfaction of a termination
criterion. To reduce the size of the initial model, prefix HMM can be used, i.e., all
states that have a common prefix are mappedto a single state. The problem with spe-
cialization operator is the number of merging operations, i.e., even with few states
with the same symbols, the number of merging operations is usually very large.
The Split Sequential algorithm aims to overcome the complexity of Merge Se-
quential. It starts with the most general process model (i.e., without duplicate tasks
and able to generate all the behaviours in the log) and then iteratively splits states
with more than one incoming transition into two states. The Split Parallel algorithm
is an extension of the Split Sequential for concurrent processes. The reason for this
extension is that unlike sequential processes, when the split of a node in concur-
rent processes changes the dependency graph, in this case it may have global side
eects. So, instead of applying the split operator on the model, the split operations
are done at level of the process instance. It is also top-down and works as follows:
suppose that activity A is split. The split operator is able to make distinction among
certain occurrences of A, e.g., A1 and A2. Then, it induces a general model based
on current instances which contains two nodes A1 and A2 instead of only A. After
termination of specialization, re-labelled nodes are changed to their original labels.
This approach also uses a decision rule induction algorithm to induce conditions for
non-deterministic transitions.
In [25], a new approach based on block-oriented representation to extract min-
imal complete models has been introduced. In block-oriented representation, each
process model consists of a set of nested building blocks. Each building block con-
sists of one or more activities or building blocks that are connected by operators.
There are four main operators, namely, sequence, parallel, alternative, and loop.
Operators define the control-flow of the model.
The works in [28, 29, 30] are some of the more extensive works in process min-
ing. Authors started by developing an alpha algorithm and over time they extended
it with many modifications to tackle dierent challenges. Authors proved that alpha
algorithm is suitable for a specific class of models. In the first version of their alpha
algorithm, they assumed that logs are noise-free and complete. The alpha algorithm
is unable to find short loops and implicit places, and works based on binary rela-
tions in the logs. There are four relations: follows, causal, parallel, and unrelated.
Two tasks A and B have a follows relation if they appear next to each other in the
log. This relation is the basic relation from which the other relations are extracted.
Two tasks A and B have a causal relation if A follows B, but B does not follow A.
If B also follows A, then the tasks have a parallel relation. When A and B are not
228 A. Rezaei Mahdiraji et al.
involved in a follows relation, they are said to be unrelated. All the dependency re-
lations are inferred based on local information in the log. Additionally, because the
algorithm works based on sets, it cannot mine models with duplicate tasks. The al-
pha algorithm works only based on follows relation without considering frequency,
therefore it cannot handle noise. In [34], the alpha algorithm was enhanced with
several heuristics to consider frequencies to handle noise. The main idea behind the
heuristics is as follows: the more often task A follows task B and the less often B
follows A, it is more likely that A is a cause for B. Because the algorithm mainly
works based on binary relations, the non-free-choice constructs cannot be captured.
In [21], they first extended the three relational metrics in [34] and added two new
metrics that are more suitable in distinction between choice and parallel situations.
Then, based on a training set that contains information about five metrics and ac-
tual relations between pairs of activities (causal, choice, and parallel), they applied
a classification rule learner to induce rules that distinguish among dierent relations
based on the values of five relational metrics.
Clustering-Based Techniques. Most of the approaches in process mining only
produce one process model. This single model is supposed to represent every sin-
gle detail in the log. So, the result model is intricate and hard to understand. Many
real-world logs contain such unstructured models, usually because they allow very
flexible execution of the process model. Flexible environments generate heteroge-
neous logs, i.e., they contain informationabout very dierent cases. Another reason
for unstructured logs is that some logs record information about dierent processes
that belong to the same domain. In both cases, only one process model is unable to
describe this kind of log and the result model is very specific (when an exact process
model is the goal) or over-general. The basic idea of using clustering techniques in
the process mining domain is to divide original logs into several smaller logs where
each one of them contains only homogeneous cases. Then, for each partition, a sep-
arate process model is extracted.
In the process mining literature, there are several studies using clustering in dif-
ferent ways in process mining context. For example in [1], each trace is represented
with a vector of features extracted from dierent traces, i.e., control-flow, organisa-
tion, data, etc. In [10, 11, 23], a hierarchy of process model is generated, i.e., each
log partition will be furthered partitioned if it is not expressive enough.
These are are all local algorithms, but there are three main global approaches in
literature. Namely, genetic process mining, approach based on first order logic, and
fuzzy process mining.
Genetic Process Mining. Since genetic algorithms are global search algorithms,
genetic process mining is also a global algorithm that can handle noise and can
capture problematic structure such as non-free-choice constructs [22, 32]. To apply
genetic algorithms to process mining, a new representation for process models was
proposed in [22, 32], known as casual matrix. Each individual or potential process
model is represented by a causal matrix. Casual matrix contains information about
casual relations between the activities and input/output of each activity. Causal ma-
trix can be mapped to Petri Nets [22, 32], where they used three genetic operators,
11 Knowledge Extraction from Events Flows 229
namely, elitism, crossover, and mutation. Elitism operator selects a percentage of
best process models for the next generation. Crossover recombines causality rela-
tionships in the current population and mutation operator inserts new casual rela-
tionships, and adds/removes activities from input or output of each activity. One of
the main drawback of genetic process mining is the computational complexity of
the approach.
First-Order-Logic Approach. Another approach based on global search uses a
first-order-logic learning approach to process mining [20]. In contrast to other ap-
proaches seen so far, this approach generates declarative process models instead of
imperative (procedural) models. Imperative approaches define exact execution order
for a set of tasks and declarative approaches only focus on what should be done. In
[20], the process mining problem is defined as a binary classification problem, i.e.,
the log contains positive and negative traces.
Advantages of using first-order learning are as follows: (i) it can discover struc-
tural patterns, i.e., search for patterns of relations between rows in the event log, (ii)
by using declarative representation, it generates more flexible process models, and
(iii) it can use prior knowledge in learning.
Fuzzy Process Mining. The third global strategy is a fuzzy approach. The idea of
fuzzy process mining is to generate dierent views of the same process model based
on configurable parameters. Based on what is interesting, configuration parameters
are used to keep only those parts of the process that are relevant and remove oth-
ers. To achieve this objective, it considers both global and local information from
dierent aspects of the process such as control-flow and organisational structure to
produce a process model. This is why the fuzzy approach is also known as multi-
perspective approach [14].
11.4 Application to the GialloRosso Case Study
In this section, we start by defining the process of delta analysis, then we delve
into an application to the GialloRosso case study. Knowledge extraction in the
ArtDECO Project has been contextualised to the analysis of deviations from
Fig. 11.2 Phases of the lifecycle of Business Processes for delta analysis
230 A. Rezaei Mahdiraji et al.
expected business models. In the specific, we used a local algorithm based on
happens-before relations, and supported by a rule-based system for tasks recon-
struction from event logs. Figure 11.2 shows dierent phases of a business process,
from modelling, instantiation, to execution. We start generally with a business pro-
cess model that represents all the steps as they have been conceived by a business
process analyst. Such model is then instantiated and executed: at the same time,
several executions of the same process model can be running. Each process execu-
tion needs to be monitored to derive indications from the real running processes.
Then, this information can be used to reconstruct the real model. Afterwards, delta
analysis will be used to compare the planned and the actual models to derive the
deviations. For delta analysis, the monitoring phase is critical for the derivation of
low-level events that can be analysed for knowledge extraction. Once the actual ex-
ecution model has been reconstructed by means of algorithms for process mining,
delta analysis is used to derive the deviations from the execution traces.
In our case, the process of delta analysis is done by means of the following steps:
(a) determination of the happened-before relations among tasks from the original
planned model. For each task, if task A precedes B, then A ->B;
(b) process monitoringfor the collection of events for dierent instances of process
(c) tasks reconstruction from events by means of a rule-based system, as in Figure
(d) determination of the violation of happened-before relations from execution
traces, when an event in task B is detected while no event from task A hap-
pening before. Such violations are annotated;
In particular, to reconstruct the process phases from low-level events, events need
to be mapped to higher-level constructs (Figure 11.3). For this, we use a rule-based
system that maps all the events according to domain information to the higher levels,
Fig. 11.3 Phases of the lifecycle of Business Processes for delta analysis
11 Knowledge Extraction from Events Flows 231
so that it is possible to associate each event to a phase of a process. A rule is defined
by means of a source application that generated the event, the text of the event, and
a discriminator for a particular process instance (e.g. an item the instance of the
process is related to). Each rule specifically maps to a task in the original process
model. The actor/agent that generated the event is already part of the event metadata.
After rules are applied, events that do not comply to a rule are discarded and not
considered for delta analysis.
If we consider the case study of the winery (the GialloRosso case, see Chapter
2), we can instantiate the approach to part of the business processes defined to show
how the approach has been applied.
The planned behaviour in the case study foresees that when the handling of the
wine starts, a message must be sent from the distributor of the winery to the car-
rier, responsible for the transportation (Figure 11.4). In parallel, the distributor starts
the quality monitoring process for the delivery so that to gather objective data about
the performance of the delivery process. Such information is then used to alert the
distributors and to update the final wine quality. At this point, the whole process ends.
Figure 11.4 already contains information about dierent execution traces and
their deviation from the planned behaviour. The detail is shown with the process
number (e.g. P1 or P2) and an indication of a (d)eviation or an (u)nauthorized action
during the step. We will explain in the following how these activities are detected
and the software implementation that has been used for the analysis.
With the scope of explainingthe approach undertaken, we just focus on the initial
data exchange among the distributor and the carrier (top left part of Figure 11.4).
Actions can deviate from the original plans under some circumstances, as the pro-
cess can start without being triggered by a message in the information system. For
example, the process can be started due to personal communication among the two
actors of the process. Even in this trivial case, this activity can be detrimental for
Fig. 11.4 The ’discover carrier’ subprocess results from delta analysis: the original workflow
is tagged with deviations of two process instances P1 and P2
232 A. Rezaei Mahdiraji et al.
process improvement analyses, as there is no information about how much time
was required to pass from the decision taken at the management level to the control
center level. Also, there is no tracking of the original communication, and evalua-
tion of errors in the definition and/or execution of the directives that were assigned.
Furthermore, the execution violates the original planned sequence of actions.
This can be enforced by the low-level data process mining. The following pre-
conditions and post-conditions (happened-before relations), can be inferred from the
original model. They can be derived from the execution of several execution traces
or derived from the original process. If we consider Handling Start (H S),MailSend
(MS),andHandling Execution (H E) actions:
H_S = pre (MS(Distributor, Carrier))
H_S = post (MS(Carrier, Distributor))
H_S = post (H_E)
We derive that the Handling start phase has a precondition that a message must
be exchanged between two actors of the business process. Post conditions are an-
other exchange of messages among the actors, and the execution of the successive
phase in process. In other terms, these are the conditions for the part of the process
that we consider, if we use happened-before notation:
MS(Distributor, Carrier) -> H_S;
H_S -> MS(Carrier, Distributor);
H_S -> H_E;
Once we have this information for the planned model, we need to focus on the
actual instances of the process. If we run several instances of the process monitoring
the actors, we collect execution traces that need to be mapped to higher constructs.
This is done by a domain-dependent rule-based system that maps events to tasks and
process phases. An alternative is to use machine learning approaches that need to be
trained by several executiontraces. The rule-based system needs also to discriminate
the specific instance of the process, so we need to have a way to divide flows of
events across dierent processes. The following is an example of a rule:
APP{"any"} AND EVENT{"*warehouse*"} AND DISC{"item *"} -> Task A
In this case, we are defining a rule to process events generated by any application
that are related to documents that have warehouse in the title and should be divided
according to a tagged item. Therefore, events with dierent items will be mapped
to dierent process flows. All the events that comply to this rule will be mapped to
task A, by keeping the timestamp information.
Once events have been associated to the planned tasks, the detection of violations
of the original model is done by means of the evaluation of temporal execution of
the events associated to a task: if a task is executed before the actual execution of
another temporal-related task, such violation is annotated.
In Figure 11.4 there is an example of the original process annotated with viola-
tions from two running instances P1 and P2. Each violation means the possibility
that the process has been executed without following the original temporal relations
among phases. In the case study, this can mean that the communication flows among
11 Knowledge Extraction from Events Flows 233
stakeholders followed dierentpaths rather than those planned. We can see the num-
ber of violations per task and we can further focuson each task to inspect the causes
of such deviations by looking at the single low level events. We can see two dif-
ferent types of information. The (d)eviations, actions undertaken without respecting
temporal relations, for example sending the delivery without prior communication
by the distributor, or (u)nauthorized actions, that is actions that were not allowed
at a specified step, like recording information in a warehouse’s registry. The latter
kind of actions can be specified by the user to be evaluated at runtime against the
real execution traces.
Fig. 11.5 PROM plug-in for the execution of delta analysis
For the execution of process mining and delta analysis, we implemented a plu-
gin for the PROM software [27]. In particular, we took the opportunity to use the
low-level events collected by PROM to perform the analysis. The plug-in has been
integrated into the Eclipse1application with the support of the Graphical Modeling
Framework (GMF).
Figure 11.5 gives an overview of the software prototype used for the analysis.
We see the same business case that has been followed by explaining the approach,
loaded into the application’s workspace. The current prototype takes three dierent
types of inputs: a) the definition of a workflow, b) the definition of the rules for each
node in the workflow, c) a set of events generated from several execution traces.
Given the inputs, the application parses all the events and annotates the business
process with relevant information. The relevant quadrants of the application in Fig-
ure 11.5 are the top and bottom ones. In the top one, the loaded processis visualized.
In the bottom one there is the output of all the deviations detected, and the user can
select the dierent view to activate in the workspace.
234 A. Rezaei Mahdiraji et al.
In the case used to exemplify the approach, two dierent execution traces have
been instantiated by generating events and set as the input for the application. After
the analysis has been performed, each node of the loaded process is marked with
deviation information derived from delta analysis. This is a particolar scenario, in
which the data is analysed ex-post. As noted in section 11.2, the usefulness of the
proposed approach is to analyse data in real-time, with execution traces collected
and analysed as actions are performed by the actors of the process.
11.5 Conclusions
In this chapter, we proposed an analysis of methods for an automated extraction
of knowledge from event flows. We focused specifically on process mining, that is
reconstructing business processes from event log traces.
Process mining can be important for organisations that want to reconstruct knowl-
edge hidden in the event logs. Typically any organisation has the opportunity to col-
lect this kind of information. The advantages are multi-faceted, mostly we referred
to two specific areas.
On one side, such knowledge can be used to evaluate whether the high level busi-
ness processes are aligned with the business plan models. As such, process mining
can be used to see whether the actual behaviour is deviating from the expected be-
haviour. On the other side, the knowledge can be used to detect hidden behaviours
- i.e. not encoded in high level business processes - inside the organisation. Such
behaviours can then be the focus of further analyses to see whether they are really
required, resources are wasted, or even process improvement/restructuring opportu-
nities can derive from them.
We proposed an approach based on delta analysis to derive information from low
level event flows and reconstruct the original processes. We showed in the context
of the case study of the ArtDECO project, the GialloRosso winery, how event flows
are used to reconstruct the original processes and detect deviations from the planned
1. Aires da Silva, G., Ferreira, D.R.: Applying Hidden Markov Models to Process Mining.
In: Rocha, A., Restivo, F., Reis, L.P., Torrao, S. (eds.) Sistemas e Tecnologias de Infor-
macao: Actas da 4a Conferencia Iberica de Sistemas e Tecnologias de Informacao, pp.
207–210. AISTI/FEUP/UPF (2009)
2. Coman, I., Sillitti, A.: An Empirical Exploratory Study on Inferring Developers’ Activ-
ities from Low-Level Data. In: 19th International Conference on Software Engineering
and Knowledge Engineering (SEKE 2007), Boston, MA, USA, July 9-11 (2007)
3. Coman, I., Sillitti, A.: Automated Identification of Tasks in Development Sessions. In:
16th IEEE International Conference on Program Comprehension (ICPC 2008), Amster-
dam, The Netherlands, June 10-13 (2008)
11 Knowledge Extraction from Events Flows 235
4. Coman, I., Sillitti, A., Succi, G.: Investigating the Usefulness of Pair-Programming in a
Mature Agile Team. In: 9th International Conference on eXtreme Programming and Ag-
ile Processes in Software Engineering (XP 2008), Limerick, Ireland, June 10-14 (2008)
5. Coman, I., Sillitti, A.: Automated Segmentation of Development Sessions into Task-
related Subsections. International Journal of Computers and Applications 31(3) (2009)
6. Coman, I., Sillitti, A., Succi, G.: A Case-study on Using an Automated In-process Soft-
ware Engineering Measurement and Analysis System in an Industrial Environment. In:
31st International Conference on Software Engineering (ICSE 2009), Vancouver, BC,
Canada, May 16-24 (2009)
7. Cook, J.E., Wolf, A.L.: Discovering Models of Software Processes from Event-Based
Data. ACM Transactions on Software Engineering and Methodology 7(3), 215–249
8. Ferreira, D., Zacarias, M., Malheiros, M., Ferreira, P.: Approaching Process Mining with
Sequence Clustering: Experiments and Findings. In: Alonso, G., Dadam, P., Rosemann,
M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 360–374. Springer, Heidelberg (2007)
9. Fronza, I., Sillitti, A., Succi, G.: Modeling Spontaneous Pair Programming when New
Developers Join a Team. In: 3rd International Symposium on Empirical Software Engi-
neering and Measurement (ESEM 2009), Lake Buena Vista, FL, USA, October 15-16
10. Greco, G., Guzzo, A., Pontieri, L., Sacca’, D.: Discovering expressive process models
by clustering log traces. IEEE Trans. Knowl. Data Eng. 18(8), 1010–1027 (2006)
11. Greco, G., Guzzo, A., Pontieri, L.: Mining taxonomies of process models. Data &
Knowledge Engineering 67(1), 74–102 (2008)
12. Goedertier, S., Martens, D., Baesens, B., Haesen, R., Vanthienen, J.: Process Mining
as First-Order Classification Learning on Logs with Negative Events. In: ter Hofstede,
A.H.M., Benatallah, B., Paik, H.-Y. (eds.) BPM Workshops 2007. LNCS, vol. 4928, pp.
42–53. Springer, Heidelberg (2008)
13. Guenther, C.W., Van der Aalst, W.M.P.: Mining Activity Clusters from Low-Level Event
Logs. BETAWorking Paper Series, WP 165. Eindhoven University of Technology, Eind-
hoven (2006)
14. G¨unther, C.W., van der Aalst, W.M.P.: Fuzzy Mining – Adaptive Process Simplification
Based on Multi-perspective Metrics. In: Alonso, G., Dadam, P., Rosemann, M. (eds.)
BPM 2007. LNCS, vol. 4714, pp. 328–343. Springer, Heidelberg (2007)
15. G¨unther, C.W., Rozinat, A., van der Aalst, W.M.P.: Activity Mining by Global Trace
Segmentation. In: Rinderle-Ma, S., Sadiq, S., Leymann, F. (eds.) BPM 2009. LNBIP,
vol. 43, pp. 128–139. Springer, Heidelberg (2010)
16. Herbst, J.: Dealing with concurrency in workflow induction. In: Proceedings of the 7th
European Concurrent Engineering Conference, Society for Computer Simulation (SCS),
pp. 169-174 (2000)
17. Herbst, J., Karagiannis, D.: Integrating Machine Learning and Workflow Management
to Support Acquisition and Adaptation of Workflow Models. International Journal of
Intelligent Systems in Accounting, Finance and Management 9, 67–92 (2000)
18. Janes, A., Scotto, M., Sillitti, A., Succi, G.: A perspective on non-invasive software
management. In: 2006 IEEE Instrumentation and Measurement Technology Conference
(IMTC 2006), Sorrento, Italy, April 24-27 (2006)
19. Janes, A., Sillitti, A., Succi, G.: Non-invasive software process data collection for expert
identification. In: 20th International Conference on Software Engineering and Knowl-
edge Engineering (SEKE 2008), San Francisco, CA, USA, July 1-3 (2008)
236 A. Rezaei Mahdiraji et al.
20. Lamma, E., Mello, P., Riguzzi, F., Storari, S.: Applying Inductive Logic Programming
to Process Mining. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds.) ILP 2007.
LNCS (LNAI), vol. 4894, pp. 132–146. Springer, Heidelberg (2008)
21. Maruster, L., Weijters, A.J.M.M., Van der Aalst, W.M.P., Van den Bosch, A.: A rule-
based approach for process discovery: Dealing with noise and imbalance in process logs.
Data Mining and Knowledge Discovery 13(1), 67–87 (2006)
22. de Medeiros, A.K.A., Weijters, A.J.M.M., van der Aalst, W.M.P.: Genetic Process Min-
ing: A Basic Approach and Its Challenges. In: Bussler, C.J., Haller, A. (eds.) BPM 2005.
LNCS, vol. 3812, pp. 203–215. Springer, Heidelberg (2006)
23. de Medeiros, A.K.A., Guzzo, A., Greco, G., van der Aalst, W.M.P., Weijters, A.J.M.M.,
van Dongen, B.F., Sacc`a, D.: Process Mining Based on Clustering: A Quest for Preci-
sion. In: ter Hofstede, A.H.M., Benatallah, B., Paik, H.-Y. (eds.) BPM Workshops 2007.
LNCS, vol. 4928, pp. 17–29. Springer, Heidelberg (2008)
24. Rozinat, A., van der Aalst, W.M.P.: Conformance Testing: Measuring the Fit and Appro-
priateness of Event Logs and Process Models. In: Bussler, C.J., Haller, A. (eds.) BPM
2005. LNCS, vol. 3812, pp. 163–176. Springer, Heidelberg (2006)
25. Schimm, G.: Mining exact models of concurrent workflows. Comput. Ind. 53, 265–281
26. Scotto, M., Sillitti, A., Succi, G., Vernazza, T.: Dealing with Software Metrics Collection
and Analysis: a Relational Approach. Studia Informatica Universalis, Suger 3(3), 343–
366 (2004)
27. Sillitti, A., Janes, A., Succi, G., Vernazza, T.: Collecting, Integrating and Analyzing Soft-
ware Metrics and Personal Software Process Data. In: Proceedings of the 29th EUROMI-
CRO Conference (2003)
28. Van der Aalst, W.M.P., Weijters,A.: Process mining: a research agenda. Comput. Ind. 53,
231–244 (2002)
29. Van der Aalst, W.M.P., Van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters,
A.J.M.M.: Workflow mining: A survey of issues and approaches. Data & Knowledge
Engineering 47(2), 237–267 (2003)
30. Van der Aalst, W.M.P., Weijters, A.J.M.M., Maruster, L.: Workflow Mining: Discovering
Process Models from Event Logs. IEEE Transactions on Knowledge and Data Engineer-
ing 16(9), 1128–1142 (2004)
31. Van der Aalst, W.M.P., Reijers, H., Song, M.: Discovering Social Networks from Event
Logs. Computer Supported Cooperative work 14(6), 549–593 (2005)
32. van der Aalst, W.M.P., de Medeiros, A.K.A., Weijters, A.J.M.M.: Genetic Process Min-
ing. In: Ciardo, G., Darondeau, P. (eds.) ICATPN 2005. LNCS, vol. 3536, pp. 48–69.
Springer, Heidelberg (2005)
33. Song, M., G¨unther, C.W., van der Aalst, W.M.P.: Trace Clustering in Process Mining.
In: Ardagna, D., Mecella, M., Yang, J. (eds.) BPM 2008 Workshops. LNBIP, vol. 17, pp.
109–120. Springer, Heidelberg (2009)
34. Weijters, A.J.M.M., Van der Aalst, W.M.P.: Rediscovering Workflow Models from
Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering 10(2),
151–162 (2003)
35. Wen, L., Wang, J., Van der Aalst, W.M.P., Wang, Z., Sun, J.: A Novel Approach for
Process Mining Based on Event Types. BETA Working Paper Series, WP 118. Eindhoven
University of Technology, Eindhoven (2004)
... Therefore, understanding the mechanisms behind AMs is of paramount importance to customize and implement them correctly. To this end, there are several empirical studies analyzing the behavior of software developers focus on agile practices and identifying how they work and collaborate inside the development team [ [18]. This paper is organized as follows. ...
Full-text available
Software development is a very complex activity in which the human factor has a paramount importance. Moreover, since this activity requires the collaboration among different stakeholders, coordination problems arise. Different development methodologies address these problems in different ways. Agile Methods address them embedding coordination mechanisms inside the process itself rather than defining the development process on one side and then superimposing coordination through additional practices or tools.
... By interfacing with existing IT systems within the organization (ERP systems, source code databases, time management systems, calendars, etc.) 2. Directly on the machines of the collaborating parties (if detailed data about the usage of enterprise application software is needed) 6 The measurement probes (depicted in Figure 2) infer the execution of activities by looking at data traces left behind by the execution of the process, and then they report the inferred activities to a log file from which they can be studied later. 7 If the activities cannot be tracked directly, sometimes an indirect way that combines several pieces of information is possible. 8, 9 ...
... Software systems that support and/or control real word processes typically have logging mechanisms in place so that one can analyze why something is not working or how the systems can be improved (e.g. see [38][39][40]). Such data is stored in an event log database, which is the main information source for process mining. ...
Full-text available
The reputation of lightweight software development processes such as Agile and Lean is damaged by practitioners that claim benefits of such processes that are not true. Teams that want to demonstrate their seriousness, could benefit from matching their processes to the CMMI model, a recognized model by industry and the public administration. CMMI stands for Capability Maturity Model Integration and provides a reference model to improve and evaluate processes according to their maturity based on best practices. On the other hand, particularly in a lightweight software development process, the costs of a CMMI appraisal are hard to justify since its advantages are not directly related to the creation of value for the customer. This paper presents Jidoka4CMMI, a tool that — once a CMMI appraisal has been conducted — allows the documentation of the assessment criteria in form of executable test cases. The test cases, and so the CMMI appraisal, can be repeated anytime, without additional costs. The use of Jidoka4CMMI increases the benefits of conducting a CMMI appraisal. We hope that this encourages practitioners using lightweight software development processes to assess their processes using a CMMI model.
Full-text available
Smart Security systems are applications of the Smart City paradigm for local crime prevention. Like most Smart City tools, they consist of informational and technological components that support decision-making processes. A pre-requisite for such tools is that they are supposed to be means of ongoing management and policy innovations: we therefore review some of the crucial components of a Smart Security system from the viewpoint of a local government or a local branch of the public administration, in order to analyze the high-level requisites, characteristics and potentials of such a system. The objective is to help Public officials in identifying both what defines a useful technical tool but also what is required on the part of the public administration to actually make it useful. We therefore discuss the following problems. First, we address the issue of indicators, data and the use of statistical analysis to infer the likely determinants of crime and to define risk parameters for urban spaces. In doing that, we suggest innovative tools to introduce spatial information in crime count models. Second, we discuss sensors and sensor output analysis, trying to define the circumstances that make it useful and the new possibilities offered by current technology. Then we discuss about integration of different information both from a conceptual and a technical point of view, stressing the importance of closing the gap between cold and hot data in order to realize an integrated early warning system. Finally, we discuss the problem of creating a scalable Smart Security system in a local government , indicating a list of significant international experiences.
Full-text available
Open source software development is becoming always more relevant. Understanding the behavior of developers in open source software projects and identifying the kinds of their contributions is an essential step to improve the efficiency of the development process and to organize the development teams more effectively. Moreover, understanding the level of participation of the different developers helps to understand which members of the development team are more important than others and who are the actual key developers. This paper investigates the behavior of open source developers and the structure of the development of open source projects through the analysis of a very large dataset: 10 well-known and widely used open source software projects for a total of more than 4 MLOC (millions of lines of code) modified distributed in more than 200 K versions. This study builds on the top of other studies in this area applying a set of rigorous statistical techniques, analyzing how developers contribute to the projects. Its novelty is in the fine gain analysis of the developers that have commit rights on the repository of the project they work on, in the automated identification of key contributors of the project, in the size of the analyzed datasets, and in the statistical techniques used to classify the behavior of the developers in an automated way. To collect such large volume of data and to ensure their integrity, a tool to automatically mine open source version control systems has been used. The main result of this study is the identification of a recurrent pattern of four kinds of contributors with the same characteristics in all the projects analyzed even if the projects are very different in domain, size, language, etc.
Full-text available
Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated time-consuming process and typically, there are discrepancies between the actual workflow processes and the processes as perceived by the management. Therefore, we propose a technique for rediscovering workflow models. This technique uses workflow logs to discover the workflow process as it is actually being executed. The workflow log contains information about events taking place. We assume that these events are totally ordered and each event refers to one task being executed for a single case. This information can easily be extracted from transactional information systems (e.g., Enterprise Resource Planning systems such as SAP and Baan). The rediscovering technique proposed in this paper can deal with noise and can also be used to validate workflow processes by uncovering and measuring the discrepancies between prescriptive models and actual process executions.
Full-text available
Automated data collection solves the main problems associated with manual data collection, ensuring accuracy of data, non-intrusiveness, and low-costs of collection. However, automated tools mainly collect low-level data. Higher-level information, such as the start of working on a specific task, is still intrusively collected. To address this issue, we propose an approach to segment development sessions into task-related subsections. We will first describe the concept of the approach, the results of an initial validation study and then the potential benefits.
Conference Paper
Full-text available
Process Mining is an area of active and innovative research in recent years, where the goal is to obtain a process model from a log of recorded events. Probabilistic models offer a greater degree of flexibility and are an inspiring promise for their applications in process mining. In this paper, we discuss the use of Hidden Markov Models (HMMs) for process mining, a technique often used in speech recognition and bioinformatics. The focus of this work is the use of HMMs as the underlying framework for a Sequence Clustering algorithm. We discuss the challenges currently being faced, and we present an initial view of the HMM-Based Sequence Clustering algorithm for process mining.
Full-text available
Process mining techniques attempt to extract non-trivial and useful information from event logs recorded by information systems. For example, there are many process mining techniques to automatically dis-cover a process model based on some event log. Most of these algorithms perform well on structured processes with little disturbances. However, in reality it is difficult to determine the scope of a process and typically there are all kinds of disturbances. As a result, process mining tech-niques produce spaghetti-like models that are difficult to read and that attempt to merge unrelated cases. To address these problems, we use an approach where the event log is clustered iteratively such that each of the resulting clusters corresponds to a coherent set of cases that can be adequately represented by a process model. The approach allows for different clustering and process discovery algorithms. In this paper, we provide a particular clustering algorithm that avoids over-generalization and a process discovery algorithm that is much more robust than the algorithms described in literature [1]. The whole approach has been im-plemented in ProM.
Full-text available
Process mining techniques have proven to be a valuable tool for ana-lyzing the execution of business processes. They rely on logs that identify events at an activity level, i.e., most process mining techniques assume that the infor-mation system explicitly supports the notion of activities/tasks. This is often not the case and only low-level events are being supported and logged. For example, users may provide different pieces of data which together constitute a single ac-tivity. The technique introduced in this paper uses clustering algorithms to derive activity logs from lower-level data modification logs, as produced by virtually every information system. This approach was implemented in the context of the ProM framework and its goal is to widen the scope of processes that can be ana-lyzed using existing process mining techniques.
Full-text available
Effective information systems require the existence of explicit process models. A completely specified process design needs to be developed in order to enact a given business process. This development is time consuming and often subjective and incomplete. We propose a method that constructs the process model from process log data, by determining the relations between process tasks. To predict these relations, we employ machine learning technique to induce rule sets. These rule sets are induced from simulated process log data generated by varying process characteristics such as noise and log size. Tests reveal that the induced rule sets have a high predictive accuracy on new data. The effects of noise and imbalance of execution priorities during the discovery of the relations between process tasks are also discussed. Knowing the causal, exclusive, and parallel relations, a process model expressed in the Petri net formalism can be built. We illustrate our approach with real world data in a case study.
Conference Paper
Full-text available
One of the aims of process mining is to retrieve a process model from a given event log. However, current techniques have problems when mining processes that contain non-trivial constructs and/or when dealing with the presence of noise in the logs. To overcome these problems, we try to use genetic algorithms to mine process models. The non-trivial constructs are tackled by choosing an internal representation that supports them. The noise problem is naturally tackled by the genetic algorithm because, per definition, these algorithms are robust to noise. The definition of a good fitness measure is the most critical challenge in a genetic approach. This paper presents the current status of our research and the pros and cons of the fitness measure that we used so far. Experiments show that the fitness measure leads to the mining of process models that can reproduce all the behavior in the log, but these mined models may also allow for extra behavior. In short, the current version of the genetic algorithm can already be used to mine process models, but future research is necessary to always ensure that the mined models do not allow for extra behavior. Thus, this paper also discusses some ideas for future research that could ensure that the mined models will always only reflect the behavior in the log. Keywordsprocess mining-genetic mining-genetic algorithms-Petri nets-workflow Petri nets
Conference Paper
Full-text available
Measurement in software production is essential for understanding, controlling, and improving the software development process. Past research has emphasized the importance of a disciplined data collection process as a prerequisite for a sound, solid, and useful analysis. This article proposes non-invasive, i.e. automatic measurement techniques to instill a continuous and consistent framework to support software project management and to overcome the drawbacks of manual data collection
Conference Paper
Most information systems log events (e.g., transaction logs, audit trails) to audit and monitor the processes they support. At the same time, many of these processes have been explicitly modeled. For example, SAP R/3 logs events in transaction logs and there are EPCs (Event-driven Process Chains) describing the so-called reference models. These reference models describe how the system should be used. The coexistence of event logs and process models raises an interesting question: “Does the event log conform to the process model and vice versa?”. This paper demonstrates that there is not a simple answer to this question. To tackle the problem, we distinguish two dimensions of conformance: fitness (the event log may be the result of the process modeled) and appropriateness (the model is a likely candidate from a structural and behavioral point of view). Different metrics have been defined and a Conformance Checker has been implemented within the ProM Framework.