Content uploaded by Ingo Weber
Author content
All content in this area was uploaded by Ingo Weber on Dec 09, 2016
Content may be subject to copyright.
Listen to me: Improving Process Model
Matching through User Feedback
Christopher Klinkm¨uller1, Henrik Leopold2, Ingo Weber3,4, Jan Mendling2,
and Andr´e Ludwig1
1Information Systems Institute, University of Leipzig, Leipzig, Germany?
{klinkmueller,ludwig}@wifa.uni-leipzig.de
2Wirtschaftsuniversit¨at Wien, Augasse 2-6, A-1090 Vienna, Austria
{henrik.leopold,jan.mendling}@wu.ac.at
3Software Systems Research Group, NICTA, Sydney, Australia??
ingo.weber@nicta.com.au
4School of Computer Science & Engineering, University of New South Wales
Abstract. Many use cases in business process management rely on the
identification of correspondences between process models. However, the
sparse information in process models makes matching a fundamentally
hard problem. Consequently, existing approaches yield a matching qual-
ity which is too low to be useful in practice. Therefore, we investigate
incorporating user feedback to improve matching quality. To this end,
we examine which information is suitable for feedback analysis. On this
basis, we design an approach that performs matching in an iterative,
mixed-initiative approach: we determine correspondences between two
models automatically, let the user correct them, and analyze this input
to adapt the matching algorithm. Then, we continue with matching the
next two models, and so forth. This approach improves the matching
quality, as showcased by a comparative evaluation. From this study, we
also derive strategies on how to maximize the quality while limiting the
additional effort required from the user.
Keywords: BPM, process similarity, process model matching
1 Introduction
More and more organizations use process models as a tool for managing their
operations. Typical use cases for process models range from process documen-
tation to enactment through a workflow system. Once a repository of process
?The work presented in this paper was partly funded by the German Federal Min-
istry of Education and Research under the projects LSEM (BMBF 03IPT504X) and
LogiLeit (BMBF 03IPT504A).
?? NICTA is funded by the Australian Government through the Department of Commu-
nications and the Australian Research Council through the ICT Centre of Excellence
Program.
models reaches a certain size, there are several important use cases which re-
quire the comparison of process models. Examples include validating a technical
implementation of a business process against a business-centered specification
[2], process model search [6, 13, 10], or identifying clones in process models [7].
The demand for techniques that are capable of comparing process models
has led to the development of a variety of process model matchers. These match-
ers, e.g. [24, 14, 11], are usually designed for universal applicability. That is, they
are based on common matching metrics used to assess pairs of activities and
define classification rules which are believed to provide meaningful indications
of similarity for activities in any pair of process models. However, the insuffi-
cient accuracy of these approaches [3] suggests that the assumption of universal
applicability is too strict, and might hinder effective application in practice.
For this reason, we seize the idea of an adaptive matcher. A related idea
was discussed in [23] where characteristics of a certain process model collec-
tion are analyzed to select well-suited matchers for the collection. In contrast
to this approach, we devise an iterative, mixed-initiative approach that utilizes
user feedback to constantly adapt the matching algorithm. It works by presenting
automatically determined correspondences between two models to the user, and
asking her to add missing and remove incorrect ones. The matching algorithm
is then adjusted by analyzing the feedback and the next model pair is matched.
The contributions of this paper are threefold. First, we investigate which
information in process models can reliably be used for feedback analysis. For this
purpose, we derive indicators from the literature which provide information on
whether activities correspond or not and assess their correlation to the classes
of corresponding and non-corresponding activity pairs. The results also offer
insights into the challenges that process model matching faces. Second, based on
this analysis we introduce an approach that integrates user feedback to improve
the matching quality. Third, we perform a comparative evaluation and, based
on the results, derive strategies to minimize the user workload while maximizing
the quality improvements.
The rest of the paper is organized as follows. Section 2 defines process model
matching and introduces the state of the art. Section 3 provides an overview of
correspondence indicators derived from related research and investigates their
potential for user feedback analysis. Based on this survey, Section 4 defines our
approach that incorporates feedback. Section 5 evaluates the approach using
simulated feedback from gold standards. Finally, Section 6 concludes the paper.
2 Foundations: Problem Illustration and Related Work
This section introduces the problem of process model matching in Section 2.1
and reviews the state of the art in Section 2.2.
2.1 Problem Illustration
In accordance with ontology matching [8], process model matching is the pro-
cess of identifying an alignment between two process models. In this paper, a
2
process model is regarded as a business process graph as defined in [4]: a process
model consists of labeled nodes of different types and directed edges connecting
them. While the edges define the control flow of the process, the nodes express
activities, gateways, etc. This abstract notion of process models permits the ap-
plication of our work to different notations like Petri nets, Event-driven Process
Chains (EPCs) or Business Process Model and Notation (BPMN).
Definition 1 (Process model, Set of activities). Let Lbe a set of labels
and Tbe a set of types. A process model pis a tuple (N, E , λ, τ), in which:
–Nis the set of nodes;
–E⊆N×Nis the set of edges;
–λ:N→ L is a function that maps nodes to labels; and
–τ:N→ T is a function that assigns types to nodes.
For a given process model p= (N, E, λ, τ )the set A={a|a∈N∧τ(a) =
activity}is called the set of activities, where we require ∀a∈A, n ∈N:
|{n|(a, n)∈E)}| ≤ 1and |{n|(n, a)∈E)}| ≤ 1. Furthermore, we require that
there only exists one start (∃n∈N, ∀ni∈N: (ni, n)/∈E) and one end node
(∃n∈N, ∀ni∈N: (n, ni)/∈E).
Given two process models p1,p2and their activity sets A1,A2, an alignment
is a set of correspondences, i.e. activity pairs (a1, a2) with a1∈A1and a2∈A2
that represent similar functionality. Correspondences between sets of activities
(A∗
1, A∗
2) with A∗
1⊆A1and A∗
2⊆A2are expressed as sets of correspondences
between all activity pairs in A∗
1, A∗
2:{(a∗
1, a∗
2)|(a∗
1∈A∗
1∧a∗
2∈A∗
2)}.
Fig. 1 shows an alignment between two university admission process models
which will be used as a running example throughout the paper. Both processes
represent the scenario of receiving, evaluating, and deciding about an applica-
tion. Hence, activities from one process related to one of these tasks are matched
with activities dealing with the same task in the other process. While α2and β2
constitute a one-to-one correspondence, β6is not matched. Moreover, there are
two complex correspondences: a one-to-many correspondence formed by α1,β1
and β2and a many-to-many correspondence comprised of α3,α4,α5,β4and β5.
Applying a matcher to automatically determine alignments will only be use-
ful if it is of high quality, i.e. if it meets the user’s expectations. This will be the
Process A
Check
Application
Documents
Complete?
Documents
in Time?
Is Student
Qualified?
Reject
Student
Accept
Student
Archive
Documents
Evaluate
Application Prepare
notification Publish
notification
Register
applicant
Process B
α1α2α3
α4
α5
β1
β2
β3
β4
β5
β6
Fig. 1: An example for a process model alignment
3
case when the number of correctly identified correspondences (true positives) is
high. Consequently, as few correspondences as possible should be missed (false
negatives), while the results of a good matcher also contain few erroneous cor-
respondences (false positives).
2.2 Related Work
The foundations for research in process model matching can be found in various
works on schema and ontology matching [1, 8] as well as in research on process
model similarity. Such process similarity techniques exploit different sources of
information such as text [5, 12], model structure [9, 4], or execution semantics
[13, 26]. An overview is provided in [5].
Approaches for process model matching typically derive attributes of activ-
ity pairs from these techniques and aggregate these attributes in a predefined
static classifier in different ways (see e.g. [24, 14, 11]). In [23], the idea of a more
dynamic assembly of matchers is discussed. Therefore, matchers are allocated to
properties of process model pairs. By evaluating these properties within a model
collection, appropriate matchers are selected and composed.
However, up until now there is no automated technique for process match-
ing available that achieves results comparable to those in the field of ontology
matching. In fact, a comparison of techniques developed by different researchers
revealed that the best matcher achieved an f-measure of 0.45 on the test data
sets [3]. This calls for improving precision and recall of existing techniques. To
this end, we investigate suitable matching indicators and user feedback.
3 Information for User Feedback Analysis
The goal of analyzing user feedback is to find models that can predict user deci-
sions with high success. Therefore, indicators whose values are highly correlated
to the decisions, i.e., whether activity pairs correspond or not, are needed [19].
For example, label similarity is seen as a good indicator: activity pairs with a
high similarity tend to correspond; pairs with a low similarity tend to not corre-
spond. As various information sources, e.g. structure and execution semantics,
can be considered, we systematically identify suitable indicators in a two-step
approach: we first present indicators from the literature and own prior work
(Section 3.1) and investigate their potential for feedback analysis (Section 3.2).
3.1 Indicator Definitions
Matching approaches rely on various characteristics of activities to judge whether
they correspond. From analyzing related work, especially the approaches evalu-
ated in the matching contest 2013 [3], we identified five categories: position and
neighborhood based on the model structure, label specificity and label semantics
referring to the labels, and execution semantics. Thereby, some approaches rely
on a certain modeling notation or do not explicitly define the characteristics.
4
In order to be able to assess if these characteristics can be used for feedback
analysis, we present indicators adapted to our process model definition.
To this end, we define indicators as similarity functions from the set of activity
pairs (A1, A2) to the interval [0,1] : a value of 0 indicates total dissimilarity, a
value of 1 identity, and values in between a degree of similarity. Most of the
presented indicators utilize an attribute function at :A→R≥0, which returns
a value measured with regard to a certain activity property. Those indicators
are referred to as attribute indicators. Given an activity pair, they indicate the
similarity of these activities with regard to a certain attribute.
Definition 2 (Attribute indicator). Let A1,A2be two sets of activities and
a1∈A1,a2∈A2be two activities. The attribute indicator iat is then defined as:
iat(a1, a2) =
0 max
a∈A1
(at(a)) = 0 ∨max
a∈A2
(at(a)) = 0
1− | at(a1)
max
a∈A1
(at(a)) −at(a2)
max
a∈A2
(at(a)) |else
In the following, we describe various attributes with regard to the general
attribute indicator and define other indicators for each of the five categories.
Position. Process models might represent the same abstract process. In such
cases, it is more likely for activities at similar positions to correspond than for
activities whose positions differ. This idea is pursued in the Triple-S approach,
which takes the relative position of nodes in the process models as a similarity
indicator [3]. According to our definition, each process model has one start and
one end node. Thus, we view these nodes as anchors and consider the distances
to these nodes, i.e. the smallest number of activities on paths from a node to the
start or end node, as attributes to define the attribute indicators σstart
pos , σend
pos .
The position of an activity can also be defined with reference to the Refined
Process Structure Tree (RPST) [23, 24]. The RPST is a hierarchical representa-
tion of a process model consisting of single-entry-single-exit fragments [20]. Each
RPST fragment belongs to one of four structured classes: trivial fragments (T)
consist of two nodes connected with a single edge; a Bond (B) represents a set of
fragments sharing two common nodes; polygons (P) capture sequences of other
fragments; in case a fragment cannot be classified as trivial, bond, or polygon,
it is categorized as a rigid (R). Fig. 2 presents the RPST of the Process A.
The idea is to view the depth of the non-trivial fragments that contain the
activity as an attribute for the position of the model structure (σrpst
pos ), i.e., the
deeper an activity is located in the RPST the more decision points need to be
passed to get to the activity. Activities have at most one incoming and at most
P1
P2 B1
P4P3
Check
Application Evaluate
Application Prepare
notification Publish
notification
Register
applicant
P4
P1
α1α2α3
α4
α5
P2 P3
T1 T2 T3 T4 T5
T7
T6
T8
T9 T1 T2 T3 T4
T5 T7T6 T8
T9
B1
Fig. 2: The fragments of the admission process of university A and the RPST
5
Table 1: Attribute indicators for an activity pair from the running example.
α1max
a∈AA
β1max
a∈AB
(α1, β1)α1max
a∈AA
β1max
a∈AB
(α1, β1)
σstart
pos 0 3 0 3 1.00 σ|label|2 2 2 3 0.67
σend
pos 3 3 3 3 1.00 σ 4 4 4 4 1.00
σrpst
pos 2 3 3 3 0.67 σ+0 0 0 1 0.00
σmodel
neigh 1 3 2 4 0.83 σk0 1 1 1 0.00
σrpst
neigh 2 2 0 0 0.00
one outgoing edge. Thus, they cannot be an entry or exit node of a non-trivial
fragment and the trivial fragments they belong to have the same depth.
Table 1 illustrates the position indicators for (α1,β1) from the running ex-
ample. Both activities have a distance to the start event of 0. As the structure
of both processes is similar they also have the same distance to the end node.
Thus, both attribute indicators are 1. As activity β1is located in a parallel block
and α1is not, their RPST positions differ leading to an indicator value of 0.67.
Neighborhood. Whereas the position attributes consider the global location
of activities in a model, we next consider the local structure. In this regard, the
Triple-S approach [3] considers the ratios of incoming and outgoing edges. As
our definition requires activities to have at most one incoming and at most one
outgoing edge, these ratios would not provide much information. Instead, we
define the structural neighborhood indicator (σmodel
neigh ) based on the undirected
version of the process model. We count the activities that are connected to an
activity by at least one sequence of distinct edges not containing any activities.
We also consider the RPST for comparing the local structure of activities
and define the RPST neighborhood indicator (σrpst
neigh). Therefore, we determine
the trivial fragments an activity is part of and count their sibling fragments.
Table 1 also shows examples for the neighborhood indicators. α1has one
structural neighbor (α2) and in Process A, α3has the most neighbors (α2,α4,
α5). Similarly, β1has two neighbors (β2,β3) and the maximum is four neighbors
for β3(β1,β2,β4,β5). Thus, the structural neighborhood of both activities is
similar (0.83). The RPST neighborhood indicator is 0, because for each activity
in Process B there are two trivial fragments forming a polygon. As each of these
polygons does not comprise any further fragments, all activities in Process B
have an RPST neighborhood size of 0.
Label Specificity. According to an analysis of matching challenges in [11], label
specificity (i.e., one label containing more detailed information than another)
had a big impact on the correct identification of correspondences. Thus, we
assume activities with a similar specificity to correspond more likely than those
with different specificities. An attribute indicator in this regard is defined upon
the label length (σ|label|), i.e., the more words a label contains, the more specific
information it provides. It is considered for matcher selection in [23] and for label
pruning in [11]. The label length is defined as the number of individual words
in a label without common stop words like “the”, “if”, and “to”. The individual
words of an activity label are returned by the function Ω:L→P(W). Table 1
shows that |Ω(α1)|=|Ω(β1)|= 2. Moreover, the maximum label length in
6
Table 2: Word occurrences and term frequencies in the admission processes
check application documents complete
occurrences 1 2 3 1
term frequency 0.33 0.67 1.00 0.33
Process A is 2. In Process B β3(“Is student qualified”) has the longest label of
length 3 whereas β2(“Documents in Time?”) consists of two individual words,
because “in” is a stop word. Thus, the label length indicator is 0.67.
We further assume frequently occurring words to be more specific than less
frequently occurring words. This idea is also pursued for label pruning in [11].
Thus, we rely on the term frequency which is well known in information retrieval.
It is defined as the number of occurrences of a certain word in a document. On
the one hand, we take the union of all activity labels in the model collection as
a document and define the function tfcoll :W → [0,1] to return the number of
a word’s occurrences in the model collection divided by the maximum number
determined for a word in the collection. On the other hand, we define tf2p:
W → [0,1] by using all activity labels in the examined model pair to create the
document. Based thereon, we define the term frequency indicators σcoll
tf and σ2p
tf .
Definition 3 (Term frequency indicators). Let a1,a2be two activities.
Then, the term frequency indicators σcoll
tf and σ2p
tf are defined as:
σcoll
tf (a1, a2)=1− | 1
|Ω(a1)|∗P
ω∈Ω(a1)
tfcoll(ω)−1
|Ω(a2)|∗P
ω∈Ω(a2)
tfcoll(ω)|
σ2p
tf (a1, a2)=1− | 1
|Ω(a1)|∗P
ω∈Ω(a1)
tf2p(ω)−1
|Ω(a2)|∗P
ω∈Ω(a2)
tf2p(ω)|
Table 2 illustrates the model pair based indicator. “Documents” occurs most
often in the pair. Thus, the term frequencies are yielded by dividing the occur-
rence values with 3. As the average term frequency of α1(“Check Application”)
is 0.50 and for β2(“Documents Complete?”) it is 0.67, the indicator yields 0.83.
Label Semantics. Every matching approach relies on the calculation of label
similarities as an indicator to which degree activities constitute the same func-
tionality. Prior research has shown that the basic bag-of-words similarity [11]
yields good results [3]. It calculates a symmetric similarity score σ.ω :W2→
[0..1] for each pair of individual words (ω1, ω2) with ω1∈Ω(a1) and ω2∈Ω(a2).
Based thereon, it is then defined as the mean of the maximum similarity score
each individual word has with any of the individual words from the other label.
Definition 4 (Basic bag-of-words similarity). Let a1,a2be two activities.
The basic bag-of-word similarity σ.λ is then defined as:
σ.λ(a1, a2) =
P
ω1∈Ω(a1)
max
ω2∈Ω(a2)(σ.ω(ω1,ω2))+ P
ω2∈Ω(a2)
max
ω1∈Ω(a1)(σ.ω(ω1,ω2))
|Ω(a1)|+|Ω(a2)|
Table 3 illustrates the computation of the basic bag-of-words similarity for
α1(“Check Application”) and β2(“Documents complete?”). To compute the
7
Table 3: Example for the basic bag-of-words similarity
documents complete max
check 0.78 0.25 0.78
application 0.11 0.18 0.18
max 0.78 0.25 σ.λ =0.50
similarity of a pair of words, we relied on the maximum of the Levenshtein
similarity [15] and the Lin similarity [16]. This measure sees high values in both,
syntax (Levenshtein) and semantics (Lin), as evidence for similarity.
Behavior. Lastly, there are approaches that account for the behavioral context
of activities within a process model. Such behavioral attributes are proposed
as indicators for matcher selection [23], considered for probabilistic match opti-
mization [14] and also implemented in the ICoP framework [21]. The idea is that
corresponding activity pairs show similar characteristics during process execu-
tion, whereas non-corresponding pairs do not. Therefore, we rely on the notion
of behavioral profiles [22] which comprise three relations between activities in
a process model defined upon the set of all possible execution sequences. Two
activities are in strict order (a1 a2) if a2is executed after a1in all execution
sequences. They are exclusive (a1+a2) if no sequence contains both activities.
Lastly, they are interleaving (a1ka2) if there are sequences in which a1occurs
before a2and there are sequences in which a2occurs before a1. For each type
of relation, we count the number of relations the given activity participates in.
Based on these counts, we define the attribute indicators σ ,σ+and σkwhich
are illustrated in Table 1, too. While α1and β1have an identical number of
strict order relations (their execution can be followed by the execution of up
to four activities), they do not share similar characteristics with regard to the
other behavioral attributes. On the one hand, there are no exclusive activities in
Process A at all. Thus, the maximum in Process A and the according attribute
indicator yield a value of 0. On the other hand, there is one interleaving relation
in each process (α4kα5and β1kβ2). As β1is part of one of these relations and
α1not, the according indicator is 0.
3.2 Applicability Assessment
We now use these indicators to analyze whether the information sources are
suitable to derive models that can predict a user’s decisions. Thus, we examine
whether there is a correlation between an indicator’s values and the classes.
As the suitability of an indicator cannot be predicted in general, it must be
estimated with regard to particular data sets (i.e., process collections) for which
the set of correspondences is known (i.e., a gold standard of correspondences
exists). To this end, we used the two process collections and respective gold
standards from the matching contest in 2013 [3]: processes on birth certificates
and university admission. More precisely, we took the set of all corresponding and
the set of all non-corresponding activity pairs for both data sets as representative
8
Table 4: p-values of the Kolmogorov–Smirnov test for the birth certificate (gray
rows) and the university admission (white rows) data sets.
σstart
pos σend
pos σrpst
pos σmodel
neigh σrpst
neigh σ|label|σcoll
tf σ2p
tf σ.λ σ σ+σk
0.001 0.010 0.967 0.054 0.010 0.581 0.000 0.111 0.000 0.000 0.111 0.211
0.000 0.367 0.155 0.286 0.468 0.210 0.016 0.699 0.000 0.001 0.864 0.393
samples for both classes. At this point, it should be noted that some of the process
models in the university admission data set are not sound, which is a necessary
prerequisite for computing the behavior attributes. Thus, we only considered the
sound university admission models for these attributes.
To assess the correlation of classes and indicator values, we first examined the
distributions of indicator values within both classes. The rationale is that classes
can only be assigned to value ranges if the values are distributed differently
across the classes. Therefore, we randomly drew 100 activity pairs from each
class per attribute. The reason is that the number of non-corresponding activity
pairs is roughly 30 times as high as the number of corresponding pairs in both
data sets, which would distort our analysis. Next, we conducted a two-sided
Kolmogorov-Smirnov [17] test at a significance level of 0.01 with these samples.
The neutral hypothesis of this test is that the examined distributions are equal
and will be rejected if the yielded p-value is lower than the significance level.
Table 4 summarizes the p-values yielded for each attribute. Bold values highlight
p-values that are below the significance level.
As can be seen from the table, there are only three attributes (σstart
pos ,σ.λ, and
σ ) for which the null hypothesis is rejected in both cases. From this analysis,
these three attributes seem suitable for classification, but we will also consider
σcoll
tf as its p-values only marginally infringe the test conditions.
We further substantiated our analysis by investigating how well each class
can be assigned to a value range of an indicator. Therefore, we measured the
information gain [19], a well established measure from statistics, as an indicator
for the entropy of class assignments within subsets of activity pairs with regard
to all pairs. More precisely, we calculated the values of all activity pairs for each
of the four attributes (σstart
pos ,σ.λ,σcoll
tf ,σ ). We then determined two subsets of
pairs with regard to one of the attributes and to a threshold. For all pairs in the
first subset the attribute value is smaller than the threshold, whereas the values
of pairs in the second subset are larger. We considered all possible separations of
activity pairs that satisfied this rule and chose the separation with the highest
information gain for each attribute. The rationale is that the respective subsets
constitute the best separation of corresponding and non-corresponding pairs with
regard to the considered attribute. As can be seen from Table 5, σ.λ yields the
Table 5: Information gains for the selected attributes for the birth certificate
(gray rows) and the university admission (white rows).
σ.λ σcoll
tf σ σstart
pos
0.056 0.023 0.016 0.005
0.027 0.010 0.007 0.002
9
σ.λ σcoll
tf σstart
pos
1
Fig. 3: Box plots for corresponding (c) and non-corresponding (n) activity pairs
representing three indicators for the birth certificate (upper row) and the uni-
versity admission (lower row) data sets.
highest and σstart
pos the lowest information gain, σcoll
tf and σ are in between. To
convey a better intuition for this measure, Fig. 3 shows the distribution of the
relative value frequencies for σ.λ and σstart
pos as well as for σcoll
tf as a representative
for the indicators with medium information gains.
According to these box plots a threshold at about 0.4 would yield a good
classifier for σ.λ as many corresponding and only a few non-corresponding activ-
ity pairs have values larger than this threshold. For the other indicators, whose
distributions differ only slightly, there is no threshold which would classify that
well. Thus, we only consider label similarity in terms of σ.λ for user feedback
analysis and introduce a mixed-initiative approach which aims at increasing the
applicability of σ.λ for separating activity pairs in the next section.
4 Word Similarity Adaptation
The incorporation of user feedback opens the opportunity to analyze the user’s
decisions and adjust the matching process accordingly. Here, we rely on correc-
tions made by the user to proposed alignments. Therefore, we let the user select
a pair of process models and automatically determine an alignment. Presenting
it to the user, she is asked to remove incorrect and add missing correspondences.
These corrections are passed to the algorithm which examines the feedback and
adapts its classification mechanism. Afterwards, the next matching process can
be started by the user. Fig. 4 illustrates this basic approach.
As outlined in Section 3, we only consider the basic bag-of-words similarity
σ.λ for correspondence identification. Given a predefined threshold we classify
all activity pairs with a basic bag-of-words similarity score higher than or equal
to the threshold as correspondences.
Although our analysis shows this indicator to have the most desirable pro-
perties, there will still be false positives and false negatives leading to an unsat-
isfactory matching quality [3]. Hence, it is the goal of the feedback analysis to
understand why mistakes were done and how they could have been avoided.
With regard to the matching process, a false positive was suggested because
the similarity of the activity pair was estimated too high, i.e., it should have
been lower than the threshold. In case of a false negative, it is the other way
10
Select
Model Pair
Determine
Alignment
Correct
Alignment
Analyze
User Feedback
User
Algorithm
Fig. 4: Basic mixed-initiative approach to learning
around, i.e., the similarity should have been higher than the threshold. The
main reasons for such wrong assessments do not directly originate in the ba-
sic bag-of-words similarity, but in the underlying word similarity measure σ.ω.
Those measures are either syntactic, not considering word meaning, or semantic
being based on external sources of knowledge like lexical databases or corpora
[18]. As the creation of such databases or corpora incurs huge manual effort,
matchers usually rely on universal ones. In both cases, i.e. syntactic matching or
semantic matching using universal corpora, the word similarity measures do not
sufficiently account for domain-specific information, e.g., technical vocabulary
or abbreviations, and thus introduce errors.
Consequently, when the user feedback indicates a misclassification of an ac-
tivity pair, our learning approach checks which pairs of words contributed to
that misclassification. According to the definition of the basic bag-of-words sim-
ilarity, a word pair contributes to an activity pair classification each time it
yields the highest similarity score for one word in the respective activity labels.
Therefore, in order to adjust the word similarities to the domain characteristics
of the considered process model collection, we decrease the similarity of a pair of
words whenever it contributed to a false positive, and increase the similarity for
a false negative. We do so by defining two counting functions: γfp : (ω1, ω2)→N
returns the number of counted false positive contributions for a word pair, and
γfn : (ω1, ω2)→Nanalogously for false negative contributions. Based on these
counters, we introduce a word similarity correction term.
Definition 5 (Word similarity correction). Let ω1,ω2be two words. Fur-
thermore, let ρfp, ρfn ∈Rbe two predefined learning rates. The correction func-
tion δ:W2→Ris then defined as:
δ(ω1, ω2) := ρfp ×γfp (ω1, ω2) + ρf n ×γf n(ω1, ω2)
Note that the counts are multiplied with learning rates; together with the
threshold these are the control parameters of the approach.
Given this correction term and an ordinary word similarity measure σ.ωo, we
introduce the adaptive word similarity σ.ωα.
Definition 6 (Adaptive word similarity). Let ω1,ω2be two words. Further-
more, let δ:W2→Rbe a function that returns a correction value for a word
pair. The adapting word similarity function σ.ωα:W2→[0..1] is then defined as:
σ.ωα(ω1, ω2) :=
1σ.ωo(ω1, ω2) + δ(ω1, ω2)>1
0σ.ωo(ω1, ω2) + δ(ω1, ω2)<0
σ.ωo(ω1, ω2) + δ(ω1, ω2)else
11
Since σ.ωo(ω1, ω2) + δ(ω1, ω2) might return a value outside the interval [0,1],
but any σ.ω function is expected to stay within these bounds, we enforce the
bounds as per the first and second case in the above definition. We then use
σ.ωαas σ.ω in the basic bag-of-words similarity when determining the alignment
between two process models.
To illustrate this approach, we refer to Table 3, which outlines the com-
putation of σ.λ for (α1,β1). In previous work [11], we found that a threshold
above 0.6 yields good results. In this case, the (α1, β1) will be classified as non-
corresponding. Collecting user feedback, this will be revealed as a wrong clas-
sification. Thus, the false negative counter will be increased by 2 for (“check”,
“documents”) as this word pair yielded the highest value for both words and
by one for (“check”, “complete”) and for (“application”, “complete”). Having
ρfp set to 0.1, the adaptive word similarity will now roughly be 0.6. Thus, an
activity pair with the labels of α1and β1will now be classified as corresponding.
5 Evaluation
This section has two objectives. First, we analyze if our mixed-initiative approach
improves the results of existing matchers with regard to the amount of missing
and incorrect correspondences. Second, we aim to derive strategies to minimize
the amount of user feedback required to achieve a high matching quality.
Experiment Setup. Our evaluation utilizes the birth certificate and the univer-
sity admission data sets from the matching competition [3]. The gold standards
serve a dual purpose here: (i) assessing the matching quality and (ii) simulating
user feedback. Therefore, going through a sequence of model pairs, we first deter-
mine an alignment for the current pair and assess the quality of this alignment.
That is, we determine the number of true positives (TP), false positives (FP)
and false negatives (FN) given the gold standard. We then calculate the stan-
dard measures of precision (P) (T P /(T P +F P )), recall (R) (T P /(T P +F N )),
and f-measure as their harmonic mean (F) (2 ×P×R/(P+R)). Next, we pass
the sets of false positives and false negatives to the algorithm which adapts the
word similarities accordingly. Then, we move on to the next pair. The average
(AVG) and the standard deviation (STD) of all measures and model pairs are
used to assess the approach’s quality. These are calculated either as a running
statistics during learning, or as an overall quality indicator after all model pairs
have been matched and the respective feedback has been considered.
Table 6: Best results from matching contest and for word similarity adaptation
Birth Certificate University Admission
Precision Recall F-Measure Precision Recall F-Measure
Approach AVG STD AVG STD AVG STD AVG STD AVG STD AVG STD
Baseline .68 .19 .33 .22 .45 .18 .56 .23 .32 .28 .41 .20
Adaptive .73 .15 .67 .24 .69 .18 .60 .20 .56 .25 .58 .21
12
We sampled the space of possible threshold values over the interval [0,1] in
steps of 0.05 as well as the space of possible false positive and false negative
learning rates over the interval [0,0.2] in steps of 0.01. Moreover, we randomly
generated different model pair sequences in order to check the influence of the
model pair order on the quality. We used the maximum of the Levenshtein [15]
and the Lin [16] similarities as the ordinary similarity measure.
Matching Results. Table 6 compares the results of our mixed-initiative ap-
proach to a baseline comprised of the best results from the matching competition
[3], i.e., the RefMod-Mine/NSCM results for the birth certificate and the bag-
of-words similarity with label pruning for the university admission data set. The
results for the mixed-initiative approach were determined for collecting user feed-
back over all model pairs. We observed an increase of the f-measure by 0.24 for
the birth certificate and by 0.17 for the university admission data set. While the
precision remained stable, there was a dramatic improvement in the recall.
Deriving strategies. To derive strategies for minimizing the user workload, we
first investigated if the order in which process model pairs are considered had
impact on the overall quality. For this purpose, we determined the quality of the
basic bag-of-words similarity for each model pair. Then, we split the model pairs
for each data set into three equal-sized classes, i.e., model pairs with a high,
a medium, and a low f-measure. We generated three sequences (high,medium,
and low) where each sequence starts with 12 model pairs of the respective class,
randomly ordered, followed by the remaining 24 model pairs, also in random
order. Fig. 5 shows the running average f-measure after the ith iteration for all
three sequences per data set. The results suggest that the order only has a small
impact on the final quality, since the average f-measures converge to roughly the
same value as the number of iterations increases. However, the running average
can be misleading: if we start learning with pairs that are already matched well
before learning (as in the high case), how much can we learn from them? To
examine this aspect, we ran a different experiment, where learning is stopped
after the ith iteration, and the f-measure over all pairs is computed. The results
are shown in Fig. 6, left. Looking at the data, one might hypothesize that here
the user workload per model pair is lower in the high case than for the other
0
0,2
0,4
0,6
0,8
1 6 11 16 21 26 31 36
Average f-measure
Number of iterations
Birth Certificate
high medium low baseline
0,15
0,35
0,55
0,75
1 6 11 16 21 26 31 36
Average f-measure
Number of iterations
University Admission
high medium low baseline
Fig. 5: Running average f-measure after ith iteration
13
0,4
0,5
0,6
0,7
1 6 11 16 21 26 31 36
Average f-measure
Learning stopped after ith iteration
Birth Certificate
high medium low
0
100
200
300
400
1 6 11 16 21 26 31 36
Effort (number of changes)
Learning stopped after ith iteration
Birth Certificate
high medium low
0,4
0,45
0,5
0,55
0,6
1 6 11 16 21 26 31 36
Average f-measure
Learning stopped after ith iteration
University Admission
high medium low
0
100
200
300
400
500
1 6 11 16 21 26 31 36
Effort (number of changes)
Learning stopped after ith iteration
University Admission
high medium low
Fig. 6: Overall average f-measure over all 36 model pairs and the user workload
after learning for iiterations
sequences. Thus, we also counted the number of changes a user has to do until
learning is stopped. These effort indicators are shown in Fig. 6, right.
First of all, it can be seen that – regardless of the order – the amount of cor-
rections is roughly growing linearly without big differences across the sequences.
Furthermore, the f-measure curves for all three sequences approach each other
with a growing number of iterations used to learn. When learning is stopped
early, the best results are yielded for the low and the medium sequences: feed-
back on models has a larger impact if matching quality is low beforehand. Finally,
regardless of the order, 2/3rds of the improvements are obtained from analyzing
about half the model pairs (i= 16). In practice it is not possible to sort model
pairs with regard to the f-measure upfront. But as feedback collection is progress-
ing, the relative improvements can be measured. As soon as the improvements
from additional feedback level off, analyzing can be stopped.
Discussion. The evaluation shows that the incorporation of user feedback leads
to strong improvements compared to the top matchers of the matching compe-
tition [3]. When feedback is collected for all model pairs, the f-measure increases
by 41% and 53% for the two data sets. Even when reducing the workload by only
collecting feedback for half of the model pairs, big improvements are obtained.
The main concern about experiments on process model matching relates to
external validity, i.e., in how far the results of our study can be generalized
[25]. In this regard, the size of the two data sets restricts the validity of both,
14
the indicator assessment and the evaluation. Furthermore, the processes in each
data set represent the same abstract process. Hence, some structural and be-
havioral characteristics might be underrepresented, limiting the significance of
the indicator assessment. This problem also has implications on the evaluation
of the word similarity adaptation, as the processes only cover a small number of
tasks from both domains and a rather limited vocabulary used to describe them.
Thus, words might tend to occur more often than in other model collections
and feedback might be collected for the same word pair more often than usual.
This limits the generalization of the quality improvements and of the strategies
to minimize the user’s efforts. Lastly, the indicator assessment does not allow
for a general judgment on the sources of information, as there might exist other
indicators which better exploit these sources. Therefore, enlarging the data sets
by including data sets whose characteristics differ from the once considered in
this paper and considering more indicators are important steps in future work.
6 Conclusions and Future Work
In this paper, we investigated user feedback as a means for improving the quality
of process model matching. Thus, we first reviewed sources of information from
the literature and assessed their potential for feedback analysis based on derived
correspondence indicators. This assessment indicated that only the label based
similarity of activities can reliably be applied to decide whether an activity pair
corresponds or not. In a next step, we designed a mixed-initiative approach
that adapts the word similarity scores based on user feedback. We evaluated
our approach with regard to established benchmarking samples and showed that
user feedback can substantially improve the matching quality. Furthermore, we
investigated strategies to reduce the user workload while maximizing its benefit.
In future research, we plan to investigate further strategies for decreasing the
user workload while maximizing the matching quality. This comprises guidelines
for choosing model pairs (or activity pairs) the user needs to provide feedback
on. Another direction we plan to pursue is the extension of our approach to
better account for semantic relations and co-occurrences of words within labels.
References
1. Z. Bellahense, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer,
Heidelberg, 2011.
2. M. C. Branco, J. Troya, K. Czarnecki, J. M. K¨uster, and H. V¨olzer. Matching
business process workflows across abstraction levels. In MoDELS, pages 626–641,
2012.
3. U. Cayoglu, R. Dijkman, M. Dumas, P. Fettke, L. Garc´ıa-Ba˜nuelos, P. Hake,
C. Klinkm¨uller, H. Leopold, A. Ludwig, P. Loos, J. Mendling, A. Oberweis,
A. Schoknecht, E. Sheetrit, T. Thaler, M. Ullrich, I. Weber, and M. Weidlich.
The process model matching contest 2013. In PMC-MR, 2013.
4. R. Dijkman, M. Dumas, and L. Garc´ıa-Ba˜nuelos. Graph matching algorithms for
business process model similarity search. In BPM, pages 48–63, 2009.
15
5. R. Dijkman, M. Dumas, B. van Dongen, R. K¨a¨arik, and J. Mendling. Similarity of
business process models: Metrics and evaluation. Inf. Syst., 36(2):498–516, 2011.
6. M. Dumas, L. Garc´ıa-Ba˜nuelos, and R. M. Dijkman. Similarity search of business
process models. IEEE Data Eng. Bull., 32(3):23–28, 2009.
7. C. C. Ekanayake, M. Dumas, L. Garc´ıa-Ba˜nuelos, M. L. Rosa, and A. H. M. ter
Hofstede. Approximate clone detection in repositories of business process models.
In BPM, pages 302–318, 2012.
8. J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, Berlin, 2013.
9. D. Grigori, J. C. Corrales, and M. Bouzeghoub. Behavioral Matchmaking for
Service Retrieval. In IEEE ICWS, pages 145–152, 2006.
10. T. Jin, J. Wang, M. L. Rosa, A. H. ter Hofstede, and L. Wen. Efficient querying
of large process model repositories. Computers in Industry, 64(1):41–49, 2013.
11. C. Klinkm¨uller, I. Weber, J. Mendling, H. Leopold, and A. Ludwig. Increasing
recall of process model matching by improved activity label matching. In BPM,
pages 211–218, 2013.
12. A. Koschmider and E. Blanchard. User assistance for business process model
decomposition. In IEEE RCIS, pages 445–454, 2007.
13. M. Kunze, M. Weidlich, and M. Weske. Behavioral similarity - a proper metric.
In BPM, pages 166–181, 2011.
14. H. Leopold, M. Niepert, M. Weidlich, J. Mendling, R. M. Dijkman, and H. Stuck-
enschmidt. Probabilistic optimization of semantic process model matching. In
BPM, pages 319–334, 2012.
15. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics Doklady., 10(8):707–710, 1966.
16. D. Lin. An information-theoretic definition of similarity. In ICML, pages 296–304,
1998.
17. F. J. Massey. The kolmogorov-smirnov test for goodness of fit. Journal of the
American Statistical Association, 46(253):68–78, 1951.
18. R. Navigli. Word sense disambiguation: A survey. ACM Comput. Surv., 41(2):10:1–
10:69, 2009.
19. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, (First
Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, 2005.
20. J. Vanhatalo, H. V¨olzer, and J. Koehler. The refined process structure tree. Data
Knowl. Eng., 68(9):793–818, 2009.
21. M. Weidlich, R. M. Dijkman, and J. Mendling. The ICoP framework: Identification
of correspondences between process models. In CAiSE, pages 483–498, 2010.
22. M. Weidlich, J. Mendling, and M. Weske. Efficient consistency measurement based
on behavioral profiles of process models. IEEE Trans. Softw. Eng., 37(3):410–429,
2011.
23. M. Weidlich, T. Sagi, H. Leopold, A. Gal, and J. Mendling. Predicting the quality
of process model matching. In BPM, pages 203–210, 2013.
24. M. Weidlich, E. Sheetrit, M. C. Branco, and A. Gal. Matching business process
models using positional passage-based language models. In ER, pages 130–137,
2013.
25. C. Wohlin, P. Runeson, M. H¨ost, M. C. Ohlsson, B. Regnell, and A. Wessl´en. Ex-
perimentation in Software Engineering: An Introduction. Kluwer Academic Pub-
lishers, 2000.
26. H. Zha, J. Wang, L. Wen, C. Wang, and J. Sun. A workflow net similarity measure
based on transition adjacency relations. Computers in Industry, 61(5):463 – 471,
2010.
16