# Merging Business Process Models.

**ABSTRACT** This paper addresses the following problem: given two business process models, create a process model that is the union of the process models given as input. In other words, the behavior of the produced process model should encompass that of the input models. The paper describes an algorithm that produces a single configurable process model from a pair of process models. The algorithm works by extracting the common parts of the input process models, creating a single copy of them, and appending the differences as branches of configurable connectors. This way, the merged process model is kept as small as possible, while still capturing all the behavior of the input models. Moreover, analysts are able to trace back which model(s) a given element in the merged model originates from. The algorithm has been prototyped and tested against process models taken from several application domains.

**1**Bookmark

**·**

**107**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**AbstractWith the increasing adoption of process-aware information systems, large process model repositories have emerged. Typically, the models in such repositories are re-aligned to real-world events and demands through adaptation on a day-to-day basis. This bears the risk of introducing model redundancies and of unnecessarily increasing model complexity. If no continuous investment is made in keeping process models simple, changes will become more difficult and error-prone over time. Although refactoring techniques are widely used in software engineering to address similar problems, so far, no comparable state-of-the-art has evolved in the business process management domain. Process designers either have to refactor process models by hand or are simply unable to apply respective techniques at all. This paper proposes a catalogue of process model “smells” for identifying refactoring opportunities. In addition, it introduces a set of behavior-preserving techniques for refactoring large process repositories. The proposed refactorings enable process designers to effectively deal with model complexity by making process models better understandable and easier to maintain. The refactorings have been evaluated using large process repositories from the healthcare and automotive domain. To demonstrate the feasibility of the refactoring techniques, a proof-of-concept prototype has been implemented.Computers in Industry 01/2011; · 1.71 Impact Factor - SourceAvailable from: Marie-Christine Fauvet[Show abstract] [Hide abstract]

**ABSTRACT:**As organizations reach higher levels of Business ProcessManagement maturity, they tend to accumulate large collections of processmodels. These repositories may contain thousands of activities and be managed by different stakeholders with varying skills and responsibilities. However, while being of great value, these repositories induce high management costs. Thus, it becomes essential to keep track of the various model versions as they may mutually overlap, supersede one another and evolve over time. We propose an innovative versioning model, and associated storage structure, specifically designed to maximize sharing across process models and process model versions, reduce conflicts in concurrent edits and automatically handle controlled change propagation. The focal point of this technique is to version single process model fragments, rather than entire process models. Indeed empirical evidence shows that real-life process model repositories have numerous duplicate fragments. Experiments on two industrial datasets confirm the usefulness of our technique.On the Move to Meaningful Internet Systems: OTM 2011 - Confederated International Conferences: CoopIS, DOA-SVI, and ODBASE 2011, Hersonissos, Crete, Greece, October 17-21, 2011, Proceedings, Part I; 01/2011 - [Show abstract] [Hide abstract]

**ABSTRACT:**Similarity measures for business process models have been suggested for different purposes such as measuring compliance between reference and actual models, searching for related models in a repository, or locating services that adhere to a specification given by a process model. The aim of our article is to provide a comprehensive survey on techniques to define and calculate such similarity measures.As the measures differ in many aspects, it is an interesting question how different measures rank “similarity” within the same set of models. We investigated, how different kinds of changes in a model influence the values of different similarity measures that have been published in academic literature.Furthermore, we identified eight properties that a similarity measure should have from a theoretical point of view and analysed how these properties are fulfilled by the different measures. Our results show that there are remarkable differences among existing measures.We give some recommendations which type of measure is useful for which kind of application.Computers in Industry 02/2012; 63(2):148–167. · 1.71 Impact Factor

Page 1

Merging Business Process Models

Marcello La Rosa1, Marlon Dumas2, Reina Uba2, and Remco Dijkman3

1Queensland University of Technology, Australia

m.larosa@qut.edu.au

2University of Tartu, Estonia

{marlon.dumas,reinak}@ut.ee

3Eindhoven University of Technology, The Netherlands

r.m.dijkman@tue.nl

Abstract. This paper addresses the following problem: given two busi-

ness process models, create a process model that is the union of the

process models given as input. In other words, the behavior of the pro-

duced process model should encompass that of the input models. The

paper describes an algorithm that produces a single configurable process

model from a pair of process models. The algorithm works by extracting

the common parts of the input process models, creating a single copy of

them, and appending the differences as branches of configurable connec-

tors. This way, the merged process model is kept as small as possible,

while still capturing all the behavior of the input models. Moreover, ana-

lysts are able to trace back which model(s) a given element in the merged

model originates from. The algorithm has been prototyped and tested

against process models taken from several application domains.

1Introduction

In the context of company mergers and restructurings, it often occurs that mul-

tiple alternative processes, previously belonging to different companies or units,

need to be consolidated into a single one in order to eliminate redundancies and

create synergies. To this end, teams of business analysts need to compare simi-

lar process models so as to identify commonalities and differences, and to create

integrated process models that can be used to drive the process consolidation

effort. This process model merging effort is tedious, time-consuming and error-

prone. In one instance reported in this paper, it took a team of three analysts

130 man-hours to merge 25% of two variants of an end-to-end process model.

In this paper, we consider the problem of (semi-)automatically merging pro-

cess models under the following requirements:

1. The behavior of the merged model should subsume that of the input models.

2. Given an element in the merged process model, analysts should be able to

trace back from which process model(s) the element in question originates.

3. One should be able to derive the input process models from the merged one.

The main contribution of the paper is an algorithm that takes as input a col-

lection of process models and generates a configurable process model [15]. A

R. Meersman et al. (Eds.): OTM 2010, Part I, LNCS 6426, pp. 96–113, 2010.

c ? Springer-Verlag Berlin Heidelberg 2010

Page 2

Merging Business Process Models97

configurable process model is a modeling artifact that captures a family of pro-

cess models in an integrated manner and that allows analysts to understand

what these process models share, what their differences are, and why and how

these differences occur. Given a configurable process model, analysts can derive

individual members of the underlying process family by means of a procedure

known as individualization. We contend that configurable process models are a

suitable output for a process merging algorithm, because they provide a mecha-

nism to fulfill the second and third requirements outlined above. Moreover, they

can be used to derive new process models that were not available in the orig-

inating process family, e.g. when the need to capture new business procedures

arises. In this respect, the merged model can be seen as a reference model [4] for

the given process family.

The algorithm requires as input a mapping that defines which elements from

one process model correspond to which elements from another process model.

To assist in the construction of this mapping, a mapping is suggested to the user

who can then adapt the mapping if necessary.The algorithm has been tested on

process models sourced from different domains. The tests show that the process

merging algorithm produces compact models and scales up to process models

containing hundreds of nodes.

The paper is structured as follows. Section 2 introduces the notion of con-

figurable process model as well as a technique for proposing an initial mapping

between similar process model elements. Section 3 presents the process merging

algorithm. Section 4 reports on the implementation and evaluation of the algo-

rithm. Finally, Section 5 discusses related work and Section 6 draws conclusions.

2 Background

This section introduces two basic ingredients of the proposed process merging

technique: a notation for configurable process models and a technique to match

the elements of a given pair of process models. This latter technique is used

to assist users in determining which pairs of process model elements should be

considered as equivalent when merging.

2.1

There exist many notations to represent business processes, such as Event-driven

Process Chains (EPC), UML Activity Diagrams (UML ADs) and the Business

Process Modeling Notation (BPMN). In this paper we abstract from any specific

notation and represent a business process model as a directed graph with labeled

nodes as per the following definition. This process abstraction allows us to merge

process models defined in different notations.

Definition 1 (Business Process Graph). A business process graph G is a

set of pairs of process model nodes—each pair denoting a directed edge. A node

n of G is a tuple (idG(n),λG(n),τG(n)) consisting of a unique identifier idG(n)

(of type string), a label λG(n) (of type string), and a type τG(n). In situations

where there is no ambiguity, we will drop the subscript G from idG, λGand τG.

Configurable Business Processes

Page 3

98M. La Rosa et al.

For a business process graph G, its set of nodes, denoted NG, is

?{{n1,n2}|(n1,n2) ∈ G}. Each node has a type. The available types of nodes

‘activity’, ‘event’ and ‘gateway’. In the rest of this paper we will show examples

using the EPC notation, which has three types of nodes: i) ‘function’ nodes,

representing tasks that can be performed in an organization; ii) ‘event’ nodes,

representing pre-conditions that must be satisfied before a function can be per-

formed, or post-conditions that are satisfied after a function has been performed;

and iii) ‘connector’ nodes, which determine the flow of execution of the process.

Thus, τG∈ {“f”, “e”, “c”} where the letters represent the (f)unction, (e)vent

and (c)onnector type. The label of a node of type “c” indicates the kind of

connector. EPCs have three kinds of connectors: AND, XOR and OR. AND

connectors either represent that after the connector, the process can continue

along multiple parallel paths (AND-split), or that it has to wait for multiple par-

allel paths in order to be able to continue (AND-join). XOR connectors either

represent that after the connector, a choice has to be made about which path to

continue on (XOR-split), or that the process has to wait for a single path to be

completed in order to be allowed to continue (XOR-join). OR connectors start

or wait for multiple paths. Models G1and G2in Fig. 1 are two example EPCs.

A Configurable EPC (C-EPC) [15] is an EPC where some connectors are

marked as configurable. A configurable connector can be configured by removing

one or more of its incoming branches (in the case of a join) or one or more of

its outgoing branches (in the case of a split). The result is a regular connector

with a possibly reduced number of incoming or outgoing branches. In addition, a

configurable OR connector can be mutated into a regular XOR or a regular AND.

After all nodes in a C-EPC are configured, a C-EPC needs to be individualized

by removing those branches that have been excluded during the configuration

of each configurable connector. Model CG in Fig. 1 is an example of C-EPC

featuring a configurable XOR-split, a configurable XOR-join and a configurable

OR-join, while the two models G1 and G2 are two possible individualizations

of CG. G1can be obtained by configuring the three configurable connectors in

order to keep all branches labeled “1”, and restricting the OR-join to an AND-

join; G2 can be obtained by configuring the three configurable connectors in

order to keep all branches labeled “2” and restricting the OR-join to an XOR-

join. Since in both cases only one branch is kept for the two configurable XOR

connectors (either the one labeled “1” or the one labeled “2”), these connectors

are removed during individualization. For more details on the individualization

algorithm, we refer to [15].

According to requirement (2) in Section 1, we need a mechanism to trace back

from which variant a given element in the merged model originates. Coming back

to the example in Fig. 1, the C-EPC model (CG) can also be seen as the result

of merging the two EPCs (G1and G2). The configurable XOR-split immediately

below function “Shipment Processing” in CG has two outgoing edges. One of

them originates from G1 (and we thus label it with identifier “1”) while the

second originates from G2(identifier “2”). In some cases, an edge in the merged

depend on the language that is used. For example, BPMN has nodes of type

Page 4

Merging Business Process Models99

Shipment is

complete

Deliveries

need to be

planned

Delivery is

relevant for

shipment

Shipment is

complete

Delivery is

relevant for

shipment

Delivery

Delivery

is to be

created

V

Deliveries

need to be

planned

Transporting

X

Order

generated and

delivery opened

X

2

Delivery is

relevant for

shipment

1,2

Delivery

V

2

2

X

1,2

V

1

Shipment is

complete

1,2

1,2

Delivery

is to be

created

2

X

Order

generated and

delivery opened

2

2

Deliveries

need to be

planned

1

1,2

1,2

X

2

1

Freight

packed

Freight

packed

Shipment

processing

Shipment

is to be

processed

Shipment

is to be

processed

Shipment

is to be

processed

label

label

event

function

AND connector

arc

mapping

configurable

connector

V

X

V

XOR connector

OR connector

max. common

region

CGG1

G2

Deliveries

need to be

planned

Delivery

unblocked

Delivery

unblocked

2

Shipment

processing

Shipment

processing

1: “ ”

2: “X”

V

1: “Transporation

planning and

processing”

2: “Transporting”

annotation

Transportation

planning and

processing

V

Transportation

planning and

processing

Fig.1. Two business process models with a mapping, and their merged model

model originates from multiple variants. For example, the edge that emanates

from event “Delivery is relevant for shipment” is labeled with both variants (“1”

and “2”) since this edge can be found in both original models.

Also, since nodes in the merged model are obtained by combining nodes from

different variants, we need to capture the label of the node in each of its vari-

ants. For example, function “Transportation planning and processing” in CG

stems from the merger of the function with the same name in G1, and function

“Transporting” in G2. Accordingly, this function in CG will have an annotation

(as shown in the figure), stating that its label in variant 1 is “Transportation

planning and processing”, while its label in variant 2 is “Transporting”. Simi-

larly, the configurable OR connector just above “Transportation planning and

processing” in CG stems from two connectors: an AND connector in variant

1 and an XOR connector in variant 2. Thus an annotation will be attached

to this node (as shown in the figure) which will record the fact that the label

of this connector is “and” in variant 1, and “xor” in variant 2. In addition to

providing traceability, these annotations enable us to derive the original process

models by configuring the merged one, as per requirement (3) in Section 1. Thus,

we define the concept of Configurable Process Graph, which attaches additional

configuration metadata to each edge and node in a business process graph.

Definition 2 (Configurable Business Process Graph). Let I be a set of

identifiers of business process models, and L the set of all labels that process

Page 5

100M. La Rosa et al.

model nodes can take. A Configurable Business Process graph is a tuple

(G,αG,γG,ηG) where G is a business process graph, αG: G → ℘(I) is a function

that maps each edge in G to a set of process graph identifiers, γG: NG→ ℘(I×L)

is a function that maps each node n ∈ NGto a set of pairs (pid,l) where pid is

a process graph identifier and l is the label of node n in process graph pid, and

ηG: NG→ {true,false} is a boolean indicating whether a node is configurable or

not.

Because we attach annotations to graph elements, our concept of configurable

process graph slightly differs from the one defined in [15].

Below, we define some auxiliary notations which we will use when matching

pairs of process graphs.

Definition 3 (Preset, Postset, Transitive Preset, Transitive Postset).

Let G be a business process graph. For a node n ∈ NGwe define the preset as

•n = {m|(m,n) ∈ G} and the postset as n• = {m|(n,m) ∈ G}. We call an

element of the preset predecessor and an element of the postset successor. There

is a path between two nodes n ∈ NG and m ∈ NG, denoted n ?→ m, if and

only if (iff) there exists a sequence of nodes n1,...,nk∈ NGwith n = n1 and

m = nk such that for all i ∈ 1,...,k − 1 holds (ni,ni+1) ∈ G. If n ?= m and

for all i ∈ 2,...,k − 1 holds τ(ni) =“c”, the path n

chain. The set of nodes from which a node n ∈ NGis reachable via a connector

chain is defined as

?→ n} and is called the transitive preset

of n via connector chains. Similarly, n

postset of n via connector chains.

c

?→ m is called a connector

c• n = {m ∈ NG|m

c

c•= {m ∈ NG|n

c

?→ m} is the transitive

For example, the transitive preset of event “Delivery is relevant for shipment” in

Figure 1, includes functions “Delivery” and “Shipment Processing”, since these

two latter functions can be reached from the event by traversing backward edges

and skipping any connectors encountered in the backward path.

2.2Matching Business Processes

The aim of matching two process models is to establish the best mapping between

their nodes. Here, a mapping is a function from the nodes in the first graph to

those in the second graph. What is considered to be the best mapping depends

on a scoring function, called the matching score. The matching score we employ

is related to the notion of graph edit distance [1]. We use this matching score as

it performed well in several empirical studies [17,2,3]. Given two graphs and a

mapping between their nodes, we compute the matching score in three steps.

First, we compute the matching score between each pair of nodes as follows.

Nodes of different types must not be mapped, and splits must not be matched

with joins. Thus, a mapping between nodes of different types, or between a split

and a join, has a matching score of 0. The matching score of a mapping between

two functions or between two events is measured by the similarity of their la-

bels. To determine this similarity, we use a combination of a syntactic similarity

Page 6

Merging Business Process Models101

measure, based on string edit distance [10], and a linguistic similarity measure,

based on the Wordnet::Similarity package [13] (if specific ontologies for a domain

are available, such ontologies can be used instead of Wordnet). We apply these

measures on pairs of words from the two labels, after removing stop-words (e.g.

articles and conjunctions) and stemming the remaining words (to remove word

endings such as ”-ing”). The similarity between two words is the maximum be-

tween their syntactic similarity and their linguistic similarity. The total similarity

between two labels is the average of the similarities between each pair of words

(w1,w2) such that w1 belongs to the first label and w2 belongs to the second

label. With reference to the example in Fig. 1, the similarity score between nodes

‘Transportation planning and processing’ in G1and node ‘Transporting’ in G2is

around 0.35. After removing the stop-word “and”, we have three pairs of terms.

The similarity between ‘Transportation” and “‘Transporting” after stemming is

1.0, while the similarity between “plan” and “process” or between “plan” and

“transport” is close to 0. The average similarity between these three pairs is thus

around 0.35. This approach is directly inspired from established techniques for

matching pairs of elements in the context of schema matching [14].

The above approach to compute similarities between functions/events cannot

be used to compute the similarity between pairs of splits or pairs of joins, as

connectors’ labels are restricted to a small set (e.g. ‘OR’, ‘XOR’ and ’AND’)

and they each have a specific semantics. Instead, we use a notion of context

similarity. Given two mapped nodes, context similarity is the fraction of nodes

in their transitive presets and their transitive postsets that are mapped (i.e. the

contexts of the nodes), provided at least one mapping of transitive preset nodes

and one mapping of transitive postset nodes exists.

Definition 4 (Context similarity). Let G1and G2be two process graphs. Let

M : NG1? NG2be a partial injective mapping that maps nodes in G1to nodes

in G2. The context similarity of two mapped nodes n ∈ NG1and m ∈ NG2is:

|M(

max(|

where M applied to a set yields the set in which M is applied to each element.

c• n)∩

c• n|,|

c• m| + |M(n

c• m|) + max(|n

c•) ∩ m

c• |,|m

c• |

c• |)

For example, the event ‘Delivery is relevant for shipment’ preceding the AND-

join (via a connector chain of size 0) in model G1from Fig. 1 is mapped to the

event ‘Delivery is relevant for shipment’ preceding the XOR-join in G2. Also,

the function succeeding the AND-join (via a connector chain of size 0) in G1is

mapped to the function succeeding the XOR-join in G2. Therefore, the context

similarity of the two joins is:1+1

Second, we derive from the mapping the number of: Node substitutions (a

node in one graph is substituted for a node in the other graph iff they appear

in the mapping); Node insertions/deletions (a node is inserted into or deleted

from one graph iff it does not appear in the mapping); Edge substitutions (an

edge from node a to node b in one graph is substituted for an edge in the other

graph iff node a is matched to node a?, node b is matched to node b?and there

3+1= 0.5.

Page 7

102M. La Rosa et al.

exists an edge from node a?to node b?); and Edge insertions/deletions (an edge

is inserted into or deleted from one graph iff it is not substituted).

Third, we use the matching scores from step one and the information about

substituted, inserted and deleted nodes and edges from step two, to compute

the matching score for the mapping as a whole. We define the matching score

of a mapping as the weighted average of the fraction of inserted/deleted nodes,

the fraction of inserted/deleted edges and the average score for node substitu-

tions. Specifically, the matching score of a pair of process graphs and a mapping

between them is defined as follows.

Definition 5 (Matching score). Let G1 and G2 be two process graphs and

let M be their mapping function, where dom(M) denotes the domain of M and

cod(M) denotes the codomain of M. Let also 0 ≤ wsubn ≤ 1, 0 ≤ wskipn ≤ 1

and 0 ≤ wskipe ≤ 1 be the weights that we assign to substituted nodes, inserted

or deleted nodes and inserted or deleted edges, respectively, and let Sim(n,m)

be the function that returns the similarity score for a pair of mapped nodes, as

computed in step one.

The set of substituted nodes, denoted subn, inserted or deleted nodes, denoted

skipn, substituted edges, denoted sube, and inserted or deleted edges, denoted

skipe, are defined as follows:

subn = dom(M) ∪ cod(M)

sube = {(a,b) ∈ E1|(M(a),M(b)) ∈ E2}∪

{(a?,b?) ∈ E2|(M−1(a?),M−1(b?)) ∈ E1}

The fraction of inserted or deleted nodes, denoted fskipn, the fraction of inserted

or deleted edges, denoted fskipe, and the average distance of substituted nodes,

denoted fsubsn, are defined as follows.

skipn = (NG1∪ NG2) − subn

skipe = (E1∪ E2) \ sube

fskipn =

|skipn|

|N1|+|N2|

fskipe =

|skipe|

|E1|+|E2|

fsubn =

2.0·Σ(n,m)∈M1.0−Sim(n,m)

|subn|

Finally, the matching score of a mapping is defined as:

1.0 −wskipn · fskipn + wskipe · fskipe + wsubn · fsubn

wskipn + wskipe + wsubn

For example, in Fig. 1 the node ‘Freight packed’ and its edge to the AND-join

in G1are inserted, and so are the node ‘Delivery unblocked’ and its edge to the

XOR-join in G2. The AND-join in G1is substituted by the second XOR-join in

G2with a matching score of 0.5, while the node ‘Transportation planning and

processing’ in G1is substituted by the node ‘Transporting’ in G2with a match-

ing score of 0.35 as discussed above. Thus, the edge between ‘Transportation

planning and processing’ and the AND-join in G1 is substituted by the edge

between ‘Transporting’ and the XOR-join in G2, as both edges are between two

substituted nodes. All the other substituted nodes have a matching score of

1.0. If all weights are set to 1.0, the total matching score for this mapping is

1.0 −

7

21+11

19+2·0.5+2·0.65

3

14

= 0.64.

Page 8

Merging Business Process Models103

Definition 5 gives the matching score of a given mapping. To determine the

matching score of two business process graphs, we must exhaustively try all

possible mappings and find the one with the highest matching score. Various

algorithms exist to find the mapping with the highest matching score. In the

experiments reported in paper, we use a greedy algorithm from [2], since its

computational complexity is much lower than that of an exhaustive algorithm,

while having a high precision.

3Merging Algorithm

The merging algorithm is defined over pairs of configurable process graphs. In

order to merge two or more (non-configurable) process graphs, we first need to

convert each process graph into a configurable process graph. This is trivially

achieved by annotating every edge of a process graph with the identifier of the

process graph, and every node in the process graph with a pair indicating the

process graph identifier and the label for that node. We then obtain a config-

urable process graph representing only one possible variant.

Given two configurable process graphs G1 and G2 and their mapping M,

the merging algorithm (Algorithm 1) starts by creating an initial version of the

merged graph CG by doing the union of the edges of G1 and G2, excluding

the edges of G2 that are substituted. In this way for each matched node we

keep the copy in G1only. Next, we set the annotation of each edge in CG that

originates from a substituted edge, with the union of the annotations of the two

substituted edges in G1and G2. For example, this produces all edges with label

“1,2” in model CG in Fig. 1. Similarly, we set the annotation of each node in

CG that originates from a matched node, with the union of the annotations of

the two matched nodes in G1and G2. In Fig. 1, this produces the annotations of

the last two nodes of CG—the only two nodes originating from matched nodes

with different labels (the other annotations are not shown in the figure).

Next, we use function MaximumCommonRegions to partition the mapping

between G1 and G2 into maximum common regions (Algorithm 2). A maxi-

mum common region (mcr) is a maximum connected subgraph consisting only

of matched nodes and substituted edges. For example, given models G1 and

G2in Fig. 1, MaximumCommonRegions returns the three mcrs highlighted by

rounded boxes in the figure. To find all mcrs, we first randomly pick a matched

node that has not yet been included in any mcr. We then compute the mcr of

that node using a breadth-first search. After this, we choose another mapped

node that is not yet in an mcr, and we construct the next mcr. We then postpro-

cess the set of maximum common regions to remove from each mcr those nodes

that are at the beginning or at the end of one model, but not of the other (this

step is not shown in Algorithm 2). Such nodes cannot be merged, otherwise it

would not be possible to trace back which model they come from. For example,

we do not merge event “Deliveries need to be planned” in Fig. 1 as this node is

at the beginning of G1and at the end of G2. In this case, since the mcr contains

this node only, we remove the mcr altogether.

Page 9

104M. La Rosa et al.

Algorithm 1. Merge

function Merge(Graph G1,Graph G2,Mapping M)

init

Mapping mcr, Graph CG

begin

CG ⇐ G1∪ G2\ (G2∩ sube)

foreach (x,y) ∈ CG ∩ sube do

αCG(x,y) ⇐ αG1(x,y) ∪ αG2(M(x),M(y))

end

foreach n ∈ NCG∩ subn do

γCG(n) ⇐ γG1(n) ∪ γG2(M(n))

end

foreach mcr ∈ MaximumCommonRegions(G1,G2,M) do

FG1 ⇐ {x ∈ dom(mcr) | • x ∩ dom(mcr) = ∅ ∨ •M(x) ∩ cod(mcr) = ∅}

foreach fG1 ∈ FG1 such that | • fG1| = 1 and | • M(fG1)| = 1 do

pfG1⇐ Any(•fG1), pfG2⇐ Any(•M(fG1))

xj ⇐ new Node(“c”,“xor”,true)

CG ⇐ (CG \ ({(pfG1,fG1),(pfG2,fG2)})) ∪ {(pfG1,xj),(pfG2,xj),(xj,fG1)}

αCG(pfG1,xj) ⇐ αG1(pfG1,fG1), αCG(pfG2,xj) ⇐ αG2(pfG2,fG2)

αCG(xj,fG1) ⇐ αG1(pfG1,fG1) ∪ αG2(pfG2,fG2)

end

LG1 ⇐ {x ∈ dom(mcr) | x • ∩ dom(mcr) = ∅ ∨ M(x) • ∩ cod(mcr) = ∅}

foreach lG1 ∈ LG1 such that |lG1• | = 1 and |M(lG1) • | = 1 do

slG1 ⇐ Any(lG1•), slG2 ⇐ Any(M(lG1)•)

xs ⇐ new Node(“c”,“xor”,true)

CG ⇐ (CG \ ({(lG1,slG1),(lG2,slG2)})) ∪ {(xs,slG1),(xs,slG2),(lG1,xs)}

αCG(xs,slG1) ⇐ αG1(lG1,slG1), αCG(xs,slG2) ⇐ αG2(lG2,slG2)

αCG(lG1,xs) ⇐ αG1(lG1,slG1) ∪ αG2(lG2,slG2)

end

end

CG ⇐ MergeConnectors(M,CG)

return CG

end

Once we have identified all mcrs, we need to reconnect them with the remain-

ing nodes from G1and G2that are not matched. The way a region is reconnected

depends on the position of its sources and sinks in G1and G2. A region’s source

is a node whose preset is empty (the source is a start node) or at least one of

its predecessors is not in the region; a region’s sink is a node whose postset is

empty (the sink is an end node) or at least one of its successors is not in the

region. We observe that this condition may be satisfied by a node in one graph

but not by its matched node in the other graph. For example, a node may be a

source of a region for G2but not for G1.

If a node fG1is a source in G1or its matched node M(fG1) is a source in

G2and both fG1and M(fG1) have exactly one predecessor each, we insert a

configurable XOR-join xj in CG to reconnect the two predecessors to the copy of

fG1in CG. Similarly, if a node lG1is a sink in G1or its matched node M(lG1)

is a sink in G2 and both nodes have exactly one successor each, we insert a

Page 10

Merging Business Process Models105

Algorithm 2. Maximum Common Regions

function MaximumCommonRegions(Graph G1,Graph G2,Mapping M)

init

{Node} visited ⇐ ∅, {Mapping} MCRs ⇐ ∅

begin

while exists c ∈ dom(M) such that c ∈ visited do

{Node} mcr ⇐ ∅

{Node} tovisit ⇐ {c}

while tovisit = ∅ do

c ⇐ dequeue(tovisit)

mcr ⇐ mcr ∪ {c}

visited ⇐ visited ∪ {c}

foreach n ∈ dom(M) such that ((c,n) ∈ G1 and (M(c),M(n)) ∈ G2) or

((n,c) ∈ G1 and (M(n),M(c)) ∈ G2) and n ∈ visited do

enqueue(tovisit,n)

end

end

MCRs ⇐ MCRs ∪ {mcr}

end

return MCRs

end

configurable XOR-split xs in CG to reconnect the two successors to the copy

of lG1in CG. We also set the labels of the new edges in CG to track back the

edges in the original models. This is illustrated in Fig. 2 where we use symbols

pfG1to indicate the only predecessor of node fG1in G1, slG1to indicate the

only successor of node lG1in G1 and so on. Moreover, in Algorithm 1 we use

function Node to create the configurable XOR joins and splits that we need to

add, and function Any to extract the element of a singleton set.

In Fig. 1, node “Shipment processing” in G1and its matched node in G2are

both sink nodes and have exactly one successor each (“Delivery is relevant for

shipment” in G1and “Delivery is to be created” in G2). Thus, we reconnect this

node in CG to the two successors via a configurable XOR-join and set the labels

of the incoming and outgoing edges of this join accordingly. The same operation

applies when a node is source (sink) in a graph but not in the other.

By removing from MCRs all the nodes that are at the beginning or at the end

of one model but not of the other, we guarantee that either both a source and

its matched node have predecessors or none has, and similarly, that either both

a sink and its matched node have successors or none has. In Fig. 1, the region

containing node “Deliveries need to be planned” is removed after postprocessing

MCRs since this node is a start node for G1and an end node for G2.

If a source has multiple predecessors (i.e. it is a join) or a sink has multiple

successors (i.e. it is a split), we do not need to add a configurable XOR-join

before the source, or a configurable XOR-split after the sink. Instead, we can

simply reconnect these nodes with the remaining nodes in their preset (if a

join) or postset (if a split) which are not matched. This case is covered by

Page 11

106M. La Rosa et al.

G2

fG1

pfG1

lG1

dom(mcr)

X

pfG2

1,2

12

slG1

X

slG2

1,2

12

CG

fG2

pfG2

lG2

slG2

cod(mcr)

2

2

G1

fG1

pfG1

1

lG1

slG1

dom(mcr)

1

Fig.2. Reconnecting a maximum common region to the nodes that are not matched

function MergeConnectors (Algorithm 3). This function is invoked in the last

step of Algorithm 1 to merge the preset and postset of all matched connectors,

including those that are source or sink of a region, as well as any matched

connector inside a region. In fact the operation that we need to perform is the

same in both cases. Since every matched connector c in CG is copied from G1,

we need to reconnect to c the predecessors and successors of M(c) that are not

matched. We do so by adding a new edge between each predecessor or successor

of M(c) and c. If at least one such predecessor or successor exists, we make c

configurable, and if there is a mismatch between the labels of the two matched

connectors (e.g. one is “xor” and the other is “and”) we also change the label

of c to “or”. For example, the AND-join in G1 of Fig. 1 is matched with the

XOR-join that precedes function “Transporting” in G2. Since both nodes are

source of the region in their respective graphs, we do not need to add a further

configurable XOR-join. The only non-matched predecessor of the XOR-join in

G2is node “Delivery unblocked”. Thus, we reconnect the latter to the copy of

the AND-join in CG via a new edge labeled “2”. Also, we make this connector

configurable and we change its label to “or”, obtaining graph CG in Fig. 1.

After merging two process graphs, we can simplify the resulting graph by

applying a set of reduction rules. These rules are used to reduce connector chains

that may have been generated after inserting configurable XOR connectors. This

reduces the size of the merged process graph while preserving its behavior and its

configuration options. The reduction rules are: 1) merge consecutive splits/joins,

2) remove redundant transitive edges between connectors, and 3) remove trivial

connectors (i.e. those connectors with one input edge and one output edge), and

are applied until a process graph cannot be further reduced. For space reasons,

we cannot provide full details of the reduction rules. Detailed explanations and

formal descriptions of the rules are given in a technical report [9].

The worst-case complexity of the process merging procedure is O(|NG|3)

where |NG| is the number of nodes of the largest graph. This is the complexity

of the process mapping step when using a greedy algorithm [2], which domi-

nates the complexity of the other steps of the procedure. The complexity of the

algorithm for merging connectors is linear on the number of connectors. The al-

gorithm for calculating the maximum common regions is a breadth-first search,

thus linear on the number of edges. The algorithm for calculating the merged

Page 12

Merging Business Process Models107

Algorithm 3. Merge Connectors

function MergeConnectors(Mapping M,{Edge} CG)

init

{Node} S ⇐ ∅, {Node} J ⇐ ∅

begin

foreach c ∈ dom(M) such that τ(c) =“c” do

S ⇐ {x ∈ M(c) • | x ∈ cod(M)}

J ⇐ {x ∈ •M(c) | x ∈ cod(M)}

CG ⇐ (CG\?

αCG(c,x) ⇐ αG2(M(c),x)

end

foreach x ∈ J do

αCG(x,c) ⇐ αG2(x,M(c))

end

if |S| > 0 or |J| > 0 then

ηCG(c) ⇐ true

end

x∈S{(M(c),x)}∪?

x∈J{(x,M(c))})∪?

x∈S{(c,x)}∪?

x∈J{(x,c)}

foreach x ∈ S do

if λG1(c) = λG2(M(c)) then

λCG(c) ⇐“or”

end

end

return CG

end

model calls the algorithm for calculating the maximum common regions, then

visits at most all nodes of each maximum common region, and finally calls the

algorithm for merging connectors. Since the number of nodes in a maximum

common region and the number of maximum common regions are both bounded

by the number of edges, and given that different regions do not share edges, the

complexity of the merging algorithm is also linear on the number of edges.

The merged graph subsumes the input graphs in the sense that the set of

traces induced by the merged graph includes the union of the traces of the two

input graphs. The reason is that every node in an input graph has a correspond-

ing node in the merged graph, and every edge in any of the original graphs has

a corresponding edge (or pair of edges) in the merged graph. Hence, for any

run of the input graph (represented as a sequence of traversed edges) there is

a corresponding run in the merged graph. The run in the merged graph has

additional edges which correspond to edges that have a configurable xor connec-

tor either as source or target. From a behavioral perspective, these configurable

xor connectors are “silent” steps which do not alter the execution semantics. If

we abstract from these connectors, the run in the input graph is equivalent to

the corresponding run in the merged graph. Furthermore, each reduction rule is

behavior-preserving. A detailed proof is outside the scope of this paper.

We observe that the merging algorithm accepts both configurable and non-

configurable process graphs as input. Thus, the merging operator can be used for

multi-way merging. Given a collection of process graphs to be merged, we can

Page 13

108M. La Rosa et al.

start by merging the first two graphs in the collection, then merge the resulting

configurable process graph with the third graph in the collection and so on.

4Evaluation

The algorithm for process merging has been implemented as a tool which

is freely available as part of the Synergia toolset (see: http://www.

processconfiguration.com ). The tool takes as input two EPCs represented in

the EPML format and suggests a mapping between the two models. Once this

mapping has been validated by the user, the tool produces a configurable EPC

in EPML by merging the two input models. Using this tool, we conducted tests

in order to evaluate (i) the size of the models produced by the merging operator,

and (ii) the scalability of the merging operator.

Size of merged models. Size is a key factor affecting the understandability of

process models and it is thus desirable that merged models are as compact as

possible. Of course, if we merge very different models, we can expect that the

size of the merged model will almost equal to the sum of the sizes of the two

input models, since we need to keep all the information in the original models.

However, if we merge very similar models, we expect to obtain a model whose

size is close to the size of the largest of the two models.

We conducted tests aimed at comparing the sizes of the models produced by

the merging operator relative to the sizes of the input models. For these tests,

we took the SAP reference model, consisting of 604 EPCs, and constructed

every pair of EPCs from among them. We then filtered out pairs in which a

model was paired with itself and pairs for which the matching score of the

models was less than 0.5. As a result of the filtering step, we were left with

489 pairs of similar but non-identical EPCs. Next, we merged each of these

model pairs and calculated the ratio between the size of the merged model and

the size of the input models. This ratio is called the compression factor and is

defined as CF(G1,G2) = |CG|/(|G1| + |G2|), where CG = Merge(G1,G2). A

compression factor of 1 means that the input models are totally different and

thus the size of the merged model is equal to the sum of the sizes of the input

models (the merging operator merely juxtaposes the two input models side-by-

side). A compression factor close to 0.5 (but still greater than 0.5) means that

the input models are very similar and thus the merged model is very close to

one of the input models. Finally, if the matching score of the input models is

Table 1. Size statistics of merged SAP reference models

Size 1Size 2Size merged Compression Merged after

reduction

3

186

31.52

28.96

Compression

after reduction

0.5

1.05

0.68

0.13

Min

Max

Average

Std dev

3330.5

1.17

0.75

0.15

130

22.07

20.95

130

24.31

22.98

194

33.90

30.35

Page 14

Merging Business Process Models109

R² = 0.8377

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0.40.5 0.60.70.80.911.1

Matching score

Compression factor

Fig.3. Correlation between matching score of input models and compression factor

very low (e.g. only a few isolated nodes are similar), the addition of configurable

connectors may induce an overhead explaining a compression factor above 1.1

Table 1 summarizes the test results. The first two columns show the size of the

initial models. The third and fourth column show the size of the merged model

and the compression factor before applying any reduction rule, while the last

two columns show the size of the merged model and the compression factor after

applying the reduction rules. The table shows that the reduction rules improve

the compression factor (average of 68% vs. 75%), but the merging algorithm itself

yields the bulk of the compression. This can be explained by the fact that the

merging algorithm factors out common regions when merging. In light of this,

we can expect that the more similar two process models are, the more they share

common regions and thus the smaller the compression factor is. This hypothesis

is confirmed by the scatter plot in Figure 3 which shows the compression factors

(X axis) obtained for different matching scores of the input models (Y axis). The

solid line is the linear regression of the points.

Scalability. We also conducted tests with large process models in order to assess

the scalability of the proposed merging operator. We considered four model pairs.

The first three pairs capture a process for handling motor incident and personal

injury claims at an Australian insurer. The first pair corresponds to the claim

initiation phase (one model for motor incident and one for personal injury),

the second pair corresponds to claim processing and the third pair corresponds

to payment of invoices associated to a claim. Each pair of models has a high

similarity, but they diverge due to differences in the object of the claim.

A fourth pair of models was obtained from an agency specialized in handling

applications for developing parcels of land. One model captures how land de-

velopment applications are handled in South Australia while the other captures

the same process in Western Australia. The similarity between these models was

1In file compression, the compression factor is defined as 1− |CG|/(|G1| +|G2|), but

here we use the reverse in order to compare this factor with the matching score.

Page 15

110M. La Rosa et al.

Table 2. Results of merging insurance and land development models

Pair # Size 1 Size 2 Merge time

(msec.)

Size merged Compression Merged after

reduction

474

87

624

279

Compression

after reduction

0.68

0.87

0.92

0.72

1

2

3

4

339

22

469

200

357 79

78

213 85

191 20

486

88

641

290

0.7

0.88

0.95

0.75

0

high since they cover the same process and were designed by the same analysts.

However, due to regulatory differences, the models diverge in certain points.

Table 2 shows the sizes of the input models, the execution time of the merging

operator and statistics related to the size of the merged models. The tests were

conducted on a laptop with a dual core Intel processor, 2.53 GHz, 3 GB memory,

running Microsoft Vista and SUN Java Virtual Machine version 1.6 (with 512MB

of allocated memory). The execution times include both the matching step and

the merging step, but they exclude the time taken to read the models from disk.

The results show that the merging operator can handle pairs of models with

around 350 nodes each in a matter of milliseconds—an observation supported by

the execution times we observed when merging the pairs from the SAP reference

model. Table 2 also shows the compression factors. Pairs 2 and 3 have a poor

compression factor (lower is better). This is in great part due to differences in

the size of these two models, which yields a low matching score. For example, in

the case of pair 2 (matching score of 0.56) it can be seen that the merged model

is only slightly larger than the larger of the two input models.

When the insurance process models were given to us, a team of three analysts

at the insurance company had tried to manually merge these models. It took

them 130 man-hours to merge about 25% of the end-to-end process models. The

most time-consuming part of the work was to identify common regions manually.

Later, we compared the common re-

gions identified by our algorithm and

those found manually. Often, the re-

gions identified automatically were

smaller than those identified manu-

ally. Closer inspection showed that

during the manual merge, analysts

had determined that some minor

differences between the models be-

ing merged were due to omissions.

Figure 4 shows a typical case (full

node names are not shown for con-

fidentiality reasons). Function C ap-

pears in one model but not in the

other, and so the algorithm identifies two separate common regions. However,

the analysts determined that the absence of C in the motor insurance model

was an omission and created a common region with all four nodes. This scenario

1,2

1,2

X

B

AA

A

B

B

D

C

D

C

2

X

D

1,2

1

2

Fig.4. Fragment of insurance models

#### View other sources

#### Hide other sources

- Available from Marlon Dumas · May 27, 2014
- Available from alexandria.tue.nl