Content uploaded by Martin Jergler
Author content
All content in this area was uploaded by Martin Jergler on Jul 14, 2015
Content may be subject to copyright.
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 1
Safe Distribution and Parallel Execution of
Data-centric Workflows over the
Publish/Subscribe Abstraction
Mohammad Sadoghi, Martin Jergler, Hans-Arno Jacobsen, Richard Hull, Roman Vacul´
ın
Abstract
—In this work, we develop an approach for the safe distribution and parallel execution of data-centric workflows over the
publish/subscribe abstraction. In essence, we design a unique representation of data-centric workflows, specifically designed to exploit
the loosely coupled and distributed nature of publish/subscribe systems. Furthermore, we argue for the practicality and expressiveness
of our approach by mapping a standard and industry-strength data-centric workflow model, namely, IBM Business Artifacts with
Guard-Stage-Milestone (GSM), into the publish/subscribe abstraction. In short, the contributions of this work are three-fold: (1) mapping
of data-centric workflows into publish/subscribe to achieve distributed and parallel execution; (2) detailed theoretical analysis of the
mapping; and (3) formulation of the complexity of the optimal workflow distribution over the publish/subscribe abstraction as an NP-hard
problem.
Index Terms—Data-centric workflows, publish/subscribe, workflow distribution, business artifacts, case-management
F
1 INTRODUCTION
Typically, workflows support globally distributed business pro-
cesses (BPs) involving data and participants from disparate
geographical locations and organizations. At the same time, the
vast majority of workflow and business process management
(BPM) systems are either centralized in nature, relying on
centralized processing of associated data, or support only rather
restricted forms of distributed execution without considering data
appropriately [
5
], [
13
], [
7
]. In such environments, for instance,
in global corporations, it is not uncommon that large amounts
of data need to be regularly moved across the globe resulting
in decreased business efficiency. Furthermore, compliance with
legal regulations, as the privacy of business-relevant data, or other
constraints that are imposed by individual organizations or even
governments are hard to address. For example, the eighth Data
Protection Principle of the Data Protection Act (DPA) in the
United Kingdom requires that personal data (e.g., customer infor-
mation) must not be transferred outside the European Economic
Area unless the country or territory to which the data are to be
transferred provides an adequate level of protection for personal
data [
14
]. In summary, a major hurdle of current workflow
systems is their negligence of a distributed execution that adheres
to the actual geographical needs (i.e., locality of data), the
workflow scale (e.g., the number of tasks and/or instances), and
compliance with regulations or constraints [17], [12].
In recent years, there has been a growing interest in frame-
•
Throughout 2010 - 2015, this work has been supported by an IBM Faculty
Award, an NSERC Discovery Grant, and an Alexander von Humboldt Award.
From 2010-2012, M. Sadoghi was with the Middleware Systems Research
Group at the University of Toronto.
•
M. Sadoghi R. Hull and R. Vacul
´
ın are with IBM T.J. Watson Research Center,
Yorktown Heights, USA.
•M. Jergler is with Technische Universit¨
at M¨
unchen, Germany.
works for specifying and deploying workflows that combine
both data and process as first-class citizens [
2
], [
24
], [
26
], [
30
],
[
12
], [
25
], [
18
]. Data-centric workflows have a potential to
address the problem described above. Process and associated
data are tightly coupled in a sense that both are expressed in
a single model without giving explicit favor to one of them.
This simplifies workflow distribution according to geographical,
organizational, and legal constraints as only a single model
needs to be distributed. In this paper, we consider one such
data-centric BPM approach called Business Artifacts (BA) [
6
],
[
9
], [
24
] and a recent meta-model for modeling business artifacts
called Guard-Stage-Milestone (GSM) [
15
], [
12
], [
16
]. We focus
on how business processes specified in GSM can be distributed
and executed on a massively parallel infrastructure employing
the publish/subscribe (pub/sub) abstraction. Due to recent trends
towards ad-hoc and adaptable workflows (e.g., the recent Case
Management standard [
20
], [
21
], [
25
], which was significantly
influenced by GSM), we believe that the loosely coupled nature
of pub/sub systems provides a convenient substrate for workflow
execution. Adaptations like the addition or removal of individual
tasks, users, and constraints can be accomplished during runtime
by (un-)subscribing to events that drive the execution.
The ultimate goal of distributing a data-centric workflow is
to achieve an effective grouping of workflow components such
as flow activities and associated data fragments, respecting a set
of constraints such as the infrastructure topology, geographical
constraints, or pricing factors, while minimizing communication
or data transport costs. This work provides the foundation for
developing a mapping from data-centric workflow primitives
to publish/subscribe primitives, while maintaining an equivalent
operational semantics. This foundation can be applied to
identify an optimal workflow distribution that conforms to
given constraints. Moreover, the pub/sub nature decouples
individual workflow components and thus facilitates their ability
for migration to enable effective scalability of the system.
0000–0000/00/$00.00 c
2015 IEEE Published by the IEEE Computer Society
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 2
In the artifact-centric paradigm, BPs are modeled as
interactions of key business-relevant, conceptual entities called
Business Artifacts (or “artifacts,” for short). Artifacts are
modeled using an information model, that includes attributes for
storing all business-relevant information about the artifact, and
alifecycle model, that represents the possible ways the artifact
might evolve. The artifact approach typically yields a high-level
factoring of BPs into a handful of interacting artifact types.
The recently introduced data-centric workflow model known
as Business Artifacts with Guard-Stage-Milestone Lifecycles
meta-model [
15
], [
16
], [
12
] provides a declarative approach for
specifying artifact lifecycles. GSM supports parallelism and
modularity, with an operational semantics based on a variant
of Event-Condition-Action (ECA) rules. There are four key
elements in the GSM meta-model: (a) the information model; (b)
milestones, which correspond to business-relevant operational
objectives that are achieved (and possibly invalidated) based on
triggering events and/or conditions over the information models
of BA instances; (c) stages, which correspond to clusters of
activity intended to achieve milestones; and (d) guards, which
control when stages are opened or closed, respectively. Multiple
stages of an artifact instance may be active at the same time,
which enables parallelism. Hierarchical structuring of the stages
supports a rich form of modularity.
The operational semantics of GSM is characterized by how
a single “incoming external event” is incorporated into the
current “snapshot” of the information model of a GSM-based
system [
12
]. This semantics extends the well-known Event-
Condition-Action (ECA) rule paradigm. It is centered around
business steps (or B-steps, for short) that focus on the full impact
of incorporating incoming external events. In particular, the focus
is on what milestones (i.e., goals or objectives) are achieved
or invalidated and what stages (i.e., tasks) are opened and
closed, as a result of the incoming event. Changes in milestone
and stage status are treated as internal “status events” and can
trigger further status changes in the B-step. Intuitively, a B-step
corresponds to the smallest unit of business-relevant change
that can occur to a data-centric workflow. In this paper, we rely
on the incremental operational semantics introduced in [
12
],
which resembles the incremental application of ECA-like rules
providing a natural and direct approach for its implementation.
Starting with an information model and a set of data-centric
workflow primitives (based on a set of acyclic ECA-style rules)
that rely on an incremental operational semantics, we develop
a complete mapping of data-centric workflows into the pub/sub
abstraction. We enable this workflow transformation by redefin-
ing and formalizing key pub/sub constructs such as subscriptions
and publications together with their matching conditions, as well
as consumption and notification policies. As a result, once a data-
centric workflow is transformed into the pub/sub abstraction, it
seamlessly inherits the distributed and loosely-coupled benefits
of pub/sub. Altogether, we make the following contributions:
1)
Formalization of data-centric workflows and a suitable
pub/sub abstraction (Sec. 3-4).
2)
Mapping of data-centric workflows into the pub/sub
abstraction to achieve distributed and parallel execution
(Sec. 5-6).
3) Detailed theoretical analysis of the mapping (Sec. 7).
4)
Complexity analysis for optimal workflow distribution
over pub/sub (Sec. 8).
2 RELATED WORK
Our work is based on the data-centric business artifacts
paradigm [
24
], [
6
], [
9
], with the GSM meta-model being
a natural evolution from the earlier practical artifact meta-
models [
10
], [
28
], but using a declarative basis and supporting
modularity and parallelism within artifact instances. The existing
work on GSM operational semantics already addresses some sort
of parallelism [
29
] but does not consider the distributed execution
of business artifacts [
12
]. Recently, different data-centric
approaches have been proposed including the FlexConnect
meta-model [
26
], in which processes are organized as interacting
business objects, the Case Management paradigm [
30
], [
20
],
[
25
], [
21
], and the AXML Artifact Model [
2
], [
?
], which is
based on a declarative form of artifacts using Active XML
as a basis [
1
]. Another object-aware framework that aims at
unifying process and data is PHILharmonicFlows [
18
]. Here,
workflows are modeled as micro processes that represent the
data and behavior of individual objects and macro processes that
represent the interactions among such objects.
There exists a body of work focused on various aspects of
distributed workflow execution. For instance, [
5
] has a similar
goal as our work but is applied to an inherently activity-centric
workflow model, in which data is only considered as input and
output of flow activities (dataflow) and no data-centric execution
is supported. This is also true in [
4
], in which scheduling of
workflows in self-organizing wireless networks is addressed to
respect resource allocation constraints and dynamic topology
changes, or for [
27
], [
19
] that use pub/sub techniques to
implement some of the BPM execution aspects.
Distributed workflow execution has been studied in the 1990s
to also address scalability, fault resilience, and enterprise-wide
workflow management [
3
], [
32
], [
22
]. A detailed design of a
distributed workflow management system was proposed in [
3
].
The work bares similarity with our approach in that a business
process is fully distributed among a set of nodes. However, the
distribution architectures differ fundamentally. In our approach,
a content-based message routing substrate naturally enables
decoupling, dynamic reconfiguration, system monitoring, and
run-time control. This is not addressed in the earlier work.
A behavior-preserving transformation of a centralized activity
chart, representing a workflow, into an equivalent partitioned one
is described in [
22
] and realized in the MENTOR system [
32
].
MENTOR is inspired by compiler-based techniques, including
control flow and data flow analysis, in order to parallelize the
business process [
23
]. However, these approaches are comple-
mentary to our work since we operate with the original business
process model without analyzing the process. An advantage
of executing an unmodified process is that dynamic changes to
the executing business process instances are possible, as their
structure remains unchanged from the original specification.
Finally, an approach to integrate existing business processes as
part of a larger workflow is presented in [
8
]. The authors define
event points in business processes where events can be received
or sent. Events are filtered, correlated, and dispatched using a
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 3
centralized pub/sub model. The interaction of existing business
processes is synchronized by event communication. This is
similar to our work in terms of allowing business processes to
publish and subscribe. In our approach, activities in a business
process are decoupled, and the communication between them
is performed in a content-based pub/sub broker network.
3 DATA-CENTRIC WORKFLOWS
We begin by describing the concepts behind GSM for modeling
and executing data-centric workflows. Then we give a concrete
example of a BP represented in GSM, which will serve as a
running example throughout the remainder.
3.1 Overview of GSM Schema
A GSM schema describing a workflow is defined as a set of
artifact types with lifecycle, denoted by
A
, where each
A
is
defined by a six-tuple:
A=hx, Att, T yp, Stg, Mst, Lcyci.
In essence, a GSM workflow schema (or model) can succinctly
be described as the grouping of business processes into an
artifact type
A
that corresponds to an actual business entity
within an organization. Each artifact is comprised of a set of
goal-oriented work items with lifecycles, in which a work item is
modeled as stages (
Stg
) and goals are referred to as milestones
(
Mst
). In addition, each artifact may have many instances (
x
),
i.e., workflow instances or enactments over a globally shared
information model in order to store relevant data, e.g., a set of
data and status attributes (
Att
) and their associated data types
(
Typ
). Moreover, the lifecycle schema, i.e., the blueprint for the
artifact’s evolution through its various stages, is given by:
Lcyc=hSubstage, Tasks, Owns, Guards, Ach, Invi.
The lifecycle of each stage captures the hierarchy of its substages
(
Substage
), encapsulation of a task within each (sub)stage
(
Tasks
), information about stage nesting (
Owns
), conditions
for enabling (sub)stages (
Guards
), conditions for determining
the successful completion of (sub)stages (
Ach
), and conditions
for disabling (sub)stages (
Inv
). Roughly speaking, the GSM
schema defines a workflow through the lens of a stage, guards
for entering a stage, and milestones for leaving a stage.
A key primitive GSM construct, in addition to guard, stage,
and milestone, is the sentry, which in fact is the building block
of guards and milestones. A sentry is a Boolean formula of
type
χ(x)
that consists of two parts: the (triggering) event
ξ(x)
,
which is a Boolean formula to test the type of an incoming
external event, and a condition
ϕ(x)
, which is a Boolean formula
defined over a subset of status attributes. A sentry may have
three different forms: (i)
on ξ(x)if ϕ(x)
; (ii)
on ξ(x)
;
and (iii) if ϕ(x).
With respect to GSM execution, we focus on the incremental
formulation of the GSM operational semantics: a variation
of incremental firing of Event-Condition-Action (ECA) rules,
known as Prerequisite-Antecedent-Consequent (PAC) rules. PAC
rules differ from traditional ECA rules in a way that they also
incorporate a temporal aspect, i.e., the prerequisite, allowing
for conditions on prior system states. The set of PAC rules
can be derived in polynomial time from a GSM schema [
12
].
More importantly, the order of PAC rule firing is defined by the
generalized notion of the Polarized Dependency Graph (PDG).
The PDG imposes a topological sort order on PAC rule firing,
essentially a PAC rule stratification, in which no cyclic relation
among PAC rules are allowed, which requires the PDG graph to
be acyclic. The PDG imposed order on rule firing guarantees the
uniqueness and the termination properties in the context of defin-
ing the smallest logical unit of work, i.e., a B-step, as the well-
formedness of a finite set of PAC rules within the B-step [12].
The incremental formulation of GSM (in turn, the execution
of PAC rules in the prescribed order of the PDG) is driven and
initiated upon receiving an external event from the environment.
The set of all relevant PAC rules are executed in response
to the external event; the firing of PAC rules are sequenced
to form an atomic-step. The semantics of such a B-step with
respect to the overall GSM system state snapshot (i.e., the
instantiation of the information model) is summarized using a
5-tuple
hΣ,e,t,Σ0,Geni
, where
Σ
is the current system snapshot
of the GSM instance prior to consuming the external event
e
,
Σ0
is the new snapshot of the system after firing all relevant PAC
rules that are triggered directly or indirectly by the external event
e, and Gen is a set of generated immutable events as a result of
1-way and 2-way service calls that may be encapsulated in a task;
DEFIN ITION 1. An immutable event is a static instantiation of
an event schema such that all its attribute values are predefined
and not changed at runtime.
A task itself is encapsulated in a stage. Thus, the B-step is
formalized with respect to the sequence of PAC rule firings such
that
Σ = Σ0,Σ1,Σ2,···,Σn= Σ0
(where
Σ06= Σ1
). Thus, after
applying the
ith
PAC rule, according to the order imposed by
the PDG, the state advances from
Σi
to
Σi+1
, which is also
referred to as a micro-B-step in GSM.
The key properties surrounding B-steps are that a B-step
hΣ,e,t,Σ0,Geni
always terminates and ends in a unique state
Σ0
, where
Σ6= Σ0
. We refer to these as uniqueness properties
of B-steps [
12
]. They are achieved in part by restricting that
each
Att
in the GSM schema changes at most once as a result
of PAC rules firing within the context of a single B-step (i.e.,
toggle-once property), which implies that a change cannot be
undone, and in part by executing all relevant
1
PAC rules whose
consequents are reachable in the PDG graph and in an order
imposed by the PDG, namely, visiting every reachable node in
the PDG using a strata-based, breadth-first graph traversal.
The GSM schema consists of six distinct types of PAC rules
that are also described in Appendix B: PAC-1 for achieving
guards; PAC-2 for achieving milestones; PAC-3 for invalidating
a milestone once its stage is opened; PAC-4 for invalidating
a milestone once its invalidating sentry is achieved; PAC-5 for
closing a stage when one of its milestones is achieved; and
PAC-6 for closing a substage when its parent stage is closed.
The GSM execution model assumes a global, external event
queue, and the current GSM operational semantics is serialized
1.
If the PAC rule’s prerequisite and antecedent are satisfied and its consequent
is applied, the current state of a GSM instance changes.
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 4
Legal Reviewing (LR)
Requirements
Approval
(RA)
Engineering
Design
(ED)
g1
g2
g3
g4
Requirements
Approved
(RA:ap)
Design Completed
(ED:cp)
Design Suspended
(ED:sp)
g5
g6
Legal Review
Completed
(LR:cp)
Evaluating
Country
Restrictions
(ECR)
g7 Evaluated
(ECR:ev)
Preparing
Export
Documents
(PE)
g8
g9
Export Docs
Prepared
(PE:pp)
Preparation
Suspended
(PE:sp)
Lifecycle
Model
Information
Model
Data Attributes Status Attributes
Milestones
design
customer
requirements
latestIncEvent
Stages (open/closed)
RA ED LR ECR PE RA:ap ED:cp ED:sp ECR:ev PE:pp PE :sp LR :cp
...
Fig. 1. ”Design-to-order” business process.
w.r.t. the external event queue. In this work, we also rely on a
global event queue to orchestrate concurrent B-step executions
such that the event queue behaves as a pseudo-global clock.
However, we interleave and pipeline the processing of multiple
B-steps over a loosely coupled, distributed pub/sub infrastructure,
in which each B-step is associated with a different external
event. We achieve this distributed and parallel execution while
guaranteeing an identical behavior as if B-steps were processed
centrally and in the sequential order of the global event queue.
For enhanced comprehensibility, we focus on the core aspects
of GSM workflows (i.e., the data and the lifecycle) to describe
our mapping. Therefore, we formalize a GSM workflow schema,
Γ, as follows:
Γ=hI,Ri,
where
I
is the workflow information model that consists of a
set of ordered
hattr,datatypei
-pairs and distinguishes between
data attributes (
Id
), i.e., application data, and status attributes
(
Is
) (describing the state of the workflow within its lifecycle).
The number of status attributes is finite and bounded by the
schema.
R
is a set of acyclic PAC rules representing the lifecycle.
The operational semantics of
Γ
follows the general notion of
incremental operational semantics [12].
3.2 Example of Data-centric Workflow in GSM
The example depicted in Figure 1 (cf. [
12
]) represents a
data-centric GSM model for a product design process on behalf
of an external customer. It is structured into various stages (i.e.,
rounded rectangles) describing the individual task definitions.
Guards are denoted by diamonds and milestones by circles.
Upon a customer order, i.e., an external event of type
R:NewOrder
, a new workflow is instantiated and the cor-
responding product requirements are approved. Once the re-
quirements have been approved (i.e., an external event of type
T:RequirementsApproval
, which indicates that the clerk
in charge of finished the task, has been received), the actual engi-
neering stage is opened. In case the customer decides to change
the requirements afterwards (i.e.,
R:CustomerChange
), the
design stage is suspended and the requirements are approved
again. The legal reviewing of the order is encapsulated in a sepa-
rate stage comprising two sub-stages. While country restrictions
can be evaluated in parallel with the approval of requirements, the
preparation of the export documents requires a completed design.
Furthermore, preparation is suspended and country restrictions
are re-evaluated if requirements change. The whole process is
accomplished once the export documents are prepared.
+, RA
+, g1
+, ED +, LR +, ECR +, PE
+, RA:ap +, ED:cp +, ED:sp +, ECR:cv +, PE:pp +, PE:sp +, LR:cp
+, g2 +, g3 +, g4 +, g5 +, g6 +, g8 +, g9+, g7
-, RA -, ED
-, LR
-, ECR -, PE
-, RA:ap - , ED:cp -, ED:sp -, ECR:cv - , PE:pp -, PE:sp -, LR:cp
Fig. 2. Polarized dependency graph (PDG) for BP.
GUARD SENT RY
g1 latestIncEvent = “R:NewOrder”
g2 latestIncEvent = “R:CustomerChange”
g3 RA:ap ∧ECR:ev ∧ ¬ ED:cp
g4 latestIncEvent = “R:ResumeDesign”
g5 latestIncEvent = “R:NewOrder”
g6 latestIncEvent = “R:RedoExportDocs”
g7 ¬ECR:ev
g8 ⊕ED:cp
g9 latestIncEvent = “R:RedoExportDocs”
TABLE 1
Sentries associated with guards.
MILESTO NE TYPE SE NTRY
RA:ap Ach latestIncEvent = “T:RequirementsApproval”
ED:cp Ach latestIncEvent = “T:EngineeringDesign”
Inv RA:ap
ED:sp Ach RA:ap
ECR:ev Ach latestIncEvent = “T:EvalCountryRestrictions”
PE:pp Ach latestIncEvent = “T:PreparingExportDocs”
Inv ED:cp
PE:sp Ach ED:cp
LR:cp Ach ⊕PE:pp
Inv PE:pp
TABLE 2
Sentries associated with milestones.
The above behavior is implicitly described by the sentries as-
sociated with guards (cf. Table 1) and milestones (cf. Table 2) of
the GSM schema. The triggering events in the sentry definitions
can be either external or internal events. Conceptually, external
events are further divided into request-events invoking a task
(indicated with a “
R
” in the example) and task-termination-events
starting with a “
T
”. Internal events represent status attribute
updates in the information model, whereby
⊕
indicates that an
attribute toggled to true and indicates that it toggled to false.
For example, guard
g1
is achieved if the latest
incoming external event was of type
R:NewOrder
,
which corresponds to a new customer request. Similar,
milestone
ExportDocsPrepared
is invalidated if an
internal status-update event notified the invalidation of milestone
DesignCompleted (i.e., ED:cp, for short).
The PAC rules for this workflow are derived from the
GSM model according to the six rule templates described in
Appendix B (cf. also [
12
]). Altogether, this comprises a set of 41
rules that are depicted in Appendix D. An excerpt of three PAC
rules, which will be relevant for a subsequent running example of
the mapping is depicted in Table 3. The order of PAC rule firing
for the GSM operational semantics is described by the PDG
depicted in Figure 2. It has been established according to the
construction algorithm described in Appendix C (cf. also [12]).
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 5
In the rest of the paper we exploit this example to illustrate
our GSM-to-pub/sub mapping. In particular, we show the
construction of two distinct subscriptions capturing the semantics
of (1) the invalidation of milestone
PE:pp
(based on the rules
depicted in Table 3) and (2) maintaining a consistent view on
status attribute ExportDocsPrepared (i.e., PE:pp).
NOPRERE-
QUISI TE
ANTECEDENT CONSE-
QUENT
1 PE:pp ED:cp PE:pp
2 PE:pp ⊕ED:cp ∧LR PE:pp
3 PE:pp latestIncEvent = “R:RedoExportDocs” ∧LR PE:pp
TABLE 3
Excerpt of PAC rules for “Design-to-order” BP.
4 PUBLISH/SUBSCRIBE SCHEMA
In this section, we present the necessary formalization of the
pub/sub abstraction for subsequently being able to prove the
correctness of our mapping from data-centric workflows to
pub/sub. At the core of the pub/sub abstraction lies a set of
publications (
P
) and subscriptions (
S
). Each publication,
P ∈P
,
is defined as follows:
P=hEi,
where
E
defines the publication’s event schema that consists of
a set of ordered
hattr,datatypei
-pairs. Events are instances of
this schema and defined as sets of ordered
hattr,valuei
-pairs,
where
value
is an instance of the
datatype
specified in
E
. Over
time, a publisher continuously produces events that conform to
its event schema. Each subscription,
S ∈ S
, is defined as follows:
S=hD,Φ(ρk),δ(ρk),N(ρk),Ψ(ρk)i(1)
where
ρk=he,t, xi
, with event type
e
, logical event time
t
,
and subscription instance
x
(
x
is a context variable essentially
identifying a concrete workflow instance).
Dis the data model
. It describes the internal state of a
subscription and its unique key is formed by the triplet ρk.
D=he,t,x,onE1,···,onEm,d1,···,dn,s1,···,sp,visitedi
For every toggling status attribute appearing in antecedents of
PAC rules there is a column
onEi
in
D
. Moreover, there is a
column for every data attribute
di
(i.e., application data) and
every status attribute
sj
(i.e., internal workflow state) appearing
in logical expressions of PAC rules. The tuple with key
ρk
maintains these values as the result of receiving events associated
with
ρk
. The final column
visited
indicates whether or not all
the values in this tuple have stabilized, i.e., do no longer change
as a result of external event
ρk
. Setting
visited
to true in the
tuple with key
ρk
implies that this tuple is now a read-only tuple
and any notification (event generation) associated with
S
for
event
ρk
has been completed. A read-only tuple is retained for
maintaining an execution history and for enabling parallel and
distributed processing of PAC rules. We define the domain range
for status attributes as
DOMstatus ={true,false,∅}
, i.e., the
Boolean constants together with a special symbol
∅
. The domain
for status changes is defined as
DOMtoggling ={hBoolean ×
Booleani,∅}
, i.e., all possible transitions for the status attribute
together with
∅
. Similarly, the domain range for data attributes
is defined as
DOMdata ={String ∪Number ∪∅}
, i.e.,
any string or number together with
∅
. The special symbol
∅
indicates that the current value is unstable, i.e., the attribute has
not been updated in the context of the external event
ρk
, while
all other domain values are considered as stable.
Φ(ρk)is the subscription’s matching condition
. It is
a disjunction over
φi(ρk)∈Φ(ρk
), where each
φi(ρk)
is a
condition, that is, a logical formula representing the antecedent of
a PAC rule, that is instantiated and correlated with each external
event
ρk
. This condition is expressed over the condition language
L
that is a subset of First-Order Logic (FOL) supporting:
scalar values, binary relations (i.e., logical operators
(∨,∧,→)
,
relational operators (i.e.,
<,≤,=,6=,≥,>
, the unary relation
¬
,
and quantification over subscription instances
ρk
, i.e.,
∀
and
∃
).
The quantification domain for ρis totally ordered by time tand
instance x. Furthermore, we define the following functions.
1) τk(attr,ρk)
, or simply,
τ(attr)
, which returns the current
value of the attribute attr w.r.t. ρkin D.
2) τk−1(attr,ρk)
which returns the last value of the attribute
attr
w.r.t.
ρk
for
k > 2
in
D
; otherwise it returns
False
for Boolean attributes, and a default or a null value (
⊥
)
for non-Boolean attributes.
Finally, we resort to three-valued logic, with three possible
values (i.e.,
true
,
false
,
unknown
), where
unknown
is the
interpretation of the unstable value (
∅
). We do not consider the
null value (
⊥
) as unstable and we do not permit the null value for
Boolean variables. We define the evaluation of any logical binary
or unary operator involving
Unknown
as
Unknown
, whereas we
rely on traditional two-valued logic when no
unknown
value is
present. Also, when dealing with different system snapshots
Σ
, to
differentiate an attribute value among different snapshots, when it
is notclear from the context, we extend the definition
τ
to include
Σas input parameter as follows: τk(Σ,attr,ρk)or τ(Σ,attr).
δis the subscription’s consumption policy
that describes how
the internal state of a subscription changes after consuming an
event (cf. Section 6).
N(ρk)is the subscription’s notification policy
that is also a
disjunction over
νi(ρk)∈N(ρk)
and instantiated and correlated
with each external event
ρk
. The notification consists of the
notification schema that describes the content of the event(s)
(its payload) and a set of conditions
νi(ρk)
that dictate how the
content of the event is generated.
Ψ(ρk)defines the relationship
between a subscription’s
condition
Φ
and a subscription’s notification policy
N
and is
represented as a set of ordered pairs
hφ∈Φ,ν ∈Ni
, where each
φi
is associated with the corresponding
νi
, meaning, when the
matching condition
φi
is satisfied, then the notification condition
νiis evaluated:
Ψs(ρk)= [
φi∈Φ
hφi(ρk),νi(ρk)i.
An instance of the subscription
S
consists of an internal state
ΣS
j
over the data model
D
. The internal state of a subscription
can only be changed upon receiving (consuming) an external
event or generating an event (notification). In general, the
internal state together with an event shapes the subscription
operational semantics
OS
(a.k.a., the matching semantics),
which is summarized as a 6-tuple:
OS=hΣS
j,e,t,x,ΣS
j+1,Geni.
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 6
1) ΣS
jis the current internal state of the subscription S.
2) eis an occurrence of an external immutable event.
3) t
is the logical time, which is greater than all logical
timestamps occurring in ΣS
j.
4) x
is a variable that ranges over the IDs of instances of
S
.
This is referred to as the context variable of S.
5) ΣS
j+1 is the internal state after consuming event e.
6) Gen
is the set of generated immutable event occurrences
(generated by the notification policy) as reaction to the
external event ρk.
Consequently, the operational semantics for a subscription
OS
is formally defined as follows.
DEFIN ITION 2. Given a subscription
S
with internal state
ΣS
j
and an external event
e
at time
t
for the instance
x
(denoted by
ρ
) of
ΣS
j
, the subscription
S
examines
e
and either accepts
e
and makes a transitions from
ΣS
j
e,t,x
7−−−−→ ΣS
j+1
, or rejects
e
, if
neither the consumption policy nor the notification policy define
a state change for e.
We formally define the pub/sub schema Πas follows:
Π=hP,S,E,Ci
1) Pis a set of publications.
2) Sis a set of subscriptions.
3) E
is the event schema that captures both publications’
event and subscriptions’ notification schemes.
4) C
is the communication state maintaining for each external
event its type
(e)
, the logical time of its occurrence
(t)
,
the subscription instance that processed it
(x)
, and the
subscription type S, formalized as
C=he,t,x,Si.
Without loss of generality, if there is more than just a single
publisher for external events our formal pub/sub model assumes
the following two properties: i) Each subscription instantaneously
examines the external event
e
according to the subscription opera-
tional semantics. ii) At any instant in time, only a single subscrip-
tion is examining
e
in
Π
. These assumptions simplify the correct-
ness proof for our mapping (cf. Section 7) as external events are
inspected in the same sequential order by all subscriptions. We
refer to this as the pseudo-serializable execution property of Γ2.
An instance of the pub/sub schema
Π
is defined as a sequence
of global snapshots of
Σ1···Σk
over a discrete time space
t
,
where
Σi={ΣC
k,ΣS
j}
,
ΣC
k
is the communication state at time
t
over
Π
, and
ΣS
j
is the internal state of each subscription instance
of S.Π’s operational semantics is summarized as follows:
OΠ=hΣk,e,t,x,Σk+1i.
Here,
Σk
is the current snapshot of
Π
. And
e
is an occurrence
of an external event that is pending, implying that there is at least
one instance
x
of at least one subscription
S
that has not yet
examined
e
at logical time
t
.
Σk+1
is the new global snapshot
2.
The practical implication of these assumptions is that with multiple
external event publishers, all external events must be serialized. This requires
a synchronization mechanism between external event publishers in order to
generate a total order over a discrete timespace
t
. A solution to this problem in
content-based pub/sub systems is presented in [33]
of
Π
. Formally, the operational semantics of the pub/sub model
OΠis defined as follows:
DEFIN ITION 3. Given a pending event
ρk= (e,t, x)
and a
subscription
S
that has yet to examine
ρk
and having the current
state
ΣS
j
, then the global snapshot advances instantaneously
from Σk
e,t,x
7−−−−→ Σk+1, namely,
1)
The communication state transitions from
ΣC
k
e,t,x
7−−−−→ ΣC
k+1
, i.e., event
e
was sent to
S
for instance
x
.
2)
The subscription
S
examines the external event
e
in
accordance to
OS
; hence,
S
either accepts
e
and
transitions from ΣS
j
e,t,x
7−−−−→ ΣS
j+1 or rejects e.
We define a valid execution sequence over
Π
as one that
corresponds to a pseudo-serializable execution such that
at any instant in time,
Π
transitions only once from state
ΣC
k
e,t,x
7−−−−→ ΣC
k+1
, and only a single instance of subscription
S
receives an event
e
and transitions from
ΣS
j
e,t,x
7−−−−→ ΣS
j+1
(if
necessary). Notably, at any instant in time, many subscriptions
(or many instances of a single subscription) may be waiting to
receive the event
e
; however, the pseudo-serializable execution
property does not impose any restriction on the order in which
subscriptions (or instances of a subscription) must receive
the event
e
. Therefore, any non-deterministic selection of
subscriptions (or instances), that results in an instantaneous
examination of event
e
at time
t
by a single subscription
instance
x
, suffices. Most importantly, this pseudo-serialization
requirement can be dropped when there is a single publisher of
external events (cf. assumptions on the formal pub/sub model).
An event is pending only if at least one subscription instance
has not examined it yet, and (in theory) every subscription
instance must examine every event exactly once. Therefore,
from the communication state
C
, it can be inferred, which events
have been processed for which instances of subscription
S
and
which events are pending for which instances of S.
Finally, in general, with more than one publisher of external
events, any valid implementation of
Π
must guarantee the
pseudo-serializable execution property.
5 WORKFLOW MAPPING OVERVIEW
Given a data-centric workflow schema
Γ=hI,Ri
, we construct
a pub/sub schema
Π = hP,S,E,Ci
by applying a mapping
function
M
such that
M: Γ −→ Π
. The set
P
in our mapping
consists of a single publisher which simply publishes the external
events coming from the environment. However, constructing the
set of necessary subscriptions,
S
, is more subtle and is primarily
derived from the set of PAC rules and the PDG for a given
schema,
Γ
. In addition, we require a set of subscriptions for
bookkeeping purposes such as updating data and status attributes
and determining the start and the end of a B-step.
We define subscriptions both for processing relevant PAC
rules and maintaining the current values for status and data
attributes. In general, two classes of subscriptions arise: (1)
Application-specific subscriptions which capture the core of
the workflow operational semantics encoding both the PAC
rule semantics and the PDG topological sort order semantics.
(2) Generic subscriptions which implement a bookkeeping
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 7
Environ-
ment S⊝si
Ssource Ssink
Ssi Ssj
S⊕sk
S⊕sl
S⊕si
S⊝sj
S⊕sj
Bookkeeping
subscriptions
Application-specific
subscriptions
Ssk Ssl
Fig. 3. High-level illustration of subscription flow
mechanism to provide a consistent view of the data with an
implicit locking mechanism. This mechanism maintains multiple
versions of values for all attributes from the information model
and applies updates in a deterministic order dictated by the order
of external events. Hence, there is one version for each
ρk
, i.e., for
each B-step. These two classes of subscriptions also incorporate
the time semantics of the workflow schema which is based on the
external event received from the single publisher in our pub/sub
formulation. Therefore, subscriptions are event-relativized in a
sense that each subscription evaluates its conditions, implements
its consumption policy, and sends its notification in the context
of each external event in isolation, which forms a B-step.
In our mapping,
M:Γ−→Π
, we require the following set of
subscriptions for key workflow operations: [
S⊕s
] and [
Ss
] for
satisfying/falsifying or validating/invalidating status attributes
s
;
[
Ss
] for updating the status attribute
s
; [
Sd
] for updating the data
attribute
d
; [
Ssource
] for identifying the start of a B-step; and
[
Ssink
] for identifying the end of a B-step, where the
⊕
or
po-
larity, denotes a positive or a negative change in status attributes.
Next, we provide a high-level overview of each subscription.
The high-level representation and interaction among
subscriptions (represented as oval) is also depicted in Figure 3.
The directed, solid arrows indicate the flow of events among
subscriptions and the (bright-colored) directed, dashed arrows in-
dicate events received from and sent to the environment, while the
(black) dashed lines are bookkeeping messages for maintaining a
consistent view of the attributes. What is not shown in the figure,
for improved readability, is that there must be an arrow from
every node to the node Ssink to determine the end of a B-step.
The precise meaning of the arrows becomes evident in
Section 6, after formally defining each subscription.
[S⊕s],[Ss]
: For each status attribute
s
in the information
model of
Γ
,
I
, we add the subscription
S⊕s
, for validating
the attribute
s
and the subscription
Ss
for invalidating
s
.
The subscription’s condition
Φ
is derived based on the PAC
rules’ prerequisite and antecedent conditions. Hence,
Φ
is an
application-specific condition.
[Ss]
: For each status attribute
s
in
I
, we add the subscription
Ss
that listens to updates (i.e., notifications of
S⊕s
and
Ss
) for
s. Hence, Ss’s Φis a generic condition.
[Sd]
: For each data attribute
d
in
I
, we add the subscription
Sd
that listens to updates on
d
at the outset of the B-step. Hence,
Sd’s Φis also a generic condition.
[Ssource,Ssink]
: For identifying the beginning and ending
of a B-step we add the source subscription
Ssource
and the sink
subscription
Ssink
, respectively. All of these subscriptions are
intended for bookkeeping purposes. Thus, their conditions are
also generic.
6 MAPPING FORMALIZATION
The subscription plays a central role in formulating the mapping
of a data-centric workflow schema,
Γ=hI,Ri
, into the pub/sub
abstraction given by
Π=hP,S,E,Ci
. We formalize the semantics
of a subscription
S ∈ S
as described in Equation 1, where its con-
dition
Φ(ρk)
, consumption policy
δ(ρk)
, and notification policy
N(ρk)
are instantiated and associated with an external event
ρk
,
and
Ψ(ρk)
interrelates condition and corresponding notification.
6.1 Matching and Notification Policies
In this section, we start by providing a detailed account of
the mapping of the workflow’s application-specific semantics,
namely, encoding of PAC rules and the PDG topological sort
order, into a set of subscriptions. In addition, we provide the
foundation for a mapping that emulates the generic-execution
semantics including the necessary bookkeeping mechanism as
a set of subscriptions. We guarantee that the workflow correctly
executes by ensuring data consistency and the B-step semantics.
6.1.1 PAC Rules and PDG Mapping
We first define the
Γ
application-specific conditions for
subscriptions
S ∈ S
. Each logical formula
φi(ρk)∈Φ
is defined
as follows:
φi(ρk)=
PDG Predecessors
z }| {
ψi,PDG(ρk)∧
Event-based Pseudo Clock
z }| {
ψi,PseudoClock(ρk)(2)
Here,
ψi,PDG
is the PDG predecessor component, that is,
a logical formula that encodes the PDG topological sort order,
i.e.,
ψi,PDG
is a logical formula that evaluates to true when
all variables in
D
have stabilized (cf. detailed descriptions in
Sections 6.1.1.1 and 6.1.1.2).
ψi,PseudoClock
is a logical formula
that enforces that subscriptions are processed based on the order
of external events, i.e., it guarantees event-order serialization (cf.
detailed description in Section 6.1.1.3). The second component of
Ψ
, the notification expression,
νi(ρk)∈N
, is defined as follows:
νi(ρk)=
γρk,Svisited
sρkif ψi,SAT (ρk)
γρk,Svisited
sρkif ∀νi∈N,¬(ψi,SAT (ρk))
WAIT if ∃φi∈Ψi,¬(φi)
(3)
Here,
Svisited
sρk
is an event that indicates that the subscription
S
was successfully visited for the external event
ρk
, i.e., (partially)
completed as defined in Section 6.2. And,
γρk
is an event that
represents the consequent of the PAC rule indicating a change
(either a positive or a negative) to a particular status attribute
s∈ I
(
γ=s
), while
γ
indicates no change to status attribute
s
. In addition, each event
γρk
and
γρk
contains the current value
of the status attribute
s
in the context of the external event
ρk
.
ψi,SAT(ρk)
is a logical formula derived from a PAC rule’s pre-
requisite
π
and antecedent
α
and
WAIT
is an indicator that implies
that not all subscription conditions (
φi
) have been satisfied. The
notification policy is explained in detail in Section 6.1.1.4)
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 8
6.1.1.1
Application-specific Condition
: To construct
the application-specific condition, we adapt the PDG
construction algorithm [
15
], which operates on the set of PAC
rules
R
. Suppose each PAC rule has the form
hπ, α, γi
that
stands for prerequisite, antecedent, and consequent, respectively.
Then, the antecedent
α
of a PAC rule is constructed according
to the template
on ξ(x)if ϕ(x)
, where each expression
expr ∈ξ(x)
is of the form on-event
onEventT ype
or
s
,
where
onEventT ype
indicates requiring an external event of
the type given by
onEventT ype
and
means waiting for a
positive,
⊕
, or negative,
, change in status attribute
s
. Similarly,
every expression
expr∈γ
also follows the form
s
. However,
expression
expr∈ϕ(x)
has the form
s
, which simply indicates a
stable value for status attribute
s
. A value is stable if it no longer
changes in the current B-step.
We first collapse instances of PAC rules,
Ri∈ R
, that have
identical
π
and
γ
into a super PAC rule given by
hπ, A, γi
,
where
A= (∨α∈Riα)
. In general, PAC rules share identical
π
and
γ
because for a given status attribute
s
there exist multiple
rules that satisfy or falsify it. The notion of a super PAC rule
simplifies the mapping. Therefore, the (super) PAC rule
R
is
mapped to Ss, where s∈γ.
Example 1
In our example BP, there are three PAC
rules that share the common prerequisite
π=PE:pp
and consequent
γ=PE:pp
(cf. Table 3). These
PAC rules represent the incoming edges in the PDG
for node
(-,PE:pp)
(cf. Figure 2). Hence, they
can be collapsed into the super PAC rule
PACPE:pp
that represents the invalidation of milestone
PE:pp
.
πAγ
(ED:cp) ∨
PE:pp (⊕ED:cp ∧LR) ∨PE:pp
(latestIncEvent = “R:RedoExportDocs” ∧LR)
The super PAC rule is mapped to subscription SPE:pp.
The relation
Ψs∈ Ss
(an application-specific condition
and notification) is constructed through various mapping stages,
which are described next.
Each PAC rule is used to construct the subscription’s matching
condition,
hπ,A,γi∈R Φ∈ Ss
. More specifically, we
derive each φi∈Φbased on the PAC rule as follows:
MΦ:αi∈A −→φi∈Φ.(4)
Intuitively speaking, the antecedent of a PAC rule (
α
) forms
the matching condition (
Φ
). In case of a super PAC rule, each
antecedent of the original PAC rules that are collapsed within
this super PAC rule (
αi∈ A
) forms a single component of the
matching condition (φi∈Φ).
6.1.1.2
PDG Predecessors
: The key component of
φi
, denoted by
ψi,PDG ∈φi
, is at the core of the subscription
mapping and incorporates the notion of PDG predecessors,
an integral part of encoding the PDG topological sort order
semantics of Γinto Π. This mapping stage is represented by:
Mψi,P DG :αi∈A−→ψi,PDG ∈φi.(5)
Thus, for each
φi∈Φ
, we construct
ψi,PDG ∈φi
from the
corresponding
αi∈R
. The actual definition of
ψi,PDG
is derived
by adapting the PDG construction algorithm that examines the
antecedent component of each PAC rule identifying the set of
status attributes whose values must stabilize before firing a PAC
rule, i.e., before evaluating the subscription. We formally define
ψi,PDG
as a set of on-events that listen for positive or negative
change in variables appearing in the PAC rule’s antecedent,
which encodes the three different forms a sentry can have:
ψi,PDG(ρk)= ^
s∈ξ(x)
τk(ons,ρk)∧^
s∈ϕ(x)
τk(on.s,ρk),(6)
where
ons
refers to events that announce a change or no
change to
s
(i.e., the
on ξ(x)
component in antecedent
α
). and
on.s
refers to an event that holds the current value of
s
(i.e., the
if ϕ(x)component in α).
6.1.1.3
Event-based Pseudo Clock
: The second
component of
φi
is
ψPseudoClock
which enforces that each
subscription is processed, namely, its condition
φi
is satisfied,
in the order in which external events arrive. Therefore, external
events act as a pseudo clock. The operation of this pseudo clock
is defined by a logical formula as follows:
ψPseudoClock(ρk)=(@ρj,ρj∈ΣS,
¬(τj(isVisited,ρj))∧
τj(eventTime,ρj)<τk(eventTime,ρk)∧
τj(subInstance,ρj) = τk(subInstance,ρk)).
(7)
Example 1—cont.
As
PACPE:pp
is originally
comprised of three individual PAC rules, the subscription
condition
Φ
contains three disjuncts representing the
original antecedents (i.e.,
φ1
,
φ2
, and
φ3
), which results
in the following PDG predecessor components:
ψ1,PDG(ρk) = τk(onED:cp,ρk)
ψ2,PDG(ρk) = τk(onED:cp,ρk)∧τk(on.LR)
ψ3,PDG(ρk) = τk(LR).
Note that the external request event in PAC Rule 3
(i.e.,
R:RedoExportDocs
) is not evaluated
within the matching condition (here
ψ3,PDG(ρk)
)
but later on within the notification condition. The
data model
D
for this subscription is as follows:
etxonED:cp ED:cp PE:pp LR visited
The complete subscription condition Φis then:
Φ= φ1∨φ2∨φ3
=(ψ1,PDG (ρk)∨ψ2,PDG(ρk)∨ψ3,PDG(ρk))
∧ψPseudoClock(ρk)
6.1.1.4
Application-specific Notification
: Once the
PDG requirement (i.e.,
ψi,PDG ∈φi
) for a subscription instance
is satisfied, namely, all variables in
α
have stabilized, and all prior
external events have been processed, (i.e.,
(ψPseudoClock ∈φi)
),
the corresponding notification of
(φi, νi)
is triggered. Each
νi∈N
is partially derived from the corresponding PAC rule of
the super PAC rule hπ,αi,γiin accordance to Equation 3:
MN:(π,αi∈A,γ)∈R−→νi∈N. (8)
The key component of
νi
is a logical formula
ψi,SAT
, which
describes the behavior of the notification policy. Before, giving
the definition of the logical formula
ψi,SAT
, we must re-write
π
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 9
and
αi
. This re-writing is necessary for abiding by the workflow
semantics, in which each variable in
π
must use its last recent
value from the last completed B-step (if any), while each variable
in
αi
must use its most recent value. Thus, we re-write
π
, which
consists of only Boolean variables, as follows:
Mπ:si∈π−→τk−1(si,ρk).(9)
Similarly, we re-write
α
, which consists of both status and
data attributes, based on the most recent values as follows (
Mαi
consists of three stages of re-writing given by
Mϕ∈αi
,
M1
ξ∈αi
,
and M2
ξ∈αi):
Mϕ∈αi:ai∈ϕ−→ τk(ai,ρk),
M1
ξ∈αi:s∈ξ−→τk(ons,ρk)= ˆ
,
M2
ξ∈αi:onEvent∈ξ−→τk(eventType,ρk)=onEvent,
(10)
where
ˆ
∈ Boolean×Boolean
refers to the type of transition
for status attribute
s
as indicated by
s
. The mapping of a PAC
rule to ψi,SAT is expressed as
Mψi,SAT :(π,αi)∈R−→ψi,SAT ∈φi,(11)
where
ψi,SAT
is simply derived by conjunction of re-written
π
and α:
ψi,SAT (ρk)=Mπ∧Mαi.(12)
Example 1—cont.
Based on the super PAC rule,
PACPE:pp
, we now derive the notification condition for
the example in a similar fashion:
ψ1,SAT(ρk)=τk−1(PE :pp,ρk)∧
τk(onED:cp,ρk)= (true,false)
ψ2,SAT(ρk)=τk−1(PE :pp,ρk)∧τk(LR,ρk)∧
τk(onED:cp,ρk)= (false,true)
ψ3,SAT(ρk)=τk−1(PE :pp,ρk)∧τk(LR,ρk)∧
τk(eventType,ρk)=0R:RedoExportDocs0
The components of the notification condition
N
(i.e.,
ν1
,
ν2
, and
ν3
) can then be derived from Equation 3. We show
this for ν1as follows:
ν1(ρk)=
PE:ppρk,Svisited
PE:ppρk
if ψ1,SAT (ρk)
PE:ppρk,Svisited
PE:ppρk
if ∀νi,¬(ψi,SAT (ρk))
WAIT if ∃φi∈Ψi,¬(φi)
6.1.2 Data Consistency & Semantics Simulation
Now that we demonstrated the mapping to translate PAC rules
into a set of subscriptions, next, we derive the subscriptions
required for bookkeeping (described in Sections 6.1.2.1
and 6.1.2.2) and execution of Γ(described in Section 6.1.2.3).
Given the relation
Ψi,s(ρk)=hφi(ρk),νi(ρk)i
, then the
generic condition φiis defined by:
φi(ρk)=τk(a,ρk),(13)
where
φi
essentially captures the interest in any attempt to alter
the value of attribute
a
. On the other hand, the notification policy
νiis expressed as follows:
νi=aρk,Svisited
aρk,(14)
where
aρk
broadcasts the current value of attribute
a
and
Svisited
aρk
indicates that the bookkeeping subscription for
a
was
visited for external event ρk.
6.1.2.1
Status Attribute Consistency
: We start with
the workflow’s data consistency requirement that ensures a
consistent view of status attributes. We must ensure that when
a status attribute changes, no race condition for updating the
value arises and that every interested subscription has the
most up-to-date values for its status attributes. To achieve
data consistency, we add to the subscription,
Ss
, a generic
condition for every status attribute
s
in the information model
of
Γ
, which acts as a single gateway for changing
s
’s value
and subsequently broadcasting the final stable value of
s
to all
interested subscriptions. The relation Ψs∈Ssis given by:
φs(ρk)=τk(ons,ρk)(15)
νs(ρk)=
τk(s,ρk)⇐True,Svisited
sρkif τk(ons,ρk)= (false,true)
τk(s,ρk)⇐False,Svisited
sρkif τk(ons,ρk)= (true,false)
τk(s,ρk)⇐τk−1(s,ρk),Svisited
sρkotherwise,
where
⇐
indicates assignment of the value of the right-side to
the variable on the left-side.
Example 2
We now show the subscription for capturing
updates on status attribute
PE:pp
. The subscription
condition Φis given by:
φPE:pp(ρk)= τk(onPE:pp,ρk)
Notification condition N=νPE:pp(ρk)is given by:
τk(PE:pp,ρk)⇐True,Svisited
PE:ppρk
if τk(onPE:pp,ρk)= (false,true)
τk(PE:pp,ρk)⇐False,Svisited
PE:ppρk
if τk(onPE:pp,ρk)= (true,false)
τk(PE:pp,ρk)⇐τk−1(PE:pp,ρk),Svisited
PE:ppρk
otherwise
6.1.2.2
Data Attribute Consistency
: Likewise, we
construct a set of subscriptions that listens to events containing
values for each data attribute. Upon consuming an external event,
if the value in the event payload is different from the current
value, then the subscription
Sd
generates the value, derived from
the change or no change events, accordingly, as follows:
φd(ρk)=τk(on∆eρk,ρk)(16)
νd(ρk)=
τk(d,ρk)⇐d,Svisited
dρkif d∈∆eρk∧
τk(d,ρk)6=τk−1(d,ρk)
τk(d,ρk)⇐τk−1(d,ρk),Svisited
dρkotherwise,
where ∆eρksummarizes the data attributes appearing in e.
6.1.2.3
B-Step Simulation
: Finally, in the workflow
execution, it is crucial to identify the start and end of a completed
B-step. Therefore, first, we focus on the start of a new B-step,
which is achieved through subscription
Ssource
. The source
subscription
Ssource
has a special property because it is the
only subscription that waits upon receiving external events
e
from the environment. Every incoming external event in
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 10
turn establishes the start of a new B-step (referred to as the
B-step deterministic-initiation property). Acting as a single
gateway,
Ssource
assigns increasing timestamps
t
to all incoming
e
and thereby imposes a total order on all external events.
Therefore,
Ssource
sends the events
Svisited
sourceρk
and
eρk
that
is understood by all subscriptions, where its type, time, and
intended subscription instances are summarized in
ρk
. Hence,
the relation Ψsource= (φ(ρk),ν(ρk)) is given as follows:
φsource(ρk) = e(17)
νsource(ρk) = Svisited
sourceρk,eρk,∆eρk
In order to guarantee the B-step deterministic-initiation property,
we must add
onSvisited
sourceρk
to all subscriptions whose
φi∈Φ
are
empty. In the same spirit, the end of a B-step is determined by
introducing
Ssink
that subscribes to every subscription involved
in order to establish the ending of a B-step (referred to as the
B-step deterministic-completion property). Hence,
Ψsink(ρk)
is given by:
φsink(ρk) = ^
Si∈S0
τk(onSvisited
iρk,ρk)(18)
νsink(ρk) = Svisited
sinkρk,
where S0=S\Ssink
6.2 Consumption Policy
The subscriptions’ conditions and notification policies define
a design-time specification of the workflow semantics under
our pub/sub formulation. As opposed to this, the consumption
policy specifies how to update the internal state of each
subscription,
Σ
, at runtime. The consumption policy is tightly
bound to the subscription operational semantics, denoted
by
OS= (ΣS
j, e, t, x, ΣS
j+1, Gen)
. To precisely model the
consumption policy w.r.t.
OS
, we discuss the subscription’s
evolution as it goes through the various stages of its lifecycle
within a B-step: initiation, modification, completion, satisfaction,
generation, and termination. Each stage and its interaction with
other stages is defined next and illustrated in Figure 4.
STAGE 1.Subscription initiation occurs for the event associated
with
ρk
(within the
kth
B-step), when the subscription first
receives the event, either directly (the event
eρk
), or indirectly
(such as status or data attribute updates in the context of
ρk
).
Then,
eventType
,
eventTime
, and
subscriptionIstance
are populated based on
ρk
and
isVisited
is set to false, while
the rest of its attributes in
D
are set to
∅
. However, if the subscrip-
tion instance
x∈ρk
does not exist in
ΣS
, then as part of the ini-
tialization (and creation of the new instance), all status attributes
are set to false and all data attributes are set to their default values.
STAGE 2.Subscription modification occurs for the event asso-
ciated with
ρk
(within the
kth
B-step) after the subscription has
been initiated (before or after a subscription’s partial completion),
when the internal state of the subscription is updated and it is
transitioned according to the subscription operational semantics:
OS=ΣS
j
Eρk→(e,t,x)
7−−−−−−−−−→ ΣS
j+1.
Initiation Modi-
fication
Com-
pletion
Partial
Comple-
tion
Satis-
faction
Genera-
tion
Termina-
tion
STAGE 1 STAGE 2 STAGE 3 STAGE 4 STAGE 5 STAGE 6
Fig. 4. Consumption policy state transition
The internal state of the subscription changes by at most one
single attribute in
D
and is characterized by the following
assignment:
(∀(ai,value)∈Eρk,ai∈D)→τk(ai,ρk)⇐value
STAGE 3.Subscription (partial) completion occurs for the event
associated with
ρk
(within the
kth
B-step) after the subscription
has been initiated, when at least one of the subscription’s
φi(ρk)∈Ψs(ρk)
has evaluated to true. If all
φi(ρk)
have
evaluated to true, then the subscription is considered completed,
while if at least one of
φi(ρk)
has evaluated to true, then
the subscription is considered partially completed. Explicitly
considering a stage for the partial completion, allows the
pub/sub system to evaluate the notification policy, i.e., Stage
4, and generate events, i.e., Stage 5, before the subscription
is completed. Hence, a tuple
hφi(ρk),νi(ρk)i ∈ ΨS(ρk)
might
completely evaluate to true and the corresponding notifications
are generated, even if there exist conditions
φj∈Φ
that did
not yet evaluated to true. This behavior improves parallelism
in execution and is indicated by the dashed lines in Figure 4.
STAGE 4.Subscription satisfaction occurs for the event
associated with
ρk
(within the
kth
B-step) after the subscription
has been (partially) completed, when
φi(ρk)∈Ψ(ρk)
is
evaluated to true, i.e., the subscription is (partially) completed,
and the subscription’s corresponding notification policy,
νi(ρk)
,
evaluates to true.
STAGE 5.Subscription generation occurs for the event associated
with
ρk
(within the
kth
B-step) after the subscription is satisfied
and when the subscription’s relevant events are generated
according to νi(ρk).
STAGE 6.Subscription termination occurs for the event
ρk
(within the
kth
B-step) after all events have been generated by
the subscription and attribute
τk(isVisited,ρk)
is assigned to
true. Once
isVisited
is set to true, the tuple associated with
ρkbecomes read-only.
7 WORKFLOW MAPPING ANALYSIS
In this section, we show that under incremental formulation
(sequential execution), the data-centric workflow schema
Γ
is equivalent to the pub/sub schema
Π
(distributed execution),
expressed as
M: Γ −→Π
. Before establishing the correctness
and equivalence of the
Γ
and the
Π
schemas, we define a set
of preliminary concepts. For the proof of these preliminaries and
the overall equivalence of
Γ
and
Π
we would like to refer the
reader to Appendix A.
As described in Section 3, the incremental operational
semantics of
Γ
is defined as the 5-tuple
(Σ,e,t,Σ0,Gen)
and the
Γ
system snapshot transition, denoted by
Σe
7−−→ Σ0
, is defined
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 11
as the smallest logical business step (B-step), which consists of
the sequential firing of PAC rules. The B-step in its expanded
form is given by
Σ = Σ0,Σ1,Σ2,···,Σn= Σ0,
where
Σ06= Σ1
(due to updating data attributes based on the external incoming
event
e
) and each
Σi
is referred to as a micro-B-step. Thus, the
ith
micro-B-step corresponds to the firing of the
ith
PAC rule.
Furthermore, based on the
Γ
semantics, each PAC rule firing
results in the change of exactly one status attribute and the value
of each status attribute changes at most once within a B-step
(the toggle-once property). Consequently, each PAC rule is fired
at most once within a B-step.
First we provide a formalization for the set of status
attributes that is changed within a B-step in reaction to an
external event
e
. Essentially, this set can be derived from the
event-relativized-PDG.
DEFIN ITION 4. The event-relativized-PDG for an external
event
e
, denoted by
ePDG = (Ve,Ee)
, is a subgraph of the
PDG that includes all PAC rules and their ordering that are
triggered in reaction to e:
PDG(V,E)⊇eP DG ={(Ve,Ee)|Ve⊆V,Ee⊆E}
DEFIN ITION 5. Given
ePDG=(Ve,Ee)
for external events of
type
e
, the event-relativized status attribute set for
e
, denoted by
Ie
s
, contains all status attributes that occur in nodes
Ve
of
ePDG
,
i.e., all status attributes that are changed within the B-step.
Ie
s={s|s∈Ve,(Ve,Ee)= ePDG}
Thus, the set of status attributes that is not changed within the
B-step is given by Ie
s=Is\Ie
s.
We now formalize the changes in value of a status attribute
over the notion of stable attribute values as follows.
DEFIN ITION 6. A status attribute
s∈ Is
is called stable,
denoted by
˙sΣ
, within a B-step caused by
e
iff
s
is within the
set of attributes that are not changed as reaction to
e
, or
s
is
in the event-relativized status attribute set for
e
and changed its
value in the context of e.
˙sΣi=(τ(Σi−1,s)6=τ(Σ0
i,s), if s∈Ie
s
>, if s∈Ie
s
DEFIN ITION 7. We refer to initial and final system snapshot
of a B-step as complete system snapshot, denoted by
Σ
(or
Σ0
)
and Σ0(or Σn), if all status attributes are stable.
∀s∈Is,˙sΣ
DEFIN ITION 8. We refer to an intermediate system snapshot
within a B-step as a partial system snapshot, denoted by
Σi,0<i<n, i.e., if not all status attributes are stable.
∃s∈Is,¬˙sΣ
Finally, we emphasize that the incremental formulation of
the execution follows a sequential and central execution, in
which the
Γ
semantics for the B-step execution is defined as
an atomic step and each B-step consists of a finite number of
micro-B-steps. Therefore, we define the concept of time in terms
of a B-step such that system time advances only from
ti
to
ti+1
after processing the
ith
event (
ei
), i.e., the completion of the
ith
B-step. In addition, external events are processed in the order
in which they arrive—the in-order processing of external events.
LEMMA 1. The
Γ
incremental semantics guarantees the
in-order processing of external events (when all events are
published from a single source). Hence, the B-step execution
(i.e., PAC rule firing) follows the event-order serialization (cf.
Proof 2 in Appendix A).
Similar to the B-step event-order serialization in the
Γ
semantics, the micro-B-steps within a B-step also follow a strict
order which is imposed by the topological sort order of the
PDG—the PDG-based serialization of micro-B-steps.
DEFIN ITION 9. The
Γ
incremental semantics guarantees the
PDG-based serialization of micro-B-steps [12].
Next, we show how the operational semantics of
Γ
is also guar-
anteed in our pub/sub formulation. As provided in Section 4, the
pub/sub schema
Π
’s operational semantics is also formalized as a
sequence of changes in a system snapshot denoted by
Σi
e,t,x
7−−−−→
Σi+1
, implying a single subscriber received and accepted event
e
.
LEMMA 2. The pub/sub operational semantics guarantees
in-order delivery of events between any pair of publisher and
subscriber (cf. Proof 3 in Appendix A).
COROLLARY 1. As consequence of Lemma 2 the mapping
M
under our pub/sub operational semantics guarantees in-order
processing of external events (when all events are published
from a single source).
Furthermore, our subscription mapping for PAC rules in the
Γ
schema processes events with respect to the order of external
events (published from a single source in both
Γ
and
Π
schemas).
This mapping also introduces the notion of event-based pseudo-
clock (Section 6) in order to achieve event-order serialization.
LEMMA 3. The mapping
M
under the pub/sub operational
semantics guarantees execution of subscriptions based on
event-order serialization (cf. Proof 4 in Appendix A).
LEMMA 4. The mapping
M
under our pub/sub operational
semantics guarantees the PDG-based serialization of
subscriptions (cf. Proof 5 in Appendix A).
With respect to the B-step execution, we also prove that the
pub/sub semantics satisfy the toggle-once property.
LEMMA 5. The mapping
M
guarantees the toggle-once
property of a B-step (cf. Proof 6 in Appendix A).
To prove the correctness of the overall execution of the pub/sub
workflow formulation, we introduce the notion of a reachable
system snapshot: the state of the system after executing a set of
external events. Therefore, the correctness of our model after
processing a set of external events is determined by comparing
the information model (captured by the system snapshot) of the
Γ
and
Π
schemas. If the two snapshots are identical, then our
workflow to pub/sub mapping is correct, otherwise, it is incorrect.
To compare
Γ
and
Π
system snapshots, denoted by
ΣΓ
and
ΣΠ
, respectively, we introduce two levels of equivalence, namely
weak and strong equivalence. Without loss of generality, we
1041-4347 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2015.2421331, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 12
make the following simplification in the internal data model of a
system snapshot for both
ΣΓ
and
ΣΠ
: we conceptualize
ΣΓ
and
ΣΠ
as simply a collection of all data and status attributes given
in the
Γ
information model. In addition, in
ΣΠ
, we also employ
a versioning mechanism for storing this collection, in which the
versioning is advanced with respect to external events. Hence,
through versioning in
ΣΠ
, values of data and status attributes are
retained separately for each external event, while in
ΣΓ
, only the
latest version of data and status attribute values are maintained.
DEFIN ITION 10. The (partial) system snapshots
ΣΓ
and
ΣΠ
are
weakly equivalent up to event
ei
, denoted by
ΣΓ⇔wΣΠ
, iff the
values of stable status attributes in both ΣΓand ΣΠare equal.
∀s∈ΣΓ,˙sΓ
Σ∧˙sΠ
Σ→τ(ΣΓ,s)=τi(ΣΠ,s)
DEFIN ITION 11. The (complete) system snapshots
ΣΓ
and
ΣΠ