ArticlePDF Available

Response Time and Availability Study of RAFT Consensus in Distributed SDN Control Plane

Authors:
Article

Response Time and Availability Study of RAFT Consensus in Distributed SDN Control Plane

Abstract and Figures

Software Defined Networking promises unprecedented flexibility and ease of network operations. While flexibility is an important factor when leveraging advantages of a new technology, critical infrastructure networks also have stringent requirements on network robustness and control plane delays. Robustness in the SDN control plane is realized by deploying multiple distributed controllers, formed into clusters for durability and fast-failover purposes. However, the effect of the controller clustering on the total system response time is not well investigated in current literature. Hence, in this work we provide a detailed analytical study of the distributed consensus algorithm RAFT, implemented in OpenDaylight and ONOS SDN controller platforms. In those controllers, RAFT implements the data-store replication, leader election after controller failures and controller state recovery on successful repairs. To evaluate its performance, we introduce a framework for numerical analysis of various SDN cluster organizations w.r.t. their response time and availability metrics. We use Stochastic Activity Networks for modeling the RAFT operations, failure injection and cluster recovery processes, and using real-world experiments, we collect the rate parameters to provide realistic inputs for a representative cluster recovery model. We also show how a fast rejuvenation mechanism for the treatment of failures induced by software errors can minimize the total response time experienced by the controller clients, while guaranteeing a higher system availability in the long-term.
Content may be subject to copyright.
1
Response Time and Availability Study of RAFT
Consensus in Distributed SDN Control Plane
Ermin Sakic, Student Member, IEEE, and Wolfgang Kellerer, Senior Member, IEEE
Abstract—Software Defined Networking promises unprece-
dented flexibility and ease of network operations. While flexibility
is an important factor when leveraging advantages of a new
technology, critical infrastructure networks also have stringent
requirements on network robustness and control plane delays.
Robustness in the SDN control plane is realized by deploy-
ing multiple distributed controllers, formed into clusters for
durability and fast-failover purposes. However, the effect of the
controller clustering on the total system response time is not
well investigated in current literature. Hence, in this work we
provide a detailed analytical study of the distributed consensus
algorithm RAFT, implemented in OpenDaylight and ONOS SDN
controller platforms. In those controllers, RAFT implements the
data-store replication, leader election after controller failures and
controller state recovery on successful repairs. To evaluate its
performance, we introduce a framework for numerical analysis
of various SDN cluster organizations w.r.t. their response time
and availability metrics. We use Stochastic Activity Networks
for modeling the RAFT operations, failure injection and cluster
recovery processes, and using real-world experiments, we collect
the rate parameters to provide realistic inputs for a representative
cluster recovery model. We also show how a fast rejuvenation
mechanism for the treatment of failures induced by software
errors can minimize the total response time experienced by the
controller clients, while guaranteeing a higher system availability
in the long-term.
Keywords - performance analysis, stochastic activity net-
works, SDN, distributed control plane, RAFT, strong consis-
tency, fault tolerance, smart grid, OpenDaylight, ONOS
I. INTRODUCTION
A. Background and problem statement
In critical infrastructure, such as the utility [1], [2] and
automotive [3] domains, resilience of the communication
network is a necessary property and an important criterion
for adopting a new and disruptive network technology such
as Software Defined Networking (SDN). In single controller
SDN scenarios, unavailability of the controller leads to loss of
control and monitoring channels with the network devices and
hence a system instability. The loss of network control may
further result in production and power outages (smart grid [1])
or even life-threatening scenarios (dependable automotive [3]).
To address the resilience issues, SDN controllers can be
logically coupled into controller clusters, where each instance
of the controller, referred to as replica hereafter, is respon-
sible for managing a number of switches in the network.
A particular controller may exhibit its control only over the
switches to which it is assigned. In order to provide a fallback
solution in case of another controller’s failure, it also keeps
E. Sakic is with the Department of Electrical Engineering and Information
Technology, Technical University of Munich, Germany; and Siemens AG,
Munich, Germany, E-Mail: (ermin.sakic@{tum.de, siemens.com}).
W. Kellerer is with the Department of Electrical Engineering and In-
formation Technology, Technical University of Munich, Germany, E-Mail:
(wolfgang.kellerer@tum.de).
track of the internal state information related to the switches
managed by other controllers. When a controller replica fails,
a different controller instance from the same cluster takes
over and resumes operation with some downtime. To keep
the backup replicas up-to-date w.r.t. the internal controller
state, controllers synchronize their state. Depending on the
consistency model which defines the ordering of synchro-
nization messages, the synchronization procedure imposes a
varying overhead on the control channel [4], [5]. The two
major controller platforms OpenDaylight [6] and ONOS [7]
implement the strong consistency model, which requires that
the update of a distributed state has been seen by the majority
of the cluster members before it is considered to be committed.
In a strongly consistent cluster, whenever an update request
is initialized by a cluster client at one of the controller replicas,
the receiving replica sends out the received request to the
current cluster leader. The leader is the controller instance
that orders all incoming state update requests, so as to allow
for a serialized history of updates and thus operational state
consistency at runtime. Following a state update at the leader,
the update is propagated using a consensus protocol to the
cluster replicas, and is committed to the data-store only after
the majority of replicas have agreed on the update.
A consensus algorithm ensures that all replicas always
decide on the same value (agreement), with the constraint that
only a value proposed by one of the replicas eventually be-
comes accepted after the synchronization procedure (integrity).
Google’s Chubby [8] is a distributed locking service whose
state-distribution and failure tolerance mechanism are based on
a variation of the Paxos consensus algorithm [9], [10]. Open-
Daylight and ONOS implement the more recent algorithm
RAFT [11]. Unlike Paxos, RAFT also provides for persistent
logging and state reconciliation for recovered replicas.
In addition to the availability concerns, critical infrastruc-
ture providers often have very stringent requirements on the
experienced control plane delay. For example, the smart grid is
a delay-sensitive infrastructure that requires techniques which
identify and react on any abnormal communication network
changes in a timely manner. If the detection and responses are
not made promptly, the grid may become inefficient or even
unstable and cause further catastrophic failures in the entire
network [2]. Events in the grid may require rapid reaction
from the network controller - i.e. rerouting in case of power
grid failures, expedited diagnostics and alarm handling [1].
Furthermore, network management systems in the 5G context
can require bounded configuration times when establishing
on-demand network services [4]. However, the frequently
deployed strong consistency model in a clustered SDN requires
that, prior to any operation in the SDN controller cluster, a
cluster-wide synchronization must occur. The response time of
such a control plane is hence dependent on parameters such
2
as the cluster size, controller placement, processing delays in
different system components and the failure vectors.
The clustered SDN controller solutions require estimations
of the worst-case response times and the expected availability
for arbitrary sets of configurations, before their deployment
can be considered in critical infrastructure networks. To our
best knowledge, no prior work has investigated these issues.
Hence, we fill the gap with an appropriate analytical study.
In the remainder of the introductory section, we give an
overview of our contributions. In Section II, we describe the
assumed multi-controller SDN architecture from the system
perspective. In the same section we specify the technique
of Stochastic Activity Networks-based (SAN) modeling and
outline the consensus algorithm RAFT. In Section III we
introduce and explain in detail the proposed SAN models for
response time, controller failure injection and cluster recovery
modeling. In Section IV we explain the evaluation methods
and parametrizations used to compute the results presented in
Section V. In Section VI we present existing work in the field
of distributed SDN control plane and consensus algorithms.
Section VII concludes the paper.
B. Our contribution
In this paper, we present a system model of a distributed
SDN control plane that leverages the Stochastic Activity
Networks (SAN) modeling framework for the estimation of
cluster response time and availability measures. Our SANs
comprise the detailed sub-models of the RAFT consensus
algorithm, cluster failure and recovery. We also define the
parametrizations (studies) for different cluster configurations
in order to evaluate the introduced models. We further evaluate
a steady-state configuration of the distributed SDN control
plane using long-term failure rates for SDN controllers at sub-
module, process and hardware level, and short-term response
times experienced after immediate controller failures.
By assuming reliable event delivery and bounded network,
application and data-store commit delays, we can provide
stochastic delay guarantees for response handling times in
non-failure, partial-failure and cluster-majority failure states.
Failures are modeled as stochastic arrival processes for long-
term, and deterministic occurrences for worst-case evaluations.
As per the nature of the modeled consensus algorithm RAFT,
the recovery process too is a combination of stochastic and
deterministic message and timeout delays. We further intro-
duce an enhancement to the current controller platforms for
enabling a fast recovery of controller bundles and processes.
We evaluate its benefits w.r.t. to the total expected response
time and cluster availability using the developed model. Our
SAN models are compiled into Continuous Time Markov
Chain (CTMC) state spaces. In contrast to existing works on
consensus algorithms that derive their performance analysis
from experiments, we provide analytical guarantees. To this
end, our numerical solutions cover the space of all possible
state combinations which an SDN cluster may be in.
II. SY ST EM MO DE L, SAN PERFORMABILITY MODELING
AN D TH E RAFT CON SE NS US ALGORITHM
In this section, we introduce the assumed system model that
comprises the forwarding devices, multiple SDN controllers
for redundancy and controller clients. We then outline the
background on the formal concepts used in our modeling
and discuss the evaluated RAFT consensus algorithm in more
detail. The notation used henceforth is presented in Table I.
TABLE I
NOTATIO N USE D IN SECTIONS II AND III. THE REMAINING MODEL
PARA ME TER S AR E SPE CIFI ED I N TABL E II.
Symbol Parameter
CNumber of SDN controller cluster replicas
For NFController failure count
π(t)State probability vector at time t
qUniformization rate constant of the CTMC
TRRAFT replica-to-leader network delay
TMActual follower-majority-to-leader network delay
TMworst Worst-case follower-majority-to-leader network delay
TMbest Best-case follower-majority-to-leader network delay
RMNumber of missing RAFT terms in a lagging follower
FSf Boolean depicting a failure of a single RAFT follower
FMj Boolean depicting a failure of the RAFT follower majority
FLdr Boolean depicting a failure of the RAFT leader
LUp Counter of currently available RAFT leaders
FUp Counter of currently available RAFT followers
A. Generic system model
We assume a set of CSDN controllers, collected in a
single cluster and deployed for the purpose of achieving fault-
tolerant operation [5]. Fig. 1 depicts a deployment of the
redundant control- and data planes in an exemplary industrial
SDN network. Control plane redundancy is realized by running
C= 3 controllers simultaneously, and a number of disjoint
paths in between them for fail-over purposes in case of link
and node outages. In general, a deployment of C= 2 F+ 1
controllers tolerates a maximum of Fcontroller failures before
the SDN cluster becomes unavailable. Thus, in Fig. 1 only a
single controller failure is tolerated before the cluster stops to
serve the clients’ requests (consult the explanation below). The
clients of the controllers, such as the network administrators,
switches, network appliances and end-hosts, can trigger con-
troller events that lead to a cluster-wide state synchronization
and subsequent event processing in the cluster leader. Clients
communicate their requests (i.e. Remote Procedure Calls, state
updates, topology events etc.) to any live replica that is a
member of the SDN cluster. The replica then contacts their
current cluster leader to serialize (order) the request, which
in return distributes the request to other replicas, commits the
request and executes its local state machine (in zero-failure
case). The replica is then notified of the result of the request
execution and can respond to the client with an application
response. In the case of a leader failure during the request
processing, a new leader is elected by executing a consensus
algorithm and the synchronization process is re-initiated.
The limitation of supporting only Ffailures when 2F+1
replicas are deployed relates to the CAP theorem [12]. This
theorem states that any distributed system can provide a
maximum of two of the following three system properties at
the same time: consistency, availability and partition-tolerance
(CAP). Consensus algorithms such as RAFT [11] and Paxos
[9], [13] favor consistency and partition-tolerance properties,
and are able to forward their state consistently even in the face
of network partitions. A consistent operation of a controller
cluster ensures that the majority of controllers will have the
same controller state at any given time, and that no two
3
conflicting state updates are ever successfully committed to
the shared update history. Hence, controllers are in consensus
with regards to their state. Consistent and partition-tolerant
operation, however, comes at the cost of a lower availability,
since consistent operation in the face of network partitions can
only be guaranteed by disabling the operation of a partitioned
cluster minority while the majority continues to operate. In
the remainder of the paper we take this limitation into account
and consider the system as available only when the majority
of controller nodes are available and are mutually reachable.
Application Domain Control Network
C1
(RAFT 1)
C2
(RAFT 2)
C3
(RAFT 3)
NBI
Client
Fig. 1. An exemplary industrial SDN with redundant paths for majority
of controller-to-controller and controller-to-switch connections. The SDN
controllers execute the RAFT [14] agents, responsible for per-state-shard state
synchronization, leader election and cluster recovery after individual replica
failures. Red dashed lines represent the RAFT session exchanges between
the SDN controller replicas, blue dashed lines are the ”client” connections
(switch-controllers, northbound interface client-controllers).
In state of the art controller platform implementations with
a strong consistency model [6], [7], network configuration re-
quests facilitate a number of state changes and inter-controller
synchronization steps before coming to a consensus in de-
cision and actual execution of the configuration change. For
example, assuming an SDN module that subscribes and reacts
to topology changes (e.g., raises alarms to an administrator in
case of link failures), the topology change would first need
to be committed across the majority of controllers, before the
subscribed SDN module could be notified of the committed
change and execute the reaction. The duration of this process
obviously depends on the cluster size, the availability of
controllers and the controller-to-controller delays.
In the following subsections we describe the SAN frame-
work we use for modeling of the distributed SDN control
plane. We give an overview of the RAFT consensus algorithm
responsible for state synchronization, cluster leader election
and state recovery after failures.
B. Stochastic Activity Networks
In contrast to previously eublished methods on the evalua-
tion that provides a mean to evaluate an existing product or
deployment, e.g. by using measurement techniques, deductive
analysis allows for a system evaluation before the system is
actually deployed. Hence, significant savings can be achieved
if the deductive solutions are able to accurately predict the
real-world behavior of the future non-implemented system
or system extensions. With this in mind, contrary to the
previously published methods on evaluation of the distributed
SDN control plane, which base their analysis on a limited
number of physical cluster configurations [15], [16], we opt
for the flexible and economical deductive solution.
Discrete-event simulation is, for example, partially applica-
ble to our problem. Simulation allows for a tunable quality
of the results by repeating execution of a given model and
derivation of the relevant output measures. However, the
simulation methodology may not handle corner cases, which
are numerous in a consensus algorithm such as RAFT.
Another class of deductive analysis methods are the analytic
numerical methods, which are suitable when a closed-form
solution is not obtainable. Analytic numerical solvers allow
for an accurate evaluation of each system state configuration.
For this purpose, they require a manually or automatically gen-
erated model state space as an input. The additional overhead
of the state space generation, as well as the inclusion of each
state in the solution, generally leads to a higher computational
effort compared to simulation. Furthermore, the generation
of the state space may lead to a state explosion problem
and infeasible solving times. Therefore, we dedicate Section
V-C to specifically discuss the scalability of our models.
Instead of the manual state modeling, we automate the model
generation process and hence avoid the issue of largeness [17]
of the resulting state space. For the purpose of the automated
model generation, we use the Stochastic Activity Networks
(SANs), one of the most prominent representatives of model
generation frameworks. We choose specifically SANs over
similar techniques, such as Generalized Stochastic Petri Nets
and Stochastic Reward Nets because of its practical extensions
for the inhibition of state transitions, as well as the flexible
predicate assignment to the gate abstractions (see below).
SANs are an extension of Petri Nets (PN) and an established
graphical language for describing the system behavior. SANs
have been successfully used in survivability and performability
studies of critical infrastructures [18], industrial control sys-
tems [19] and telecommunication systems [20] since the late
1980s. We provide a brief summary of the most important
SAN concepts used in our modeling here, and refer the more
interested reader to comprehensive descriptions in [17], [21].
A SAN consists of places,activities,input gates and output
gates. Similar to PNs, places have a certain token assignment
associated with them. Every unique assignment of tokens
across the places uniquely defines a state of the SAN. These
states are called markings. In Markov Chain analogy, a single
marking represents a unique state of a Markov Chain. An
activity element of a SAN defines a transition with the
corresponding transition rates, and allows for controlling the
flow of tokens from a single SAN place into a different SAN
place. Furthermore, an activity allows for connecting a place
to an output gate where, on transition of a token from a place
to an output gate, a sequence of actions can be taken - e.g. ”if
the number of tokens in place A >nincrement the number
of tokens in place B by m”. Hence, compact state changes (and
a large number of unique markings) triggered by a particular
transition may be modeled using a smaller number of modeling
elements, compared to a traditional Markov Chain.
When an activity fires, a number of tokens are removed
from the source place and transferred to a destination place
connected by the activity. An input gate serves as an inhibitor
of an associated activity. It specifies a boolean predicate which,
when evaluated true, enables an activity and allows the firing
4
of the activity. If the inhibitor evaluates false, the associated
activity is disabled. An instantaneous activity is enabled at all
times, and will fire whenever there are tokens available in its
input place. A timed activity, on the other hand, is assigned a
time distribution function which specifies the firing rate of a
specific activity. In our model, for timed activities we assign
the deterministic (Erlang-approximated) and exponential firing
rates, but also specify instantaneous activities where necessary.
An activity may further lead to a token transfer from a source
place to one of multiple destination places. This uncertainty
is modeled using a case definition for each destination state,
where each case is assigned a probability parameter.
To solve the SAN, it must first be transformed into a
discrete-state stochastic process [17]. We make use of the flat
state space generator implemented in the M¨
obius modeling
tool [22], to generate the Continuous Time Markov Chain
(CTMC) state space inherent to the evaluated SAN. To derive
instantaneous state probabilities of a CTMC, the transient
solver of M¨
obius implements the uniformization method [17],
[23]. In short, using uniformization, the transient state proba-
bility vector π(t)of the CTMC can be expressed in terms of a
one-step probability matrix of a Discrete Time Markov Chain
(DTMC), so that all state transitions of a resulting DTMC
occur with a uniform rate q. As a result of the transformation,
the desired state probability vector π(t)at time tis governed
by a Poisson variable qt and can be expressed as follows:
π(t)
i=r
X
i=l
v(i)eqt (qt)i
i!where v(0) = π(0) (1)
where v(i)represents an iteratively computed DTMC state
probability vector at step i. Lower and upper bounds, land r,
govern the number of iterations required to compute the state
probability vector with an overall error tolerance of ε=εl+εr
and truncation points land r, respectively.
C. Case Study: RAFT Consensus Algorithm
RAFT is a distributed consensus algorithm that provides
safe and ordered updates in a system comprised of multiple
running replicas. RAFT is the only consensus algorithm imple-
mentation in the two prominent open-source SDN controller
platforms OpenDaylight [6] and ONOS [7]. It tries to solve the
issues of understandability of the previous de-facto standard
consensus algorithm Multi-Paxos [9], and additionally stan-
dardizes an implementation of leader election and post-failure
replica recovery operations. A comprehensive description of
the algorithm can be found in [11], [14].
A RAFT cluster comprises leader,follower and candidate
replica roles. The leader is the node that parses and distributes
incoming client updates (i.e. reads,writes,no-ops) to RAFT
followers and ensures safe commits. The majority of cluster
followers must confirm the acceptance of a new update before
the leader and the followers may commit the update in
the local commit log. Only after the update is committed,
the SDN applications built on top of a RAFT agent can
continue their processing. After the application has computed
the operation related to state update, a response is forwarded
to a client (e.g., a switch or network management system).
RAFT guarantees that the applied state updates are eventually
Follower
(Updated)
Replica
Down
Follower
(Lagging)
Candidate Leader
Times out,
restarts election
Discovers replica
with higher term
Receives votes
from follower majority
Times out
Discovers leader
Down
Down
Down
Up
Up
Down
Up
AppendRPCs Times
out
Fig. 2. A simplified lifecycle schema of a replica inside the RAFT cluster.
Adapted from [14] and extended for the purpose of detailed modeling.
committed in every available replica in the cluster in the
right order. Furthermore, each update is applied exactly once,
hence enabling linearizable semantics [14] when operating
with the controller state. In the case of a leader failure, after
an expiration of an internal follower timeout, the remaining
followers automatically switch to a candidate role. A candidate
is an active replica which offers to become the new cluster
leader. To do so, it propagates its candidate status to the other
available replicas. If a majority of nodes vote for the same
candidate, this candidate node becomes the new leader.
Updates in RAFT require a single-round trip delay between
the leader and the preferred follower majority (the fastest to
reach followers). When a controller failure occurs, depending
on the role of the failed replica, additional delay overhead is
imposed. Failures in the RAFT leader during the processing
of a particular update lead to a new leader election after an
expired election timeout. After an exceeded client timeout, the
client retries its request. If instead of the leader a follower
had failed, depending on the follower’s type and the number
of active followers, we distinguish three scenarios:
Failure of a follower that is not a member of the preferred
follower majority results in no additional imposed delays
between the leader and the cluster majority.
Failure of a follower that is a member of the preferred
follower majority leads to the RAFT leader having to
include an additional ”slower” follower in the preferred
follower majority. This, in return, may negatively affect
the update commit times depending on the follower’s
placement and its distance to the RAFT leader.
Failure of any follower that comprises the follower set,
with no backup followers available (stand-by RAFT
members), necessarily leads to the cluster unavailability.
The client update requests that were not successfully
committed must be repeated by the client.
After a successful recovery of the majority of the RAFT
members and the re-election of a new leader, RAFT is able
to forward its state and commit new updates. Depending on
the failure source and the repair time, as well as on the
RAFT recovery parameters (candidate and election timeout),
the recovery takes a non-deterministic period to finish.
5
Fig. 2 gives a high-level overview of the states a cluster
replica may traverse throughout its lifecycle. We present the
more detailed structural and behavioral models of RAFT in
zero- and multiple-failure cases in Section III. Section IV
details the timing variables used in our parametrization of
RAFT.
III. SAN MOD EL S
In this section, we present the SAN models for response
time, failure and recovery processes in the context of a RAFT-
enabled SDN control plane. We represent places as blue cir-
cles, timed activities as thick blue vertical bars, instantaneous
activities as thin blue vertical bars, and input and output gates
as thick red and black arrows, respectively.
A. RAFT End-to-End Delay SAN Model
The distributed SDN control plane model assumes Ccon-
trollers connected in a RAFT cluster, hosting one or multiple
SDN applications (referred to as bundles) that react to asyn-
chronous client events. The client device is external to the
SDN controller (e.g. an OpenFlow switch or a northbound
interface consumer). The client sporadically generates events,
such as flow requests or switch notifications which neces-
sitate data-store updates and its subsequent synchronization.
The client delivers these events in asynchronous and reliable
manner to the replicas in the cluster for processing. After the
controller cluster finishes the event processing, the client is
notified of the result. The SAN in Fig. 3 depicts this process.
The place IdleState models the initial system state
where no events are queued for internal processing. Following
a new event arrival at any of the RAFT replicas, the receiving
replica is tasked with the propagation of the new event
to the current RAFT leader. New event arrivals increment
the token amount in the state EventQueuedForLeader,
where events are queued until a leader controller replica
becomes elected in the cluster. The input gates enumerated
LeaderAndMajorityUp# ensure that the transmission of
the event to the RAFT leader or replicas, as well as the
intermediate processing inside the cluster happens only in the
case where both the RAFT leader and the follower majority
are up and available in the cluster. Propagation of the event
from the furthest-away replica to the leader is modeled by
the activity delayToLeader using a deterministic worst-
case delay metric TR=TMworst (see below). On a re-
ceived event, the leader initiates the propagation of the re-
spective data-store state update to its followers. The delays
induced by the activities delayToMajorityFollowers
and delayFromMajorityToLeader correspond, in the
best-case to the leader-to-(preferred)-majority delay TMbest .
In the worst-case, to contact the follower majority, the leader
needs to contact the follower furthest away from it and hence
induce the worst-case uni-directional network delay TMworst .
Thus, the delay between the RAFT leader and the follower
majority is governed by the number of failed followers. We
model the non-constant leader-follower majority delay TMas
detailed below. When the follower majority has acknowledged
the state update, the leader continues committing the data-
store change locally, and the system eventually reaches the
CommitDone state. To prevent the leader from broadcast-
ing multiple unacknowledged updates, we ensure the input
gate DisableConcurrentUpdates enables the transition
delayToMajorityFollowers if and only if the distribu-
tion states of RAFT do not contain any outstanding tokens (no
synchronization in progress).
Alternatively, in the case of at least one replica that neces-
sarily comprises the cluster majority lags behind the RAFT
leader in terms of its commit log (see Subsection II-C),
the leader enforces additional steps in order to synchronize
the cluster majority with its local view. To this end, the
activity majorFollowerNotUpToDate fires and a token
is incremented in the place BringFollowersUpToDate.
The lagging followers are thus assigned the Follower
(Lagging) node status (depicted in Fig. 2). For each
RAFT term state that is missing in the lagging follower, an
additional round-trip delay in the critical path between the
leader and replica is induced, hence adding 2RMTM
to the overall delay, where RMis the maximum number of
missing RAFT terms in the follower. This delay is imposed in
the definition of the activity lateBringUpToDateNodes.
To govern the activation of the instantaneous activity
majorFollowerNotUpToDate we make use of a stateful
counter CounterFailures that is incremented on each
new logged replica failure (refer to Subsection III-B). We
consider the worst-case and hence assume that at any time
after b(C1)/2c+ 1 replica nodes have been disabled, out-
of-date replicas are automatically present in the majority of the
follower nodes required to confirm a leader update. Hence,
we infer the additional overhead of updating the lagging
nodes in the replica majority. The flag to enable the activity
majorFollowerNotUpToDate is cleared after the recon-
ciliation (as a side configuration of the gate resetCounter).
Following an applied data-store commit in the majority of
replicas, the leader commits the state locally and the SDN
application gets notified of the data-store event. The data-store
commit and the SDN application’s processing delay are in-
duced during the activity applyCommit and are modeled as
TCand TAin Table II, respectively. After the application has
completed its processing (system state ApplicationDone),
the leader notifies the replica that initially generated the update
event (thus adding once-more TRto the overall worst-case
delay), and the replica further forwards its response to the
client (thus adding TCR which is the client-replica delay). The
system then finally reaches the stable state SequenceEnd,
where the event is marked as successfully processed. In case
of a failure occurrence in the leader or follower majority
during the event processing, the activities named CH# lead
to a token being shifted from the current SAN place to
the EventQueuedForLeader place, using the output gate
increment action modeled by OGF#. Hence, the event distribu-
tion procedure restarts as soon as the cluster is re-established.
The delay until a critical failure occurrence of the leader or
the follower majority is noticed by the client is modeled using
the client timeout TCL .
As previously noted, the delay from the RAFT leader to
the furthest-away replica from the follower majority will vary
depending on the availability and proximity of followers w.r.t.
6
Fig. 3. The RAFT response time model depicting the sub-processes of: the receival of a client event at a follower proxy-replica; the event propagation to the
RAFT leader; the event propagation from the leader to the follower majority; the data-store commit of the client event and its subsequent processing in the
SDN application; and ultimately the propagation of the SDN application’s response from leader, through the proxy-replica, to client. Detecting a failure of the
RAFT follower majority or leader leads to the restart of the event handling process, starting at the place EventQueuedForLeader - but only after an added
deterministic delay of client timeout (see Table II). Furthermore, the extended sub-process of AppendEntries RPC, necessary to reconcile the RAFT followers
that lag behind the current RAFT leader in terms of their state, is included in the lower right part of the critical path (state BringFollowersUpToDate).
the current cluster leader. We annotate the leader-majority
followers delay as TM. Assuming a deployment of Ccon-
troller nodes and a single leader Lat any time, the set
SL={DR1, DR2...DRbC/2c}contains the maximum bounded
delays between the leader Land bC/2cfollower nodes closest
to Lw.r.t. delay between controller Land each of the available
followers RI. Hence, we define the delay between leader L
and the follower majority as the delay between Land the
farthest follower in the majority TM=max{SL}.
To emphasize the effect of a failed preferred-follower con-
troller on the response time, in our exemplary evaluation,
we scale the delay value to contact the followers majority
linearly with the number of currently available followers using
a scaling factor SFso that:
TM=(SFTMbest when Fup ≥ bC/2c
undefined otherwise
For the evaluation purposes we model the SFas a function
of the current marking of SAN so that SF=C1
Fup and thus:
TM=(C1
Fup TMbest when Fup ≥ bC/2c
undefined otherwise
where Fup represents the number of currently available fol-
lowers. In the best case where all nodes are up, the leader-
majority delay equals TMbest . In the worst case, the controller-
majority delay peaks at TMworst =TR= 2TMbest when only
bC/2c+ 1 controller nodes (including the leader) are active.
Using a fixed scaling factor is an exemplary and non-optimal
representation, as the exact worst-case leader-majority delay
is equal to the delay between the leader and the farthest
away follower in the current follower majority, and hence
necessitates knowing the exact bounded delays between each
two SDN controllers in the network. We omit this level of
model granularity as the required parameters would require
population from an engineered network topology, and would
further rely on the optimality of the used controller placement
technique. Nevertheless, the SAN model proposed here can
be extended to take an arbitrary set of controller-to-controller
delay parameters with little effort.
Data-store sharding: The data-store of an SDN controller
(e.g., OpenDaylight) is sharded into an arbitrary amount of
data shards at a flexible granularity (e.g., data shard for topol-
ogy or flow state). Separate RAFT sessions are responsible for
each data shard. We assume that all data shards are available
on all SDN controller replicas. Hence, each available controller
is an active member of each per-shard RAFT session. RAFT
can handle the updates of different data shards concurrently
and in isolation. This enhances the overall throughput of the
system as multiple asynchronous updates to different shards
are parallelized and executed in a non-blocking manner.
Batching of the data-store updates and latency considera-
tions: We assume that the clients specify their updates that
modify a shard either as a single state update or as a batch
of updates [24] for maximized throughput and minimized
response time. Thus, the worst-case occurs when the update or
a batch of concurrent updates is exchanged in a single frame
across the cluster and the majority of the cluster members fail
before the updates are committed successfully. If a new update
arrives during the processing of another update of the same
shard and a leader fails, we assume that the client updates are
in the worst-case batched with the previous non-committed
updates and are transferred in one round after the cluster has
recovered. This model is fitting for handling real-time events
(e.g., alarms) that should preferably never get queued.
B. Cluster Failure SAN Model
To evaluate the performance of the SDN distributed control
plane and RAFT in the face of failures, we introduce a
7
dedicated failure model. For general long-term considerations,
we distinguish between hardware and software failures with
failure rates λFHand λFS, respectively. All specified non-
deterministic timeouts, failure and repair rates in our model
follow a negative exponential distribution. For software fail-
ures, we distinguish failures at the application bundle (i.e. an
OSGI bundle in ONOS [7] and OpenDaylight [6] controllers)
and process level failures. Similarly, repair rates are distin-
guished correspondingly as specified in Table II.
The SAN failure model is depicted in Fig. 4. The place
NodesUp contains the total number of available controllers
(nodes that are up, but not necessarily assigned a RAFT
member role). Depending on the failure type (at hardware,
process or bundle level), after an occurrence of a failure, a
token is placed into the respective NodesDown place. Further-
more, each firing of a failure activity triggers a token addition
in the place NodeDownSelectFailure and results in
a subsequent evaluation in the instantaneous case activity
failureSelectRole. We distinguish between the safe-
follower (FSf ), follower-majority (FMj ) and leader failures
(FLdr), with the following probabilities:
P(FSf ) = Fup
Fup+1 when Fup ≥ d(C1)/2 + 1e ∧ Lup >0
0 otherwise
P(FMj ) = 1 when Fup <d(C1)/2 + 1e ∨ Lup == 0
0 otherwise
P(FLdr) = 1
Fup+1 when Fup ≥ d(C1)/2 + 1e ∧ Lup >0
0 otherwise
where Fup and Lup are the counters of tokens in places
FollowersUp and LeaderUp before the failure occur-
rence, respectively. Failure of the follower-majority FM j , or
a leader failure FLdr during the event handling in controller,
results in a client timeout and subsequent restart of the event
handling process. On the other hand, failure FSf does not
affect the cluster availability as the reorganization of a stable
cluster majority is still possible with the remaining nodes,
albeit with an added delay (as per the definition of TM).
Fig. 4. The SAN model of the failure processes includes the long-term
failure rates (Hw_F, Process_F and Bundle_F) and the controlled
failure injection (Inj_Hw_F, Inj_Process_F and Inj_Bundle_F).
The failure type is decided based on a random selection process (bottom-
left), and its severity is a function of the current system state (bottom-right).
In order to observe the response time during and shortly af-
ter the failure, we also model a procedure for controlled failure
injection of single and multiple-correlated transient controller
failures and observe the system performance over a short-
term time range at millisecond granularity. The correlated
failures are modeled as bursty and may occur concurrently.
In the past, correlated failures have been investigated in the
context of distributed systems [25], and represent a flexible
method to consider chained failure propagation, i.e. resulting
from a malfunctioning replicated SDN application. The failure
injection process is depicted in the upper left part of the
SAN shown in Fig. 4. The place BurstyFailureTokens
initially holds a number of tokens corresponding to the num-
ber of simultaneous bursty failure injections. The activity
selectFailureType governs the probability distributions
for encountering a particular type of node failure.
As will be shown in the Subsection V-A, in our response
time evaluation we distinguish the scenarios of mixed hard-
ware and software failure injection, as well as the single and
multiple-correlated failure injections at the granular level of
controller bundle, controller process or hardware.
Critical data plane failures: In a DTMC, the probability
of occurrence of an SDN controller element failure Fcor a
critical data plane element failure Fdcorresponds to:
P(FcFd) = P(Fc) + P(Fd)P(FcFd)(2)
In the continuous time domain, failure arrivals for the crit-
ical data plane elements that carry the controller-to-controller
flows, and failure arrivals for the SDN controller elements can
be represented as two independent Poisson processes Nd(t)
and Nc(t)with the unique firing rates λdand λc, respectively.
Since the two processes are independent, they also have
independent increments. Therefore, critical failure arrivals
associated with the summed process Nt(t) = Nd(t) + Nc(t)
can be modeled using the rate λt=λd+λc.
The failure rates for the critical data plane paths which
carry the network control flows can be embedded in the
parametrization of our models without an additional modeling
overhead (see Table II). However, in this work we primarily
focus on studying the control plane consensus for the use case
of a highly redundant industrial network [26], [27]. Thus, we
intentionally decouple our work from the data plane reliability
studies and assume the reliable parametrization 1d=.
C. RAFT Recovery SAN Model
The RAFT recovery SAN model in Fig. 5 depicts the
process of re-inclusion of a previously disabled controller
replica in the RAFT cluster. The place InitElectionPool
holds a token for each running controller replica that is
available but still needs to be admitted in the cluster. As
per RAFT design, the replica expects the RAFT leader of
the current term to announce its presence using a leader
heartbeat. If a leader is identified before the follower time-
out expires, the replica takes upon the follower role and a
token is assigned to the place AnnounceFollowerRole.
Alternatively, the replica switches to the candidate role (place
AnnounceCandidateRole). Three cases are now possi-
ble, each adding its specific delay to the overall response time:
1) If the cluster majority is up (≥ bC/2c+ 1) and the repli-
cas acknowledge the candidate as a new leader before
8
the expiration of the candidate timeout, the candidate is
elected as the leader (output gate setLeaderUp). The
announcement of the candidate role from the candidate
to the cluster majority takes an additional round trip.
2) If another leader is identified while the replica
is in the candidate state, the candidate replica
becomes a follower and a token moves from
the AnnounceCandidateRole place to the
setNewFollowerUp output gate.
3) If the cluster majority sends no acknowledgment to the
candidate nodes during the candidate timeout (occurs
whenever a total of b(C/2)c+1 replicas are still down),
the candidate waits for the timeout to expire and then
repeats the candidate procedure.
Fig. 5. The RAFT recovery SAN model depicts the inclusion of a previously
unavailable controller replica into the cluster. Depending on the current state,
the replica may become either the new RAFT leader or a follower. Duration
of the recovery process will affect the resulting event response time and the
cluster availability if the recovering replica is needed to establish a follower-
majority and elect a new RAFT leader.
If the replica becomes a RAFT leader, a token is assigned
to the LeaderUp place (previously empty), alternatively the
token is assigned to the place FollowersUp. In both cases,
the NodesUp token counter is incremented by 1.
IV. EVALUATION
A. Model parametrization using a RAFT experiment
To evaluate the general fitness of our response time model,
we first compare the proposed model against an experimental
RAFT setup in a zero-failure scenario. For this purpose, we
implement a RAFT agent and deploy multiple copies thereof
in a RAFT cluster. For the RAFT backend implementation, we
use the open-source Java library libraft 1. The cluster was orga-
nized so that the controller nodes, acting as RAFT agents, were
reachable in an any-to-any manner over a single-hop Open
vSwitch instance. We configured the Open vSwitch to inject a
constant symmetrical delay of 5ms on each egress port and we
then used this value as a deterministic base leader-follower-
majority delay TMin the model parametrization. Furthermore,
based on the raftlib performance observations, we modeled the
commit delay parameter TAas an exponentially distributed
delay with a mean of 1ms. The resulting modeled response
time and the comparison with the experimental results for
different controller cluster sizes are depicted in Fig. 6. As can
be noted, our model predicts the expected performance well.
To reflect the stochastic performance guarantees when replica
failures are concerned, we resort to using only SAN-based
analytical modeling for most accurate approximations.
B. Fast recovery mechanism for bundles and processes
In contrast to evaluating the system with purely fixed
software repair rates, as was done in relevant past studies [28],
1libraft - Raft Distributed Consensus Protocol in Java: https://libraft.io
0 5 10 15 20 25 30
Event handling time t [ms]
0.0
0.2
0.4
0.6
0.8
1.0
P(D<t)
2+1 Controllers
3+1 Controllers
4+1 Controllers
5+1 Controllers
6+1 Controllers
Modeled response time
Fig. 6. Comparison of experimentally observed and modeled RAFT per-
formance with clusters of various sizes. Represented are the CDFs of per-
cluster-configuration measurements, with each measurement encompassing
1000 sequential write operations. The observed delay considers a fixed single-
hop packet latency of 5ms in between the RAFT leader and replicas, as well
as a 1ms data-store commit time in the RAFT leader and replica majority.
Client and application delays were not considered in this experiment. The
measurements were taken in a zero-failure state of the RAFT cluster and
should serve as an initial indicator of the response time model fitness.
[29], we utilize a recovery model that reflects much closer
the actual state-of-the-art SDN controller implementations. We
further propose an optimization to enhance upon the standard
repair time in the face of controller failures. We assume a
watchdog-like mechanism implementation that monitors the
critical controller components’ health and correctness. The
watchdog can monitor both the granular SDN controller ap-
plications (bundles) and the actual controller process (that
comprises many bundles). Whenever a bundle or a process
fails, we assume an immediate scheduling of a rejuvenation
procedure that repairs the affected software component.
Realization: While there may exist various designs to realize
a watchdog functionality for the purpose of monitoring the
liveness of a software or hardware component, we opted
to implement the watchdog as a software-agent external to
the OSGi container hosting the SDN controller bundles. We
deployed the watchdog agent on the same host machine as
the monitored SDN controller instance. Following a successful
start-up of both the watchdog and the SDN controller pro-
cesses, the watchdog establishes a connection to the OSGi
environment hosting different controller bundles. We make
use of the Apache Karaf’s2Remoting mechanism to allow for
remote connections to a running Karaf instance.
Our agent periodically polls the status of a bundle’s lifecycle
and discovers that the bundle is in one of the following UP
states: {INSTALLED, STARTING, ACTIVE}; or DOWN states:
{UNINSTALLED, STOPPING, RESOLVED}. Upon discovery
of a bundle that is in a DOWN-state, the agent schedules a
bundle:start-transition for the affected bundle, in order to get
it up and running in an UP-state. In the case of an unsuccessful
remote connection to Karaf, the watchdog evaluates the current
list of processes for false positives and, if a missing Karaf
process is detected, it schedules an immediate restart of Karaf.
The watchdog process could also be executed externally
to the machine running the SDN controller. Hence, while
not considered in our evaluation, the same mechanism can
2Apache Karaf - an OSGi distribution offered by the Apache Software
Foundation based on Apache Felix - https://karaf.apache.org
9
be applied to schedule physical or VM reboots in case of a
hardware or hypervisor failure. On the other hand, hardware
or hypervisor issues may be a sign of misconfigurations or
recurring defects whose source should be diagnosed manually.
To collect the accurate real-world repair rates for controller
bundles and processes, we have used our watchdog agent
implementation to evaluate the bundle and process reboot
times in a clustered OpenDaylight (ODL) setup. We had
experimentally injected bundle- and process-critical failures
in sequence and then measured the subsequent recovery time
required to re-stabilize the system. The distinguished mean
bundle and software process repair times, measured during
the controlled rejuvenation of the critical RAFT component
sal-distributed-datastore and the ODL’s controller process,
ticked at 182.9ms and 26.9srespectively, far below the 3
minute recovery intervals previously proposed in literature
[28], [30]. The measured recovery time purposely does not
include the time needed to re-include the recovered node in
the RAFT cluster, since this is modeled as a separate non-
deterministic process in our SANs. The bundle and process
reboots took place inside a dedicated ODL VM that was part
of a bigger ODL controller cluster, virtualized on a modern
Intel Xeon-based server, with each of the ODL VMs assigned
4 vCores and 8 GB of DDR4 memory. ODL was loading the
OSGi bundles available in the OpenFlowPlugin and Controller
projects and had the Clustering component enabled.
TABLE II
SAN MODEL PARAMETERS USED IN OUR SOLUTIONS.
Parameter Intensity Unit Meaning
TA1[ms] Application handling time
TC1[ms] Data-store commit delay
1f225 [ms] Mean follower timeout
1ca 225 [ms] Mean candidate timeout
TCL 50 [ms] Client timeout
TR(TMworst ) 10 [ms] Worst-case replica-leader delay
TMbest 5[ms] Best-case majority-leader delay
TCR 1[ms] Delay client-to-replica
NF1..C N/A Controller failure count
1FH6[months] Hardware failure rate
1FS1[week] Software failure rate
1FSi30/#F[ms] Software failure rate (injected)
1RH12 [h] Hardware repair rate
1d[h] Critical data plane failure rate
1RS3[minutes] Bundle and process repair rate
1RSbw 182.9[ms] (Watchdog) Bundle repair rate
1RSpw 26.9[s] (Watchdog) Process repair rate
Es20 N/A Erlang approximation stages
RM10 N/A Max. inconsistent RAFT terms
C. On parameter selection
To evaluate the expected response time metrics of various
cluster configurations, we vary the SDN cluster size and
hence the number of controller replicas that take part in the
RAFT algorithm as per Table II. The generalized long-term
software and hardware failure rates, as well as the hardware
repair rates are taken from Liu et al. [29]. As discussed in
Subsection III-B, to allow for granular worst-case response
time analysis, we model single and correlated failure injections
with varying number of failures, where following a failure, a
replica is temporarily excluded from the cluster until recov-
ered. To depict the benefits of failure source differentiation
and the proposed watchdog mechanism, we distinguish mixed
and software-only failures, and vary the number of failure
injections between 1 and bC/2c+ 1 (majority nodes down)
controller failures. The fact that the process of uniformization
may only be applied to exponentially distributed transition
rates makes our estimations slightly pessimistic. Thus at the
cost of the generated Continuous Time Markov Chain (CTMC)
state size and required solving time, we approximate every
deterministic message delay and timeout and minimize the
total distribution variance using a 20-stage Erlang distribution.
V. RE SU LTS OF OUR ANA LYTI CA L EVAL UATION
In this section, we present the results of the analytical eval-
uations for various SDN cluster sizes and arbitrary numbers of
injected failures. We emphasize the benefits of a fast recovery
mechanism for the experienced worst-case response time of
an SDN cluster prone to software failures, and visualize its
advantages for the long-term system availability. Finally, we
discuss the complexity properties of our approach.
A. Response Time Analysis
When a single random-role controller from the SDN cluster
fails as a result of a hardware, process or bundle failure (each
being equally probable), deploying a larger number of SDN
controller replicas ensures an overall lower expected response
time (see Fig. 7). This is related to the probability of a leader
being injected with a failure, hence necessitating a leader re-
election to move forward the state. The probability of a leader
failure becomes increasingly lower when larger clusters are
deployed (as explained in Subsection III-B).
10010110 2103
Event response time t [ms]
0.0
0.2
0.4
0.6
0.8
1.0
P(D<t)
C=9 ; NF= 1; WD = off
C=7 ; NF= 1; WD = off
C=5 ; NF= 1; WD = off
C=3 ; NF= 1; WD = off
Fig. 7. Varying probability of an event being successfully handled in a given
time period tfor different SDN controller cluster sizes C. The probability of
the RAFT leader failing is inversely proportional to the cluster size.
Next, we evaluate the probability of meeting an event
handling deadline when the majority of nodes in the cluster
have failed. The expected response times where mixed hard-
ware and software failures, as well as exclusively bundle-level
failures may occur, are depicted with and without the watchdog
(WD) mechanism enabled in Fig. 8a and 8b, respectively.
The watchdog mechanism enables faster recovery of replicas
and hence faster repeated processing of an event in the case
of leader and follower majority failures. An SDN cluster
equipped with the watchdog mechanism on average processes
the events faster and with a higher probability than the one
without. Especially when simultaneous hardware failures are
10
improbable and software failures are typical, the fast software
recovery provides obvious response-time benefits (Fig. 8b).
(a) Resulting response time assuming an occurrence of NFcombined correlated
hardware and software (process, bundle) failures. All three types of failures are
injected with equal probability.
(b) Resulting response time assuming NFbundle-only failures. The watchdog
mechanism will guarantee a timely repair and inclusion of the recovered RAFT
node in the cluster.
Fig. 8. Probability of receiving an event response during an observation
window, assuming a simultaneous occurrence of (a) NFmixed and (b)
NFsoftware-bundle only controller failures in a cluster comprised of C
controllers. The failures are injected at rate NF0.0333 (all correlated NF
failures are thus expected to be injected by time point t= 30ms).
Fig. 9 depicts the effect of the consecutive failures on
the experienced response time in a 7-node controller cluster.
If the majority nodes remain available after each individual
failure, the time to respond is governed by the case where
a cluster leader fails and a new leader election procedure
is automatically initiated. There is no noticeable difference
in the convergence time regardless of the (non-)usage of the
watchdog mechanism in this particular case. The lower the
maximum number of induced failures induced, slightly shorter
is the expected response time. This may be related to the
fact that the follower timeouts are exponentially distributed,
hence a higher number of active nodes that time out after a
leader failure leads to an overall lower expected time to select
a candidate and repair the cluster.
B. Cluster Availability
Next, we emphasize the long-term advantage of an SDN
controller bundle/process watchdog mechanism by evaluat-
ing the availability of a 3-node cluster configuration over
an observation period of 1000 hours. Fig. 10 depicts the
Fig. 9. Resulting response time assuming an occurrence of 1NF
(bC/2c+ 1) controller failures in a 7-node controller cluster. The response
time is governed by the duration of the leader election procedure. When the
majority of controllers are unavailable, the usage of the watchdog mechanism
(dashed) leads to important benefits w.r.t. expected worst-case response time.
unavailability of a 3-node controller cluster setup. We define
the unavailability measure as the probability of encountering
an unavailable cluster of controllers at any time instant tas
PCU(t) = 1 PCA(t). Here PCA(t)represents the proba-
bility of encountering a system in a state where the RAFT
leader and the majority of RAFT followers are available and
have converged their leader-election processes. Software and
hardware failures are modeled using the long-term exponen-
tial hardware and software failure rates presented in Table
II. The approximated unavailability measure saturates after
85 hours, which is an expected mean failure time for the
combined software failures at bundle and process level, given
the individual exponentially distributed failures with a mean of
1week (170 hours) for individual arrivals. We consider the
process and bundle failure arrivals as two independent Poisson
processes with variably configured rates. Hence, merging the
two independent processes with equal arrival rates results in
an approximately halved inter-arrival time between software
failures. The usage of a watchdog that proactively rejuvenates
a system after a software failure leads to a shorter overall
experienced downtime, and hence a lower expected RAFT
cluster unavailability in the long-term. Configurations with
five or more replicas guarantee a negligible unavailability of
<1e9and are hence not included in the figure.
C. Model Complexity and Solve Time
Compared to the manual Markov Chain modeling, SANs
allow for more compact modeling of complex scenarios.
Analytically, both options need to solve the same CTMC
and have to deal with an exponential increase in model
size which may result in inefficient or intractable analytical
solutions when complex models are concerned [31], [32]. The
model complexity dictates both the amount of computational
resources and the time required to solve the model.
Fig. 11 depicts the state space sizes of the generated
CTMCs. The generated state space is used by the transient
solver to find the transient solutions for short-term (NFlower
than C) and long term (NFconsiders up to Cfailures)
numerical analysis. The model complexity increases with the
number of possible combinations the system may occupy. For
11
100101102103
Time t[hours]
0
1
2
3
4
5
Cluster unavailability
1e 5
Fixed repair rate 1/λS= 180s
Watchdog enabled
Fig. 10. Transient analysis of the SDN controller cluster unavailability over
a period of 1000 hours. The cluster size of exactly three controllers was
considered in the transient analysis. As expected, the inclusion of a liveness
guard mechanism results in a lower overall expected unavailability. SDN
controller clusters that include five or a higher number of replicas per cluster
have shown to posses negligible availability concerns. This confirms the claims
made in [28], where authors discuss the minimal effect of long-term failure
rates on the experienced downtime of an SDN control plane.
short-term response time analysis we limit the complexity of
the model by considering only the injected correlated failures -
this is realistic as only a very short time period (1s < x < 2s)
is considered (see Figures 7 and 8). For long-term analysis,
additional system states, where more than just the majority
of nodes may fail could be of interest (consider Fig. 10).
Fig. 11 shows the CTMC state space sizes for the cluster
configurations up to C= 19. We observe that, for some
parametrizations, the compiled state space size grows expo-
nentially with the number of controller replicas. The number
of possible failure injections dictates the number of generated
unique combinations. For the most accurate setting of the
ES= 20 (20 Erlang stages, see Subsection IV-C) and cluster
sizes of 17 and more replicas, we have encountered memory
handling limitations in the flat state space generator in M¨
obius.
Namely, if the solution should cover for all theoretically
possible system combinations, i.e. when failure of every single
node should be considered, the solution space eventually grows
to an intractable amount of states for very large cluster sizes.
To cater for the scalability of our solution when analyzing
large control planes, we propose three options:
1) State space largeness avoidance by applying a scenario-
based approach to the worst-case modeling. For exam-
ple, one could consider a limited number of maximum
failure injections. By limiting the number of maximum
failure injections to NF=bC/2c+1, large-scale clusters
can be analyzed successfully (see Fig. 11).
2) State space largeness avoidance by trading solution
accuracy, e.g. by manipulation of the Erlang stages used
for the approximation of the deterministic transitions.
3) Faster convergence of the transient solver by raising the
error tolerance of the uniformization (see Equation 1).
For completeness, we also evaluate the second option by
varying the Erlang stage parametrization. We take note of the
effect on the overall result accuracy for the transient analysis of
a 7-node cluster. Fig. 12 depicts the inaccuracy of the latency
bound introduced by lowering the number of Erlang stages
C= 5 C=9 C= 13 C= 17 C= 19
Controller Cluster Size [Controllers]
103
104
105
106
107
Resulting CTMC State Space Size [States]
max(NF) = 1; ES= 1
max(NF) = 1; ES= 5
max(NF) = 1; ES= 10
max(NF) = 1; ES= 15
max(NF) = C/2 +1; ES= 1
max(NF) = C/2 +1; ES= 5
max(NF) = C/2 +1; ES= 10
max(NF) = C/2 +1; ES= 15
max(NF)=C;ES= 1
max(NF)=C;ES= 5
max(NF)=C;ES= 10
max(NF)=C;ES= 15
Fig. 11. Size of the CTMC state space generated using the SAN models
and parameters discussed in Sections III and IV, respectively. The lower the
number of controller failures of interest (i.e. where NF< C), the smaller the
resulting CTMC state space size. If the possibility of an eventual occurrence
of failures in all nodes is assumed, the state space grows correspondingly,
reaching up to 107possible state space combinations with controller cluster
size set to C= 13 and the maximum accuracy ES= 15. Striped bars
represent the unsuccessful CTMC compilations where the flat state space
generator fails to compile the state space. However, by considering a lower
number of Erlang approximation stages ES,C= 19 and more controller
replicas can be handled with a limited inaccuracy (see Fig. 12). Similarly, a
focused assumption on the maximum number of possible failure occurrences
helps the scalability of the solution (where max(NF)< C).
from EShigh = 20 to ESlow ∈ {5,10,15}. At ES= 5,
the generated state space size is decreased by a magnitude
(see Fig. 11) and is hence, in addition to the first option, an
effective method of deploying our models in a scalable manner.
From this study, we conclude that the state space generation
process is scalable as long as the accuracy and failure injection
parameters are selected carefully for the use case at hand.
100101102103
Event response time t [ms]
0.0
0.2
0.4
0.6
0.8
1.0
P(D<t)
C= 7 ; NF= 4; ES= 20
C= 7 ; NF= 3; ES= 20
C= 7 ; NF= 2; ES= 20
C= 7 ; NF= 4; ES= 15
C= 7 ; NF= 3; ES= 15
C= 7 ; NF= 2; ES= 15
C= 7 ; NF= 4; ES= 10
C= 7 ; NF= 3; ES= 10
C= 7 ; NF= 2; ES= 10
C= 7 ; NF= 4; ES= 5
C= 7 ; NF= 3; ES= 5
C= 7 ; NF= 2; ES= 5
Fig. 12. Inaccuracies stemming from a decreased number of Erlang stages
ESused in the approximation of deterministic transitions are negligible.
Inaccurate approximation of a deterministic distribution lead to a higher
variance for the random variable describing the failure arrivals. Hence, for
small ESthe solver estimates a more relaxed (thus more pessimistic) latency
bound.
We next consider the performance overhead of the state
space generation in our approach. Fig. 13 shows how the
scenario where max(NF) = Cwith C= 19 and ES= 5
results in a tolerable 103seconds solving period.
Fig. 14 depicts the computation time to solve the presented
SANs. The duration of the solution computation of SAN
will vary depending on the model complexity (state space
size), the definition of the observed performance variable
(reward function and the number and granularity of time
12
Fig. 13. Overhead of the CTMC compilation for varying cluster sizes C,
failure injection counts max(NF)and Erlang parameterizations ES. While
very accurate and large-scale combinations may lead to intractable solutions,
feasible solutions can be presented even for the complex deployments of
C= 19 controllers with various degrees of accuracy and all max(NF)
combinations.
measurements), as well as the required accuracy and model
stiffness (the range of the expected action completion times)
[33]. In the M¨
obius modeling tool, the accuracy of the transient
solver indicates the degree of accuracy that the user wishes in
terms of the number of decimal places. The solver execution
times depicted in Fig. 14 were observed for the accuracy
parameter set to 9 and an observation window of 1 second
(1000 data points). The largest generated state space for the
purpose of modeling the largest cluster size necessarily leads
to the longest solution computation times. For the analysis
scenarios described here, these computation times are feasible.
Fig. 14. Computation time of the instant of time [22] transient solutions
for the state space sizes depicted in Fig. 11. The solution covers the target
observation interval of 1second at millisecond resolution - hence the transient
solver has computed the solutions for 1000 time-points. The computations
were executed on a commodity hardware equipped with a modern AMD
processor and 32GB of DDR4 memory. The required computation overhead
for the numerical solution is feasible for a short-term response time study.
VI. RE LATE D WO RK
In state of the art literature, availability and overhead mod-
eling of SDN has recently started to gain traction. In [28], the
authors investigate the impact of operational and management
failures on the availability in SDNs. They focus on the long-
term availability impact of adding additional controllers, but
do not include any response time analysis nor consider the
impact of controller synchronization at micro-scale.
Tuncer et al. [34] enhance a controller placement heuristic
to cater for the optimality of the controller-network device
cluster unbalance. Given an arbitrary network topology, their
objective is to compute the number of controllers and the
fitting placement, as well as to declare the controller-device
assignments when considering a distance (e.g. delay) con-
straint. While the controller-switch assignment was specifi-
cally targeted in their work, the same solution could be applied
to planning an efficient controller cluster configuration. The
problem we solve is complementary to this, since we allow
for analyzing any given SDN cluster with regards to its worst-
case control plane performance at runtime.
Muqaddas et al. [5], [35] investigate the load overhead
of the intra-cluster communication in a 2- and 3-controller
ONOS cluster. They propose a model to quantify the traffic
exchanged among the controllers and express it as a function
of the network topology. They did not consider the effect of
the transient failures on the response time and availability.
In [36], Zhang et al. describe the single-data ownership
organizational model implemented by the RAFT algorithm
and propose an estimation formula for approximating of the
flow setup time in a distributed SDN controller cluster. Their
estimation is however fairly simplistic as it models only the
average case. The worst-case estimations were not considered.
Ongaro [24] and Howard et al. [14] provide initial perfor-
mance evaluations of the RAFT consensus algorithm. Howard
et al. [14] further implement an event-driven framework for
prototyping of RAFT using experimental topologies. Contrary
to the analytical approach presented in our work, their per-
formance evaluation of RAFT is based on a limited number
of repeated experiments and focuses on evaluating the RAFT
leader re-election procedure following a failure. Unfortunately,
these works do not provide a good understanding of how the
overall system response time is affected after a failure.
In two experiment-based studies, Suh et al. [15], [16]
measure the throughput and the recovery time of a RAFT-
enabled SDN controller cluster with 1, 3 and 5 replicas. They
put special focus on the effect of φaccrual failure detector
[37] on the resulting performance footprint. The authors
deduce that the controller failover time increases as the value φ
increases. With higher φ, the OpenDaylight cluster becomes
more conservative in determining a controller failure, hence
in case of failures, using a large φvalues will generally lead
to slow failure discovery. Authors varied φand measured the
lowest recovery time of 2,6s, which is a non-satisfying
recovery time for many critical industrial applications. Instead
of using an adaptive scaling factor φ, we rely on a fixed
follower timeout variable with a mean of 225ms. We
assume that the controller-to-controller delays are bounded and
will hence not exceed this value except in the case where a
controller failure has occurred. This value is recommended
by the authors of the RAFT consensus algorithm, and was
determined to be a good trade-off between the recovery time
and the signaling overhead in their experiments [11].
The introduced watchdog mechanism for fast software sys-
tem recovery relates to the concept of software rejuvenation.
Several works have investigated the phenomenon of ”software
aging” wherein the health of a software system degrades with
time [38], [39]. These papers conclude that a mechanism
which ”rejuvenates” or ”recovers” the software component to
its stable state, would provide long-term benefits in terms of
experienced system availability. In this work, we evaluate the
13
benefits of the reactive controller recovery where, following
a detected controller bundle or process failure, the affected
component is reinitialized in order to minimize the downtime.
Machida et al. [40] analyze the completion time of a job
running on a server that is affected by software aging, and
consider the benefit of the preemptive-resume operation, where
a job resumes execution from the point of interruption as
soon as the failed server recovers. Similar to this work, we
investigate the job completion time for a client request, but
we consider a distributed multi-server operation. We focus on
the strategy where, assuming a failure occurs, the request is
handled from the beginning instead of delegating it to the next
server.
Apart from RAFT, Paxos [9] is another influential [8], [41]
consensus algorithm that eventually motivated the develop-
ment of RAFT. Paxos ensures that any two distributed servers
that are part of the same cluster may never disagree about
the value of a particular update, for any applied update
in the update history. In its optimizations, its performance
is comparable to RAFT, in that, assuming a stable cluster
leadership, committing a cluster-wide update takes a single
round trip in most cases [10]. Multi-Paxos [9], [13] is a
prominent variation of Paxos, that assumes a stable leader for
an infinite number of sequential cluster updates. This allows
for one-round-trip delay as the first phase of Paxos becomes
unnecessary for the majority of updates. In [42] the authors
evaluate an implementation of Multi-Paxos and conclude that
the overall performance of Multi-Paxos is limited by the
slowest node in the fastest cluster majority. This is a valid
observation for any quorum-based consensus algorithm, hence
we distinguish the leader failures as critical for our analysis.
To minimize the effect of single-leader failure and maximize
the load balancing of requests, Mencius [43] proposes a
round-robin-based update-handling by multiple leaders in a
Paxos cluster. While it enables higher throughput in the stable
case, the cluster will always run at the speed of the slowest
elected leader as the new updates may be dependant on
previous updates that are assigned to be handled by the slow
node. EPaxos [10] is a recent leader-less take on Paxos that
tries to circumvent these issues. It keeps track of the ordering
and mutual dependencies between the client-initiated updates.
Hence, it is able to parallelize multiple update instances when
no collisions between concurrent client updates are expected.
Like RAFT, it requires a single round-trip in most cases to
commit a state update, and two round-trips if dependency
conflicts arise. Contrary to RAFT and other leader-based Paxos
variations, the response time in an EPaxos cluster does not
suffer from unstable leaders since the clients may always
fall-back to any remaining live leader replica. However, the
algorithm adds additional complexity in state-keeping and log
compaction tasks because of the added dependency trees.
Since all available SDN cluster implementations focus on
a single master for any switch in its administrative domain
at runtime, we put focus on the evaluation of a single-
leader RAFT-based cluster and consider its direct comparison
with multi-leader EPaxos [10] and eventually consistent ap-
proaches [4], [44] as future work.
VII. CONCLUSION AND OUT LO OK
SDN enables the necessary control plane robustness by
controller clustering and state replication. However, this repli-
cation incurs additional performance overhead. Indeed, it is
not always clear which particular cluster configuration would
best suit the application and network configuration at hand.
Existing performance studies of distributed SDN control plane
neglect the cluster’s response time and availability metrics.
Hence, we hereby propose the usage of Stochastic Activity
Networks (SANs) for modeling and numerical evaluation
of distributed SDN clusters. We put special focus on the
practically relevant distributed consensus algorithm RAFT, but
generalize our model to be applicable to similar Paxos variants
(e.g., Multi-Paxos). RAFT is implemented in two dominant
open-source SDN platforms and is of practical relevance for
performance analysis of the distributed SDN control plane.
We introduce and discuss the SAN-based models for response
time and availability evaluation, and include a failure injection
model for evaluation of the two metrics under the effect of
an arbitrary number of correlated controller failures. Using
transient solver methods, we are able to provide a probabilistic
guarantee on the event handling response times and numeri-
cally evaluate the availability property of arbitrary SDN cluster
configurations. We have shown that, assuming a balanced
distribution of controllers in the network w.r.t. the controller-
to-controller delays, larger clusters provide lower worst-case
response times and higher system availability. With the help
of analytical modeling, the evaluation and optimization of
cluster configurations, in order to determine the best suited
configuration for the network at hand, becomes possible
without costly hardware setups for experimental evaluation or
lengthy simulation runs. Analytical modeling further provides
for corner case inclusion and tighter stochastic guarantees than
possible using experimental sampling.
Finally, we have proposed the watchdog mechanism for
fast recovery from software failures in a distributed SDN
controller setting. Using transient solvers, we have proven its
benefits on the short-term response time and the long-term
availability properties of a controller cluster. The solutions to
our models are computationally feasible for both the typical
(3-5 controllers), and very complex clusters (20 controllers).
Extensions to support the novel leaderless Paxos variants,
such as EPaxos, require significant changes in the models
used and are thus considered as future work. Furthermore, we
provide the response time metrics for a model that assumes
an accumulated distribution of RAFT state updates in the
latency- and throughput-optimized, batched mode. Extending
the proposed model to support a sequential distribution of
updates at high scale is non-trivial using SAN-based mod-
eling, because of the added state size complexity. Supporting
the queueing behavior when handling client-generated events
requires inclusion of additional concepts from classic queueing
theory or network calculus for practical value.
ACKNOWLEDGMENT
This work has received funding from the EU’s Horizon 2020
research and innovation programme under grant agreement
number 671648 VirtuWind and in parts under grant agreement
number 647158 FlexNets (by the European Research Council).
14
REFERENCES
[1] T. Mahmoodi et al., “VirtuWind: virtual and programmable industrial
network prototype deployed in operational wind park,” Transactions on
Emerging Telecommunications Technologies, vol. 27, no. 9, 2016.
[2] Y. Huang et al., “Real-time detection of false data injection in smart grid
networks: an adaptive CUSUM method and analysis,IEEE Systems
Journal, vol. 10, no. 2, 2016.
[3] S. C. Sommer et al., “Race: A centralized platform computer based
architecture for automotive applications,” in Electric Vehicle Conference
(IEVC), 2013 IEEE International. IEEE, 2013.
[4] E. Sakic et al., “Towards Adaptive State Consistency in Distributed SDN
Control Plane,” in IEEE ICC 2017 SAC Symposium SDN & NFV Track
(ICC’17 SAC-11 SDN&NFV), 2017.
[5] A. S. Muqaddas et al., “Inter-controller traffic in ONOS clusters for
SDN networks,” in Communications (ICC), 2016 IEEE International
Conference on. IEEE, 2016.
[6] J. Medved et al., “Opendaylight: Towards a model-driven SDN controller
architecture,” in A World of Wireless, Mobile and Multimedia Networks
(WoWMoM), 2014 IEEE 15th International Symposium on, 2014.
[7] P. Berde et al., “ONOS: towards an open, distributed SDN OS,” in
Proceedings of the third workshop on Hot topics in software defined
networking. ACM, 2014.
[8] M. Burrows, “The Chubby lock service for loosely-coupled distributed
systems,” in Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.
[9] L. Lamport, “The part-time parliament,” ACM Transactions on Computer
Systems (TOCS), vol. 16, no. 2, 1998.
[10] I. Moraru et al., “There is more consensus in egalitarian parliaments,”
in Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles. ACM, 2013.
[11] D. Ongaro et al., “In Search of an Understandable Consensus Algo-
rithm,” in USENIX Annual Technical Conference, 2014.
[12] S. Gilbert et al., “Brewer’s conjecture and the feasibility of consistent,
available, partition-tolerant web services,Acm Sigact News, vol. 33,
no. 2, 2002.
[13] L. Lamport et al., “Paxos made simple,” ACM Sigact News, vol. 32,
no. 4, 2001.
[14] H. Howard et al., “Raft Refloated: Do We Have Consensus?” SIGOPS
Oper. Syst. Rev., vol. 49, no. 1, 2015.
[15] D. Suh et al., “Toward Highly Available and Scalable Software Defined
Networks for Service Providers,” IEEE Communications Magazine,
vol. 55, no. 4, 2017.
[16] ——, “On performance of OpenDaylight clustering,” in NetSoft Confer-
ence and Workshops (NetSoft), 2016 IEEE. IEEE, 2016.
[17] G. Bolch et al.,Queueing networks and Markov chains: Modeling and
performance evaluation with computer science applications. John
Wiley & Sons, 2006.
[18] A. Avritzer et al., “Survivability evaluation of gas, water and electricity
infrastructures,” Electronic Notes in Theoretical Computer Science, vol.
310, 2015.
[19] M. A. Ndiaye et al., “Performance assessment of industrial control
system during pre-sales uncertain context using automatic Colored
Petri Nets model generation,” in Control, Decision and Information
Technologies (CoDIT), 2016 International Conference on. IEEE, 2016.
[20] A. Gonzalez et al., “Service Availability in the NFV Virtualized Evolved
Packet Core,” in Global Communications Conference (GLOBECOM),
2015 IEEE. IEEE, 2015.
[21] W. H. Sanders et al.,Stochastic activity networks: Formal definitions
and concepts. Springer, 2001.
[22] G. Clark et al., “The Mobius Modeling Tool,” in Petri Nets and
Performance Models, 2001. Proceedings. 9th International Workshop
on. IEEE, 2001.
[23] A. Reibman et al., “Numerical transient analysis of Markov models,”
Computers & Operations Research, vol. 15, no. 1, 1988.
[24] D. Ongaro, “Consensus: Bridging theory and practice,” Ph.D. disserta-
tion, Stanford University, 2014.
[25] K. Nagaraja et al., “Using Fault Injection and Modeling to Evaluate the
Performability of Cluster-Based Services.” in USENIX Symposium on
Internet Technologies and Systems, 2003.
[26] E. Molina et al., “Availability improvement of Layer 2 seamless net-
works using Openflow,” The Scientific World Journal, vol. 2015, 2015.
[27] ——, “Performance enhancement of high-availability seamless redun-
dancy networks using openflow,” IEEE Communications Letters, vol. 20,
no. 2, 2016.
[28] G. Nencioni et al., “Availability Modelling of Software-Defined Back-
bone Networks,” in Dependable Systems and Networks Workshop, 2016
46th Annual IEEE/IFIP International Conference on. IEEE, 2016.
[29] Y. Liu et al., “A proactive approach towards always-on availability in
broadband cable networks,” Computer Communications, vol. 28, 2005.
[30] S. Verbrugge et al., “General availability model for multilayer trans-
port networks,” in Design of Reliable Communication Networks,
2005.(DRCN 2005). Proceedings. 5th International Workshop on. IEEE,
2005.
[31] W. H. Sanders et al., “Reduced base model construction methods
for stochastic activity networks,IEEE Journal on Selected Areas in
Communications, vol. 9, no. 1, 1991.
[32] P. E. Heegaard et al., “Survivability modeling with stochastic reward
nets,” in Winter Simulation Conference, 2009.
[33] M. Malhotra et al., “Stiffness-tolerant methods for transient analysis of
stiff Markov chains,Microelectronics Reliability, vol. 34, no. 11, 1994.
[34] D. Tuncer et al., “On the placement of management and control
functionality in software defined networks,” in Network and Service
Management (CNSM), 2015 11th International Conference on. IEEE,
2015.
[35] A. S. Muqaddas et al., “Inter-controller Traffic to Support Consistency
in ONOS Clusters,” IEEE Transactions on Network and Service Man-
agement, 2017.
[36] T. Zhang et al., “The role of inter-controller traffic in SDN controllers
placement,” in Network Function Virtualization and Software Defined
Networks (NFV-SDN), IEEE Conference on. IEEE, 2016.
[37] N. Hayashibara et al., “The φaccrual failure detector,” in Reliable
Distributed Systems, 2004. Proceedings of the 23rd IEEE International
Symposium on. IEEE, 2004.
[38] W. Xie et al., “Software rejuvenation policies for cluster systems under
varying workload,” in Dependable Computing, 2004. Proceedings. 10th
IEEE Pacific Rim International Symposium on. IEEE, 2004.
[39] K. Vaidyanathan et al., “Analysis and implementation of software
rejuvenation in cluster systems,” in ACM SIGMETRICS Performance
Evaluation Review, vol. 29, no. 1. ACM, 2001.
[40] F. Machida et al., “Job completion time on a virtualized server with
software rejuvenation,ACM Journal on Emerging Technologies in
Computing Systems (JETC), vol. 10, no. 1, 2014.
[41] W. J. Bolosky et al., “Paxos replicated state machines as the basis of
a high-performance data store,” in Symposium on Networked Systems
Design and Implementation (NSDI), 2011.
[42] H. Du et al., “Multi-Paxos: An Implementation and Evaluation,” Depart-
ment of Computer Science and Engineering, University of Washington,
Tech. Rep. UW-CSE-09-09-02, 2009.
[43] Y. Mao et al., “Mencius: building efficient replicated state machines for
WANs,” in OSDI, vol. 8, 2008.
[44] D. Levin et al., “Logically centralized?: state distribution trade-offs in
software defined networks,” in Proceedings of the first workshop on Hot
topics in software defined networks. ACM, 2012.
Ermin Sakic (S017) received his B.Sc. and M.Sc.
degrees in electrical engineering and information
technology from Technical University of Munich
in 2012 and 2014, respectively. He is currently
employed at Siemens AG as a Research Scientist in
the Corporate Technology research unit. Since 2016,
he is pursuing the Ph.D. degree with the Department
of Electrical and Computer Engineering at TUM.
His research interests include reliable and scalable
Software Defined Networks, distributed systems and
efficient network and service management.
Wolfgang Kellerer (M096 – SM011) is a Full
Professor with the Technical University of Munich
(TUM), heading the Chair of Communication Net-
works at the Department of Electrical and Computer
Engineering. Before, he was for over ten years with
NTT DOCOMO’s European Research Laboratories.
He received his Dr.-Ing. degree (Ph.D.) and his
Dipl.-Ing. degree (Master) from TUM, in 1995 and
2002, respectively. His research resulted in over 200
publications and 35 granted patents. He currently
serves as an associate editor for IEEE Transactions
on Network and Service Management and on the Editorial Board of the IEEE
Communications Surveys and Tutorials. He is a member of ACM and the
VDE ITG.
... • It has a simple leader-follower model and implementation compared to its Paxos counterpart [68] • Raft only synchronises peers if transactions exist and if they need to be processed. Proof of Work and other Byzantinian Fault Tolerant mechanisms need to create blocks regardless of transaction processing. ...
... Raft has been used as the consensus mechanism for a number of reasons [68]. ...
Conference Paper
This thesis investigates the application of AI and blockchain technology to the domain of Algorithmic Regulation. Algorithmic Regulation refers to the use of intelligent systems for the enabling and enforcement of regulation (often referred to as RegTech in financial services). The research work focuses on three problems: a) Machine interpretability of regulation; b) Regulatory reporting of data; and c) Federated analytics with data compliance. Uniquely, this research was designed, implemented, tested and deployed in collaboration with the Financial Conduct Authority (FCA), Santander, RegulAItion and part funded by the InnovateUK RegNet project. I am a co-founder of RegulAItion. / Using AI to Automate the Regulatory Handbook: In this investigation we propose the use of reasoning systems for encoding financial regulation as machine readable and executable rules. We argue that our rules-based “white-box” approach is needed, as opposed to a “black-box” machine learning approach, as regulators need explainability and outline the theoretical foundation needed to encode regulation from the FCA Handbook into machine readable semantics. We then present the design and implementation of a production-grade regulatory reasoning system built on top of the Java Expert System Shell (JESS) and use it to encode a subset of regulation (consumer credit regulation) from the FCA Handbook. We then perform an empirical evaluation, with the regulator, of the system based on its performance and accuracy in handling 600 “real- world” queries and compare it with its human equivalent. The findings suggest that the proposed approach of using reasoning systems not only provides quicker responses, but also more accurate results to answers from queries that are explainable. / SmartReg: Using Blockchain for Regulatory Reporting: In this investigation we explore the use of distributed ledgers for real-time reporting of data for compliance between firms and regulators. Regulators and firms recognise the growing burden and complexity of regulatory reporting resulting from the lack of data standardisation, increasing complexity of regulation and the lack of machine executable rules. The investigation presents a) the design and implementation of a permissioned Quorum-Ethereum based regulatory reporting network that makes use of an off-chain reporting service to execute machine readable rules on banks’ data through smart contracts b) a means for cross border regulators to share reporting data with each other that can be used to given them a true global view of systemic risk c) a means to carry out regulatory reporting using a novel pull-based approach where the regulator is able to directly “pull” relevant data out of the banks’ environments in an ad-hoc basis- enabling regulators to become more active when addressing risk. We validate the approach and implementation of our system through a pilot use case with a bank and regulator. The outputs of this investigation have informed the Digital Regulatory Reporting initiative- an FCA and UK Government led project to improve regulatory reporting in the financial services. / RegNet: Using Federated Learning and Blockchain for Privacy Preserving Data Access In this investigation we explore the use of Federated Machine Learning and Trusted data access for analytics. With the development of stricter Data Regulation (e.g. GDPR) it is increasingly difficult to share data for collective analytics in a compliant manner. We argue that for data compliance, data does not need to be shared but rather, trusted data access is needed. The investigation presents a) the design and implementation of RegNet- an infrastructure for trusted data access in a secure and privacy preserving manner for a singular algorithmic purpose, where the algorithms (such as Federated Learning) are orchestrated to run within the infrastructure of data owners b) A taxonomy for Federated Learning c) The tokenization and orchestration of Federated Learning through smart contracts for auditable governance. We validate our approach and the infrastructure (RegNet) through a real world use case, involving a number of banks, that makes use of Federated Learning with Epsilon-Differential Privacy for improving the performance of an Anti-Money-Laundering classification model.
... Commonly used decentralization consistency algorithms mainly include RAFT [24][25][26][27][28], PBFT [29][30][31][32][33], RIPPLE [34][35][36][37][38] and DPOS [39][40][41][42][43]. All of the above four algorithms have good results in ensuring data consistency. ...
Article
Full-text available
The traditional on-board centralized-distributed mission negotiation architecture has poor security and reliability. It can easily give rise to the collapse of the whole system when the master node is attacked by malicious nodes. To address this issue, the decentralized consistency algorithms commonly used in the internet world are referred to in this paper. Firstly, four typical consistency algorithms suitable for the Internet and which are named RAFT, PBFT, RIPPLE and DPOS are selected and modified for a multi-satellite autonomous mission negotiation. Additionally, based on the above modified consistency algorithms, a new double-layer decentralized consistency algorithm named DDPOS is proposed. It is well known that the above four common consistency algorithms cannot have both a low resource occupation and high security. The DDPOS algorithm can integrate the advantages of four common consistency algorithms due to its freedom of choice attribute, which can enable satellite clusters to flexibly adopt different appropriate consistency algorithms and the number of decentralized network layers. The DDPOS algorithm not only greatly improves the security and reliability of the whole satellite cluster, but also effectively reduces the computing and communication resources occupation of the satellite cluster. Without the presence of a malicious node attack, the resource occupation of the DDPOS algorithm is almost the same as that of the RAFT algorithm. However, in the case of a malicious node attack, compared with the RAFT algorithm, the total computation and total bandwidth occupation of the DDPOS algorithm have decreased by 67% and 75%, respectively. Moreover, it is surprising that although the DDPOS algorithm is more complex, its code size is only about 8% more than the RAFT algorithm. Finally, the effectiveness and feasibility of the DDPOS algorithm in the on-board practical application are analyzed and verified via simulation experiments.
... Thus, the control over a gNB should be handed over to the new target controller. Regarding the control plane consistency during handovers, we rely on solutions applied in core networks [36]. In this work, we do not consider it as it does not bring any technical novelty. ...
Conference Paper
Next generation radio access networks (RANs) envision softwarization and programmability as the main tools to provide the quality of service (QoS) requirements of emerging applications. Consequently, software-defined radio access networks (SD-RANs) have gained increased traction as a technology to foster network management and alleviate orchestration. While there exist SD-RAN architecture concepts both with single and multiple SD-RAN controllers, currently developed prototypes only include a single controller. Such a design may be sufficient for a low number of managed devices, for instance below 50. When the number of devices increases beyond 300, the controller performance deteriorates. A distributed control plane provides a solution, but renders the management in the control plane complex and incurs additional overhead, for instance control handover. In this way, both single controller and distributed control plane approaches may have a negative impact on a user's QoS. Yet, proper evaluations are missing and therefore the performance remains unclear. In order to investigate the effect of SD-RAN control plane on the user performance, in this work, we provide an extensive evaluation based on a 5G simulator, compliant with 3GPP standardization, as well as measurements with open-source SD-RAN controllers. Based on our simulator, we are able to demystify the user QoS depending on the control plane design choices. Our results demonstrate that having a distributed control plane with control handovers improves the user performance by at least 20% in terms of throughput, 5x regarding the packet loss ratio and 140% in terms of delay compared to a single controller approach. This confirms that the benefits of multiple controllers surpass the overhead caused by more complicated management.
... This variable is determined by knowing the time interval between the state of the time when the request is received from the server or the time the task is received with the time in which the response or task is completed. This parameter depends on the nature of the SDN network for example, in an SDN distributed network architecture represents the time taken by the LB algorithm to react and respond to the request [20]. ...
... Although Raft shares similarities with Paxos and VR, Raft uses strong leadership where the messages flow only from the leader to other servers; all other servers passively synchronize states from the leader, resolving conflicts by obeying the leader's commands. Utilizing strong leadership reduces the types of messages and thus improves Raft's understandability, making Raft widely deployed in practical large-scale distributed systems such as Baidu File System [16], SDN designs [17], and HyperLedger Kafka [18]. Thus, the leader election mechanism is crucial because a Raft system fails without a leader. ...
Preprint
Leader-based consensus protocols must undergo a view-change phase to elect a new leader when the current leader fails. The new leader is often decided upon a candidate server that collects votes from a quorum of servers. However, voting-based election mechanisms intrinsically cause competition in leadership candidacy when each candidate collects only partial votes. This split-vote scenario can result in no leadership winner and prolong the undesired view-change period. In this paper, we investigate a case study of Raft's leader election mechanism and propose a new leader election protocol, called ESCAPE, that fundamentally solves split votes by prioritizing servers based on their log responsiveness. ESCAPE dynamically assigns servers with a configuration that offers different priorities through Raft's periodic heartbeat. In each assignment, ESCAPE keeps track of server log responsiveness and assigns configurations that are inclined to win an election to more up-to-date servers, thereby preparing a pool of prioritized candidates. Consequently, when the next election takes place, the candidate with the highest priority will defeat its counterparts and becomes the next leader without competition. The evaluation results show that ESCAPE progressively reduces the leader election time when the cluster scales up, and the improvement becomes more significant under message loss.
... The transformation of the network from problem to solution is presented in [15], proposing the usage of dedicated P4-based network devices for the partial offloading of the Raft operation to the underlying network. The modeling and the numerical evaluation of the Raft-operated distributed SDN clusters is given in [16], using Stochastic Activity Networks and estimating the effect of various hardware and software controller failures on the system response time. However, the differentiation of the delays on the connections between the nodes is not considered in this work, as well as their impact on the election process and the response time. ...
... As the popularity of RAFT increased in distributed system projects, more effort has been done in modeling the algorithm to evaluate its performance. In [17], the authors propose a model to study the integration of RAFT in distributed SDN systems. More performance analysis were performed on RAFT in [18] and [19] for distributed database system and blockchain applications respectively. ...
Conference Paper
Blockchain designed for Mobile Ad hoc Networks (MANET) and mesh networks is an emerging research topic which has to cope with the network partition problem. However, existing consensus algorithms used in blockchains have been designed to work in a fully connected network with reliable communication. As this assumption does not hold anymore in mobile wireless networks, we describe in this paper the problem of network partitions and its impact on blockchain. Then, we propose a new consensus algorithm called Consensus for Mesh (C4M) which inspires from RAFT as a solution to this problem. The C4M consensus algorithm is integrated in Blockgraph, a blockchain solution for MANET and mesh networks. We implemented our solution in NS-3 to analyze its performances through simulations. The simulation results show that the heartbeat interval and the election timeout have a great impact on the leader election time, especially in case of topology changes.
... As famous examples, the two stateof-the-art distributed controllers Open Daylight (ODL) [18] and Open Network Operating System (ONOS) [19] adopt the Raft consensus algorithm [20] to achieve strong consistency for the shared data structures. In fact, most of ODL versions are endowed with clustering service to allow the deployment of multiple instances of the controller following Raft algorithm [21]. ONOS provides eventual consistency model too, through the so called anti-entropy algorithm based on a gossip protocol [22]. ...
Article
Full-text available
Software-defined networking (SDN) has become one of the most promising paradigms to manage large scale networks. Distributing the SDN control proved its performance in terms of resiliency and scalability. However, the choice of the number of controllers to use remains problematic. A large number of controllers may be oversized inducing an overhead in the investment cost and the synchronization cost in terms of delay and traffic load. However, a small number of controllers may be insufficient to achieve the objective of the distributed approach. So, the number of used controllers should be tuned in function of the traffic charge and application requirements. In this paper, we present an intelligent and resizable control plane for software defined vehicular network architecture, where SDN capabilities coupled with deep reinforcement learning (DRL) allow achieving better QoS for vehicular applications. Interacting with SDVN, DRL agent decides the optimal number of distributed controllers to deploy according to the network environment (number of vehicles, load, speed etc.). To the best of our knowledge, this is the first work that adjusts the number of controllers by learning from the vehicular environment dynamicity. Experimental results proved that our proposed system outperforms static distributed SDVN architecture in terms of end-to-end delay and packet loss.
Article
A distributed control plane is more scalable and robust in software defined networking. This paper focuses on controller load balancing using packet-in request redirection, that is, given the instantaneous state of the system, determining whether to redirect packet-in requests for each switch, such that the overall control plane response time (CPRT) is minimized. To address the above problem, we propose a framework based on Lyapunov optimization. First, we use the drift-plus-penalty algorithm to combine CPRT minimization problem with controller capacity constraints, and further derive a non-linear program, whose optimal solution is obtained with brute force using standard linearization techniques. Second, we present a greedy strategy to efficiently obtain a solution with a bounded approximation ratio. Third, we reformulate the program as a problem of maximizing a non-monotone submodular function subject to matroid constraints. We implement a controller prototype for packet-in request redirection, and conduct trace-driven simulations to validate our theoretical results. The results show that our algorithms can reduce the average CPRT by 81.6% compared to static assignment, and achieve a 3× improvement in maximum controller capacity violation ratio.
Conference Paper
Full-text available
We consider a distributed Software Defined Networking (SDN) architecture adopting a cluster of multiple controllers to improve network performance and reliability. Differently from previous work, we focus on the control traffic exchanged among the controllers, in addition to the Openflow control traffic exchanged between controllers and switches. We develop an analytical model to estimate the reaction time perceived at the switches due to the inter-controller communications, based on the data-ownership model adopted in the cluster. We advocate a careful placement of the controllers, taking into account the two above kinds of control traffic. We evaluate, for some real ISP network topologies, the possible delay tradeoffs for the controllers placement problem.
Conference Paper
Full-text available
State synchronisation in clustered Software Defined Networking controller deployments ensures that all instances of the controller have the same state information in order to provide redundancy. Current implementations of controllers use a strong consistency model, where configuration changes must be synchronised across a number of instances before they are applied on the network infrastructure. For large deployments, this blocking process increases the delay of state synchronisation across cluster members and consequently has a detrimental effect on network operations that require rapid response, such as fast failover and Quality of Service applications. In this paper, we introduce an adaptive consistency model for SDN Controllers that employs concepts of eventual consistency models along with a novel `cost-based' approach where strict synchronisation is employed for critical operations that affect a large portion of the network resources while less critical changes are periodically propagated across cluster nodes. We use simulation to evaluate our model and demonstrate the potential gains in performance.
Conference Paper
Full-text available
In order to support reactive and adaptive operations, Software-Defined Networking (SDN)-based management and control frameworks call for decentralized solutions. A key challenge to consider when deploying such solutions is to decide on the degree of distribution of the management and control functionality. In this paper, we develop an approach to determine the allocation of management and control entities by designing two algorithms to compute their placement. The algorithms rely on a set of input parameters which can be tuned to take into account the requirements of both the network infrastructure and the management applications to execute in the network. We evaluate the influence of these parameters on the configuration of the resulting management and control planes based on real network topologies and provide guidelines regarding the settings of the proposed algorithms.
Article
In distributed SDN architectures, the network is controlled by a cluster of multiple controllers. This distributed approach permits to meet the scalability and reliability requirements of large operational networks. Despite that, a logical centralized view of the network state should be guaranteed, enabling the simple development of network applications. Achieving a consistent network state requires a consensus protocol, which generates control traffic among the controllers whose timely delivery is crucial for network performance. We focus on the state-of-art ONOS controller, designed to scale to large networks, based on a cluster of self-coordinating controllers. In particular, we study the inter-controller control traffic due to the adopted consistency protocols. Based on real traffic measurements and the analysis of the adopted consistency protocols, we develop some empirical models to quantify the traffic exchanged among the controllers, depending on the considered shared data structures, the current network state (e.g. topology) and the occurring network events (e.g. flow or host addition). Our models provide a formal tool to be integrated into the design and dimension the control network interconnecting the controllers. Our results are of paramount importance for the proper design of large SDN networks, in which the control plane is implemented in-band and cannot exploit dedicated network resources.
Article
Software-defined networking is moving from its initial deployment in small-scale data center networks to large-scale carrier-grade networks. In such environments, high availability and scalability are two of the most prominent issues, and thus extensive work is ongoing. In this article, we first review the state of the art on high availability and scalability issues in SDN and investigate relevant open source activities. In particular, two well-known open source projects, OpenDaylight (ODL) and Open Network Operating System (ONOS), are analyzed in terms of high availability (i.e., network state database replication/synchronization and controller failover mechanisms) and scalability (i.e., network state database partition/ distribution and controller assignment mechanisms) issues. We also present experimental results on the flow rule installation/read throughput and the failover time upon a controller failure in ONOS and ODL, and identify open research challenges.
Conference Paper
Industrial control systems (ICS) are defined with hardware and software components dedicated to control and monitoring tasks for factory process. Proper functioning of ICS architectures is mainly linked to the performance they offer. System integrators (SI) must know performance of the architecture they propose to the customers by assessing them. Many methodologies have already proven their capabilities to assess ICS architecture performance. One of them is colored Petri Nets (CPN). To assess the performance of ICS architecture using CPN involve defining manually the model. Pre-sales uncertain context involves problematic making this manual model definition challenging. This paper introduces a concept allowing automatic CPN model generation by instantiation and parameterization. However before introducing the concept, the paper shows the problematic involved by the pre-sales context. Then shows why CPN methodology is a relevant solution for assessing the performance of ICS in this context.
Conference Paper
Software-defined networking (SDN) is moving from its initial deployment in small-scale data center networks to large-scale carrier-grade networks. In such environments, high availability and scalability are two of the most prominent issues and thus extensive work is ongoing. In this paper, a well-known open source project, OpenDaylight (ODL) is analyzed in terms of network state database partition/synchronization and controller failover/load balancing mechanisms. We also present experimental results on the flow rule installation throughput and the failover time upon a controller failure in ODL.
Article
With anticipated exponential growth of connected devices, future industrial networks require an open solutions architecture facilitated by standards and a strong ecosystem. Such solutions should also deal with range of quality of service requirements imposed by industrial networks. Preserving strict quality of service is particularly challenging when services pass across domains of multiple provides. VirtuWind aims to develop and demonstrate a Software Defined Networking and Network Function Virtualization ecosystem, based on an open, modular and secure framework to address stringent requirements of the industrial networks. A prototype of the framework for intra-domain and inter-domain scenarios will be showcased in real Wind Parks, as a representative use case of industrial networks. This paper details this vision and explains steps forward.