Conference PaperPDF Available

BFT Protocols for Heterogeneous Resource Allocations in Distributed SDN Control Plane

Authors:

Abstract and Figures

Distributed Software Defined Networking (SDN) controllers aim to solve the issue of single-point-of-failure and improve the scalability of the control plane. Byzantine and faulty controllers, however, may enforce incorrect configurations and thus endanger the control plane correctness. Multiple Byzantine Fault Tolerance (BFT) approaches relying on Replicated State Machine (RSM) execution have been proposed in the past to cater for this issue. The scalability of such solutions is, however, limited. Additionally, the interplay between progressing the state of the distributed controllers and the consistency of the external reconfigurations of the forwarding devices has not been thoroughly investigated. In this work, we propose an agreement-and-execution group-based approach to increase the overall through-put of a BFT-enabled distributed SDN control plane. We adapt a proven sequencing-based BFT protocol, and introduce two optimized BFT protocols that preserve the uniform agreement, causality and liveness properties. A state-hashing approach which ensures causally ordered switch reconfigurations is proposed, that enables an opportunistic RSM execution without relying on strict sequencing. The proposed designs are implemented and validated for two realistic topologies, a path computation application and a set of KPIs: switch reconfiguration (response) time, signaling overhead, and acceptance rates. We show a clear decrease in the system response time and communication overhead with the proposed models, compared to a state-of-the-art approach.
Content may be subject to copyright.
c
2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
BFT Protocols for Heterogeneous Resource
Allocations in Distributed SDN Control Plane
Ermin Sakic∗†, Wolfgang Kellerer
Technical University Munich, Germany, Siemens AG, Germany
E-Mail:{ermin.sakic, wolfgang.kellerer}@tum.de, ermin.sakic@siemens.com
Abstract—Distributed Software Defined Networking (SDN)
controllers aim to solve the issue of single-point-of-failure and
improve the scalability of the control plane. Byzantine and faulty
controllers, however, may enforce incorrect configurations and
thus endanger the control plane correctness. Multiple Byzantine
Fault Tolerance (BFT) approaches relying on Replicated State
Machine (RSM) execution have been proposed in the past to
cater for this issue. The scalability of such solutions is, however,
limited. Additionally, the interplay between progressing the state
of the distributed controllers and the consistency of the external
reconfigurations of the forwarding devices has not been thor-
oughly investigated. In this work, we propose an agreement-and-
execution group-based approach to increase the overall through-
put of a BFT-enabled distributed SDN control plane. We adapt
a proven sequencing-based BFT protocol, and introduce two
optimized BFT protocols that preserve the uniform agreement,
causality and liveness properties. A state-hashing approach which
ensures causally ordered switch reconfigurations is proposed, that
enables an opportunistic RSM execution without relying on strict
sequencing. The proposed designs are implemented and validated
for two realistic topologies, a path computation application and
a set of KPIs: switch reconfiguration (response) time, signaling
overhead, and acceptance rates. We show a clear decrease in
the system response time and communication overhead with the
proposed models, compared to a state-of-the-art approach.
I. INTRODUCTION AND PROBLEM STATE ME NT
Software Defined Networking (SDN) centralizes the
decision-making in a dedicated controller component. Con-
cepts for achieving crash-fault-tolerance and scalable operation
of the controller have been presented in the past [1], [2].
By means of a logical distribution of controller replicas and
the state synchronization, the controller instances are able to
synchronize the results of their individual computations and
come to consistent decisions independent of the instance that
handled the client request. However, these approaches are
based on weak crash-tolerant algorithms (e.g. RAFT [3] and
Paxos [4]) that are unable to cater for malicious and incorrect
(e.g., buggy [5]) controller decisions that have an individ-
ual controller instance fault as a root cause. Recent works
have thus highlighted the importance of deploying Byzantine
Fault Tolerance (BFT) protocols for achieving consensus, in
scenarios where a subset of controllers is faulty due to a
malicious adversary or internal bugs. Realizing a BFT SDN
control plane comes with an additional controller deployment
overhead, previously shown to range between 2FM+FA+ 1
[6] and 3(FM+FA) + 1 [7] controller instances required to
tolerate up to FMstrictly Byzantine and FAfail-crash failures.
To support stateful controller-based applications (i.e.,
resource-constrained routing, load-balancing, stateful fire-
walls), the controllers synchronize their internal state updates.
Traditional BFT designs [7], [8] require active participation
of all replicas in the system. Thus, they leverage an RSM
approach to handle the client requests, where a majority of
controller instances must come to the agreement about the
order of the client requests, before subsequently executing
them. Finally, the controllers reach consensus on the output of
the computation in order to ensure the causality of subsequent
decisions. We have identified two issues with this approach.
First, to preserve causality, the non-faulty replicas always
participate in all system operations. In the absence of faults,
more replicas execute the decision-making requests than re-
quired to make progress, thus strongly limiting the execution
throughput of the system. Namely, the application execution
is handled by each controller instance in the cluster. In het-
erogeneous environments, where particular controller replicas
can be assigned a higher resource set compared to the others,
this leads to an under-utilization of fast replicas, as the system
progresses at best at the speed of the b|C|+1
2c+1 fastest replica
(|C| is the number of deployed controllers) [2]. Second, these
BFT implementations rely on reaching a successful agreement
about the sequence number mapping for each arriving client
request, prior to its actual execution. The agreement phase thus
necessarily increases the total processing time of individual
requests. We claim that the serialization of requests is a mean
to an end and that the causality of configurations on individual
external devices (i.e., switches) is a sufficient constraint.
II. OUR CONTRIBUTION
In this work we make a point that an optimal separation
of the controller cluster into sufficiently-sized agreement and
execution (A&E) groups leads to an overall higher utilization
in request processing. In our approach, faster replicas may
be leveraged in the intersection of different A&E groups,
while slower replicas may run at their assigned speed without
negatively influencing the faster replicas. To identify the A&E
groups, we extend an existing ILP formulation for controller-
switch assignment procedure [6]. The solver identifies an A&E
group for each deployed switch element, while maximizing the
overlap of the members of different groups. The formulation
considers the execution capacity of individual controllers, as
well as the switch-controller delays as its constraints. The
solver executes during runtime, thus optimizing the assignment
upon each discovered Byzantine/fail-crash failure.
To cater for the second issue, we adopt the classical Practical
BFT (PBFT) approach [8] to realize a distributed sequencer
in order to minimize the fail-over time in the case of a leader
failure. We additionally introduce a group-based variant of this
protocol, that leverages the partitioning of the total controller
set into multiple A&E groups. Finally, in addition to the two
agreement-based designs above, we present an opportunistic
protocol design. With the opportunistic approach, successful
handling of a client request implies reaching a consensus on a
consistent device reconfiguration while preserving the causal-
ity of decisions, subsequent to the actual request handling. We
achieve the causality and agreement by reaching consensus: i)
on the controller state at the time of application execution;
ii) on the actual computed output result (to guarantee the
consistency of decisions).
We have implemented these three BFT protocols and have
analyzed the overheads of switch reconfiguration time, the
communication overhead and the request acceptances rates.
We ran our evaluation for emulated Open vSwitch-based
Internet2 and Fat-Tree topologies, comprising up to 34 Open
vSwitch instances and up to 13 controllers, while considering
a varied number of tolerated Byzantine failures.
Paper structure: Sec. III introduces the overall system
model. Sec. IV details the proposed BFT protocols. Sec. IV-D
discusses the ILP formulation for the optimal controller-switch
assignment. Sec. V presents the evaluation methodology. Sec.
VI discusses the results. Sec. VII summarizes the related work.
Sec. VIII concludes this paper.
III. SYS TE M MOD EL
In [6], we discussed the often neglected differentiation
between state-independent (SIA) and state-dependent (SDA)
SDN applications. The SDA require an up-to-date and syn-
chronized application state in order to serve the client requests.
In this work, we consider solely the global SDA operations
where successfully handled client requests result in stateful
write operations to the replicated data-store. The subsequent
client request executions that result in new writes to the same
state must consider the preceding writes for their correctness.
The value of the write operation is determined by an execution
of a multi-phase BFT protocol. We distinguish accepting and
rejecting protocol executions. Rejecting executions are caused
by replicas that interrupt the run because of a missing consen-
sus in one of the protocol phases (caused by e.g., conflicting
seq. no. proposals, faulty controllers and packet loss). We
assume that clients retransmit the requests until a successful
execution has been acknowledged by the controllers.
Our SDN architecture is comprised of: i) controllers
that individually execute an instance of a BFT process; ii)
the switches that implement a comparison mechanism for
matching controller configuration messages (as per [6]); iii)
the clients; iv) a REASSIGNER component that maintains
the switch-controller assignments (as per [6]). The request-
initiating clients comprise northbound clients (e.g., applica-
tions, administrators) and the switches capable of forwarding
the client requests as (OpenFlow) packet-in messages to the
SDN controllers (e.g., routing, load-balancing requests). The
target clients represent the configuration targets, e.g., switches
that are (re)configured as a result of request handling.
We assume a fair-loss link abstraction, where a message
(re-)transmitted infinitely often is eventually delivered at the
recipient. Packets may be arbitrarily dropped, lost, delayed,
duplicated and delivered out of order during any of the BFT
protocol phases. The SDN control plane is realized in either
in-band or out-of-band manner. Control messages exchanged
between the controller, switches and clients are assumed to be
signed, thus ensuring: i) the integrity of messages exchanged
using the SDN data plane; ii) message forging is impossible.
State-updates distribution assumes an eventually syn-
chronous model as per [9], where different replicas possess
different views of the current configuration state for a limited
time duration. Eventually, given an appropriately long quies-
cent period, all correct replicas converge to the same state.
We assume that a bounded number of controllers may exhibit
Byzantine behavior and/or fail-crash failures, respectively.
IV. BFT CO NS EN SU S PROTO CO LS A ND T HE
CON TRO LL ER -SWITCH ASS IG NM EN T METHODOLOGY
The proposed protocols guarantee the following properties:
Uniform Agreement: When a correct replica commits a
particular internal state/switch update (i.e., computes a
particular response), all correct replicas eventually com-
mit the same update.
Liveness: All correct replicas eventually finalize the
processing of each client request. The resulting run is
declared either accepting or rejecting.
Causality: The updates to the controller data-store and
the per-switch configuration updates are executed in a
causally dependent order. The controller’s decision to
reconfigure a switch take into account all preceding
configurations of that switch.
We assume a deployment of a total of 2FM+FA+ 1
controllers per agreement and execution (A&E) group in order
to tolerate an upper bound of individual FMByzantine and
FAfail-crash controller failures in that particular A&E group.
In the remainder of this section, we introduce the three BFT
protocols: the agreement-based MPBFT and SBFT protocols,
and the opportunistic OBFT (ref. Table I).
TABLE I
OVE RVIE W OF PR ES ENT ED BFT P ROTO COL S.
Alg. Name Type No. Rounds
MPBFT Modified PBFT Agreement-
based 2
SBFT Serialized A Priori BFT Agreement-
based 3
OBFT Opportunistic A
Posteriori BFT Opportunistic 2
A. Pre-serialization model MPBFT (agreement-based)
Modified PBFT (MPBFT) imposes a single A&E group
where each active controller replica is tasked with execution of
an agreed command. The workflow of MPBFT is visualized
in Fig. 1. A request-initiating client initially invokes its ap-
plication request to all active controller replicas (REQUEST
phase). For each incoming client request, each controller
replica assigns a unique sequence number and distributes
this sequence number proposal to the other controllers in the
cluster (PREPARE phase). The replicas compare the sequence
number proposals. If the correct majority of proposals are
matching (i.e., the same sequence number is proposed by the
majority of correct replicas), successful global agreement has
been reached. At the begin of the COMMIT phase, each correct
replicas executes the client request. The execution output is
subsequently broadcasted by each replica to the remainder of
the cluster and the collected output responses are once again
compared in all replicas. Each controller deduces the correct
majority response and eventually commits the output to its
local data-store (i.e., a store of reservations) and finally reports
the agreed output to the target clients (REPLY phase). After
collecting FM+1 consistent output messages, the target clients
(e.g., switches) decide to apply the new configuration.
MPBFT is a variation of PBFT [8] that requires no leader
and is thus tolerant to individual node failures. Compared
to PBFT, we shorten the protocol execution by one round.
Whereas PBFT proposes a PRE-PREPARE round, MPBFT
skips this round by leveraging a client-initiated atomic multi-
cast execution and a distributed sequencer. Namely, each new
client request is multicasted to each replica of the system.
The replicas propose a new seq. number for the request by
incrementing the current counter as per Alg. 1. The seq.
numbers for new client requests are assigned based on the
current state of a local atomic counter. Following an arrival of
a new request, the replicas yield the lowest unallocated seq.
number value and propose this seq. number to the remaining
replicas. After collecting a sufficient amount of matching
PREPARE messages, all correct replicas decide to accept the
seq. number contained in the correct majority proposal as the
final seq. number for this request. Table III summarizes the
exact amounts of required matching messages to progress the
protocol execution.
If no correct majority vote is achieved during the agreement
process on either the sequence number or the computed output,
the replicas respond with a rejection status. If sufficient rejec-
tion messages are collected, the current execution is cancelled
and the run is declared rejecting. Concurrent client requests
can lead to same sequence numbers being assigned to different
requests at different replicas, thus resulting in rejecting runs.
The execution capacity of MPBFT is limited by the slowest
replica in the system. Consider the scenario FM= 1,FA= 0
depicted in Fig. 1. Each controller Ciis able to service request
workload up to a capacity of Piper observation interval.
The portrayed system is thus able to service computations
up to max(Pi=1..N λi)min(Pi), or 500 requests/interval
(imposed by the capacity of C4and C5). Thus, Client 1 (with
processing requirement of λ1= 500) and Client 2 (λ2= 400)
cannot be serviced concurrently. One can alternatively portray
the depicted rates as continuous execution workloads. While
active participation of C4and C5in the system is unnecessary
to tolerate a single Byzantine fault, they are included in
execution and signaling and are necessary to progress the
system state. MPBFT’s communication overhead is quadratic
(ref. Table IV). With alternative protocol designs SBFT and
OBFT, we next leverage the additional execution capacity by
partitioning the control plane into multiple A&E groups.
B. Pre-serialization model SBFT (agreement-based)
With Serialized A Priori BFT (SBFT), agreement and
execution processes are administered by multiple A&E groups.
We assign for each request-initiating client (i.e., a northbound
application, an edge switch) an A&E group according to the
algorithm presented in Sec. IV-D. To tolerate FMByzantine
and FAfail-crash failures in the scope of a single A&E group,
each group must comprise 2FM+FA+1 controllers. Multiple
execution groups can process the client requests concurrently.
SBFT design is depicted in Fig. 2. Compared to MPBFT,
SBFT introduces the PRE-PREPARE step, where the replicas
belonging to the A&E group propose and subsequently notify
the remainder of the replicas of an assigned sequence number.
In an accepting run, the group replicas collect the responses
in the PREPARE phase and reach consensus by collecting
d|C|+FM+1
2ematching sequence number proposals. Finally, the
replicas of the A&E group execute the request in the COMMIT
phase and broadcast the response to all remaining replicas. If
FM+ 1 matching outputs are received, the replicas apply the
internal state reconfiguration and notify the target clients of
the final result during REPLY. The communication overhead
Algorithm 1 Logical Sequencer: Ordering of client requests
Notation:
MPClient request (e.g. flow request)
MCReplica message (seq. no. proposal) initiated at a remote controller
CSet of available SDN controllers
RID Unique client request identifier
Rmappings Mapping of client request ids to unique seq. numbers
Satomic Atomic sequencer that yields the current seq. number
1: upon event on-client-request < MP, RID >do
2: ...
3: proposed_seq_no = propose_seq_no(RI D )
4: ...
5:
6: upon event on-new-replica-sync-update < MC, RID >do
7: ...
8: switch PHASE do
9: case MPBFT-PREPARE:
10: propose_seq_no(RID )
11: case SBFT-PRE-PREPARE:
12: propose_seq_no(RID )
13: ...
14:
15: function PROPOSE_SEQ _NO(RI D )
16: if RID ∈ Rmappings then
17: return Rmappings[RI D ]
18: else
19: while Satomic ∈ Rmappings.v alues() do
20: Satomic =Satomic + 1
21: Rmappings[RI D ]=Satomic
22: return Rmappings[RI D ]
Client 1
λ1= 500
C1
P1= 500
C2
P2= 800
C3
P3= 900
C4
P4= 500
C5
P5= 500
Client 2
λ2= 400
REQUEST PREPARE COMMIT REPLY
Fig. 1. MPBFT Model: In REQUEST phase, the clients initiate new ex-
ecutions. During PREPARE, the controller replicas agree on the execution
order by reaching consensus on the assigned sequence number for the clients’
requests. Each controller executes the request in the COMMIT phase. During
REPLY, target clients are notified of reconfigurations. Client 2’s requests
cannot be serviced as a result of a limited processing capacity of the
controllers.
of SBFT is bounded O(3|A||C|), and grows linearly for a fixed
A&E group size.
Causality: To ensure that the causality property holds in
MPBFT and SBFT, the controllers execute the sequenced
request in order agreed during PREPARE. The replicas execute
the COMMIT phase only if the outputs (i.e., the added reserva-
tions) for the preceding requests were seen by the executing
replica. Thus, before handling subsequent requests, the status
of preceding runs (accepting/rejecting) must be determined.
C. Post-negotiation model OBFT (opportunistic)
Opportunistic A Posteriori BFT (OBFT) is a speculative
take on SBFT, where computations of the client requests
Client 1
λ1= 500
C1
P1= 500
C2
P2= 800
C3
P3= 900
C4
P4= 500
C5
P5= 500
Client 2
λ2= 400
REQUEST PRE-
PREPARE PREPARE COMMIT REPLY
Fig. 2. SBFT Model: Compared to MPBFT, SBFT allows for more efficient
allocation of execution resources, since execution is separated into multiple
A&E groups. This comes with an overhead of a PRE-PREPARE step, required
to reach consensus on the sequence number allocated to the request.
execute prior to reaching consensus about the computed output
values. A global sequencer is not used in OBFT and thus
PRE-PREPARE and PREPARE phases are omitted. Instead,
each replica maintains the hashes of current switch configu-
rations, as well as a state array containing the hashes of the
configurations of the switches at the time of request executions
(TORC hashes). Following the output computation in the
COMMIT phase, the replicas come to consensus about the
updated switch state in the PRE-REPLY phase. This workflow
is depicted in Fig. 3.
In contrast to MPBFT and SBFT, in their COMMIT phase,
the replicas belonging to the same A&E group compute the
outputs, and in addition to the computed response outputs,
they broadcast the hash arrays denoting their view of the target
clients’ configurations. Each accepting replica that is not part
of the serving A&E group evaluates its actual current local
view of the switch states, and iff : i) FM+ 1 matching output
values have been computed by the A&E replicas; and b) their
current view of switch configuration hashes is matching with
those of the A&E replicas; they answer with an accepting
status. The execution replicas (belonging to the A&E group),
instead compare the proposed hash array with their local
TORC hashes for the target client (i.e., target switches) and
notify other A&E replicas of their status. If sufficient (ref.
Table III) positive confirmations have been collected at the
end of PRE-REPLY phase, each active controller internally
commits the output proposed by the correct majority of
the A&E group. The A&E group members then notify the
configuration targets of the agreed output in REPLY phase.
OBFT’s comm. overhead is quadratic and grows with |C|.
D. Dynamic Controller-Switch (Re)Assignment Procedure
In our design, each request-initiating client (i.e., a north-
bound client or a switch) is assigned a unique controller agree-
ment and execution (A&E) group. Groups assigned to different
switches are allowed to partially or fully overlap. Only the
assigned controllers are required to contact the target clients
and apply reconfigurations. Similarly, only these controllers
are contacted by the request-initiating clients with new applica-
tion requests. Our ILP formulation of the assignment problem
aims to minimize the total overlap between the members of the
active A&E groups, so to minimize the synchronization delay
during the consensus executions. The proposed reassignment
Client 1
λ1= 500
C1
P1= 500
C2
P2= 800
C3
P3= 900
C4
P4= 500
C5
P5= 500
Client 2
λ2= 400
REQUEST COMMIT PRE-REPLY REPLY
Fig. 3. OBFT model: An opportunistic protocol variation, where A&E group
members execute their clients’ requests prior to the distribution of reference
state configurations based on which the computations were executed. The
internal controller state and the target clients are updated only if the consensus
on the reference state configurations could be reached for the correct majority
of global controller instances (ref. Table III).
Algorithm 2 Hash comparison in the OBFT-COMMIT phase
Notation:
RID Unique client request identifier
HV Current configuration hashes for the switches
HVC [RID ]Switches’ config. hashes prior request computation (TORC)
find-path() An exemplary SDN application logic operation
consensus() Returns consensus message according to the number of
minimum required confirmations (ref. Table III)
mC
RID A COMMIT message for round RI D
MC
RID Set of buffered COMMIT messages for RID
1: procedure HAND LE NE W CL IEN T RE QUE ST
2: upon event on-received-client-request (CLRID )do
3: R=find-path(CLRI D .routing_request)
4: for all SW ∈ R do
5: current-hash[SW] = hash(SW.state)
6: mC
RID .hash, mC
RID .path = current-hash, R
7: broadcast-to-cluster-members(mC
RID )
8:
9: procedure HANDLE INCOMING COMMIT MESSAGE
10: upon event new-replica-sync-message(mC
RID )do
11: PRID
C=consensus(MC
RID ,<val, state-hash-array>)
12: on-init-obft-pre-reply(PRID
C,inline-with-replica-view(PRID
C))
13:
14: function INLINE-WITH-REPLICA-VI EW(PRI D
C)
15: for all SW PRID
C.path do
16: if HV[SW] == PRI D
C.hash[SW]then
17: pass ()
18: else if HVC [RID ][SW] == PRI D
C.hash[SW]then
19: pass ()
20: else return (REJECT)
21: return (ACCEPT)
mechanism, the objective function and the constraints extend
the formulation presented in [6]. For brevity, we do not discuss
each constraint in detail here, but refer the reader to the
summary in Table V and [6]. The procedure is executed once
at the system startup and dynamically during runtime, on each
detected controller failure.
For each switch Siwe can derive a bitstring RSicomprised
of ones for replicas actively assigned to Siand zeros for the
unassigned replicas. We then formalize the objective function:
TABLE II
NOTATIO N USE D IN TAB LES III, IV A ND V.
Symbol Meaning
CSet of active controllers in the system
FMNo. of tolerated Byzantine faults in a single A&E group
Req(t)Time-variant no. of controllers [6] that must be assigned
to each switch, to tolerate the Byzantine failures
SSet of switches in the system
PCiTotal available controller Ci’s capacity.
LCLk, LSj
Request processing load stemming from the northbound
client CLkand edge switch Sj, respectively.
DC,S Max. tolerable delay for controller-switch communication.
AController replicas belonging to a single A&E group
|Magr |Sum of the tolerated Byzantine failures and the majority
of correct replicas per A&E group: d|A|+FM+1
2e
|Mglob|Sum of the tolerated Byzantine replicas and the majority
of all correct active replicas: d|C|+FM+1
2e
CMP Comp. overhead of executing the packet comparison
EComp. overhead of executing SDN application operation
TABLE III
THE A MOU NT O F MATCH IN G MES SAG ES RE QU IRE D TO RE ACH C ON SEN SU S
IN T HE RE SP ECT IVE P ROTO CO L PHA SE (W ORS T-CAS E).
Algorithm PRE-
PREPARE PREPARE COMMIT PRE-
REPLY REPLY
MPBFT N/A |Mglob |FM+ 1 N/A FM+ 1
SBFT |Magr | |Mglob |FM+ 1 N/A FM+ 1
OBFT N/A N/A |Magr | |Mglob |FM+ 1
min X
Sj∈S
X
Si∈S,Si6=Sj
HD(RSj, RSi)(1)
where HD(RSj, RSi)denotes the Hamming distance be-
tween the assignment bitstrings for Sjand Si. Combined with
the adapted minimum assignment constraint depicted in Table
V, we ensure the building of minimum-sized A&E groups that
fulfill the capacity and delay constraints of the clients.
V. EVAL UATIO N
To evaluate the different BFT protocols, we realized a
centralized path computation application that executes in each
of the deployed controller replicas. Based on the sequence
and current state of link reservations, the routing algorithm
leverages Dijkstra algorithm to choose the optimal (cheapest)
path w.r.t. bandwidth resource consumption, and thus implic-
itly load-balances the embedded flows in the given topology.
TABLE IV
COMPUTATIONAL AND COMMUNICATION OVERHEAD OF THE
IN TROD UC ED BFT PR OTOC OL S.
Alg. Computational Overhead Communication Overhead
MPBFT O(2|C|CMP +|C|E)O(2|C||C|)
SBFT O(CMP(2|C| +|A|)+|A|E)O(3|A||C|)
OBFT O(2|C|CMP +|A|E)O(|A|(|C|+1)+|C |(|C|−1))
TABLE V
CONSTRAINTS USED IN BUILDING THE A&E GRO UPS .
Constraint Formulation
Min. Assignment P
Ci∈C
ACi,Sj== Req(t),Sj∈ S
Unique Assignment ACi,Sj1,Ci∈ C, Sj∈ S
Bounded Capacity P
Sj∈S
ACi,SjLSj
PCiP
CLk∈CL
LCLk,Ci∈ C
Delay Bounds ACi,SjdCi,SjDC,S ,Ci∈ C, Sj∈ S
The BFT protocol executions take the source-destination pair
and the required bandwidth as an input for the service request.
Subsequently, the protocol computes the optimal path in the
COMMIT phase and notifies the switches on the path of new
reservation in the REPLY phase. To evaluate the designs of
all three protocols, we consider the following performance
metrics: i) time required to apply a new switch reconfiguration,
measured from the time of a client request arrival until the con-
firmation of the last switch reconfiguration; ii) the acceptance
rate for the new arrivals; ii) the total communication overhead.
To validate our claims in a realistic environment, we have
emulated the Internet2 topology, as well as a fat-tree data-
center topology, encompassing 34 and 20 switches, respec-
tively. The controllers in the Internet2 scenario were placed
so to maximize the system coverage against failures as per
[6], [10]. The controllers of the fat-tree topology were placed
on the leaf-nodes as per [6], [11]. The state synchronization
between the controllers and the resulting switch reconfigura-
tions occur in in-band control mode. To provide for realistic
delay emulation, we derive the link distances from the publicly
available geographical Internet2 data1and inject the propaga-
tion delays using Linux’s tc tool. A single client was placed at
each switch of the Internet2 topology, while two clients were
placed at each leaf-switch of the fat-tree topology. The arrivals
for incoming service requests are modeled using n.e.d. [11].
To generate the hashes for per-switch configuration state
(ref. Sec. IV-C), we used Python’s hashlib implementation and
the SHA256 secure hash algorithm, defined in FIPS 180-2
[12]. We used Gurobi to solve the ILP formulated in Sec.
IV-D. The measurements were executed on a commodity PC
equipped with AMD Ryzen 1600 CPU and 32 GB RAM.
VI. DISCUSSION
1) Total reconfiguration time for the internal controller and
the switch state: Fig. 4a and Fig. 4b depict the accumulated
response time starting with the reception of a client request
at a controller replica until the last reconfiguration in one of
the switches on the detected path. The total number of active
controllers was fixed to |C| = 10 and the measurement was
executed for A&E group sizes varying between |A| = 3 and
|A| = 7 (FM= 1 and FM= 3, respectively) controllers.
Rejecting executions were not considered. Both Fat-Tree and
Internet2 topologies depict the benefit of opportunistic execu-
tion and a lower number of phases in OBFT in all scenarios.
In Fig. 5a and Fig. 5b, we vary the total number of deployed
active controllers. The figures portray how MPBFT provides
equal performance for the controller constellations where the
A&E group size in SBFT and OBFT approximately equals the
total number of active controllers (all controllers belong to the
same A&E group). After provisioning additional replicas (case
for |C| = [7..13]), the performance of MPBFT starts to suffer
compared to both SBFT and OBFT, as it requires interactions
between all instances of controllers for successful request
handling, whereas SBFT and OBFT continue to operate at the
level of a constant A&E group size. OBFT offers the best per-
formance in both topologies. This is due to SBFT and MPBFT
requiring additional rounds to handle the request sequencing,
compared to OBFT, that ensures the causality property holds
per-switch, even in the case of unordered executions. MPBFT
1Internet2 topological data (provided by POCO project) - https://github.
com/lsinfo3/poco/tree/master/topologies
25262728
Switch Reconfiguration Delay [ms] < D
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
SBFT, FM= 1, | | = 10
SBFT, FM= 2, | | = 10
SBFT, FM= 3, | | = 10
OBFT, FM= 1, | | = 10
OBFT, FM= 2, | | = 10
OBFT, FM= 3, | | = 10
(a) Fat-Tree topology
262728
Switch Reconfiguration Delay [ms] < D
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
SBFT, FM= 1, | | = 10
SBFT, FM= 2, | | = 10
SBFT, FM= 3, | | = 10
OBFT, FM= 1, | |= 10
OBFT, FM= 2, | |= 10
OBFT, FM= 3, | |= 10
(b) Internet2 topology
Fig. 4. Total accumulated switch reconfiguration (system response) time for
varied sizes of A&E groups for max. tolerated Byzantine failures FM=
[1..3],FA= 0 and a fixed total number of active controllers |C| = [10].
OBFT shows dominantly lower commit delays in all depicted scenarios.
suffers further since the commands execute on each of the
controller replicas. Hence, its consensus requires on average an
inclusion of a larger number of replicas compared to SBFT and
OBFT. Internet2 topology depicts a lower discrepancy between
SBFT and OBFT and highlights the benefit of sequencing
in geographically distributed scenarios where network delays
cause a longer asynchronous period and thus a higher probabil-
ity of execution overlaps (confirmed by Fig. 6). The maximum
path lengths are higher for Internet2 topology, thus resulting
in a higher number of overlapping reservations that cause
execution rejections/stalling period in opportunistic OBFT.
2) Acceptance rates for arriving requests: In Fig. 6 we vary
the per-client arrival rates λfor incoming client requests. In the
case of λ= 4, up to 64 requests/second are processed by the
cluster in Internet2 topology. Opportunistic execution of OBFT
and subsequent hash comparison tends to result more often in
rejecting runs, compared to SBFT that serializes all requests
prior to their processing. MPBFT results in a relatively high
percentage of rejections, due to a higher chance of conflicting
sequence number handouts that may occur concurrently since
all replicas are involved in proposals during PREPARE phase.
3) Communication overhead: Fig. 7 depicts the scaling of
communication overhead with the increase of the total number
242526272829210 211
Switch Reconfiguration Delay [ms] < D
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
MPBFT, FM= 1, | | = 4
MPBFT, FM= 1, | | = 7
MPBFT, FM= 1, | | = 10
MPBFT, FM= 1, | | = 13
SBFT, FM= 1, | | = 4
SBFT, FM= 1, | | = 7
SBFT, FM= 1, | | = 10
SBFT, FM= 1, | | = 13
OBFT, FM= 1, | | = 4
OBFT, FM= 1, | | = 7
OBFT, FM= 1, | | = 10
OBFT, FM= 1, | | = 13
(a) Fat-Tree Topology
26272829210
Switch Reconfiguration Delay [ms] < D
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
(b) Internet2 Topology
Fig. 5. Total accumulated switch reconfiguration (system response) time for
varied numbers of active controller replicas |C| = [4..13],FA= 0 and an
A&E group size of |A| = 3. While OBFT portrays the lowest reconfiguration
delays, its performance is similar to SBFT and MPBFT for small control
planes (especially for Internet2), slightly better compared to SBFT and largely
dominant compared to MPBFT for larger topology sizes.
= 1 = 2 = 4
Arrival Rate [# Requests per Seconds per Client]
0
10
20
30
40
50
60
70
80
90
Acceptance Ratio [%]
SBFT
OBFT
MPBFT
Fig. 6. Acceptance rates for incoming client requests in fat-tree topology and
FM= 1,|C| = 4. SBFT tends to execute a higher number of successful runs
compared to: i) MPBFT, due to its larger number of active replicas involved
in sequencing process and ii) OBFT, due to its opportunistic design, where
consistency of outputs is agreed upon after execution has finished.
of active controllers. Controller-to-Switch (C2S) communica-
tion overhead increases with the number of controllers that ex-
ecute the operation and communicate their result to the target
switches. Thus, following an output response computation, in
MPBFT each controller distributes the newly computed config-
urations to switches, hence the linear overhead increase. Since
the size of the A&E group remains unchanged throughout all
depicted scenarios, SBFT and OBFT show a constant low C2S
overhead. The Controller-to-Controller (C2C) overhead scales
with the number of active controllers involved in the A&E
group. For MPBFT and OBFT, this increase is quadratic. For
SBFT, the C2C overhead increase is linear. It should be noted
that the linear evolution holds only for constant A&E group
sizes, i.e., for fixed FMand FA.
MPBFT, | | = 4
MPBFT, | | = 7
MPBFT, | | = 10
MPBFT, | | = 13
, | | = 4
, | | = 7
, | | = 10
, | | = 13
OBFT, | | = 4
OBFT, | | = 7
OBFT, | | = 10
OBFT, | | = 13
0
25
50
75
100
125
150
175
200
Signaling overhead [pps]
C2S Average Packet Load
C2C Average Packet Load
SBFT,
SBFT,
SBFT,
SBFT,
Fig. 7. Signaling overhead [pps] when serving 16 requests/second for a varied
number of controllers |C| = [4..13] and a fixed A&E group size |A| = 3.
SBFT possesses the lowest overhead (linear growth), followed by OBFT and
MPBFT, that show a quadratic growth scaling with |C|.
Additional notes: While SBFT and MPBFT ensure a single
execution and validation of inputs for client requests (i.e.,
each client sequence number is mapped to a unique request),
OBFT executes client requests speculatively, prior to reaching
consensus. Thus, Byzantine clients may attempt affecting the
order of execution, or generate execution contentions. Meter-
ing mechanisms for misbehaving clients and their exclusion
could cater for this case. They are, however, not in the scope
of this work.
VII. REL ATED WO RK
Agreement-based approaches have focused on the optimiza-
tion of sequencing procedure by minimizing the number of
replicas that actively participate in sequence proposals [13],
[14]. REBFT [13] keeps only a subset (2F+ 1 of a total
of 3F+ 1) replicas active during normal case operation. It
activates the passive replicas only after a detected replica fault.
Such approaches rely on a trusted counter implementation to
prevent equivocation, the capability of a malicious replicas
to send conflicting proposals to other members. Since we do
not assume a centralized proposer, we prevent equivocation by
deciding new seq. numbers individually, without the overhead
of a trusted counter nor passive replica activation delay.
Speculative BFT protocols have been investigated in [15],
[16]. However, these approaches conclude about the consensus
of the computed decisions based on the comparison of the
instantaneous outputs and assume a stateless operation. In the
contrast, in OBFT we leverage the agreement procedure that
relies on external outputs, i.e., stateful per-switch configura-
tions that are inherent to network management scenarios.
Omada [17] is a sequencing-based BFT design that as-
signs replicas with either agreement or execution roles and
parallelizes the agreement phase. It highlights the benefit of
selecting a configuration with the lowest number of agreement
groups. Contrary to our work, the authors assume a centralized
sequencer per agreement group. Distinguishing causality prop-
erty per configuration target is not discussed nor leveraged in
their protocol. Similarly, Omada does not provide an insight
into opportunistic approaches to execution handling.
VIII. CONCLUSION
We have implemented two agreement-based and an op-
portunistic BFT protocol for the purpose of SDN controller
state synchronization, and have analyzed their overheads in
an emulated environment using software switches and emu-
lated network delays. The evaluated KPIs include the switch
reconfiguration times, the request acceptance rates and the
communication overhead. We have shown how our opportunis-
tic BFT approach leverages agreement of switch state at the
time of request computation to ensure the causality during
request reconfiguration. It offers considerably lower response
time compared to the sequencing-based approaches. However,
this benefit comes at the expense of a lower acceptance rate
and quadratic communication overhead. For those metrics, the
A&E group-based sequencing approach SBFT presents a bet-
ter alternative. Both approaches result in a higher throughput
compared to MPBFT, which adapts the PBFT protocol.
ACKNOWLEDGMENT
This work has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement number 780315 SEMIOTICS. We are grateful to
Nemanja Deric, Arled Papa, Johannes Riedl and the reviewers
for their useful feedback and comments.
REFERENCES
[1] D. Suh et al., “On performance of OpenDaylight clustering,” in NetSoft
Conference and Workshops (NetSoft), 2016 IEEE. IEEE, 2016.
[2] E. Sakic et al., “Response Time and Availability Study of RAFT
Consensus in Distributed SDN Control Plane,” IEEE Transactions on
Network and Service Management, 2017.
[3] H. Howard et al., “Raft Refloated: Do we have Consensus?” ACM
SIGOPS Operating Systems Review, vol. 49, no. 1, 2015.
[4] L. Lamport et al., “Paxos made simple,ACM Sigact News, vol. 32,
no. 4, 2001.
[5] P. Vizarreta et al., “Mining Software Repositories for Predictive Mod-
elling of Defects in SDN Controller,” in IFIP/IEEE International Sym-
posium on Integrated Network Management, 2019.
[6] E. Sakic et al., “MORPH: An Adaptive Framework for Efficient and
Byzantine Fault-Tolerant SDN Control Plane,” IEEE Journal on Selected
Areas in Communication, 2018.
[7] H. Li et al., “Byzantine-resilient secure software-defined networks with
multiple controllers in cloud,” IEEE Transactions on Cloud Computing,
vol. 2, no. 4, 2014.
[8] M. Castro et al., “Practical Byzantine fault tolerance,” in OSDI, vol. 99,
1999.
[9] A. Miller et al., “The honey badger of BFT protocols,” in Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications
Security. ACM, 2016.
[10] D. Hock et al., “POCO-framework for Pareto-optimal resilient controller
placement in SDN-based core networks,” in Network Operations and
Management Symposium (NOMS), 2014 IEEE. IEEE, 2014.
[11] X. Huang et al., “Dynamic Switch-Controller Association and Control
Devolution for SDN Systems,arXiv preprint arXiv:1702.03065, 2017.
[12] National Institute of Standards and Technology, “FIPS 180-2 with
change notice, "Secure Hash Standard",” 2004.
[13] T. Distler et al., “Resource-efficient Byzantine fault tolerance,IEEE
Transactions on Computers, vol. 65, no. 9, 2016.
[14] J. Liu et al., “Scalable Byzantine Consensus via Hardware-assisted
Secret Sharing,” IEEE Transactions on Computers, 2018.
[15] R. Kotla et al., “Zyzzyva: speculative byzantine fault tolerance,ACM
SIGOPS Operating Systems Review, vol. 41, no. 6, 2007.
[16] P. Mohan et al., “Primary-Backup Controller Mapping for Byzantine
Fault Tolerance in Software Defined Networks,” in GLOBECOM 2017-
2017 IEEE Global Communications Conference. IEEE, 2017.
[17] M. Eischer et al., “Scalable Byzantine Fault Tolerance on Heterogeneous
Servers,” in Dependable Computing Conference (EDCC), 2017 13th
European. IEEE, 2017.
... BFT has recently been investigated in the context of distributed SDN control plane [5]- [7], [10]. In [5], [6], 3F M + 1 controller replicas are required to tolerate up to F M Byzantine failures. ...
... Second, neither of the models detail the overhead of message comparison procedure in the target switches. The realizations presented in [5]- [7], [10] realize the packet comparison procedure solely in software. The non-deterministic / varied latency imposed by the software switching may, however, be limiting in specific use cases, such as in the failure scenarios in critical infrastructure networks [11] or in 5G scenarios [12]. ...
... [6] and [7] discuss the strategy for minimization of no. of matching messages required to deduce correct controller decisions, which we adopt in this work as well. [10] discusses the benefit of disaggregation of BFT consensus groups in the SDN control plane into multiple controller cluster partitions, thus enabling higher scalability than possible with [6] and [7]. While compatible with [10], our work focuses on scalability enhancements and footprint minimization by means of data-plane reconfiguration for realizing more efficient packet comparison. ...
Conference Paper
Full-text available
Byzantine Fault Tolerance (BFT) enables correct operation of distributed, i.e., replicated applications in the face of malicious takeover and faulty/buggy individual instances. Recently, BFT designs have gained traction in the context of Software Defined Networking (SDN). In SDN, controller replicas are distributed and their state replicated for high availability purposes. Malicious controller replicas, however, may destabilize the control plane and manipulate the data plane, thus motivating the BFT requirement. Nonetheless, deploying BFT in practice comes at a disadvantage of increased traffic load stemming from replicated controllers, as well as a requirement for proprietary switch functionalities, thus putting strain on switches' control plane where particular BFT actions must be executed in software. P4BFT leverages an optimal strategy to decrease the total amount of messages transmitted to switches that are the configuration targets of SDN controllers. It does so by means of message comparison and deduction of correct messages in the determined optimal locations in the data plane. In terms of the incurred control plane load, our P4-based data plane extensions outperform the existing solutions by ∼ 33.2% and ∼ 40.2% on average, in random 128-switch and Fat-Tree/Internet2 topologies, respectively. To validate the correctness and performance gains of P4BFT, we deploy bmv2 and Netronome Agilio SmartNIC-based topologies. The advantages of P4BFT can thus be reproduced both with software switches and "commodity" P4-enabled hardware. A hardware-accelerated controller packet comparison procedure results in an average 96.4 % decrease in processing delay per request compared to existing software approaches.
... BFT has recently been investigated in the context of distributed SDN control plane [5]- [7], [10]. In [5], [6], 3F M + 1 controller replicas are required to tolerate up to F M Byzantine failures. ...
... Second, neither of the models detail the overhead of message comparison procedure in the target switches. The realizations presented in [5]- [7], [10] realize the packet comparison procedure solely in software. The non-deterministic/varied latency imposed by the software switching may, however, be limiting in use cases that require deterministic or low reconfiguration latency, such as in the failure scenarios in critical infrastructure networks [11] or in 5G scenarios [12]. ...
... [6] and [7] discuss the strategy for minimization of no. of matching messages required to deduce correct controller decisions, which we adopt in this work as well. [10] discusses the benefit of disaggregation of BFT consensus groups in the SDN control plane into multiple controller cluster partitions, thus enabling higher scalability than possible with [6] and [7]. While compatible with [10], our work focuses on scalability enhancements and footprint minimization by means of data-plane reconfiguration for realizing more efficient packet comparison. ...
Preprint
Full-text available
Byzantine Fault Tolerance (BFT) enables correct operation of distributed, i.e., replicated applications in the face of malicious take-over and faulty/buggy individual instances. Recently, BFT designs have gained traction in the context of Software Defined Networking (SDN). In SDN, controller replicas are distributed and their state replicated for high availability purposes. Malicious controller replicas, however, may destabilize the control plane and manipulate the data plane, thus motivating the BFT requirement. Nonetheless, deploying BFT in practice comes at a disadvantage of increased traffic load stemming from replicated controllers, as well as a requirement for proprietary switch functionalities, thus putting strain on switches' control plane where particular BFT actions must be executed in software. P4BFT leverages an optimal strategy to decrease the total amount of messages transmitted to switches that are the configuration targets of SDN controllers. It does so by means of message comparison and deduction of correct messages in the determined optimal locations in the data plane. In terms of the incurred control plane load, our P4-based data plane extensions outperform the existing solutions by ~33.2% and ~40.2% on average, in random 128-switch and Fat-Tree/Internet2 topologies, respectively. To validate the correctness and performance gains of P4BFT, we deploy bmv2 and Netronome Agilio SmartNIC-based topologies. The advantages of P4BFT can thus be reproduced both with software switches and "commodity" P4-enabled hardware. A hardware-accelerated controller packet comparison procedure results in an average ~96.4% decrease in processing delay per request compared to existing software approaches.
... Hence, a variable η is set to 10% to account for this ratio. We base the selection of η on [39], where an evaluation of SDN controller to switch communication control overhead is presented for various number of SDN controllers in a distributed control plane. Eventually, the traffic model for each square region i ∈ I is expressed in number of new flows per second and is used to compose the flow profile F . ...
Article
In the context of the 5G ecosystem, the integration between the terrestrial and satellite networks is envisioned as a potential approach to further enhance the network capabilities. In light of this integration, the satellite community is revisiting its role in the next generation 5G networks. Emerging technologies such as Software-Defined Networking (SDN) which rely on programmable and reconfigurable concepts, are foreseen to play a major role in this regard. Therefore, an interesting research topic is the introduction of management architecture solutions for future satellite networks driven by means of SDN. This anticipates the separation of the data layer from the control layer of the traditional satellite networks, where the control logic is placed on programmable SDN controllers within traditional satellite devices. While a centralized control layer promises delay reductions, it introduces additional overheads due to reconfiguration and migration costs. In this paper, we propose a method to quantify the overhead imposed on the network by the aforementioned parameters while investigating the use-case scenario of an SDN-enabled satellite space segment. We make use of an optimal controller placement and satellite-to-controller assignment which minimizes the average flow setup time with respect to varying traffic demands. Furthermore, we provide insights on the network performance with respect to the migration and reconfiguration cost for our proposed SDN-enabled architecture. Finally, we compare our proposed space segment SDN-enabled architecture with alternative solutions in the state-of-the-art given the aforementioned performance metrics.
... The controllers are assumed to synchronize in either in-band or out-of-band [17] manner. We assume non-Byzantine [2], [18] operation of network controllers. ...
Conference Paper
Full-text available
Centralized Software Defined Networking (SDN) controllers and Network Management Systems (NMS) introduce the issue of controller as a single-point of failure (SPOF). The SPOF correspondingly motivated the introduction of distributed controllers, with replicas assigned into clusters of controller instances replicated for purpose of enabling high availability. The replication of the controller state relies on distributed consensus and state synchronization for correct operation. Recent works have, however, demonstrated issues with this approach. False positives in failure detectors deployed in replicas may result in oscillating leadership and control plane unavailability. In this paper, we first elaborate the problematic scenario. We resolve the related issues by decoupling failure detector from the underlying signaling methodology and by introducing event agreement as a necessary component of the proposed design. The effectiveness of the proposed model is validated using an exemplary implementation and demonstration in the problematic scenario. We present an analytic model to describe the worst- case delay required to reliably agree on replica failures. The effectiveness of the analytic formulation is confirmed empirically using varied cluster configurations in an emulated environment. Finally, we discuss the impact of each component of our design on the replica failure- and recovery-detection delay, as well as on the imposed communication overhead.
Conference Paper
Full-text available
In Software Defined Networking (SDN) control plane of forwarding devices is concentrated in the SDN controller, which assumes the role of a network operating system. Big share of today's commercial SDN controllers are based on OpenDaylight, an open source SDN controller platform, whose bug repository is publicly available. In this article we provide a first insight into 8k+ bugs reported in the period over five years between March 2013 and September 2018. We first present the functional components in OpenDaylight architecture, localize the most vulnerable modules and measure their contribution to the total bug content. We provide high fidelity models that can accurately reproduce the stochastic behaviour of bug manifestation and bug removal rates, and discuss how these can be used to optimize the planning of the test effort, and to improve the software release management. Finally, we study the correlation between the code internals, derived from the Git version control system, and software defect metrics, derived from Jira issue tracker. To the best of our knowledge, this is the first study to provide a comprehensive analysis of bug characteristics in a production grade SDN controller.
Article
Full-text available
Current approaches to tackling the single point of failure in SDN entail a distributed operation of SDN controller instances. Their state synchronization process is reliant on the assumption of a correct decision-making in the controllers. Successful introduction of SDN in the critical infrastructure networks also requires catering to the issue of unavailable, unreliable (e.g. buggy) and malicious controller failures. We propose MORPH, a framework tolerant to unavailability and Byzantine failures, that distinguishes and localizes faulty controller instances and appropriately reconfigures the control plane. Our controller-switch connection assignment leverages the awareness of the source of failure to optimize the number of active controllers and minimize the controller and switch reconfiguration delays. The proposed re-assignment executes dynamically after each successful failure identification. We require 2FM +FA+1 controllers to tolerate FM malicious and FA availability-induced failures. After a successful detection of FM malicious controllers, MORPH reconfigures the control plane to require a single controller message to forward the system state. Next, we outline and present a solution to the practical correctness issues related to the statefulness of the distributed SDN controller applications, previously ignored in the literature. We base our performance analysis on a resource-aware routing application, deployed in an emulated testbed comprising up to 16 controllers and up to 34 switches, so to tolerate up to 5 unique Byzantine and additional 5 availability-induced controller failures (a total of 10 unique controller failures). We quantify and highlight the dynamic decrease in the packet and CPU load and the response time after each successful failure detection.
Article
Full-text available
Software Defined Networking promises unprecedented flexibility and ease of network operations. While flexibility is an important factor when leveraging advantages of a new technology, critical infrastructure networks also have stringent requirements on network robustness and control plane delays. Robustness in the SDN control plane is realized by deploying multiple distributed controllers, formed into clusters for durability and fast-failover purposes. However, the effect of the controller clustering on the total system response time is not well investigated in current literature. Hence, in this work we provide a detailed analytical study of the distributed consensus algorithm RAFT, implemented in OpenDaylight and ONOS SDN controller platforms. In those controllers, RAFT implements the data-store replication, leader election after controller failures and controller state recovery on successful repairs. To evaluate its performance, we introduce a framework for numerical analysis of various SDN cluster organizations w.r.t. their response time and availability metrics. We use Stochastic Activity Networks for modeling the RAFT operations, failure injection and cluster recovery processes, and using real-world experiments, we collect the rate parameters to provide realistic inputs for a representative cluster recovery model. We also show how a fast rejuvenation mechanism for the treatment of failures induced by software errors can minimize the total response time experienced by the controller clients, while guaranteeing a higher system availability in the long-term.
Conference Paper
Full-text available
Security in Software Defined Networks (SDNs) has been a major concern for its deployment. Byzantine threats in SDNs are more sophisticated to defend since control messages issued by a compromised controller look legitimate. Applying traditional Byzantine Fault Tolerance approach to SDNs requires each switch to be mapped to 3f + 1 controllers to defend against f simultaneous controller failures. This approach on one hand overloads the controllers due to multiple requests from switches. On the other hand, it raises new challenges concerning the switch-controller mapping and determining minimum number of controllers required in the network. In this paper, we present a novel primary-backup controller mapping approach in which a switch is mapped to only f + 1 primary and f backup controllers to defend against simultaneous Byzantine attacks on f controllers. We develop an optimization programming formulation that provides the switch-controller mapping solution and minimizes the total number of controllers required. We consider the controller processing capacity and communication delay between switches and controllers as problem constraints. Our approach also facilitates capacity sharing of backup controllers when two switches use the same backup controller but do not need it simultaneously. We demonstrate the effectiveness of the proposed approach through numerical analysis. The results show that the proposed approach significantly reduces the total number of controllers required by up to 50% compared to an existing scheme while guaranteeing better load balancing among controllers with a fairness index of up to 0.92.
Article
Full-text available
In software-defined networking (SDN), as data plane scale expands, scalability and reliability of the control plane has become major concerns. To mitigate such concerns, two kinds of solutions have been proposed separately. One is multi-controller architecture, i.e., a logically centralized control plane with physically distributed controllers. The other is control devolution, i.e., delegating control of some flows back to switches. Most of existing solutions adopt either static switch-controller association or static devolution, which may not adapt well to the traffic variation, leading to high communication costs between switches and controller, and high computation costs of switches. In this paper, we propose a novel scheme to jointly consider both solutions, i.e., we dynamically associate switches with controllers and dynamically devolve control of flows to switches. Our scheme is an efficient online algorithm that does not need the statistics of traffic flows. By adjusting some parameter V, we can make a trade-off between costs and queue backlogs. Theoretical analysis and extensive simulations show that our scheme yields much lower costs and latency compared to static schemes, and balanced loads among controllers.
Article
Full-text available
The surging interest in blockchain technology has revitalized the search for effective Byzantine consensus schemes. In particular, the blockchain community has been looking for ways to effectively integrate traditional Byzantine fault-tolerant (BFT) protocols into a blockchain consensus layer allowing various financial institutions to securely agree on the order of transactions. However, existing BFT protocols can only scale to tens of nodes due to their $O(n^2)$ message complexity. In this paper, we propose FastBFT, the fastest and most scalable BFT protocol to-date. At the heart of FastBFT is a novel message aggregation technique that combines hardware-based trusted execution environments (TEEs) with lightweight secret sharing primitives. Combining this technique with several other optimizations (i.e., optimistic execution, tree topology and failure detection), FastBFT achieves low latency and high throughput even for large scale networks. Via systematic analysis and experiments, we demonstrate that FastBFT has better scalability and performance than previous BFT protocols.
Conference Paper
The surprising success of cryptocurrencies has led to a surge of interest in deploying large scale, highly robust, Byzantine fault tolerant (BFT) protocols for mission-critical applications, such as financial transactions. Although the conventional wisdom is to build atop a (weakly) synchronous protocol such as PBFT (or a variation thereof), such protocols rely critically on network timing assumptions, and only guarantee liveness when the network behaves as expected. We argue these protocols are ill-suited for this deployment scenario. We present an alternative, HoneyBadgerBFT, the first practical asynchronous BFT protocol, which guarantees liveness without making any timing assumptions. We base our solution on a novel atomic broadcast protocol that achieves optimal asymptotic efficiency. We present an implementation and experimental results to show our system can achieve throughput of tens of thousands of transactions per second, and scales to over a hundred nodes on a wide area network. We even conduct BFT experiments over Tor, without needing to tune any parameters. Unlike the alternatives, HoneyBadgerBFT simply does not care about the underlying network.
Article
One of the main reasons why Byzantine fault-tolerant (BFT) systems are currently not widely used lies in their high resource consumption: 3f+1 replicas are required to tolerate only f faults. Recent works have been able to reduce the minimum number of replicas to 2f+1 by relying on trusted subsystems that prevent a faulty replica from making conflicting statements to other replicas without being detected. Nevertheless, having been designed with the focus on fault handling, during normal-case operation these systems still use more resources than actually necessary to make progress in the absence of faults. This paper presents Resource-efficient Byzantine Fault Tolerance (ReBFT), an approach that minimizes the resource usage of a BFT system during normal-case operation by keeping f replicas in a passive mode. In contrast to active replicas, passive replicas neither participate in the agreement protocol nor execute client requests; instead, they are brought up to speed by verified state updates provided by active replicas. In case of suspected or detected faults, passive replicas are activated in a consistent manner. To underline the flexibility of our approach, we apply ReBFT to two existing BFT systems: PBFT and MinBFT.