Conference PaperPDF Available

P4BFT: Hardware-Accelerated Byzantine-Resilient Network Control Plane

Authors:

Abstract

Byzantine Fault Tolerance (BFT) enables correct operation of distributed, i.e., replicated applications in the face of malicious takeover and faulty/buggy individual instances. Recently, BFT designs have gained traction in the context of Software Defined Networking (SDN). In SDN, controller replicas are distributed and their state replicated for high availability purposes. Malicious controller replicas, however, may destabilize the control plane and manipulate the data plane, thus motivating the BFT requirement. Nonetheless, deploying BFT in practice comes at a disadvantage of increased traffic load stemming from replicated controllers, as well as a requirement for proprietary switch functionalities, thus putting strain on switches' control plane where particular BFT actions must be executed in software. P4BFT leverages an optimal strategy to decrease the total amount of messages transmitted to switches that are the configuration targets of SDN controllers. It does so by means of message comparison and deduction of correct messages in the determined optimal locations in the data plane. In terms of the incurred control plane load, our P4-based data plane extensions outperform the existing solutions by ∼ 33.2% and ∼ 40.2% on average, in random 128-switch and Fat-Tree/Internet2 topologies, respectively. To validate the correctness and performance gains of P4BFT, we deploy bmv2 and Netronome Agilio SmartNIC-based topologies. The advantages of P4BFT can thus be reproduced both with software switches and "commodity" P4-enabled hardware. A hardware-accelerated controller packet comparison procedure results in an average 96.4 % decrease in processing delay per request compared to existing software approaches.
P4BFT: Hardware-Accelerated
Byzantine-Resilient Network Control Plane
Ermin Sakic∗†, Nemanja Deric, Endri Goshi, Wolfgang Kellerer
Technical University Munich, Germany, Siemens AG, Germany
E-Mail:{ermin.sakic, nemanja.deric, endri.goshi, wolfgang.kellerer}@tum.de, ermin.sakic@siemens.com
Abstract—Byzantine Fault Tolerance (BFT) enables correct
operation of distributed, i.e., replicated applications in the face
of malicious take-over and faulty/buggy individual instances.
Recently, BFT designs have gained traction in the context of
Software Defined Networking (SDN). In SDN, controller replicas
are distributed and their state replicated for high availability
purposes. Malicious controller replicas, however, may destabilize
the control plane and manipulate the data plane, thus motivating
the BFT requirement. Nonetheless, deploying BFT in practice
comes at a disadvantage of increased traffic load stemming from
replicated controllers, as well as a requirement for proprietary
switch functionalities, thus putting strain on switches’ control
plane where particular BFT actions must be executed in software.
P4BFT leverages an optimal strategy to decrease the total
amount of messages transmitted to switches that are the con-
figuration targets of SDN controllers. It does so by means of
message comparison and deduction of correct messages in the
determined optimal locations in the data plane. In terms of the
incurred control plane load, our P4-based data plane extensions
outperform the existing solutions by 33.2% and 40.2% on
average, in random 128-switch and Fat-Tree/Internet2 topologies,
respectively. To validate the correctness and performance gains
of P4BFT, we deploy bmv2 and Netronome Agilio SmartNIC-
based topologies. The advantages of P4BFT can thus be repro-
duced both with software switches and "commodity" P4-enabled
hardware. A hardware-accelerated controller packet comparison
procedure results in an average 96.4% decrease in processing
delay per request compared to existing software approaches.
I. INTRODUCTION
State-of-the-art failure-tolerant SDN controllers base their
state distribution on crash-tolerant consensus approaches. Such
approaches comprise single-leader operation, where leader
replica decides on the ordering of client updates. After con-
firming the update with the follower majority, the leader
triggers the cluster-wide commit operation and acknowledges
the update with the requesting client. RAFT algorithm [1]
realizes this approach, and is implemented in OpenDaylight
[2] and ONOS [3]. RAFT is, however, unable to distinguish
malicious / incorrect from correct controller decisions, and
can easily be manipulated by an adversary in possession of
the leader replica [4]. Recently, Byzantine Fault Tolerance
(BFT)-enabled controllers were proposed for the purpose of
enabling correct consensus in scenarios where a subset of
controllers is faulty due to a malicious adversary or internal
bugs [5]–[7]. In BFT-enabled SDN, multiple controllers act
as replicated state machines and hence process incoming
client requests individually. Thus with BFT, each controller
of a single administrative domain transmits an output of their
computation to the target switch. The outputs of controllers are
then collected by trusted configuration targets (e.g., switches)
and compared for payload matching for the purpose of correct
message identification.
In in-band [8] deployments, where application flows share
the same infrastructure as the control flows, the traffic arriving
from controller replicas imposes a non-negligible overhead [9].
Similarly, comparing and processing controller messages in
the switches’ software-based control plane causes additional
delays and CPU load [7], leading to longer reconfigurations.
Moreover, the comparison of control packets is implemented
as a proprietary non-standardized switch function, thus unsup-
ported in off-the-shelf devices.
In this work, we investigate the benefits of offloading
the procedure of comparison of controller outputs, required
for correct BFT operation, to carefully selected network
switches. By minimizing the distance between the processing
nodes and controller clusters / individual controller instances,
we decrease the network load imposed by BFT operation.
P4BFT’s P4-enabled pipeline is in charge of controller packet
collection, correct packet identification and its forwarding
to the destination nodes, thus minimizing accesses to the
switches’ software control plane and effectively outperforming
the existing software-based solutions.
II. BACKGROU ND A ND PRO BL EM STATE ME NT
BFT has recently been investigated in the context of dis-
tributed SDN control plane [5]–[7], [10]. In [5], [6], 3FM+ 1
controller replicas are required to tolerate up to FMByzantine
failures. MORPH [7] requires 2FM+FA+ 1 replicas in order
to tolerate up to FMByzantine and FAavailability-induced
failures. The presented models assume the deployment of SDN
controllers as a set of replicated state machines, where clients
submit inputs to the controllers, that process them in isolation
and subsequently send the computed outputs to the target desti-
nation (i.e., reconfiguration messages to destination switches).
They assume trusted platform execution and a mechanism in
the destination switch, capable of comparison of the controller
messages and deduction of the correct message. Namely, after
receiving FM+1 matching payloads, the observed message is
regarded as correct and the containing configuration is applied.
The presented models are sub-optimal in a few regards.
First, they assume the collection and processing of controller
messages exclusively in the receiver nodes (configuration
targets). Propagation of each controller message can carry
a large system footprint in large-scale in-band controlled
networks, thus imposing a non-negligible load on the data
plane. Second, neither of the models detail the overhead of
message comparison procedure in the target switches. The
realizations presented in [5]–[7], [10] realize the packet com-
parison procedure solely in software. The non-deterministic
/ varied latency imposed by the software switching may,
however, be limiting in specific use cases, such as in the
failure scenarios in critical infrastructure networks [11] or in
5G scenarios [12]. This motivates a hardware-accelerated BFT
design that minimizes the processing delays.
A. Our contribution
We introduce and experimentally validate the P4BFT de-
sign, which builds upon [5]–[7] and adds further optimizations:
It allows for collection of controllers’ packets and their
comparison in processing nodes, as well as for relaying
of deduced correct packets to the destinations;
It selects the optimal processing nodes at per-destination-
switch granularity. The proposed objective minimizes
the control plane load and reconfiguration time, while
considering constraints related to the switches’ processing
capacity and the upper-bound reconfiguration delay;
It executes in software, e.g., in P4 switch behavioral
model (bmv21), or in a physical, e.g., Netronome Smart-
NIC2environment. Correctness, processing time and de-
ployment flexibility are validated in both platforms.
We present the evaluation results of P4BFT for well-known
and randomized network topologies and varied controller and
cluster sizes and their placements. To the best of our knowl-
edge, this is the first implementation of a BFT-enabled solution
on a hardware platform, allowing for accelerated packet pro-
cessing and low-latency malicious controller detection time.
Paper Structure: Related work is presented in Section III.
Section IV details the P4BFT co-design of the control and data
plane as well as the optimization procedure. Section V presents
the evaluation methodology and discusses the empirically
measured performance of P4BFT in software- and hardware-
based data planes. Section VI concludes this paper.
III. REL ATED WORK
1) BFT variations in SDN context: In the context of central-
ized network control, BFT is still a relatively novel area of re-
search. Reference solutions [5]–[7], assume the comparison of
configuration messages, transmitted by the controller replicas,
in the switch destined as the configuration target. With P4BFT,
we investigate the flexibility advantages of message processing
in any node capable of message collection and processing,
thus allowing for a footprint minimization. [6] and [7] discuss
the strategy for minimization of no. of matching messages
required to deduce correct controller decisions, which we
adopt in this work as well. [10] discusses the benefit of disag-
gregation of BFT consensus groups in the SDN control plane
into multiple controller cluster partitions, thus enabling higher
scalability than possible with [6] and [7]. While compatible
with [10], our work focuses on scalability enhancements and
footprint minimization by means of data-plane reconfiguration
for realizing more efficient packet comparison.
1P4 Software Switch - https://github.com/p4lang/behavioral-model
2Netronome Agilio R
CX 2x10GbE SmartNIC Product Brief - https://www.
netronome.com/media/documents/PB_Agilio_CX_2x10GbE.pdf
2) Data Plane-accelerated Service Execution: Recently,
Dang et al. [13] have portrayed the benefits of offloading
coordination services for reaching consensus to the data plane,
on the example of a Paxos implementation in P4 language. In
this paper, we investigate if a similar claim can be transferred
to BFT algorithms in SDN context. In the same spirit, in
[14], end-hosts partially offload the log replication and log
commitment operations of RAFT consensus algorithm to
neighboring P4 devices, thus accelerating the overall commit
time. In the context of in-network computation, Sapio et al.
[15] discuss the benefit of data aggregation offloading to
constrained network devices for the purpose of data reduction
and minimization of workers’ computation time.
IV. SYS TE M MOD EL A ND DESIGN
A. P4BFT System Model
We consider a typical SDN architecture allowing for flexible
function execution on the networking switches for the pur-
pose of BFT system operation. The flexibility of in-network
function execution is bounded by the limitation of the data
plane programming interface (i.e., the P416 [16] language
specification in the case of P4BFT). The control plane commu-
nication between the switches and controllers and in-between
the controllers is realized using an in-band control channel [8].
In order to prevent faulty replicas from impersonating correct
replicas, controllers authenticate each message using Message
Authentication Codes (assuming pre-shared symmetric keys
for each pair) [17]. Similarly, switches that are in charge of
message comparison and message propagation to the config-
uration targets must be capable of signature generation using
the processed payload and their secret key.
In P4BFT, controllers calculate their decisions in isola-
tion from each other, and transmit them to the destina-
tion switch. Control packets are intercepted by the process-
ing nodes (i.e., processing switches) responsible for deci-
sions destined for the target switch. In order to collect and
compare control packets, we assume packet header fields
that include the client_request_id,controller_id,
destination_switch_id (e.g., MAC/IP address), the
payload (controller-decided configuration) and the optional
signature field (denoting if a packet has already been
processed by a processing node). Clients must include the
client_request_id field in their controller requests.
Apart from distinguishing correct from malicious/incorrect
messages, P4BFT allows for identification and exclusion of
faulty controller replicas. P4BFT’s architectural model as-
sumes three entities, each with a distinguished role:
1) Network controllers enforce forwarding plane configu-
rations based on internal decision making. For simplification,
each controller replica of an administrative domain serves
each client request. Each correct replica maintains internal
state information (e.g., resource reservations) matching to
that of other correct instances. In the case of a controller
with diverged state, i.e., as a result of corrupted operation
or a malicious adversary take-over, the incorrect controllers’
computation outputs may differentiate from the correct ones.
2) P4-enabled switches forward the control and application
packets. Depending on the output of Reassigner’s optimization
step, a switch may be assigned the processing node role, i.e.,
become in charge of comparing outputs computed by different
controllers, destined for itself or other configuration targets.
A processing node compares messages sent out by different
controllers and distinguishes the correct ones. On identification
of a faulty controller, it declares the faulty replica to the
Reassigner. In contrast to [5]–[7], P4BFT enables control
packet comparison for packets destined for remote targets.
3) Reassigner is responsible for two tasks:
Task 1: It dynamically reassigns the controller-switch con-
nections based on the events collected from the detection
mechanism of the switches, i.e., upon their detection, it
excludes faulty controllers from the assignment procedure. It
furthermore ensures that a minimum number of required con-
trollers, necessary to tolerate a number of availability failures
FAand malicious failures FM, are loaded and associated with
each switch. This task is also discussed in [6], [7].
Task 2: It maps a processing node, in charge of controller
messages’ comparison, to each destination switch. Based on
the result of this optimization, switches gain the responsibility
of control packets processing. The output of the optimization
procedure is the Processing Table, necessary to identify the
switches responsible for comparison of controller messages.
Additionally, the Reassigner computes the Forwarding Tables,
necessary for forwarding of controller messages to processing
nodes and reconfiguration targets. Given the no. of controllers
and the user-configurable parameter of max. tolerated Byzan-
tine failures FM, Reassigner reports to processing nodes the
no. of necessary matching messages that must be collected
prior to marking a controller message as correct.
B. Finding the Optimal Processing Nodes
The optimization methodology allows for minimization of
the experienced switch reconfiguration delay, as well as the
decrease of the total network load introduced by the exchanged
controller packets. When a switch is assigned the processing
node role for itself or another target switch, it collects the
control packets destined for the target switch and deduces the
correct payload on-the-fly, it next forwards a single packet
copy containing the correct controller message to the destina-
tion switch. Consider Fig. 1a). If control packet comparison
is done only at the target switch (as in prior works), a request
for S4 creates a total footprint of FC= 13 packets in the data
plane (the sum of Cluster 1 and Cluster 2 utilizations of 4and
9, respectively). In contrast, if the processing is executed in S3
(as depicted in Fig. 1b)), the total experienced footprint can
be decreased to FC= 11. Therefore, in order to minimize the
total control plane footprint, we identify an optimal processing
node for each target switch, based on a given topology,
placement of controllers and the processing nodes’ capacity
constraints. If we additionally extend the optimization to a
multi-objective formulation by considering the delay metric,
the total traversed critical path between the controller furthest
away from the configuration target would equal FD= 3 in
the worst case (ref. Fig. 1c)), i.e., 3hops assuming a delay
weight of 1per hop. Additionally, this assignment also has the
minimized communication overhead of FC= 11.
TABLE I
PARA MET ER S USE D IN T HE MO DE L
Symbol Description
V:{S1, S2, ..., Sn}, n Z+Set of all switch nodes in the topology.
C:{C1, C2, ..., Cn}, n Z+Set of all controllers connected to the topology.
D:{di,j,k,i, j, k ∈ V} Set of delay values for path from ito k, passing through j.
H:{hi,j,i, j ∈ V } Set of number of hops for shortest path from ito j.
Q:{qi,i∈ V} Set of switches’ processing capacity.
Cj⊆ C Set of controllers connected to the node j.
M⊆V Set of switches connected to at least one controller.
TMaximum tolerated delay value.
x(i, k)Binary variable that equals 1if iis a processing node for k.
We describe the processing node mapping problem using
an integer linear programming (ILP) formulation. Table I
summarizes the notation used.
Communication overhead minimization objective min-
imizes the global imposed communication footprint in the
control plane. Each controller replica generates an individual
message sent to the processing node i, that subsequently
collects all remaining necessary messages and forwards a
resulting single correct message to the configuration target k:
MF=min P
k∈V P
i∈V
(1 ·hi,k ·x(i, k) + P
j∈M |Cj| · hj,i ·x(i, k))
(1)
Configuration delay minimization objective minimizes the
worst-case delay imposed on the critical path used for for-
warding configuration messages from a controller associated
with node j, to the potential processing node iand finally to
the configuration target node k:
MD=min X
k∈V
X
i∈V
x(i, k)·max
j∈M(dj,i,k)(2)
Bi-objective optimization minimizes the weighted sum of
the two objectives, w1and w2being the associated weights:
min w1·MF+w2·MD(3)
Processing capacity constraint: Sum of messages requir-
ing processing on i, for each configuration target kassigned
to i, must be kept at or below is processing capacity qi:
Subject to: X
k∈V
x(i, k)· |C| 6qi,i∈ V (4)
Maximum delay constraint: For each configuration target
k, the delay imposed by the controller packet forwarding
to node i, responsible for collection and packet comparison
procedure and forwarding of the correct message to the target
node k, does not exceed an upper bound T:
Subject to: X
i∈V
x(i, k)·max
j∈M(dj,i,k)6T , k∈ V (5)
Single assignment constraint: For each configuration tar-
get k, there exists exactly one processing node i:
Subject to: X
i∈V
x(i, k)=1,k∈ V (6)
Note: The assignment of controller-switch connections for
the purpose of control and reconfiguration is adapted from
existing formulations [7], [10] and is thus not detailed here.
Cluster 1 Cluster 2
C1 C2 C3 C4 C5
+2
+2 +2
+1h +1
+1
+1h
+2 +3
+1h
S1 S2 S3
S4 S5
(a) Case I: FC= 13; FD= 3 hops
Cluster 1 Cluster 2
C1 C2 C3 C4 C5
+2
+1h +2
+1h
+1
+1h
+1
+1h
+2
+1h
+3
S1 S2 S3
S4 S5
(b) Case II: FC= 11; FD= 5 hops
Cluster 1 Cluster 2
C1 C2 C3 C4 C5
+2
+2
+1
+1h
+3
+1h
+3
+1h
S1 S2 S3
S4 S5
(c) Case III: FC= 11; FD= 3 hops
Fig. 1. For brevity we depict the control flows destined only for configuration target S4. The orange and red blocks represent an exemplary cluster separation
of 5controllers into groups of 2and 3controllers, respectively. The green dashed block highlights the processing node responsible for comparing the controller
messages destined for S4. Figure (a) presents the unoptimized case as per [5]–[7], where S4 collects and processes control messages destined for itself,
thus resulting in a control plane load of FC= 13 and a delay on critical path (marked with blue labels) of FD= 3 hops (assuming edge weights of 1).
By optimizing for the total communication overhead, the total FCcan be decreased to 11, as portrayed in Figure (b). Contrary to (a), in (b) processing of
packets destined for S4 is offloaded to the processing node S3. However, additional delay is incurred by the traversal of path S1-S2-S3-S2-S4 for the control
messages sourced in Cluster 1. Multi-objective optimization according to P4BFT, that aims to minimize both the communication overhead and control plane
delay instead selects S2 as the optimal processing node (ref. Figure (c)), thus minimizing both FCand FD.
C. P4 Switch and Reassigner Control Flow
Processing node data plane: Switches declared to process
controller messages for a particular target (i.e., for itself, or for
another switch) initially collect the control payloads stemming
from different controllers. Each processing node maintains
counters for the number of observed and matching packets
for a particular (re-)configuration request identifier. After suf-
ficient matching packets are collected for a particular payload
(more specifically, hash of the payload), the processing node
signs a message using its private key and forwards one copy
of the correct packet to its own control plane for required
software processing (i.e., identification of the correct message
and potentially malicious controllers), and the second copy
on the port leading to the configuration target. To distinguish
processed from unprocessed packets in destination switches,
processing nodes refer to the trailing signature field.
Processing node control plane: After determining the cor-
rect packet, the processing node identifies any incorrect con-
troller replicas (i.e., replicas whose output hashes diverge
from the deduced correct hash) and subsequently notifies the
Reassigner of the discrepancy. Alternatively, the switch applies
the configuration message if it is the configuration target itself.
The switch then proceeds to clear its registers associated with
the processed message hash so to free the memory for future
requests.
Reassigner control flow: At network bootstrapping time, or
on occurrence of any of the following events: i) a detected
malicious controller; ii) a failed controller replica; or iii) a
switch/link failure; Reassigner reconfigures the processing and
forwarding tables of the switches, as well as the number of
required matching messages to detect the correct message.
D. P4 Tables Design
Switches maintain Tables and Registers that define the
method of processing incoming packets. Reassigner populates
the switches’ Tables and Registers so that the selection of
processing nodes for controller messages is optimal w.r.t. a
set of given constraints, i.e., so that the total message over-
head or control plane latency experienced in control plane is
minimized (according to the optimization procedure in Section
IV-B). The Reassigner thus modifies the elements whenever
a controller is identified as incorrect and is hence excluded
from consideration, resulting in a different optimization result.
P4BFT leverages four P4 tables:
1) Processing Table: It holds identifiers of the switches
whose packets must be processed by the switch hosting
this table. Incoming packets are matched based on the
destination switch’s ID. In the case of a table hit, the
hosting switch processes the packets as a processing node.
Alternatively, the packet is matched against the Process-
Forwarding Table.
2) Process-Forwarding Table: Declares which egress port
the packets should be sent out on for further processing.
If an unprocessed control packet is not to be processed
locally, the switch will forward the packet towards the
correct processing node, based on forwarding entries
maintained in this table.
3) L2-Forwarding Table: After the processing node has
processed the incoming control packets destined for the
destination switch, the last step is forwarding the correctly
deduced packet towards it. Information on how to reach
the destination switches is maintained in this table. Con-
trary to forwarding to a processing node, the difference
here is that the packet is now forwarded to the destination
switch.
4) Hash Table with associated registers:Processing a set
of controller packets for a particular request identifier
requires evaluating and counting the number of occur-
rences of packets containing the matching payload. To
uniquely identify the decision of the controller, a hash
value is generated on the payload during processing. The
counting of incoming packets is done by updating the
corresponding binary values in the register vectors, with
respective layout depicted in Table II.
On each arriving unprocessed packet, the processing node
computes a previously seen or i-th initially observed hash
hrequest_id
iover the acquired payload. Subsequently, it sets the
binary flag to 1, for source controller controller_id in
TABLE II
HAS H TABLE LAYO UT
Msg Hash Request ID 1... Request ID K
h0bh0
C1bh0
C2... bh0
CN... bh0
C1bh0
C2... bh0
CN
... b...
C1b...
C2... b...
CN... b...
C1b...
C2... b...
CN
hFMbhFM
C1bhFM
C2
... bhFM
CN
... bhFM
C1bhFM
C2
... bhFM
CN
the i-th register row at column [client_request_id ·|C|
+controller_id]. |C| represents the total no. of de-
ployed controllers. Each time a client request is fully pro-
cessed, the binary entries associated with the corresponding
client_request_id are reset to zero. To detect a ma-
licious controller, the controller IDs associated with hashes
distinguished as incorrect, are reported to the Reassigner.
Note: To tolerate FMByzantine failures, a maximum of
FM+ 1 unique hashes for a single request identifier may be
detected, hence the corresponding FM+ 1 pre-allocated table
rows in Table II.
V. EVALUATI ON , RE SU LTS AND DISCUSSION
A. Evaluation Methodology
We next evaluate the following metrics using P4BFT and
state-of-the-art [5]–[7] designs: i) control plane load; ii) im-
posed processing delay in the software and hardware P4BFT
nodes; iii) end-to-end switch reconfiguration delay; and iv)
ILP solution time. We execute the measurements for random
controller placements and diverse data plane topologies: i)
random topologies with fixed average node degree; ii) refer-
ence Internet2 [18]; and iii) data-center Fat-Tree (k= 4). We
also vary and depict the impact of no. of switches, controller
instances, and disjoint controller clusters. To compute paths
between controllers and switches and between processing and
destination switches, Reassigner leverages the Constrained
Shortest Path First (CSPF) algorithm. For brevity, as an input
to the optimization procedure in Reassigner, we assume edge
weights of 1. The objective function used in processing node
selection is Eq. 3, parametrized with (w1, w2) = (1,1).
P4BFT implementation is a combination of P416 and P4
Runtime code, compiled for software and physical execution
on P4 software switch bmv2 (master check-out, December
2018) and a Netronome Agilio SmartNIC device with the cor-
responding firmware compiled using SDK 6.1-Preview,
respectively. Apache Thrift and gRPC are used for population
of registers and table entries in bmv2, respectively. Thrift is
used for both table and registers population for the Netronome
SmartNIC, due to the current SDK release not fully supporting
the P4 Runtime. HTTP REST is used in exchange between
P4 switch control plane and the Reassigner. The Reassigner
and network controller replicas are implemented as Python
applications.
B. Communication Overhead Advantage
Figure 2 depicts the packet load improvement in P4BFT
over the existing reference solutions [5]–[7] for randomly
generated topologies with average node degree of 4. The
footprint improvement is defined as 1FP4BF T
C
FSoA
C
, where FC
denotes the sum of packet footprint for control flows destined
to each destination switch of the network topology as per
Sec. IV-B and Fig. 1. P4BFT outperforms the state-of-the-
art as each of the presented works assumes an uninterrupted
control flow from each controller instance to the destination
switches. P4BFT, on the other hand, aggregates control packets
in the processing nodes that, subsequently to collecting the
control packets, forward a single correct message towards the
destination, thus decreasing the control plane load.
8 24 40 56 128
Number of Switches in the Topology
0
20
40
60
80
100
Footprint Improvement [%]
(compared to [5]-[7])
P4BFT-capable - 1 Random
P4BFT-capable - 25% Random
P4BFT-capable - 50% Random
P4BFT-capable - 75% Random
P4BFT-capable - 100%
Fig. 2. Packet load improvement of P4BFT over the reference works [5]–
[7] for 5000 randomly generated network topologies per scenario, with
7controllers distributed into 3disjoint and randomly placed clusters. In
addition to the 100% coverage where each node may be considered a P4BFT
processing node, we include scenarios where only the random [1, 25%, 50%,
75%] nodes of all available nodes in the infrastructure are P4BFT-enabled.
Thus, even in the topologies with limited programmable data plane resources,
i.e., in brownfield-scenarios involving OpenFlow/NETCONF+YANG non-P4
configuration targets, P4BFT offers substantial advantages over existing SoA.
Fig. 3 (a) and (b) portray the footprint improvement scaling
with the number of controllers and disjoint clusters. P4BFT’s
footprint efficiency generally benefits from the higher number
of controller instances. Controller clusters, on the other hand,
aggregate replicas behind the same edge switch. Thus, with the
higher number of disjoint clusters, the degree of aggregation
and the total footprint improvement decreases.
C. Processing and Reconfiguration Delay
Fig. 4 depicts the processing delay incurred in the process-
ing node for a single client request. The delay corresponds
to the P4 pipeline execution time spent on identification of a
correct controller message, comprising the i) hash computation
over controller messages; ii) incrementing the counters for
the computed hash; iii) signing the correct packet and; iv)
propagating it to the correct egress port. When using the
P4-enabled SmartNIC, P4BFT decreases the processing time
compared to bmv2 software target by two orders of magnitude.
Fig. 5 depicts the total reconfiguration delay imposed in
SoA and P4BFT designs for (w1, w2) = (1,1) (ref. Eq.
3). It considers the time difference between issuing a switch
reconfiguration request, until the correct controller message
is determined and applied in the destination. Related works
process the reconfiguration messages stemming from con-
troller replicas in the destination target, their control flows
5 9 13 17
Number of Replicated Controller Instances
0
20
40
60
80
100
Footprint Improvement [%]
(compared to [5]-[7])
Fat-Tree (k=4)
Internet2
(a)
1 7 13 17
Number of Disjoint Controller Clusters
0
20
40
60
80
100
Fat-Tree (k=4)
Internet2
(b)
Fig. 3. The impact of (a) controllers and; (b) disjoint controller clusters on the
control plane load footprint in Internet2 and Fat-Tree (k= 4) topologies for
5000 randomized controller placements each. (a) randomizes the placement
but fixes the no. of disjoint clusters to 3; (b) randomizes the no. of disjoint
clusters between [1, 7, 13, 17] but fixes the no. of controllers to 17. The
resulting footprint improvement scales with the number of controllers but is
inversely proportional to the number of disjoint clusters.
6×1017×101
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
2×1033×1034×103
P4BFT Switch Processing Delay [µs]
Netronome Agilio CX 10GbE
bmv2 P4 Software Switch
Fig. 4. The CDF of processing delays imposed in a P4BFT’s processing node
for a scenario including 5controller instances. 3correct packets and thus 3
P4 pipeline executions are necessary to confirm the payload correctness when
tolerating 2Byzantine controller failures.
traversing shortest paths in all cases. On average, P4BFT’s
reconfiguration delay is comparable with related works, the
overall control plane footprint being substantially improved.
D. Optimization procedure in Reassigner
1) Impact of optimization objectives: Figure 6 depicts the
Pareto frontier of optimal processing node assignments w.r.t.
the objectives presented in Section IV-B: the total control plane
footprint (minimized as per Eq. 1) and the reconfiguration
delay (minimized as per Eq. 2). From the total solution
0 5 10 15 20 25
Switch Reconfiguration Delay [ms]
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
0
20
40
60
80
100
Footprint Improvement [%]
(compared to [5]-[7])
SoA ([5]-[7])
P4BFT
Mean Improvement
Fig. 5. CDFs of time taken to configure randomly selected switches in
SoA and P4BFT environments for Internet2 topology, 10 random controller
placements for 5replicas and 1700 individual requests per placement. SoA
works [5]-[7] collect, compare and apply the controllers’ reconfiguration
messages in the destination switch thus effectively minimizing the recon-
figuration delay at all times. P4BFT, on the other hand, may occasionally
favor footprint minimization over the incurred reconfiguration delay and thus
impose a longer critical path, leading to slower reconfigurations. On average,
however, P4BFT imposes comparable reconfiguration delays at a much higher
footprint improvement (depicted blue), mean being 38%, best and worst cases
at 60% and 19.3%, respectively, for evaluated placements.
space, depending on the weights prioritization in Eq. 3, either
(26.0,3.0) or (28.0,2.0) solutions can be considered optimal.
Comparable works implicitly minimize the incurred reconfigu-
ration delay but fail to consider the control plane load. Hence,
they prefer the (30.0,2.0) solution (encircled red).
26 27 28 29 30 31 32
Total Control-Plane Footprint (No. Packets)
2.0
2.5
3.0
3.5
4.0
Worst-Case Reconfig. Delay
Fig. 6. Pareto Frontier of P4BFT’s solution space for the topology presented
in Fig. 1. The comparable works tend to minimize the incurred reconfiguration
delay, but ignore the imposed control plane load. [5]–[7] hence select
(30.0,2.0) as the optimal solution (encircled in red) while P4BFT selects
(26.0,3.0) or (28.0,2.0) thus minimizing the total overhead as per Eq. 3.
2) ILP solution time - impact of topology, amount of
controller and disjoint clusters: The solution time for the
optimization procedure considering random topologies with
average network degree of 4and a fixed no. of randomly
placed controllers is depicted in Fig. 7 (a). The solution
time scales with number of switches, peaking at 420ms for
large 128-switch topologies. The reassignment procedure is
executed in few rare events: during network bootstrapping, on
malicious / failed controller detection and following a switch
/ link failure. Thus, we consider the observed solution time
short and viable for online mapping. Fig. 7 (b) depicts the ILP
solution time scaling with the number of active controllers.
The lower the number of active controllers, the shorter the
solution time. In "Fixed Clusters" case, each controller is
placed in its disjoint cluster (worst-case for the optimization).
(a)
(b)
Fig. 7. (a) depicts the impact of network topology size on the ILP solution
time for random topologies. (b) depicts the impact of controller number
and cluster disjointness in the case of Internet2 topology. The results are
averaged over 5000 per-scenario iterations. The higher the cluster aggregation
of controllers, the lower the ILP solution time. "Fixed Clusters" considers
the worst-case, where each controller is randomly placed, but disjointly
w.r.t. the other controller instances. Clearly, the ILP solution time scales
with the amount of deployed switches and controllers. We used Gurobi 8.1
optimization framework configured to execute multiple solvers on multiple
threads simultaneously, and have chosen the ones that finish first.
The "Random Clusters" case considers a typical clustering
scenario, where a maximum of [1..3] clusters are deployed,
each comprising a uniform number of controller instances. The
higher the cluster aggregation, the lower the ILP solution time.
VI. CONCLUSION
P4BFT introduces a switch control-plane/data-plane co-
design, capable of malicious controller identification while
simultaneously minimizing the control plane footprint. By
merging the control channels in P4-enabled processing nodes,
the use of P4BFT results in a lowered control plane footprint,
compared to existing designs. In a hardware-based data plane,
by offloading packet processing from general purpose CPU
to the data-plane NPU, it additionally leads to a decrease
in request processing time. Given the low solution time, the
presented ILP formulation is viable for on-line execution.
While we focused on an SDN scenario here, future works
should consider the conceptual transfer of P4BFT to other
application domains, including stateful web applications and
critical industrial control systems.
ACKNOWLEDGMENT
This work has received funding from European Commis-
sion’s H2020 research and innovation programme under grant
agreement no. 780315 SEMIoTICS and from the German
Research Foundation (DFG) under the grant number KE
1863/8-1. We are grateful to Cristian Bermudez Serna, Dr.
Johannes Riedl and the anonymous reviewers for their useful
feedback and comments.
REFERENCES
[1] H. Howard, M. Schwarzkopf, A. Madhavapeddy, and J. Crowcroft, “Raft
refloated: Do we have consensus?” ACM SIGOPS Operating Systems
Review, vol. 49, no. 1, 2015.
[2] J. Medved, R. Varga, A. Tkacik, and K. Gray, “OpenDaylight: Towards
a model-driven SDN controller architecture,” in Proceedings of IEEE
International Symposium on a World of Wireless, Mobile and Multimedia
Networks 2014. IEEE, 2014, pp. 1–6.
[3] P. Berde, M. Gerola, J. Hart, Y. Higuchi, M. Kobayashi, T. Koide,
B. Lantz, B. O’Connor, P. Radoslavov, W. Snow et al., “ONOS: Towards
an open, distributed SDN OS,” in Proceedings of the third workshop on
Hot topics in software defined networking. ACM, 2014, pp. 1–6.
[4] C. Copeland et al., “Tangaroa: A Byzantine Fault Tolerant
Raft,” http://www.scs.stanford.edu/14au-cs244b/labs/projects/copeland_
zhong.pdf, [Accessed March-2019].
[5] H. Li, P. Li, S. Guo, and A. Nayak, “Byzantine-resilient secure software-
defined networks with multiple controllers in cloud,” IEEE Transactions
on Cloud Computing, vol. 2, no. 4, pp. 436–447, 2014.
[6] P. M. Mohan, T. Truong-Huu, and M. Gurusamy, “Primary-backup
controller mapping for Byzantine fault tolerance in software defined
networks,” in GLOBECOM 2017 - 2017 IEEE Global Communications
Conference. IEEE, 2017, pp. 1–7.
[7] E. Sakic, N. Ðeri´
c, and W. Kellerer, “MORPH: An adaptive framework
for efficient and Byzantine fault-tolerant SDN control plane,IEEE
Journal on Selected Areas in Communications, vol. 36, no. 10, pp. 2158–
2174, 2018.
[8] L. Schiff, S. Schmid, and P. Kuznetsov, “In-band synchronization for
distributed SDN control planes,” ACM SIGCOMM Computer Commu-
nication Review, vol. 46, no. 1, pp. 37–43, 2016.
[9] A. S. Muqaddas, A. Bianco, P. Giaccone, and G. Maier, “Inter-controller
traffic in ONOS clusters for SDN networks,” in 2016 IEEE International
Conference on Communications (ICC). IEEE, 2016, pp. 1–6.
[10] E. Sakic and W. Kellerer, “BFT protocols for heterogeneous resource al-
locations in distributed SDN control plane,” in 2019 IEEE International
Conference on Communications (IEEE ICC’19), Shanghai, P.R. China,
2019.
[11] E. Sakic and W. Kellerer, “Response time and availability study of
RAFT consensus in distributed SDN control plane,” IEEE Transactions
on Network and Service Management, vol. 15, no. 1, 2018.
[12] University of Surrey - 5G Innovation Centre, “5G Whitepaper: The Flat
Distributed Cloud (FDC) 5G Architecture Revolution,” 2016.
[13] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, “Paxos made switch-y,
ACM SIGCOMM Computer Communication Review, vol. 46, no. 2, pp.
18–24, 2016.
[14] Y. Zhang, B. Han, Z.-L. Zhang, and V. Gopalakrishnan, “Network-
assisted raft consensus algorithm,” in Proceedings of the SIGCOMM
Posters and Demos. ACM, 2017, pp. 94–96.
[15] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, “In-
network computation is a dumb idea whose time has come,” in Proceed-
ings of the 16th ACM Workshop on Hot Topics in Networks. ACM,
2017, pp. 150–156.
[16] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,
C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese et al., “P4: Pro-
gramming protocol-independent packet processors,” ACM SIGCOMM
Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014.
[17] M. Eischer and T. Distler, “Scalable byzantine fault tolerance on
heterogeneous servers,” in 2017 13th European Dependable Computing
Conference (EDCC). IEEE, 2017, pp. 34–41.
[18] Internet2 Consortium, “Internet2 Network Infrastructure Topology,
https://www.internet2.edu/media_files/422, [Accessed March-2019].
... • Reconfigurability: the parser and the processing logic can be redefined in the field. Variations [54][55][56][57][58][59][60][61][62] Collectors and Solutions [63][64][65][66][67] Congestion Control [68][69][70][71][72][73][74][75][76] Measurements AQM [99][100][101][102][103][104][105][106][107][108][109] QoS and TM [110][111][112][113][114] Multicast [115][116][117] Load Balancing [118][119][120][121][122][123][124][125][126] Caching [127][128][129][130][131][132][133][134][135][136] Telecom Services [137][138][139][140][141][142][143][144][145][146] Contentcentric Networking [147][148][149][150][151][152] Consensus [153][154][155][156][157][158][159][160] Machine Learning [161][162][163][164][165][166] Miscellaneous [167][168][169][170][171][172][173][174][175] Aggregation [176][177][178][179] Service Automation [180,181] Heavy Hitter [182][183][184][185][186][187][188][189][190] Cryptography [191][192][193][194][195] Anonymity [196][197][198][199][200] Access Control [201][202][203][204][205][206][207][208] Attacks and Defenses Troubleshoot [230][231][232][233][234] Verification [235][236][237][238][239][240][241][242][243] • Protocol independence: the switch is protocol-agnostic. The programmer defines the protocols, the parser, and the operations to process the headers. ...
... It processes a large class of distributed transactions in a single round trip, without any additional coordination between shards and replicas. Sakic et al. [159] proposed P4 Byzantine Fault Tolerance (P4BFT), a system that is based on BFT-enabled SDN, where controllers act as replicated state machines. The system offloads the comparison of controllers' outputs required for correct BFT operations to programmable switches. ...
... Others Eris [155] Novel P4BFT [159] BFT N/A [156] Raft × Unordered and completely asynchronous networks require the full implementation and complexity of Paxos. NOPaxos suggests that the communication layer should provide a new Ordered Unreliable Multicast (OUM) primitive; that is, there is a guarantee that receivers will process the multicast messages in the same order, though messages can be lost. ...
Article
Full-text available
Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary implementations which are hard-coded by vendors, inducing a lengthy, costly, and inflexible process. Recently, data plane programmability has attracted significant attention from both the research community and the industry, permitting operators and programmers in general to run customized packet processing functions. This open-design paradigm is paving the way for an unprecedented wave of innovation and experimentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized, top-down approach to develop network applications; providing granular visibility of packet events defined by the programmer; reducing complexity and enhancing resource utilization of the programmable switches; and drastically improving the performance of applications that are offloaded to the data plane. Despite the impressive advantages of programmable data plane switches and their importance in modern networks, the literature has been missing a comprehensive survey. To this end, this paper provides a background encompassing an overview of the evolution of networks from legacy to programmable, describing the essentials of programmable switches, and summarizing their advantages over Software-defined Networking (SDN) and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed with P4 language; surveying, classifying, and analyzing more than 200 articles; discussing challenges and considerations; and presenting future perspectives and open research issues.
... Control [68][69][70][71][72][73][74][75][76] Measurements AQM [99][100][101][102][103][104][105][106][107][108][109] QoS and TM [110][111][112][113][114] Multicast [115][116][117] Load Balancing [118][119][120][121][122][123][124][125][126] Caching [127][128][129][130][131][132][133][134][135][136] Telecom Services [137][138][139][140][141][142][143][144][145][146] Contentcentric Networking [147][148][149][150][151][152] Consensus [153][154][155][156][157][158][159][160] Machine Learning [161][162][163][164][165][166] Miscellaneous [167][168][169][170][171][172][173][174][175] Aggregation [176][177][178][179] Service Automation [180,181] Heavy Hitter [182][183][184][185][186][187][188][189][190] Cryptography [191][192][193][194][195] Anonymity [196][197][198][199][200] Access Control [201][202][203][204][205][206][207][208] Attacks and Defenses Troubleshoot [230][231][232][233][234] Verification [235][236][237][238][239][240][241][242][243] • Protocol independence: the switch is protocol-agnostic. The programmer defines the protocols, the parser, and the operations to process the headers. ...
... It processes a large class of distributed transactions in a single round trip, without any additional coordination between shards and replicas. Sakic et al. [159] proposed P4 Byzantine Fault Tolerance (P4BFT), a system that is based on BFT-enabled SDN, where controllers act as replicated state machines. The system offloads the comparison of controllers' outputs required for correct BFT operations to programmable switches. ...
... Others Eris [155] Novel P4BFT [159] BFT N/A [156] Raft × Unordered and completely asynchronous networks require the full implementation and complexity of Paxos. NOPaxos suggests that the communication layer should provide a new Ordered Unreliable Multicast (OUM) primitive; that is, there is a guarantee that receivers will process the multicast messages in the same order, though messages can be lost. ...
Preprint
Full-text available
Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary implementations which are hardcoded by vendors, inducing a lengthy, costly, and inflexible process. Recently, data plane programmability has attracted significant attention from both the research community and the industry, permitting operators and programmers in general to run customized packet processing function. This open-design paradigm is paving the way for an unprecedented wave of innovation and experimentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized, top-down approach to develop network applications; providing granular visibility of packet events defined by the programmer; reducing complexity and enhancing resource utilization of the programmable switches; and drastically improving the performance of applications that are offloaded to the data plane. Despite the impressive advantages of programmable data plane switches and their importance in modern networks, the literature has been missing a comprehensive survey. To this end, this paper provides a background encompassing an overview of the evolution of networks from legacy to programmable, describing the essentials of programmable switches, and summarizing their advantages over Software-defined Networking (SDN) and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed with P4 language; surveying, classifying, and analyzing more than 150 articles; discussing challenges and considerations; and presenting future perspectives and open research issues.
... Downstream nodes need to parse only the UPC to make forwarding decisions. [485] 2017 -Sankaran et al. [486] 2020 -Zang et al. [487] 2017 bmv2 Dang et al. [488,489] 2016/20 Tofino [490] P4BFT [491,492] 2019 bmv2, Netronome SwiShmem [493] 2020 -SC-BFT [494] 2020 bmv2 ...
... P4BFT [491,492] introduces a consensus mechanism against buggy or malicious control plane instances. The controller responses are sent to trustworthy instances which compare the responses and establish consensus, e.g., by choosing the most common response. ...
Preprint
With traditional networking, users can configure control plane protocols to match the specific network configuration, but without the ability to fundamentally change the underlying algorithms. With SDN, the users may provide their own control plane, that can control network devices through their data plane APIs. Programmable data planes allow users to define their own data plane algorithms for network devices including appropriate data plane APIs which may be leveraged by user-defined SDN control. Thus, programmable data planes and SDN offer great flexibility for network customization, be it for specialized, commercial appliances, e.g., in 5G or data center networks, or for rapid prototyping in industrial and academic research. Programming protocol-independent packet processors (P4) has emerged as the currently most widespread abstraction, programming language, and concept for data plane programming. It is developed and standardized by an open community and it is supported by various software and hardware platforms. In this paper, we survey the literature from 2015 to 2020 on data plane programming with P4. Our survey covers 497 references of which 367 are scientific publications. We organize our work into two parts. In the first part, we give an overview of data plane programming models, the programming language, architectures, compilers, targets, and data plane APIs. We also consider research efforts to advance P4 technology. In the second part, we analyze a large body of literature considering P4-based applied research. We categorize 241 research papers into different application domains, summarize their contributions, and extract prototypes, target platforms, and source code availability.
... In the study of Lin et al. [13], a practical collaboration infrastructure for 5G network slice broker is designed, where the core challenge is the consensus protocol to guarantee the security and performance of the overall system. By solving the consensus problem, many related applications can be realized, such as the adaptive weighted replication [14,15], information retrieval [16,17], and the flight control system [18,19]. In addition, the consensus problem has also been studied and widely used in various fields such as blockchain and IoT [20,21]. ...
Article
Full-text available
The continuous development of fifth-generation (5G) networks is the main driving force for the growth of Internet of Things (IoT) applications. It is expected that the 5G network will greatly expand the applications of the IoT, thereby promoting the operation of cellular networks, the security and network challenges of the IoT, and pushing the future of the Internet to the edge. Because the IoT can make anything in anyplace be connected together at any time, it can provide ubiquitous services. With the establishment and use of 5G wireless networks, the cellular IoT (CIoT) will be developed and applied. In order to provide more reliable CIoT applications, a reliable network topology is very important. Reaching a consensus is one of the most important issues in providing a highly reliable CIoT design. Therefore, it is necessary to reach a consensus so that even if some components in the system is abnormal, the application in the system can still execute correctly in CIoT. In this study, a protocol of consensus is discussed in CIoT with dual abnormality mode that combines dormant abnormality and malicious abnormality. The protocol proposed in this research not only allows all normal components in CIoT to reach a consensus with the minimum times of data exchange, but also allows the maximum number of dormant and malicious abnormal components in CIoT. In the meantime, the protocol can make all normal components in CIoT satisfy the constraints of reaching consensus: Termination, Agreement, and Integrity.
... In-network computing has been applied to several domains, including packet aggregation [28], databases [18], machine-learning acceleration [27], data analytics [19], network telemetry [5], and even consensus protocols [11]. This paradigm can also result in significant energy savings [34] and even offer novel ways to defend against byzantine behaviors [26]. ...
Preprint
Full-text available
Network appliances continue to offer novel opportunities to offload processing from computing nodes directly into the data plane. One popular concern of network operators and their customers is to move data increasingly faster. A common technique to increase data throughput is to compress it before its transmission. However, this requires compression of the data -- a time and energy demanding pre-processing phase -- and decompression upon reception -- a similarly resource consuming operation. Moreover, if multiple nodes transfer similar data chunks across the network hop (e.g., a given pair of switches), each node effectively wastes resources by executing similar steps. This paper proposes ZipLine, an approach to design and implement (de)compression at line speed leveraging the Tofino hardware platform which is programmable using the P4_16 language. We report on lessons learned while building the system and show throughput, latency and compression measurements on synthetic and real-world traces, showcasing the benefits and trade-offs of our design.
... Recently, implementations of consensus algorithms in networking hardware (e.g., those of Paxos [30], [31], Raft [32] and Byzantine agreement [33], [34]) have started gaining traction. Dang et al. [30], [31] portray throughput, latency and flexibility benefits of network-supported consensus execution at line speed. ...
Conference Paper
Full-text available
Centralized Software Defined Networking (SDN) controllers and Network Management Systems (NMS) introduce the issue of controller as a single-point of failure (SPOF). The SPOF correspondingly motivated the introduction of distributed controllers, with replicas assigned into clusters of controller instances replicated for purpose of enabling high availability. The replication of the controller state relies on distributed consensus and state synchronization for correct operation. Recent works have, however, demonstrated issues with this approach. False positives in failure detectors deployed in replicas may result in oscillating leadership and control plane unavailability. In this paper, we first elaborate the problematic scenario. We resolve the related issues by decoupling failure detector from the underlying signaling methodology and by introducing event agreement as a necessary component of the proposed design. The effectiveness of the proposed model is validated using an exemplary implementation and demonstration in the problematic scenario. We present an analytic model to describe the worst- case delay required to reliably agree on replica failures. The effectiveness of the analytic formulation is confirmed empirically using varied cluster configurations in an emulated environment. Finally, we discuss the impact of each component of our design on the replica failure- and recovery-detection delay, as well as on the imposed communication overhead.
... Given the user-configurable parameter of required tolerated failures F A and F M , Reassigner reports to processing nodes the number of necessary matching messages that must be collected prior to marking a controller message as correct. The details of the P4 control flow, as well as the match-action pairs of P4 tables, are presented in [4,5]. ...
Article
Supporting byzantine fault tolerance (BFT) in distributed software-defined networks (SDNs) may lead to increased consensus delay and traffic load since all messages should be verified and multicasted among controllers. To address this problem, we propose a switch-centric byzantine fault tolerant (SC-BFT) mechanism, in which key BFT functions (e.g., message authentication and comparison) are implemented at the programmable switches. Thus, SC-BFT can accelerate the consensus procedure and mitigate the communication overhead. We implemented SC-BFT at BMv2 using P4. Analytical and simulation results show that SC-BFT provides 80% reduced response time compared to conventional BFT consensus mechanisms with significantly reduced communication overhead.
Conference Paper
Full-text available
Distributed Software Defined Networking (SDN) controllers aim to solve the issue of single-point-of-failure and improve the scalability of the control plane. Byzantine and faulty controllers, however, may enforce incorrect configurations and thus endanger the control plane correctness. Multiple Byzantine Fault Tolerance (BFT) approaches relying on Replicated State Machine (RSM) execution have been proposed in the past to cater for this issue. The scalability of such solutions is, however, limited. Additionally, the interplay between progressing the state of the distributed controllers and the consistency of the external reconfigurations of the forwarding devices has not been thoroughly investigated. In this work, we propose an agreement-and-execution group-based approach to increase the overall through-put of a BFT-enabled distributed SDN control plane. We adapt a proven sequencing-based BFT protocol, and introduce two optimized BFT protocols that preserve the uniform agreement, causality and liveness properties. A state-hashing approach which ensures causally ordered switch reconfigurations is proposed, that enables an opportunistic RSM execution without relying on strict sequencing. The proposed designs are implemented and validated for two realistic topologies, a path computation application and a set of KPIs: switch reconfiguration (response) time, signaling overhead, and acceptance rates. We show a clear decrease in the system response time and communication overhead with the proposed models, compared to a state-of-the-art approach.
Article
Full-text available
Current approaches to tackling the single point of failure in SDN entail a distributed operation of SDN controller instances. Their state synchronization process is reliant on the assumption of a correct decision-making in the controllers. Successful introduction of SDN in the critical infrastructure networks also requires catering to the issue of unavailable, unreliable (e.g. buggy) and malicious controller failures. We propose MORPH, a framework tolerant to unavailability and Byzantine failures, that distinguishes and localizes faulty controller instances and appropriately reconfigures the control plane. Our controller-switch connection assignment leverages the awareness of the source of failure to optimize the number of active controllers and minimize the controller and switch reconfiguration delays. The proposed re-assignment executes dynamically after each successful failure identification. We require 2FM +FA+1 controllers to tolerate FM malicious and FA availability-induced failures. After a successful detection of FM malicious controllers, MORPH reconfigures the control plane to require a single controller message to forward the system state. Next, we outline and present a solution to the practical correctness issues related to the statefulness of the distributed SDN controller applications, previously ignored in the literature. We base our performance analysis on a resource-aware routing application, deployed in an emulated testbed comprising up to 16 controllers and up to 34 switches, so to tolerate up to 5 unique Byzantine and additional 5 availability-induced controller failures (a total of 10 unique controller failures). We quantify and highlight the dynamic decrease in the packet and CPU load and the response time after each successful failure detection.
Conference Paper
Full-text available
Programmable data plane hardware creates new opportunities for infusing intelligence into the network. This raises a fundamental question: what kinds of computation should be delegated to the network? In this paper, we discuss the opportunities and challenges for co-designing data center distributed systems with their network layer. We believe that the time has finally come for offloading part of their computation to execute in-network. However, in-network computation tasks must be judiciously crafted to match the limitations of the network machine architecture of programmable devices. With the help of our experiments on machine learning and graph analytics workloads, we identify that aggregation functions raise opportunities to exploit the limited computation power of networking hardware to lessen network congestion and improve the overall application performance. Moreover, as a proof-of-concept, we propose Daiet, a system that performs in-network data aggregation. Experimental results with an initial prototype show a large data reduction ratio (86.9%-89.3%) and a similar decrease in the workers' computation time.
Article
Full-text available
Software Defined Networking promises unprecedented flexibility and ease of network operations. While flexibility is an important factor when leveraging advantages of a new technology, critical infrastructure networks also have stringent requirements on network robustness and control plane delays. Robustness in the SDN control plane is realized by deploying multiple distributed controllers, formed into clusters for durability and fast-failover purposes. However, the effect of the controller clustering on the total system response time is not well investigated in current literature. Hence, in this work we provide a detailed analytical study of the distributed consensus algorithm RAFT, implemented in OpenDaylight and ONOS SDN controller platforms. In those controllers, RAFT implements the data-store replication, leader election after controller failures and controller state recovery on successful repairs. To evaluate its performance, we introduce a framework for numerical analysis of various SDN cluster organizations w.r.t. their response time and availability metrics. We use Stochastic Activity Networks for modeling the RAFT operations, failure injection and cluster recovery processes, and using real-world experiments, we collect the rate parameters to provide realistic inputs for a representative cluster recovery model. We also show how a fast rejuvenation mechanism for the treatment of failures induced by software errors can minimize the total response time experienced by the controller clients, while guaranteeing a higher system availability in the long-term.
Conference Paper
Full-text available
Security in Software Defined Networks (SDNs) has been a major concern for its deployment. Byzantine threats in SDNs are more sophisticated to defend since control messages issued by a compromised controller look legitimate. Applying traditional Byzantine Fault Tolerance approach to SDNs requires each switch to be mapped to 3f + 1 controllers to defend against f simultaneous controller failures. This approach on one hand overloads the controllers due to multiple requests from switches. On the other hand, it raises new challenges concerning the switch-controller mapping and determining minimum number of controllers required in the network. In this paper, we present a novel primary-backup controller mapping approach in which a switch is mapped to only f + 1 primary and f backup controllers to defend against simultaneous Byzantine attacks on f controllers. We develop an optimization programming formulation that provides the switch-controller mapping solution and minimizes the total number of controllers required. We consider the controller processing capacity and communication delay between switches and controllers as problem constraints. Our approach also facilitates capacity sharing of backup controllers when two switches use the same backup controller but do not need it simultaneously. We demonstrate the effectiveness of the proposed approach through numerical analysis. The results show that the proposed approach significantly reduces the total number of controllers required by up to 50% compared to an existing scheme while guaranteeing better load balancing among controllers with a fairness index of up to 0.92.
Conference Paper
Consensus is a fundamental problem in distributed computing. In this poster, we ask the following question: can we partially offload the execution of a consensus algorithm to the network to improve its performance? We argue for an affirmative answer by proposing a network-assisted implementation of the Raft consensus algorithm. Our approach reduces consensus latency, is failure-aware, and does not sacrifice correctness or scalability. In order to enable Raft-aware forwarding and quick response, we use P4-based programmable switches and offload partial Raft functionality to the switch. We demonstrate the efficacy of our approach and performance improvements it offers via a prototype implementation.
Article
Control planes of forthcoming Software-Defined Networks (SDNs) will be distributed : to ensure availability and fault-tolerance, to improve load-balancing, and to reduce overheads, modules of the control plane should be physically distributed. However, in order to guarantee consistency of network operation, actions performed on the data plane by different controllers may need to be synchronized, which is a nontrivial task. In this paper, we propose a synchronization framework for control planes based on atomic transactions, implemented in-band, on the data-plane switches. We argue that this in-band approach is attractive as it keeps the failure scope local and does not require additional out-of-band coordination mechanisms. It allows us to realize fundamental consensus primitives in the presence of controller failures, and we discuss their applications for consistent policy composition and fault-tolerant control-planes. Interestingly, by using part of the data plane configuration space as a shared memory and leveraging the match-action paradigm, we can implement our synchronization framework in today's standard OpenFlow protocol, and we report on our proof-of-concept implementation.