Content uploaded by Han Liu

Author content

All content in this area was uploaded by Han Liu on Apr 25, 2017

Content may be subject to copyright.

VIRTUAL TOPOLOGY PARTITIONING TOWARDS AN EFFICIENT FAILURE

RECOVERY OF SOFTWARE DEFINED NETWORKS

ALI MALIK1, BENJAMIN AZIZ1, CHIH-HENG KE2, HAN LIU1, MO ADDA1

1School of Computing University of Portsmouth, Portsmouth PO1 3HE, UK

2Department of Computer Science and Information Engineering, National Quemoy University, Taiwan

E-MAIL: {ali.al-bdairi, benjamin.aziz, han.liu, mo.adda}@port.ac.uk, smallko@gmail.com

Abstract:

Software Deﬁned Networking is a new networking paradigm

that has emerged recently as a promising solution for tackling the

inﬂexibility of the classical IP networks. The centralized approach

of SDN yields a broad area for intelligence to optimise the network

at various levels. Fault tolerance is considered one of the most cur-

rent research challenges that facing the SDN, hence, in this paper

we introduce a new method that computes an alternative paths re-

actively for centrally controlled networks like SDN. The proposed

method aims to reduce the update operation cost that the SDN

network controller would spend in order to recover from a single

link failure. Through utilising the principle of community detec-

tion, we deﬁne a new network model for the sake of improving the

network’s fault tolerance capability. An experimental study is re-

ported showing the performance of the proposed method. Based

on the results, some further directions are suggested in the context

of machine learning towards achieving further advances in this

research area.

Keywords:

Network Topology, Community Detection, Graph Theory, Soft-

ware Deﬁned Networking

1. Introduction

The Internet and networking system in general plays an es-

sential role in changing our life style through producing vari-

ous type of technologies that get involved in our daily activities

such as social media, economics and business. Networking de-

vices exchange data in a variety networks (e.g. wireless sensor

networks, the Internet-of-Things and Cloud networks) where

currently there are about 9 billion devices connected to the In-

ternet and this number is expected to more than double by 2020.

However, the today’s IP network infrastructures do not have

the ability to accommodate such a huge number of devices,

hence the Internet ossiﬁcation is highly expected [1]. One ef-

fective solution to tackle the ossiﬁcation issue is via replace

the complex and rigid networking systems by a programmable

networking instead.

Software-deﬁned networking (SDN) was resulted from a

long history of efforts that have been exerted for the purpose

of simplifying the computer networks management and control

[2]. In SDN the control plane has been moved out from the

data plane and placed in a central location usually called the

network controller or the network operating system. Due to

this decoupling, the network forwarding elements (e.g. routers

and switches) became a dummy devices, which typically dic-

tated by the network controller. In reality, the SDN controller is

either has a direct connection to the data plane forwarding ele-

ments or indirect way through which the controller will be able

to instruct those devices. The OpenFlow [3] protocol is the

most commonly used to establish the connection between the

data plane forwarding elements and the controller, so the con-

troller will be able to send the forwarding rules to each device

that lies on its domain. So far, SDN has gained much attention

of both academia and industry community and adopted by some

of well known pioneering companies like Deutsche Telekom,

Google, Microsoft, Verizon, and Yahoo [4]. Although SDN has

brought a signiﬁcant beneﬁts to the concept of networks, some

new challenges accompanied this innovation such as faults,

conﬁguration, security and performance [5]. Thus, our goal

in this paper is to eliminate the SDN drawbacks through miti-

gating the one of the most current challenges, which is the data

plane fault tolerance.

The remainder of this paper is organised as follows. In Sec-

tion 2 various techniques to solve the issue of SDN data plane

failures are presented. Section 3 and 4 illustrate our model and

the proposed framework. The experimental results of the eval-

uation are presented in section 5. Finally, the conclusion of this

paper and future work directions are provided in section 6.

2. Related work

The topic of SDN faults and recovery has been already stud-

ied, so we will discuss the relevant literature in this section.

Since the SDN has two separated planes (i.e. control and data),

hence each plane is susceptible to failure. Apart from control

plane failure, which is not the main focused in this work, the

recovery mechanism of data plane can be classiﬁed into pro-

tection and restoration techniques [6]. In protection method,

which also known as proactive, the ﬂow entries of the backup

paths are determined in early stage of failure, hence the effected

data packets will convey through the backup path directly at the

moment of failure. On the other hand, in restoration mecha-

nism, which also known as reactive, the backup path will be de-

termined by the controller after the occurrence of failure, thus

extra time is required to set up the discovered alternative path.

Authors in [7] and [8] have shown how a fast data plane recov-

ery can be achieved within the protection approach. However,

the cost of proactive mechanisms is high as it consumes the ca-

pacity of the Ternary Content Addressable Memory (TCAM)

[6], where the forwarding rules to be stored. In addition, there

is no guarantee that the pre-planned paths will be in a good

health once the primary paths will fail.

In contrast, efforts have been exerted to investigate the effec-

tiveness of the reactive approach. In this context, the authors in

[9] and [10] have shown how a fast restoration can be accom-

plished, however in both works the experimental topologies

were small scale (i.e. 6 and 14 nodes respectively). Further-

more, the processing time for setting up the chosen path was

ignored, which is a requirement in SDNs in order to re-routing

from the affected primary path to the backup one.

Unlike the previous studies, the authors in [11] produced

a new method for fast restoration by reducing the processing

time, which typically paid by the controller, through ﬁnding

the alternative path (from end-to-end) that has a minimum op-

eration requirement. One main drawback with this work is that

it does not guarantee the sequentiality of the health nodes in the

effected path to be in similar order on the alternative one. Addi-

tionally, there is a lack of information regarding the simulation

tool that has been used. We also noted that there was a mistake

with the ﬁrst network topology that used in their experiments,

since the authors mentioned that it contains 37 nodes and 57

links, while in reality their network contains a 56 links as it

missed the edge that connects between Hamburg and Frankfurt

according to the SNDlib library [12], hence their results could

be signiﬁcantly deviant from the expected one.

All the above issues motivated us to put a further efforts in

order to investigate a more possible solutions for the problem of

SDN fault tolerance. The next section will present the proposed

system model of this work.

3. System model

The most frequently used notations throughout this paper are

listed in Table 1.

TABLE 1. List of notations

Symbol Description

CThe set of cliques

cThe failed clique

rsSource router

rdDestination router

ric Any arbitrary router in the failed clique

rjc Any arbitrary router in the failed clique

Pmin Dijkstra’s shortest path in terms of number of hops

DGDijkstra for ﬁnding the shortest path based on the graph G

DcDijkstra for ﬁnding the shortest path based on the clique c

We utilised the undirected graph theory as a basis to model

the computer network topology. Generally, each simple graph

G= (V, E )consists of a set of vertices (i.e. nodes), V, as well

as a set of edges (i.e. links), E, which connect the nodes to one

another. The set of all links in Gcan be deﬁned as a 2-element

subset of nodes, E⊆V×V. We deﬁne a path P, from

source to destination, as a sequence of consecutive vertices

representing nodes or routers in the network1. The path starts

at the source router, rs, and ends with a destination router, rd,

with ric and rjc being any two adjacent routers along P:

P= (rs, . . . , ric, rj c, . . . , rd)

We deﬁne the set of all possible paths, Prs,rd, between any

source router rsand destination router rd, as the following set:

Prs,rd={P|(ﬁrst(P) = rs)∧(last(P) = rd)}

and the deﬁnition of ﬁrst and last is given as functions on any

general sequence (a1, . . . , an):

ﬁrst((a1, . . . , an)) = a1

last((a1, . . . , an)) = an

1We use the terms router and node interchangeably.

FIGURE 1. Community detection

Community detection (or sometimes known as cliques iden-

tiﬁcation) has been productively proposed as a solution to re-

solve various kinds of network-related problems including the

problem of network path optimisation. In this context, we will

use the concept of non-overlapping cliques as an approach to

optimise SDN restoration through accelerating the process of

failure recovery. By dividing the network’s graph Ginto a cer-

tain number of cliques, we are assuming that when the link

failure occurs then only one clique will suffer from the failure.

Meantime, the other cliques should be working ﬁne. It is highly

likely that most of the path’s links will be distributed over var-

ious cliques, hence at the link failure moment, the only clique

c, which includes the effected link, will be treated rather than

dealing with the whole graph Gin looking for an alternative

path from end-to-end.

Informally, Figure 1 depicts how the virtual cliques can be

extracted from the original network topology graph through

applying any community detection algorithm, where the

number of the extracted cliques is vary and typically depends

on the network topology and the used algorithm itself. In our

model, we deﬁne the set of cliques dividing the network graph

Gas C⊆G, where individual cliques c∈Care deﬁned as

follows:

c= (V0, E0)|V0⊆V∧E0⊆E

The deﬁnition of cliques is mutually exclusive, in other words:

∀c1, c2∈C|c1= (V1, E1)∧c2= (V2, E2)⇒(V1∩V2=

∅)∧(E1∩E2=∅)

We deﬁne a failed link (i.e. a 2-router sub-path), F, in a path

Pas follows:

F= (ric, rj c)| ∃ c:c= (V0, E 0)∧F∈E0

This deﬁnition of Fassumes only those cases where failed

links are always of an intra-clique type, i.e. it will never be the

case that the failed link connects two cliques, i.e.:

@F= (ric, rj c), c1= (V1, E1), c2= (V2, E2)|

c16=c2⇒ric ∈V1∧rjc ∈V2

Inter-clique failures will be the focus of future work.

We use the term longest-shortest path as the path that has

the maximum number of hops amongst the set of solutions

returned by applying Dijkstra’s algorithm [13] to our network

topology for ﬁnding the shortest path between every possible

two nodes in that network. To ﬁnd this longest-shortest path,

we deﬁne the special function LS as follows:

LS(PDset) = x, such that x∈ PDset and

∀y∈ PDset :len(y)≤len(x)

If there are more than one longest-shortest paths in PDset,

we pick one randomly. PDset itself represents the set of all

Dijkstra-based solutions for some network topology (V, E ):

PDset ={P| ∀ rs, rd∈V:P=D(Prs,rd)}

4. The Proposed Framework

From a high level view, Figure 2 illustrates the main com-

ponents of our proposed framework where the Fault Tolerance

Enhancer component is the primary contribution of this frame-

work. We discuss next in more detail the developed component

as well as the tools that we used in this framework.

FIGURE 2. Proposed framework components

4.1 SDN controller

SDN controller representing the network’s brain and the

most vital part, which is the place where the intelligence and

decision making resides. Presently, there are more than 30 dif-

ferent SDN controller offering from both academia (e.g. for

research purposes) and industry (e.g. for commercial use). Our

framework currently supports the POX controller [14], which

is a python-based open source SDN controller. We selected

the POX as it is more suitable for research purposes and also

more convenient for fast prototyping than the other available

ones [15]. The OpenFlow [3] protocol is used to communicate

between the control and data planes in order to gather statistics

from the data plane and carry the controller instructions to pro-

gram the data plane elements, whereas the set of POX APIs can

be utilised for developing various network applications.

4.2 Fault tolerance enhancer component

Currently, there are three main parts that consisting this com-

ponent as follows:

A.Topology parser: will be responsible for fetching the under-

lying network topology characteristics and build a topological

view with the aid of the POX openﬂow.discovery, which is an

already developed component. In order to represent the gained

network topology as a graph G, we utilised the NetworkX [16]

tool, which is a pure python package with a set of functions

that can be used to manipulate and simplify the network graphs.

B.Cliques producer: is responsible for virtually partitioning

the network topology graph Ginto C(i.e. sub-graphs) by

incorporating the well known community detection algorithm

Girvan and Newman [17] to produce the possible cliques

(with any size) on the basis of the network graph that acquired

from the topology parser. The densely connection between

the resulted clique’s vertices is the main feature of Girvan

and Newman that interesting to us, in other words, the strong

connection among the nodes in each clique could provide a

multiple alternative paths that would utilised at the failure

events. To do so, we have incorporated the igraph tool [18]

to our framework, which is a python open source library for

analysing and manipulating the graphs where the Girvan and

Newman algorithm is already implemented.

C.Route ﬁnder: is used to identify the paths for the network’s

ﬂows on the basis of the global view of the underlay topology,

which can be gained from the topology parser. in fact, the route

ﬁnder will be called on two occasions; the ﬁrst one when a new

packet arrives to the network, while the second one when the

failure occurs so a new path should be computed. For this pur-

pose, two algorithms have been developed to obtain the shortest

path based on the well known Dijkstra’s algorithm. We adopt

the hop count as an additive metric on which the Dijkstra’s al-

gorithm will be able to form the shortest path for any incoming

requests. The ﬁrst algorithm reﬂects the default action that per-

formed by the SDN controller at failure moments, which is to

erase the ﬂow entries of the effected path and then install the

rules of the backup one from the source router rsto destination

rd, the pseudo code is demonstrated in Algorithm 1.

Algorithm 1: First algorithm to ﬁnd the shortest path with

Dijkstra from End-to-End based on the Graph G

On Normal: Set Primary Path as Pmin ∈ P rs,rd

On Failure : Do the following procedure

1Prs,rd:= Prs,rd− {Pmin}

2Pmin := DG(Prs,rd)

On the other hand, the second algorithm will tackle the fail-

ure based on the effected clique c∈Crather than searching

from end-to-end based on the whole graph G, the pseudo code

of this algorithm is illustrated in Algorithm 2. Currently, the

Algorithm 2 is capable to manage the clique’s intra-link failure

and for the future we will extend the framework to include the

inter-link failure among the cliques.

Algorithm 2: Second algorithm to ﬁnd the shortest path

with Dijkstra on the basis of either Gor effected c

On Normal: Set Primary Path as Pmin ∈ P rs,rd

On Failure : Do the following procedure

1if F= (ric, rj c)then

2Pmin := Dc(Pric,rj c )

3end

The complexity of the two above algorithms are similar to

the Dijkstra, which is O(|V|+|E|log |V|). It also worth not-

ing here that both algorithms have the same strategy to obtain

the shortest path for the newly arriving packets, which is by

running the Dijkstra based on the network graph G. However,

they differ in respect of ﬁnding the alternative path to recover

from the failure, in which the Algorithm 2 will turn to apply the

Dijkstra on the crather than G.

5. Simulations and experimental results

In order to test the performance of the proposed framework,

we ﬁrst build it on top of POX controller as illustrated in Figure

2, then we used the popular SDN emulator Mininet [19] to eval-

uate the prototype of our proposed framework. As mentioned

earlier that SDN has splitted the network architecture into con-

trol plane, the brain, and data plane, the muscles. So, we have

FIGURE 3. Cliques-based network topology of Germany50, [12]

modelled the Germany50 from SNDlib [12] as an instance of

a real world network topology to act our data plane structure,

which contains 50 nodes and 88 edges. The black links in

Figure 3 represent the original Germany50 topology and we

added the colored layer (i.e. from c1 to c6) to demonstrate the

cliques that achieved after applying the Girvan and Newman

algorithm. According to the Germany50 topology, the longest-

shortest path lies between 1 and 13 through the route:(1, 5, 27,

22, 32, 42, 48, 50, 26, 13), there might be several longest-

shortest paths in the network topology and in such a case we

choose randomly. We attached two virtual hosts H1and H2

to the rsand rdof the longest-shortest path respectively in or-

der to simulate the scenario of packet injection. For now let’s

assume that the link (27, 22) fails. Consequently, Algorithm

1 will response to the failure by removing the failed-primary

path and install (1, 25, 47, 3, 39, 7, 41, 17, 40, 13) as an alter-

native shortest path to mask the failure. Since our selected path

passes through 4 cliques, which are c1,c2,c3 and c6, this

means only one clique (e.g. sub-graph) will be involved at the

moment of the intra link failure in the scenario of Algorithm

2, which will react ﬁrst by detecting the effected clique c(i.e.

c2 in our case) and thence ﬁnd a loop-free shortest path be-

tween the routers on both sides of the failed link, i.e. between

the two nodes of F. According to the given example, the sub-

path (27, 45, 22) will be returned by the second algorithm as

a quick solution based on c2 without taking into consideration

the other clique’s nodes as they will remain on the same settings

(untouched). Hence, it can be clearly seen that the second algo-

rithm does not guarantee the shortest path from end-to-end as

it works over a clique and not a whole graph. We are expecting

a signiﬁcant reduction in the duration of the recovery process

for two reasons, (I) because the proposed algorithm will con-

sider only one clique but not the whole graph and (II) due to

minimising the number of rules modiﬁcation. Figure 4 shows

the behaviour of the two algorithms, which is based on the se-

lected path (i.e. from 1 to 13), in both failure and non-failure

scenarios. We note that the two algorithms have nearly similar

FIGURE 4. Before and after failure measurements

durations for setting up the path (i.e. before failure scenario).

However unlike the Algorithm 1, Algorithm 2 has an extremely

positive impact in terms of enhancing the time cost of fault tol-

erance. The reduction rate can be calculated through compar-

ing the consumed time before and after the failure, which can

be arrived through the following formula:

R=Time BF −Time AF

Time BF ×100 (1)

Where Rrepresents the reduction rate, Time BF is the time

of setting up the primary path before the failure and Time AF

is the time of setting up the alternative path after the failure.

Based on (1) and according to the result of ﬁgure 4, the Rof

Algorithm 2 is 84.333%. Let us assume that len(rs, ..., rd) =

n, then the utilisation rate of Algorithm 2 can be measured

through (n−2

n)×100%, hence, the pre-installed rules utilisa-

tion percentage (based on our example case) will be 80%, this

ratio interprets the high reduction rate that gained by the algo-

rithm. As a result, the larger the value of n, the more utilisation

we obtain out of the selected primary path.

6. Conclusions

This paper provides a new approach for ameliorating the

fault tolerance mechanism of software-deﬁned networks. We

have deﬁned a new network model based on the set theory and

graph theory to represent the theoretical part. We also have

developed a new framework on the basis of the created model

to represent the practical side of this work. The experimen-

tal results, which have been carried out through a well-known

tools and emulation, demonstrate how the proposed method has

led to improving the performance of reactive failure recovery

through segmenting the network topology into a certain num-

ber of non-overlapping cliques. The proposed approach has not

been explored so far, which indicates that this paper ﬁrst time

suggests the partitioning as a technique towards enhancing the

restoration mode of the SDN’s fault tolerance.

In future, we will position the study in the setting of ma-

chine learning, towards achieving that the routing solution can

be dynamically adjusted according to the update of the statis-

tical details of topological data. In other words, we aim to in-

volve learning strategies in routing to achieve globally optimal

search towards ﬁnding the shortest path.

Acknowledgements

This work is supported by the College of Computer Science

and Information Technology at the University of Al-Qadisiyah

under the scholarship program of the Iraqi Ministry of Higher

Education and Scientiﬁc Research.

References

[1] Lin, P., Bi, J., Hu, H., Feng, T., & Jiang, X. (2011, Novem-

ber). A quick survey on selected approaches for preparing pro-

grammable networks. In Proceedings of the 7th Asian Internet

Engineering Conference (pp. 160-163). ACM.

[2] Feamster, N., Rexford, J., & Zegura, E. (2014). The road to

SDN: an intellectual history of programmable networks. ACM

SIGCOMM Computer Communication Review, 44(2), 87-98.

[3] McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Pe-

terson, L., Rexford, J., ... & Turner, J. (2008). OpenFlow: en-

abling innovation in campus networks. ACM SIGCOMM Com-

puter Communication Review, 38(2), 69-74.

[4] Kreutz, D., Ramos, F. M., Esteves Verissimo, P., Esteve Rothen-

berg, C., Azodolmolky, S., & Uhlig, S. (2015). Software-deﬁned

networking: A comprehensive survey. Proceedings of the IEEE,

103 (1), 14-76.

[5] Wickboldt, J. A., De Jesus, W. P., Isolani, P. H., Both, C. B., Ro-

chol, J., & Granville, L. Z. (2015). Software-deﬁned network-

ing: management requirements and challenges. IEEE Commu-

nications Magazine, 53(1), 278-285.

[6] Akyildiz, I. F., Lee, A., Wang, P., Luo, M., & Chou, W. (2014).

A roadmap for trafﬁc engineering in SDN-OpenFlow networks.

Computer Networks, 71, 1-30.

[7] Kempf, J., Bellagamba, E., Kern, A., Jocha, D., Takcs, A., &

Skldstrm, P. (2012, June). Scalable fault management for Open-

Flow. In Communications (ICC), 2012 IEEE International Con-

ference on (pp. 6606-6610). IEEE.

[8] Sgambelluri, A., Giorgetti, A., Cugini, F., Paolucci, F., & Cas-

toldi, P. (2013). OpenFlow-based segment protection in Ethernet

networks. Journal of Optical Communications and Networking,

5(9), 1066-1075.

[9] Sharma, S., Staessens, D., Colle, D., Pickavet, M., & De-

meester, P. (2011, October). Enabling fast failure recovery in

OpenFlow networks. In Design of Reliable Communication Net-

works (DRCN), 2011 8th International Workshop on the (pp.

164-171). IEEE.

[10] Staessens, D., Sharma, S., Colle, D., Pickavet, M., & Demeester,

P. (2011, October). Software deﬁned networking: Meeting car-

rier grade requirements. In Local & Metropolitan Area Networks

(LANMAN), 2011 18th IEEE Workshop on (pp. 1-6). IEEE.

[11] Astaneh,S.A.,& Heydari,S. S. (2016).Optimization of SDN ﬂow

operations in multi-failure restoration scenarios.IEEE Transac-

tions on Network and Service Management,13(3),421-432.

[12] SNDlib library, http://sndlib.zib.de.

[13] Dijkstra, E. W. (1959). A note on two problems in connexion

with graphs. Numerische mathematik, 1(1), 269-271.

[14] POX , https://openﬂow.stanford.edu/display/ONL/POX+Wiki

[15] Shalimov, A., Zuikov, D., Zimarina, D., Pashkov, V., & Smelian-

sky, R. (2013, October). Advanced study of SDN/OpenFlow

controllers. In Proceedings of the 9th central & eastern euro-

pean software engineering conference in russia (p. 1). ACM.

[16] Schult, D. A., & Swart, P. (2008, August). Exploring network

structure, dynamics, and function using NetworkX. In Proceed-

ings of the 7th Python in Science Conferences (SciPy 2008) (Vol.

2008, pp. 11-16).

[17] Girvan, M., & Newman, M. E. (2002). Community structure

in social and biological networks. Proceedings of the national

academy of sciences, 99(12), 7821-7826.

[18] Csardi, G., & Nepusz, T. (2006). The igraph software package

for complex network research. InterJournal, Complex Systems,

1695(5), 1-9.

[19] Lantz, B., Heller, B., & McKeown, N. (2010, October). A net-

work in a laptop: rapid prototyping for software-deﬁned net-

works. In Proceedings of the 9th ACM SIGCOMM Workshop on

Hot Topics in Networks (p. 19). ACM.