Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
759
Safety and Reliability – Safe Societies in a Changing World – Haugen et al. (Eds)
© 2018 Taylor & Francis Group, London, ISBN 978-0-8153-8682-7
Self-healing networks: A theoretical approach to smart grids’ resilience
A. Scala
Istituto Sistemi Complessi CNR, Università “Sapienza” Roma, Rome, Italy
F. Morone & H. Makse
City College of New York, New York, USA
ABSTRACT: To ensure a high quality of service to the users in the next generation of smart grids,
self-healing capabilities are a crucial feature to be introduced. We show how distributed communication
protocols can enrich complex networks with self-healing capabilities; an obvious field of applications are
infrastructural networks distributing a commodity via a flow, like gas, water or electric power. We con-
sider the case where the presence of redundant links allows to recover the connectivity of the system. We
then introduce a theoretical framework to calculate the fraction of nodes still served for increasing levels
of network damages. Such framework allows to analyse the interplay between redundancies and topology,
a key point in improving the resilience of networked infrastructures to multiple failures.
ery history of a system (see Fig. 1, upper panel).
Loosely speaking, there are three main factors
characterizing resilience: the initial after-event
state s*, the recovery time τ and the recovery level
sC. In the usual resilience cycle, an accident hap-
pens at time t* and the quality of service s drops
quickly to value s* that depends on the robustness
(Cohen and Havlin 2010) and on the reliability
(Chaturvedi 2016) of the network; after a recov-
ery time τ that will depend on the restoration plan,
service would have been restored up to a level sC
that in some cases can be equal or even higher than
the initial one. In this paper we will concentrate on
the restored quality of service sC and we will intro-
duce a self-healing percolation model for charac-
terizing such quantity.
Self-healing is a crucial feature for implement-
ing resilience in the smart networks of the future
(Quattrociocchi et al. 2014). An example of already
existing self-healing infrastructures are telecommu-
nication networks, which rely on self-healing proce-
dures based on routing protocols to restore traffic
using redundant links (Bhandari 1998). This is a
typical example of a strategy based on the redun-
dancy in the interconnectivity of its components
to ensure the continuity of a system; for example,
when a hole is punched in a leaf, the remaining
vessels are capable to sustain the extra flow nec-
essary to keep the tissues alive (Katifori, Szöllősi,
& Magnasco 2010). On the other hand, self-heal-
ing in infrastructural networks should be instead
though as a constrained mechanism in which
only a limited amount of resources is available.
1 INTRODUCTION
The functioning of any advanced society relies on
networks that distribute commodities like water,
gas, energy. The increasing urbanization and the
accelerated growth of the size and the numbers
of mega-cities (Facchini et al. 2017, Kennedy
et al. 2015) requires such networks to be not only
robust—i.e. able to sustain natural hazards, ran-
dom failures or intentional attacks—but also to
be resilient, i.e. to quickly recover from such haz-
ards. Thus, the key feature to be implemented in
network systems is resilience (Francis and Bekera
2014, Ganin et al. 2016), i.e. the ability of recov-
ering an acceptable level of service in the face of
faults, failures, accidents and attacks. In this paper,
we described a simplified model for networks who
are able to increase their resilience through self-
healing capabilities while considering the effects
both of the topology and of the redundancy on
the capability of network recovery. After introduc-
ing the concept of resilience in sec.2, we will intro-
duce in sec.3 a percolative model of self-healing
reconnection. In sec.4 we will describe the analytic
approaches needed to solve the problem at the
mean-field level; we will compare the results of the
model prediction with numerical simulation in 5.
2 RESILIENCE
Resilience is a complex property and is best
expressed via the function describing the recov-
760
Such a strategy is also common to material science
where new polymeric compounds are capable of
self healing due to the presence of small amounts
of healing agents that gets released and activated
upon cracking (White et al. 2001, Toohey et al.
2007).
While the security of large-scale infrastructural
networks—like long range, high-voltage electric
power networks—is based on redundancy, most of
local (regional, rural and city level) networks due
to economic constrains are tree-like objects with
few redundant links (Quattrociocchi et al. 2014).
To describe and characterize a core feature of the
resilience of such local networks, we introduce a
new percolation problem that models their capabil-
ity of recovering connectivity upon random fail-
ures. Hence, our model for self-healing networks
is inspired by local distribution networks like gas,
water and medium/low voltage electric power
networks. In real networks, cables and pipes—
especially in urban areas—will likely follow the
topology of the street networks. Also, tree-like net-
works allow operators to minimize costs (building
physical links in the network requires huge invest-
ments) and to have an easy accountability of the
consumptions. However, few redundant links must
be present in order to recover the connectivity of
the networks in case of accidents. Thus, after a link
failure, some redundant links can be activated to
recover (at least partially) the functionality of the
network. In real infrastructures, such a procedure
is often implemented manually, while in the smart
networks of the future it should function automati-
cally, possibly by embedding distributed algorithms
automating the network recovery (Quattrociocchi
et al. 2014). In the lower panel Fig.1 we present a
cartoon of the self-healing process in real networks.
3 MODEL
In our scenario we consider network systems dis-
tributing some utility; for sake of simplicity, we
will consider a single node to be the source of the
quantity to be distributed on the network. Exam-
ples of such network utilities are water, power, gas
or oil pipelines or electric power distribution. At
each instant of time, the topology of the network
distributing the utility (the active tree) is assumed to
be a tree; this assumption is partially verified in the
above mentioned system; in particular, it is mostly
verified in the case of electric power distribution
(Pagani and Aiello 2011). In fact, such a structure
meets the infrastructures’ managers needs—i.e., to
measure (for billing purposes) in an easy and pre-
cise way how much of a given quantity is served
to any single node of the network. Finally, as a
further simplification we will not take into account
the magnitudes of flows—i.e., all links and sources
are assumed to have infinite capacity—but we will
focus on maximizing the connectedness of the sys-
tem in order to serve as many nodes as possible.
Notice that in the case of real flows, such assump-
tion can be unrealistic since links and nodes in real
networks have limits beyond which they become
unoperational.
In order to implement our strategy and its self-
healing capabilities, we consider the presence of
dormant backup links—i.e., a set of links that can
be switched on. Nodes are assumed to be able to
communicate with their neighbors by means of
a suitable distributed interaction protocol with a
limited amount of knowledge: in particular, nodes
are supposed to possess information only about
the state of neighboring nodes connected either via
active or via dormant links.
Figure1. Relations among self healing and Resilience.
Upper panel: Cartoon of a recovery function describing
the resilience of a system; in this case we are plotting a
metric s measuring the quality of service versus the time
t. In the picture, an accident happens at time t* and the s
drops to a low value that depends on the robustness and
on the reliability of the network; after a time τ, recovery
plans will have restored the service s up to a level sC that in
some cases can be equal or even higher than the initial one.
While the order of the restoration events determines the
recovery time τ, any optimal restoration plan will achieve
the same level of service sC given the same resources. In
this paper we will concentrate on the magnitude of sC and
not on the recovery time. Lower panel: Cartoon of a self-
healing process in a distribution network.
761
When either a node or a link failure occurs, all
the nodes below the failure will disconnect from the
active tree and become unserved. Such unserved
nodes can now try to reconnect the active tree by
waking up through the protocol some dormant
backup links. Such a process will reconstruct a new
active-tree that can restore totally or partially the
flow, i.e. heal the system. Fig. 2 presents a graphi-
cal sketch of the healing procedure.
In the following, we will indicate with Ts(G) a
tree on the graph G routed in the node s; we indi-
cate with
R G T⊆ −
the set of redundant links
and with N the number of nodes. The set of redun-
dant links will be also described by its adjacency
matrix Ri,j that takes the value 1 if a redundant
link is present among nodes i and j, 0 otherwise.
Moreover, we will describe the damages inflicted to
the distribution tree by the matrix qij=0 if link ij is
removed, qij=1 otherwise. The relevant metric for
the effectiveness of a redundancy pattern R given
an attack q is the fraction of served nodes FoS, i.e.
the number of nodes connected to the root nor-
malized by N.
4 METHODS
Let us first notice that a node i is connected to a
node j if and only if a message from node i can
reach node j. To solve such a problem, we use a cav-
ity based approach for message passing (Mezard
and Montanari 2009).
First of all let us define the set S(i) as the set
on nodes which are sons of node i on the original
tree T (i.e. not considering the redundant edges):
S i k k i
( )
≡
{ }
: .is a son of
Moreover we will call
F(i) the unique father of node i. Finally, we define
R i j Rij
( )
= =
{ }
: 1
the set of nodes connected to i
via redundant links; obviously, we have
F i R i
( ) ( )
∉
and
S i R i
( ) ( )
= ∅∩.
Then, let us consider a node
i and a node
j S i∈
( )
which is one of the sons of
node i on the original tree.
We introduce the following quantities:
•
di j→
is probability that node i is connected to
the root s, when the son-node
j S i∈
( )
is absent
from the tree.
•
ui j→
is the probability that node i is connected
to the root s, when the father-node
j F i=
( )
is
absent from the tree.
•
r
i j→
is the probability that node i is connected to
the root s when i and j connected by a redundant
edge and j is absent from the tree.
We will derive first the recursive equation for
d j S i
i j→∈
( )
, .
When j is absent from the graph,
then i is connected to the root s if at least one
among this possibilities is realized:
1. its father F(i) is connected to the root. The prob-
ability that such event does NOT happen is
π
11= −
( )
( ) ( )
→
q d
iF i F i i
2. one of its sons S(i) – except j – is connected to
the root. The probability that such event does
NOT happen is
π
21= −
∈
( )
→
∏
k S i j
ki k i
q u
( )
3. one of the neighbours connected to i via a
redundant link is connected to the root when i
is absent. The probability that such event does
NOT happen is
π
3
1
1= −
=→
∏
k
N
ik k i
R r( )
The total probability that i is connected to
the root when its son j is absent is thus given by
11 2 3
− × ×
π π π
,
i.e.
d q d
q u R r
i j iF i F i i
k S i j
ki k i
m
N
im m
→
( ) ( )
→
∈
( )
→=
= − −
( )
×
− −
∏ ∏
1 1
1 1
1
( ) ( →→i)
(1)
Following the same procedure we can write the
equation for
ui j→:
Figure 2. Example of the healing procedure: (Left
Panel) In the initial state, the source node (filled square,
upper left corner) is able to serve all 16 nodes through the
links of the active tree. The 4 dashed lines (green online)
represent dormant backup links that can be activated
upon failure. The redundancy of the system is p=4/9 as
only 4 of the 9 possible backup links are present. The
link marked with an X is the one that is going to fail.
(Central panel) A single link failure disconnects all the
nodes of a sub-tree; in the example, a sub-tree of 6 nodes
(red online) is left isolated from the source—i.e., the sys-
tem has a damage ∆=6). (Right Panel) By activating a
single dormant backup link, the self-healing protocol has
been able to recover connectivity for the whole system,
in this case bringing back the number of served nodes
at its maximum value 16. The link that has recovered
the connectivity is marked with an R. Notice that in real
networks, due to the physics of the flows and to the con-
straints on links and nodes capacities, not all the possible
reconnection could be viable and on the contrary could
lead to cascading behavior.
762
u q u R r
i j
k S i
ki k i
m
N
im m i→∈
( )
→=→
= − − × −
∏ ∏
1 1 1
1
( ) ( )
(2)
Finally, the equation for
r
i j→
is:
r q d
q u R
i j iF i F i i
k S i
ki k i
m m j
N
im
→
( ) ( )
→
∈
( )
→= ≠
= − −
( )
×
− −
∏ ∏
1 1
1 1
1
( ) (
,
rrm i→)
(3)
Equations (1–3) constitute the self consistent
equations of the problem. They are valid for any
given realization of the random tree and redun-
dant links. Equations (1–3) can be considered as
describing messages running on the edges of the
tree and on redundant edges, and they can be
solved by simple iteration.
Once a solution to Eqs. (1–3) has been found,
we can compute the total probability pi that node i
is connected to the root
p q d
q u R r
iiF i F i i
k S i
ki k i
m
N
im m i
= − −
( )
×
− −
( ) ( )
→
∈
( )
→=→
∏ ∏
1 1
1 1
1
( ) ( )
(4)
Disregarding correlations, we can estimate the
average fraction of served nodes FoS as
FoS p
N
i
N
i
==
∑1
(5)
5 RESULTS
We now compare the results of our eqs. (1–3) to
the numerical simulation of our self-healing pro-
cedure. In particular, we are considering the case
of networks generated by random trees at which
a fraction α of random recovery links are added.
After an initial fraction f of links in the tree is
deleted at random, recovery links are activated
whenever they reduce the number of connected
components. Finally, the FoS is calculated by
checking the fraction of nodes connected to the
origin via the surviving links plus the redundant.
Notice that when averaging eq. (1–3) to solve the
mean-field equations, we are using
q f
ij =
and
R N
i j,/ .= −
( )
α
1
First, we perform simulations on random trees
on a complete graph of 1000 nodes. The random
trees are generated according a flat-sampling pro-
cedure in the space of possible trees (Broder 1989,
Aldous 1990, Wilson 1996). Edges on the tree
are removed at random; the fraction of removed
edges is indicated as
f q T
ij
i j
= − ∑
1 / ,
,
where
T N= −1.
We parametrize with α the population
of redundant link existing among two nodes i and j
where the link i,j does not belong to T; thus, Rij are
random variables on {0,1} where Rij=1 with prob-
ability
α
/ .N−
( )
1
In each simulation, a random
tree is generated, a random vertex is assumed to
be the source, a fraction α of redundant links are
added to the tree and a fraction f of links is erased
at random from the tree; then the fraction of FoS
of sites connected to the root (via links either in T
– q or in R) is calculated. The results for the average
FoS are presented on Fig. 3.
We then consider the average behavior of Eqs.
(1–3) over the randomness in the model, i.e. over
the possible realizations of the qij and Cij. Perform-
ing the average we obtain the following self con-
sistent equations:
d f d e
k k
k k P k f u
r
k
k
= − − −
( )
×
−
( )
−
( )
− −
( )
−
−
∑
1 1 1
11 1
2
2
α
[ ]
(6)
u e k
kP k f u
r k
= −
( )
− −
( )
− −
1 1 1 1
α
[ ]
r f d e r
= − − −
( )
−
1 1 1
α
k
k
k
kP k f u
∑
( )
− −
( )
−
[ ]1 1 1
Figure 3. Comparison of simulations and analytical
approximation. Depicted are the curves for the fraction
of served nodes FoS (i.e. nodes connected at the origin
after self-healing) versus the initial fraction of failed
links f. Crosses correspond to average values of the FoS
obtained by simulating random trees of 1000 nodes for
different levels of redundancy α=0.2 (red X), α= 0.4
(green X) and α=0.6 (blue X); the size of the symbols is
of the order of thrice the error bars. As expected, curves
are monotonically decreasing with f and monotonically
increasing with α. Full lines correspond to the predic-
tions of our analytical approximations eqs. (7).
763
where P(k) is the degree distribution (i.e. the prob-
ability of having k neighbours) of nodes in the tree
which are not leaves. In Fig. 3 we compare the theo-
retical predicted FoS obtained by averaging eq. (5)
with the results of numerical simulations.
6 CONCLUSION
In this paper we have discusses the recovery of net-
works upon a minimal self-healing procedure that
exploits the presence of redundant edges to recover
the connectivity of the system. Our scenario is
inspired by real-world distribution networks that
are, often for economic reasons, tree-like and in the
meantime are also often provided with alternative
backup links that can be activated in case of mal-
functioning; as an example, this is the case for low-
voltage distribution networks (ENEL 2011).
Our model, albeit schematic, is realistic in the
sense that it could be readily and easily imple-
mented with the current technologies. In fact,
routing protocols represent a vast available source
of distributed algorithms able to maintain the con-
nectivity of a system. Therefore, our scheme could
be implemented by coupling an ICT network to
current infrastructures. Our case is an example
in which interdependencies enhance the resilience
instead of introducing catastrophic breakdowns
(Buldyrev et al. 2010). However, since in we are
assuming that links capacities are infinite, our
model applies to cases where the network is not
stressed. For real flow networks, the physics of
the flows and their constrains can forbid some
of the possible reconnections that, in the worst
cases, could even lead to cascading failures like
he one observed in model power grids (Pahwa
et al. 2014). Moreover, we are considering the case
of a single source; in the case of multiple sources,
leaving the system disconnected (islanding) could
even improve its robustness (Mureddu et al. 2016).
Notice that since our model predicts only the final
level of service of a network, also timescales are
an important element that should be introduced
to allow for a characterization of the full resilience
curve of the system.
In this paper we have introduced the cavity equa-
tion describing our model and compared an ana-
lytical approximation for the average values of the
connectivity under random failures with numeri-
cal simulations. We find a promising accordance
of such an approximation with numerical results
opening the field for future investigations.
The first direction to be investigated is the study
of our approximation when both the fraction of
failures f and the fraction of redundant links α is
small. In fact, in real systems the number of concur-
rent failures is small: this is the reason at the basis
of the N – 1 criterion in engineering. On the same
pace, for economic reasons also the fraction α of
redundant links is doomed to be small: hence, our
approximation is valid for small f’s and α’s, where
analytical expressions could be linearly expanded.
On the other hand, if failures are not independ-
ent but happen in a correlated and perhaps cata-
strophic way like in cascading events (Pahwa
et al. 2014), the approximation must be enhanced
to hold for the entire f range; work in progress is
done in this direction.
Most importantly, cavity equations (1–3) can
be applied to single networks with a given topol-
ogy and set of redundant links to calculate the FoS
as an alternative method to numerical simulations.
Having a set of closed equations allows then to eas-
ily analyze scenarios in which the set of redundant
links is varied: coupling our approach with the
introduction of a cost function for the links is of
importance for the design of networks since it would
allow to optimizing the redundancy. As an example,
numerical simulations on planar topologies suggest
that a very effective strategy to strengthen planar
networks is to add long range links (Quattrocioc-
chi et al. 2014): since such links are overly expensive
in networks like electric distribution, the feasibility
of such a strategy depends on cost-benefit analysis
about their implementation of physical long-range
links in PNIs. A further direction of study would
be to consider the effects of more detailed struc-
tural characteristics on the dynamics of the system
(D’Agostino et al. 2012). However, it is important
to remember that in optimizing the system the cost
of the links is as much important as the increase in
resilience of the system. In fact, a simple constrain
like keeping the number of redundant links fixed
would lead to very unrealistic topologies in which
the source is at the center of a star regardless of
the length of the links (Quattrociocchi et al. 2014).
REFERENCES
Aldous, D.J. (1990, November). The random walk con-
struction of uniform spanning trees and uniform
labelled trees. SIAM J. Discret. Math. 3(4), 450–465.
Bhandari, R. (1998). Survivable Networks: Algorithms for
Diverse Routing. Norwell, MA, USA: Kluwer Aca-
demic Publishers.
Broder, A. (1989). Generating random spanning trees. In
30th Annual Symposium on Foundations of Computer
Science, pp. 442–447.
Buldyrev, S.V., R. Parshani, G. Paul, H.E. Stanley,
& S. Havlin (2010). Catastrophic cascade of fail-
ures in interdependent networks. Nature 464(7291),
1025–1028.
Chaturvedi, S.K. (2016). Network reliability: measures
and evaluation (1 ed.). Performability engineering
series. John Wiley.