A General Framework for Service Availability for Bandwidth-Efficient Connection-Oriented Networks
ABSTRACT Availability in connection-oriented service in networks has traditionally been “all-or-nothing,” i.e., when a failure occurs, a connection either is unprotected or fully protected. The differences in availability and cost between these two extremes can be quite high. A general framework for service availability will be presented that fills the gap. It is shown how network resources and cost are related to service parameters of the framework for networks that are bandwidth-efficient. In addition, a simple revenue model is presented and characterized, revealing when nontraditional service agreements may be attractive.
-
Citations (0)
-
Cited In (0)
Page 1
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 3, JUNE 2010985
A General Framework for Service Availability for
Bandwidth-Efficient Connection-Oriented Networks
Ori Gerstel, Fellow, IEEE, and G. Sasaki
Abstract—Availability in connection-oriented service in net-
works has traditionally been “all-or-nothing,” i.e., when a failure
occurs, a connection either is unprotected or fully protected. The
differences in availability and cost between these two extremes can
be quite high. A general framework for service availability will
be presented that fills the gap. It is shown how network resources
and cost are related to service parameters of the framework
for networks that are bandwidth-efficient. In addition, a simple
revenue model is presented and characterized, revealing when
nontraditional service agreements may be attractive.
Index Terms—Protection switching, service availability, service
level agreements, survivable networks.
I. INTRODUCTION
M
agreement1(SLA) by “all-or-nothing” switching, i.e., whenever
a fault occurs, a connection on the fault is completely protected
or not for the duration of the fault. There are great differences in
availability and cost between these two extremes. For example,
a
protected connection may have a high availability of
99.999%, while an unprotected connection could have a much
lower availability of 99.9% or even lower. In addition, the
connection may use more than twice the network resources as
an unprotected connection since the disjoint working and pro-
tection paths of a
connection are together at least twice a
shortestpathofanunprotectedconnection.Whilesharedprotec-
tionschemesreduceresourceusage,theystillrequiresignificant
protection resources—especially for sparse topologies such as
rings—and do not provide availability guarantees for connec-
tions that are not 99.999% protected. So among the limited se-
lection of classical protection services, there is a high tradeoff
between availability and network cost, which ultimately affects
customer prices. What is needed is to bridge the gap between
OSTprotectionschemesinanetworkattempttoachieve
the availability specified in a customer’s service level
Manuscript received March 09, 2008; revised November 06, 2008; February
21, 2009; and August 13, 2009; approved by IEEE/ACM TRANSACTIONS ON
NETWORKING Editor A. Somani. First published April 19, 2010; current version
published June 16, 2010.
O. Gerstel is with Carrier Routing Business Unit, Cisco Systems, Natanya
42504, Israel (e-mail: ori@ieee.org).
G. H. Sasaki is with the University of Hawaii, Honolulu, HI 96822 USA
(e-mail: galens@hawaii.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNET.2010.2046746
1For sake of simplicity, we refer to the availability defined in the SLA as “the
SLA.” The SLA contains many other aspects, such as maximum latency, jitter,
as well as support guarantees, but they are outside the scope of this paper.
theseclassicalprotectionservicessothatacustomerwillbeable
to find the right SLA at the right price.
The purpose of this paper is to propose a general framework
for the SLA toward this goal while keeping in mind the un-
derlying technologies for implementation. The framework has
the following practical advantages. It can be implemented with
current and foreseeable technologies at both the optical layer
and electronic packet switched layer (e.g., Ethernet, MPLS, IP).
Its parameters can be measured to verify SLA compliance. It
leadstobandwidthefficiencyinthenetwork,whichinthispaper
meansthatthereisminimalorpossiblynoadditionalbandwidth
beyond the necessary working bandwidth. This is important to
keep connection prices low.
The framework allows protection bandwidth to be a fraction
of the working bandwidth such as in [1] and [2]. It also allows
connections that are not directly on a fault to be interrupted to
free surviving bandwidth for protection such as in [3] and [4].
This is a departure from the usual telecommunication practice
of not disturbing established connections, but it allows greater
flexibility in optimizing surviving bandwidth, leading to a more
holistic network protection:
Definition: Network protection is a means to redistribute the
limited bandwidth that survived a network failure, among all
the services supported by the network with the single goal of
ensuring they all meet their SLAs.
In addition, the framework will introduce features that ad-
dress the following weakness of conventional availability spec-
ifications. The availability of a connection is typically specified
by a percentage, e.g., 99.9%. For an operating period of say a
year, the connection will be unavailable for at most 8 hours and
46 min. Note that the connection can be continually down for
8 hours and 46 min and still meet its SLA. This may be too long
for a customer, who may prefer to limit any continuous down-
time to a couple of hours, and spread out the downtimes. The
SLA framework of this paper addresses this by ensuring avail-
ability over short periods.
The paper is organized as follows. Related work is discussed
inSectionII,andtheSLAframeworkispresentedinSectionIII.
Section IV describes how the SLA framework affects network
cost, and in particular the required link bandwidths. In the sec-
tion, and throughout this paper, the system is assumed to be
composed of two network nodes connected by a pair of con-
nections that pass through a network as shown in Fig. 1. The
connections basically serve as a pair of links between the two
nodes,andtheywillbereferredtoas“links”1and2.The“links”
areassumedtohavethesamebandwidthandhavedisjointphys-
ical paths, and that they do not fail together. It will be assumed
throughout the paper that the total time that link
is down is at most . In addition, the time to repair a link is
1063-6692/$26.00 © 2010 IEEE
Page 2
986 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 3, JUNE 2010
Fig. 1. Point-to-point system through a network.
at most
a priori. Presumably, at least conservative estimates are known
by service providers. Throughout the paper, the following no-
tation will be used:
bound on the total amount of time that there is some failure.
It will also be assumed that the links carry a total of
nections, and each connection operates over the time interval
. The duration will be referred to as the lifetime of the
network. In Section IV, a lower bound on link bandwidth re-
quirements is given. It will be shown that the lower bound is
nearly achievable for simple but important special cases.
Section V presents scenarios when nontraditional protection
schemes may be economically attractive. Simple economic
models are used to illustrate the tradeoffs. Implementation is
discussed inSectionVI. It isworth noting thatthenetwork must
now keep track of state information per connection. Section VII
has final remarks. It includes directions for future research and
a discussion of generalizations of the assumption shown in
Fig. 1 from two links to multiple links.
. The values, andare assumed to be known
. Note thatis an upper
con-
II. RELATED WORK
There have been a number of proposals to bridge the gap
between full protection and unprotected service. Several pa-
pers—such as [5]—propose that connections be given priori-
ties based on their SLA, and survivability of a connection de-
pends on its relative priority among the other connections. Such
“best effort” approaches that depend on the SLA of other con-
nections may be unsuitable for applications that require service
guarantees.
The Quality of Protection (QoP) framework in [1] is an SLA
framework with survivability guarantees that are independent
of other connections. Here, the availability of connections is ac-
commodated in one of two ways: 1) connections are given the
amount of bandwidth they were promised under a failure con-
dition—even if a fraction of the bandwidth under normal condi-
tions; 2) a probabilistic scheme in which connections get their
full bandwidth according to a probability that is based on their
priority. Note that the QoP framework the first way is also con-
sidered in [2]. Also note that the QoP framework the second
way is an “all-or-nothing” protection switching. Another prob-
abilistic approach is presented in [6], which describes a mathe-
matical optimization framework. In [7], routing under a proba-
bilistic framework is considered.
In [3] and [4], connections that are not on a fault can be in-
terrupted or “victimized,” allowing greater flexibility to meet
SLAs of all connections. In the case of [3], the connections
protect a fraction of their working bandwidth. This approach
can be viewed as a generalization of the “extra traffic” con-
cept in SONET. Our approach also blurs the boundary between
working and protection bandwidth, as ultimately the system can
usebothtoprotectconnections.Itshouldbenotedthattheselec-
tion of which connections to victimize must be done carefully
to ensure SLAs are satisfied.
In [4], the accumulated downtime of connections are kept
track of and used to determine the connections to victimize. It
is shown in [4] that naive greedy approaches can fail. Accumu-
lated downtimes of connections are also used in [8]. In [9], it is
discussedthattheguaranteedmaximumaccumulateddowntime
may need a safety margin to take into account the time to repair.
III. SLA FRAMEWORK
The SLA framework for a connection
The framework has components referred to as SLA1–SLA4.
SLA1: The connection has two states, working and protec-
tion, corresponding to two bandwidths, the working state band-
width
and the protection state bandwidth
The working state is the normal state for the connection, and the
protection state occurs only when there is a fault somewhere in
the network. So when there are no faults in the network, all con-
nections are in their working state, and if there is a fault, then
some connections may be in the protection state, and the rest of
the connections are in the working state. The set of connections
that are in the protection state may change over time during a
fault. The protection state bandwidth
can be zero. The following simplifying assumption will be used
in the subsequent sections. For all connections , the working
state bandwidth is
, and the protection state bandwidth
is
.
SLA2:Theconnectionhasamaximumaccumulatedtime
that the connection is in the protection state. Note that the ser-
vice availability of the connection must be at least
where
is thelifetime ofthenetwork. Forexample,iftheavail-
ability is 99.999% and
is a year, then
bereferredtoasthemaximumaccumulatedprotectionstatetime
for connection .
SLA1 and SLA2 cover classical connection services: un-
protected, fully protected, and low-priority preemptible (extra
traffic in SONET nomenclature) services. For unprotected
connections
, where
tion
is normally carried. For fully protected connections (i.e.,
availability is 100%),
connections,
, since any failure will impact these
connections due to their preemption—even if they were not
directly impacted. SLA1 and SLA2 also cover protection
schemes of [1]–[3] and [4].
The next two SLA components, SLA3 and SLA4, specify the
availability of a connection over short time periods.
SLA3: Whenever the connection goes into the working state,
it must remain in the state for at least a minimum amount of
time
before going to the protection state. This ensures that
servicesthatrequiretheworkingstatebandwidthhavesufficient
time to be completed. For example, video streams for movies
will be described.
, respectively.2
is a fraction ofand
,
is 5.26 min.will
is the link that connec-
. For low-priority preemptible
2Such an SLA is feasible if connection interfaces can transmit at two band-
width rates. Section VI presents more implementation details.
Page 3
GERSTEL AND SASAKI: FRAMEWORK FOR SERVICE AVAILABILITY FOR BANDWIDTH-EFFICIENT CONNECTION-ORIENTED NETWORKS 987
require bandwidth for 60 to 90 min. The parameter
referred to as the minimum working state duration.
The following assumption will be used in the next section
to ensure SLA3 with minimal link bandwidth. Let the time be-
fore the first failure and the times between consecutive faults
be referred to as fault-free periods. Assume that the minimum
fault-free duration is at least
working state duration for connection . Note that this assump-
tion should be reasonable if the minimum working state dura-
tionsaremuchsmallerthanthemeantimetofailureforthelinks.
Without this assumption, a link could go down, come back up,
and then immediately go back down again, possibly leading to
a violation of SLA3. For example, suppose each link is 30 Gb/s
and carries three connections, where the working state band-
width is
Gb/s and the protection state bandwidth is
Gb/s. Note that the links have minimum bandwidth for
the connections in their working state, and if a link goes down,
thenthesurvivinglinkhasjustenoughbandwidthforallconnec-
tions in their protection state. Now when one of the links goes
down, all six connections go into their protection states, each
with 5Gb/s onthesurvivinglink.When thelinkcomesback up,
all connections are required to be in their working state. How-
ever, when the link immediately goes back down again, there is
not enough surviving bandwidth to ensure all connections can
be in their working state for their minimum working state dura-
tions, violating SLA3. To ensure SLA3, more link bandwidth is
needed, but then the links are less bandwidth-efficient.
SLA4: Whenever the connection goes into the protection
state, it must transition to the working state after at most
amount of time. The parameter
maximum protection state duration. In addition, over any time
period
, the amount of time that the connection is in the
working state is at least
is a parameter satisfying
short-term availability rate. Note that this has the same form as
the quality-of-service guarantee definition in [10].
This ensures that working state bandwidth will resume
within a tolerable prescribed delay
is rescheduling a video conference meeting within a couple of
hours. It also ensures that the connection has access to working
state bandwidth for a fraction of time that is approximately
during faults, andcan be chosen high enough to provide
good average throughput. An example application that needs
good throughput is offline backup.
The next lemma shows how the maximum protection state
duration
and minimum working state duration
access rate to working state bandwidth.
Lemma 1:
Suppose a connection
working state duration
and a maximum protection state
duration
. Then, during any interval
of time the connection is in the working state is at least
.
The proof of the lemma is given in Appendix A. From
the lemma, it can be assumed without loss of generality that
. To achieve bandwidth efficiency, in many
cases, the value of
must be strictly larger than
For example, suppose in Fig. 1 there are two connections,
1 and 2, that are on links 1 and 2, respectively, when there
will be
, whereis the minimum
will be referred to as the
, where
and referred to as the
. An example application
imply an
has a minimum
, the amount
.
are no faults. Suppose they have working state bandwidth
and protection state bandwidth 0. Suppose connection 1 has
hour, and connection 2 has
Suppose
, so whenever link 2 has a fault, the
connections are in the working state for approximately 50% of
the time. However, with these values of
fault on link 2 for say 8 hours, then connection 2 will be in the
working state for at least 4 hours, during which connection 1
will also be in the working state at some time. Since both
connections will be in the working state on link 1 at the same
time, link 1 must have bandwidth 2
two connections each only require average bandwidth
link 1 during the fault, the link must have bandwidth 2
the link will be utilized at only about 50%.
Note that subsets of the SLA components can be disabled by
choosing appropriate parameter values. For example, to disable
SLA1, SLA2, SLA3, or SLA4, the parameters can be chosen so
that
, or
The following are mild assumptions on the SLA parameters
to simplify results in Section IV.
Assumption 1: Without loss of generality, each connection k
is assumed to satisfy:
SLA4); and
(for SLA4).
hours.
and, if there is a
. So even though the
on
, and
, respectively.
(for SLA2);(for
A. Examples of Service Mixes
To add some intuition to the variety of options that our gen-
eral SLA framework provides, this example presents possible
service mixes for Fig. 1 when each link is 30 Gb/s, and each
link has three 10-Gb/s connections, so there are six connections
altogether. The following are services that could be supported
by the system.
Service Mix 1: All six connections are unprotected.
Service Mix 2: Three of the connections are fully protected,
and the other three are low-priority. This is a classical SONET
scenario with extra traffic.
Service Mix 3: Whenever a fault occurs, all six connections
have protection state bandwidth
would apply to a real-time application that can fall back to a de-
gradedperformanceatacertainreduceddatarate.Anexampleis
given in the Section VI of connections carrying high-definition
TV (HDTV) video, but when there is a fault, the connections
reduce their bandwidth to carry standard-definition TV (SDTV)
video.
Service Mix 4: Whenever a fault occurs, half of the connec-
tions are at 10 Gb/s, and the other half have no bandwidth. The
connections take turns at having the working state bandwidth of
10 Gb/s and switch every 2 hours. This is a “rolling blackout”
strategy to share the surviving bandwidth. This corresponds to
Proposition 4 in Section IV, where
and a parameter
of the proposition is equal to 2. This scenario
corresponds to a nonreal-time application, such as data center
backup during off hours.
Service Mix 5: Whenever a fault occurs, four connections
haveprotection state bandwidth of 2.5 Gb/s, and two of the con-
nections have working state bandwidth of 10 Gb/s. The connec-
tionstaketurnsathavingtheworkingstatebandwidthof10Gb/s
every 2 hours. This corresponds to Proposition 4 in Section IV,
Gb/s. This scenario
hours,,
Page 4
988 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 18, NO. 3, JUNE 2010
where
eter
of services—some real-time applications with a degraded fall-
back option and some offline backup services.
hours,hours, Gb/s, and param-
ofthepropositionisequalto3.Itcorrespondstoamixture
IV. NETWORK COST
In this section, the link bandwidth requirements will be dis-
cussed for the system in Fig. 1 given the SLA framework of
Section III. We will present upper and lower bounds on the link
bandwidth that depend on SLA parameter values. These bounds
are presented in four propositions numbered 1–4. The proposi-
tions also have constraints on the values of the parameters that
showhowtheparametersare relatedtoeachother.Forexample,
the choice of the minimum working state durations
short-term availability rates
maximum protection state durations
Proposition 1 has simple lower bounds. Proposition 2
presents a simple upper bound, but where the short-term avail-
ability conditions SLA3 and SLA4 are essentially ignored.
Propositions 3 and 4 have upper bounds when SLA3 and
SLA4 are included. Proposition 4 has additional restrictions
on the values of minimum working state durations, maximum
protection state durations, and short-term availability rates that
lead to lower protection state durations. Though Proposition 4
has restricted SLA parameter values, it can still be applied to
useful services such as Service Mixes 4 and 5 in Section III.
Next, the four propositions will be presented, followed by a
comparison of the bounds.
Proposition 1: Suppose there are
is the working state bandwidth and
bandwidth from SLA1. For each connection
maximum accumulated protection state time from SLA2,
be the minimum working state duration from SLA3, and
be the maximum protection state duration and
termavailabilityratefromSLA4.SupposeAssumption1istrue.
Then, the bandwidth on a link is at least:
(a)
(b), where
(c)
, where
Proof: Part (a) is the link bandwidth required for
nections in their working states when there are no faults. For
part (b), suppose the accumulated times when there is a fault is
. SLA2 implies that, during faults, the fraction of time that
a connection
is in the protection state is at most
equivalently the fraction of time that the connection is in the
working state is at least
. Thus, during the fault, the average
bandwidth for connection
is at least
implies part (b). For part (c), suppose there is a single fault with
duration .SLA4impliesthatduringthefault,connection
be in the working state for at least
or equivalently the fraction of time that the connection is in the
working state is at least
. Thus, during the fault, the average
bandwidth required by connection
This implies part (c). QED
Proposition 2: Suppose there are
istheworkingstatebandwidthand istheprotectionstateband-
width from SLA1. For each connection , let
imum accumulated protection state time from SLA2. Suppose
and the
will determine how small the
can be.
connections. Suppose
is the protection state
, let be the
be the short-
; and
.
con-
, or
. This
will
amount of time,
is at least.
connections. Suppose
be the max-
Fig. 2. A surviving bandwidth schedule ?.
the short-term availability constraints SLA3 and SLA4 are ig-
nored, i.e., for each connection , minimum working state dura-
tion
, and maximum protection state duration
SupposeAssumption1istrue.TosatisfytheSLA,thefollowing
link bandwidth is sufficient:
.
whereis defined in Proposition 1.
Proof: A bandwidth schedule for the connections will be
presented that assumes that there is one down link and one sur-
viving link during the time interval
mines when the connections are in their working and protection
states and is defined over
surviving bandwidth schedule
the figure, the rows correspond to connections that are in their
working states over time. For example, at time 0, connections 1,
2, 4, and 7 are in their working states; then after some time, con-
nection 4 goes into its protection state, and connection 5 goes
into its working state. Finally, at time
7 are in their working states. Note that a valid schedule requires
that a connection cannot appear in two or more rows at the same
time.
The connections are arranged in the figure as follows. Each
connection
is scheduled to be in the working state for an accu-
mulated time
, so it is in the protection state for
anaccumulatedtimeatmost
tions are scheduled in order, starting with connection 1, filling
one row at a time, with wraparound at the end of the row (at
time
) to the beginning of the next. Since each connection
has , it appears in at most one row at any time. Con-
nections 2, 4, and 7 are examples of connections that have their
schedules wrapped around from one row to the next.
The surviving bandwidth schedule
Whenever a fault occurs, the connections follow
on the first fault, the connections begin following
its starting time 0. When there is no fault, all connections
go to their working states. When the next fault occurs, the
connections resume following
example, suppose schedule
faults of durations
, and
schedule is followed during faults. For example, fault 2 begins
with connections 1, 3, 5, and 7 in their working states and ends
with connections 1, 3, 5, and 8 in their working states.
. The schedule deter-
. This will be referred to as the
and is illustrated in Fig. 2. In
, connections 2, 4, and
,andSLA2issatisfied.Connec-
will be used as follows.
, where
from
from where they left off. For
is Fig. 2, and there are three
. Then, Fig. 3 shows how the
Page 5
GERSTEL AND SASAKI: FRAMEWORK FOR SERVICE AVAILABILITY FOR BANDWIDTH-EFFICIENT CONNECTION-ORIENTED NETWORKS 989
Fig. 3. An example for three faults.
Note that the connections
their protection states only during faults, during which schedule
has them in their protection states for accumulated time at
most
.
The link bandwidth of the proposition will be shown to be
sufficient. When there are no faults, then
is sufficient to carry
connections. When there is a fault, the
connections follow schedule
tions in their working states is at most
the number of rows in Fig. 2. Since
therequiredbandwidthfortheconnectionsonthesurvivinglink
is
. Therefore, thelink bandwidthof the
proposition is sufficient. QED
Proposition 3: Suppose there are
the working state bandwidth and
width from SLA1. Suppose each connection
accumulated protection state time
working state duration
for SLA3, and maximum protection
state duration
and short-term availability rate
Suppose Assumption 1 is true. Suppose the minimum fault-free
durationisatleast
.Supposeeachconnection
satisfy SLA2 since they are in
per link
. Then, the number of connec-
, which is
,
connections. Let
be the protection state band-
has maximum
for SLA2, minimum
be
for SLA4.
satisfies
(1)
where
sition 1. To satisfy the SLA, the following link bandwidth is
sufficient:
, and is from Propo-
(2)
The proof of the proposition is presented in Appendix B. The
proposition shows how the SLA parameters are related. For ex-
ample, (1) implies that the maximum protection state duration
is proportional to the minimum working state duration
and inversely proportional to the short-term availability rate
Note that
in the link bandwidth formula (2) is approxi-
mately equal to
, which is part of previous link bandwidth
formulas in Propositions 1 and 2. It is a close approximation if
.
TABLE I
COMPARISON OF ?
AND ? ? WHEN AVAILABILITY IS 99.9%
theminimumworkingstatedurations
than the accumulated time of all failures
pares the values of
when the network lifetime
is 99.9% (i.e.,
all connections have the same minimum working state duration
. Note that the values are about the same if the minimum
working state durations are at most an hour.
In the next Proposition 4, there are restrictions on the values
of the minimum working state duration, maximum protection
state duration, and short-term availability rate. By constraining
the values, the connections can be scheduled more efficiently
duringfaults,andthiscanshortenthemaximumprotectionstate
durationforagivenworkingstatedurationandshort-termavail-
ability rate.
Proposition 4: Suppose there are
the working state bandwidth and
width from SLA1. For each connection , let
imum accumulated protection state time for SLA2. Suppose all
connections have the same minimum working state duration
for SLA3. For SLA4, suppose there is an integer
eachconnection ,thereisanonnegativeinteger
the maximum protection state duration is
and the short-term availability rate
Inaddition,supposethemaximum accumulatedprotection state
time
is sufficiently large so that
Suppose Assumption 1 is true. Suppose the minimum fault-free
duration is at least
. To satisfy the SLA, the following link
bandwidth is sufficient:
aremuchsmaller
. Table I com-
and for different values of
is a year, the long-termavailability
hours), and assuming
and
connections. Let
be the protection state band-
be
be the max-
, and for
suchthat
.
.
The proof of the proposition is left in Appendix C. The proof
relies on a bandwidth schedule when failures occur. Also, the
constraint
under the schedule. This constraint implies that the short-term
availability rate
(which affects the link bandwidth) must sat-
isfy
under the assumptions of the proposition.
ToillustratethatProposition4leadstosmallervaluesofmax-
imumprotectionstateduration
of Proposition 4 are true and for all connections
Then, Propositions 3 and 4 have the same link bandwidth value.
Notethat
ensures SLA2 is satisfied
,supposethattheassumptions
.
forProposition3,whereas