Conference PaperPDF Available

Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges

Authors:

Abstract and Figures

Many businesses rely on Disaster Recovery (DR) services to prevent either manmade or natural disasters from causing expensive service disruptions. Unfortunately, current DR services come either at very high cost, or with only weak guarantees about the amount of data lost or time required to restart operation after a failure. In this work, we argue that cloud computing platforms are well suited for offering DR as a service due to their pay-as-you-go pricing model that can lower costs, and their use of automated virtual platforms that can minimize the recovery time after a failure. To this end, we perform a pricing analysis to estimate the cost of running a public cloud based DR service and show significant cost reductions compared to using privately owned resources. Further, we explore what additional functionality must be exposed by current cloud platforms and describe what challenges remain in order to minimize cost, data loss, and recovery time in cloud based DR services. 1
Content may be subject to copyright.
Disaster Recovery as a Cloud Service:
Economic Benefits & Deployment Challenges
Timothy Wood, Emmanuel Cecchet, K.K. Ramakrishnan
,
Prashant Shenoy, Jacobus van der Merwe
, and Arun Venkataramani
University of Massachusetts Amherst
AT&T Labs - Research
{twood,cecchet,shenoy,arun}@cs.umass.edu {kkrama,kobus}@research.att.com
Abstract
Many businesses rely on Disaster Recovery (DR) services to
prevent either manmade or natural disasters from causing ex-
pensive service disruptions. Unfortunately, current DR services
come either at very high cost, or with only weak guarantees
about the amount of data lost or time required to restart opera-
tion after a failure. In this work, we argue that cloud comput-
ing platforms are well suited for offering DR as a service due
to their pay-as-you-go pricing model that can lower costs, and
their use of automated virtual platforms that can minimize the
recovery time after a failure. To this end, we perform a pricing
analysis to estimate the cost of running a public cloud based DR
service and show significant cost reductions compared to using
privately owned resources. Further, we explore what additional
functionality must be exposed by current cloud platforms and
describe what challenges remain in order to minimize cost, data
loss, and recovery time in cloud based DR services.
1 Introduction
Our society’s growing reliance on crucial computer sys-
tems means that even short periods of downtime can re-
sult in significant financial loss, or in some cases even
put human lives at risk. Many business and government
services utilize Disaster Recovery (DR) systems to mini-
mize the downtime incurred by catastrophic system fail-
ures. Current Disaster Recovery mechanisms range from
periodic tape backups that are trucked offsite, to contin-
uous synchronous replication of data between geograph-
ically separated sites.
A key challenge in providing DR services is to sup-
port Business Continuity
1
(BC), allowing applications to
rapidly come back online after a failure occurs. By mini-
mizing the recovery time and the data lost due to disaster,
a DR service can also provide BC, but typically at high
cost. In this paper we explore how virtualized cloud plat-
forms can be used to provide low cost DR solutions that
1
In this work we consider BC to be a stringent form of DR that
requires applications to resume full or partial operation shortly after
a disaster occurs, and we focus on the software and IT infrastructure
needed to support this. In addition, a full BC plan must cover issues
related to physical facilities and personnel management.
are better at enabling Business Continuity.
Virtualized cloud platforms are well matched to pro-
viding DR. The “pay-as-you-go” model of cloud plat-
forms can lower the cost of DR since different amounts
of resources are needed before and after a disaster oc-
curs. Under normal operating conditions, a cloud based
DR service may only need a small share of resources to
synchronize state from the primary site to the cloud; the
full amount of resources required to run the application
only needs to be provisioned (and paid for) if a disas-
ter actually happens. The use of automated virtualization
platforms means that these additional resources can be
rapidly brought online once the disaster is detected. This
can dramatically reduce the recovery time after a failure,
a key component in enabling business continuity.
To explore the potential for using cloud computing as
a DR solution, we perform a basic pricing analysis to
understand the cost of running cloud-based DR for dif-
ferent application types and backup mechanisms. Our
results indicate that some applications can see substan-
tial economic benefits due to the on demand nature of
cloud computing platforms. We discuss under what sce-
narios clouds provide the greatest benefits for DR, and
present the limitations of current cloud platform features
and pricing schemes. Our end goal is to show how cloud
platforms can provide low cost DR services and can be
optimized to minimize data loss and recovery time in or-
der to provide both efficient disaster recovery and busi-
ness continuity.
2 How is DR Done Today?
A typical DR service works by replicating application
state between two data centers; if the primary data cen-
ter becomes unavailable, then the backup site can take
over and will activate a new copy of the application us-
ing the most recently replicated data. In this work we
focus on DR systems with the goal of providing business
continuity–allowing applications to fail over to a backup
site while minimizing service disruptions.
1
2.1 DR Requirements
This section discusses the key requirements for an ef-
fective DR service. Some of these requirements may be
based on business decisions such as the monetary cost of
system downtime or data loss, while others are directly
tied to application performance and correctness.
Recovery Point Objective (RPO): The RPO of a DR
system represents the point in time of the most recent
backup prior to any failure. The necessary RPO is gen-
erally a business decision—for some applications abso-
lutely no data can be lost (RPO=0), requiring continuous
synchronous replication to be used, while for other appli-
cations, the acceptable data loss could range from a few
seconds to hours or even days.
Recovery Time Objective (RTO): The RTO is an or-
thogonal business decision that specifies a bound on how
long it can take for an application to come back online
after a failure occurs. This includes the time to detect the
failure, prepare any required servers in the backup site
(virtual or physical), initialize the failed application, and
perform the network reconfiguration required to reroute
requests from the original site to the backup site so the
application can be used. Depending on the application
type and backup technique, this may involve additional
manual steps such as verifying the integrity of state or
performing application specific data restore operations,
and can require careful scheduling of recovery tasks to
be done efficiently [7]. Having a very low RTO can en-
able business continuity, allowing an application to seam-
lessly continue operating despite a site wide disaster.
Performance: For a DR service to be useful it must
have a minimal impact on the performance of each appli-
cation being protected under failure-free operation. DR
can impact performance either directly such as in the syn-
chronous replication case where an application write will
not return until it is committed remotely, or indirectly by
simply consuming disk and network bandwidth resources
which otherwise the application could use.
Consistency: The DR service must ensure that after a
failure occurs the application can be restored to a con-
sistent state. This may require the DR mechanism to
be application specific to ensure that all relevant state is
properly replicated to the backup site. In other cases, the
DR system may assume that the application will keep a
consistent copy of its important state on disk, and use a
disk replication scheme to create consistent copies at the
backup site.
Geographic Separation: It is important that the pri-
mary and backup sites are geographically separated in or-
der to ensure that a single disaster will not impact both
sites. This geographic separation adds its own challenges
since increased distance leads to higher WAN bandwidth
costs and will incur greater network latency. Increased
round trip latency directly impacts application response
time when using synchronous replication. As round trip
delays are limited by the speed of light, synchronous
replication is feasible only when the backup site is within
10s of kilometers of the primary. Asynchronous tech-
niques can improve performance over longer distances,
but can lead to greater data loss during a disaster. Dis-
tance can especially be a challenge in cloud based DR
services as a business might have only coarse control over
where resources will be physically located.
2.2 DR Mechanisms
Disaster Recovery is primarily a form of long distance
state replication combined with the ability to start up ap-
plications at the backup site after a failure is detected.
The amount and type of state that is sent to the backup
site can vary depending on the application’s needs. State
replication can be done at one of these layers: (i) within
an application, (ii) per disk or within a file system, or
(iii) for the full system context. Replication at the appli-
cation layer can be the most optimized, only transferring
the crucial state of a specific application. For example,
some high-end database systems replicate state by trans-
ferring only the database transaction logs, which can be
more efficient than sending the full state modified by each
query [8]. Backup mechanisms operating at the file sys-
tem or disk layer replicate all or a portion of the file sys-
tem tree to the remote site without requiring specific ap-
plication knowledge [6]. The use of virtualization makes
it possible to not only transparently replicate the com-
plete disk, but also the memory context of a virtual ma-
chine, allowing it to seamlessly resume operation after a
failure; however, such techniques are typically designed
only for LAN environments due to significant bandwidth
and latency requirements [4, 9].
The level of data protection and speed of recovery de-
pends on the type of backup mechanism used and the na-
ture of resources available at the backup site. In general,
DR services fall under one of the following categories:
Hot Backup Site: A hot backup site typically provides
a set of mirrored stand-by servers that are always avail-
able to run the application once a disaster occurs, provid-
ing minimal RTO and RPO. Hot standbys typically use
synchronous replication to prevent any data loss due to a
disaster. This form of backup is the most expensive since
fully powered servers must be available at all times to
run the application, plus extra licensing fees may apply
for some applications. It can also have the largest impact
on normal application performance since network latency
between the two sites increases response times.
Warm Backup Site: A warm backup site may keep
state up to date with either synchronous or asynchronous
replication schemes depending on the necessary RPO.
Standby servers to run the application after failure are
2
available, but are only kept in a “warm” state where it
may take minutes to bring them online. This slows re-
covery, but also reduces cost; the server resources to run
the application need to be available at all times, but ac-
tive costs such as electricity and network bandwidth are
lower during normal operation.
Cold Backup Site: In a cold backup site, data is of-
ten only replicated on a periodic basis, leading to an RPO
of hours or days. In addition, servers to run the applica-
tion after failure are not readily available, and there may
be a delay of hours or days as hardware is brought out
of storage or repurposed from test and development sys-
tems, resulting in a high RTO. It can be difficult to sup-
port business continuity with cold backup sites, but they
are a very low cost option for applications that do not
require strong protection or availability guarantees.
The on-demand nature of cloud computing means that
it provides the greatest cost benefit when peak resource
demands are much higher than average case demands.
This suggests that cloud platforms can provide the great-
est benefit to DR services that require warm stand-by
replicas. In this case, the cloud can be used to cheaply
maintain the state of an application using low cost re-
sources under ordinary operating conditions. Only after
a disaster occurs must a cloud based DR service pay for
the more powerful–and expensive–resources required to
run the full application, and it can add these resources
in a matter of seconds or minutes. In contrast, an enter-
prise using its own private resources for DR must always
have servers available to meet the resource needs of the
full disaster case, resulting in a much higher cost during
normal operation.
2.3 Failover and Failback
In addition to managing state replication, a DR solution
must be able to detect when a disaster has occurred, per-
form a failover procedure to activate the backup site, as
well as run the failback steps necessary to revert con-
trol back to the primary data center once the disaster
has been dealt with. Detecting when a disaster has oc-
curred is a challenging problem since transient failures
or network segmentation can trigger false alarms. In
practice, most DR techniques rely on manual detection
and failover mechanisms. Cloud based systems can sim-
plify this problem by monitoring the primary data center
from cloud nodes distributed across different geographic
regions, making it simpler to determine the extent of a
network failure and react accordingly.
In most cases, a disaster will eventually pass, and a
business will want to revert control of its applications
back to the original site. To do this, the DR software
must support bidirectional state replication so that any
new data that was created at the backup site during the
disaster can be transferred back to the primary. Doing
Primary Data Center
DR Cloud
Web
Servers
3X
Database
Web
Servers
3X
Database
Failover Mode
Resources
(inactive)
Replication Mode
Resources
(active)
Disk
Disk
DR
Server
State
Sync
Clients
Normal
traffic
Redirect after
disaster
Figure 1: RUBiS is configured with 3 web servers and 1
database at the primary site. In ordinary operation the cloud
only requires a single DR server to maintain database state, and
only initializes the full application resources once a disaster oc-
curs. After the failure, client traffic must be redirected to the
cloud site.
this efficiently can be a major challenge: the primary site
may have lost an arbitrary amount of data due to the dis-
aster, so the replication software must be able to deter-
mine what new and old state must be resynchronized to
the original site. In addition, the failback procedure must
be scheduled and implemented in order to minimize the
level of application downtime.
3 DR as a Cloud Service
While there are many types of DR that can be provided
using cloud resources, we focus on a warm standby sys-
tem where important application state is continuously
replicated into the cloud. Figure 1 illustrates this setup
for a web application that requires four servers (one
database and three web servers) in the primary site.
Within the cloud providing DR, the level of resources
required depends on whether it is in Replication Mode
or Failover Mode. During normal operation, the system
stays in Replication Mode, and requires only a single low
cost VM to act as the DR Server that handles the state
synchronization. When a disaster occurs, the system en-
ters Failover Mode, which requires resources to support
the full application. In this section we analyze the costs
of this form of DR and discuss both the benefits and chal-
lenges remaining for DR in the cloud.
3.1 Are Clouds Cheaper for DR?
We first study the costs associated with disaster recovery
services to understand if clouds can actually make DR
cheaper. We compare the cost of running a DR service
3
Public Cloud Colocation
RUBiS Replication Failover Replication Failover
Servers $2.04 $32.64 $26.88 $26.88
Network $0.54 $18.00 $1.16 $39.14
Storage $1.22 $1.39
Total per day $3.80 $52.03 $28.04 $66.01
Total per year $1,386 $18,992 $10,234 $24,095
99% uptime cost $1,562 per year $10,373 per year
(a)
Resource Consumption
Replication Failover
Servers 1 cloud / 4 colo 4
Network 5.4 GB/day 180 GB/day
Storage 30 GB 30 GB
IO 130 req/sec 150 req/sec
(b)
Figure 2: (a) Cost per day and year for providing DR services for RUBiS. Under normal operation, only the Replication Mode cost
must be paid, leading to substantial savings when using a cloud platform. (b) Resources required during Replication and Failover
Modes are the same for the cloud and colocation center except that the colo center must always have 4 servers available.
using public cloud resources against a “build your own”
DR service using an enterprise’s own private resources.
To estimate the cost of the latter approach, we use the
price of renting resources from a colocation facility. This
is a reasonable estimate for small to medium size busi-
nesses which may own a single data center but cannot
afford the additional expense of a second full data center
as a DR site.
Our cost study is meant to be illustrative rather than
definitive—we found a wide range of prices for both
cloud and colo providers, and we do not include fac-
tors such as management costs which may not be equiv-
alent in each case. While large enterprises that own
multiple data centers may be able to obtain cheaper re-
sources by running DR between their sites, they will
still face the same cost model as the colocation facil-
ity. Past cost studies indicate that the primary costs of
running a private data center are for purchasing servers
and infrastructure—costs that do not change regardless of
whether servers are actively used or not [5]. In contrast,
the cloud’s pay-as-you-go model benefits users who can
turn resources on and off as needed, which is exactly the
case in disaster recovery services that acquire resources
on demand only after a failure occurs.
3.1.1 Case Study: Multi-tier Web Application
To understand the cost of providing DR in the cloud, we
first consider a common multi-tier web application archi-
tecture composed of several web front ends connected to
a database server containing the persistent state for the
application. This scenario illustrates how some compo-
nents of an application may have different DR require-
ments. The web servers in this example contain only
transient state (e.g., session cookies that can be lost with-
out significantly disrupting the application) and only re-
quire a weak backup policy; we assume that all the front
ends can be recreated from a template image stored in the
backup site and do not require any other form of synchro-
nization. The database node, however, requires stronger
consistency and uses a disk based replication scheme to
send all writes to a VM in the backup site. Applications
such as this are a natural fit for a cloud based DR ser-
vice because fewer resources are required to replicate the
important state than to run the full application.
To analyze the cost of providing DR for such an appli-
cation, we calculate the Replication Mode and Failover
Mode costs of running DR for the RUBiS web bench-
mark. RUBiS is an e-commerce web application that
can be run using multiple Tomcat servers and a MySQL
database [3]. Figure 1 shows RUBiS’s structure and how
it replicates state to the cloud. We calculate costs based
on resource usage traces recorded from running RUBiS
with 300 clients, and prices gathered from Amazon’s
Cost Comparison Calculator [1]; we have validated that
the colocation pricing information is competitive with of-
ferings from other providers.
Cost Breakdown: Figure 2(a) shows the yearly cost
for running the DR service with a public cloud or a pri-
vate colocation facility. The server cost only requires one
“small” VM to run the DR server in Replication mode in
the cloud whereas the colocation DR approach must al-
ways be provisioned with the four “large” servers needed
to run the application during failover. Figure 2(b) shows
the resource requirements for both modes. The network
and IO consumption during failover mode includes the
web traffic of the live application with clients whereas the
replication mode only includes the replicated state per-
sisted to the database. The storage cost for EC2 is based
on EBS volumes (Amazon’s persistent storage product)
and IO costs, whereas the colocation center storage cost
is included as part of the server hardware costs.
99% Uptime Cost: Since disasters are rare, most of
the time only the Replication Mode cost must be paid.
The best way to compare total costs is thus to calcu-
late the yearly cost of each approach based on a cer-
tain level of downtime caused by disasters. Assuming a
99% uptime model where a total of 3.6 days of downtime
is handled by transitioning from Replication to Failover
Mode, the yearly cost of the cloud DR service comes to
only $1,562, compared to $10,373 with the colocation
provider—an 85% reduction (Figure 2a). This illustrates
the benefit of the cloud’s pay-as-you-go pricing model—
4
Public Cloud Colocation
Data Warehouse Replication Failover Replication Failover
Servers $4.08 $12.00 $8.51 $8.51
Network $0.10 $0.12 $0.22 $0.26
Storage $3.50 $3.92
Total per day $7.68 $16.04 $8.73 $8.77
Total per year $2,802 $5,853 $3,186 $3,202
99% uptime cost $2,832 per year $3,186 per year
(a)
0K
1K
2K
3K
4K
1 2 6 12
Yearly 99% Uptime Cost
Cloud Colo
Continuous
Replication
Backups per Day
(b)
Figure 3: (a) Cost for providing DR services for the data warehouse application. The cloud provides only moderate savings due to
high storage costs. (b) Using periodic backups can significantly lower the price of DR in the cloud by reducing the cost of VMs.
substantial savings can be achieved if the cost to synchro-
nize state to a backup site is lower than the cost of run-
ning the full application.
Cost of Adding DR: Our analysis so far considered
the primary site to run on the user’s own private re-
sources, but they could also be run in the cloud. How-
ever, simply using cloud resources does not eliminate the
need for DR—it is still critical to run a DR service to
ensure continued operation if the primary cloud provider
is disrupted. Running the whole application in the cloud
costs $18,992 per year and using cloud DR in addition
only adds 8%. Running the application in a colo center
costs more in the first place ($24,095 per year) but adding
DR in a second colo facility increases the total cost by al-
most 42%. Finally, if a colocation center is used for the
primary site but a cloud is used for DR, then the incre-
mental cost of having DR is only 6.5%.
3.1.2 Case Study: Data Warehouse
Our second case study analyzes the cost of providing DR
for a Data Warehouse application. A data warehouse
records data such as a stream of website clicks or sales
information produced by other applications. Data is typi-
cally appended to the warehouse at regular intervals, and
reports are generated based on the incoming and exist-
ing data sets. We consider a small sized Data Warehouse
with a 1TB capacity that adds 1 GB of new data per day.
To run the full application, a powerful server is required–
we estimate costs based on a “High-Memory Extra Large
Instance” from EC2.
Cost Breakdown: Figure 3(a) shows the cost for run-
ning the data warehouse application. We assume that the
cloud based DR system requires a “medium” size VM
as a backup server due to its IO intensive nature, result-
ing in a relatively high server cost even under normal op-
eration. Additionally, the cloud must pay a large stor-
age cost to support the 1TB capacity of the data ware-
house. As a result, the cloud based DR service provides a
smaller benefit because its Replication Mode cost is only
slightly lower than the cost in a colocation facility, and
its Failover Mode cost is significantly higher.
99% Uptime Cost: By comparing the Failover Mode
costs, it is clear that it is cheapest to use a colocation
center as the primary site of the data warehouse ($5,853
per year in the cloud versus $3,202 per year in a coloca-
tion center). However, since the replication cost for the
cloud is lower and is incurred for 99% of the time, the
total costs is still lower for the cloud. Despite having a
higher Failover Mode price, the cloud based DR system
still lowers the total DR cost from $3,186 to $2,832 over
a one year period assuming 99% uptime.
Periodic Backups: The data warehouse application
obtains a smaller economic benefit from the cloud than
seen in the multi-tier web application case study due to
its increased server and storage requirements during or-
dinary operation. However, the flexibility of cloud re-
sources can help reduce this cost if the application can
tolerate a weaker RPO. For example, it may be sufficient
to only send periodic backups to the cloud site once ev-
ery few hours or after each bulk load, rather than running
the DR service continuously. Assuming that one hour of
VM time is charged per backup, Figure 3(b) shows how
the cost of DR can be substantially lowered by reducing
the backup frequency. While a similar approach could be
used in a private data center to reduce energy consump-
tion, it would have a much smaller effect on overall cost
since power usage of individual servers is a minor frac-
tion compared to the cost of hardware and space that must
be paid regardless of whether a machine is in use or not.
3.2 Benefits of the Cloud
Under current pricing schemes, cloud based DR services
will not see much benefit when used for applications that
require true “hot” standby servers since this can signifi-
cantly raise the cost during normal operation. However,
for applications that can tolerate recovery times on the
order of 200 seconds (a typical VM startup time in the
EC2 cloud), substantial savings can be found by utilizing
low cost servers while replicating state in ordinary con-
ditions and powerful ones only after a disaster occurs.
Cloud DR services may be able to obtain additional eco-
nomic benefits by multiplexing a single replication server
for multiple applications, further lowering the cost of re-
sources under normal operation. For applications with a
5
loose RPO, the cloud can provide even greater benefits by
only initiating the replication service a few times a day to
create periodic backups.
Cloud computing can facilitate disaster recovery by
significantly lowering costs:
The cloud’s pay-as-you go pricing model signifi-
cantly lowers costs due to the different level of re-
sources required before and during a disaster.
Cloud resources can quickly be added with fine
granularity and have costs that scale smoothly with-
out requiring large upfront investments.
The cloud platform manages and maintains the DR
servers and storage devices, lowering IT costs and
reducing the impact of failures at the disaster site.
The benefits of virtualization, while not necessarily
specific to cloud platforms, still provide important fea-
tures for disaster recovery:
VM startup can be easily automated, lowering re-
covery times after a disaster.
Virtualization eliminates hardware dependencies,
potentially lowering hardware requirements at the
backup site.
Application agnostic state replication software can
be run outside of the VM, treating it as a black box.
These characteristics can simplify the replication and de-
ployment of resources in a cloud DR site, and enable
business continuity by reducing recovery times.
4 Challenges for the Cloud Provider
Although cloud-based DR can provide economic benefits
for a customer, such a service raises numerous challenges
for a cloud provider, as discussed next.
4.1 Handling Correlated Failures
Typically a cloud provider will attempt to statistically
multiplex its DR customers onto its server pool. Such
statistical multiplexing assumes that not all of its cus-
tomers will experience simultaneous failures, and hence
the number of free servers that the cloud providers must
have available is smaller than the peak needs of all its cus-
tomers. However, correlated failures across customers is
not uncommon—for instance, an electric grid failure or
a natural disaster such as a flood can cause a large num-
ber of customer from a geographic area to simultaneously
failover to their DR sites. To prevent such correlated
failures from stressing any one data center, the cloud
provider should attempt to distribute its DR customers
across multiple data centers in a way that minimizes po-
tential conflicts—e.g. multiple customers from the same
geographic region should be backed up to different cloud
data centers. This placement problem is further compli-
cated by constraints such as limits on latency between
the customer and cloud site. To intelligently address this
issue, the cloud provider must employ risk models—not
unlike ones used by insurance companies—to (i) estimate
how many servers should be available in a data center for
a certain group of customers and (ii) how to distribute
customers from a region across different data center sites
to “spread the risk”. In the event of stress on any single
data center due to correlated failures, dynamic migration
of a group of customers to another site can be employed.
To achieve all of these tasks seamlessly, the cloud
provider should be able to treat all of its data centers
as a single pool of resources available to its DR cus-
tomers [10, 2]. In practice, current data centers act as iso-
lated entities and it is non-trivial to move or replicate stor-
age and computation resources between data centers. We
believe that future cloud architectures will rely on net-
work virtualization to provide seamless connectivity be-
tween data centers, and wide-area VM and storage migra-
tion to allow for resource management across data center
sites.
4.2 Revenue Maximization
The DR strategies we have discussed assume that cus-
tomers only pay for the majority of their DR resources
after some kind of failure actually occurs, and that suf-
ficient resources are always available when needed. The
cloud service provider must maintain these resources and
pay for their upkeep at all times, regardless of whether
a customer has experienced a failure. Since disasters are
typically rare, there will be little or no revenue from the
server farm in the normal case when there are no fail-
ures. Hence, a cloud provider must find ways to generate
revenue from such idling resources in order to make its
capital investments viable.
We assume that a cloud DR provider will also offer tra-
ditional cloud computing services and rent its resources
to customers for non-DR purposes. In this case, the cloud
may be able to “double book” its servers for both regu-
lar and DR customers. Public clouds generally only offer
best effort service when new VM or network resources
are requested. While this is sufficient for general cloud
computing, in disaster recovery it is imperative that ad-
ditional resources be available within the specified RTO.
One existing pricing mechanism that would facilitate this
on demand access to resources is the use of “spot in-
stances”. Spot instances allow the service provider to
rent resources, typically at a lower price, without guar-
antees about how long they will be available. A cloud
service could generate revenue from idling DR servers by
offering them as spot instances to non-DR customers and
reclaim them on-demand when these servers are needed
for high priority DR customers.
Currently, cloud platforms often provide few guaran-
tees about server and bandwidth availability and network
quality of service, which are important for ensuring an
6
application can fully operate after failover. EC2 currently
supports “reserved” VM instances that are guaranteed to
be available, but they are primarily designed for users
who know that they will be actively running a VM for
a long period of time, and their pricing is designed to re-
flect this with a moderate yearly fee but cheaper hourly
costs. For disaster recovery, it may be desirable to al-
low for “priority resources” which are guaranteed to be
available on demand, although perhaps at a higher hourly
cost than ordinary VM instances or network bandwidth
(which also increases the revenue for the cloud provider
while providing better assurances to a customer).
4.3 Mechanisms for Cloud DR
While cloud computing platforms already contain many
useful features for supporting disaster recovery, there are
additional requirements they must meet before they can
provide DR as a cloud service.
Network Reconfiguration: For a cloud DR service to
provide true business continuity, it must facilitate recon-
figuring the network setup for an application after it is
brought online in the backup site. We have previously
proposed how a cloud infrastructure can be combined
with virtual private networks (VPNs) to support this kind
of rapid reconfiguration for applications that only com-
municate within a private business environment [10].
Public Internet facing applications would require addi-
tional forms of network reconfiguration through either
modifying DNS or updating routes to redirect traffic to
the failover site. To support any of these features, cloud
platforms need greater coordination with network service
providers.
Security & Isolation: The public nature of cloud
computing platforms remains a concern for some busi-
nesses. In order for an enterprise to be willing to fail over
from its private data center to a cloud during a disaster it
will require strong guarantees about the privacy of stor-
age, network, and the virtual machine resources it uses.
Likewise, clouds must guarantee that the performance of
applications running in the cloud will not be impacted by
disasters affecting other businesses.
VM Migration & Cloning: Current cloud comput-
ing platforms do not support VM migration in or out
of the cloud. VM migration or cloning would simplify
the failback procedure for moving an application back
to its original site after a disaster has been dealt with.
This would also be a useful mechanism for facilitating
planned maintenance downtime. The Remus system [4]
has demonstrated how a continuous form of VM migra-
tion can be used to synchronize both memory and disk
state of a virtual machine to a backup server. This could
potentially allow for full system DR mechanisms that al-
low completely transparent failover during a disaster. To
support this, clouds must expose additional hypervisor
level functionality to their customers, and migration tech-
niques must be optimized for WAN environments.
5 Ongoing Work and Conclusions
We have argued that cloud computing platforms are an
excellent match for providing disaster recovery services
due to their pay-as-you-go pricing model and ability to
rapidly bring resources online after a disaster. The flexi-
bility of cloud resources also allows enterprises to make
a trade off between data protection and price to an ex-
tent not possible when using private resources that must
be statically provisioned. We have compared the costs
of running DR services using public cloud or privately
owned resources, and shown cost reductions of up to 85%
by taking advantage of cloud resources.
In our ongoing work, we are developing Dr. Cloud, a
prototype DR system that we can use to understand the
potential for using existing cloud platforms to provide
DR. This will allow us to better understand what fea-
tures and optimizations must be included within the cloud
platform itself, and to explore the tradeoffs between cost,
RPO, and RTO in a cloud DR service.
Acknowledgements: This work was supported in
part by NSF grants CNS-0720271, CNS-0720616, CNS-
09169172, and CNS-0834243, as well as by AT&T. We
also thank our reviewers for their comments and sugges-
tions.
References
[1] Aws economics center. http://aws.amazon.com/economics/.
[2] Rajkumar Buyya, Rajiv Ranjan, and Rodrigo N. Calheiros. InterCloud:
Utility-Oriented Federation of Cloud Computing Environments for Scaling
of Application Services. In The 10th International Conference on Algo-
rithms and Architectures for Parallel Processing, Busan, Korea, 2010.
[3] Emmanuel Cecchet, Anupam Chanda, Sameh Elnikety, Julie Marguerite,
and Willy Zwaenepoel. Performance Comparison of Middleware Archi-
tectures for Generating Dynamic Web Content. In 4th ACM/IFIP/USENIX
International Middleware Conference, June 2003.
[4] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm
Hutchinson, and Andrew Warfield. Remus: High availability via asyn-
chronous virtual machine replication. In Proceedings of the Usenix Sym-
posium on Networked System Design and Implementation, 2008.
[5] Albert Greenberg, James Hamilton, David A. Maltz, and Parveen Patel.
Cost of a cloud: Research problems in data center networks. In ACM SIG-
COMM Computer Communications Review, Feb 2009.
[6] Kimberley Keeton, Cipriano Santos, Dirk Beyer, Jeffrey Chase, and John
Wilkes. Designing for Disasters. Conference On File And Storage Tech-
nologies, 2004.
[7] Kimberly Keeton, Dirk Beyer, Ernesto Brau, Arif Merchant, Cipriano San-
tos, and Alex Zhang. On the road to recovery:restoring data after disasters.
European Conference on Computer Systems, 40(4), 2006.
[8] Tirthankar Lahiri, Amit Ganesh, Ron Weiss, and Ashok Joshi. Fast-
Start:quick fault recovery in oracle. ACM SIGMOD Record, 30(2), 2001.
[9] Vmware high availability. http://www.vmware.com/products/
high-availability/.
[10] T. Wood, A. Gerber, K. Ramakrishnan, J. Van der Merwe, and P. Shenoy.
The case for enterprise ready virtual private clouds. In Proceedings of
the Usenix Workshop on Hot Topicsin Cloud Computing (HotCloud), San
Diego, CA, June 2009.
7
... Due to the present problems with data recovery and backup in a single cloud, massive amounts of data storage are being used by duplicating data to many data centres within the same cloud. Furthermore, a single cloud may run into issues including hardware malfunction, software bugs, network intrusions, natural catastrophes, and human-caused harm that affect the data that is stored [10,11]. These issues often cause service interruptions, and in the worst scenario, they might corrupt data and cause the system to collapse. ...
... Certain providers of cloud services came up with workable strategies to prevent this issue, such as using regional data dispersion to safeguard the most important data. Additionally, [11,12] one cloud service provider has data centres in many locations, and these centres often employ comparable infrastructures-bulk purchases, operation mechanisms, and management teams-and software stacks. The procedure used by an organisation to restore business operations after a disruptive incident is known as disaster recovery, or DR. ...
... This condition often refers to the quantity of data that may be lost between the moment of a critical incident and the most recent backup, within a time frame that is most relevant to a company, before severe damage arises. RPO is dependent on business decisions, as mentioned in [11], yet it should be remembered that certain software applications and technologies still need RPO=0 in spite of this determination. ii. ...
Article
Full-text available
The most important components of a firm are business continuity and disaster recovery planning, although they are often disregarded. Even before a crisis strikes, businesses need to have a well-organized strategy and documentation for business continuity and recovery after a disaster. A single cloud is characterised as a collection of servers housed in one or more data centres that are provided by a single supplier. Nonetheless, there are several reasons why switching from a single cloud to multiple clouds is sensible and crucial. For example, single cloud providers are still vulnerable to outages, which impacts the database's availability. Furthermore, the single cloud may experience partial or whole data loss in the event of a catastrophe. Due to the significant risks of database accessibility failure and the potential for malevolent insiders inside the single cloud, it is anticipated that consumers would become less fond of single clouds. Cloud-based Disaster Recovery (DR) enables the coordinated use of resources from many cloud services offered by the DR Service provider. Thus, it is essential to create a workable multi-cloud-based Disaster Recovery (DR) architecture that minimises backup costs in relation to Recovery Time Objective (RTO) and Recovery Point Objective (RPO). By achieving high data dependability, cheap backup costs, quick recovery, and business continuity before to, during, and after the catastrophic incidence, the framework should preserve accessibility to data. This study suggests a multi-cloud architecture that ensures high data availability before to, during, and after the catastrophe. Additionally, it guarantees that database services will continue both before and after the financial crisis.
... Our society's increasing dependence on critical computer systems means that even short periods of downtime can result in significant financial losses and, in some cases, even put human lives at risk. A key challenge is to handle the business continuity of companies, enabling them to return to operation quickly after a disaster (Wood et al. 2010). Optimized use of IT services, including cloud computing services, is a condition for business continuity, which is using these services increasingly intensive. ...
... Optimized use of IT services, including cloud computing services, is a condition for business continuity, which is using these services increasingly intensive. The disaster-recovery mechanisms in clouds are cheaper than the traditional ones and is also more convenient and user friendly (Wood et al., 2010). In principle, cloud computing has the potential to become an energy-efficient technology (e.g. for ICT), as some academics say (Berl et al., 2010) provided that the potential for its significant energy savings, which has so far been focused on hardware aspects, can be fully referenced to system operation and network aspects. ...
Article
Full-text available
Purpose: The use of cloud services grows rapidly, due in part to the development of Industry 4.0 concept and massive use of new artificial intelligence applications. It should be noted that this involves significantly increased electricity consumption. This increase is mainly concentrated in data centers, but the understanding energy efficiency practices on user sides is also necessary. Design/methodology/approach: The research problem of the article was the assessing the level of managers awareness in manufacturing companies regarding the energy consumption of cloud computing services. As a first step literature study was performed out on the importance of carrying out energy efficiency measures and practices, especially related to the development of cloud computing services. The empirical part of the study was based on a quantitative method, the opinions of 300 IT managers were collected using a questionnaire designed for this purpose (the Cronbach's alpha 0.957). Findings: The overall of the importance of the investigated energy efficiency practices is low and their perception slightly varies according to the size of the organization. Research limitations/implications: Only a few selected from wide range of practices were surveyed restricted to the area of purchasing and IT management. Managers' opinions were questioned, the research did not include documents (e.g. procurement documents) to confirm opinions. Practical implications: The future introduction of the obligation to conduct an energy audit in medium and small organisations can raise awareness of the need for energy management in the procurement and use of IT services. It should be motivated and supported by activities carried out by governmental and social organisations. Social implications: Increasing awareness and action related to energy management in the purchase and IT services management has a positive impact on the environment, especially while their massive use. Originality/value: The article points out the need for organizations to raise awareness of the environmental impact of using cloud computing services in the area of energy efficiency and consumption. This is important both for organizations and for the bodies responsible for stimulating sustainable development activities.
... Et al., 2014). E mplement, sele but will greatly he following: nce of e-learnin s may use e-le ction of cost; format capab problems such Karim Wood et al. (2010) compared the costs of running DR services using the public cloud with those of privately owned resources, and the results showed that the cost reductions increased by 85% through taking advantage of cloud resources (Wood et al., 2010). Additionally, the testing and development environment of the cloud reduces the unit cost, while increasing effectiveness. ...
... Et al., 2014). E mplement, sele but will greatly he following: nce of e-learnin s may use e-le ction of cost; format capab problems such Karim Wood et al. (2010) compared the costs of running DR services using the public cloud with those of privately owned resources, and the results showed that the cost reductions increased by 85% through taking advantage of cloud resources (Wood et al., 2010). Additionally, the testing and development environment of the cloud reduces the unit cost, while increasing effectiveness. ...
Article
Full-text available
Teaching today relies a great deal on IT resources which require large investments and there are many higher institutions that cannot afford such investments. Educational institutions usually search for opportunities to better manage their resources, especially after the economic crisis, which has resulted in reducing government support, especially in western countries. It is argued that ‘cloud computing’ is one of those opportunities for any educational institution due to its benefits in terms of cost reduction. Today, ‘cloud computing’ can be seen as one of the latest dynamic services in the IT world because of its flexibility. This paper investigates the financial incentives for adopting cloud computing in higher educational institutions. To achieve this objective the research employs a qualitative method to collect the data. Interviews were conducted with a number of cloud service providers, experts in the field and users/potential users of the cloud. The results reveal that cloud computing drives down up-front and on-going costs, and that the number of IT staff can be reduced if the cloud is adopted. Disaster recovery and business continuity are other cost-savings areas for an educational institute in adopting the cloud, and cloud computing provides low cost testing and a development environment solution.
... Organizations with geographically distributed operations have leveraged MI Link to implement sophisticated data distribution strategies that balance local performance with centralized management. The disaster recovery research documented that multinational enterprises increasingly implement strategic data distribution rather than simple redundancy, with 73% maintaining synchronized database instances in multiple geographic regions [7]. This 804 editor@iaeme.com ...
Article
Full-text available
This comprehensive article explores how Azure SQL Managed Instance and MI Link create a transformative approach to database modernization through seamless hybrid cloud integration. By offering near-complete compatibility with on-premises SQL Server environments alongside automated management features, these technologies enable organizations to migrate databases with minimal disruption while maintaining critical workloads across both environments. The bidirectional replication capabilities of MI Link provide enterprises with flexible disaster recovery options and phased migration paths that prioritize business continuity. Through a detailed examination of architecture, implementation strategies, and performance optimization techniques, the article provides IT leaders with actionable insights for leveraging these technologies to modernize infrastructure, enhance operational resilience, and strategically distribute database workloads between on-premises systems and cloud environments for optimal performance and cost efficiency.
... Organizations with geographically distributed operations have leveraged MI Link to implement sophisticated data distribution strategies that balance local performance with centralized management. The disaster recovery research documented that multinational enterprises increasingly implement strategic data distribution rather than simple redundancy, with 73% maintaining synchronized database instances in multiple geographic regions [7]. This 804 editor@iaeme.com ...
Article
Full-text available
This comprehensive article explores how Azure SQL Managed Instance and MI Link create a transformative approach to database modernization through seamless hybrid cloud integration. By offering near-complete compatibility with on-premises SQL Server environments alongside automated management features, these technologies enable organizations to migrate databases with minimal disruption while maintaining critical workloads across both environments. The bidirectional replication capabilities Siva Kumar Raju Bhupathiraju https://iaeme.com/Home/journal/IJITMIS 798 editor@iaeme.com of MI Link provide enterprises with flexible disaster recovery options and phased migration paths that prioritize business continuity. Through a detailed examination of architecture, implementation strategies, and performance optimization techniques, the article provides IT leaders with actionable insights for leveraging these technologies to modernize infrastructure, enhance operational resilience, and strategically distribute database workloads between on-premises systems and cloud environments for optimal performance and cost efficiency.
... However, incorporating resilient computing into distributed cloud applications remains challenging, still requires significant programming efforts and is an open area of research 51 [121]. Notwithstanding, disaster recovery is an expensive operation, and is required as a service to minimise recovery time and costs after a failure has occurred [122]. Multi-cloud and multi-region architectures that scale both horizontally (geographically distributed) and vertically (not only in cloud data centers, but throughout the network) are recommended to avoid single points of failure [123]. ...
Preprint
The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now evolving. In this paper, we firstly discuss the changing cloud infrastructure and consider the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers. These trends have resulted in the need for a variety of new computing architectures that will be offered by future cloud infrastructure. These architectures are anticipated to impact areas, such as connecting people and devices, data-intensive computing, the service space and self-learning systems. Finally, we lay out a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.
... Data center operators emphasize on predicting SSD device failures to prevent data loss, minimize service unavailability, and reduce maintenance costs. Despite redundant protection schemes such as RAID(Redundant Array of Independent Disks) and replication, recovering from service disruptions in data centers is costly and time-consuming, leading to significant financial losses [1]. Proactively identifying faults offers numerous benefits, making them a critical focus for efficient data center operations [2] [3]. ...
Article
Full-text available
Proactive strategies for predicting solid state drive(SSD) failures are imperative to ensure uninterrupted services in data centers. Traditional methods that rely on rule-based approaches and machine learning algorithms often fail to accurately predict these failures. In this study, we introduce a temporal-contextual attention network(TCAN), a pioneering method that integrates long short-term memory(LSTM) and transformer architectures to address the limitations of existing schemes. TCAN exploits temporal patterns and attribute dependencies and offers a more comprehensive solution for SSD failure prediction. Unlike conventional feature selection methods, TCAN adopts a feature grouping approach that utilizes all available attributes while accounting for its unique characteristics. Specifically, TCAN treats certain features based on their temporal aspects, capturing how these features change over time, whereas other features are used to capture the dependencies and interactions among different attributes. Through extensive evaluation of private datasets from the Tencent data center and comparison with state-of-the-art models, including machine learning and deep learning approaches, TCAN demonstrates superior performance in identifying potential failures. Furthermore, ablation studies and evaluations using public datasets validate their effectiveness and robustness across different datasets. Our findings underscore the importance of considering both temporal features and inter-attribute dependencies for accurate SSD failure prediction, highlighting the potential of TCAN for enhancing storage system reliability and service stability in data center environments.
... This elasticity ensures that Amazon only acquires what it must as this results to efficient and more predictable IT costs [4]. Moreover, less incidence which may be as a result of system failure and down times associated with cloud infrastructures cuts down on expenses of managing the system [34]. ...
Article
Full-text available
Background: This research paper discusses a detailed exploration of Amazon's adoption of Oracle ERP Cloud, focusing on the strategic benefits of the implementation and the challenges and wider implications of implementing cloud-based ERP solutions within one of the world's largest and most complex enterprises. Further, it is detailed how, through a strict selection process, Amazon was led to settle for Oracle ERP Cloud from several leading ERP systems in the market. It also brings forth the criteria and evaluations at hand that guided this decision-making. Method: This technique focuses on the phased rollout strategy, showing how Amazon brought the ERP system incrementally across departments, beginning with finance and procurement. It underlines the important role played by cross-functional teamwork, depicting efforts between finance, supply chain, HR, and IT teams to smooth implementation. Results: The study shows how deep technologies such as AI, machine learning, the Internet of Things, and blockchain are integrated into the ERP system. These go a long way to increase the decision-making ability and better operation of security, with improved transparency in Amazon; they provide it with real-time analytics, predictive insights, and improved transparency. Conclusion: Implementing Oracle ERP Cloud at Amazon sheds light on how scalable and cost-efficient cloud-based ERP solutions are. The availability of real-time data access and advanced analytics has spurred data-driven decision-making, but issues such as data migration and security require careful consideration in the planning process. This work provides valuable insights for enterprises seeking to implement similar ERP systems.
Article
Cloud computing has completely transformed how users access and use applications, services, and data. This study provides a thorough analysis of cloud computing, including its history, key features, the role of virtualization in cloud environments, and various cloud service models. From the introduction of time-sharing in the 1960s to its widespread adoption in the 2000s, cloud computing has evolved significantly. The properties of cloud computing, such as resource pooling, on-demand self-service, measured service, resilience, and rapid flexibility, are examined in this research. Virtualization plays a crucial role in cloud computing by enabling efficient resource utilization, scalability, and workload separation. A detailed discussion of several virtualization techniques, including multitenancy, containerization, and hypervisors, is provided. The advantages and drawbacks of each method are also compared in the paper to help readers select the most suitable approach for specific use cases. The functions, deployment models, customization options, scalability, and service examples of cloud service models—Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS)—are described. Additionally, the study explores cloud deployment options, such as community, multi-cloud, hybrid, public, and private models, each with its own unique features. This article offers a comprehensive overview of cloud computing, making it an invaluable resource for both beginners and experts. It enables informed decision-making and the successful deployment of cloud technologies to meet various business needs.
Conference Paper
Full-text available
Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial. This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.
Article
Full-text available
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often operate at low utilization due to resource stranding and fragmentation. To attack this first problem, we propose (1) increasing network agility, and (2) providing appropriate incentives to shape resource consumption. Second, we note that cloud service providers are building out geo-distributed networks of data centers. Geo-diversity lowers latency to users and increases reliability in the presence of an outage taking out an entire site. However, without appropriate design and management, these geo-diverse data center networks can raise the cost of providing service. Moreover, leveraging geo-diversity requires services be designed to benefit from it. To attack this problem, we propose (1) joint optimization of network and data center resources, and (2) new systems and mechanisms for geo-distributing state.
Conference Paper
Full-text available
On-line services are making increasing use of dynamically generated Web content. Serving dynamic content is more complex than serving static content. Besides a Web server, it typically involves a server-side application and a database to generate and store the dynamic content. A number of standard mechanisms have evolved to generate dynamic content. We evaluate three specific mechanisms in common use: PHP, Java servlets, and Enterprise Java Beans (EJB). These mechanisms represent three different architectures for generating dynamic content. PHP scripts are tied to the Web server and require writing explicit database queries. Java servlets execute in a different process from the Web server, allowing them to be located on a separate machine for better load balancing. The database queries are written explicitly, as in PHP, but in certain circumstances the Java synchronization primitives can be used to perform locking, reducing database lock contention and the amount of communication between servlets and the database. Enterprise Java Beans (EJB) provide several services and facilities. In particular, many of the database queries can be generated automatically. We measure the performance of these three architectures using two application benchmarks: an online bookstore and an auction site. These benchmarks represent common applications for dynamic content and stress different parts of a dynamic content Web server. The auction site stresses the server front-end, while the online bookstore stresses the server back-end. For all measurements, we use widely available open-source software (the Apache Web server, Tomcat servlet engine, JOnAS EJB server, and MySQL relational database). While Java servlets are less efficient than PHP, their ability to execute on a different machine from the Web server and their ability to perform synchronization leads to better performance when the front-end is the bottleneck or when there is database lock contention. EJB facilities and services come at the cost of lower performance than both PHP and Java servlets.
Article
Full-text available
this paper. The standard solutions include interarray mirroring (local and remote, synchronous and asynchronous), tertiary storage (e.g., tape) backup, remote vaulting, and snapshots, combined with recovery by failover or data reconstruction
Article
Allowing applications to survive hardware failure is an expensive undertaking, which generally involves re- engineering software to include complicated recovery logic as well as deployingspecial-purposehardware; this represents a severe barrier to improving the dependabil- ity of large or legacy applications. We describe the con- struction of a general and transparent high availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of down- time, while completely preserving host state such as ac- tive network connections. Our approach encapsulates protected software in a virtual machine, asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, and uses speculative ex- ecution to concurrently run the active VM slightly ahead of the replicated system state.
Conference Paper
Availability requirements for database systems are more stringent than ever before with the widespread use of databases as the foun- dation for ebusiness. This paper highlights Fast-StartTM Fault Recovery, an important availability feature in Oracle, designed to expedite recovery from unplanned outages. Fast-Start allows the administrator to configure a running system to impose predictable bounds on the time required for crash recovery. For instance, fast- start allows fine-grained control over the duration of the roll-for- ward phase of crash recovery by adaptively varying the rate of checkpointing with minimal impact on online performance. Persis- tent transaction locking in Oracle allows normal online processing to be resumed while the rollback phase of recovery is still in progress, and fast-start allows quick and transparent rollback of changes made by uncommitted transactions prior to a crash.
Conference Paper
Cloud computing providers have setup several data centers at different geographical locations over the Internet in order to optimally serve needs of their customers around the world. However, existing systems do not support mechanisms and policies for dynamically coordinating load distribution among different Cloud-based data centers in order to determine optimal location for hosting application services to achieve reasonable QoS levels. Further, the Cloud computing providers are unable to predict geographic distribution of users consuming their services, hence the load coordination must happen automatically, and distribution of services must change in response to changes in the load. To counter this problem, we advocate creation of federated Cloud computing environment (InterCloud) that facilitates just-in-time, opportunistic, and scalable provisioning of application services, consistently achieving QoS targets under variable workload, resource and network conditions. The overall goal is to create a computing environment that supports dynamic expansion or contraction of capabilities (VMs, services, storage, and database) for handling sudden variations in service demands. This paper presents vision, challenges, and architectural elements of InterCloud for utility-oriented federation of Cloud computing environments. The proposed InterCloud environment supports scaling of applications across multiple vendor clouds. We have validated our approach by conducting a set of rigorous performance evaluation study using the CloudSim toolkit. The results demonstrate that federated Cloud computing model has immense potential as it offers significant performance gains as regards to response time and cost saving under dynamic workload scenarios.