Deterministic Models of Software Aging and Optimal Rejuvenation Schedules
ABSTRACT Automated modeling of software aging processes is a prerequisite for cost-effective usage of adaptive software rejuvenation as a self-healing technique. We consider the problem of such automated modeling in server-type applications whose performance degrades depending on the "work" done since last rejuvenation, for example the number of served requests. This type of performance degradation - caused mostly by resource depletion - is common, as we illustrate in a study of the popular Axis Soap server 1.3. In particular, we propose deterministic models for approximating the leading indicators of aging and an automated procedure for statistical testing of their correctness. We further demonstrate how to use these models for finding optimal rejuvenation schedules under utility functions. Our focus is on the important case that the utility function is the average of a performance metric (such as maximum service rate). We also consider optional SLA constraints under which the performance should never drop below a specified level. Our approach is verified by a study of the aging processes in the Axis Soap 1.3 server. The experiments show that the deterministic modeling technique is appropriate in this case, and that the optimization of rejuvenation schedules can greatly improve the average maximum service rate of an aging application.
- [Show abstract] [Hide abstract]
ABSTRACT: Software aging, i.e. degradation of software performance or functionality caused by resource depletion is usually discovered only in the production scenario. This incurs large costs and delays of defect removal and requires provisional solutions such as rejuvenation (controlled restarts). We propose a method for detecting aging problems shortly after their introduction by runtime comparisons of different development versions of the same software. Possible aging issues are discovered by analyzing the differences in runtime traces of selected metrics. The required comparisons are workload-independent which minimizes the additional effort of dedicated stress tests. Consequently, the method requires only minimal changes to the traditional development and testing process. This paves the way to detecting such problems before public releases, greatly reducing the cost of defect fixing. Our study focuses on the memory leaks of Eucalyptus, a popular open source framework for managing cloud computing environments.Integrated Network Management (IM 2013), 2013 IFIP/IEEE International Symposium on; 01/2013
- [Show abstract] [Hide abstract]
ABSTRACT: The main obstacles in mass adoption of cloud computing for database operations in healthcare organization are the data security and privacy issues. In this paper, it is shown that IT services particularly in hardware performance evaluation in virtual machine can be accomplished effectively without IT personnel gaining access to actual data for diagnostic and remediation purposes. The proposed mechanisms utilized the hypothetical data from TPC-H benchmark, to achieve 2 objectives. First, the underlying hardware performance and consistency is monitored via a control system, which is constructed using TPC-H queries. Second, the mechanism to construct stress-testing scenario is envisaged in the host, using a single or combination of TPC-H queries, so that the resource threshold point can be verified, if the virtual machine is still capable of serving critical transactions at this constraining juncture. This threshold point uses server run queue size as input parameter, and it serves 2 purposes: It provides the boundary threshold to the control system, so that periodic learning of the synthetic data sets for performance evaluation does not reach the host's constraint level. Secondly, when the host undergoes hardware change, stress-testing scenarios are simulated in the host by loading up to this resource threshold level, for subsequent response time verification from real and critical transactions.Journal of Medical Systems 08/2012; 37(4):9950. · 1.78 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Software aging is a phenomenon plaguing many long-running complex software systems, which exhibit performance degradation or an increasing failure rate. Several strategies based on the proactive rejuvenation of the software state have been proposed to counteract software aging and prevent failures. This survey paper provides an overview of studies on Software Aging and Rejuvenation (SAR) that appeared in major journals and conference proceedings, with respect to the statistical approaches that have been used to forecast software aging phenomena and to plan rejuvenation, the kind of systems and aging effects that have been studied, and the techniques that have been proposed to rejuvenate complex software systems. The analysis is useful to identify key results from SAR research, and it is leveraged in this paper to highlight trends and open issues.ACM Journal on Emerging Technologies in Computing Systems 01/2014; · 0.76 Impact Factor
Deterministic Models of Software Aging and Optimal
Zuse Institute Berlin (ZIB)
Takustraße 7, 14195 Berlin-Dahlem
Dep. Engenharia Informática
CoreGRID Technical Report
November 2, 2006
Institute on System Architecture
CoreGRID - Network of Excellence
CoreGRID is a Network of Excellence funded by the European Commission under the Sixth Framework Programme
Project no. FP6-004265
Deterministic Models of Software Aging and Optimal Rejuvenation
Zuse Institute Berlin (ZIB)
Takustraße 7, 14195 Berlin-Dahlem
Dep. Engenharia Informática
November 2, 2006
Automated modeling of software aging processes is a prerequisite for cost-effective usage of adaptive software
rejuvenation as a self-healing technique. We consider the problem of such automated modeling in server-type applica-
tions whose performance degrades depending on the “work” done since last rejuvenation, for example the number of
served requests. This type of performance degradation - caused mostly by resource depletion - is common, as we illus-
trate in a study of the popular Axis Soap server 1.3. In particular, we propose deterministic models for approximating
the leading indicators of aging and an automated procedure for statistical testing of their correctness. We further
demonstrate how to use these models for finding optimal rejuvenation schedules under utility functions. Our focus is
on the important case that the utility function is the average of a performance metric (such as maximum service rate).
We also consider optional SLA constraints under which the performance should never drop below a specified level.
Our approach is verified by a study of the aging processes in the Axis Soap 1.3 server. The experiments show that the
deterministic modeling technique is appropriate in this case, and that the optimization of rejuvenation schedules can
greatly improve the average maximum service rate of an aging application.
In the last decades the software applications have been highly increasing in complexity. In spite of the advances in
software engineering techniques even the most mission-critical applications are still prone to some latent bugs and on
top of that they are becoming more difficult to manage.
This increase in the complexity was observed, for instance, by IBM that in 2001 launched the Autonomic Com-
puting Initiative as a vision to conduct research efforts in the area of system self-management. The topics addressed
include system configuration, protection, healing and optimization. One of the four properties that is aimed is the de-
velopment of self-healing computing systems. A self-healing system should be able to automatically predict or detect
potential errors and to execute some proactive actions to avoid the occurrence of failures.
This paper presents a contribution to this topic of self-healing. One of the main concerns in nowadays complex
software systems is the appearance of software aging. The term software aging describes the phenomenon of pro-
gressive degradation of the running software that may lead to system crashes or undesirable hang ups . It may
happen due to the exhaustion of systems resources, like memory-leaks, unreleased locks, non-terminated threads,
shared-memory pool latching, storage fragmentation, data corruption and accumulation of numerical errors.
The aging phenomena is likely to be found in any type of software with enough complexity, but it is particularly
troublesome in long-running applications. It is not only a problem for desktop operating systems: it has been observed
This research work is carried out under the FP6 Network of Excellence CoreGRID funded by the European Commission (Contract IST-2002-
in telecommunication systems , web-servers [12, 6], enterprise clusters , OLTP systems , spacecraft systems
. This problem has even been reported in military systems  with severe consequences for the loss of lives.
There are several commercial tools that help to identify some sources of memory-leaks in the software during the
development phase [26, 23]. However, not all the faults can be spotted and those tools cannot work in third-party
software packages when there is no access to the source-code. This means that existing production systems have to
deal with the problem of software aging.
The current state of software systems for IT systems is notably increasing in complexity with the introduction
of Web-Technologies, the use of complex middleware for enterprise application integration and the usage of SOA.
As a consequence, there are increasing concerns with this phenomena of software aging, and it is wise to devise
some techniques to deal with this problem in order to increase dependability of autonomic capabilities of complex IT
The most natural procedure to combat software aging is to apply the well-known technique of software rejuvena-
tion . Two basic rejuvenation policies have been proposed: time-based and proactive rejuvenation. Time-based
rejuvenation is widely used today in some real production systems, for instance by some web-servers [12, 6]. Proactive
rejuvenation has been studied in [5, 4, 13, 28, 19, 15, 18, 16, 29] and it is widely understood that this technique of
rejuvenation provides better results, resulting in higher availability and lower costs.
In this paper we present some techniques that can be applied to find out the optimum rejuvenation time, with the
main purpose of improving the reliability of the applications and minimizing the possible downtime due to a rejuve-
nation action. Our focus extends to any complex server-based software that has to run 24x7 with strict requirements
of sustained performance and dependability.
In a previous experimental work with an implementation of a SOAP server Apache Axis 1.3 - we have demon-
strated that this particular package of middleware is highly prone to the problem of software aging . We have
also found that this problem was highly deterministic which drove us our attention to develop some mathematical
techniques to model the phenomena and find out the optimal strategy for rejuvenation of the computational server in
single-server and cluster configurations.
This work can be applied to any software system that may potentially suffer from the problem of aging and have a
deterministic behaviour: that is, even if you restart the system the problem will show up somewhere in the future, in a
time-independent way, but directly related with the usage of system resources.
The aging behaviour of any software can be captured by one or more indicators of aging. Such an indicator is any
measurable metrics of the server likely to be influenced by the software aging, for example the maximum number of
requests it can serve per second. The aging indicators frequently depend on the time since last rejuvenation, but might
also dependent on other metrics such as number of processed requests, number of performed database operations, size
of the swap file or the amount of main memory used by the software.
The basis of this paper is a simple deterministic approach for time-independent modeling of the aging indicators.
Our main assumption is that the leading indicators of performance degradation can be approximated as (deterministic)
functions of some “work”-related metric, e.g. the number of served requests since last rejuvenation. This type of
software degradation might be attributed to memory leaks, unterminated threads, stale file locks, and other resource
depletion occurring with each request, or a series of them.
Modeling aging behaviour as a deterministic process depending on the amount of work since last rejuvenation
offers several advantages. Firstly, such aging process description provides independence of the work (e.g. request)
arrival rate or its time distribution. Compared to time-based characterisations these models are more universal and have
less parameters. A consequence of the determinism is a simple and concise description of the model - for example, as
an interpolating spline or a sequence of functions. This greatly facilitates analytical treatment and optimization - tasks
much more cumbersome for probabilistic models predominant in modeling of aging phenomena . Furthermore, if
applicable, deterministic modeling is likely to yield a higher level of accuracy that the probabilistic techniques. Finally,
the general applicability of these types of models is high since they correctly describe the aging process if memory
leaks or other unreleased resourced are the primary cause of the degradation. We have confirmed this empirically for
the aging processes identified in Apache Axis 1.3 (for the request rate capacity as an aging indicator).
Of course, not every aging indicator can be described as a function of the performed work. In some cases an
approximate model provided by our approach might be sufficient, while in other cases deterministic models will be
not applicable at all. Here a verification procedure for the applicability of the models under a given approximation
error tolerance is needed. To automate the modeling process, such a verification should follow objective criteria and
not rely on expert judgement. Such a model verification procedure offers an automated selection of the suitable work
metric or even their combinations. We discuss in this paper the complete, automated process of choosing and fitting
the models of performance degradation with statistical testing that these models are appropriate.
The obtained models can be used to find optimal rejuvenation schedules for individual servers and server pools
in order to maximize some utility function, possibly with additional constraints. For example, we might want to
maximize as the utility function the average service rate capacity of a server, while ensuring that this capacity never
drops below a certain limit. This technique is described and evaluated in this paper. We derive for this case analytical
formulas for efficient computation of the rejuvenation times of individual servers.
Here we obtain deterministic models for the request rate capacity as a function of the number of served requests. We
derive the optimal rejuvenation schedules by our technique and verify their optimality via experiments.
The main technical contributions of this paper are 1) a deterministic technique for modeling the leading indicators
of aging in dependence of work metrics, 2) an automated procedure for selecting a suitable work metrics, finding a
model for the aging process, and statistical verification of these results, 3) an approach for finding optimal rejuvenation
schedules of applications for maximizing the average value of a performance metrics (such as the maximum service
rate), 4) evaluation of our approach via a case study on the Apache Axis 1.3 server where the “work”-related metric is
the number of served requests since last rejuvenation.
The rest of the paper is organized as follows: after reviewing related work in Section 2, we present the process of
determining and verifying deterministic models for aging in Section 3. The next Section 4 explains how to obtain the
optimal rejuvenation schedules without and with SLA constraints. Section 5 describes the setup of the experiments
performed on Apache Axis 1.3 server. The evaluation of the experiments is presented in Section 6. We discuss there
the obtained deterministic models for an aging indicator (maximum request rate capacity) of the server, the results
of modified ANOVA tests, and the optimal rejuvenation times for maximizing the average maximum request rate
2 Related work
Software rejuvenation was first proposed in  and since then tens of papers have been published in the literature. Two
policies have been studied to apply the software rejuvenation : (a) by scheduling periodic actions for rejuvenation;
(b) estimate the time for resource exhaustion and perform a technique for proactive rejuvenation.
While the first policy is simple to understand and apply it does not provide the best result in terms of availability
and cost, since it may trigger unnecessary rejuvenation actions. Proactive rejuvenation is definitely a better option.
There are two basic approaches to apply proactive software rejuvenation: (i) Analytic-based approach and (ii)
The first approach uses analytic modeling of a system, assuming some distributions for failure, workload and
repair-time and tries to obtain the best optimal rejuvenation schedule. Several papers have been presented in the
literature that describe analytical models. A survey about papers that follow this approach can be found in . The
paper presented in  presented a continuous-time Markov chain model to find a closed-form expression for the
optimal trigger rate. In  was presented a semi-Markov model that relaxed the assumption for time-independent
transition rates.  presented a Markov regenerative process that allowed the rejuvenation trigger clock to start in a
robust state. A modelling approach for transactional systems was presented in . This Markov-based model took
into account some details of transaction arrivals and loss.
In the measurement-based approach, the goal is to collect some data from the system and then quantify and validate
the effect of aging in system resources. Three main techniques have been presented in the literature:  used a time-
based and workload-independent estimation of software aging. A different study that takes into account the workload
was presented in . Those authors presented then some further papers where their model has been refined. A yet
different approach was used by  that made use of ARMA/ARX models to validate the occurrence of software
aging in a web-server. The work presented in  considered several algorithms for prediction of resource exhaustion,
mainly based on curve-fitting algorithms.
In  is presented a study about software aging in a Web-Server (Apache). That study uses time-series analysis
to predict the occurrence of software aging. Proactive detection of software aging in OLTP servers was studied in
 using monitoring data collected during a period of 5 months. That data was used to train a pattern-recognition
tool. After the training phase, the system went back to production and kept the monitoring activity. That tool was
able to predict the occurrence of software aging with a long time in advance. Another related study was presented in
. The authors applied MSET (a statistical pattern recognition method developed by NASA and US Department of
Figure 1: Approximating aging indicator by a concatenation of elementary functions
Energy) for proactive detection of software aging in cluster systems. MSET provided excellent results and was able to
detect the occurrence of memory contention problems with high-sensitivity and with low probability of false-alarms.
Further work was presented in  that kept proving the effectiveness of MSET for eager detection of software aging
and runaway processes. In  is presented another relevant study that uses sequential probability ratio tests (SPRT)
to achieve early warning of potential aging problems, by the on-line monitoring of several hardware parameters and
software performance metrics.
Our modeling approach requires fitting elementary functions (polynomials, exponential functions) to noisy data
from multiple trials. One of the most popular approaches to describe and approximate sampled signals is the curve fit-
ting using splines . Advanced techniques spline fitting problems in presence of noise and non-stationarity have been
studied in medicine (particularly neuroscience) recently. They include DMS and SARS, and the Bayesian Adaptive
Regression Splines (BARS) approach . This method has been further extended to fitting curves from multiple trials
 and to non-parametric testing of equality of functions . An alternative approach for fitting splines offers ap-
proximation of data by basic, parametrized functions such as liner functions, exponential functions and higher-degree
polynomials. The fitting process usually performed via the Levenberg-Marquardt algorithm . The disadvantage
is a difficulty of approximating the whole sample by a single function. As a remedy, a concatenation of separate
functions over different argument ranges is used.
3 Modeling Aging Processes
We denote an indicator of aging by y. The amount of “work” performed by an application since last rejuvenation
is described by a work metrics x, such as the number of served requests since reboot. The key assumption of our
approach is that the indicators of aging depend primarily on a single work metrics x, i.e. y = y(x). While this
assumption might not apply strictly for a majority of the aging processes, it is sometimes sufficientto have approximate
models of the above kind. For a user-specified approximation error level this approach can be followed if we have
verification procedure to detect whether the specified error level is not likely to be exceeded. Moreover, automating
such a procedure offers the opportunity to automatically test and select the work metrics (or even functions for several
of them) from a pool of collected metrics. In this section we describe how to build a work metrics-based model for a
given aging indicator and how to test automatically its applicability as a model input.
3.1 Approximating aging indicators by splines
Work metrics are in general monotone functions of time - they either increase or decrease. This allows for using them
as arguments of other functions - in our case models for indicators of aging. Using a work metrics as an argument,
we find a function S(x) which approximates an aging indicator y via curve fitting. Unfortunately, an aging indicator
cannot be in general described by one basic function (such as a polynomial or exponential function) over the whole
range of its argument. We handle this problem by subdividing the argument range the into a set of segments, and
fitting a basic function separately for every segment. For example, taking the maximum service rate P as an indicator
of aging and the number of served requests as x, we could approximate P = P(x) as a (low-order) polynomial for
x = 0,...,n1, another polynomial for x = n1+ 1,...,n2, and yet another function for x > n2(see Figure 1).