Conference PaperPDF Available

Capturing Workload Pathology by Statistical Exception Detection System.


Abstract and Figures

The paper describes one site's experience of using Multivariate Adaptive Statistical Filtering (MASF) to recognize automatically some common computer system defects such as run-away processes on one or multiple CPUs and memory leaks. A home made SEDS (Statistical Exception Detection System) that captures any global and application level statistical exceptions was modified to recognize, report and alert about those defects.
Content may be subject to copyright.
Capturing Workload Pathology
by Statistical Exception Detection System
Igor Trubin, PhD, Vice Chair SCMG
The paper describes one site's experience of using Multivariate Adaptive Statistical Filtering (MASF)
to recognize automatically some common computer system defects such as run-away processes on
one or multiple CPUs and memory leaks. A home made SEDS (Statistical Exception Detection
System) that captures any global and application level statistical exceptions was modified to
recognize, report and alert about those defects.
1. Introduction
In 2000 the Statistical Exception Detection System
(SEDS) was developed and implemented to support
IT Capacity Management process in a large U.S.
financial services company. Exploiting the MASF
method, SEDS is used for automatic scanning
through large volumes of performance data and
identifying global metrics measurements that differ
significantly from their expected values. The first
public introduction of the system was made in CMG
2001 conference [1]. The current Structure of the
system is shown on Figure 1.
Figure 1 – SEDS structure
When the system went to production the typical
exception looked like Figure 2.
Figure 2 – Typical Exception
From these types of control charts the following facts
can be read. Normally, the server (HP-UX N-class 4-
way box) is heavily used but more or less stable. All
day yesterday it had more than the average load
plus three hours of statistically unusual CPU
utilization that is represented by exceeding the red
upper limit curve (Mean + 3σ). Using this approach,
SEDS filters out the real performance issues from
thousands of servers and allows the relatively small
group of capacity planners to provide proactive
performance/capacity management of the large
computer farm.
But soon SEDS also captured a different type of
issue shown in Figure 3. The performance analysts
quickly recognized known system defects such as
memory leaks and run-away processes (infinite
program loop). In terms of statistical analysis, those
exceptions are definitely outliers and SEDS was
modified to exclude them from historical data to
avoid an exaggerated upper control limit calculation
for future detections.
Figure 3 – Non-Typical Exceptions for CPU
and Memory Utilization
After a while this sudden SEDS “side effect”
becomes a very effective way to detect some
common computer system pathologies that is the
subject of this paper.
2. Workload Pathology Overview
Looking at the classical performance charts any
experienced system performance analyst can
recognize a memory leak pattern or run-away
process situation (Figure 4).
Sometimes it’s not so obvious and requires a deeper
analysis to recognize a pathology. For instance, a
global metric might not clearly show it, but the
workload view can help to show the run-away
process on a particular application as seen in
Figure 5.
No doubt these workload defects cause some
problems on the server because of capturing excess
resources and capacity. Even if the system
performance with a parasite process is still OK, the
real resource usage is hidden and the capacity
usage picture becomes inaccurate and
Figure 4 – Typical Pathology Patterns: “Fast”
Memory Leak and CPU Run-Away Process on a
4-Way Server
Figure 5 – Monthly Workload Chart Shows Run-
Away Application (Marked by Blue Color)
When the automatic future trend chart generator is
used [2], the history data with pathologies causes
inaccurate prediction, as shown in Figure 6.
Figure 6 – Memory Leak Issue Spoiled the Forecast
The problem of system pathology recognition has
been discussed numerous times, including some
CMG presentations.
For example, there is a good definition of a memory
leak in one CMG 2001 paper [3]: “Memory leaks are
when the application acquires memory to create
objects and then ‘forgets’ to release the memory for
reuse. As a result, the application memory usage
continues to grow as the application executes”.
Another 2004 CMG paper [4] has listed a set of rules
of thumb to develop automated analysis of computer
subsystem “hot-spots”, including memory leaks and
run-away processes that considers as some of
memory and process “hot-spots”.
One more 2004 CMG paper [5] underlines the
necessity of adding to IT service management tools
an “intelligent pattern recognition to look for the
pathology of processes (such as memory leaks or
program loops)”.
The best research in this field has been recently
done and published for the 2003 CMG conference
by Ron Kaminski [6]. An excellent computer system
defects classification was presented in this study as
well the detailed capturing algorithms.
The advantage of using the statistical process
control technique to recognize this pathology is
simplicity. SEDS can do it without any complex
analytical algorithms since any pathology is treated
as statistically unusual behavior and easily can be
captured on the more or less stable production
Also SEDS has a special database to keep all
exceptions as shown on Figure 1. This database
has a table to keep separately all exceptions about
possible pathology cases.
3. Capturing Workload Pathologies by
In the first SEDS implementation, there were
extremely simple criteria to add an exception to that
Several hours in a row the CPU or/and Memory
utilization was more than some very high threshold
(from 99.5% for UNIX and 95% for Wintel)
If the exception had been added to that table,
SEDS’s e-mail notification in addition to the list of
servers with general exceptions will create a
separate server list with possible run-away or/and
memory leak patterns. SEDS produces this type of
smart alarm every morning and sends to capacity
planners or performance analysts.
On the SEDS web report as shown in figure 7, the
server with a possible pathology will be colored red.
The line with the server name has links to control
charts as well as classical performance charts that
allow a user to make additional analysis and
determine an appropriate action plan to address this
issue. Other exceptions are marked by other colors
as explained on the Figure’s legend.
SEDS creates the data report based on any data
source available in SAS/ITRM. In this case the
pathology can be detected from BMC BGS type of
data, Concord eHealth data or HP Vantage Point
data (MeasureWare –MW).
If the application (workload) level data is available, in
case an exception is detected, SEDS generates a
control chart for any workload definitions, which
helps identify immediately what application or
process is responsible for the problem. The example
is in Figure 8.
Figure 7 – SEDS Web Report
4. Did not
Consume the Entire Machine?
nly some of them as shown in Figures 4, 5 and 8.
wing non-statistical algorithm was added to
et's assume:
What If a Run-Away Process
Indeed, another typical situation on an SMP type of
server is when some process takes not all CPUs bu
Any of these situations will be captured as statistical
exceptions for sure, but to filter out that pathology
the follo
Uh - each yesterday’s hour CPU util.
Uah - the same hour six-month average of CPU util.
Ncpu - number of CPUs
The server most likely had a runaway process at
least on one CPU, if for each hour yesterday (24
hours in a row) the following condition is true:
Uh > Uah + 100% / Ncpu
For example, on a 4-CPU system, this will highlight a
process consuming an additional 100%/4 = 25% of
s ystem capacity.
The server that matches this criterion will be added
to the pathology table of the SEDS database. For
better visual recognition of this type of an issue,
SEDS publishes the list of servers with run-away
processes as the bar chart as shown in Figure 9,
where X axis is the daily CPU utilization average. If
all of the CPU resource was used, the bar value will
be close to 100%. If not, it’s going to be significantly
less than 100%. For instance, for an underutilized
server it might be slightly bigger than 50% for a 2-
ay server presented on Figure 8.
Figure 8 – Run-Away Workload on One CPU of Two
rocess was successfully killed around 16:30.
Figure 8 also shows confirmation that the run-
Figure 9 – Run-Away Server List
5. Recognition of Other Complex Defects
ception as shown
the second charts on Figure 10.
filter to put these cases to the Pathology
The current version of the SEDS still cannot filter out
some other complex defects such as a slow (long
term) increases (“ramps”). This typical situation is
presented on the Figure 10 where memory utilization
has slow growing component representing some
process with a memory leak. SEDS captured this
server with a memory utilization ex
This case has a distinct pattern, which is a very
small deviation from the statistical upper limit and
could be probably formulized and added to the
Classical Performance nd Control Chart from
6. SEDS Exception Database Usage
e exception detector records each
Figure 10 – Complex Memory Leak Example;
Chart a
The history of exceptions can be used for other
analysis to show longer trends of metrics outside the
three standard deviations range. These can indicate
system resource problems, server load growth (or
reduction), seasonal deviation [7] or possible
pathologic situation. The SEDS database was
developed to store such data; it is actually a log file
in which th
Even if the exception was not put into the special
pathology table, the history of them can uncover the
hidden pathology. Looking at the chart in Figure 11,
which was created against exception database, the
high density “clouds” sharply indicate a possible
memory leak type of defect. This type of charts is
generated automatically by SEDS and available
nder the “history” link on web reports (Figure 7).
steps should be taken with the process level data to
If the defect was recognized by an analyst, certain
find defective process ID and to open an incident or
problem ticket to kill or restart the process. Even a
reboot sometimes has to be done to solve the
problem. It takes time.
The sooner the pathology is found and resolved -
the better service IT provides for the business.
Figure 11 – History of Pathology Exceptions
n-away processes had captured each server”.
rocesses was two times less than in 2001!
To see the progress and effectiveness of capturing
and resolving particular system pathologies the
following report was presented to upper
management. The idea of the report was “how l
The result was put on a bar chart as shown on
Figure 12. Based on that the stable improvement
was reported: in 2003 overall duration of r
Figure 12 – Overall Duration of CPU Pathologies on UNIX Servers
7. Summary
The workload pathology can be discovered by
stat that is normally very unusual
ystem behavior. Then that should be excluded or
recognize a
pical defect such as run-away process (infinitive
em, and it has
een successfully used for 5 years to automatically
8. References
[1] Igor Trubin, Kevin McLaughlin; “Exception
Detection System, Based on the Statistical
Process Control Concept”, Proceedings of the
surement Group, 2001
istical tools when
masked from healthy historical data as outliers to
avoid statistically incorrect conclusions.
In most cases an additional non-statistical filter can
be added to the MASF-type detector to
program loop) or memory leak. Even if this filter
does not recognize a defect, the exception should
be the subject of additional analysis and the hidden
defect can be discovered. The history of previous
exceptions can help to identify that.
This approach was implemented to the SAS based
Statistical Exception Detection Syst
report about possible workload pathologies on a
large computer farm.
Computer Mea
[2] Igor Trubin, Linwood Merritt;2003 - Disk
Subsystem Capacity Management Based on
Business Drivers I/O Performance Metrics and
MASF”, Proceedings of the Computer
Measurement Group, 2003
Jack Woolley; “How to Validate Application
Quality, Performance and Scalability ",
Proceedings of the Computer Measure
Group, 2001
[4] Rogers, Russell; "Ubiquitous Data Collection
in a Large Distributed Environment ",
Proceedings of the Computer Measurement
Group, 2004
[5] Adam Grummitt, “Corporate Performance
Management as a Pragmatic Process in an ITIL
World “, Proceedings of the Computer
Measurement Group, 2004
[6] Ron Kaminski, “Automating Process and
Workload Pathology Detection, Proceedings
of the Computer Measurement Group, 2003
] Igor Trubin, Global and Application Levels
[7 Exception Detection System, Based on MASF
Technique”, Proceedings of the Computer
Measurement Group, 2002
... Some general purpose statistical tools have the ability to build, interactively or programmatically, some type of control charts (SAS, JMP and other) --those are used for ad-hoc reporting or building "home-made" systems, such as SEDS -Statistical Exception Detection System, where Control charting is used as a main reporting tool. [2,3,4,5] Other products have such build-in control chart generators as a reporting feature. Figure 1 shows some examples of control charts used in different products (tools) and were published in some CMG papers and presentations. ...
... One case on that figure shows a spectacular "saw" type of historical baseline of daily reboot (typical and simplest way to fight memory leaks) and actual data indicated that some days where rebooting was forgotten. Other case shows that even a slow-going memory leak pattern, which is very hard to capture automatically, can be captured by using control chart and exception detectors such as SEDS [5]. ...
Full-text available
The Control Chart is one of the main Six Sigma tools to optimize business processes. After some adjustments it is used now as visualization tool in IT Capacity Management especially in “behavior learning” products to underline performance and capacity usage anomalies. This review answers the following questions. What is the Control Chart and how to read it and where to use? Review of some performance tools that use it. Control chart types: MASF charts vs. classical SPC; introduction to IT-Control Chart for IT application performance control. How to build a Control Chart using Excel for interactive analysis and R scripting to do it automatically?
... Researchers also suggested the use of control charts in monitoring production systems. Trubin et al. [24] used control charts to detect problems during the capacity management process at a large financial institute. In a production system, the performance counters fluctuate according to the field load. ...
Full-text available
Load testing is an important phase in the software development process. It is very time consuming but there is usually little time for it. As a solution to the tight testing schedule, software companies automate their testing procedures. However, existing automation only reduces the time required to run load tests. The analysis of the test results is still performed manually. A typical load test outputs thousands of performance counters. Analyzing these counters manually requires time and tacit knowledge of the system-under-test from the performance engineers. The goal of this study is to derive an approach to automatically verify load tests results. We propose an approach based on a statistical quality control technique called control charts. Our approach can a) automatically determine if a test run passes or fails and b) identify the subsystem where performance problem originated. We conduct two case studies on a large commer-cial telecommunication software and an open-source software system to evaluate our approach. Our results warrant further development of control chart based techniques in performance verification.
... Trubin et al. [18] proposed the use of control charts for infield monitoring of software systems where performance counters fluctuate according to the input load. Control charts can automatically learn if the deviation is out of a control limit, at which time, the operator can be alerted. ...
Full-text available
The goal of performance regression testing is to check for performance regressions in a new version of a software system. Performance regression testing is an important phase in the software development process. Performance regression testing is very time consuming yet there is usually little time assigned for it. A typical test run would output thousands of performance counters. Testers usually have to manually inspect these counters to identify performance regressions. In this paper, we propose an approach to analyze performance counters across test runs using a statistical process control technique called control charts. We evaluate our approach using historical data of a large software team as well as an open-source software project. The results show that our approach can accurately identify performance regressions in both software systems. Feedback from practitioners is very promising due to the simplicity and ease of explanation of the results.
Conference Paper
Full-text available
This paper presents a method for deriving and utilizing Application Invariants. An Application Invariant is a metric that quantifies the behavior or performance of an application in such a way that its value is immune to changes in workload volume. Several sample Application Invariants are developed and presented. One of the primary benefits of an Application Invariant is that it provides a simple (flat) shape that can readily be used to track changes in application performance or behavior in an automated manner.
Conference Paper
Full-text available
Identifying change in application performance is a time consuming task. Businesses today have hundreds of applications and each application has hundreds of metrics. How do you wade through that mass of data to find an indication of change? This paper describes the use of an Application Signature to identify, quantify and report change. A Signature is a compact description of application performance that is used much like a template to judge if a change has occurred. There are a concise set of visual indicators generated by the Signature that supports the identification of change in a timely manner.
Conference Paper
The detection of early-warning signals of performance deterioration can help technical support teams in taking swift remedial actions, thus ensuring rigor in production support operations of large scale software systems. Performance anomalies or deterioration, if left unattended, often result in system slowness and unavailability. In this paper, we presents a simple, intuitive and low-overhead technique for recognizing the early warning signs in near real time before they impact the system The technique is based on the inverse relationship which exists between throughput and average response time in a closed system. Because of this relationship, a significant increase in the average system response time causes an abrupt fall in system throughput. To identify such occurrences automatically, Individuals and Moving Range (XmR) control charts are used. We also provide a case study from a real-world production system, in which the technique has been successfully used. The use of this technique has reduced the occurrence of performance related incidents significantly in our daily operations. The technique is tool agnostic and can also be easily implemented in popular system monitoring tools by building custom extensions.
Full-text available
Cloud datacenters comprise hundreds or thousands of disparate application services, each having stringent performance and availability requirements, sharing a finite set of heterogeneous hardware and software resources. The implication of such complex environment is that the occurrence of performance problems, such as slow application response and unplanned downtimes, has become a norm rather than exception resulting in decreased revenue, damaged reputation, and huge human-effort in diagnosis. Though causes can be as varied as application issues (e.g. bugs), machine-level failures (e.g. faulty server), and operator errors (e.g. mis-configurations), recent studies have attributed capacity-related issues, such as resource shortage and contention, as the cause of most performance problems on the Internet today. As cloud datacenters become increasingly autonomous there is need for automated performance diagnosis systems that can adapt their operation to reflect the changing workload and topology in the infrastructure. In particular, such systems should be able to detect anomalous performance events, uncover manifestations of capacity bottlenecks, localize actual root-cause(s), and possibly suggest or actuate corrections. This thesis investigates approaches for diagnosing performance problems in cloud infrastructures. We present the outcome of an extensive survey of existing research contributions addressing performance diagnosis in diverse systems domains. We also present models and algorithms for detecting anomalies in real-time application performance and identification of anomalous datacenter resources based on operational metrics and spatial dependency across datacenter components. Empirical evaluations of our approaches shows how they can be used to improve end-user experience, service assurance and support root-cause analysis.
Full-text available
In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.
Full-text available
Traditional methods for system performance analysis have long relied on a mix of queuing theory, detailed system knowledge, intuition, and trial-and-error. These approaches often require construction of incomplete gray-box models that can be costly to build and difficult to scale or generalize. In this thesis, we present a black-box analysis method to discover the amount of load on a web server with minimal knowledge of its internal mechanisms. In contrast to white-box analysis, where a system's internal mechanisms can help to explain its behavior, black-box analysis relies on external measurements of a system's reactions to well-understood inputs. The primary advantages of black-box analysis are its relative independence from specific architectures,its applicability to opaque environments (e.g., closed-source systems), and its scalability. In this thesis, we show that statistical analyses of web server response times can be used to discover which server resources are stressed by particular workloads. We also show that under certain conditions, the settling period of server response times after resource perturbation correlates positively with the degree of perturbation. Finally, we use the two-sample Kolmogorov-Smirnov (KS) test to measure statistical equality of multiple samples drawn from response times of a server under various steady-state load conditions. We show that in specific circumstances, the number of samples that test as statistically equal can serve as an imprecise indicator of the amount of load on a server. All of these contributions will aid performance analysis in new environments such as cloud computing, where internal server mechanisms and configurations change dynamically and structural information is hidden from users.