Content uploaded by Guoliang Zhao
Author content
All content in this area was uploaded by Guoliang Zhao on Mar 16, 2021
Content may be subject to copyright.
Predicting Performance Anomalies in Soware Systems at Run-time
GUOLIANG ZHAO, School of Computing, Queen’s University, Canada
SAFWAT HASSAN, Department of Engineering, Thompson Rivers University, Canada
YING ZOU, Department of Electrical and Computer Engineering, Canada
DEREK TRUONG, IBM, Canada
TOBY CORBIN, IBM, United Kingdom
High performance is a critical factor to achieve and maintain the success of a software system. Performance anomalies represent
the performance degradation issues (e.g., slowing down in system response times) of software systems at run-time. Performance
anomalies can cause a dramatically negative impact on users’ satisfaction. Prior studies propose dierent approaches to detect
anomalies by analyzing execution logs and resource utilization metrics after the anomalies have happened. However, the prior
detection approaches cannot predict the anomalies ahead of time; such limitation causes an inevitable delay in taking corrective actions
to prevent performance anomalies from happening. We propose an approach that can predict performance anomalies in software
systems and raise anomaly warnings in advance. Our approach uses a Long-Short Term Memory (LSTM) neural network to capture the
normal behaviors of a software system. Then, our approach predicts performance anomalies by identifying the early deviations from
the captured normal system behaviors. We conduct extensive experiments to evaluate our approach using two real-world software
systems (i.e., Elasticsearch and Hadoop). We compare the performance of our approach with two baselines. The rst baseline is
one state-to-the-art baseline called Unsupervised Behavior Learning (UBL). The second baseline predicts performance anomalies by
checking if the resource utilization exceeds pre-dened thresholds. Our results show that our approach can predict various performance
anomalies with high precision (i.e., 97-100%) and recall (i.e., 80-100%), while the baselines achieve 25-97% precision and 93-100%
recall. For a range of performance anomalies, our approach can achieve sucient lead times that vary from 20 to 1,403 seconds (i.e.,
23.4 minutes). We also demonstrate the ability of our approach to predict the performance anomalies that are caused by real-world
performance bugs. For predicting performance anomalies that are caused by real-world performance bugs, our approach achieves
95-100% precision and 87-100% recall, while the baselines achieve 49-83% precision and 100% recall. The obtained results show that
our approach outperforms the existing anomaly prediction approaches and is able to predict performance anomalies in real-world
systems.
CCS Concepts: •Software and its engineering →Software post-development issues.
Additional Key Words and Phrases: Performance Anomaly Prediction, LSTM Neural Network, Software Systems
ACM Reference Format:
Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin. 2020. Predicting Performance Anomalies in Software
Systems at Run-time. ACM Trans. Softw. Eng. Methodol. 1, 1, Article 1 (January 2020), 32 pages. https://doi.org/10.1145/3440757
Authors’ addresses: Guoliang Zhao, School of Computing, Queen’s University, Kingston, Canada, 17gz2@cs.queensu.ca; Safwat Hassan, Department
of Engineering, Thompson Rivers University, Kamloops, Canada, shassan@tru.ca; Ying Zou, Department of Electrical and Computer Engineering,
Kingston, Canada, ying.zou@queensu.ca; Derek Truong, IBM, Markham, Canada, trong@ca.ibm.com; Toby Corbin, IBM, Southampton, United Kingdom,
corbint@uk.ibm.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
2 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
1 INTRODUCTION
Large-scale software systems have become an essential part of our daily activities, such as performing nancial services,
fullling healthcare operations, and enabling interpersonal communication. To meet users’ expectations, software sys-
tems need to perform their functionalities with high performance and reliability. However, the performance of software
systems is prone to run-time performance anomalies (e.g., slowing down in system response time) [
55
]. Performance
anomalies mean that the monitored system behaviors cannot be explained by the current system workload [
9
]. For
example, the number of transactions that are processed by a system suggests less CPU and memory consumption than
the actual resource usage.
To maintain the performance of software systems, operators need to detect performance anomalies and prevent fail-
ures from happening at run-time [
29
]. Due to the increasing scale and complexity of software systems, it is challenging for
operators to manually keep track of the execution status of software systems. Hence, researchers have proposed various
approaches to automatically monitor software systems and detect anomalies at run-time [27] [29] [44] [45] [51] [64].
However, the delay in detecting anomalies and taking corrective actions can result in the violations of Service Level
Objective (SLO) (e.g., long response time for users’ requests) or system failures, which can cause nancial loss [
11
].
For example, in August 2013, Amazon was down for 40 minutes. Due to this outage, Amazon had almost $5 million
revenue loss [
61
]. Hence, it is desirable to provide a performance anomaly prediction approach that can proactively
raise anomaly warnings in advance, and thereby help operators to prevent potential anomalies from happening.
Performance anomaly prediction approaches forecast that a software system is about to enter an anomalous state by
capturing the pre-failure state before the anomaly happens [
11
]. Performance bugs represent the non-functional bugs,
which can cause signicant performance degradation [
30
]. Performance bugs (e.g., memory leak bugs) do not always
break down software systems instantly. Software systems are considered to be in a pre-failure state from the start of
experiencing performance degradation until the performance anomaly occurs [
11
]. Prior studies propose approaches
that can predict failures of virtual machines (VMs) in cloud infrastructures [
11
] [
56
]. The existing studies [
11
] [
56
]
analyze the metrics collected from VMs (e.g., the CPU usage of a VM) and predict the future failures of the monitored
VMs. For a cloud with hundreds of VMs, predicting VM-level failures can pinpoint the VM that will have a failure and
enable operators to take proper actions to prevent the failure from happening. For example, if a VM in the cloud is
running out of memory, to stop the out of memory failure from happening, operators can migrate the VM to a physical
host with larger memory space [
57
]. However, the existing studies [
11
] [
56
] predict failures at the VM-level, and
operators still need to manually localize the anomalous application that causes the predicted failure. There are multiple
challenges for achieving an ecient performance anomaly prediction for applications (i.e., achieving performance
anomaly prediction at the application-level) as follows.
•Dynamic software behaviors
. The behaviors of software systems are dynamic (e.g., uctuating memory usage)
because of the constantly changing user behaviors and varying system workloads. Hence, it is dicult to
characterize the behaviors of a software system and precisely capture the pre-failure state [
26
]. Consequently,
existing anomaly detection techniques that use statistical models can easily misclassify the temporal high resource
utilization as anomalous behaviors [7] [41] [42] [43] [59].
•Instrumentation challenges
. Operators do not always have access to the source code of systems running on
the cloud. It is infeasible to apply existing anomaly detection techniques [
4
] [
17
] that need to instrument the
source code of software systems. In addition, instrumenting the binary or source code of software systems can
introduce overhead and impact the performance of the systems at run-time [17] [60].
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 3
•Data acquisition challenges
. It is challenging to obtain enough labeled training data (i.e., monitoring data
with normal and anomalous labels) to build supervised anomaly prediction approaches [
19
] [
24
] [
66
]. Companies
are often reluctant to release their monitoring dataset as it may contain condential information about their
users. More importantly, supervised techniques can only predict known anomalies that appear in the training
dataset [11].
In this paper, we present a performance anomaly prediction approach that overcomes the aforementioned challenges.
First, we apply a Long-Short Term Memory (LSTM) neural network to handle the diculty in characterizing the
dynamic behaviors of applications. LSTM neural network overcomes the vanishing gradient problem experienced
by Recurrent Neural Network (RNN) [
23
]. LSTM neural network has been widely used in language modeling [
54
],
text classication [
71
], and failure prediction for physical equipment (e.g., engines [
7
] [
41
] [
42
] [
43
] [
59
]. Compared
with other machine learning techniques (e.g., linear regression) and deep learning techniques (e.g., RNN), LSTM
neural network has the advantage to incorporate the time dependency in time-series data. By incorporating the time
dependency, the LSTM neural networks can capture both the temporal uctuations and the long-term changes in the
monitoring data (e.g., CPU usage) collected from an application to characterize the dynamic behaviors of the application.
To avoid the instrumentation of applications, we only use the performance monitoring application that is external
to the applications to track resource utilization (e.g., the CPU and memory usage of a JAVA application). To solve the
challenges related to data acquisition, we train LSTM neural networks to capture the normal behaviors of applications
under regular operations. Compared with obtaining labeled data from normal and anomalous behaviors of applications,
collecting only normal behaviors data is more straightforward and practical [
11
]. At the run-time monitoring stage, our
approach predicts anomalies by checking whether the application deviates from the normal behaviors that are expected
from the LSTM neural networks. Therefore, our approach apply the LSTM neural networks to predict performance
anomalies without the need to train our approach using anomalous behaviors data.
Predicting performance anomalies at the application-level enables our approach to be used in multiple scenarios.
Applications can be hosted on a large-scale cluster with thousands of nodes (e.g., virtual machines). Our approach
can be used to monitor the instances of an application running on each node and locate the anomalous instances of
the application in the cluster. In addition, in an industrial environment, there might be several applications running
on the same machine or virtual machine (VM). Operators can deploy multiple instances of our approach with each
instance monitoring an application. By monitoring each application, our approach can predict which application is the
anomalous application running on a machine.
The lead times measure the amount of time that our approach can predict performance anomalies in advance. After
predicting a performance anomaly, operators or automatic anomaly prevention approaches [
57
] should take actions to
prevent the predicted anomaly from happening. Anomaly prevention actions, such as scaling VM resources and live
VM migration, take time to complete. Thus, our approach should predict anomalies with sucient lead times to ensure
that the anomaly prevention actions can be nished.
We conduct quantitative experiments on two dierent widely used open-source software systems: 1) Elatsticsearch,
an open-source, distributed, and RESTful search engine [
15
], and 2) Hadoop MapReduce sample applications provided
by Hadoop distribution [3]. Especially, our work addresses the following research questions (RQs):
RQ1. What is the performance of our approach for predicting performance anomalies?
To evaluate the performance of our approach, we inject bugs that can cause performance anomalies into the source
code of the studied systems and trigger the injected performance bugs. Furthermore, we compare our approach with
Manuscript submitted to ACM
4 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
two baselines: (1) we implement the Unsupervised Behavior Learning (UBL) baseline proposed by Dean et al. [
11
].
(2) we implement a baseline that predicts anomalies for an application by checking if the resource utilization of the
application exceeds pre-dened thresholds. Our experiment results demonstrate that our approach can predict various
performance anomalies with 97-100% precision and 80-100% recall, while the two baselines achieve 25-97% precision
and 93-100% recall.
RQ2. How early our approach can raise a performance anomaly warning before the anomaly occurs?
In this RQ, we evaluate the lead times of our approach in predicting performance anomalies. Through our experiments,
our approach can predict dierent types of performance anomalies in advance and achieve lead times varying from 20 to
1,403 seconds (i.e., 23.4 minutes). As mentioned in the existing study [
57
], anomaly prevention actions, such as scaling
VM resources, can be completed within one second. Thus, the lead times achieved by our approach can be sucient for
the automatic prevention approaches [57] to take proper actions and prevent the anomalies from happening.
RQ3. Could our approach be used to predict anomalies happened in the real world?
In the previous RQs, we measure the performance of our approach in predicting performance anomalies that are
caused by the injected bugs. However, the real-world performance bugs may be more complex than our injected bugs.
Hence, in this RQ, we demonstrate the ability of our approach for predicting performance anomalies that are caused by
real-world performance bugs. Because of the limited and ambiguous information in bug reports, reproducing real-world
performance bugs is a challenging and time-consuming task [
50
]. In addition, there are performance bugs that can only
occur under special workloads and running environments. To the end, we manage to reproduce ve performance bugs
that appeared in Elasticsearch. Then, we use the reproduced performance bugs to trigger performance anomalies. We
observe that our approach can predict the anomalies that are caused by these ve real-world performance bugs with
95-100% precision and 87-100% recall, while the two baselines achieve 49-83% precision and 100% recall. In addition, we
evaluate the run-time overhead of our anomaly prediction approach. We observe that our LSTM neural networks take
an average of 0.25 milliseconds, 320MB memory, and 25% CPU usage (i.e., two CPU cores) on our machine with an
8-core Intel i7-4790 3.60GHz CPU to predict anomalies at every second.
Paper organization
: The rest of the paper is organized as follows. Section 2 describes the design of our approach.
Section 3 outlines the data collection methodology. Section 4 presents the evaluation results for our approach. We
discuss the overhead and usage scenarios of our approach in Section 5. The threats to validate are discussed in Section 6.
We present the related work in Section 7. Finally, we conclude our paper in Section 8.
2 SYSTEM DESIGN
Figure 1 gives an overview of our approach. As shown in Figure 1, our approach consists of two phases: (1) the
oine training phase to train our LSTM neural network, and (2) the run-time monitoring phase to predict performance
anomalies. In this section, we introduce the design of our performance anomaly prediction approach. We rst describe
the metrics that are used to monitor applications. Next, we introduce the architecture of our approach using the LSTM
neural network. Then, we present the training phase of approach. Finally, we describe our approach in predicting
performance anomaly using the trained LSTM neural network.
2.1 Monitoring Applications
In a real-world system, an application might co-exist with other applications. We measure the resource utilization that is
specic to the application under monitoring using performance monitoring tools. Various performance monitoring tools
(e.g., jconsole [
47
], Solarwinds [
52
], and tasks manager [
65
]) have been developed to monitor the resource utilization of
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 5
Phase 1: Offline training phase
Collect values of
monitoring metrics
Train LSTM neural
network
Phase 2: Run-time monitoring phase
Collect values of
monitoring metrics
Use the LSTM
neural network to
predict performance
anomalies
Stable version of
the monitored
system
Values of normal
monitoring metrics
New version of
the monitored
system
raise anomaly
alarms
Trained LSTM
neural network
Values of
monitoring metrics
Fig. 1. An overview of our performance anomaly prediction approach.
an application. The performance monitoring tools can capture the resource utilization of the running processes of an
application without instrumenting the application.
Table 1 represents the monitoring metrics that are used in our approach. We apply six metrics (e.g., CPUUsage,
HeapUsage, and NativeUsage) that are commonly used to monitor the performance of application. For JAVA applications,
there are four regions (i.e., eden, survivor, old generation, and code cache regions) in the heap and two regions (i.e.,
meta and compressed class regions) in the native memory space. The detailed descriptions of the regions can be found
in the OpenJDK documentation [
46
]. Each region is responsible for storing a unique type of data. Hence, we monitor the
usage of each region. All twelve metrics shown in Table 1 can be obtained using performance monitoring applications
without any instrumentation.
Table 1. The metrics used to monitor applications in our approach.
Metric Name Description
CPUUsage The CPU usage of the monitored application
HeapUsage The size of the heap of the monitored application
NativeUsage The size of the native memory space of the monitored application
EdenUsage The size of the eden region in the heap
SurUsage The size of the survivor region in the heap
OldGenUsage The size of the old generation region in the heap
CodeCacheUsage The size of the code cache region in the heap
MetaUsage The size of the meta region in the native memory space
CompressedClassUsage The size of the compressed class region in the native memory space
NumThreads The number of live threads that are issued by the monitored application
NumClasses The number of classes of the monitored application that are loaded into memory
GcTime The time spent in the latest garbage collection (in milliseconds)
Manuscript submitted to ACM
6 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
2.2 Architecture of our approach
To precisely model the behavior of an application, we need to capture the time dependency of the collected metrics
(i.e., the dependency between current resource usages and past resource usages). Long-Short Term Memory (LSTM)
neural network is well known to accurately capture the long term time dependency and model complex correlations in
time-series data (e.g., the collected values of the monitoring metrics over time) [
7
] [
43
]. The LSTM neural network
is a special kind of recurrent neural network (RNN) that avoids the vanishing gradient problem existed in RNN [
23
].
The vanishing gradient problem means the loss of long-term dependencies introduced by the decaying gradient values.
The LSTM neural network overcomes this problem by applying multiplicative gates that enforce constant error ow
through the internal states of the LSTM memory cells. The key of the LSTM neural network is the internal states of its
the LSTM memory cells. Figure 2 shows the structure of a LSTM memory cell. There are three multiplicative gates, i.e.,
input gate, forget gate, and output gate in the LSTM memory cells. The input gate is used to learn what new input
information should be stored in the internal states of the memory cells. The forget gate controls how long the new
input information should be stored and the output gate learns what parts of the stored information the memory cells
should output. These three gates prevent the internal states of memory cells from being perturbed by irrelevant inputs
and outputs, which enables the LSTM neural networks to capture long term time dependency. Thus, we choose to build
a predictive the LSTM neural network to model the normal behaviors of an application.
Fig. 2. Long Short-term Memory Cell.
The architecture of our approach is shown in Figure 3. Our approach contains four layers: one input layer, two
hidden LSTM layers, and one output layer with three output branches. We build a deep LSTM network by stacking
two LSTM layers, i.e., each neural unit in the lower LSTM hidden layer is fully connected to each unit in the higher
LSTM hidden layer through feedforward connections. Prior studies show that stacking recurrent hidden layers enables
a neural network to capture the structure of time-series data accurately [
7
] [
22
] [
32
] [
41
] [
42
] [
43
] [
59
]. Inspired by
prior studies, we choose to use stacking LSTM layers to process the values of the monitoring metrics and model the
normal behaviors of applications.
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 7
Fig. 3. The architecture of our approach.
2.3 Training our Predictive LSTM Neural Network
As shown in Figure 4, the input to our LSTM neural network is a time series
x1,x2,x3· · · ,xn
, where each entry
xt,
1
≤t≤n
, is a 12-dimensional vector that contains the values of the monitoring metrics collected at time
t
. For
each metric, we assign one neural unit in the input layer to take the value of the metric. To capture the behavior of an
application, the next values (i.e.,
x(n+1)
) of the monitoring metrics should be predicted [
42
]. We train our LSTM neural
network to predict the CPU usage, memory usage, and the number of threads because these three metrics are common
measures for the resource utilization of applications. For the memory usage, we choose to predict the
Old GenU saдe
metric that measures the size of the old generation region in the heap. More specically, the old generation region is
used to only store stable objects [
35
] [
46
]. Thus, compared with other memory usage metrics,
Old GenU saдe
contains
the long-term memory usage of objects and is less uctuating. During the training phase, our LSTM neural network tries
to minimize the dierence between the predicted values of the monitoring metrics and the real observed values of the
monitoring metrics. We use only the values of the monitoring metrics collected from normal behaviors of applications
to train our LSTM neural network. Thus, our LSTM neural network can capture normal behaviors of applications.
The output of our LSTM neural network is the predictions of
CP UU s aдe
,
Old GenU saдe
, and
NumT hreads
. The
stacking LSTM layers, shown in Figure 3, extract the time dependency and correlations information from the input
time-series values of the monitoring metrics. Then, the extracted information is passed down to the three output
branches and each branch predicts the value of a monitoring metric. Dierent from predicting all these three metrics
together in a three-dimensional vector, predicting them separately provides information loss for each metric during the
training process.
2.4 Performance Anomaly Prediction Using the Trained LSTM Neural Network
Performance bugs (e.g., memory leak and innite loop bugs) do not always result in performance anomalies instantly.
For example, it takes seven days for memory leak bug #8249 in Elasticsearch to occupy memory space and crash
Manuscript submitted to ACM
8 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
Fig. 4. An overview of our approach using the LSTM neural network to predict performance anomalies.
Elasticsearch as introduced in the bug report
1
. There is a time window, i.e., pre-failure time window, from the start of a
performance bug until the performance anomaly occurs [
11
]. Therefore, we use the trained LSTM neural network to
predict the future occurrences of performance anomalies by detecting whether the values of the monitoring metrics
deviate from their normal behaviors.
Figure 4 shows an overview of our performance anomaly prediction approach. First, the trained LSTM neural network
reads the collected time-series values of the monitoring metrics before time
n+
1, i.e.,
x1,x2,x3· · · ,xn
. Then, the
LSTM neural network predicts the values of the monitoring metrics at time
n+
1, i.e.,
ˆ
CP UU s aдen+1,ˆ
Old GenU saдen+1
,
and
ˆ
NumT hreadsn+1
. Next, we calculate the dierences between the three pairs of the predictions and the real
observed values, i.e., (
ˆ
CP UU s aдen+1
,
CP UU s aдen+1
), (
ˆ
Old GenU saдen+1
,
Old GenU saдen+1
), and (
ˆ
NumT hreadsn+1
,
NumT hreadsn+1
). Our LSTM neural network is trained to capture the normal behaviors of an application. Thus, the
prediction values are predicted following the knowledge learned from the normal behaviors. A large dierence between
any pairs of the predictions and the real observed values indicates that the monitored application deviates from its
normal behaviors. We set up a threshold for the dierence between each pair of the predictions and the real observed
values. As shown in Figure 4, if the dierence is greater than the threshold, we think that the application is in a
pre-failure state, and a performance anomaly warning is raised. Otherwise, the application is in a normal state, and no
performance anomaly warning is raised. The design of the thresholds is explained in Section 4.1.
3 DATA COLLECTION
We evaluate our performance anomaly prediction approach using the data collected from two software systems: 1)
Hadoop MapReduce sample applications which are JAVA open-source applications provided by Hadoop distribution [
3
];
and 2) Elatsticsearch which is a JAVA open-source, distributed, and RESTful search engine [
15
]. To test the generalizability
of our approach, we conduct experiments using seven dierent Hadoop MapReduce applications mentioned in Section 3.2.
The studied seven Hadoop MapReduce applications have less than 1,000 lines of code. To evaluate our approach on
a large-scale software system, we test our approach on Elasticsearch which has more than 1.7 million lines of code.
Elasticsearch is developed by over 1,300 developers and has more than 50,000 commits on GitHub. Furthermore, we test
our approach on dierent released versions of Elasticsearch mentioned in Section 4.3.
Figure 5 shows an overview of our data collection approach. To train and evaluate our approach, we need to collect
the metrics that track both normal and anomalous application behaviors. As shown in Figure 5, we rst download the
released versions of the studied applications. For collecting the values of the monitoring metrics for normal behaviors,
we download the Elasticsearch version 5.3.0 and Hadoop applications that run on Hadoop MapReduce 3.1.3. We select
these versions because these versions are well tested and widely used in practice. Next, we collect the values of the
1https://github.com/elastic/elasticsearch/issues/8249
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 9
monitoring metrics for normal behaviors by simulating the various workloads to the studied applications. Figure 6(a)
shows an example of the values of the CPU usage for normal behaviors that are collected from Hadoop applications.
Then, we inject bugs into the source code of the studied applications and obtain the buggy applications. Finally, we
collect the values of the monitoring metrics for anomalous cases by simulating the workloads to the buggy applications.
Figure 6(b) shows an example of the values of the CPU usage for anomalous cases that are collected from a buggy
Elasticsearch with an injected innite loop bug. As shown in Figure 6(b), after the occurrence of the innite loop bug at
the time mark of 1,500 second, the CPU usage of Elasticsearch starts increasing and reaches to 100%. We explain the
bug injection and workload simulation steps as follows.
Collecting the values of the monitoring metrics for anomalous behaviors
GitHub
Inject bugs
Run the clean
systems under
various workloads
Run the systems with
injected bugs under
various workloads
Download
releasedversion
Operational
versions
Values of normal
monitoring metrics
Running
systems
Collect values of
monitoring metrics
Versions with
injected bugs
Running
systems
Collect values of
monitoring metrics from
the systems with injected
bugs
Values of anomalous
monitoring metrics
Collecting the values of the monitoring metrics for normal behaviors
Fig. 5. An overview of our data collection approach.
Bug injection:
To simulate performance anomalies, previous studies inject bugs that are related to the CPU and
memory usage into the source code [
11
] [
19
] [
56
] [
66
]. Inspired by the previous studies, we inject three types of
performance bugs to the studied applications as follows.
•Innite loop
. We add innite loop bugs that increase CPU consumption continuously by creating threads with
innite loops.
•MemLeak
. We inject memory leak bugs which continuously create objects and store them into static lists.
Because the elements in static lists are not removed by the garbage collection process, the memory usage of the
monitored JVM can keep on increasing.
•Deadlock
. We insert buggy code to create threads with deadlock issues. The created threats deadlock each other.
We aim to inject the three types of performance bugs into dierent source code locations of the studied applications
and obtain dierent buggy applications. Manually injecting performance bugs is error-prone and time-consuming.
Therefore, we propose an automatic bug injection approach that injects bugs into the studied applications, as shown in
Figure 7. We rst select the JAVA classes in the source code to inject bugs. We only select JAVA classes that can be
reached during run-time to inject performance bugs. Because the architectures of Hadoop MapReduce applications and
Elasticsearch are dierent, we apply dierent strategies to select JAVA classes in the studies applications, as shown
in Sections 3.1 and 3.2. We build the Abstract Syntax Trees (ASTs) of the selected classes which represent the syntax
structures of the source code of the classes. For each selected class, we analyze the AST and identify the location to
inject performance bugs. Next, we inject the three types of performance bugs in the source code of the selected class.
Figure 8(a) shows the source code of a Java class in Elasticsearch. Our approach analyzes the syntax structure of the
class and identies that the denitions of the
Rest AnalyzeAction
class and the
prepareRequest
function starts at lines 41
Manuscript submitted to ACM
10 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
(a) Normal CPU usage of the monitored Hadoop applications.
(b) Anomalous CPU usage of the monitored Elasticsearch.
Fig. 6. An example of the values of the monitoring metrics for the normal and anomalous cases of the monitored Elasticsearch and
Hadoop applications.
and 78, respectively. Figure 8(b) shows the source code of the Java class after injecting a memory leak bug. A static list
is injected in the
Rest AnalyzeAction
class (i.e., line 42) and a for loop is injected at the beginning of the
prepareRequest
function (i.e., line 80) to add elements into the static list.
System
source code
Select
classes
Build
ASTs
Inject
bugs
Classes to
inject bugs
ASTs of the
selected
classes
Buggy
Systems
Fig. 7. An overview of our automatic bug injection approach.
Running the normal/buggy applications under various workloads
: We conduct all the workload simulation
on a single machine with an 8-core Intel i7-4790 3.60GHz CPU, 32GB memory, and Ubuntu version 16.04.1. We deploy
Elasticsearch and Hadoop MapReduce framework in a single-node cluster. We realize that our testbed environment
is much simpler than the industrial environment in which a cluster could host thousands of nodes. As our approach
works on the application-level resource usage metrics, a single-node cluster in our experiment design can provide the
same infrastructure as a multiple-node cluster for collecting the application-level metrics. We discuss how to use our
approach in a large-scale cluster in the Discussion Section.
To demonstrate that our approach can be used in practice, we drive the subjected applications using various workloads
that are observed in real-world scenarios as follows.
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 11
(a) The Java class before bug injection. (b) The Java class after bug injection.
Fig. 8. The source code a Java class before and aer injecting a memory leak bug.
•
For Elaticsearch, we use the datasets that are designed to conduct the macrobenchmarking test as the input
workload. The macrobenchmarking test is a testing methodology used to evaluate the overall performance of
applications [38].
•
As shown in Table 2, the studied Hadoop MapReduce applications produce statistics about a set of text les.
For example, the application wordMean counts the average length of the words in the input les. RandomTex-
tWriter [
2
] is a text generation application that is provided by Hadoop to generate random large text les (e.g.,
les with 1GB size). Hence, we generate the input datasets for the studied Hadoop MapReduce applications using
the RandomTextWriter application.
Our approach collects the values of the monitoring metrics of the studied applications using a toolkit, called IBM
Javametrics [
25
] that monitors resource usage of a JVM. IBM Javametrics can be congured to automatically store the
values of the monitoring metrics, while other performance monitoring tools (e.g., tasks manager [
65
], or jconsole [
47
])
require manual eort to output the values of the monitoring metrics. We congure the JVM congurations of the
studies applications to load the IBM Javametrics when the applications start running. Then, the loaded IBM Javametrics
collect the resource usage of the applications at run-time with a congured sampling interval. We set the sampling time
interval to one second in our experiments as used in the existing studies [
11
]. In the next sections, we describe the
details for data collection in the studied applications.
3.1 Data Collection for Elasticsearch
We follow the bug injection approach shown in Figure 7 to obtain buggy versions of Elasticsearch. We rst identify all
the classes that handle HTTP requests. In Elasticsearch, the classes that handle HTTP requests act as entry points for
processing HTTP requests. Hence, we can inject the performance bugs into the classes that handle HTTP requests and
trigger the injected bugs once we send HTTP requests. For example, we obtain a buggy Elasticsearch by injecting an
innite loop bug into the class, RestClusterStateAction.java, which handles the HTTP requests for checking the cluster
states. The injected bug creates a CPU consumption thread every time when the application handles a request for
checking cluster state.
The API method, controller.registerHandler(), is used to register the classes for handling HTTP requests. Therefore, we
nd all the classes that handle HTTP requests by searching the keyword, controller.registerHandler(). We then manually
Manuscript submitted to ACM
12 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
check the classes identied by searching the keyword, controller.registerHandler(), and lter the classes that do not
process HTTP requests. After the ltering, 90 classes that handle HTTP requests are identied. For each class, we
obtain three buggy versions of Elasticsearch by injecting the three dierent types of performance bugs. In total, we
obtain 90*3 (i.e., 270) buggy versions of Elasticsearch. After adding the one normal version of Elasticsearch, we have
90*3+1 (i.e., 271) versions of Elasticsearch in total.
We use Rally [
14
] to simulate real-world workloads. Rally is a macrobenchmarking framework for Elasticsearch, and
it provides various macrobenchmarking test suites based on dierent datasets, such as questions and answers datasets
from StackOverow and the HTTP server log data dataset. In our experiment, a workload simulation is the execution of
Elasticsearch under a macrobenchmarking test suite. Figure 9 shows an example of applying a test suite to Elasticsearch.
As shown in Figure 9, the macrobenchmarking test suites simulate real-world scenarios of using Elasticsearch, such
as executing term searching operations that search for dierent keywords or executing range searching operations
that search for all documents created in a specic time window. Each test suite takes the numbers of dierent types of
operations as congurable parameters. Hence, to be sure that the training data and testing data for our approach are
collected under dierent workloads, we randomize the parameters of a test suite before applying it. Then, the dierent
types of operations are executed sequentially. After executing the test suite, we restart the Elasticsearch and start
applying the next test suite. We have 271 versions (i.e., 270 buggy versions and one normal version) of Elasticsearch.
We nd three stable test suites that can be applied to simulate workload. Thus, for each version of Elasticsearch (e.g.,
normal or buggy versions), we apply these three test suites in Rally. In total, we run 271*3 (i.e., 813) test suites. The
duration of each test suite varies from 45 minutes to 2 hours. It takes our machine around two months to nish running
all the test suites and collecting the values of the monitoring metrics.
Executing the operations
A running
Elasticsearch
system
Randomize the
numbers of different
operations
Excute term
searching
operations
Excute range
searching
operations
Excute scroll
searching
operations
Excute other
operations
Restart the
Elasticsearch
system
Fig. 9. A workload simulation example of applying a test suite to Elasticsearch.
To generate data for anomalous situations, we keep on sending the HTTP requests that are handled by the injected
classes to trigger the injected bugs. The starting time for sending the HTTP requests and the time intervals between
every two requests are randomly selected and recorded. While simulating the workloads, we track the response times of
Elasticsearch for processing the submitted HTTP requests. A performance anomaly is marked if the average response
time is greater than a threshold (e.g., 100ms) [
11
]. The marked performance anomalies are then used to evaluate the
performance of our anomaly prediction approach.
3.2 Data Collection for Hadoop
We conduct experiments using seven sample applications that are provided by the Hadoop distribution as listed in
Table 2. For each sample application, we use the approach shown in Figure 7 to automatically inject the three types of
performance bugs into the Map class and obtain three buggy applications. We have seven Hadoop applications, meaning
that we obtain 7*3 (i.e., 21) buggy Hadoop applications and seven normal Hadoop applications in total. We then collect
the values of the monitoring metrics following the data generation approach shown in Figure 5. For Hadoop applications
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 13
(i.e., multiFileWc, wordCount, wordMean, wordMedian, and wordStandardDeviation), each simulated workload is 10GB
text les. For Hadoop applications bbp and pi, each simulated workload is the number of the digits of
π
that needs to be
calculated. To ensure that the training data and testing data for our approach are collected under dierent workloads,
we generate dierent text les and the number of the digits of πwhenever applying them to Hadoop applications.
Table 2. The sample applications provided by Hadoop distribution.
Application Description
bbp
An application that uses Bailey-Borwein-Ploue algorithm to compute the exact digits
of Pi.
pi An application that estimates Pi using a quasi-Monte Carlo method.
multiFileWc An application that counts the words in several input les separately.
wordCount An application that counts the words in the input les.
wordMean An application that counts the average length of the words in the input les.
wordMedian An application that counts the median length of the words in the input les.
wordStandardDeviation
An application that counts the standard deviation of the length of the words in the
input les.
We run the normal and buggy versions of these seven applications to collect the values of the monitoring metrics
under normal and anomalous situations. We program the injected performance bugs to record their start time at
run-time. We monitor the progress of each application and mark a performance anomaly when the application does not
make any progress for processing the input data.
4 EVALUATION
In this section, we present the approach and the results of the studied research questions.
4.1 RQ1. What is the performance of our approach for predicting performance anomalies?
Motivation:
To test the ability of our approach for predicting performance anomalies, in this RQ, we measure the
performance of our approach and compare it with the existing state-to-the-art approaches.
Approach:
We collect the values of the monitoring metrics under normal and anomalous situations from the studied
applications, as described in Section 3. We build our LSTM neural network using Keras [
33
]. Table 3 lists the parameters
that are used to build and train our LSTM neural network. The detailed descriptions of the parameters can be found in the
Keras documentation [
33
]. We train our LSTM neural network using the values of the monitoring metrics under normal
behaviors, as introduced in Section 2.3. In the run-time monitoring phase, our LSTM neural network predicts the next
values of the monitoring metrics by analyzing the current and past values of the monitoring metrics. In particular, our
approach predicts the values for three monitoring metrics (i.e.,
CP UU s aдe,OldG enU saдe
, and
NumT hreads
). Finally,
we compare the predicted values with the actual observed values and raise anomaly warnings if the dierences are
larger than the predened thresholds.
Measuring the performance of our approach
. We use precision and recall to measure the performance of our
approach for predicting performance anomalies. As shown in Figure 10, we consider that our approach achieves a
true positive prediction if our approach raises an anomaly warning after the starting of the performance bug and
before the occurrence of the performance anomaly. If our approach raises a warning, but there is no performance bug
happening, we consider that our approach makes a false positive prediction. If our approach does not raise any warning,
Manuscript submitted to ACM
14 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
Table 3. The parameters that are used to build and train our LSTM neural network.
Parameter Value
Batch size 8
Number of input features 12
Number of steps 4
Number of hidden cells in LSTM layers 8
Number of epoches 20
Optimization function Adam [34]
Loss function Mean square error [67]
but the performance anomaly actually occurs, we consider that our approach fails to predict the performance anomaly.
Equations 1 and 2 show the computation for precision and recall. Precision measures the percentage of true positive
predictions among all the predictions made by our approach. Recall represents the percentage of the performance
anomalies that can be predicted by our approach among all the anomalies in our experiment datasets.
Pr ecision =
Numtp
Numtp +Numf p
(1)
Recall =
Numtp
Numtp +Numf n
(2)
Where
Numtp
and
Numf p
are the number of true positive (TP) and false positive (FP) predictions that our approach
makes. In addition,
Numf n
is the number of false negatives (FN), i.e., the number of performance anomalies that our
approach fails to predict.
A performance
bug starts
occurring
Our approach
raises an
anomaly warning
A performance
anomaly occurs
Time
Fig. 10. An example of our approach making a true positive prediction.
Identifying the optimal thresholds for raising performance anomaly predictions
. We use the values of the
monitoring metrics under normal situations as our training datasets. We split the data for the values of the monitoring
metrics under anomalous situations into the validation and testing datasets as shown in Tables 4 and 5. There are more
buggy versions of the studied applications than the normal versions. Therefore, there are more data in the validation
and testing datasets than the data in the training datasets. As mentioned in Section 3, we simulate dierent workloads
to collect data. Thus, the training and testing data shown in Tables 4 and 5 are collected under dierent workloads,
which enables us to measure the performance of our approach under dierent workloads from the training phase.
We record the time taken to train the LSTM neural networks on our experiment machine with an 8-core Intel i7-4790
3.60GHz CPU and 32GB memory. It takes 276 seconds (i.e., 4.6 minutes) and 240 seconds (i.e., 4 minutes) to train the
LSTM neural networks for Elasticsearch and Hadoop applications, respectively. We set up three thresholds for the
values of the three monitoring metrics predicted by our LSTM neural network (i.e.,
CP UU s aдe,OldG enU saдe
, and
NumT hreads
). Similar to the existing studies [
7
], we choose the thresholds by conducting sensitivity analysis to nd
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 15
the values that can yield to the maximum F-score on the validation datasets. F-score is a combination metric of precision
and recall. Equation 3 shows the computation for F-score.
F−scor e =2∗Pr ecision ∗Recall
Pr ecision +Recall (3)
We choose the thresholds using the validation dataset. Then, the selected thresholds are used on the testing datasets
to predict performance anomalies and evaluate the performance of our approach.
Table 4. Data collected from the workload simulation of Elasticsearch.
Datasets Description The number of the
collected samples
Training
The the values of the monitoring metrics that are collected by running
the normal version of Elasticsearch.
45,801
Validation
30% of the values of the monitoring metrics that are collected by running
the buggy versions of Elasticsearch.
199,973
Testing
70% of the values of the monitoring metrics that are collected by running
the buggy versions of Elasticsearch.
466,603
Table 5. Data collected from the workload simulation of Hadoop applications.
Datasets Description The number of the
collected samples
Training
The values of the monitoring metrics that are collected by running all
normal verions of the Hadoop applications.
36,625
Validation
30% of the values of the monitoring metrics that are collected by running
the buggy versions of the Hadoop applications.
19,989
Testing
70% of the values of the monitoring metrics that are collected by running
the buggy versions of the Hadoop applications.
46,641
Comparing with the existing state-to-the-art approaches
. We select the Unsupervised Behavior Learning
(UBL) approach proposed by Dean et al. [
11
] as a baseline. The UBL baseline is unsupervised and is capable to predict
performance anomalies by analyzing the values of the monitoring metrics. UBL baseline uses the Self Organizing
Map (SOM) to capture the behaviors of applications. We build the SOM using the minisom python library [
31
]. For
the conguration of the SOM, we use the best conguration presented by Dean et al. [
11
] in their experiments. We
randomly initialize the weight vectors of the SOM as used in the UBL baseline [
11
]. We train the SOMs on our training
datasets listed in Tables 4 and 5. The UBL baseline needs a threshold to dierentiate the normal and anomalous values
of the monitoring metrics. We determine the threshold by conducting the sensitivity analysis on the F-score using our
validation datasets. Finally, we test the performance of the UBL baseline on our testing datasets.
In addition to the UBL baseline, we implement a simple baseline that raises performance anomaly warnings for
an application by checking if the resource utilization of the application exceeds pre-dened thresholds. For example,
VMware
2
, one of the top companies in providing cloud computing and virtualization software and services, suggests
raising anomaly warnings once the CPU usage of a virtual machine (VM) is above 95% [
63
]. We set up the thresholds for
the values of the three monitoring metrics, i.e.,
CP UU s aдe
,
Old GenU saдe
, and
NumT hreads
and predict performance
2https://www.vmware.com/
Manuscript submitted to ACM
16 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
anomalies by checking if any of the values of these three metrics exceeds its threshold. We determine the thresholds for
the values of these three monitoring metrics by maximizing the F-score on our validation datasets listed in Tables 4
and 5. We compare the performance of our approach with the performance of the two baselines (i.e., the UBL and the
simple baselines).
Results: Our approach can predict various performance anomalies with 97-100% precision and 80-100%
recall.
Table 6 shows the performance of our approach and the baselines in predicting anomalies for Elasticsearch. As
shown in Table 6, for
Elasticsearch
, our approach is able to achieve high precision of 98%, 97%, and 100% in predicting
the performance anomalies that are caused by the deadlock, innite loop, and memory leak bugs, respectively. For
Elasticsearch
, our approach achieves the highest recall of 100% and 98% in predicting the performance anomalies that
are caused by the innite loop and deadlock bugs, respectively. Table 7 show the performance of our approach and
the baselines in predicting anomalies for the studied Hadoop applications. For
Hadoop applications
, our approach
achieves 100% precision and 100% recall in predicting the performance anomalies that are caused by the innite loop
and deadlock bugs.
The good performance regarding the innite loop and deadlock bugs suggests that our approach models the behaviors
of the normal CPU usage and the number of threads accurately. Conversely, We nd that the lowest recall (i.e., 80%
and 87% ) of our approach occurs when the performance anomalies are caused by the memory leak bugs. The low
performance for predicting the anomalies caused by the memory leak bugs can be explained by the high uctuation of
the memory usage. Memory usage depends on the size of the data that are processed by the applications. Because we
run the studied applications to process large datasets (e.g., 10GB text les for the Hadoop MapReduce applications),
the memory usage of the applications is uctuating. The high uctuation of the memory usage impacts the ability of
our approach to distinguish between the normal and anomalous behaviors of memory usage. Consequently, the high
uctuation reduces the recall of our approach in predicting the performance anomalies that are caused by the memory
leak bugs.
Table 6. The performance of our approach and the baselines in predicting performance anomalies in the studied Elasticsearch
experiments. We highlight the results of the approaches that achieve the best performance.
Precision Recall F-score
Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak
Our
approach 98% 97% 100% 98% 100% 87% 98% 98% 93%
The UBL
baseline 97% 65% 25% 100% 93% 94% 98% 77% 39%
The simple
baseline 76% 79% 78% 100% 100% 93% 86% 88% 85%
Our approach consistently outperforms the UBL and the simple baselines in predicting performance
anomalies in Elasticsearch and Hadoop applications
. We highlight the results of the approaches that achieve the
best performance in Tables 6 and 7. Our approach consistently achieves higher precision and F-score than the baselines.
For
Elasticsearch
, our approach achieves an average precision of 98.3% in predicting anomalies that are caused by
dierent types of performance bugs, while the UBL and the simple baselines achieve an average precision of 62.3% and
77.7%, respectively. For
Hadoop applications
, both our approach and the UBL baseline achieve high precision and
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 17
Table 7. The performance of our approach and the baselines in predicting performance anomalies in the studied Hadoop applications
experiments. We highlight the results of the approaches that achieve the best performance.
Precision Recall F-score
Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak
Our
approach 100% 100% 100% 100% 100% 80% 100% 100% 89%
The UBL
baseline 41% 85% 70% 100% 100% 100% 58% 92% 82%
The simple
baseline 67% 60% 71% 100% 100% 100% 80% 75% 83%
recall in predicting the performance anomalies that are caused by the innite loop bugs. The reason behind the high
performance is that the CPU usage of the normal Hadoop applications does not experience high uctuation during the
execution of the map tasks. The map tasks execute the same operations on dierent chunks of data over time. The
simple baseline achieves a low precision of 60% in predicting anomalies that are caused by the innite loop bugs. One
reason for the low performance is that the CPU usage threshold learned from the Hadoop validation dataset cannot
scale to the CPU usage behaviors in the Hadoop testing dataset.
Anomaly prediction examples
. The aforementioned results show that our approach outperforms the baselines,
especially, our approach achieves higher precision in predicting performance anomalies. To obtain insights about
the high precision of our approach, we provide examples to show the cases where the baselines raise false positive
predictions while our approach does not.
Figure 11 shows an example of using our approach and the UBL baseline to predict a performance anomaly that is
caused by an innite loop bug. As proposed by Dean et al. [
11
], for the values of the monitoring metrics collected at
each second, the UBL baseline maps the values of the metrics into a winner neuron in the Self Organizing Map and
calculates the distance between the winner neuron and its neighbor neurons. The distances for the collected values of
the monitoring metrics are shown in Figure 11 (b). If the distance is greater than a pre-dened threshold, a performance
anomaly warning is raised. As mentioned in Section 2, our approach raises performance anomaly warnings by checking
the dierences between the predicted values of the monitoring metrics and the real observed values of the metrics. The
dierences between the predicted and real observed values are shown in Figure 11 (c). If the dierence is greater than a
threshold, our approach raises a performance anomaly warning. As introduced in Section 4.1, the thresholds for our
approach and the UBL baseline are selected by maximizing the F-score on the Elasticsearch validation dataset shown in
Table 4.
As shown in Figure 11, the CPU usage of Elasticsearch keeps on uctuating from the beginning because of varying
operations in the workload simulation. After an innite loop bug starts to occur at the time mark of 1,633 second, the
real observed CPU usage continues to increase and reaches to 100%. Both our approach and the UBL baseline make
a true positive anomaly prediction at the time mark of 1,682 second. However, the UBL baseline raises several false
positive predictions before the occurrence of the innite loop bug because of the uctuation of the CPU usage. For
example, at the time mark of 127 second, the real CPU usage of Elasticsearch has a large uctuation and the UBL
baseline raises a false positive prediction. Compared with the UBL baseline, no false positive predictions are raised by
our approach. Our LSTM neural network is not impacted by the uctuations as the LSTM neural network can capture
both the short uctuations and the long-term changes.
Manuscript submitted to ACM
18 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
Fig. 11. An example of predicting anomalies that are caused by an infinite loop bug using the UBL baseline and our approach.
Timestamps when our approach and the baseline approaches raise anomaly warnings are highlighted in red.
Figure 12 shows an example of using our approach and the simple baseline to predict a performance anomaly that is
caused by a memory leak bug. After a memory leak bug starts happening at the time mark of 1,567 second, the real
observed memory usage of Elasticsearch continues to increase and reaches to 100% of the memory space. As shown in
Figure 12, both our approach and the simple baseline are able to predict the performance anomaly around the time
mark of 1,696 second. However, the simple baseline raises several false positive predictions before the occurrence of the
memory leak bug. The memory usage threshold learned from the validation dataset is 78%. At the time marks of 576
and 1,218 seconds, the memory usage of Elasticsearch increases and exceeds the threshold. Compared with the simple
baseline, our approach is more robust to the uctuations of the memory usage and achieves a higher precision.
The reasons behind the good performance of our approach
. The aforementioned results and examples show
that our approach achieves good performance in predicting performance anomalies. The explanation for the good
performance of our approach is that the LSTM neural networks can incorporate the time dependency in time-series
data (e.g., the values of the monitoring metrics in our case). In addition, we use two stacking LSTM layers that have
been observed to capture the time dependency accurately in the existing work [7] [41] [42] [43] [59].
From the aforementioned examples, we demonstrate that the LSTM neural networks can predict anomalies more
accurately than the UBL baseline. We nd that the LSTM neural networks are not impacted by the uctuations in the
values of the monitoring metrics as the neural networks can capture both the short uctuations and the long-term
changes. For example, the CPU usage of Elasticsearch has high uctuations because of varying workloads. The high
uctuations result in the low precision of the UBL baseline in predicting anomalies (e.g., 25% precision in predicting
anomalies caused by the memory leak bugs), while our approach achieves a precision of 100%.
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 19
Fig. 12. An example of predicting anomalies that are caused by a memory leak bug using the simple baseline and our approach.
Timestamps when our approach and the baseline approaches raise anomaly warnings are highlighted in red.
Summary of RQ 1
Our approach can predict performance anomalies with 97-100% precision and 80-100% recall. Our approach
outperforms the UBL and the simple baselines in predicting performance anomalies in the studied applications.
By analyzing the anomaly prediction examples of our approach and the baselines, we observe that our approach
can handle the uctuations in the values of the monitoring metrics using the LSTM neural networks.
4.2 RQ2. How early our approach can raise a performance anomaly warning before the anomaly occurs?
Motivation:
To demonstrate that our approach is capable to predict performance anomalies in advance, in this RQ,
we measure the lead time of our approach. Lead time calculates the amount of time that our approach can raise a
performance anomaly warning before the anomaly occurs.
Approach
. Dean et al. [
11
] mark the point when the performance anomalies occur by searching service level
objective (SLO) violations (e.g., the average HTTP request response time is larger than a specic value). In particular,
Dean et al. identify the performance anomalies when the average HTTP request response time is greater than 100ms. We
use the same threshold (i.e., 100ms) to mark the performance anomaly occurrence in our experiment. For Elasticsearch,
we mark a performance anomaly if the average response time for processing the HTTP requests is greater than 100ms.
For Hadoop applications, we mark a performance anomaly when the application does not make any progress for
processing the input data. As shown in Figure 10, if a warning is raised at the second
tp
and the real performance
Manuscript submitted to ACM
20 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
anomaly happens at the second
pp+∆
, the lead time is
∆
seconds. As shown in Section 4.1, the precision of the baselines
is low, meaning that the baselines tend to raise false anomaly warnings. The lead times achieved by the baselines are
not valid because many anomaly warnings are false warnings. Thus, we do not compare the lead times of our approach
with the lead times of the baselines.
Infinite loop Deadlock Memory leak
Injected bugs
Lead time in seconds
0 500 (8.3 minutes) 1000 1500 (25 minutes)
(a) Lead times from Elasticsearch experiments
Infinite loop Deadlock Memory leak
Injected bugs
Lead time in seconds
0 600 1000 (16.7 minutes)
(b) Lead times from Hadoop experiments
Fig. 13. Beanplots of the lead times from Elasticsearch and Hadoop experiments. The doed line represents the overall median lead
time.
Results: Our approach predicts performance anomalies in advance with lead times that vary from 20 to
1,403 seconds (i.e., 23.4 minutes)
. Figures 13(a) and 13(b) show the lead times achieved by our approach for the
studied applications. For
Elasticsearch
, the median lead time for predicting the anomalies that are caused by the
deadlock bugs is 1,372 seconds (i.e., 22.8 minutes), with the maximum lead time of 1,403 seconds (i.e., 23.4 minutes). For
predicting the anomalies introduced by the innite loop bugs, our approach achieves a median lead time of 434 seconds
(i.e., 7.2 minutes). Our approach achieves a median lead time of 75 seconds for predicting the anomalies that are caused
by the memory leak bugs in Elasticsearch. We believe that the short lead times for the memory leak bugs are because
of the high uctuation of the memory usage at run-time. Thus, our approach can only predict an anomaly when the
application is close to the anomalous state.
For
Hadoop applications
, our approach achieves a median lead time of 165 seconds (i.e., 2.8 minutes) for predicting
the performance anomalies that are caused by the memory leak bugs. For the performance anomalies that are caused
by the deadlock bugs, our approach is able to predict them with a median lead time of 979 seconds (i.e., 16.3 minutes).
Compared with the anomalies caused by the innite loop and memory leak bugs, our approach can predict the
anomalies caused by the deadlock bugs with longer lead times. The deadlock bugs continuously create threads with
deadlock issues, and each thread has its own memory space. With more memory being consumed by the increasing
number of threads, the studied applications become slower and cause performance anomalies to happen when system
run out of memory. Our approach predicts the anomalies based on the anomalous behavior of the number of threads
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 21
(i.e., the number of threads continues to increase). Our approach achieves long lead times (i.e., 979 seconds) because
each thread only consumes a small amount of memory, and it takes a long time to create enough threads to aect the
memory usage of the studied applications.
Our approach aims to predict performance anomalies and enable operators to prevent the predicted anomalies from
happening. As mentioned in the existing study [
57
], local anomaly prevention actions, such as scaling VM resources,
can be completed within one second and more costly anomaly prevention, such as live VM migration, takes 10 to 30
seconds to nish. The lead times achieved by our approach vary from 20 seconds to 1,403 seconds (i.e., 23.4 minutes),
such lead times can be sucient for the automatic prevention approaches [57] to take proper actions and prevent the
anomalies from happening.
It should be noted that the achieved lead times can be impacted by the frequency of triggering the injected bugs
during the workload simulation. For example, during the workload simulation of Elasticsearch, we send the HTTP
requests to trigger the injected bugs. The more frequent we trigger the injected bugs, the quicker the performance
anomalies can occur. Thus, the shorter lead times can be achieved by our approach. In addition, the selected thresholds
can impact the lead times as follows. Using a small threshold to check the dierences between the predicted values of
the monitoring metrics and the real observed values of the metrics can improve the lead times because a small deviation
from the normal behaviors can trigger an anomaly warning. However, more false-positive predictions can be raised too
when we apply a small threshold.
Summary of RQ 2
Our approach achieves lead times that vary from 20 to 1,403 seconds (i.e., 23.4 minutes) for predicting per-
formance anomalies in our experiments. As suggested by the prior study [
57
], the achieved lead times are
sucient for automatic prevention approaches to take proper actions and prevent the predicted anomalies
from happening.
4.3 RQ3. Could our approach be used to predict anomalies happened in the real world?
Motivation:
In the previous RQs, we measure the performance of our approach in predicting the performance anomalies
caused by the injected bugs. However, the injected performance bugs might be more straightforward than the real-world
performance bugs. Hence, in this RQ, we demonstrate the ability of our approach to predict the performance anomalies
that are caused by real-world performance bugs and compare the performance of our approach with the baselines.
In addition, we evaluate the run-time overhead of our anomaly prediction approach to test whether our approach is
practical to predict anomalies in real-world systems.
Approach:
To obtain performance bug reports, we search bug reports that contain keywords (e.g., memory leak,
deadlock, and innite loop) about performance bugs in the studied applications. No performance bug reports are
returned after we search for the Hadoop MapReduce applications. The MapReduce applications are small (i.e., less than
1,000 lines of code) and are used as sample applications for learning purposes. Thus, it is reasonable that no performance
bugs have been found in these sample applications. Therefore, we could not reproduce performance bugs in MapReduce
applications.
For each identied performance bug report, we obtain the Elasticsearch version that contains the performance bug.
Next, we identify the operations that can trigger the performance bug by exploring the bug reports and analyzing the
source code of the buggy version. There are performance bugs that only occur under special workloads and running
Manuscript submitted to ACM
22 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
environments. For example, the innite loop bug #30962
3
only happens in Windows platforms and the deadlock bug
#36195
4
requires Elasticsearch to run on a multiple-node cluster. Due to our single-node experiment environment, we
could not reproduce performance bugs that need intensive datasets or require the tested application to run on large
clusters (e.g., deploying Elasticsearch to several virtual machines). Because of the limited and ambiguous information
in bug reports, reproducing real-world performance bugs is a challenging and time-consuming task [
50
]. To this end,
we successfully reproduce ve performance bugs as listed in Table 8. We reproduce more memory leak bugs than the
innite loop and deadlock bugs because memory leak bugs usually do not require a multiple-node cluster environment
to reproduce.
Table 8. Summary of the reproduced real-world performance bugs.
Bug ID Description Performance
bug type
Elasticsearch
version
Lines of code of
Elasticsearch
#65535
Out of memory issue occurs when using percolator
queries.
Memory leak 1.2.1 381,159
#82496Heap usage grows when using cache keys. Memory leak 1.3.4 414,081
#93777
Elasticsearch hangs when using the collate option.
Deadlock 1.4.2 434,125
#241088
High memory usage issue occurs when using
nested queries.
Memory leak 5.5.0 2,988,321
#247359
Thread falls into innite loop when processing in-
dices queries.
Innite loop 5.3.1 2,549,236
To collect the values of the monitoring metrics in anomalous executions, we follow the same approach shown in
Figure 5. We send HTTP requests that trigger the real-world performance bugs while running the macrobenchmarking
test suites. Each Elasticsearch version has its supported test suites. In total, we apply 16 test suites to the ve buggy
Elasticsearch versions shown in Table 8. To collect more anomalous executions, we iterate the workload simulation ve
times [
5
]. To this end, we collect 16*5 (i.e., 80) anomalous Elasticsearch executions with performance anomalies. For the
data collected from each workload simulation iteration (i.e., 16 anomalous Elasticsearch executions), we evaluate the
precision, recall, and F-score of our approach and the baselines in predicting anomalies. We use the same prediction
models and thresholds that are trained and validated on the datasets shown in Table 4. After ve iterations, we calculate
the average and the standard deviation of the precision, recall, and F-score of our approach and baselines in predicting
performance anomalies as shown in Table 9. To evaluate the run-time overhead of our anomaly prediction approach, we
measure the amount of time, CPU usage, and memory usage that our approach takes to make anomaly predictions. We
conduct the experiments on our experiment environment with an 8-core Intel i7-4790 3.60GHz CPU and 32GB memory.
Results: Our approach achieves high precision and recall in predicting the performance anomalies that
are caused by the ve reproduced performance bugs in Elasticsearch
. As shown in Table 9, our approach achieves
an average of 100% precision and 87% recall in predicting performance anomalies that are caused by the deadlock bugs.
In addition, our approach achieves average 100% and 95% precision in predicting performance anomalies caused by the
innite loop and memory leak bugs.
3https://github.com/elastic/elasticsearch/issues/30962
4https://github.com/elastic/elasticsearch/issues/36195
5https://github.com/elastic/elasticsearch/issues/6553
6https://github.com/elastic/elasticsearch/issues/8249
7https://github.com/elastic/elasticsearch/issues/9377
8https://github.com/elastic/elasticsearch/issues/24108
9https://github.com/elastic/elasticsearch/issues/24735
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 23
Table 9. The performance of our approach and the baselines in predicting performance anomalies that are caused by the real-world
performance bugs. We highlight the results of the approaches that achieve the best performance.
Precision Recall F-score
Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak Deadlock Innite
loop
Memory
leak
Our
approach
100%
±0%
100%
±0%
95%
±9%
87%
±9%
100%
±0%
100%
±0%
93%
±5%
100%
±0%
97%
±5%
The UBL
baseline
49%
±3%
59%
±4%
71%
±15%
100%
±0%
100%
±0%
100%
±0%
66%
±2%
74%
±3%
82%
±11%
The simple
baseline
50%
±0%
83%
±24%
68%
±19%
100%
±0%
100%
±0%
100%
±0%
67%
±0%
89%
±16%
79%
±13%
Our approach achieves higher average performance with less deviation compared with the baselines
. As
shown in Table 9, our approach consistently achieves higher F-score in predicting performance anomalies that are
caused by the dierent real-world performance bugs than the baselines. In addition, our approach achieves less deviation.
Less deviation means that our approach achieves more stable performance in predicting performance anomalies than
the baselines. For example, the deviation of the F-score of our approach in predicting anomalies that are caused by
memory leak bugs is 0.05, while the deviation of the UBL approach is 0.11.
Our approach predicts the performance anomalies caused by the real-world performance bugs with lead
times that vary from 2 to 846 seconds (i.e., 14.1 minutes)
. As shown in Figure 14, the median lead time for
predicting the anomalies that are caused by the innite loop bug is 116 seconds (i.e., 1.9 minutes). For predicting the
anomalies that are caused by the memory leak bugs, our approach achieves a median lead time of 406 seconds (i.e., 6.7
minutes), with the maximum lead time of 846 seconds (i.e., 14.1 minutes). Our approach achieves the median lead time
of 3 seconds for predicting the anomalies that are caused by the deadlock bug.
Infinite loop Deadlock Memory leak
Injected bugs
Lead time in seconds
0 300 (5 minutes) 600 900 (15 minutes)
Fig. 14. Beanplots of the lead times for predicting the performance anomalies caused by the five reproduced performance bugs in
Elasticsearch. The doed line represents the overall median lead time.
Manuscript submitted to ACM
24 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
We analyze the reason behind the short lead times for predicting the anomalies caused by the deadlock bug #9377.
The deadlock bug #9377 causes the Elasticsearch to hang with the searching threads deadlocking each other. The
searching threads in Elasticsearch are responsible for processing document searching requests. After all the searching
threads deadlock each other, Elasticsearch stops processing document searching requests and puts searching requests
into the waiting queue. After the waiting queue becomes full, Elasticsearch starts throwing out exceptions and rejecting
more searching requests. Because Elasticsearch is used to search documents in practice, during the workload simulation,
intensive document searching requests (e.g., hundreds in a second) are sent to Elasticsearch. In our experiments, we use
the default queue size of 1,000 [
16
]. After the deadlock bug occurs, the waiting queue of Elasticsearch is lled with a
thousand of searching requests within a few seconds, and Elasticsearch starts rejecting additional searching requests.
The 100% precision shown in Table 9 means that our approach is able to predict the performance anomalies caused by
the deadlock bug #9377 when the anomalies happen in a few seconds after the occurrence of the performance bug.
Fig. 15. A false positive prediction that is raised by our approach. Timestamps when our approach raises anomaly warnings are
highlighted in red.
We manually analyze the false positive predictions that are raised by our approach. Figure 15 shows a false positive
prediction example when we test our approach using the data collected from the memory leak bug #24108. We run
Elasticsearch under a workload that requires Elasticsearch to load a large number of documents into the memory. f
Thus, the real observed memory usage uctuates from the beginning, and it is dicult to predict the uctuating value
precisely. As shown in Figure 15 (c), our trained LSTM neural network can predict uctuating memory usage with
small dierences. After a memory leak bug starts to occur at the time mark of 1,163 second, the real observed memory
usage continues to increase and reaches to 100% usage of the memory space. A performance anomaly is observed at the
time mark of 1,968 second based on the increase in the average response times of the HTTP requests (as described
in Section 3.1). As shown in Figure 15 (b), our approach predicts that memory usage should stop growing at around
70%. The dierences between the observed and predicted memory usages start increasing before the occurrence of the
anomaly. At the time mark of 1,290 second, the dierence exceeds the threshold (i.e., 0.075), and our approach raises a
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 25
true anomaly warning. As introduced in Section 4.1, the threshold (i.e., 0.075) is selected by maximizing the F-score
on the Elasticsearch validation dataset shown in Table 4. However, before the memory leak bug starts to occur, the
dierence between the observed and the predicted memory usages exceeds the threshold (i.e., 0.075) at the time mark
of 835 second, and our approach raises an anomaly warning. There is no performance bug occurring at the time mark
of 835 second and Elasticsearch is under normal executions. Hence, the anomaly warning is false, and our approach
makes a false positive prediction.
Table 10. The measurements of the overhead of our approach
Number of
samples
Time taken
(in seconds)
Average CPU
usage
Maximum
CPU usage
Average memory
usage
Maximum memory
usage
141,647 36 21.8% 51.6% 2.0% 2.1%
The run-time overhead of our performance anomaly prediction approach is feasible for real-world sys-
tems.
As shown in Table 10, there are 141,647 samples in the monitoring data collected from the replicated performance
bugs. For each sample (i.e., the values of the monitoring metrics collected at each second), our approach processes
the collected values of the monitoring metrics and predicts if a performance anomaly would happen. Our approach
takes 36 seconds to process the 141,647 samples and predict performance anomalies that are caused by the real-world
performance bugs. On average, our approach needs 36/141,617 seconds (i.e., 0.25 milliseconds) to predict anomalies
using the values of the monitoring metrics collected at every second. The average CPU usage of our approach for
predicting anomalies is 21.8% on our machine. There are only 8 cores in our CPU, which means that our approach needs
a average of 8*21.8% (i.e., 2) CPU cores to predict anomalies. The average memory usage of our approach is 32GB*2.0%
(i.e., 320MB). More powerful CPUs and larger memory size than our experiment machine are used in the industrial
environments. For example, in Amazon’s Elastic Compute Cloud (EC2), each node (i.e, virtual machine) can have up to
16 cores of 2.3 GHz AWS Graviton CPU and 32GB memory [
1
]. In the industrial environments, our approach can be
assigned to separate CPU cores from the monitored application and do not impact the executions of the monitored
application. Thus, our anomaly prediction approach is practical to predict anomalies in real-world systems.
Summary of RQ 3
Our approach achieves higher average performance with less deviation compared with the baselines in
predicting the performance anomalies caused by the ve real-world performance bugs. Our approach achieves
lead times that vary from 2 to 846 seconds (i.e., 14.1 minutes). In addition, our approach takes an average of
0.25 milliseconds, 320MB memory, and 25% CPU usage to predict anomalies at every second.
5 DISCUSSION
In this section, we discuss the implementation of our approach to monitor applications in the real-world industrial
environment, the usage scenarios of our approach, and the limitations of our approach.
5.1 Implementing our approach to monitor applications in the industrial environment
Data collection
. Figure 16 shows the overview of implementing our approach to monitor an application. As shown
in Figure 16, the rst step to implement our anomaly prediction approach is to collect data to train the LSTM neural
networks. For an application, the perfect training data should be collected from monitoring the latest stable version of
Manuscript submitted to ACM
26 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
the application. Human operators can manually examine the collected data to make sure that no performance anomaly
happened during the data collection.
Collect
values of
monitoring
metrics
Build and
training
LSTM
Deploy the
trained
LSTM
Collected values
of normal
monitoring
metrics
Stable version
of an
application
Trained
LSTM Re-train
LSTM
Fig. 16. The overview of implementing our approach to monitor an application.
Building and training LSTM neural networks
. After data collection, the parameters for building and training
the LSTM neural networks should be adjusted based on the collected data as shown in Figure 16. For example, the
number of input features should be the same as the number of monitoring metrics in the collected data. Once the
parameters are decided, the LSTM neural networks can be trained oine using the collected data.
Deploying and re-training LSTM neural networks
. The trained LSTM neural networks can be deployed on
the platform (e.g., VMs in a cloud infrastructure) which hosts the application to be monitored. After the deployment,
the LSTM neural networks can track the values of the monitoring metrics of the application as input and predict
performance anomalies. In addition, the behaviors of the application might change because of the application updating.
To adapt the LSTM neural networks to the new behaviors, the LSTM neural networks should be re-trained using the
data collected from the new stable versions of the application as shown in Figure 16. However, frequently re-training
can decrease the performance of the LSTM neural networks if the neural networks are trained on a non-stable version,
i.e., a version that has buggy behaviors but are not realized by developers.
5.2 The usage scenarios of our approach
We deploy our approach to monitor the application-level metrics. In this section, we describe two scenarios of deploying
our approach.
Scenarios when an application runs on a multiple-node cluster
. Applications can be hosted on a cluster with
thousands of nodes (e.g., virtual machines). For monitoring an application running on a large-scale cluster, we can rst
collect application-level metrics of the instance of the application (i.e., the application process) running on each node
(e.g., virtual machine) in the cluster. Then, we can train the LSTM neural network using the collected application-level
metrics values. Next, our approach can be deployed into each node in the cluster. For every node, our approach monitors
the instance of the application running on the node and predicts anomalies for the instance of the application. By
predicting anomalies for each instance of the application, our approach can help developers and operators to locate the
anomalous instance of the application in the cluster.
Scenarios when several applications are running in the same system
. In an industrial environment, there
might be several applications running on the same machine or virtual machine (VM). The co-existing applications
share computation resources, such as CPU and memory. The anomalous behaviors of one application can aect the
executions of other co-existing applications. For example, a memory leak bug in an application can cause the system
running out of memory and aect the executions of other applications in the system. For the scenarios when several
applications are running in the same system, operators can deploy multiple instances of our approach. More specically,
each instance monitors an application and predicts the performance anomalies for the application. By monitoring each
application, our approach can predict which application is the anomalous application. For example, applications A
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 27
and B are running in the same system and there is a memory leak happening in the application B. After deployment
of separate instances of our approach to each application, our approach can predict performance anomalies in the
application B based on its increasing memory usage.
5.3 The limitations of our approach
Our approach captures the normal behaviors of applications using the LSTM neural networks. In our experiments,
we observe that our approach achieves good performance in predicting performance anomalies by checking whether
applications deviate from the captured normal behaviors. However, every approach has its limitations. In this section,
we summarize and discuss the limitations of our approach.
Our approach may not be able to dierentiate non performance anomalies that deviate from normal
behaviors from performance anomalies
. For example, a spontaneous increase in the number of documents loading
requests to Elasticsearch can suddenly increase the CPU and memory usages of Elasticsearch . In our experiments,
we observe that our approach can handle the uctuations in the monitoring metrics (e.g., CPU and memory usages).
However, if a workload change results in large uctuations in the monitoring metrics, our approach might raise a false
performance anomaly warning. For example, our approach makes a false positive prediction when the memory usage is
uctuating heavily as shown in the example in Section 4.3.
Our approach has limitations to automatically adapt to the new normal behaviors of applications
. The
LSTM neural networks capture only the normal behaviors that are included in the training dataset. However, the
behaviors of applications could change over time because of the updates of the application. In our approach, the LSTM
neural networks cannot automatically adapt to the new behaviors of applications at the run-time monitoring phase.
The LSTM neural networks are required to be re-trained oine to learn the new behaviors of applications.
6 THREATS TO VALIDITY
In this section, we discuss the threats to the validity of our study.
Threats to external validity
is related to the generalizability of our results with respect to other project settings.
To ensure the generalizability of our approach, we conduct experiments on two JAVA software systems. These two
software systems have dierent architectures and belong to dierent domains. However, studying software systems
that are programmed in other languages can be useful to augment the generalizability of our approach.
Threats to internal validity
concern the uncontrolled factors that may aect the experiment results. One internal
threat to our results is that the injected performance bugs are simpler than real-world performance bugs. To mitigate this
threat, we test our approach on real-world performance bugs by reproducing the performance bugs of Elasticsearch. We
manage to reproduce ve real-world performance bugs of Elasticsearch. The obtained results show that our approach
can accurately predict the performance anomalies that are caused by real-world performance bugs. However, further
studies can explore reproducing more performance bugs, such as concurrent performance bugs. The time interval
that is used to collect the values of the monitoring metrics of the studied systems can aect our results. Dierent
time intervals might aect the ability of our approach for monitoring software systems and predicting performance
anomalies. We apply one second time interval following the existing work [
11
]. However, further studies can explore to
set the best time intervals for dierent software systems. In addition, the thresholds that are used to raise performance
anomaly warnings can aect our results. There is a tradeo between precision and recall. Using a smaller threshold can
improve the recall of our approach because small deviations from the normal behaviors can raise performance anomaly
predictions. However, a smaller threshold can aect the precision of our approach since the predictions raised from
Manuscript submitted to ACM
28 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
small deviations might be false-positive predictions. In our experiments, we determine the values of the thresholds by
selecting the values that achieve the best F-score on the validation datasets.
7 RELATED WORK
In recent years, studying operational and performance anomalies has been of great interest to researchers. In this
section, we discuss the related work concerning operational anomaly detection, performance anomaly detection, and
performance anomaly prediction.
7.1 Operational Anomaly Detection
Operational anomalies are related to exceptional workows. For example, a NameNode in Hadoop Distributed File
System (HDFS) is not updated after deleting a data block [
69
]. Generally speaking, operational anomaly detection
approaches analyze console logs of a software system to monitor the execution ow of the system in two steps: (1)
Parsing unstructured logs to produce structured log events; and (2) Operational anomaly detection using structured log
events.
Step 1: Parsing unstructured logs to produce structured log events
. Logs record the run-time information (e.g.,
execution traces and the internal states of programs) during the execution of a software system. However, the logs are
usually unstructured and free-form text. Before feeding log messages into log analyzing approaches, the unstructured
log messages need to be parsed to produce structured log events. For example, the log printing statement printf(“Start
transaction %d.”, id) contains a constant string, i.e., Start transaction, and a variable transaction parameter, i.e.,
id
. Log
parsing is used to identify the log event Start transaction *, where
∗
stands for the place holder for variables (i.e., parameter
values). Various log parsing approaches have been proposed in recent years, such as [
12
] [
18
] [
21
] [
40
] [
58
] [
62
] [
68
] [
72
].
Step 2: Detecting operational anomalies using structured log events
. Various supervised and unsupervised
machine learning approaches are proposed to detect operational anomalies. For example, Yen et al. [
70
] propose an
unsupervised clustering approach that identies malware infections and policy violations based on the application-
specic features extracted from logs. Xu et al. [
68
] propose a PCA-based unsupervised approach to detect the anomalies
from the structured log events. Fu et al. [
18
] use Finite State Automation (FSA), an unsupervised machine learning
technique, to model the normal workow from system logs and detect the anomalies based on the learned workows.
Lou et al. [
39
] group the structured log events into dierent groups and mine program invariants from each group to
represent the system workows. The mined invariants are then used to detect anomalies from logs. Lin et al. [
37
] propose
a cluster-based approach that checks the occurrence of log sequences to detect anomalies. Du et al. [
13
] propose a Long-
Short Term Memory (LSTM) neural network to model structured log events as natural language sequences and learn log
patterns from the normal executions to detect anomalies. In addition to unsupervised algorithms, Liang et al. [
36
] apply
a supervised learning algorithm, i.e., Support Vector Machine (SVM), to detect operational anomalies. Mike et al. [
8
]
use a decision tree to detect and diagnose system failures by analyzing the logs from user requests. Dierent from the
aforementioned operational anomaly detection approaches, our approach aims to predict performance anomalies for
software systems.
7.2 Performance Anomaly Detection
Performance anomalies are related to exceptional resource utilization. Researchers propose various approaches to
automatically detect performance anomalies by analyzing logs and run-time metrics (e.g., resource utilization and
network trac statistics) of the monitored systems. Munawar et al. [
45
] employ a linear regression model to identify
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 29
correlations among measurement metrics, such as the correlation between memory usage and the number of live
threads. The measured correlations are then used to characterize normal behaviors and identify anomalous states.
In addition to applying linear regression, researchers explore various statistical techniques, such as multi-variable
regression [
29
], locally-weighted regression [
44
], auto-regressive regression [
27
], and Gaussian mixture models [
20
] to
capture correlations among measurement metrics. In contrast to the above approaches, our approach can predict future
anomalies instead of detecting anomalies after the occurrence.
In addition to analyzing correlations among metrics, Shen et al. [
51
] propose a reference-driven performance anomaly
detection approach by checking how metrics dier from the ideal behavior. Powers et al. [
48
] explore dierent statistical
learning methods to detect performance violations in a software system. Cohen et al. [
10
] extract signatures from the
labeled failure data to detect recurrent problems. Later, Bodik et al. [
6
] improve the signature extraction approach by
applying feature selection techniques. Cherkasova et al. [
9
] use a regression-based transaction model with application
performance signatures to distinguish between performance anomalies and system changes. Jiang et al. [
28
] propose an
approach to detect system failures based on the entropy of the clustered system metrics. Roy et al. [
49
] introduce an
approach to nd performance anomalies by checking a subset of correlated system metrics. Stewart et al. [
53
] apply
a transaction mix model to predict the performance of a system given a certain workload. Anomalies are detected
if the observed performance is dierent from the predicted performance. Wang et al. [
64
] propose an entropy-based
approach to detect anomalies by analyzing the distributions of the metrics. Dierent from the aforementioned anomaly
detection studies, our approach aims to predict performance anomalies in advance and can potentially enable operators
or automatic prevention approaches to take proactive actions.
7.3 Performance Anomaly Prediction
Prediction approaches intend to predict anomalies before they occur, while detection approaches identify anomalies
when the anomalies happen. Existing studies propose approaches that predict performance anomalies in virtual machines
(i.e., predicting VM-level anomalies). For example, Williams et al. [
66
] propose a supervised approach to predict failures
in a distributed system by analyzing VM-level resource usage metrics collected from every node in the distributed
system. Gu et al. [
19
] combine Markov models and Bayesian classication models to process the VM-level resource usage
metrics and predict performance anomalies. Huang et al. [
24
] apply recurrent neural networks to predict performance
anomalies in distributed systems. In contrast to our approach, the aforementioned approaches [
19
] [
24
] [
66
] require
labeled training datasets that represent normal and anomalous behaviors of the studied systems. Dean et al. [
11
] propose
an unsupervised behavior learning approach to predict the failures of virtual machines by analyzing the monitoring
data collected from VMs in cloud infrastructures. Tan et al. [
56
] propose a context-aware anomaly prediction approach
to predict the failures of virtual machines in large-scale hosting infrastructures.
Dierent from our work, the aforementioned approaches [
11
] [
56
] provide VM-level performance anomalies predic-
tion. Our approach can issue application-level performance anomaly warnings that predict the anomalous applications
in a virtual machine in advance. In our experiments, we observe that our approach is able to handle the uctuations in
the monitoring data using the LSTM neural networks. We nd that our approach achieves more accurate results in
predicting application-level performance anomalies than the approach proposed by Dean et al. [
11
]. By providing more
accurate predictions, our approach can be more practical in the industrial production environments.
Manuscript submitted to ACM
30 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
8 CONCLUSION
In this paper, we present an approach that predicts performance anomalies in software systems. First, our approach uses
LSTM neural networks to capture the normal behaviors of software systems. Then, our approach predicts performance
anomalies at run-time by checking the early deviations from the normal behaviors that are expected from the LSTM
neural networks. We conduct extensive experiments using two real-world software systems (i.e., Elasticsearch and
Hadoop applications) to evaluate our approach. In addition, we demonstrate the ability of our approach to predict the
performance anomalies that are caused by real-world performance bugs. The obtained results show:
•
Our approach outperforms the baselines and predicts various performance anomalies with 97-100% precision
and 80-100% recall.
•
Our approach predicts performance anomalies in advance with lead times that vary from 20 to 1,403 seconds
(i.e., 23.4 minutes).
•
Our approach achieves 95-100% precision and 87-100% recall with lead times that vary from 2 to 846 seconds
(i.e., 14.1 minutes) for predicting the anomalies that are caused by the ve real-world performance bugs in
Elasticsearch.
Overall, the results of the experiments show that our approach achieves high performance in predicting anomalies
that are caused by both the injected bugs and real-word performance bugs. In the future, we plan to collect data from
more real-world software systems, reproduce more real-world performance bugs (e.g., concurrent performance bugs),
and compare the performance of our approach with more existing approaches.
REFERENCES
[1] Amazon. [n. d.]. Amazon EC2 Instance Types. Retrieved September 5, 2020 from https://aws.amazon.com/ec2/instance- types/
[2]
Apache. [n. d.]. Apache Hadoop RandomTextWriter application. Retrieved November 27, 2019 from https://hadoop.apache.org/docs/r1.2.1/api/org/
apache/hadoop/examples/RandomTextWriter.html
[3] Apache. [n. d.]. Apache Hadoop System. Retrieved November 27, 2019 from http://hadoop.apache.org/
[4]
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling.. In OSDI,
Vol. 4. 18–18.
[5]
Stefan Berner, Roland Weber, and Rudolf K Keller. 2005. Observations and lessons learned from automated testing. In Proceedings of the 27th
international conference on Software engineering. 571–579.
[6]
Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically Building Robust Signatures of Performance Behavior for
Small-and Large-Scale Systems.. In SysML.
[7]
Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time signals via deep long short-term memory networks. In 2015 IEEE
International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 1–7.
[8]
Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In International Conference on
Autonomic Computing, 2004. Proceedings. IEEE, 36–43.
[9]
Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? application change? or workload change?
towards automated detection of application performance anomaly and change. In 2008 IEEE International Conference on Dependable Systems and
Networks With FTCS and DCC (DSN). IEEE, 452–461.
[10]
Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving
system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105–118.
[11]
Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized
cloud systems. In Proceedings of the 9th international conference on Autonomic computing. ACM, 191–200.
[12]
Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE,
859–864.
[13]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning.
In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298.
[14] Elastic. [n. d.]. Rally. Retrieved December 9, 2019 from https://github.com/elastic/rally
[15] ElasticSearch. [n. d.]. Elasticsearch. Retrieved November 27, 2019 from https://www.elastic.co
Manuscript submitted to ACM
Predicting Performance Anomalies in Software Systems at Run-time 31
[16]
ElasticSearch. [n. d.]. Elasticsearch Reference. Retrieved November 27, 2019 from https://www.elastic.co/guide/en/elasticsearch/reference/5.3/
modules-threadpool.html
[17]
Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2012. Fay: extensible distributed tracing from kernels to
clusters. ACM Transactions on Computer Systems (TOCS) 30, 4 (2012), 1–35.
[18]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In
2009 ninth IEEE international conference on data mining. IEEE, 149–158.
[19]
Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In 2009 IEEE 25th International Conference on Data
Engineering. IEEE, 1000–1011.
[20]
Zhen Guo, Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Tracking probabilistic correlation of monitoring data for fault detection in
complex systems. In International Conference on Dependable Systems and Networks (DSN’06). IEEE, 259–268.
[21]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with xed depth tree. In 2017 IEEE International
Conference on Web Services (ICWS). IEEE, 33–40.
[22]
Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Advances in neural information
processing systems. 190–198.
[23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[24]
Shaohan Huang, Carol Fung, Kui Wang, Polo Pei, Zhongzhi Luan, and Depei Qian. 2016. Using recurrent neural networks toward black-box system
anomaly prediction. In 2016 IEEE/ACM 24th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
[25] IBM. [n. d.]. IBM Javametrics. Retrieved December 2, 2019 from https://developer.ibm.com/javasdk/application-metrics-java/
[26]
Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Discovering likely invariants of distributed transaction systems for autonomic system
management. In 2006 IEEE International Conference on Autonomic Computing. IEEE, 199–208.
[27]
Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Modeling and tracking of transaction ow dynamics for fault detection in complex systems.
IEEE Transactions on Dependable and Secure Computing 3, 4 (2006), 312–326.
[28]
Miao Jiang, Mohammad A Munawar, Thomas Reidemeister, and Paul AS Ward. 2009. Automatic fault detection and diagnosis in complex software
systems by information-theoretic monitoring. In 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 285–294.
[29]
Miao Jiang, Mohammad A Munawar, Thomas Reidemeister, and Paul AS Ward. 2009. System monitoring with metric-correlation models: problems
and solutions. In Proceedings of the 6th international conference on Autonomic computing. ACM, 13–22.
[30]
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. ACM
SIGPLAN Notices 47, 6 (2012), 77–88.
[31]
JustGlowing. [n. d.]. MiniSom: a minimalistic implementation of the Self Organizing Maps. Retrieved November 27, 2019 from https://github.com/
JustGlowing/minisom
[32]
Eamonn Keogh, Jessica Lin, and Ada Fu. 2005. Hot sax: Eciently nding the most unusual time series subsequence. In Fifth IEEE International
Conference on Data Mining (ICDM’05). Ieee, 8–pp.
[33] Keras. [n. d.]. Keras: The Python Deep Learning library. Retrieved November 27, 2019 from https://keras.io/
[34] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[35]
Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. „Where does Memory Go?: Study of Memory Management in JVM-based Data Analytics
“. (2016).
[36]
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Seventh IEEE International
Conference on Data Mining (ICDM 2007). IEEE, 583–588.
[37]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identication for online service
systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102–111.
[38]
Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. 2018. Beyond macrobenchmarks: microbenchmark-based graph database evaluation.
Proceedings of the VLDB Endowment 12, 4 (2018), 390–403.
[39]
Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining Invariants from Console Logs for System Problem Detection.. In USENIX
Annual Technical Conference. 1–14.
[40]
Adetokunbo Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2012. A lightweight algorithm for message type extraction in system
application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 1921–1936.
[41]
Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shro. 2016. LSTM-based encoder-decoder
for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148 (2016).
[42]
Pankaj Malhotra, Lovekesh Vig, Gautam Shro, and Puneet Agarwal. 2015. Long short term memory networks for anomaly detection in time series.
In Proceedings. Presses universitaires de Louvain, 89.
[43]
Shigeru Maya, Ken Ueno, and Takeichiro Nishikawa. 2019. dLSTM: a new approach for anomaly detection using deep learning with delayed
prediction. International Journal of Data Science and Analytics (2019), 1–28.
[44]
Mohammad A Munawar and Paul AS Ward. 2007. A comparative study of pairwise regression techniques for problem determination. In Proceedings
of the 2007 conference of the center for advanced studies on Collaborative research. IBM Corp., 152–166.
[45]
Mohammad Ahmad Munawar and Paul AS Ward. 2007. Leveraging many simple statistical models to adaptively monitor software systems. In
International Symposium on Parallel and Distributed Processing and Applications. Springer, 457–470.
Manuscript submitted to ACM
32 Guoliang Zhao, Safwat Hassan, Ying Zou, Derek Truong, and Toby Corbin
[46] Openjdk. [n. d.]. Openjdk documentation. Retrieved December 2, 2019 from http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.html
[47] Oracle. [n. d.]. jconsole. Retrieved December 3, 2019 from http://openjdk.java.net/tools/svc/jconsole/
[48]
Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the eleventh ACM
SIGKDD international conference on Knowledge discovery in data mining. ACM, 801–807.
[49]
Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud
services. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 1167–1178.
[50]
Ripon K Saha, Sarfraz Khurshid, and Dewayne E Perry. 2014. An empirical study of long lived bugs. In 2014 Software Evolution Week-IEEE Conference
on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 144–153.
[51]
Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. 2009. Reference-driven performance anomaly identication. In ACM SIGMETRICS
Performance Evaluation Review, Vol. 37. ACM, 85–96.
[52]
solarwinds. [n. d.]. Solarwinds SAM Server & Application Monitor. Retrieved November 27, 2019 from https://www.solarwinds.com/
server-application- monitor?CMP=BIZ-TAD-PCWDLD-SAM_PP- A-PP-Q116
[53]
Christopher Stewart, Terence Kelly, and Alex Zhang. 2007. Exploiting nonstationarity for performance prediction. In ACM SIGOPS Operating Systems
Review, Vol. 41. ACM, 31–44.
[54]
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth annual conference of the
international speech communication association.
[55]
Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In 2010 IEEE International Symposium on Modeling, Analysis
and Simulation of Computer and Telecommunication Systems. IEEE, 133–140.
[56] Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of
the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing. ACM, 173–182.
[57]
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly
prevention for virtualized cloud systems. In 2012 IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 285–294.
[58]
Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM
international conference on Information and knowledge management. ACM, 785–794.
[59]
Adrian Taylor, Sylvain Apache Hadoop System. http://hadoop.apache.org/core/.Leblanc, and Nathalie Japkowicz. 2016. Anomaly detection in
automobile control network data with long short-term memory networks. In 2016 IEEE International Conference on Data Science and Advanced
Analytics (DSAA). IEEE, 130–139.
[60]
Avishay Traeger, Ivan Deras, and Erez Zadok. 2008. DARC: Dynamic analysis of root causes of latency distributions. In Proceedings of the 2008 ACM
SIGMETRICS international conference on Measurement and modeling of computer systems. 277–288.
[61]
Dylan Tweney. 2013. Amazon website goes down for 40 minutes, costing the company $5 million. Retrieved November 26, 2019 from
https://venturebeat.com/2013/08/19/amazon-website-down/
[62]
Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations &
Management (IPOM 2003)(IEEE Cat. No. 03EX764). IEEE, 119–126.
[63] VMware. [n. d.]. Virtual machine CP U usage alarm. Retrieved September 5, 2020 from https://kb.vmware.com/s/article/2057830
[64]
Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric
distributions. In 2010 IEEE Network Operations and Management Symposium-NOMS 2010. IEEE, 96–103.
[65] Jim / James C. Warner. 2013. top, Linux man page. Retrieved November 26, 2019 from https://linux.die.net/man/1/top
[66]
Andrew W Williams, Soila M Pertet, and Priya Narasimhan. 2007. Tiresias: Black-box failure prediction in distributed systems. In 2007 IEEE
International Parallel and Distributed Processing Symposium. IEEE, 1–8.
[67]
Cort J Willmott and Kenji Matsuura. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing
average model performance. Climate research 30, 1 (2005), 79–82.
[68]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console
logs. In 2009 Ninth IEEE International Conference on Data Mining. IEEE, 588–597.
[69]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132.
[70]
Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda. 2013. Beehive: Large-scale log analysis
for detecting suspicious activity in enterprise networks. In Proceedings of the 29th Annual Computer Security Applications Conference. ACM, 199–208.
[71]
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classication. arXiv preprint arXiv:1511.08630
(2015).
[72]
Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R Lyu. 2018. Tools and Benchmarks for Automated Log Parsing.
arXiv preprint arXiv:1811.03509 (2018).
Manuscript submitted to ACM