PreprintPDF Available

Energy-Based Anomaly Detection: A New Perspective for Predicting Software Failures

Authors:
  • USI Università della Svizzera Italiana
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The ability of predicting failures before their occurrence is a fundamental enabler for reducing field failures and improving the reliability of complex software systems. Recent research proposes many techniques to detect anomalous values of system metrics, and demonstrates that collective anomalies are a good symptom of failure-prone states. In this paper (i) we observe the analogy of complex software systems with multi-particle and network systems, (ii) propose to use energy-based models commonly exploited in physics and statistical mechanics to precisely reveal failure-prone behaviors without training with seeded errors, and (iii) present some preliminary experimental results that show the feasibility of our approach.
Energy-Based Anomaly Detection
A New Perspective for Predicting Software Failures
Cristina Monni
USI Universit`
a della Svizzera Italiana
cristina.monni@usi.ch
Mauro Pezz`
e
USI Universit`
a della Svizzera italiana
Universit`
a degli studi di Milano Bicocca
mauro.pezze@usi.ch
Abstract—The ability of predicting failures before their occur-
rence is a fundamental enabler for reducing field failures and
improving the reliability of complex software systems. Recent
research proposes many techniques to detect anomalous values
of system metrics, and demonstrates that collective anomalies are
a good symptom of failure-prone states.
In this paper (i) we observe the analogy of complex software
systems with multi-particle and network systems, (ii) propose
to use energy-based models commonly exploited in physics and
statistical mechanics to precisely reveal failure-prone behaviors
without training with seeded errors, and (iii) present some
preliminary experimental results that show the feasibility of our
approach.
Index Terms—anomaly detection, failure prediction, complex
systems
I. INTRODUCTION
Developing fault-free software is a hard if not impossible
task [1]. When faults are activated, that is, they result in
some unexpected behaviors, they produce incorrect program
states (errors), and errors may cause the program to produce
incorrect (observable) results (the program fails).
Anomaly detection approaches aim to predict failures by
revealing errors before they lead to failures, to both alert
operators and trigger protective actions that either prevent
failures or mitigate their effects [2, 3, 4, 5, 6]. Empirical
studies indicate that errors persist in the system before leading
to failures, often altering an incrementally larger set of system
metrics [6, 7]. Anomaly detectors predict failures that occur
after a sequence of errors by monitoring metrics measured
during system execution, and by triggering anomalous values.
Anomaly detectors collect several heterogeneous metrics at
many different locations that span from nodes to resources,
virtual machines and applications, and are called Key Perfor-
mance Indicators (KPIs).Anomalies are out-of-norm KPIs,
such as an unusually low number of packets sent from a virtual
machine, a sudden increase of CPU or memory utilisation, an
anomalous increase of allocated virtual machines. Empirical
studies indicate that single anomalous KPIs are rarely symp-
toms of incoming failures when considered in isolation, while
collectively anomalous KPIs [8] may or may not be symptoms
of failure-prone states depending on the configuration of the
execution state [6, 7].
The main challenge to anomaly detectors is to precisely
identify collective anomalies that may lead to failures to early
predict incoming failures with few false positives.
In this paper, we propose a new approach to predict failures
that we identified by observing relevant analogies among the
nature of complex software systems, complex physical systems
and complex networks. Starting from this observation, we
study the applicability of statistical mechanics commonly used
for detecting anomalies in physical systems and networks
to the problem of detecting anomalies in complex software
systems, and propose an energy-based anomaly detection ap-
proach to reveal collective anomalies that may lead to failures.
II. RE LATE D WO RK
Current anomaly detectors rely either on rules that encode
failure patterns (signature-based detectors) [2] or on the anal-
ysis of data collected with black box monitoring (data-driven
detectors) [3, 4, 5, 7, 9, 10, 11]. Signature-based detectors
can be very precise, but can detect only errors that correspond
to well known patterns of anomalies, thus suffering from high
false negative rates, while data-driven approaches based on sta-
tistical analysis or unsupervised learning can detect emerging
and unexpected anomalies, but suffer from high false positive
rates [4], low precision [11], limited generalizability [5] or
high overhead [4, 10, 11]. Suitable training with seeded
faults may increase the precision of data-driven approaches,
but the improvement in precision is limited to the types of
seeded faults, and is proportional to the extension of training
sessions [12, 13]. Unfortunately, long training sessions with
seeded faults are rarely possible in large software systems.
An energy-based approach does not rely on predefined
knowledge thus overcoming the limitations of signature-based
approaches, does not need training with seeded faults thus
overcoming the limitations of data-driven approaches with
seeded faults [12, 13], and has a linear complexity thus
overcoming the limitations of data-driven approaches that do
not require seeded faults but present very high complexity [4,
5, 10, 11].
III. STATISTICAL MECHANICS OF SOFTWARE FAILURES
In physics, the study of macroscopic many-particle sys-
tems relies on a statistical characterisation of the properties
of single particles, that affect the global physical quantities
such as energy or temperature. Statistical mechanics is the
branch of physics that models the global behavior of many-
particle systems with stochastic functions that depend on the
interactions among the system particles [14].
The core contribution of statistical mechanics is the obser-
vation that microscopic interactions among particles result in
acollective behavior that manifests with the appearance of
new phases of matter, such as solid, liquid, or more recently
superfluid and superconductive [15], with a wide variety of
practical applications. Such collective behavior manifests also
in complex software systems, where the properties of single
KPIs are related to the occurrence of system failures. Moti-
vated by the general applicability of statistical mechanics to
complex dynamical systems, we propose a statistical mechan-
ics approach to the analysis of complex software systems, for
predicting failures as a collective behavior emerging from the
anomalous KPIs. We designed our approach taking inspiration
from some relevant cases that manifest non-trivial collective
behavior: the macroscopic properties of multi-particle systems,
the growth of the World Wide Web and the spread of epidemics
through airport networks [16].
The Ising model is a statistical mechanics model that relates
the unlikely particular configurations of spin states collectively
aligned along the same direction (behavior of the basic ele-
ments) to the magnetization of the matter (particular behaviors
of the overall system). The model shows that the macroscopic
state of the system depends on the collective configuration
(spin alignment) of its particles, but not on the single state
(spin) of each single particle [14].
In complex networks, scientists are interested in predicting
the dynamics of global properties of the network, such as the
growth of the World Wide Web and the spread of epidemics
on airport networks, and use statistical mechanics to study the
relationships between the states of the nodes (basic elements)
and the dynamics of the overall properties of the network [16].
Complex networks and multi-particle systems share the de-
pendencies of interesting properties of the global state of the
system (the WWW growth, or the epidemic threshold) on
the collective configuration of its basic elements (the degree
distribution of the nodes), but not on the single state (the
degree of the single nodes).
In complex software systems, we are interested in predicting
the dynamics of failures by analyzing the dependencies of in-
teresting properties of the global state of the system (the pres-
ence of failure-prone errors) on the collective configuration
of its basic elements (anomalies) [6, 7]. We can thus reduce
the problem of predicting failures by revealing anomalies as
a manifestation of a non-trivial collective behavior of basic
elements related to global properties of the system, similarly to
what happens in multi-particle and complex network systems.
This led us to study the applicability of the Ising model from
statistical mechanics in this new context.
An Ising approach to the analysis of the dynamics of soft-
ware failures computes a probability distribution that models
the global state of the system from the microscopic states of
the basic elements of the system (KPIs).
Statistical mechanics stochastically describes complex sys-
tems by associating a scalar energy Eiand a probability
pieEi(1)
to each possible microscopic configuration iof the system
(microstate), and quantifies the states that are accessible to
the system with the Gibbs free energy Gthat is defined as the
logarithm of the weighted sum over all microstates [14]
G=log X
i
eEi(2)
The free energy includes the relevant information about
the system, and discontinuities in the free energy identify
the emergence of collective behavior. We apply a statistical
mechanics approach to software systems, by modeling the
KPI states with a parametric energy-based Ising model, and
associating each configuration of the KPI states with the scalar
energy of the model. We solve the parametric energy-based
models of software systems with an optimization technique
that finds the most likely configuration corresponding to some
local optima of the probability distribution describing the KPI
configuration, to identify collective behaviors that may lead
to system failures. In this paper we discuss how to solve the
optimization problem by training a neural network specifically
designed as an Ising model.
IV. ENERGY-BAS ED AN OM ALY DET EC TI ON
In complex software systems, the interactions among KPIs
depend on many factors that cannot be measured directly,
such as the runtime interactions among applications, virtual
machines and nodes. Such interactions depend on the con-
tingent distribution of workload and resources. This leads
to parametric Ising models that cannot be computed either
analytically or numerically, being the computation intractable.
The solutions of the parametric Ising models can be efficiently
computed with Restricted Boltzmann Machines (RBMs) that
well approximate the intractable numerical computations [17,
18].
A RBM is a bipartite neural network composed of a
visible layer, a hidden layer, and connections only across
layers, with associated weights that are computed during a
training phase [18]. The RBM couples the set of visible
variables {vi, i = 1, . . . , N }to a set of hidden variables
{hj, j = 1, ..., M }(MN) with weighted edges, and
quantifies the interactions among variables through the energy
function of the 2-dimensional Ising model [14]
E({vi},{hj}) = X
i
aivi+X
ij
viwij hj+X
j
bjhj(3)
where ai, wij , bjare parameters that the RBM computes
during a training phase, wij is the weight associated to the
edge hvi, hjiand aiand bjare bias values. The RBM solves
Equation (1) by computing the joint probability p({vi},{hj})
of observing a configuration of hidden and visible variables
as a function of the energy (Equation (3))
p({vi},{hj})eE({vi},{hj})(4)
The training of the RBM efficiently approximates the joint
probability (4) by computing the Gibbs free energy G(Equa-
tion (2)) as
G({vi}) = log X
hj
eE({vi},{hj})(5)
When modelling complex software systems, we identify
failure-prone states from the free energy of the system, by
training the RBM with KPI values observed during the normal
execution of the system. We collect KPIs as metrics available
at different levels, for instance, CPU utilization at system
level, transmitted packets at network level, and sent messages
at application level. At each timestamp t, we collect a set
of NKPIs, KPI1, . . . KPIN, and associate each KPIiwith
a stochastic variable vithat we obtain by normalizing the
KPI values in the interval [0,1]. Given an input configuration
{KP Ii=vi[0,1], i = 1, . . . , N }, the RBM computes the
marginal distribution that best approximates the system state
given the states of single KPIs.
The training data of the RBM is a matrix of KPI values
aggregated over time, where each row contains all KPI values
collected at a specific time, and each column contains all
values of a single KPI. The training phase computes the
distribution corresponding to the most likely configuration of
all KPIs in the training set.
The Gibbs free energy that the RBM computes for the tuples
of KPI values collected at any time sample, precisely identifies
collective anomalies with respect to the aggregations of KPI
values used in the training phase. We define the following
energy-based anomaly detection approach: (i) We compute the
free energy corresponding to the normal state of the system by
training a RBM with data from normal executions, (ii) build a
baseline model corresponding to the range of values of the free
energy of the normal state over a time window, (iii) compare
the baseline model with values of the free energies computed
during system execution, corresponding to the current state of
the system, (iv) detect anomalies when the free energy of the
current state exceeds the baseline of normal behavior.
In steps (i) and (ii) of the approach, we compute the
free energy of the normal state by training the RBM with
a matrix of KPI values that we obtain from monitoring the
normal execution of the system in a given time window. By
building the baseline model from KPIs collected during normal
behavior, we overcome one of the main limitations of current
approaches, which require long training sessions with seeded
faults. In steps (iii) and (iv), we consider that the free energy
computed over the current execution window is anomalous
when the value significantly deviates from the trend line of the
free energy corresponding to normal behaviors, as computed
in the former steps for normal executions.
Both Mariani et al. [7] and Jin et al. [6] show that failures
are preceded by bursts of anomalies, while spurious anomalies
usually correspond to transient legal behaviors. Grounded on
this evidence, we consider as anomalous only free energy
values that exceed the standard deviation of the trend line
that best fits the free energy computed from normal behav-
ior. We report the best values corresponding to the baseline
trend line(G)±3σ, where σis the standard deviation of the
Gibbs free energy Gand the multiplicative factor 3 is the best
value that we identified with a set of experiments.
V. FEASIBILITY STUDY
We evaluated the feasibility of the approach by measuring
the precision of RBM in revealing failure-prone anomalies. We
conducted the experiment on the Yahoo Webscope1dataset, a
reference library provided by Yahoo Research and contain-
ing real values of data from Yahoo services. We chose the
S5 A1Benchmark dataset of the Yahoo Webscope Program,
because it contains real production traffic data from various
Yahoo services with anomalies marked by experts. Each times-
tamp is an integer that corresponds to an hour of aggregated
data. We created our dataset by considering values of 35
metrics, a large subset of the complete set of metrics that
we pruned by eliminating metrics with odd information about
anomalies.
We built a normal execution of the system that we use
to create the baseline model (steps (i) and (ii)) as the first
829 timestamps with a very small percentage of anomalies
(below 0.01%), an amount representative of benign anomalies
commonly observed in failure-free executions. We considered
the data of timestamps remaining after the first 829, and
set up two sets of experiments of 244 timestamps each, a
set of timestamps without any anomalous value (the normal
validation set) and a set of timestamps with at least 20%
of anomalous values (the anomalous test set), to resemble a
normal and a failing execution, respectively. We analyzed both
data sets with our approach, and the experiments resulted in
a false positive rate of 7% and a true positive rate of 100%.
The results are significantly better than current approaches
that do not require fault seeding, while in line with approaches
that rely on fault seeding [4]. The linear complexity of a two
layers architecture of the RBM indicates the scalability of
the approach [17]. Our preliminary experiments confirm the
low complexity of RBMs, with an average training time of 4
seconds on a 16 GB RAM laptop with 3840 NVIDIA CUDA
cores, with a number of hidden units equal to the size of the
visible units, thus equal to the number of KPIs collected from
the system. Additional recent experiments with larger dataset
confirm the scalability of RBMs [19].
The encouraging results suggest that our overall hypothesis
about the Ising nature of failures in complex software systems
is worth additional investigation: Failures in complex software
systems may be predictable by revealing collective anomalies
of KPI values, which can be effectively done from the free
energy computed with a RBM in linear time without requiring
previous knowledge of errors. The energy-based approach
overcomes the main limitations of signature-based approaches
which can predict only failures of the type encoded in the
1Yahoo Computing System Data, https://webscope.sandbox.yahoo.com/
catalog.php?datatype=s&did=70
signatures, data-driven approaches trained with seeded faults,
which can hardly be applied in industrial scale environment,
and data-driven approaches without seeded faults, which either
suffer from high false positive rates or are extremely complex
and thus hardly scale to large systems.
We would like to emphasize that the results reported in this
paper aim to only provide some initial data about the feasibility
of the approach, and they are far from being conclusive. We
conducted a simple experiment with third party data, and
referred to the annotated anomalies to validate the approach,
that is, we trained the initial model with KPI values from
normal executions, and used data with anomalies to verify the
quality of the results of the analysis.
VI. CONCLUSION
In this paper, we discuss the Ising nature of failures of com-
plex software system, from the observation that failure-prone
states depend on non-trivial collective anomalous values of
system metrics, a property shared with multi-particle systems
and complex networks. We present the encouraging results of
a preliminary feasibility study that suggest the validity of our
hypothesis and support further studies in this direction. The
main contributions of this paper are (i) the analogy between
the non-trivial collective behavior in complex physicals sys-
tems/complex networks and the collective nature of anomalies
and failures in complex software systems, that allows us
to argue about the (ii) possible application of energy-based
models for predicting failure-prone anomalies, (iii) a prelimi-
nary implementation of energy-based models with RBMs for
detecting failure-prone anomalies, and (iv) some early results
that indicate the feasibility of the approach.
REFERENCES
[1] L. Gazzola, L. Mariani, F. Pastore, and M. Pezz`
e. “An
Exploratory Study of Field Failures”. In: Proc. Interna-
tional Symposium on Software Reliability Engineering.
IEEE, 2017, pp. 67–77.
[2] B. Ozcelik and C. Yilmaz. “Seer: A Lightweight On-
line Failure Prediction Approach”. In: Transactions on
Software Engineering 42.1 (2016), pp. 26–46.
[3] T. Yongmin, N. Hiep, S. Zhiming, G. Xiaohui, V. Chitra,
and R. Deepak. “PREPARE: Predictive Performance
Anomaly Prevention for Virtualized Cloud Systems”.
In: Proc. International Conference on Distributed Com-
puting Systems. IEEE, 2012, pp. 285–294.
[4] O. Ibidunmoye, A.-R. Rezaie, and E. Elmroth.
Adaptive Anomaly Detection in Performance Metric
Streams”. In: Transactions on Network and Service
Management 15.1 (2018), pp. 217–231.
[5] T. Ahmed, M. Coates, and A. Lakhina. “Multivariate
Online Anomaly Detection Using Kernel Recursive
Least Squares”. In: Proc. International Conference on
Computer Communications. IEEE, 2007, pp. 625–633.
[6] S. Jin, Z. Zhang, K. Chakrabarty, and X. Gu.
“Changepoint-based anomaly detection in a core router
system”. In: Proc. International Test Conference. IEEE,
2017, pp. 1–10.
[7] L. Mariani, C. Monni, M. Pezz`
e, O. Riganelli, and
R. Xin. “Localizing Faults in Cloud Systems”. In: Proc.
international Conference on Software Testing. IEEE,
2018, pp. 262–273.
[8] V. Chandola, A. Banerjee, and V. Kumar. “Anomaly
Detection: A Survey”. In: Computing Surveys 41.3
(2009), 15:1–15:58.
[9] D. J. Dean, H. Nguyen, and X. Gu. “UBL: Unsuper-
vised Behavior Learning for Predicting Performance
Anomalies in Virtualized Cloud Systems”. In: Proc.
International Conference on Autonomic Computing.
ACM, 2012, pp. 191–200.
[10] H. Mi, H. Wang, Y. Zhou, M. Lyu, and H. Cai. “Toward
Fine-Grained, Unsupervised, Scalable Performance Di-
agnosis for Production Cloud Computing Systems”. In:
Transactions on Parallel and Distributed Systems 24.6
(2013), pp. 1245–1255.
[11] A. Lakhina, M. Crovella, and C. Diot. “Diagnosing
Network-wide Traffic Anomalies”. In: Proc. Conference
on Applications, Technologies, Architectures, and Pro-
tocols for Computer Communications. New York, NY,
USA: ACM, 2004, pp. 219–230.
[12] C. Sauvanaud, K. Lazri, M. Kaˆ
aniche, and K. Kanoun.
“Anomaly Detection and Root Cause Localization in
Virtual Network Functions”. In: Proc. International
Symposium on Software Reliability Engineering. IEEE,
2016, pp. 196–206.
[13] X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan,
and T. Jiaqi. “Ganesha: blackBox diagnosis of MapRe-
duce systems”. In: ACM SIGMETRICS Performance
Evaluation Review 37.3 (2010), pp. 8–13.
[14] D. Chandler. Introduction to Modern Statistical Me-
chanics. Oxford University Press, Sept. 1987, p. 288.
[15] D. Tilley and J. Tilley. Superfluidity and Superconduc-
tivity. Graduate Student Series in Physics. Taylor &
Francis, 1990.
[16] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes.
“Critical phenomena in complex networks”. In: Reviews
of Modern Physics 80 (2008), pp. 1275–1335.
[17] M. ´
A. Carreira-Perpi˜
n´
an and G. E. Hinton. “On Con-
trastive Divergence Learning”. In: Proc. International
Workshop on Artificial Intelligence and Statistics. The
Society for Artificial Intelligence and Statistics, 2005.
[18] G. E. Hinton and R. R. Salakhutdinov. “Reducing the
Dimensionality of Data with Neural Networks”. In:
Science 313.5786 (2006), pp. 504–507.
[19] C. Monni, M. Pezz`
e, and G. Prisco. “An RBM Anomaly
Detector for the Cloud”. In: Proc. International Confer-
ence on Software Testing. IEEE, 2019, to appear.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Virtualized cloud systems are prone to performance anomalies due to various reasons such as resource contentions, software bugs, and hardware failures. In this paper, we present a novel Predictive Performance Anomaly Prevention (PREPARE) system that provides automatic performance anomaly prevention for virtualized cloud computing infrastructures. PREPARE integrates online anomaly prediction, learning-based cause inference, and predictive prevention actuation to minimize the performance anomaly penalty without human intervention. We have implemented PREPARE on top of the Xen platform and tested it on the NCSU's Virtual Computing Lab using a commercial data stream processing system (IBM System S) and an online auction benchmark (RUBiS). The experimental results show that PREPARE can effectively prevent performance anomalies while imposing low overhead to the cloud infrastructure.
Article
Performance diagnosis is labor intensive in production cloud computing systems. Such systems typically face many real-world challenges, which the existing diagnosis techniques for such distributed systems cannot effectively solve. An efficient, unsupervised diagnosis tool for locating fine-grained performance anomalies is still lacking in production cloud computing systems. This paper proposes CloudDiag to bridge this gap. Combining a statistical technique and a fast matrix recovery algorithm, CloudDiag can efficiently pinpoint fine-grained causes of the performance problems, which does not require any domain-specific knowledge to the target system. CloudDiag has been applied in a practical production cloud computing systems to diagnose performance problems. We demonstrate the effectiveness of CloudDiag in three real-world case studies.
Article
Leading physical chemist David Chandler takes a new approach to statistical mechanics to provide the only introductory-level work on the modern topics of renormalization group theory, Monte Carlo simulations, time correlation functions, and liquid structure. The author provides compact summaries of the fundamentals of this branch of physics and discussions of many of its traditional elementary applications, interspersed with over 150 exercises and microcomputer programs.
Conference Paper
High-speed backbones are regularly affected by various kinds of network anomalies, ranging from malicious attacks to harmless large data transfers. Different types of anomalies affect the network in different ways, and it is difficult to know a priori how a potential anomaly will exhibit itself in traffic statistics. In this paper we describe an online, sequential, anomaly detection algorithm, that is suitable for use with multivariate data. The proposed algorithm is based on the kernel version of the recursive least squares algorithm. It assumes no model for network traffic or anomalies, and constructs and adapts a dictionary of features that approximately spans the subspace of normal behaviour. The algorithm raises an alarm immediately upon encountering a deviation from the norm. Through comparison with existing block-based offline methods based upon Principal Component Analysis, we demonstrate that our online algorithm is equally effective but has much faster time-to-detection and lower computational complexity. We also explore minimum volume set approaches in identifying the region of normality.