Content uploaded by Cristina Monni

Author content

All content in this area was uploaded by Cristina Monni on May 30, 2019

Content may be subject to copyright.

Energy-Based Anomaly Detection

A New Perspective for Predicting Software Failures

Cristina Monni

USI Universit`

a della Svizzera Italiana

cristina.monni@usi.ch

Mauro Pezz`

e

USI Universit`

a della Svizzera italiana

Universit`

a degli studi di Milano Bicocca

mauro.pezze@usi.ch

Abstract—The ability of predicting failures before their occur-

rence is a fundamental enabler for reducing ﬁeld failures and

improving the reliability of complex software systems. Recent

research proposes many techniques to detect anomalous values

of system metrics, and demonstrates that collective anomalies are

a good symptom of failure-prone states.

In this paper (i) we observe the analogy of complex software

systems with multi-particle and network systems, (ii) propose

to use energy-based models commonly exploited in physics and

statistical mechanics to precisely reveal failure-prone behaviors

without training with seeded errors, and (iii) present some

preliminary experimental results that show the feasibility of our

approach.

Index Terms—anomaly detection, failure prediction, complex

systems

I. INTRODUCTION

Developing fault-free software is a hard if not impossible

task [1]. When faults are activated, that is, they result in

some unexpected behaviors, they produce incorrect program

states (errors), and errors may cause the program to produce

incorrect (observable) results (the program fails).

Anomaly detection approaches aim to predict failures by

revealing errors before they lead to failures, to both alert

operators and trigger protective actions that either prevent

failures or mitigate their effects [2, 3, 4, 5, 6]. Empirical

studies indicate that errors persist in the system before leading

to failures, often altering an incrementally larger set of system

metrics [6, 7]. Anomaly detectors predict failures that occur

after a sequence of errors by monitoring metrics measured

during system execution, and by triggering anomalous values.

Anomaly detectors collect several heterogeneous metrics at

many different locations that span from nodes to resources,

virtual machines and applications, and are called Key Perfor-

mance Indicators (KPIs).Anomalies are out-of-norm KPIs,

such as an unusually low number of packets sent from a virtual

machine, a sudden increase of CPU or memory utilisation, an

anomalous increase of allocated virtual machines. Empirical

studies indicate that single anomalous KPIs are rarely symp-

toms of incoming failures when considered in isolation, while

collectively anomalous KPIs [8] may or may not be symptoms

of failure-prone states depending on the conﬁguration of the

execution state [6, 7].

The main challenge to anomaly detectors is to precisely

identify collective anomalies that may lead to failures to early

predict incoming failures with few false positives.

In this paper, we propose a new approach to predict failures

that we identiﬁed by observing relevant analogies among the

nature of complex software systems, complex physical systems

and complex networks. Starting from this observation, we

study the applicability of statistical mechanics commonly used

for detecting anomalies in physical systems and networks

to the problem of detecting anomalies in complex software

systems, and propose an energy-based anomaly detection ap-

proach to reveal collective anomalies that may lead to failures.

II. RE LATE D WO RK

Current anomaly detectors rely either on rules that encode

failure patterns (signature-based detectors) [2] or on the anal-

ysis of data collected with black box monitoring (data-driven

detectors) [3, 4, 5, 7, 9, 10, 11]. Signature-based detectors

can be very precise, but can detect only errors that correspond

to well known patterns of anomalies, thus suffering from high

false negative rates, while data-driven approaches based on sta-

tistical analysis or unsupervised learning can detect emerging

and unexpected anomalies, but suffer from high false positive

rates [4], low precision [11], limited generalizability [5] or

high overhead [4, 10, 11]. Suitable training with seeded

faults may increase the precision of data-driven approaches,

but the improvement in precision is limited to the types of

seeded faults, and is proportional to the extension of training

sessions [12, 13]. Unfortunately, long training sessions with

seeded faults are rarely possible in large software systems.

An energy-based approach does not rely on predeﬁned

knowledge thus overcoming the limitations of signature-based

approaches, does not need training with seeded faults thus

overcoming the limitations of data-driven approaches with

seeded faults [12, 13], and has a linear complexity thus

overcoming the limitations of data-driven approaches that do

not require seeded faults but present very high complexity [4,

5, 10, 11].

III. STATISTICAL MECHANICS OF SOFTWARE FAILURES

In physics, the study of macroscopic many-particle sys-

tems relies on a statistical characterisation of the properties

of single particles, that affect the global physical quantities

such as energy or temperature. Statistical mechanics is the

branch of physics that models the global behavior of many-

particle systems with stochastic functions that depend on the

interactions among the system particles [14].

The core contribution of statistical mechanics is the obser-

vation that microscopic interactions among particles result in

acollective behavior that manifests with the appearance of

new phases of matter, such as solid, liquid, or more recently

superﬂuid and superconductive [15], with a wide variety of

practical applications. Such collective behavior manifests also

in complex software systems, where the properties of single

KPIs are related to the occurrence of system failures. Moti-

vated by the general applicability of statistical mechanics to

complex dynamical systems, we propose a statistical mechan-

ics approach to the analysis of complex software systems, for

predicting failures as a collective behavior emerging from the

anomalous KPIs. We designed our approach taking inspiration

from some relevant cases that manifest non-trivial collective

behavior: the macroscopic properties of multi-particle systems,

the growth of the World Wide Web and the spread of epidemics

through airport networks [16].

The Ising model is a statistical mechanics model that relates

the unlikely particular conﬁgurations of spin states collectively

aligned along the same direction (behavior of the basic ele-

ments) to the magnetization of the matter (particular behaviors

of the overall system). The model shows that the macroscopic

state of the system depends on the collective conﬁguration

(spin alignment) of its particles, but not on the single state

(spin) of each single particle [14].

In complex networks, scientists are interested in predicting

the dynamics of global properties of the network, such as the

growth of the World Wide Web and the spread of epidemics

on airport networks, and use statistical mechanics to study the

relationships between the states of the nodes (basic elements)

and the dynamics of the overall properties of the network [16].

Complex networks and multi-particle systems share the de-

pendencies of interesting properties of the global state of the

system (the WWW growth, or the epidemic threshold) on

the collective conﬁguration of its basic elements (the degree

distribution of the nodes), but not on the single state (the

degree of the single nodes).

In complex software systems, we are interested in predicting

the dynamics of failures by analyzing the dependencies of in-

teresting properties of the global state of the system (the pres-

ence of failure-prone errors) on the collective conﬁguration

of its basic elements (anomalies) [6, 7]. We can thus reduce

the problem of predicting failures by revealing anomalies as

a manifestation of a non-trivial collective behavior of basic

elements related to global properties of the system, similarly to

what happens in multi-particle and complex network systems.

This led us to study the applicability of the Ising model from

statistical mechanics in this new context.

An Ising approach to the analysis of the dynamics of soft-

ware failures computes a probability distribution that models

the global state of the system from the microscopic states of

the basic elements of the system (KPIs).

Statistical mechanics stochastically describes complex sys-

tems by associating a scalar energy Eiand a probability

pi∝e−Ei(1)

to each possible microscopic conﬁguration iof the system

(microstate), and quantiﬁes the states that are accessible to

the system with the Gibbs free energy Gthat is deﬁned as the

logarithm of the weighted sum over all microstates [14]

G=−log X

i

e−Ei(2)

The free energy includes the relevant information about

the system, and discontinuities in the free energy identify

the emergence of collective behavior. We apply a statistical

mechanics approach to software systems, by modeling the

KPI states with a parametric energy-based Ising model, and

associating each conﬁguration of the KPI states with the scalar

energy of the model. We solve the parametric energy-based

models of software systems with an optimization technique

that ﬁnds the most likely conﬁguration corresponding to some

local optima of the probability distribution describing the KPI

conﬁguration, to identify collective behaviors that may lead

to system failures. In this paper we discuss how to solve the

optimization problem by training a neural network speciﬁcally

designed as an Ising model.

IV. ENERGY-BAS ED AN OM ALY DET EC TI ON

In complex software systems, the interactions among KPIs

depend on many factors that cannot be measured directly,

such as the runtime interactions among applications, virtual

machines and nodes. Such interactions depend on the con-

tingent distribution of workload and resources. This leads

to parametric Ising models that cannot be computed either

analytically or numerically, being the computation intractable.

The solutions of the parametric Ising models can be efﬁciently

computed with Restricted Boltzmann Machines (RBMs) that

well approximate the intractable numerical computations [17,

18].

A RBM is a bipartite neural network composed of a

visible layer, a hidden layer, and connections only across

layers, with associated weights that are computed during a

training phase [18]. The RBM couples the set of visible

variables {vi, i = 1, . . . , N }to a set of hidden variables

{hj, j = 1, ..., M }(M≤N) with weighted edges, and

quantiﬁes the interactions among variables through the energy

function of the 2-dimensional Ising model [14]

E({vi},{hj}) = X

i

aivi+X

ij

viwij hj+X

j

bjhj(3)

where ai, wij , bjare parameters that the RBM computes

during a training phase, wij is the weight associated to the

edge hvi, hjiand aiand bjare bias values. The RBM solves

Equation (1) by computing the joint probability p({vi},{hj})

of observing a conﬁguration of hidden and visible variables

as a function of the energy (Equation (3))

p({vi},{hj})∝e−E({vi},{hj})(4)

The training of the RBM efﬁciently approximates the joint

probability (4) by computing the Gibbs free energy G(Equa-

tion (2)) as

G({vi}) = −log X

hj

e−E({vi},{hj})(5)

When modelling complex software systems, we identify

failure-prone states from the free energy of the system, by

training the RBM with KPI values observed during the normal

execution of the system. We collect KPIs as metrics available

at different levels, for instance, CPU utilization at system

level, transmitted packets at network level, and sent messages

at application level. At each timestamp t, we collect a set

of NKPIs, KPI1, . . . KPIN, and associate each KPIiwith

a stochastic variable vithat we obtain by normalizing the

KPI values in the interval [0,1]. Given an input conﬁguration

{KP Ii=vi∈[0,1], i = 1, . . . , N }, the RBM computes the

marginal distribution that best approximates the system state

given the states of single KPIs.

The training data of the RBM is a matrix of KPI values

aggregated over time, where each row contains all KPI values

collected at a speciﬁc time, and each column contains all

values of a single KPI. The training phase computes the

distribution corresponding to the most likely conﬁguration of

all KPIs in the training set.

The Gibbs free energy that the RBM computes for the tuples

of KPI values collected at any time sample, precisely identiﬁes

collective anomalies with respect to the aggregations of KPI

values used in the training phase. We deﬁne the following

energy-based anomaly detection approach: (i) We compute the

free energy corresponding to the normal state of the system by

training a RBM with data from normal executions, (ii) build a

baseline model corresponding to the range of values of the free

energy of the normal state over a time window, (iii) compare

the baseline model with values of the free energies computed

during system execution, corresponding to the current state of

the system, (iv) detect anomalies when the free energy of the

current state exceeds the baseline of normal behavior.

In steps (i) and (ii) of the approach, we compute the

free energy of the normal state by training the RBM with

a matrix of KPI values that we obtain from monitoring the

normal execution of the system in a given time window. By

building the baseline model from KPIs collected during normal

behavior, we overcome one of the main limitations of current

approaches, which require long training sessions with seeded

faults. In steps (iii) and (iv), we consider that the free energy

computed over the current execution window is anomalous

when the value signiﬁcantly deviates from the trend line of the

free energy corresponding to normal behaviors, as computed

in the former steps for normal executions.

Both Mariani et al. [7] and Jin et al. [6] show that failures

are preceded by bursts of anomalies, while spurious anomalies

usually correspond to transient legal behaviors. Grounded on

this evidence, we consider as anomalous only free energy

values that exceed the standard deviation of the trend line

that best ﬁts the free energy computed from normal behav-

ior. We report the best values corresponding to the baseline

trend line(G)±3σ, where σis the standard deviation of the

Gibbs free energy Gand the multiplicative factor 3 is the best

value that we identiﬁed with a set of experiments.

V. FEASIBILITY STUDY

We evaluated the feasibility of the approach by measuring

the precision of RBM in revealing failure-prone anomalies. We

conducted the experiment on the Yahoo Webscope1dataset, a

reference library provided by Yahoo Research and contain-

ing real values of data from Yahoo services. We chose the

S5 A1Benchmark dataset of the Yahoo Webscope Program,

because it contains real production trafﬁc data from various

Yahoo services with anomalies marked by experts. Each times-

tamp is an integer that corresponds to an hour of aggregated

data. We created our dataset by considering values of 35

metrics, a large subset of the complete set of metrics that

we pruned by eliminating metrics with odd information about

anomalies.

We built a normal execution of the system that we use

to create the baseline model (steps (i) and (ii)) as the ﬁrst

829 timestamps with a very small percentage of anomalies

(below 0.01%), an amount representative of benign anomalies

commonly observed in failure-free executions. We considered

the data of timestamps remaining after the ﬁrst 829, and

set up two sets of experiments of 244 timestamps each, a

set of timestamps without any anomalous value (the normal

validation set) and a set of timestamps with at least 20%

of anomalous values (the anomalous test set), to resemble a

normal and a failing execution, respectively. We analyzed both

data sets with our approach, and the experiments resulted in

a false positive rate of 7% and a true positive rate of 100%.

The results are signiﬁcantly better than current approaches

that do not require fault seeding, while in line with approaches

that rely on fault seeding [4]. The linear complexity of a two

layers architecture of the RBM indicates the scalability of

the approach [17]. Our preliminary experiments conﬁrm the

low complexity of RBMs, with an average training time of 4

seconds on a 16 GB RAM laptop with 3840 NVIDIA CUDA

cores, with a number of hidden units equal to the size of the

visible units, thus equal to the number of KPIs collected from

the system. Additional recent experiments with larger dataset

conﬁrm the scalability of RBMs [19].

The encouraging results suggest that our overall hypothesis

about the Ising nature of failures in complex software systems

is worth additional investigation: Failures in complex software

systems may be predictable by revealing collective anomalies

of KPI values, which can be effectively done from the free

energy computed with a RBM in linear time without requiring

previous knowledge of errors. The energy-based approach

overcomes the main limitations of signature-based approaches

which can predict only failures of the type encoded in the

1Yahoo Computing System Data, https://webscope.sandbox.yahoo.com/

catalog.php?datatype=s&did=70

signatures, data-driven approaches trained with seeded faults,

which can hardly be applied in industrial scale environment,

and data-driven approaches without seeded faults, which either

suffer from high false positive rates or are extremely complex

and thus hardly scale to large systems.

We would like to emphasize that the results reported in this

paper aim to only provide some initial data about the feasibility

of the approach, and they are far from being conclusive. We

conducted a simple experiment with third party data, and

referred to the annotated anomalies to validate the approach,

that is, we trained the initial model with KPI values from

normal executions, and used data with anomalies to verify the

quality of the results of the analysis.

VI. CONCLUSION

In this paper, we discuss the Ising nature of failures of com-

plex software system, from the observation that failure-prone

states depend on non-trivial collective anomalous values of

system metrics, a property shared with multi-particle systems

and complex networks. We present the encouraging results of

a preliminary feasibility study that suggest the validity of our

hypothesis and support further studies in this direction. The

main contributions of this paper are (i) the analogy between

the non-trivial collective behavior in complex physicals sys-

tems/complex networks and the collective nature of anomalies

and failures in complex software systems, that allows us

to argue about the (ii) possible application of energy-based

models for predicting failure-prone anomalies, (iii) a prelimi-

nary implementation of energy-based models with RBMs for

detecting failure-prone anomalies, and (iv) some early results

that indicate the feasibility of the approach.

REFERENCES

[1] L. Gazzola, L. Mariani, F. Pastore, and M. Pezz`

e. “An

Exploratory Study of Field Failures”. In: Proc. Interna-

tional Symposium on Software Reliability Engineering.

IEEE, 2017, pp. 67–77.

[2] B. Ozcelik and C. Yilmaz. “Seer: A Lightweight On-

line Failure Prediction Approach”. In: Transactions on

Software Engineering 42.1 (2016), pp. 26–46.

[3] T. Yongmin, N. Hiep, S. Zhiming, G. Xiaohui, V. Chitra,

and R. Deepak. “PREPARE: Predictive Performance

Anomaly Prevention for Virtualized Cloud Systems”.

In: Proc. International Conference on Distributed Com-

puting Systems. IEEE, 2012, pp. 285–294.

[4] O. Ibidunmoye, A.-R. Rezaie, and E. Elmroth.

“Adaptive Anomaly Detection in Performance Metric

Streams”. In: Transactions on Network and Service

Management 15.1 (2018), pp. 217–231.

[5] T. Ahmed, M. Coates, and A. Lakhina. “Multivariate

Online Anomaly Detection Using Kernel Recursive

Least Squares”. In: Proc. International Conference on

Computer Communications. IEEE, 2007, pp. 625–633.

[6] S. Jin, Z. Zhang, K. Chakrabarty, and X. Gu.

“Changepoint-based anomaly detection in a core router

system”. In: Proc. International Test Conference. IEEE,

2017, pp. 1–10.

[7] L. Mariani, C. Monni, M. Pezz`

e, O. Riganelli, and

R. Xin. “Localizing Faults in Cloud Systems”. In: Proc.

international Conference on Software Testing. IEEE,

2018, pp. 262–273.

[8] V. Chandola, A. Banerjee, and V. Kumar. “Anomaly

Detection: A Survey”. In: Computing Surveys 41.3

(2009), 15:1–15:58.

[9] D. J. Dean, H. Nguyen, and X. Gu. “UBL: Unsuper-

vised Behavior Learning for Predicting Performance

Anomalies in Virtualized Cloud Systems”. In: Proc.

International Conference on Autonomic Computing.

ACM, 2012, pp. 191–200.

[10] H. Mi, H. Wang, Y. Zhou, M. Lyu, and H. Cai. “Toward

Fine-Grained, Unsupervised, Scalable Performance Di-

agnosis for Production Cloud Computing Systems”. In:

Transactions on Parallel and Distributed Systems 24.6

(2013), pp. 1245–1255.

[11] A. Lakhina, M. Crovella, and C. Diot. “Diagnosing

Network-wide Trafﬁc Anomalies”. In: Proc. Conference

on Applications, Technologies, Architectures, and Pro-

tocols for Computer Communications. New York, NY,

USA: ACM, 2004, pp. 219–230.

[12] C. Sauvanaud, K. Lazri, M. Kaˆ

aniche, and K. Kanoun.

“Anomaly Detection and Root Cause Localization in

Virtual Network Functions”. In: Proc. International

Symposium on Software Reliability Engineering. IEEE,

2016, pp. 196–206.

[13] X. Pan, J. Tan, S. Kavulya, R. Gandhi, P. Narasimhan,

and T. Jiaqi. “Ganesha: blackBox diagnosis of MapRe-

duce systems”. In: ACM SIGMETRICS Performance

Evaluation Review 37.3 (2010), pp. 8–13.

[14] D. Chandler. Introduction to Modern Statistical Me-

chanics. Oxford University Press, Sept. 1987, p. 288.

[15] D. Tilley and J. Tilley. Superﬂuidity and Superconduc-

tivity. Graduate Student Series in Physics. Taylor &

Francis, 1990.

[16] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes.

“Critical phenomena in complex networks”. In: Reviews

of Modern Physics 80 (2008), pp. 1275–1335.

[17] M. ´

A. Carreira-Perpi˜

n´

an and G. E. Hinton. “On Con-

trastive Divergence Learning”. In: Proc. International

Workshop on Artiﬁcial Intelligence and Statistics. The

Society for Artiﬁcial Intelligence and Statistics, 2005.

[18] G. E. Hinton and R. R. Salakhutdinov. “Reducing the

Dimensionality of Data with Neural Networks”. In:

Science 313.5786 (2006), pp. 504–507.

[19] C. Monni, M. Pezz`

e, and G. Prisco. “An RBM Anomaly

Detector for the Cloud”. In: Proc. International Confer-

ence on Software Testing. IEEE, 2019, to appear.