PresentationPDF Available

Performance Anomaly and Change Point Detection For Large-Scale System Management for WorldS4 2021

Authors:

Abstract

Presented at The 2021 World S4 Conference (https://conferences.ieee.org/conferences_events/conferences/conferencedetails/51998) in London virtually.
WorldS4 2021
Performance Anomaly and Change Point Detection
For Large-Scale System Management
Dr. Igor Trubin – igor.trubin@capitalone.com

         
     
      !  
 "#   $%  #
$% &  '    #
 #      
      '  (
 %      ) *
 '+# , 
  
# '
    (    
           
-*,
2WorldS4 2021
You can build your own weekly profile using a free tool https://www.PERFOMALIST.com/
SETDS - Machine Learning based Performance Anomaly Detection
3WorldS4 2021
Check more in my CMG papers listed here: http://www.trub.in/2007/06/system-management-by-exception.html
SETDS - Machine Learning based Change Point Detection
Then Exception Values (EV, which basically is an anomaly score (magnitudes) are
calculated hourly or daily as a difference between statistical upper and lower limits
(UCL and/or LCL) and the actual data to keep that aside for additional analysis. EV
data is used to detect past change points by solving the EV(t)=0equation, where ‘t’
is time and roots are change points. This method can be used to detect a recent trend
within historical data from the last change point to the most recent data point. Using
that subset of the historical data as a sample allows to build a much more accurate
trend forecast as shown on picture below
4WorldS4 2021
Catching Anomaly and Normality in Cloud by Neural Net
and Entropy Calculation
Part 1. Catching Normality with a Neural Net
Part 2. Detecting Imbalanced AWS Objects by Using Entropy Calculation
Part 1. The Neural Network (NN) is not a new machine learning method. About 12 years ago I was involved as a Capacity
Planning resource for the project of building an infrastructure (servers) to run NN for the fraud detection application. Now NN
got much more attention and popularity as a part of AI, mostly because the computing power is increased dramatically and
respectively more tasks can be done by using NN.
The goal of the presentation is to demystify the technique in some simple terms and examples to show what it actually is and
how that could be used for Capacity and Demand management. That is done by developing R code to recognize typical
workload pasterns, like OLTP, or others in the time series performance data daily profiles.
Part 2. It is the typical concern to detect anomalies for short living objects or for the object with very small amount of
measurements. Why? Number of those objects could be thousands and thousands so it is important to separate exceptional
ones with anomalies for further investigation. That could be servers or customers that have just started being monitored or
public cloud objects (EC2s, ASGs) that usually have very short lifespan. Suggested approach to detect anomalous behavior of
this type of objects is to estimate the Entropy of the each object. If the entropy is low, everything should be in order and most
likely OK. If not - there is a possible disorder there or mess and someone needs to check what is going on with the object.
The method is implemented in the cloud based application written on R that scans every hour all cloud Auto Scaling Groups
(ASG) to detect imbalanced ones in term of number of EC2 instances in the group. That allows to separate a couple hundreds
ASGs out of hundreds thousands of them.
This entropy based method is well known and it described in details in the following www.Trub.in blog post:
Quantifying Imbalance in Computer Systems
5WorldS4 2021
PART 1.
Catching Normality with a Neural Net
Introduction to Neural Nets (Networks)
The Neural Network (NN) is not new machine learning method even for Capital One. About 12
years ago I was involved as a Capacity Planning resource for the project of building an
infrastructure (servers) to run NN for the fraud detection application. Now NN got much more
attention and popularity as a part of AI, mostly because the computing power is increased
dramatically and respectively more tasks can be done by using NN
DEFINITION: An artificial neural network is a computational-based, nonlinear empirical model,
inspired on the biological neural networks. An ANN acts as a black box and learns to predict
the value of specific output variables given sufficient input information.
Neurons are connected with synapses.
Synapse just does multiplying the weigh (W)
and output (X) from the previous neurons
(wjxj) . Neuron itself summarizes all his inputs
by using some function. Most common one
is logistic function.
6WorldS4 2021
NN in R to recognize “OLTP” (Online Transaction Processing) pattern
We have built the Neural Network model to recognize “OLTP” (Online Transaction
Processing) type of daily workload pattern from the data that has mostly a non-OLTP
patterns (servers CPU utilization in this case). Considering the OLTP is a normal pattern, the
goal is to detect “Normality” in the data!
 !"#$#%&
"'
(!)"#* "
+$#%!###*,-.
7WorldS4 2021
NN in R to recognize “OLTP” (Online Transaction Processing) pattern
!#$!$"
/#$
!#'
!0!/
/0/"
!#'
!/
"
$#
#'
!1$"2
3"
4
#5
'*
!
!
#6"#!
'!)
$"7"
89.
8WorldS4 2021
NN in R to recognize “OLTP” : RESULT
&$#'"$#/!$#showed two possible false negatives:
/"#*,-
"!*.
/"#*,-0"
!*
-*'!!$#%.
  -         
/!##'/#'
9WorldS4 2021
NN to recognize the opposite – mostly OLTP (Business Transactions)
!  #$0 " !   / :#   3#;"
#$#%%&
0 1
-linear.output ={T|F}
(regression vs. classification)
-the number of output
neurons and hidden
layers can be more than one
Check SYSTEMS AND METHODS FOR MODELING COMPUTER
RESOURCE METRICS - our US patent #10,437,697


 2
10WorldS4 2021
C O N C L U S I O N
Deeper Learning?
Deep learning increasing number of hidden layers. After rerunning the program with
hidden = c(18,17,16,15,13,12,5), linear.output = F (7 hidden layers), the accuracy of this case is much better
and finally is acceptable:
Applying the approach to daily profiles gives the ability to classify workloads
and recognize the known patterns. That could be used to find some good
and different kind of profiles () or defective ones
()
Interesting that
outputs (test result) is
not really acceptable
having too many false
negatives
How to improve???
11WorldS4 2021
Detecting Imbalanced AWS Objects by Using Entropy Calculation
Problem statement: How to detect anomalies for short living objects or for the
object with very small amount of measurements.
Solution: Compare them. It is important to separate exceptional ones with
anomalies for further investigation.
Objects could be
-servers or customers that have just started being monitored or
-public cloud objects (EC2s, ASGs) that usually have very short lifespan.
Suggested approach is to estimate the Entropy of the object.
Object is normal, if the entropy is low.
If not - there is a possible disorder there and
someone needs to check what is going on with
the object.
“Entropy is a notion used to describe the amount of disorder (or randomness) in
a system. The basic principle is this: high disorder equals high entropy, while
order equals low entropy”, SO
12WorldS4 2021
Quantifying Imbalance in Computer Systems
This entropy based method is well known and it described in details in the
www.Trub.in blog post:
Quantifying Imbalance in Computer Systems
http://www.trub.in/2012/01/quantifying-imbalance-in-computer.html
The following formula is used to estimate how the object is imbalanced:
The niC value varies from 0 to 1. “0” means it is
completely balanced (e.g. for ASG it is just the constant
number of EC2s) and 1 is complete random disorder
13WorldS4 2021
Applying the method to public cloud metrics via SETDS
-&".(#%#*3/!#!3
"##)03%4"#3$4!
"#<4$#"#$/!!!
"
-!"3"#"##
#$*
-In SETDS it is calculated across all hourly data points (could be a couple
of days or 3-4 weeks – whole history of the particular ASG life).
1"'!
"'!$"""!&#"#3!
<"4=/*
>"!#$$$#!
It does NOT detect anomalies during the object live, but it
detects the objects that are anomalous (unusual)
among the large number of other similar ones
14WorldS4 2021
Applying the method to public cloud metrics via SETDS – RESULT
-Below is the R code lines that does that work highlighted in red. Note there the
"family" is ASV, the "object" is ASG .
-In the R code
first for each hour the entropy is calculated (weekhourentrophy) and
then in the end of the loop (within SELECT statement) the imbalance is
calculated for the entire data frame ("Actual").
15WorldS4 2021
Applying the method to public cloud metrics via SETDS –
Visual Validation by Control Charts Proves FIndins
ASG 213
ASG 312
ASG 321
Just scaling
out events:
Mess:
Imbalance=0.079
Imbalance=0.008
Imbalance=0.001 Imbalance=0.002
WorldS4 2021
The Model Factory – Business Driven
Massive Predictions
Based on Capital One Patent 10,437,697
SYSTEMS AND METHODS FOR MODELING COMPUTER RESOURCE METR
ICS
17WorldS4 2021
Modeling Factory Concept (Patent Application ABSTRACT)
Business Driver Info
Component Mapping
Utilization Data
“This disclosure relates generally to system modeling, and more particularly to systems and
methods for modeling computer resource metrics. In one embodiment, a processor-
implemented computer resource metric modeling method is disclosed. The method may include
detecting one or more statistical trends in aggregated interaction data for one or more
interaction types, and mapping each interaction type to one or more devices facilitating the
transactions. The method may further include generating one or more linear regression models
of a relationship between device utilization and interaction volume, and calculating one or more
diagnostic statistics for the one or more linear regression models. A subset of the linear
regression models may be filtered out based on the one or more diagnostic statistics. One or
more forecasts may be generated using the remaining linear regression models, using which a
report may be generated and provided.”
Model Factory
Output : Models, Forecasts
18WorldS4 2021
Scalability Challenge: Multivariate Adaptive Statistical Filtering
profiles vs. Raw Data
Raw data – noisy and massive
MASF weekly profiles are data cubes* with the two
following dimensions: 30 weeks (~6 month worth)
historical baseline and 168 (24*7) week hours.
R-Script to build MASF Profiles against raw hourly
stamped time-series data:
http://itrubin.blogspot.com/2012/03/r-script-to-aggregate-etl-to-my
sql.html
MASF
profiles –
clean and
short
In good models, the business Driver MASF profile
should be consistent with server’s:
*There are two time dimensions in the MASF profiles. The hour of the week
and the thirty data point for each of those hours. The third dimension is
system utilization.
19WorldS4 2021
Creating Capacity Models – Mobile App Example
Independent Variable:
MASF profile
Dependent Variable:
Capacity Resource (CPU) MASF profile
Linear Regression Business Driver Capacity usage
forecast
There are multiple thresholds calculated depending
on redundancy in server configuration:
N-1 – assumes one server is down/passive.
N/2 – assumes half the environment is down/passive.
Related MASF
Profiles
Model the
Variables
Capacity in
Business Terms
Extrapolate the line to determine
at what transaction volume the
capacity threshold is reached.
Actuals Forecasts
*
*Based on SAS Proc-Reg
20WorldS4 2021
The End
Questions? Please send them to Igor@Trub.in and I will respond!
21WorldS4 2021
Thank you!
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.