Scalable monitoring via threshold compression in a large operational 3G network.
-
Citations (0)
-
Cited In (0)
Page 1
Scalable Monitoring via Threshold Compression in a Large
Operational 3G Network
Suk-Bok Lee1, Dan Pei2, MohammadTaghi Hajiaghayi2,3, Ioannis Pefkianakis1, Songwu Lu1
He Yan4, Zihui Ge2, Jennifer Yates2, Mario Kosseifi5
1UCLA Computer Science
4Colorado State University
2ATT Labs–Research
3University of Maryland
5AT&T Network Services
ABSTRACT
Threshold-based performance monitoring in large 3G networks is
very challenging for two main factors: large network scale and
dynamics in both time and spatial domains. There exists a funda-
mental tradeoff between the size of threshold settings and the alarm
quality. In this paper, we propose a scalable monitoring solution,
called threshold-compression that characterizes the tradeoff via
intelligent threshold aggregation. The main insight behind our so-
lution isto identify groups of network elements withsimilar thresh-
old behaviors across location and time dimensions, thus forming
spatial-temporal clusters and generating the associated compressed
thresholds within the optimization framework. Our evaluations on
a commercial 3G network have demonstrated the effectiveness of
our threshold-compression solution, e.g., threshold setting re-
duction up to 90% within 10% false/miss alarms.
Categories
Communication Networks]: Network Operations
and Subject Descriptors:
C.2.3 [Computer-
General Terms: Measurement, Algorithms
1.INTRODUCTION
The current practice for monitoring the health of a large-scale
network is to use pre-defined thresholds of selected key perfor-
mance indicator (KPI) metrics. However, direct application of such
a pre-computed, threshold-based alarming model does not scale in
3G networks due to the two main factors: (1) massive data vol-
ume and large network scale; (2) rich dynamics in both time and
spatial domains. A single static threshold per KPI fails to capture
such spatial and temporal dynamics, leading to unacceptably poor
alarm quality with nearly 70% false positives/negatives. On the
other hand, a finer-grained location- and time-dependent threshold
setting can capture network dynamics but incurs prohibitively high
system management complexity. The number of thresholds to be
maintained grows very large with the increasing number of net-
work elements (NEs) and the time granularity. For example, given
that one regional area has about 5,000 cells and 30 KPIs, the per-
NE hourly threshold scheme has as many as 5K × 24 × 30 = 3.6
million thresholds in a single area. Therefore, it is increasingly dif-
ficult to monitor an operational 3G network with naive pre-defined
threshold scheme. To this end, we propose a scalable threshold-
based solution, called threshold-compression, which has both
merits of a small number of used thresholds and accurate captur-
ing of spatial-temporal network dynamics.
Copyright is held by the author/owner(s).
SIGMETRICS’11, June 7–11, 2011, San Jose, California, USA.
ACM 978-1-4503-0262-3/11/06.
2.THRESHOLD COMPRESSION
We describe threshold-compression by highlighting the moti-
vation, problem formulation, and compression algorithm suite.
Case for similar threshold behavior.
sion approach is motivated by two key observations: (1) thresh-
old behavior similarity among a certain group of NEs, and (2)
stable/close threshold trends over some period of time. Figure 1
shows example NE-pairs on downlink-throughput KPI using per-
NE-hourly thresholds. Such spatial similarity is attributed to the
geographic locations of NEs and the user population in the corre-
sponding area. For example, NEs in urban (/rural) areas are likely
to have similar high (/low) dynamics over time. Time-domain sim-
ilarity is also observed, as each NE is likely to have similar high
(/low) demand during peak (/sleep) hours. For example in the fig-
ure, each NE-group shows very stable threshold behavior during
peak hours between 11:00 GMT and 22:00 GMT, which provides
us an opportunity to form a temporal-domain cluster.
Our threshold compres-
0
00:00
0.2
0.4
0.6
0.8
1
04:0008:0012:0016:0020:00
DL throughput (norm)
Time in GMT
Node B 1
Node B 2
(a) NodeB1 and NodeB2
0
00:00
0.2
0.4
0.6
0.8
1
04:0008:0012:0016:0020:00
Time in GMT
Node B 3
Node B 4
(b) NodeB3 and NodeB4
Figure 1: Downlink-throughput KPI: similar threshold (per-
NE-hourly) behavior among different Node Bs.
Desirable properties of threshold compression.
scalable monitoring performance as well as practical threshold
management, threshold-compression should have the following
properties: (1) High compression gain: The resulting threshold set-
ting should remain small even with a large number of NEs; (2)
Low false alarm rate: The compressed thresholds must result in
good alarm quality, i.e, low false positive rate (FPR) and false neg-
ative rate (FNR), and thus, we use a concept of threshold closeness
of lower (Ti,j
per-NE-hourly thresholds of (NE i and hour j); (3) Management-
oriented grouping policy: The spatial-temporal clusters must be
easy to manage and update in the monitoring system.
end, we employ a consistent NE grouping policy where each NE
can belong to only one NE group (but there can be multiple hour
groups within an NE group), hence a two-level hierarchical cluster-
ing structure.
Problem formulation.
We formulate the threshold compression
problem taking the alarm quality as well as the required clustering
To ensure
lower) and upper (Ti,j
upper) bound to approximate the
To this
Page 2
policy into account. The objective is to find the minimum number
of spatial-temporal clusters(or equivalently theminimum threshold
setting) from a given fine-grained threshold setting with the follow-
ing constraints: (1) Each compressed threshold must be within the
permissible threshold interval of Ti,j
ing must be consistent in time; (3) Each cluster must consist of
continuous time steps (optional rule).
It turns out that this problem is not only NP-hard (regardless of
the optional rule) but indeed it is very hard to approximate as well.
The proof is given in the full version of the paper [1].
Threshold compression algorithm suite.
compression takes a two-staged approach. We first decouple the
spatial NE grouping from the original two-dimensional cluster-
ing problem, then further proceed with temporal-domain clustering
within each identified NE group. Our key strategy for clustering is
to combine spatial-temporal blocks if they (i) have common inter-
section in their permissible intervals, and (ii) meet the consistent
NE grouping rule. Note that having common intersection among
the cluster members ensures the satisfying alarm quality.
1. NE grouping: greedy coloring approach.
identifies NE groups each showing similar threshold behavior each
hour among its members. As the first-level of clustering hierar-
chy, each NE group, in fact, consists of 24 hour-groups, which will
be compressed further in the next stage via time-domain cluster-
ing. Then, the NE grouping problem naturally reduces to the graph
coloring that asks the minimum number of colors (NE groups)
assignable toeachvertex(NE)such thatnoedge (common intersec-
tion) connects two identically colored vertices (group members).
This graph coloring instance is NP-hard, and we employ a greedy
coloring heuristic, which works quite well in practice. Specifi-
cally we apply the Welsh-Powell algorithm [2] that uses at most
maximin{d(vi) + 1,i} colors, that is at most one more than the
maximum degree of the graph. We convert our problem instance
to a graph G(V,E), where each NE corresponds to a vertex in G.
For each vertex pair vi and vi?, we put an edge between them if
their counterpart NEs i and i?have disjoint threshold intervals in
any hour. Then the vertices colored γ (by the greedy coloring algo-
rithm) can be readily transformed to NE-group γ in our problem.
Once identified, each NE group γ defines its own permissible
threshold interval to reflect each member’s interval. Setting the
group threshold interval to the common intersection among the
members makes the next-stage clustering procedure to keep con-
trol on the resulting alarm quality.
2. Hour grouping: minimum cover selection.
level of the clustering hierarchy, the time-domain clustering takes
the NE grouping result as input to perform the hour grouping for
each identified NE-group. Within NE group γ, there are initially
24 hour-groups, each of which we simply refer an hour. Then each
hour j is represented by its threshold interval Φγ,j
(i.e., the common intersection among all members at hour j) as a
result of NE grouping. Given the set of intervals, the hour grouping
problem istofind theminimum number of interval groups such that
(i) each interval belongs to one of the interval groups, and (ii) there
is common intersection in each interval group. We use a simple
greedy algorithm that leads to an optimal solution to this problem.
The algorithm is as follows. We first sort all the interval endpoints
(∀j∈H : Φγ,j
scan the list (in ascending order) until first encountering an upper-
bound point Φγ,j?
(i.e., all hours j : Φγ,j
and delete them from the list. We repeat this process until there is
no interval in the list. This simple greedy rule indeed finds the min-
lowerand Ti,j
upper; (2) NE group-
Our threshold-
The first stage
As the next
lowerand Φγ,j
upper
lower,Φγ,j
upper) in ascending order of their values. We
upper. We then put all intervals containing this point
lower≤ Φγ,j?
upper) into a new interval group C?
h,
imum number of interval groups, hence the minimum hour groups.
The proof is given in the full version of the paper [1].
Now, all hours in each identified interval group C?
group γ can form a spatial-temporal cluster C??
serve the threshold-closeness property for all members, we set the
compressed thresholds Tcomp(δ) within the common intersection
across all NEs i ∈ Cγ and hours j ∈ C?
cluster, and we use the median point in this study. We again note
that this compressed thresholds Tcomp(δ) is shared by all NEs and
hours in C??
ing the location and time specific thresholds.
hof NE
δ. In order to pre-
hin the spatial-temporal
δ, thus reducing the threshold setting while still preserv-
3.EVALUATION
We evaluate the performance of threshold-compressionon the
data recorded from June 2010 to August 2010 in one regional 3G
network that covers several thousands of NEs. Figure 2 shows the
threshold compression gain on different KPIs. The compression
gainisdefined asthethreshold-setting reduction relativetothe fine-
grained per-NE-hourly setting. Each compression gain in the figure
represents the highest threshold-compression gain observed when
theresulting false/missalarmratesFPRand FNR(based on theper-
NE-hourly alarm statistics) are both within 10% (and 20%) range.
Weobserve that, within10% false/miss alarmcondition, most KPIs
show very high compression gain nearly 80–90%. Tables 1 com-
pare the threshold-setting sizes and false/miss alarm rates produced
by different thresholding schemes. Our approach balances very
well the problematic tradeoff between the threshold setting and the
alarm quality, while other schemes are unable to achieve both.
?
??
??
??
??
??
??
??
??
??
???
???????????????????????????????????????????????????????
????
???????????????????????
???????????????? ????????????????
Figure 2: Compression gain on different KPIs.
Threshold scheme#thresholds
per-NE-hourly
threshold-compression
per-NE-static
per-NEtype-hourly
per-NEtype-static
FPR
-
8.4%
31.1%
51.2%
53.2%
FNR
-
2.7%
51.8%
47.5%
58.0%
25320
3763
1055
24
1
Table 1: Thresholding on downlink-throughput KPI.
4. CONCLUSION
Motivated by key observations of spatial-temporal threshold
similarity, we have proposed a scalable monitoring solution, called
threshold-compression that can characterize the location- and
time-specific threshold trend of each individual NE with minimal
threshold setting. Our experience with applying our threshold-
compression solution in the operational 3G network monitoring
has been very positive, and demonstrated the effectiveness of the
proposed approach, e.g., threshold setting reduction up to 90%
within 10% false/miss alarms.
5.
[1] S.-B. Lee, D. Pei, M. Hajiaghayi, I. Pefkianakis, S. Lu, H. Yan, Z. Ge,
J. Yates, M. Kosseifi. Scalable monitoring via threshold compression
in a large operational 3G network. AT&T Technical Report, 2011.
[2] D. Welsh and M. Powell. An upper bound for the chromatic number
of a graph and its application to timetabling problems. Computer
Journal, 85–86, 1967.
REFERENCES