Conference Paper

SEEN: ML Assisted Cellular Service Diagnosis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Effective management of large-scale cellular data networks is critical to meet customer demands and expectations. Customer calls for technical support provide direct indication as to the problems customers encounter. In this paper, we study the customer tickets - free-text recordings and classifications by customer support agents - collected at a large cellular network provider, with two inter-related goals: i) to characterize and understand the major factors which lead to customers to call and seek support; and ii) to utilize such customer tickets to help identify potential network problems. For this purpose, we develop a novel statistical approach to model customer call rates which account for customer-side factors (e.g., user tenure and handset types) and geo-locations. We show that most calls are due to customer-side factors and can be well captured by the model. Furthermore, we also demonstrate that location-specific deviations from the model provide a good indicator of potential network-side issues.
Conference Paper
Full-text available
Detecting anomalous traffic is a crucial part of managing IP networks. In recent years, network-wide anomaly de- tection based on Principal Component Analysis (PCA) has emerged as a powerful method for detecting a wide vari- ety of anomalies. We show that tuning PCA to operate effectively in practice is difficult and requires more robust techniques than have been presented thus far. We analyze a week of network-wide traffic measurements from two IP backbones (Abilene and Geant) across three different traffic aggregations (ingress routers, OD flows, and input links), and conduct a detailed inspection of the feature time se- ries for each suspected anomaly. Our study identifies and evaluates four main challenges of using PCA to detect traf- fic anomalies: (i) the false positive rate is very sensitive to small differences in the number of principal components in the normal subspace, (ii) the effectiveness of PCA is sensi- tive to the level of aggregation of the traffic measurements, (iii) a large anomaly may inadvertently pollute the normal subspace, (iv) correctly identifying which flow triggered the anomaly detector is an inherently challenging problem.
Conference Paper
Full-text available
Traditional DSL troubleshooting solutions are reactive, relying mainly on customers to report problems, and tend to be labor-intensive, time consuming, prone to incorrect resolutions and overall can contribute to increased customer dissatisfaction. In this paper, we propose a proactive approach to facilitate troubleshooting customer edge problems and reducing customer tickets. Our system consists of: i) a ticket predictor which predicts future customer tickets; and ii) a trouble locator which helps technicians accelerate the troubleshooting process during field dispatches. Both components infer future tickets and trouble locations based on existing sparse line measurements, and the inference models are constructed automatically using supervised machine learning techniques. We propose several novel techniques to address the operational constraints in DSL networks and to enhance the accuracy of NEVERMIND. Extensive evaluations using an entire year worth of customer tickets and measurement data from a large network show that our method can predict thousands of future customer tickets per week with high accuracy and signifcantly reduce the time and effort for diagnosing these tickets. This is benefcial as it has the effect of both reducing the number of customer care calls and improving customer satisfaction.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Conference Paper
An essential step in the customer care routine of cellular service carriers is determining whether an individual user is impacted by on-going service issues. This is traditionally done by monitoring the network and the services. However, user feedback data, generated when users call customer care agents with problems, is a complementary source of data for this purpose. User feedback data is particularly valuable as it provides the user perspective of the service issues. However, this data is extremely noisy, due to range of issues that users have and the diversity of the language used by care agents. In this paper, we present LOTUS, a system that identifies users impacted by a common root cause (such as a network outage) from user feedback. LOTUS is based on novel algorithmic framework that tightly couples co-training and spatial scan statistics. To model the text in the user feedback, LOTUS also incorporates custom-built language models using deep sequence learning. Through experimental analysis on synthetic and live data, we demonstrate the accuracy of LOTUS. LOTUS has been deployed for several months, and has identified the impact over 200 events.
Article
A recent experimental evaluation assessed 19 time series classification (TSC) algorithms and found that one was significantly more accurate than all others: the Flat Collective of Transformation-based Ensembles (Flat-COTE). Flat-COTE is an ensemble that combines 35 classifiers over four data representations. However, while comprehensive, the evaluation did not consider deep learning approaches. Convolutional neural networks (CNN) have seen a surge in popularity and are now state of the art in many fields and raises the question of whether CNNs could be equally transformative for TSC. We implement a benchmark CNN for TSC using a common structure and use results from a TSC-specific CNN from the literature. We compare both to Flat-COTE and find that the collective is significantly more accurate than both CNNs. These results are impressive, but Flat-COTE is not without deficiencies. We significantly improve the collective by proposing a new hierarchical structure with probabilistic voting, defining and including two novel ensemble classifiers built in existing feature spaces, and adding further modules to represent two additional transformation domains. The resulting classifier, the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE), encapsulates classifiers built on five data representations. We demonstrate that HIVE-COTE is significantly more accurate than Flat-COTE (and all other TSC algorithms that we are aware of) over 100 resamples of 85 TSC problems and is the new state of the art for TSC. Further analysis is included through the introduction and evaluation of 3 new case studies and extensive experimentation on 1,000 simulated datasets of 5 different types.
Conference Paper
In an increasingly mobile connected world, our user experience of mobile applications more and more depends on the performance of cellular radio access networks (RAN). To achieve high quality of experience for the user, it is imperative that operators identify and diagnose performance problems quickly. In this paper, we describe our experience in understanding the challenges in automating the diagnosis of RAN performance problems. Working with a major cellular network operator on a part of their RAN that services more than 2 million users, we demonstrate that fine-grained modeling and analysis could be the key towards this goal. We describe our methodology in analyzing RAN problems, and highlight a few of our findings, some previously unknown. We also discuss lessons from our attempt at building automated diagnosis solutions.
Article
The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments re-implementing 8 different representation methods and 9 similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this paper, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. Our experiments have provided both a unified validation of some of the existing achievements, and in some cases, suggested that certain claims in the literature may be unduly optimistic.
Conference Paper
This paper presents NetSieve, a system that aims to do automated problem inference from network trouble tickets. Network trouble tickets are diaries comprising fixed fields and free-form text written by operators to document the steps while troubleshooting a problem. Unfortunately, while tickets carry valuable information for network management, analyzing them to do problem inference is extremely difficult--fixed fields are often inaccurate or incomplete, and the free-form text is mostly written in natural language. This paper takes a practical step towards automatically analyzing natural language text in network tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve, combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. To cope with ambiguity in free-form text, NetSieve leverages learning from human guidance to improve its inference accuracy. We evaluate NetSieve on 10K+ tickets from a large cloud provider, and compare its accuracy using (a) an expert review, (b) a study with operators, and (c) vendor data that tracks device replacement and repairs. Our results show that NetSieve achieves 89%-100% accuracy and its inference output is useful to learn global problem trends. We have used NetSieve in several key network operations: analyzing device failure trends, understanding why network redundancy fails, and identifying device problem symptoms.
Conference Paper
Customer care calls serve as a direct channel for a service provider to learn feedbacks from their customers. They reveal details about the nature and impact of major events and problems observed by customers. By analyzing the customer care calls, a service provider can detect important events to speed up problem resolution. However, automating event detection based on customer care calls poses several significant challenges. First, the relationship between customers' calls and network events is blurred because customers respond to an event in different ways. Second, customer care calls can be labeled inconsistently across agents and across call centers, and a given event naturally give rise to calls spanning a number of categories. Third, many important events cannot be detected by looking at calls in one category. How to aggregate calls from different categories for event detection is important but challenging. Lastly, customer care call records have high dimensions (e.g., thousands of categories in our dataset). In this paper, we propose a systematic method for detecting events in a major cellular network using customer care call data. It consists of three main components: (i) using a regression approach that exploits temporal stability and low-rank properties to automatically learn the relationship between customer calls and major events, (ii) reducing the number of unknowns by clustering call categories and using L1 norm minimization to identify important categories, and (iii) employing multiple classifiers to enhance the robustness against noise and different response time. For the detected events, we leverage Twitter social media to summarize them and to locate the impacted regions. We show the effectiveness of our approach using data from a large cellular service provider in the US.
Conference Paper
Network anomaly detection using dimensionality reduction techniques has received much recent attention in the liter- ature. For example, previous work has aggregated netflow records into origin-destination (OD) flows, yielding a much smaller set of dimensions which can then be mined to un- cover anomalies. However, this approach can only identify which OD flow is anomalous, not the particular IP flow(s) responsible for the anomaly. In this paper we show how one can use random aggregations of IP flows (i.e., sketches) to enable more precise identification of the underlying causes of anomalies. We show how to combine traffic sketches with a subspace method to (1) detect anomalies with high accu- racy and (2) identify the IP flows(s) that are responsible for the anomaly. Our method has detection rates comparable to previous methods and detects many more anomalies than prior work, taking us a step closer towards a robust on-line system for anomaly detection and identification.
Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability
  • Jinwon An
  • Sungzoon Cho
Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
  • Devlin Jacob
CableMon: Improving the Reliability of Cable Broadband Networks via Proactive Network Maintenance
  • Jiyao Hu
  • Zhenyu Zhou
  • Xiaowei Yang
  • Jacob Malone
  • Jonathan W Williams
  • Hu Jiyao