Shenglin Zhang

Shenglin Zhang
Nankai University | NKU · College of Software

PhD

About

39
Publications
20,738
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
531
Citations
Introduction
Shenglin Zhang is an associate professor at the College of Software, Nankai University. His research interests focus on AIOps, including anomaly detection, failure diagnosis, root cause analysis, failure prediction, etc., for software/network service management. He has published 30+ papers in international conferences, including ATC, WWW, VLDB, SIGMETRICS, CoNEXT, INFOCOM, IJCAI, ISSRE, IWQOS, etc., and peer-reviewed journals, including IEEE TC/TSC/TNSM, etc.
Additional affiliations
July 2012 - July 2017
Tsinghua University
Position
  • PhD Student
Education
September 2008 - June 2012
Xidian University
Field of study
  • Network Engineering

Publications

Publications (39)
Preprint
Full-text available
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia and industry to enable automated and effective software service management. Plenty of efforts have been dedicated to AIOps, including anomaly detection, root cause localization, incident management, etc. However, most existing works are evaluated on privat...
Article
Detecting malicious non-existent domain names (NXDomains) in a real-time manner is vitally important to the security of large-scale dependable systems. Existing detection methods are trained based on the assumption that the NXDomains, which cannot be recognized by the domain generation algorithm (DGA) archive, are benign. However, new types of mali...
Preprint
Full-text available
UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Article
Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) to ensure service quality. Detecting outlier machine instances with multivariate time series is crucial for service management. However, it is a challenging task due to the multipl...
Article
Logs are imperative in the management process of networks and services. However, manually identifying and classifying anomalous logs is time-consuming, error-prone, and labor-intensive. Additionally, rule-based approaches cannot tackle the challenges underlying anomalous log identification and classification resulting from new types of logs and par...
Preprint
Full-text available
Logs are one of the most valuable data sources for managing large-scale online services. After a failure is detected/diagnosed/predicted, operators still have to inspect the raw logs to gain a summarized view before take actions. However, manual or rule-based log summarization has become inefficient and ineffective. In this work, we propose LogSumm...
Conference Paper
Full-text available
Logs are one of the most valuable data sources for large-scale service (e.g., social network, search engine) maintenance. Log parsing serves as the the first step towards automated log analysis. However, the current log parsing methods are not adaptive. Without intra-service adaptiveness, log parsing cannot handle software/firmware upgrade because...
Article
Full-text available
With the growing market of cloud databases, careful detection and elimination of slow queries are of great importance to service stability. Previous studies focus on optimizing the slow queries that result from internal reasons (e.g., poorly-written SQLs). In this work, we discover a different set of slow queries which might be more hazardous to da...
Article
Full-text available
Syslog parsing is of vital importance for the detection, diagnosis and prediction of network device failures in a datacenter. A common approach to syslog parsing is to extract templates from historical syslogs, after which syslogs are matched to these templates. To address the problems in the existing syslog parsing techniques, we propose a novel f...
Conference Paper
Full-text available
Recording runtime status via logs is common for almost every computer system, and detecting anomalies in logs is crucial for timely identifying malfunctions of systems. However, manually detecting anomalies for logs is time-consuming, error-prone, and infeasible. Existing automatic log anomaly detection approaches, using indexes rather than semanti...
Article
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Conference Paper
Full-text available
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
Full-text available
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
Full-text available
Additive key performance indicators (KPIs, such as page view, revenue, error count) with multi-dimensional attributes (such as ISP, Province, DataCenter) are common and important monitoring metrics in Internet companies. When an anomaly happens to an overall KPI, it is critical but challenging to localize the root cause, which is one (or more) comb...
Article
Full-text available
As a path vector protocol, Border Gateway Protocol (BGP) messages contain an entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS Path Looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loops...
Article
Full-text available
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs. In this paper, we pr...
Conference Paper
Full-text available
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs. In this paper, we pr...
Article
Full-text available
In the design and construction process of Next Generation Internet, it is important to identify the source of each IP packet forwarding accurately, especially for the support of precise fine-grained management, control, traceability and improving the trustworthiness of the Internet. This paper designed a scalable Network Identity (NID) scheme for t...
Article
Full-text available
Provider Portal for Applications (P4P) is a model aiming to incorporate (peer to peer) P2P applications with Internet Service Providers (ISPs) and improve the performance of the both ISP and the P2P applications. In this study, we have analyzed the relationship between the link traffic and the P-distance, which is the core interface of P4P. In addi...
Conference Paper
Full-text available
As a path vector protocol, Border Gateway Protocol (BGP) messages contain the entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS path looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loop...
Chapter
Full-text available
P4P (Provider Portal for Applications) is a model aiming to incorporate P2P with ISP and improve the performance of both the ISP and the P2P applications. In this study, we analyze the relationship between the link traffic and the P-distance, which is the core interface of P4P, and illustrate the disadvantage of P4P in dealing with network topology...