Shenglin Zhang

Shenglin Zhang
Nankai University | NKU · College of Software

PhD

About

80
Publications
32,172
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,124
Citations
Introduction
Shenglin Zhang is an associate professor at the College of Software, Nankai University. His research interests focus on AIOps, including anomaly detection, failure diagnosis, root cause analysis, failure prediction, etc., for software/network service management. He has published 30+ papers in international conferences, including ATC, WWW, VLDB, SIGMETRICS, CoNEXT, INFOCOM, IJCAI, ISSRE, IWQOS, etc., and peer-reviewed journals, including IEEE TC/TSC/TNSM, etc.
Additional affiliations
July 2012 - July 2017
Tsinghua University
Position
  • PhD Student
Education
September 2008 - June 2012
Xidian University
Field of study
  • Network Engineering

Publications

Publications (80)
Article
Widely adopted for their scalability and flexibility, modern microservice systems present unique failure diagnosis challenges due to their independent deployment and dynamic interactions. This complexity can lead to cascading failures that negatively impact operational efficiency and user experience. Recognizing the critical role of fault diagnosis...
Article
With the booming of large-scale network devices, anomaly detection on multivariate time series (MTS), such as a combination of CPU utilization, average response time, and network packet loss, is important for system reliability. Although a collection of learning-based approaches have been designed for this purpose, our study shows that these approa...
Article
Microservices improve the scalability and flexibility of monolithic architectures to accommodate the evolution of software systems, but the complexity and dynamics of microservices challenge system reliability. Ensuring microservice quality requires efficient failure diagnosis, including detection and triage. Failure detection involves identifying...
Preprint
Automatic log analysis is essential for the efficient Operation and Maintenance (O&M) of software systems, providing critical insights into system behaviors. However, existing approaches mostly treat log analysis as training a model to perform an isolated task, using task-specific log-label pairs. These task-based approaches are inflexible in gener...
Article
Accurate and efficient localization of root cause instances in large-scale microservice systems is of paramount importance. Unfortunately, prevailing methods face several limitations. Notably, some recent methods rely on supervised learning which necessitates a substantial amount of labeled data. However, labeling root cause instances is time-consu...
Article
The availability of microservice systems is critical to business operations and corporate reputation. However, the dynamics and complexity of microservice systems introduce significant challenges to the performance issue diagnosis of large-scale microservice systems. After investigating hundreds of real-world performance issue cases in Tencent, we...
Preprint
Large language models (LLMs) excel at general question-answering (Q&A) but often fall short in specialized domains due to a lack of domain-specific knowledge. Commercial companies face the dual challenges of privacy protection and resource constraints when involving LLMs for fine-tuning. This paper propose a novel framework, Self-Evolution, designe...
Preprint
AIOps algorithms play a crucial role in the maintenance of microservice systems. Many previous benchmarks' performance leaderboard provides valuable guidance for selecting appropriate algorithms. However, existing AIOps benchmarks mainly utilize offline datasets to evaluate algorithms. They cannot consistently evaluate the performance of algorithms...
Article
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (ie using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more accur...
Article
To ensure the performance of large-scale datacenters, operators need to monitor up to tens of millions of various-type KPIs, e.g., CPU utilization, memory utilization. For each KPI, it is crucial but challenging to detect outliers that deviate from its historical patterns or the patterns of other KPIs in the same period. In this work, we propose O...
Article
We propose LogSummary, an automatic, unsupervised end-to-end log summarization framework for software system maintenance in this work. LogSummary obtains the summarized triples of necessary logs for a given log sequence. It integrates a novel information extraction method that considers semantic information and domain knowledge with a new triple-ra...
Article
Logs are one of the most valuable data to describe the running state of services. Failure diagnosis through logs is crucial for service reliability and security. The current automatic log failure diagnosis methods cannot fully use the multiple fields of logs, which fail to capture the relation between them. In this paper, we propose LogKG, a new fr...
Preprint
Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or signif...
Preprint
Proactive failure detection of instances is vitally essential to microservice systems because an instance failure can propagate to the whole system and degrade the system's performance. Over the years, many single-modal (i.e., metrics, logs, or traces) data-based nomaly detection methods have been proposed. However, they tend to miss a large number...
Preprint
Full-text available
Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact....
Preprint
Full-text available
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (i.e., using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more ac...
Article
Full-text available
Abstract The reliability of wireless base stations is essential to guarantee the user experiences in wireless networks, thereby employing the anomaly detection on multivariate time series is indispensable for network operators to monitor the behaviours of large‐scale wireless base stations. In this paper, a general unsupervised anomaly detection mo...
Preprint
Full-text available
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia and industry to enable automated and effective software service management. Plenty of efforts have been dedicated to AIOps, including anomaly detection, root cause localization, incident management, etc. However, most existing works are evaluated on privat...
Article
Timely anomaly detection of key performance indicators (KPIs), e.g. , service response time, error rate, is of utmost importance to Web services. Over the years, many unsupervised deep learning-based anomaly detection approaches have been proposed. To achieve good performance, they require a long period of KPI data for model training, which is no...
Article
Detecting malicious non-existent domain names (NXDomains) in a real-time manner is vitally important to the security of large-scale dependable systems. Existing detection methods are trained based on the assumption that the NXDomains, which cannot be recognized by the domain generation algorithm (DGA) archive, are benign. However, new types of mali...
Preprint
Full-text available
The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex corr...
Article
Anomaly clue localization of multi-dimensional derived measure is vitally important for the reliability of online video services. In this paper, we propose RobustSpot, an end-to-end framework for localizing the clues to anomalous multi-dimensional derived measures. RobustSpot integrates two novel indicators, i.e., “Anomaly Degree” and “Contribution...
Preprint
Full-text available
UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Article
Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) to ensure service quality. Detecting outlier machine instances with multivariate time series is crucial for service management. However, it is a challenging task due to the multipl...
Article
Logs are imperative in the management process of networks and services. However, manually identifying and classifying anomalous logs is time-consuming, error-prone, and labor-intensive. Additionally, rule-based approaches cannot tackle the challenges underlying anomalous log identification and classification resulting from new types of logs and par...
Preprint
Full-text available
Logs are one of the most valuable data sources for managing large-scale online services. After a failure is detected/diagnosed/predicted, operators still have to inspect the raw logs to gain a summarized view before take actions. However, manual or rule-based log summarization has become inefficient and ineffective. In this work, we propose LogSumm...
Conference Paper
Full-text available
Logs are one of the most valuable data sources for large-scale service (e.g., social network, search engine) maintenance. Log parsing serves as the the first step towards automated log analysis. However, the current log parsing methods are not adaptive. Without intra-service adaptiveness, log parsing cannot handle software/firmware upgrade because...
Article
Full-text available
With the growing market of cloud databases, careful detection and elimination of slow queries are of great importance to service stability. Previous studies focus on optimizing the slow queries that result from internal reasons (e.g., poorly-written SQLs). In this work, we discover a different set of slow queries which might be more hazardous to da...
Article
Full-text available
Syslog parsing is of vital importance for the detection, diagnosis and prediction of network device failures in a datacenter. A common approach to syslog parsing is to extract templates from historical syslogs, after which syslogs are matched to these templates. To address the problems in the existing syslog parsing techniques, we propose a novel f...
Conference Paper
Full-text available
Recording runtime status via logs is common for almost every computer system, and detecting anomalies in logs is crucial for timely identifying malfunctions of systems. However, manually detecting anomalies for logs is time-consuming, error-prone, and infeasible. Existing automatic log anomaly detection approaches, using indexes rather than semanti...
Article
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Conference Paper
Full-text available
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
Full-text available
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Article
Full-text available
Additive key performance indicators (KPIs, such as page view, revenue, error count) with multi-dimensional attributes (such as ISP, Province, DataCenter) are common and important monitoring metrics in Internet companies. When an anomaly happens to an overall KPI, it is critical but challenging to localize the root cause, which is one (or more) comb...
Article
Full-text available
As a path vector protocol, Border Gateway Protocol (BGP) messages contain an entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS Path Looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loops...
Article
Full-text available
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs. In this paper, we pr...
Conference Paper
Full-text available
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs. In this paper, we pr...
Article
Full-text available
In the design and construction process of Next Generation Internet, it is important to identify the source of each IP packet forwarding accurately, especially for the support of precise fine-grained management, control, traceability and improving the trustworthiness of the Internet. This paper designed a scalable Network Identity (NID) scheme for t...
Article
Full-text available
Provider Portal for Applications (P4P) is a model aiming to incorporate (peer to peer) P2P applications with Internet Service Providers (ISPs) and improve the performance of the both ISP and the P2P applications. In this study, we have analyzed the relationship between the link traffic and the P-distance, which is the core interface of P4P. In addi...
Conference Paper
Full-text available
As a path vector protocol, Border Gateway Protocol (BGP) messages contain the entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS path looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loop...
Chapter
Full-text available
P4P (Provider Portal for Applications) is a model aiming to incorporate P2P with ISP and improve the performance of both the ISP and the P2P applications. In this study, we analyze the relationship between the link traffic and the P-distance, which is the core interface of P4P, and illustrate the disadvantage of P4P in dealing with network topology...

Network