
Shenglin ZhangNankai University | NKU · College of Software
Shenglin Zhang
PhD
About
80
Publications
32,172
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,124
Citations
Introduction
Shenglin Zhang is an associate professor at the College of Software, Nankai University. His research interests focus on AIOps, including anomaly detection, failure diagnosis, root cause analysis, failure prediction, etc., for software/network service management. He has published 30+ papers in international conferences, including ATC, WWW, VLDB, SIGMETRICS, CoNEXT, INFOCOM, IJCAI, ISSRE, IWQOS, etc., and peer-reviewed journals, including IEEE TC/TSC/TNSM, etc.
Skills and Expertise
Additional affiliations
July 2012 - July 2017
Education
September 2008 - June 2012
Publications
Publications (80)
Widely adopted for their scalability and flexibility, modern microservice systems present unique failure diagnosis challenges due to their independent deployment and dynamic interactions. This complexity can lead to cascading failures that negatively impact operational efficiency and user experience. Recognizing the critical role of fault diagnosis...
With the booming of large-scale network devices, anomaly detection on multivariate time series (MTS), such as a combination of CPU utilization, average response time, and network packet loss, is important for system reliability. Although a collection of learning-based approaches have been designed for this purpose, our study shows that these approa...
Microservices improve the scalability and flexibility of monolithic architectures to accommodate the evolution of software systems, but the complexity and dynamics of microservices challenge system reliability. Ensuring microservice quality requires efficient failure diagnosis, including detection and triage. Failure detection involves identifying...
Automatic log analysis is essential for the efficient Operation and Maintenance (O&M) of software systems, providing critical insights into system behaviors. However, existing approaches mostly treat log analysis as training a model to perform an isolated task, using task-specific log-label pairs. These task-based approaches are inflexible in gener...
Accurate and efficient localization of root cause instances in large-scale microservice systems is of paramount importance. Unfortunately, prevailing methods face several limitations. Notably, some recent methods rely on supervised learning which necessitates a substantial amount of labeled data. However, labeling root cause instances is time-consu...
The availability of microservice systems is critical to business operations and corporate reputation. However, the dynamics and complexity of microservice systems introduce significant challenges to the performance issue diagnosis of large-scale microservice systems. After investigating hundreds of real-world performance issue cases in Tencent, we...
Large language models (LLMs) excel at general question-answering (Q&A) but often fall short in specialized domains due to a lack of domain-specific knowledge. Commercial companies face the dual challenges of privacy protection and resource constraints when involving LLMs for fine-tuning. This paper propose a novel framework, Self-Evolution, designe...
AIOps algorithms play a crucial role in the maintenance of microservice systems. Many previous benchmarks' performance leaderboard provides valuable guidance for selecting appropriate algorithms. However, existing AIOps benchmarks mainly utilize offline datasets to evaluate algorithms. They cannot consistently evaluate the performance of algorithms...
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (ie using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more accur...
To ensure the performance of large-scale datacenters, operators need to monitor up to tens of millions of various-type KPIs, e.g., CPU utilization, memory utilization. For each KPI, it is crucial but challenging to detect outliers that deviate from its historical patterns or the patterns of other KPIs in the same period. In this work, we propose
O...
We propose LogSummary, an automatic, unsupervised end-to-end log summarization framework for software system maintenance in this work. LogSummary obtains the summarized triples of necessary logs for a given log sequence. It integrates a novel information extraction method that considers semantic information and domain knowledge with a new triple-ra...
Logs are one of the most valuable data to describe the running state of services. Failure diagnosis through logs is crucial for service reliability and security. The current automatic log failure diagnosis methods cannot fully use the multiple fields of logs, which fail to capture the relation between them. In this paper, we propose LogKG, a new fr...
Internet-based services have seen remarkable success, generating vast amounts of monitored key performance indicators (KPIs) as univariate or multivariate time series. Monitoring and analyzing these time series are crucial for researchers, service operators, and on-call engineers to detect outliers or anomalies indicating service failures or signif...
Proactive failure detection of instances is vitally essential to microservice systems because an instance failure can propagate to the whole system and degrade the system's performance. Over the years, many single-modal (i.e., metrics, logs, or traces) data-based nomaly detection methods have been proposed. However, they tend to miss a large number...
Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact....
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (i.e., using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more ac...
Abstract The reliability of wireless base stations is essential to guarantee the user experiences in wireless networks, thereby employing the anomaly detection on multivariate time series is indispensable for network operators to monitor the behaviours of large‐scale wireless base stations. In this paper, a general unsupervised anomaly detection mo...
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia and industry to enable automated and effective software service management. Plenty of efforts have been dedicated to AIOps, including anomaly detection, root cause localization, incident management, etc. However, most existing works are evaluated on privat...
Timely anomaly detection of key performance indicators (KPIs),
e.g.
, service response time, error rate, is of utmost importance to Web services. Over the years, many unsupervised deep learning-based anomaly detection approaches have been proposed. To achieve good performance, they require a long period of KPI data for model training, which is no...
Detecting malicious non-existent domain names (NXDomains) in a real-time manner is vitally important to the security of large-scale dependable systems. Existing detection methods are trained based on the assumption that the NXDomains, which cannot be recognized by the domain generation algorithm (DGA) archive, are benign. However, new types of mali...
The reliability of wireless base stations in China Mobile is of vital importance, because the cell phone users are connected to the stations and the behaviors of the stations are directly related to user experience. Although the monitoring of the station behaviors can be realized by anomaly detection on multivariate time series, due to complex corr...
Anomaly clue localization of multi-dimensional derived measure is vitally important for the reliability of online video services. In this paper, we propose RobustSpot, an end-to-end framework for localizing the clues to anomalous multi-dimensional derived measures. RobustSpot integrates two novel indicators, i.e., “Anomaly Degree” and “Contribution...
UniLog: Deploy One Model and Specialize it for All Log Analysis Tasks
Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) to ensure service quality. Detecting outlier machine instances with multivariate time series is crucial for service management. However, it is a challenging task due to the multipl...
Logs are imperative in the management process of networks and services. However, manually identifying and classifying anomalous logs is time-consuming, error-prone, and labor-intensive. Additionally, rule-based approaches cannot tackle the challenges underlying anomalous log identification and classification resulting from new types of logs and par...
Logs are one of the most valuable data sources for managing large-scale online services. After a failure is detected/diagnosed/predicted, operators still have to inspect the raw logs to gain a summarized view before take actions. However, manual or rule-based log summarization has become inefficient and ineffective. In this work, we propose LogSumm...
Logs are one of the most valuable data sources for large-scale service (e.g., social network, search engine) maintenance. Log parsing serves as the the first step towards automated log analysis. However, the current log parsing methods are not adaptive. Without intra-service adaptiveness, log parsing cannot handle software/firmware upgrade because...
With the growing market of cloud databases, careful detection and elimination of slow queries are of great importance to service stability. Previous studies focus on optimizing the slow queries that result from internal reasons (e.g., poorly-written SQLs). In this work, we discover a different set of slow queries which might be more hazardous to da...
Syslog parsing is of vital importance for the detection, diagnosis and prediction of network device failures in a datacenter. A common approach to syslog parsing is to extract templates from historical syslogs, after which syslogs are matched to these templates. To address the problems in the existing syslog parsing techniques, we propose a novel f...
Recording runtime status via logs is common for almost every computer system, and detecting anomalies in logs is crucial for timely identifying malfunctions of systems. However, manually detecting anomalies for logs is time-consuming, error-prone, and infeasible. Existing automatic log anomaly detection approaches, using indexes rather than semanti...
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifi...
Additive key performance indicators (KPIs, such as page view, revenue, error count) with multi-dimensional attributes (such as ISP, Province, DataCenter) are common and important monitoring metrics in Internet companies. When an anomaly happens to an overall KPI, it is critical but challenging to localize the root cause, which is one (or more) comb...
As a path vector protocol, Border Gateway Protocol (BGP) messages contain an entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS Path Looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loops...
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs. In this paper, we pr...
The detection of performance changes in software change roll-outs in Internet-based services is crucial for an operations team, because it allows timely roll-back of a software change when performance degrades unexpectedly. However, it is infeasible to manually investigate millions of performance measurements of many roll-outs.
In this paper, we pr...
In the design and construction process of Next Generation Internet, it is important to identify the source of each IP packet forwarding accurately, especially for the support of precise fine-grained management, control, traceability and improving the trustworthiness of the Internet. This paper designed a scalable Network Identity (NID) scheme for t...
Provider Portal for Applications (P4P) is a model aiming to incorporate (peer to peer) P2P applications with Internet Service Providers (ISPs) and improve the performance of the both ISP and the P2P applications. In this study, we have analyzed the relationship between the link traffic and the P-distance, which is the core interface of P4P. In addi...
As a path vector protocol, Border Gateway Protocol (BGP) messages contain the entire Autonomous System (AS) path to each destination for breaking arbitrary long AS path loops. However, after observing the global routing data from RouteViews, we find that BGP AS path looping (BAPL) behavior does occur and in fact can lead to multi-AS forwarding loop...
P4P (Provider Portal for Applications) is a model aiming to incorporate P2P with ISP and improve the performance of both the ISP and the P2P applications. In this study, we analyze the relationship between the link traffic and the P-distance, which is the core interface of P4P, and illustrate the disadvantage of P4P in dealing with network topology...
Network
Cited