November 2024
·
9 Reads
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
November 2024
·
9 Reads
October 2024
·
4 Reads
Many problems that cloud operators solve are computationally expensive, and operators often use heuristic algorithms (that are faster and scale better than optimal) to solve them more efficiently. Heuristic analyzers enable operators to find when and by how much their heuristics underperform. However, these tools do not provide enough detail for operators to mitigate the heuristic's impact in practice: they only discover a single input instance that causes the heuristic to underperform (and not the full set), and they do not explain why. We propose XPlain, a tool that extends these analyzers and helps operators understand when and why their heuristics underperform. We present promising initial results that show such an extension is viable.
August 2024
·
7 Reads
·
8 Citations
June 2024
·
10 Reads
·
2 Citations
Proceedings of the ACM on Programming Languages
Configuration errors remain a major cause of system failures and service outages. One promising approach to identify configuration errors automatically is to learn common usage patterns (and anti-patterns) using data-driven methods. However, existing data-driven learning approaches analyze only simple configurations (e.g., those with no hierarchical structure), identify only simple types of issues (e.g., type errors), or require extensive domain-specific tuning. In this paper, we present Diffy, the first push-button configuration analyzer that detects likely bugs in structured configurations. From example configurations, Diffy learns a common template, with "holes" that capture their variation. It then applies unsupervised learning to identify anomalous template parameters as likely bugs. We evaluate Diffy on a large cloud provider's wide-area network, an operational 5G network testbed, and MySQL configurations, demonstrating its versatility, performance, and accuracy. During Diffy's development, it caught and prevented a bug in a configuration timer value that had previously caused an outage for the cloud provider.
November 2023
·
8 Reads
·
8 Citations
May 2023
·
33 Reads
·
1 Citation
We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.
November 2021
·
31 Reads
·
1 Citation
August 2021
·
21 Reads
·
19 Citations
July 2020
·
17 Reads
·
24 Citations
... Classical algorithms such as FanOut [7] and Spread-Out [33] are simple approaches to scheduling that take 2.5-91× longer to complete All-to-All transfers than a theoretical optimum ( §6). A modern line of research (such as TE-CCL [29] and TACCL [48]) uses computationally complex algorithms to compute optimized schedules; the transfers they compute for balanced workloads (i.e., those without skew) complete near-optimally. However, the time to compute one schedule ranges from minutes to days, all for an All-to-All transfer, which itself completes in milliseconds! ...
August 2024
... Model checkers [1,3,4,14,18,36,39,43,54,57] model a network's routing and forwarding behaviors based on protocol semantics and device configurations, and check whether engineerspecified reachability and resilience policies are satisfied. Consistency checkers [13,19,20,[22][23][24]45] compare configurations within and across devices and flag inconsistencies and deviations from best practices. LLM-based Q&A tools [7,11,29,32,50,51] parse configuration files and query pre-trained sequential transformer models through prompts to detect syntax and subtle semantic issues. ...
June 2024
Proceedings of the ACM on Programming Languages
... Second, the recent rise of Large Language Models (LLMs) has provided increasingly sophisticated automatic coding and code understanding tools, and ways for operators to interact with their system at a higher level of abstraction. This has the potential to reduce the significant manual effort usually required by operators for tasks such as incident detection [38], incident management and mitigation [3,17,19,23,39], and root cause analysis [3,6,9,35,42,43]. ...
Reference:
Intent-based System Design and Operation
November 2023
... However, this approach faces two major challenges: first, the need for real-time collection of network-wide demand information to compute a schedule of switch configurations and the long reconfiguration times of commercial OCSes result in significant overhead when frequently reconfiguring networks at scale [33]; second, the collected traffic may not accurately predict future communication demands, leading to potential mismatches between switch configurations and actual application requirements. Current CCLs [21,[49][50][51] for model training are primarily designed for static topologies, whereas in reconfigurable networks, the topology may continuously change. To address this, we propose a novel strategy that reconfigures the communication patterns of collective algorithms to adapt to dynamically changing network topologies, thereby optimizing overall performance. ...
May 2023
... B is expected to have enough expertise to solve the task, but also could be the adversary that performs malicious operations. For the rest of the paper, we focus on tasks related to router configuration troubleshooting (such as OSPF, BGP, and filter configurations) due to their high frequency (Table I) and significance [72], [44], [55], [35], [36]. Our primary focus is ensuring the network's correctness, specifically maintaining the reachability between network devices. ...
August 2021
... The state of these heuristic analyzers today is reminiscent of the early days of our community's exploration of network verifiers and their potential to help network operators configure and manage their networks. In the same way that network verifiers enabled operators to identify bugs in their configurations [10,14,15,19,22,24,28,29,31,33,40,48,50], a heuristic analyzer can help them find the performance gap of the algorithms they deploy. Tools that allow operators to leverage heuristic analyzers more easily, identify why the heuristics underperform, and devise solutions to remediate the issue serve a similar purpose to the tools our community crafted that explained the impact of configuration bugs [23,25,40,41] (by producing all sets of packets that the bug impacted and the configuration lines that caused the impact). ...
Reference:
Towards Safer Heuristics With XPlain
July 2020