Siva Kesava Reddy Kakarla’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (9)


End-to-End Performance Analysis of Learning-enabled Systems
  • Conference Paper

November 2024

·

9 Reads

Pooria Namyar

·

Michael Schapira

·

Ramesh Govindan

·

[...]

·


Figure 1: Example heuristics and their encoding in MetaOpt (sub-figures (b) and (c)). Heuristic in sub-figure (b) forces the demands less than a threshold to be pinned and then solves a flow maximization problem, heuristic in sub-figure (c) assigns the first bin that can fit the ball.
Figure 5: The adversarial subspace generator: (a) finds a rough subspace and separates bad samples ( ) from good ones ( ); (b) it trains a regression tree on these samples and uses it to refine the subspace and produces (c). We show the first subspace (í µí°· 0 ) for our FF example in (c). Here, í µí° ¶ í µí±– í µí±— encodes the rough subspace and í µí±‡ í µí±– and í µí±‰ í µí±– the path in the regression tree.
Towards Safer Heuristics With XPlain
  • Preprint
  • File available

October 2024

·

4 Reads

Many problems that cloud operators solve are computationally expensive, and operators often use heuristic algorithms (that are faster and scale better than optimal) to solve them more efficiently. Heuristic analyzers enable operators to find when and by how much their heuristics underperform. However, these tools do not provide enough detail for operators to mitigate the heuristic's impact in practice: they only discover a single input instance that causes the heuristic to underperform (and not the full set), and they do not explain why. We propose XPlain, a tool that extends these analyzers and helps operators understand when and why their heuristics underperform. We present promising initial results that show such an extension is viable.

Download


Diffy: Data-Driven Bug Finding for Configurations

June 2024

·

10 Reads

·

2 Citations

Proceedings of the ACM on Programming Languages

Configuration errors remain a major cause of system failures and service outages. One promising approach to identify configuration errors automatically is to learn common usage patterns (and anti-patterns) using data-driven methods. However, existing data-driven learning approaches analyze only simple configurations (e.g., those with no hierarchical structure), identify only simple types of issues (e.g., type errors), or require extensive domain-specific tuning. In this paper, we present Diffy, the first push-button configuration analyzer that detects likely bugs in structured configurations. From example configurations, Diffy learns a common template, with "holes" that capture their variation. It then applies unsupervised learning to identify anomalous template parameters as likely bugs. We evaluate Diffy on a large cloud provider's wide-area network, an operational 5G network testbed, and MySQL configurations, demonstrating its versatility, performance, and accuracy. During Diffy's development, it caught and prevented a bug in a configuration timer value that had previously caused an outage for the cloud provider.



Comparing the transfer time from SCCL least-steps with TE-CCL (K = 10 and chunk size = 25 KB). TE-CCL can better pipeline chunks and so pays less í µí»¼ cost with larger transfers.
Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

May 2023

·

33 Reads

·

1 Citation

We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.




Citations (6)


... Classical algorithms such as FanOut [7] and Spread-Out [33] are simple approaches to scheduling that take 2.5-91× longer to complete All-to-All transfers than a theoretical optimum ( §6). A modern line of research (such as TE-CCL [29] and TACCL [48]) uses computationally complex algorithms to compute optimized schedules; the transfers they compute for balanced workloads (i.e., those without skew) complete near-optimally. However, the time to compute one schedule ranges from minutes to days, all for an All-to-All transfer, which itself completes in milliseconds! ...

Reference:

FLASH: Fast All-to-All Communication in GPU Clusters
Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
  • Citing Conference Paper
  • August 2024

... Model checkers [1,3,4,14,18,36,39,43,54,57] model a network's routing and forwarding behaviors based on protocol semantics and device configurations, and check whether engineerspecified reachability and resilience policies are satisfied. Consistency checkers [13,19,20,[22][23][24]45] compare configurations within and across devices and flag inconsistencies and deviations from best practices. LLM-based Q&A tools [7,11,29,32,50,51] parse configuration files and query pre-trained sequential transformer models through prompts to detect syntax and subtle semantic issues. ...

Diffy: Data-Driven Bug Finding for Configurations
  • Citing Article
  • June 2024

Proceedings of the ACM on Programming Languages

... Second, the recent rise of Large Language Models (LLMs) has provided increasingly sophisticated automatic coding and code understanding tools, and ways for operators to interact with their system at a higher level of abstraction. This has the potential to reduce the significant manual effort usually required by operators for tasks such as incident detection [38], incident management and mitigation [3,17,19,23,39], and root cause analysis [3,6,9,35,42,43]. ...

A Holistic View of AI-driven Network Incident Management
  • Citing Conference Paper
  • November 2023

... However, this approach faces two major challenges: first, the need for real-time collection of network-wide demand information to compute a schedule of switch configurations and the long reconfiguration times of commercial OCSes result in significant overhead when frequently reconfiguring networks at scale [33]; second, the collected traffic may not accurately predict future communication demands, leading to potential mismatches between switch configurations and actual application requirements. Current CCLs [21,[49][50][51] for model training are primarily designed for static topologies, whereas in reconfigurable networks, the topology may continuously change. To address this, we propose a novel strategy that reconfigures the communication patterns of collective algorithms to adapt to dynamically changing network topologies, thereby optimizing overall performance. ...

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

... B is expected to have enough expertise to solve the task, but also could be the adversary that performs malicious operations. For the rest of the paper, we focus on tasks related to router configuration troubleshooting (such as OSPF, BGP, and filter configurations) due to their high frequency (Table I) and significance [72], [44], [55], [35], [36]. Our primary focus is ensuring the network's correctness, specifically maintaining the reachability between network devices. ...

Campion: debugging router configuration differences
  • Citing Conference Paper
  • August 2021

... The state of these heuristic analyzers today is reminiscent of the early days of our community's exploration of network verifiers and their potential to help network operators configure and manage their networks. In the same way that network verifiers enabled operators to identify bugs in their configurations [10,14,15,19,22,24,28,29,31,33,40,48,50], a heuristic analyzer can help them find the performance gap of the algorithms they deploy. Tools that allow operators to leverage heuristic analyzers more easily, identify why the heuristics underperform, and devise solutions to remediate the issue serve a similar purpose to the tools our community crafted that explained the impact of configuration bugs [23,25,40,41] (by producing all sets of packets that the bug impacted and the configuration lines that caused the impact). ...

GRooT: Proactive Verification of DNS Configurations
  • Citing Conference Paper
  • July 2020