Prateesh Goyal’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (37)


Figure 1: m4 mimics the computational structure of flowSim but replaces its components with learnable modules.
Figure 2: m4 's workflow: Inputs (yellow boxes), outputs (red boxes), intermediate components (white boxes).
Figure 3: m4 adds "dense" supervision during training by querying intermediate network states for "remaining size" and "queue length". Dashed boxes represent subsequent simulations triggered by new flow-level events.
Figure 4: m4 converts (a) a network snapshot in time to a (b) bipartite graph and uses GNN to capture spatial dynamics.
Figure 5: m4's implementation
m4: A Learned Flow-level Network Simulator
  • Preprint
  • File available

March 2025

·

19 Reads

·

Anton A. Zabreyko

·

·

[...]

·

Thomas Anderson

Flow-level simulation is widely used to model large-scale data center networks due to its scalability. Unlike packet-level simulators that model individual packets, flow-level simulators abstract traffic as continuous flows with dynamically assigned transmission rates. While this abstraction enables orders-of-magnitude speedup, it is inaccurate by omitting critical packet-level effects such as queuing, congestion control, and retransmissions. We present m4, an accurate and scalable flow-level simulator that uses machine learning to learn the dynamics of the network of interest. At the core of m4 lies a novel ML architecture that decomposes state transition computations into distinct spatial and temporal components, each represented by a suitable neural network. To efficiently learn the underlying flow-level dynamics, m4 adds dense supervision signals by predicting intermediate network metrics such as remaining flow size and queue length during training. m4 achieves a speedup of up to 104×\times over packet-level simulation. Relative to a traditional flow-level simulation, m4 reduces per-flow estimation errors by 45.3% (mean) and 53.0% (p90). For closed-loop applications, m4 accurately predicts network throughput under various congestion control schemes and workloads.

Download

Figure 3: SYN and SYN-ACK spraying in Machnet to establish a flow between engine #1 at the client and engine #1 at the server. This figure shows the unoptimized version where the client sprays í µí±› 2 SYN packets. The green SYN and SYN-ACK packets are the ones that hash to the correct RX queue.
Figure 4: This experiment measures the single message latency of each approach. DPDK performs significantly better than existing options.
Figure 6: Performance of the FASTER key-value store over Machnet and Linux TCP/IP. Machnet achieves 3.3x higher throughput and 80% lower p99 latency compared to Linux.
Figure 7: FASTER server performance with multiple threads
Least common denominator NIC model in Machnet
Fast Userspace Networking for the Rest of Us

February 2025

·

9 Reads

After a decade of research in userspace network stacks, why do new solutions remain inaccessible to most developers? We argue that this is because they ignored (1) the hardware constraints of public cloud NICs (vNICs) and (2) the flexibility required by applications. Concerning the former, state-of-the-art proposals rely on specific NIC features (e.g., flow steering, deep buffers) that are not broadly available in vNICs. As for the latter, most of these stacks enforce a restrictive execution model that does not align well with cloud application requirements. We propose a new userspace network stack, Machnet, built for public cloud VMs. Central to Machnet is a new ''Least Common Denominator'' model, a conceptual NIC with a minimal feature set supported by all kernel-bypass vNICs. The challenge is to build a new solution with performance comparable to existing stacks while relying only on basic features (e.g., no flow steering, no RSS reconfiguration). Machnet uses a microkernel design to provide higher flexibility in application execution compared to a library OS design; we show that microkernels' inter-process communication overhead is negligible on large cloud networks.




Challenging the Need for Packet Spraying in Large-Scale Distributed Training

June 2024

·

69 Reads

Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.



Figure 1: Basic components of DBO.
Figure 7: High-level architecture of the Release Buffer. The Delivery Clock advances upon new market data reception from the CES. Incoming trades from the MP are tagged with the Delivery Clock id and MP's response time before sent to the OB/CES.
Figure 8: Cloud-hosted exchanges' architectural view.
DBO: Response Time Fairness for Cloud-Hosted Financial Exchanges

March 2023

·

57 Reads

In this paper, we consider the problem of hosting financial exchanges in the cloud. Financial exchanges require predictable, equal latency to all market participants to ensure fairness for various tasks, such as high speed trading. However, it is extremely difficult to ensure equal latency to all market participants in existing cloud deployments, because of various reasons, such as congestion, and unequal network paths. In this paper, we address the unfairness that stems from lack of determinism in cloud networks. We argue that predictable or bounded latency is not necessary to achieve fairness. Inspired by the use of logical clocks in distributed systems, we present Delivery Based Ordering (DBO), a new approach that ensures fairness by instead correcting for differences in latency to the participants. We evaluate DBO both in our hardware test bed and in a public cloud deployment and demonstrate that it is feasible to achieve guaranteed fairness and sub-100 microsecond latency while operating at high transaction rates.





Citations (21)


... Packet-level simulators [35,63,71] are popular in networking research, but face significant scalability challenges for modeling large-scale data center networks. Recent work improves the scalability of packet-level simulation using machine learning [43,74,75], approximation techniques [76], and better parallelization [27]. However, these approaches have some key limitations. ...

Reference:

m4: A Learned Flow-level Network Simulator
m3: Accurate Flow-Level Performance Estimation using Machine Learning
  • Citing Conference Paper
  • August 2024

... There are many applications in modern datacenters which are implemented effectively using distributed systems. Some applications benefit from hard guarantees about the state at distinct nodes, such as databases [1,2,3] and financial exchanges [4], where the system must ensure that all nodes can reason correctly about the order of transactions. Another important feature in applications is predictable latency and resource requirements, important in robotics [5] and in large-scale numerical computations such as machine-learning training and inference [6]. ...

DBO: Fairness for Cloud-Hosted Financial Exchanges
  • Citing Conference Paper
  • September 2023

... While throughput of a datacenter topology is interesting from a theory standpoint, a vast majority of the literature focuses on practically achieving the ideal throughput of a topology. For instance, congestion control [37,[62][63][64][65][66][67][68][69], buffer management [70][71][72][73][74][75], scheduling [76][77][78], load-balancing [79][80][81][82][83]. In fact, the underlying protocols can turn out to be the key enablers (or limiters) of system performance in the datacenter [65]. ...

Annulus: A Dual Congestion Control Loop for Datacenter and WAN Traffic Aggregates
  • Citing Article
  • July 2020

... L4S queue selection with SwiftQueue: Current L4S queue selection involves putting all packets from a given L4S flow into the same queue. Yet, recall that this approach can be recent works [26,62,93] have developed a centralized TCP CC plane, which views multiple data paths and integrates a central feedback loop. Several of these already have Linux kernel integration. ...

Elasticity detection: a building block for internet congestion control
  • Citing Conference Paper
  • August 2022

... For a video V of length L, let c k represents the k-th chunk at a bitrate r where r ∈ {r 1 , r 2 , ....r m }, and R k denote the time spent for rebuffering. Then according to the video streaming literature [21], [22], the QoE observed by a client for the k-th chunk is calculated as follows: ...

End-to-end transport for video QoE fairness
  • Citing Article
  • August 2019

... In [311], a LAN is set up in which one sender is connected to an access point with a wired Ethernet link and one receiver is connected to the same access point with a WiFi link. Similar single-user and multi-user LAN scenarios are used in [312]. The link qualities are varied purposefully by periodically changing the MCS index of the WiFi access point. ...

ABC: A simple explicit congestion controller for wireless networks
  • Citing Article
  • February 2020

... However, kernel modules normally are limited to integer arithmetic so the floating-point operations employed in MODRL/P-MODRL will need to be emulated using integer arithmetic (e.g., [49]). Alternatively, the recently proposed Congestion Control Plane (CCP) [50] approach enables part of the transport codes to run in user space which supports floating point calculations. Therefore, we adopt CCP to demonstrate the potential performance gains of MODRL/P-MODRL on Linux TCP. ...

Restructuring endpoint congestion control
  • Citing Article
  • August 2018

... More recent approaches exploit the increasing programmability of network devices [4] to obtain yet more information, in particular from packet payloads. However, sending raw packets and statistics from hundreds or thousands of switches to a collector can quickly congest network links and overwhelm the collector, even if implemented in a streaming fashion, as in Marple [5], Sonata [6], or Newton [7]. While some of these approaches aim to process raw statistics locally to switches in order to alleviate the bottleneck introduced by a logically centralized collector, they can only capture simple monitoring tasks since their statefulness is confined to aggregates (e.g., minimum, maximum, average, count). ...

Language-Directed Hardware Design for Network Performance Monitoring
  • Citing Article
  • August 2017