Matthew Caesar

University of Illinois, Urbana-Champaign, Urbana, IL, USA

Are you Matthew Caesar?

Claim your profile

Publications (20)0 Total impact

  • Article: Finishing Flows Quickly with Preemptive Scheduling
    [show abstract] [hide abstract]
    ABSTRACT: Today's data centers face extreme challenges in providing low latency. However, fair sharing, a principle commonly adopted in current congestion control protocols, is far from optimal for satisfying latency requirements. We propose Preemptive Distributed Quick (PDQ) flow scheduling, a protocol designed to complete flows quickly and meet flow deadlines. PDQ enables flow preemption to approximate a range of scheduling disciplines. For example, PDQ can emulate a shortest job first algorithm to give priority to the short flows by pausing the contending flows. PDQ borrows ideas from centralized scheduling disciplines and implements them in a fully distributed manner, making it scalable to today's data centers. Further, we develop a multipath version of PDQ to exploit path diversity. Through extensive packet-level and flow-level simulation, we demonstrate that PDQ significantly outperforms TCP, RCP and D3 in data center environments. We further show that PDQ is stable, resilient to packet loss, and preserves nearly all its performance gains even given inaccurate flow information.
    06/2012;
  • Article: Shortest Paths in Less Than a Millisecond
    [show abstract] [hide abstract]
    ABSTRACT: We consider the problem of answering point-to-point shortest path queries on massive social networks. The goal is to answer queries within tens of milliseconds while minimizing the memory requirements. We present a technique that achieves this goal for an extremely large fraction of path queries by exploiting the structure of the social networks. Using evaluations on real-world datasets, we argue that our technique offers a unique trade-off between latency, memory and accuracy. For instance, for the LiveJournal social network (roughly 5 million nodes and 69 million edges), our technique can answer 99.9% of the queries in less than a millisecond. In comparison to storing all pair shortest paths, our technique requires at least 550x less memory; the average query time is roughly 365 microseconds --- 430x faster than the state-of-the-art shortest path algorithm. Furthermore, the relative performance of our technique improves with the size (and density) of the network. For the Orkut social network (3 million nodes and 220 million edges), for instance, our technique is roughly 2588x faster than the state-of-the-art algorithm for computing shortest paths.
    06/2012;
  • Source
    Article: Slick Packets
    [show abstract] [hide abstract]
    ABSTRACT: Source-controlled routing has been proposed as a way to improve flexibility of future network architectures, as well as simplifying the data plane. However, if a packet specifies its path, this precludes fast local re-routing within the network. We propose SlickPackets, a novel solution that allows packets to slip around failures by specifying alternate paths in their headers, in the form of compactly-encoded directed acyclic graphs. We show that this can be accomplished with reasonably small packet headers for real network topologies, and results in responsiveness to failures that is competitive with past approaches that require much more state within the network. Our approach thus enables fast failure response while preserving the benefits of source-controlled routing.
    01/2012;
  • Source
    Conference Proceeding: Debugging the data plane with anteater.
    Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 15-19, 2011; 01/2011
  • Source
    Conference Proceeding: Guaranteeing BGP Stability with a Few Extra Paths.
    2010 International Conference on Distributed Computing Systems, ICDCS 2010, Genova, Italy, June 21-25, 2010; 01/2010
  • Source
    Article: Dynamic route recomputation considered harmful.
    Computer Communication Review. 01/2010; 40:66-71.
  • Source
    Conference Proceeding: Achieving convergence-free routing using failure-carrying packets.
    Proceedings of the ACM SIGCOMM 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Kyoto, Japan, August 27-31, 2007; 01/2007
  • Source
    Conference Proceeding: ROFL: routing on flat labels.
    Proceedings of the ACM SIGCOMM 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Pisa, Italy, September 11-15, 2006; 01/2006
  • Source
    Conference Proceeding: HLP: a next generation inter-domain routing protocol.
    Proceedings of the ACM SIGCOMM 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, Pennsylvania, USA, August 22-26, 2005; 01/2005
  • Source
    Article: The SAHARA Model for Service Composition Across Multiple Providers
    [show abstract] [hide abstract]
    ABSTRACT: Services are capabilities that enable applications and are of crucial importance to pervasive computing in next-generation networks. Service Composition is the construction of complex services from primitive ones; thus enabling rapid and flexible creation of new services. The presence of multiple independent service providers poses new and significant challenges. Managing trust across providers and verifying the performance of the components in composition become essential issues. Adapting the composed service to network and user dynamics by choosing service providers and instances is yet another challenge. In SAHARA, we are developing a comprehensive architecture for the creation, placement, and management of services for composition across independent providers. In this paper, we present a layered reference model for composition based on a classification of different kinds of composition.We then discuss the different overarching mechanisms necessary for the successful deployment of such an architecture through a variety of case-studies involving composition.
    06/2002;
  • Conference Proceeding: The SAHARA Model for Service Composition across Multiple Providers.
    Pervasive Computing, First International Conference, Pervasive 2002, Zürich, Switzerland, August 26-28, 2002, Proceedings; 01/2002
  • Source
    Article: Stabilizing Route Selection in BGP
    [show abstract] [hide abstract]
    ABSTRACT: Route instability is an important contributor to data plane unreliability on the Internet, and also incurs load on the control plane of routers. In this paper, we study how route selection schemes can avoid these changes in routes. Specifically, we characterize the tradeoffs between interruption rate, our measure of stability; availability of routes; and deviation from the network operator's preferred routes. We develop algorithms to lower bound the feasible points in the tradeoff spaces between these three cost metrics. We also propose a new approach, Stable Route Selection (SRS), which uses flexibility in route selection to improve stability without sacrificing availability, and with a controlled amount of deviation. Our large-scale simulation results show that SRS can significantly improve stability while deviating only a small amount from preferred routes. We implement our protocol in a software router, Quagga, and confirm in cluster deployment that SRS's gains in route stability translate to improved reliability in the data plane. Finally, we evaluate SRS under direct feeds of route update traffic from Internet routers. In this case, we observe less improvement, but SRS can still improve stability when multiple disjoint paths are available. published or submitted for publication
  • Source
    Article: Towards a next generation inter-domain routing protocol
  • Source
    Article: Stabilizing BGP, Safely
    [show abstract] [hide abstract]
    ABSTRACT: Route instability incurs significant load core routers and is widely recognized as a major contributor to data plane unre-liability on the Internet. Route flap damping provides some protection against instability, but introduces pathologies and reduces availability. Concerns about the scalability of the routing system and the increasing prevalence of real-time applications have prompted a renewed interest in stability. We believe it is time for a principled approach to stabilizing Internet rout-ing. This paper takes a step towards that goal by character-izing the tradeoff between stability and availability, and be-tween stability and deviation from preferred routes. We pro-pose a new approach, Stable Route Selection (SRS), which uses flexibility in route selection to improve stability without sacrificing availability. Our extensive simulation and exper-imental results show that SRS improves stability and data plane reliability while deviating only a small amount from preferred routes.
  • Source
    Article: The Case for an Internet Health Monitoring System
    [show abstract] [hide abstract]
    ABSTRACT: Internet routing is plagued with several problems today, including chronic instabilities, convergence problems, and misconfiguration of routers. We believe that a first step to-wards making the Internet robust to these problems is by developing a systematic methodology for analyzing rout-ing changes and inferring why they happen and where they originate. In this paper, we motivate the need as well as de-scribe the design of an Internet health monitoring system that identifies the source of routing instabilities purely by passively observing routing updates from different vantage points. We believe such a system could be used to contin-uously infer the state of the network. Such inferences may then be used offline for network performance monitoring and troubleshooting, or online to improve path selection and damping of instability.
  • Source
    Article: Towards Root Cause Analysis of Internet Routing Dynamics
    Matthew Caesar, L Subramanian, Randy H Katz
    [show abstract] [hide abstract]
    ABSTRACT: The lack of a good understanding of the dynamics of interdomain rout-ing has made efforts to address BGP's shortcomings a black art. To gain more insight into these dynamics, we need to answer two ques-tions: What is the cause of a routing change? Where does a routing change originate? This paper proposes the design of a BGP health inferencing system that answers these questions by observing routing updates from multiple vantage points and inferring the type and loca-tion of an event that triggers a routing change. To build such a system, we solve two basic problems: (a) classify route updates into groups of correlated routing changes where all route updates in a group are triggered by the same event, (b) given the set of routing changes for an event, determine the location and the cause of the event. By analyzing route updates from Routeviews and RIPE for over ¢ ¤ £ months, we found that our approach can pinpoint the location where an update is triggered to a single inter-AS link in over 70% of observed updates. We found that the majority of updates are caused by a rela-tively small number of unstable links. In addition, 25% of prefixes are persistently unstable, causing 20% of all updates observed. Routes through the Internet core usually reconverge quickly after events, while an event taking place at the network edge is 9 times more likely to cause a long-term route change. We validated our approach by show-ing it can detect a variety of well-known events, namely: (a) session re-sets recorded in the NANOG mailing list; (b) routing problems within ISPs; (c) location of BGP Beacons. In addition, our inference method-ology is able to detect several routing problems not publicly known. In summary, we believe that our health inference system is a first step towards forming a better understanding of inter-domain routing dy-namics.
  • Source
    Article: Stable path (s) assignment for inter-domain routing
    [show abstract] [hide abstract]
    ABSTRACT: The Border Gateway Protocol (BGP) is the inter-domain routing protocol in the Internet that allows each autonomous system (AS) to select routes to the destinations based on locally determined policies. It has been shown that the policy autonomy exercised by ASes may result in persistent oscillations in BGP. Current solutions either rely on globally consistent policy assignments (which are hard to achieve in a distributed fashion), or require significant deviations from locally assigned policies (which reduce flexibility). In this paper, we take a different approach. Namely, we propose multipath routing to resolve the conflict with policy autonomy and system stability. We design an algorithm STABLE PATH(S) ASSIGNMENT (SPA), that provably detects persistent oscillations and eliminates these oscillations by assigning multiple paths to some ASes in the network. We design a distributed protocol for SPA and present tight bounds on the number of paths assigned by the algorithm to the ASes. Using simulations on the AS graph, we show that SPA assigns at most two paths to any AS in the network (in 99.9% of the instances), while assigning a single path in absence of persistent oscillations. Our evaluation results suggest that SPA can effectively detect networks that have a stable state but can potentially face persistent oscillations, and assigns a single path to the ASes in such networks.
  • Source
    Article: Towards localizing root causes of BGP dynamics
    [show abstract] [hide abstract]
    ABSTRACT: Today, we lack a clear understanding of the dynamics of the Border Gateway Protocol (BGP) and this has largely restricted our ability to address BGP's shortcomings. To gain more in-sight into BGP's dynamics, this paper proposes the design of a BGP health inferencing system that localizes the root causes of routing changes. Specifically, the inference system addresses two questions: What is the cause of a routing change? Where does a routing change originate? The inference system corre-lates routing updates across multiple vantage points to narrow down the suspect set of AS's that might have triggered rout-ing changes. Our methodology is primarily targeted towards analyzing events affecting relatively stable prefixes (compos-ing roughly 80% of the routing table), which are known to be the most popular destinations of Internet traffic. For 70% of observed updates to these prefixes, our approach can pinpoint the location of origin to a single inter-AS link. We analytically and empirically argue correctness of several key steps of our methodology and additionally show that our technique can cor-rectly pinpoint the source of several well-known/ documented routing events.
  • Source
    Article: A General Auction-based Architecture for Resource Allocation
    Matthew Caesar, Weidong Cui
    [show abstract] [hide abstract]
    ABSTRACT: In this paper we present a framework for re-source allocation based on auctions. We leverage (i) applica-tion awareness to achieve the performance metrics desired by the application (ii) prediction to preallocate resources based on expected demand, and (iii) a control channel to ex-change information between the client and the allocator to improve resource utilization. We design a scheme that is (i) general: can be used in many application contexts, (ii) flexi-ble: can be used under a variety of workloads, (iii) efficient: provides high resource utilization with low overhead, (iv) responsive: adapts quickly to client demand, and (v) fair: provides proportional fairness to client requests. Simulation results show that the simple auction-based allocation tech-niques we propose can be used to achieve these goals with excellent performance. High resource utilization, low de-lay, and proportional fairness in terms of delay and resource allocation can be observed even under bursty application workloads. Furthermore, the simple prediction techniques we propose can greatly improve responsiveness. However, we note that misprediction of future load can cause a signif-icant decrease in system utilization. We compare our allo-cation techniques with lottery scheduling and note that our auction-based resource allocation scheme performs better under most application request workloads.
  • Source
    Article: Dynamic route computation considered harmful
    [show abstract] [hide abstract]
    ABSTRACT: This paper advocates a different approach to reduce routing convergence—side-stepping the problem by avoiding it in the first place! Rather than recomputing paths after temporary topology changes, we argue for a separation of timescale between offline computation of multiple diverse paths and online spreading of load over these paths. We believe decoupling failure recovery from path computation leads to networks that are inherently more efficient, more scalable, and easier to manage.