David A. Maltz

David A. Maltz
Microsoft · Azure Networking Team

Ph.D.
Looking for collaborations to advance the state of the art in Cloud networking and services.

About

119
Publications
50,555
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
43,840
Citations
Introduction
Dr. David A. Maltz is the engineering leader for the Azure Networking team, responsible for developing, deploying, and operating the software and network devices that connect Microsoft's largest services, including the Azure Public Cloud and Microsoft 365. We write the network services like Network Security and DNS, the distributed systems that control our Software Defined and Physical Network, and the SONiC firmware for our switches. We design cloud-scale networks and optical layers.
Additional affiliations
August 2005 - present
Microsoft
Position
  • Partner Development Manager

Publications

Publications (119)
Article
It's been 15 years since what we now call Software Defined Network began emerging out of a set of ideas in the networking research community. This editorial note traces how the ideas in one particular paper from that time have evolved and found practical applications.
Article
Full-text available
Anycast is an internet addressing protocol where multiple hosts share the same IP-address. A popular architecture for modern Content Distribution Networks (CDNs) for geo-replicated HTTP-services consists of multiple layers of proxy nodes for service and co-located DNS-servers for load-balancing on different proxies. Both the proxies and the DNS-ser...
Patent
A method of networking a plurality of servers together within a data center is disclosed. The method includes the step of addressing a data packet for delivery to a destination server by providing the destination server address as a flat address. The method further includes the steps of obtaining routing information required to route the packet to...
Article
Full-text available
For more than a decade, special purpose DNS [1] servers have been used to direct users to different IP addresses based on characteristics such as proximity, load balancing and failover. In this paper, we examine two of the mechanisms these special purpose DNS servers may use to control user traffic:-short time-to-lives (TTL) and time-splicing multi...
Conference Paper
This paper proposes an approach to the design of large-scale general-purpose data center networks based on the notions of volume and area universality introduced by Leiserson in the 1980's in the context of VLSI design. In particular, we suggest that the principle goal of the network designer should be to build a single network that is provably com...
Conference Paper
Full-text available
Performance of online applications directly impacts user satisfaction. A major component of the user-perceived performance of the application is the time spent in transit between the user's device and the application existing in data centers. Content Delivery Networks (CDNs) are typically used to improve user-perceived application performance throu...
Conference Paper
Networks for Cloud-Scale Data Centers are challenged by the need to support multiple diverse traffic patterns and traffic types across a single shared physical network. This talk lays out challenges and shows how software-defined networking combined with simple physical networks meets the need.
Patent
A plurality of network addresses from a distributed client is obtained, at least a first portion of the obtained network addresses including resolved network address responses to distributed client requests for resolved network addresses corresponding to one or more network location indicators associated with a first web service. Test content is ob...
Conference Paper
Full-text available
Recent trends to pack data centers with more CPUs per rack have led to a scenario in which each individual rack may contain hundreds, or even thousands, of compute nodes using system-on-chip (SoC) architectures. At this increased scale, traditional rack-level star topologies with a top-of-rack (ToR) switch as the hub and servers as the leaves are n...
Article
As modern datacenter networks (DCNs) grow to support hundreds of thousands of servers and beyond, managing network equipment -- such as routers, firewalls, and load balancers -- becomes increasingly complex. Network attributes such as IP address allocations and BGP neighbor relations are scattered among various network engineering groups, which mak...
Conference Paper
Building cost-efficient cloud data centers means putting more and more servers on the same sites, and this stresses the ability to build cost-efficient fiber plants and networks to connect them. With lifespan of a server being 3 years while fiber can be good for 10 years or more, designs must be future proof. Given the scale of the networks, reduci...
Article
Deploying interactive applications in the cloud is a challenge due to the high variability in performance of cloud services. In this paper, we present Dealer - a system that helps geo-distributed, interactive and multi-tier applications meet their stringent requirements on response time despite such variability. Our approach is motivated by the fac...
Conference Paper
Full-text available
Datacenter networks (DCNs) are constantly evolving due to various updates such as switch upgrades and VM migrations. Each update must be carefully planned and executed in order to avoid disrupting many of the mission-critical, interactive applications hosted in DCNs. The key challenge arises from the inherent difficulty in synchronizing the changes...
Article
Layer-4 load balancing is fundamental to creating scale-out web services. We designed and implemented Ananta, a scale-out layer-4 load balancer that runs on commodity hardware and meets the performance, reliability and operational requirements of multi-tenant cloud computing environments. Ananta combines existing techniques in routing and distribut...
Conference Paper
Layer-4 load balancing is fundamental to creating scale-out web services. We designed and implemented Ananta, a scale-out layer-4 load balancer that runs on commodity hardware and meets the performance, reliability and operational requirements of multi-tenant cloud computing environments. Ananta combines existing techniques in routing and distribut...
Conference Paper
Datacenter networks (DCNs) are constantly evolving due to various updates such as switch upgrades and VM migrations. Each update must be carefully planned and executed in order to avoid disrupting many of the mission-critical, interactive applications hosted in DCNs. The key challenge arises from the inherent difficulty in synchronizing the changes...
Conference Paper
Full-text available
Data centers are fascinating places, where the massive scale required to deliver on-line services like web search and cloud hosting turns minor issues into major challenges that must be addressed in the design of the physical infrastructure and the software platform. In this talk, I'll briefly overview the kinds of applications that run in mega-dat...
Article
Data centers are fascinating places, where the massive scale required to deliver on-line services like web search and cloud hosting turns minor issues into major challenges that must be addressed in the design of the physical infrastructure and the software platform. In this talk, I'll briefly overview the kinds of applications that run in mega-dat...
Patent
Constructing an inference graph relates to the creation of a graph that reflects dependencies within a network. In an example embodiment, a method includes determining dependencies among components of a network and constructing an inference graph for the network responsive to the dependencies. The components of the network include services and hard...
Patent
A system including at least one storage node and at least one computation node connected by a switch is described herein. Each storage node has one or more storage units and one or more network interface components, the collective bandwidths of the storage units and the network interface components being proportioned to one another to enable commun...
Conference Paper
Deploying interactive applications in the cloud is a challenge due to the high variability in performance of cloud services. In this paper, we present Dealer-- a system that helps geo-distributed, interactive and multi-tier applications meet their stringent requirements on response time despite such variability. Our approach is motivated by the fac...
Article
Full-text available
Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff...
Article
Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable cha...
Conference Paper
Driven by the soaring demands for always-on and fast-response online services, modern datacenter networks have recently undergone tremendous growth. These networks often rely on commodity hardware to reach immense scale while keeping capital expenses under check. The downside is that commodity devices are prone to failures, raising a formidable cha...
Article
Full-text available
Enterprise networks are important, with size and complexity even surpassing carrier networks. Yet, the design of enterprise networks remains ad hoc and poorly understood. In this paper, we show how a systematic design approach can handle two key areas of enterprise design: virtual local area networks (VLANs) and reachability control. We focus on th...
Conference Paper
Cloud service providers operate data centers around the world, and they depend on Global Traffic Management systems to direct requests from clients to the most appropriate data center to serve the requests. While GTM systems have been in-use for years, they are attracting re-newed interests due to the rapid expansion of cloud service providers' net...
Conference Paper
When a user requests content from a cloud service provider, sometimes the content sent by the provider is modified inflight by third-party entities. To our knowledge, there is no comprehensive study that examines the extent and primary root causes of the content modification problem. We design a lightweight experiment and instrument a vast number o...
Article
Full-text available
To be agile and cost effective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL2, a practical network architecture that scales to support huge data centers with uniform high capacity betw...
Article
Enterprises are increasingly deploying their applicationsin the cloud given the cost-saving advantages,and the potential to geo-distribute applications to ensureresilience and better service experience. However,a key unknown is whether it it is feasible to meet thestringent response time requirements of enterprise applicationsusing the cloud. We ma...
Conference Paper
Full-text available
In this paper, we tackle challenges in migrating enterprise services into hybrid cloud-based deployments, where enterprise operations are partly hosted on-premise and partly in the cloud. Such hybrid architectures enable enterprises to benefit from cloud-based architectures, while honoring application performance requirements, and privacy restricti...
Conference Paper
Full-text available
Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a 6000 server production cluster and reveal impairments that lead to high application latencies, ro...
Article
In this paper, we tackle challenges in migrating enterprise services into hybrid cloud-based deployments, where enterprise operations are partly hosted on-premise and partly in the cloud. Such hybrid architectures enable enterprises to benefit from cloud-based architectures, while honoring application performance requirements, and privacy restricti...
Article
Full-text available
Networks cannot be managed without management plane communications among geographically distributed network devices and control agents. Unfortunately, the mechanisms used in commercial networks to support management plane communications are often hard to configure, insufficiently secured, and/or suboptimal in performance. This paper presents the de...
Conference Paper
Although there is tremendous interest in designing improved networks for data centers, very little is known about the network-level traffic characteristics of data centers today. In this paper, we conduct an empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, an...
Article
Full-text available
Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a 6000 server production cluster and reveal impairments that lead to high application latencies, ro...
Article
Full-text available
This article is an editorial note submitted to CCR. It has NOT been peer reviewed. The author takes full responsibility for this article's technical content. Comments can be posted through CCR Online. Abstract Network management represents an architectural gap in to- day's Internet (1). Many problems with computer networks today, such as faults, mi...
Article
Full-text available
To be agile and cost effective, data centers must allow dynamic resource allocation across large server pools. In particular, the data center network should provide a simple flat abstraction: it should be able to take any set of servers anywhere in the data center and give them the illusion that they are plugged into a physically separate, noninter...
Conference Paper
Operator interviews and anecdotal evidence suggest that an operator's ability to manage a network decreases as the network becomes more complex. However, there is currently no way to systematically quantify how com- plex a network's design is nor how complexity may im- pact network management activities. In this paper, we develop a suite of complex...
Article
Full-text available
A repository of router configuration files from production networks would provide the research community with a treasure trove of data about network topologies, routing designs, and security policies. However, configuration files have been largely unobtainable precisely because they provide detailed information that could be exploited by competitor...
Article
Full-text available
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often ope...
Conference Paper
Full-text available
Organizations world-wide are adopting wireless networks at an im- pressive rate, and a new industry has sprung up to provide tools to manage these networks. Unfortunately, these tools do not integrate cleanly with traditional wired network management tools, leading to unsolved problems and frustration among the IT staff. We ex- plore the problem of...
Conference Paper
Few studies so far have examined the nature of reachability poli- cies in enterprise networks. A better understanding of reachability policies could both inform future approaches to network design as well as current network configuration mechanisms. In thispaper, we introduce the notion of a policy unit, which is an abstract representa- tion of how...
Article
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often ope...
Conference Paper
Full-text available
Enterprise networks are important, with size and complexity even surpassing carrier networks. Yet, the design of enterprise networks is ad-hoc and poorly understood. In this paper, we show how a systematic design approach can handle two key areas of enterprise design: virtual local area networks (VLANs) and reachability control. We focus on these t...
Conference Paper
Full-text available
We argue that there is a continuum between completely manual and completely automated management of networks and distributed applications. The ability to visualize the status of the network and applications inside a data center allows human users to rapidly asses the health of the system - quickly identifying problems that span across components an...
Conference Paper
Full-text available
Applications hosted in today's data centers suffer from internal fragmentation of resources, rigidity, and bandwidth constraints im- posed by the architecture of the network connecting the data cen- ter's servers. Conventional architectures statically map web ser- vices to Ethernet VLANs, each constrained in size to a few hun- dred servers owing to...
Article
Full-text available
Organizations world-wide are adopting wireless networks at an impressive rate, and a new industry has sprung up to pro- vide tools to manage these networks. Unfortunately, these tools do not integrate cleanly with traditional wired network management tools, leading to unsolved problems and frus- tration among the IT staff. We explore the problem of...
Conference Paper
Full-text available
Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inf...
Article
Full-text available
Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994. Includes bibliographical references (leaves 68-71). by David A. Maltz. M.S.
Conference Paper
We present Tesseract, an experimental system that en- ables the direct control of a computer network that is un- der a single administrative domain. Tesseract's design is based on the 4D architecture, which advocates the de- composition of the network control plane into decision, dissemination, discovery, and data planes. Tesseract pro- vides two p...
Conference Paper
Full-text available
Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approx- imate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observatio...
Article
Full-text available
Defenses against botnet-based distributed denial-of-service (DDoS) attacks must demonstrate that in addition to being technically feasible, they are also economically viable, particularly when compared with the two most widely deployed defenses--simple massive overprovisioning of resources to absorb and handle DDoS traffic, and "scrubbing" of incom...
Conference Paper
Full-text available
This paper presents the Leslie Graph, a simple yet powerful abstraction describing the complex dependencies between network, host and application components in modern networked systems. It discusses challenges in the discovery of Leslie Graphs, their uses, and describes two alternate approaches to their discovery, supported by some initial feasibil...
Conference Paper
Full-text available
Content providers base their business on their ability to re ceive and answer requests from clients distributed across the Intern et. Since disruptions in the flow of these requests directly translate into lost revenue, there is tremendous incentive to diagnose why some re- quests fail and prod the responsible parties into correctiv e action. Howev...
Article
Full-text available
Today's data networks are surprisingly fragile and difficult to manage. We argue that the root of these problems lies in the complexity of the control and management planes--the software and protocols coordinating network elements--and particularly the way the decision logic and the distributed-systems issues are inexorably intertwined. We advocate...
Article
Full-text available
We argue for the refactoring of the IP control plane to support network-wide objectives and control. Weput forward a design that refactors functionality into a novel 4D architecture composed of four separateplanes: decision, dissemination, discovery and data. All decision-making logic is moved out of routersalong with current management plane funct...
Conference Paper
Full-text available
We propose a novel technique that can determine both the host responsible for originating a propagating worm attack and the set of attack flows that make up the initial stages of the attack tree via which the worm infected successive generations of victims. We argue that knowledge of both is important for combating worms: knowledge of the origin su...
Conference Paper
Full-text available
The primary purpose of a network is to provide reachability between applications running on end hosts. In this paper, we describe how to compute the reachability a network provides from a snapshot of the configuration state from each of the routers. Our primary contribution is the precise definition of the potential reachability of a network and a...
Conference Paper
Full-text available
A repository of router configuration files from production networks would provide the research community with a treasure trove of data about network topologies, routing designs, and security policies. However, configuration files have been largely unobtainable precisely because they provide detailed information that could be exploited by competitor...
Conference Paper
Full-text available
In any IP network, routing protocols provide the intelligence that takes a collection of physical links and transforms them into a network that enables packets to travel from one host to another. Though routing design is arguably the single most important design task for large IP networks, there has been very little systematic investigation into ho...
Conference Paper
In any IP network, routing protocols provide the intelligence that takes a collection of physical links and transforms them into a network that enables packets to travel from one host to another. Though routing design is arguably the single most important design task for large IP networks, there has been very little systematic investigation into ho...
Article
There are two usual methods to evaluate a software system in multi-hop wireless ad hoc networks: simu- lation and real test-bed. The test-bed method is ex- pensive and non-repeatable. The simulation method usually requires re-implementing the real software system inside the simulator, which is also infeasible for large scale software systems. In th...
Article
Mobile nodes of the future will be equipped with multiple network interfaces to take advantage of overlay networks, yet no current mobility systems provide full support for the simultaneous use of multiple interfaces. The need for such support arises when multiple connectivity options are available with different cost, coverage, latency and bandwid...