Thomas E. Anderson's research while affiliated with Trinity Washington University and other places

Publications (175)

Preprint
In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size...
Preprint
Full-text available
Personal computer owners often want to be able to run security-critical programs on the same machine as other untrusted and potentially malicious programs. While ostensibly trivial, this requires users to trust hardware and system software to correctly sandbox malicious programs, trust that is often misplaced. Our goal is to minimize the number and...
Preprint
Full-text available
Modern networks exhibit a high degree of variability in link rates. Cellular network bandwidth inherently varies with receiver motion and orientation, while class-based packet scheduling in datacenter and service provider networks induces high variability in available capacity for network tenants. Recent work has proposed numerous congestion contro...
Preprint
Full-text available
The end of Dennard scaling and the slowing of Moore's Law has put the energy use of datacenters on an unsustainable path. Datacenters are already a significant fraction of worldwide electricity use, with application demand scaling at a rapid rate. We argue that substantial reductions in the carbon intensity of datacenter computing are possible with...
Preprint
The increasing use of cloud computing for latency-sensitive applications has sparked renewed interest in providing tight bounds on network tail latency. Achieving this in practice at reasonable network utilization has proved elusive, due to a combination of highly bursty application demand, faster link speeds, and heavy-tailed message sizes. While...
Preprint
Talek is a private group messaging system that sends messages through potentially untrustworthy servers, while hiding both data content and the communication patterns among its users. Talek explores a new point in the design space of private messaging; it guarantees access sequence indistinguishability, which is among the strongest guarantees in th...
Preprint
Disaggregated, or non-local, file storage has become a common design pattern in cloud systems, offering benefits of resource pooling and server specialization, where the inherent overhead of separating compute and storage is mostly hidden by storage device latency. We take an alternate approach, motivated by the commercial availability of very low...
Preprint
Effective congestion control in a multi-tenant data center is becoming increasingly challenging with rapidly increasing workload demand, ever faster links, small average transfer sizes, extremely bursty traffic, limited switch buffer capacity, and one-way protocols such as RDMA. Existing deployed algorithms, such as DCQCN, are still far from optima...
Conference Paper
The ability to extend kernel functionality safely has long been a design goal for operating systems. Modern operating systems, such as Linux, are structured for extensibility to enable sharing a single code base among many environments. Unfortunately, safety has lagged behind, and bugs in kernel extensions continue to cause problems. We study three...
Conference Paper
Many computer systems, especially mobile and IoT systems, use a large number of I/O devices. A contemporary OS acts as a security guard for these devices, which trust the OS to correctly implement the "perimeter defense." Moreover, the OS also trusts these devices and their drivers to be well-behaved and bug-free. This interwoven trust model compli...
Conference Paper
Writing correct distributed systems code is difficult, especially for novice programmers. The inherent asynchrony and need for fault-tolerance make errors almost inevitable. Industrial-strength testing and model checking have been shown to be effective at uncovering bugs, but they come at a cost --- in both time and effort --- that is far beyond wh...
Preprint
Designing and debugging distributed systems is notoriously difficult. The correctness of a distributed system is largely determined by its handling of failure scenarios. The sequence of events leading to a bug can be long and complex, and it is likely to include message reorderings and failures. On single-node systems, interactive debuggers enable...
Article
A perennial question in computer networks is where to place functionality among components of a distributed computer system. In data centers, one option is to move all intelligence to the edge, essentially relegating switches and middleboxes, regardless of their programmability, to simple static routing policies. Another is to add more intelligence...
Conference Paper
Current hardware and application storage trends put immense pressure on the operating system's storage subsystem. On the hardware side, the market for storage devices has diversified to a multi-layer storage topology spanning multiple orders of magnitude in cost and performance. Above the file system, applications increasingly need to process small...
Conference Paper
We take a comprehensive look at packet corruption in data center networks, which leads to packet losses and application performance degradation. By studying 350K links across 15 production data centers, we find that the extent of corruption losses is significant and that its characteristics differ markedly from congestion losses. Corruption impacts...
Conference Paper
Many data center traffic patterns exhibit abundant concurrent connections and high churn. In the face of these characteristics, server-centric congestion control is a poor fit—each connection, no matter how small, must start from scratch when testing when and how much to send along a given path. This is despite the fact that there are a large numbe...
Conference Paper
Web applications are a frequent target of successful attacks. In most web frameworks, the damage is amplified by the fact that application code is responsible for security enforcement. In this paper, we design and evaluate Radiatus, a shared-nothing web framework where application-specific computation and storage on the server is contained within a...
Conference Paper
The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing sub systems. We argue that the primary reason for high memory and processing overheads is the inefficient use of these resources by current commodity network interface cards (NICs). We propose FlexNIC, a flexible network DMA interface that ca...
Article
The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing sub systems. We argue that the primary reason for high memory and processing overheads is the inefficient use of these resources by current commodity network interface cards (NICs). We propose FlexNIC, a flexible network DMA interface that ca...
Article
The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing sub systems. We argue that the primary reason for high memory and processing overheads is the inefficient use of these resources by current commodity network interface cards (NICs). We propose FlexNIC, a flexible network DMA interface that ca...
Article
The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing sub systems. We argue that the primary reason for high memory and processing overheads is the inefficient use of these resources by current commodity network interface cards (NICs). We propose FlexNIC, a flexible network DMA interface that ca...
Conference Paper
We present the first formal verification of state machine safety for the Raft consensus protocol, a critical component of many distributed systems. We connected our proof to previous work to establish an end-to-end guarantee that our implementation provides linearizable state machine replication. This proof required iteratively discovering and prov...
Conference Paper
As network demand increases, data center network operators face a number of challenges including the need to add capacity to the network. Unfortunately, network upgrades can be an expensive proposition, particularly at the edge of the network where most of the network's cost lies. This paper presents a quantitative study of alternative ways of wiri...
Article
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that split...
Conference Paper
Full-text available
Distributed systems are difficult to implement correctly because they must handle both concurrency and failures: machines may crash at arbitrary points and networks may reorder, drop, or duplicate packets. Further, their behavior is often too complex to permit exhaustive testing. Bugs in these systems have led to the loss of critical data and unacc...
Article
A longstanding problem with the Internet is that it is vulnerable to outages, black holes, hijacking and denial of service. Although architectural solutions have been proposed to address many of these issues, they have had difficulty being adopted due to the need for widespread adoption before most users would see any benefit. This is especially re...
Article
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at datacenter scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilien...
Conference Paper
As personal information increases in value, the incentives for remote services to collect as much of it as possible increase as well. In the current Internet, the default assumption is that all behavior can be correlated using a variety of identifying information, not the least of which is a user's IP address. Tools like Tor, Privoxy, and even NATs...
Conference Paper
Interdomain path changes occur frequently. Because routing protocols expose insufficient information to reason about all changes, the general problem of identifying the root cause remains unsolved. In this work, we design and evaluate PoiRoot, a real-time system that allows a provider to accurately isolate the root cause (the network responsible) o...
Conference Paper
Interdomain path changes occur frequently. Because routing protocols expose insufficient information to reason about all changes, the general problem of identifying the root cause remains unsolved. In this work, we design and evaluate PoiRoot, a real-time system that allows a provider to accurately isolate the root cause (the network responsible) o...
Conference Paper
As personal information increases in value, the incentives for remote services to collect as much of it as possible increase as well. In the current Internet, the default assumption is that all behavior can be correlated using a variety of identifying information, not the least of which is a user's IP address. Tools like Tor, Privoxy, and even NATs...
Conference Paper
In this paper, we argue that recent device hardware trends enable a new approach to the design of operating systems: instead of the operating system mediating access to hardware, applications run directly on top of virtualized I/O devices, where the kernel provides only control plane services. This new division of labor is transparent to the user,...
Conference Paper
Full-text available
The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, ro...
Conference Paper
Free web services often face growing pains. In the current client-server access model, the cost of providing a service increases with its popularity. This leads organizations that want to provide services free-of-charge to rely to donations, advertisements, or mergers with larger companies to cope with operational costs. This paper proposes an alte...
Article
The Internet was designed to always find a route if there is a policy-compliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting out...
Article
Full-text available
The Internet was designed to always find a route if there is a policy-compliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting out...
Article
As the Internet has become more popular, it has increasingly been a target and medium for monitoring, censorship, content discrimination, and denial of service. Although anonymizing overlays such as Tor [2] provide some help to end users in combating these trends, the overlays themselves have become targets in turn. In this paper, we take a fresh a...
Article
We propose a new approach to mitigate disruptions of Internet connectivity. The Internet was designed to always find a route if there is a policy-compliant path; however, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has done considerable work on this problem, much of it focused o...
Conference Paper
Full-text available
Distributed storage systems often trade off strong semantics for improved scalability. This paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving li...
Article
A common assumption made in log analysis research is that the underlying log is totally ordered. For concurrent systems, this assumption constrains the generated log to either exclude concurrency altogether, or to capture a particular interleaving of concurrent events. This paper argues that capturing concurrency as a partial order is useful and of...
Article
In this paper, we design, implement, and evaluate a new scalable and fault tolerant network operating sys-tem, called ETTM, for securely and efficiently manag-ing network resources at a packet granularity. Our aim is to provide network administrators a greater degree of control over network behavior at lower cost, and net-work users a greater degre...
Article
In this paper, we design, implement, and evaluate a new scalable and fault tolerant network manager, called ETTM, for securely and efficiently managing network resources at a packet granularity. Our aim is to provide network administrators a greater degree of control over network behavior at lower cost, and network users a greater degree of perform...
Conference Paper
Full-text available
Traceroute is the most widely used Internet diagnos- tic tool today. Network operators use it to help identify routing failures, poor performance, and router misconfig- urations. Researchers use it to map the Internet, predict performance, geolocate routers, and classify the perfor- mance of ISPs. However, traceroute has a fundamental limitation th...
Conference Paper
Privacy -- the protection of information from unauthorized disclosure -- is increasingly scarce on the Internet. The lack of privacy is particularly true for popular peer-to-peer data sharing applications such as BitTorrent where user behavior is easily monitored by third parties. Anonymizing overlays such as Tor and Freenet can improve user privac...
Conference Paper
Full-text available
Operators and researchers want accurate router-level views of the Internet for purposes including troubleshooting and modeling. However, tools such as traceroute return IP addresses. Because routers may have dozens of IP addresses, or aliases, multiple measurements may return different addresses, obscuring whether they represent the same machine. W...
Conference Paper
Full-text available
Flaws in the standard libraries of secure sandboxes represent a major security threat to billions of devices worldwide. The standard libraries are hard to secure because they frequently need to perform low-level operations that are forbidden in untrusted application code. Existing designs have a single, large trusted computing base that contains se...
Article
Full-text available
We argue that carrier sense in 802.11 and other wireless protocols leads to scheduling decisions that are overly pessimistic and hence waste capacity. As an alternative, we propose interference cancellation, in which simulta- neous signals are modeled and decoded together rather than treating all but one as random noise. This method greatly expands...
Conference Paper
The last fifteen years has seen a vast proliferation of mid- dleboxes to solve all manner of persistent limitations in the Internet protocol suite. Examples include firewalls, NATs, load balancers, traffic shapers, deep packet intru- sion detection, virtual private networks, network moni- tors, transparent web caches, content delivery networks, and...
Conference Paper
Full-text available
Many peer-to-peer distributed applications can benefit from accurate predictions of Internet path performance. Existing approaches either 1) achieve high accuracy for sophisticated path properties, but adopt an unscalable centralized approach, or 2) are lightweight and decentral- ized, but work only for latency prediction. In this paper, we present...
Conference Paper
Full-text available
Replicatingcontentacrossageographicallydistributedsetofservers and redirecting clients to the closest server in terms of latency has emerged as a common paradigm for improving client performance. Inthispaper, weanalyzelatenciesmeasuredfromserversinGoogle's content distribution network (CDN) to clients all across the Inter- net to study the effectiv...
Article
We motivate the capability approach to network denial-of-service (DoS) attacks, and evaluate the traffic validation architecture (TVA) architecture which builds on capabilities. With our approach, rather than send packets to any destination at any time, senders must first obtain ldquopermission to sendrdquo from the receiver, which provides the per...
Article
Unspoofable source identifiers in packets are a basic building block in combating Denial of Service (DoS) attacks. Routers may rely on these identifiers to precisely block attack traffic or enforce fair resource allocation. It is also possible to
Article
Full-text available
Internet routing protocols (BGP, OSPF, RIP) have traditionally favored responsiveness over consistency. A router applies a received update immediately to its forwarding table before propagating the update to other routers, including those that potentially depend upon the outcome of the update. Responsiveness comes at the cost of routing loops and b...
Conference Paper
An emerging paradigm in peer-to-peer (P2P) networks is to explicitly consider incentives as part of the proto- col design in order to promote good (or discourage bad) behavior. However, effective incentives are hampered by the challenges of a P2P environment, e.g. transient users and no central authority. In this paper, we quantify these challenges...
Conference Paper
This paper develops a model of computer systems research as a way of helping explain to prospective authors the often obscure workings of conference program committees. While our goal is primarily descriptive, we use the model to motivate several recent changes in conference design and to suggest some further potential improvements.
Conference Paper
We present Hubble, a system that operates continuously to find Internet reachability problems in which routes exist to a destination but packets are unable to reach the destination. Hubble monitors at a 15 minute granularity the data-path to prefixes that cover 89% of the Internet's edge address space. Key enabling techniques include a hybrid passi...
Conference Paper
Large-scale distributed denial of service (DoS) attacks are an unfortunate everyday reality on the Internet. They are simple to execute and with the growing prevalence and size of botnets more effective than ever. Although much progress has been made in developing techniques to address DoS attacks, no existing solution is unilater- ally deployable,...
Conference Paper
Full-text available
A fundamental problem with unmanaged wireless networks is high packet loss rates and poor spatial reuse, especially with bursty traffic typical of normal use. To address these limitations, we explore the notion of interference cancella- tion for unmanaged networks — the ability for a single re- ceiver to disambiguate and successfully receive simult...
Conference Paper
A fundamental problem with many peer-to-peer systems is the tendency for users to "free ride"--to consume resources without contributing to the system. The popular file distribution tool BitTorrent was explicitly designed to address this problem, using a tit-for-tat reciprocity strategy to provide positive incentives for nodes to contribute resourc...
Conference Paper
Traditional methods of conducting measurements to end hosts require sendingunexpectedpacketstomeasurementtargets.Althoughexistingtechniques can ascertain end host characteristics accurately, their use in large-scale measure- ment studies is hindered by the fact that unexpected traffic can trigger alarms in common intrusion detection systems, often...
Conference Paper
We present Wiser, an Internet routing pro- tocol that enables ISPs to jointly control routing in a way that produces efficient end-to-end paths even when they act in their own interests. Wiser is a simple extension of BGP, uses only existing peering contracts for mone- tary exchange, and can be incrementally deployed. Each ISP selects paths in a wa...
Conference Paper
Full-text available
Distributed hash tables (DHTs) provide scalable, key-based lookup of objects in dynamic network environments. Al- though DHTs have been studied extensively from an an- alytical perspective, only recently have wide deployments enabled empirical examination. This paper reports mea- surements of the Azureus BitTorrent client's DHT, which is in active...
Article
As distributed systems that span multiple administrative domains proliferate, robust protocols increasingly need to incorporate the incentives of multiple stakeholders into their design. A significant challenge in designing
Article
A key challenge in combating Denial of Service (DoS) attacks is to reliably identify attack sources from packet contents. If a source can be reliably identified, routers can stop an attack by filtering packets from the attack sources without causing collateral damage to legitimate traffic. This task is difficult because attackers may spoof arbitrar...
Conference Paper
Several models have been recently proposed for predicting the la- tency of end to end Internet paths. These models treat the Internet as a black-box, ignoring its internal structure. While these models are simple, they can often fail systematically; for example, the most widely used models use metric embeddings that predict no benefit to detour rou...