Robbert Van Renesse

Robbert Van Renesse
Cornell University | CU · Department of Computer Science

About

277
Publications
66,735
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
13,439
Citations
Introduction
Skills and Expertise
Additional affiliations
January 1991 - present
Cornell University
Position
  • Group Leader

Publications

Publications (277)
Chapter
This paper presents Escher, an approach to build and deploy multi-tiered cloud-based applications, and outlines the framework that supports it. Escher is designed to allow systems of systems to be derived methodically and to evolve over time, in a modular way. To this end, Escher includes (i) a novel authenticated message bus that hides from one an...
Article
Infrastructure-as-a-Service cloud providers sell virtual machines that are only specified in terms of number of CPU cores, amount of memory, and I/O throughput. Performance-critical aspects such as cache sizes and memory latency are missing or reported in ways that make them hard to compare across cloud providers. It is difficult for users to adapt...
Article
Scaling Byzantine Fault Tolerant (BFT) systems in terms of membership is important for secure applications with large participation such as blockchains. While traditional protocols have low latency, they cannot handle many processors. Conversely, blockchains often have hundreds to thousands of processors to increase robustness, but they typically h...
Preprint
Full-text available
In distributed systems, a group of $\textit{learners}$ achieve $\textit{consensus}$ when, by observing the output of some $\textit{acceptors}$, they all arrive at the same value. Consensus is crucial for ordering transactions in failure-tolerant systems. Traditional consensus algorithms are homogeneous in three ways: - all learners are treated equa...
Preprint
As RAM is becoming cheaper and growing abundant, it is time to revisit the design of persistent key-value storage systems. Most of today's persistent key-value stores are based either on a write-optimized Log-Structured Merge tree (LSM) or a read-optimized B+-tree. Instead, this paper introduces a new design called "lazy-trie" to index the persiste...
Preprint
Fault tolerant consensus protocols usually involve ordered rounds of voting between a collection of processes. In this paper, we derive a general specification of fault tolerant asynchronous consensus protocols and present a class of consensus protocols that refine this specification without using rounds. Crash-tolerant protocols in this class use...
Preprint
This paper introduces a family of leaderless Byzantine fault tolerance protocols, built around a metastable mechanism via network subsampling. These protocols provide a strong probabilistic safety guarantee in the presence of Byzantine adversaries while their concurrent and leaderless nature enables them to achieve high throughput and scalability....
Preprint
We present Charlotte, a framework for composable, authenticated distributed data structures. Charlotte data is stored in blocks that reference each other by hash. Together, all Charlotte blocks form a directed acyclic graph, the blockweb; all observers and applications use subgraphs of the blockweb for their own data structures. Unlike prior system...
Conference Paper
Full-text available
"Cloud-native" container platforms, such as Kubernetes, have become an integral part of production cloud environments. One of the principles in designing cloud-native applications is called Single Concern Principle, which suggests that each container should handle a single responsibility well. In this paper, we propose X-Containers as a new securit...
Article
Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after...
Chapter
Blockchain-based cryptocurrencies have demonstrated how to securely implement traditionally centralized systems, such as currencies, in a decentralized fashion. However, there have been few measurement studies on the level of decentralization they achieve in practice. We present a measurement study on various decentralization metrics of two of the...
Conference Paper
Use-based privacy restricts how information may be used, making it well-suited for data collection and data analysis applications in networked information systems. This work investigates the feasibility of enforcing use-based privacy in distributed systems with adversarial service providers. Three architectures that use Intel-SGX are explored: sour...
Preprint
Full-text available
Blockchains offer a useful abstraction: a trustworthy, decentralized log of totally ordered transactions. Traditional blockchains have problems with scalability and efficiency, preventing their use for many applications. These limitations arise from the requirement that all participants agree on the total ordering of transactions. To address this f...
Article
Full-text available
Blockchain-based cryptocurrencies have demonstrated how to securely implement traditionally centralized systems, such as currencies, in a decentralized fashion. However, there have been few measurement studies on the level of decentralization they achieve in practice. We present a measurement study on various decentralization metrics of two of the...
Article
Just like we have small dedicated networks in houses, cars, factories, etc., we will have small clouds in all these places. We need a way to glue all these clouds together into a single worldwide infrastructure. The Cloud Abstraction Layer could provide such glue.
Poster
Full-text available
The IoT and mobile computing revolution is causing an exponential growth in edge devices connected to the Internet. These capture and store an increasing amount of personal and privacy sensitive data about end users. Through recent legislation such as the European Union General Data Protection Act (GDPR), such data is often accompanied by strict pr...
Article
Full-text available
Infrastructure-as-a-Service (IaaS) cloud providers hide available interfaces for virtual machine (VM) placement and migration, CPU capping, memory ballooning, page sharing, and I/O throttling, limiting the ways in which applications can optimally configure resources or respond to dynamically shifting workloads. Given these interfaces, applications...
Conference Paper
The "cloud paradigm" can provide a wealth of sophisticated emergency communication services that are gamechangers in emergency response, but its current implementation is not suitable to the challenging environments in which these responses often take place. The networking infrastructure may be all but unavailable, and access to centralized datacen...
Conference Paper
The coming generation of Internet-of-Things (IoT) applications will process massive amounts of incoming data while supporting data mining and online learning. In cases with demanding real-time requirements, such systems behave as smart memories: a high-bandwidth service that captures sensor input, processes it using machine-learning tools, replicat...
Conference Paper
Full-text available
The rise of blockchain-based cryptocurrencies has led to an explosion of services using distributed ledgers as their underlying infrastructure. However, due to inherently single-service oriented blockchain protocols, such services can bloat the existing ledgers, fail to provide sufficient security, or completely forego the property of trustless aud...
Article
Consus is a strictly serializable geo-replicated transactional key-value store. The key contribution of Consus is a new commit protocol that reduces the cost of executing a transaction to three wide area message delays in the common case. Augmenting the commit protocol are multiple Paxos implementations optimized for different purposes. Together th...
Technical Report
Full-text available
The rise of blockchain-based cryptocurrencies has led to an explosion of services using distributed ledgers as their underlying infrastructure. However, due to inherently single-service oriented blockchain protocols, such services can bloat the existing ledgers, fail to provide sufficient security, or completely forego the property of trustless aud...
Article
We present Moving Participants Turtle Consensus (MPTC), an asynchronous consensus protocol for crash and byzantine-tolerant distributed systems. MPTC uses various \emph{moving target defense} strategies to tolerate certain Denial-of-Service (DoS) attacks issued by an adversary capable of compromising a bounded portion of the system. MPTC supports o...
Conference Paper
Modern applications often operate on data in multiple administrative domains. In this federated setting, participants may not fully trust each other. These distributed applications use transactions as a core mechanism for ensuring reliability and consistency with persistent data. However, the coordination mechanisms needed for transactions can both...
Conference Paper
Full-text available
Global cloud services have to respond to workloads that shift geographically as a function of time-of-day or in response to special events. While many such services have support for adding nodes in one region and removing nodes in another, we demonstrate that such mechanisms can lead to significant performance degradation. Yet other services do not...
Article
Modern applications often operate on data in multiple administrative domains. In this federated setting, participants may not fully trust each other. These distributed applications use transactions as a core mechanism for ensuring reliability and consistency with persistent data. However, the coordination mechanisms needed for transactions can both...
Conference Paper
Full-text available
We present Ovid, a framework for building evolvable large-scale distributed systems that run in the cloud. Ovid constructs and deploys distributed systems as a collection of simple components, creating systems suited for containerization in the cloud. Ovid supports evolution of systems through transformations, which are automated refinements. Examp...
Conference Paper
Data networks require a high degree of performance and reliability as mission-critical IoT deployments increasingly depend on them. Although performance and fault tolerance can be individually addressed at all levels of the networking stack, few solutions tackle these challenges in an elegant and scalable manner. We propose a redundant array of ind...
Conference Paper
A Supercloud is a "CrossCloud": It is an Infrastructure-as-a-Service (IaaS) that goes beyond federated or hybrid clouds and gives its users direct control over cloud deployments--even across different underlying cloud providers (Jia et al. 2015). It supports privileged cloud operations such as migration across autonomous cloud providers even if the...
Conference Paper
In this paper, we explore the use of live VM migration to take advantage of spot markets such as provided by Amazon and Google. These markets provide an exciting low cost alternative to regular VM instances, but the threats of price spikes and premature termination severely limit their usability. Migration can address these threats: spot market ins...
Conference Paper
Full-text available
Configuring large distributed computations is a challenging task. Efficiently executing distributed computations requires configuration tuning based on careful examination of application and hardware properties. Considering the large number of parameters and impracticality of using trial and error in a production environment, programmers tend to ma...
Conference Paper
Consensus is a basic building block in middleware configuration services [4, 18]. While such services are designed to tolerate crash failures in asynchronous settings, they may not stand up well to Denial-of-Service (DoS) attacks. Specifically, malicious clients can carefully craft workloads that substantially degrade the performance of many state-...
Technical Report
Full-text available
Cryptocurrencies, based on and led by Bitcoin, have shown promise as infrastructure for pseudonymous online payments, cheap remittance, trustless digital asset exchange, and smart contracts. However, Bitcoin-derived blockchain protocols have inherent scalability limits that trade-off between throughput and latency and withhold the realization of th...
Article
This paper presents the omni-kernel architecture, a novel operating system architecture designed around the basic premise of pervasive monitoring and scheduling. Motivated by new requirements in virtualized environments, the architecture ensures that all resource consumption is measured, that the resource consumption resulting from a scheduling dec...
Conference Paper
Full-text available
This paper proposes a mechanism for expressing and enforcing security policies for shared data. Security policies are expressed as stateful meta-code operations; meta-code can express a broad class of policies, including access-based policies, use-based policies, obligations, and sticky policies with declassification. The meta-code is interposed in...
Article
Full-text available
An attacker who controls a computer in an overlay network can effectively control the entire overlay network if the mechanism managing membership information can successfully be targeted. This article describes Fireflies, an overlay network protocol that fights such attacks by organizing members in a verifiable pseudorandom structure so that an int...
Article
For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. The ini-tial description avoids optimizations tha...
Article
Infrastructure as a Service (IaaS) clouds couple applications tightly with the underlying infrastructures and services. This vendor lock-in problem forces users to apply ad-hoc deployment strategies in order to tolerate cloud failures, and limits the ability of doing virtual machine (VM) migration and resource scaling across different clouds. This...
Article
This paper investigates cache placement on a cooperative cache built from individual client caches in an online social network or web service. We use a service that maintains a mapping between content and the clients that cache it, and propose cache placement schemes that leverage relationships between clients (for example, social links) and worklo...
Article
The robustness of distributed systems is usually phrased in terms of the number of failures of certain types that they can withstand. However, these failure models are too crude to describe the different kinds of trust and expectations of participants in the modern world of complex, integrated systems extending across different owners, networks, an...
Article
Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary...
Article
Full-text available
In-memory read-only caches are widely used in cloud infrastructure to reduce access latency and to reduce load on backend databases. Operators view coherent caches as impractical at genuinely large scale and many client-facing caches are updated in an asynchronous manner with best-effort pipelines. Existing incoherent cache technologies do not supp...
Patent
Full-text available
In a method for improving the efficiency of a search engine in accessing, searching and retrieving information in the form of documents stored in document or content repositories, the search engine comprises an array of search nodes hosted on one or more servers. An index of the stored document is created. The search engine processes a user search...
Article
Full-text available
NEBULA is a proposal for a Future Internet Architecture. It is based on the assumptions that: (1) cloud computing will comprise an increasing fraction of the application workload offered to an Internet, and (2) that access to cloud computing resources will demand new architectural features from a network. Features that we have identified include de...
Conference Paper
Full-text available
State machine replication (SMR) is a well-known technique able to provide fault-tolerance. SMR consists of sequencing client requests and executing them against replicas in the same order; thanks to deterministic execution, every replica will reach the same state after the execution of each request. However, SMR is not scalable since any replica ad...
Conference Paper
Replication is a widely used technique to provide high-availability to online services. While being an effective way to mask failures, replication comes at a price: at least twice as much hardware and energy are required to mask a single failure. In a context where the electricity drawn by data centers worldwide is increasing each year, there is a...
Conference Paper
Operators of the nationwide power grid use proprietary data networks to monitor and manage their power distribution systems. These purpose-built, wide area communication networks connect a complex array of equipment ranging from PMUs and synchrophasers to SCADA systems. Collectively, these equipment form part of an intricate feedback system that en...
Conference Paper
Fault-tolerant distributed systems often contain complex error handling code. Such code is hard to test or model-check because there are often too many possible failure scenarios to consider. As we will demonstrate in this paper, formal methods have evolved to a state in which it is possible to generate this code along with correctness guarantees....
Conference Paper
This paper describes and evaluates Sprinkler, a reliable high-throughput broadcast facility for geographically dispersed datacenters. For scaling cloud services, datacenters use caching throughout their infrastructure. Sprinkler can be used to broadcast update events that invalidate cache entries. The number of recipients can scale to many thousand...
Conference Paper
Full-text available
This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs. Facebook's image-management infrastructure is complex and geographically distributed. It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional ca...
Conference Paper
Most if not all datacenter services use sharding and replication for scalability and reliability. Shards are more-or-less independent of one another and individually replicated. In this paper, we challenge this design philosophy and present a replication protocol where the shards interact with one another: A protocol running within shards ensures l...
Conference Paper
Today, Infrastructure-as-a-Service (IaaS) cloud providers such as Amazon's Elastic Compute Engine (EC2), Google's Compute Engine, and Microsoft's Azure offer elastic and isolated compute resources via virtualization and users often choose one of these providers based on price, locality, performance, and features. Typically, a user will choose the s...
Article
Full-text available
Paxos, Viewstamped Replication, and Zab are replication protocols that ensure high-availability in asynchronous environments with crash failures. Various claims have been made about similarities and differences between these protocols. But how does one determine whether two protocols are the same, and if not, how significant the differences are? We...
Conference Paper
Some network protocols tie application state to underlying TCP connections, leading to unacceptable service outages when an endpoint loses TCP state during fail-over or migration. For example, BGP ties forwarding tables to its control plane connections so that the failure of a BGP endpoint can lead to widespread routing disruption, even if it recov...
Conference Paper
Full-text available
The NEBULA Future Internet Architecture (FIA) project is focused on a future network that enables the vision of cloud computing [8,12] to be realized. With computation and storage moving to data centers, networking to these data centers must be several orders of magnitude more resilient for some applications to trust cloud computing and enable thei...
Conference Paper
The collection and prompt analysis of synchrophasor measurements is a key step towards enabling the future smart power grid, in which grid management applications would be deployed to monitor and react intelligently to changing conditions. The potential exists to slash inefficiencies and to adaptively reconfigure the grid to take better advantage o...
Article
Practical systems must often guarantee that changes to the system state are durable. Examples of such systems are databases, file systems, and messaging middleware with guaranteed delivery. One common way of implementing durability while keeping performance ...
Conference Paper
We present a new class of Byzantine-tolerant State Machine Replication protocols for asynchronous environments that we term Byzantine Chain Replication. We demonstrate two implementations that present different trade-offs between performance and security, and compare these with related work. Leveraging an external reconfiguration service, these pro...
Article
Full-text available
We propose embedding executable code fragments in cryptographically protected capabilities to enable flexible discretionary access control in cloud-like computing infrastructures. We are developing this as part of a sports analytics application that runs on a federation of public and enterprise clouds. The capability mechanism is implemented comple...
Conference Paper
This paper describes ShadowDB, a replicated version of the BerkeleyDB database. ShadowDB is a primary-backup based replication protocol where failure handling, the critical part of the protocol, is taken care of by a synthesized consensus service that is correct by construction. The service has been proven correct semiautomatically by the Nuprl pro...
Conference Paper
We present a fault-tolerant ordered broadcast service that is correct-by-construction. Our broadcast service allows for diversity in space, whereby the participants in the broadcast protocol run different code, as well as in time, whereby the protocol itself is changed periodically. We use the Nuprl proof assistant to specify the service, prove cor...
Article
Full-text available
There are pressing economic as well as environmental arguments for the overhaul of the current outdated power grid, and its replacement with a Smart Grid that integrates new kinds of green power generating systems, monitors power use, and adapts consumption to match power costs and system load. This paper identifies some of the computing needs for...
Article
Coordination in a distributed system is facilitated if there is a unique process, the leader, to manage the other processes. The leader creates edicts and sends them to other processes for execution or forwarding to other processes. The leader may fail, and when this occurs a leader election protocol selects a replacement. This paper describes Neri...
Article
Full-text available
Attack-tolerant distributed systems change their pro-tocols on-the-fly in response to apparent attacks from the envi-ronment; they substitute functionally equivalent versions possibly more resistant to detected threats. Alternative protocols can be packaged together as a single adaptive protocol or variants from a formal protocol library can be sen...
Article
Today’s Internet often suffers transient outages, but as increasingly critical services migrate to the cloud, much higher levels of Internet availability will be necessary. The stunning shift toward cloud computing has created new pressures on the Internet. Loads are soaring, and many applications increasingly depend on real-time data streaming. Un...
Article
Gossip protocols are known to be highly robust in scenarios with high churn, but if the data that is being gossiped becomes corrupted, a protocol's very robustness can make it hard to fix the problem. All participants need to be taken down, any disk-based data needs to be scrubbed, the cause of the corruption needs to be fixed, and only then can pa...
Conference Paper
Full-text available
The chapter studies how to provide clients with access to a replicated object that is logically indistinguishable from accessing a single yet highly available object. We study this problem under two different models. In the first, we assume that failures can be detected accurately. In the second we drop this assumption, making the model more realis...
Article
Full-text available
Gossip protocols are an important building block of many large-scale systems. They have inherent load-balancing properties as long as nodes are deployed over a network with a "flat" topology, that is, a topology where any pair of nodes may engage in a gossip exchange. Unfortunately, the Internet is not flat in the sense that firewalls and NAT boxes...
Article
Full-text available
There has been considerable interest in reliability services such as Google"s Chubby and Yahoo"s Zookeeper, and in the State Machine Replication model, the standard way of formalizing them. Yet, traditional SMR treatments omit a formal analysis of reconfiguration as actually implemented in production settings. We develop such a model; it ensures th...
Article
Full-text available
In designing and building distributed systems, it is common engineering practice to separate steady-state ("normal") operation from abnormal events such as recovery from failure. This way the normal case can be optimized extensively while recovery can be amortized. However, integrating the recovery procedure with the steady-state protocol is often...
Article
Full-text available
Peer-to-peer (P2P) architectures are popular for tasks such as collaborative download, VoIP telephony, and backup. To maximize performance in the face of widely variable storage capacities and bandwidths, such systems typically need to shift work from poor nodes to richer ones. Similar requirements are seen in today's large data centers, where mach...
Article
Full-text available