Ken Birman

Ken Birman
Cornell University | CU · Department of Computer Science

PhD

About

422
Publications
34,428
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,015
Citations
Additional affiliations
August 1982 - present
Cornell University
Position
  • N. Rama Rao Professor

Publications

Publications (422)
Preprint
Full-text available
Leveraging one-sided RDMA for applications that replicate small data objects can be surprisingly difficult: such uses amplify any protocol overheads. Spindle is a set of optimization techniques for systematically tackling this class of challenges for atomic multicast over RDMA. These include memory polling optimizations using novel sender and recei...
Conference Paper
Full-text available
Safety-critical applications have always struggled to balance strong consistency with high performance. Kernel-bypassing technologies like RDMA (Remote Direct Memory Access) promise to ease this tension by offering fast communication options that can fully leverage modern hardware. We introduce DerechoDDS, an OMG-compliant data distribution service...
Article
Deploying machine learning into IoT cloud settings will require an evolution of the cloud infrastructure. In this white paper, we justify this assertion and identify new capabilities needed for real-time intelligent systems. We also outline our initial efforts to create a new edge architecture more suitable for ML. Although the work is still underw...
Article
Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after...
Chapter
The deployment of Phasor Measurement Units (PMUs) could support a new generation of wide area monitoring and situational awareness systems, but this has not yet occurred. Instead, PMU data exchange occurs through bilateral agreements, each reflecting substantial human involvement. Our work proposes a new model for PMU data sharing, based upon today...
Article
Applications that aggregate and query data from distributed embedded devices are of interest in many settings, such as smart buildings and cities, the smart power grid, and mobile health applications. However, such devices also pose serious privacy concerns due to the personal nature of the data being collected. In this article, we present an algor...
Article
The continuing rollout of phasor measurement units (PMUs) enables wide area monitoring and control (WAMS/WACS), but the difficulty of sharing data in a secure, scalable, cost-effective, low-latency manner limits exploitation of this new capability by bulk electric power grid operators. GridCloud is an open-source platform for real time data acquisi...
Conference Paper
The coming generation of Internet-of-Things (IoT) applications will process massive amounts of incoming data while supporting data mining and online learning. In cases with demanding real-time requirements, such systems behave as smart memories: a high-bandwidth service that captures sensor input, processes it using machine-learning tools, replicat...
Article
A new system for automated analysis of ″Holter″ 24-hour ECG recordings has been developed at Columbia University. The current system is substantially changed from the previous system, Columbia III, and runs under the UNIX time sharing operating system. All programs have been written in the high level, block structured language ″C″ . A central compo...
Conference Paper
Full-text available
Many applications perform real-time analysis on data streams. We argue that existing solutions are poorly matched to the need, and introduce our new Freeze-Frame File System. Freeze-Frame FS is able to accept streams of updates while satisfying "temporal reads" on demand. The system is fast and accurate: we keep all update history in a memory-mappe...
Conference Paper
Data networks require a high degree of performance and reliability as mission-critical IoT deployments increasingly depend on them. Although performance and fault tolerance can be individually addressed at all levels of the networking stack, few solutions tackle these challenges in an elegant and scalable manner. We propose a redundant array of ind...
Conference Paper
Ken Birman's talk focused on controversies surrounding fault-tolerance and consistency. Looking at the 1990's, he pointed to debate around the so-called CATOCS question (CATOCS refers to causally and totally ordered communication primitives) and drew a parallel to the more modern debate about consistency at cloud scale (often referred to as the CAP...
Article
Read-only caches are widely used in cloud infrastructures to reduce access latency and load on backend databases. Operators view coherent caches as impractical at genuinely large scale and many client-facing caches are updated asynchronously with best-effort pipelines. Existing solutions that support cache consistency are inapplicable to this scena...
Article
Full-text available
New technologies for computerized metering and data collection in the electrical power grid promise to create a more efficient, cost-effective, and adaptable smart grid. However, naive implementations of smart grid data collection could jeopardize the privacy of consumers, and concerns about privacy are a significant obstacle to the rollout of smar...
Article
Modern Web services rely extensively upon a tier of in-memory caches to reduce request latencies and alleviate load on backend servers. Within a given cache, items are typically partitioned across cache servers via consistent hashing, with the goal of balancing the number of items maintained by each cache server. Effects of consistent hashing vary...
Article
This experience report presents the results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2. Paxos is a fundamental algorithm for building fault-tolerant services, at the core of state-machine replication. Implementations of Paxos are currently used in many prototypes and pro...
Article
Full-text available
In-memory read-only caches are widely used in cloud infrastructure to reduce access latency and to reduce load on backend databases. Operators view coherent caches as impractical at genuinely large scale and many client-facing caches are updated in an asynchronous manner with best-effort pipelines. Existing incoherent cache technologies do not supp...
Article
Full-text available
NEBULA is a proposal for a Future Internet Architecture. It is based on the assumptions that: (1) cloud computing will comprise an increasing fraction of the application workload offered to an Internet, and (2) that access to cloud computing resources will demand new architectural features from a network. Features that we have identified include de...
Conference Paper
The developers of today’s cloud computing systems are expected to not only create applications that will work well at scale, but also to create management services that will monitor run-time conditions and intervene to address problems as conditions evolve. Management tasks are generally not performance intensive, but robustness is critical: when a...
Article
In smart power grids it is possible to match supply and demand by applying control mechanisms that are based on fine-grained load prediction. A crucial component of every control mechanism is monitoring, that is, executing queries over the network of smart meters. However, smart meters can learn so much about our lives that if we are to use such me...
Conference Paper
Operators of the nationwide power grid use proprietary data networks to monitor and manage their power distribution systems. These purpose-built, wide area communication networks connect a complex array of equipment ranging from PMUs and synchrophasers to SCADA systems. Collectively, these equipment form part of an intricate feedback system that en...
Article
Full-text available
This experience report presents the results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2. Paxos is a fundamental algorithm for building fault-tolerant services, at the core of state-machine replication. Implementations of Paxos are currently used in many prototypes and pro...
Conference Paper
Full-text available
This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs. Facebook's image-management infrastructure is complex and geographically distributed. It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional ca...
Conference Paper
The big-data community generally favors a two stage methodology whereby data is first collected, then uploaded for analysis using tools like MapReduce. During analysis the data won't change; this simplifies fault-tolerance and makes it worthwhile to cache intermediary results. In contrast, when it is necessary to capture data continuously and query...
Article
Energy accounts for a significant fraction of the operational costs of a data center, and data center operators are increasingly interested in moving toward low-power designs. Two distinct approaches have emerged toward achieving this end: the power-proportional approach focuses on reducing disk and server power consumption, while the green data ce...
Conference Paper
Some network protocols tie application state to underlying TCP connections, leading to unacceptable service outages when an endpoint loses TCP state during fail-over or migration. For example, BGP ties forwarding tables to its control plane connections so that the failure of a BGP endpoint can lead to widespread routing disruption, even if it recov...
Conference Paper
Full-text available
The NEBULA Future Internet Architecture (FIA) project is focused on a future network that enables the vision of cloud computing [8,12] to be realized. With computation and storage moving to data centers, networking to these data centers must be several orders of magnitude more resilient for some applications to trust cloud computing and enable thei...
Conference Paper
Applications used to evaluate next-generation electrical power grids("smart grids") are anticipated to be compute and data-intensive. In this work, we parallelize and improve performance of one such application which was run sequentially prior to the use of our cloud-based configuration. We examine multiple cloud computing offerings, both commercia...
Conference Paper
The collection and prompt analysis of synchrophasor measurements is a key step towards enabling the future smart power grid, in which grid management applications would be deployed to monitor and react intelligently to changing conditions. The potential exists to slash inefficiencies and to adaptively reconfigure the grid to take better advantage o...
Article
Full-text available
Two new algorithms are given for randomized consensus in a shared-memory model with an oblivious adversary. Each is based on a new construction of a conciliator, an object that guarantees termination and validity, but that only guarantees agreement with ...
Article
Full-text available
There are pressing economic as well as environmental arguments for the overhaul of the current outdated power grid, and its replacement with a Smart Grid that integrates new kinds of green power generating systems, monitors power use, and adapts consumption to match power costs and system load. This paper identifies some of the computing needs for...
Article
Full-text available
This paper presents a new object-oriented approach to modeling the semantics of distributed multi-party protocols such as leader election, distributed locking, or reliable multicast, and a programming language that supports it. The approach builds on and extends our live distributed objects model [37] by introducing a new concept of a distributed f...
Article
Full-text available
We propose a novel, declarative approach to im-plementing reliable multi-party protocols that enables efficient and scalable implementations. Our 1 Proper-ties Framework (PF) is able to express semantics as simple as gossip or resource cleanup, or as complex as transactions, consensus, and virtual synchrony. Protocols written in the PF compile to a...
Article
Full-text available
Component integration environments such as Microsoft .NET and J2EE have become widely popu- lar with application developers, who benefit from standardized memory management, system-wide type checking, debugging, and performance analysis tools that operate across component boundaries. This pa- per describes QuickSilver Scalable Multicast1 (QSM), a n...
Article
Full-text available
An increasingly important class of applications requires high-speed online processing of massive quantities of real-time information — examples include sensor network monitoring, deep packet inspection, financial calculators
Article
Full-text available
New data-consistency models make it possible for cloud computing developers to replicate soft state without encountering the limitations associated with the CAP theorem. The CAP theorem explores tradeoffs between consistency, availability, and partition tolerance, and concludes that a replicated service can have just two of these three properties....
Chapter
This Guide to Reliable Distributed Systems describes the key concepts, principles and implementation options for creating high-assurance cloud computing solutions. In combination with the Isis² software platform, the text offers a practical path to success in this vital emerging area. Opening with a broad technical overview, the guide then delves i...
Chapter
In previous chapters we looked at cloud computing from the outside; here, we will do so from the inside, within the data center. We focus on some well-known cloud components in enough detail to appreciate the basic ideas, why they work (and when they might not work), and we will speculate a bit about how they might be generalized for use in other s...
Chapter
The context established by Chap. 10 provides conceptual tools to develop an automated and highly dynamic membership tracking service, in which membership of a system varies as service instances are launched and join an active system, shut down and must leave it, or crash. We solve the problem in steps, first showing how a system can track its own m...
Chapter
We first encountered the transactional execution model in Chap. 7, in conjunction with client/server architectures. As noted at that time, the model draws on a series of assumptions to arrive at a style of computing that is especially well matched to the needs of applications operating on databases. In this chapter we consider some of the details t...
Chapter
Abstract Readers of this text will have learned a great deal about Cornell’s Isis2 platform, available for download from Cornell (under freeBSD licensing) from the web site http:// www. cs. cornell. edu/ ken/ isis2. Here we present the Isis2 API.
Chapter
Client/server computing can be recast into an object-oriented model in ways that greatly simplify application development. Here we focus on a key technology that had an unusually large influence on the emergence of the cloud: CORBA, a powerful object oriented standard that lives on within frameworks like the Java runtime environment or Microsoft’s...
Chapter
This chapter extends the protocols of Chap. 12 with mechanisms required for integrating group communication with point-to-point communication, and for dealing with large numbers of groups that might be extensively overlapped.
Chapter
Many systems evolve incrementally and hence the need arises to retrofit reliability into existing and often very complex systems. Here we discuss some of the major options for performing that task without needing to recode the existing application from scratch.
Chapter
Protocol design is just part of the challenge when building distributed systems: our protocols also need to be presented in an easily-used form that lends itself to high performance and scalability. This chapter discusses some of the challenges. The Isis2 system, available for use by readers of this text, is an example of one embodiment of the idea...
Chapter
Up to now we have looked at cloud computing from a fairly high level, and used terms such as “client” and “server” in ways intended to evoke the reader’s intuition into the way that modern computing systems work: our mobile devices, laptops and desktop systems operate fairly autonomously, requesting services from servers that might run in a machine...
Chapter
In this chapter, we consider a number of protocols representative of a new wave of research and commercial activity in distributed computing. The protocols in question share two characteristics. First, they exploit what are called peer-to-peer communication patterns. Peer-to-peer computing is in some ways a meaningless categorization, since all of...
Chapter
The previous chapter looked at cloud computing from a client’s perspective. Late in the discussion we touched on one of the ways that cloud computing is forcing the Internet itself to evolve. In this chapter, we will say more about that topic. Dominant at the network level are issues stemming from the need of the cloud to control routing and mainta...
Chapter
Before jumping into the question of how to make systems reliable, it will be useful to briefly understand the reasons that distributed systems fail. In this chapter we discuss some of the thinking around failure: a surprisingly rich and varied technical topic.
Chapter
This chapter begins a more systematic analysis of the major components of the cloud architecture. We consider client platform architectures and issues of mobility, the cloud and the Internet, where questions of control over routing dominate, and the multi-tiered structure of cloud computing datacenters. Within this broad context, many standards ari...
Chapter
Previous chapters of this book have made a number of uses of clocks or time in distributed protocols. In this chapter, we look more closely at the underlying issues. Our focus is on aspects of real-time computing that are specific to distributed protocols and systems.
Chapter
This book is intended for use by professionals or advanced students, and the material presented is at a level for which simple problems are not entirely appropriate. Accordingly, most of the problems in this appendix are intended as the basis for essay-style responses or for programming projects, which might build upon the technologies we have trea...
Chapter
The protocols of Chap. 12 represent a new kind of building block from which a wide variety of sophisticated systems can be constructed. Here, we do so, exploring the needed mechanisms that will let us take the step from a world of protocols that live in isolation to a full-fledged toolkit for implementing applications. When the properties of the mo...
Chapter
In this and the next two chapters, we will be focused on mechanisms for replicating data and computation while guaranteeing some form of consistent behavior to the end-user. For example, we might want to require that even though information has been replicated, the system behaves as if that information was not replicated and instead resides at a si...
Chapter
Our goal in Chap. 12 is to identify the best options for implementing high-speed data replication and other tools needed for fault-tolerant, highly assured Web Services and other forms of distributed computing. Given the GMS created in Chap. 11, one option would be to plunge right in and build replicated applications using the protocol directly in...
Chapter
In this chapter, we drill down on some of the major issues that a cloud computing client system must handle.
Chapter
In this chapter we look at the technical underpinnings of modern security technologies. Our treatment won’t substitute for a full course in web security, but will provide details for the mechanisms used elsewhere in the text.
Conference Paper
The April 2011 DOE workshop, 'Computational Needs for the Next Generation Electric Grid', was the culmination of a year-long process to bring together some of the Nation's leading researchers and experts to identify computational challenges associated with the operation and planning of the electric power system. The attached papers provide a journe...
Article
Today’s Internet often suffers transient outages, but as increasingly critical services migrate to the cloud, much higher levels of Internet availability will be necessary. The stunning shift toward cloud computing has created new pressures on the Internet. Loads are soaring, and many applications increasingly depend on real-time data streaming. Un...
Conference Paper
Full-text available
We present a power-lean storage system, where racks of servers, or even entire data center shipping containers, can be powered down to save energy. We show that racks and containers are more than the sum of their servers, and demonstrate the feasibility of designing a storage system that powers them up and down on demand; further, we show that such...
Article
Full-text available
The global network of data centers is emerging as an important distributed systems paradigm-commodity clusters running high-performance applications, connected by high-speed “lambda” networks across hundreds of milliseconds of network latency. Packet loss on long-haul networks can cripple applications and protocols: A loss rate as low as 0.1% is su...
Article
Full-text available
The Cornell University Distributed Systems research group conducted a lecture series at the Air Force Research Laboratory Information Directorate (AFRL/RI) in Rome, NY. A total of eight half-day workshops on technologies for building robust cloud computing solutions were held at Rome Research Site. During these workshops, AF Research Laboratory and...
Article
The Cornell Live Information Objects (LIO) effort was undertaken to overcome a limitation of the tools used to create modern information-enabled applications. When using the Global Information Grid (GIG) standards, existing technologies assume continuous connectivity to some form of data center. As a result, while it is not difficult to build power...
Article
Full-text available
We design and implement a novel class of highly precise network instrumentation, capable of the first-ever capture of exact packet timings of network traffic. Our instrumentation - combining real-time physics test equipment with off-line postprocessing software - prevents interference with the system under test, provides reproducible measurements b...
Conference Paper
Full-text available
While Web Services ensure interoperability and extensibility for networked applications, they also complicate the deployment of highly collaborative systems, such as virtual reality environments and massively multiplayer online games. Quite simply, such systems often manifest a natural peer-to-peer structure. This conflicts with Web Services’ impos...
Article
Full-text available
When replicating data in a cloud computing setting, it is common to send updates using reliable dissemination mech-anisms such as network overlay trees. We show that as data centers scale up, such multicast schemes manifest various performance and stability problems.
Conference Paper
Full-text available
The paper introduces Self-Replicating Objects (SROs), a new concurrent programming abstraction. An SRO is implemented and used much like an ordinary .NET object and can expose arbitrary user-defined APIs, but it is aggressive about automatically exploiting multicore CPUs. It does so by spontaneously and transparently partitioning its state into a s...
Conference Paper
Full-text available
High-bandwidth, semi-private optical lambda networks carry growing volumes of data on behalf of large data cen- ters, both in cloud computing environments and for scien- tific, financial, defense, and other enterprises. This paper undertakes a careful examination of the end-to-end charac- teristics of an uncongested lambda network running at high s...
Article
Full-text available
We introduce the Live Objects framework, which leverages our distributed object-oriented programming model and enables tactical edge mashups for battlefield command and control. Unlike most deployed web services, which are typically limited to client-server interactions, Live Objects can simultaneously support multiple patterns of communication, in...
Article
Full-text available
Gossip-based protocols are commonly used for diffusing information in large-scale distributed applications. GO (Gossip Objects) is a per-node gossip platform that we developed in support of this class of protocols. GO allows nodes to join multiple gossip groups without losing the appealing fixed bandwidth guarantee of gossip protocols, and the plat...
Article
Full-text available
Today's Rich Internet Application (RIA) technologies such as Ajax, Flex, or Silverlight, are designed around the client-server paradigm and cannot easily take advantage of replication, publish-subscribe, or peer-to-peer mechanisms for better scalability or responsiveness. This is particularly true of storage: content is typically persisted in data...
Conference Paper
Full-text available
Network bottlenecks, firewalls, restrictions on IP Multicast availability and administrative policies have long prevented the use of multicast even where the fit seems obvious. The confusion around multicast poses a problem for large-scale pub/sub-based applications that need blazing speed even across WAN networks. There are a number of multicast p...
Conference Paper
Full-text available
In this chapter, we discuss a widely used fault-tolerant data replication model called virtual synchrony. The model responds to two kinds of needs. First, there is the practical question of how best to embed replication into distributed sys- tems. Virtual synchrony defines dynamic process groups that have self-managed membership. Applications can j...
Conference Paper
Full-text available
This talk overviews some of the lower bounds on the complexity of implementing software transactional memory, and explains their underlying assumptions. The talk will discuss how these lower bounds align with experimental results and design choices made ...
Conference Paper
Full-text available
Data centers avoid IP Multicast (IPMC) because of a series of problems with the technology. We introduce Dr. Multicast (MCMD), a system that maps IPMC operations to a combination of point-to-point unicast and traditional IPMC transmissions. MCMD optimizes the use of IPMC addresses within a data center, while simultaneously respecting an administrat...
Conference Paper
Full-text available
We design and implement a novel class of highly precise network instrumentation and apply this tool to perform the first exact packet-timing measurements of a wide-area network ever undertaken, capturing 10 Gigabit Ethernet packets in flight on optical fiber. Through principled design, we improve timing precision by two to six orders of magnitude o...