Dahlia Malkhi

Microsoft, Washington, West Virginia, United States

Are you Dahlia Malkhi?

Claim your profile

Publications (168)19.66 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: CORFU is a global log which clients can append-to and read-from over a network. Internally, CORFU is distributed over a cluster of machines in such a way that there is no single I/O bottleneck to either appends or reads. Data is fully replicated for fault tolerance, and a modest cluster of about 16--32 machines with SSD drives can sustain 1 million 4-KByte operations per second. The CORFU log enabled the construction of a variety of distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services.
    ACM Transactions on Computer Systems (TOCS). 12/2013; 31(4).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Distributed systems are easier to build than ever with the emergence of new, data-centric abstractions for storing and computing over massive datasets. However, similar abstractions do not exist for storing and accessing meta-data. To fill this gap, Tango provides developers with the abstraction of a replicated, in-memory data structure (such as a map or a tree) backed by a shared log. Tango objects are easy to build and use, replicating state via simple append and read operations on the shared log instead of complex distributed protocols; in the process, they obtain properties such as linearizability, persistence and high availability from the shared log. Tango also leverages the shared log to enable fast transactions across different objects, allowing applications to partition state across machines and scale to the limits of the underlying log without sacrificing consistency.
    Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles; 11/2013
  • Dahlia Malkhi, Jean-Philippe Martin
    [Show abstract] [Hide abstract]
    ABSTRACT: The Spanner project reports that one can build practical large-scale systems that combine strong semantics with geo-distribution. In this review manuscript, we provide insight on how Spanner's concurrency control provides both read-only transactions which avoid locking data, and strong consistency.
    ACM SIGACT News 09/2013; 44(3):73-77.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The basic block I/O interface used for interacting with storage devices hasn't changed much in 30 years. With the advent of very fast I/O devices based on solid-state memory, it becomes increasingly attractive to make many devices directly and concurrently available to many clients. However, when multiple clients share media at fine grain, retaining data consistency is problematic: SCSI, IDE, and their descendants don't offer much help. We propose an interface to networked storage that reduces an existing software implementation of a distributed shared log to hardware. Our system achieves both scalable throughput and strong consistency, while obtaining significant benefits in cost and power over the software implementation.
    Proceedings of the 6th International Systems and Storage Conference; 06/2013
  • Dahlia Malkhi, Robbert van Renesse
    [Show abstract] [Hide abstract]
    ABSTRACT: Practical systems must often guarantee that changes to the system state are durable. Examples of such systems are databases, file systems, and messaging middleware with guaranteed delivery. One common way of implementing durability while keeping performance ...
    ACM SIGOPS Operating Systems Review 01/2013; 47(1):1-2.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We introduce here a generalized method a new Algorithm to find Triple-Base number system and Triple-Base chain and hence in turn Single Digit Triple-Base number system(SDTBNS). The proposed method is not only simpler and faster than the Algorithms to ...
    ACM SIGARCH Computer Architecture News 12/2012; 40(4):1-2.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The paper proposes microsharding, a relational alternative for the recent procedural approaches with large-scale data stores to support OLTP workloads elastically. It employs a declarative specification, called transaction classes, of constraints applied ...
    ACM SIGOPS Operating Systems Review 01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CORFU--which stands for Clusters of Redundant Flash Units, and also for an island near Paxos in Greece--organizes a cluster of flash devices as a single, shared log that can be accessed concurrently by multiple clients over the network. The CORFU shared log makes it easy to build distributed applications that require strong consistency at high speeds, such as databases, transactional key-value stores, replicated state machines, and metadata services. CORFU can be viewed as a distributed SSD, providing advantages over conventional SSDs such as distributed wear-leveling, network locality, fault tolerance, incremental scalability and geo-distribution. A single CORFU instance can support up to 200K appends/sec, while reads scale linearly with cluster size. Importantly, CORFU is designed to work directly over network-attached flash devices, slashing cost, power consumption and latency by eliminating storage servers.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Corfu exposes a cluster of flash devices to applications as a single, shared log. Applications can append data to this log or read from the middle. Internally, this shared log is implemented as a distributed log spread over the flash cluster. There are two reasons why this design makes sense: From a bottom-up perspective, flash requires log-structured writes to ensure even and minimal wear-out as well as high throughput. By implementing a distributed log, we eliminate the need for low-level logging on each flash device. This means we can operate over very dumb flash chips directly attached to the network, resulting in massive savings in power and infrastructure cost (basically, we don't need storage servers any more). From a top-down perspective, a really fast flash-based shared log is great for applications that need strong consistency, such as databases, transactional key-value stores and metadata services. We can run a database at speeds that saturate the raw flash. For some types of strongly consistent operations (like atomic updates), we are able to run at a few hundred thousand operations per second. The current Corfu implementation has been deployed over a cluster of 32 Intel X25M server-attached SSDs. This deployment currently supports 400K 4KB reads/sec and 200K 4KB appends/sec. Several applications have been prototyped over Corfu, include a transactional key-value store and a full replicated database. While we are still evaluating these applications, the initial results are promising; for instance, our key-value store can support atomic multi-gets and multi-puts involving ten 4KB keys each at speeds of 40K/sec and 20K/sec, respectively.
    01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article deals with the emulation of atomic read/write (R/W) storage in dynamic asynchronous message passing systems. In static settings, it is well known that atomic R/W storage can be implemented in a fault-tolerant manner even if the system is completely asynchronous, whereas consensus is not solvable. In contrast, all existing emulations of atomic storage in dynamic systems rely on consensus or stronger primitives, leading to a popular belief that dynamic R/W storage is unattainable without consensus. In this article, we specify the problem of dynamic atomic read/write storage in terms of the interface available to the users of such storage. We discover that, perhaps surprisingly, dynamic R/W storage is solvable in a completely asynchronous system: we present DynaStore, an algorithm that solves this problem. Our result implies that atomic R/W storage is in fact easier than consensus, even in dynamic systems.
    J. ACM. 01/2011; 58:7.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Two or more mobile entities, called agents or robots, starting at distinct initial positions in some environment, have to meet. This task is known in the literature as rendezvous. Among many alternative assumptions that have been used to study the rendezvous ...
    Distributed Computing - 25th International Symposium, DISC 2011, Rome, Italy, September 20-22, 2011. Proceedings; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider data-centric distributed storage, where storage-nodes are directly attached to the network. We present DynaDisk, the first read/write storage system that allows clients to add and remove storage devices in a completely de-centralized manner, and without stopping ongoing read/write operations. DynaDisk supports two alternative approaches to reconfiguration, one partially synchronous (consensus-based) and one asynchronous. We evaluate DynaDisk on a LAN cluster and compare these two reconfiguration meth-ods.
    07/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern storage solutions, such as non-volatile solid-state devices, offer unprecedented speed of access over high-bandwidth interconnects. An array of flash memory chips attached directly to a 1-10 GB fiber switch can support up to 100K page writes per second. While no single host can drive such throughput, the combined power of a large group of clients, accessing the shared storage over a common interconnect, can utilize the system at full capacity.
    Distributed Computing, 24th International Symposium, DISC 2010, Cambridge, MA, USA, September 13-15, 2010. Proceedings; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There has been considerable interest in reliability services such as Google"s Chubby and Yahoo"s Zookeeper, and in the State Machine Replication model, the standard way of formalizing them. Yet, traditional SMR treatments omit a formal analysis of reconfiguration as actually implemented in production settings. We develop such a model; it ensures that members of the new configuration start with full knowledge of the finalized state of the prior configuration. To evaluate the approach, we develop a new implementation of atomic multicast, and evaluate its performance under a range of reconfiguration scenarios.
    01/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We provide the first sparse covers and probabilistic partitions for graphs excluding a fixed minor that have strong diameter bounds; i.e. each set of the cover/partition has a small diameter as an induced sub-graph. Using these results we provide improved distributed name-independent routing schemes. Specifically, given a graph excluding a minor on r vertices and a parameter ρ>0 we obtain the following results: (1) a polynomial algorithm that constructs a set of clusters such that each cluster has a strong-diameter of O(r 2ρ) and each vertex belongs to 2O(r)r! clusters; (2) a name-independent routing scheme with a stretch of O(r 2), headers of O(log n+rlog r) bits, and tables of size 2O(r)r! log 4n/log log n bits; (3) a randomized algorithm that partitions the graph such that each cluster has strong-diameter O(r6r ρ) and the probability an edge (u,v) is cut is O(r d(u,v)/ρ).
    Theory of Computing Systems 01/2010; 47:837-855. · 0.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We give randomized agreement algorithms with constant expected running time in asynchronous systems subject to process failures, where up to a minority of processes may fail. We consider three types of process failures: crash, omission, and Byzantine. For crash or omission failures, we solve consensus assuming private channels or a public-key infrastructure, respectively. For Byzantine failures, we solve weak Byzantine agreement assuming a public-key infrastructure and a broadcast primitive called weak sequenced broadcast. We show how to obtain weak sequenced broadcast using a minimal trusted platform module. The presented algorithms are simple, have optimal resilience, and have optimal asymptotic running time. They work against a sophisticated adversary that can adaptively schedule messages, processes, and failures based on the messages seen by faulty processes.
    Distributed Computing, 24th International Symposium, DISC 2010, Cambridge, MA, USA, September 13-15, 2010. Proceedings; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We live in a world of Internet services such as email, social networks, web searching, and more, which must store increasingly larger volumes of data. These services must run on cheap infrastructure, hence they must use distributed storage systems; and they have to provide reliability of data for long periods as well as availability, hence they must support online reconfiguration to remove failed nodes and add healthy ones. The knowledge needed to implement online reconfiguration is subtle and simple techniques often fail to work well. This tutorial provides an introductory overview of this topic, including a description of the main technical challenges, as well as the various approaches that are used to address these challenges.
    01/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In designing and building distributed systems, it is common engineering practice to separate steady-state ("normal") operation from abnormal events such as recovery from failure. This way the normal case can be optimized extensively while recovery can be amortized. However, integrating the recovery procedure with the steady-state protocol is often far from obvious, and can present subtle difficulties. This issue comes to the forefront in modern data centers, where applications are often implemented as elastic sets of replicas that must reconfigure while continuing to provide service, and where it may be necessary to install new versions of active services as bugs are fixed or new functionality is introduced. Our paper explores this topic in the context of a dynamic reconfiguration model of our own design that unifies two widely popular prior approaches to the problem: virtual synchrony, a model and associated protocols for reliable group communication, and state machine replication (in particular, Paxos), a model and protocol for replicating some form of deterministic functionality specified as an event-driven state machine.
    01/2010;
  • Source
    Leslie Lamport, Dahlia Malkhi, Lidong Zhou
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconfiguration means changing the set of processes executing a distributed system. We explain several methods for reconfiguring a system implemented using the state-machine approach, including some new ones. We discuss the relation between these methods and earlier reconfiguration algorithms--especially view changing in group communication.
    SIGACT News. 01/2010; 41:63-73.

Publication Stats

4k Citations
19.66 Total Impact Points

Institutions

  • 2007–2013
    • Microsoft
      Washington, West Virginia, United States
  • 1999–2010
    • Hebrew University of Jerusalem
      • Rachel and Selim Benin School of Computer Science and Engineering
      Jerusalem, Jerusalem District, Israel
  • 1999–2001
    • University of Texas at Austin
      • Department of Computer Science
      Texas City, TX, United States
  • 1997–2001
    • AT&T Labs
      Austin, Texas, United States