Article

MixT: a language for mixing consistency in geodistributed transactions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Programming concurrent, distributed systems is hard—especially when these systems mutate shared, persistent state replicated at geographic scale. To enable high availability and scalability, a new class of weakly consistent data stores has become popular. However, some data needs strong consistency. To manipulate both weakly and strongly consistent data in a single transaction, we introduce a new abstraction: mixed-consistency transactions, embodied in a new embedded language, MixT. Programmers explicitly associate consistency models with remote storage sites; each atomic, isolated transaction can access a mixture of data with different consistency models. Compile-time information-flow checking, applied to consistency models, ensures that these models are mixed safely and enables the compiler to automatically partition transactions. New run-time mechanisms ensure that consistency models can also be mixed safely, even when the data used by a transaction resides on separate, mutually unaware stores. Performance measurements show that despite their stronger guarantees, mixed-consistency transactions retain much of the speed of weak consistency, significantly outperforming traditional serializable transactions.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We define our analysis over a augmented version of the OOlong [4] core language, with constructs to (1) identify which fields are replicated, and of these, which allow for weak consistency, (2) the specification of invariants over the replicated fields, and (3) the definition of guards for method executions. The language shares similarities with MixT [19] or DCCT [22] Figure 1: Bank accounts replicated among several sites. Calls that require coordination in red, the remainder are marked green. ...
... In Kraska et al. [12], DCCT [22] and MixT [19], as in Ant, consistency guarantees are bound to data. However, in these proposals, the programmer must reason about the consistency levels, explicitly binding data to consistency categories, such as serializable or eventual, which contrasts with Ant's weak. ...
Preprint
Full-text available
Distributed applications often deal with data with different consistency requirements: while a post in a social network only requires weak consistency, the user balance in turn has strong correctness requirements, demanding mutations to be synchronised. To deal efficiently with sequences of operations on different replicas of the distributed application, it is useful to know which operations commute with others and thus, when can an operation not requiring synchronisation be anticipated wrt other requiring it, thus avoiding unnecessary waits. Herein we present a language-based static analysis to extract at compile-time from code information on which operations can commute with which other operations and thus get information that can be used by the run-time support to decide on call anticipations of operations in replicas without compromising consistency. We illustrate the formal analysis on several paradigmatic examples and briefly present a proof-of-concept implementation in Java.
... connections between transactional isolation and distributed consistency guarantees [29]. Another theme across a number of our recent results is the composition of services that offer different consistency guarantees [31,67,68]. The composition of multiple services with different consistency guarantees is a signature of modern cloud computing that needs to be brought more explicitly into programming frameworks. ...
... The first problem amounts to dataflow analysis across HydroLogic handlers; this is easy to do conservatively in a static analysis of a HydroLogic program, though we may desire more nuanced conditional solutions enforced at runtime. The question of metaconsistency is related to our prior work on mixed consistency of black-box services [31,67,68]. In the Hydro context we may use third-party services, but we also expect to have plenty of white-box HydroLogic code, where we have the flexibility to change the consistency specs across modules to make them consistent with the consistency of endpoint specifications. ...
Preprint
Full-text available
Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics, Availablity, Consistency and Targets of optimization. We propose to migrate developers gradually to PACT programming by lifting familiar code into our more declarative level of abstraction. We then propose a multi-stage compiler that emits human-readable code at each stage that can be hand-tuned by developers seeking more control. Our agenda raises numerous research challenges across multiple areas including language design, query optimization, transactions, distributed consistency, compilers and program synthesis.
... The information-flow type system of ConSysT ensures that mixing consistency levels does not violate consistency guarantees. That is, Weak data never flows into Strong objects as such flow can degrade its strong consistency guarantees [4]. ...
... Operations have to be labeled red, when they can violate application invariants if executed concurrently. MixT [4] uses a type system to enforce correct mixing of data located in different data stores (with possibly different consistency guarantees). In contrast to MixT, ConSysT defines consistency as part of data and not at the level of operations or data stores. ...
Conference Paper
Data replication is essential in scenarios like geo-distributed datacenters and edge computing. Yet, it poses a challenge for data consistency. Developers either adopt high consistency at the detriment of performance or they embrace low consistency and face a much higher programming complexity. We argue that language abstractions should support associating the level of consistency to data types. We present ConSysT, a programming language and middleware that provides abstractions to specify consistency types, enabling mixing different consistency levels in the same application. Such mechanism is fully integrated with object-oriented programming and type system guarantees that different levels can be mixed only in a correct way.
... It is also known that many geo-distributed applications do not require strong consistency for every operation [36] and that many such applications are dominated by operations that are correct even if executed over a weakly-consistent view. This observation has motivated the advent of georeplicated systems supporting hybrid (or mixed) consistency models [43], [42], [34]. To the best of our knowledge, NimbleChain is the first to introduce hybrid consistency models to permissionless environments. ...
Preprint
Full-text available
Nakamoto's seminal work gave rise to permissionless blockchains -- as well as a wide range of proposals to mitigate its performance shortcomings. Despite substantial throughput and energy efficiency achievements, most proposals only bring modest (or marginal) gains in transaction commit latency. Consequently, commit latencies in today's permissionless blockchain landscape remain prohibitively high for latency-sensitive geo-distributed applications. This paper proposes NimbleChain, which extends standard permissionless blockchains with a fast path that delivers consensusless promises of commitment. This fast path supports cryptocurrency transactions and only takes a small fraction of the original commit latency, while providing consistency guarantees that are strong enough to ensure correct cryptocurrencies. Since today's general-purpose blockchains also support smart contract transactions, which typically have (strong) sequential consistency needs, NimbleChain implements a hybrid consistency model that also supports strongly-consistent applications. To the best of our knowledge, NimbleChain is the first system to bring together fast consensusless transactions with strongly-consistent consensus-based transactions in a permissionless setting. We implement NimbleChain as an extension of Ethereum and evaluate it in a 500-node geo-distributed deployment. The results show that the average latency to promise a transaction is an order of magnitude faster than consensus-based commit, with minimal overhead when compared with a vanilla Ethereum implementation.
... While confidentiality and integrity are mainstay features of information flow tracking, informationflow labels can also express other properties. For instance, MixT [MM18] describes how to use informationflow labels to create safe transactions across databases with different consistency models, and the work of Zheng and Myers [ZM05] uses information-flow labels to provide availability guarantees. FLAFOL allows such alternative interpretations of labels by using an abstract permission model to give meaning to labels. ...
Preprint
Full-text available
We present the Flow-Limited Authorization First-Order Logic (FLAFOL), a logic for reasoning about authorization decisions in the presence of information-flow policies. We formalize the FLAFOL proof system, characterize its proof-theoretic properties, and develop its security guarantees. In particular, FLAFOL is the first logic to provide a non-interference guarantee while supporting all connectives of first-order logic. Furthermore, this guarantee is the first to combine the notions of non-interference from both authorization logic and information-flow systems. All theorems in this paper are proven in Coq.
... Improvements in scalability and throughput of Atomic Commit implementations and other optimizations related to consistency, are widely applicable to databases [1,2,7,18], programming languages [3,[10][11][12]17] and distributed systems in general [6,9,19]. ...
Conference Paper
Full-text available
In high-throughput distributed applications, such as large-scale banking systems, synchronization between objects becomes a bottleneck. This short paper focusses on research, in close collaboration with ING Bank, on the opportunity of leveraging application specific knowledge captured by model driven engineering approaches, to increase application performance in high-contention scenarios, while maintaining functional application-level consistency.
... IPA stores adapt consistency to system load within the user-specified bound. Similarly, MixT [Milano and Myers 2018] is a language that allows transactions to access different stores with varying consistency guarantees and applies information flow analysis to prevent less-consistent data from influencing more-consistent data. In contrast to our approach that infers the required synchronization and dependency, IPA and MixT require the user to explicitly associate consistency conditions with objects and stores. ...
Article
Full-text available
Distributed system replication is widely used as a means of fault-tolerance and scalability. However, it provides a spectrum of consistency choices that impose a dilemma for clients between correctness, responsiveness and availability. Given a sequential object and its integrity properties, we automatically synthesize a replicated object that guarantees state integrity and convergence and avoids unnecessary coordination. Our approach is based on a novel sufficient condition for integrity and convergence called well-coordination that requires certain orders between conflicting and dependent operations. We statically analyze the given sequential object to decide its conflicting and dependent methods and use this information to avoid coordination. We present novel coordination protocols that are parametric in terms of the analysis results and provide the well-coordination requirements. We implemented a tool called Hamsaz that can automatically analyze the given object, instantiate the protocols and synthesize replicated objects. We have applied Hamsaz to a suite of use-cases and synthesized replicated objects that are significantly more responsive than the strongly consistent baseline.
Article
We present batch-based consistency, a new approach for consistency optimization that allows programmers to specialize consistency with application-level integrity properties. We implement the approach with a two-step process: we statically infer optimal consistency requirements for executions of bounded sets of operations, and then, use the inferred requirements to parameterize a new distributed protocol to relax operation reordering at run time when it is safe to do so. Our approach supports standard notions of consistency. We implement batch-based consistency in Peepco, demonstrate its expressiveness for partial data replication, and examine Peepco’s run-time performance impact in different settings.
Article
Local-first software embraces data replication as a means to achieve scalability and offline availability. A crucial ingredient of local-first software are mergeable data types, like conflict-free replicated data types (CRDTs), which feature eventual consistency by enabling processes to access data locally and later merge it with other replicas in an asynchronous manner. Notably, the merging process needs to adhere to application constraints for correctness. Ensuring such application-level invariants poses a challenge, as developers must reason about the replicated program state and resort to manual synchronization of specific application components to enforce the invariant. This paper introduces ConLoc (Consistent Local-First Software), a novel system designed to automatically enforce safety and maintain invariants in local-first applications. ConLoc effectively addresses the issue of preserving invariants in the execution of programs with replicated data types, including CRDTs. Our approach is able to verify the correctness of many CRDTs examined in the literature and in implementations, such the ones used in the Riak database. ConLoc ensures that applications are automatically synchronized correctly, resulting in substantial latency and throughput improvements when compared to sequential execution, while upholding the same set of invariants.</p
Article
Despite decades of research and practical experience, developers have few tools for programming reliable distributed applications without resorting to expensive coordination techniques. Conflict-free replicated datatypes (CRDTs) are a promising line of work that enable coordination-free replication and offer certain eventual consistency guarantees in a relatively simple object-oriented API. Yet CRDT guarantees extend only to data updates; observations of CRDT state are unconstrained and unsafe. We propose an agenda that embraces the simplicity of CRDTs, but provides richer, more uniform guarantees. We extend CRDTs with a query model that reasons about which queries are safe without coordination by applying monotonicity results from the CALM Theorem, and lay out a larger agenda for developing CRDT data stores that let developers safely and efficiently interact with replicated application state.
Article
Geo-replication is essential for providing low latency response and quality Internet services. However, designing fast and correct geo-replicated services is challenging due to the complex trade-off between performance and consistency semantics in optimizing the expensive cross-site coordination. State-of-the-art solutions rely on programmers to derive sufficient application-specific invariants and code specifications, which is both time-consuming and error-prone. In this paper, we propose an end-to-end geo-replication deployment framework AUTOGR (AUTOmated Geo-Replication) to free programmers from such label-intensive tasks. AutoGR enables the geo-replication features for non-replicated, serializable applications in an automated way with optimized performance and correct application semantics. Driven by a novel static analyzer RIGI, AUTOGR can extract application invariants by verifying whether their geo-replicated versions obey the serializable semantics of the non-replicated application. RIGI takes application codes as inputs and infers a set of side effects and path conditions possibly leading to consistency violations. RIGI employs the Z3 theorem prover to identify pairs of conflicting side effects and feed them to a geo-replication framework for automated across-site deployment. We evaluate AUTOGR by transforming four serializable and originally non-replicated DB-compliant applications to geo-replicated ones across 3 sites. Compared with state-of-the-art human-intervention-free automated approaches (e.g., strong consistency), AUTOGR reduces up to 61.8% latency and achieves up to 2.12X higher peak throughput. Compared with state-of-the-art approaches relying on a manual analysis (e.g., PoR), AUTOGR can quickly enable the geo-replication feature with zero human intervention while offering similarly low latency and high throughput.
Article
Nakamoto’s seminal work gave rise to permissionless blockchains – as well as a wide range of proposals to mitigate their performance shortcomings. Despite substantial throughput and energy efficiency achievements, most proposals only bring modest (or marginal) gains in transaction commit latency. Consequently, commit latencies in today’s permissionless blockchain landscape remain prohibitively high. This paper proposes NimbleChain, a novel algorithm that extends permissionless blockchains based on Nakamoto consensus with a fast path that delivers causal promises of commitment , or simply promises . Since promises only partially order transactions, their latency is only a small fraction of the totally-ordered commitment latency of Nakamoto consensus. Still, the weak consistency guarantees of promises are strong enough to correctly implement cryptocurrencies. To the best of our knowledge, NimbleChain is the first system to bring together fast, partially-ordered transactions with consensus-based, totally-ordered transactions in a permissionless setting. This hybrid consistency model is able to speed up cryptocurrency transactions while still supporting smart contracts, which typically have (strong) sequential consistency needs. We implement NimbleChain as an extension of Ethereum and evaluate it in a 500-node geo-distributed deployment. The results show NimbleChain can promise a cryptocurrency transactions up to an order of magnitude faster than a vanilla Ethereum implementation, with marginal overheads.
Article
Conflict-free replicated data types (CRDTs) are a promising tool for designing scalable, coordination-free distributed systems. However, constructing correct CRDTs is difficult, posing a challenge for even seasoned developers. As a result, CRDT development is still largely the domain of academics, with new designs often awaiting peer review and a manual proof of correctness. In this paper, we present Katara, a program synthesis-based system that takes sequential data type implementations and automatically synthesizes verified CRDT designs from them. Key to this process is a new formal definition of CRDT correctness that combines a reference sequential type with a lightweight ordering constraint that resolves conflicts between non-commutative operations. Our process follows the tradition of work in verified lifting, including an encoding of correctness into SMT logic using synthesized inductive invariants and hand-crafted grammars for the CRDT state and runtime. Katara is able to automatically synthesize CRDTs for a wide variety of scenarios, from reproducing classic CRDTs to synthesizing novel designs based on specifications in existing literature. Crucially, our synthesized CRDTs are fully, automatically verified, eliminating entire classes of common errors and reducing the process of producing a new CRDT from a painstaking paper proof of correctness to a lightweight specification.
Article
In this article we study the properties of distributed systems that mix eventual and strong consistency. We formalize such systems through acute cloud types (ACTs), abstractions similar to conflict-free replicated data types (CRDTs), which by default work in a highly available, eventually consistent fashion, but which also feature strongly consistent operations for tasks which require global agreement. Unlike other mixed-consistency solutions, ACTs can rely on efficient quorum-based protocols, such as Paxos. Hence, ACTs gracefully tolerate machine and network failures also for the strongly consistent operations. We formally study ACTs and demonstrate phenomena which are neither present in purely eventually consistent nor strongly consistent systems. In particular, we identify temporary operation reordering , which implies interim disagreement between replicas on the relative order in which the client requests were executed. When not handled carefully, this phenomenon may lead to undesired anomalies, including circular causality. We prove an impossibility result which states that temporary operation reordering is unavoidable in mixed-consistency systems with sufficiently complex semantics. Our result is startling, because it shows that apparent strengthening of the semantics of a system (by introducing strongly consistent operations to an eventually consistent system) results in the weakening of the guarantees on the eventually consistent operations.
Chapter
Distributed systems address the increasing demand for fast access to resources and fault tolerance for data. However, due to scalability requirements, software developers need to trade consistency for performance. For certain data, consistency guarantees may be weakened if application correctness is unaffected. In contrast, data flow from data with weak consistency to data with strong consistency requirements is problematic, since application correctness may be broken. In this paper, we propose LCD, a higher-order static consistency-typed language with replicated data types. The type system of LCD statically enforces noninterference between data types with weak consistency and data types with strong consistency. This means that updates of weakly-consistent data can never affect strongly-consistent data. Finally, our main theorem guarantees sequential consistency for so-called con operations in our language.
Conference Paper
Large scale distributed systems require to embrace the trade off between consistency and availability, accepting lower levels of consistency to guarantee higher availability. Existing programming languages are, however, agnostic to this compromise, resulting in consistency guarantees that are the same for the whole application and are implicitly adopted from the middleware or hardcoded in configuration files. In this paper, we propose to integrate availability in the design of an object-oriented language, allowing developers to specify different consistency and isolation constraints in the same application at the granularity of single objects. We investigate how availability levels interact with object structure and define a type system that preserves correct program behavior. Our evaluation shows that our solution performs efficiently and improves the design of distributed applications.
Article
Current programming models only provide abstractions for sharing data under a homogeneous consistency model. It is, however, not uncommon for a distributed application to provide strong consistency for one part of the shared data and eventual consistency for another part. Because mixing consistency models is not supported by current programming models, writing such applications is extremely difficult. In this paper we propose CScript, a distributed object-oriented programming language with built-in support for data replication. At its core are consistent and available replicated objects. CScript regulates the interactions between these objects to avoid subtle inconsistencies that arise when mixing consistency models. Our evaluation compares a collaborative text editor built atop CScript with a state-of-the-art implementation. The results show that our approach is flexible and more memory efficient.
Conference Paper
Distributed key-value (KV) stores are a rising alternative to traditional relational databases since they provide a flexible yet simple data model. Recent KV stores use eventual consistency to ensure fast reads and writes as well as high availability. Support for eventual consistency is however still very limited as typically only a handful of replicated data types are provided. Moreover, modern applications maintain various types of data, some of which require strong consistency whereas other require high availability. Implementing such applications remains cumbersome due to the lack of support for data consistency in today's KV stores. In this paper we propose Squirrel, an open implementation of an in-memory distributed KV store. The core idea is to reify distribution through consistency models and protocols. We implement two families of consistency models (strong consistency and strong eventual consistency) and several consistency protocols, including two-phase commit and CRDTs.
Conference Paper
Full-text available
In this paper, we present a novel adaptive and workload-driven partitioning framework, named Helios, aiming to achieve low-latency and high-throughput online queries in distributed graph stores. As each workload typically contains popular or similar queries, our partitioning method uses the existing workload to capture active vertices and edges which are frequently visited and traversed respectively. This information is used to heuristically improve the quality of partitions either by avoiding the concentration of active vertices in a few partitions proportional to their visit frequencies or by reducing the probability of the cut of active edges proportional to their traversal frequencies. In order to assess the impact of Helios on a graph store, and to show how easily the approach can be plugged on top of the system, we exploit it in a distributed, graph-based RDF store. The query engine of the store exploits Helios to reduce or eliminate data communication for future queries and balance the load among nodes. We evaluate the store by using realistic query workloads over an RDF dataset. Our results demonstrate the ability of Helios to handle varying query workloads with minimum overhead while maintaining the quality of partitions over time, along with being scalable by increasing either the data size or the number of computing nodes.
Article
Full-text available
The integration of transactions into hardware relaxed memory architectures is a topic of current research both in industry and academia. In this paper, we provide a general architectural framework for the introduction of transactions into models of relaxed memory in hardware, including the SC, TSO, ARMv8 and PPC models. Our framework incorporates flexible and expressive forms of transaction aborts and execution that have hitherto been in the realm of software transactional memory. In contrast to software transactional memory, we account for the characteristics of relaxed memory as a restricted form of distributed system, without a notion of global time. We prove abstraction theorems to demonstrate that the programmer API matches the intuitions and expectations about transactions.
Article
Full-text available
Serializability is a well-understood correctness criterion that simplifies reasoning about the behavior of concurrent transactions by ensuring they are isolated from each other while they execute. However, enforcing serializable isolation comes at a steep cost in performance and hence database systems in practice support, and often encourage, developers to implement transactions using weaker alternatives. Unfortunately, the semantics of weak isolation is poorly understood, and usually explained only informally in terms of low-level implementation artifacts. Consequently, verifying high-level correctness properties in such environments remains a challenging problem. To address this issue, we present a novel program logic that enables compositional reasoning about the behavior of concurrently executing weakly-isolated transactions. Recognizing that the proof burden necessary to use this logic may dissuade application developers, we also describe an inference procedure based on this foundation that ascertains the weakest isolation level that still guarantees the safety of high-level consistency invariants associated with such transactions. The key to effective inference is the observation that weakly-isolated transactions can be viewed as functional (monadic) computations over an abstract database state, allowing us to treat their operations as state transformers over the database. This interpretation enables automated verification using off-the-shelf SMT solvers. Case studies and experiments of real-world applications (written in an embedded DSL in OCaml) demonstrate the utility of our approach.
Article
Full-text available
Programming with replicated objects is difficult. Developers must face the fundamental trade-off between consistency and performance head on, while struggling with the complexity of distributed storage stacks. We introduce Correctables, a novel abstraction that hides most of this complexity, allowing developers to focus on the task of balancing consistency and performance. To aid developers with this task, Correctables provide incremental consistency guarantees, which capture successive refinements on the result of an ongoing operation on a replicated object. In short, applications receive both a preliminary---fast, possibly inconsistent---result, as well as a final---consistent---result that arrives later. We show how to leverage incremental consistency guarantees by speculating on preliminary values, trading throughput and bandwidth for improved latency. We experiment with two popular storage systems (Cassandra and ZooKeeper) and three applications: a Twissandra-based microblogging service, an ad serving system, and a ticket selling system. Our evaluation on the Amazon EC2 platform with YCSB workloads A, B, and C shows that we can reduce the latency of strongly consistent operations by up to 40% (from 100ms to 60ms) at little cost (10% bandwidth increase, 6% throughput drop) in the ad system. Even if the preliminary result is frequently inconsistent (25% of accesses), incremental consistency incurs a bandwidth overhead of only 27%.
Article
Full-text available
This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that "no one size fits all". To address this shift, we propose a polystore architecture; it is designed to unify querying over multiple data models. We consider the challenges and opportunities associated with polystores. Open questions in this space revolve around query optimization and the assignment of objects to storage engines. We introduce our approach to these topics and discuss our prototype in the context of the Intel Science and Technology Center for Big Data.
Article
Full-text available
User-facing online services utilize geo-distributed data stores to minimize latency and tolerate partial failures, with the intention of providing a fast, always-on experience. However, geo-distribution does not come for free; application developers have to contend with weak consistency behaviors, and the lack of abstractions to composably construct high-level replicated data types, necessitating the need for complex application logic and invariably exposing inconsistencies to the user. Some commercial distributed data stores and several academic proposals provide a lattice of consistency levels, with stronger consistency guarantees incurring increased latency and throughput costs. However, correctly assigning the right consistency level for an operation requires subtle reasoning and is often an error-prone task. In this paper, we present QUELEA, a declarative programming model for eventually consistent data stores (ECDS), equipped with a contract language, capable of specifying fine-grained application-level consistency properties. A contract enforcement system analyses contracts, and automatically generates the appropriate consistency protocol for the method protected by the contract. We describe an implementation of QUELEA on top of an off-the-shelf ECDS that provides support for coordination-free transactions. Several benchmarks including two large web applications, illustrate the effectiveness of our approach.
Conference Paper
Full-text available
Online services distribute and replicate state across geographically diverse data centers and direct user requests to the closest or least loaded site. While effectively ensuring low latency responses, this approach is at odds with maintaining cross-site consistency. We make three contributions to address this tension. First, we propose RedBlue consistency, which enables blue operations to be fast (and eventually consistent) while the remaining red operations are strongly consistent (and slow). Second, to make use of fast operation whenever possible and only resort to strong consistency when needed, we identify conditions delineating when operations can be blue and must be red. Third, we introduce a method that increases the space of potential blue operations by breaking them into separate generator and shadow phases. We built a coordination infrastructure called Gemini that offers RedBlue consistency, and we report on our experience modifying the TPC-W and RUBiS benchmarks and an online social network to use Gemini. Our experimental results show that RedBlue consistency provides substantial performance gains without sacrificing consistency.
Article
Full-text available
Replicating data across multiple data centers not only allows moving the data closer to the user and, thus, reduces latency for applications, but also increases the availability in the event of a data center failure. Therefore, it is not surprising that companies like Google, Yahoo, and Netflix already replicate user data across geographically different regions. However, replication across data centers is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Synchronous wide-area replication is therefore considered to be unfeasible with strong consistency and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, restrict consistency to small partitions, or give up consistency entirely. With MDCC (Multi-Data Center Consistency), we describe the first optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols. MDCC can commit transactions in a single round-trip across data centers in the normal operational case. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. Our evaluation using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
Full-text available
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across dierent data centers). At this scale, small and large components fail continuously. The way Cassandra man- ages the persistent state in the face of these failures drives the reliability and scalability of the software systems rely- ing on this service. While in many ways Cassandra resem- bles a database and shares many design and implementation strategies therewith, Cassandra does not support a full rela- tional data model; instead, it provides clients with a simple data model that supports dynamic control over data lay- out and format. Cassandra system was designed to run on cheap commodity hardware and handle high write through- put while not sacricing read eciency.
Article
Full-text available
Our previous work has shown that architectural and appli- cation shifts have resulted in modern OLTP databases in- creasingly falling short of optimal performance (10). In par- ticular, the availability of multiple-cores, the abundance of main memory, the lack of user stalls, and the dominant use of stored procedures are factors that portend a clean-slate redesign of RDBMSs. This previous work showed that such a redesign has the potential to outperform legacy OLTP databases by a significant factor. These results, however, were obtained using a bare-bones prototype that was devel- oped just to demonstrate the potential of such a system. We have since set out to design a more complete execution plat- form, and to implement some of the ideas presented in the original paper. Our demonstration presented here provides insight on the development of a distributed main memory OLTP database and allows for the further study of the chal- lenges inherent in this operating environment.
Article
Full-text available
ANSI SQL-92 defines Isolation Levels in terms of phenomena: Dirty Reads, Non-Repeatable Reads, and Phantoms. This paper shows that these phenomena and the ANSI SQL definitions fail to characterize several popular isolation levels, including the standard locking implementations of the levels. Investigating the ambiguities of the phenomena leads to clearer definitions; in addition new phenomena that better characterize isolation types are introduced. An important multiversion isolation type, Snapshot Isolation, is defined.
Article
Full-text available
Chopping transactions into pieces is good for performance but may lead to nonserializable executions. Many researchers have reacted to this fact by either inventing new concurrency-control mechanisms, weakening serializability, or both. We adopt a different approach. We assume a user who —has access only to user-level tools such as (1) choosing isolation degrees 1ndash;4, (2) the ability to execute a portion of a transaction using multiversion read consistency, and (3) the ability to reorder the instructions in transaction programs; and —knows the set of transactions that may run during a certain interval (users are likely to have such knowledge for on-line or real-time transactional applications). Given this information, our algorithm finds the finest chopping of a set of transactions TranSet with the following property: If the pieces of the chopping execute serializably, then TranSet executes serializably . This permits users to obtain more concurrency while preserving correctness. Besides obtaining more intertransaction concurrency, chopping transactions in this way can enhance intratransaction parallelism. The algorithm is inexpensive, running in O(n×(e+m)) time, once conflicts are identified, using a naive implementation, where n is the number of concurrent transactions in the interval, e is the number of edges in the conflict graph among the transactions, and m is the maximum number of accesses of any transaction. This makes it feasible to add as a tuning knob to real systems.
Conference Paper
Replication protocols in distributed storage systems are fundamentally constrained by the finite propagation speed of information, which necessitates trade-offs among performance metrics even in the absence of failures. We focus on the consistency-latency trade-off, which dictates that a distributed storage system can either guarantee that clients always see the latest data, or it can guarantee that operation latencies are small (relative to the inter-data-center latencies) but not both. We propose a technique called spectral shifting for tuning this trade-off adaptively to meet an application-specific performance target in a dynamically changing environment. Experiments conducted in a real wold cloud computing environment demonstrate that our tuning framework provides superior convergence compared to a state-of-the-art solution.
Article
In wide-area distributed systems, data replication provides fault tolerance and low latency. Causal consistency in such systems is an interesting consistency model. Most existing works assume the data is fully replicated because this greatly simplifies the design of the algorithms to implement causal consistency. Recently, we proposed causal consistency under partial replication because it reduces the number of messages used under a wide range of workloads. One drawback of partial replication is that its meta-data tends to be relatively large when the message size is small. In this paper, we propose an algorithm Approx-Opt-Track which provides approximate causal consistency whereby we can reduce the meta-data at the cost of some violations of causal consistency. The amount of violations can be made arbitrarily small by controlling a tunable parameter, that we call credits. We present the analytic data to show the performance of Approx-Opt-Track. We then give simulation results to show the potential benefit of Approx-Opt-Track, viz., its ability to provide almost the same guarantees as causal consistency, at a smaller cost.
Article
There have been proposed protocols to achieve causal consistency with a distributed data store that does not make safety guarantees. Such protocols work with an unmodified data store if it is implemented as middleware or a shim layer while it can be implemented inside a data store. But the middleware approach has required modifications to applications. Applications have to explicitly specify data dependency to be managed. Our Letting-It-Be protocol to the contrary, handles all implicit dependency naturally resulting from data accesses even though it is implemented as middleware. Our protocol does not require any modifications to either data stores or applications. It works with them as they are. It trades performance for the merit to some extent. Throughput declines from a data store alone were 21% in the best case and 78% in the worst case without multi-level management of dependency graph, which is a performance optimization technique.
Article
Developing and reasoning about systems using eventually consistent data stores is a difficult challenge due to the presence of unexpected behaviors that do not occur under sequential consistency. A fundamental problem in this setting is to identify a correctness criterion that precisely captures intended application behaviors yet is generic enough to be applicable to a wide range of applications. In this paper, we present such a criterion. More precisely, we generalize conflict serializability to the setting of eventual consistency. Our generalization is based on a novel dependency model that incorporates two powerful algebraic properties: commutativity and absorption. These properties enable precise reasoning about programs that employ high-level replicated data types, common in modern systems. To apply our criterion in practice, we also developed a dynamic analysis algorithm and a tool that checks whether a given program execution is serializable. We performed a thorough experimental evaluation on two real-world use cases: debugging cloud-backed mobile applications and implementing clients of a popular eventually consistent key-value store. Our experimental results indicate that our criterion reveals harmful synchronization problems in applications, is more effective at finding them than prior approaches, and can be used for the development of practical, eventually consistent applications.
Article
Plotkin's λ-value calculus is sound but incomplete for reasoning about βeegr;-transformations on programs in continuation-passing style (CPS). To find a complete extension, we define a new, compactifying CPS transformation and an “inverse”mapping, un -CPS, both of which are interesting in their own right. Using the new CPS transformation, we can determine the precise language of CPS terms closed under β7eegr;-transformations. Using the un -CPS transformation, we can derive a set of axioms such that every equation between source programs is provable if and only if βη can prove the corresponding equation between CPS programs. The extended calculus is equivalent to an untyped variant of Moggi's computational λ-calculus.
Conference Paper
Distributed applications and web services, such as online stores or social networks, are expected to be scalable, available, responsive, and fault-tolerant. To meet these steep requirements in the face of high round-trip latencies, network partitions, server failures, and load spikes, applications use eventually consistent datastores that allow them to weaken the consistency of some data. However, making this transition is highly error-prone because relaxed consistency models are notoriously difficult to understand and test. In this work, we propose a new programming model for distributed data that makes consistency properties explicit and uses a type system to enforce consistency safety. With the Inconsistent, Performance-bound, Approximate (IPA) storage system, programmers specify performance targets and correctness requirements as constraints on persistent data structures and handle uncertainty about the result of datastore reads using new consistency types. We implement a prototype of this model in Scala on top of an existing datastore, Cassandra, and use it to make performance/correctness tradeoffs in two applications: a ticket sales service and a Twitter clone. Our evaluation shows that IPA prevents consistency-based programming errors and adapts consistency automatically in response to changing network conditions, performing comparably to weak consistency and 2-10× faster than strong consistency.
Conference Paper
Developing and reasoning about systems using eventually consistent data stores is a difficult challenge due to the presence of unexpected behaviors that do not occur under sequential consistency. A fundamental problem in this setting is to identify a correctness criterion that precisely captures intended application behaviors yet is generic enough to be applicable to a wide range of applications. In this paper, we present such a criterion. More precisely, we generalize conflict serializability to the setting of eventual consistency. Our generalization is based on a novel dependency model that incorporates two powerful algebraic properties: commutativity and absorption. These properties enable precise reasoning about programs that employ high-level replicated data types, common in modern systems. To apply our criterion in practice, we also developed a dynamic analysis algorithm and a tool that checks whether a given program execution is serializable. We performed a thorough experimental evaluation on two real-world use cases: debugging cloud-backed mobile applications and implementing clients of a popular eventually consistent key-value store. Our experimental results indicate that our criterion reveals harmful synchronization problems in applications, is more effective at finding them than prior approaches, and can be used for the development of practical, eventually consistent applications.
Conference Paper
This paper presents the design, implementation, and evaluation of TARDiS (Transactional Asynchronously Replicated Divergent Store), a transactional key-value store explicitly designed for weakly-consistent systems. Reasoning about these systems is hard, as neither causal consistency nor per-object eventual convergence allow applications to deal satisfactorily with write-write conflicts. TARDiS instead exposes as its fundamental abstraction the set of conflicting branches that arise in weakly-consistent systems. To this end, TARDiS introduces a new concurrency control mechanism: branch-on-conflict. On the one hand, TARDiS guarantees that storage will appear sequential to any thread of execution that extends a branch, keeping application logic simple. On the other, TARDiS provides applications, when needed, with the tools and context necessary to merge branches atomically, when and how applications want. Since branch-on-conflict in TARDiS is fast, weakly-consistent applications can benefit from adopting this paradigm not only for operations issued by different sites, but also, when appropriate, for conflicting local operations. We find that TARDiS reduces coding complexity for these applications and that judicious branch-on-conflict can improve their local throughput at each site by two to eight times.
Conference Paper
Large-scale distributed systems often rely on replicated databases that allow a programmer to request different data consistency guarantees for different operations, and thereby control their performance. Using such databases is far from trivial: requesting stronger consistency in too many places may hurt performance, and requesting it in too few places may violate correctness. To help programmers in this task, we propose the first proof rule for establishing that a particular choice of consistency guarantees for various operations on a replicated database is enough to ensure the preservation of a given data integrity invariant. Our rule is modular: it allows reasoning about the behaviour of every operation separately under some assumption on the behaviour of other operations. This leads to simple reasoning, which we have automated in an SMT-based tool. We present a nontrivial proof of soundness of our rule and illustrate its use on several examples.
Conference Paper
This paper describes the design, implementation, and evaluation of Callas, a distributed database system that offers to unmodified, transactional ACID applications the opportunity to achieve a level of performance that can currently only be reached by rewriting all or part of the application in a BASE/NoSQL Style. The key to combining performance and ease of programming is to decouple the ACID abstraction---which Callas offers identically for all transactions---from the mechanism used to support it. MCC, the new Modular approach to Concurrency Control at the core of Callas, makes it possible to partition transactions in groups with the guarantee that, as long as the concurrency control mechanism within each group upholds a given isolation property, that property will also hold among transactions in different groups. Because of their limited and specialized scope, these group-specific mechanisms can be customized for concurrency with unprecedented aggressiveness. In our MySQL Cluster-based prototype, Callas yields an 8.2x throughput gain for TPC-C with no programming effort.
Conference Paper
Referential integrity, which guarantees that named resources can be accessed when referenced, is an important property for reliability and security. In distributed systems, however, the attempt to provide referential integrity can itself lead to security vulnerabilities that are not currently well understood. This paper identifies three kinds of referential security vulnerabilities related to the referential integrity of distributed, persistent information. Security conditions corresponding to the absence of these vulnerabilities are formalized. A language model is used to capture the key aspects of programming distributed systems with named, persistent resources in the presence of an adversary. The referential security of distributed systems is proved to be enforced by a new type system.
Article
Geo-replicated storage provides copies of the same data at multiple, geographically distinct locations. Facebook, for example, geo-replicates its data (profiles, friends lists, likes, etc.) to data centers on the east and west coasts of the United States, and in Europe. In each data center, a tier of separate Web servers accepts browser requests and then handles those requests by reading and writing data from the storage system.
Article
GentleRain is a new causally consistent geo-replicated data store that provides throughput comparable to eventual consistency and superior to current implementations of causal consistency. GentleRain uses a periodic aggregation protocol to determine whether updates can be made visible in accordance with causal consistency. Unlike current implementations, it does not use explicit dependency check messages, resulting in a major throughput improvement at the expense of a modest increase in update visibility. Furthermore, GentleRain tracks causal consistency by attaching to updates scalar timestamps derived from loosely synchronized physical clocks. Clock skew does not cause violations of causal consistency, but may delay the visibility of updates. By encoding causality in a single scalar timestamp, GentleRain reduces storage and communication overhead for tracking causality. We evaluate GentleRain using Amazon EC2, and demonstrate that it achieves throughput equal to about 99% of eventual consistency, and 120% better than previous implementations of causal consistency.
Article
Graph databases have become a common infrastructure component. Yet existing systems either operate on offline snapshots, provide weak consistency guarantees, or use expensive concurrency control techniques that limit performance. In this paper, we introduce a new distributed graph database, called Weaver, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs. The key insight that allows Weaver to combine strict serializability with horizontal scalability and high performance is a novel request ordering mechanism called refinable timestamps. This technique couples coarse-grained vector timestamps with a fine-grained timeline oracle to pay the overhead of strong consistency only when needed. Experiments show that Weaver enables a Bitcoin blockchain explorer that is 8x faster than Blockchain.info, and achieves 10.9x higher throughput than the Titan graph database on social network workloads and 4x lower latency than GraphLab on offline graph traversal workloads.
Article
Out of the many NoSQL databases in use today, some that provide simple data structures for records, such as Redis and MongoDB, are now becoming popular. Building applications out of these complex data types provides a way to communicate intent to the database system without sacrificing flexibility or committing to a fixed schema. Currently this capability is leveraged in limited ways, such as to ensure related values are co-located, or for atomic updates. There are many ways data types can be used to make databases more efficient that are not yet being exploited. We explore several ways of leveraging abstract data type (ADT) semantics in databases, focusing primarily on commutativity. Using a Twitter clone as a case study, we show that using commutativity can reduce transaction abort rates for high-contention, update-heavy workloads that arise in real social networks. We conclude that ADTs are a good abstraction for database records, providing a safe and expressive programming model with ample opportunities for optimization, making databases more safe and scalable.
Article
Cloud storage solutions promise high scalability and low cost. Existing solutions, however, differ in the degree of consistency they provide. Our experience using such systems indicates that there is a non-trivial trade-off between cost, consistency and availability. High consistency implies high cost per transaction and, in some situations, reduced availability. Low consistency is cheaper but it might result in higher operational cost because of, e.g., overselling of products in a Web shop. In this paper, we present a new transaction paradigm, that not only allows designers to define the consistency guarantees on the data instead at the transaction level, but also allows to automatically switch consistency guarantees at runtime. We present a number of techniques that let the system dynamically adapt the consistency level by monitoring the data and/or gathering temporal statistics of the data. We demonstrate the feasibility and potential of the ideas through extensive experiments on a first prototype implemented on Amazon's S3 and running the TPC-W benchmark. Our experiments indicate that the adaptive strategies presented in the paper result in a significant reduction in response time and costs including the cost penalties of inconsistencies.
Article
Spanner is Google’s scalable, multiversion, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free snapshot transactions, and atomic schema changes, across all of Spanner.
Conference Paper
We consider the problem of separating consistency-related safety properties from availability and durability in distributed data stores via the application of a "bolt-on" shim layer that upgrades the safety of an underlying general-purpose data store. This shim provides the same consistency guarantees atop a wide range of widely deployed but often inflexible stores. As causal consistency is one of the strongest consistency models that remain available during system partitions, we develop a shim layer that upgrades eventually consistent stores to provide convergent causal consistency. Accordingly, we leverage widely deployed eventually consistent infrastructure as a common substrate for providing causal guarantees. We describe algorithms and shim implementations that are suitable for a large class of application-level causality relationships and evaluate our techniques using an existing, production-ready data store and with real-world explicit causality relationships.
Article
This paper presents secure program partitioning, a language-based technique for protecting confidential data during computation in distributed systems containing mutually untrusted hosts. Confidentiality and integrity policies can be expressed by annotating programs with security types that constrain information flow; these programs can then be partitioned automatically to run securely on heterogeneously trusted hosts. The resulting communicating subprograms collectively implement the original program, yet the system as a whole satisfies the security requirements of participating principals without requiring a universally trusted host machine. The experience in applying this methodology and the performance of the resulting distributed code suggest that this is a promising way to obtain secure distributed computation.
Conference Paper
Modern cloud systems are geo-replicated to improve application latency and availability. Transactional consistency is essential for application developers, however, the corresponding concurrency control and commitment protocols are costly in a geo-replicated setting. To minimize this cost, we identify the following essential scalability properties: (i) only replicas updated by a transaction T make steps to execute T, (ii) a read-only transaction never waits for concurrent transactions and always commits, (iii) a transaction may read object versions committed after it started, and (iv) two transactions synchronize with each other only if their writes conflict. We present Non-Monotonic Snapshot Isolation (NMSI), the first strong consistency criterion to allow implementations with all four properties. We also present a practical implementation of NMSI called Jessy, which we compare experimentally against a number of well-known criteria. Our measurements show that the latency and throughput of NMSI are comparable to the weakest criterion, read-committed, and between two to fourteen times faster than well-known strong consistencies.
Conference Paper
Choosing a cloud storage system and specific operations for reading and writing data requires developers to make decisions that trade off consistency for availability and performance. Applications may be locked into a choice that is not ideal for all clients and changing conditions. Pileus is a replicated key-value store that allows applications to declare their consistency and latency priorities via consistency-based service level agreements (SLAs). It dynamically selects which servers to access in order to deliver the best service given the current configuration and system conditions. In application-specific SLAs, developers can request both strong and eventual consistency as well as intermediate guarantees such as read-my-writes. Evaluations running on a worldwide test bed with geo-replicated data show that the system adapts to varying client-server latencies to provide service that matches or exceeds the best static consistency choice and server selection scheme.
Conference Paper
Storing big data reliably is hard. Searching that data is just as hard. Basho Technologies, the company behind Riak KV and Riak Search, focuses on solving these two problems. Both Riak KV (a key-value datastore and map/reduce platform) and Riak Search (a Solr-compatible full-text search and indexing engine) are built around a library called Riak Core that manages the mechanics of running a distributed application in a cluster without requiring a central coordinator or shared state. Using Riak Core, these applications can scale to hundreds of servers, handle enterprise-sized amounts of data, and remain operational in the face of server failure.* This talk will describe the implementation, responsibilities, and boundaries of Riak Core, including how Riak Core: •Divides the data- or computing-domain of your application into separate virtual nodes located on different physical servers. •Distributes data and operations to the correct virtual nodes within the cluster. •Dynamically re-shapes the cluster without requiring a central coordinator when nodes enter, leave, crash, slow down, or go dark. •Provides the Erlang community with a solid platform for creating other distributed applications. Special attention will be paid to how Riak Core adopts common functional programming patterns and leverages OTP/Erlang libraries and behaviours.
Article
Database-backed applications are nearly ubiquitous in our daily lives. Applications that make many small accesses to the database create two challenges for developers: increased latency and wasted resources from numerous network round trips. A well-known technique to improve transactional database application performance is to convert part of the application into stored procedures that are executed on the database server. Unfortunately, this conversion is often difficult. In this paper we describe Pyxis, a system that takes database-backed applications and automatically partitions their code into two pieces, one of which is executed on the application server and the other on the database server. Pyxis profiles the application and server loads, statically analyzes the code's dependencies, and produces a partitioning that minimizes the number of control transfers as well as the amount of data sent during each transfer. Our experiments using TPC-C and TPC-W show that Pyxis is able to generate partitions with up to 3x reduction in latency and 1.7x improvement in throughput when compared to a traditional non-partitioned implementation and has comparable performance to that of a custom stored procedure implementation.
Presentation
At PODC 2000, the CAP theorem received its first broad audience. Surprisingly for an impossibility result, one important effect has been to free designers to explore a wider range of distributed systems. Designers of wide-area systems, in which network partitions are considered inevitable, know they cannot have both availability and consistency, and thus can now justify weaker consistency. The rise of the "NoSQL" movement ("Not Only SQL") is an expression of this freedom. The choices of how and when to weaken consistency are often the defining characteristics of these systems, with new variations appearing every year. We review a variety of interesting places in the "CAP Space" as a way to illuminate these issues and their consequences. For example, automatic teller machines (ATMs), which predate the CAP theorem, surprisingly choose availability with weak consistency but with bounded risk. Finally, I explore a few of the options to try to "work around" the impossible. The most basic is the use of commutative operations, which make it easy to restore consistency after a partition heals. However, even many commutative operations have non-commutative exceptions in practice, which means that the exceptions may be incorrect or late. Adding the concept of "delayed exceptions" allows more operations to be considered commutative and simplifies eventual consistency during a partition. Furthermore, we can think of delayed exception handling as "compensation" - we execute a compensating transaction that restores consistency. Delayed exception handling with compensation appears to be what most real wide-area systems do - inconsistency due to limited communication is treated as an exception and some exceptional action, such as monetary compensation or even legal action, is used to fix it. This approach to wide-area systems puts the emphasis on audit trails and recovery rather than prevention, and implies that we should expand and formalize the role of compensation in the design of complex systems
Conference Paper
ANSI SQL-92 [MS, ANSI] defines Isolation Levels in terms of phenomena: Dirty Reads, Non-Repeatable Reads, and Phantoms. This paper shows that these phenomena and the ANSI SQL definitions fail to properly characterize several popular isolation levels, including the standard locking implementations of the levels covered. Ambiguity in the statement of the phenomena is investigated and a more formal statement is arrived at; in addition new phenomena that better characterize isolation types are introduced. Finally, an important multiversion isolation type, called Snapshot Isolation, is defined.
Conference Paper
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an "always-on" experience where operations always complete with low latency. Today's systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model---causal consistency with convergent conflict handling, or causal+---that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
Article
Cloud storage solutions promise high scalability and low cost. Ex- isting solutions, however, differ in the degree of consistency they provide. Our experience using such systems indicates that there is a non-trivial trade-off between cost, consistency and availability. High consistency implies high cost per transaction and, in some situations, reduced availability. Low consistency is cheaper but it might result in higher operational cost because of, e.g., overselling of products in a Web shop. In this paper, we present a new transaction paradigm, that not only allows designers to define the consistency guarantees on the data instead at the transaction level, but also allows to automatically switch consistency guarantees at runtime. We present a number of techniques that let the system dynamically adapt the consistency level by monitoring the data and/or gathering temporal statistics of the data. We demonstrate the feasibility and potential of the ideas through extensive experiments on a first prototype implemented on Amazon's S3 and running the TPC-W benchmark. Our experi- ments indicate that the adaptive strategies presented in the paper result in a significant reduction in response time and costs includ- ing the cost penalties of inconsistencies.
Article
We describe PNUTS, a massively parallel and geographi- cally distributed database system for Yahoo!'s web applica- tions. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of con- current requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and uti- lizes automated load-balancing and failover to reduce oper- ational complexity. The first version of the system is cur- rently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimen- tal results.
Article
The tradeoffs between consistency, performance, and availability are well understood. Traditionally, how- ever, designers of replicated systems have been forced to choose from either strong consistency guarantees or none at all. This paper explores the semantic space be- tween traditional strong and optimistic consistency mod- els for replicated services. We argue that an important class of applications can tolerate relaxed consistency, but benefit from bounding the maximum rate of inconsistent access in an application-specific manner. Thus, we de- velop a set of metrics, Numerical Error, Order Error, and Staleness, to capture the consistency spectrum. We then present the design and implementation of TACT, a middleware layer that enforces arbitrary consistency bounds among replicas using these metrics. Finally, we show that three replicated applications demonstrate sig- nificant semantic and performance benefits from using our framework.
Article
A sequence of interleaved user transactions in a database system may not be ser:ahzable, t e, equivalent to some sequential execution of the individual transactions Using a simple transaction model, it ~s shown that recognizing the transaction histories that are serlahzable is an NP-complete problem. Several efficiently recognizable subclasses of the class of senahzable histories are therefore introduced; most of these subclasses correspond to senahzabdity principles existing in the hterature and used in practice Two new principles that subsume all previously known ones are also proposed Necessary and sufficient conditions are given for a class of histories to be the output of an efficient history scheduler, these conditions imply that there can be no efficient scheduler that outputs all of senahzable histories, and also that all subclasses of senalizable histories studied above have an efficient scheduler Finally, it is shown how these results can be extended to far more general transaction models, to transactions with partly interpreted functions, and to distributed database systems
Article
Most current approaches to concurrency control in database systems rely on locking of data objects as a control mechanism. In this paper, two families of non-locking concurrency controls are presented. The methods used are "optimistic" in the sense that they rely mainly on transaction backup as a control mechanism, "hoping" that conflicts between transactions will not occur. Applications where these methods should be more efficient than locking are discussed.