Russell Sears

University of California, Berkeley, Berkeley, California, United States

Are you Russell Sears?

Claim your profile

Publications (25)0 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this demo proposal, we describe REEF, a framework that makes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models. We will demonstrate diverse workloads, including extract-transform-load MapReduce jobs, iterative machine learning algorithms, and ad-hoc declarative query processing. At its core, REEF builds atop YARN (Apache Hadoop 2's resource manager) to provide retainable hardware resources with lifetimes that are decoupled from those of computational tasks. This allows us to build persistent (cross-job) caches and cluster-wide services, but, more importantly, supports high-performance iterative graph processing and machine learning algorithms. Unlike existing systems, REEF aims for composability of jobs across computational models, providing significant performance and usability gains, even with legacy code. REEF includes a library of interoperable data management primitives optimized for communication and data movement (which are distinct from storage locality). The library also allows REEF applications to access external services, such as user-facing relational databases. We were careful to decouple lower levels of REEF from the data models and semantics of systems built atop it. The result was two new standalone systems: Tang, a configuration manager and dependency injector, and Wake, a state-of-the-art event-driven programming and data movement framework. Both are language independent, allowing REEF to bridge the JVM and .NET.
    No preview · Article · Aug 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Walnut is an object-store being developed at Yahoo! with the goal of serving as a common low-level storage layer for a variety of cloud data management systems including Hadoop (a MapReduce system), MObStor (a multimedia serving system), and PNUTS (an extended key-value serving system). Thus, a key performance challenge is to meet the latency and throughput requirements of the wide range of workloads commonly observed across these diverse systems. The motivation for Walnut is to leverage a carefully optimized low-level storage system, with support for elasticity and high-availability, across all of Yahoo!'s data clouds. This would enable sharing of hardware resources across hitherto siloed clouds of different types, offering greater potential for intelligent load balancing and efficient elastic operation, and simplify the operational tasks related to data storage. In this paper, we discuss the motivation for unifying different storage clouds, describe the requirements of a common storage layer, and present the Walnut design, which uses a quorum-based replication protocol and one-hop direct client access to the data in most regular operations. A unique contribution of Walnut is its hybrid object strategy, which efficiently supports both small and large objects. We present experiments based on both synthetic and real data traces, showing that Walnut works well over a wide range of workloads, and can indeed serve as a common low-level storage layer across a range of cloud systems.
    No preview · Article · Jan 2012
  • Russell Sears · Raghu Ramakrishnan
    [Show abstract] [Hide abstract]
    ABSTRACT: Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrifice read performance and exhibit unacceptable latency spikes. We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplification, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance. We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress without resorting to techniques that degrade read performance. We use Bloom filters to improve index performance, and find a number of subtleties arise. First, we ensure reads can stop after finding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups. Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.
    No preview · Article · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Mobile application development is challenging for several reasons: intermittent and limited network connectivity, tight power constraints, server-side scalability concerns, and a number of fault-tolerance issues. Developers handcraft complex solutions that include client-side caching, conflict resolution, disconnection tolerance, and backend database sharding. To simplify mobile app development, we present Mobius, a system that addresses the messaging and data management challenges of mobile application development. Mobius introduces MUD (Messaging Unified with Data). MUD presents the programming abstraction of a logical table of data that spans devices and clouds. Applications using Mobius can asynchronously read from/write to MUD tables, and also receive notifications when tables change via continuous queries on the tables. The system combines dynamic client-side caching (with intelligent policies chosen on the server-side, based on usage patterns across multiple applications), notification services, flexible query processing, and a scalable and highly available cloud storage system. We present an initial prototype to demonstrate the feasibility of our design. Even in our initial prototype, remote read and write latency overhead is less than 52% when compared to a hand-tuned solution. Our dynamic caching reduces the number of messages by a factor of 4 to 8.5 when compared to fixed strategies, thus reducing latency, bandwidth, power, and server load costs, while also reducing data staleness.
    Preview · Article · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud data management systems are growing in prominence, particularly at large Internet companies like Google, Yahoo!, and Amazon, which prize them for their scalability and elasticity. Each of these systems trades off between low-latency serving performance and batch processing throughput. In this paper, we discuss our experience running batch-oriented Hadoop on top of Yahoo's serving-oriented PNUTS system instead of the standard HDFS file system. Though PNUTS is optimized for and primarily used for serving, a number of applications at Yahoo! must run batch-oriented jobs that read or write data that is stored in PNUTS. Combining these systems reveals several key areas where the fundamental properties of each system are mismatched. We discuss our approaches to accommodating these mismatches, by either bending the batch and serving abstractions, or inventing new ones. Batch systems like Hadoop provide coarse task-level recovery, while serving systems like PNUTS provide finer record or transaction-level recovery. We combine both types to log record-level errors, while detecting and recovering from large-scale errors. Batch systems optimize for read and write throughput of large requests, while serving systems use indexing to provide low latency access to individual records. To improve latency-insensitive write throughput to PNUTS, we introduce a batch write path. The systems provide conflicting consistency models, and we discuss techniques to isolate them from one another.
    Preview · Conference Paper · Jan 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being ap- plied to a diverse range of applications that di!er consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it di"cult to un- derstand the tradeo!s between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of fa- cilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the devel- opment of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.
    Preview · Conference Paper · Sep 2010

  • No preview · Article · Jan 2010 · ACM SIGOPS Operating Systems Review
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Building and debugging distributed software remains ex- tremely difficult. We conjecture that by adopting a data- centric approach to system design and by employing declar- ative programming languages, a broad range of distributed software can be recast naturally in a data-parallel program- ming model. Our hope is that this model can significantly raise the level of abstraction for programmers, improving code simplicity, speed of development, ease of software evo- lution, and program correctness. This paper presents our experience with an initial large- scale experiment in this direction. First, we used the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS and provides com- parable performance. Second, we extended the system with complex distributed features not yet available in Hadoop, including high availability, scalability, and unique monitor- ing and debugging facilities. We present both quantitative and anecdotal results from our experience, providing some concrete evidence that both data-centric design and declara- tive languages can substantially simplify distributed systems programming.
    Preview · Conference Paper · Jan 2010
  • Source
    Conference Paper: “MapReduce Online,”
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce mate- rialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapRe- duce programming model beyond batch processing, and can reduce completion times and improve system utiliza- tion for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports on- line aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applica- tions such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.
    Preview · Conference Paper · Jan 2010
  • Source

    Preview · Conference Paper · Jan 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We demonstrate a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.
    Full-text · Conference Paper · Jan 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent research has explored using Datalog-based languages to express a distributed system as a set of logical invariants. Two properties of distributed systems proved difficult to model in Datalog. First, the state of any such system evolves with its execution. Second, deductions in these systems may be arbitrarily delayed, dropped, or reordered by the unreliable network links they must traverse. Previous efforts addressed the former by extending Datalog to include updates, key constraints, persistence and events, and the latter by assuming ordered and reliable delivery while ignoring delay. These details have a semantics outside Datalog, which increases the complexity of the language and its interpretation, and forces programmers to think operationally. We argue that the missing component from these previous languages is a notion of time. In this paper we present Dedalus, a foundation language for programming and reasoning about distributed systems. Dedalus reduces to a subset of Datalog with negation, aggregate functions, successor and choice, and adds an explicit notion of logical time to the language. We show that Dedalus provides a declarative foundation for the two signature features of distributed systems: mutable state, and asynchronous processing and communication. Given these two features, we address two important properties of programs in a domain-specific manner: a notion of safety appropriate to non-terminating computations, and stratified monotonic reasoning with negation over time. We also provide conservative syntactic checks for our temporal notions of safety and stratification. Our experience implementing full-featured systems in variants of Datalog suggests that Dedalus is well-suited to the specification of rich distributed services and protocols, and provides both cleaner semantics and richer tests of correctness.
    Preview · Article · Dec 2009
  • Source
    Russell Sears · Eric A. Brewer
    [Show abstract] [Hide abstract]
    ABSTRACT: ner-grained concurrency and an increased range of workloads, then remove two core assump- tions: that pages are the unit of recovery and that times- tamps (LSNs) should be stored on each page. Recovering individual application-level objects (rather than pages) simplies the handing of systems with object sizes that dier from the page size. We show how to remove the need for LSNs on the page, which in turn enables DMA or zero-copy I/O for large ob- jects, increases concurrency, and reduces communication be- tween the application, buer manager and log manager. Our experiments show that the looser coupling signicantly re- duces the impact of latency among the components. This makes the approach particularly applicable to large scale dis- tributed systems, and enables a \cross pollination" of ideas from distributed systems and transactional storage. However, these advantages come at a cost; segments are incompatible with physiological redo, preventing a number of important optimizations. We show how allocation en- ables (or prevents) mixing of ARIES pages (and physiologi- cal redo) with segments. We present an allocation policy that avoids undesirable interactions that complicate other combi- nations of ARIES and LSN-free pages, and then present a proof that both approaches and our combination are correct. Many optimizations presented here were proposed in the past. However, we believe this is the rst unied approach.
    Full-text · Conference Paper · Aug 2009
  • Source
    Russell Sears · Eric Brewer
    [Show abstract] [Hide abstract]
    ABSTRACT: Although existing write-ahead logging algorithms scale to conventional database workloads, their communication and synchronization overheads limit their usefulness for modern applications and distributed systems. We revisit write-ahead logging with an eye toward finer-grained concurrency and an increased range of workloads, then remove two core assumptions: that pages are the unit of recovery and that times-tamps (LSNs) should be stored on each page. Recovering individual application-level objects (rather than pages) simplifies the handing of systems with object sizes that differ from the page size. We show how to remove the need for LSNs on the page, which in turn enables DMA or zero-copy I/O for large objects, increases concurrency, and reduces communication between the application, buffer manager and log manager. Our experiments show that the looser coupling significantly reduces the impact of latency among the components. This makes the approach particularly applicable to large scale distributed systems, and enables a "cross pollination" of ideas from distributed systems and transactional storage. However, these advantages come at a cost; segments are incompatible with physiological redo, preventing a number of important optimizations. We show how allocation enables (or prevents) mixing of ARIES pages (and physiological redo) with segments. We present an allocation policy that avoids undesirable interactions that complicate other combinations of ARIES and LSN-free pages, and then present a proof that both approaches and our combination are correct. Many optimizations presented here were proposed in the past. However, we believe this is the first unified approach.
    Full-text · Conference Paper · Aug 2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most application provenance systems are hard coded for a particular type of system or data, while cur- rent provenance file systems maintain in-memory prove- nance graphs and reside in kernel space, leading to com- plex and constrained implementations. Story Book re- sides in user space, and treats provenance events as a generic event log, leading to a simple, flexible and eas- ily optimized system. We demonstrate the flexibility of our design by adding provenance to a number of different systems, including a file system, database and a number of file types, and by implementing two separate storage backends. Although Story Book is nearly 2.5 times slower than ext3 under worst case workloads, this is mostly due to FUSE mes- sage passing overhead. Our experiments show that cou- pling our simple design with existing storage optimiza- tions provides higher throughput than existing systems.
    Preview · Conference Paper · Jan 2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The efficiencies of cloud computing enable a wide range of developers to access the power of large clusters. Un- fortunately, parallel and distributed programming remains too complex for many of these developers, and slows the progress of even sophisticated distributed system builders. We conjecture that a broad range of distributed software can be recast naturally in a data-parallel programming model. We argue that this significantly raises the level of abstraction for programmers, improving code simplicity, speed of de- velopment, ease of software evolution, and program correct- ness. To evaluate these claims, the bulk of the paper presents our experience using the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS, providing comparable performance. We extended the system with complex distributed features not yet available in Hadoop, including availability, scalability, and unique monitoring and debugging facilities. We present both quantitative and anecdotal results from our experience, showing that a data-centric approach can substantially sim- plify distributed systems programming.
    Preview · Article · Jan 2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Paxos consensus protocol can be specified concisely, but is notoriously dicult to implement in practice. We recount our experience building Paxos in Overlog, a dis- tributed declarative programming language. We found that the Paxos algorithm is easily translated to declarative logic, in large part because the primitives used in consensus proto- col specifications map directly to simple Overlog constructs such as aggregation and selection. We discuss the program- ming idioms that appear frequently in our implementation, and the applicability of declarative programming to related application domains.
    Preview · Article · Jan 2009
  • Source
    Russell Sears · Mark Callaghan Google · Eric Brewer
    [Show abstract] [Hide abstract]
    ABSTRACT: Rose is a database storage engine for high-throughput replication. It targets seek-limited, write-intensive transaction processing workloads that perform near real-time decision support and analytical processing queries. Rose uses log structured merge (LSM) trees to create full database replicas using purely sequential I/O, allowing it to provide orders of magnitude more write throughput than B-tree based replicas. Also, LSM-trees cannot become fragmented and provide fast, predictable index scans. Rose's write performance relies on replicas' ability to per-form writes without looking up old values. LSM-tree lookups have performance comparable to B-tree lookups. If Rose read each value that it updated then its write throughput would also be comparable to a B-tree. Although we target replication, Rose provides high write throughput to any application that updates tuples without reading existing data, such as append-only, streaming and versioning databases. We introduce a page compression format that takes ad-vantage of LSM-tree's sequential, sorted data layout. It in-creases replication throughput by reducing sequential I/O, and enables efficient tree lookups by supporting small page sizes and doubling as an index of the values it stores. Any scheme that can compress data in a single pass and provide random access to compressed values could be used by Rose. Replication environments have multiple readers but only one writer. This allows Rose to provide atomicity, consistency and isolation to concurrent transactions without re-sorting to rollback, blocking index requests or interfering with maintenance tasks. Rose avoids random I/O during replication and scans, leaving more I/O capacity for queries than existing systems, and providing scalable, real-time replication of seek-bound workloads. Analytical models and experiments show that Rose provides orders of magnitude greater replication band-width over larger databases than conventional techniques.
    Full-text · Conference Paper · Aug 2008
  • Source
    Russell Sears · Catharine Van Ingen · Jim Gray
    [Show abstract] [Hide abstract]
    ABSTRACT: Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation - one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256KB are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256KB and 1MB, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of "storage age" or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
    Preview · Article · Jan 2007
  • Source
    Russell Sears · Catharine Van Ingen
    [Show abstract] [Hide abstract]
    ABSTRACT: Fragmentation leads to unpredictable and degraded application performance. While these problems have been studied in detail for desktop filesystem workloads, this study examines newer systems such as scalable object stores and multimedia repositories. Such systems use a get/put interface to store objects. In principle, databases and filesystems can support such applications efficiently, allowing system designers to focus on complexity, deployment cost and manageability. Although theoretical work proves that certain storage policies behave optimally for some workloads, these policies often behave poorly in practice. Most storage benchmarks focus on short-term behavior or do not measure fragmentation. We compare SQL Server to NTFS and find that fragmentation dominates performance when object sizes exceed 256KB-1MB. NTFS handles fragmentation better than SQL Server. Although the performance curves will vary with other systems and workloads, we expect the same interactions between fragmentation and free space to apply. It is well-known that fragmentation is related to the percentage free space. We found that the ratio of free space to object size also impacts performance. Surprisingly, in both systems, storing objects of a single size causes fragmentation, and changing the size of write requests affects fragmentation. These problems could be addressed with simple changes to the filesystem and database interfaces. It is our hope that an improved understanding of fragmentation will lead to predictable storage systems that require less maintenance after deployment. Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, USA
    Preview · Article · Dec 2006