Russell Sears

University of California, Berkeley, Berkeley, California, United States

Are you Russell Sears?

Claim your profile

Publications (22)0 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this demo proposal, we describe REEF, a framework that makes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models. We will demonstrate diverse workloads, including extract-transform-load MapReduce jobs, iterative machine learning algorithms, and ad-hoc declarative query processing. At its core, REEF builds atop YARN (Apache Hadoop 2's resource manager) to provide retainable hardware resources with lifetimes that are decoupled from those of computational tasks. This allows us to build persistent (cross-job) caches and cluster-wide services, but, more importantly, supports high-performance iterative graph processing and machine learning algorithms. Unlike existing systems, REEF aims for composability of jobs across computational models, providing significant performance and usability gains, even with legacy code. REEF includes a library of interoperable data management primitives optimized for communication and data movement (which are distinct from storage locality). The library also allows REEF applications to access external services, such as user-facing relational databases. We were careful to decouple lower levels of REEF from the data models and semantics of systems built atop it. The result was two new standalone systems: Tang, a configuration manager and dependency injector, and Wake, a state-of-the-art event-driven programming and data movement framework. Both are language independent, allowing REEF to bridge the JVM and .NET.
    Proceedings of the VLDB Endowment. 08/2013; 6(12):1370-1373.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Walnut is an object-store being developed at Yahoo! with the goal of serving as a common low-level storage layer for a variety of cloud data management systems including Hadoop (a MapReduce system), MObStor (a multimedia serving system), and PNUTS (an extended key-value serving system). Thus, a key performance challenge is to meet the latency and throughput requirements of the wide range of workloads commonly observed across these diverse systems. The motivation for Walnut is to leverage a carefully optimized low-level storage system, with support for elasticity and high-availability, across all of Yahoo!'s data clouds. This would enable sharing of hardware resources across hitherto siloed clouds of different types, offering greater potential for intelligent load balancing and efficient elastic operation, and simplify the operational tasks related to data storage. In this paper, we discuss the motivation for unifying different storage clouds, describe the requirements of a common storage layer, and present the Walnut design, which uses a quorum-based replication protocol and one-hop direct client access to the data in most regular operations. A unique contribution of Walnut is its hybrid object strategy, which efficiently supports both small and large objects. We present experiments based on both synthetic and real data traces, showing that Walnut works well over a wide range of workloads, and can indeed serve as a common low-level storage layer across a range of cloud systems.
    01/2012;
  • Russell Sears, Raghu Ramakrishnan
    [Show abstract] [Hide abstract]
    ABSTRACT: Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrifice read performance and exhibit unacceptable latency spikes. We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplification, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance. We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress without resorting to techniques that degrade read performance. We use Bloom filters to improve index performance, and find a number of subtleties arise. First, we ensure reads can stop after finding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups. Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mobile application development is challenging for several reasons: intermittent and limited network connectivity, tight power constraints, server-side scalability concerns, and a number of fault-tolerance issues. Developers handcraft complex solutions that include client-side caching, conflict resolution, disconnection tolerance, and backend database sharding. To simplify mobile app development, we present Mobius, a system that addresses the messaging and data management challenges of mobile application development. Mobius introduces MUD (Messaging Unified with Data). MUD presents the programming abstraction of a logical table of data that spans devices and clouds. Applications using Mobius can asynchronously read from/write to MUD tables, and also receive notifications when tables change via continuous queries on the tables. The system combines dynamic client-side caching (with intelligent policies chosen on the server-side, based on usage patterns across multiple applications), notification services, flexible query processing, and a scalable and highly available cloud storage system. We present an initial prototype to demonstrate the feasibility of our design. Even in our initial prototype, remote read and write latency overhead is less than 52% when compared to a hand-tuned solution. Our dynamic caching reduces the number of messages by a factor of 4 to 8.5 when compared to fixed strategies, thus reducing latency, bandwidth, power, and server load costs, while also reducing data staleness.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cloud data management systems are growing in prominence, particularly at large Internet companies like Google, Yahoo!, and Amazon, which prize them for their scalability and elasticity. Each of these systems trades off between low-latency serving performance and batch processing throughput. In this paper, we discuss our experience running batch-oriented Hadoop on top of Yahoo's serving-oriented PNUTS system instead of the standard HDFS file system. Though PNUTS is optimized for and primarily used for serving, a number of applications at Yahoo! must run batch-oriented jobs that read or write data that is stored in PNUTS. Combining these systems reveals several key areas where the fundamental properties of each system are mismatched. We discuss our approaches to accommodating these mismatches, by either bending the batch and serving abstractions, or inventing new ones. Batch systems like Hadoop provide coarse task-level recovery, while serving systems like PNUTS provide finer record or transaction-level recovery. We combine both types to log record-level errors, while detecting and recovering from large-scale errors. Batch systems optimize for read and write throughput of large requests, while serving systems use indexing to provide low latency access to individual records. To improve latency-insensitive write throughput to PNUTS, we introduce a batch write path. The systems provide conflicting consistency models, and we discuss techniques to isolate them from one another.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We demonstrate a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Building and debugging distributed software remains ex- tremely difficult. We conjecture that by adopting a data- centric approach to system design and by employing declar- ative programming languages, a broad range of distributed software can be recast naturally in a data-parallel program- ming model. Our hope is that this model can significantly raise the level of abstraction for programmers, improving code simplicity, speed of development, ease of software evo- lution, and program correctness. This paper presents our experience with an initial large- scale experiment in this direction. First, we used the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS and provides com- parable performance. Second, we extended the system with complex distributed features not yet available in Hadoop, including high availability, scalability, and unique monitor- ing and debugging facilities. We present both quantitative and anecdotal results from our experience, providing some concrete evidence that both data-centric design and declara- tive languages can substantially simplify distributed systems programming.
    European Conference on Computer Systems, Proceedings of the 5th European conference on Computer systems, EuroSys 2010, Paris, France, April 13-16, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being ap- plied to a diverse range of applications that di!er consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it di"cult to un- derstand the tradeo!s between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of fa- cilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the devel- opment of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.
    Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010, Indianapolis, Indiana, USA, June 10-11, 2010; 01/2010
  • Source
    Conference Paper: MapReduce Online.
    [Show abstract] [Hide abstract]
    ABSTRACT: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce mate- rialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapRe- duce programming model beyond batch processing, and can reduce completion times and improve system utiliza- tion for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports on- line aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applica- tions such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.
    Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2010, April 28-30, 2010, San Jose, CA, USA; 01/2010
  • Source
    Datalog Reloaded - First International Workshop, Datalog 2010, Oxford, UK, March 16-19, 2010. Revised Selected Papers; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent research has explored using Datalog-based languages to express a distributed system as a set of logical invariants [2, 19]. Two properties of distributed systems proved difficult to model in Datalog. First, the state of any such system evolves with its execution. Second, deductions in these systems may be arbitrarily delayed, dropped, or reordered by the unreliable network links they must traverse. Previous efforts addressed the former by extending Datalog to include updates, key constraints, persistence and events, and the latter by assuming ordered and reliable delivery while ignoring delay. These details have a semantics outside Datalog, which increases the complexity of the language or its interpretation, and forces programmers to think operationally. We argue that the missing component from these previous languages is a notion of time.
    12/2009;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Paxos consensus protocol can be specified concisely, but is notoriously dicult to implement in practice. We recount our experience building Paxos in Overlog, a dis- tributed declarative programming language. We found that the Paxos algorithm is easily translated to declarative logic, in large part because the primitives used in consensus proto- col specifications map directly to simple Overlog constructs such as aggregation and selection. We discuss the program- ming idioms that appear frequently in our implementation, and the applicability of declarative programming to related application domains.
    Operating Systems Review. 01/2009; 43:25-30.
  • Source
    Russell Sears, Eric A. Brewer
    [Show abstract] [Hide abstract]
    ABSTRACT: ner-grained concurrency and an increased range of workloads, then remove two core assump- tions: that pages are the unit of recovery and that times- tamps (LSNs) should be stored on each page. Recovering individual application-level objects (rather than pages) simplies the handing of systems with object sizes that dier from the page size. We show how to remove the need for LSNs on the page, which in turn enables DMA or zero-copy I/O for large ob- jects, increases concurrency, and reduces communication be- tween the application, buer manager and log manager. Our experiments show that the looser coupling signicantly re- duces the impact of latency among the components. This makes the approach particularly applicable to large scale dis- tributed systems, and enables a \cross pollination" of ideas from distributed systems and transactional storage. However, these advantages come at a cost; segments are incompatible with physiological redo, preventing a number of important optimizations. We show how allocation en- ables (or prevents) mixing of ARIES pages (and physiologi- cal redo) with segments. We present an allocation policy that avoids undesirable interactions that complicate other combi- nations of ARIES and LSN-free pages, and then present a proof that both approaches and our combination are correct. Many optimizations presented here were proposed in the past. However, we believe this is the rst unied approach.
    PVLDB. 01/2009; 2:490-501.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The efficiencies of cloud computing enable a wide range of developers to access the power of large clusters. Un- fortunately, parallel and distributed programming remains too complex for many of these developers, and slows the progress of even sophisticated distributed system builders. We conjecture that a broad range of distributed software can be recast naturally in a data-parallel programming model. We argue that this significantly raises the level of abstraction for programmers, improving code simplicity, speed of de- velopment, ease of software evolution, and program correct- ness. To evaluate these claims, the bulk of the paper presents our experience using the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS, providing comparable performance. We extended the system with complex distributed features not yet available in Hadoop, including availability, scalability, and unique monitoring and debugging facilities. We present both quantitative and anecdotal results from our experience, showing that a data-centric approach can substantially sim- plify distributed systems programming.
    01/2009;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most application provenance systems are hard coded for a particular type of system or data, while cur- rent provenance file systems maintain in-memory prove- nance graphs and reside in kernel space, leading to com- plex and constrained implementations. Story Book re- sides in user space, and treats provenance events as a generic event log, leading to a simple, flexible and eas- ily optimized system. We demonstrate the flexibility of our design by adding provenance to a number of different systems, including a file system, database and a number of file types, and by implementing two separate storage backends. Although Story Book is nearly 2.5 times slower than ext3 under worst case workloads, this is mostly due to FUSE mes- sage passing overhead. Our experiments show that cou- pling our simple design with existing storage optimiza- tions provides higher throughput than existing systems.
    First Workshop on the Theory and Practice of Provenance, February 23, 2009, San Francisco, CA, USA, Proceedings; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Rose 1 is a database storage engine for high-throughput repli-cation. It targets seek-limited, write-intensive transaction processing workloads that perform near real-time decision support and analytical processing queries. Rose uses log structured merge (LSM) trees to create full database repli-cas using purely sequential I/O, allowing it to provide or-ders of magnitude more write throughput than B-tree based replicas. Also, LSM-trees cannot become fragmented and provide fast, predictable index scans. Rose's write performance relies on replicas' ability to per-form writes without looking up old values. LSM-tree lookups have performance comparable to B-tree lookups. If Rose read each value that it updated then its write throughput would also be comparable to a B-tree. Although we target replication, Rose provides high write throughput to any ap-plication that updates tuples without reading existing data, such as append-only, streaming and versioning databases. We introduce a page compression format that takes ad-vantage of LSM-tree's sequential, sorted data layout. It in-creases replication throughput by reducing sequential I/O, and enables efficient tree lookups by supporting small page sizes and doubling as an index of the values it stores. Any scheme that can compress data in a single pass and provide random access to compressed values could be used by Rose. Replication environments have multiple readers but only one writer. This allows Rose to provide atomicity, consis-tency and isolation to concurrent transactions without re-sorting to rollback, blocking index requests or interfering with maintenance tasks. Rose avoids random I/O during replication and scans, leaving more I/O capacity for queries than existing systems, and providing scalable, real-time replication of seek-bound workloads. Analytical models and experiments show that Rose provides orders of magnitude greater replication band-width over larger databases than conventional techniques.
    Proceedings of The Vldb Endowment - PVLDB. 01/2008; 1(1).
  • Source
    Russell Sears, Catharine Van Ingen, Jim Gray
    [Show abstract] [Hide abstract]
    ABSTRACT: Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation - one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256KB are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256KB and 1MB, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of "storage age" or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
    Computing Research Repository - CORR. 01/2007;
  • Source
    Russell Sears, Catharine Van Ingen
    [Show abstract] [Hide abstract]
    ABSTRACT: Fragmentation leads to unpredictable and degraded application performance. While these problems have been studied in detail for desktop filesystem workloads, this study examines newer systems such as scalable object stores and multimedia repositories. Such systems use a get/put interface to store objects. In principle, databases and filesystems can support such applications efficiently, allowing system designers to focus on complexity, deployment cost and manageability. Although theoretical work proves that certain storage policies behave optimally for some workloads, these policies often behave poorly in practice. Most storage benchmarks focus on short-term behavior or do not measure fragmentation. We compare SQL Server to NTFS and find that fragmentation dominates performance when object sizes exceed 256KB-1MB. NTFS handles fragmentation better than SQL Server. Although the performance curves will vary with other systems and workloads, we expect the same interactions between fragmentation and free space to apply. It is well-known that fragmentation is related to the percentage free space. We found that the ratio of free space to object size also impacts performance. Surprisingly, in both systems, storing objects of a single size causes fragmentation, and changing the size of write requests affects fragmentation. These problems could be addressed with simple changes to the filesystem and database interfaces. It is our hope that an improved understanding of fragmentation will lead to predictable storage systems that require less maintenance after deployment. Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, USA
    12/2006;
  • Source
    Russell Sears, Eric A. Brewer
    [Show abstract] [Hide abstract]
    ABSTRACT: An increasing range of applications requires robust support for atomic, durable and concurrent transactions. Databases provide the default solution, but force appli- cations to interact via SQL and to forfeit control over data layout and access mechanisms. We argue there is a gap between DBMSs and file systems that limits design- ers of data-oriented applications. Stasis is a storage framework that incorporates ideas from traditional write-ahead logging algorithms and file systems. It provides applications with flexible control over data structures, data layout, robustness, and per- formance. Stasis enables the development of unforeseen variants on transactional storage by generalizing write- ahead logging algorithms. Our partial implementation of these ideas already provides specialized (and cleaner) semantics to applications. We evaluate the performance of a traditional trans- actional storage system based on Stasis, and show that it performs favorably relative to existing systems. We present examples that make use of custom access meth- ods, modified buffer manager semantics, direct log file manipulation, and LSN-free pages. These examples fa- cilitate sophisticated performance optimizations such as zero-copy I/O. These extensions are composable, easy to implement and significantly improve performance.
    7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6-8, Seattle, WA, USA; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems for learning to detect anomalous email behav-ior, such as worms and viruses, tend to build either per-user models or a single global model. Global mod-els leverage a larger training corpus but often model individual users poorly. Per-user models capture fine-grained behaviors but can take a long time to accumu-late sufficient training data. Approaches that combine global and per-user information have the potential to address these limitations. We use the Latent Dirich-let Allocation model to transition smoothly from the global prior to a particular user's empirical model as the amount of user data grows. Preliminary results demon-strate long-term accuracy comparable to per-user mod-els, while also showing near-ideal performance almost immediately on new users.
    01/2006;