About
72
Publications
9,622
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,010
Citations
Publications
Publications (72)
First introduced in 1954, linear probing is one of the oldest data structures in computer science, and due to its unrivaled data locality, it continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because primary-clustering effects...
From bottom to top
The doubling of the number of transistors on a chip every 2 years, a seemly inevitable trend that has been called Moore's law, has contributed immensely to improvements in computer performance. However, silicon-based transistors cannot get much smaller than they are today, and other approaches should be explored to keep performan...
Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transaction...
The CSI framework [15] provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrum...
The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentat...
The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentat...
The CSI framework provides comprehensive static instrumentation that a compiler can insert into a program-under-test so that dynamic-analysis tools - memory checkers, race detectors, cache simulators, performance profilers, code-coverage analyzers, etc. - can observe and investigate runtime behavior. Heretofore, tools based on compiler instrumentat...
We present Autogen—an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. Autogen analyzes the set of DP table locations accessed by the iterative algorithm when run on...
File systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not obtained all of these performance gains without sacrificing performance on other operations, such as...
Most B-tree articles assume that all N keys have the same size K, that f = B/K keys fit in a disk block, and therefore that the search cost is O(logf + 1N) block transfers. When keys have variable size, B-tree operations have no nontrivial performance guarantees, however.
This article provides B-tree-like performance guarantees on dictionari...
We present AUTOGEN — an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences.
AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run...
We present AUTOGEN---an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run...
The Bε-tree File System, or BetrFS (pronounced "better eff ess"), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large scans efficiently. Previous WODS-based file systems have shown promise but have been hampered in s...
Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the...
A method, apparatus and computer program product for storing data in a disk storage system is presented. A high-performance dictionary data structure is defined. The dictionary data structure is stored on a disk storage system. Key-value pairs can be inserted and deleted into the dictionary data structure. Updates run faster than one insertion per...
We implemented two data structures in a consistency-oblivious programming (COP) style: a red black tree and a dynamic cache-oblivious B-tree. Unlike a naive transactional style, in which an operation such as an insertion is enclosed in a hardware transaction , in a COP-style there are two phases: an oblivious phase that runs with no transactions or...
A method, apparatus and computer program product for storing data in a disk storage system is presented. A dictionary data structure is defined and stored on the disk storage system. Key-value pairs can be inserted and deleted into the dictionary data structure, with full transactional semantics, at a rate that is faster than one insertion per disk...
This paper presents new alternatives to the well-known Bloom filter data structure. The Bloom filter, a compact data structure supporting set insertion and membership queries, has found wide application in databases, storage systems, and networks. Because the Bloom filter performs frequent random reads and writes, it is used almost exclusively in R...
The TokuFS file system outperforms write-optimized file systems by an order of magnitude on microdata write workloads, and outperforms read-optimized file systems by an order of magnitude on read workloads. Microdata write workloads include creating and destroying many small files, performing small unaligned writes within large files, and updating...
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification o...
Most B-tree papers assume that all N keys have the same size K, that F = B/K keys fit in a disk block, and therefore that the search cost is O(logf+1 N) block transfers. When keys have variable size, however, B-tree operations have no nontrivial performance guarantees. This paper provides B-tree-like performance guarantees on dictionaries that cont...
The tx2500 disk cluster at MIT Lincoln Labortory sorted a terabyte (1010 100-byte records) in 197s using an "Indy" sort, and in 297s using a "Daytona" sort. The sort employed a parallel sample sort, and ran on 400 nodes, each containing a 6-disk RAID, and 8GB of memory, all connected by Infiniband. It employed TCP sockets to communicate between the...
A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA). For block-transfer size B and on N elements, the shuttle tree implements searches in optimal O ` logB+1 N ´ transfers, range queries of L suc...
Humanity's understanding of the Earth's weather and cli- mate depends critically on accurate forecasting and state-estimation technology. It is not clear how to build an effective dynamic data-driven application system (DDDAS) in which computer models of the planet and observations of the actual conditions interact, however. We are de- signing and...
A mesh is a graph that divides physical space into regularly-shaped regions. Meshes computations form the basis of many applications, including finite-element methods, image rendering, collision detection, and N-body simulations. In one important mesh primitive, called a mesh update, each mesh vertex stores a value and repeatedly updates this value...
B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally when keys are long or of variable length, when keys are compressed, even when using front compres- sion, the standard B-tree compression scheme, for range queries, and with respect to memory effects such as disk prefetching. This...
Transactional memory should be virtualized to support transactions of arbitrary footprint and duration. The unbounded transactional memory (UTM) architecture achieves high performance in the common case of small transactions, without sacrificing correctness in large transactions.
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994. Vita. Includes bibliographical references (p. 149-159). by Bradley C. Kuszmaul. Ph.D.
This paper analyzes the worst-case performance of randomized backoff on simple multiple-access channels. Most previous analysis of backoff has assumed a statistical arrival model.For batched arrivals, in which all n packets arrive at time 0, we show the following tight high-probability bounds. Randomized binary exponential backoff has makespan Θ(nl...
This paper presents concurrent cache-oblivious (CO) B-trees. We extend the cache-oblivious model to a parallel or distributed setting and present three concurrent CO B-trees. Our first data structure is a concurrent lock-based exponential CO B-tree. This data structure supports insertions and non-blocking searches/successor queries. The second and...
I present a VLSI circuit for segmented parallel prefix with gate delay O(log S) and wire delay.
Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory...
Summary form only given. Backoff strategies have typically been analyzed by making statistical assumptions on the distribution of problem inputs. Although these analyses have provided valuable insights into the efficacy of various backoff strategies, they leave open the question as to which backoff algorithms perform best in the worst case or on in...
Backoff strategies have typically been analyzed by making statistical assumptions on the distribution of problem inputs. Although these analyses have provided valuable insights into the efficacy of various backoff strategies, they leave open the question as to which backoff algorithms perform best in the worst case or on inputs, such as bursty inpu...
This work shows how hardware transactional memory (HTM) can be implemented to support transactions of arbitrarily large size, while ensuring that small transactions run efficiently. Our implementation handles small transactions similar to Herlihy and Moss's scheme in that it holds tentative updates in a cache. Unlike their scheme, which uses a spec...
Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt a...
The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as Θ(n 2 ) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, U...
A processor with an explicit dataflow instruction-set architecture may be able to achieve performance comparable to a superscalar RISC processor, even on serial code. To achieve this, the dataflow processor must support speculative operation, especially speculative branches, and a pipeline with bypassing for serial code. This paper outlines a set o...
Introduction Today's superscalar processors rename registers, bypass registers, checkpoint state so that they can recover from speculative execution, check for dependencies, allocate execution units, and access multi-ported register files. The circuits employed are complex and irregular, requiring much effort and ingenuity to implement well. Furthe...
instructions is fetched in parallel from instruction memory and stored in execution stations from left to right. During each successive clock cycle, a grid-like datapath supplies arguments to instructions. Instructions issue whenever their arguments become available. Once all four instructions have executed, the commit logic computes and stores new...
This paper describes a processor, the Ultrascalar, based on such a structure. We have layed out the Ultrascalar's datapath in an H-tree layout using the Magic design tool. We built our own CMOS standard cells which we used in designing the register datapath. Figure 1 depicts this datapath for a 64-station Ultrascalar. Each leaf-node of the H-tree r...
Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10--60% (geometric mean of 31%) in program speed compared to today's processors. The processor operates at clo...
The poor scalability of existing superscalar processors ha s been of great concern to the computer engineering community. In partic- ular, the critical-path lengths of many components in exist ing im- plementations grow as where is the fetch width, the issue width, or the window size. This paper describes two scalable pro- cessor architectures, the...
The StarTech massively parallel chess program, running on a 512-processor Connection Machine CM-5 supercomputer, tied for third place at the 1993 ACM International Computer Chess Championship. StarTech employs the Jamboree search algorithm, a natural extension of J. Pearl's Scout search algorithm, to find parallelism in game-tree searches. StarTech...
Cilk (pronounced "silk") is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "critical path" of a Cilk computation can be used to accurately model performan...
This paper describes a new processor microarchitecture, called the
The poor scalability of existing superscalar processors has been
of great concern to the computer engineering community. In particular
the critical-path lengths of many components in existing implementations
grow as Θ(n<sup>2</sup>) where n is the fetch width, the issue
width, or the window size. This paper presents a novel implementation,
called t...
this paper is organized as follows. Section 1 reviews the parallel-prefix problem along with the standard solutions. Section 2 reviews the standard log-depth circuits for solving parallel prefix. Section 3 reviews segmented parallel prefix. Section 4 discusses a few minor variations on parallel prefix. Section 5 describes the cyclic segmented paral...
this memo how to design scheduler circuits that efficiently assign a set of resources to a set of requesting elements. The elements are assumed to be stored in a wrap-around sequence. Not all the elements in the sequence need be requesting and the number of requesting elements may exceed the number of available resources. Various criteria can be us...
This paper shows how to run multithreaded programs on a DRAM (Distributed Random Access Memory) parallel computer and demonstrates that such programs can run efficiently on a collection of machines distributed across thousands of miles over the internet. Suppose we have a fully strict multithreaded program has work T 1 and critical-path length T1 ,...
The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a d...
Cilk (pronounced “silk”) is a C-based runtime system for multi-threaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performa...
The RACE(R) parallel computer system provides a high-performance parallel interconnection network at low cost. This paper describes the architecture and implementation of the RACE system, a parallel computer for embedded applications. The topology of the network, which is constructed with 6-port switches, can be specified by the customer, and is ty...
Programmers of the Connection Machine CM-5 data network can
improve the performance of their data movement code more than a factor
of three by selectively using global barriers, by limiting the rate at
which messages are injected in to the network, and by managing the order
in which they are injected. Barriers eliminate target-processor
congestion,...
The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (1012 floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a da...
Message routing networks are acknowledged to be one of the most critical portions of massively parallel computers. This paper presents a processor chip for use in a massively parallel computer. The programmable approach used in this processor provides enough flexibility to make it a “universal” part for building a wide variety of interconnection ne...
Computer chess provides a good testbed for understanding dynamic MIMD-style computations. To investigate the programming issues, we engineered a parallel chess program called *Socrates (pronounced “star-Socrates”), which running on the Sandia National Laboratories 1824-node Paragon, placed second in the 1995 World Computer Chess Championship. *Socr...
Fast global synchronization provides simple, efficient solutions to many of the system problems of parallel computing. It achieves this by providing composition of both performance and correctness. If you understand the performance and meaning of parallel computations A and B, then you understand the performance and meaning of "A; barrier; B". To d...
: The connection machine (CM) is a highly parallel single instruction multiple data (SIMD) computer, which has been described as `a huge piece of hardware looking for a programming methodology'[Arv]. Applicative languages, on the other hand, can be described as a programming methodology looking for a parallel computing engine. By simulating archite...
This document describes Cilk
This paper investigates the puzzle of this new glitch, in the following linear, self-timed sequence of events.
Workloads for high-performance streaming databases often con-tain many writes of small data blocks (for example, of metadata) followed by large subrange queries. Most of today's file systems and databases either cannot provide adequate performance for the write phase, the read phase, or both. The supercomputing technolo-gies group at MIT CSAIL has...
Cache-oblivious B-trees for data sets stored in external memory represent an application that can benefit from the use of transac- tional memory (TM), yet pose several challenges for existing TM implementations. Using TM, a programmer can modify a serial, in-memory cache-oblivious B-tree (CO B-tree) to support concur- rent operations in a straightf...
For the fourth challenge (string matching) of the 2009 Intel Threading Challenge I implemented Rabin-Karp hashing, using the Cilk++ multithreading programming environment, and ran it on Linux. The search component of this program runs with nearly linear speedup on up to 16 cores. I ran a search on the first chromosome (123MB) of Canis familiaris (d...
I implemented a program to use skip lists as a transposition table in the game of Quari. I used a straightforward search implemented in Cilk, and spent most of my effort making the move generation code as fast as possible. I implemented the skip list algorithm of [2].