Article

Pipelining with Futures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time. Paul, Vishkin, and Wagener (PVW) used the approach in a parallel implementation of 2-3 trees. The approach was later used by Cole in the first O( lg n) time sorting algorithm on the PRAM not based on the AKS sorting network, and has since been used to improve the time of several other algorithms. Although the approach has improved the asymptotic time of many algorithms, there are two practical problems: maintaining the pipeline is quite complicated for the programmer, and the pipelining forces highly synchronous code execution. Synchronous execution is less practical on asynchronous machines and makes it difficult to modify a schedule to use less memory or to take better advantage of locality. In this paper we show how futures (a parallel language construct) can be used to implement pipelining without requiring the user to code it explicitly, allowing for much simpler code and more asynchronous execution. A runtime system manages the pipelining implicitly. As with user-managed pipelining, we show how the technique reduces the depth of many algorithms by a logarithmic factor over the nonpipelined version. We describe and analyze four algorithms for which this is the case: a parallel merging algorithm on trees, parallel algorithms for finding the union and difference of two randomized balanced trees (treaps), and insertion into a variant of the PVW 2-3 trees. For three of these, the pipeline delays are data dependent making them particularly difficult to pipeline by hand. To determine the runtime of algorithms we first analyze the algorithms in a language-based cost model in terms of the work w and depth d of the computations, and then show universal bounds for implementing the language on various machine models.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We can also encode various forms of concurrency primitives, such as fork/join parallelism (Conway, 1963) implemented by parallel pairs, futures (Halstead, 1985), and a concurrency monad in the style of SILL (Toninho et al., 2013;Toninho, 2015;Griffith, 2016) (which combines sequential functional with concurrent session-typed programming). As an almost immediate consequence of our reconstruction of futures, we obtain a clean and principled subsystem of linear futures, already anticipated and used in parallel complexity analysis by Blelloch and Reid-Miller (Blelloch & Reid-Miller, 1999) without being rigorously developed. ...
... The shifts between the linear and non-linear modes allow both types of futures to be used in the same program, embedding the linear language (including its futures) into the non-linear language via the monad ↑ S * S ↓ S * S . Uses for linear futures (without a full formalization) in the efficient expression of certain parallel algorithms have already been explored in prior work (Blelloch & Reid-Miller, 1999), but to our knowledge, no formalization of linear futures has yet been given. ...
... There are several potential directions that future work in this space could take. In our reconstruction of futures, we incidentally also provide a definition of linear futures, which have been used in designing pipelines (Blelloch & Reid-Miller, 1999), but to our knowledge have not been examined formally or implemented. One item of future work, then, would be to further explore linear futures, now aided by a formal definition which is also amenable to implementation. ...
Article
Full-text available
Common approaches to concurrent programming begin with languages whose semantics are naturally sequential and add new constructs that provide limited access to concurrency, as exemplified by futures . This approach has been quite successful, but often does not provide a satisfactory theoretical backing for the concurrency constructs, and it can be difficult to give a good semantics that allows a programmer to use more than one of these constructs at a time. We take a different approach, starting with a concurrent language based on a Curry–Howard interpretation of adjoint logic, to which we add three atomic primitives that allow us to encode sequential composition and various forms of synchronization. The resulting language is highly expressive, allowing us to encode futures, fork/join parallelism, and monadic concurrency in the same framework. Notably, since our language is based on adjoint logic, we are able to give a formal account of linear futures , which have been used in complexity analysis by Blelloch and Reid-Miller. The uniformity of this approach means that we can similarly work with many of the other concurrency primitives in a linear fashion, and that we can mix several of these forms of concurrency in the same program to serve different purposes.
... The additional flexibility of futures allows one to write a wider range of parallel programs and/or provide a higher level of parallelism beyond what can be specified using only fork-join parallelism. For instance, Blelloch and Reid-Miller [9] show that one can asymptotically reduce the span of various tree operations using parallel futures. Since its proposal in the late 70s [4,20], future constructs have been incorporated into various task parallel languages and platforms [14-16, 19, 23, 33, 39, 44, 47], including the C++11 standard [31]. ...
... Benchmarks under category A) include: mm (matrix multiplication, input 4k-by-4k), smm (the Strassen matrix multiplication algorithm, input 4k-by-4k), and sort (parallel mergesort, input 10 8 ). Benchmarks under category B) include: heartwall (adapted from the Rodinia benchmark suite [17] that tracks the movement of a mouse heart over a sequence of ultrasound images, input 104 frames), lcs (dynamic programming solving longest common subsequence, input 32k with base case size 512), sw (dynamic programming that implements Smith-Waterman for sequence alignment, input 2k with base case 32), bst (the pipelined tree merge using parallel futures from [9], input 8e6 and 4e6). The benchmark under category C) is ferret (adapted from the PARSEC benchmark suite [6] that implements a content-based similarity search on images with pipeline parallelism, with input size native. ...
... By removing the synchronization overhead between get and put, we see T 1 drops further from 26.78 seconds to 21.06 seconds, which is comparable to that of fib-fj. 9 This shows that these are 8 This difference in span decreases as B approaches √ N . 9 Though naturally things cannot run correctly in parallel if there is not synchronization between get and put. Figure 3. ...
Conference Paper
The use of futures provides a flexible way to express parallelism and can generate arbitrary dependences among parallel subcomputations. The additional flexibility that futures provide comes with a cost, however. When scheduled using classic work stealing, a program with futures, compared to a program that uses only fork-join parallelism, can incur a much higher number of "deviations," a metric for evaluating the performance of parallel executions. All prior works assume a parsimonious work-stealing scheduler, however, where a worker thread (surrogate of a processor) steals work only when its local deque becomes empty. In this work, we investigate an alternative scheduling approach, called ProWS, where the workers perform proactive work stealing when handling future operations. We show that ProWS, for programs that use futures, can provide provably efficient execution time and equal or better bounds on the number of deviations compared to classic parsimonious work stealing. Given a computation with T1 work and T∞ span, ProWS executes the computation on P processors in expected time O(T1/P + T∞ lg P), with an additional lg P overhead on the span term compared to the parsimonious variant. For structured use of futures, where each future is single touch with no race on the future handle, the algorithm incurs deviations, matching that of the parsimonious variant. For general use of futures, the algorithm incurs O(mkT∞ + PT∞ lg P) deviations, where mk is the maximum number of future touches that are logically parallel. Compared to the bound for the parsimonious variant, O(kT∞ + PT∞), with k being the total number of touches in the entire computation, this bound is better assuming mk = Ω(P lg P) and is smaller than k, which holds true for all the benchmarks we examined.
... We evaluate FutureRD using six benchmarks: longestcommon subsequence (lcs), Smith-Waterman (sw), matrix multiplication without temporary matrices (mm), binary tree merge (bst) as described by Blelloch and Reid-Miller [10], Heart Wall Tracking (heartwall), and Dedup (dedup). Heart Wall Tracking and Dedup both contain parallel patterns that cannot be easily implemented using fork-join constructs alone. ...
... Lemma A. 10. Consider a node u where U u = Find(D N S P , u) is unattached. ...
... Say u has an attached successor A 1 and v has an attached predecessor A 2 and there is no path from u to v that contains only series-parallel edges (otherwise, the first part of the query will give the correct answer), but there is a path p from u to v containing create_fut and get_fut edges. This path must go through the last node of A 1 (Lemma A. 10. In addition, since there are no incident non-SP edge between nodes in A 2 and v (Lemma A.8), this path must also go through A 2 . ...
Preprint
Full-text available
This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that we can detect races more efficiently than with general use of futures. Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time O(T1α(m, n)), where T1 is the sequential running time of the program, α is the inverse Ackermann's function, m is the total number of memory accesses, n is the dynamic count of places at which parallelism is created. Since α is a very slowly growing function (upper bounded by 4 for all practical purposes), it can be treated as a close-to-constant overhead. MultiBags+ is an extension of MultiBags that target programs with general use of futures. It runs in time O((T1 + k²)α(m, n)) where T1, α, m and n are defined as before, and k is the number of future operations in the computation. We implemented both algorithms and empirically demonstrate their efficiency.
... Fork-join parallelism is well-suited to dynamic load-balancing techniques such as work stealing [1–3, 9, 11–13, 15, 18–20]. A popular and effective way to extend fork-join parallelism is to allow threads to create futures [4] [7] [14] [18] [19]. A future is a data object that represents a promise to deliver the result of an asynchronous computation when it is ready. ...
... Note that in a structured computation with local touch constraint, a future thread is now allowed to evaluate multiple futures and these futures can be touched at different times. Though allowing a future thread to compute multiple futures is not very common, Blelloch and Reid-Miller [7] point out that it can be useful for some future-parallel computations like pipeline parallelism [7] [9] [16] [17] [21]. We will show in Section 6.1 that work-stealing parallel executions of computations satisfying the local touch constraint also have relatively low cache overheads. ...
... Note that in a structured computation with local touch constraint, a future thread is now allowed to evaluate multiple futures and these futures can be touched at different times. Though allowing a future thread to compute multiple futures is not very common, Blelloch and Reid-Miller [7] point out that it can be useful for some future-parallel computations like pipeline parallelism [7] [9] [16] [17] [21]. We will show in Section 6.1 that work-stealing parallel executions of computations satisfying the local touch constraint also have relatively low cache overheads. ...
Article
Full-text available
In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create {futures. A thread creates a future to hold the results of a computation, which may or may not be executed in parallel. That result is returned when some thread touches that future, blocking if necessary until the result is ready. Recent research has shown that while futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur Ω(P T∞ + t T∞) deviations, which implies Ω(C P T∞ + C t T∞) additional cache misses, where C is the number of cache lines, P is the number of processors, t is the number of touches, and T∞ is the computation span. Since cache locality has a large impact on software performance on modern multicores, this result is troubling. In this paper, however, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it, or by a later descendant of the thread that created it, then parallel executions with work stealing can incur at most O(C P T²∞) additional cache misses, a substantial improvement. This structured use of futures is characteristic of many (but not all) parallel applications.
... Pipeline parallelism 1 [6] [16] [17] [25] [27] [28] [31] [33] [35] [37] is a well-known parallel-programming pattern that can be used to parallelize a variety of applications, including streaming applications from the domains of video, audio, and digital signal processing. ...
... Pipeline parallelism 1 [6, 16, 17, 25, 27, 28, 31, 33, 35, 37] is a well-known parallel-programming pattern that can be used to parallelize a variety of applications, including streaming applications from the domains of video, audio, and digital signal processing. Many applications, including the ferret, dedup, and x264 benchmarks from the PARSEC benchmark suite [4, 5] , exhibit parallelism in the form of a linear pipeline, where a linear sequence S = S 0 , . . . ...
... Cilk-P's support for defining linear pipelines on the fly is more flexible than the ordered directive in OpenMP [29] , which supports a limited form of on-the-fly pipelining, but it is less expressive than other approaches. Blelloch and Reid-Miller [6] describe a scheme for on-the-fly pipeline parallelism that employs futures [3, 14] to coordinate the stages of the pipeline, allowing even nonlinear pipelines to be defined on the fly. Although futures permit more complex, nonlinear pipelines to be expressed, this generality can lead to unbounded space requirements to attain even modest speedups [7]. ...
Conference Paper
Full-text available
Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T1 work and T∞ span (critical-path length), Piper executes the computation on P processors in TP≤ T1/P + O(T∞ + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.
... The performance of these schedulers has been extensively studied, both theoretically [9] and empirically [10] for nested-parallel (or fully strict) programs. Nested parallelism, however, is a fairly restrictive form of parallelism and cannot be used to implement parallel pipelining [5] . In contrast, parallel pipelines can be constructed using parallel futures [18] as well as other more general forms of parallelism. ...
... Parallel pipelining is an example of a style of program that uses futures and cannot be expressed with nested parallelism. Parallel pipelines can be as simple as a single producer-consumer pair or far more sophisticated [5]. This paper focuses on futures as they seem to be a modest extension of nested parallelism that already reveals fundamental differences. ...
... Parallel pipelining can also be implemented using a number of other language features such as I-structures in ID [4] and streams in sisal [14]. Blelloch and Reid-Miller [5] give several examples of applications where the use of futures can asymptotically reduce the span of the computation graphs. They also give an upper bound on the time required to execute these computations by giving an implementation of a greedy scheduler that supports futures. ...
Conference Paper
ABSTRACT Work stealing is a popular method of scheduling,ne-grained parallel tasks. The performance of work stealing has been extensively studied, both theoretically and empirically, but primarily for the restricted class of nested-parallel (or fully strict) computations. We extend this prior work by consider- ing a broader class of programs that also supports pipelined parallelism through the use of parallel futures. Though the overhead of work-stealing schedulers is often quantied in terms of the number of steals, we show that a broader metric, the number of deviations, is a better way to quantify work-stealing overhead for less restrictive forms of parallelism, including parallel futures. For such parallelism, we prove bounds on work-stealing overheads|scheduler time and cache misses|as a function of the number,of deviations. Deviations can occur, for example, when work is stolen or when,a future is touched. We also show instances where deviations can occur independently of steals and touches. Next, we prove that, under work stealing, the expected number of deviations isO(Pd+td) in aP -processor execution of a computation with span d and t touches of futures. More- over, this bound is existentially tight for any work-stealing scheduler that is parsimonious (those where processors steal only when,their queues are empty); this class includes all prior work-stealing schedulers. We also present empirical measurements,of the number,of deviations incurred by a classic application of futures, Halstead’s quicksort, using our parallel implementation of ML. Finally, we identify a family
... In general, there are two types of model parallelism in training DNNs. The first type is graph partitioning [2,13,14,12,15,22,9,39,45,56]. Due to the sequential dependency between the computation tasks in a deep neural network, most graph partitioning works use pipeline parallelism [13,14,56]. ...
... [19] proposes to search for novel neural architecture to provide higher concurrency and distribution opportunities. Though modifying/searching neural architecture for CNN distribution can achieve highly efficient parallel neural architecture, it intrinsically shows the following drawbacks: (1) it takes a long time to modify/search optimal neural architecture and to train the neural network in order to recover the accuracy; (2) The modified neural network cannot achieve the same level of accuracy even after retraining [31,4,19,18,66]. ...
Preprint
Partitioning CNN model computation between edge devices and servers has been proposed to alleviate edge devices' computing capability and network transmission limitations. However, due to the large data size of the intermediate output in CNN models, the transmission latency is still the bottleneck for such partition-offloading. Though compression methods on images like JPEG-based compression can be applied to the intermediate output data in CNN models, their compression rates are limited, and the compression leads to high accuracy loss. Other compression methods for partition-offloading adopt deep learning technology and require hours of additional training. In this paper, we propose a novel compression method DISC for intermediate output data in CNN models. DISC can be applied to partition-offloading systems in a plug-and-play way without any additional training. It shows higher performance on intermediate output data compression than the other compression methods designed for image compression. Further, AGLOP is developed to optimize the partition-offloading system by adjusting the partition point and the hyper-parameters of DISC. Based on our evaluation, DISC can achieve over 98% data size reduction with less than 1%1\% accuracy loss, and AGLOP can achieve over 91.2% end-to-end execution latency reduction compared with the original partition-offloading.
... The algorithms run in optimal O(m lg(n/m)) work and in O(lg n) depth (parallel time) on an EREW PRAM with unit-time scan operations (used for load balancing)—all expected case. This bound is based on automated pipelining [3]—without pipelining the algorithms run in O(lg 2 n) depth. As with the sequential algorithms on treaps, the expectation is over possible random priorities, such that the bounds do not depend on the key values. ...
... Since the expected heights of tm and tn are O(lg m) and O(lg n), the expected depth of the operations are O(lg m lg n). In [3] the authors show that the depths of the operations can be reduced to O(lg m + lg n) using pipelining. Furthermore since the recursive calls in the algorithms are independent (never access the same parts of the trees) the algorithms run with exclusive reads and writes. ...
Article
We present parallel algorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m n) the algorithms run in expected O(m lg(n=m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg 2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion, the main advantage of our algorithms are their simplicity. In fact, our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 ...
... We also consider multithreaded computations that use futures [12,13,14,20,25]. The dag structures of computations with futures are defined elsewhere [4]. This is a superclass of nested-parallel computations, but still much more restrictive than general computations. ...
... That is, we have i(u) (3=4) i(q). 4. Suppose a process p chooses process q in D i as its victim at time step i (a steal attempt of p targeting q occurs at step i). ...
Article
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multi-threaded computations Gn each member of which requires θ(n) total instructions (work), for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is Ω(n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(C[m/s]PT∞), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.
... This severely underutilizes the accelerator devices. Therefore, most works on graph partitioning (including GPipe) employ pipeline parallelism [23], [6], [7] to increase utilization. In pipeline parallelism, an accelerator device is allocated to a subcomponent that is often referred to as a stage. ...
... Researchers have proposed interesting uses of futures. Blelloch and Reid-Miller [1997] used futures to generate "non-linear" pipeline. Using futures to pipeline the split and merge of binary trees, they developed a parallel algorithm of tree merge with be er span than a fork-join parallel marge algorithm. ...
Preprint
Task parallelism research has traditionally focused on optimizing computation-intensive applications. Due to the proliferation of commodity parallel processors, there has been recent interest in supporting interactive applications. Such interactive applications frequently rely on I/O operations that may incur significant latency. In order to increase performance, when a particular thread of control is blocked on an I/O operation, ideally we would like to hide this latency by using the processing resources to do other ready work instead of blocking or spin waiting on this I/O. There has been limited prior work on hiding this latency and only one result that provides a theoretical bound for interactive applications that use I/Os. In this work, we propose a method for hiding the latency of I/O operations by using the futures abstraction. We provide a theoretical analysis of our algorithm that shows our algorithm provides better execution time guarantees than prior work. We also implemented the algorithm in a practically efficient prototype library that runs on top of the Cilk-F runtime, a runtime system that supports futures within the context of the Cilk Plus language, and performed experiments that demonstrate the efficiency of our implementation.
... Fork-join programs can preserve important locality properties of a parallel program during execution but futures may not [3]; though more restricted versions of futures may [24]. Futures can improve performance of some applications asymptotically by enabling pipelining [10]. Furthermore, some applications can benefit from even more relaxed forms of parallelism where programs can be structured more freely [41]. ...
Article
Increasing availability of multicore systems has led to greater focus on the design and implementation of languages for writing parallel programs. Such languages support various abstractions for parallelism, such as fork-join, async-finish, futures. While they may seem similar, these abstractions lead to different semantics, language design and implementation decisions, and can significantly impact the performance of end-user applications. In this paper, we consider the question of whether it would be possible to unify various paradigms of parallel computing. To this end, we propose a calculus, called dag calculus, that can encode fork-join, async-finish, and futures, and possibly others. We describe dag calculus and its semantics, establish translations from the aforementioned paradigms into dag calculus. These translations establish that dag calculus is sufficiently powerful for encoding programs written in prevailing paradigms of parallelism. We present concurrent algorithms and data structures for realizing dag calculus on multicore hardware and prove that the proposed techniques are consistent with the semantics. Finally, we present an implementation of the calculus and evaluate it empirically by comparing its performance to highly optimized code from prior work. The results show that the calculus is expressive and that it competes well with, and sometimes outperforms, the state of the art.
... Fork-join programs can preserve important locality properties of a parallel program during execution but futures may not [3]; though more restricted versions of futures may [24]. Futures can improve performance of some applications asymptotically by enabling pipelining [10]. Furthermore, some applications can benefit from even more relaxed forms of parallelism where programs can be structured more freely [41]. ...
Conference Paper
Increasing availability of multicore systems has led to greater focus on the design and implementation of languages for writing parallel programs. Such languages support various abstractions for parallelism, such as fork-join, async-finish, futures. While they may seem similar, these abstractions lead to different semantics, language design and implementation decisions, and can significantly impact the performance of end-user applications. In this paper, we consider the question of whether it would be possible to unify various paradigms of parallel computing. To this end, we propose a calculus, called dag calculus, that can encode fork-join, async-finish, and futures, and possibly others. We describe dag calculus and its semantics, establish translations from the aforementioned paradigms into dag calculus. These translations establish that dag calculus is sufficiently powerful for encoding programs written in prevailing paradigms of parallelism. We present concurrent algorithms and data structures for realizing dag calculus on multicore hardware and prove that the proposed techniques are consistent with the semantics. Finally, we present an implementation of the calculus and evaluate it empirically by comparing its performance to highly optimized code from prior work. The results show that the calculus is expressive and that it competes well with, and sometimes outperforms, the state of the art.
... Pipeline parallelism can be constructed by either futures (e.g. [16] who used it to shorten span) or synchronization variables, or by some elegantly defined linguistic constructs [39]. The key idea in pipeline parallelism is to organize a parallel program as a linear sequence of stages. ...
Conference Paper
The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "\parallel" (parallel) and ";" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "\leadsto" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O(i=0h1Q(t;σMi)Cip)O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right), where QQ^{*} is the cache complexity of task t{\mathsf t}, CiC_i is the cost of cache miss at level-i cache which is of size MiM_i, σ(0,1)\sigma\in(0,1) is a constant, and p is the number of processors in an h-level cache hierarchy.
... However, the nature of data dependences within the algorithm makes vectorization and parallelization non-trivial on modern CPUs and GPUs. On the other hand, these same data dependences have very nice dataflow properties which are quite naturally mapped on spatial architectures using pipeline parallelism [1], [2]. Moreover, the exploitation of pipeline parallelism also enables the conversion of many memory references into much less expensive local communication which further improves efficiency. ...
... At some level all the models based on work and depth view a computation as a DAG, where the nodes represent sequential tasks, and the edges represent dependences between the tasks. The DAG can either be given offline [29], or revealed online [26] [24] [16] [12]. What is common with all these DAG models is that they require some form of scheduling to map a computation onto actual physical hardware. ...
Article
Full-text available
common theme is parallel computing. Building applications that take full advantage of parallelism remains a significant challenge, even when exclusive access to the computing fabric is assumed. But a chief characteristic of next-generation computing is simultaneous access to shared com- puting resources by millions of users. For example, an Internet search engine must serve multiple simultaneous queries, each of which is decomposed into multiple parallel tasks whose results are combined to produce a response to a query. The need for parallelism at the level of the indi- vidual query arises not only to improve response time, but also to permit decomposition of the search space into fragments that can be managed by individual processors. With the emergence of increasingly sophisticated Web services, one can readily envision greater demand for multiscale parallelism. For example, electronic commerce applications may service many purchase transac- tions simultaneously, each of which can be broken up into sub-transactions that can be executed in parallel. Or one may envision computationally-intensive services such as a biological database that can be searched for sequence or expression patterns homologous to some user input, or a biological simulation service that performs protein folding simulations on demand. Our view is that the computing systems of the future exhibit parallelism multiple levels of granularity, and that these resources must be shared among many users simultaneously. These combination of these two aspects, which we call multiscale parallelism (see Figure 1), poses signif- icant challenges for implementing next-generation systems. Conventional parallel programming decomposes a problem into tasks that are executed cooperatively on a shared computing fabric. Conventional job scheduling considers jobs to be atomic units of work that are to be executed competitively on a shared fabric. Multiscale parallelism integrates the competitive with the coop- erative aspects of multiscale parallelism at all levels of granularity. Our contention is that scheduling is a major component of any successful approach to manag- ing multiscale parallelism. Broadly speaking, by scheduling we mean the placement over time of components of a computation on processing elements. Scheduling is an area that has been studied extensively in many dierent
... A detailed analysis of this algorithm is still lacking and the main drawback is the reconstructing phase that is composed of waves of synchronized processors which modify the tree in a bottom-up fashion. A top-down approach has been developed in [7]. ...
Article
Fringe analysis uses the distribution of bottom subtrees or fringe of search trees under the assumption of random insertion of keys, yielding an average case analysis of the fringe. The results in the fringe give upper and lower bounds for several measures for the whole tree. We are interested in the fringe analysis of the synchronized parallel insertion algorithms of Paul, Vishkin, and Wagener (PVW) on 2-3 trees. This algorithm inserts k keys with k processors into a tree of size n with time O(logn + log k). As the direct analysis of this algorithm is very difficult we tackle this problem by introducing a new family of algorithms, denoted by MacroSplit algorithms, and our main theorem proves that two algorithms of this family, denoted MaxMacroSplit and MinMacroSplit, bound the behavior of the fringe in the PVW algorithm. Previous work deals with the fringe analysis of sequential algorithms, but this type of analysis was still an open problem for parallel algorithms on search trees. We extend fringe analysis to parallel algorithms and we get a rich mathematical structure giving new interpretations even in the sequential case. We prove that random insertion of keys generates a binomial distribution, that the synchronized insertion of keys can be modeled by a Markov chain, and that the coefficients of the transition matrix of the Markov chain are related to the expected local behavior of our algorithm. Finally, we show that the coefficients of the power expansion of this matrix over (n+1)(-1) are the binomial transform of the expected local behavior of the algorithm. We finally show that the fringe of the PVW algorithm asymptotically converges to the sequential case.
... Again we see no general way of avoiding them within the same time bounds. In more recent work we have shown that by restricting the language it is possible to implement call-by-speculation with purely exclusive access [Blelloch and Reid-Miller 1997]. It is an interesting question of whether the concurrent memory access capability is needed in the general case. ...
Article
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (call-by-speculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Article
Recent work has proposed a memory property for parallel programs, called disentanglement, and showed that it is pervasive in a variety of programs, written in different languages, ranging from C/C++ to Parallel ML, and showed that it can be exploited to improve the performance of parallel functional programs. All existing work on disentanglement, however, considers the "fork/join" model for parallelism and does not apply to "futures", the more powerful approach to parallelism. This is not surprising: fork/join parallel programs exhibit a reasonably strict dependency structure (e.g., series-parallel DAGs), which disentanglement exploits. In contrast, with futures, parallel computations become first-class values of the language, and thus can be created, and passed between functions calls or stored in memory, just like other ordinary values, resulting in complex dependency structures, especially in the presence of mutable state. For example, parallel programs with futures can have deadlocks, which is impossible with fork-join parallelism. In this paper, we are interested in the theoretical question of whether disentanglement may be extended beyond fork/join parallelism, and specifically to futures. We consider a functional language with futures, Input/Output (I/O), and mutable state (references) and show that a broad range of programs written in this language are disentangled. We start by formalizing disentanglement for futures and proving that purely functional programs written in this language are disentangled. We then generalize this result in three directions. First, we consider state (effects) and prove that stateful programs are disentangled if they are race free. Second, we show that race freedom is sufficient but not a necessary condition and non-deterministic programs, e.g. those that use atomic read-modify-operations and some non-deterministic combinators, may also be disentangled. Third, we prove that disentangled task-parallel programs written with futures are free of deadlocks, which arise due to interactions between state and the rich dependencies that can be expressed with futures. Taken together, these results show that disentanglement generalizes to parallel programs with futures and, thus, the benefits of disentanglement may go well beyond fork-join parallelism.
Chapter
We exhibit a strong bisimulation between asynchronous message passing concurrency with session types and shared memory concurrency with futures. A key observation is that both arise from closely related interpretations of the semi-axiomatic sequent calculus with recursive definitions, which provides a unifying framework. As a further result we show that the bisimulation applies to both linear and nonlinear versions of the two languages.KeywordsSession typesfuturesbisimulation
Preprint
Full-text available
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a \emph{key} per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallel algorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily extend to two related problems, \emph{histogram} and \emph{collect-reduce}. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. We also test two important applications with real-world data, and show that our algorithms improve the performance over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.
Preprint
Full-text available
Biconnectivity is one of the most fundamental graph problems. The canonical parallel biconnectivity algorithm is the Tarjan-Vishkin algorithm, which has O(n+m) optimal work (number of operations) and polylogarithmic span (longest dependent operations) on a graph with n vertices and m edges. However, Tarjan-Vishkin is not widely used in practice. We believe the reason is the space-inefficiency (it generates an auxiliary graph with O(m) edges). In practice, existing parallel implementations are based on breath-first search (BFS). Since BFS has span proportional to the diameter of the graph, existing parallel BCC implementations suffer from poor performance on large-diameter graphs and can be even slower than the sequential algorithm on many real-world graphs. We propose the first parallel biconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm first generates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. All the steps in our algorithm are highly-parallel. We carefully analyze the correctness of our algorithm, which is highly non-trivial. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We ran them on a 96-core machine on 27 graphs, including social, web, road, k-NN, and synthetic graphs, with significantly varying sizes and edge distributions. FAST-BCC is the fastest on all 27 graphs. On average (geometric means), FAST-BCC is 5.1×\times faster than GBBS, and 3.1×\times faster than the best existing baseline on each graph.
Conference Paper
In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The best sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves O(łog² n) amortized time per edge insertion or deletion, and O(łog n) time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where Δ is the average batch size of all deletions, our algorithm achieves O(łog n łog(1+n / Δ)) expected amortized work per edge insertion and deletion and O(łog³ n) depth w.h.p. Our algorithm answers a batch of k connectivity queries in O(k łog(1 + n/k)) expected work and O(łog n) depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.
Article
We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their data dependences to be determined at runtime. Pipelite unifies communication, scheduling, and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state of the art in terms of speedup, scalability, and memory usage.
Preprint
In this paper we present two versions of a parallel working-set map on p processors that supports searches, insertions and deletions. In both versions, the total work of all operations when the map has size at least p is bounded by the working-set bound, i.e., the cost of an item depends on how recently it was accessed (for some linearization): accessing an item in the map with recency r takes O(1+log r) work. In the simpler version each map operation has O((log p)^2+log n) span (where n is the maximum size of the map). In the pipelined version each map operation on an item with recency r has O((log p)^2+log r) span. (Operations in parallel may have overlapping span; span is additive only for operations in sequence.) Both data structures are designed to be used by a dynamic multithreading parallel program that at each step executes a unit-time instruction or makes a data structure call. To achieve the stated bounds, the pipelined data structure requires a weak-priority scheduler, which supports a limited form of 2-level prioritization. At the end we explain how the results translate to practical implementations using work-stealing schedulers. To the best of our knowledge, this is the first parallel implementation of a self-adjusting search structure where the cost of an operation adapts to the access sequence. A corollary of the working-set bound is that it achieves work static optimality: the total work is bounded by the access costs in an optimal static search tree.
Conference Paper
Dynamic data race detection is a program analysis technique for detecting errors provoked by undesired interleavings of concurrent threads. A primary challenge when designing efficient race detection algorithms is to achieve manageable space requirements. State of the art algorithms for unstructured parallelism require Θ(n) space per monitored memory location, where n is the total number of tasks. This is a serious drawback when analyzing programs with many tasks. In contrast, algorithms for programs with a series-parallel (SP) structure require only Θ(1) space. Unfortunately, it is currently poorly understood if there are classes of parallelism beyond SP that can also benefit from and be analyzed with Θ(1) space complexity. In the present work, we show that structures richer than SP graphs, namely that of two-dimensional (2D) lattices, can be analyzed in O(1) space: a) we extend Tarjan's algorithm for finding lowest common ancestors to handle 2D lattices; b) from that extension we derive a serial algorithm for race detection that can analyze arbitrary task graphs having a 2D lattice structure; c) we present a restriction to fork-join that admits precisely the 2D lattices as task graphs (e.g., it can express pipeline parallelism). Our work generalizes prior work on race detection, and aims to provide a deeper understanding of the interplay between structured parallelism and program analysis efficiency.
Conference Paper
We present MetaFork, a metalanguage for multithreaded algorithms based on the fork-join concurrency model and targeting multicore architectures. MetaFork is implemented as a source-to-source compilation framework allowing automatic translation of programs from one concurrency platform to another. The current version of this framework supports CilkPlus and OpenMP. We evaluate the benefits of the MetaFork framework through a series of experiments, such as narrowing performance bottlenecks in multithreaded programs. Our experiments show also that, if a native program, written either in CilkPlus or OpenMP, has little parallelism overhead, then the same property holds for its OpenMP or CilkPlus counterpart translated by MetaFork.
Conference Paper
In this paper, we demonstrate the ability of spatial architectures to significantly improve both runtime performance and energy efficiency on edit distance, a broadly used dynamic programming algorithm. Spatial architectures are an emerging class of application accelerators that consist of a network of many small and efficient processing elements that can be exploited by a large domain of applications. In this paper, we utilize the dataflow characteristics and inherent pipeline parallelism within the edit distance algorithm to develop efficient and scalable implementations on a previously proposed spatial accelerator. We evaluate our edit distance implementations using a cycle-accurate performance and physical design model of a previously proposed triggered instruction-based spatial architecture in order to compare against real performance and power measurements on an x86 processor. We show that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation. This dramatic improvement comes from leveraging the massive parallelism available in spatial architectures and from the dramatic reduction of expensive memory accesses through conversion to relatively inexpensive local communication.
Conference Paper
In this paper an approach to dynamic parallelizing of coarse and medium grained program is proposed where the parallelization sources are both dataflow analysis and the features given in the program by annotating some of their operators. Program annotation enables to support two additional types of parallel computations which cannot be found out only from the analysis of dataflow dependencies. First, there are the speculative computations based on anticipating alternative branches of the program’s computational process. Second, there are pipeline computations that sometimes may be initialized for operators at the moment when their input data are not complete. The system for dynamic parallelizing of programs is implemented in C++/PVM for PC clusters, and experiments from Buchberger’s algorithm are presented.
Conference Paper
An approach to dynamic parallelizing of coarse grained pro- gram where the parallelization sources are both dataflow analysis and the features pointed out in the program by annotating is proposed. Program annotating enables to hold two additional types of parallel computations which cannot be found out only from the analysis of dataflow depen- dences. Firstly, there are speculative computations based on anticipating alternative branches of the program’s computational process. Secondly, there are pipeline computations that sometimes may be initialised for operators at the moment when their input data are not complete. Auto- mated program analysis of this type of concurrence is either very hard or it generates a lot of surplus computation, thus absorbing the effect of program parallelization. The implementation of the system of dynamic program parallelization for clusters of PCs and results of some experiments performed on it are described.
Article
Full-text available
We present a randomized strategy for maintaining balance in dynamically changing search trees that has optimalexpected behavior. In particular, in the expected case a search or an update takes logarithmic time, with the update requiring fewer than two rotations. Moreover, the update time remains logarithmic, even if the cost of a rotation is taken to be proportional to the size of the rotated subtree. Finger searches and splits and joins can be performed in optimal expected time also. We show that these results continue to hold even if very little true randomness is available, i.e., if only a logarithmic number of truely random bits are available. Our approach generalizes naturally to weighted trees, where the expected time bounds for accesses and updates again match the worst-case time bounds of the best deterministic methods. We also discuss ways of implementing our randomized strategy so that no explicit balance information is maintained. Our balancing strategy and our algorithms are exceedingly simple and should be fast in practice.
Conference Paper
Full-text available
The purpose of this paper is to describe a sorting network of size 0(n log n) and depth 0(log n). A natural way of sorting is through consecutive halvings: determine the upper and lower halves of the set, proceed similarly within the halves, and so on. Unfortunately, while one can halve a set using only 0(n) comparisons, this cannot be done in less than log n (parallel) time, and it is known that a halving network needs (½)n log n comparisons. It is possible, however, to construct a network of 0(n) comparisons which halves in constant time with high accuracy. This procedure (&egr;-halving) and a derived procedure (&egr;-nearsort) are described below, and our sorting network will be centered around these elementary steps.
Conference Paper
Full-text available
Our model of computation is a parallel computer with k synchronized processors P1,...,Pk sharing a common random access storage, where simultaneous access to the same storage location by two or more processors is not allowed. Suppose a 2–3 tree T with n leaves is implemented in the storage, suppose a1,...,ak are data that may or may not be stored in the leaves, and for all i, 1ik, processor Pi knows ai. We show how to search for a1,..., ak in the tree T, how to insert these data into the tree and how to delete them from the tree in O(log n+log k) steps.
Article
Full-text available
Effectively using shared-memory multiprocessors requires substantial programming effort. We present the programming language COOL (Concurrent Object-Oriented Language), which was designed to exploit coarse-grained parallelism at the task level in shared-memory multiprocessors. COOL's primary design goals are efficiency and expressiveness. By efficiency we mean that the language constructs should be efficient to implement and a program should not have to pay for features it does not use. By expressiveness, we imply that the program should flexibly support different concurrency patterns, thereby allowing various decompositions of a problem. COOL emphasizes the integration of concurrency and synchronization with data abstraction to ease the task of creating modular and efficient parallel programs. It is an extension of C++, which was chosen because it supports abstract data type definitions and is widely used.< >
Article
Full-text available
Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchronize with ancestor or sibling threads. Such computations do not include languages with futures or user-specified synchronization constraints. Here we extend the results to languages with synchronization variables. Such languages include languages with futures, such as Multilisp and Cool, as well as other languages such as id. The main result is an online scheduling algorithm which, given a computation with w work (total operations), oe synchronizations, d depth (critical path) and s1 sequential space, will run in O(w=p + oe log(pd)=p + d log(pd)) time and s1 + O(pd log(pd)) space, on a p-processor crcw pram with a fetch-and-add primitive. This includes all time and space costs for both t...
Article
Full-text available
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any computation with w units of work and critical path length d , and for any sequential schedule that takes space s 1 , we provide a parallel schedule that takes fewer than w/p + d steps on p processors and requires less than s 1 + p·d space. This matches the lower bound that we show, and significantly improves upon the best previous bound of s 1 ·p spaces for the common case where d « s 1 . The paper then describes a scheduler for implementing high-level languages with nested parallelism, that generates schedules in this class. During program execution, as the structure of the computation is revealed, the scheduler keeps track of the active tasks, allocates the tasks to the processors, and performs the necessary task synchronization. The scheduler is itself a parallel algorithm, and incurs at most a constant factor overhead in time and space, even when the scheduling granularity is individual units of work. The algorithm is the first efficient solution to the scheduling problem discussed here, even if space considerations are ignored.
Article
Linear logic has been proposed as one solution to the problem of garbage collection and providing efficient "update-in-place" capabilities within a more functional language. Linear logic conserves accessibility, and hence provides a mechanical metaphor which is more appropriate for a distributed-memory parallel processor in which copying is explicit. However, linear logic's lack of sharing may introduce significant inefficiencies of its own.We show an efficient implementation of linear logic called Linear Lisp that runs within a constant factor of non-linear logic. This Linear Lisp allows RPLACX operations, and manages storage as safely as a non-linear Lisp, but does not need a garbage collector. Since it offers assignments but no sharing, it occupies a twilight zone between functional languages and imperative languages. Our Linear Lisp Machine offers many of the same capabilities as combinator/graph reduction machines, but without their copying and garbage collection problems.
Article
This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T 1 is the time required for executing the computation on a 1 processor, T ∞ is the time required by an infinite number of processors, and S 1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is O(T 1 /P+T ∞ ), that is, it achieves linear speedup when P=O(T 1 /T ∞ ), and it is space-efficient if it uses O(S 1 P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to “strict” computations – those in which all arguments to a procedure must be available before the procedure can be invoked – much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time O(T 1 /P+T ∞ lgP) and worst-case space O(S 1 PlgP), including overhead costs.
Article
Article
Mul-T is a parallel Lisp system, based on Multilisp's future construct, that has been developed to run on an Encore Multimax multiprocessor. Mul-T is an extended version of the Yale T system and uses the T system's ORBIT compiler to achieve “production quality” performance on stock hardware — about 100 times faster than Multilisp. Mul-T shows that futures can be implemented cheaply enough to be useful in a production-quality system. Mul-T is fully operational, including a user interface that supports managing groups of parallel tasks.
Article
We study Girard's linear logic from the point of view of giving a concrete computational interpretation of the logic, based on the Curry—Howard isomorphism. In the case of Intuitionistic linear logic, this leads to a refinement of the lambda calculus, giving finer control over order of evaluation and storage allocation, while maintaining the logical content of programs as proofs, and computation as cut-elimination. In the classical case, it leads to a concurrent process paradigm with an operational semantics in the style of Berry and Boudol's chemical abstract machine. This opens up a promising new approach to the parallel implementation of functional programming languages; and offers the prospect of typed concurrent programming in which correctness is guaranteed by the typing.
Conference Paper
In an earlier paper[RRH92], we presented a new technique for the SPMD parallelization of programs that use dynamic data structures. Our approach is based on migrating the thread of computation to the processor that owns the data, and annotating the program with futures to introduce parallelism. We have implemented this approach for the Intel iPSC/860. This paper reports on our implementation, called Olden, and presents some early performance results from a series of non-trivial benchmarks.
Conference Paper
The PRAM model is a machine that comprises p processors and m memory cells; each processor can access each memory cell in constant time. The PRAM has proved a popular model for parallel algorithm design. For the task of designing efficient, highly parallel algorithms is quite difficult, in general. The PRAM model provides an abstraction that strips away problems of synchronization, reliability and communication delays, thereby permitting algorithm designers to focus first and foremost on the structure of the computational problem at hand, rather than the architecture of a currently available machine. As a consequence, a considerable body of PRAM algorithms has been discovered in the past several years, and a number of powerful techniques for designing such algorithms have been identified (see, for instance, the survey articles [KR88, EG88]).
Article
This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. We develop a formal graphtheoretic model of multithreaded computation and give three important measures of time and space: T1 is the time required for executing the computation on 1 processor, T1 is the time required by an infinite number of processors, and S1 is the space required to execute the computation on 1 processor. We consider a computation executed on P processors to be time-efficient if the time is O(T1=P + T1 ), that is, it achieves linear speedup when P = O(T1=T1 ). We consider a P-processor execution of the computation to be space-efficient if it uses O(S1P ) total space, that is, the space per processor is within a constant factor of the space required for a 1-processor execution. We show that there exist multithreaded computations such that no execution schedule c...
Article
It is shown that arithmetic expressions with n greater than equivalent to 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log//2n plus 10(n minus 1)/p using p greater than equivalent to 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed.
Article
The subject of this chapter is the design and analysis of parallel algorithms. Most of today’s algorithms are sequential, that is, they specify a sequence of steps in which each step consists of a single operation. These algorithms are well suited to today’s computers, which basically perform operations in a sequential fashion. Although the speed at which sequential computers operate has
Article
The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware
Article
Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel execution. Like Scheme, Multilisp is oriented toward symbolic computation. Unlike some parallel programming languages, Multilisp incorporates constructs for causing side effects and for explicitly introducing parallelism. The potential complexity of dealing with side effects in a parallel context is mitigated by the nature of the parallelism constructs and by support for abstract data types: a recommended Multilisp programming style is presented which, if followed, should lead to highly parallel, easily understandable programs. Multilisp is being implemented on the 32-processor Concert multiprocessor; however, it is ultimately intended for use on larger multiprocessors. The current implementation, called Concert Multilisp , is complete enough to run the Multilisp compiler itself and has been run on Concert prototypes including up to eight processors. Concert Multilisp uses novel techniques for task scheduling and garbage collection. The task scheduler helps control excessive resource utilization by means of an unfair scheduling policy; the garbage collector uses a multiprocessor algorithm based on the incremental garbage collector of Baker.
Article
Linear Logic [6] provides a refinement of functional programming and suggests a new implementation technique, with the following features: •a synthesis of strict and lazy evaluation,•a clean semantics of side effects,•no garbage collector.
Article
This paper investigates some problems associated with an argument evaluation order that we call "future" order, which is different from both call-by-name and call-by-value. In call-by-future, each formal parameter of a function is bound to a separate process (called a "future") dedicated to the evaluation of the corresponding argument. This mechanism allows the fully parallel evaluation of arguments to a function, and has been shown to augment the expressive power of a language. We discuss an approach to a problem that arises in this context: futures which were thought to be relevant when they were created become irrelevant through being ignored in the body of the expression where they were bound. The problem of irrelevant processes also appears in multiprocessing problem-solving systems which start several processors working on the same problem but with different methods, and return with the solution which finishes first. This parallel method strategy has the drawback that the processes which are investigating the losing methods must be identified, stopped, and re-assigned to more useful tasks. The solution we propose is that of garbage collection. We propose that the goal structure of the solution plan be explicitly represented in memory as part of the graph memory (like Lisp's heap) so that a garbage collection algorithm can discover which processes are performing useful work, and which can be recycled for a new task. An incremental algorithm for the unified garbage collection of storage and processes is described.
Conference Paper
We present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems. The problems for which we give improved algorithms include intersection detection, trapezoidal decomposition (hence, polygon triangulation), and planar point location (hence, Voronoi diagram construction). We also give efficient parallel algorithms for fractional cascading, 3-dimensional maxima, 2-set dominance counting, and visibility from a point. All of our algorithms run in O(log n) time with either a linear or sub-linear number of processors in the CREW PRAM model.
Conference Paper
This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the expected time TP to execute a fully strict computation on P processors using our work-stealing scheduler is TP=O(T1/P+T), where T1 is the minimum serial execution time of the multithreaded computation and T is the minimum execution time with an infinite number of processors. Moreover, the space SP required by the execution satisfies SP&les;S1P. We also show that the expected total communication of the algorithm is at most O(TSmax P), where Smax is the size of the largest activation record of any thread, thereby justifying the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor
Article
Early results of a project on compiling stylized recursion into stackless iterative code are reviewed as they apply to a target environment with multiprocessing. Parallelism is possible in executing the compiled image of argument evaluation (collateral argument evaluation of Algol 68), of data structure construction when suspensions are used, and of functional combinations. The last facility provides generally, concise expression for all operations performed in Lisp by mapping functions and in APL by typed operators; there are other uses as well.
Article
A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O (log n ) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine)
Article
Past attempts to apply Girard's linear logic have either had a clear relation to the theory (Lafont, Holmstrom, Abramsky) or a clear practical value (Guzm'an and Hudak, Wadler), but not both. This paper defines a sequence of languages based on linear logic that span the gap between theory and practice. Type reconstruction in a linear type system can derive information about sharing. An approach to linear type reconstruction based on use types is presented. Applications to the array update problem are considered.
Article
Speculative evaluation, including leniency and futures, is often used to produce high degrees of parallelism. Existing speculative implementations, however, may serialize computation because of their implementation of queues of suspended threads. We give a provably efficient parallel implementation of a speculative functional language on various machine models. The implementation includes proper parallelization of the necessary queuing operations on suspended threads. Our target machine models are a butterfly network, hypercube, and PRAM. To prove the efficiency of our implementation, we provide a cost model using a profiling semantics and relate the cost model to implementations on the parallel machine models. 1 Introduction Futures, lenient languages, and several implementations of graph reduction for lazy languages all use speculative evaluation (call-by-speculation [15]) to expose parallelism. The basic idea of speculative evaluation, in this context, is that the evaluation of a...
Article
This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T 1 / P + O ( T ∞ , where T 1 is the minimum serial execution time of the multithreaded computation and ( T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S 1 P , where S 1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O ( PT ∞ ( 1 + n d ) S max ), where S max is the size of the largest activation record of any thread and n d is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
Article
We present parallel algorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m n) the algorithms run in expected O(m lg(n=m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg 2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion, the main advantage of our algorithms are their simplicity. In fact, our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 ...
Article
In this paper we prove time and space bounds for the implementation of the programming language Nesl on various parallel machine models. Nesl is a sugared typed -calculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementation bounds for functional languages by considering space and by including arrays. For modeling the cost of Nesl we augment a standard call-by-value operational semantics to return two cost measures: a DAG representing the sequential dependences in the computation, and a measure of the space taken by a sequential implementation. We show that a Nesl program with w work (nodes in the DAG), d depth (levels in the DAG), and s sequential space can be implemented on a p processor butterfly network, hypercube, or CRCW PRAM using O(w=p+ d log p) time and O(s + dp log p) reachable space. 1 For programs with sufficient parallelism these bounds are optimal in that they give linear speedup and use spac...
Article
We develop formal methods for reasoning about memory usage at a level of abstraction suitable for establishing or refuting claims about the potential applications of linear logic for static analysis. In particular, we demonstrate a precise relationship between type correctness for a language based on linear logic and the correctness of a reference-counting interpretation of the primitives that the language draws from the rules for the `of course' operation. Our semantics is `low-level' enough to express sharing and copying while still being `highlevel ' enough to abstract away from details of memory layout. This enables the formulation and proof of a result describing the possible run-time reference counts of values of linear type. Contents 1 Introduction 1 2 Operational Semantics with Memory 4 3 A Programming Language Based on Linear Logic 9 4 Semantics 14 5 Properties of the Semantics 24 6 Linear Logic and Memory 27 7 Discussion 32 A Proofs of the Main Theorems 36 Acknowle...
Space-efficient scheduling of parallelism with synchronization variables
  • G Blelloch
  • P Gibbons
  • Y Matias
  • G Narlikar
G. Blelloch, P. Gibbons, Y. Matias, and G. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 12-23, June 1997.
Mul-T: a high-performance parallel lisp
  • D A Krantz
  • R H Halstead
  • E Mohr
D. A. Krantz, R. H. Halstead, Jr., and E. Mohr. Mul-T: a high-performance parallel lisp. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 81-90, 1989.
Parallel dictionaries on 2-3 trees
  • W Paul
  • U Vishkin
  • H Wagener
W. Paul, U. Vishkin, and H. Wagener. Parallel dictionaries on 2-3 trees. In Proceedings of the 10th Colloquium on Automata, Languages and Programming, pages 597-609. Lecture Notes in Computer Science, vol. 143. Springer-Verlag, Berlin, July 1983.