Parallel Processing Letters

Published by World Scientific

Online ISSN: 1793-642X

·

Print ISSN: 0129-6264

Articles


Figure 4: Partitioning of distance intervals for K = 4, K 0 = 3, d T = 2. 
Figure 5: A step of the algorithm, consisting of K = 4 rounds. Here messages are represented as small balls. Notice that messages in the same cell are at the same distance from the sink, but they can be in different vertices. 
Hardness and approximation of Gathering in static radio networks
  • Conference Paper
  • Full-text available

April 2006

·

59 Reads

·

·

·

[...]

·

In this paper, we address the problem of gathering information in a central node of a radio network, where interference constraints are present. We take into account the fact that, when a node transmits, it produces interference in an area bigger than the area in which its message can actually be received. The network is modeled by a graph; a node is able to transmit one unit of information to the set of vertices at distance at most d<sub>T</sub> in the graph, but when doing so it generates interference that does not allow nodes at distance up to d<sub>I</sub> (d<sub>I</sub> ≥ d<sub>T</sub>) to listen to other transmissions. Time is synchronous and divided into time-steps in each of which a round (set of non-interfering radio transmissions) is performed. We give a general lower bound on the number of rounds required to gather on any graph, and present an algorithm working on any graph, with an approximation factor of 4. We also show that the problem of finding an optimal strategy for gathering (one that uses a minimum number of time-steps) does not admit a fully polynomial time approximation scheme if d<sub>I</sub> > d<sub>T</sub>, unless P=NP, and in the case d<sub>I</sub> = d<sub>T</sub> the problem is NP-hard.
Download
Share

Study of the medium message performance of BIP/Myrinet

February 2000

·

7 Reads

L. Prylli and B. Tourancheau (1998) presented the BIP low level communication protocols for Myrinet programmable NICs. On one side BIP has proven that a very small latency and a very large throughput can be delivered that surpass other communication layers for Myrinet. On another side hardware implementations like Giganet or SCI show very good performance for medium size messages. Our aim in this paper is to study how the programmable NICs can compare with the hardware solutions for the medium size messages. We conducted a very precise model of the pipeline behavior directed by the NIC program in order to identify the bottlenecks. The results show that the pipeline quality is very important. We develop an adaptive strategy as described by K. Yocum et al. (1998). We then introduce it in BIP. Experiments show up to a twofold gain over the previous BIP and reach the performance above the hardware solutions

An improved parallel algorithm for Delaunay triangulation ondistributed memory parallel computers
Delaunay triangulation has been much used in such applications as volume rendering, shape representation, terrain modeling and so on. The main disadvantage of Delaunay triangulation is large computation time required to obtain the triangulation on an input points set. This time can be reduced by using more than one processor, and several parallel algorithms for Delaunay triangulation have been proposed. In this paper, we propose an improved parallel algorithm for Delaunay triangulation, which partitions the bounding convex region of the input points set into a number of regions by using Delaunay edges and generates Delaunay triangles in each region by applying an incremental construction approach. Partitioning by Delaunay edges makes it possible to eliminate merging step required for integrating subresults. It is shown from the experiments that the proposed algorithm has good load balance and is more efficient than Cignoni et al.'s algorithm (1993) and our previous algorithm (1996)

OPIOM: off-processor IO with Myrinet

February 2001

·

14 Reads

As processors become more powerful and clusters larger, users will exploit this increased power to progressively run larger and larger problems. Today's datasets in biology, physics or multimedia applications are huge and require high performance storage sub-systems. As a result, the hot spot of cluster computing is gradually moving from high performance computing to high performance IO. The solutions proposed by the parallel file-system community try to improve performance by working at the kernel level to enhance the regular IO design or by using a dedicated storage area network like Fiber Channel. We propose a new design to merge the communication network and the storage network at the best price. We have implemented it in OPIOM with the Myrinet interconnect. OPIOM moves data asynchronously from SCSI disks to the embedded memory of a Myrinet interface in order to send it to a remote node. This design presents attractive features: high performance and extremely low host overhead

Figure 1: Parallel efficiency of various processes on the XMT. For all runs, take l = 2k.
Figure 2: Parallel efficiency of the overall algorithm on the XMT. For all runs, take l = 2k.
Parallel Implementation of Fast Randomized Algorithms for Low Rank Matrix Decomposition

May 2012

·

128 Reads

We analyze the parallel performance of randomized interpolative decomposition by decomposing low rank complex-valued Gaussian random matrices larger than 100 GB. We chose a Cray XMT supercomputer as it provides an almost ideal PRAM model permitting quick investigation of parallel algorithms without obfuscation from hardware idiosyncrasies. We obtain that on non-square matrices performance becomes very good, with overall runtime over 70 times faster on 128 processors. We also verify that numerically discovered error bounds still hold on matrices two orders of magnitude larger than those previously tested.

Iso-Quality of Service: Fairly Ranking Servers for Real-Time Data Analytics

January 2015

·

106 Reads

We present a mathematically rigorous Quality-of-Service (QoS) metric which relates the achievable quality of service metric (QoS) for a real-time analytics service to the server energy cost of offering the service. Using a new iso-QoS evaluation methodology, we scale server resources to meet QoS targets and directly rank the servers in terms of their energy-efficiency and by extension cost of ownership. Our metric and method are platform-independent and enable fair comparison of datacenter compute servers with significant architectural diversity, including micro-servers. We deploy our metric and methodology to compare three servers running financial option pricing workloads on real-life market data. We find that server ranking is sensitive to data inputs and desired QoS level and that although scale-out micro-servers can be up to two times more energy-efficient than conventional heavyweight servers for the same target QoS, they are still six times less energy efficient than high-performance computational accelerators.

Array dataflow analysis for explicitly parallel programs

January 2006

·

17 Reads

This paper describes a dataflow analysis of array data structures for data-parallel and/or control- (or task-) parallel imperative languages. This analysis departs from previous work because it 1) simultaneously handles both parallel programming paradigms, and 2) does not rely on the usual iterative solving process of a set of data flow equations but extends array dataflow analysis based on integer linear programming, thus improving the precision of results.

Figure 6: Influence of pipeline looseness on the socket (open symbols) and node level (filled symbols). The case d u − d l = 0 represents a rigid "lockstep", which is obviously hazardous. Data was taken on Nehalem EP, but the general characteristic is very similar on all architectures.
Figure 7: Multi-layer halo communication. Each halo is transmitted consecutively along the three coordinate directions, avoiding direct communication across edges and corners [13].
Figure 8: Theoretical multi-layer halo advantage versus linear subdomain size L for different halo widths h. Parameters are set for a vector-mode hybrid Jacobi solver on a QDR-IB network and a per-node performance of 2000 MLUP/s (see text for details). Inset: Ratio of computation versus overall time ("computational efficiency") for the corner cases h = 2 and h = 32.
Figure 9: Distributed-memory parallel performance (strong scaling) of the standard and the multi-halo pipelined Jacobi solvers with relaxed synchronization, at a problem size of 600 3 .
Figure 10: Distributed-memory parallel performance (weak scaling) of the standard and the multi-halo pipelined Jacobi solvers with relaxed synchronization, at a problem size of 600 3 per process.
Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters

June 2010

·

166 Reads

Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. Benchmark results are presented for three current x86-based microprocessors, showing clearly that our optimization works best on designs with high-speed shared caches and low memory bandwidth per core. We furthermore demonstrate that simple bandwidth-based performance models are inaccurate for this kind of algorithm and employ a more elaborate, synthetic modeling procedure. Finally we show that temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment, albeit with limited benefit at strong scaling. Comment: 16 pages, 10 figures

Unifying look at semigroup computations on meshes with multiple broadcasting

January 1970

·

4 Reads

Semigroup computations are a fundamental algorithmic tool finding applications in all areas of parallel processing. Given a sequence of m items a 1, a 2,..., a m from a semigroup S with an associative operation Ön Ön\sqrt n \times \sqrt n . Our contribution is to present the first lower bound and the first time-optimal algorithm which apply to the entire range of m (2 W(max{ min{ log m, log\tfracn\tfrac23 m\tfrac13 } ,\tfracm\tfrac13 n\tfrac16 } )\Omega (max\{ min\{ log m, log\tfrac{{n^{\tfrac{2}{3}} }}{{m^{\tfrac{1}{3}} }}\} ,\tfrac{{m^{\tfrac{1}{3}} }}{{n^{\tfrac{1}{6}} }}\} ) time. Second, we show that our bound is tight by designing an algorithm whose running time matches the lower bound. These results unify and generalize all semigroup lower bounds and algorithms known to the authors....

Calculating Voronoi Diagrams Using Simple Chemical Reactions

February 2014

·

717 Reads

This paper overviews work on the use of simple chemical reactions to calculate Voronoi diagrams and undertake other related geometric calculations. This work highlights that this type of specialised chemical processor is a model example of a parallel processor. For example increasing the complexity of the input data within a given area does not increase the computation time. These processors are also able to calculate two or more Voronoi diagrams in parallel. Due to the specific chemical reactions involved and the relative strength of reaction with the substrate (and cross-reactivity with the products) these processors are also capable of calculating Voronoi diagrams sequentially from distinct chemical inputs. The chemical processors are capable of calculating a range of generalised Voronoi diagrams (either from circular drops of chemical or other geometric shapes made from adsorbent substrates soaked in reagent), skeletonisation of planar shapes and weighted Voronoi diagrams (e.g. additively weighted Voronoi diagrams, Multiplicitavely weighted Crystal growth Voronoi diagrams). The paper will also discuss some limitations of these processors. These chemical processors constitute a class of pattern forming reactions which have parallels with those observed in natural systems. It is possible that specialised chemical processors of this general type could be useful for synthesising functional structured materials.

Fig. 1. The local nature of the CPM: removing links from the network has no effect on the communities which do not contain any of the endpoints. The link, which is represented by a dashed line in Fig. a) is removed from the network. The resulting community structure is shown in Fig. b). The left community (grey nodes) is not affected by the link removal, since the link is not part of the community. The community on the right side of the figure (black nodes) is partially effected. 
Fig. 2. Splitting of a network into two pieces before the exploration of the communities. a) The original network: the links highlighted by dashed lines are selected as cut-links, their end-nodes define the boundary region between the two pieces. b) After the link removal the network falls into two separate subnetworks. c) Re-inserting the boundary region into each subnetwork separately. This way the k-cliques of the boundary region will appear in both pieces.
Fig. 3. 
Fig. 4. The total running time is decreasing with the inverse of the subnetwork size until the number of subnetworks reaches the number of available processing units. The optimal size for the subnetworks is the largest possible size, which fits into one database. 
Parallel clustering with CFinder

May 2012

·

199 Reads

The amount of available data about complex systems is increasing every year, measurements of larger and larger systems are collected and recorded. A natural representation of such data is given by networks, whose size is following the size of the original system. The current trend of multiple cores in computing infrastructures call for a parallel reimplementation of earlier methods. Here we present the grid version of CFinder, which can locate overlapping communities in directed, weighted or undirected networks based on the clique percolation method (CPM). We show that the computation of the communities can be distributed among several CPU-s or computers. Although switching to the parallel version not necessarily leads to gain in computing time, it definitely makes the community structure of extremely large networks accessible.

Figure 1. A 3 dimensional illustration of Example 1 Example 1. Let us identify the lower and the upper parts of two horizontal half hyperplanes, illustrated by two half spaces, in Minkowski spacetime, see Figure 1. 5
Figure 2. A 3 dimensional illustration of Example 2
Figure 3. A 3 dimensional illustration of Example 3
Figure 4. Illustration of Example 4
Figure 5. Illustration of Example 5
Closed Timelike Curves in Relativistic Computation

April 2012

·

234 Reads

In this paper, we investigate the possibility of using closed timelike curves (CTCs) in relativistic hypercomputation. We introduce a wormhole based hypercomputation scenario which is free from the common worries, such as the blueshift problem. We also discuss the physical reasonability of our scenario, and why we cannot simply ignore the possibility of the existence of spacetimes containing CTCs.

Figure 4: Internode results for the nonblocking MPI benchmark on the Westmere-based test cluster and on Cray XT4 and XE6 systems. Unless indicated otherwise, results for nonblocking send and receive are almost identical.
Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems

June 2011

·

187 Reads

We evaluate optimized parallel sparse matrix-vector operations for several representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Beyond the single node, the performance of parallel sparse matrix-vector operations is often limited by communication overhead. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. Moreover we identify performance benefits of hybrid MPI/OpenMP programming due to improved load balancing even without explicit communication overlap. We compare performance results for pure MPI, the widely used "vector-like" hybrid programming strategies, and explicit overlap on a modern multicore-based cluster and a Cray XE6 system.

Using duplication for the multiprocessor scheduling problem with hierarchical communications

August 1999

·

1 Read

We propose a two-step algorithm that efficiently constructs a schedule of minimum makespan for the precedence multiprocessor constrained scheduling problem in the presence of hierarchical communications and task duplication. We consider the case where all the tasks of the precedence graph have unit execution times, and the multiprocessor is composed by an unbounded number of clusters with two identical processors each. The communication delay for transferring the data between a predecessor-task and a successor-task executed on processors of different clusters take a unit of time, while this cost becomes null whenever these tasks are executed on the same processor or on different processors of the same cluster. The first step of the algorithm computes for each task an earliest starting time of any of its copies and constructs a critical graph whereas the second step uses the obtained critical graph and task duplication to build up an earliest optimal schedule.

Fig. 7. Linearizing an unsuccessful contains() method calls is a bit tricky. Dark nodes are physically in the list and white nodes are physically removed. During a traversal of the list by thread A, the sublist starting at the node pointed to by curr (and schematically represented by ".. .") may be disconnected from the main list by a concurrent remove() method execution. Both nodes with items a and b can still be reached, and the determination if an item is in the list is based solely on the mark-bit. 
Fig. 9. The graph shows throughput as concurrency increases with a 34%, 33% and 33% ratio respectively of contains(), add(), and remove() method calls. 
A Lazy Concurrent List-Based Set Algorithm
List-based implementations of sets are a fundamental building block of many concurrent algorithms. A skiplist based on the lock-free list-based set algorithm of Michael will be included in the JavaTM Concurrency Package of JDK 1.6.0. However, Michael’s lock-free algorithm has several drawbacks, most notably that it requires all list traversal operations, including membership tests, to perform cleanup operations of logically removed nodes, and that it uses the equivalent of an atomically markable reference, a pointer that can be atomically “marked,” which is expensive in some languages and unavailable in others. We present a novel “lazy” list-based implementation of a concurrent set object. It is based on an optimistic locking scheme for inserts and removes, eliminating the need to use the equivalent of an atomically markable reference. It also has a novel wait-free membership test operation (as opposed to Michael’s lock-free one) that does not need to perform cleanup operations and is more efficient than that of all previous algorithms. Empirical testing shows that the new lazy-list algorithm consistently outperforms all known algorithms, including Michael’s lock-free algorithm, throughout the concurrency range. At high load, with 90% membership tests, the lazy algorithm is more than twice as fast as Michael’s. This is encouraging given that typical search structure usage patterns include around 90% membership tests. By replacing the lock-free membership test of Michael’s algorithm with our new wait-free one, we achieve an algorithm that slightly outperforms our new lazy-list (though it may not be as efficient in other contexts as it uses Java’s RTTI mechanism to create pointers that can be atomically marked).

Wormhole deadlock prediction

April 2006

·

14 Reads

Deadlock prevention is usually realized by imposing strong restrictions on packet transmissions in the network so that the resulting deadlock free routing algorithms are not optimal with respect to resources utilization. Optimality request can be satisfied by forbidding transmissions only when they would bring the network into a configuration that will necessarily evolve into a deadlock. Hence, optimal deadlock avoidance is closely related to deadlock prediction. In this paper it is shown that wormhole deadlock prediction is an hard problem. Such result is proved with respect to both static and dynamic routing.

Fig. 1: Manual copy delay line with two stages. The syringe is used to indicate the species where inputs are presented and X1 and X2 represent the output species from the delay line. Species X2C is used to cascade a value to a delay line of greater than two stages. X1 signal and X2 signal catalyze the copy and decay (λ).
Fig. 7: SAMP calculated for delay lines. Mm and Bm are the m th stage of manual copying and back propagation delay line.
Fig. 8: Perceptron integration with backwards propagating delay line of two stages. The delay line outputs (X1 and X2) are fed to the perceptron without modification of the delay line.
Fig. 9: Binary time-series chemical perceptron success rate. The perceptron learns 11 of the 14 functions with an accuracy of greater than 85%.
Delay Line as a Chemical Reaction Network

April 2014

·

90 Reads

Chemistry as an unconventional computing medium presently lacks a systematic approach to gather, store, and sort data over time. To build more complicated systems in chemistries, the ability to look at data in the past would be a valuable tool to perform complex calculations. In this paper we present the first implementation of a chemical delay line providing information storage in a chemistry that can reliably capture information over an extended period of time. The delay line is capable of parallel operations in a single instruction, multiple data (SIMD) fashion. Using Michaelis-Menten kinetics, we describe the chemical delay line implementation featuring an enzyme acting as a means to reduce copy errors. We also discuss how information is randomly accessible from any element on the delay line. Our work shows how the chemical delay line retains and provides a value from a previous cycle. The system's modularity allows for integration with existing chemical systems. We exemplify the delay line capabilities by integration with a threshold asymmetric signal perceptron to demonstrate how it learns all 14 linearly separable binary functions over a size two sliding window. The delay line has applications in biomedical diagnosis and treatment, such as smart drug delivery.

Hierarchical Peer-to-Peer Systems
Structured peer-to-peer (P2P) lookup services organize peers into a flat overlay network and offer distributed hash table (DHT) functionality. Data is associated with keys and each peer is responsible for a subset of the keys. In hierarchical DHTs, peers are organized into groups, and each group has its autonomous intra-group overlay network and lookup service. Groups are organized in a top-level overlay network. To find a peer that is responsible for a key, the top-level overlay first determines the group responsible for the key; the responsible group then uses its intra-group overlay to determine the specific peer that is responsible for the key. We provide a general framework and a scalable hierarchical overlay management. We study a two-tier hierarchy using Chord for the top level. Our analysis shows that by using the most reliable peers in the top level, the hierarchical design significantly reduces the expected number of hops.

Scheduling Sensors by Tiling Lattices

July 2008

·

14 Reads

Suppose that wirelessly communicating sensors are placed in a regular fashion on the points of a lattice. Common communication protocols allow the sensors to broadcast messages at arbitrary times, which can lead to problems should two sensors broadcast at the same time. It is shown that one can exploit a tiling of the lattice to derive a deterministic periodic schedule for the broadcast communication of sensors that is guaranteed to be collision-free. The proposed schedule is shown to be optimal in the number of time slots.

Memory Reuse Analysis in the Polyhedral Model

June 1997

·

55 Reads

this paper we develop the constraints that the projection functions must satisfy, based on the information obtained in the usage table. The usage table may have many other applications. For example, when parallel code is generated, it can be used for communication optimization (message vectorization, detection of common communication patterns, such as broadcast, scatter, gather, total exchange, scans, etc.) It may also form the basis of the analysis necessary to generate sender initiated communications. The remainder of this paper is organized as follows. We follow this introduction with a review of related work. Next, we describe the Alpha language, system, and our compilation methodology.

Scatter of Weak Robots

February 2007

·

48 Reads

In this paper, we investigate the scatter problem, which is defined as follows: Given a set of n robots, regardless of the initial position of the robots on the plane, eventually, no two robots are located at the same position forever. We show that this problem cannot be deterministically solved. Next, we propose a randomized algorithm. The proposed solution is trivially self-stabilizing. We then show how to design a self-stabilizing version of any deterministic solution for the Pattern Formation and the Gathering problems for any number n < 2 of robots.

Themis: Component Dependence Metadata In Adaptive Parallel Applications

May 2001

·

48 Reads

This paper describes THEMIS, a programming model and run-time library being designed to support cross-component performance optimization through explicit manipulation of the computation's iteration space at run-time. Each component is augmented with "component dependence metadata", which characterizes the constraints on its execution order, data distribution and memory access order. We show how this supports dynamic adaptation of each component to exploit the available resources, the context in which its operands are generated, and results are used, and the evolution of the problem instance. Using a computational fluid dynamics visualization example as motivation, we show how component dependence metadata provides a framework in which a number of interesting optimizations become possible. Examples include data placement optimization, loop fusion, tiling, memoization, checkpointing and incrementalization.

Low Crosstalk Address Encodings for Optical Message Switching Systems

May 1998

·

5 Reads

An optical message switching system delivers messages from N sources to N destinations using beams of light. The redirection of the beams involves vector-matrix multiplication and a threshold operation. The input vectors are set by the sources and may be viewed as the addresses of the desired destinations. In a massively parallel system, it is highly desirable to reduce the number of threshold (non-linear) elements, which require extra wiring and increase clock skew. Moreover, the threshold devices have a sensitivity parameter (implied by the technology) defined as the gap in which the outcome of the device is not determined. This gap is largely effected by the crosstalk which is the maximum number of joint set bits in any pair of addresses, implying a lower bound on the maximum intensity for which the outcome of the threshold operation is determined. In this work we consider the design of addresses which are both short (so that the number of threshold devices is reduced) and have low crosstalk (so that the sensitivity gap may grow). We show that addresses of O( log N) bits exist, for which the crosstalk is a constant fraction of the number of set bits in each address, hence allowing for a Θ( log N) sized sensitivity gap. More generally, we show the precise coefficient which depends on the desired gap. It is established that when using O( log N) bit addresses, the crosstalk cannot be further reduced. An exact construction of O( log ² N) bit addresses is given, where the involved constant depends on the desired crosstalk. Finally we describe briefly the basic optical elements that can be used in order to construct a message switching system which use these address schemes.

Tuning Message Aggregation On High Performance Clusters For Efficient Parallel Simulations

November 2000

·

6 Reads

High performance clusters (HPCs) based on commodity hardware are becoming more and more popular in the parallel computing community. These new platforms oer a hardware capable of very low latency and very high throughput at an unbeatable cost, making them attractive for a large variety of parallel and distributed applications. With adequate communication software, HPCs have the potential to achieve a level of performance similar to massively parallel computers. However, for parallel applications that present a high communication/computation ratio, it is still essential to provide the lowest latency in order to minimize the communication overhead. In this paper, we are investigating message aggregation techniques to improve parallel simulations of ne-grain ATM communication network models. Even if message aggregation is a well-known solution for improving the communication performance of high latency interconnection networks, the complex interaction between message aggreg...

Fast Parallel Permutation Algorithms

October 1995

·

16 Reads

We investigate the problem of permuting n data items on an EREW PRAM with p processors using little additional storage. We present a simple algorithm with run time O((n/p) log n) and an improved algorithm with run time O(n/p+logn loglog(n/p)). Both algorithms require n additional global bits and O(1) local storage per processor. If prefix summation is supported at the instruction level, the run time of the improved algorithm is O(n/p). The algorithms can be used to rehash the address space of a PRAM emulation. Keywords: Parallel Algorithms, Permutations, Shared Memory, Rehashing 1. Introduction Consider the task of permuting n data items on an EREW PRAM with p n processors according to a permutation given in the form of a constant-time "blackbox " program. The task is trivial if n additional (global or local) memory cells are available: The items are first moved to the additional storage, with each processor handling O(n=p) items, and then written back in permuted order. We ...

Systematic Derivation of Tree Contraction Algorithms

July 2004

·

59 Reads

While tree contraction algorithms play an important role in e#cient tree computation in parallel, it is di#cult to develop such algorithms due to the strict conditions imposed on contracting operators. In this paper, we propose a systematic method of deriving e#cient tree contraction algorithms from recursive functions on trees in any shape. We identify a general recursive form that can be parallelized to obtain e#cient tree contraction algorithms, and present a derivation strategy for transforming general recursive functions to parallelizable form. We illustrate our approach by deriving a novel parallel algorithm for the maximum connected-set sum problem on arbitrary trees, the tree-version of the famous maximum segment sum problem.

On Designing Communication-Intensive Algorithms For A Spanning Optical Bus Based Array

August 1998

·

7 Reads

The Reconfigurable Array with Spanning Optical Buses (or RASOB) architecture provides flexible reconfiguration and strong connectivities with low hardware and control complexities. We use a parallel implementation of the matrix transposition as well as multiplication algorithms as an example to show how the architectural capabilities can be taken advantage of in designing efficient parallel algorithms. Keywords: Optical Bus, Reconfiguration, Matrix Transposition, Matrix Multiplication. 1. Introduction Reconfigurable architectures are attractive because they provide alternatives to completely connected systems at lower implementation costs. Since optical interconnects can offer many advantages over its electronic counterpart including high connection density and relaxed bandwidth-distance product, they will soon be a viable alternative for multiprocessor interconnections [1, 2, 3, 4]. This paper describes the Reconfigurable Array with Spanning Optical Buses (RASOB) architecture t...

On The Alignment Problem

October 1996

·

135 Reads

This paper deals with the problem of aligning data and computations when mapping uniform or affine loop nests onto SPMD distributed memory parallel computers. For affine loop nests we formulate the problem by introducing the communication graph, which can be viewed as the counterpart for the mapping problem of the dependence graph for scheduling. We illustrate the approach with several examples to show the difficulty of the problem. In the simplest case, that of perfect loop nests with uniform dependences, we show that minimizing the number of communications is NP-complete, although we are able to derive a good alignment heuristic in most practical cases. Keywords: Loop nest; Uniform dependences; Affine dependences; Alignment problem; The owner computes rule; Convex domain; Communication graph; Mapping strategies. 1. Introduction The automatic parallelization of loop nests targeted for execution onto DMPCs has motivated a vast amount of research ( 11;9;18 and references therein). ...

An Efficient Processor Allocation For Nested Parallel Loops On Distributed Memory Hypercubes

June 2000

·

34 Reads

We consider the static processor allocation problem for arbitrarily nested parallel loops on distributed memory, message-passing hypercubes. We present HYPAL (HYpercube Partitioning ALgorithm) as an efficient algorithm to solve this problem. HYPAL calculates an optimal set of partitions of the dimension of the hypercube,and assigns them to the set of iterations of the nested loop. Some considerations about the influence of the communicationoverhead in order to get a more realistic approach are considered. The main problem at this point is to obtain the communication pattern associated to the parallel program because it depends on scheduling and data distribution. Keywords: Distributed memory hypercube multiprocessor, parallelizing compiler, processor allocation, loop scheduling. 1. Introduction One of the phases in the scheduling is the processor allocation, which is the assignment of the number of processors to each task of the parallelized program. In this paper, we conce...

Simultaneous Allocation And Scheduling Using Convex Programming Techniques

November 1995

·

11 Reads

Simultaneous exploitation of task and data parallelism provides significant benefits for many applications. The basic approach for exploiting task and data parallelism is to use a task graph representation (Macro Dataflow Graph) for programs to decide on the degree of data parallelism to be used for each task (allocation) and an execution order for the tasks (scheduling). Previously, we presented a two step approach for allocation and scheduling by considering the two steps to be independent of each other. In this paper, we present a new simultaneous approach which uses constraints to model the scheduler during allocation. The new simultaneous approach provides significant benefits over our earlier approach for the benchmark task graphs that we have considered.

Parallel Copying Garbage Collection using Delayed Allocation

November 1998

·

23 Reads

We present a new approach to parallel copying garbage collection on symmetric multiprocessor (SMP) machines appropriate for Java and other object-oriented languages. Parallel, in this setting, means that the collector runs in several parallel threads. Our collector is based on a new idea called delayed allocation, which completely eliminates the fragmentation problem of previous parallel copying collectors while still keeping low synchronization, high efficiency, and simplicity of collection. In addition to this main idea, we also discuss several other ideas such as termination detection, balancing the distribution of work, and dealing with contention during work distribution.

Memory Cost Due to Anticipated Broadcast

January 2003

·

17 Reads

To get efficient solutions, parallelization techniques mainly focus on data alignment or on communication minimization. The efficiency of a parallel solution not only depends on the communication cost, but also on the memory cost. This paper mainly focus on a symbolic evaluation of the memory cost due to anticipated broadcast. This evaluation is conducted in the polytope model using Ehrhart polynomials, which express the number of integer points in a parameterized polytope.

Approximating Maximum 2-CNF Satisfiability

June 1997

·

302 Reads

A parallel approximation algorithm for the MAXIMUM 2-CNF SATISFIABILITY problem is presented. This algorithm runs in O(log 2 (n + jF j)) parallel time on a CREW PRAM machine using O(n+jF j) processors, where n is the number of variables and jF j is the number of clauses. Performance guarantees are considered for three slightly differing definitions of this problem. Keywords : Satisfiability, Maximum 2-CNF SAT, Maximum Cut, Approximation Algorithm. 1. Introduction A satisfiability problem takes as input a formula which is a conjunction of clauses F = (c 1 ; : : : ; c m ). Let jF j denote m, the number of clauses in F . Each clause c i is a disjunction of r i literals, where each literal is either a positive (true) or negative (false) appearance of a variable from the set X = fx 1 ; : : : ; xng. Such a boolean formula is said to be in conjunctive normal form (CNF). The objective is to find a truth assignment of the n variables that satisfies (makes the boolean clause true) either a...

Figure 1: A 8 x 8 Benes network with r = 3
Figure 4: A 5 x 5 Benes network
Figure 5: Two loops in the realization of a permutation in a 9 9 AS-Benes
Figure 6: Comparisons between AS-Benes and Benes networks  
Arbitrary Size Benes Networks

May 1997

·

1,828 Reads

The Benes network is a rearrangeable nonblocking network which can realize any arbitrary permutation. Overall, the r-dimensional Benes network connects 2 r inputs to 2 r outputs through 2r Gamma 1 levels of 2 Theta 2 switches. Each level of switches consists of 2 rGamma1 switches, and hence the size of the network has to be a power of two. In this paper, we extend Benes networks to arbitrary sizes. We also show that the looping routing algorithm used in Benes networks can be slightly modified and applied to arbitrary size Benes networks. 1 Introduction A multistage network consists of more than one stage of switching elements and is usually capable of connecting an arbitrary input terminal to an arbitrary output terminal. Multistage networks are classified into blocking, rearrangeable, or nonblocking networks. In blocking networks, simultaneous connections of more than one terminal pair may result in conflicts in the use of network communication links. A network is a rearrang...

Functional Algorithm Simulation Of The Fast Multipole Method: Architectural Implications

July 1995

·

21 Reads

Functional Algorithm Simulation is a methodology for predicting the computation and communication characteristics of parallel algorithms for a class of scientific problems, without actually performing the expensive numerical computations involved. In this paper, we use Functional Algorithm Simulation to study the parallel Fast Multipole Method (FMM), which solves the N-body problem. Functional Algorithm Simulation provides us with useful information regarding communication patterns in the algorithm, the variation of available parallelism during different algorithmic phases, and upper bounds on available speedups for different problem sizes. Furthermore, it allows us to predict the performance of the FMM on message-passing multiprocessors with topologies such as cliques, hypercubes, rings, and multirings, over a wider range of problem sizes and numbers of processors than would be feasible by direct simulation. Our simulations show that an implementation of the FMM on low-cost...

Reconfigurable Parallel Computer Architecture Based on Wavelength-Division Multiplexed Optical Interconnection Network

April 1995

·

15 Reads

Reconfigurability is a desirable characteristic in parallel computer architecture that supports the structural parallelism inherent to multiple parallel algorithms. This paper presents a novel approach to achieve reconfigurability via a multiple-domain wavelengthdivision multiplexed optical interconnection network. A network structure is defined to host a set of point-to-point guest interconnection topologies. Fast complete reconfiguration takes place by re-labeling node identifiers and re-tuning receiver filters according to a local table at each node. For a network of size N , the connecting fiber complexity is O(N ), the same order of the link complexities of the individual topologies. The proposed network avoids the possible edge dilation, congestion, and cumulative switching latencies associated with previously proposed approaches. Keywords: Parallel computer architecture, reconfigurability, interconnection networks, virtual topology, optical interconnection. 1. Introduction The...

Figure 1. Stages in the communication between two objects. Each stage begins with the injection of a new probe.
Figure 5. Migration approaches.
Figure 6. The testbed application in a 4x4 transputer mesh.
A Routing Strategy For Object-Oriented Applications In Massively Parallel Architectures

November 1996

·

39 Reads

Parallel object-oriented environments have a high degree of dynamicity and need specialised support to achieve efficiency of execution. Static strategies are not suitable for these environments: any prediction before execution can only roughly estimate the real behaviour. In object-oriented environments, the decision to create/destroy objects is usually taken at run-time and object allocation can change during the execution. The requirement of dynamicity should be considered in the design of every component of the support. The routing system, for instance, must ensure delivery even in case of object dynamic allocation/reallocation. The paper argues that routing algorithms for parallel object-oriented environments in massively parallel architectures should be both adaptive and efficient. We adopted a routing strategy designed to be effective in case of objects dynamically created/destroyed and capable of moving during the execution. Our adaptive strategy does not assume any knowlegde of both object allocation and system topology configuration.

Figure 3: Ratio of clock periods R = t S c t 1 c in order for the the S-pipelined and nonpipelined designs to have the same performance; i.e., t S c T S comp = t 1 c T 1 comp . 
Optimal Synthesis of Processor Arrays with Pipelined Arithmetic Units

April 1994

·

26 Reads

Two-level pipelining in processor arrays (PAs) involves pipelining of operations across processing elements (PEs) and pipelining of operations in functional units in each PE. Although it is an attractive method for improving the throughput of PAs, existing methods for generating PAs with two-level pipelining are restricted and cannot systematically explore the entire space of feasible designs. In this paper, we extend a systematic design method, called General Parameter Method (GPM), we have developed earlier to find optimal designs of PAs with two-level pipelines. The basic idea is to add new constraints on periods of data flows to include the effect of internal functional pipelines in the PEs. As an illustration, we present pipelined PA designs for computing matrix products. For n-dimensional meshes and other symmetric problems, we provide an efficient scheme to obtain a pipelined PA from a non-pipelined PA using a reindexing transformation. This scheme is used in GPM as...

Costing Nested Array Codes

May 2002

·

25 Reads

We discuss a language-based cost model for array programs built on the notions of work complexity and parallel depth. The programs operate over data structures comprising nested arrays and recursive product-sum types. In a purely functional setting, such programs can be implemented by way of the flattening transformation that converts codes over nested arrays into vectorised code over flat arrays. Flat arrays lend themselves to a particularly efficient implementation on standard hardware, but the overall efficiency of the approach depends on the flattening transformation preserving the asymptotic complexity of the nested array code. Blelloch has characterised a class of first-order array programs, called contained programs, for which flattening preserves the asymptotic depth complexity. However, his result is restricted to programs processing only arrays and tuples. In the present paper, we extend Blelloch'S result to array programs processing data structures containing arrays as well as arbitrary recursive product-sum types. Moreover, we replace the notion of containment by the more general concept of fold programs.

Bounds for the bandwidth of the d-ary de Bruijn graph

April 2002

·

23 Reads

The computation of upper bounds of the bandwidth of graphs is mainly based on the giving of a numbering which achieves these bounds. In [9], Harper proposed such a numbering for the binary hypercube, based on the Hamming weights and binary values of the hypercube vertices. By defining an extended Hamming weight, this numbering can lead to an equivalent proof for the d-ary de Bruijn graph. We present in this paper an approach, based on the use of the continuous domain and Laplace’s theorem for asymptotically evaluating integrals, which leads to the enumeration of the vertices of the same extended Hamming weight in the non-binary case. This result allows the computation of an upper bound of the bandwidth of the unoriented de Bruijn graph, as well as an upper bound of its vertex-bisection when both the diameter and the degree are even.

Automated Negotiation Between Publishers And Consumers Of Grid Notifications

June 2004

·

43 Reads

Notification services mediate between information publishers and consumers that wish to subscribe to periodic updates. In many cases, however, there is a mismatch between the dissemination of these updates and the delivery preferences of the consumer, often in terms of frequency of delivery, quality, etc. In this paper, we present an automated negotiation engine that identifies mutually acceptable terms; we study its performance, and discuss its application to a Grid notification service. We also demonstrate how the negotiation engine enables users to control the Quality of Service levels they require.

Automatic Data and Computation Decomposition for Distributed Memory Machines

December 1994

·

13 Reads

In this paper, we develop an automatic compile-time computation and data decomposition technique for distributed memory machines. Our method can handle complex programs containing perfect and nonperfect loop nests with or without loop-carried dependences. Applying our decomposition algorithms, a program will be divided into collections (called clusters) of loop nests, such that data redistributions are allowed only between the clusters. Within each cluster of loop nests, decomposition and data locality constraints are formulated as a system of homogeneous linear equations which is solved by polynomial time algorithms. Our algorithm can selectively relax data locality constraints within a cluster to achieve a balance between parallelism and data locality. Such relaxations are guided by exploiting the hierarchical program nesting structures from outer to inner nesting levels to keep the communications at a outer-most level possible. This work is central to the on-going compiler developmen...

Optimal Parameters For Load Balancing With The Diffusion Method In Mesh Networks

September 1994

·

38 Reads

The diffusion method is a simple distributed load balancing method for distributed memory multiprocessors. It operates in a relaxation fashion for point-to-point networks. Its convergence to the balanced state relies on the value of a parameter---the diffusion parameter. An optimal diffusion parameter would lead to the fastest convergence of the method. Previous results on optimal parameters have existed for the k-ary n-cube and the torus. In this paper, we derive optimal diffusion parameters for mesh networks. Keywords: Diffusion method, load balancing, mesh networks, multicomputers 1. Introduction Given an N-processor system G, the problem of load balancing is to redistribute the system workload (w 1 ; w 2 ; : : : ; wN ), where w i is a nonnegative real number representing the workload in processor i, such that each processor ends up with the same w = P w i =N; i = 1; 2; : : :; N . It requires the processors not only to reach an agreement on the average load, but also to ad...

A Self-Stabilizing Distributed Algorithm to Construct BFS Spanning Trees of a Symmetric Graph

October 1996

·

30 Reads

We propose a simple and efficient self-stabilizing distributed algorithm to construct the breadth first search (BFS) spanning tree of an arbitrary connected symmetric graph. We develop a completely new direct approach of graph theoretical reasoning to prove the correctness of our algorithm. The approach seems to have potential to have applications in proving correctness of other self-stabilizing algorithms for graph theoretical problems. Address for Correspondence: Pradip K Srimani Department of Computer Science Colorado State University Ft. Collins, CO 80523 Tel: (303) 491-7097 Fax: (303) 491-6639 Email: srimani@CS.ColoState.Edu Department of Mathematics y Department of Computer Science 1 Introduction A distributed system can be viewed to consist of a set of loosely connected systems (state machines) which do not share a global memory but can share information by exchanging messages only. Each node or machine is allowed to have only a partial view of the global state which dep...

Figure 1 : Package requirement conflicts in a complex application.
Figure 2: Layout for an object using CDNs.
Binary Version Management for Computational Grids

November 1999

·

52 Reads

Applications are no longer monolithic files, but rather a collection of dynamically linked libraries, images, fonts, etc. For such applications to function correctly, all of the required files must be available and be the correct version. Missing files preclude application execution, and incorrect versions lead to mysterious and frustrating failures. This paper describes a simple scheme to address this problem: Content-Derived Names (CDNs). CDNs use digital signatures to automatically and uniquely name specific versions of files. Because Content-Derived Names are computed using a cryptographically strong hash over the text of a package, this process is safe from spoofing and other attacks based on providing the wrong library. We explain how CDNs ease the management of application distribution for clusters and grids. We also describe a prototype implementation of CDNs for the Tcl programming language. 1. Introduction The proliferation of complex software libraries has made development...

On a Sublinear Time Parallel Construction of Optimal Binary Search Trees (Note)

April 1997

·

56 Reads

We design an efficient sublinear time parallel construction of optimal binary search trees. The efficiency of the parallel algorithm corresponds to its total work (the product time × processors). Our algorithm works in O(n1−ɛ log n) time with the total work O(n2−2ɛ), for an arbitrarily small constant 0 < ε ≤ 1/2. This is optimal within a factor n 2ɛ with respect to the best known sequential algorithm given by Knuth, which needs only O(n2) time due to a monotonicity property of optimal binary search trees, see [6]). It is unknown how to explore this property in an efficient NC construction of binary search trees. Here we show that it can be effectively used in sublinear time parallel computation. Our improvement also relies on the use (in independently processed small subcomputations) of the parallelism present in Knuth's algorithm. The best known sublinear time algorithms for the construction of binary search trees (as an instance of a more general problem) have O(n3) work for time larger than n 3/4, see [3] and [7]. For time √n these algorithms need n 4 work, while our algorithm needs for this time only n 3 work, thus improving the known algorithms by a linear factor. Also if time is O(n1−ɛ) and ε is very small our improvement is close to O(n). Such improvement is similar to the one implied by the monotonicity property in sequential computations (from n 3 sequential time for a more general dynamic programming problem to n 2 time for the special case of optimal binary search trees).

Bipartite Expander Matching Is In Nc

January 1997

·

52 Reads

A work-efficient deterministic NC algorithm is presented for finding a maximum matching in a bipartite expander graph with any expansion factor fi ? 1. This improves upon a recently presented deterministic NC maximum matching algorithm which is restricted to those bipartite expanders with large expansion factors (fi \Delta ffl ; ffl ? 0), and is not work-efficient [1]. Keywords: Bipartite Matching, Expander Graphs, NC, Network Flow. 1. Introduction Finding maximum cardinality matchings in bipartite expander graphs has many applications such as routing networks, sorting networks, permutation networks, and path selection. Note that by Hall&apos;s Theorem there is a perfect matching in a bipartite expander graph with expansion factor fi ? 1. Thus we are really finding one of the (potentially many) perfect matchings. Bipartite expander graphs are an important part of the design of routing networks such as concentrators and superconcentrators [2]. They are also used in self-routing permuta...

A Lower Bound For Order-Preserving Broadcast In The Postal Model

October 1999

·

39 Reads

In the postal model of message passing systems, the actual communication network between processors is abstracted by a single communication latency factor, which measures the inverse ratio of the time it takes for a processor to send a message and the time that passes until the recipient receives the message. In this paper we examine the problem of broadcasting multiple messages in an order-preserving fashion in the postal model. We prove lower bounds for all parameter ranges and show that these lower bounds are within a factor of seven of the best upper bounds. In some cases, our lower bounds show significant asymptotic improvements over the previous best lower bounds. Keywords: Parallel algorithms, Postal model, Broadcast, Lower bounds. 1. Introduction In many distributed memory parallel computers and high speed communication networks, processors submit messages to and retrieve messages from an underlying communication network. It is the job of the communication network to deliver...

A Lower Bound On The Average Physical Length Of Edges In The Physical Realization Of Graphs

November 1995

·

6 Reads

The stereo-realization of a graph is the assignment of positions in Cartesian space to each of its vertices such that vertex density is bounded. A bound is derived on the average edge length in such a realization. It is similar to an earlier reported result, however the new bound can be applied to graphs for which the earlier result is not well suited. A more precise realization definition is also presented. The bound is applied to d-dimensional realizations of de Bruijn graphs, yielding an edge length ofOmegaGamma/2 Gamma 2 Gammad )r n=d =(2n)), where r is the radix (number of distinct symbols) and n is the number of graph dimensions (number of symbol positions). The bound is also applied to shuffle-exchange graphs; for such graphs with small radix the edge-length bound is 2 3 l oe + 1 3 l ffl ' (1 Gamma 2 Gammad )r n=d =(2(n(2 Gamma 1=r) Gamma 1)), where r is the radix, n is the number of graph dimensions, l oe is the average length of shuffle edges, and l ffl is th...

Calculating an Optimal Homomorphic Algorithm for Bracket Matching

December 1999

·

42 Reads

It is widely recognized that a key problem of parallel computation is in the development of both efficient and correct parallel software. Although many advanced language features and compilation techniques have been proposed to alleviate the complexity of parallel programming, much effort is still required to develop parallelism in a formal and systematic way. In this paper, we intend to clarify this point by demonstrating a formal derivation of a correct but efficient homomorphic parallel algorithm for a simple language recognition problem known as bracket matching. To the best of our knowledge, our formal derivation leads to a novel divide-and-conquer parallel algorithm for bracket matching.