Journal of Parallel and Distributed Computing

Published by Elsevier BV

Online ISSN: 1096-0848


Print ISSN: 0743-7315


More IMPATIENT: a gridding-accelerated toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs
  • Article

May 2013


117 Reads


Nady Obeid





Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code.

Fig. 1. 
Fig. 1. Initial arrangement of runs on the PDM. The leading block of each run resides on a different disk. The leading blocks of each of these runs can be read in one parallel I/O during the R-Way merge. 
Fig. 2. 
Table 3
Figure 5 of 5
Efficient Out of Core Sorting Algorithms for the Parallel Disks Model
  • Article
  • Full-text available

November 2011


81 Reads

In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.

Multi-heuristic dynamic task allocation using genetic algorithms in a heterogeneous distributed system

July 2010


228 Reads

We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms.

High Performance Multiple Sequence Alignment System for Pyrosequencing Reads from Multiple Reference Genomes

November 2012


57 Reads

Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.

Scalable isosurface visualization of massive datasets on commodity off-the-shelf clusters

January 2009


67 Reads

Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces.

Distributed Computation of the knn Graph for Large High-Dimensional Point Sets

March 2007


106 Reads

High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors.

Nearly Logarithmic-Time Parallel Algorithms for the Class of ±2b ASCEND Computations on a SIMD Hypercube

April 1992


13 Reads

Recently, the author has studied two important classes of algorithms requiring ±2<sup>b</sup> communications: ±2<sup>b</sup>-descend, and ±2<sup>b</sup>-ascend. Let N=2 <sup>n</sup> be the number of PEs in a SIMD hypercube which restricts all communications to a single fixed dimension at a time. He has developed an efficient O( n ) algorithm for the descend class, and also obtained a simple O( n <sup>2</sup>/log n ) algorithm for the ascend class, requiring O(log n ) words of local memory per PE. In the present paper he presents two new algorithms for the ascend class on a SIMD hypercube. The first algorithm runs in O( n <sup>1.5</sup>) time and requires O(1) space per PE. The second algorithm, which is discussed only briefly, runs in O( n √ n /log n ) time and requires O(log n ) space per PE

2D and 3D Optimal Parallel Image Warping

May 1993


65 Reads

Spatial image warping is useful for image processing and graphics. The authors present optimal concurrent-read-exclusive-write (CREW) and exclusive-read-exclusive-write (EREW) parallel-random-access-machine (PRAM) algorithms that achieve O(1) asymptotic run time. The significant result is the creative processor assignment that results in an EREW PRAM forward direct warp algorithm. The forward algorithm calculates any nonscaling affine transform. The EREW algorithm is the most efficient in practice, and 16k processor MasPar MP-1 can rotate a 4 million element image in under a second and a 2 million element volume in 1/2 of a second. This high performance allows interactive viewing of volumes from arbitrary viewpoints and illustrates linear speedup

Design and Analysis of a Hybrid Access Control to an Optical Star Using WDM

May 1991


23 Reads

A passive optical star is an ideal shared medium, from both fault tolerant and access synchronization points of view. The communication over an optical star merges to a single point in space and then broadcasts back to all the nodes. This circular symmetry facilitates the solution for two basic distributed synchronization problems, which are presented in this work: (i) the generation of a global event clock for synchronizing the nodes′operation, and (ii) distributed scheduling for accessing the shared passive medium, which is a hybrid (deterministic and random) technique. We present, prove, and analyze this hybrid scheduling algorithm, which is equivalent to a distributed queue, and, therefore, is also algorithmically fair. Furthermore, our solution has two additional properties: destination overflow prevention and destination fairness. The effective solution of these problems can be used for efficiently implementing a local area network based on a passive optical star.

Fast Parallel Algorithms for Solving Triangular Systems of Linear Equations on the Hypercube

January 1991


56 Reads

Presents efficient hypercube algorithms for solving triangular systems of linear equations by using various matrix partitioning and mapping schemes. Recently, several parallel algorithms have been developed for this problem. In these algorithms, the triangular solver is treated as the second stage of Gauss elimination. Thus, the triangular matrix is distributed by columns (or rows) in a wrap fashion since it is likely that the matrix is distributed this way after an LU decomposition has been done on the matrix. However, the efficiency of the algorithms is low. The motivation here is to develop various data partitioning and mapping schemes for hypercube algorithms by treating the triangular solver as an independent problem. Performance of the algorithms is analyzed theoretically and empirically by implementing them on a commercially available hypercube

An Efficient Processor Allocation Algorithm Using Two-Dimensional Packing

April 1997


18 Reads

The mesh is one of the most widely used interconnection networks for multiprocessor systems. We propose an approach to partition a given mesh into m submeshes which can be allocated to m tasks with grid structures. We adapt two dimensional packing to solve the submesh allocation problem. Due to the intractability of the two dimensional packing problem, finding an optimal solution is computationally infeasible. We develop an efficient heuristic packing algorithm called TP-heuristic. Allocating a submesh to each task is achieved using the results of packing. We propose two different methods called uniform scaling and non uniform scaling. Experiments were carried out to test the accuracy of solutions provided by our allocation algorithm

Embedding K-ary Complete Trees into Hypercubes

May 1994


14 Reads

In this paper, dilated embedding and precise embedding of K-ary complete trees into hypercubes are studied. For dilated embedding, a nearly optimal algorithm is proposed which embeds a K-ary complete tree of height h, T(K)(h), into an (h - 1)[log K] + [log (K + 2)]-dimensional hypercune with dilation Max{2, phi(K), phi(K + 2)}. phi(x) = min{lambda: SIGMA(i=0)lambda C(d)i greater-than-or-equal-to x and d = [log x]}. It is clear that [([log x] + 1)/2] less-than-or-equal-to phi(x) less-than-or-equal-to [log x], for x greater-than-or-equal-to 3.) For precise embedding, we show a (K - 1)h + 1-dimensional hypercube is large enough to contain T(K)(h) as its subgraph, K greater-than-or-equal-to 3. (C) 1995 Academic Press, Inc.

Embedding All Binary Trees in the Hypercube

January 1992


46 Reads

An O(N<sup>2</sup>) heuristic algorithm is presented that embeds all binary trees, with dilation 2 and small average dilation, into the optimal sized hypercube. The heuristic relies on a conjecture about all binary trees with a perfect matching. It provides a practical and robust technique for mapping binary trees into the hypercube and ensures that the communication load is evenly distributed across the network assuming any shortest path routing strategy. One contribution of this work is the identification of a rich collection of binary trees that can be easily mapped into the hypercube

Parallelizing a Level 3 BLAS Library for LAN-Connected Workstations

May 1995


16 Reads

Parallel BLAS libraries have recently shown promise as a means of taking advantage of parallel computing in solving scientific problems. However, little work has been done on providing such a parallel library in LAN-connected workstations. Our motivation for this research lies in the strong belief that since LAN-connected workstations are highly cost-effective, it is important to study the issues in such environments. Dynamic load balancing is a method that allows parallel programs to run efficiently in LAN-connected workstations. Introducing dynamic load balancing leads to new design concerns for data distribution, sequential implementation, and library interfaces. Through a series of experiments, we investigate the influence of these factors. We also propose a set of guidelines for developing an efficient parallel Level 3 BLAS library

BLITZEN: a highly integrated massively parallel machine

November 1988


25 Reads

The design of BLITZEN, a highly integrated chip with 128 processing elements (PEs) is presented. The bit serial processing element is described, and some comparisons with the massively parallel processor (MPP) and the Connection Machine are provided. Local control features and methods for memory access are emphasized. The organization of PEs on the custom chip, with emphasis on interconnection and I/O schemes, is described. Details of the custom chip design and instruction pipeline are provided. An overview of system architecture concepts and software for BLITZEN is also given. Each PE has 1 Kbit of static RAM and performs bit-serial processing and functional elements for arithmetic, logic, and shifting. Unique local control features include modification of the global memory address by data local to each PE and complementary operations based on a condition register. Fixed-point operations on 32-bit data can exceed a rate of one billion operations per second. Since the processors are bit-serial devices, performance rates improve with shorter word lengths. The bus oriented I/O scheme can transfer data at 10240 MB/s

Fully-scalable fault-tolerant simulations for BSP and CGM

May 1999


19 Reads

In this paper we consider general simulations of algorithms designed for fully operational BSP and CGM machines on machines with faulty processors. The faults are deterministic (i.e., worst-case distributions of faults are considered) and static (i.e., they do not change in the course of computation). We assume that a constant fraction of processors are faulty. We present a deterministic simulation (resp. a randomized simulation) that achieves constant slowdown per local computations and O((log<sub>h</sub> p)<sup>2</sup>) (resp. O(log<sub>h </sub> p)) slowdown per communication round, provided that a deterministic preprocessing is done that requires O((log<sub>h</sub> p) <sup>2</sup>) communication rounds and linear (in h) computation per processor in each communication round. Our results are fully-scalable over all values of p from θ(1) to θ(n). Furthermore, our results imply that for p&les;n<sup>&epsi;</sup> (ε<1), algorithms can be made resilient to a constant fraction of processor faults without any asymptotic slowdown

On the Embedding of a Class of Regular Graphs in a Faulty Hypercube

January 1995


15 Reads

A wide range of graphs with regular structures are shown to be embeddable in an injured hypercube with faulty links. These include rings, linear paths, binomial trees, binary trees, meshes, tori, and many others. Unlike many existing algorithms which are capable of embedding only one type of graphs, our algorithm embeds the above graphs in a unified way, all centered around a notion called edge matrix. In many cases, the degree of fault tolerance offered by the algorithm is optimal or near-optimal

Packet scheduling with joint design of MIMO and network coding

November 2009


33 Reads

In this paper we propose a joint design of MIMO technique and network coding (MIMO-NC) and apply it to improve the performance of wireless networks. We consider a system in which the packet exchange among multiple wireless users is forwarded by a relay node. In order to enjoy the benefit of MIMO-NC, all the nodes in the network are mounted with two antennas and the relay node possess the coding capability. For the cross traffic flows among any four users, the relay node not only can receive packets simultaneously from two compatible users in the uplink (users-to-relay node), but also can mix up distinct packets for four destined users into two coded packets and concurrently send them out in the same downlink (relay node-to-users), so that the information content is significantly increased in each transmission. We formalize the problem of finding a schedule to forward the buffered data of all the users in minimum number of transmissions in such a system as a problem of finding a maximum matching in a graph. We also provide an analytical model on maximum throughput and optimal energy efficiency, which explicitly measures the performance gain of the MIMO-NC enhancement. Our analytical and simulation results demonstrate that system performance can be greatly improved by the efficient utilization of MIMO and network coding opportunities.

An Optimal Parallel Matching Algorithm for Cographs

January 1992


12 Reads

The class of cographs, or complement-reducible graphs, arises naturally in many different areas of applied mathematics and computer science. The authors show that the problem of finding a maximum matching in a cograph can be solved optimally in parallel by reducing it to parenthesis matching. With a n -vertex cograph G represented by its parse tree as input the algorithm finds a maximum matching in G in O(log n ) time using O( n /log n ) processors in the EREW-PRAM model

On collaborative content distribution using multi-message gossip

May 2006


36 Reads

We study epidemic schemes in the context of collaborative data delivery. In this context, multiple chunks of data reside at different nodes, and the challenge is to simultaneously deliver all chunks to all nodes. Here we explore the inter-operation between the gossip of multiple, simultaneous message-chunks. In this setting, interacting nodes must select which chunk, among many, to exchange in every communication round. We provide an efficient solution that possesses the inherent robustness and scalability of gossip. Our approach maintains the simplicity of gossip, and has low message, connections and computation overhead. Because our approach differs from solutions proposed by network coding, we are able to provide insight into the tradeoffs and analysis of the problem of collaborative content distribution. We formally analyze the performance of the algorithm, demonstrating its efficiency with high probability.

Units of Computation in Fault-Tolerant Distributed Systems

July 1994


15 Reads

We develop a framework that helps in developing understanding of a fault-tolerant distributed system and so helps in designing such systems. We define a unit of computation in such systems, referred to as a molecule, that has a well defined interface with other molecules, i.e. has minimal dependence on other molecules. The smallest such unit-an indivisible molecule-is termed as an atom. We show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, particularly for a fault tolerant system where it is important to guarantee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after execution of such units. We prove different properties satisfied by molecules and atoms, and present algorithms to detect atoms in an ongoing computation and to force the completion of a molecule. We illustrate the uses of the developed work in application areas such as debugging, checkpointing, and reasoning about stable properties

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

May 2004


45 Reads

Summary form only given. We describe the parallelization of the multizone code versions of the NAS parallel benchmarks employing multilevel OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multilevel parallel applications.

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

February 2000


60 Reads

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N<sup>α</sup>), where 2<α&les;3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O (log N) time by using N<sup>α</sup>/log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1&les;p&les;Nα<sup>α</sup>/log N, multiplying two N×N matrices can be performed by a DMPC with p processors in O(N<sup>α</sup>/p) rime, i.e., linear speedup and cost optimality can be achieved in the range [1..N<sup>α</sup>/log N]. This unifies all known algorithms for matrix multiplication on DMPC, standard or non-standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. The above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically

A Parallel Algorithm for Computing Polygon Set Operations

May 1994


18 Reads

We present a parallel algorithm for performing Boolean set operations on generalized polygons that have holes in them. The intersection algorithm has a processor complexity of O(m<sup>2</sup>n <sup>2</sup>) processors and a time complexity of O(max(2logm, log<sup>2 </sup>n)), where m is the maximum number of vertices in any loop of a polygon, and n is the maximum number of loops per polygon. The union and difference algorithms have a processor complexity of O(m<sup>2</sup>n <sup>2</sup>) and time complexity of O(logm) and O(2logm, logn) respectively. The algorithm is based on the EREW PRAM model. The algorithm tries to minimize the intersection point computations by intersecting only a subset of loops of the polygons based on of their topological relationships

Eventually consistent failure detectors

February 2002


37 Reads

The concept of unreliable failure detector was introduced by T.D. Chandra and S. Toueg (1996) as a mechanism that provides information about process failures. This mechanism has been used to solve different problems in asynchronous systems, in particular the Consensus problem. In this paper, we present a new class of unreliable failure detectors, which we call Eventually Consistent and denote by &square;C. This class adds to the failure detection capabilities of other classes an eventual leader election capability. This capability allows all correct processes to eventually choose the same correct process as leader. We study the relationship between &square;C and other classes of failure detectors. We also propose an efficient algorithm to trans form &square;C into &square;P in models of partial synchrony. Finally, to show the power of this new class of failure detectors, we present a Consensus algorithm based on &square;C. This algorithm successfully exploits the leader election capability of the failure detector and performs better in number of rounds than all the previously proposed algorithms for failure detectors with eventual accuracy

Top-cited authors