May 2013

·

117 Reads

Published by Elsevier BV

Online ISSN: 1096-0848

·

Print ISSN: 0743-7315

May 2013

·

117 Reads

Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code.

November 2011

·

81 Reads

In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.

July 2010

·

228 Reads

We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms.

November 2012

·

57 Reads

Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.

January 2009

·

67 Reads

Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces.

March 2007

·

106 Reads

High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors.

April 1992

·

13 Reads

Recently, the author has studied two important classes of
algorithms requiring ±2<sup>b</sup> communications:
±2<sup>b</sup>-descend, and ±2<sup>b</sup>-ascend. Let N=2
<sup>n</sup> be the number of PEs in a SIMD hypercube which restricts
all communications to a single fixed dimension at a time. He has
developed an efficient O( n ) algorithm for the descend class,
and also obtained a simple O( n <sup>2</sup>/log n )
algorithm for the ascend class, requiring O(log n ) words of
local memory per PE. In the present paper he presents two new algorithms
for the ascend class on a SIMD hypercube. The first algorithm runs in O(
n <sup>1.5</sup>) time and requires O(1) space per PE. The
second algorithm, which is discussed only briefly, runs in
O( n √ n /log n ) time and requires
O(log n ) space per PE

May 1993

·

65 Reads

Spatial image warping is useful for image processing and graphics.
The authors present optimal concurrent-read-exclusive-write (CREW) and
exclusive-read-exclusive-write (EREW) parallel-random-access-machine
(PRAM) algorithms that achieve O(1) asymptotic run time. The significant
result is the creative processor assignment that results in an EREW PRAM
forward direct warp algorithm. The forward algorithm calculates any
nonscaling affine transform. The EREW algorithm is the most efficient in
practice, and 16k processor MasPar MP-1 can rotate a 4 million element
image in under a second and a 2 million element volume in 1/2 of a
second. This high performance allows interactive viewing of volumes from
arbitrary viewpoints and illustrates linear speedup

May 1991

·

23 Reads

A passive optical star is an ideal shared medium, from both fault tolerant and access synchronization points of view. The communication over an optical star merges to a single point in space and then broadcasts back to all the nodes. This circular symmetry facilitates the solution for two basic distributed synchronization problems, which are presented in this work: (i) the generation of a global event clock for synchronizing the nodes′operation, and (ii) distributed scheduling for accessing the shared passive medium, which is a hybrid (deterministic and random) technique. We present, prove, and analyze this hybrid scheduling algorithm, which is equivalent to a distributed queue, and, therefore, is also algorithmically fair. Furthermore, our solution has two additional properties: destination overflow prevention and destination fairness. The effective solution of these problems can be used for efficiently implementing a local area network based on a passive optical star.

January 1991

·

56 Reads

Presents efficient hypercube algorithms for solving triangular
systems of linear equations by using various matrix partitioning and
mapping schemes. Recently, several parallel algorithms have been
developed for this problem. In these algorithms, the triangular solver
is treated as the second stage of Gauss elimination. Thus, the
triangular matrix is distributed by columns (or rows) in a wrap fashion
since it is likely that the matrix is distributed this way after an LU
decomposition has been done on the matrix. However, the efficiency of
the algorithms is low. The motivation here is to develop various data
partitioning and mapping schemes for hypercube algorithms by treating
the triangular solver as an independent problem. Performance of the
algorithms is analyzed theoretically and empirically by implementing
them on a commercially available hypercube

April 1997

·

18 Reads

The mesh is one of the most widely used interconnection networks for multiprocessor systems. We propose an approach to partition a given mesh into m submeshes which can be allocated to m tasks with grid structures. We adapt two dimensional packing to solve the submesh allocation problem. Due to the intractability of the two dimensional packing problem, finding an optimal solution is computationally infeasible. We develop an efficient heuristic packing algorithm called TP-heuristic. Allocating a submesh to each task is achieved using the results of packing. We propose two different methods called uniform scaling and non uniform scaling. Experiments were carried out to test the accuracy of solutions provided by our allocation algorithm

May 1994

·

14 Reads

In this paper, dilated embedding and precise embedding of K-ary complete trees into hypercubes are studied. For dilated embedding, a nearly optimal algorithm is proposed which embeds a K-ary complete tree of height h, T(K)(h), into an (h - 1)[log K] + [log (K + 2)]-dimensional hypercune with dilation Max{2, phi(K), phi(K + 2)}. phi(x) = min{lambda: SIGMA(i=0)lambda C(d)i greater-than-or-equal-to x and d = [log x]}. It is clear that [([log x] + 1)/2] less-than-or-equal-to phi(x) less-than-or-equal-to [log x], for x greater-than-or-equal-to 3.) For precise embedding, we show a (K - 1)h + 1-dimensional hypercube is large enough to contain T(K)(h) as its subgraph, K greater-than-or-equal-to 3. (C) 1995 Academic Press, Inc.

January 1992

·

46 Reads

An O(N<sup>2</sup>) heuristic algorithm is presented that embeds
all binary trees, with dilation 2 and small average dilation, into the
optimal sized hypercube. The heuristic relies on a conjecture about all
binary trees with a perfect matching. It provides a practical and robust
technique for mapping binary trees into the hypercube and ensures that
the communication load is evenly distributed across the network assuming
any shortest path routing strategy. One contribution of this work is the
identification of a rich collection of binary trees that can be easily
mapped into the hypercube

May 1995

·

16 Reads

Parallel BLAS libraries have recently shown promise as a means of
taking advantage of parallel computing in solving scientific problems.
However, little work has been done on providing such a parallel library
in LAN-connected workstations. Our motivation for this research lies in
the strong belief that since LAN-connected workstations are highly
cost-effective, it is important to study the issues in such
environments. Dynamic load balancing is a method that allows parallel
programs to run efficiently in LAN-connected workstations. Introducing
dynamic load balancing leads to new design concerns for data
distribution, sequential implementation, and library interfaces. Through
a series of experiments, we investigate the influence of these factors.
We also propose a set of guidelines for developing an efficient parallel
Level 3 BLAS library

November 1988

·

25 Reads

The design of BLITZEN, a highly integrated chip with 128 processing elements (PEs) is presented. The bit serial processing element is described, and some comparisons with the massively parallel processor (MPP) and the Connection Machine are provided. Local control features and methods for memory access are emphasized. The organization of PEs on the custom chip, with emphasis on interconnection and I/O schemes, is described. Details of the custom chip design and instruction pipeline are provided. An overview of system architecture concepts and software for BLITZEN is also given. Each PE has 1 Kbit of static RAM and performs bit-serial processing and functional elements for arithmetic, logic, and shifting. Unique local control features include modification of the global memory address by data local to each PE and complementary operations based on a condition register. Fixed-point operations on 32-bit data can exceed a rate of one billion operations per second. Since the processors are bit-serial devices, performance rates improve with shorter word lengths. The bus oriented I/O scheme can transfer data at 10240 MB/s

May 1999

·

19 Reads

In this paper we consider general simulations of algorithms designed for fully operational BSP and CGM machines on machines with faulty processors. The faults are deterministic (i.e., worst-case distributions of faults are considered) and static (i.e., they do not change in the course of computation). We assume that a constant fraction of processors are faulty. We present a deterministic simulation (resp. a randomized simulation) that achieves constant slowdown per local computations and O((log<sub>h</sub> p)<sup>2</sup>) (resp. O(log<sub>h </sub> p)) slowdown per communication round, provided that a deterministic preprocessing is done that requires O((log<sub>h</sub> p) <sup>2</sup>) communication rounds and linear (in h) computation per processor in each communication round. Our results are fully-scalable over all values of p from θ(1) to θ(n). Furthermore, our results imply that for p⩽n<sup>ε</sup> (ε<1), algorithms can be made resilient to a constant fraction of processor faults without any asymptotic slowdown

January 1995

·

15 Reads

A wide range of graphs with regular structures are shown to be
embeddable in an injured hypercube with faulty links. These include
rings, linear paths, binomial trees, binary trees, meshes, tori, and
many others. Unlike many existing algorithms which are capable of
embedding only one type of graphs, our algorithm embeds the above graphs
in a unified way, all centered around a notion called edge matrix. In
many cases, the degree of fault tolerance offered by the algorithm is
optimal or near-optimal

November 2009

·

33 Reads

In this paper we propose a joint design of MIMO technique and network coding (MIMO-NC) and apply it to improve the performance of wireless networks. We consider a system in which the packet exchange among multiple wireless users is forwarded by a relay node. In order to enjoy the benefit of MIMO-NC, all the nodes in the network are mounted with two antennas and the relay node possess the coding capability. For the cross traffic flows among any four users, the relay node not only can receive packets simultaneously from two compatible users in the uplink (users-to-relay node), but also can mix up distinct packets for four destined users into two coded packets and concurrently send them out in the same downlink (relay node-to-users), so that the information content is significantly increased in each transmission. We formalize the problem of finding a schedule to forward the buffered data of all the users in minimum number of transmissions in such a system as a problem of finding a maximum matching in a graph. We also provide an analytical model on maximum throughput and optimal energy efficiency, which explicitly measures the performance gain of the MIMO-NC enhancement. Our analytical and simulation results demonstrate that system performance can be greatly improved by the efficient utilization of MIMO and network coding opportunities.

January 1992

·

12 Reads

The class of cographs, or complement-reducible graphs, arises
naturally in many different areas of applied mathematics and computer
science. The authors show that the problem of finding a maximum matching
in a cograph can be solved optimally in parallel by reducing it to
parenthesis matching. With a n -vertex cograph G represented by
its parse tree as input the algorithm finds a maximum matching in G in
O(log n ) time using O( n /log n ) processors in
the EREW-PRAM model

May 2006

·

36 Reads

We study epidemic schemes in the context of collaborative data delivery. In this context, multiple chunks of data reside at different nodes, and the challenge is to simultaneously deliver all chunks to all nodes. Here we explore the inter-operation between the gossip of multiple, simultaneous message-chunks. In this setting, interacting nodes must select which chunk, among many, to exchange in every communication round. We provide an efficient solution that possesses the inherent robustness and scalability of gossip. Our approach maintains the simplicity of gossip, and has low message, connections and computation overhead. Because our approach differs from solutions proposed by network coding, we are able to provide insight into the tradeoffs and analysis of the problem of collaborative content distribution. We formally analyze the performance of the algorithm, demonstrating its efficiency with high probability.

July 1994

·

15 Reads

We develop a framework that helps in developing understanding of a
fault-tolerant distributed system and so helps in designing such
systems. We define a unit of computation in such systems, referred to as
a molecule, that has a well defined interface with other molecules, i.e.
has minimal dependence on other molecules. The smallest such unit-an
indivisible molecule-is termed as an atom. We show that any execution of
a fault-tolerant distributed computation can be seen as an execution of
molecules/atoms in a partial order, and such a view provides insights
into understanding the computation, particularly for a fault tolerant
system where it is important to guarantee that a unit of computation is
either completely executed or not at all and system designers need to
reason about the states after execution of such units. We prove
different properties satisfied by molecules and atoms, and present
algorithms to detect atoms in an ongoing computation and to force the
completion of a molecule. We illustrate the uses of the developed work
in application areas such as debugging, checkpointing, and reasoning
about stable properties

May 2004

·

45 Reads

Summary form only given. We describe the parallelization of the multizone code versions of the NAS parallel benchmarks employing multilevel OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multilevel parallel applications.

February 2000

·

60 Reads

Consider any known sequential algorithm for matrix multiplication
over an arbitrary ring with time complexity O(N<sup>α</sup>),
where 2<α⩽3. We show that such an algorithm can be
parallelized on a distributed memory parallel computer (DMPC) in O (log
N) time by using N<sup>α</sup>/log N processors. Such a parallel
computation is cost optimal and matches the performance of PRAM.
Furthermore, our parallelization on a DMPC can be made fully scalable,
that is, for all 1⩽p⩽Nα<sup>α</sup>/log N,
multiplying two N×N matrices can be performed by a DMPC with p
processors in O(N<sup>α</sup>/p) rime, i.e., linear speedup and
cost optimality can be achieved in the range [1..N<sup>α</sup>/log
N]. This unifies all known algorithms for matrix multiplication on DMPC,
standard or non-standard, sequential or parallel. Extensions of our
methods and results to other parallel systems are also presented. The
above claims result in significant progress in scalable parallel matrix
multiplication (as well as solving many other important problems) on
distributed memory systems, both theoretically and practically

May 1994

·

18 Reads

We present a parallel algorithm for performing Boolean set
operations on generalized polygons that have holes in them. The
intersection algorithm has a processor complexity of O(m<sup>2</sup>n
<sup>2</sup>) processors and a time complexity of O(max(2logm, log<sup>2
</sup>n)), where m is the maximum number of vertices in any loop of a
polygon, and n is the maximum number of loops per polygon. The union and
difference algorithms have a processor complexity of O(m<sup>2</sup>n
<sup>2</sup>) and time complexity of O(logm) and O(2logm, logn)
respectively. The algorithm is based on the EREW PRAM model. The
algorithm tries to minimize the intersection point computations by
intersecting only a subset of loops of the polygons based on of their
topological relationships

February 2002

·

37 Reads

The concept of unreliable failure detector was introduced by T.D.
Chandra and S. Toueg (1996) as a mechanism that provides information
about process failures. This mechanism has been used to solve different
problems in asynchronous systems, in particular the Consensus problem.
In this paper, we present a new class of unreliable failure detectors,
which we call Eventually Consistent and denote by □C. This class
adds to the failure detection capabilities of other classes an eventual
leader election capability. This capability allows all correct processes
to eventually choose the same correct process as leader. We study the
relationship between □C and other classes of failure detectors.
We also propose an efficient algorithm to trans form □C into
□P in models of partial synchrony. Finally, to show the power of
this new class of failure detectors, we present a Consensus algorithm
based on □C. This algorithm successfully exploits the leader
election capability of the failure detector and performs better in
number of rounds than all the previously proposed algorithms for failure
detectors with eventual accuracy