Journal of Parallel and Distributed Computing

Published by Elsevier
Online ISSN: 1096-0848
Publications
Article
Several recent methods have been proposed to obtain significant speed-ups in MRI image reconstruction by leveraging the computational power of GPUs. Previously, we implemented a GPU-based image reconstruction technique called the Illinois Massively Parallel Acquisition Toolkit for Image reconstruction with ENhanced Throughput in MRI (IMPATIENT MRI) for reconstructing data collected along arbitrary 3D trajectories. In this paper, we improve IMPATIENT by removing computational bottlenecks by using a gridding approach to accelerate the computation of various data structures needed by the previous routine. Further, we enhance the routine with capabilities for off-resonance correction and multi-sensor parallel imaging reconstruction. Through implementation of optimized gridding into our iterative reconstruction scheme, speed-ups of more than a factor of 200 are provided in the improved GPU implementation compared to the previous accelerated GPU code.
 
Initial arrangement of runs on the PDM. The leading block of each run resides on a different disk. The leading blocks of each of these runs can be read in one parallel I/O during the R-Way merge. 
Article
In this paper we present efficient algorithms for sorting on the Parallel Disks Model (PDM). Numerous asymptotically optimal algorithms have been proposed in the literature. However many of these merge based algorithms have large underlying constants in the time bounds, because they suffer from the lack of read parallelism on PDM. The irregular consumption of the runs during the merge affects the read parallelism and contributes to the increased sorting time. In this paper we first introduce a novel idea called the dirty sequence accumulation that improves the read parallelism. Secondly, we show analytically that this idea can reduce the number of parallel I/O's required to sort the input close to the lower bound of [Formula: see text]. We experimentally verify our dirty sequence idea with the standard R-Way merge and show that our idea can reduce the number of parallel I/Os to sort on PDM significantly.
 
Article
We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms.
 
Article
Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.
 
Article
Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces.
 
Article
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors.
 
Conference Paper
Recently, the author has studied two important classes of algorithms requiring ±2<sup>b</sup> communications: ±2<sup>b</sup>-descend, and ±2<sup>b</sup>-ascend. Let N=2 <sup>n</sup> be the number of PEs in a SIMD hypercube which restricts all communications to a single fixed dimension at a time. He has developed an efficient O( n ) algorithm for the descend class, and also obtained a simple O( n <sup>2</sup>/log n ) algorithm for the ascend class, requiring O(log n ) words of local memory per PE. In the present paper he presents two new algorithms for the ascend class on a SIMD hypercube. The first algorithm runs in O( n <sup>1.5</sup>) time and requires O(1) space per PE. The second algorithm, which is discussed only briefly, runs in O( n √ n /log n ) time and requires O(log n ) space per PE
 
Conference Paper
Spatial image warping is useful for image processing and graphics. The authors present optimal concurrent-read-exclusive-write (CREW) and exclusive-read-exclusive-write (EREW) parallel-random-access-machine (PRAM) algorithms that achieve O(1) asymptotic run time. The significant result is the creative processor assignment that results in an EREW PRAM forward direct warp algorithm. The forward algorithm calculates any nonscaling affine transform. The EREW algorithm is the most efficient in practice, and 16k processor MasPar MP-1 can rotate a 4 million element image in under a second and a 2 million element volume in 1/2 of a second. This high performance allows interactive viewing of volumes from arbitrary viewpoints and illustrates linear speedup
 
Conference Paper
A passive optical star is an ideal shared medium, from both fault tolerant and access synchronization points of view. The communication over an optical star merges to a single point in space and then broadcasts back to all the nodes. This circular symmetry facilitates the solution for two basic distributed synchronization problems, which are presented in this work: (i) the generation of a global event clock for synchronizing the nodes′operation, and (ii) distributed scheduling for accessing the shared passive medium, which is a hybrid (deterministic and random) technique. We present, prove, and analyze this hybrid scheduling algorithm, which is equivalent to a distributed queue, and, therefore, is also algorithmically fair. Furthermore, our solution has two additional properties: destination overflow prevention and destination fairness. The effective solution of these problems can be used for efficiently implementing a local area network based on a passive optical star.
 
Conference Paper
Presents efficient hypercube algorithms for solving triangular systems of linear equations by using various matrix partitioning and mapping schemes. Recently, several parallel algorithms have been developed for this problem. In these algorithms, the triangular solver is treated as the second stage of Gauss elimination. Thus, the triangular matrix is distributed by columns (or rows) in a wrap fashion since it is likely that the matrix is distributed this way after an LU decomposition has been done on the matrix. However, the efficiency of the algorithms is low. The motivation here is to develop various data partitioning and mapping schemes for hypercube algorithms by treating the triangular solver as an independent problem. Performance of the algorithms is analyzed theoretically and empirically by implementing them on a commercially available hypercube
 
Conference Paper
The mesh is one of the most widely used interconnection networks for multiprocessor systems. We propose an approach to partition a given mesh into m submeshes which can be allocated to m tasks with grid structures. We adapt two dimensional packing to solve the submesh allocation problem. Due to the intractability of the two dimensional packing problem, finding an optimal solution is computationally infeasible. We develop an efficient heuristic packing algorithm called TP-heuristic. Allocating a submesh to each task is achieved using the results of packing. We propose two different methods called uniform scaling and non uniform scaling. Experiments were carried out to test the accuracy of solutions provided by our allocation algorithm
 
Conference Paper
In this paper, dilated embedding and precise embedding of K-ary complete trees into hypercubes are studied. For dilated embedding, a nearly optimal algorithm is proposed which embeds a K-ary complete tree of height h, T(K)(h), into an (h - 1)[log K] + [log (K + 2)]-dimensional hypercune with dilation Max{2, phi(K), phi(K + 2)}. phi(x) = min{lambda: SIGMA(i=0)lambda C(d)i greater-than-or-equal-to x and d = [log x]}. It is clear that [([log x] + 1)/2] less-than-or-equal-to phi(x) less-than-or-equal-to [log x], for x greater-than-or-equal-to 3.) For precise embedding, we show a (K - 1)h + 1-dimensional hypercube is large enough to contain T(K)(h) as its subgraph, K greater-than-or-equal-to 3. (C) 1995 Academic Press, Inc.
 
Conference Paper
An O(N<sup>2</sup>) heuristic algorithm is presented that embeds all binary trees, with dilation 2 and small average dilation, into the optimal sized hypercube. The heuristic relies on a conjecture about all binary trees with a perfect matching. It provides a practical and robust technique for mapping binary trees into the hypercube and ensures that the communication load is evenly distributed across the network assuming any shortest path routing strategy. One contribution of this work is the identification of a rich collection of binary trees that can be easily mapped into the hypercube
 
Article
Parallel BLAS libraries have recently shown promise as a means of taking advantage of parallel computing in solving scientific problems. However, little work has been done on providing such a parallel library in LAN-connected workstations. Our motivation for this research lies in the strong belief that since LAN-connected workstations are highly cost-effective, it is important to study the issues in such environments. Dynamic load balancing is a method that allows parallel programs to run efficiently in LAN-connected workstations. Introducing dynamic load balancing leads to new design concerns for data distribution, sequential implementation, and library interfaces. Through a series of experiments, we investigate the influence of these factors. We also propose a set of guidelines for developing an efficient parallel Level 3 BLAS library
 
Conference Paper
The design of BLITZEN, a highly integrated chip with 128 processing elements (PEs) is presented. The bit serial processing element is described, and some comparisons with the massively parallel processor (MPP) and the Connection Machine are provided. Local control features and methods for memory access are emphasized. The organization of PEs on the custom chip, with emphasis on interconnection and I/O schemes, is described. Details of the custom chip design and instruction pipeline are provided. An overview of system architecture concepts and software for BLITZEN is also given. Each PE has 1 Kbit of static RAM and performs bit-serial processing and functional elements for arithmetic, logic, and shifting. Unique local control features include modification of the global memory address by data local to each PE and complementary operations based on a condition register. Fixed-point operations on 32-bit data can exceed a rate of one billion operations per second. Since the processors are bit-serial devices, performance rates improve with shorter word lengths. The bus oriented I/O scheme can transfer data at 10240 MB/s
 
Conference Paper
In this paper we consider general simulations of algorithms designed for fully operational BSP and CGM machines on machines with faulty processors. The faults are deterministic (i.e., worst-case distributions of faults are considered) and static (i.e., they do not change in the course of computation). We assume that a constant fraction of processors are faulty. We present a deterministic simulation (resp. a randomized simulation) that achieves constant slowdown per local computations and O((log<sub>h</sub> p)<sup>2</sup>) (resp. O(log<sub>h </sub> p)) slowdown per communication round, provided that a deterministic preprocessing is done that requires O((log<sub>h</sub> p) <sup>2</sup>) communication rounds and linear (in h) computation per processor in each communication round. Our results are fully-scalable over all values of p from θ(1) to θ(n). Furthermore, our results imply that for p&les;n<sup>&epsi;</sup> (ε<1), algorithms can be made resilient to a constant fraction of processor faults without any asymptotic slowdown
 
Conference Paper
A wide range of graphs with regular structures are shown to be embeddable in an injured hypercube with faulty links. These include rings, linear paths, binomial trees, binary trees, meshes, tori, and many others. Unlike many existing algorithms which are capable of embedding only one type of graphs, our algorithm embeds the above graphs in a unified way, all centered around a notion called edge matrix. In many cases, the degree of fault tolerance offered by the algorithm is optimal or near-optimal
 
Conference Paper
In this paper we propose a joint design of MIMO technique and network coding (MIMO-NC) and apply it to improve the performance of wireless networks. We consider a system in which the packet exchange among multiple wireless users is forwarded by a relay node. In order to enjoy the benefit of MIMO-NC, all the nodes in the network are mounted with two antennas and the relay node possess the coding capability. For the cross traffic flows among any four users, the relay node not only can receive packets simultaneously from two compatible users in the uplink (users-to-relay node), but also can mix up distinct packets for four destined users into two coded packets and concurrently send them out in the same downlink (relay node-to-users), so that the information content is significantly increased in each transmission. We formalize the problem of finding a schedule to forward the buffered data of all the users in minimum number of transmissions in such a system as a problem of finding a maximum matching in a graph. We also provide an analytical model on maximum throughput and optimal energy efficiency, which explicitly measures the performance gain of the MIMO-NC enhancement. Our analytical and simulation results demonstrate that system performance can be greatly improved by the efficient utilization of MIMO and network coding opportunities.
 
Conference Paper
The class of cographs, or complement-reducible graphs, arises naturally in many different areas of applied mathematics and computer science. The authors show that the problem of finding a maximum matching in a cograph can be solved optimally in parallel by reducing it to parenthesis matching. With a n -vertex cograph G represented by its parse tree as input the algorithm finds a maximum matching in G in O(log n ) time using O( n /log n ) processors in the EREW-PRAM model
 
Conference Paper
We study epidemic schemes in the context of collaborative data delivery. In this context, multiple chunks of data reside at different nodes, and the challenge is to simultaneously deliver all chunks to all nodes. Here we explore the inter-operation between the gossip of multiple, simultaneous message-chunks. In this setting, interacting nodes must select which chunk, among many, to exchange in every communication round. We provide an efficient solution that possesses the inherent robustness and scalability of gossip. Our approach maintains the simplicity of gossip, and has low message, connections and computation overhead. Because our approach differs from solutions proposed by network coding, we are able to provide insight into the tradeoffs and analysis of the problem of collaborative content distribution. We formally analyze the performance of the algorithm, demonstrating its efficiency with high probability.
 
Conference Paper
We develop a framework that helps in developing understanding of a fault-tolerant distributed system and so helps in designing such systems. We define a unit of computation in such systems, referred to as a molecule, that has a well defined interface with other molecules, i.e. has minimal dependence on other molecules. The smallest such unit-an indivisible molecule-is termed as an atom. We show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, particularly for a fault tolerant system where it is important to guarantee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after execution of such units. We prove different properties satisfied by molecules and atoms, and present algorithms to detect atoms in an ongoing computation and to force the completion of a molecule. We illustrate the uses of the developed work in application areas such as debugging, checkpointing, and reasoning about stable properties
 
Conference Paper
Summary form only given. We describe the parallelization of the multizone code versions of the NAS parallel benchmarks employing multilevel OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multilevel parallel applications.
 
Conference Paper
Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N<sup>α</sup>), where 2<α&les;3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O (log N) time by using N<sup>α</sup>/log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1&les;p&les;Nα<sup>α</sup>/log N, multiplying two N×N matrices can be performed by a DMPC with p processors in O(N<sup>α</sup>/p) rime, i.e., linear speedup and cost optimality can be achieved in the range [1..N<sup>α</sup>/log N]. This unifies all known algorithms for matrix multiplication on DMPC, standard or non-standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. The above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically
 
Conference Paper
We present a parallel algorithm for performing Boolean set operations on generalized polygons that have holes in them. The intersection algorithm has a processor complexity of O(m<sup>2</sup>n <sup>2</sup>) processors and a time complexity of O(max(2logm, log<sup>2 </sup>n)), where m is the maximum number of vertices in any loop of a polygon, and n is the maximum number of loops per polygon. The union and difference algorithms have a processor complexity of O(m<sup>2</sup>n <sup>2</sup>) and time complexity of O(logm) and O(2logm, logn) respectively. The algorithm is based on the EREW PRAM model. The algorithm tries to minimize the intersection point computations by intersecting only a subset of loops of the polygons based on of their topological relationships
 
Conference Paper
The concept of unreliable failure detector was introduced by T.D. Chandra and S. Toueg (1996) as a mechanism that provides information about process failures. This mechanism has been used to solve different problems in asynchronous systems, in particular the Consensus problem. In this paper, we present a new class of unreliable failure detectors, which we call Eventually Consistent and denote by &square;C. This class adds to the failure detection capabilities of other classes an eventual leader election capability. This capability allows all correct processes to eventually choose the same correct process as leader. We study the relationship between &square;C and other classes of failure detectors. We also propose an efficient algorithm to trans form &square;C into &square;P in models of partial synchrony. Finally, to show the power of this new class of failure detectors, we present a Consensus algorithm based on &square;C. This algorithm successfully exploits the leader election capability of the failure detector and performs better in number of rounds than all the previously proposed algorithms for failure detectors with eventual accuracy
 
Conference Paper
The problem of deciding if a set of real-time messages can be transmitted in a unidirectional ring network with m >1 nodes is considered. Complexity results are given for various restrictions on the four parameters associated with each message: origin node, destination node, release time and deadline. For nonpreemptive transmission, it is shown that the problem is solvable in polynomial time for any case when only one of the four parameters is allowed to be arbitrary. Also shown is that it is NP-complete for each case when any two of the four parameters are fixed. For preemptive transmission, the problem is solvable in polynomial time for any case when only one of the four parameters is allowed to be arbitrary. Also, it is NP-complete for each case when any two of the four parameters are fixed, except the following two cases: (1) same origin node and release time; and (2) same destination node and deadline
 
Conference Paper
Cyclic shifts are intrinsic operations in many parallel algorithms. Therefore, it is important to execute them efficiently. The authors present and analyze algorithms for the cyclic shift operation on n-dimensional (distributed memory) hypercubes. The first algorithm by S.L. Johnsson (1987) always uses shortest paths between hypercube nodes for routing. The authors prove that when using this algorithm, there is no node and link congestion in any communication step on synchronous hypercubes. Consequently, any cyclic shift can be implemented in n steps on such machines, without any local buffering of messages. They show that all previously known algorithms need local buffering, in the context of asynchronous hypercubes. In order to overcome this, they design an algorithm that always uses link-disjoint paths for routing. In this case, they prove that any cyclic shift can be realized by using at most <sup>4</sup>/<sub>3</sub>n steps, without any local message buffers
 
Conference Paper
The B-tree is a fundamental data structure that is used to access and update a large number of keys. In this paper we present a parallel algorithm on the EREW PRAM that deletes keys in a B-tree. Our algorithm runs in O(t(log k+log, n)) time with k processors, where n is the number of keys in the B-tree, t is the minimum degree of the B-tree, and k is the number of unsorted keys to delete, and it improves upon the previous algorithm by a factor of t
 
Conference Paper
Advancements in storage technology along with the fast deployment of high-speed networks has allowed the storage, transmission and manipulation of multimedia information such as text, graphics, still images, video and audio to be feasible. Our study focused on, the performance of the mass storage system for a large-scale video-on-demand server. Different video file striping schemes, such as application level striping and device driver level striping, were examined in order to study scalability and performance issues. To study the impact of different concurrent access patterns on the performance of a server, experimental results were obtained on group access on a single video file and multiple group accesses on multiple video files
 
Conference Paper
The implementation and testing of an algorithm for the numerical solution of the nonlinear Poisson equation of semiconductor device theory (MOSFET) on a massively parallel processor (MPP) is presented. A brief description is provided of the parallel architecture of the MPP, highlighting the features exploited by the numerical implementation. The specifics of the algorithm implementation using the parallel Pascal language of the MPP are described. Refinements made to the algorithm implementation to improve run-time efficiency are also discussed. The algorithm implementation is summarized and future work is outlined
 
Conference Paper
Recently the diameter problem for Packed Exponential Networks (PEC networks) was addressed by Lin and Prasanna (1992), who presented asymptotically tight bounds for the diameter, and showed asymptotically optimal routing algorithms. In this paper exact solutions to the diameter and routing problems of PEC networks are derived, thereby strengthening the asymptotic bounds. For an N = 2<sup>n</sup> node PEC network, with √2n an integer, it is shown that the diameter is given by the simple expression 2<sup>√2n-3</sup> (3√2n - 2). An exact expression for the diameter of PEC networks for general N is also derived. Efficient algorithms for shortest-path routing between nodes in a PEC network are then developed. These algorithms use at most O(log<sup>2</sup> N) time for computing the lengths of minimal routes between nodes. Finally, a simple modification to obtain symmetric PEC networks is suggested
 
Conference Paper
The main goal is to derive an approximate, closed-form solution for the decentralized dynamic load sharing (LS) problem treated in an earlier paper [G. K. Shin and Y.-C. Chang, Load sharing in distributed real-time system with state change broadcasts. IEEE Trans. Comput. C-38, 9, 1124-1142 (1989)]. Whenever the load state of a node changes from underloaded to fully loaded and vice versa, the node broadcasts this change to a set of nodes in its physical proximity, called a buddy set. An overloaded node can select, without probing other nodes, the first available node from its preferred list, an ordered buddy set. Preferred lists are so constructed that the probability of more than one overloaded node sending tasks to an underloaded node may be made very small. In hard real-time systems, the problem of scheduling periodic tasks to meet their deadlines has been studied extensively, but scheduling aperiodic tasks has been addressed far less, due mainly to their random arrivals. We show that the proposed LS method can be used to effectively handle aperiodic tasks in distributed real-time systems. The probability of missing task deadlines can be kept below a specified level by choosing appropriate threshold patterns and buddy set sizes which are derived from the approximate closed-form solution. Specifically, “optimal” threshold patterns and buddy set sizes are derived for different system loads by minimizing the communication overhead subject to a constraint of keeping the probability of missing task deadlines below any given level. (One can also derive optimal solutions by minimizing the probability of missing deadlines while keeping the communication overhead below a specified level.) Several examples are presented to demonstrate the power and utility of the proposed LS approach.
 
Conference Paper
For a Kautz network with faulty components the authors propose a distributed fault-tolerant routing scheme, called DFTR, in which each nonfaulty node knows no more than the condition of its links and adjacent nodes. They construct rooted tree for a given destination in the Kautz network, and use it to develop DFTR such that a faulty component will never be encountered more than once. In DFTR, each node attempts to route a message via the shortest path. If a node on the path detects a faulty node at the next hop, a best alternative path for routing the message around the faulty component is to be obtained. A best alternative path is first generated by the reduced concatenation of this node and the destination, and then is checked to make sure that it does not contain any of encountered faulty nodes. If it does a new alternative path is generated as before. The authors invent an efficient approach in the checking step to reduce computational time. With slight modification, DFTR may adapt to de Bruijn networks as well
 
Conference Paper
Algorithms for mapping n -dimensional meshes on a star graph of degree n with expansion 1 and dilation 3 are developed. It is shown that an n -degree star graph can efficiently simulate an n -dimensional mesh. The analysis indicates that the algorithms developed for uniform meshes may not be efficiently simulated on the star graph
 
Conference Paper
A nonlinear programming approach is introduced for solving the hypercube embedding problem. The basic idea of the proposed approach is to approximate the discrete space of an n-dimensional hypercube, i.e. {z:z∈{0,1}<sup>n</sup>}, with the continuous space of an n-dimensional hypersphere, i.e. {x:x∈ R <sup>n</sup> and ||x||<sup>2</sup>=1}. The mapping problem is initially solved in the continuous domain by employing the gradient projection technique to a continuously differentiable objective function. The optimal process `locations' from the solution of the continuous hypersphere mapping problem are then discretized onto the n-dimensional hypercube. The proposed approach can solve, directly, the problem of mapping P processes onto N nodes for the general case where P>N. In contrast, competing embedding heuristics from the literature can produce only one-to-one mappings and cannot, therefore, be directly applied when P>N
 
Conference Paper
Free-space optical interconnection (FSOI) of integrated circuits (ICs) shows great potential for the efficient implementation of high-performance parallel computing and signal processing systems. The design study of a high-speed FSOI FFT processor system is the objective of this paper. The FFT is an important mathematical operation in many applications such as signal processing, telecommunications, and instrumentation which makes it an excellent choice to incorporate in a high-speed system. The result of this work is the design of a photonic FFT system that includes the Photonic FFT (PFFT) chip and the Photonic RAM (PRAM) chip
 
Conference Paper
The authors consider the problems of selection, routing and sorting on an n-star graph (with n! n odes), an interconnection network which has been proven to possess many special properties. They identify a tree like subgraph (a `(k, 1, k) chain network') of the star graph which enables them to design efficient algorithms for these problems. They present an algorithm that performs a sequence of n prefix computations in O(n<sup>2</sup>) time. This algorithm is used as a subroutine in other algorithms. In addition they offer an efficient deterministic sorting algorithm that runs in (n<sup>3</sup> log n)/2 steps. They also show that sorting can be performed on the n-star graph in time O(n<sup>3</sup>) and that selection of a set of uniformly distributed n keys can be performed in O(n<sup>2</sup>) time with high probability. Finally, they also present a deterministic (non oblivious) routing algorithm that realizes any permutation in O(n<sup>3</sup>) steps on the n-star graph
 
Conference Paper
The article presents a variation of the partition method for solving m <sup>th</sup>-order linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses as compared to the more commonly used version of the partition method. The variation uses a general loop restructuring technique called loop raking . The article describes an implementation of this technique on the CRAY Y-MP and presents performance results on first- and second-order linear recurrences, as well as on Livermore loops, 5, 11 and 19, which are based on linear recurrences
 
Conference Paper
The authors present a programming environment called C-NET developed for the reconfigurable SuperNode multiprocessor. It allows the implementation of variable-topology programs that are referred to as phase-reconfigurable programs. The design decisions concerning dynamic-reconfiguration handling are discussed with regard to the architectural constraints of the machine. It provides three specialized languages: PPL (phase programming language), for the development of phase-reconfigurable programs: GCL (graph-construction language), for the construction of graphs on which the phases are to be executed; and CPL components programming language), for coding the software components. The first example on which the programming environment was tested was the conjugate-gradient (CG) algorithm. The results are encouraging. Phase-reconfigurable implementation of CG was developed and compared with a fixed topology implementation (8×4 torus)
 
Conference Paper
A local stabilizer protocol that takes any on- or off-line distributed algorithm and converts it into a synchronous self-stabilizing algorithm with local monitoring and repairing properties is presented. Whenever the self-stabilizing version enters an inconsistent state, the inconsistency is detected, in O(1) time, and the system state is repaired in a local manner. The expected computation time that is lost during the repair process is proportional to the largest diameter of a faulty region. (C) 2002 Elsevier Science (USA).
 
Article
While a number of user-level protocols have been developed to reduce the gap between the performance capabilities of the physical network and the performance actually available, their compatibility issues with the existing sockets-based applications and IP-based infrastructure has been an area of major concern. To address these compatibility issues while maintaining a high performance, a number of researchers have been looking at alternative approaches to optimize the existing traditional protocol stacks. Broadly, previous research has broken up the overheads in the traditional protocol stack into four related aspects, namely: (i) compute requirements and contention, (ii) memory contention, (iii) I/O bus contention and (iv) system resources’ idle time. While previous research dealing with some of these aspects exists, to the best of our knowledge, there is no work which deals with all these issues in an integrated manner while maintaining backward compatibility with existing applications and infrastructure. In this paper, we address each of these issues, propose solutions for minimizing these overheads by exploiting the emerging architectural features provided by modern Network Interface Cards (NICs) and demonstrate the capabilities of these solutions using an implementation based on UDP/IP over Myrinet. Our experimental results show that with our implementation of UDP, termed as E-UDP, can achieve up to 94% of the theoretical maximum bandwidth. We also present a mathematical performance model which allows us to study the scalability of our approach for different system architectures and network speeds.
 
Article
The true costs of high performance computing are currently dominated by software. Addressing these costs requires shifting to high productivity languages such as Matlab. MatlabMPI is a Matlab implementation of the Message Passing Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation of MatlabMPI is that it implements the widely used MPI ``look and feel'' on top of standard Matlab file I/O, resulting in an extremely compact (~250 lines of code) and ``pure'' implementation which runs anywhere Matlab runs, and on any heterogeneous combination of computers. The performance has been tested on both shared and distributed memory parallel computers (e.g. Sun, SGI, HP, IBM and Linux). MatlabMPI can match the bandwidth of C based MPI at large message sizes. A test image filtering application using MatlabMPI achieved a speedup of ~300 using 304 CPUs and ~15% of the theoretical peak (450 Gigaflops) on an IBM SP2 at the Maui High Performance Computing Center. In addition, this entire parallel benchmark application was implemented in 70 software-lines-of-code (SLOC) yielding 0.85 Gigaflop/SLOC or 4.4 CPUs/SLOC, which are the highest values of these software price performance metrics ever achieved for any application. The MatlabMPI software will be available for download.
 
Number of cluster heads for a WSN with 100 nodes. 
Number of cluster heads for a WSN with 500 nodes. 
Average number of nodes in the clusters for a WSN with 100 nodes. 
Average number of nodes in the clusters for a WSN with 500 nodes. 
Radio characteristics used in our simulations.
Article
The deployment of wireless sensor networks in many application areas requires self-organization of the network nodes into clusters. Clustering is a network management technique, since it creates a hierarchical structure over a flat network. Quite a lot of node clustering techniques have appeared in the literature, and roughly fall into two families: those based on the construction of a dominating set and those which are based solely on energy considerations. The former family suffers from the fact that only a small subset of the network nodes are responsible for relaying the messages, and thus cause rapid consumption of the energy of these nodes. The latter family uses the residual energy of each node in order to decide about whether it will elect itself as a leader of a cluster or not. This family’s methods ignore topological features of the nodes and are used in combination with the methods of the former family. We propose an energy-efficient distributed clustering protocol for wireless sensor networks, based on a metric for characterizing the significance of a node, w.r.t. its contribution in relaying messages. The protocol achieves small communication complexity and linear computation complexity. Experimental results attest that the protocol improves network longevity.
 
Article
One of the challenges for 3D multiuser virtual simulation environments (3DMUVEs) developers is to keep the shared virtual simulation environment synchronized among all the participating users’ terminals. Support to 3DMUVEs through traditional client–server communication model offers simpler management but can lead to bottlenecks and higher latencies. Peer-to-peer communication model, on the other hand, offers no central coordination but are more complex to manage. Current peer-to-peer networks, such as KaZaA and Gnutella, provide multimedia sharing services but do not support multiuser 3D virtual environment (VE) applications.This paper describes a solution to support 3DMUVEs in a hybrid peer-to-peer Gnutella network, which provides session control and distributed shared VE synchronization. As a result of this work, two components specified by the ongoing multiuser extension to the MPEG-4 standard were implemented and integrated to the Gnutella network for control and synchronization. This solution minimizes the disadvantages of client–server and pure peer-to-peer models. The results show that this approach can be a feasible solution, specially for spontaneous 3DMUVEs that can emerge from any user, with no investment needed (apart from his own computer). The use of peer-to-peer networks such as the Gnutella could be used as a test environment for companies wishing to check both their multiuser 3DMUVEs software for correctness and their acceptance by the users community before making heavy investments.
 
Article
In this paper we focus on the problem of designing very fast parallel algorithms for the convex hull and the vector maxima problems in three dimensions that are output-size sensitive. Our algorithms achieve parallel time and optimal work with high probability in the CRCW PRAM where n and h are the input and output size, respectively. These bounds are independent of the input distribution and are faster than the previously known algorithms. We also present an optimal speed-up (with respect to the input size only) sublogarithmic time algorithm that uses superlinear number of processors for vector maxima in three dimensions.
 
Article
In this paper, we solve the k-dimensional all nearest neighbor (kD_ANN) problem, where k=2 or 3, on a linear array with a reconfigurable pipelined bus system (LARPBS) from image processing perspective. First, for a two-dimensional (2D) binary image of size N×N, we devise an algorithm for solving the 2D_ANN problem using a LARPBS of size N2+ϵ, where 0<ϵ⪡1. Then, for a three-dimensional (3D) binary image of size N×N×N, we devise an algorithm for solving the 3D_ANN problem using a LARPBS of size N3+ɛ, where 0<ɛ<1. To the best of our knowledge, all results derived above are the best O(1) time 2D_ANN and 3D_ANN algorithms on the LARPBS model known.
 
Article
Dijkstra defined a distributed system to be self-stabilizing if, regardless of the initial state, the system is guaranteed to reach a legitimate (correct) state in a finite time. Even though the concept of self-stabilization received little attention when it was introduced, it has become one of the most popular fault tolerance approaches. On the other hand, graph algorithms form the basis of many network protocols. They are used in routing, clustering, multicasting and many other tasks. The objective of this paper is to survey the self-stabilizing algorithms for dominating and independent set problems, colorings, and matchings. These graph theoretic problems are well studied in the context of self-stabilization and a large number of algorithms have been proposed for them.
 
Article
The increasing demand for real-time applications in Wireless Sensor Networks (WSNs) has made the Quality of Service (QoS) based communication protocols an interesting and hot research topic. Satisfying Quality of Service (QoS) requirements (e.g. bandwidth and delay constraints) for the different QoS based applications of WSNs raises significant challenges. More precisely, the networking protocols need to cope up with energy constraints, while providing precise QoS guarantee. Therefore, enabling QoS applications in sensor networks requires energy and QoS awareness in different layers of the protocol stack. In many of these applications (such as multimedia applications, or real-time and mission critical applications), the network traffic is mixed of delay sensitive and delay tolerant traffic. Hence, QoS routing becomes an important issue. In this paper, we propose an Energy Efficient and QoS aware multipath routing protocol (abbreviated shortly as EQSR) that maximizes the network lifetime through balancing energy consumption across multiple nodes, uses the concept of service differentiation to allow delay sensitive traffic to reach the sink node within an acceptable delay, reduces the end to end delay through spreading out the traffic across multiple paths, and increases the throughput through introducing data redundancy. EQSR uses the residual energy, node available buffer size, and Signal-to-Noise Ratio (SNR) to predict the best next hop through the paths construction phase. Based on the concept of service differentiation, EQSR protocol employs a queuing model to handle both real-time and non-real-time traffic.By means of simulations, we evaluate and compare the performance of our routing protocol with the MCMP (Multi-Constraint Multi-Path) routing protocol. Simulation results have shown that our protocol achieves lower average delay, more energy savings, and higher packet delivery ratio than the MCMP protocol.
 
Article
Supporting real time applications over local area wireless access networks requires features and mechanisms that are not present in the original IEEE 802.11 standard for WLANs. Therefore, several quality of service (QoS) enabling mechanisms have been added to the MAC layer in the new IEEE 802.11e standard. However, the standard does not mandate a specific QoS solution and intentionally leaves it to developers and equipment vendors to devise such schemes. We present a solution that employs the controlled access mechanisms of the 802.11e to provide per-session guaranteed QoS to multimedia sessions. We introduce a framework that centralizes the task of scheduling uplink and downlink flows in the access point through the new concept of virtual packets. We propose a fair generalized processor sharing based scheduler, integrated with a traffic shaper, for scheduling controlled (polling) and contention access durations. Through analysis and experiments we demonstrate that our solution provides guaranteed fair access for multimedia sessions over WLANs.
 
Article
This paper presents a performance model developed for the deployment design of IEEE 802.11s Wireless Mesh Networks (WMN). The model contains seven metrics to analyze the state of WMN, and novel mechanisms to use multiple evaluation criteria in WMN performance optimization. The model can be used with various optimization algorithms. In this work, two example algorithms for channel assignment and minimizing the number of mesh Access Points (APs) have been developed. A prototype has been implemented with Java, evaluated by optimizing a network topology with different criteria and verified with NS-2 simulations. According to the results, multirate operation, interference aware routing, and the use of multiple evaluation criteria are crucial in WMN deployment design. By channel assignment and removing useless APs, the capacity increase in the presented simulations was between 230% and 470% compared to a single channel configuration. At the same time, the coverage was kept high and the traffic distribution fair among the APs.
 
Top-cited authors
Vipin Kumar
  • University of Minnesota Twin Cities
Albert Zomaya
  • The University of Sydney
Giorgos Kollias
  • Purdue University
Karthik Kambatla
  • Purdue University
Xiao Qin
  • Auburn University