The Journal of Supercomputing

Published by Springer Nature

Online ISSN: 1573-0484


Print ISSN: 0920-8542


Fig. 1. A detailed flowchart of the XenoCluster approach  
Fig. 2. UIPTC runtime and efficiency results from Table 3  
Fig. 4.  
Multi-granularity Parallel Computing in a Genome-Scale Molecular Evolution Application
  • Conference Paper
  • Full-text available

August 2009


91 Reads



Terry A Braun




Previously [1], we reported a coarse-grained parallel computational approach to identifying rare molecular evolutionary events often referred to as horizontal gene transfers. Very high degrees of parallelism (up to 65x speedup on 4,096 processors) were reported, yet the overall execution time for a realistic problem size was still on the order of 12 days. With the availability of large numbers of compute clusters, as well as genomic sequence from more than 2,000 species containing as many as 35,000 genes each, and trillions of sequence nucleotides in all, we demonstrated the computational feasibility of a method to examine "clusters" of genes using phylogenetic tree similarity as a distance metric. A full serial solution to this problem requires years of CPU time, yet only makes modest IPC and memory demands; thus, it is an ideal candidate for a grid computing approach involving low-cost compute nodes. This paper now describes a multiple granularity parallelism solution that includes exploitation of multi-core shared memory nodes to address fine-grained aspects in the tree-clustering phase of our previous deployment of XenoCluster 1.0. In addition to benchmarking results that show up to 80% speedup efficiency on 8 CPU cores, we report on the biological accuracy and relevance of our results compared to a reported set of known xenologs in yeast.

Parallelism of iterative CT reconstruction based on local reconstruction algorithm

April 2009


94 Reads

An iterative algorithm is suited to reconstruct CT images from noisy or truncated projection data. However, as a disadvantage, the algorithm requires significant computational time. Although a parallel technique can be used to reduce the computational time, a large amount of communication overhead becomes an obstacle to its performance (Li et al. in J. X-Ray Sci. Technol. 13:1-10, 2005). To overcome this problem, we proposed an innovative parallel method based on the local iterative CT reconstruction algorithm (Wang et al. in Scanning 18:582-588, 1996 and IEEE Trans. Med. Imaging 15(5):657-664, 1996). The object to be reconstructed is partitioned into a number of subregions and assigned to different processing elements (PEs). Within each PE, local iterative reconstruction is performed to recover the subregion. Several numerical experiments were conducted on a high performance computing cluster. And the FORBILD head phantom (Lauritsch and Bruder was used as benchmark to measure the parallel performance. The experimental results showed that the proposed parallel algorithm significantly reduces the reconstruction time, hence achieving a high speedup and efficiency.

Fig. 1. Convergence curve. Each point represents one combination of parameters. log 10 ( Q ̂ s ) vs log 10 ( F ̂ s ) 
DEEP - Differential Evolution Entirely Parallel Method for Gene Regulatory Networks

August 2011


108 Reads

The Differential Evolution Entirely Parallel (DEEP) method is applied to the biological data fitting problem. We introduce a new migration scheme, in which the best member of the branch substitutes the oldest member of the next branch that provides a high speed of the algorithm convergence. We analyze the performance and efficiency of the developed algorithm on a test problem of finding the regulatory interactions within the network of gap genes that control the development of early Drosophila embryo. The parameters of a set of nonlinear differential equations are determined by minimizing the total error between the model behavior and experimental observations. The age of the individuum is defined by the number of iterations this individuum survived without changes. We used a ring topology for the network of computational nodes. The computer codes are available upon request.

Protein Simulation Data in the Relational Model

October 2012


209 Reads

High performance computing is leading to unprecedented volumes of data. Relational databases offer a robust and scalable model for storing and analyzing scientific data. However, these features do not come without a cost-significant design effort is required to build a functional and efficient repository. Modeling protein simulation data in a relational database presents several challenges: the data captured from individual simulations are large, multi-dimensional, and must integrate with both simulation software and external data sites. Here we present the dimensional design and relational implementation of a comprehensive data warehouse for storing and analyzing molecular dynamics simulations using SQL Server.

Table 5 . Total SNF execution time (in seconds) for the 1024 • 1024 Landsat TM band 5 image.
Parallel algorithms for image enhancement and segmentation byregion growing with an experimental study
Presents efficient and portable implementations of a useful image enhancement process, the symmetric neighborhood filter (SNF), and an image segmentation technique which makes use of the SNF and a variant of the conventional connected components algorithm which we call δ-connected components. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. The image segmentation algorithm makes use of an efficient connected components algorithm based on a novel approach for parallel merging. The algorithms have been coded in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scientific CS-2, Intel Paragon, and workstation clusters. Our experimental results are consistent with the theoretical analysis (and provide the best known execution times for segmentation, even when compared with machine-specific implementations). Our test data include difficult images from the Landsat Thematic Mapper (TM) satellite data

Supporting the sockets interface over user-level communication architecture: Design issues and performance comparisons

July 2005


42 Reads

Since user-level communication (ULC) architecture provides only primitive operations for application programmers, there have been several researches to build a portable and standard communication interface, such as sockets, on top of ULC architecture. Basically there are three different approaches to supporting the sockets interface over ULC architecture: LAN emulation, a user-level sockets, and a kernel-level sockets. The primary objective of this paper is to compare these approaches in terms of their design, implementation, and performance. We have developed and implemented a kernel-level sockets layer over ULC architecture, since there is currently no available implementation. We also present different design and implementation decisions on data receiving, data sending, connection management, etc. in the three approaches. Through the performance comparison, we show that LAN emulation approach exhibits the worst performance both in latency and bandwidth. Our experiments also show that a user-level sockets is useful for latency-sensitive applications and a kernel-level sockets is effective for applications which require high bandwidth and full compatibility with the legacy sockets interface.

n-dimensional processor arrays with optical dBuses

November 1995


16 Reads

dBus-array(k,n) is an n-dimensional processor array of k<sup>n </sup> nodes connected via k<sup>n-1</sup> dBuses. A dBus is a unidirectional bus which receives signals from a set of n nodes (input set), and transmits signals to a different set of n nodes (output set). Two optical implementations of the dBus-array(k,n) are discussed. One implementation uses the wavelength division multiplexing as in the wavelength division multiple access channel hypercube WMCH (P.W. Dowd, 1992). WMCH(k,n) and dBus-array(k,n) have the same diameter and about the same average internode distance, while the dBus-array requires only one tunable transmitter/receiver per node. Compared to n tunable transmitters/receivers per node for the WMCH. The other implementation uses one fixed wavelength transmitter/receiver per node and the dilated slipped banyan switching network (DSB) (R.A. Thompson, 1991) to combine the time division and wavelength division multiplexing (G. Lin et al., 1994)

Dynamic Load Balancing Computation of Pulses Propagating in a Nonlinear Medium

February 2002


28 Reads

The aim of this work is to present an efficient parallel approach for the numerical computation of pulse propagation in nonlinear dispersive optical media. We consider the nonlinear Maxwell's equations associated with the modelization of the residual susceptibilities. The numerical approach is based on the finite difference time domain (FDTD) method, developed in a system of coordinates moving with the group velocity of the main pulse. In order to improve the computational delay, the size of the window is defined dynamically. However, for high frequency pulses propagating in a large domain, the computational delay is still penalizing, particularly for 2D and 3D computations. Therefore the parallel technique is a way to develop an efficient approach. We present two parallel strategies, developed in the message passing framework. The first approach is based on a static load distribution and the associated communication structures are very simple. However, in this case the equivalent global load has been increased, compared to the optimal sequential computations. The second parallel approach preserves the global load of the optimal sequential computations. In this case, we have developed a load re-balancing strategy using specific communication structures. The parallel strategies are developed in the one dimensional case and their extension to specific multidimensional cases are straightforward. The efficiency of the parallel approaches is investigated with the computation of the second harmonic generation in a KDP type crystal. In a sequential context, we have investigated the self-focusing process in a nonlinear Kerr medium.

Shared memory vs. message passing: the COMOPS benchmark experiment

February 1998


12 Reads

The paper presents the comparison of the COMOPS benchmark performance in MPI and shared memory on three different shared memory platforms: the DEC AlphaServer 8400/300, the SGI Power Challenge, and the HP Convex Exemplar SPP1600. The paper also qualitatively analyzes the obtained performance data based on an understanding of the corresponding architecture and the MPI implementations. Some conclusions are made for the interprocessor communication performance on these three shared memory platforms

Conventional benchmarks as a sample of the performance spectrum

February 1998


1,690 Reads

Most benchmarks are smaller than actual application programs. One reason is to improve benchmark universality by demanding resources every computer is likely to have. But users dynamically increase the size of application programs to match the power available, whereas most benchmarks are static and of a size appropriate for computers available when the benchmark was created; this is particularly true for parallel computers. Thus, the benchmark overstates computer performance since smaller problems spend more time in cache. Scalable benchmarks, such as HINT, examine the full spectrum of performance through various memory regimes, and express a superset of the information given by any particular fixed size benchmark. Using 5000 experimental measurements, we have found that performance on the NAS Parallel Benchmarks, SPEC, LINPACK, and other benchmarks is predicted accurately by subsets of HINT performance curve. Correlations are typically better than 0.995, and predicted ranking is often perfect

An Enhanced Parallel Loop Self-Scheduling Scheme for Cluster Environments

April 2005


26 Reads

In this paper, a parallel loop self-scheduling scheme for heterogeneous PC cluster systems is proposed. Though the proposed scheme does allow users to choose parameters before the execution initialization phase, there are still weaknesses that motivate us to go further with new improvements in that scheme. For instance, a decision on a fixed and monotonous parameter can easily lead to invalid schedule by using previous input information. Thus, it is proposed in this paper a new scheme, where the scheduling parameter can be adjusted dynamically and fit into most widely available computer systems, in order to provide higher overall performance.

A Parallel Algorithm to Reconstruct Bounding Surfaces in 3D Images

January 1998


22 Reads

The growing size of 3D digital images causes sequential algorithms to be less and less usable on whole images and a parallelization of these algorithm is often required. We have developed an algorithm named Sewing Faces which synthesizes both geometrical and topological information on bounding surface of 6-connected 3D objects. We call such combined information a skin. In this paper we present a parallelization of Sewing Faces. It is based on a splitting of 3D images into several sub-blocks. When all the sub-blocks are processed a gluing step consists of merging all the sub-skins to get the final skin. Moreover we propose a fine-grain approach where each sub-block is processed by several parallel processors.

Performance comparison of the CRAY-2 and CRAY X-MP/416 supercomputers

June 1990


20 Reads

The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.

LABILE: Link quAlity-based lexIcaL routing mEtric for reactive routing protocols in IEEE 802.15.4 networks

January 2010


446 Reads

In this paper, we propose a lexical routing metric to enable path-wise link quality-aware routing in Wireless Sensor Networks (WSNs). The realization of this routing metric is achieved by applying the indexing techniques of formal language processing in multi-metric route classification and cost evaluation for WSNs. In particular, the metric is motivated from the fact that IEEE 802.15.4 networks are formed on links with highly variable quality, and selection of poor quality links degrades the data delivery considerably. Although IEEE 802.15.4 supports link-level awareness through LQI, (a)it remains unbeknownst at the path level, and therefore, (b)its processing is link-local in scope. We propose LABILE, a composite routing metric, that is implemented through modifying RREQ and RREP structures of AODV to capture and convey two-state link information to destination which in turn is processed through an easy to compute routing lexicon and a corresponding lexical algorithm for path selection in case of availability of multiple paths. Multiple paths are represented in a path space through a proposed cost model that encompasses hop-count, weak links and a quantity weakness factor of each link, depending upon a thresholding mechanism which in turn declares a link to be either healthy (aka usable) or weak (aka unusable). Using the weakness factor, the success probability of data delivery over weak link turns out to be a fraction of success probability at healthy link. The mathematical model along with the experiments done for LABILE show that proposed composite metric and its parsing scheme together achieve link robustness consistently by evading the link failures, as the number of weak links is fairly reduced in lexical path as compared to the hop-count-based metric. Increased data delivery is obtained through preventing retransmissions as on failure-prone links. KeywordsIEEE 802.15.4–Routing metric–Lexical ordering–Wireless sensor networks–Language

Table 4 . Fraction of time spent in subroutines of SGETRF NAS Strassen, NB = 512.
Using Strassen's Algorithm to Accelerate the Solution of Linear Systems

March 1990


666 Reads

Strassen's algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the Cray-2 and Cray Y-MP supercomputers. A number of techniques have been used to reduce the scratch space requirement for this algorithm, at the same time preserving a high level of performance. When the resulting Strassen-based matrix multiply routine is combined with some routines from the new LAPACK library, LU decomposition can be performed with rates significantly higher than by conventional means. We succeeded in factoring a 2048 x 2048 matrix on the Cray Y-MP at a rate equivalent to 325 MFLOPS.

Figure 2 Performance (GFLOPS) attained by matrix inversion on peco.
Using graphics processors to accelerate the computation of the matrix inverse

December 2011


269 Reads

We study the use of massively parallel architectures for computing a matrix inverse. Two different algorithms are reviewed, the traditional approach based on Gaussian elimination and the Gauss–Jordan elimination alternative, and several high performance implementations are presented and evaluated. The target architecture is a current general-purpose multicore processor (CPU) connected to a graphics processor (GPU). Numerical experiments show the efficiency attained by the proposed implementations and how the computation of large-scale inverses, which only a few years ago would have required a distributed-memory cluster, take only a few minutes on a hybrid architecture formed by a multicore CPU and a GPU. KeywordsLinear algebra–Matrix inversion–Graphics processors

Towards an RCC-Based Accelerator for Computational Fluid Dynamics Applications

December 2004


34 Reads

Computational Fluid Dynamics (CFD) applications are a critical tool in designing sophisticated mechanical systems such as jet engines and gas turbines. CFD applications use intensive floating-point calculations and are typically run on High-Performance Computing (HPC) systems. We analyze three of the most compute intensive functions (Euler, Viscous, and Smoothing algorithms) and develop a baseline system architecture for accelerating these functions in RCC hardware. We then present detailed design data for the most compute intensive (Euler) function. Based on this analysis, we show that an RCC-based CFD accelerator—compared to conventional processors—promises dramatic improvement in sustained compute speed at better price-performance ratios coupled with much lower overall power consumption.

Role-based access control for a Grid system using OGSA-DAI and Shibboleth

November 2010


48 Reads

In this paper, we propose a new role-based access control (RBAC) system for Grid data resources in the Open Grid Services Architecture Data Access and Integration (OGSA-DAI). OGSA-DAI is a widely used framework for integrating data resources in Grids. However, OGSA-DAI’s identity-based access control causes substantial administration overhead for the resource providers in virtual organizations (VOs) because of the direct mapping between individual Grid users and the privileges on the resources. To solve this problem, we used the Shibboleth, an attribute authorization service, to support RBAC within the OGSA-DAI. In addition, access control policies need to be specified and managed across multiple VOs. For the specification of access control policies, we used the Core and Hierarchical RBAC profile of the eXtensible Access Control Markup Language (XACML); and for distributed administration of those policies and the user-role assignments, we used the Object, Metadata and Artifacts Registry (OMAR). OMAR is based on the e-business eXtensible Markup Language (ebXML) registry specifications developed to achieve interoperable registries and repositories. Our RBAC system provides scalable and fine-grain access control and allows privacy protection. It also supports dynamic delegation of rights and user-role assignments, and reduces the administration overheads for the resource providers because they need to maintain only the mapping information from VO roles to local database roles. Moreover, unnecessary mapping and connections can be avoided by denying invalid requests at the VO level. Performance analysis shows that our RBAC system adds only a small overhead to the existing security infrastructure of OGSA-DAI. KeywordsOpen Grid Services Architecture Data Access and Integration (OGSA-DAI)-Grid data resources-Virtual organization (VO)-Shibboleth-Object, Metadata and Artifacts Registry (OMAR)-Role-based access control (RBAC)-eXtensible Access Control Markup Language (XACML)

Embedded access points for trusted data and resources access in HPC systems

January 2011


25 Reads

Biometric authentication systems represent a valid alternative to the conventional username–password based approach for user authentication. However, authentication systems composed of a biometric reader, a smartcard reader, and a networked workstation which perform user authentication via software algorithms have been found to be vulnerable in two areas: firstly in their communication channels between readers and workstation (communication attacks) and secondly through their processing algorithms and/or matching results overriding (replay attacks, confidentiality and integrity threats related to the stored information of the networked workstation). In this paper, a full hardware access point for HPC environments is proposed. The access point is composed of a fingerprint scanner, a smartcard reader, and a hardware core for fingerprint processing and matching. The hardware processing core can be described as a Handel-C algorithmic-like hardware programming language and prototyped via a Field Programmable Gate Array (FPGA) based board. The known indexes False Acceptance Rate (FAR) and False Rejection Rate (FRR) have been used to test the prototype authentication accuracy. Experimental trials conducted on several fingerprint DBs show that the hardware prototype achieves a working point with FAR=1.07% and FRR=8.33% on a proprietary DB which was acquired via a capacitive scanner, a working point with FAR=0.66% and FRR=6.13% on a proprietary DB which was acquired via an optical scanner, and a working point with FAR=1.52% and FRR=9.64% on the official FVC2002_DB2B database. In the best case scenario (depending on fingerprint image size), the execution time of the proposed recognizer is 183.32ms. KeywordsTrusted authentication-Embedded biometric authentication systems-Security solutions for user authentication

Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers

June 1996


57 Reads

Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes an approach, called Global Arrays (GAs), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented the GA library on a variety of computer systems, including the Intel Delta and Paragon, the IBM SP-1 and SP-2 (all message passers), the Kendall Square Research KSR-1/2 and the Convex SPP-1200 (nonuniform access shared-memory machines), the CRAY T3D (a globally addressable distributed-memory computer), and networks of UNIX workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GAs in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.

An interference-Aware multichannel media access control protocol for wireless sensor networks

October 2008


35 Reads

Scalable Parallel Systems (SPS) have offered a challenging model of computing and poses fascinating optimizations in sensor networks. With the development of sensor hardware technology, a certain sensor node is equipped with a radio transceiver that can be tuned to work on multiple channels. In this paper, we develop a novel interference-aware multichannel media access control (IMMAC) protocol for wireless sensor networks, which takes advantage of multichannel availability. Firstly, each node is assigned with a quiescent channel to reduce hidden terminal beforehand, and then it makes channel adjustment according to dynamic traffic. Secondly, a scalable multichannel media access control protocol is designed to make a tradeoff between channel switching overhead and fairness, and it effectively supports for node unicast and broadcast based on the receiver-directed channel switching. We have implemented simulation to evaluate the performance of IMMAC by comparing with other relevant protocols. The results show that our protocol exhibits more prominent ability, which utilizes multichannel to make parallel transmission and reduce hidden terminal problems effectively in resource-constrained wireless sensor networks.

Distributed scheduling algorithms for channel access in TDMA wireless mesh networks

July 2008


34 Reads

In this paper, we have considered the distributed scheduling problem for channel access in TDMA wireless mesh networks. The problem is to assign time-slot(s) for nodes to access the channels, and it is guaranteed that nodes can communicate with all their one-hop neighbors in the assigned time-slot(s). And, the objective is to minimize the cycle length, i.e., the total number of different time-slots in one scheduling cycle. In single-channel ad hoc networks, the best known result for this problem is proved to be K 2 in arbitrary graphs (Chlamtac and Pinter in IEEE Trans. Comput. C-36(6):729–737, 1987) and 25K in unit disk graphs (http:// pdfserv. maxim-ic. com/ en/ ds/ MAX2820-MAX2821. pdf) with K as the maximum node degree. There are multiple channels in wireless mesh networks, and different nodes can use different control channels to reduce congestion on the control channels. In this paper, we have considered two scheduling models for wireless mesh networks. The first model is that each node has two radios, and the scheduling is simultaneously done on the two radios. We have proved that the upper bound of the cycle length in arbitrary graphs can be 2K. The second model is that the time-slots are scheduled for the nodes regardless of the number of radios on them. In this case, we have proved that the upper bound can be (4K−2). We also have proposed greedy algorithms with different criterion. The basic idea of these algorithms is to organize the conflicting nodes by special criterion, such as node identification, node degree, the number of conflicting neighbors, etc. And, a node cannot be assigned to a time-slot(s) until all neighbor nodes, which have higher criterion and might conflict with the current node, are assigned time-slot(s) already. All these algorithms are fully distributed and easy to realize. Simulations are also done to verify the performance of these algorithms.

Access Grid technology in classroom and research environments

January 2007


12 Reads

Global communication is essential to industry, research and education. The Access Grid (AG) is a suite of hardware, software, and tools to facilitate communication and collaboration over the Internet. These resources are used at over 500 institutions worldwide to support group-to-group interactions across the Grid including collaborative research work sessions, tutorials, lectures, large-scale distributed meetings and training. This paper will provide an overview of the technology to encourage professionals to integrate benefits and tools of the Grid into their instruction and research. Furthermore, this paper will compare this new technology to more traditional videoconferencing and distributed collaborative working environments. Lastly, it will present issues and challenges that must be addressed to incorporate this momentous technology within the classroom and for collaboration throughout the world.

Parallel Large Scale High Accuracy Navier-Stokes Computations on Distributed Memory Clusters

January 2004


14 Reads

We present a highly scalable parallelization of a high-accuracy 3D serial multiblock Navier-Stokes solver. The code solves the full Navier-Stokes equations and is capable of performing large-scale computations for practical configurations in an industrial enviroment. The parallelization strategy is based on the geometrical domain decomposition principle, and on the overlapped communication and computation concept. The important advantage of the strategy is that the suggested type of message-passing ensures a very high scalability of the algorithm from the network point of view, because, on the average, the communication work per processor is not increased if the number of processors is increased. The parallel multiblock-structured Navier-Stokes solver based on the parallel virtual machine (PVM) routines was implemented on 106-processors distributed memory cluster managed by the MOSIX software package. Analysis of the results demonstrated a high level of parallel efficiency (speed up) of the computational algorithm. This allowed the reduction of the execution time for large-scale computations employing 10 million of grid points, from an estimated 46 days on the SGI ORIGIN 2000 computer (in the serial single-user mode) to 5–6 hours on 106-processors cluster. Thus, the parallel multiblock full Navier-Stokes code can be successfully used for large-scale practical aerodynamic simulations of a complete aircraft on millions-points grids on a daily basis, as needed in industry.

TCP CAE: An improved congestion control using comparative ACK-based estimator

January 2011


17 Reads

TCP receivers deliver ACK packets to senders for reliable end-to-end transfer. However, due to network congestion in the backward direction, ACK packets may not be successfully transferred, which causes the degradation of TCP performance. To overcome this problem, this paper proposes a reverse congestion warning mechanism and a congestion handling mechanism in heterogeneous networks with heavy background traffic in the backward direction. In the proposed scheme, senders detect the reverse direction congestion and execute an exponential backoff algorithm in advance instead of waiting for RTO expiration. According to the simulation results using the NS-2 network simulator, the proposed scheme shows a performance elevation of 20% than Reno, 150% than New Reno, and 450% than Westwood, respectively, under heterogeneous networks and that the error rate of the radio link is 1% when the backward network is congested. KeywordsTransmission control protocol–Congestion–Wireless networks–Reverse traffic

Top-cited authors