Concurrency and Computation Practice and Experience

Published by Wiley
Online ISSN: 1532-0634
Publications
Article
Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 32000(2) pixels in size, which are well beyond the largest previously reported in the literature.
 
The SPRINT framework architecture as described in [9].
How permutations are distributed among the available processes.
pmaxT speed-up on the various systems.
Article
The statistical language R and its Bioconductor package are favoured by many biostatisticians for processing microarray data. The amount of data produced by some analyses has reached the limits of many common bioinformatics computing infrastructures. High Performance Computing systems offer a solution to this issue. The Simple Parallel R Interface (SPRINT) is a package that provides biostatisticians with easy access to High Performance Computing systems and allows the addition of parallelized functions to R. Previous work has established that the SPRINT implementation of an R permutation testing function has close to optimal scaling on up to 512 processors on a supercomputer. Access to supercomputers, however, is not always possible, and so the work presented here compares the performance of the SPRINT implementation on a supercomputer with benchmarks on a range of platforms including cloud resources and a common desktop machine with multiprocessing capabilities. Copyright © 2011 John Wiley & Sons, Ltd.
 
Conference Paper
Distributed real-time architecture of an embedded system is often described as a set of communicating components. Such a system is data flow (for its description) and time-triggered (for its execution). This work fits in with these problematics and focuses on the control of the time compatibility of a set of interdependent data used by the system components. The architecture of a component-based system forms a graph of communicating components, where more than one path can link two components. These paths may have different timing characteristics but the flows of information which transit on these paths may need to be adequately matched, so that a component uses inputs which all (directly or indirectly) depend on the same production step. In this paper, we define this temporal data-matching property, we show how to analyze the architecture to detect situations that cause data matching inconsistencies, and we describe an approach to manage data matching that uses queues to delay too fast paths and timestamps to recognize consistent data.
 
Conference Paper
A method of learning adaptation rules for case- based reasoning (CBR) is proposed in this paper. Adaptation rules are generated from the case-base with the guidance of domain knowledge which is also extracted from the case-base. The adaptation rules are refined before they are applied in the revision process. After solving each new problem, the adaptation rule set is updated by an evolution module in the retention process. The results of preliminary experiment show that the adaptation rules obtained could improve the performance of the CBR system compared to a retrieval-only CBR system.
 
Conference Paper
It is becoming increasingly difficult to implement effective systems for preventing network attacks, due to the combination of (1) the rising sophistication of attacks requiring more complex analysis to detect, (2) the relentless growth in the volume of network traffic that we must analyze, and, critically, (3) the failure in recent years for uniprocessor performance to sustain the exponential gains that for so many years CPUs enjoyed (ldquoMoorepsilas Lawrdquo). For commodity hardware, tomorrowpsilas performance gains will instead come from multicore architectures in which a whole set of CPUs executes concurrently. Taking advantage of the full power of multi-core processors for network intrusion prevention requires an indepth approach. In this work we frame an architecture customized for parallel execution of network attack analysis. At the lowest layer of the architecture is an ldquoActive Network Interfacerdquo (ANI), a custom device based on an inexpensive FPGA platform. The ANI provides the inline interface to the network, reading in packets and forwarding them after they are approved. It also serves as the front-end for dispatching copies of the packets to a set of analysis threads. The analysis itself is structured as an event-based system, which allows us to find many opportunities for concurrent execution, since events introduce a natural, decoupled asynchrony into the flow of analysis while still maintaining good cache locality. Finally, by associating events with the packets that ultimately stimulated them, we can determine when all analysis for a given packet has completed, and thus that it is safe to forward the pending packet - providing none of the analysis elements previously signaled that the packet should instead be discarded.
 
Conference Paper
We present the parallelization of a sparse grid finite element discretization of the Black-Scholes equation, which is commonly used for option pricing. Sparse grids allow to handle higher dimensional options than classical approaches on full grids, and can be extended to a fully adaptive discretization method. We introduce the algorithmical structure of efficient algorithms operating on sparse grids, and demonstrate how they can be used to derive an efficient parallelization with OpenMP of the Black-Scholes solver. We show results on different commodity hardware systems based on multi-core architectures with up to 8 cores, and discuss the parallel performance using Intel and AMD CPUs.
 
Task dependency DAG
Conference Paper
Data-oriented workflows are often used in scientific applications for executing a set of dependent tasks across multiple computers. We discuss how these can be modeled using lambda calculus, and how ideas from functional programming are applicable in the design of workflows. Such an approach avoids the restrictions often found in workflow languages, permitting the implementation of complex application logic and data manipulation. This paper explains why lambda calculus is an appropriate model for workflow representation, and how a suitably efficient implementation can provide a wide range of capabilities to developers. The presented approach also permits high-level workflow features to be implemented at user level, in terms of a small set of low-level primitives provided by the language implementation.
 
Conference Paper
Automatic construction of workflows on the Grid currently is a hot research topic. The problems that have to be solved are manifold: How can existing services be integrated into a workflow, that is able to accomplish a specific task? How can an optimal workflow be constructed in respect to changing resource characteristics during the optimization process? How to cope with dynamically changing or incomplete knowledge of the goal function of the optimization process? - and finally: How to react to service failures during workflow execution? In this paper we propose a method to optimize a workflow based on a heuristic A * approach that allows to react to dynamics in the environment, as changes in the Grid infrastructure and in the users requirements during the optimization process and failing resources during execution.
 
Required rails as a function of the number of nodes for both static allocation algorithms.
Livelock avoidance state table.
Conference Paper
Using multiple independent networks (also known as rails) is an emerging technique which is being used to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper, we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The methods compared include a static rail allocation, a basic round-robin rail allocation, a local-dynamic allocation based on local knowledge and a dynamic rail allocation that reserves both communication endpoints of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for long enough messages). Most importantly, this proposed allocation scheme scales well with the number of rails and message size. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic allocation for short messages with those of the dynamic algorithm for large messages. Copyright
 
Conference Paper
Next generation e-learning platforms should support cooperative use of geographically distributed computing and educational resources as an aggregated environment to provide new levels of flexibility and extensibility. In this overall framework, our activity addresses the definition and implementation of advanced multimedia services for an aggregated grid-based e-learning environment, as well as the design and experimentation of a content distribution and multimedia streaming infrastructure in light of edge device heterogeneity, mobility, content adaptation and scalability. In this paper we initially present the general objectives and requirements that we are taking into account in the development of a multimedia access service for an e-learning platform. Then we describe a partial system prototype which capitalizes upon traditional features of grid computing like providing access to heterogeneous resources and services of different administrative domains in a transparent and secure way. Moreover, our system takes advantage of recent proposals by the Global Grid Forum (GGF) aiming at a standard for service-oriented architectures based on the concept of grid service.
 
Conference Paper
The paper focuses on the formal aspects of the DIPLODOCUS environment. DIPLODOCUS is a UML profile intended for the modeling and verification of real-time and embedded applications meant to be executed on complex Systems-on-Chip. Application tasks and architectural elements (e.g., CPUs, bus, memories) are described with a UML-based language, using an open-source toolkit named TTool. Those descriptions may be automatically transformed into a formal hardware and software specification. From that specification, model-checking techniques may be applied to evaluate several properties of the system, e.g., safety, schedulability, and performance properties. The approach is exemplified with an MPEG2 decoding application.
 
Conference Paper
Due to increasing numbers of real-time high-performance applications like control systems, autonomous robots, financial systems, scheduling these real-time applications on HPC resources has become an important problem. This paper presents a novel real-time multiprocessor scheduling algorithm, called Notional Approximation for Balancing Load Residues (NABLR), which heuristically selects tasks for execution by taking into account their residual loads and laxities. The NABLR schedule is created by considering a sequence of inter-arrival intervals (IAI) between two consecutive job arrivals of any task and using a heuristic to carefully plan task execution to fully utilize available resources in each of these intervals and avoid deadline misses as much as possible. Performance evaluation shows that NABLR outperforms previously known efficient algorithms (i.e. EDF and EDZL) in successfully scheduling sets of tasks in which total utilization of each task set equals available resource capacity, performing the closest to an optimal algorithm such as LLREF and Pfair. Out of 2500 randomly selected high-utilization task sets, NABLR can schedule up to 97.9% of the sets versus 63.2% by the best known efficient NABLR schedule are significantly smaller than those of optimal schedules (on average 80.57% fewer preemptions, migrations and 75.52% fewer scheduler invocations than those of LLREF) and comparably efficient suboptimal schedules (fewer or nearly the same number of invocations as EDZL and ASEDZL, but within only 0.12% more preemptions/migrations than ASEDZL). NABLR has the same O(NlogN) time complexity as other previously proposed efficient.
 
Article
We consider a task graph to be executed on a set of processors. We assume that the mapping is given, say by an ordered list of tasks to execute on each processor, and we aim at optimizing the energy consumption while enforcing a prescribed bound on the execution time. While it is not possible to change the allocation of a task, it is possible to change its speed. Rather than using a local approach such as backfilling, we consider the problem as a whole and study the impact of several speed variation models on its complexity. For continuous speeds, we give a closed-form formula for trees and series-parallel graphs, and we cast the problem into a geometric programming problem for general directed acyclic graphs. We show that the classical dynamic voltage and frequency scaling (DVFS) model with discrete modes leads to a NP-complete problem, even if the modes are regularly distributed (an important particular case in practice, which we analyze as the incremental model). On the contrary, the VDD-hopping model leads to a polynomial solution. Finally, we provide an approximation algorithm for the incremental model, which we extend for the general DVFS model.
 
Article
In this paper, we study CPU utilization time patterns of several Map-Reduce applications. After extracting running patterns of several applications, the patterns with their statistical information are saved in a reference database to be later used to tweak system parameters to efficiently execute unknown applications in future. To achieve this goal, CPU utilization patterns of new applications along with its statistical information are compared with the already known ones in the reference database to find/predict their most probable execution patterns. Because of different patterns lengths, the Dynamic Time Warping (DTW) is utilized for such comparison; a statistical analysis is then applied to DTWs' outcomes to select the most suitable candidates. Moreover, under a hypothesis, another algorithm is proposed to classify applications under similar CPU utilization patterns. Three widely used text processing applications (WordCount, Distributed Grep, and Terasort) and another application (Exim Mainlog parsing) are used to evaluate our hypothesis in tweaking system parameters in executing similar applications. Results were very promising and showed effectiveness of our approach on 5-node Map-Reduce platform
 
Article
The pricing of American style and multiple exercise options is a very challenging problem in mathematical finance. One usually employs a Least-Square Monte Carlo approach (Longstaff-Schwartz method) for the evaluation of conditional expectations which arise in the Backward Dynamic Programming principle for such optimal stopping or stochastic control problems in a Markovian framework. Unfortunately, these Least-Square Monte Carlo approaches are rather slow and allow, due to the dependency structure in the Backward Dynamic Programming principle, no parallel implementation; whether on the Monte Carlo levelnor on the time layer level of this problem. We therefore present in this paper a quantization method for the computation of the conditional expectations, that allows a straightforward parallelization on the Monte Carlo level. Moreover, we are able to develop for AR(1)-processes a further parallelization in the time domain, which makes use of faster memory structures and therefore maximizes parallel execution. Finally, we present numerical results for a CUDA implementation of this methods. It will turn out that such an implementation leads to an impressive speed-up compared to a serial CPU implementation.
 
Article
Financial institutions have massive computations to carry out overnight which are very demanding in terms of the consumed CPU. The challenge is to price many different products on a cluster-like architecture. We have used the Premia software to valuate the financial derivatives. In this work, we explain how Premia can be embedded into Nsp, a scientific software like Matlab, to provide a powerful tool to valuate a whole portfolio. Finally, we have integrated an MPI toolbox into Nsp to enable to use Premia to solve a bunch of pricing problems on a cluster. This unified framework can then be used to test different parallel architectures.
 
Article
With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding. In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores. We have evaluated our approach for a total of 7194 JPEG images across three high- and middle-end CPU--GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.
 
Article
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
 
Article
Energy efficiency has been a daunting challenge for datacenters. The financial industry operates some of the largest datacenters in the world. With increasing energy costs and the financial services sector growth, emerging financial analytics workloads may incur extremely high operational costs, to meet their latency targets. Microservers have recently emerged as an alternative to high-end servers, promising scalable performance and low energy consumption in datacenters via scale-out. Unfortunately, stark differences in architectural features, form factor and design considerations make a fair comparison between servers and microservers exceptionally challenging. In this paper we present a rigorous methodology and new metrics for fair comparison of server and microserver platforms. We deploy our methodology and metrics to compare a microserver with ARM cores against two servers with x86 cores, running the same real-time financial analytics workload. We define workload-specific but platform-independent performance metrics for platform comparison, targeting both datacenter operators and end users. Our methodology establishes that a server based the Xeon Phi processor delivers the highest performance and energy-efficiency. However, by scaling out energy-efficient microservers, we achieve competitive or better energy-efficiency than a power-equivalent server with two Sandy Bridge sockets despite the microserver's slower cores. Using a new iso-QoS (iso-Quality of Service) metric, we find that the ARM microserver scales enough to meet market throughput demand, i.e. a 100% QoS in terms of timely option pricing, with as little as 55% of the energy consumed by the Sandy Bridge server.
 
Article
In this paper we describe our work on enabling fine-grained authorization for resource usage and management. We address the need of virtual organizations to enforce their own polices in addition to those of the resource owners, in regard to both resource consumption and job management. To implement this design, we propose changes and extensions to the Globus Toolkit's version 2 resource management mechanism. We describe the prototype and the policy language that we designed to express fine-grained policies, and we present an analysis of our solution. Comment: 13 pages, 2 figures
 
Conference Paper
We introduce a new technique for automated performance diagnosis, using the program’s callgraph. We discuss our implementation of this diagnosis technique in the Paradyn Performance Consultant. Our implementation includes the new search strategy and new dynamic instrumentation to resolve pointer-based dynamic call sites at run-time. We compare the effectiveness of our new technique to the previous version of the Performance Consultant for several sequential and parallel applications. Our results show that the new search method performs its search while inserting dramatically less instrumentation into the application, resulting in reduced application perturbation and consequently a higher degree of diagnosis accuracy.
 
Article
The latest trends in high-performance computing systems show an increasing demand on the use of a large scale multicore systems in a efficient way, so that high compute-intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy available. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling model that considers the main performance bottlenecks, as for example, memory contention. In this paper, the {\em Multicore Cluster Model} (MCM) is proposed, which captures the most relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention. Better performance was achieved when a load balance strategy for a Branch-and-Bound application applied to the Partitioning Sets Problem is based on MCM, showing its efficiency and applicability to modern systems.
 
Article
Electronic Health (e-Health) technology has brought the world with significant transformation from traditional paper-based medical practice to Information and Communication Technologies (ICT)-based systems for automatic management (storage, processing, and archiving) of information. Traditionally e-Health systems have been designed to operate within stovepipes on dedicated networks, physical computers, and locally managed software platforms that make it susceptible to many serious limitations including: 1) lack of on-demand scalability during critical situations; 2) high administrative overheads and costs; and 3) in-efficient resource utilization and energy consumption due to lack of automation. In this paper, we present an approach to migrate the ICT systems in the e-Health sector from traditional in-house Client/Server (C/S) architecture to the virtualised cloud computing environment. To this end, we developed two cloud-based e-Health applications (Medical Practice Management System and Telemedicine Practice System) for demonstrating how cloud services can be leveraged for developing and deploying such applications. The Windows Azure cloud computing platform is selected as an example public cloud platform for our study. We conducted several performance evaluation experiments to understand the Quality Service (QoS) tradeoffs of our applications under variable workload on Azure.
 
Article
Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.
 
Conference Paper
Many concurrent data-structure implementations use the well-known compare-and-swap (CAS) operation, supported in hardware by most modern multiprocessor architectures for inter-thread synchronization. A key weakness of the CAS operation is the degradation in its performance in the presence of memory contention. In this work we study the following question: can software-based contention management improve the efficiency of hardware-provided CAS operations? Our performance evaluation establishes that lightweight contention management support can greatly improve performance under medium and high contention levels while typically incurring only small overhead when contention is low.
 
Conference Paper
The distributed nature of the grid results in the problem of scheduling parallel jobs produced by several independent organizations that have partial control over the system. We consider systems composed of n identical clusters of m processors. We show that it is always possible to produce a collaborative solution that respects participant’s selfish goals, at the same time improving the global performance of the system. We propose algorithms with a guaranteed worst-case performance ratio on the global makespan: a 3-approximation algorithm if the last completed job requires at most m/2 processors, and a 4-approximation algorithm in the general case.
 
Conference Paper
Available GPUs provide increasingly more processing power especially for multimedia and digital signal processing. Despite the tremendous progress in hardware and thus processing power, there are and always will be applications that require using multiple GPUs either running inside the same machine or distributed in the network due to computational intensive processing algorithms. Existing solutions for developing applications for GPUs still require a lot of hand-optimization when using multiple GPUs inside the same machine and provide in general no support for using remote GPUs distributed in the network. In this paper we address this problem and show that an open distributed multimedia middleware, like the Network-Integrated Multimedia Middleware (NMM), is able (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing.
 
Article
Machine learning over fully distributed data poses an important problem in peer-to-peer (P2P) applications. In this model we have one data record at each network node, but without the possibility to move raw data due to privacy considerations. For example, user profiles, ratings, history, or sensor readings can represent this case. This problem is difficult, because there is no possibility to learn local models, the system model offers almost no guarantees for reliability, yet the communication cost needs to be kept low. Here we propose gossip learning, a generic approach that is based on multiple models taking random walks over the network in parallel, while applying an online learning algorithm to improve themselves, and getting combined via ensemble learning methods. We present an instantiation of this approach for the case of classification with linear models. Our main contribution is an ensemble learning method which---through the continuous combination of the models in the network---implements a virtual weighted voting mechanism over an exponential number of models at practically no extra cost as compared to independent random walks. We prove the convergence of the method theoretically, and perform extensive experiments on benchmark datasets. Our experimental analysis demonstrates the performance and robustness of the proposed approach.
 
Article
Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load-balancing becomes a non-trivial question. A common strategy for load-balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load-balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks, and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the QUARK runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.
 
Article
In many scientific applications the solution of non-linear differential equations are obtained through the set-up and solution of a number of successive eigenproblems. These eigenproblems can be regarded as a sequence whenever the solution of one problem fosters the initialization of the next. In addition, some eigenproblem sequences show a connection between the solutions of adjacent eigenproblems. Whenever is possible to unravel the existence of such a connection, the eigenproblem sequence is said to be a correlated. When facing with a sequence of correlated eigenproblems the current strategy amounts to solving each eigenproblem in isolation. We propose a novel approach which exploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials (ChFSI). The resulting eigensolver, is optimized by minimizing the number of matvec multiplications and parallelized using the Elemental library framework. Numerical results shows that ChFSI achieves excellent scalability and is competitive with current dense linear algebra parallel eigensolvers.
 
Article
We introduce SLIRP, a module generator for the S-Lang numerical scripting language, with a focus on its vectorization capabilities. We demonstrate how both SLIRP and S-Lang were easily adapted to exploit the inherent parallelism of high-level mathematical languages with OpenMP, allowing general users to employ tightly-coupled multiprocessors in scriptable research calculations while requiring no special knowledge of parallel programming. Motivated by examples in the ISIS astrophysical modeling & analysis tool, performance figures are presented for several machine and compiler configurations, demonstrating beneficial speedups for real-world operations.
 
Article
The MPI/RT standard is the product of the work of many people working in an open community standards group over a period of six plus years. The purpose of this archival publication is to preserve the significant knowledge and experience that was developed in real-time message passing systems as a consequence of the R&D e#ort as well as in the specification of the standard. Interestingly, several implementations of MPI/RT (as well as comprehensive test suites) have been created in industry and academia over the period during which the standard was created. MPI/RT is likely to gain adoption interest over time, and this adoption may be driven by the promulgation of the standard including this publication. We expect that, when people are interested in understanding options for reliable, QoS-oriented parallel computing with message passing, MPI/RT will serve as a foundation for such a study, whether or not its complete formalism is accepted into other systems or standards.
 
Article
In the course of our work in developing formal specifications for components of the Java Virtual Machine (JVM), we have uncovered subtle bugs in the bytecode verifier of Sun's Java 2 SDK 1.2. These bugs, which lead to type safety violations, relate to the naming of reference types. Under certain circumstances, these names can be spoofed through delegating class loaders. These flaws expose some inaccuracies and ambiguities in the JVM specification. We propose several solutions to all of these bugs. In particular, we propose a general solution that makes use of subtype loading constraints. Such constraints complement the equality loading constraints introduced in the Java 2 Platform, and are posted by the bytecode verifier when checking assignment compatibility of class types. By posting constraints instead of resolving and loading classes, the bytecode verifier in our solution has a cleaner interface with the rest of the JVM, and allows lazier loading. We sketch some excerpts of our mathematical formalization of this approach and of its type safety results. Copyright
 
Article
Digital scholarship offers the opportunity to move beyond the limitations of traditional scholarly publication. Rather than limiting scholarly communication to text-based static documents, the Web makes it possible for scholars to expose and share the full evidence of their research including data, images, video, and other genre of materials. These aggregations of evidence, or compound documents, can then be integrated into a linked data cloud, the basis of Scholarship 2.0—an open environment in which scholars collaborate and build new knowledge on the existing scholarship. We present Open Archives Initiative–Object Reuse and Exchange (OAI–ORE), a set of standards to identify and describe aggregations of Web Resources, thereby making the Scholarship 2.0 vision possible. Copyright © 2010 John Wiley & Sons, Ltd.
 
Article
An important factor for high-speed optical communication is the availability of ultrafast and low-noise photodetectors. Among the semiconductor photodetectors that are commonly used in today's long-haul and metro-area fiber-optic systems, avalanche photodiodes (APDs) are often preferred over p-i-n photodiodes due to their internal gain, which significantly improves the receiver sensitivity and alleviates the need for optical pre-amplification. Unfortunately, the random nature of the very process of carrier impact ionization, which generates the gain, is inherently noisy and results in fluctuations not only in the gain but also in the time response. Recently, a theory characterizing the autocorrelation function of APDs has been developed by us which incorporates the dead-space effect, an effect that is very significant in thin, high-performance APDs. The research extends the time-domain analysis of the dead-space multiplication model to compute the autocorrelation function of the APD impulse response. However, the computation requires a large amount of memory space and is very time consuming. In this research, we describe our experiences in parallelizing the code in MPI and OpenMP using CAPTools. Several array partitioning schemes and scheduling policies are implemented and tested. Our results show that the code is scalable up to 64 processors on a SGI Origin 2000 machine and has small average errors. Copyright © 2004 John Wiley & Sons, Ltd.
 
Article
The LHCb Grid software has been used for two Physics Data Challenges, with the latter producing over 98 TB of data and consuming over 650 processor-years of computing power. This paper discusses the experience of developing a Grid infrastructure, interfacing ...
 
Article
In this paper we discuss the issues related to the development of efficient parallel implementations of the Marching Cubes algorithm, one of the most used methods for isosurface extraction, which is a fundamental operation for 3D data analysis and visualization. We present three possible parallelization strategies and we outline the pros and cons of each of them, considering isosurface extraction as stand-alone operation or as part of a dynamic workflow. Our analysis shows that none of these implementations represents the most efficient solution for arbitrary situations. This is a major issue, because in many cases the quality of the service provided by a workflow depends on the possibility of selecting dynamically the operations to perform and, consequently, the more efficient basic building block for each stage. In this paper we present a set of guidelines that permits to achieve the highest performance for the extraction of isosurface in the most common situations, considering the characteristics of the data to process and of the workflow. These guidelines represent a suitable example to support the efficient configurations of workflows for 3D data processing in a dynamic and complex computing environment. Copyright © 2011 John Wiley & Sons, Ltd.
 
Article
The IEEE P802.11s/D4.0 standard is not considered secure for the routing protocol. In this paper, we propose IBC-HWMP, a secure Hybrid Wireless Mesh Protocol (HWMP) using identity-based cryptography (IBC). The reason we use IBC is that it does not need to verify the authenticity of public keys. We have implemented the IBC mechanism to secure control messages in HWMP, namely path request and path reply. Our aim is to focus on secure data exchange in mutable fields. Through extensive ns-3 simulations, results show that the overhead introduced by IBC-HWMP is not significant compared with the classical HWMP and that, at the same time, we improve the security. Copyright © 2011 John Wiley & Sons, Ltd.
 
Article
One of the language features of the core language of HPF 2.0 is the HPF Library. The HPF Library consists of 55 generic functions. The implementation of this library presents the challenge that all data types, data kinds, array ranks and input distributions need to be supported. For instance, more than 2 billion separate functions are required to support COPY_SCATTER fully. The efficient support of these billions of specific functions is one of the outstanding problems of High-Performance Fortran. We have solved this problem by developing a library generator which utilizes the mechanism of parameterized templates. This mechanism allows the procedures to be instantiated at compile time for arguments with a specific type, kind, rank, and distribution over a specific processor array. We describe the algorithms used in the different library functions. The implementation gives the ease of generating a large number of library routines from a single template. The templates can be extended with special code for specific combinations of the input arguments. We describe in detail the implementation and performance of the Matrix Multiplication template for the Fujitsu VPP5000 platform. Keywords: parallel computing, parallel languages, code generation, library functions, parameterized templates, matrix multiplication. 1. Contact email: waveren@fecit.co.uk 1.
 
Article
Science gateways have emerged as a concept for allowing large numbers of users in communities to easily access high-performance computing resources which previously required a steep learning curve to utilize. In order to reduce the complexity of managing access for these communities, which can often be large and dynamic, the concept of community accounts is being considered. This paper proposes a security model for community accounts, organized by the four As of security: Authentication, Authorization, Auditing and Accounting. Copyright © 2006 John Wiley & Sons, Ltd.
 
Hierarchical model.
Article
This paper reports the hybridization of the artificial bee colony (ABC) and a genetic algorithm (GA), in a hierarchical topology, a step ahead of a previous work. We used this parallel approach for solving the protein structure prediction problem using the three-dimensional hydrophobic-polar model with side-chains (3DHP-SC). The proposed method was run in a parallel processing environment (Beowulf cluster), and several aspects of the modeling and implementation are presented and discussed. The performance of the hybrid-hierarchical ABC-GA approach was compared with a hybrid-hierarchical ABC-only approach for four benchmark instances. Results show that the hybridization of the ABC with the GA improves the quality of solutions caused by the coevolution effect between them and their search behavior. Copyright © 2011 John Wiley & Sons, Ltd.
 
Article
OoLALA is an object-oriented linear algebra library designed to reduce the effort of software development and maintenance. In contrast with traditional (Fortran-based) libraries, it provides two high abstraction levels that significantly reduce the number of implementations necessary for particular linear algebra operations. Initial performance evaluations of a Java implementation of OoLALA show that the two high abstraction levels are not competitive with the low abstraction level of traditional libraries. These initial performance results motivate the present contribution-the characterization of a set of storage formats (data structures) and matrix properties (special features) for which implementations at the two high abstraction levels can be transformed into implementations at the low (more efficient) abstraction level. Copyright (c) 2005 John Wiley & Sons, Ltd.
 
Article
The modeling of the electrical activity of the heart is of great medical and scientific interest, because it provides a way to get a better understanding of the related biophysical phenomena, allows the development of new techniques for diagnoses and serves as a platform for drug tests. The cardiac electrophysiology may be simulated by solving a partial differential equation coupled to a system of ordinary differential equations describing the electrical behavior of the cell membrane. The numerical solution is, however, computationally demanding because of the fine temporal and spatial sampling required. The demand for real-time high definition 3D graphics made the new graphic processing units (GPUs) a highly parallel, multithreaded, many-core processor with tremendous computational horsepower. It makes the use of GPUs a promising alternative to simulate the electrical activity in the heart. The aim of this work is to study the performance of GPUs for solving the equations underlying the electrical activity in a simple cardiac tissue. In tests on 2D cardiac tissues with different cell models it is shown that the GPU implementation runs 20 times faster than a parallel CPU implementation running with 4 threads on a quad–core machine, parts of the code are even accelerated by a factor of 180. Copyright © 2010 John Wiley & Sons, Ltd.
 
Article
As the leading vendor of enterprise business standard software, SAP has recognized the need to adapt their R/3 system to current trends in software development and to meet market needs for speed of development, flexibility, openness, and interoperability. In this paper, we first present SAP's approach to object-oriented and component-based technology by describing the Business Framework, the concepts of Business Objects, BAPIs, and the Business Object Repository. On this basis, we then analyze current communication architectures and products enabling the interaction of external Java- based software applications with SAP R/3, point out the advantages and disadvantages of different solutions, and finally elaborate on potential strategies and steps for driving the evolution of SAP R/3 in order to further increase interoperability, openness, and flexibility. Copyright © 2001 John Wiley & Sons, Ltd.
 
Article
concerns, and therefore have wide variation in their library support. It is difficult to keep track of which libraries are installed on each platform. In general, the more libraries on which a program depends, the fewer environments in which it can run. In addition, an increased acceptance of software reuse has enabled creation of more and more sophisticated software packages, built out of numerous components obtained from diverse sources. This in turn has created a configuration management problem for users and system administrators, who must devote valuable time to obtaining and installing those components, tracking changes (including enhancements, bug fixes, and sometimes security fixes) to those components. Version conflicts, where one software package requires version X of a library while another package on the same platform requires version Y, are common. Packages such as ATLAS [3,4] can increase the effective performance of many platforms by optimizing the parameters of the com
 
Article
Networks of the future will be characterized by a variety of computational devices that display a level of dynamism not seen in traditional wired networks. Because of the dynamic nature of these networks, resource discovery is one of the fundamental problems that must be faced. While resource discovery systems are not a novel concept, securing these systems in an efficient and scalable way is challenging. This paper describes the design and implementation of an architecture for access-controlled resource discovery. This system acheives this goal by integrating access control with the Intentional Naming System (INS), a resource discovery and service location system. The integration is scalable, efficient, and fits well within a proxy-based security framework designed for dynamic networks. We provide performance experiments that show how our solution outperforms existing schemes. The result is a system that provides secure, accesscontrolled resource discovery that can scale to large numbers of resources and users.
 
Article
This paper describes the Grid Resource Broker (GRB) portal, a web gateway to computational grids in use at the University of Lecce. The portal allows trusted users seamless access to computational resources and grid services, providing a friendly computing environment that takes advantage of the underlying Globus Toolkit middleware, enhancing its basic services and capabilities. Keywords: computational grids, web portals. 1.
 
Article
Few methods use molecular dynamics simulations in concert with atomically detailed force fields to perform protein–ligand docking calculations because they are considered too time demanding, despite their accuracy. In this paper we present a docking algorithm based on molecular dynamics which has a highly flexible computational granularity. We compare the accuracy and the time required with well-known, commonly used docking methods such as AutoDock, DOCK, FlexX, ICM, and GOLD. We show that our algorithm is accurate, fast and, because of its flexibility, applicable even to loosely coupled distributed systems such as desktop Grids for docking. Copyright © 2005 John Wiley & Sons, Ltd.
 
Article
In performance critical applications, memory latency is frequently the dominant overhead. In many cases, automatic compiler-based optimizations to improve memory performance are limited and programmers frequently resort to manual optimization techniques. However, this process is tedious and time-consuming. Furthermore, as the potential benefit from optimization is unknown there is no way to judge the amount of effort worth expending, nor when the process can stop, i.e. when optimal memory performance has been achieved or sufficiently approached. Architecture simulators can provide such information but designing an accurate model of an existing architecture is difficult and simulation times are excessively long. In this article, we propose and implement a technique that is both fast and reasonably accurate for estimating a lower bound on execution time for scientific applications. This technique has been tested on a wide range of programs from the SPEC benchmark suite and two commercial applications, where it has been used to guide a manual optimization process and iterative compilation. We compare our technique with that of a simulator with an ideal memory behaviour and demonstrate that our technique provides comparable information on memory performance and yet is over two orders of magnitude faster. We further show that our technique is considerably more accurate than hardware counters. Copyright © 2004 John Wiley & Sons, Ltd.
 
Article
OpenMP is emerging as a viable high-level programming model for shared memory parallel systems. It was conceived to enable easy, portable application development on this range of systems, and it has also been implemented on cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Unfortunately, it is hard to obtain high performance on the latter architecture, particularly when large numbers of threads are involved. In this paper, we discuss the difficulties faced when writing OpenMP programs for ccNUMA systems, and explain how the vendors have attempted to overcome them. We focus on one such system, the SGI Origin 2000, and perform a variety of experiments designed to illustrate the impact of the vendor's efforts. We compare codes written in a standard, loop-level parallel style under OpenMP with alternative versions written in a Single Program Multiple Data (SPMD) fashion, also realized via OpenMP, and show that the latter consistently provides superior performance. A carefully chosen set of language extensions can help us translate programs from the former style to the latter (or to compile directly, but in a similar manner). Syntax for these extensions can be borrowed from HPF, and some aspects of HPF compiler technology can help the translation process. It is our expectation that an extended language, if well compiled, would improve the attractiveness of OpenMP as a language for high-performance computation on an important class of modern architectures. Copyright © 2002 John Wiley & Sons, Ltd.
 
Top-cited authors
Rajkumar Buyya
  • University of Melbourne
Bertram Ludäscher
  • University of Illinois, Urbana-Champaign
Ilkay Altintas
  • University of California, San Diego
Jing Tao
  • University of California, Santa Barbara
Yang Zhao
  • Google Inc.