Concurrency and Computation Practice and Experience

Published by Wiley

Online ISSN: 1532-0634


Print ISSN: 1532-0626


Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2
  • Article

December 2014


87 Reads

Shahid H Bokhari



Metin N Gurcan
Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 32000(2) pixels in size, which are well beyond the largest previously reported in the literature.

The SPRINT framework architecture as described in [9].
How permutations are distributed among the available processes.
pmaxT speed-up on the various systems.
Optimization of a parallel permutation testing function for the SPRINT R package
  • Article
  • Full-text available

December 2011


214 Reads

Savvas Petrou



Muriel Mewissen




Jon Hill
The statistical language R and its Bioconductor package are favoured by many biostatisticians for processing microarray data. The amount of data produced by some analyses has reached the limits of many common bioinformatics computing infrastructures. High Performance Computing systems offer a solution to this issue. The Simple Parallel R Interface (SPRINT) is a package that provides biostatisticians with easy access to High Performance Computing systems and allows the addition of parallelized functions to R. Previous work has established that the SPRINT implementation of an R permutation testing function has close to optimal scaling on up to 512 processors on a supercomputer. Access to supercomputers, however, is not always possible, and so the work presented here compares the performance of the SPRINT implementation on a supercomputer with benchmarks on a range of platforms including cloud resources and a common desktop machine with multiprocessing capabilities. Copyright © 2011 John Wiley & Sons, Ltd.

Analysis of distributed multi-periodic systems to achieve consistent data matching

July 2010


47 Reads

Distributed real-time architecture of an embedded system is often described as a set of communicating components. Such a system is data flow (for its description) and time-triggered (for its execution). This work fits in with these problematics and focuses on the control of the time compatibility of a set of interdependent data used by the system components. The architecture of a component-based system forms a graph of communicating components, where more than one path can link two components. These paths may have different timing characteristics but the flows of information which transit on these paths may need to be adequately matched, so that a component uses inputs which all (directly or indirectly) depend on the same production step. In this paper, we define this temporal data-matching property, we show how to analyze the architecture to detect situations that cause data matching inconsistencies, and we describe an approach to manage data matching that uses queues to delay too fast paths and timestamps to recognize consistent data.

Adaptation Rule Learning for Case-Based Reasoning

November 2007


142 Reads

A method of learning adaptation rules for case- based reasoning (CBR) is proposed in this paper. Adaptation rules are generated from the case-base with the guidance of domain knowledge which is also extracted from the case-base. The adaptation rules are refined before they are applied in the revision process. After solving each new problem, the adaptation rule set is updated by an evolution module in the retention process. The results of preliminary experiment show that the adaptation rules obtained could improve the performance of the CBR system compared to a retrieval-only CBR system.

An architecture for exploiting multi-core processors to parallelize network intrusion prevention

July 2009


60 Reads

It is becoming increasingly difficult to implement effective systems for preventing network attacks, due to the combination of (1) the rising sophistication of attacks requiring more complex analysis to detect, (2) the relentless growth in the volume of network traffic that we must analyze, and, critically, (3) the failure in recent years for uniprocessor performance to sustain the exponential gains that for so many years CPUs enjoyed (ldquoMoorepsilas Lawrdquo). For commodity hardware, tomorrowpsilas performance gains will instead come from multicore architectures in which a whole set of CPUs executes concurrently. Taking advantage of the full power of multi-core processors for network intrusion prevention requires an indepth approach. In this work we frame an architecture customized for parallel execution of network attack analysis. At the lowest layer of the architecture is an ldquoActive Network Interfacerdquo (ANI), a custom device based on an inexpensive FPGA platform. The ANI provides the inline interface to the network, reading in packets and forwarding them after they are approved. It also serves as the front-end for dispatching copies of the packets to a set of analysis threads. The analysis itself is structured as an event-based system, which allows us to find many opportunities for concurrent execution, since events introduce a natural, decoupled asynchrony into the flow of analysis while still maintaining good cache locality. Finally, by associating events with the packets that ultimately stimulated them, we can determine when all analysis for a given packet has completed, and thus that it is safe to forward the pending packet - providing none of the analysis elements previously signaled that the packet should instead be discarded.

Parallelizing a Black-Scholes solver based on finite elements and sparse grids

May 2010


119 Reads

We present the parallelization of a sparse grid finite element discretization of the Black-Scholes equation, which is commonly used for option pricing. Sparse grids allow to handle higher dimensional options than classical approaches on full grids, and can be extended to a fully adaptive discretization method. We introduce the algorithmical structure of efficient algorithms operating on sparse grids, and demonstrate how they can be used to derive an efficient parallelization with OpenMP of the Black-Scholes solver. We show results on different commodity hardware systems based on multi-core architectures with up to 8 cores, and discuss the parallel performance using Intel and AMD CPUs.

Figure 1. Task dependency DAG
Lambda Calculus as a Workflow Model
Data-oriented workflows are often used in scientific applications for executing a set of dependent tasks across multiple computers. We discuss how these can be modeled using lambda calculus, and how ideas from functional programming are applicable in the design of workflows. Such an approach avoids the restrictions often found in workflow languages, permitting the implementation of complex application logic and data manipulation. This paper explains why lambda calculus is an appropriate model for workflow representation, and how a suitably efficient implementation can provide a wide range of capabilities to developers. The presented approach also permits high-level workflow features to be implemented at user level, in terms of a small set of low-level primitives provided by the language implementation.

Grid Workflow Optimization Regarding Dynamically Changing Resources and Conditions

September 2007


21 Reads

Automatic construction of workflows on the Grid currently is a hot research topic. The problems that have to be solved are manifold: How can existing services be integrated into a workflow, that is able to accomplish a specific task? How can an optimal workflow be constructed in respect to changing resource characteristics during the optimization process? How to cope with dynamically changing or incomplete knowledge of the goal function of the optimization process? - and finally: How to react to service failures during workflow execution? In this paper we propose a method to optimize a workflow based on a heuristic A * approach that allows to react to dynamics in the environment, as changes in the Grid infrastructure and in the users requirements during the optimization process and failing resources during execution.

Figure 2. Required rails as a function of the number of nodes for both static allocation algorithms.
Livelock avoidance state table.
Using multirail networks in high-performance clusters

February 2001


424 Reads

Using multiple independent networks (also known as rails) is an emerging technique which is being used to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper, we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The methods compared include a static rail allocation, a basic round-robin rail allocation, a local-dynamic allocation based on local knowledge and a dynamic rail allocation that reserves both communication endpoints of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for long enough messages). Most importantly, this proposed allocation scheme scales well with the number of rails and message size. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic allocation for short messages with those of the dynamic algorithm for large messages. Copyright

Designing Grid services for multimedia streaming in an e-learning environment

July 2004


56 Reads

Next generation e-learning platforms should support cooperative use of geographically distributed computing and educational resources as an aggregated environment to provide new levels of flexibility and extensibility. In this overall framework, our activity addresses the definition and implementation of advanced multimedia services for an aggregated grid-based e-learning environment, as well as the design and experimentation of a content distribution and multimedia streaming infrastructure in light of edge device heterogeneity, mobility, content adaptation and scalability. In this paper we initially present the general objectives and requirements that we are taking into account in the development of a multimedia access service for an e-learning platform. Then we describe a partial system prototype which capitalizes upon traditional features of grid computing like providing access to heterogeneous resources and services of different administrative domains in a transparent and secure way. Moreover, our system takes advantage of recent proposals by the Global Grid Forum (GGF) aiming at a standard for service-oriented architectures based on the concept of grid service.

Formal system-level design space exploration

July 2010


46 Reads

The paper focuses on the formal aspects of the DIPLODOCUS environment. DIPLODOCUS is a UML profile intended for the modeling and verification of real-time and embedded applications meant to be executed on complex Systems-on-Chip. Application tasks and architectural elements (e.g., CPUs, bus, memories) are described with a UML-based language, using an open-source toolkit named TTool. Those descriptions may be automatically transformed into a formal hardware and software specification. From that specification, model-checking techniques may be applied to evaluate several properties of the system, e.g., safety, schedulability, and performance properties. The approach is exemplified with an MPEG2 decoding application.

Improved real-time scheduling for periodic tasks on multiprocessors

August 2011


30 Reads

Due to increasing numbers of real-time high-performance applications like control systems, autonomous robots, financial systems, scheduling these real-time applications on HPC resources has become an important problem. This paper presents a novel real-time multiprocessor scheduling algorithm, called Notional Approximation for Balancing Load Residues (NABLR), which heuristically selects tasks for execution by taking into account their residual loads and laxities. The NABLR schedule is created by considering a sequence of inter-arrival intervals (IAI) between two consecutive job arrivals of any task and using a heuristic to carefully plan task execution to fully utilize available resources in each of these intervals and avoid deadline misses as much as possible. Performance evaluation shows that NABLR outperforms previously known efficient algorithms (i.e. EDF and EDZL) in successfully scheduling sets of tasks in which total utilization of each task set equals available resource capacity, performing the closest to an optimal algorithm such as LLREF and Pfair. Out of 2500 randomly selected high-utilization task sets, NABLR can schedule up to 97.9% of the sets versus 63.2% by the best known efficient NABLR schedule are significantly smaller than those of optimal schedules (on average 80.57% fewer preemptions, migrations and 75.52% fewer scheduler invocations than those of LLREF) and comparably efficient suboptimal schedules (fewer or nearly the same number of invocations as EDZL and ASEDZL, but within only 0.12% more preemptions/migrations than ASEDZL). NABLR has the same O(NlogN) time complexity as other previously proposed efficient.

Reclaiming the energy of a schedule: Models and algorithms

August 2013


44 Reads

We consider a task graph to be executed on a set of processors. We assume that the mapping is given, say by an ordered list of tasks to execute on each processor, and we aim at optimizing the energy consumption while enforcing a prescribed bound on the execution time. While it is not possible to change the allocation of a task, it is possible to change its speed. Rather than using a local approach such as backfilling, we consider the problem as a whole and study the impact of several speed variation models on its complexity. For continuous speeds, we give a closed-form formula for trees and series-parallel graphs, and we cast the problem into a geometric programming problem for general directed acyclic graphs. We show that the classical dynamic voltage and frequency scaling (DVFS) model with discrete modes leads to a NP-complete problem, even if the modes are regularly distributed (an important particular case in practice, which we analyze as the incremental model). On the contrary, the VDD-hopping model leads to a polynomial solution. Finally, we provide an approximation algorithm for the incremental model, which we extend for the general DVFS model.

A Study on Using Uncertain Time Series Matching Algorithms in MapReduce Applications

August 2013


134 Reads

In this paper, we study CPU utilization time patterns of several Map-Reduce applications. After extracting running patterns of several applications, the patterns with their statistical information are saved in a reference database to be later used to tweak system parameters to efficiently execute unknown applications in future. To achieve this goal, CPU utilization patterns of new applications along with its statistical information are compared with the already known ones in the reference database to find/predict their most probable execution patterns. Because of different patterns lengths, the Dynamic Time Warping (DTW) is utilized for such comparison; a statistical analysis is then applied to DTWs' outcomes to select the most suitable candidates. Moreover, under a hypothesis, another algorithm is proposed to classify applications under similar CPU utilization patterns. Three widely used text processing applications (WordCount, Distributed Grep, and Terasort) and another application (Exim Mainlog parsing) are used to evaluate our hypothesis in tweaking system parameters in executing similar applications. Results were very promising and showed effectiveness of our approach on 5-node Map-Reduce platform

GPGPUs in computational finance: Massive parallel computing for American style options

June 2012


76 Reads

The pricing of American style and multiple exercise options is a very challenging problem in mathematical finance. One usually employs a Least-Square Monte Carlo approach (Longstaff-Schwartz method) for the evaluation of conditional expectations which arise in the Backward Dynamic Programming principle for such optimal stopping or stochastic control problems in a Markovian framework. Unfortunately, these Least-Square Monte Carlo approaches are rather slow and allow, due to the dependency structure in the Backward Dynamic Programming principle, no parallel implementation; whether on the Monte Carlo levelnor on the time layer level of this problem. We therefore present in this paper a quantization method for the computation of the conditional expectations, that allows a straightforward parallelization on the Monte Carlo level. Moreover, we are able to develop for AR(1)-processes a further parallelization in the time domain, which makes use of faster memory structures and therefore maximizes parallel execution. Finally, we present numerical results for a CUDA implementation of this methods. It will turn out that such an implementation leads to an impressive speed-up compared to a serial CPU implementation.

Using Premia and Nsp for Constructing a Risk Management Benchmark forTesting Parallel Architecture

May 2009


79 Reads

Financial institutions have massive computations to carry out overnight which are very demanding in terms of the consumed CPU. The challenge is to price many different products on a cluster-like architecture. We have used the Premia software to valuate the financial derivatives. In this work, we explain how Premia can be embedded into Nsp, a scientific software like Matlab, to provide a powerful tool to valuate a whole portfolio. Finally, we have integrated an MPI toolbox into Nsp to enable to use Premia to solve a bunch of pricing problems on a cluster. This unified framework can then be used to test different parallel architectures.

Parallel Tiled QR Factorization for Multicore Architectures

September 2008


100 Reads

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

November 2013


142 Reads

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding. In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores. We have evaluated our approach for a total of 7194 JPEG images across three high- and middle-end CPU--GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.

Methods and Metrics for Fair Server Assessment under Real-Time Financial Workloads

December 2014


48 Reads

Energy efficiency has been a daunting challenge for datacenters. The financial industry operates some of the largest datacenters in the world. With increasing energy costs and the financial services sector growth, emerging financial analytics workloads may incur extremely high operational costs, to meet their latency targets. Microservers have recently emerged as an alternative to high-end servers, promising scalable performance and low energy consumption in datacenters via scale-out. Unfortunately, stark differences in architectural features, form factor and design considerations make a fair comparison between servers and microservers exceptionally challenging. In this paper we present a rigorous methodology and new metrics for fair comparison of server and microserver platforms. We deploy our methodology and metrics to compare a microserver with ARM cores against two servers with x86 cores, running the same real-time financial analytics workload. We define workload-specific but platform-independent performance metrics for platform comparison, targeting both datacenter operators and end users. Our methodology establishes that a server based the Xeon Phi processor delivers the highest performance and energy-efficiency. However, by scaling out energy-efficient microservers, we achieve competitive or better energy-efficiency than a power-equivalent server with two Sandy Bridge sockets despite the microserver's slower cores. Using a new iso-QoS (iso-Quality of Service) metric, we find that the ARM microserver scales enough to meet market throughput demand, i.e. a 100% QoS in terms of timely option pricing, with as little as 55% of the energy consumed by the Sandy Bridge server.

Fine-Grained Authorization for Job Execution in the Grid: Design and Implementation

April 2004


45 Reads

In this paper we describe our work on enabling fine-grained authorization for resource usage and management. We address the need of virtual organizations to enforce their own polices in addition to those of the resource owners, in regard to both resource consumption and job management. To implement this design, we propose changes and extensions to the Globus Toolkit's version 2 resource management mechanism. We describe the prototype and the policy language that we designed to express fine-grained policies, and we present an analysis of our solution. Comment: 13 pages, 2 figures

A Callgraph-Based Search Strategy for Automated Performance Diagnosis

August 2000


24 Reads

We introduce a new technique for automated performance diagnosis, using the program’s callgraph. We discuss our implementation of this diagnosis technique in the Paradyn Performance Consultant. Our implementation includes the new search strategy and new dynamic instrumentation to resolve pointer-based dynamic call sites at run-time. We compare the effectiveness of our new technique to the previous version of the Performance Consultant for several sequential and parallel applications. Our results show that the new search method performs its search while inserting dramatically less instrumentation into the application, resulting in reduced application perturbation and consequently a higher degree of diagnosis accuracy.

Memory Aware Load Balance Strategy on a Parallel Branch-and-Bound Application

April 2014


63 Reads

The latest trends in high-performance computing systems show an increasing demand on the use of a large scale multicore systems in a efficient way, so that high compute-intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy available. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling model that considers the main performance bottlenecks, as for example, memory contention. In this paper, the {\em Multicore Cluster Model} (MCM) is proposed, which captures the most relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention. Better performance was achieved when a load balance strategy for a Branch-and-Bound application applied to the Partitioning Sets Problem is based on MCM, showing its efficiency and applicability to modern systems.

Reporting an Experience on Design and Implementation of e-Health Systems on Azure Cloud

June 2014


199 Reads

Electronic Health (e-Health) technology has brought the world with significant transformation from traditional paper-based medical practice to Information and Communication Technologies (ICT)-based systems for automatic management (storage, processing, and archiving) of information. Traditionally e-Health systems have been designed to operate within stovepipes on dedicated networks, physical computers, and locally managed software platforms that make it susceptible to many serious limitations including: 1) lack of on-demand scalability during critical situations; 2) high administrative overheads and costs; and 3) in-efficient resource utilization and energy consumption due to lack of automation. In this paper, we present an approach to migrate the ICT systems in the e-Health sector from traditional in-house Client/Server (C/S) architecture to the virtualised cloud computing environment. To this end, we developed two cloud-based e-Health applications (Medical Practice Management System and Telemedicine Practice System) for demonstrating how cloud services can be leveraged for developing and deploying such applications. The Windows Azure cloud computing platform is selected as an example public cloud platform for our study. We conducted several performance evaluation experiments to understand the Quality Service (QoS) tradeoffs of our applications under variable workload on Azure.

Exploring performance and power properties of modern multicore chips via simple machine models

January 2014


1,029 Reads

Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.

Lightweight Contention Management for Efficient Compare-and-Swap Operations

May 2013


63 Reads

Many concurrent data-structure implementations use the well-known compare-and-swap (CAS) operation, supported in hardware by most modern multiprocessor architectures for inter-thread synchronization. A key weakness of the CAS operation is the degradation in its performance in the presence of memory contention. In this work we study the following question: can software-based contention management improve the efficiency of hardware-provided CAS operations? Our performance evaluation establishes that lightweight contention management support can greatly improve performance under medium and high contention levels while typically incurring only small overhead when contention is low.

Top-cited authors