Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 32000(2) pixels in size, which are well beyond the largest previously reported in the literature.
Distributed real-time architecture of an embedded system is often described as a set of communicating components. Such a system is data flow (for its description) and time-triggered (for its execution). This work fits in with these problematics and focuses on the control of the time compatibility of a set of interdependent data used by the system components. The architecture of a component-based system forms a graph of communicating components, where more than one path can link two components. These paths may have different timing characteristics but the flows of information which transit on these paths may need to be adequately matched, so that a component uses inputs which all (directly or indirectly) depend on the same production step. In this paper, we define this temporal data-matching property, we show how to analyze the architecture to detect situations that cause data matching inconsistencies, and we describe an approach to manage data matching that uses queues to delay too fast paths and timestamps to recognize consistent data.
A method of learning adaptation rules for case- based reasoning (CBR) is proposed in this paper. Adaptation rules are generated from the case-base with the guidance of domain knowledge which is also extracted from the case-base. The adaptation rules are refined before they are applied in the revision process. After solving each new problem, the adaptation rule set is updated by an evolution module in the retention process. The results of preliminary experiment show that the adaptation rules obtained could improve the performance of the CBR system compared to a retrieval-only CBR system.
It is becoming increasingly difficult to implement effective systems for preventing network attacks, due to the combination of (1) the rising sophistication of attacks requiring more complex analysis to detect, (2) the relentless growth in the volume of network traffic that we must analyze, and, critically, (3) the failure in recent years for uniprocessor performance to sustain the exponential gains that for so many years CPUs enjoyed (ldquoMoorepsilas Lawrdquo). For commodity hardware, tomorrowpsilas performance gains will instead come from multicore architectures in which a whole set of CPUs executes concurrently. Taking advantage of the full power of multi-core processors for network intrusion prevention requires an indepth approach. In this work we frame an architecture customized for parallel execution of network attack analysis. At the lowest layer of the architecture is an ldquoActive Network Interfacerdquo (ANI), a custom device based on an inexpensive FPGA platform. The ANI provides the inline interface to the network, reading in packets and forwarding them after they are approved. It also serves as the front-end for dispatching copies of the packets to a set of analysis threads. The analysis itself is structured as an event-based system, which allows us to find many opportunities for concurrent execution, since events introduce a natural, decoupled asynchrony into the flow of analysis while still maintaining good cache locality. Finally, by associating events with the packets that ultimately stimulated them, we can determine when all analysis for a given packet has completed, and thus that it is safe to forward the pending packet - providing none of the analysis elements previously signaled that the packet should instead be discarded.
We present the parallelization of a sparse grid finite element discretization of the Black-Scholes equation, which is commonly used for option pricing. Sparse grids allow to handle higher dimensional options than classical approaches on full grids, and can be extended to a fully adaptive discretization method. We introduce the algorithmical structure of efficient algorithms operating on sparse grids, and demonstrate how they can be used to derive an efficient parallelization with OpenMP of the Black-Scholes solver. We show results on different commodity hardware systems based on multi-core architectures with up to 8 cores, and discuss the parallel performance using Intel and AMD CPUs.
Data-oriented workflows are often used in scientific applications for executing a set of dependent tasks across multiple computers. We discuss how these can be modeled using lambda calculus, and how ideas from functional programming are applicable in the design of workflows. Such an approach avoids the restrictions often found in workflow languages, permitting the implementation of complex application logic and data manipulation. This paper explains why lambda calculus is an appropriate model for workflow representation, and how a suitably efficient implementation can provide a wide range of capabilities to developers. The presented approach also permits high-level workflow features to be implemented at user level, in terms of a small set of low-level primitives provided by the language implementation.
Automatic construction of workflows on the Grid currently is a hot research topic. The problems that have to be solved are manifold: How can existing services be integrated into a workflow, that is able to accomplish a specific task? How can an optimal workflow be constructed in respect to changing resource characteristics during the optimization process? How to cope with dynamically changing or incomplete knowledge of the goal function of the optimization process? - and finally: How to react to service failures during workflow execution? In this paper we propose a method to optimize a workflow based on a heuristic A * approach that allows to react to dynamics in the environment, as changes in the Grid infrastructure and in the users requirements during the optimization process and failing resources during execution.
Using multiple independent networks (also known as rails) is an emerging technique which is being used to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper, we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load and allocation scheme. The methods compared include a static rail allocation, a basic round-robin rail allocation, a local-dynamic allocation based on local knowledge and a dynamic rail allocation that reserves both communication endpoints of a message before sending it. The last method is shown to perform better than the others at higher loads: up to 49% better than local-knowledge allocation and 37% better than the round-robin allocation. This allocation scheme also shows lower latency and it saturates at higher loads (for long enough messages). Most importantly, this proposed allocation scheme scales well with the number of rails and message size. In addition we propose a hybrid algorithm that combines the benefits of the local-dynamic allocation for short messages with those of the dynamic algorithm for large messages. Copyright
Next generation e-learning platforms should support cooperative use of geographically distributed computing and educational resources as an aggregated environment to provide new levels of flexibility and extensibility. In this overall framework, our activity addresses the definition and implementation of advanced multimedia services for an aggregated grid-based e-learning environment, as well as the design and experimentation of a content distribution and multimedia streaming infrastructure in light of edge device heterogeneity, mobility, content adaptation and scalability. In this paper we initially present the general objectives and requirements that we are taking into account in the development of a multimedia access service for an e-learning platform. Then we describe a partial system prototype which capitalizes upon traditional features of grid computing like providing access to heterogeneous resources and services of different administrative domains in a transparent and secure way. Moreover, our system takes advantage of recent proposals by the Global Grid Forum (GGF) aiming at a standard for service-oriented architectures based on the concept of grid service.
The paper focuses on the formal aspects of the DIPLODOCUS environment. DIPLODOCUS is a UML profile intended for the modeling and verification of real-time and embedded applications meant to be executed on complex Systems-on-Chip. Application tasks and architectural elements (e.g., CPUs, bus, memories) are described with a UML-based language, using an open-source toolkit named TTool. Those descriptions may be automatically transformed into a formal hardware and software specification. From that specification, model-checking techniques may be applied to evaluate several properties of the system, e.g., safety, schedulability, and performance properties. The approach is exemplified with an MPEG2 decoding application.
Due to increasing numbers of real-time high-performance applications like control systems, autonomous robots, financial systems, scheduling these real-time applications on HPC resources has become an important problem. This paper presents a novel real-time multiprocessor scheduling algorithm, called Notional Approximation for Balancing Load Residues (NABLR), which heuristically selects tasks for execution by taking into account their residual loads and laxities. The NABLR schedule is created by considering a sequence of inter-arrival intervals (IAI) between two consecutive job arrivals of any task and using a heuristic to carefully plan task execution to fully utilize available resources in each of these intervals and avoid deadline misses as much as possible. Performance evaluation shows that NABLR outperforms previously known efficient algorithms (i.e. EDF and EDZL) in successfully scheduling sets of tasks in which total utilization of each task set equals available resource capacity, performing the closest to an optimal algorithm such as LLREF and Pfair. Out of 2500 randomly selected high-utilization task sets, NABLR can schedule up to 97.9% of the sets versus 63.2% by the best known efficient NABLR schedule are significantly smaller than those of optimal schedules (on average 80.57% fewer preemptions, migrations and 75.52% fewer scheduler invocations than those of LLREF) and comparably efficient suboptimal schedules (fewer or nearly the same number of invocations as EDZL and ASEDZL, but within only 0.12% more preemptions/migrations than ASEDZL). NABLR has the same O(NlogN) time complexity as other previously proposed efficient.
We consider a task graph to be executed on a set of processors. We assume
that the mapping is given, say by an ordered list of tasks to execute on each
processor, and we aim at optimizing the energy consumption while enforcing a
prescribed bound on the execution time. While it is not possible to change the
allocation of a task, it is possible to change its speed. Rather than using a
local approach such as backfilling, we consider the problem as a whole and
study the impact of several speed variation models on its complexity. For
continuous speeds, we give a closed-form formula for trees and series-parallel
graphs, and we cast the problem into a geometric programming problem for
general directed acyclic graphs. We show that the classical dynamic voltage and
frequency scaling (DVFS) model with discrete modes leads to a NP-complete
problem, even if the modes are regularly distributed (an important particular
case in practice, which we analyze as the incremental model). On the contrary,
the VDD-hopping model leads to a polynomial solution. Finally, we provide an
approximation algorithm for the incremental model, which we extend for the
general DVFS model.
In this paper, we study CPU utilization time patterns of several Map-Reduce
applications. After extracting running patterns of several applications, the
patterns with their statistical information are saved in a reference database
to be later used to tweak system parameters to efficiently execute unknown
applications in future. To achieve this goal, CPU utilization patterns of new
applications along with its statistical information are compared with the
already known ones in the reference database to find/predict their most
probable execution patterns. Because of different patterns lengths, the Dynamic
Time Warping (DTW) is utilized for such comparison; a statistical analysis is
then applied to DTWs' outcomes to select the most suitable candidates.
Moreover, under a hypothesis, another algorithm is proposed to classify
applications under similar CPU utilization patterns. Three widely used text
processing applications (WordCount, Distributed Grep, and Terasort) and another
application (Exim Mainlog parsing) are used to evaluate our hypothesis in
tweaking system parameters in executing similar applications. Results were very
promising and showed effectiveness of our approach on 5-node Map-Reduce
platform
The pricing of American style and multiple exercise options is a very
challenging problem in mathematical finance. One usually employs a Least-Square
Monte Carlo approach (Longstaff-Schwartz method) for the evaluation of
conditional expectations which arise in the Backward Dynamic Programming
principle for such optimal stopping or stochastic control problems in a
Markovian framework. Unfortunately, these Least-Square Monte Carlo approaches
are rather slow and allow, due to the dependency structure in the Backward
Dynamic Programming principle, no parallel implementation; whether on the Monte
Carlo levelnor on the time layer level of this problem. We therefore present in
this paper a quantization method for the computation of the conditional
expectations, that allows a straightforward parallelization on the Monte Carlo
level. Moreover, we are able to develop for AR(1)-processes a further
parallelization in the time domain, which makes use of faster memory structures
and therefore maximizes parallel execution. Finally, we present numerical
results for a CUDA implementation of this methods. It will turn out that such
an implementation leads to an impressive speed-up compared to a serial CPU
implementation.
Financial institutions have massive computations to carry out overnight which
are very demanding in terms of the consumed CPU. The challenge is to price many
different products on a cluster-like architecture. We have used the Premia
software to valuate the financial derivatives. In this work, we explain how
Premia can be embedded into Nsp, a scientific software like Matlab, to provide
a powerful tool to valuate a whole portfolio. Finally, we have integrated an
MPI toolbox into Nsp to enable to use Premia to solve a bunch of pricing
problems on a cluster. This unified framework can then be used to test
different parallel architectures.
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding.
In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies.
We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores.
We have evaluated our approach for a total of 7194 JPEG images across three high- and middle-end CPU--GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.
Energy efficiency has been a daunting challenge for datacenters. The
financial industry operates some of the largest datacenters in the world. With
increasing energy costs and the financial services sector growth, emerging
financial analytics workloads may incur extremely high operational costs, to
meet their latency targets. Microservers have recently emerged as an
alternative to high-end servers, promising scalable performance and low energy
consumption in datacenters via scale-out. Unfortunately, stark differences in
architectural features, form factor and design considerations make a fair
comparison between servers and microservers exceptionally challenging. In this
paper we present a rigorous methodology and new metrics for fair comparison of
server and microserver platforms. We deploy our methodology and metrics to
compare a microserver with ARM cores against two servers with x86 cores,
running the same real-time financial analytics workload. We define
workload-specific but platform-independent performance metrics for platform
comparison, targeting both datacenter operators and end users. Our methodology
establishes that a server based the Xeon Phi processor delivers the highest
performance and energy-efficiency. However, by scaling out energy-efficient
microservers, we achieve competitive or better energy-efficiency than a
power-equivalent server with two Sandy Bridge sockets despite the microserver's
slower cores. Using a new iso-QoS (iso-Quality of Service) metric, we find that
the ARM microserver scales enough to meet market throughput demand, i.e. a 100%
QoS in terms of timely option pricing, with as little as 55% of the energy
consumed by the Sandy Bridge server.
In this paper we describe our work on enabling fine-grained authorization for resource usage and management. We address the need of virtual organizations to enforce their own polices in addition to those of the resource owners, in regard to both resource consumption and job management. To implement this design, we propose changes and extensions to the Globus Toolkit's version 2 resource management mechanism. We describe the prototype and the policy language that we designed to express fine-grained policies, and we present an analysis of our solution. Comment: 13 pages, 2 figures
We introduce a new technique for automated performance diagnosis, using the program’s callgraph. We discuss our implementation
of this diagnosis technique in the Paradyn Performance Consultant. Our implementation includes the new search strategy and
new dynamic instrumentation to resolve pointer-based dynamic call sites at run-time. We compare the effectiveness of our new
technique to the previous version of the Performance Consultant for several sequential and parallel applications. Our results
show that the new search method performs its search while inserting dramatically less instrumentation into the application,
resulting in reduced application perturbation and consequently a higher degree of diagnosis accuracy.
The latest trends in high-performance computing systems show an increasing
demand on the use of a large scale multicore systems in a efficient way, so
that high compute-intensive applications can be executed reasonably well.
However, the exploitation of the degree of parallelism available at each
multicore component can be limited by the poor utilization of the memory
hierarchy available. Actually, the multicore architecture introduces some
distinct features that are already observed in shared memory and distributed
environments. One example is that subsets of cores can share different subsets
of memory. In order to achieve high performance it is imperative that a careful
allocation scheme of an application is carried out on the available cores,
based on a scheduling model that considers the main performance bottlenecks, as
for example, memory contention. In this paper, the {\em Multicore Cluster
Model} (MCM) is proposed, which captures the most relevant performance
characteristics in multicores systems such as the influence of memory hierarchy
and contention. Better performance was achieved when a load balance strategy
for a Branch-and-Bound application applied to the Partitioning Sets Problem is
based on MCM, showing its efficiency and applicability to modern systems.
Electronic Health (e-Health) technology has brought the world with
significant transformation from traditional paper-based medical practice to
Information and Communication Technologies (ICT)-based systems for automatic
management (storage, processing, and archiving) of information. Traditionally
e-Health systems have been designed to operate within stovepipes on dedicated
networks, physical computers, and locally managed software platforms that make
it susceptible to many serious limitations including: 1) lack of on-demand
scalability during critical situations; 2) high administrative overheads and
costs; and 3) in-efficient resource utilization and energy consumption due to
lack of automation. In this paper, we present an approach to migrate the ICT
systems in the e-Health sector from traditional in-house Client/Server (C/S)
architecture to the virtualised cloud computing environment. To this end, we
developed two cloud-based e-Health applications (Medical Practice Management
System and Telemedicine Practice System) for demonstrating how cloud services
can be leveraged for developing and deploying such applications. The Windows
Azure cloud computing platform is selected as an example public cloud platform
for our study. We conducted several performance evaluation experiments to
understand the Quality Service (QoS) tradeoffs of our applications under
variable workload on Azure.
Modern multicore chips show complex behavior with respect to performance and
power. Starting with the Intel Sandy Bridge processor, it has become possible
to directly measure the power dissipation of a CPU chip and correlate this data
with the performance properties of the running code. Going beyond a simple
bottleneck analysis, we employ the recently published Execution-Cache-Memory
(ECM) model to describe the single- and multi-core performance of streaming
kernels. The model refines the well-known roofline model, since it can predict
the scaling and the saturation behavior of bandwidth-limited loop kernels on a
multicore chip. The saturation point is especially relevant for considerations
of energy consumption. From power dissipation measurements of benchmark
programs with vastly different requirements to the hardware, we derive a
simple, phenomenological power model for the Sandy Bridge processor. Together
with the ECM model, we are able to explain many peculiarities in the
performance and power behavior of multicore processors, and derive guidelines
for energy-efficient execution of parallel programs. Finally, we show that the
ECM and power models can be successfully used to describe the scaling and power
behavior of a lattice-Boltzmann flow solver code.
Many concurrent data-structure implementations use the well-known
compare-and-swap (CAS) operation, supported in hardware by most modern
multiprocessor architectures for inter-thread synchronization. A key weakness
of the CAS operation is the degradation in its performance in the presence of
memory contention.
In this work we study the following question: can software-based contention
management improve the efficiency of hardware-provided CAS operations? Our
performance evaluation establishes that lightweight contention management
support can greatly improve performance under medium and high contention levels
while typically incurring only small overhead when contention is low.