Conference Paper

Parallel Application Signature for Performance Prediction.

DOI: 10.1109/HPCC.2010.92 Conference: PDPTA
Source: DBLP

ABSTRACT Predicting performance of parallel applications is becoming increasingly complex and the best performance predictor is the application itself, but the time required to run it thoroughly is a onerous requirement. We seek to characterize the behavior of message-passing applications on different systems by extracting a signature which will allow us to predict what system will allow the application to perform best. To achieve this goal, we have developed a method we called Parallel Application Signatures for Performance Prediction (PAS2P) that strives to describe an application based on its behavior. Based on the application's message-passing activity, we have been able to identify and extract representative phases, with which we created a Parallel Application Signature that has allowed us to predict the application's performance. We have experimented with different signature-extraction algorithms and found a reduction in the prediction error using different scientific applications on different clusters. We were able to predict execution times with an average accuracy of over 98%.

0 Bookmarks
 · 
84 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Being able to accurately estimate how an application will perform in a specific computational system provides many useful benefits and can result in smarter decisions. In this work we present a novel approach to model the behavior of message passing parallel applications. Based in the concept of signatures, which are the most relevant parts of an application (phases), we are able to build a model that allows us to predict the application execution time in different systems with variable input data size. Executing these signatures with different input data sizes defines a program's behavior partial function. Using regression we can generalize this behavior function to predict an application performance in a target system with other input data size within a predefined range. We explain our methodology and in order to validate the proposal present results using a synthetic program and well known applications.
    The 9th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2011, Sharm El-Sheikh, Egypt, December 27-30, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The increase of processing units, speed and computational power, and the complexity of scientific applications that use high performance computing require more efficient Input/Output (I/O) systems. In order to efficiently use the I/O it is necessary to know its performance capacity to determine if it fulfills applications I/O requirements. This paper proposes a methodology to evaluate I/O performance on computer clusters under different I/O configurations. This evaluation is useful to study how different I/O subsystem configurations will affect the application performance. This approach encompasses the characterization of the I/O system at three different levels: application, I/O system and I/O devices. We select different system configuration and/or I/O operation parameters and we evaluate the impact on performance by considering both the application and the I/O architecture. During I/O configuration analysis we identify configurable factors that have an impact on the performance of the I/O system. In addition, we extract information in order to select the most suitable configuration for the application.
    2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26-30, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to measure the performance of a parallel machine, a set of application kernels as benchmarks have often been used. However, it is not always possible to characterize the performance using only benchmarks, given the fact that each one usually reflects a narrow set of kernel applications at best. Computers show different performance indices for different applications as they run them. Accurate prediction of parallel applications' performance is becoming increasingly complex and the time required to run it thoroughly is an onerous requirement; especially if we want to predict for different systems. In production clusters, where throughput and efficiency of use are fundamental, it is important to be able to predict which system is more appropriate for an application, or how long a scheduled application will take to run, in order to have the foresight that will allow us to make better use of the resources available. We have created a tool [4], which we dubbed Parallel Application Signature for Performance Prediction (PAS2P) to characterize message-passing parallel applications. PAS2P instruments and executes applications in a parallel machine, and produces a trace log. The data collected is used to characterize computation and communication behaviour. To obtain the machine-independent application model, the trace is assigned a logical global clock according to causality relations between communication events, through an algorithm was inspired by Lamport. Once we have the logical trace, we identify and extract the most relevant event sequences (phases) and assign them a weight from the number of times they occur. Afterwards, we create a signature defined by a set of phases selected depending on their weight. This is the signature through whose execution in different target systems allows us to measure the execution time of each phase, and hence to estimate the entire application's run time in each of those systems. We do this by extrapolation of each phase's execution time using the weights we have obtained. As shown in Figure 1, there is a sequence of stages that are necessary to obtain the relevant portions (phases) and their weights. With this information, we can proceed to create a completely machine-independent signature for each application that we can then execute in other systems in a shorter amount of time, since the execution time of the signature will always be a small fraction of the whole application's runtime. Finally, in the last stage, we predict the full execution time of the parallel application by adding the execution time of all the phases multiplied by their weights. To instrument the parallel applications, we need to collect communication events and the computational time. Afterwards we define: Event: The action of sending or receiving a message. Extended Basic Block (EBB): A generalization of the Basic Block concept for parallel systems. We define it as a segment of a process whose beginning and end are defined by occurrences of messages, either sent or received. We may also say that it is a "computational time" segment bounded by communication actions. The synchronization between computing nodes, which is absent in sequential applications, becomes necessary. To solve this, we have to move from multiple physical, local clocks to a single logical, global clock. In [1], we showed a logical clock based on the order of precedence of events accross processes as defined by Lamport. We found that the quality of prediction falls, because there is a non-deterministic ordering of receptions. To solve the non-deterministic events (receptions) problem, we have decided to introduce a new algorithm [3] inspired by Lamport's. Through this algorithm, we define a new logical ordering, in which, if one process Sends a message in a Logical Time (LT), its reception will be modeled to arrive in a LT + 1 and never afterwards. Once all events have been assigned an LT, we create a logical trace where we insert all events depending on its logical time and type of communication (Send or Recv). Finally, once we have located each event, we divide the logical trace into more logical times, that is, there can only be one event for each process in a logical time.
    11/2010;