Parallel Application Signature for Performance Prediction.
ABSTRACT Predicting performance of parallel applications is becoming increasingly complex and the best performance predictor is the application itself, but the time required to run it thoroughly is a onerous requirement. We seek to characterize the behavior of message-passing applications on different systems by extracting a signature which will allow us to predict what system will allow the application to perform best. To achieve this goal, we have developed a method we called Parallel Application Signatures for Performance Prediction (PAS2P) that strives to describe an application based on its behavior. Based on the application's message-passing activity, we have been able to identify and extract representative phases, with which we created a Parallel Application Signature that has allowed us to predict the application's performance. We have experimented with different signature-extraction algorithms and found a reduction in the prediction error using different scientific applications on different clusters. We were able to predict execution times with an average accuracy of over 98%.
- SourceAvailable from: Weizhe Zhang
[Show abstract] [Hide abstract]
- "The modules of DwarfCode include trace recording, trace merging, repeat compression and dwarf code generation. Although several related studies have been conducted and well-grounded in trace recording and code generation , , , , , , , , challenges remain W. Zhang is with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China. E-mail: firstname.lastname@example.org. "
ABSTRACT: We present DwarfCode, a performance prediction tool for MPI applications on diverse computing platforms. The goal is to accurately predict the running time of applications for task scheduling and job migration. First, DwarfCode collects the execution traces to record the computing and communication events. Then, it merges the traces from different processes into a single trace. After that, DwarfCode identifies and compresses the repeating patterns in the final trace to shrink the size of the events. Finally, a dwarf code is generated to mimic the original program behavior. This smaller running benchmark is replayed in the target platform to predict the performance of the original application. In order to generate such a benchmark, two major challenges are to reduce the time complexity of trace merging and repeat compression algorithms. We propose an O(mpn) trace merging algorithm to combine the traces generated by separate MPI processes , where m denotes the upper bound of tracing distance, p denotes the number of processes, and n denotes the maximum of event numbers of all the traces. More importantly, we put forward a novel repeat compression algorithm, whose time complexity is O(nlogn). Experimental results show that DwarfCode can accurately predict the running time of MPI applications. The error rate is below 10% for compute and communication intensive applications. This toolkit has been released for free download as a GNU General Public License v3 software.IEEE Transactions on Computers 01/2015; DOI:10.1109/TC.2015.2417526 · 1.47 Impact Factor
[Show abstract] [Hide abstract]
- "Some example scenarios are shown in Fig. 2. In addition, we have equipped our models with a set of configuration parameters that allow users to modify the behavior and configuration of the simulated system. Some of these parameters enable the simulation of real-based scenarios from the execution traces of real programs ; as well as the inclusion of failure traces of real HPC systems . The most relevant parameters of the simulation models, summarized in Table 4, are: network topology; routing algorithm; traffic pattern; realprogram execution traces; real-system failure traces; packet size; link frequency/speed; and router buffer size. "
ABSTRACT: Nowadays, the study of high-performance computing (HPC) is one of the essential aspects of postgraduate pro-grammes in Computational Science. However, university education in HPC often suffers from a significant gap between theoretical concepts and the practical experience of students. To face this challenge, we have implemented an innovative teaching strategy to provide students appropriate resources to ease the assimilation of theoretical con-cepts, while improving their practical experience through the use of teaching tools and resources specifically designed to promote active learning. We have used the proposed strategy to organize the module of Parallel Computers and Architectures of the Master's in High-Performance Computing, at the Universitat Aut‘onoma de Barcelona, obtaining very promising results. In particular, we have observed improvements of both the academic marks of students and the perception about their own expertise and skills in HPC, regarding the previous teaching approach.
[Show abstract] [Hide abstract]
- "To solve the non-deterministic events (receptions) problem, we have decided to introduce a new algorithm  inspired by Lamport's. Through this algorithm, we define a new logical ordering, in which, if one process Sends a message in a Logical Time (LT), its reception will be modeled to arrive in a LT + 1 and never afterwards. "
ABSTRACT: In order to measure the performance of a parallel machine, a set of application kernels as benchmarks have often been used. However, it is not always possible to characterize the performance using only benchmarks, given the fact that each one usually reflects a narrow set of kernel applications at best. Computers show different performance indices for different applications as they run them. Accurate prediction of parallel applications' performance is becoming increasingly complex and the time required to run it thoroughly is an onerous requirement; especially if we want to predict for different systems. In production clusters, where throughput and efficiency of use are fundamental, it is important to be able to predict which system is more appropriate for an application, or how long a scheduled application will take to run, in order to have the foresight that will allow us to make better use of the resources available. We have created a tool , which we dubbed Parallel Application Signature for Performance Prediction (PAS2P) to characterize message-passing parallel applications. PAS2P instruments and executes applications in a parallel machine, and produces a trace log. The data collected is used to characterize computation and communication behaviour. To obtain the machine-independent application model, the trace is assigned a logical global clock according to causality relations between communication events, through an algorithm was inspired by Lamport. Once we have the logical trace, we identify and extract the most relevant event sequences (phases) and assign them a weight from the number of times they occur. Afterwards, we create a signature defined by a set of phases selected depending on their weight. This is the signature through whose execution in different target systems allows us to measure the execution time of each phase, and hence to estimate the entire application's run time in each of those systems. We do this by extrapolation of each phase's execution time using the weights we have obtained. As shown in Figure 1, there is a sequence of stages that are necessary to obtain the relevant portions (phases) and their weights. With this information, we can proceed to create a completely machine-independent signature for each application that we can then execute in other systems in a shorter amount of time, since the execution time of the signature will always be a small fraction of the whole application's runtime. Finally, in the last stage, we predict the full execution time of the parallel application by adding the execution time of all the phases multiplied by their weights. To instrument the parallel applications, we need to collect communication events and the computational time. Afterwards we define: Event: The action of sending or receiving a message. Extended Basic Block (EBB): A generalization of the Basic Block concept for parallel systems. We define it as a segment of a process whose beginning and end are defined by occurrences of messages, either sent or received. We may also say that it is a "computational time" segment bounded by communication actions. The synchronization between computing nodes, which is absent in sequential applications, becomes necessary. To solve this, we have to move from multiple physical, local clocks to a single logical, global clock. In , we showed a logical clock based on the order of precedence of events accross processes as defined by Lamport. We found that the quality of prediction falls, because there is a non-deterministic ordering of receptions. To solve the non-deterministic events (receptions) problem, we have decided to introduce a new algorithm  inspired by Lamport's. Through this algorithm, we define a new logical ordering, in which, if one process Sends a message in a Logical Time (LT), its reception will be modeled to arrive in a LT + 1 and never afterwards. Once all events have been assigned an LT, we create a logical trace where we insert all events depending on its logical time and type of communication (Send or Recv). Finally, once we have located each event, we divide the logical trace into more logical times, that is, there can only be one event for each process in a logical time.