Conference Paper

A New Vision for Coarray Fortran

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Each portion of the global address space has an affinity to a certain process or thread. A number of PGAS programming systems have been implemented in the past, including Unified Parallel C (UPC) [4], Co-Array Fortran (CAF) [21], [15], Chapel [5], X10 [27], STAPL [3] and Titanium [29]. Typically, these approaches rely on one-sided communication substrates such as GASNet [1] to perform inter-node communication. ...
... The most popular PGAS languages are UPC [4], CAF [21], [15], Chapel [5], X10 [27], STAPL [3] and Titanium [29]. UPC is one of the first and one of the few fully implemented PGAS languages and is an extension of the C programming language. ...
... It coverts Fortran 95 into a robust parallel language with a few rules related to two fundamental issues: work distribution and data distribution. By 2005, the Fortran Standards Committee decided to make CAF 2.0 [15] integrated into the Fortran 2008 standard. Compared to the CAF 1.0, CAF 2.0 can present a richer set of coarray-based language extensions. ...
Conference Paper
Full-text available
A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly simplifies the tasks of developing parallel applications, because no explicit communication has to be specified in the program for data exchange between different computing nodes. In this paper we present DART, a runtime environment, which implements the PGAS paradigm on largescale high-performance computing clusters. A specific feature of our implementation is the use of one-sided communication of the Message Passing Interface (MPI) version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated the performance of the implementation with several low-level kernels in order to determine overheads and limitations in comparison to the underlying MPI-3.
... In Numrich and Reid's 1998 proposal [17], Coarray Fortran is a simple set of extensions to Fortran 95, principal among which is support for shared data known as coarrays. Responding to shortcomings in the Fortran Standards Committee's addition of coarrays to the Fortran 2008 standards, we at Rice envisioned an extensive update which has come to be known as Coarray Fortran 2.0 [15]. In this paper, we chronicle the evolution of Coarray Fortran 2.0 as it gains support for asynchronous point-to-point and collective operations. ...
... Coarray Fortran 2.0 (CAF 2.0) is an extension of Coarray Fortran that provides better expressiveness through features such as process subsets, topologies, and fine-grain synchronization. We outlined the motivation and vision for CAF 2.0 in a 2009 paper [15] 1 . In this paper, we augment CAF 2.0 with additional primitives that enable users to overlap communication latency with computation. ...
... To make this paper self-contained, we briefly summarize the overall structure of Coarray Fortran 2.0 (CAF 2.0) described in earlier work [15]. CAF 2.0 differs significantly from the coarray-based extensions being incorporated into Fortran as part of the 2008 standard. ...
Article
Full-text available
In Numrich and Reid's 1998 proposal [17], Coarray Fortran is a simple set of extensions to Fortran 95, principal among which is support for shared data known as coarrays. Responding to short-comings in the Fortran Standards Committee's addition of coarrays to the Fortran 2008 standards, we at Rice envisioned an extensive update which has come to be known as Coarray Fortran 2.0 [15]. In this paper, we chronicle the evolution of Coarray Fortran 2.0 as it gains support for asynchronous point-to-point and collective operations. We outline how these operations are implemented and describe code fragments from several benchmark programs to show we use these operations to hide latency by overlapping communication and computation.
... Unified Parallel C (UPC) [13], Coarray Fortran (CAF) [34], X10 [17], Chapel [14], and 1 CAF 2.0 [28] are examples of language-based PGAS models, while OpenSHMEM [15] and Global Arrays [32] are examples of library-based PGAS models. Library-based models assume the use of a base language (C/C++ and Fortran being typical) which does not require or generally benefit from any special compiler support. ...
... Hence, a hybrid application using PGAS and MPI is not uncommon. Yang et al. [42] have shown the importance of interoperability between CAF and MPI using their CAF 2.0 [28] implementation over MPI-3.0. ...
Thesis
Full-text available
Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, Coarray Fortran (CAF) is unique in that as it has been incorporated into an existing standard (Fortran 2008), and therefore it is of particular importance that implementations supporting it are both portable and deliver su�cient levels of performance. OpenSHMEM is a library which is the culmination of a standardization e�ort among many implementers and users of SHMEM, and it provides a means to develop light-weight, portable, scalable applications based on the PGAS programming model. As such, we propose here that OpenSHMEM is well situated to serve as a runtime substrate for other PGAS programming models. In this work, we demonstrate how OpenSHMEM can be exploited as a runtime layer upon which CAF may be implemented. Specifically, we re-targeted the CAF implementation provided in the OpenUH compiler to OpenSHMEM, and show how parallel language features provided by CAF may be directly mapped to OpenSHMEM, including allocation of remotely accessible objects, one-sided communication, and various types of synchronization. Moreover, we present and evaluate various algorithms we developed for implementing remote access of non-contiguous array sections, and acquisition and release of remote locks using the OpenSHMEM interface. Through this work, we argue for specific features like block-wise strided data transfer, multi-dimensional strided data transfer, and atomic memory operations which may be added to OpenSHMEM to better support idiomatic usage of CAF.
... Several languages and libraries provide a PGAS programming model, some with extended capabilities such as asynchronous spawning of computations at remote loca-tions in addition to facilities for expressing data movement and synchronization. Unified Parallel C (UPC) [2], Coarray Fortran (CAF) [3], X10 [4], Chapel [5], and CAF 2.0 [6] are examples of language-based PGAS models, while OpenSHMEM [7] and Global Arrays [8] are examples of library-based PGAS models. Library-based models assume the use of a base language (C/C++ and Fortran being typical) which does not require or generally benefit from any special compiler support. ...
... Hence, a hybrid application using PGAS and MPI is not uncommon. Yang et al. [24] have shown the importance of interoperability between CAF and MPI using their CAF 2.0 [6] implementation over MPI-3.0. ...
Conference Paper
Full-text available
Languages and libraries based on the Partitioned Global Address Space (PGAS) programming model have emerged in recent years with a focus on addressing the programming challenges for scalable parallel systems. Among these, Coarray Fortran (CAF) is unique in that as it has been incorporated into an existing standard (Fortran 2008), and therefore it is of particular importance that implementations supporting it are both portable and deliver sufficient levels of performance. OpenSHMEM is a library which is the culmination of a standardization effort among many implementers and users of SHMEM, and it provides a means to develop light-weight, portable, scalable applications based on the PGAS programming model. As such, we propose here that OpenSHMEM is well situated to serve as a runtime substrate for CAF implementations. In this paper, we demonstrate how OpenSHMEM can be exploited as a runtime layer upon which CAF may be implemented. Specifically, we re-targeted the CAF implementation provided in the OpenUH compiler to OpenSHMEM, and show how parallel language features provided by CAF may be directly mapped to OpenSHMEM, including allocation of remotely accessible objects, one-sided communication, and various types of synchronization. Moreover, we present and evaluate various algorithms we developed for implementing remote access of non-contiguous array sections and acquisition and release of remote locks using the OpenSHMEM interface.
... $ 10.00 language that supports the SPMD programming style. There have been a few academic and commercial CAF compilers available to date [6,3,2,16]. In CAF, all data and computations are local and must be explicitly decomposed in each executing process or thread. ...
... While as of yet our compiler translation is not guided by analysis or augmented with compiler-optimizations, we plan to exploit the robust optimization framework provided by Open64 to this end. More recently, Rice University has presented a critique of proposed coarray features in Fortran 2008 [14] and have a new vision in CAF 2.0 [16]. CAF 2.0 offers a number of desirable features for a more expressive parallel programming language, including process subsets, topologies, team-based coarray allocation/deallocation, enhanced synchronizations and collectives, and asynchronous communication support. ...
Article
Coarray Fortran (CAF) comprises a set of proposed language extensions to Fortran that are expected to be adopted as part of the Fortran 2008 standard. In contrast to prior open-source implementation efforts, our approach is to use a single, unified compiler infrastructure to translate, optimize and generate binaries from CAF codes. In this paper, we will describe our compiler and runtime implementation of CAF using an Open64-based compiler infrastructure. We will detail the process by which we generate a high-level intermediate representation from the CAF code in our compilers front-end, how our compiler analyzes and translate this IR to generate a binary which makes use of our runtime system, and how we support the runtime execution model with our runtime library. We have carried out experiments using both an ARMCI- and GASNet-based runtime implementation, and we present these results.
... One-sided communication is supported by many PGAS languages [35], [4], [7], [21], [6], [13]. In this work, we use UPC++, a library-based PGAS programming extension to C ++ [35]. ...
... The performance and potential of one-sided communication have been studied by many researchers, although most of them focus on FORTRAN Co-Arrays (CAF) [5], [21] and UPC [31]. In this study we focus on MPI one-sided communication and UPC++. ...
... The related work can be divided into two categories based on the programming models used: CAF and UPC. For CAF, Mellor-Crummey et al. have proposed Co-array Fortran 2.0 [14]. Their CAF2.0 compiler uses a source-to-source translator to convert CAF code into Fortran 90 (F90) with calls to a low-level one-sided communication library such as GASNet. ...
Article
Full-text available
The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space languages like UPC and Co-Array Fortran. In this paper, we perform a study of Gemini performance using a set of communication microbenchmarks and compare the performance of one-sided communication in PGAS languages with two-sided MPI. Our results demonstrate the performance benefits of the PGAS model on Gemini hardware, showing in what circumstances and by how much one-sided communication outperforms two-sided in terms of messaging rate, aggregate bandwidth, and computation and communication overlap capability. For example, for 8-byte and 2KB messages the one-sided messaging rate is 5 and 10 times greater respectively than the two-sided one. The study also reveals important information about how to optimize one-sided Gemini communication.
... In addition, we compare foMPI with two major HPC PGAS languages: UPC and Fortran 2008 with Coarrays, both specially tuned for Cray systems. We did not evaluate the semantically richer Coarray Fortran 2.0 [22] because no tuned version was available on our system. We execute all benchmarks on the Blue Waters system, using the Cray XE6 nodes only. ...
Conference Paper
Full-text available
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
... From 2005 to 2010 the HPCC suite has been used in the HPC Challenge Competition, where participants enter implementations of the benchmarks in various languages to compete for the "most productive" implementation. In previous years, entires written in PGAS and HPCS languages such Chapel [5], CAF 2.0 [24], UPC, and X10 [10] have been involved in this competition. ...
Article
Full-text available
The Parallel Ocean Program (POP) is a 71,000 line-of-code pro-gram written in Fortran and MPI. POP is a component of the Com-munity Earth System Model (CESM), which is a heavily used global climate model. Now that Coarrays are part of the Fortran standard one question raised by POP's developers is whether Coar-rays could be used to improve POP's performance or reduce its code volume. Although Coarray Fortran (CAF) has been evaluated with smaller benchmarks and with an older version of POP, it has not been evaluated with newer versions of POP or on modern plat-forms. In this paper, we examine what impacts using CAF has on a large climate simulation application by comparing and evaluating variants of the CGPOP miniapp, which serves as a performance proxy of POP.
... P3DFFT is implemented using the Message Passing Interface (MPI) [1] parallel programming model. While several researchers have explored various different programming models [2], [3], MPI has been the dominant programming model for past couple decades and has scaled to the largest parallel machines. The MPI Standard defines a set of collective operations to provide abstractions for various group communication patterns. ...
Article
Parallel 3D FFT (P3DFFT) is an important com-ponent of many scientific computing applications ranging from fluid dynamics, astrophysics and molecular dynamics. One of the main limiting factors of parallel 3D FFT performance and scalability is the time spent in All-to-all communication. The implementation of P3DFFT uses the Message Passing Interface (MPI) as the parallel programming model. MPI is the dominant programming model for past couple of decades, and many scientific applications use it. Of the many communication prim-itives offered by MPI, collective communications, especially MPI Alltoall, consume a lot of time for applications using P3DFFT. Hiding the latency of MPI Alltoall is critical towards scaling libraries such as P3DFFT. The newest revision of MPI, MPI-3, is widely expected to provide support for non-blocking communication to enable latency-hiding. At the same time, popular interconnection networks, such as InfiniBand, have provided support for non-blocking collective communication. For example, the latest ConnectX-2 adapter by Mellanox provides a low-level task-list offload capability. In this paper, we design a non-blocking offloaded collective for Alltoall Per-sonalized Exchange (MPI Alltoall) using the task-list offload by the ConnectX-2 network adapter. Simultaneously, we re-design the P3DFFT library and a sample application kernel to overlap the MPI Alltoall operations with application level computation and demonstrate the benefits of using our novel non-blocking Alltoall algorithm. Our experimental evaluation shows that we are able to achieve near perfect overlap of computation and communication (99%) through the use of offload mechanism without any adverse impact on the latency of the MPI Alltoall operation. We are also able to see an improvement of upto 23% in the overall run-time with the modified P3DFFT kernel when compared to the default blocking version.
... The portable Berkeley UPC implementation relies on GASNet [4] for communication, while some vendors directly target their low-level interconnect API. Co-array Fortran [17,15] extends the notion of standard Fortran arrays with a co-index to specify the process holding the array. The DARPA sponsored HPCS (High Productivity Computing Systems) languages X10 [7], Fortress [1], 1 // split the units into 8 teams ( e . ...
Conference Paper
Full-text available
DASH is a realization of the PGAS (partitioned global address space) model in the form of a C++ template library. Operator overloading is used to provide global-view PGAS semantics without the need for a custom PGAS (pre-)compiler. The DASH library is implemented on top of our runtime system DART, which provides an abstraction layer on top of existing one-sided communication substrates. DART contains methods to allocate memory in the global address space as well as collective and one-sided communication primitives. To support the development of applications that exploit a hierarchical organization, either on the algorithmic or on the hardware level, DASH features the notion of teams that are arranged in a hierarchy. Based on a team hierarchy, the DASH data structures support locality iterators as a generalization of the conventional local/global distinction found in many PGAS approaches.
... This description is within the context of extensions to Fortran; as shorthand, these extensions are referred to as CAFe, for Coarray Fortran extensions. CAFe is complementary to previous work extending coarray Fortran [9,11]. ...
Article
Full-text available
Emerging hybrid accelerator architectures for high performance computing are often suited for the use of a data-parallel programming model. Unfortunately, programmers of these architectures face a steep learning curve that frequently requires learning a new language (e.g., OpenCL). Furthermore, the distributed (and frequently multi-level) nature of the memory organization of clusters of these machines provides an additional level of complexity. This paper presents preliminary work examining how programming with a local orientation can be employed to provide simpler access to accelerator architectures. A locally-oriented programming model is especially useful for the solution of algorithms requiring the application of a stencil or convolution kernel. In this programming model, a programmer codes the algorithm by modifying only a single array element (called the local element), but has read-only access to a small sub-array surrounding the local element. We demonstrate how a locally-oriented programming model can be adopted as a language extension using source-to-source program transformations.
... J. Zhang et al. [32] studied the performance of the N-Body problem in UPC. J. Mellor-Crummey et al. [21] examined the performance of the Challenge Benchmark Suite in CAF 2.0. P. Ghosh et al. [16] explored the ordering of one-sided messages to achieve better performance. ...
Conference Paper
Full-text available
Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular,and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library. (c) 2014 Association for Computing Machinery.
... Meanwhile, data distribution is up to the programmer. After its finale implementation in Fortran Standardization Mellor-Crummey et al. [MC+09] proposed further features, e.g., global pointers, safe synchronization, and collective communication. ...
Thesis
Full-text available
The Partitioned Global Address Space (PGAS) programming model represents distributed memory as global shared memory space. The DASH library implements C++ standard library concepts based on PGAS and provides additional containers and algorithms that are common in High-Performance Computing (HPC) applications. This thesis examines especially sparse matrices and whether state-of-the-art conceptual approaches to these data storage can be abstracted to improve the adaptation of domain-specific computational intent to the underlying hardware. Therefore, an intensive systematic analysis of state-of-the-art distribution and storage of sparse data is done. Based on these findings given property classification system of general dense data distributions is extended by sparse properties. Afterward, a universal vocabulary of a domain decomposition concept abstraction is developed. This four-layer concept introduces the following steps: (i) Global Canonical Domain, (ii) Formatting, (iii) Decomposition, and (iv) Hardware. In combination with the proposed abstraction, an index set algebra utilizing the resulting sparse data distribution system is presented. This concludes in a newly created Position Concept that is capable of referencingan element in the data domain as a multivariate position. The key feature is the abilityto reference this element with arbitrary representations of the element’s physical memorylocation.
... In addition, we compare foMPI with two major HPC PGAS languages: UPC and Fortran 2008 with Coarrays, both specially tuned for Cray systems. We did not evaluate the semantically richer Coarray Fortran 2.0 [22] because no tuned version was available on our system. We execute all benchmarks on the Blue Waters system, using the Cray XE6 nodes only. ...
Preprint
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
... PGAS essentially brings the advantages of threading-based programming (such as global visibility and accessibility of data elements) to distributed memory systems and accounts for the performance characteristics of data accesses by making the locality properties explicitly available to the programmer. Traditional PGAS approaches come in the form of a library (e.g., OpenSHMEM [12], Global Arrays [11]) or language extensions (Unified parallel C, UPC [8], Co-Array Fortran, CAF [29], [30]). Those solutions usually don't address hierarchical locality and offer only a two-level (local/remote) distinction of access costs. ...
Conference Paper
We present DASH, a C++ template library that offers distributed data structures and parallel algorithms and implements a compiler-free PGAS (partitioned global address space) approach. DASH offers many productivity and performance features such as global-view data structures, efficient support for the owner-computes model, flexible multidimensional data distribution schemes and inter-operability with STL (standard template library) algorithms. DASH also features a flexible representation of the parallel target machine and allows the exploitation of several hierarchically organized levels of locality through a concept of Teams. We evaluate DASH on a number of benchmark applications and we port a scientific proxy application using the MPI two-sided model to DASH. We find that DASH offers excellent productivity and performance and demonstrate scalability up to 9800 cores.
... Dla barrier w wersji asynchronicznej lepszą nazwą byłoby: asyncBarrier, ale ten prefiks skłania do wyboru opcji drugiej, co oznaczałoby, że metoda Xxx działa synchronicznie, a jedynie metody o nazwach asyncXxx działają asynchronicznie. Można by zastanowić się jeszcze, czy potrzebna jest metoda asyncBarrier, ale z doświadczeń własnych jak i doświadczeń innych języków opartych o paradygmat PGAS [4], taka operacja może być użyteczna. ...
Chapter
Full-text available
The article describes research on new version of the PCJ library (Parallel Computing in Java) for parallel programming in the Java language. The PCJ library was created and developed in scope of doctoral dissertation. The library is the implementation of solutions developed thought research into new methods of parallel programming in the Java language using the PGAS (Partitioned Global Address Space) paradigm. In the dissertation, the library was benchmarked and compared to the MPI (Message Passing Interface) – current standard used in the HPC (High-performance computing) for creating concurrent applications. Article describes history of development of the PCJ library with its current version 5. The library is based on PGAS programming model, which allows for easy development of parallel applications for HPC or processing Big Data. The library can be used in the multicore computers, computing clusters, so the systems that consist of many nodes consisting of many cores. PCJ hides the communication inside node and between nodes. The library can be also used on a single workstation with Java Virtual Machine (JVM). Current version of the PCJ library uses mechanisms introduced into newest version of the Java language, such as lambda expressions or streams, and introduces new possibilities for transferring data and creating distributed objects.
... The second commonplace technique to improve upon the local-by-default situation is to move toward a global-bydefault model, at least for certain classes of objects. PGAS models, now widely available from Fortran Co-Arrays [47], Chapel [14], Julia [38], and many other languages, provide some differentiation between local and global objects but allow global access without explicit regard for locality considerations. The compiler and/or runtime system might optimize layout and placement of objects based on their global access pattern, but the degree to which this optimization can be usefully done is still an open question. ...
Article
Full-text available
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
... However coarrays are often not yet or only partially supported by other Fortran compilers [CS14]. Additionally, scientists from Rice University in Houston, Texas developed Coarray Fortran 2.0 (CAF 2.0) [MCASJ09]. CAF 2.0 is a runtime library and is based on the coarray programming model but includes additional features. More details on CAF's features are discussed in Section 4.3. ...
Thesis
Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest.
... They use the concept of globally visible address space with an explicit handling of addresses that are non-local. The ongoing efforts in the Fortran community ensure continuous support for CAF functionality as exemplified by CAF 2.0 [31,48,49,54] and incorporation of some of the CAF features into the Fortran 2008 standard [55]. The DARPA's HPCS program introduced three more languages into the space: Fortress [2,33], X10 [59], and Chapel (Cascade High Productivity Language) [12,13]. ...
Article
Full-text available
The objective of the PULSAR project was to design a programming model suitable for largescale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, messagepassing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
... CAF 1.0 a été inclus dans la norme Fortran 2008 [76], avec l'ajout de fonctions permettant entre autres de faire des communication globales. Une version 2.0 est développée de son coté par l'université de Rice [67] [7]. CAF utilise un paradigme de programmation SPMD. ...
Thesis
Les grilles de calculs sont des architectures distribuées couramment utilisées pour l'exécution de programmes scientifiques ou de simulation. Les programmeurs doivent ainsi acquérir de nouvelles compétences pour pouvoir tirer partie au mieux de toutes les ressources offertes. Ils doivent apprendre à écrire un code parallèle, et, éventuellement, à gérer une mémoire distribuée.L'ambition de cette thèse est de proposer une chaîne de compilation permettant de générer automatiquement un code parallèle distribué en tâches à partir d'un code séquentiel. Pour cela, le compilateur source-à-source PIPS est utilisé. Notre approche a deux atouts majeurs : 1) une succession de transformations simples et modulaires est appliquée, permettant à l'utilisateur de comprendre les différentes transformations appliquées, de les modifier, de les réutiliser dans d'autres contextes, et d'en ajouter de nouvelles; 2) une preuve de correction de chacune des transformations est donnée, permettant de garantir que le code généré est équivalent au code initial.Cette génération automatique de code parallèle distribué de tâches offre également une interface de programmation simple pour les utilisateurs. Une version parallèle du code est automatiquement générée à partir d'un code séquentiel annoté.Les expériences effectuées sur deux machines parallèles, sur des noyaux de Polybench, montrent une accélération moyenne linéaire voire super-linéaire sur des exemples de petites tailles et une accélération moyenne égale à la moitié du nombre de processus sur des exemples de grandes tailles.
... • Data transfer operations: Put, Get, and Accumulate • Synchronization operations: Atomic read-modify-write, and lock/mutex • Utility operations: Memory allocation/deallocation, local/global Fence, and error handling The Berkeley Global Address Space Networking (GASNet) library [Bon08] is designed as a compiler runtime library for the PGAS languages UPC and Titanium. It also provides the foundations for the Rice University Co-Array Fortran 2.0, which aims to correct a number of identified shortcomings [MCASJ09]. The GASNet library is structured with a core API and an extended API. ...
Thesis
The growing need for computing is more and more challenging, especially in the embedded system world with autonomous cars, drones, and smartphones. New highly parallel and heterogeneous processors emerge to answer this challenge. They operate in constrained environments with real-time requirements, reduced power consumption, and safety. Programming these new chips is a time-consuming and challenging task leading to huge software development costs. The Kalray MPPA® processor is a competitive example for low-power super-computing on a single chip. It integrates up to 288 VLIW cores grouped in 18 clusters, each fitted with shared local memory. These clusters are interconnected with a high-bandwidth network-on-chip, and DMA engines are used to communicate. This processor is used in this thesis for experimental results. We propose the AOS library enabling highperformance communications and synchronizations of distributed local memories on clustered manycores. AOS provides 70% of the peak hardware throughput for transfers larger than 8 KB. We propose tools for the implementation of static and dynamic dataflow programs based on AOS to accelerate the parallel application developments onto clustered manycores. We propose an implementation of OpenVX for clustered manycores on top of AOS. OpenVX is a standard based on dataflow for the development of computer vision and neural network computing. The proposed OpenVX implementation includes automatic optimizations like data prefetch to overlap communications and computations, or kernel fusion to avoid the main memory bandwidth bottleneck. Results show super-linear speedups.
... SHMEM [17] falls into the same category, however, strides cannot have multiple levels, thus sending a 3D array slice can require multiple calls. Modern HPC languages such as CAF [18], UPC [19], Chapel and X10 support array slicing within the language and allow to assign slices to remote arrays but do not support transfers based on index lists directly (such support might not be needed if the compiler can aggregate multiple small sends at runtime). ...
Preprint
Full-text available
Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently networkaccelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 10x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration.
... SHMEM [17] falls into the same category, however, strides cannot have multiple levels, thus sending a 3D array slice can require multiple calls. Modern HPC languages such as CAF [18], UPC [19], Chapel and X10 support array slicing within the language and allow to assign slices to remote arrays but do not support transfers based on index lists directly (such support might not be needed if the compiler can aggregate multiple small sends at runtime). ...
Conference Paper
Full-text available
Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently network-accelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 8x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration.
Article
Full-text available
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.
Thesis
Diese Arbeit handelt von einer Software-Library namens IOFWD. Hierbei werden Daten, die für Dateien bestimmt sind, abgefangen und an spezialsierte IO-Server weitergeleitet. Hierbei werden verschiedene Umgebungen untersucht z.B. Simulationen mit MPI und PGAS Umgebungen wie UPC.
Conference Paper
Full-text available
W dobie komunikacji mobilnej za pomocą smartfonów, komunikatorów internetowych, czy smsów odczytanie stanu afektywnego interlokutora stanowi nie lada wyzwanie, będąc przyczyną nieporozumień, błędnej interpretacji afektywnej rozmówcy lub też braku zrozumienia. Odczuwane emocje przez człowieka w tego typu interakcjach społecznych manifestują się poprzez ciało w formie gestów, mimiki, mikromimiki, czy też napięcia mięśniowego. Sygnały te pełnią nie tylko funkcję podstawowego źródła informacji o stanie afektywnym jednostki, ale usprawniają, czy wręcz umożliwiają komunikację międzyludzką. W naszych badaniach poszukaliśmy sposobów kodowania emocji w komunikacji mobilnej na poziomie gestów wywoływanych podczas spostrzegania bodźców afektywnych. Do rejestrowania znaków afektywnych użyto ekranu dotykowego telefonu typu smartfon ze specjalnie przygotowanym oprogramowaniem, mierzącym kluczowe składowe gestów, wykonywanych na ekranie dotykowym, będących motoryczną odpowiedzią na bodziec afektywny o określonym ładunku pobudzenia oraz znaku. W badaniach wykazano istotne zmiany początkowej siły nacisku w funkcji spostrzeganej emocji, sugerujące, że wyższe pobudzenie może skutkować silniejszą tendencją do powstrzymywania się od wykonania danego gestu. Wykazano również, że wrażenia przyjemności/przykrości spostrzeganej emocji istotnie zmieniają trajektorię wykonywanych gestów w kierunku wertykalnym – wzorce dla negatywnych emocji rejestrowano w górnych częściach ekranu, zaś wzorce odpowiedzi dla pozytywnych emocji obserwowano przy ruchach w dolnych częściach ekranu
Article
Multi-core processors are considered now the only feasible alternative to the large single-core processors which have become limited by technological aspects such as power consumption and heat dissipation. However, due to their inherent parallel structure and their diversity, multi-cores are difficult to program. There is a variety of different approaches to simplify multi-core programming, but most of them are only solving parts of the problem, leaving the rest as (unrealistic) assumptions. This thesis proposes a unitary framework (called MAP) for effective programming of multi-core processors, filling a gap in the multi-core programming models landscape. The framework is designed to assist the programmer in application design, implementation, optimization, and performance analysis. MAP is built using the expertise and guidelines gathered while programming three types of multi-core processors for three different classes of applications. Thus, MAP has several stages: application design, modeling, prototyping, and tuning, as well as performance checkpoints and a performance guided feedback loop. Overall, MAP is a viable application-centric approach to programming for multi-core processors. However, part of the tool support is lacking and more has to be done, as future work, to replace some of the phases which are now developed by hand with (semi-)automated tools.
Conference Paper
In the quest to build exascale supercomputers, designers are increasing the number of hierarchical levels that exist among system components. Software developed for these systems must account for the various hierarchies to achieve maximum efficiency. The first step in this work is to identify groups of processes that share common resources. We develop, analyze, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data. We find that bitonic sort and our new hash-based algorithm best suit the task.
ResearchGate has not been able to resolve any references for this publication.