Conference Paper

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the appearance of the heterogeneous platform OpenPower, many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs, our presented approach relies heavily on abstract meta-programming techniques, which are essential to focus on fine-grained tuning rather than code porting. With this in mind, the CUDA-based open-source plasma simulation code PIConGPU is currently being abstracted to support the heterogeneous OpenPower platform using our fast porting interface cupla, which wraps the abstract parallel C++11 kernel acceleration library Alpaka. We demonstrate how PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, Power8 CPUs and NVIDIA GPUs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We present the consequences of near-perfect weak-scaling of such a code in terms of I/O demands from an application perspective based on productionruns using the particle-in-cell (PIC) code PIConGPU[1,2]. PIConGPU demonstrates a typical use case in which a PFlops/s-scale, performance portable simulation[3,4]leads automatically to PByte-scale output even for single runs. ...
... PIConGPU is an electro-magnetic PIC code[5,6]implemented via abstract, performance portable C++11 kernels on manycore hardware utilizing the Alpaka library[3,4]. Its applications span from general plasma physics, over laser-matter interaction to laser-plasma based particle accelerator research. ...
... We present the consequences of near-perfect weak-scaling of such a code in terms of I/O demands from an application perspective based on production runs using the particle-in-cell (PIC) code PIConGPU [1,2]. PIConGPU demonstrates a typical use case in which a PFlops/s-scale, performance portable simulation [3,4] leads automatically to PByte-scale output even for single runs. ...
Article
Full-text available
We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today's and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threaded data-transformations for the I/O library ADIOS as a feasible way to trade underutilized host-side compute potential on heterogeneous systems for reduced I/O latency.
... We have developed Alpaka [28] due to our own need in programming highly efficient algorithms for simulations [27] and data analysis on modern hardware in a portable manner. The aim of our approach is to have a single C++ source code in which we can express all levels of parallelism available on modern compute hardware, using a parallel redundant hierarchy model similar to that found in CUDA or OpenCL. ...
... Our open-source projects PIConGPU [3,2] and HaseOnGPU [5] both use Alpaka for the kernel abstraction for various many-core hardware [27,28], but different libraries for the mentioned topics not handled by Alpaka, like Graybat [26] for the network communication, mallocMC for the memory management or libPMacc for containers and asynchronous event handling. Alpaka is not meant as full grown solution for developing or porting whole HPC applications, but as a single-purpose library that can easily be included into the individual software of an exiting HPC project. ...
Conference Paper
Full-text available
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific Open image in new window we are able to reach 5Open image in new window on a P100 and over 1Open image in new window on a KNL system.
... We have developed Alpaka [28] due to our own need in programming highly efficient algorithms for simulations [27] and data analysis on modern hardware in a portable manner. The aim of our approach is to have a single C++ source code in which we can express all levels of parallelism available on modern compute hardware, using a parallel redundant hierarchy model similar to that found in CUDA or OpenCL. ...
... Our open-source projects PIConGPU [3,2] and HaseOnGPU [5] both use Alpaka for the kernel abstraction for various many-core hardware [27,28], but different libraries for the mentioned topics not handled by Alpaka, like Graybat [26] for the network communication, mallocMC for the memory management or libPMacc for containers and asynchronous event handling. Alpaka is not meant as full grown solution for developing or porting whole HPC applications, but as a single-purpose library that can easily be included into the individual software of an exiting HPC project. ...
Article
Full-text available
Quantitative predictions from synthetic radiation diagnostics often have to consider all accelerated particles. For particle-in-cell (PIC) codes, this not only means including all macro-particles but also taking into account the discrete electron distribution associated with them. This paper presents a general form factor formalism that allows to determine the radiation from this discrete electron distribution in order to compute the coherent and incoherent radiation self-consistently. Furthermore, we discuss a memory-efficient implementation that allows PIC simulations with billions of macro-particles. The impact on the radiation spectra is demonstrated on a large scale LWFA simulation.
... We have developed Alpaka [28] due to our own need in programming highly efficient algorithms for simulations [27] and data analysis on modern hardware in a portable manner. The aim of our approach is to have a single C++ source code in which we can express all levels of parallelism available on modern compute hardware, using a parallel redundant hierarchy model similar to that found in CUDA or OpenCL. ...
... Our open-source projects PIConGPU [3,2] and HaseOnGPU [5] both use Alpaka for the kernel abstraction for various many-core hardware [27,28], but different libraries for the mentioned topics not handled by Alpaka, like Graybat [26] for the network communication, mallocMC for the memory management or libPMacc for containers and asynchronous event handling. Alpaka is not meant as full grown solution for developing or porting whole HPC applications, but as a single-purpose library that can easily be included into the individual software of an exiting HPC project. ...
Article
Full-text available
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.
... Due to the software stack PIConGPU uses, only a few top level changes are made to support running on AMD GPUs via HIP. Figure 2 shows the entire PIConGPU software stack. More information about porting PIConGPU can be found in the 2016 ICHPC paper [25] and the Alpaka code repository [3]. As part of the ongoing CAAR for Frontier effort, we analyze the performance of PIConGPU on OLCF's Summit supercomputer, the second fastest in the world [2] as of the time of writing, and on an early access Frontier Center of Excellence machine. ...
Preprint
Full-text available
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD architectures (CPU-GPU), which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this paper, we design an instruction roofline model for AMD GPUs using AMD's ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application's performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell (PIC) simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.
... Within the PIC community, portability is underexplored, and our work seeks to address this weakness. Example projects such as PIConGPU [34], [35] and the AMITIS project [36] demonstrate the applicability of the PIC algorithm to GPU architectures, but do not make a concerted effort to address the portability issues presented by the increasing diversity of modern HPC platforms. ...
Article
Full-text available
VPIC is a general purpose particle-in-cell simulation code for modeling plasma phenomena such as magnetic reconnection, fusion, solar weather, and laser-plasma interaction in three dimensions using large numbers of particles. VPICs capacity in both fidelity and scale makes it particularly well-suited for plasma research on pre-exascale and exascale platforms. In this paper we demonstrate the unique challenges involved in preparing the VPIC code for operation at exascale, outlining important optimizations to make VPIC efficient on accelerators. Specifically, we show the work undertaken in adapting VPIC to exploit the portability-enabling framework Kokkos and highlight the enhancements to VPICs modeling capabilities to achieve performance at exascale. We assess the achieved performance-portability trade-off through a suite of studies on nine different varieties of modern preexascale hardware. Our performance-portability study includes weakscaling runs on three of the top ten TOP500 supercomputers, as well as a comparison of low-level system performance of hardware from four different vendors.
... Within the PIC community, portability is underexplored, and our work seeks to address this weakness. Example projects such as PIConGPU [39], [40] and the AMITIS project [41] demonstrate the applicability of the PIC algorithm to GPU architectures, but do not make a concerted effort to address the portability issues presented by the increasing diversity of modern HPC platforms. ...
Preprint
Full-text available
VPIC is a general purpose Particle-in-Cell simulation code for modeling plasma phenomena such as magnetic reconnection, fusion, solar weather, and laser-plasma interaction in three dimensions using large numbers of particles. VPIC's capacity in both fidelity and scale makes it particularly well-suited for plasma research on pre-exascale and exascale platforms. In this paper we demonstrate the unique challenges involved in preparing the VPIC code for operation at exascale, outlining important optimizations to make VPIC efficient on accelerators. Specifically, we show the work undertaken in adapting VPIC to exploit the portability-enabling framework Kokkos and highlight the enhancements to VPIC's modeling capabilities to achieve performance at exascale. We assess the achieved performance-portability trade-off through a suite of studies on nine different varieties of modern pre-exascale hardware. Our performance-portability study includes weak-scaling runs on three of the top ten TOP500 supercomputers, as well as a comparison of low-level system performance of hardware from four different vendors.
... In Zenker et al. [33], the OpenPOWER memory architecture, along with the improved CPU-GPU communications, is used to complement the memory limitations of the GPUs, in a GPGPU environment. The authors focused on the portability of a PIC algorithm when using multiple CPU architectures and NVIDIA GPUs, finally creating an abstraction of the CUDA programming language. ...
Article
Full-text available
Performance, i.e., execution times, is one of the most important features of HPC software, but energy consumption is also growing in importance if we intend to extend application to Exascale. This is the case of HPC software used in weather forecasting, in which every ounce of performance is critical in order to increase the accuracy and precision of its results. In this work, we study the performance-energy balance of an OpenPOWER processor, which is designed for the high workloads typically seen on data servers and HPC environments. Our results show that the OpenPOWER processor is superior in performance in weather forecast workloads compared to other processors commonly used in HPC, but at the expense of consuming more energy. Furthermore, the highest hyperthreading modes available on OpenPOWER processors do not perform well with HPC workloads and are even detrimental to performance.
... All used 4096 GPUs together reach a peak performance of ∼ 16.2 PFLOPS/s. PIConGPU is memory bound, but still capable to use over 12% of the single precision peak performance on the Kepler architecture [35]. On the investigated sub set of Piz Daint this means ∼ 1.9 PFLOP/s are actually exe- cuted. ...
Article
Full-text available
The computation power of supercomputers grows faster than the bandwidth of their storage and network. Especially applications using hardware accelerators like Nvidia GPUs cannot save enough data to be analyzed in a later step. There is a high risk of loosing important scientific information. We introduce the in situ template library ISAAC which enables arbitrary applications like scientific simulations to live visualize their data without the need of deep copy operations or data transformation using the very same compute node and hardware accelerator the data is already residing on. Arbitrary meta data can be added to the renderings and user defined steering commands can be asynchronously sent back to the running application. Using an aggregating server, ISAAC streams the interactive visualization video and enables user to access their applications from everywhere.
Article
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.
Article
Full-text available
The fully electromagnetic particle-in-cell code WarpX is being developed by a team of the U.S. DOE Exascale Computing Project (with additional non-U.S. collaborators on part of the code) to enable the modeling of chains of tens to hundreds of plasma accelerator stages on exascale supercomputers, for future collider designs. The code is combining the latest algorithmic advances (e.g., Lorentz boosted frame and pseudo-spectral Maxwell solvers) with mesh refinement and runs on the latest computer processing unit and graphical processing unit (GPU) architectures. In this paper, we summarize the strategy that was adopted to port WarpX to GPUs, report on the weak parallel scaling of the pseudo-spectral electromagnetic solver, and then present solutions for decreasing the time spent in data exchanges from guard regions between subdomains. In Sec. IV, we demonstrate the simulations of a chain of three consecutive multi-GeV laser-driven plasma accelerator stages.
Article
Full-text available
This paper reports on an in‐depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle‐in‐cell (PIC) simulation codes, extending previous studies based on codes from other domains. At the example of a particle‐in‐cell model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, OpenACC, CUDA, Kokkos, and RAJA, targeting multi‐core (CPU) and graphics (GPU) processors. Both Kokkos and RAJA appear mature, are usable for complex codes, and keep their promise to provide performance portability across different architectures. Comparing the obtainable performance on state‐of‐the art hardware, but also considering aspects such as code complexity, feature availability, and overall productivity, we finally draw the conclusion that the Kokkos framework would be suited best to tackle the massively parallel implementation of the full PIC model.
Thesis
Full-text available
In this thesis, we are interested in solving the Vlasov–Poisson system of equations (useful in the domain of plasma physics, for example within the ITER project), thanks to classical Particle-in-Cell (PIC) and semi-Lagrangian methods.The main contribution of our thesis is an efficient implementation of the PIC method on multi-core architectures, written in C, called Pic-Vert. Our implementation(a) achieves close-to-minimal number of memory transfers with the main memory,(b) exploits SIMD instructions for numerical computations, and(c) exhibits a high degree of shared memory parallelism.To put our work in perspective with respect to the state-of-the-art, we propose a metric to compare the efficiency of different PIC implementations when using different multi-core architectures. Our implementation is 3 times faster than other recent implementations on the same architecture (Intel Haswell).
Conference Paper
Full-text available
We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today’s and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threaded data-transformations for the I/O library ADIOS as a feasible way to trade underutilized host-side compute potential on heterogeneous systems for reduced I/O latency.
Conference Paper
Full-text available
We present a particle-in-cell simulation of the relativistic Kelvin-Helmholtz Instability (KHI) that for the first time delivers angularly resolved radiation spectra of the particle dynamics during the formation of the KHI. This enables studying the formation of the KHI with unprecedented spatial, angular and spectral resolution. Our results are of great importance for understanding astrophysical jet formation and comparable plasma phenomena by relating the particle motion observed in the KHI to its radiation signature. The innovative methods presented here on the implementation of the particle-in-cell algorithm on graphic processing units can be directly adapted to any many-core parallelization of the particle-mesh method. With these methods we see a peak performance of 7.176 PFLOP/s (double-precision) plus 1.449 PFLOP/s (single-precision), an efficiency of 96% when weakly scaling from 1 to 18432 nodes, an efficiency of 68.92% and a speed up of 794 (ideal: 1152) when strongly scaling from 16 to 18432 nodes.
Article
Full-text available
High-intensity laser plasma-based ion accelerators provide unsurpassed field gradients in the megavolt-per-micrometer range. They represent promising candidates for next-generation applications such as ion beam cancer therapy in compact facilities. The weak scaling of maximum ion energies with the square-root of the laser intensity, established for large sub-picosecond class laser systems, motivates the search for more efficient acceleration processes. Here we demonstrate that for ultrashort (pulse duration ~30 fs) highly relativistic (intensity ~10(21) W cm(-2)) laser pulses, the intra-pulse phase of the proton acceleration process becomes relevant, yielding maximum energies of around 20 MeV. Prominent non-target-normal emission of energetic protons, reflecting an engineered asymmetry in the field distribution of promptly accelerated electrons, is used to identify this pre-thermal phase of the acceleration. The relevant timescale reveals the underlying physics leading to the near-linear intensity scaling observed for 100 TW class table-top laser systems.
Article
Full-text available
We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures.
Code
This is the archive containing the software used for evaluations in the publication "Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond" submitted to the international workshop on OpenPOWER for HPC 2016. The archive has the following content: PIConGPU Kelvin-Helmholtz Simulation code (picongpu-alpaka/): Remote: https://github.com/psychocoderHPC/picongpu-alpaka.git Branch: topic-scaling Commit: 1f004c8e0514ad1649f3958a6184878af6e75150 Alpaka code (alpaka/): Remote: https://github.com/psychocoderHPC/alpaka.git Branch: topic-picongpu-alpaka Commit: 4a6dd35a9aff62e7f500623c3658685f827f73e5 Cupla (cupla/): Remote: https://github.com/psychocoderHPC/cupla.git Branch: topic-dualAccelerators Commit: 4660f5fd8e888aa732230946046219f7e5daa1c9 The simulation was executed for one thousand time steps and the following configuration: shape is higher then CIC, we used TSC pusher is Boris current solver is Esirkepov (optimized, generalized) Yee field solver trilinear interpolation in field gathering 16 particles per cell Compile flags: CPU g++-4.9.2: -g0 -O3 -m64 -funroll-loops -march=native -ffast-math --param max-unroll-times=512 GPU nvcc: --use_fast_math --ftz=false -g0 -O3 -m64
Conference Paper
Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of \#ifdefs.
Article
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address an growing list of applications and domain libraries.
Conference Paper
The 12-core 649mm2 POWER8™ leverages IBM's 22nm eDRAM SOI technology [1], and microarchitectural enhancements to deliver up to 2.5× the socket performance [2] of its 32nm predecessor, POWER7+™ [3]. POWER8 contains 4.2B transistors and 31.5μF of deep-trench decoupling capacitance. Three thin-oxide transistor Vts are used for power/performance tuning, and thick-oxide transistors enable high-voltage I/O and analog designs. The 15-layer BEOL contains 5–80nm, 2–144nm, 3–288nm, and 3–640nm pitch layers for low-latency communication as well as 2–2400nm ultra-thick-metal (UTM) pitch layers for low-resistance distribution of power and clocks.
Article
The particle-in-cell (PIC) algorithm is one of the most widely used algorithms in computational plasma physics. With the advent of graphical processing units (GPUs), large-scale plasma simulations on inexpensive GPU clusters are in reach. We present an implementation of a fully relativistic plasma PIC algorithm for GPUs based on the NVIDIA CUDA library. It supports a hybrid architecture consisting of single computation nodes interconnected in a standard cluster topology, with each node carrying one or more GPUs. The internode communication is realized using the message-passing interface. The simulation code PIConGPU presented in this paper is, to our knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.
Article
High intensity short-pulse laser experiments are used for various applications from fast ignitor concept to proton beam generation. These systems produce substantial fluxes of energetic electrons, and the production of Kα spectrum by these energetic electrons has become a key diagnostics of interaction of short-pulse laser and solid-density matter. In general, spectral modeling is required to derive the thermal electron temperature from measured Kα spectrum and it is essential to use a Non-LTE (local thermodynamic equilibrium) model for this purpose. In this paper, we investigate the assumptions necessary for the application of the Non-LTE models and show that an extensive set of configurations is required to provide a valid Kα diagnosis of hot dense matter.
Intel Xeon Processor E5-2698 v3 Specification. http://ark.intel. com/de/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M- Cache-2_30-GHz. [Online
  • Intel
Intel. Intel Xeon Processor E5-2698 v3 Specification. http://ark.intel. com/de/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M- Cache-2_30-GHz. [Online; accessed April 11, 2016].
boost.fiber. https : / / github . com / olk / boost -fiber. [Online
  • Oliver Kowalke
Oliver Kowalke. boost.fiber. https : / / github. com / olk / boost -fiber. [Online; accessed April 12, 2016].
The RAJA portability layer: overview and status
  • R Hornung
  • J Keasler
R Hornung and J Keasler. "The RAJA portability layer: overview and status". In: Lawrence Livermore National Laboratory, Livermore, USA (2014).
DataNVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data. https:// devblogs. nvidia. com/ parallelforall/ nvlink-pascal-stacked-memory-feeding-appetite-big-data
  • D Foley
Denis Foley. " NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data ". In: Nvidia.com (2014).
NVIDIA CUDA on IBM POWER8: Technical overview, software installation, and application development
  • Mauricio Faria De Oliveira
Mauricio Faria de Oliveira. NVIDIA CUDA on IBM POWER8: Technical overview, software installation, and application development.
cupla -C++ User interface for the Platform independent Library Alpaka. https://github.com/ComputationalRadiationPhysics/ cupla. [Online; accessed
  • René Widera
René Widera. cupla -C++ User interface for the Platform independent Library Alpaka. https://github.com/ComputationalRadiationPhysics/ cupla. [Online; accessed March 14, 2016].
AMD Opteron 6200 Series Processor Quick Reference Guide. https://www.amd.com/Documents
AMD. AMD Opteron 6200 Series Processor Quick Reference Guide. https://www.amd.com/Documents/Opteron_6000_QRG.pdf. [Online; accessed April 11, 2016].
Boost.Fiber. https:// github. com/ olk/ boost-fiber
  • O Kowalke
cupla: C++ User interface for the Platform independent Library Alpaka
  • R Widera
NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data
  • Denis Foley
Denis Foley. "NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data". In: Nvidia.com (2014).