Article

LLAMA: The low‐level abstraction for memory access

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero‐runtime‐overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the low‐level abstraction of memory access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user‐defined data types. The library is extensible with third‐party allocators. Providing two close‐to‐life examples, we show that the LLAMA‐generated array of structs and struct of arrays layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU® lbm benchmark and the particle‐in‐cell simulation PIConGPU demonstrate LLAMA's abilities in real‐world applications. LLAMA's layout‐aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element‐wise copying. LLAMA provides a novel tool for the development of high‐performance C++ applications in a heterogeneous environment.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... developed since its first publication [1]. In this article, we would like to present these recently introduced features and discuss their applications and use cases. ...
... We have yet to see how to deal with such instruction sets. template <int N, typename ParticleView> void updateSimd(ParticleView& particleView) { using Particle = ParticleView::RecordDim; for(std::size_t i = 0; i < problemSize; i += N) { llama::SimdN<Particle, N, std::fixed_size_simd> simdParticles; llama::loadSimd(particleView(i), simdParticles); for(std::size_t j = 0; j < problemSize; ++j) pPInteraction(simdParticles, particleView(j)); llama::storeSimd(simdParticles(tag::Vel{}), particleView(i)(tag::Vel{})); } } Figure 2. A SIMD version of the n-body update routine from the original LLAMA paper [1], using std::fixed_size_simd as SIMD technology, as proposed for C ++ 26 [13]. a SIMD construct or scalar and a reference to memory. ...
... LLAMA will handle records and the underlying memory layout transparently for the user. Figure 2 shows a simdized version of the update routine of the n-body example from the original LLAMA paper [1]. With N > 1 and the right compiler flags, SIMD code is produced. ...
Preprint
Full-text available
Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. The low-level abstraction of memory access (LLAMA) is a C++ library that provides a zero-runtime-overhead abstraction layer, underneath which memory mappings can be freely exchanged to customize data layouts, memory access and access instrumentation, focusing on multidimensional arrays of nested, structured data. After its scientific debut, several improvements and extensions have been added to LLAMA. This includes compile-time array extents for zero-memory-overhead views, support for computations during memory access, new mappings for bit-packing, switching types, byte-splitting, memory access instrumentation, and explicit SIMD support. This contribution provides an overview of recent developments in the LLAMA library.
... In contrast to integrating current implementations of SoA containers [23]- [26], our tailored implementation provides a zero-cost abstraction layer, meaning it incurs no runtime overhead and seamlessly integrates with existing codebases. It leverages automatic vectorization to fully exploit SIMD (Single Instruction, Multiple Data) capabilities. ...
Preprint
Full-text available
A dense SLAM system is essential for mobile robots, as it provides localization and allows navigation, path planning, obstacle avoidance, and decision-making in unstructured environments. Due to increasing computational demands the use of GPUs in dense SLAM is expanding. In this work, we present coVoxSLAM, a novel GPU-accelerated volumetric SLAM system that takes full advantage of the parallel processing power of the GPU to build globally consistent maps even in large-scale environments. It was deployed on different platforms (discrete and embedded GPU) and compared with the state of the art. The results obtained using public datasets show that coVoxSLAM delivers a significant performance improvement considering execution times while maintaining accurate localization. The presented system is available as open-source on GitHub https://github.com/lrse-uba/coVoxSLAM.
... GPUs are ideal for instance for ML training; ROOT is working to accelerate data movement from storage to GPUs. R&D areas include memory layout particularly suitable for GPU algorithms [19]; direct transfer from storage to GPU, bypassing of CPU; use of compression algorithms optimized for GPUs; and total throughput optimization of these different options, possibly combining them. Much of this sees very recent and ongoing technology evolution such as nvCOMP and DirectStorage; ROOT is following these developments, making sure that they can be captured for production use in the context of RNTuple, ROOT's future I/O library. ...
Preprint
Full-text available
ROOT is high energy physics' software for storing and mining data in a statistically sound way, to publish results with scientific graphics. It is evolving since 25 years, now providing the storage format for more than one exabyte of data; virtually all high energy physics experiments use ROOT. With another significant increase in the amount of data to be handled scheduled to arrive in 2027, ROOT is preparing for a massive upgrade of its core ingredients. As part of a review of crucial software for high energy physics, the ROOT team has documented its R&D plans for the coming years.
... The explicit translation of C++ objects into arrays of simple types allows for robust data interpretability even in the absence of C++ reflection capabilities. For instance, 3rd party tools can interpret the structure and (within limits) the meaning of RNTuple data without an understanding of C++ classes, as has been recently demonstrated in the context of research on a memory layout abstraction library [10]. ...
Preprint
Full-text available
This document discusses the state, roadmap, and risks of the foundational components of ROOT with respect to the experiments at the HL-LHC (Run 4 and beyond). As foundational components, the document considers in particular the ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile container file format and the TTree binary event data format. The work going into the new RNTuple event data format aims at superseding TTree, to make RNTuple the production ROOT event data I/O that meets the requirements of Run 4 and beyond.
Article
We present a C++ library for transparent memory and compute abstraction across CPU and GPU architectures. Our library combines generic data structures like vectors, multi‐dimensional arrays, maps, graphs, and sparse grids with basic generic algorithms like arbitrary‐dimensional convolutions, copying, merging, sorting, prefix sum, reductions, neighbor search, and filtering. The memory layout of the data structures is adapted at compile time using C++ tuples with optional memory double‐mapping between host and device and the capability of using memory managed by external libraries with no data copying. We combine this transparent memory layout with generic thread‐parallel algorithms under two alternative common interfaces: a CUDA‐like kernel interface and a lambda‐function interface. We quantify the memory and compute performance and portability of our implementation using micro‐benchmarks, showing that the abstractions introduce negligible performance overhead, and we compare performance against the current state of the art in a real‐world scientific application from computational fluid mechanics.
Chapter
We present a C++14 library for performance portability of scientific computing codes across CPU and GPU architectures. Our library combines generic data structures like vectors, multi-dimensional arrays, maps, graphs, and sparse grids with basic, reusable algorithms like convolutions, sorting, prefix sum, reductions, and scan. The memory layout of the data structures is adapted at compile-time using tuples with optional memory mirroring between CPU and GPU. We combine this transparent memory mapping with generic algorithms under two alternative programming interfaces: a CUDA-like kernel interface for multi-core CPUs, Nvidia GPUs, and AMD GPUs, as well as a lambda interface. We validate and benchmark the presented library using micro-benchmarks, showing that the abstractions introduce negligible performance overhead, and we compare performance against the current state of the art.Keywordsperformance portabilitymemory layoutgeneric algorithmsC++ tuplesmulti-coreGPU
Article
Full-text available
The growing rate of technology improvements has caused dramatic advances in processor performances, causing significant speed-up of processor working frequency and increased amount of instructions which can be processed in parallel. The given development of processor's technology has brought performance improvements in computer systems, but not for all the types of applications. The reason for this resides in the well known Von-Neumann bottleneck problem which occurs during the communication between the processor and the main memory into a standard processor-centric system. This problem has been reviewed by many scientists, which proposed different approaches for improving the memory bandwidth and latency. This paper provides a brief review of these techniques and also gives a deep analysis of various memory-centric systems that implement different approaches of merging or placing the memory near to the processing elements. Within this analysis we discuss the advantages, disadvantages and the application (purpose) of several well-known memory-centric systems.
Conference Paper
Full-text available
The conventional approach of moving stored data to the CPU for computation has become a major performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in integration technologies have made the decade-old concept of coupling compute units close to the memory (called Near-Memory Computing) more viable. Processing right at the home of data can completely diminish the data movement problem of data-intensive applications. This paper focuses on analyzing and organizing the extensive body of literature on near-memory computing across various dimensions: starting from the memory level where this paradigm is applied, to the granularity of the application that could be executed on the near-memory units. We highlight the challenges as well as the critical need of evaluation methodologies that can be employed in designing these special architectures. Using a case study, we present our methodology and also identify topics for future research to unlock the full potential of near-memory computing.
Article
Full-text available
Scalable and efficient numerical simulations continue to gain importance, as computation is firmly established as the third pillar of discovery, alongside theory and experiment. Meanwhile, the performance of computing hardware grows through increasingly heterogeneous parallelism, enabling simulations of ever more complex models. However, efficiently implementing scalable codes on heterogeneous, distributed hardware systems becomes the bottleneck. This bottleneck can be alleviated by intermediate software layers that provide higher-level abstractions closer to the problem domain, reducing development times and allowing computational scientists to focus. Here, we present OpenFPM, an open and scalable framework that provides an abstraction layer for numerical simulations using particles and/or meshes. OpenFPM provides transparent and scalable infrastructure for shared-memory and distributed-memory implementations of particles-only and hybrid particle-mesh simulations of both discrete and continuous models, as well as non-simulation codes. This infrastructure is complemented with frequently used numerical routines, as well as interfaces to third-party libraries. We present the architecture and design of OpenFPM, detail the underlying abstractions, and benchmark the framework in applications ranging from Smoothed-Particle Hydrodynamics (SPH) to Molecular Dynamics (MD), Discrete Element Methods (DEM), Vortex Methods, stencil codes (finite differences), and high-dimensional Monte Carlo sampling (CMA-ES), comparing it to the current state of the art and to existing software frameworks. Program summary Program Title: OpenFPM Program Files doi: http://dx.doi.org/10.17632/4yrp8nbm7c.1 Licensing provisions: GPLv3 Programming language: C++ Nature of problem: Writing numerical simulation programs that use meshes, particles, or any combination of the two typically requires long development times, in particular if the code is to scale efficiently on parallel distributed-memory computers. The long development times incur high financial and project-time costs and often lead to sub-optimal program performance as shortcuts are taken. Yet, a large portion of the functionality is common across programs and could be automated or provided as reusable software components, leading to large savings in project costs and potentially improved software performance. Solution method: OpenFPM provides a scalable, highly efficient software platform for numerical simulations using meshes, particles, or any combination of the two on parallel computers. It is based on a well-known set of abstract data types and operators that suffice to express any such simulation, regardless of the application domain. OpenFPM provides reusable, tested, and internally parallelized software components that reduce development times and make parallel computing accessible to computational scientists without extensive knowledge in parallel programming. Additional comments including restrictions and unusual features: OpenFPM is a software library based on which users can implement their simulation codes at a fraction of the development cost. All parallelization and memory handling is transparently done by the library. As its main innovation, OpenFPM makes use of C++ Template Meta Programming in order to enable simulations in arbitrary-dimensional spaces, distribution of arbitrary user-defined C++ objects, and compile-time code optimization and targeting for specific hardware platforms. OpenFPM-based simulations can directly output VTK files for visualization of results and HDF5 files for data archiving.
Article
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Conference Paper
Full-text available
We present a particle-in-cell simulation of the relativistic Kelvin-Helmholtz Instability (KHI) that for the first time delivers angularly resolved radiation spectra of the particle dynamics during the formation of the KHI. This enables studying the formation of the KHI with unprecedented spatial, angular and spectral resolution. Our results are of great importance for understanding astrophysical jet formation and comparable plasma phenomena by relating the particle motion observed in the KHI to its radiation signature. The innovative methods presented here on the implementation of the particle-in-cell algorithm on graphic processing units can be directly adapted to any many-core parallelization of the particle-mesh method. With these methods we see a peak performance of 7.176 PFLOP/s (double-precision) plus 1.449 PFLOP/s (single-precision), an efficiency of 96% when weakly scaling from 1 to 18432 nodes, an efficiency of 68.92% and a speed up of 794 (ideal: 1152) when strongly scaling from 16 to 18432 nodes.
Article
Full-text available
When designing and implementing highly ecient scientific applications for parallel comput- ers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the eects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the single-CPU eciency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the eectiveness of our optimization techniques.
Conference Paper
Full-text available
This paper looks at the evolution of the "Memory Wall" problem over the past decade. It begins by reviewing the short Computer Architecture News note that coined the phrase, including the motivation behind the note, the context in which it was written, and the controversy it sparked. What has changed over the years? Are we hitting the Memory Wall? And if so, for what types of applications?
Article
DRAM-based memory suffers from increasing row buffer conflicts, which causes significant performance degradation and power consumption. As memory capacity increases, the overheads of the row buffer conflict are increasingly worse as increasing bitline length, which results in high row activation and precharge latencies. In this work, we propose a practical approach called Row Buffer Cache (RBC) to mitigate row buffer conflict overheads efficiently. At the core of our proposed RBC architecture, the rows with good spatial locality are cached and protected, which are exempted from being interrupted by the accesses for rows with poor locality. Such an RBC architecture significantly reduces the overheads of performance and energy caused by row activation and precharge, and thus improves overall system performance and energy efficiency. We evaluate RBC architecture using SPEC CPU2006 on a DDR4 memory compared to a commodity baseline memory system. Results show that RBC improves the overall performance by up to 2.24× (16.1% on average) and reduces the memory energy by up to 68.2% (23.6% on average) for single-core simulations. For multi-core simulations, RBC increases the overall performance by up to 1.55× (17% on average) and reduces memory energy consumption by up to 35.4% (21.3% on average).
Article
The numerical study of physical problems often require integrating the dynamics of a large number of particles evolving according to a given set of equations. Particles are characterized by the information they are carrying such as an identity, a position other. There are generally speaking two different possibilities for handling particles in high performance computing (HPC) codes. The concept of an Array of Structures (AoS) is in the spirit of the object-oriented programming (OOP) paradigm in that the particle information is implemented as a structure. Here, an object (realization of the structure) represents one particle and a set of many particles is stored in an array. In contrast, using the concept of a Structure of Arrays (SoA), a single structure holds several arrays each representing one property (such as the identity) of the whole set of particles. The AoS approach is often implemented in HPC codes due to its handiness and flexibility. For a class of problems, however, it is know that the performance of SoA is much better than that of AoS. We confirm this observation for our particle problem. Using a benchmark we show that on modern Intel Xeon processors the SoA implementation is typically several times faster than the AoS one. On Intel's MIC co-processors the performance gap even attains a factor of ten. The same is true for GPU computing, using both computational and multi-purpose GPUs. Combining performance and handiness, we present the library SoAx that has optimal performance (on CPUs, MICs, and GPUs) while providing the same handiness as AoS. For this, SoAx uses modern C++ design techniques such template meta programming that allows to automatically generate code for user defined heterogeneous data structures.
Conference Paper
Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of \#ifdefs.
Chapter
Memory access patterns are critical for performance, especially on parallel architectures such as graphics processing units (GPUs). Because of this, the choice between an array-of-structures (AoS) data layout and a structure-of-arrays (SoA) layout has a large impact on overall program performance. However, it is not always obvious which layout will better serve a particular application, and testing both of them by hand in C++ is tedious because their syntax greatly differs. Not only is the syntax for defining the container different, but worse, the syntax for accessing the data within the container is different, leading to anywhere from tens to thousands of source code changes needed to switch any given container from the AoS to the SoA layout or vice versa. This chapter presents an abstraction layer that allows switching between the AoS and SoA layouts in C++ without having to change the data access syntax. A few changes to the structure and container definitions allow for easy performance comparison of AoS vs. SoA on existing AoS code. This abstraction retains the more intuitive AoS syntax (container[index].component) for data access yet allows switching between the AoS and SoA layouts with a single template parameter in the container type definition on the CPU and GPU. In this way, code development becomes independent of the data layout and performance is improved by choosing the correct layout for the application's usage pattern.
Article
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address an growing list of applications and domain libraries.
Conference Paper
SIMD extensions have been a feature of choice for processor manufacturers for a couple of decades. Designed to exploit data parallelism in applications at the instruction level, these extensions still require a high level of expertise or the use of potentially fragile compiler support or vendor-specific libraries. While a large fraction of their theoretical accelerations can be obtained using such tools, exploiting such hardware becomes tedious as soon as application portability across hardware is required. In this paper, we describe B OOST.SIMD, a C++ template library that simplifies the exploitation of SIMD hardware within a standard C++ programming model. BOOST.SIMD provides a portable way to vectorize computation on Altivec, SSE or AVX while providing a generic way to extend the set of supported functions and hardwares. We introduce a C++ standard compliant interface for the users which increases expressiveness by providing a high-level abstraction to handle SIMD operations, an extension-specific optimization pass and a set of SIMD aware standard compliant algorithms which allow to reuse classical C++ abstractions for SIMD computation. We assess BOOST.SIMD performance and applicability by providing an implementation of BLAS and image processing algorithms.
Article
“The GPU Gems series features a collection of the most essential algorithms required by Next-Generation 3D Engines.” -Martin Mittring, Lead Graphics Programmer, Crytek This third volume of the best-selling GPU Gems series provides a snapshot of today's latest Graphics Processing Unit (GPU) programming techniques. The programmability of modern GPUs allows developers to not only distinguish themselves from one another but also to use this awesome processing power for non-graphics applications, such as physics simulation, financial analysis, and even virus detection-particularly with the CUDA architecture. Graphics remains the leading application for GPUs, and readers will find that the latest algorithms create ultra-realistic characters, better lighting, and post-rendering compositing effects.Major topics include Geometry Light and Shadows Rendering Image Effects Physics Simulation GPU Computing Contributors are from the following corporations and universities:3Dfacto Adobe Systems Apple Budapest University of Technology and Economics CGGVeritas The Chinese University of Hong Kong Cornell University Crytek Czech Technical University in Prague Dartmouth College Digital Illusions Creative Entertainment Eindhoven University of Technology Electronic Arts Havok Helsinki University of Technology Imperial College London Infinity Ward Juniper Networks LaBRIヨINRIA, University of Bordeaux mental images Microsoft Research Move Interactive NCsoft Corporation NVIDIA Corporation Perpetual Entertainment Playlogic Game Factory Polytime Rainbow Studios SEGA Corporation UFRGS (Brazil) Ulm University University of California, Davis University of Central Florida University of Copenhagen University of Girona University of Illinois at Urbana-Champaign University of North Carolina Chapel Hill University of Tokyo University of WaterlooSection Editors include NVIDIA engineers: Cyril Zeller, Evan Hart, Ignacio Castaño, Kevin Bjorke, Kevin Myers, and Nolan Goodnight.The accompanying DVD includes complementary examples and sample programs.
Article
The continuous growing gap between CPU and memory speeds is an important drawback in the overall computer performance. Starting by identifying the problem and the complexity behind it, this communication addresses the recent past and current efforts to at-tenuate their disparity, namely memory hierarchy strategies, improvement of bus controllers and the development of smarter memories. This communication ends by pointing directions to the technology evolution for the next few years.
Article
It is an established trend that CPU development takes advantage of Moore's Law to improve in parallelism much more than in scalar execution speed. This results in higher hardware thread counts (MIMD) and improved vector units (SIMD), of which the MIMD developments have received the focus of library research and development in recent years. To make use of the latest hardware improvements, SIMD must receive a stronger focus of API research and development because the computational power can no longer be neglected and often auto-vectorizing compilers cannot generate the necessary SIMD code, as will be shown in this paper. Nowadays, the SIMD capabilities are sufficiently significant to warrant vectorization of algorithms requiring more conditional execution than was originally expected for Streaming SIMD Extension to handle. The Vc library (http://compeng.uni-frankfurt.de/?vc) was designed to support developers in the creation of portable vectorized code. Its capabilities and performance have been thoroughly tested. Vc provides portability of the source code, allowing full utilization of the hardware's SIMD capabilities, without introducing any overhead. Copyright © 2011 John Wiley & Sons, Ltd.
Article
The characteristics of the waves guided along a plane [I] P. S. Epstein, “On the possibility of electromagnetic surface waves, ” Proc. Nat’l dcad. Sciences, vol. 40, pp. 1158-1165, Deinterface which separates a semi-infinite region of free cember 1954. space from that of a magnetoionic medium are investi- [2] T. Tamir and A. A. Oliner, “The spectrum of electromagnetic waves guided by a plasma layer, ” Proc. IEEE, vol. 51, pp. 317gated for the case in which the static magnetic field is 332, February 1963. oriented perpendicular to the plane interface. It is [3] &I. A. Gintsburg, “Surface waves on the boundary of a plasma in a magnetic field, ” Rasprost. Radwvoln i Ionosf., Trudy found that surface waves exist only when w,<wp and NIZMIRAN L’SSR, no. 17(27), pp. 208-215, 1960. that also only for angular frequencies which lie bet\\-een [4] S. R. Seshadri and A. Hessel, “Radiation from a source near a plane interface between an isotropic and a gyrotropic dielectric,” we and 1/42 times the upper hybrid resonant frequency. Canad. J. Phys., vol. 42, pp. 2153-2172, November 1964. The surface waves propagate with a phase velocity [5] G. H. Owpang and S. R. Seshadri, “Guided waves propagating along the magnetostatic field at a plane boundary of a semiwhich is always less than the velocity of electromagnetic infinite magnetoionic medium, ” IEEE Trans. on Miomave waves in free space. The attenuation rates normal to the Tbory and Techniques, vol. MTT-14, pp. 136144, March 1966. [6] S. R. Seshadri and T. T. \Vu, “Radiation condition for a maginterface of the surface wave fields in both the media are netoionic medium. ” to be Dublished. examined. Kumerical results of the surface wave characteristics are given for one typical case.
Article
In C++, multi-dimensional arrays are often used but the language provides limited native support for them. The language, in its Standard Library, supplies sophisticated interfaces for manipulating sequential data, but relies on its bare-bones C heritage for arrays. The MultiArray library, a part of the Boost library collection, enhances a C++ programmer's tool set with versatile multi-dimensional array abstractions. It includes a general array class template and native array adaptors that support idiomatic array operations and interoperate with C++ Standard Library containers and algorithms. The arrays share a common interface, expressed as a generic programming concept, in terms of which generic array algorithms can be implemented. We present the library design, introduce a generic interface for array programming, demonstrate how the arrays integrate with the C++ Standard Library, and discuss the essential aspects of their implementation. Copyright © 2004 John Wiley & Sons, Ltd.
mallocMC ‐ memory allocator for many
  • C Eckert
  • R Widera
  • A Huebl
P2072: differentiable programming for C++. Technical report ISO JTC1/SC22/WG21 - Papers Mailing List
  • Focom Mosesws Vassilevv Wongm
ZettaScaler: liquid immersion cooling manycore based supercomputer. CANDAR2017 Keynote 3
  • Toriis Ishikawah
UltimateSoA: a trivial SOA binding to your beloved OO data hierarchy
  • Innocentev
Ultimate container - user friendly storage types for HPC simulations in C++
  • Vyskočilj
ArrayFire - A high performance software library for parallel computing with an easy-to-use API
  • Yalamanchilip Arshadu Mohammedz
P0009: MDSPAN. Technical report ISO JTC1/SC22/WG21 - Papers Mailing List
  • Trottc Hollmand Lebrun-Grandied
EVE - The expressive vector engine
  • Falcouj
P1684: mdarray: an owning multidimensional array analog of mdspan. Technical report ISO JTC1/SC22/WG21 - Papers Mailing List
  • Hoemmenm Hollmannd Trottc
  • Sunderlandd
P2040: reflection-based lazy-evaluation. Technical report ISO JTC1/SC22/WG21 - Papers Mailing List
  • Jabotc
The Boost C++ metaprogramming library
  • Gurtovoya Abrahamsd
mallocMC ‐ memory allocator for many
  • Eckert C
P2237: metaprogramming. Technical report ISO JTC1/SC22/WG21 - Papers Mailing List
  • Suttona
NSIMD: high performance computing SIMD library
  • Scalea