Conference Paper

Sierra: a SIMD extension for C++

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Nowadays, SIMD hardware is omnipresent in computers. Nonetheless, many software projects make hardly use of SIMD instructions: Applications are usually written in general-purpose languages like C++. However, general-purpose languages only provide poor abstractions for SIMD programming enforcing an error-prone, assembly-like programming style. An alternative are data-parallel languages. They indeed offer more convenience to target SIMD architectures but introduce their own set of problems. In particular, programmers are often unwilling to port their working C++ code to a new programming language. In this paper we present Sierra: a SIMD extension for C++. It combines the full power of C++ with an intuitive and effective way to address SIMD hardware. With Sierra, the programmer can write efficient, portable and maintainable code. It is particularly easy to enhance existing code to run efficiently on SIMD machines. In contrast to prior approaches, the programmer has explicit control over the involved vector lengths.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... tree traversal codes [Jo et al., 2013;Ren et al., 2015] or FFTs [Frigo and Johnson, 2005]. Second, compilers for high-level languages for vectorization, such as ISPC [Pharr and Mark, 2012] or Sierra [Leißa et al., 2014], vectorize codes only in their structured language. Finally, the vectorizer passes of general-purpose compilers build on the compilersŠ unstructured intermediate representations (IR). ...
... Data-parallel languages commonly do not offer ways to explicitly specify predicated execution [Pharr and Mark, 2012;Munshi, 2009;Leißa, 2017;Leißa et al., 2014;Reiche, 2018]. ...
... Several data-parallel languages with lock-step semantics have been proposed for vectorization [Pharr and Mark, 2012;Leißa, 2017;Leißa et al., 2014;Reiche, 2018]. These languages do not allow unstructured divergent control Ćow (goto statement). ...
... The idea of using templates to provide a generic layer above intrinsic functions, as we propose in this paper, is not new and was first proposed a decade ago [3]. It has more recently appeared in dedicated tools [4][5][6][7][8][9][10][11] and in the scope of HPC projects [12,13]. However, this approach has not become widespread because it is incompatible with the two most widely used languages in HPC, which are C and Fortran. ...
... At line (5), we fill vector res with 10, where the condition is true, or with 20 otherwise. We increment res by twice the current values, at positions that are not equal to a value at line (9). At lines (14), we set to zero the positions in res that are different from 60, we explicitly use a precomputed Boolean vector at line (19), and we increment some of the positions of a value at line (22). ...
... .Then(another value) (8) .ElseIf(a value < 10.).Then(a value * 10) (9) .ElseIf(a value == 0).Then return InaVecBestType<double>::If(v1 < v2) (6) .Then((v1−v2).exp()) (7) .Else((v2−v1).exp()); (8) } (9) (10) InaVecBestType<double> aKernelWithTest(const InaVecBestType<double> v1, const . . . ...
Article
Full-text available
The development of scientific applications requires highly optimized computational kernels to benefit from modern hardware. In recent years, vectorization has gained key importance in exploiting the processing capabilities of modern CPUs, whose evolution is characterized by increasing register-widths and core numbers, but stagnating clock frequencies. In particular, vectorization allows floating point operations to be performed at a higher rate than the processor’s frequency. However, compilers often fail to vectorize complex codes and pure assembly/intrinsic implementations often suffer from software engineering issues, such as readability and maintainability. Moreover, it is difficult for domain scientists to write optimized code without technical support. To address these issues, we propose Inastemp, a lightweight open-source C++ library. Inastemp offers a solution to develop hardware-independent computational kernels for the CPU. These kernels are portable across compilers and floating point precision and vectorized targeting SSE(3,4.1,4.2), AVX(2), AVX512, or ALTIVEC/VMX instructions. Inastemp provides advanced features, such as an if-else statement that vectorizes branches that cannot be removed. Our performance study shows that Inastemp has the same efficiency as pure intrinsic approaches on modern architectures. As side-results, this study provides micro benchmarks on the latest HPC architectures for three different computational kernels, emphasizing comparisons between scalar and intrinsic-based codes.
... An alternative solution is to add support for vectorization in the source language: ispc [PM12], a compiler for a C-based language with support for vector programming, and Sierra [LHH14], an extension to C++, use vector types and overloaded control-ow statements. In these languages, a variable can be varying, meaning that each vector lane may have a di erent value for that variable, or uniform, in which case every vector lane contains the same value. ...
... A sketch of the algorithm is given below: In this sample code, there are two calls to vectorize: One is outside the packet traversal loop, and the other is inside. This form of nested vectorization, where the vector width can be changed from inside a vectorized region, is something that other vectorization frameworks, like ispc [PM12] or Sierra [LHH14], cannot do naturally. ispc would require separate compilation of the packet and single-ray traversal, and Sierra would require the programmer to manually annotate types everywhere in order to make the switch possible. ...
... In AoSoA, groups of N objects of an array are transformed into an SoA layout, hence it is necessary to only allocate one cluster. Sierra [37] is a language intended for writing code that better exploits SIMD capabilities. Sierra provides a similar construct to ISPC's soa<N>, but the developer can also write code that is generic over the parameter N. Despite not supporting them natively, we believe it is easy to extend SHAPES to support AoSoA layouts by exploiting specialisation and by providing instructions specific to pools of such layouts. ...
... Pools are not homogeneous, hence runtime type information is necessary for field access and object construction. [10] presents extensions to the SHAPES model to add support for dynamically allocated pool-backed arrays with value semantics and a SIMD environment similar to that of [36,37]. ...
Article
The vast divide between the speed of CPU and RAM means that effective use of CPU caches is often a prerequisite for high performance on modern architectures. Hence, developers need to consider how to place data in memory so as to exploit spatial locality and achieve high memory bandwidth. Such manual memory optimisations are common in unmanaged languages (e.g. C, C++), but they sacrifice readability, maintainability, memory safety, and object abstraction. In managed languages, such as Java and C#, where the runtime abstracts away the memory from the developer, such optimisations are almost impossible. We present a language extension called SHAPES, which aims to offer developers more fine-grained control over the placement of data, without sacrificing memory safety or object abstraction. In SHAPES, programmers group related objects into pools, and specify how objects are laid out in these pools. Classes and types are annotated by pool parameters, which allow placement aspects to be changed orthogonally to the code that operates on the objects in the pool. These design decisions disentangle business logic and memory concerns. We give a formal model of SHAPES, present its type and memory safety model, and present its translation to a low-level language. We argue why we expect this translation to be efficient in terms of runtime representation of objects and access to their fields. We argue that SHAPES can be incorporated into existing managed and unmanaged language runtimes and fit well with garbage collection.
... In Structure of Arrays (SOA; column stores in database systems), all values of a field are grouped and stored contiguously across the entire object space. SOA is a well-studied, established data layout [4,19,20,24,28,31,39] and CUDA best practice which can save on memory access time (memory coalescing [16]), maximize cache usage and allow for vectorization via SIMD instructions [2,14,15]. Comparison of OOP Notation for a simplified 2D N-Body Simulation. ...
... This approach is fragile and could break at any time. Second, code can be vectorized manually, either with C++ SSE intrinsics or with a vectorization framework like Sierra [14,15]. Considering that real applications, which exhibit code that is more complex than our example here, cannot be automatically vectorized (yet) with today's compilers, even if written in SOA style, this approach seems feasible to us. ...
Conference Paper
Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We present Ikra-Cpp, an embedded DSL for object-oriented programming in C++/CUDA. Ikra-Cpp's notation is very close to standard AOS-style C++ code, but data is layed out as SOA. This gives programmers the performance benefit of SOA and the expressiveness of AOS-style object-oriented programming at the same time. Ikra-Cpp is well integrated with C++ and lets programmers use C++ notation and syntax for classes, fields, member functions, constructors and instance creation.
... This number of products is equivalent to the performance of the Alist.push(binaryW ords ⊕ k) 5 Blist.push(binaryW ords) geometric product algorithm described in [3] and [8]. ...
... This leads to an equivalent of Gaigen. Additionally, the products are vectorized using SIMD instructions described in [5], enabling us to operate up to four leaves of the tree at the same time by using SIMD registers. The SIMD code is inserted on line 6 of Algorithm 3. ...
Article
Full-text available
This paper presents an efficient implementation of geometric algebra, based on a recursive representation of the algebra elements using binary trees. The proposed approach consists in restructuring a state of the art recursive algorithm to handle parallel optimizations. The resulting algorithm is described for the outer product and the geometric product. The proposed implementation is usable for any dimensions, including high dimension (e.g. algebra of dimension 15). The method is compared with the main state of the art geometric algebra implementations, with a time complexity study as well as a practical benchmark. The tests show that our implementation is at least as fast as the main geometric algebra implementations.
... SPMD programming models with data parallel languages such as ispc [30] and ones with C++ SIMD extensions such as Sierra [19] have well-defined SPMD semantics but are more restrictive than Parsimony's proposed semantics. Prior work has also studied the use of GPU-focused SPMD programming models to target CPU SIMD units [7,43]. ...
... However, this also requires a significant expertise and can imply to re-implement new kernels for each instruction set architecture (ISA). To mitigate this drawback, several abstraction libraries have been created [2,4,6,9,10,16,19]. They allow writing vectorized code with an abstract vector type which is fixed at compile time depending on the target architecture. ...
Preprint
Full-text available
Leveraging the SIMD capability of modern CPU architectures is mandatory to take full benefit of their increasing performance. To exploit this feature, binary executables must be explicitly vectorized by the developers or an automatic vectorization tool. This why the compilation research community has created several strategies to transform a scalar code into a vectorized implementation. However, the majority of the approaches focus on regular algorithms, such as affine loops, that can be vectorized with few data transformations. In this paper, we present a new approach that allow automatically vectorizing scalar codes with chaotic data accesses as long as their operations can be statically inferred. We describe how our method transforms a graph of scalar instructions into a vectorized one using different heuristics with the aim of reducing the number or cost of the instructions. Finally, we demonstrate the interest of our approach on various computational kernels using Intel AVX-512 and ARM SVE.
... SPMD Languages. SPMD languages like ISPC [Pharr and Mark 2012], Sierra [Leißa et al. 2014], and Hipacc [Reiche et al. 2017] provide abstractions for the SPMD programming model. Their compilers all perform divergence analysis to improve the performance of the generated vector ...
Article
Vectorizing compilers employ divergence analysis to detect at which program point a specific variable is uniform, i.e. has the same value on all SPMD threads that execute this program point. They exploit uniformity to retain branching to counter branch divergence and defer computations to scalar processor units. Divergence is a hyper-property and is closely related to non-interference and binding time. There exist several divergence, binding time, and non-interference analyses already but they either sacrifice precision or make significant restrictions to the syntactical structure of the program in order to achieve soundness. In this paper, we present the first abstract interpretation for uniformity that is general enough to be applicable to reducible CFGs and, at the same time, more precise than other analyses that achieve at least the same generality. Our analysis comes with a correctness proof that is to a large part mechanized in Coq. Our experimental evaluation shows that the compile time and the precision of our analysis is on par with LLVM's default divergence analysis that is only sound on more restricted CFGs. At the same time, our analysis is faster and achieves better precision than a state-of-the-art non-interference analysis that is sound and at least as general as our analysis.
... En revanche, AoS n'ayant qu'une seule référence active, il n'a pas cette sensibilité. Un agencement mémoire hybride est possible, permettant de conserver les avantages de SoA concernant la vectorisation tout en n'utilisant qu'une seule référence active comme AoS [48]. Un tel agencement est appelé AoSoA : il consiste en un tableau de "SoA". ...
Thesis
Tout au long de cette thèse, nous avons étudié des problèmes d'algèbre linéaire de petite dimension (typiquement de 2x2 à 5x5) utilisés au sein de l'expérience LHCb (mais aussi dans d'autres domaines tels que la vision par ordinateur). Les bibliothèques d’algèbre linéaire telles que Eigen, Magma ou la MKL ne sont pas optimisées pour de petites matrices. Nous avons utilisé et combiné plusieurs transformations connues facilitant le SIMD ainsi que des transformations moins usuelles comme la racine carré inverse rapide. Pour faciliter l’écriture de ces transformations, mais également dans le but d’avoir un code portable, nous avons écrit un générateur de code. Nous avons testé ces transformations et analysé leur impact sur la vitesse de traitement d’algorithmes simples. Le traitement par lot (batch) en SoA est crucial pour maximiser la vitesse de traitements de ces problèmes à faible dimension. Une analyse de la précision des résultats en fonction de la précision de calcul a également été faite sur ces exemples. Nous avons alors implémenté ces transformations dans le but d’accélérer la factorisation de Cholesky de petites matrices (jusqu'à 12x12). La vitesse de traitement plafonne sans l’utilisation du calcul rapide de la racine carrée inverse. Nous avons obtenu une accélération entre x10 et x33 en F32. Notre code est ainsi de x3 à x10 plus rapide que la MKL. Enfin, nous avons étudié et accéléré le filtre de Kalman généraliste. Nous avons ainsi obtenu une accélération de x90 sur l'implémentation 4x4 F32. Le filtre de Kalman utilisé au sein de LHCb a été accéléré d'un facteur x2,2 par rapport à la version actuelle SIMD et supérieur à x2,3 par rapport aux filtres d'autres expériences de physique des particules.
... In the last years, many such approaches have been published. We mention Agner Fog's vector class library [42], Vc [75], Boost.SIMD [39], Generic SIMD library [117], Sierra [77], MIPP [23], xsimd [100], Google's dimsum [50] UME::SIMD [58] and libsimdpp [57]. Additionally, compilers have lately (e.g. ...
Thesis
Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available. Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm. An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue. To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure. We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator.
... Sierra [66] Sierra is a SIMD extension for C++. This API provides a portable and easy way that allows the programmer to exploit vectorization explicitly. ...
... SIMDization is a technique to utilize the capability of SIMD extensions that can be applied explicitly and implicitly [8]. In order to have explicit SIMDization, programmers should implement their applications using available programming models [9, [14][15][16][17][18][19]. Intrinsics programming model provides very similar access to ISA's capability as well as inline assembly. ...
Article
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose Instructions (SPIs). In order to exploit these features, many programming approaches have been developed. Intrinsic Programming Model (IPM) is a low-level concept for explicit SIMDization. Besides, Compiler’s Automatic Vectorization (CAV) has been embedded in modern compilers such as Intel C++ compiler (ICC), GNU Compiler Collections (GCC) and LLVM for implicit vectorization. Each SIMDization shows different improvements because of different SIMD ISAs, vector register width, and programming approaches. Our goal in this paper is to evaluate the performance of explicit and implicit vectorization. Our experimental results show that the behavior of explicit vectorization on different compilers is almost the same compared to implicit vectorization. IPM improves the performance more than CAVs. In general, ICC and GCC compilers can more efficiently vectorize kernels and use SPI compared to LLVM. In addition, AVX2 technology is more useful for small matrices and compute-intensive kernels compared to large matrices and data-intensive kernels because of memory bottlenecks. Furthermore, CAVs fail to vectorize kernels which have overlapping and non-consecutive memory access patterns. The way of implementation of a kernel impacts the vectorization. In order to understand what kind of scalar implementations of an algorithm is suitable for vectorization, an approach based on code modification technique is proposed. Our experimental results show that scalar implementations that have either loop collapsing, loop unrolling, software pipelining, or loop exchange techniques can be efficiently vectorized compared to straightforward implementations.
... In a non-transparent language such as C, one cannot in general know whether vector instructions will be used: this is compiler-specific, and depends both on the compiler heuristics and on how sensitive it is to the way the source code is written. Several programming models have therefore been built with the purpose of taking explicit control over code vectorization [13,14,4,12,5,10]. To focus on performance transparency, this section presents a simple, example-specific, hypothetical programming model designed to be performance-transparent. Figure 2 demonstrates how we may represent the vector unit behaviour. ...
Article
Machine-specific optimizations command the machine to behave in a specific way. As current programming models largely leave machine details unexposed, they cannot accommodate direct encoding of such commands. In previous work we have proposed the design of performance-transparent programming models to facilitate this use-case; this report contains proof-of-concept examples of such programming models. We demonstrate how programming model abstractions may reveal the memory footprint, vector unit utilization and data reuse of an application, with prediction accuracy ranging from 0 to 25 \%.
... A flurry of C++ libraries provide a unified programming model for various SIMD architectures, such as Boot.SIMD [11], MIPP [8], UME::SIMD [14], Sierra [18] or VC [16]. These works are complementary to ours: our intent is to provide a language that (always) compiles to efficient bitsliced code while being amenable to formal verification. ...
Conference Paper
Full-text available
Bitslicing is a programming technique commonly used in cryptography that consists in implementing a combinational circuit in software. It results in a massively parallel program immune to cache-timing attacks by design. However, writing a program in bitsliced form requires extreme minutia. This paper introduces Usuba, a synchronous dataflow language producing bitsliced C code. Usuba is both a domain-specific language -- providing syntactic support for the implementation of cryptographic algorithms -- as well as a domain-specific compiler -- taking advantage of well-defined semantics invariants to perform various optimizations before handing the generated code to an (optimizing) C compiler. On the Data Encryption Standard (DES) algorithm, we show that Usuba outperforms a reference, hand-tuned implementation by 15% (using Intel's 64 bits general-purpose registers and depending on the underlying C compiler) whilst our implementation also transparently supports modern SIMD extensions (SSE, AVX, AVX-512), other architectures (ARM Neon, IBM Altivec) as well as multicore processors through an OpenMP backend.
... These sets of macros can be easily extended for IBM Altivec or ARM Neon SIMD extension. An alternative is to use a generic SIMD library like[18] [8][26]. Another one is to use the Intel SPMD Program Compiler (ISPC)[2]. ...
Conference Paper
Full-text available
This paper presents a new multi-pass iterative algorithm for Connected Component Labeling. The performance of this algorithm is compared to those of State-of-the-Art two-pass direct algorithms. We show that thanks to the parallelism of the SIMD multi-core processors and an activity matrix that avoids useless memory access, such a kind of algorithm has performance that comes closer and closer to direct ones. This new tiled iterative algorithm has been benchmarked on four generations of Intel Xeon processors: 2×4-core Nehalem, 2×12-core Ivy-Bridge, 2×14-core Haswell and 57-core Knight Corner. Macro meta-programming was used to design a unique code for SSE, AVX2 and KNC SIMD instruction set.
... Still on the compiler side, code directives can be used to enforce loop vectorisation (#pragma simd for ICC and GCC) but the code quality relies on the compiler and this feature is not available in every one of them. Dedicated compilers like ISPC [Pharr 2012], Sierra [Leißa 2014] or Cilk [Robison 2013] choose to add a set of keywords to the language to explicitly mark the code fragments that are candidates to the automatic vectorization process. VaporSIMD [Nuzman 2011] proposes another approach which consists in autovectorizing the C based code to get the intermediate representation of the compiler and then use a Just In Time based framework to generate portable SIMD code. ...
Article
Full-text available
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This thesis explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of various parallel computing libraries in such a way that abstraction and expressiveness are maximized while efficiency overhead is minimized.
... Due to its C interface, using intrinsics forces the programmer to deal with a verbose style of programming. Furthermore, from one extension to another, the Application Pro- [81], Sierra [72] or Cilk [85] choose to add a set of keywords to the language to explicitly mark the code fragments that are candidates to the automatic vectorization process. VaporSIMD [77] proposes another approach which consists in autovectorizing the C based code to get the intermediate representation of the compiler and then use a Just In Time based framework to generate portable SIMD code. ...
Article
The constant increasing need for computing power has pushed the development of parallel architectures. Scientific computing relies on the performance of such architectures to produce scientific results. Programming efficient applications that takes advantage of these computing systems remains a non trivial task. In this thesis, we present a new methodology to design architecture aware software: the AA-DEMRAL methodology. This methodology aims at simplifying the development of parallel programming tools with multi-Architectural support through a generic and generative approach. We then present three high level programming tools that rely on this approach. First, we introduce the Boost.Dispatch library that provides a way to develop software based on the AA-DEMRAL methodology. The Boost.Dispatch library is a C++ generic framework for architecture aware function dispatching. Then, we present two C++ template libraries implemented as Architecture Aware DSELs which assess the AA-DEMRAL methodology through the use of Boost.Dispatch: Boost.SIMD, that provides a high level API for SIMD extensions and NT2 , which propose a Matlab like interface with support for multi-Core and SIMD based systems. We assess the performance of these libraries and the validity of our new methodology through benchmarks.
Article
Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However, most existing automatic vectorization techniques in modern compilers are designed for regular codes, leaving irregular applications with non-contiguous data access patterns at a disadvantage. In this paper, we present a new tool, Autovesk, that automatically generates vectorized code from scalar code, specifically targeting irregular data access patterns. We describe how our method transforms a graph of scalar instructions into a vectorized one, using different heuristics to reduce the number or cost of instructions. Finally, we demonstrate the effectiveness of our approach on various computational kernels using Intel AVX-512 and ARM SVE. We compare the speedups of Autovesk vectorized code over GCC, Clang LLVM and Intel automatic vectorization optimizations. We achieve competitive results on linear kernels and up to 11x speedups on irregular kernels.
Article
A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.
Article
The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from hobby projects to commercial operating systems. This installation base and the programmers producing it represent a massive software engineering investment spanning decades and likely to continue for decades more. Nevertheless, C, which was first standardized almost 30 years ago, lacks many features that make programming in more modern languages safer and more productive. The goal of the C project (pronounced “C for all”) is to create an extension of C that provides modern safety and productivity features while still ensuring strong backward compatibility with C and its programmers. Prior projects have attempted similar goals but failed to honor the C programming style; for instance, adding object‐oriented or functional programming with garbage collection is a nonstarter for many C developers. Specifically, C is designed to have an orthogonal feature set based closely on the C programming paradigm, so that C features can be added incrementally to existing C code bases, and C programmers can learn C extensions on an as‐needed basis, preserving investment in existing code and programmers. This paper presents a quick tour of C features, showing how their design avoids shortcomings of similar features in C and other C‐like languages. Experimental results are presented to validate several of the new features.
Conference Paper
The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all. Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.
Conference Paper
Recent trends in processor design accommodate wide vector extensions. SIMD vectorization is more important than before to exploit the potential performance of the target architecture. The latest OpenMP specification provides new directives which help compilers produce better code for SIMD auto-vectorization. However, it is hard to optimize the SIMD code performance in OpenMP since the target SIMD code generation mostly relies on the compiler implementation. In this paper, we propose a new directive that specifies user-defined SIMD variants of functions used in SIMD loops. The compiler can then use the user-defined SIMD variants when it encounters OpenMP loops instead of auto-vectorized SIMD variants. The user can optimize the SIMD performance by implementing highly-optimized SIMD code with intrinsic functions. The performance evaluation using a image composition kernel shows that the user can optimize SIMD code generation in an explicit way by using our approach. The user-defined function reduces the number of instructions by 70% compared with the auto-vectorized code generated from the serial code.
Conference Paper
The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all. Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.
Article
Full-text available
A hierarchical model of computer organizations is developed, based on a tree model using request/service type resources as nodes. Two aspects of the model are distinguished: logical and physical. General parallel- or multiple-stream organizations are examined as to type and effectiveness¿especially regarding intrinsic logical difficulties. The overlapped simplex processor (SISD) is limited by data dependencies. Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple-data stream (SIMD)] are analyzed. In particular, a nesting type explanation is offered for Minsky's conjecture¿the performance of a parallel processor increases as log M instead of M (the number of data stream processors). Multiprocessors (MIMD) are subjected to a saturation syndrome based on general communications lockout. Simplified queuing models indicate that saturation develops when the fraction of task time spent locked out (L/E) approaches 1/n, where n is the number of processors. Resources sharing in multiprocessors can be used to avoid several other classic organizational problems.
Conference Paper
Full-text available
Our ability to create systems with large amount of hardware parallelism is exceeding the average software developer’s ability to effectively program them. This is a problem that plagues our industry. Since the vast majority of the world’s software developers are not parallel programming experts, making it easy to write, port, and debug applications with sufficient core and vector parallelism is essential to enabling the use of multi- and many-core processor architectures. However, hardware architectures and vector ISAs are also shifting and diversifying quickly, making it difficult for a single binary to run well on all possible targets. Because of this, retargetability and dynamic compilation are of growing relevance. This paper introduces Intel® Array Building Blocks (ArBB), which is a retargetable dynamic compilation framework. This system focuses on making it easier to write and port programs so that they can harvest data and thread parallelism on both multi-core and heterogeneous many-core architectures, while staying within standard C++. ArBB interoperates with other programming models to help meet the demands we hear from customers for a solution with both greater programmer productivity and good performance. This work makes contributions in language features, compiler architecture, code transformations and optimizations. It presents performance data from the current beta release of ArBB and quantitatively shows the impact of some key analyses, enabling transformations and optimizations for a variety of benchmarks that are of interest to our customers.
Article
Full-text available
For almost two decades researchers have argued that ray tracing will eventually become faster than the rasterization technique that completely dominates todays graphics hardware. However, this has not happened yet. Ray tracing is still exclusively being used for off-line rendering of photorealistic images and it is commonly believed that ray tracing is simply too costly to ever challenge rasterization-based algorithms for interactive use. However, there is hardly any scientific analysis that supports either point of view. In particular there is no evidence of where the crossover point might be, at which ray tracing would eventually become faster, or if such a point does exist at all. This paper provides several contributions to this discussion: We first present a highly optimized implementation of a ray tracer that improves performance by more than an order of magnitude compared to currently available ray tracers. The new algorithm make better use of computational resources such as caches and SIMD instructions and better exploits image and object space coherence. Secondly, we show that this software implementation can challenge and even outperform high-end graphics hardware in interactive rendering performance for complex environments. We also provide an brief overview of the benefits of ray tracing over rasterization algorithms and point out the potential of interactive ray tracing both in hardware and software.
Conference Paper
Full-text available
The recent proliferation of the single instruction multiple data (SIMD) model has lead to a wide variety of implementations. These have been incorporated into many platforms, from gaming machines and DSPs to general purpose architectures. In this paper, we present an automatic vectorizer as implemented in GCC, the most multi-targetable compiler available today. We discuss the considerations involved in developing a multi-platform vectorization technology, and demonstrate how our vectorization scheme is suited to a variety of SIMD architectures. Experiments on four different SIMD platforms demonstrate that our automatic vectorization scheme is able to efficiently support individual platforms, achieving significant speedups on key kernels.
Article
Full-text available
The huge processing power needed by multimedia applications has led to multimedia extensions in the instruction set of microprocessors which exploit subword parallelism. Examples of these extended instruction sets are the Visual Instruction Set of the UltraSPARC processor, the AltiVec instruction set of the PowerPC processor, the MMX and ISS extensions of the Pentium processors, and the MAX-2 instruction set of the HP PA-RISC processor. Currently, these extensions can only be used by programs written in assembly language, through system libraries or by calling specialized macros in a high-level language. Therefore, these instructions are not used by most applications. We propose two code generation techniques to produce native code using these multimedia extensions for programs written in a high level language: classical vectorization and vectorization by unrolling. Vectorization by unrolling is simpler than classical vectorization since data dependence analysis is reduced to acyclic control flow graph analysis. Furthermore, we address the problem of unaligned memory accesses. This can be handled by both static analysis and dynamic run time checking. Preliminary experimental results for a code generator for the UltraSPARC VIS instruction set show that speedups of up to a factor of 4.8 are possible, and that vectorization by unrolling is much simpler but as effective as classical vectorization. Keywords compiler, multimedia processors, vectorization, loop unrolling 1 1
Article
Today';s desktop PCs feature a variety of parallel processing units. Developing applications that exploit this parallelism is a demanding task, and a programmer has to obtain detailed knowledge about the hardware for efficient implementation. CGiS is a data-parallel programming language providing a unified abstraction for two parallel processing units: graphics processing units (GPUs) and the vector processing units of CPUs. The CGiS compiler framework fully virtualizes the differences in capability and accessibility by mapping an abstract data-parallel programming model on those targets. The applicability of CGiS for GPUs has been shown in previous work; this work focuses on applying the abstract programming model of CGiS to CPUs with SIMD (Single Instruction Multiple Data) instruction sets. We have identified, adapted and implemented a set of program analyses to expose and access the available parallelism. The code generation phase is based on selected optimization algorithms tailored to SIMD code generation. Via code generation profiles, it is possible to adapt the code generation strategy to different target architectures. To assess the effectiveness of our approach, we have implemented backends for the two most widespread SIMD instruction sets, namely Intel';s Streaming SIMD Extensions and Freescale';s AltiVec. Additionally, we integrated a prototypical backend for the Cell Broadband Engine as an example for a multi-core architecture. Our experimental results show excellent average performance gains by a factor of 3 compared to standard scalar C++ implementations and underline the viability of this approach: real-world applications can be implemented easily with CGiS and result in efficient code.
Conference Paper
Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of today's computing devices. In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization. We evaluate our techniques in a custom OpenCL CPU driver which is compared to itself in different configurations and to proprietary implementations by AMD and Intel. We achieve an average speedup factor of 1.21 compared to naïve vectorization and additional factors of 1.15---2.09 for suited kernels due to the optimizations enabled by our analysis. Our best configuration achieves an average speedup factor of 2.5 against the Intel driver.
Article
SIMD vector units implement only a subset of the operations used by vectorizing compilers, and there are multiple conflicting techniques to legalize arbitrary vector types into register-sized data types. Traditionally, type legalization is performed using a set of predefined rules, regardless of the operations used in the program. This method is not suitable to sparse SIMD instruction sets and often prevents the vectorization of programs. In this work we introduce a new technique for type legalization, namely vector element promotion, as well as a hybrid method for combining multiple techniques of type legalization. Our hybrid type legalization method makes decisions based on the knowledge of the available instruction set as well as the operations used in the program. Our experimental results demonstrate that program-dependent hybrid type legalization improves the execution time of vector programs, outperforms the existing legalization method, and allows the vectorization of workloads which were not vectorized before.
Conference Paper
SIMD extensions have been a feature of choice for processor manufacturers for a couple of decades. Designed to exploit data parallelism in applications at the instruction level and provide significant accelerations, these extensions still require a high level of expertise or the use of potentially fragile compiler support or vendor-specific libraries. In this poster, we present Boost.SIMD a C++ template library that simplifies the exploitation of SIMD hardware within a standard C++ programming model.
Conference Paper
SIMD parallelism has become an increasingly important mechanism for delivering performance in modern CPUs, due its power efficiency and relatively low cost in die area compared to other forms of parallelism. Unfortunately, languages and compilers for CPUs have not kept up with the hardware's capabilities. Existing CPU parallel programming models focus primarily on multi-core parallelism, neglecting the substantial computational capabilities that are available in CPU SIMD vector units. GPU-oriented languages like OpenCL support SIMD but lack capabilities needed to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints that impair ease of use on CPUs. We have developed a compiler, the Intel R® SPMD Program Compiler (ispc), that delivers very high performance on CPUs thanks to effective use of both multiple processor cores and SIMD vector units. ispc draws from GPU programming languages, which have shown that for many applications the easiest way to program SIMD units is to use a single-program, multiple-data (SPMD) model, with each instance of the program mapped to one SIMD lane. We discuss language features that make ispc easy to adopt and use productively with existing software systems and show that ispc delivers up to 35x speedups on a 4-core system and up to 240× speedups on a 40-core system for complex workloads (compared to serial C++ code).
Article
The paper describes a succinct problem-oriented programming language. The language is broad in scope, having been developed for, and applied effectively in, such diverse areas as microprogramming, switching theory, operations research, information retrieval, sorting theory, structure of compilers, search procedures, and language translation. The language permits a high degree of useful formalism. It relies heavily on a systematic extension of a small set of basic operations to vectors, matrices, and trees, and on a family of flexible selection operations controlled by logical vectors. Illustrations are drawn from a variety of applications.
Conference Paper
Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and “general-purpose” languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing dataparallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorporated in a compiler for a domain-specific language used in realtime ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels.
Conference Paper
SIMD instructions are common in CPUs for years now. Using these instructions effectively requires not only vectorization of code, but also modifications to the data layout. However, automatic vectorization techniques are often not powerful enough and suffer from restricted scope of applicability; hence, programmers often vectorize their programs manually by using intrinsics: compiler-known functions that directly expand to machine instructions. They significantly decrease programmer productivity by enforcing a very error-prone and hard-to-read assembly-like programming style. Furthermore, intrinsics are not portable because they are tied to a specific instruction set. In this paper, we show how a C-like language can be extended to allow for portable and efficient SIMD programming. Our extension puts the programmer in total control over where and how control-flow vectorization is triggered. We present a type system and a formal semantics of our extension and prove the soundness of the type system. Using our prototype implementation IVL that targets Intel's MIC architecture and SSE instruction set, we show that the generated code is roughly on par with handwritten intrinsic code.
Conference Paper
Coordination and synchronization of parallel tasks is a ma-jor source of complexity in parallel programming. These constructs take many forms in practice including mutual exclusion in accesses to shared resources, termination detec-tion of child tasks, collective barrier synchronization, and point-to-point synchronization. In this paper, we introduce phasers, a new coordination construct that unifies collec-tive and point-to-point synchronizations. We establish two safety properties for phasers: deadlock-freedom and phase-ordering. Performance results obtained from a portable im-plementation of phasers on three different SMP platforms demonstrate that phasers can deliver superior performance to existing barrier implementations, in addition to the pro-ductivity benefits that result from their generality and safety properties.
Conference Paper
Modern CPUs have instructions that allow basic operations to be performed on several data elements in parallel. These instructions are called SIMD instructions, since they apply a single instruction to multiple data elements. SIMD technology was initially built into commodity processors in order to accelerate the performance of multimedia applications. SIMD instructions provide new opportunities for database engine design and implementation. We study various kinds of operations in a database context, and show how the inner loop of the operations can be accelerated using SIMD instructions. The use of SIMD instructions has two immediate performance benefits: It allows a degree of parallelism, so that many operands can be processed at once. It also often leads to the elimination of conditional branch instructions, reducing branch mispredictions.We consider the most important database operations, including sequential scans, aggregation, index operations, and joins. We present techniques for implementing these using SIMD instructions. We show that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of SIMD technology. Our study shows that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms. Superlinear speedups are obtained as a result of the elimination of branch misprediction effects.
Conference Paper
Program analysis methods, especially those which support automatic vectorization, are based on the concept of interstatement dependence where a dependence holds between two statements when one of the statements computes values needed by the other. Powerful program transformation systems that convert sequential programs to a form more suitable for vector or parallel machines have been developed using this concept [AllK 82, KKLW 80].The dependence analysis in these systems is based on data dependence. In the presence of complex control flow, data dependence is not sufficient to transform programs because of the introduction of control dependences. A control dependence exists between two statements when the execution of one statement can prevent the execution of the other. Control dependences do not fit conveniently into dependence-based program translators.One solution is to convert all control dependences to data dependences by eliminating goto statements and introducing logical variables to control the execution of statements in the program. In this scheme, action statements are converted to IF statements. The variables in the conditional expression of an IF statement can be viewed as inputs to the statement being controlled. The result is that control dependences between statements become explicit data dependences expressed through the definitions and uses of the controlling logical variables.This paper presents a method for systematically converting control dependences to data dependences in this fashion. The algorithms presented here have been implemented in PFC, an experimental vectorizer written at Rice University.
Conference Paper
Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.
Conference Paper
Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.
Conference Paper
A shading language provides a means to extend the shading and lighting formulae used by a rendering system. This paper discusses the design of a new shading language based on previous work of Cook and Perlin. This language has various types of shaders for light sources and surface reflectances, point and color data types, control flow constructs that support the casting of outgoing and the integration of incident light, a clearly specified interface to the rendering system using global state variables, and a host of useful built-in functions. The design issues and their impact on the implementation are also discussed.
Article
Thesis (Ph. D.)--University of Minnesota, 1994. Includes bibliographical references.
Conference Paper
In this paper, we describe how to extend the concept of superword-level parallelization (SLP), used for multimedia extension architectures, so that it can be applied in the presence of control flow constructs. Superword-level parallelization involves identifying scalar instructions in a large basic block that perform the same operation, and, if dependences do not prevent it, combining them into a superword operation on a multi-word object. A key insight is that we can use techniques related to optimizations for architectures supporting predicated execution, even for multimedia ISAs that do not provide hardware predication. We derive large basic blocks with predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We discuss other extensions to SLP to address common features of real multimedia codes. We present automatically-generated performance results on 8 multimedia codes to demonstrate the power of this approach. We observe speedups ranging from 1.97X to 15.07X as compared to both sequential execution and SLP alone.
Article
The latest real-time graphics architectures include programmable floating-point vertex and fragment processors, with support for data-dependent control flow in the vertex processor. We present a programming language and a supporting system that are designed for programming these stream processors. The language follows the philosophy of C, in that it is a hardware-oriented, generalpurpose language, rather than an application-specific shading language. The language includes a variety of facilities designed to support the key architectural features of programmable graphics processors, and is designed to support multiple generations of graphics architectures with different levels of functionality. The system supports both of the major 3D graphics APIs: OpenGL and Direct3D. This paper identifies many of the choices that we faced as we designed the system, and explains why we made the decisions that we did.
Article
Vector Machine The SUIF vectorizer [2] is used to generate vector operations. It performs vectorization only on parallel loops. Note that reduction is currently not recognized. We present the vector machine model in section 3.1, and describe the vectorization mechanism in details in section 3.2. 3.1 The Machine Model The Abstract Vector Machine contains infinite number of infinite-length vector registers. Vector operations work with these registers. In the SUIF file, vector regions are marked with annotated io_mrk instructions, and vector operations may only occur within these regions. There are four categories of vector operations: 1. Vector Load: loads a block of memory into a vector register, setting the length of the vector register to be the number of elements in the block of memory; 2. Vector Comparison: perform arithmetical comparisons on operand vector registers, and write the output as a mask to an output vector register. The mask may be used by conditional arithmetic oper...
Article
In this paper, we present an implementation of a vectorizing C compiler for Intel's MMX (Multimedia Extension). This compiler would identify data parallel sections of the code using scalar and array dependence analysis. To enhance the scope for application of the subword semantics, our compiler performs several code transformations. These include strip mining, scalar expansion, grouping and reduction, loop ssion and distribution. Thereafter inline assembly instructions corresponding to the data parallel sections are generated. We have used the Stanford University Intermediate Format (SUIF), a public domain compiler tool, for our implementation. We evaluated the performance of the code generated by our compiler for a number of benchmarks. Initial performance results reveal that our compiler generated code produces a reasonable performance improvement (speedup of 2 to 6.5) over the the code generated without the vectorizing transformations/inline assembly. In certain cases, t...
Article
The recent success of vector computers such as the Cray-1 and array processors such as those manufactured by Floating Point Systems has increased interest in making vector operations available to the FORTRAN programmer. The FORTRAN standards committee is currently considering a successor to FORTRAN 77, usually called FORTRAN 8x, that will permit the programmer to explicitly specify vector and array operations. Although FORTRAN 8x will make it convenient to specify explicit vector operations in new programs, it does little for existing code. In order to benefit from the power of vector hardware, existing programs will need to be rewritten in some language (presumably FORTRAN 8x) that permits the explicit specification of vector operations. One way to avoid a massive manual recoding effort is to provide a translator that discovers the parallelism implicit in a FORTRAN program and automatically rewrites that program in FORTRAN 8x. Such a translation from FORTRAN to FORTRAN 8x is not straightforward because FORTRAN DO loops are not always semantically equivalent to the corresponding FORTRAN 8x parallel operation. The semantic difference between these two constructs is precisely captured by the concept of dependence . A translation from FORTRAN to FORTRAN 8x preserves the semantics of the original program if it preserves the dependences in that program. The theoretical background is developed here for employing data dependence to convert FORTRAN programs to parallel form. Dependence is defined and characterized in terms of the conditions that give rise to it; accurate tests to determine dependence are presented; and transformations that use dependence to uncover additional parallelism are discussed.
Automatic SIMD code generation
  • R Leißa
The OpenGL Shading Language language version: 4
  • Khronos
The Complete HLSL Reference
  • S St-Laurent
  • E Wolfgang
The OpenCL Specification version: 1.2 edition 2012. Khronos. The OpenCL Specification version: 1
  • Khronos
Vector Pascal an array language 2002. G. Michaelson and P. Cockshott. Vector Pascal an array language
  • G Michaelson
  • P Cockshott