Figure 1 - available via license: Creative Commons Attribution 3.0 Unported
Content may be subject to copyright.
Clad Integrated in Interactive Environments

Clad Integrated in Interactive Environments

Source publication
Article
Full-text available
Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the nume...

Citations

... For example, to perform auto differentiation, JAX 1) captures the forward propagation functions' IR through trace-based JIT, 2) generates the gradient functions' IR via transformation, and 3) compiles CUDA binaries using XLA [62]. Similarly, JIT-based approaches are adopted by PyTorch JIT [63], Mathematica [64], Zygote [65], CLAD [66], and Enzyme [67]. ...
Thesis
Full-text available
As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. PyTorch-Direct significantly reduces CPU utilization, resulting in higher end-to-end training performance. For the input datasets and GNN architectures evaluated, PyTorch-Direct decreases the overall training time by up to 38.2%. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational graph neural networks (RGNNs). The performance challenges stem from RGNN's inherent memory intensiveness, the gap between the programming interface and the kernel APIs, and the high kernel optimization cost due to the kernels' coupling with layout and heterogeneity. Using a general matrix multiply (GEMM) template and a traversal template, Hector achieves up to a 43.7× speed-up in training and inference compared to the state-of-the-art systems. Linear operator reordering and compact tensor materialization further achieve up to 3.8× speed-up compared to the Hector unoptimized code. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Since activations take most of the GPU memory, SSDTrain offloads activations to Non-Volatile Memory Express (NVMe) SSDs with a direct GPU–SSD data path and good interoperability. The evaluation shows that SSDTrain reduces activations peak memory use by up to 47% with negligible overhead. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles. Together, these contributions demonstrate that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.
... Recently, several AD tools have been developed for GPU architectures, for example, Refs. [6,7,8,9,10]. Enzyme [6] performs AD of GPU kernels using an LLVM-based plugin [8] that can generate kernel gradients in CUDA or ROCm. In Ref. [6], the authors demonstrated that the AD performance on a set of benchmarks is within an order of magnitude of the performance of the source program. ...
Preprint
Full-text available
The Hessian-vector product computation appears in many scientific applications such as in optimization and finite element modeling. Often there is a need for computing Hessian-vector products at many data points concurrently. We propose an automatic differentiation (AD) based method, CHESSFAD (Chunked HESSian using Forward-mode AD), that is designed with efficient parallel computation of Hessian and Hessian-Vector products in mind. CHESSFAD computes second-order derivatives using forward mode and exposes parallelism at different levels that can be exploited on accelerators such as NVIDIA GPUs. In CHESSFAD approach, the computation of a row of the Hessian matrix is independent of the computation of other rows. Hence rows of the Hessian matrix can be computed concurrently. The second level of parallelism is exposed because CHESSFAD approach partitions the computation of a Hessian row into chunks, where different chunks can be computed concurrently. CHESSFAD is implemented as a lightweight header-based C++ library that works both for CPUs and GPUs. We evaluate the performance of CHESSFAD for performing a large number of independent Hessian-Vector products on a set of standard test functions and compare its performance to other existing header-based C++ libraries such as {\tt autodiff}. Our results show that CHESSFAD performs better than {\tt autodiff}, on all these functions with improvement ranging from 5-50\% on average.
... This allows us to differentiate not only mathematical functions, but also entire code segments containing them. While there is a wide selection of AD libraries available, some of them even compatible with the GPU [13], interval arithmetic is typically not supported by these libraries. Although many popular interval libraries can perform AD of interval-valued functions on the CPU, to the authors' knowledge, the Julia language is the only open-source tool offering a straightforward way of using AD with interval-valued functions on the GPU [31], see Section 2.3 for details. ...
Article
Full-text available
Interval methods are helpful in the context of scientific computing for reliable treatment of problems with bounded uncertainty. Most traditional interval algorithms, however, were designed for sequential execution while internally depending on processor-specific instructions for directed rounding. Nowadays, many-core processors and dedicated hardware for massively parallel data processing have become the de facto standard for high-performance computers. Interval libraries have yet to adapt to this heterogeneous computing paradigm. In this article, we investigate the parallelization of interval methods with an emphasis on modern graphics processors. Using a parameter identification scenario in combination with newly developed or enhanced GPU-based interval software, we evaluate different methods for reducing the size of large interval search domains. For the first time, algorithmic differentiation can be used with intervals on the GPU. Different versions of interval optimization algorithms are compared wrt. their functionality, run times, and energy consumption.
Article
Designing the 3D layout of interconnected systems (SPI2), which is a ubiquitous task in engineered systems, is of crucial importance. Intuitively, it can be thought of as the simultaneous placement of (typically rigid) components and subsystems, as well as the design of the routing of (typically deformable) interconnects between these components and subsystems. However, obtaining solutions that meet the design, manufacturing, and life-cycle constraints is extremely challenging due to highly complex and non-linear interactions between geometries, the multi-physics environment in which the systems participate, the intricate mix of rigid and deformable geometry as well as the difficult manufacturing and life-cycle constraints. Currently, this design task heavily relies on human interaction even though the complexity of searching the design space of most practical problems rapidly exceeds human abilities. In this work, we take advantage of high-performance hierarchical geometric representations and automatic differentiation to simultaneously optimize the packing and routing of complex engineered systems, while completely relaxing the constraints on the complexity of the solid shapes that can be handled and enable intricate yet functionally meaningful objective functions. Moreover, we show that by simultaneously optimizing the packing volume as well as the routing lengths we produce tighter packing and routing designs than by focusing on the bounding volume alone. We show that our proposed approach has a number of significant advantages and offers a highly parallelizable, more integrated solution for complex SPI2 designs, leading to faster development cycles with fewer iterations, and better system complexity management.