Tobias Grosser’s research while affiliated with University of Cambridge and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (82)


Compressed and Parallelized Structured Tensor Algebra
  • Article

April 2025

·

8 Reads

Proceedings of the ACM on Programming Languages

·

Emilien Bauer

·

Tobias Grosser

·

Amir Shaikhha

Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter a dilemma between the highly specialized dense tensor algebra and efficient structure-aware algorithms provided by sparse tensor algebra. In this paper, we introduce DASTAC, a framework to propagate the tensors's captured high-level structure down to low-level code generation by incorporating techniques such as automatic data layout compression, polyhedral analysis, and affine code generation. Our methodology reduces memory footprint by automatically detecting the best data layout, heavily benefits from polyhedral optimizations, leverages further optimizations, and enables parallelization through MLIR. Through extensive experimentation, we show that DASTAC can compete if not significantly outperform specialized hand-tuned implementation by experts. We observe 0.16x - 44.83x and 1.37x - 243.78x speed-up for single- and multi-threaded cases, respectively.




Figure 2. Organizing program abstractions as SSA-based IRs enables a modular approach for compiler construction. The above vector-matrix product in MLIR makes the use-def relationships explicit and obviates the need for intricate analyses by capturing information at the right abstraction level (e.g., directly expressing iteration types in linalg.generic).
Figure 5. Our multi-level compiler backend utilizes a host of MLIR dialects to generate efficient code, tailored to the RISC-V Snitch accelerator. The high-level information from the linalg dialect is progressively lowered to the custom Snitch ISA extensions. Our modular approach enables the partitioning of challenging tasks, such as register allocation, at a suitable level of abstraction.
Figure 6. Our multi-level backend uses a mix of SSA-based IRs to represent different levels of abstraction around the RISC-V ISA for a matrix-vector calculation. The SSA formulation of the ISA empowers the compiler to employ well-understood analyses and transformations and, when combined with regions, to encode further information control flow information (e.g., for loops) while staying close to the semantics of the ISA.
Figure 8. We compare our prototype compiler with flows using Clang and MLIR, and separately evaluate the expressivity of our MLIR backend (Section 4.1).
We evaluate our multi-level compiler on represen- tative DNN micro-kernels (grouped by computational and memory access traits) across various input shapes. FLOPs indicates the minimum cycles needed for each computation.
A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions
  • Preprint
  • File available

February 2025

·

21 Reads

High-performance micro-kernels must fully exploit today's diverse and specialized hardware to deliver peak performance to DNNs. While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical micro-kernels are left to specialized code generators or handwritten assembly. Even though widely-adopted compilers (e.g., LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured IR, and general-purpose best-effort design inhibit tailored code generation for innovative hardware. We think it is time to widen the classical hourglass backend and embrace progressive lowering across a diverse set of structured abstractions to bring domain-specific code generation to compiler backends. We demonstrate this concept by implementing a custom backend for a RISC-V-based accelerator with hardware loops and streaming registers, leveraging knowledge about the hardware at levels of abstraction that match its custom ISA. We use incremental register allocation over structured IRs, while dropping classical spilling heuristics, and show up to 90% FPU utilization across key DNN kernels. By breaking the backend hourglass model, we reopen the path from domain-specific abstractions to specialized hardware.

Download



Strided Difference Bound Matrices

July 2024

·

18 Reads

A wide range of symbolic analysis and optimization problems can be formalized using polyhedra. Sub-classes of polyhedra, also known as sub-polyhedral domains, are sought for their lower space and time complexity. We introduce the Strided Difference Bound Matrix (SDBM) domain, which represents a sweet spot in the context of optimizing compilers. Its expressiveness and efficient algorithms are particularly well suited to the construction of machine learning compilers. We present decision algorithms, abstract domain operators and computational complexity proofs for SDBM. We also conduct an empirical study with the MLIR compiler framework to validate the domain’s practical applicability. We characterize a sub-class of SDBMs that frequently occurs in practice, and demonstrate even faster algorithms on this sub-class.


Compressing Structured Tensor Algebra

July 2024

·

23 Reads

Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter a dilemma between the highly specialized dense tensor algebra and efficient structure-aware algorithms provided by sparse tensor algebra. In this paper, we introduce DASTAC, a framework to propagate the tensors's captured high-level structure down to low-level code generation by incorporating techniques such as automatic data layout compression, polyhedral analysis, and affine code generation. Our methodology reduces memory footprint by automatically detecting the best data layout, heavily benefits from polyhedral optimizations, leverages further optimizations, and enables parallelization through MLIR. Through extensive experimentation, we show that DASTAC achieves 1 to 2 orders of magnitude speedup over TACO, a state-of-the-art sparse tensor compiler, and StructTensor, a state-of-the-art structured tensor algebra compiler, with a significantly lower memory footprint.


Verifying Peephole Rewriting In SSA Compiler IRs

July 2024

·

14 Reads

There is an increasing need for domain-specific reasoning in modern compilers. This has fueled the use of tailored intermediate representations (IRs) based on static single assignment (SSA), like in the MLIR compiler framework. Interactive theorem provers (ITPs) provide strong guarantees for the end-to-end verification of compilers (e.g., CompCert). However, modern compilers and their IRs evolve at a rate that makes proof engineering alongside them prohibitively expensive. Nevertheless, well-scoped push-button automated verification tools such as the Alive peephole verifier for LLVM-IR gained recognition in domains where SMT solvers offer efficient (semi) decision procedures. In this paper, we aim to combine the convenience of automation with the versatility of ITPs for verifying peephole rewrites across domain-specific IRs. We formalize a core calculus for SSA-based IRs that is generic over the IR and covers so-called regions (nested scoping used by many domain-specific IRs in the MLIR ecosystem). Our mechanization in the Lean proof assistant provides a user-friendly frontend for translating MLIR syntax into our calculus. We provide scaffolding for defining and verifying peephole rewrites, offering tactics to eliminate the abstraction overhead of our SSA calculus. We prove correctness theorems about peephole rewriting, as well as two classical program transformations. To evaluate our framework, we consider three use cases from the MLIR ecosystem that cover different levels of abstractions: (1) bitvector rewrites from LLVM, (2) structured control flow, and (3) fully homomorphic encryption. We envision that our mechanization provides a foundation for formally verified rewrites on new domain-specific IRs.


Falcon: A Scalable Analytical Cache Model

June 2024

·

14 Reads

Proceedings of the ACM on Programming Languages

Compilers often use performance models to decide how to optimize code. This is often preferred over using hardware performance measurements, since hardware measurements can be expensive, limited by hardware availability, and makes the output of compilation non-deterministic. Analytical models, on the other hand, serve as efficient and noise-free performance indicators. Since many optimizations focus on improving memory performance, memory cache miss rate estimations can serve as an effective and noise-free performance indicator for superoptimizers, worst-case execution time analyses, manual program optimization, and many other performance-focused use cases. Existing methods to model the cache behavior of affine programs work on small programs such as those in the Polybench benchmark but do not scale to the larger programs we would like to optimize in production, which can be orders of magnitude bigger by lines of code. These analytical approaches hand of the whole program to a Presburger solver and perform expensive mathematical operations on the huge resulting formulas. We develop a scalable cache model for affine programs that splits the computation into smaller pieces that do not trigger the worst-case asymptotic behavior of these solvers. We evaluate our approach on 46 TorchVision neural networks, finding that our model has a geomean runtime of 44.9 seconds compared to over 32 minutes for the state-of-the-art prior cache model, and the latter is actually smaller than the true value because the prior model reached our four hour time limit on 54% of the networks, and this limit was never reached by our tool. Our model exploits parallelism effectively: running it on sixteen cores is 8.2x faster than running it single-threaded. While the state-of-the-art model takes over four hours to analyze a majority of the benchmark programs, Falcon produces results in at most 3 minutes and 3 seconds; moreover, after a local modification to the program being analyzed, our model efficiently updates the predictions in 513 ms on average (geomean). Thus, we provide the first scalable analytical cache model.


Citations (55)


... Our first step will be to explore mechanisms for combining multiple cost models, useful in cases where performance must be traded off with floating-point accuracy [16,17] in linear algebra micro-kernel compilation [13]. Combining cost models is also inevitable when the program being rewritten is expressed in terms of operations in multiple dialects, with associated cost being computed by separate cost models. ...

Reference:

eqsat: An Equality Saturation Dialect for Non-destructive Rewriting
A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions
  • Citing Conference Paper
  • March 2025

... Whilst we have focused on intrinsics in this paper, it would also be interesting to extend this work to a wider range of algorithmic patterns such as stencils. Work has already been undertaken mapping an MLIR stencil dialect to FPGAs [3], and we plan on extending this to also target AIEs. ...

A shared compilation stack for distributed-memory parallelism in stencil DSLs
  • Citing Conference Paper
  • April 2024

... For example, Dex [100] is very similar to the languages covered here. Other examples are the LIFT [101], RISE [102] and MDH [103] languages that are similar to, albeit more restricted than the languages discussed in this paper, but may support user-guided exploration of the optimization space [104]. Unlike the languages in this paper, which all restrict higher-order functions, Erik Holk's Harlan language [105] supports first class procedures natively. ...

Guided Equality Saturation

Proceedings of the ACM on Programming Languages

... It implements a bottom-up enumerative algorithm, and it uses code analysis to restrict the search space of programs. mlirSynth [8] has a similar approach, but it lifts tensor programs across different MLIR dialects. In both methods, correctness is asserted using only I/O testing while STAGG performs bounded model checking to verify that the lifted programs are equivalent to their original counterpart. ...

mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis
  • Citing Conference Paper
  • October 2023

... Compiler-driven optimizations automate transformations for stencil-based computations, including FDTD. MLIR has been applied to stencils [21,22], matrix multiplication [23], FFTs [24][25][26], and climate modeling [27]. The DaCe framework [28] provides similar optimizations for PDE solvers and scientific simulations [29,30]. ...

Stencil-HMLS: A multi-layered approach to the automatic optimisation of stencil codes on FPGA
  • Citing Conference Paper
  • November 2023

... Stencil-HMLS [31] leverages MLIR to automatically transform stencil-based codes to FPGAs. Driven by extracting stencils from existing programming languages [32] and Domain Specific Languages [33], this work operates upon the MLIR stencil dialect [34] to generate resulting code structures that are highly tuned for FPGAs and then provided to AMD Xilinx's HLS tool at the LLVM-IR level. This work demonstrates that based upon domain-specific abstractions, in this case, stencils, one is able to leverage the knowledge and expertise of the FPGA community to transform these abstract representations into an efficient dataflow form. ...

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang
  • Citing Conference Paper
  • November 2023

... While various software tools have been developed to facilitate CIM designs, existing approaches have notable limitations. On the one hand, many tools tend to focus predominantly on specific aspects of the design flow, such as hardware simulation [13]- [16] or dataflow compilation [17]- [19], lacking the holistic view necessary for effective design space exploration. On the other hand, most of these tools are primarily designed for analog CIM, and are only later adapted to support digital implementations, often overlooking crucial characteristics of digital CIM architectures. ...

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... An MRIP is an abstraction characterizing the relations among the source and follow-up inputs of a set of metamorphic relations; an MROP describes an abstract relation among the source and follow-up outputs. [22] NA + NA + − − − Jameel et al. [142] NA + NA + − − − Chan et al. [143], [144] NA + NA + − − − Chen et al. [145] NA + NA + − − − Ding et al. [146] NA + NA + − − − Murphy et al. [147] NA + NA + − − − Sim et al. [148] NA + NA + − − − Chan et al. [23] NA + NA + − − − Sun et al. [24] NA + NA + − − + Zhou et al. [25] NA + NA + − − + Tse et al. [26] NA + NA + − − − Chan et al. [27] NA + NA + − − − Kuo et al. [28] NA + NA + − − − Jiang et al. [29] NA + NA + − − − Tao et al. [149] NA + NA + − − + Yao et al. [150] NA + NA + − − − Segura et al. [151], [152] NA + NA + − − + Kuo et al. [153] NA + NA + − − − Chen et al. [154] NA + NA + − − − Pullum et al. [155] NA + NA + − − − Chen et al. [156] NA + NA + − − − Aruna and Prasad [157] NA + NA + − − − Xie et al. [158] NA + NA + − − − Murphy et al. [159] NA + NA + − − − Segura et al. [160] NA + NA + − − + Segura et al. [161], [162] NA + NA + − + − Chen et al. [30] − + NA + + − − Sun et al. [163] − + NA + + − − Luu et al. [164] NA + NA + − − − Lascu et al. [165] NA + NA + − − + Ayerdi et al. [166] NA + NA + − − + Xu et al. [167] NA + NA + − − − Segura et al. [160] propose six MROPs (equivalence, equality, subset, disjoint, complete, and difference) for testing RESTful web APIs (implementing create, read, update, or delete operations over a resource). In our MR catalog, we leverage some of these patterns to define output conditions; in particular, our MRs verify equality, difference, and subset (i.e., what we achieve with userCanRtrieveContent, which checks if an output is a subset of what already observed in previous executions). ...

Metamorphic Fuzzing of C++ Libraries
  • Citing Conference Paper
  • April 2022

... By contrast, xDSL [17] is a Python based compiler design toolkit which is 1-1 compatible with MLIR. Providing the majority of standard MLIR dialects, as well as numerous additional experimental ones too, these are all expressed in the IRDL [6] format within Python classes. xDSL enables rapid exploration and prototyping of MLIR concepts, and once these are matured and proven they can then be contributed into the main MLIR codebase more easily. ...

IRDL: an IR definition language for SSA compilers

... But you cannot capture this closure as value and pass it around in MLIR out of the box. The lp dialect [9] uses this feature to implement full closure support but lp works on a type-erased representation. For example, the types of all higher-order arguments have been erased to !lp.t-a boxed heap value. ...

Lambda the Ultimate SSA: Optimizing Functional Programs in SSA
  • Citing Conference Paper
  • April 2022