Yuze Chi’s research while affiliated with University of California, Los Angeles and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (42)


Figure 5: RapidStream IR's overall architecture, consisting of the IR (blue, §3.1), plugins (green, §3.2), and passes (red, §3.3).
Figure 10: RIR passes on the LLM accelerator example.
Frequency improvements automated with RapidStream IR for various design formats on di$erent FPGAs.
RapidStream IR: Infrastructure for FPGA High-Level Physical Synthesis
  • Preprint
  • File available

October 2024

·

76 Reads

·

·

Yutong Xie

·

[...]

·

The increasing complexity of large-scale FPGA accelerators poses significant challenges in achieving high performance while maintaining design productivity. High-level synthesis (HLS) has been adopted as a solution, but the mismatch between the high-level description and the physical layout often leads to suboptimal operating frequency. Although existing proposals for high-level physical synthesis, which use coarse-grained design partitioning, floorplanning, and pipelining to improve frequency, have gained traction, they lack a framework enabling (1) pipelining of real-world designs at arbitrary hierarchical levels, (2) integration of HLS blocks, vendor IPs, and handcrafted RTL designs, (3) portability to emerging new target FPGA devices, and (4) extensibility for the easy implementation of new design optimization tools. We present RapidStream IR, a practical high-level physical synthesis (HLPS) infrastructure for representing the composition of complex FPGA designs and exploring physical optimizations. Our approach introduces a flexible intermediate representation (IR) that captures interconnection protocols at arbitrary hierarchical levels, coarse-grained pipelining, and spatial information, enabling the creation of reusable passes for design frequency optimizations. RapidStream IR improves the frequency of a broad set of mixed-source designs by 7% to 62%, including large language models and genomics accelerators, and is portable to user-customizable new FPGA platforms. We further demonstrate its extensibility through case studies, showcasing the ability to facilitate future research.

Download

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

August 2024

·

18 Reads

·

1 Citation

ACM Transactions on Reconfigurable Technology and Systems

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an average frequency improvement of 25%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 boards compared to Vitis HLS baselines.



TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

December 2023

·

54 Reads

·

20 Citations

ACM Transactions on Reconfigurable Technology and Systems

In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning step during HLS compilation for accurate pipelining of potential critical paths. In addition, TAPA implements several optimization techniques specifically tailored for modern HBM-based FPGAs. In our experiments with a total of 43 designs, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The framework is available at https://github.com/UCLA-VAST/tapa and the core floorplan module is available at https://github.com/UCLA-VAST/AutoBridge .


RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration

September 2023

·

62 Reads

·

7 Citations

ACM Transactions on Reconfigurable Technology and Systems

FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7x reduction in compile time and up to 1.3x increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7x compile time reduction and 1.3x frequency increase.




SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

January 2023

·

20 Reads

·

9 Citations

ACM Transactions on Reconfigurable Technology and Systems

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this paper, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41 × and up to 15.73 × speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.



TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis

October 2022

·

16 Reads

·

2 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Streaming applications have become one of the key application domains for high-level synthesis (HLS) tools. For a streaming application, there is a potential to simplify the control logic by regulating each task with a stream of input and output data. This is called free-running optimization. But it is difficult to understand when such optimization can be applied without changing the functionality of the original design. Moreover, it takes a large effort to manually apply the optimization across legacy codes. In this paper, we present TARO framework which automatically applies the free-running optimization on HLS-based streaming applications. TARO simplifies the control logic without degrading the clock frequency or the performance. Experiments on Alveo U250 shows that we can obtain an average of 16% LUT and 45% FF reduction for streaming-based systolic array designs.


Citations (26)


... If customization is needed, our synthesizable device codes can be customized via modifying high-level synthesis (HLS) codes [11,66]. Customizing the accelerator can be challenging, even with the synthesizable HLS template and helper tools [36,37]. To ease the difficulties, we provide a C/C++-based simulator integrated into the LLM codes in PyTorch to check the functionality of the accelerator. ...

Reference:

INF^2: High-Throughput Generative Inference of Large Language Models using Near-Storage Processing
TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

ACM Transactions on Reconfigurable Technology and Systems

... However, on-chip memory limitations often prevent caching all necessary data, and overlapping computation and communication remains constrained. While some compilers like Merlin [77] and PASTA [35] support double buffering for single arrays, Merlin's application is limited and contextspecific, and PASTA lacks parallelism. ...

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs
  • Citing Conference Paper
  • May 2023

... Furthermore, RapidIR's success in porting designs to new FPGA architectures showcases its value in maintaining code portability and performance across evolving hardware landscapes. [GMZ22,GMZ23] while extending support to a broader range of HLS tools and design hierarchies. Challenges include interfacing with vendor tools or developing a custom placer and router using RapidWright [LK18]. ...

RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration

ACM Transactions on Reconfigurable Technology and Systems

... Architectural and architecture-related optimizations [23,35,55,61,68,82,105] on CPUs/GPUs are explored for accelerating scientific computing. [26][27][28] leveraged machine learning for the acceleration of scientific computing and [85][86][87] accelerated sparse linear algebra and solvers on FPGAs. Scientific computing is a major application in high performance computing and heavily relies on general-purpose platforms, but it is a new application domain for emerging PIM architectures and it is challenging because of high cost and low performance of floating-point processing. ...

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

... Guo et al. [GLR19] designed an FPGA accelerator with cooptimization of the host program and the chaining algorithm [Li18] for accelerating long read pairwise overlapping in third-generation genome sequencing.JAC: Jacobi. Chi et al.[CCW18] proposed an automated framework that takes in domainspecific language describing the stencil kernel[TYL23] and generates efficient HLS code. We use a 2D-Jacobi kernel from the framework's results to illustrate the challenges in designing a heterogeneous code generator. ...

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs
  • Citing Article
  • January 2023

ACM Transactions on Reconfigurable Technology and Systems

... Cong et al. [CLL12] introduce metrics to assess the layout-friendliness of an RTL netlist, while Tatsuoka et al. [TWO15,TK18] identify source code lines leading to MUX and deMUX structures. TARO [CCL23] is an optimization technique applied to free-running dataflow kernels to minimize control signal overheads. ...

TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis
  • Citing Article
  • October 2022

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... HBM is being used in many high-performance computing applications [38], and helps to overcome memory bandwidth hurdles that limit the implementation of many applications on FPGAs. Unfortunately, we were not able to infer HBM memory within Vitis HLS, as Xilinx seems to focus on Vitis as the preferred working environment for its newer devices and boards. ...

Serpens: a high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication

... SkeletonGCN [14] Hardware-software co-design for efficient GCN training on FPGAs Stream-GCN [56] Optimizes for small graphs with techniques like pipelining and workload distribution Zhang et al. [18] Dense systolic array and non-linear activation module for combination phase • Hardware-Software Co-Design (HW-SW Co-Design) for FPGAs. This approach combines software and hardware techniques to optimize GCN execution. ...

StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing
  • Citing Conference Paper
  • April 2022

... Furthermore, RapidIR's success in porting designs to new FPGA architectures showcases its value in maintaining code portability and performance across evolving hardware landscapes. [GMZ22,GMZ23] while extending support to a broader range of HLS tools and design hierarchies. Challenges include interfacing with vendor tools or developing a custom placer and router using RapidWright [LK18]. ...

RapidStream: Parallel Physical Implementation of FPGA HLS Designs
  • Citing Conference Paper
  • February 2022