Tianqi Chen’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (55)


Figure 4. Data transfer from global to shared memory for sparse/-dense KV-Cache in FlashInfer. Left: Sparse KV-Cache with bc = 2; Right: Dense KV-Cache. Head dimension d.
Figure 7. Medium Inter-Token-Latency (ITL) and medium TimeTo-First-Token (TTFT) of SGLang integrated with FlashInfer and Triton, compared to TensorRT-LLM.
Figure 11. FlashInfer's head-group fusion of query heads with the query length dimension in GQA.
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
  • Preprint
  • File available

January 2025

·

8 Reads

Zihao Ye

·

Lequn Chen

·

Ruihang Lai

·

[...]

·

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Download

Figure 1: WebLLM System Overview.
WebLLM: A High-Performance In-Browser LLM Inference Engine

December 2024

·

4 Reads

Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.


Three fine-grained REST APIs
Per-layer prefill time and KV transfer time of Llama3.1 8B. In LLM microserving, KV transfer fully overlaps with prefill computation as long as the transfer time does not exceed the com- putation time.
A System for Microserving of LLMs

December 2024

·

18 Reads

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse execution patterns, we develop a unified KV cache interface that handles various KV compute, transfer, and reuse scenarios. Our evaluation shows that LLM microserving can be reconfigured to support multiple disaggregation orchestration strategies in a few lines of Python code while maintaining state-of-the-art performance for LLM inference tasks. Additionally, it allows us to explore new strategy variants that reduce up to 47% of job completion time compared to the existing strategies.


Figure 7. The persistent stack organizes multiple matching stacks from the current step, as well as stacks from previous steps, into a single tree. It reduces memory consumption and supports rolling the state back to previous steps.
Figure 11. End-to-end performance comparison between structured generation with XGrammar and unstructured generation in browser JavaScript environment.
Comparison of the Llama3.1 TPOT (ms) for the XGram- mar engine, with and without grammar constraint enabled.
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

November 2024

·

8 Reads

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.


Figure 1. Pipeline parallelism for DNN training with basic terms used in this paper.
Figure 4. Pipeline stage partitioner performing seriesparallel decompositions. Black arrows indicate subproblem formulations. Red arrows indicate solutions of subproblems.
Figure 5. A comparison between (all-stage) í µí±˜Fí µí±˜B and perstage í µí±˜Fí µí±˜B schedules with different micro-batch sizes over stages. F{í µí±–, í µí±— }, B{í µí±–, í µí±— } indicate forward and backward passes for a micro-batch including samples í µí±– and í µí±—. It showcases how per-stage í µí±˜Fí µí±˜B scheduling can save memory footprint.
Figure 7. Throughput vs. different numbers of branches using 4, 8, 16 GPUs respectively (left). Throughput vs. different micro-batch sizes using 8 GPUs (right).
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

June 2024

·

11 Reads

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.


Figure 3. Concurrent execution of the unbatched program in the presence of tensor-dependent control flow. Tensor op. Ghost op. Batch
Time spent (ms) in various activities 1 for DyNet and ACROBAT for batch size 64.
Cortex vs. ACROBAT: Inference latencies in ms. Note that unlike ACROBAT, Cortex is limited to recursive computations, and does not support the other models in Table 3. Further, Cortex places a high development burden on its users by relying on manual kernel optimization.
Relay VM vs. ACROBAT's AOT compilation: Inference latencies in ms.
Model execution times in ms after the improvements de- scribed in §7.2 were made for the TreeLSTM, MV-RNN and DRNN models. DN, DN++ and AB stand for DyNet, DyNet with improvements and ACROBAT respectively.
ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

May 2023

·

32 Reads

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. However, the resulting control flow divergence makes batching, an important performance optimization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5X better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce RTX 3070 GPU.



ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

February 2023

·

9 Reads

Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU.




Citations (31)


... We evaluated Flash-Infer' performance across standard LLM serving environments and innovative scenarios, including prefix sharing and speculative decoding. FlashInfer have been integrated with mainstream LLM serving engines, including vLLM (Kwon et al., 2023), MLC-Engine (MLC Community, 2024;Lai et al., 2023), and SGLang (Zheng et al., 2023b), we assessed its impact on end-to-end latency and throughput improvements, showing significant enhancements on standard LLM serving benchmarks and novel applications such as longcontext inference and parallel generation. ...

Reference:

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
  • Citing Conference Paper
  • March 2023

... First, the autotuning framework can closely interface with the DRAM-PIM code generator to search the space of host and kernel tensor programs and optimize them simultaneously. Autotuning formulates optimizations as a search problem, exploring different optimized versions of a given tensor operation to identify the bestperforming candidate through hardware measurements [4,5,12,56]. This approach can automatically generate codes optimized explicitly for target hardware, often outperforming expert-tuned libraries, thereby providing both programmability and performance benefits. ...

TensorIR: An Abstraction for Automatic Tensorized Program Optimization
  • Citing Conference Paper
  • January 2023

... Other systems-software dimensions to consider are hand-tuned kernel libraries such as oneDNN [58], cuDNN [12], or other deep learning compilers such as TensorRT [16] or IREE [77]. Collage [40] can explore varying the backend for different subgraphs of the same DNN, which can bring significant performance improvements. ...

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement
  • Citing Conference Paper
  • January 2023

... The problem of client retention is solved by spotting at-risk clients and supporting proactive policies meant to keep them [19], [20], [21]. Companies can now concentrate on high-quality leads and best utilize their resources [22], [23], [24] by simplifying lead management systems which guarantees consistent and accurate lead classification [25], [26]. Ultimately, accepting a data-driven approach helps companies acquire a competitive edge, enable informed decisions, and maximize their marketing efficacy in the modern data-driven environment. ...

Automatic generation of high-performance quantized machine learning kernels
  • Citing Conference Paper
  • February 2020

... DNN overlays reintroduce the "stored program" concept into FPGAs' native dataflow execution by using instructions to control temporal datapath reuse. Many overlays envision leveraging the efficiency of datapath customization and CPUlike software programmability at the same time [12], [13], [19]. However, despite this ambition, the reality is that current DNN overlays often have more restricted inter-layer execution flexibility than fixed-function designs, as discussed in Section II-B. ...

A Hardware-Software Blueprint for Flexible Deep Learning Specialization
  • Citing Article
  • July 2019

IEEE Micro

... Deploying high-performance deep learning (DL) models on a wide range of platforms (e.g., CPU, GPU, ASIC, FPGA) has become an increasingly crucial research topic [9,15,34,39]. The general lifecycle of a DNN model consists of two stages: development (model designing and training) and deployment (model compilation and inference). ...

Leveraging the VTA-TVM Hardware-Software Stack for FPGA Acceleration of 8-bit ResNet-18 Inference
  • Citing Conference Paper
  • June 2018

... DL framework interfaces. To interface with DL frameworks and automate end-to-end compilation from input models, it is necessary to extend the graph-level TVM frontend using Relax [28] or Relay [42] IR to invoke IMTP and update the TVM runtime to manage layer and memory mapping to DRAM-PIM hardware. Integrating IMTP with application interfaces and optimizing its system-level usage is an important direction for our future work. ...

Relay: A New IR for Machine Learning Frameworks
  • Citing Conference Paper
  • June 2018

... In comparison, the types of gemms that arise in DL present some special dimensions that differ from those typically found in scientific computing, and benefit from a customized design of the micro-kernel. In addition, DL inference with quantized models requires new data types, such as 16-bit float, integer, or mixed precision, which are not supported by current LA libraries, with the notable exception of oneDNN. 1 Recent compiler frameworks such as Halide [9], Apache TVM (Tensor Virtual Machine) [10], Exo [11], JAX [12], TensorFlow XLA [13], and Tiramisu [14] fill this niche by (semi-)automatically generating optimized code. In rough detail, these frameworks take an operation, encoded in a domain-specific language (DSL) or a high-level language such as Python, with some optional hints for the preferred operation plan. ...

TVM: End-to-End Optimization Stack for Deep Learning
  • Citing Article
  • February 2018

... Because (X t ) has π as its stationary distribution, this process has also been used to build Markov Chain Monte Carlo (MCMC) algorithms to sample from π (Chen et al., 2014). It is a special case of the general method of Ma et al. (2015Ma et al. ( , 2019 for specifying stochastic differential equations with a given stationary distribution, which they use to define a general family of samplers. In particular, Equation 1 defines an irreversible stochastic process, in contrast with more common reversible processes such as the overdamped Langevin diffusion (Roberts and Tweedie, 1996). ...

Irreversible samplers from jump and continuous Markov processes

Statistics and Computing

... Additionally, simply increasing model size inevitably introduces additional inference latency, posing challenges for real-time applications. Given the inherent generalizability of speech recognition tasks, a well-pretrained smaller model might exhibit comparable capabilities to larger ones in certain respects [10], [11]. This raises an intriguing question: Can we efficiently scale up models by reusing existing small ASR models as an optimal starting point, thereby reducing training overhead without significantly impacting the Real-Time Factor (RTF)? ...

Net2Net: Accelerating Learning via Knowledge Transfer
  • Citing Conference Paper
  • November 2016