Tianqi Chen's research while affiliated with Carnegie Mellon University and other places

Publications (43)

Preprint
Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse com...
Preprint
Full-text available
Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we...
Preprint
Automatic optimization for tensor programs becomes increasingly important as we deploy deep learning in various environments, and efficient optimization relies on a rich search space and effective search. Most existing efforts adopt a search space which lacks the ability to efficiently enable domain experts to grow the search space. This paper intr...
Preprint
Quantization is a key technique to reduce the resource requirement and improve the performance of neural network deployment. However, different hardware backends such as x86 CPU, NVIDIA GPU, ARM CPU, and accelerators may demand different implementations for quantized networks. This diversity calls for specialized post-training quantization pipeline...
Preprint
Checkpointing enables training larger models by freeing intermediate activations and recomputing them on demand. Previous checkpointing techniques are difficult to generalize to dynamic models because they statically plan recomputations offline. We present Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for heuristically checkpoin...
Article
Specialized Deep Learning acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical representations pose the risk of making custom hardware quickly obsolete. We propose VTA...
Preprint
Full-text available
Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressiveness, composabilit...
Article
Full-text available
In this paper, we propose irreversible versions of the Metropolis–Hastings (MH) and Metropolis-adjusted Langevin algorithm (MALA) with a main focus on the latter. For the former, we show how one can simply switch between different proposal and acceptance distributions upon rejection to obtain an irreversible jump sampler (I-Jump). The resulting alg...
Preprint
State of the art deep learning models have made steady progress in the fields of computer vision and natural language processing, at the expense of growing model sizes and computational complexity. Deploying these models on low power and mobile devices poses a challenge due to their limited compute capabilities and strict energy budgets. One soluti...
Preprint
Hardware acceleration is an enabler for ubiquitous and efficient deep learning. With hardware accelerators being introduced in datacenter and edge devices, it is time to acknowledge that hardware specialization is central to the deep learning system stack. This technical report presents the Versatile Tensor Accelerator (VTA), an open, generic, and...
Conference Paper
We present a full-stack design to accelerate deep learning inference with FPGAs. Our contribution is two-fold. At the software layer, we leverage and extend TVM, the end-to-end deep learning optimizing compiler, in order to harness FPGA-based acceleration. At the the hardware layer, we present the Versatile Tensor Accelerator (VTA) which presents a...
Conference Paper
Discussion is centered around the following questions: * How do we facilitate tech transfer between academia and industry in a quickly evolving research landscape? * How do we incentivize companies and academic researchers to release more artifacts and open source projects as portable, customizable and reusable components which can be collaborative...
Conference Paper
Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we pro...
Preprint
Full-text available
We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a...
Article
Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch drive the current popularity and utility of deep learning. However, these frameworks are optimized for a narrow range of server-class GPUs and deploying workloads to other platforms such as mobile phones, embedded devices, and specialized accelerators (e.g., FPGAs, ASICs) requires l...
Conference Paper
We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful pr...
Article
Full-text available
We propose a framework for Markov chain Monte Carlo using both continuous dynamics and jump processes that enable the development of efficient, irreversible samplers. For each component, we decompose the dynamics into reversible and irreversible processes, and with a parameterization that is easy to specify while ensuring the correct stationary dis...
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted q...
Article
We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our a...
Article
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted q...
Article
Full-text available
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on...
Article
Full-text available
We introduce techniques for rapidly transferring the information stored in one neural net into another neural net. The main purpose is to accelerate the training of a significantly larger neural net. During real-world workflows, one often trains very many different neural networks during the experimentation and design process. This is a wasteful pr...
Article
Full-text available
Many recent Markov chain Monte Carlo (MCMC) samplers leverage stochastic dynamics with state adaptation to define a Markov transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable MCMC algorithms via data subsampling and using stochastic gradients in the stochastic dynamic simulations. Howe...
Article
In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear units (RReLU). We evaluate these activation function...
Article
Many tasks in data mining and related fields can be formalized as matching between objects in two heterogeneous domains, including collaborative filtering, link prediction, image tagging, and web search. Machine learning techniques, referred to as learning-to-match in this paper, have been successfully applied to the problems. Among them, a class o...
Article
Hamiltonian Monte Carlo (HMC) sampling methods provide a mechanism for defining distant proposals with high acceptance probabilities in a Metropolis-Hastings framework, enabling more efficient exploration of the state space than standard random-walk proposals. The popularity of such methods has grown significantly in recent years. However, a limita...
Conference Paper
Full-text available
Collaborative filtering techniques rely on aggregated user preference data to make personalized predictions. In many cases, users are reluctant to explicitly express their preferences and many recommender systems have to infer them from implicit user behaviors, such as clicking a link in a webpage or playing a music track. The clicks and the plays...
Conference Paper
Serendipitous recommendation has benefitted both e-retailers and users. It tends to suggest items which are both unexpected and useful to users. These items are not only profitable to the retailers but also surprisingly suitable to consumers' tastes. However, due to the imbalance in observed data for popular and tail items, existing collaborative f...
Article
In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hi...
Conference Paper
Transfer learning as a new machine learning paradigm has gained increasing attention lately. In situations where the training data in a target domain are not sufficient to learn predictive models effectively, transfer learning leverages auxiliary source data from related domains for learning. While most of the existing works in this area are only f...
Article
Digital music has experienced a quite fascinating transformation during the past decades. Thousands of people share or distribute their music collections on the Internet, resulting in an explosive increase of information and more user dependence on automatic recommender systems. Though there are many techniques such as collaborative filtering, most...
Article
Recently, recommender systems have fascinated researchers and benefited a variety of people's online activities, enabling users to survive the explosive web information. Traditional collaborative filtering techniques handle the general recommendation well. However, most such approaches usually focus on long term preferences. To discover more short...
Conference Paper
Full-text available
Social networks have become more and more popular in recent years. This popularity creates a need for personalization services to recommend tweets, posts (information) and celebrities organi-zations (information sources) to users according to their potential interest. Tencent Weibo (microblog) data in KDD Cup 2012 brings one such challenge to the r...
Article
Twitter has rapidly grown to a popular social network in recent years and provides a large number of real-time messages for users. Tweets are presented in chronological order and users scan the followees' timelines to find what they are interested in. However, an information overload problem has troubled many users, especially those with many follo...
Conference Paper
In this paper, we describe a feature based informative model to the second track of this year's KDD Cup Challenge. The goal is to discriminate songs rated highly by the user from ones never rated by him/her. The informative model is used to incorporate different kinds of information, such as taxonomy of items, item neighborhoods, user specific feat...
Conference Paper
Full-text available
In this paper, we address the problem of extracting technical terms automatically from an unannotated corpus. We introduce a technology term tagger , that is based on Liblinear Support Vector Machines and employs linguistic features including Part of Speech tags and Dependency Structures, in addition to user feedback to perform the task of identifi...
Article
In this paper, we describe our solutions to the first track of CAMRa2011 challenge. The goal of this track is to generate a movie ranking list for each household. To achieve this goal, we propose to use the ranking oriented matrix factorization and the matrix factorization with negative examples sampling. We also adopt feature-based matrix factoriz...
Article
Full-text available
Recommender system has been more and more popular and widely used in many applications recently. The increasing information available, not only in quantities but also in types, leads to a big challenge for recommender system that how to leverage these rich information to get a better performance. Most traditional approaches try to design a specific...
Article
Full-text available
The Yahoo! music rating data set in KDD-Cup 2011 raises a num-ber of interesting challenges: (1) It contains significantly larger number of users/items than all existing benchmark data sets. (2) The data covers a lengthy time period of more than 8 years. (3) Both date and time information are available for both training and test data. (4) The items...
Conference Paper
Recently, Behavioral Targeting (BT) is attracting much attention from both industry and academia due to its rapid growth in online advertising market. Though a basic assumption of BT, which is, the users who share similar Web browsing behaviors will have similar preference over ads, has been empirically verified, we argue that the users' ad click p...

Citations

... As highlighted in Section 6.3.2, we observed that XII) Pruned models saw a lower relative speedup when tuned. A more specialized sparse compiler, such as the emerging SparseTIR system [89] could reduce the impact of these overheads. ...
... Many domainspecific languages (DSLs) leverage the idea of decoupling optimizations from algorithms [2,6,22,42,52], allowing users to focus on customization and enabling compilers to perform complex optimizations. TVM [6,14] inherits the decoupling idea from Halide [42] and builds an end-to-end compiler for deep learning inference. Tiramisu [2] further extends the scope of schedule to polyhedral compilation. ...
... However, the computation complexity is defined by O(N ×M ), meaning that this method is only suitable for ultra-low-precision, typically less than 3-bit. For this precision range, the performance over fp32 can be multiplied by a factor of 2 to 6 depending on the precision [18]. This paved the way for the development of specialized bit-serial hardware [19], [20]. ...
... VTA supports Ten-sorFlow, PyTorch, and ONNX among other ML frameworks. VTA is compatible with various hardware platforms, including FPGAs and ASICs [23]. ...
... Residual building block [29] (named RBB) is the basic structure in ResNet-18 [30]. RBB uses shortcut to skip convolutional blocks [31], which helps to optimize the error of trainable parameters in back propagation to avoid gradient explosion [32] or disappearance [33]. ...
... Most of the previously proposed performance models are able to parse the given input DL model from a single DL framework, not from several, as we already discussed in Sect. 2. To enable the use of multiple frameworks, we used a relay, which is a high-level IR for DL models [17]. It has been used to compile DL models for inference in the TVM 3 framework. ...
... LLVM front-end frameworks. TVm (Chen et al. 2018a) is a tool capable of compiling machine learning models from different popular frameworks and generating specific low-level optimized code for a diverse set of hardware back-ends. The workflow of TVm consists in first translating the inference model imported from ML frameworks in a high-level intermediate representation called Relay, performing a set of high-level and low-level optimizations and finally generating code for different compiler back-ends, including LLVM. ...
... Historically, Markov chain Monte Carlo has been based on reversible Markov kernels such as the Metropolis-Hastings kernel (Hastings, 1970) and special cases and variations thereof (e.g., Brooks et al., 2011), since it is straightforward to ensure that these target π. However, there has been much recent interest in nonreversible kernels (e.g., Bouchard-Côté et al., 2018;Fearnhead et al., 2018), which have the potential, both in practice and in theory, to be more efficient than their reversible counterparts (Neal, 1998;Diaconis et al., 2000;Bierkens, 2015;Ma et al., 2018). A particular continuous-time nonreversible algorithm, the bouncy particle sampler (Peters & de With, 2012;Bouchard-Côté et al., 2018), and variations such as the coordinate sampler (Wu & Robert, 2020), the discrete 2 M. Ludkin and C. Sherlock bouncy particle sampler Sherlock & Thiery, 2022) and others (e.g., Vanetti et al., 2018), use occasional reflections of a velocity in the hyperplane perpendicular to the current gradient to eliminate, for continuous-time versions, or substantially reduce, for discrete-time versions, rejections of proposed moves. ...
... The general idea of creating new architectures by modifying existing ones and reusing the weights has been explored in NAS approaches [10,20] relying on network morphisms [6]. Network morphisms are operators that change the architecture of a neural network without influencing its functionality. ...
... In Eq. (8), where T is the number of leaves and w k is a vector of leaf weights, the regularized objective function L (ϕ) is minimized to produce regression trees. The computation of the ideal leaf weights w * j using the loss function L is also shown in Eq. (9) where I j is the collection of sample indices for the j-th leaf in the Equation (Bakouregui, Mohamed, Yahia, & Benmokrane, 2021;Chen & Guestrin, 2016). ...