Conference Paper

Scrooge: A Cost-Effective Deep Learning Inference System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is imperative to deploy these models cost-effectively while maintaining system performance and scalability. [60] ✕ ✓ ✓ ✕ Inferline [25] ✓ ✓ ✕ ✕ GPULet [22] ✓ ✓ ✕ ✕ Llama [61] ✓ ✓ ✕ ✕ FA2 [59] ✓ ✓ ✕ ✕ Model Switch [75] ✕ ✕ ✓ ✕ Scrooge [41] ✓ ✓ ✕ ✕ Nexus [64] ✓ ✓ ✕ ✕ Cocktail [36] ✕ ✓ ✓ ✕ InfAdapter [63] ✕ ...
... Unlike individually optimizing each stage, optimizing end-to-end ML inference pipelines will capture the correlation between configuration changes across multiple pipeline steps. Previous works [25,41,47,61] have proposed solutions to address the challenges of efficient autoscaling, batching, and pipeline scheduling not only to consider the above challenges but also to consider the dynamic nature of ML workloads. However, none of these approaches considers the combined optimization of accuracy and resource allocation across pipelines. ...
... Table 1 presents an overview of related inference serving works. Systems with inference pipeline serving have often overlooked the presence of multiple model variants for each inference task [25,41,47,59,61]. The heterogeneity of these model variants presents an opportunity not only to configure the pipeline to meet latency objectives but also to opportunistically select the most suitable model variant to enhance the accuracy of the pipeline output. ...
Article
Full-text available
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at this https URL.
... Automatic resource allocation is a complex problem that requires careful consideration and has been extensively studied in various domains, including stream processing [27,31,46], [60] ✕ ✓ ✓ ✕ Inferline [25] ✓ ✓ ✕ ✕ GPULet [22] ✓ ✓ ✕ ✕ Llama [61] ✓ ✓ ✕ ✕ FA2 [59] ✓ ✓ ✕ ✕ Model Switch [75] ✕ ✕ ✓ ✕ Scrooge [41] ✓ ✓ ✕ ✕ Nexus [64] ✓ ✓ ✕ ✕ Cocktail [36] ✕ ✓ ✓ ✕ InfAdapter [63] ✕ ✓ ✓ ✓ IPA ✓ ✓ ✓ ✓ serverless computing [52,66], and microservices [30,33,76,77]. Static auto-configuration of hardware resources [68], dynamic rightsizing of resources through autoscaling [73], and maximizing utilization with batching [13] are some of the techniques that have been used for resource management of ML models. ...
... In contrast to individually optimizing each stage, optimizing ML inference pipelines end-to-end will capture the correlation between configuration change across multiple pipeline steps. Previous works [25,41,47,61] have proposed solutions to address the challenges of efficient autoscaling, batching, and pipeline scheduling not only to consider the above challenges, but also to consider the dynamic nature of ML workloads. Nevertheless, none of the above approaches considers combined optimization of accuracy and resource allocation across pipelines. ...
... Table 1 presents an overview of related inference serving works. Systems with inference pipeline serving have often overlooked the presence of multiple model variants for each inference task [25,41,47,59,61]. The heterogeneity of these model variants presents an opportunity not only to configure the pipeline to meet latency objectives but also to opportunistically select the most suitable model variant to enhance the accuracy of the pipeline output. ...
Preprint
Full-text available
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%.
... In real-world scenarios, inference applications often stitch together multiple models and operations into a workflow. Table 1 presents several representative inference applications from recent research [4,11,15,18,42]. As a concrete example, in a traffic monitoring application [42] that analyzes pedestrian and vehicle traffic (Fig. 1), video frames are first decoded and preprocessed. ...
... ML inference applications are increasingly prevalent in daily life [4,15,18,41,42,52]. To efficiently deploy inference applications, many studies [38,48,49,53] have developed GPU-enabled serverless inference systems, allowing users to publish ML models as functions that scale resources on demand with workload fluctuations. ...
Preprint
Serverless computing has gained significant traction for machine learning inference applications, which are often deployed as serverless workflows consisting of multiple CPU and GPU functions with data dependency. However, existing data-passing solutions for serverless computing primarily reply on host memory for fast data transfer, mandating substantial data movement and resulting in salient I/O overhead. In this paper, we present FaaSTube, a GPU-efficient data passing system for serverless inference. FaaSTube manages intermediate data within a GPU memory pool to facilitate direct data exchange between GPU functions. It enables fine-grained bandwidth sharing over PCIe and NVLink, minimizing data-passing latency for both host-to-GPU and GPU-to-GPU while providing performance isolation between functions. Additionally, FaaSTube implements an elastic GPU memory pool that dynamically scales to accommodate varying data-passing demands. Evaluations on real-world applications show that FaaSTube reduces end-to-end latency by up to 90\% and achieves up to 12x higher throughput compared to the state-of-the-art.
... To improve the utilization of GPU resources, temporal sharing [9] and spatial sharing [10] are two common GPU resource multiplexing techniques. Many existing works (e.g., Cocktail [11], Clockwork [12]) leverage temporal sharing of GPUs to optimize the DNN inference performance and reduce the monetary cost. ...
... In the scenario of spatial sharing of GPUs, Scrooge [10] leverages the CUDA streams and batching techniques to pack DNN inference on VMs to ensure the performance SLOs of media applications. Using the latest multi-instance GPU (MIG) [41] featured A100 GPUs, MIG-serving [39] optimizes a set of GPU partitions and DNN inference deployments to meet performance SLOs. ...
Preprint
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter, an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to 25% in comparison to the state-of-the-art GPU resource provisioning strategies.
... Cloud platforms provide a wide range of virtual machines (VMs), each comes with different hardware types (e.g., different CPU, GPU, memory). While there have been previous attempts at providing partial solutions to exploit hardware heterogeneity in datacenter [10][11][12], edge [13,14], and cloud [15][16][17], we lack a complete solution to achieve all the desirable properties (Sec. 2). ...
... INFaaS [46] optimally selects a particular hardware type from heterogeneous devices depending on the user application, but unlike Kairos, it does not serve the model using different hardware simultaneously. Media application frameworks such as Llama [45] and Scrooge [15] allocate different hardware for different stages of the media application inference, but each query is assigned to the same sequence of hardware types, they do not distribute queries to heterogeneous resources like Kairos and are not suitable for general purpose applications. ...
Preprint
Full-text available
Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.
... In order to partially move the ML workflow to the edge devices, besides being able to break it into modules or operators as discussed above, another necessary condition is a serving system that supports operator-level parallelism on heterogeneous infrastructures. Prior ML systems focused on data, model (i.e., breaking large DNNs) and operator (i.e., breaking workflows) parallelism on homogeneous infrastructures [20,41,45,50], or on heterogeneous workers within a datacenter [14,26,56]. We demonstrate a qualitative comparison in Table 1. ...
... Nexus [58] automatically chooses the optimal batch size and the number of GPUs to use according to the request rate and latency SLO for a given model. Model DAGs are also considered in other works [4,19,26,27,55,56]. Rim [27] considers serving ML workflows at a cluster of edge GPUs. ...
Preprint
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent machine learning efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of models with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a framework for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean automatically selects the most cost-efficient models that met the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36% compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.
... InferLine (Crankshaw et al., 2020) minimizes the cost of inference serving by scaling hardware in response to changes in demand. A lot of work has been specifically related to video analytics pipelines such as VideoStorm (Zhang et al., 2017), Scrooge (Hu et al., 2021), Llama (Romero et al., 2021b), and Nexus (Shen et al., 2019). DIFFSERVE differs in its focus on cascading models within the pipeline, using a confidence-based decision process to dynamically switch between lightweight and heavyweight models, balancing both computational efficiency and response quality. ...
Preprint
Full-text available
Text-to-image generation using diffusion models has gained increasing popularity due to their ability to produce high-quality, realistic images based on text prompts. However, efficiently serving these models is challenging due to their computation-intensive nature and the variation in query demands. In this paper, we aim to address both problems simultaneously through query-aware model scaling. The core idea is to construct model cascades so that easy queries can be processed by more lightweight diffusion models without compromising image generation quality. Based on this concept, we develop an end-to-end text-to-image diffusion model serving system, DiffServe, which automatically constructs model cascades from available diffusion model variants and allocates resources dynamically in response to demand fluctuations. Our empirical evaluations demonstrate that DiffServe achieves up to 24% improvement in response quality while maintaining 19-70% lower latency violation rates compared to state-of-the-art model serving systems.
... Kairos [28] is one such system that aims to minimize the cost of inference serving using heterogeneous cloud resources. MArk [46] and Scrooge [21] also try to minimize the cost of inference serving while trying to meet latency SLOs. iGniter [43] is an interference-aware inference serving system that minimizes serving cost. ...
Preprint
Full-text available
The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware scaling, which involves increasing server count or computing power. However, this strategy can often be impractical due to limitations in the available budget or compute resources. As an alternative, accuracy scaling offers a promising solution by adjusting the accuracy of ML models to accommodate fluctuating query demands. Yet, existing accuracy scaling techniques target independent ML models and tend to underperform while managing inference pipelines. Furthermore, they lack integration with hardware scaling, leading to potential resource inefficiencies during low-demand periods. To address the limitations, this paper introduces Loki, a system designed for serving inference pipelines effectively with both hardware and accuracy scaling. Loki incorporates an innovative theoretical framework for optimal resource allocation and an effective query routing algorithm, aimed at improving system accuracy and minimizing latency deadline violations. Our empirical evaluation demonstrates that through accuracy scaling, the effective capacity of a fixed-size cluster can be enhanced by more than 2.7×2.7\times compared to relying solely on hardware scaling. When compared with state-of-the-art inference-serving systems, Loki achieves up to a 10×10\times reduction in Service Level Objective (SLO) violations, with minimal compromises on accuracy and while fulfilling throughput demands.
... Autoscaling in inference serving. Autoscaling in inference serving has been extensively studied [10,20,30,37]. Kubernetes VPA [5] and HPA [4] use threshold-based metrics such as CPU or memory usage to change computing resources or the number of instances of DL-based inference services. ...
... Our scheduler's statistics and autoscaling advice can be good signals to these cluster management tools. Llama [25] and Scrooge [12] focus more on complex query pipelines whereas we focus on the batching efficiency of individual models. These systems can adopt our techniques to further improve the efficiency of individual models in a pipeline. ...
Preprint
The orchestration of deep neural network (DNN) model inference on GPU clusters presents two significant challenges: achieving high accelerator efficiency given the batching properties of model inference while meeting latency service level objectives (SLOs), and adapting to workload changes both in terms of short-term fluctuations and long-term resource allocation. To address these challenges, we propose Symphony, a centralized scheduling system that can scale to millions of requests per second and coordinate tens of thousands of GPUs. Our system utilizes a non-work-conserving scheduling algorithm capable of achieving high batch efficiency while also enabling robust autoscaling. Additionally, we developed an epoch-scale algorithm that allocates models to sub-clusters based on the compute and memory needs of the models. Through extensive experiments, we demonstrate that Symphony outperforms prior systems by up to 4.7x higher goodput.
... Batching and parallelism parameters are practical configuration knobs of ML inference services. Batching refers to aggregating multiple requests into one request, which is widely adapted for GPU inference systems [16,23,33,35]. However, as shown in Figure 4, inference on CPU does not substantially benefit from batching in increasing the throughput, but increasing batch size leads to higher latency. ...
... Batching and parallelism parameters are practical configuration knobs of ML inference services. Batching refers to aggregating multiple requests into one request, which is widely adapted for GPU inference systems [16,23,33,35]. However, as shown in Figure 4, inference on CPU does not substantially benefit from batching in increasing the throughput, but increasing batch size leads to higher latency. ...
Preprint
Full-text available
The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).
... Machine learning inference service. Many previous research works have focused on various aspects of ML inference services including latency predictability [67][68][69], cloud cost efficiency [70][71][72][73], and adaptive query batching [36,[74][75][76]. Clover distinguishes itself from these previous contributions, as it focuses on building a carbon-aware inference system. ...
Preprint
Full-text available
This paper presents a solution to the challenge of mitigating carbon emissions from large-scale high performance computing (HPC) systems and datacenters that host machine learning (ML) inference services. ML inference is critical to modern technology products, but it is also a significant contributor to datacenter compute cycles and carbon emissions. We introduce Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. Our experimental results demonstrate that Clover is effective in substantially reducing carbon emissions while maintaining high accuracy and meeting service level agreement (SLA) targets. Therefore, it is a promising solution toward achieving carbon neutrality in HPC systems and datacenters.
Article
As the applications of AI proliferate, it is critical to increase the throughput of online DNN inference services. Multi-process service (MPS) improves the utilization rate of GPU resources by spatial-sharing, but it also brings unique challenges. First, interference between co-located DNN models deployed on the same GPU must be accurately modeled. Second, inference tasks arrive dynamically online, and each task needs to be served within a bounded time to meet the service-level objective (SLO). Third, the problem of fragments has become more serious. To address the above three challenges, we propose an In telligent S cheduling orchestrator for multi-GPU inference servers with spatio-temporal S haring ( InSS ), aiming to maximize the system throughput. InSS exploits two key innovations: i) An interference-aware latency analytical model which estimates the task latency. ii) A two-stage intelligent scheduler is tailored to jointly optimize the model placement, GPU resource allocation and adaptively decides batch size by coupling the latency analytical model. Our prototype implementation on four NVIDIA A100 GPUs shows that InSS can improve the throughput by up to 86% compared to the state-of-the-art GPU schedulers, while satisfying SLOs. We further show the scalability of InSS on 64 GPUs.
Conference Paper
Deep neural network (DNN) has achieved the state-of-the-art results in multiple fields, and has been widely used to build latency sensitive applications for its high performance. When dispatching requests among GPU machines for DNN execution, the inference system hosted in the cloud needs to guarantee that the maximum latency of all requests, denoted as the worst case latency, is within the latency objectives of the clients. In this paper, we design and implement a request dispatch system, called DeepLat, which distributes client requests among the GPU machines efficiently to minimize the worst case latency of the DNN-based application. DeepLat uses batch-aware dispatch policy to minimize the batch collecting time, proposes duration-based algorithm to reduce the average latency and supports partial-batch dispatching to minimize the waiting time for bottleneck machines. Evaluation shows that compared to existing request dispatch systems, DeepLat can reduce the worst case latency by 37.7%37.7\% on average without using extra computing resources. Besides, DeepLat achieves the theoretical lower bound for the worst case latency for over 48%48\% workload. With the capability to minimize the worst case latency, DeepLat reduces the total cost of DNN serving system by 43.2%43.2\% on average.
Article
In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.
Article
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent AutoML efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of model variants with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a system for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean picks the most cost-efficient models that meet the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58% and vehicle tracking from the NVIDIA AI City Challenge by up to 36%, compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.
Article
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either temporal sharing of GPUs or reactive GPU resource scaling and inference migration techniques, how to proactively mitigate such severe performance interference has received comparatively little attention. In this paper, we propose iGniter , an interference-aware GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. iGniter is comprised of two key components: (1) a lightweight DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A cost-efficient GPU resource provisioning strategy that jointly optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of iGniter based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that iGniter can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to 25% in comparison to the state-of-the-art GPU resource provisioning strategies.
Conference Paper
Full-text available
Detecting activities from video taken with a single camera is an active research area for ML-based machine vision. In this paper, we examine the next research frontier: near real-time detection of complex activities spanning multiple (possibly wireless) cameras, a capability applicable to surveillance tasks. We argue that a system for such complex activity detection must employ a hybrid design: one in which rule-based activity detection must complement neural network based detection. Moreover, to be practical, such a system must scale well to multiple cameras and have low end-to-end latency. Caesar, our edge computing based system for complex activity detection, provides an extensible vocabulary of activities to allow users to specify complex actions in terms of spatial and temporal relationships between actors, objects, and activities. Caesar converts these specifications to graphs, efficiently monitors camera feeds, partitions processing between cameras and the edge cluster, retrieves minimal information from cameras, carefully schedules neural network invocation, and efficiently matches specification graphs to the underlying data in order to detect complex activities. Our evaluations show that Caesar can reduce wireless bandwidth, on-board camera memory, and detection latency by an order of magnitude while achieving good precision and recall for all complex activities on a public multi-camera dataset.
Article
Full-text available
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. Embedded in the host language, it blends declarative symbolic expression with imperative tensor computation. It offers auto differentiation to derive gradients. MXNet is computation and memory efficient and runs on various heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper describes both the API design and the system implementation of MXNet, and explains how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.
Conference Paper
Full-text available
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers.
Conference Paper
We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. In large-scale case studies on 16 GPUs, when required to stay within latency constraints at least 99% of the time, Nexus can process requests at rates 1.8-12.7X higher than state of the art systems can. A long-running multi-application deployment stays within 84% of optimal utilization and, on a 100-GPU cluster, violates latency SLOs on 0.27% of requests.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Chapter
We propose a straightforward method that simultaneously reconstructs the 3D facial structure and provides dense alignment. To achieve this, we design a 2D representation called UV position map which records the 3D shape of a complete face in UV space, then train a simple Convolutional Neural Network to regress it from a single 2D image. We also integrate a weight mask into the loss function during training to improve the performance of the network. Our method does not rely on any prior face model, and can reconstruct full facial geometry along with semantic meaning. Meanwhile, our network is very light-weighted and spends only 9.8 ms to process an image, which is extremely faster than previous works. Experiments on multiple challenging datasets show that our method surpasses other state-of-the-art methods on both reconstruction and alignment tasks by a large margin. Code is available at https://github.com/YadiraF/PRNet.
Conference Paper
Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN "cell" (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.
Article
Large volumes of videos are continuously recorded from cameras deployed for traffic control and surveillance with the goal of answering "after the fact" queries: identify video frames with objects of certain classes (cars, bags) from many days of recorded video. While advancements in convolutional neural networks (CNNs) have enabled answering such queries with high accuracy, they are too expensive and slow. We build Focus, a system for low-latency and low-cost querying on large video datasets. Focus uses cheap ingestion techniques to index the videos by the objects occurring in them. At ingest-time, it uses compression and video-specific specialization of CNNs. Focus handles the lower accuracy of the cheap CNNs by judiciously leveraging expensive CNNs at query-time. To reduce query time latency, we cluster similar objects and hence avoid redundant processing. Using experiments on video streams from traffic, surveillance and news channels, we see that Focus uses 58X fewer GPU cycles than running expensive ingest processors and is 37X faster than processing all the video at query time.
Article
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge
Article
Real-world videos often have complex dynamics; methods for generating open-domain video descriptions should be senstive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Conference Paper
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework. A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. We show that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications.
Article
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
The problem of optimal scheduling of a job system for two dedicated processors is presented. A machine model with two functional units which can be either sequential or pipelined is considered. The complexity of optimal scheduling for a set of expressions on such machines is investigated. Some previous NP-completeness results are reviewed and several new ones are presented. For one restricted case, a polynomial-time algorithm is described and analyzed
CUDA Pro Tip: Understand Fat Binaries and JIT Caching
  • Mark Harris
Alexander Hermans Lucas Beyer and Bastian Leibe
  • Alexander Hermans
  • Lucas Beyer
  • Bastian Leibe
Joseph Redmon and Ali Farhadi
  • Joseph Redmon
  • Ali Farhadi
  • Redmon Joseph
INFaaS: A model-less inference serving system
  • Francisco Romero
  • Qian Li
  • J Neeraja
  • Christos Yadwadkar
  • Kozyrakis
  • Romero Francisco
Gandiva: Introspective Cluster Scheduling for Deep Learning
  • Wencong Xiao
  • Romil Bhardwaj
  • Ramachandran Ramjee
  • Muthian Sivathanu
  • Nipun Kwatra
  • Zhenhua Han
  • Pratyush Patel
  • Xuan Peng
  • Hanyu Zhao
  • Quanlu Zhang
  • Fan Yang
  • Lidong Zhou
  • Xiao Wencong
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
  • Devlin Jacob
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows
  • Vasiliki Kalavri
  • John Liagouris
  • Moritz Hoffmann
  • Desislava Dimitrova
  • Matthew Forshaw
  • Timothy Roscoe
  • Kalavri Vasiliki
Elastic Scaling of Stateful Network Functions
  • Shinae Woo
  • Justine Sherry
  • Sangjin Han
  • Sue Moon
  • Sylvia Ratnasamy
  • Scott Shenker
  • Woo Shinae
NetBricks: Taking the V out of NFV
  • Aurojit Panda
  • Sangjin Han
  • Keon Jang
  • Melvin Walls
  • Sylvia Ratnasamy
  • Scott Shenker
  • Panda Aurojit
Live video analytics at scale with approximation and delay-tolerance
  • Haoyu Zhang
  • Ganesh Ananthanarayanan
  • Peter Bodik
  • Matthai Philipose
  • Paramvir Bahl
  • Michael J Freedman
  • Zhang Haoyu