Conference Paper

Rim: Offloading Inference to the Edge

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, Jellyfish can only adjust batch sizes for different versions of the same model, not systematically schedule the whole pipeline. Distributed architectures like Distream [14] and Rim [18] instead utilize Edge devices and balance workloads between devices and servers. Distream introduces a stochastic method to determine the "split point," adaptively dividing EVA pipelines between local and server-based workloads. ...
... This is because while batching improves throughput, it increases end-to-end latency arXiv:2502.01277v1 [cs.DC] 3 Feb 2025 P r e p r i n t Scaling Jellyfish [13] centralized single tasks ✗ ✗ Distream [14] ✓ ✗ ✗ ✗ Rim [18] ✓ ✗ ✗ ✗ OCTOPINF ✓ pipeline ✓ ✓ for queries in the batch, risking SLO violations [12]. Batch sizes must therefore be dynamically adjusted to real-time workloads and conditions. ...
... Batch sizes must therefore be dynamically adjusted to real-time workloads and conditions. Rim [18] argues that Edge models rarely benefit from batching due to lower workloads compared to the cloud. It selects the split point by maximizing concurrent model execution to improve hardware utilization. ...
Preprint
Edge Video Analytics (EVA) has gained significant attention as a major application of pervasive computing, enabling real-time visual processing. EVA pipelines, composed of deep neural networks (DNNs), typically demand efficient inference serving under stringent latency requirements, which is challenging due to the dynamic Edge environments (e.g., workload variability and network instability). Moreover, EVA pipelines also face significant resource contention caused by resource (e.g., GPU) constraints at the Edge. In this paper, we introduce OCTOPINF, a novel resource-efficient and workload-aware inference serving system designed for real-time EVA. OCTOPINF tackles the unique challenges of dynamic edge environments through fine-grained resource allocation, adaptive batching, and workload balancing between edge devices and servers. Furthermore, we propose a spatiotemporal scheduling algorithm that optimizes the co-location of inference tasks on GPUs, improving performance and ensuring service-level objectives (SLOs) compliance. Extensive evaluations on a real-world testbed demonstrate the effectiveness of our approach. It achieves an effective throughput increase of up to 10x compared to the baselines and shows better robustness in challenging scenarios. OCTOPINF can be used for any DNN-based EVA inference task with minimal adaptation and is available at https://github.com/tungngreen/PipelineScheduler.
... NLP pipelines (pipelines d and e) are representative examples of emerging use cases of language models [71,72]. For a full specification of the models used in each stage of the pipelines, refer to Appendix A. Also, for Baselines We compare IPA with variations of two similar systems, namely FA2 [59] and RIM [42]. FA2 is a recent system that achieves cost efficiency using scaling and batching; however, compared to IPA, it does not have model switching as an optimization angle. ...
... Predictors are effective in reducing SLA violations. Most previous work on inference pipeline serving [25,41,42,59] has done reactive auto-configuration. In reactive approaches, configuration changes occur with live load monitoring and in response to load changes. ...
... Multi-stage inference serving: Several approaches have been proposed in previous research to improve inference performance metrics in multistage inference serving systems [7,25,37,41,42,44,47,51,58,59,61,65,69] since changing one model's configuration affects subsequent steps. InferLine [25] reduces the end-to-end latency of ML service delivery by heuristically optimizing configurations such as batch sizes and horizontal scaling of each stage. ...
Article
Full-text available
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at this https URL.
... Baselines We compare IPA with variations of two similar systems, namely FA2 [59] and RIM [42]. FA2 is a recent system that achieves cost efficiency using scaling and batching; however, compared to IPA, it does not have model switching as an optimization angle. ...
... Predictors are effective in reducing SLA violations. Most previous work on inference pipeline serving [25,41,42,59] has done reactive auto-configuration. In reactive approaches, configuration changes occur with live load monitoring and in response to load changes. ...
... Multi-stage inference serving: Several approaches have been proposed in previous research to improve the performance metrics for inference on multistage inference serving systems [7,25,37,41,42,44,47,51,58,59,61,65,69] since changing one model's configuration affects the subsequent steps. InferLine [25] reduces end-to-end latency of ML service delivery by heuristically optimizing configurations such as batch sizes and horizontal scaling of each stage. ...
Preprint
Full-text available
Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%.
... Next, if the current throughput supports the incoming workload and the pipeline. If not, we use binary search on the workload and feed it to the same algorithm to find how much of the workload vertical scaling can support (lines [22][23][24][25][26][27][28][29]. We then calculate the needed instances for serving the remaining requests using horizontal scaling with the same CPU core allocations (line 30). ...
... Autoscaling in Inference Serving Systems: Multiple approaches have been proposed to reduce total resource consumption either using vertical or horizontal scaling in inference serving systems while guaranteeing SLOs [12,13,15,24,26,27,29,37,44,45,48,49]. FA2 [43] uses graph transformation and dynamic programming techniques to reduce the number of instances in horizontal scaling. ...
Preprint
Full-text available
Inference serving is of great importance in deploying machine learning models in real-world applications, ensuring efficient processing and quick responses to inference requests. However, managing resources in these systems poses significant challenges, particularly in maintaining performance under varying and unpredictable workloads. Two primary scaling strategies, horizontal and vertical scaling, offer different advantages and limitations. Horizontal scaling adds more instances to handle increased loads but can suffer from cold start issues and increased management complexity. Vertical scaling boosts the capacity of existing instances, allowing for quicker responses but is limited by hardware and model parallelization capabilities. This paper introduces Themis, a system designed to leverage the benefits of both horizontal and vertical scaling in inference serving systems. Themis employs a two-stage autoscaling strategy: initially using in-place vertical scaling to handle workload surges and then switching to horizontal scaling to optimize resource efficiency once the workload stabilizes. The system profiles the processing latency of deep learning models, calculates queuing delays, and employs different dynamic programming algorithms to solve the joint horizontal and vertical scaling problem optimally based on the workload situation. Extensive evaluations with real-world workload traces demonstrate over 10×10\times SLO violation reduction compared to the state-of-the-art horizontal or vertical autoscaling approaches while maintaining resource efficiency when the workload is stable.
... Nexus [58] automatically chooses the optimal batch size and the number of GPUs to use according to the request rate and latency SLO for a given model. Model DAGs are also considered in other works [4,19,26,27,55,56]. Rim [27] considers serving ML workflows at a cluster of edge GPUs. ...
... Model DAGs are also considered in other works [4,19,26,27,55,56]. Rim [27] considers serving ML workflows at a cluster of edge GPUs. JellyBean differs in two ways. ...
Preprint
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent machine learning efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of models with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a framework for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean automatically selects the most cost-efficient models that met the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58%, and vehicle tracking from the NVIDIA AI City Challenge by up to 36% compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.
... Many modern applications are composed of multiple DL models, such as Amazon Alexa, and are usually arranged as a DAG. Generalizing Sponge to support such applications requires a new algorithm design, since there is a data dependency [10,15,21,28] between DL models and finding an optimal resource allocation for individual DL models requires consideration of all models in the system. Multidimensional scaling. ...
... Although there are few works that use reinforcement learning to fit the device characteristics and user objectives at MAC layer [1,6,7,10,11,13], our focus is mostly on supervised and unsupervised learning models. Recently, considerable amount of work has focused on analyzing the video feeds for single and collaborative cameras [3,5,8,9,12,14,17]. Although these works introduce different techniques to for scheduling and analyzing video streams to get a better performance in terms of time and computation power, none of them considers a scenario where overlapping scenarios are used to perform different tasks. ...
Preprint
Full-text available
Today many IoT devices with different hardware characteristics are designed and deployed in different environments ranging from our home to city-scale deployment. In this paper we focus on the application of video analytics at the city scale and propose a collaborative analytics pipeline to reduce the computation time. In the context of smart cities, surveillance cameras are being widely deployed due to drop in the infrastructure costs, and more powerful computing powers. However, offloading the large feeds from these videos and sending them to the cloud servers is not cost beneficial and can add significant overhead in terms of delay for real-time video analytics applications, as well as privacy issues. In the recent years, there has been different studies to address aforementioned issues by bringing the video analytics at the edge. In this paper, we consider the following scenario where a large scale video cameras are installed in different part of the city. We provide a vision where each entity (offices, departments) only access to its own data. However, if there is a need for more data collected by cameras owned by other departments, they can submit a query and the task to that department and only get the inference for the corresponding query.
... Although there are few works that use reinforcement learning to fit the device characteristics and user objectives at MAC layer [1,6,7,10,11,13], our focus is mostly on supervised and unsupervised learning models. Recently, considerable amount of work has focused on analyzing the video feeds for single and collaborative cameras [3,5,8,9,12,16]. Although these works introduce different techniques to for scheduling and analyzing video streams to get a better performance in terms of time and computation power, none of them considers a scenario where overlapping scenarios are used to perform different tasks. ...
... Recently, considerable amount of work has focused on analyzing the video feeds for single and collaborative cameras [2,[4][5][6][7]10]. Although these works introduce different techniques to for scheduling and analyzing video streams to get a better performance in terms of time and computation power, none of them considers a scenario where overlapping scenarios are used to perform different tasks. ...
Conference Paper
Deep neural network (DNN) has achieved the state-of-the-art results in multiple fields, and has been widely used to build latency sensitive applications for its high performance. When dispatching requests among GPU machines for DNN execution, the inference system hosted in the cloud needs to guarantee that the maximum latency of all requests, denoted as the worst case latency, is within the latency objectives of the clients. In this paper, we design and implement a request dispatch system, called DeepLat, which distributes client requests among the GPU machines efficiently to minimize the worst case latency of the DNN-based application. DeepLat uses batch-aware dispatch policy to minimize the batch collecting time, proposes duration-based algorithm to reduce the average latency and supports partial-batch dispatching to minimize the waiting time for bottleneck machines. Evaluation shows that compared to existing request dispatch systems, DeepLat can reduce the worst case latency by 37.7%37.7\% on average without using extra computing resources. Besides, DeepLat achieves the theoretical lower bound for the worst case latency for over 48%48\% workload. With the capability to minimize the worst case latency, DeepLat reduces the total cost of DNN serving system by 43.2%43.2\% on average.
Article
Artificial Intelligence (AI) at the edge is the utilization of AI in real-world devices. Edge AI refers to the practice of doing AI computations near the users at the network's edge, instead of centralised location like a cloud service provider's data centre. With the latest innovations in AI efficiency, the proliferation of Internet of Things (IoT) devices, and the rise of edge computing, the potential of edge AI has now been unlocked. This study provides a thorough analysis of AI approaches and capabilities as they pertain to edge computing, or Edge AI. Further, a detailed survey of edge computing and its paradigms including transition to Edge AI is presented to explore the background of each variant proposed for implementing Edge Computing. Furthermore, we discussed the Edge AI approach to deploying AI algorithms and models on edge devices, which are typically resource-constrained devices located at the edge of the network. We also presented the technology used in various modern IoT applications, including autonomous vehicles, smart homes, industrial automation, healthcare, and surveillance. Moreover, the discussion of leveraging machine learning algorithms optimized for resource-constrained environments is presented. Finally, important open challenges and potential research directions in the field of edge computing and edge AI have been identified and investigated. We hope that this article will serve as a common goal for a future blueprint that will unite important stakeholders and facilitates to accelerate development in the field of Edge AI.
Article
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent AutoML efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of model variants with different tradeoffs between accuracy and efficiency. We design and implement JellyBean, a system for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean picks the most cost-efficient models that meet the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58% and vehicle tracking from the NVIDIA AI City Challenge by up to 36%, compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Article
Full-text available
This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.
Article
Full-text available
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet
Conference Paper
Full-text available
Resource constrained mobile devices need to leverage computation on nearby servers to run responsive applications that recognize objects, people, or gestures from real-time video. The two key questions that impact performance are what computation to offload, and how to structure the parallelism across the mobile device and server. To answer these questions, we develop and evaluate three interactive perceptual applications. We find that offloading and parallelism choices should be dynamic, even for a given application, as performance depends on scene complexity as well as environmental factors such as the network and device capabilities. To this end we develop Odessa, a novel, lightweight, runtime that automatically and adaptively makes offloading and parallelism decisions for mobile interactive perception applications. Our evaluation shows that the incremental greedy strategy of Odessa converges to an operating point that is close to an ideal offline partitioning. It provides more than a 3x improvement in application performance over partitioning suggested by domain experts. Odessa works well across a variety of execution environments, and is agile to changes in the network, device and application inputs.
Conference Paper
Full-text available
This paper addresses the problem of scheduling concur- rent jobs on clusters where application data is stored on the computing nodes. This setting, in which schedul- ing computations close to their data is crucial for per- formance, is increasingly common and arises in systems such as MapReduce, Hadoop, and Dryad as well as many grid-computing environments. We argue that data in- tensive computation benefits from a fine-grain resource sharing model that differs from the coarser semi-static resource allocations implemented by most existing clus- ter computing architectures. The problem of scheduling with locality and fairness constraints has not previously been extensively studied under this model of resource- sharing. We introduce a powerful and flexible new framework for scheduling concurrent distributed jobs with fine-grain resource sharing. The scheduling problem is mapped to a graph datastructure, where edge weights and capacities encode the competing demands of data locality, fairness, and starvation-freedom, and a standard solver computes the optimal online schedule according to a global cost model. We evaluate our implementation of this frame- work, which we call Quincy, on a cluster of a few hun- dred computers using a varied workload of data- and CPU-intensive jobs. We evaluate Quincy against an ex- isting queue-based algorithm and implement several poli- cies for each scheduler, with and without fairness con- straints. Quincy gets better fairness when fairness is re- quested, while substantially improving data locality. The volume of data transferred across the cluster is reduced by up to a factor of 3.9 in our experiments, leading to a throughput increase of up to 40%.
Conference Paper
We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. In large-scale case studies on 16 GPUs, when required to stay within latency constraints at least 99% of the time, Nexus can process requests at rates 1.8-12.7X higher than state of the art systems can. A long-running multi-application deployment stays within 84% of optimal utilization and, on a 100-GPU cluster, violates latency SLOs on 0.27% of requests.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Conference Paper
Social sensing has emerged as a new sensing application paradigm where measurements about the physical world are collected from humans or devices on their behalf. The advent of edge computing pushes the frontier of computation, service, and data along the cloud-to-IoT continuum. The merge of these two technical trends (referred to as Social Sensing based Edge Computing or SSEC) generates a set of new research challenges. One critical issue in SSEC is the heterogeneity of the edge where the edge devices owned by human sensors often have diversified computational power, run-time environments, network interfaces, and hardware equipment. Such heterogeneity poses significant challenges in the resource management of SSEC systems. Examples include masking the pronounced heterogeneity across diverse platforms, allocating interdependent tasks with complex requirements on devices with different resources, and adapting to the dynamic and diversified context of the edge devices. In this paper, we develop a new resource management framework, HeteroEdge, to address the heterogeneity of SSEC by 1) providing a uniform interface to abstract the device details (hardware, operating system, CPU); and 2) effectively allocating the social sensing tasks to the heterogeneous edge devices. We implemented HeteroEdge on a real-world edge computing testbed that consists of heterogeneous edge devices (Jetson TX2, TK1, Raspberry Pi3, and personal computer). Evaluations based on two real-world social sensing applications show that the HeteroEdge achieved up to 42% decrease in end-to-end delay for the application and 22% more energy savings compared to the state-of-the-art baselines.
Conference Paper
Deep neural networks (DNNs) are emerging as important drivers for GPU (Graphical Processing Unit) usage. Routinely, now, cloud offerings include GPU-capable VMs, and GPUs are used for training and testing DNNs. A popular way to run inference (or testing) tasks with DNNs is to use middleware called a serving system. Tensorflow-Serving (TF-Serving) is an example of a DNN serving system. In this paper, we consider the problem of carefully scheduling multiple concurrent DNNs in a serving system on a single GPU to achieve fairness or service differentiation objectives, a capability crucial to cloud-based TF-Serving offerings. In scheduling DNNs, we face two challenges: how to schedule, and switch between, different DNN jobs at low overhead; and, how to account for their usage. Our system, Olympian, extends TF-Serving to enable fair sharing of a GPU across multiple concurrent large DNNs at low overhead, a capability TF-Serving by itself is not able to achieve. Specifically, Olympian can run concurrent instances of several large DNN models such as Inception, ResNet, GoogLeNet, AlexNet and VGG, provide each with an equal share of the GPU, while interleaving them at timescales of 1-2 ms, and incurring an overhead of less than 2%. It achieves this by leveraging the predictability of GPU computations to profile GPU resource usage models offline, then using these to achieve low overhead switching between DNNs.
Conference Paper
With the rising interest in personalized VR and gaming experiences comes the need to create high quality 3D avatars that are both low-cost and variegated. Due to this, building dynamic avatars from a single unconstrained input image is becoming a popular application. While previous techniques that attempt this require multiple input images or rely on transferring dynamic facial appearance from a source actor, we are able to do so using only one 2D input image without any form of transfer from a source image. We achieve this using a new conditional Generative Adversarial Network design that allows fine-scale manipulation of any facial input image into a new expression while preserving its identity. Our photoreal avatar GAN (paGAN) can also synthesize the unseen mouth interior and control the eye-gaze direction of the output, as well as produce the final image from a novel viewpoint. The method is even capable of generating fully-controllable temporally stable video sequences, despite not using temporal information during training. After training, we can use our network to produce dynamic image-based avatars that are controllable on mobile devices in real time. To do this, we compute a fixed set of output images that correspond to key blendshapes, from which we extract textures in UV space. Using a subject's expression blendshapes at run-time, we can linearly blend these key textures together to achieve the desired appearance. Furthermore, we can use the mouth interior and eye textures produced by our network to synthesize on-the-fly avatar animations for those regions. Our work produces state-of-the-art quality image and video synthesis, and is the first to our knowledge that is able to generate a dynamically textured avatar with a mouth interior, all from a single image.
Article
We introduce a method for co-locating style-defining elements over a set of 3D shapes. Our goal is to translate high-level style descriptions, such as “Ming” or “European” for furniture models, into explicit and localized regions over the geometric models that characterize each style. For each style, the set of style-defining elements is defined as the union of all the elements that are able to discriminate the style. Another property of the style-defining elements is that they are frequently occurring, reflecting shape characteristics that appear across multiple shapes of the same style. Given an input set of 3D shapes spanning multiple categories and styles, where the shapes are grouped according to their style labels, we perform a cross-category co-analysis of the shape set to learn and spatially locate a set of defining elements for each style. This is accomplished by first sampling a large number of candidate geometric elements and then iteratively applying feature selection to the candidates, to extract style-discriminating elements until no additional elements can be found. Thus, for each style label, we obtain sets of discriminative elements that together form the superset of defining elements for the style. We demonstrate that the co-location of style-defining elements allows us to solve problems such as style classification, and enables a variety of applications such as style-revealing view selection, style-aware sampling, and style-driven modeling for 3D shapes.
Article
Large volumes of videos are continuously recorded from cameras deployed for traffic control and surveillance with the goal of answering "after the fact" queries: identify video frames with objects of certain classes (cars, bags) from many days of recorded video. While advancements in convolutional neural networks (CNNs) have enabled answering such queries with high accuracy, they are too expensive and slow. We build Focus, a system for low-latency and low-cost querying on large video datasets. Focus uses cheap ingestion techniques to index the videos by the objects occurring in them. At ingest-time, it uses compression and video-specific specialization of CNNs. Focus handles the lower accuracy of the cheap CNNs by judiciously leveraging expensive CNNs at query-time. To reduce query time latency, we cluster similar objects and hence avoid redundant processing. Using experiments on video streams from traffic, surveillance and news channels, we see that Focus uses 58X fewer GPU cycles than running expensive ingest processors and is 37X faster than processing all the video at query time.
Conference Paper
Metric learning aims to construct an embedding where two extracted features corresponding to the same identity are likely to be closer than features from different identities. This paper presents a method for learning such a feature space where the cosine similarity is effectively optimised through a simple re-parametrization of the conventional softmax classification regime. At test time, the final classification layer can be stripped of the Network, facilitating nearest neighbour queries on unseen individuals using the cosine similarity metric. This approach presents a simple alternative to direct metric learning objectives such as siamese networks that have required sophisticated pair or triplet sampling strategies in the past. The method is evaluated on two large-scale pedestrian re-identification datasets where competitive results are achieved overall. In particular, we achieve better generalization to the test set compared toa network trained with triplet loss.
Article
We present a fully automatic framework that digitizes a complete 3D head with hair from a single unconstrained image. Our system offers a practical and consumer-friendly end-to-end solution for avatar personalization in gaming and social VR applications. The reconstructed models include secondary components (eyes, teeth, tongue, and gums) and provide animation-friendly blendshapes and joint-based rigs. While the generated face is a high-quality textured mesh, we propose a versatile and efficient polygonal strips (polystrips) representation for the hair. Polystrips are suitable for an extremely wide range of hairstyles and textures and are compatible with existing game engines for real-time rendering. In addition to integrating state-of-the-art advances in facial shape modeling and appearance inference, we propose a novel single-view hair generation pipeline, based on 3D-model and texture retrieval, shape refinement, and polystrip patching optimization. The performance of our hairstyle retrieval is enhanced using a deep convolutional neural network for semantic hair attribute classification. Our generated models are visually comparable to state-of-the-art game characters designed by professional artists. For real-time settings, we demonstrate the flexibility of polystrips in handling hairstyle variations, as opposed to conventional strand-based representations. We further show the effectiveness of our approach on a large number of images taken in the wild, and how compelling avatars can be easily created by anyone.
Conference Paper
Client-side video players employ adaptive bitrate (ABR) algorithms to optimize user quality of experience (QoE). Despite the abundance of recently proposed schemes, state-of-the-art ABR algorithms suffer from a key limitation: they use fixed control rules based on simplified or inaccurate models of the deployment environment. As a result, existing schemes inevitably fail to achieve optimal performance across a broad set of network conditions and QoE objectives. We propose Pensieve, a system that generates ABR algorithms using reinforcement learning (RL). Pensieve trains a neural network model that selects bitrates for future video chunks based on observations collected by client video players. Pensieve does not rely on pre-programmed models or assumptions about the environment. Instead, it learns to make ABR decisions solely through observations of the resulting performance of past decisions. As a result, Pensieve automatically learns ABR algorithms that adapt to a wide range of environments and QoE metrics. We compare Pensieve to state-of-the-art ABR algorithms using trace-driven and real world experiments spanning a wide variety of network conditions, QoE metrics, and video properties. In all considered scenarios, Pensieve outperforms the best state-of-the-art scheme, with improvements in average QoE of 12%--25%. Pensieve also generalizes well, outperforming existing schemes even on networks for which it was not explicitly trained.
Article
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, the first general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the TensorFlow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
Conference Paper
With smartphones making video recording easier than ever, new apps like Periscope and Meerkat brought personalized interactive video streaming to millions. With a touch, viewers can switch between first person perspectives across the globe, and interact in real-time with broadcasters. Unlike traditional video streaming, these services require low-latency video delivery to support high interactivity between broadcasters and audiences. We perform a detailed analysis into the design and performance of Periscope, the most popular personal livestreaming service with 20 million users. Using detailed measurements of Periscope (3 months, 19M streams, 705M views) and Meerkat (1 month, 164K streams, 3.8M views), we ask the critical question: ``Can personalized livestreams continue to scale, while allowing their audiences to experience desired levels of interactivity?' We analyze the network path of each stream and break down components of its end-to-end delay. We find that much of each stream's delay is the direct result of decisions to improve scalability, from chunking video sequences to selective polling for reduced server load. Our results show a strong link between volume of broadcasts and stream delivery latency. Finally, we discovered a critical security flaw during our study, and shared it along with a scalable solution with Periscope and Meerkat management.
Article
Data center-scale clusters are evolving towards heterogeneous hardware for power, cost, differentiated price-performance, and other reasons. MapReduce is a well-known programming model to process large amount of data on data center-scale clusters. Most MapReduce implementations have been designed and optimized for homogeneous clusters. Unfortunately, these implementations perform poorly on heterogeneous clusters (e.g., on a 90-node cluster that contains 10 Xeon-based servers and 80 Atom-based servers, Hadoop performs worse than on 10-node Xeon-only or 80-node Atom-only homogeneous sub-clusters for many of our benchmarks). This poor performance remains despite previously proposed optimizations related to management of straggler tasks. In this paper, we address MapReduce's poor performance on heterogeneous clusters. Our first contribution is that the poor performance is due to two key factors: (1) the non-intuitive effect that MapReduce's built-in load balancing results in excessive and bursty network communication during the Map phase, and (2) the intuitive effect that the heterogeneity amplifies load imbalance in the Reduce computation. Our second contribution is Tarazu, a suite of optimizations to improve MapReduce performance on heterogeneous clusters. Tarazu consists of (1) Communication-Aware Load Balancing of Map computation (CALB) across the nodes, (2) Communication-Aware Scheduling of Map computation (CAS) to avoid bursty network traffic and (3) Predictive Load Balancing of Reduce computation (PLB) across the nodes. Using the above 90-node cluster, we show that Tarazu significantly improves performance over a baseline of Hadoop with straightforward tuning for hardware heterogeneity.
Article
Real-world videos often have complex dynamics; methods for generating open-domain video descriptions should be senstive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous dis- tributed file systems, our design has been driven by obser- vations of our application workloads and technological envi- ronment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore rad- ically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our ser- vice as well as research and development efforts that require large data sets. The largest cluster to date provides hun- dreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
This paper presents Click, a flexible, modular software architecture for creating routers. Click routers are built from fine-grained components; this supports fine-grained extensions throughout the forwarding path. The components are packet processing modules called elements. The basic element interface is narrow, consisting mostly of functions for initialization and packet handoff, but elements can extend it to support other functions (such as reporting queue lengths). To build a router configuration, the user chooses a collection of elements and connects them into a directed graph. The graph's edges, which are called connections, represent possible paths for packet handoff. To extend a configuration, the user can write new elements or compose existing elements in new ways, much as UNIX allows one to build complex applications directly or by composing simpler ones using pipes
NVIDIA System Management Interface
  • Nvidia System Management Interface
The NVIDIA EGX Platform for Edge Computing
  • Nvidia The
  • Egx
USENIX Association. Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. PRETZEL: Opening the black box of machine learning prediction serving systems
  • Yunseong Lee
  • Alberto Scolari
  • Byung-Gon Chun
  • Marco Domenico Santambrogio
  • Markus Weimer
  • Matteo Interlandi
  • Lee Yunseong
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks
  • Alex Krizhevsky
  • Ilya Sutskever
  • Geoffrey E Hinton
  • Krizhevsky Alex
Oytun Ulutan, Swati Rallapalli, Carlos Torres, Mudhakar Srivatsa, and BS Manjunath. Actor conditioned attention maps for video action detection
  • Oytun Ulutan
  • Swati Rallapalli
  • Carlos Torres
  • Mudhakar Srivatsa
  • B S Manjunath
  • Ulutan Oytun
Gandiva : Introspective cluster scheduling for deep learning
  • Wencong Xiao
  • Romil Bhardwaj
  • Ramachandran Ramjee
  • Muthian Sivathanu
  • Nipun Kwatra
  • Zhenhua Han
  • Pratyush Patel
  • Xuan Peng
  • Hanyu Zhao
  • Quanlu Zhang
  • Fan Yang
  • Lidong Zhou
  • Xiao Wencong