Conference Paper

CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... They enable systems to enforce availability guarantees and mitigate the interference caused by shared resource congestion. The benefits of intelligently tuning such features have been shown in previous work to include increased system performance [60], improved system utilization [75], and reduced power consumption [67]. However, to the best of our knowledge, there has been no previous attempt at surveying the academic and industrial landscapes to understand the applicability of these new features in different contexts. ...
... Best-effort (BE) jobs on the other hand are time non-critical, predictable workloads that may run whenever sufficient resources become available, commonly using a batch-format scheduling system. Thus, such jobs are well suited to improve server utilization by consuming resources left unused by LC services based on current loads, a technique leveraged by several previous works [60,16,69,75,72,82,84,119,56,121,13]. BE tasks such as machine learning model training or big data analysis jobs are often data intensive and consequently disproportionately "noisy" users of shared system resources. ...
... The goal of the allocations is to reduce resource congestion and increase workload throughput. To validate its effectiveness, a comparison with two older controllers [75,105] is carried out. ...
Preprint
Full-text available
Recent advancements in commodity server processors have enabled dynamic hardware-based quality-of-service (QoS) enforcement. These features have gathered increasing interest in research communities due to their versatility and wide range of applications. Thus, there exists a need to understand how scholars leverage hardware QoS enforcement in research, understand strengths and shortcomings, and identify gaps in current state-of-the-art research. This paper observes relevant publications, presents a novel taxonomy, discusses the approaches used, and identifies trends. Furthermore, an opportunity is recognized for QoS enforcement utilization in service-based cloud computing environments, and open challenges are presented.
... Many studies try to make OS intelligent using ML technologies. These efforts cover a wide range of OS functions, e.g., resource scheduling for cloud services [1,3,[5][6][7][8][9][11][12]14,[19][20], I/O scheduling [2,51], load balancing and task scheduling [13,54,60], memory page allocation and reclamation [52,60,62], and parameter tuning for OS/system software [4,10,16]. We summarize some representative ones among these studies and show them in Table 1 in detail. ...
... They do not need offline traces and can achieve a near-optimal solution faster via sampling in the scheduling exploration space and observing the real-time feedback. For example, Bayesian optimization models the reward as a probability distribution and uses this model to guide the search behaviors (e.g., the study [9] in Table 1). The Upper Confidence Bound (UCB) strategy in the context of multi-armed bandit problems assigns each option a UCB based on past observations and selects the option with the highest upper confidence bound, e.g., the work [5] in Table 1. ...
... For cost-efficiency, multiple services are usually co-located on a server, including latency-critical (LC) services with strict QoS targets and throughput-oriented best-effort (BE) services [3,9,19,23]. Runtime resource scheduling is the core for quality of service (QoS) control in these complicated co-location cases [3,9,23,25]. ...
Preprint
Full-text available
Making it intelligent is a promising way in System/OS design. This paper proposes OSML+, a new ML-based resource scheduling mechanism for co-located cloud services. OSML+ intelligently schedules the cache and main memory bandwidth resources at the memory hierarchy and the computing core resources simultaneously. OSML+ uses a multi-model collaborative learning approach during its scheduling and thus can handle complicated cases, e.g., avoiding resource cliffs, sharing resources among applications, enabling different scheduling policies for applications with different priorities, etc. OSML+ can converge faster using ML models than previous studies. Moreover, OSML+ can automatically learn on the fly and handle dynamically changing workloads accordingly. Using transfer learning technologies, we show our design can work well across various cloud servers, including the latest off-the-shelf large-scale servers. Our experimental results show that OSML+ supports higher loads and meets QoS targets with lower overheads than previous studies.
... sufficiently address resource contention and utilization, leading to performance degradation and resource wastage [8], [9], [10], [11], [12], [13]. ...
... These works ensure LC task QoS by allocating only the minimum necessary resources to LC tasks while assigning the remaining resources to BE tasks, thereby minimizing resource consumption and maximizing BE task performance. However, these isolation methods that focus on separating resources allocated to LC and BE tasks often fail to recognize the saturation point for BE task performance, leading to resource wastage [2], [5], [8], [14], [15]. Moreover, dynamic resource allocation methods, such as those employed by PARTIES, OLPart, and other state-of-the-art approaches, continuously explore and adjust resource combinations during runtime. ...
... This is the author's version which has not been fully edited and content may change prior to final publication. [8] focuses on dynamically allocating resources among LC tasks in a QoS-aware manner. While it ensures the QoS of LC tasks, Clite struggles to reclaim resources allocated to BE tasks during contention, leading to inefficient resource distribution and reduced system performance. ...
Article
Full-text available
The importance of data centers in modern computing environments is continuously increasing, and ensuring the Quality of Service (QoS) of Latency-Critical (LC) tasks is essential to prevent system failures and performance degradation. Resource isolation techniques, widely used to guarantee the QoS of LC tasks, allocate additional resources until the QoS requirements are satisfied. However, traditional methods do not consider the performance saturation point of Best-Effort (BE) tasks and allocate all remaining resources to BE tasks after meeting LC task requirements, resulting in resource wastage. Furthermore, the high overhead caused by real-time resource adjustments can lead to system performance degradation. To address these issues, this paper proposes a weight-based Markov Chain model for resource optimization. The proposed model evaluates resource efficiency through offline profiling of balanced resource combinations and determines the optimal resource allocation strategy in advance. By accurately identifying the minimum resources required to ensure LC task QoS and predicting the performance saturation point of BE tasks, the model prevents unnecessary resource wastage. Unlike traditional methods, the proposed model leverages weight-based profiling to significantly reduce real-time scheduling overhead and achieve balanced resource allocation. Experimental results demonstrate that the proposed model maintains LC task QoS while guaranteeing BE task performance at a level comparable to existing resource isolation scheduling techniques. Additionally, the model successfully optimizes resource utilization and effectively reduces profiling overhead compared to traditional methods. The proposed approach optimizes both resource efficiency and performance in multi-task server environments.
... A cloud server may have many co-located LC (latency-critical) and BE (best-efforts) services (e.g., apps in Table 1) at the same time [14,24,38]. The existing OS swap approach does not leverage the knowledge of the application features. ...
... Figure 5 shows how existing Linux swap mechanism performs on Redis. Similar to the previous studies, the y-axis shows its 99th percentile response latency that reflects the QoS of LC application, and the y-scale of latency is logarithmic [14,24,38]. ...
Article
Full-text available
This paper proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in the cloud environments. iSwap works in the OS kernel. iSwap accurately learns the reuse patterns for memory pages and makes the swap decisions accordingly to avoid ineffective operations. In the cases where memory pressure is high, iSwap compresses pages that belong to the latency-critical (LC) applications (or high-priority applications) and keeps them in main memory, avoiding I/O operations for these LC applications to ensure QoS; and iSwap evicts low-priority applications' pages out of main memory. iSwap has a low overhead and works well for cloud applications with large memory footprints. We evaluate iSwap on Intel x86 and ARM platforms. The experimental results show that iSwap can significantly reduce ineffective swap operations (8.0%-19.2%) and improve the QoS for LC applications (36.8%-91.3%) in cases where memory pressure is high, compared with the latest LRU-based approach widely used in modern OSes.
... Co-locating Workloads. Co-locating latency-critical applications with batch applications is a widely adopted approach to improve resource utilization in datacenters, with the popular resource partitioning techniques (Chen et al., 2019b;Zhu & Erez, 2016;Lo et al., 2015;Wu & Martonosi, 2008;Chen et al., 2019a;Kasture et al., 2015;Sanchez & Kozyrakis, 2011;Kasture & Sanchez, 2014;Zhang et al., 2019;Iorgulescu et al., 2018;Patel & Tiwari, 2020;Chen et al., 2023). Prior work has also explored multiplexing to serve multiple DNNs (Dhakal et al., 2020;Tan et al., 2021;Zhang et al., 2023b), LLMs (Duan et al., 2024), or Lo-RAs Sheng et al., 2024a;Wu et al., 2024). ...
Preprint
Full-text available
Recent advancements in large language models (LLMs) have facilitated a wide range of applications with distinct quality-of-experience requirements, from latency-sensitive online tasks, such as interactive chatbots, to throughput-focused offline tasks like document summarization. While deploying dedicated machines for these services ensures high-quality performance, it often results in resource underutilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving latency requirements. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor for batch execution time estimation and an SLO-aware profiler to quantify interference, and (2) SLO-aware offline scheduling policies that maximize throughput and prevent starvation, without compromising online serving latency. Our evaluation on production workloads shows that HyGen achieves up to 5.84x higher throughput compared to existing advances while maintaining comparable latency.
... Third, at many request rates, older GPUs (V100 and T4) can meet latency SLOs comparable to newer GPUs (A100), particularly for smaller models, revealing opportunities for running smaller models using older GPUs effectively. Takeaway 1: The prefill and decoding phases feature different characteristics, as prior works have shown [35,50]. Particularly for old GPUs, the prefill phase is more demanding as it is compute-bound, while the decoding phase is less demanding as it is memory-bound. ...
Preprint
Full-text available
LLMs have been widely adopted across many real-world applications. However, their widespread use comes with significant environmental costs due to their high computational intensity and resource demands. Specifically, this has driven the development of new generations of high-performing GPUs, exacerbating the problem of electronic waste and accelerating the premature disposal of devices. To address this problem, this paper focuses on reducing the carbon emissions of LLM serving by reusing older, low-performing GPUs. We present GreenLLM, an SLO-aware LLM serving framework designed to minimize carbon emissions by reusing older GPUs. GreenLLM builds on two identified use cases that disaggregate specific computations onto older GPUs, reducing carbon emissions while meeting performance goals. To deepen our understanding of the potential carbon savings from disaggregation, we also provide a theoretical analysis of its relationship with carbon intensity and GPU lifetime. Our evaluations show that GreenLLM reduces carbon emissions by up to 40.6% compared to running standard LLM serving on new GPU only, meeting latency SLOs for over 90% of requests across various applications, latency requirements, carbon intensities, and GPU lifetimes.
... However, unlike Ribbon, they either do not deal with heterogeneous instances or assume that the queries will not change with time [66,67]. Ribbon introduces a novel use case of Bayesian Optimization for building efficient inference service systems -expanding the surface area of what we had demonstrated previously for other use cases of Bayesian Optimization in the data center/cloud computing resource-management area (i.e., shared resource partitioning for microservices [68], shared resource management for fairness and performance [69], and performance auto-tuning [70]). None of [72,73]. ...
Preprint
Full-text available
Deep learning model inference is a key service in many businesses and scientific discovery processes. This paper introduces RIBBON, a novel deep learning inference serving system that meets two competing objectives: quality-of-service (QoS) target and cost-effectiveness. The key idea behind RIBBON is to intelligently employ a diverse set of cloud computing instances (heterogeneous instances) to meet the QoS target and maximize cost savings. RIBBON devises a Bayesian Optimization-driven strategy that helps users build the optimal set of heterogeneous instances for their model inference service needs on cloud computing platforms -- and, RIBBON demonstrates its superiority over existing approaches of inference serving systems using homogeneous instance pools. RIBBON saves up to 16% of the inference service cost for different learning models including emerging deep learning recommender system models and drug-discovery enabling models.
... The vast majority of these performance tuning studies adopt BO to find optimized parameter configurations since BO is theoretically grounded and can efficiently learn the relationship between performance and parameters from evaluations, thus intelligently tuning parameters for performance improvement. Because of these advantages, BO is also employed in resource allocation studies [45], [46]. However, to the best of our knowledge, previous studies have not focused on tuning the parameters of LLM inference engines and can not address the three unique challenges faced in LLM inference engine tuning. ...
Preprint
A service-level objective (SLO) is a target performance metric of service that cloud vendors aim to ensure. Delivering optimized SLOs can enhance user satisfaction and improve the competitiveness of cloud vendors. As large language models (LLMs) are gaining increasing popularity across various fields, it is of great significance to optimize SLOs for LLM inference services. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. We first propose a generalized formulation of the tuning problem to handle various objectives and constraints between parameters, and SCOOT exploits the Bayesian optimization (BO) technique to resolve the problem via exploration and exploitation. Moreover, SCOOT adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT can significantly outperform existing tuning techniques in SLO optimization while greatly improving the tuning efficiency.
... Cluster resource and power management A rich body of work seeks to improve resource efficiency under the SLO constraints through resource management for a wide range of latency sensitive workloads, such as microservices [76] and DL workloads, through effective resource sharing [6], [43], [51], dynamic allocation [71], and hardware reconfiguration [24]. Others focus on approaches that enable safe power management and oversubscription [16], [29], [49] leveraging workload characteristics [25], [75] and system state [62]. ...
Preprint
Full-text available
The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.
... However, DVFaaS still relies on CFS-driven power management, while also following a reactive strategy for adjusting core frequencies based on slack propagation. QoS-aware strategies: Several works have also been proposed to optimize resource allocation for latency-critical applications under QoS constraints [16,44,45,51,63]. While these works effectively address resource optimization to meet latency requirements, they do not focus on power management as a key optimization goal. ...
Preprint
Full-text available
Serverless workflows have emerged in FaaS platforms to represent the operational structure of traditional applications. With latency propagation effects becoming increasingly prominent, step-wise resource tuning is required to address the end-to-end Quality-of-Service (QoS) requirements. Modern processors' allowance for fine-grained Dynamic Voltage and Frequency Scaling (DVFS), coupled with the intermittent nature of serverless workflows presents a unique opportunity to reduce power while meeting QoS. In this paper, we introduce a QoS-aware DVFS framework for serverless workflows. {\Omega}kypous regulates the end-to-end latency of serverless workflows by supplying the system with the Core/Uncore frequency combination that minimizes power consumption. With Uncore DVFS enriching the efficient power configurations space, we devise a grey-box model that accurately projects functions' execution latency and power, to the applied Core and Uncore frequency combination. To the best of our knowledge, {\Omega}kypous is the first work that leverages Core and Uncore DVFS as an integral part of serverless workflows. Our evaluation on the analyzed Azure Trace, against state-of-the-art (SotA) power managers, demonstrates an average power consumption reduction of 9% (up to 21%) while minimizing QoS violations.
... Traditional approaches to dynamic resource allocation for services can be divided into 1) heuristics methods, for instance, those who use rule-based algorithms [1], [2], peak allocation [3], or threshold-based solutions [4], [5], [6], [7], [8]; and 2) methods based on analytical models, for instance, queuing theory [9], [10], linear functions [11], Continuous Time Markov Chains (CTMC) [12], Stochastic Petri Nets (SPNs) [13], or fluid models [14]. Heuristics methods have the disadvantage that they strongly rely on domain expertise, they often overprovision, and they are generally tailored for a specific service scenario. ...
Article
Full-text available
We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. The management objective is defined first and then mapped onto available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates the learning of the system model and the operating region from the learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with 50% less system resources than Kubernetes HPA autoscaling.
... These studies typically pre-characterize applications under the influence of controlled-intensity interference [8] [12] [13], thereby limiting their applicability in production environments. Moreover, certain investigations employ feedback-based methods to mitigate interference among co-located applications [5] [6], which involves adjusting LCS resources based on QoS requirements with a few seconds of delay. However, these methods face challenges when applied in highly dynamic production clusters. ...
Preprint
Full-text available
Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we observe that BEJs typically exhibit periodic execution patterns and serve as the primary sources of interference to LCSs. Furthermore, despite occupying the same level of resource consumption, the diverse compositions of BEJs can result in varying degrees of interference on LCSs. Subsequently, we propose PISM, a proactive Performance Interference Scoring and Mitigating framework for LCSs through the optimization of BEJ scheduling. Firstly, PISM adopts a data-driven approach to establish a characterization and classification methodology for BEJs. Secondly, PISM models the relationship between the composition of BEJs on servers and the response time (RT) of LCSs. Thirdly, PISM establishes an interference scoring mechanism in terms of RT, which serves as the foundation for BEJ scheduling. We assess the effectiveness of PISM on a small-scale cluster and through extensive data-driven simulations. The experiment results demonstrate that PISM can reduce cluster interference by up to 41.5%, and improve the throughput of long-tail LCSs by 76.4%.
... A major challenge is to decide which requested application should be hosted by a particular edge server or group of edge servers that meet the application user's requirements. Authors such as Yang et al. proposed a key component acceleration design in [26], a strategy to ensure a low latency of multi-tier applications, whereas authors in [27] maximized the resource utilization, while avoiding QoS violations for services provisioned in data centers, by aggregating web services with multiple latencies with batch applications. To increase the throughput of batch applications, while maintaining the required QoS of user-oriented services in cloud data centers, the authors in [22,23] proposed allocating computational resources to applications residing at the site. ...
Article
Full-text available
Future applications to be supported by 6G networks are envisaged to be realized by loosely-coupled and independent microservices. In order to achieve an optimal deployment of applications, smart resource management strategies will be required, working in a cost-effective and resource-efficient manner. Current cloud computing services are challenged to meet the explosive growth and demand of future use cases such as virtual/augmented/mixed reality (VR/AR/MR). The purpose of edge computing (EC) is to better address latency and transmission requirements of those future stringent applications. However, a high flexibility and a rapid decision-making will be required since EC suffers from limited resources availability. For this reason, this work proposes an artificial intelligence (AI) technique, based on reinforcement learning (RL), to make intelligent decisions on the optimal tier and edge-site selection to serve any request according to the application’s category, constraints, and conflicting costs. In addition, when deployed at the edge-network, a heuristic has been proposed for the mapping of microservices within the selected edge-site. That heuristic will exploit a ranking methodology based on the network topology and available network and compute resources while preserving the revenue of the mobile network operator (MNO). Simulation results show that the performance of the proposed RL approach is close to the optimal solution by reaching the cost minimization objective within a 8.3% margin; moreover, RL outperforms considered benchmark algorithms in most of the conducted experiments.
Article
Recent years have witnessed increasing interest in machine learning (ML) inferences on serverless computing due to its auto-scaling and cost-effective properties. However, one critical aspect, function granularity, has been largely overlooked, limiting the potential of serverless ML. This paper explores the impact of function granularity on serverless ML, revealing its important effects on the SLO hit rates and resource costs of serverless applications. It further proposes adaptive granularity as an approach to addressing the phenomenon that no single granularity fits all applications and situations. It explores three predictive models and presents programming tools and runtime extensions to facilitate the integration of adaptive granularity into existing serverless platforms. Experiments show adaptive granularity produces up to a 29.2% improvement in SLO hit rates and up to a 24.6% reduction in resource costs over the state-of-the-art serverless ML which uses fixed granularity.
Article
Job packing is an effective technique to harvest the idle resources allocated to the deep learning (DL) training jobs but not fully utilized, especially when clusters may experience low utilization, and users may overestimate their resource needs. However, existing job packing techniques tend to be conservative due to the mismatch in scope and granularity between job packing and cluster scheduling. In particular, tapping the potential of job packing in the training cluster requires a local and fine-grained coordination mechanism. To this end, we propose a novel job-packing middleware named Gimbal , which operates between the cluster scheduler and the hardware resources. As middleware, Gimbal must not only facilitate coordination among the packed jobs but also support various scheduling objectives of different schedulers. Gimbal achieves dual functionality by introducing a set of worker calibration primitives designed to calibrate workers’ execution status in a fine-grained manner. The primitives obscure the complexity of the underlying job and resource management mechanisms, thus offering the generality and extensibility for crafting coordination policies tailored to various scheduling objectives. We implement Gimbal on a real-world GPU cluster and evaluate it with a set of representative DL training jobs. The results show that Gimbal improves different scheduling objectives up to 1.32 × compared with the state-of-the-art job packing techniques.
Article
In edge computing environments, the co-location of latency-critical (LC) services and best-effort (BE) jobs is a key strategy for enhancing resource utilization. However, existing analysis-based co-location strategies incur high analytical costs and struggle to rapidly adapt to the evolving fields of edge computing and microservice architectures, often failing to effectively meet the demands of edge computing environments. Feedback-based co-location strategies, while reducing analytical overhead, lack sufficient research in multi-node environments, resulting in overly coarse-grained deployment strategies. These strategies do not adequately consider the dynamic workloads and resource constraints inherent in edge computing, leading to improper resource allocation and degraded performance. This paper introduces Cortex, a Kubernetes-based co-location framework for edge device clusters that addresses these challenges by innovatively transforming the co-location deployment problem into a Minimum Cost Maximum Flow (MCMF) problem and employing the Network Simplex Algorithm (NSA) to optimize resource allocation and ensure QoS of LC services. Cortex also features a dynamic adjustment mechanism that adapts to changes in the request load of LC services, thereby minimizing the performance loss of BE jobs and reducing resource wastage. Our experiments in real edge device clusters demonstrate that Cortex significantly improves system resource utilization by 12.81%, increases the QoS satisfaction rate by 17.86%, and boosts the number of BE jobs by 51.96% compared to existing methods.
Article
The paper focuses on an understudied yet fundamental problem: existing methods typically average the utilization of multiple hardware threads to evaluate the available CPU resources. However, the approach could underestimate the actual usage of the underlying physical core for Simultaneous Multi-Threading (SMT) processors, leading to an overestimation of remaining resources. The overestimation propagates from microarchitecture to operating systems and cloud schedulers, which may misguide scheduling decisions, exacerbate CPU overcommitment, and increase Service Level Agreement (SLA) violations. To address the potential overestimation problem, we propose an SMT-aware and purely data-driven approach named Remaining CPU (RCPU) that reserves more CPU resources to restrict CPU overcommitment and prevent SLA violations. RCPU requires only a few modifications to the existing cloud infrastructures and can be scaled up to large data centers. Extensive evaluations in the data center proved that RCPU contributes to a reduction of SLA violations by 18% on average for 98% of all latency-sensitive applications. Under a benchmarking experiment, we prove that RCPU increases the accuracy by 69% in terms of Mean Absolute Error (MAE) compared to the state-of-the-art
Article
Full-text available
Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production clusters. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we observe that BEJs typically exhibit periodic execution patterns and serve as the primary sources of interference to LCSs. Furthermore, despite occupying the same level of resource consumption, the diverse compositions of BEJs can result in varying degrees of interference on LCSs. Subsequently, we propose PISM, a proactive Performance Interference Scoring and Mitigating framework for LCSs through the optimization of BEJ scheduling. Firstly, PISM adopts a data-driven approach to establish a characterization and classification methodology for BEJs. Secondly, PISM models the relationship between the composition of BEJs on servers and the response time (RT) of LCSs. Thirdly, PISM establishes an interference scoring mechanism in terms of RT, which serves as the foundation for BEJ scheduling. We assess the effectiveness of PISM on a small-scale cluster and through extensive data-driven simulations. The experiment results demonstrate that PISM can reduce cluster interference by up to 41.5%, and improve the throughput of long-tail LCSs by 76.4%.
Article
Serverless computing systems have become very popular because of their natural advantages with respect to auto-scaling, load balancing and fast distributed processing. As of today, almost all serverless systems define two QoS classes: best-effort ( BE ) and latency-sensitive ( LS ). Systems typically do not offer any latency or QoS guarantees for BE jobs and run them on a best-effort basis. In contrast, systems strive to minimize the processing time for LS jobs. This work proposes a precise definition for these job classes and argues that we need to consider a bouquet of performance metrics for serverless applications, not just a single one. We thus propose the comprehensive latency ( CL ) that comprises the mean, tail latency, median and standard deviation of a series of invocations for a given serverless function. Next, we design a system FaaSCtrl, whose main objective is to ensure that every component of the CL is within a prespecified limit for an LS application, and for BE applications, these components are minimized on a best-effort basis. Given the sheer complexity of the scheduling problem in a large multi-application setup, we use the method of surrogate functions in optimization theory to design a simpler optimization problem that relies on performance and fairness. We rigorously establish the relevance of these metrics through characterization studies. Instead of using standard approaches based on optimization theory, we use a much faster reinforcement learning (RL) based approach to tune the knobs that govern process scheduling in Linux, namely the real-time priority and the assigned number of cores. RL works well in this scenario because the benefit of a given optimization is probabilistic in nature, owing to the inherent complexity of the system. We show using rigorous experiments on a set of real-world workloads that FaaSCtrl achieves its objectives for both LS and BE applications and outperforms the state-of-the-art by 36.9% (for tail response latency) and 44.6% (for response latency's std. dev.) for LS applications.
Article
It is an open challenge for cloud database service providers to guarantee tenants' service-level objectives (SLOs) and enjoy high resource utilization simultaneously. In this work, we propose a novel system Tao to overcome it. Tao consists of three key components: (i) tasklet-based DAG generator, (ii) tasklet-based DAG executor, and (iii) SLO-guaranteed scheduler. The core concept in Tao is tasklet, a coroutine-based lightweight execution unit of the physical execution plan. In particular, we first convert each SQL operator in the traditional physical execution plan into a set of fine-grained tasklets by the tasklet-based DAG generator. Then, we abstract the tasklet-based DAG execution procedure and implement the tasklet-based DAG executor using C++20 coroutines. Finally, we introduce the SLO-guaranteed scheduler for scheduling tenants' tasklets across CPU cores. This scheduler guarantees tenants' SLOs with a token bucket model and improves resource utilization with an on-demand core adjustment strategy. We build Tao on an open-sourced relational database, Hyrise, and conduct extensive experimental studies to demonstrate its superiority over existing solutions.
Article
User-facing applications often experience excessive loads and are shifting towards the microservice architecture. To fully utilize heterogeneous resources, current datacenters have adopted the disaggregated storage and compute architecture, where the storage and compute clusters are suitable to deploy the stateful and stateless microservices, respectively. Moreover, when the local datacenter has insufficient resources to host excessive loads, a reasonable solution is moving some microservices to remote datacenters. However, it is nontrivial to decide the appropriate microservice deployment inside the local datacenter and identify the appropriate migration decision to remote datacenters, as microservices show different characteristics, and the local datacenter shows different resource contention situations. We therefore propose ELIS, an intra- and inter-datacenter scheduling system that ensures the Quality-of-Service (QoS) of the microservice application, while minimizing the network bandwidth usage and computational resource usage. ELIS comprises a resource manager , a cross-cluster microservice deployer , and a reward-based microservice migrator . The resource manager allocates near-optimal resources for microservices while ensuring QoS. The microservice deployer deploys the microservices between the storage and compute clusters in the local datacenter, to minimize the network bandwidth usage while satisfying the microservice resource demand. The microservice migrator migrates some microservices to remote datacenters when local resources cannot afford the excessive loads. Experimental results show that ELIS ensures the QoS of user-facing applications. Meanwhile, it reduces the public network bandwidth usage, the remote computational resource usage, and the local network bandwidth usage by 49.6%, 48.5%, and 60.7% on average, respectively.
Article
Workload consolidation is a widely used approach to enhance resource utilization in modern data centers. However, the concurrent execution of multiple jobs on a shared server introduces contention for essential shared resources such as CPU cores, Last Level Cache, and memory bandwidth. This contention negatively impacts job performance, leading to significant degradation in throughput. To mitigate resource contention, effective resource isolation techniques at the software or hardware level can be employed to partition the shared resources among colocated jobs. However, existing solutions for resource partitioning often assume a limited number of jobs that can be colocated, making them unsuitable for scenarios with a large-scale job colocation due to several critical challenges. In this study, we propose Lavender, a framework specifically designed for addressing large-scale resource partitioning problems. Lavender incorporates several key techniques to tackle the challenges associated with large-scale resource partitioning, ensuring efficiency, adaptivity, and optimality. We conducted comprehensive evaluations of Lavender to validate its performance and analyze the reasons for its advantages. The experimental results demonstrate that Lavender significantly outperforms state-of-the-art baselines. Lavender is publicly available at https://github.com/yanxiaoqi932/OpenSourceLavender.
Article
Energy efficiency is among the most important challenges for computing. There has been an increasing gap between the rate at which the performance of processors has been improving and the lower rate of improvement in energy efficiency. This paper answers the question of how to reduce energy usage in heterogeneous datacenters. It proposes a unified hierarchical scheduling using a D-Choices technique, which considers interference and heterogeneity. continuous upgrades and the integrated high-performance ‘big’ and energy-efficient ‘little’ cores. This results in datacenters becoming more heterogeneous and traditional job scheduling algorithms become suboptimal. To this end, we present a two-level hierarchical scheduler for datacenters that exploits increased server heterogeneity. It combines in a unified approach cluster and node level scheduling algorithms, and it can consider specific optimization objectives including job completion time, energy usage, and energy-delay-product (EDP). Its novelty lies in the unified approach and in modeling interference and heterogeneity. Experiments on a research cluster found that the proposed approach outperforms state-of-the-art schedulers by around 10% in job completion time, 39% in energy usage, and 42% in EDP. This paper demonstrated a unified approach as a promising direction in optimizing energy and performance for heterogeneous datacenters.
Article
The growing adoption of microservice architectures (MSAs) has led to major research and development efforts to address their challenges and improve their performance, reliability, and robustness. Important aspects of MSA that are not sufficiently covered in the open literature include efficient cloud resource allocation and optimal power management. Other aspects of MSA remain widely scattered in the literature, including cost analysis, service level agreements (SLAs), and demand-driven scaling. In this article, we examine recent cloud frameworks for containerized microservices with a focus on efficient resource utilization using auto-scaling. We classify these frameworks on the basis of their resource allocation models and underlying hardware resources. We highlight current MSA trends and identify workload-driven resource sharing within microservice meshes and SLA streamlining as two key areas for future microservice research.
Article
As the paradigm of cloud computing, a datacenter accommodates many co-running applications sharing system resources. Although highly concurrent applications improve resource utilization, the resulting resource contention can increase the uncertainty of quality of services (QoS). Previous studies have shown that achieving high resource utilization and high QoS simultaneously is challenging. Moreover, quantifying the intensity of interference across multiple concurrent applications in a datacenter, where applications can be either latency-critical (LC) or best-effort (BE), poses a significant challenge. To address these issues, we propose Ah-Q, which comprises two theorems, a metric, and a scheduling strategy. First, we present the necessary and sufficient conditions to precisely test whether a datacenter is both QoS guaranteed and high-throughput. We also present a theorem that reveals the relationship between tail latency and throughput. Our theoretical results are insightful and useful for building datacenters that have desirable performance. Second, we propose the “System Entropy” (E S{_{S}} ) to quantitatively measure the interference within a datacenter. Interference arises due to resource scarcity or irrational scheduling, and effective scheduling can alleviate resource scarcity. To assess the effectiveness of a resource scheduling strategy, we introduce the concept of “resource equivalence”. We evaluate various resource scheduling strategies to demonstrate the correctness and effectiveness of the proposed theory. Third, we introduce a new resource scheduling strategy, ARQ, that leverages both isolation and sharing of resources. Our evaluations show that ARQ significantly outperforms state-of-the-art strategies PARTIES and CLITE in reducing the tail latency of LC applications and increasing the IPC of BE applications.
Conference Paper
Full-text available
Cloud services have recently started undergoing a major shift from monolithic applications, to graphs of hundreds or thousands of loosely-coupled microservices. Microservices fundamentally change a lot of assumptions current cloud systems are designed with, and present both opportunities and challenges when optimizing for quality of service (QoS) and cloud utilization. In this paper we explore the implications microservices have across the cloud system stack. We first present DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible. DeathStarBench includes a social network, a media service, an e-commerce site, a banking system, and IoT applications for coordination control of UAV swarms. We then use DeathStarBench to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks. Finally, we explore the tail at scale effects of microservices in real deployments with hundreds of users, and highlight the increased pressure they put on performance predictability.
Article
Full-text available
Cloud multi-tenancy is typically constrained to a single interactive service colocated with one or more batch, low-priority services, whose performance can be sacrificed when deemed necessary. Approximate computing applications offer the opportunity to enable tighter colocation among multiple applications whose performance is important. We present Pliant, a lightweight cloud runtime that leverages the ability of approximate computing applications to tolerate some loss in their output quality to boost the utilization of shared servers. During periods of high resource contention, Pliant employs incremental and interference-aware approximation to reduce contention in shared resources, and prevent QoS violations for co-scheduled interactive, latency-critical services. We evaluate Pliant across different interactive and approximate computing applications, and show that it preserves QoS for all co-scheduled workloads, while incurring a 2.1\% loss in output quality, on average.
Article
Full-text available
Most cloud computing optimizers explore and improve one workload at a time. When optimizing many workloads, the single-optimizer approach can be prohibitively expensive. Accordingly, we examine "collective optimizer" that concurrently explore and improve a set of workloads significantly reducing the measurement costs. Our large-scale empirical study shows that there is often a single cloud configuration which is surprisingly near-optimal for most workloads. Consequently, we create a collective-optimizer, MICKY, that reformulates the task of finding the near-optimal cloud configuration as a multi-armed bandit problem. MICKY efficiently balances exploration (of new cloud configurations) and exploitation (of known good cloud configuration). Our experiments show that MICKY can achieve on average 8.6 times reduction in measurement cost as compared to the state-of-the-art method while finding near-optimal solutions. Hence we propose MICKY as the basis of a practical collective optimization method for finding good cloud configurations (based on various constraints such as budget and tolerance to near-optimal configurations).
Article
Full-text available
As software systems grow in complexity, the space of possible configurations grows exponentially. Within this increasing complexity, developers, maintainers, and users cannot keep track of the interactions between all the various configuration options. Finding the optimally performing configuration of a software system for a given setting is challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough data on enough sample configurations can be very expensive since each such sample requires configuring, compiling and executing the entire system against a complex test suite. The central insight of this paper is that choosing a suitable source (a.k.a. "bellwether") to learn from, plus a simple transfer learning scheme will often outperform much more complex transfer learning methods. Using this insight, this paper proposes BEETLE, a novel bellwether based transfer learning scheme, which can identify a suitable source and use it to find near-optimal configurations of a software system. BEETLE significantly reduces the cost (in terms of the number of measurements of sample configuration) to build performance models. We evaluate our approach with 61 scenarios based on 5 software systems and demonstrate that BEETLE is beneficial in all cases. This approach offers a new highwater mark in configuring software systems. Specifically, BEETLE can find configurations that are as good or better as those found by anything else while requiring only 1/7th of the evaluations needed by the state-of-the-art.
Article
Full-text available
Finding the right cloud configuration for workloads is an essential step to ensure good performance and contain running costs. A poor choice of cloud configuration decreases application performance and increases running cost significantly. While Bayesian Optimization is effective and applicable to any workloads, it is fragile because performance and workload are hard to model (to predict). In this paper, we propose a novel method, SCOUT. The central insight of SCOUT is that using prior measurements, even those for different workloads, improves search performance and reduces search cost. At its core, SCOUT extracts search hints (inference of resource requirements) from low-level performance metrics. Such hints enable SCOUT to navigate through the search space more efficiently---only spotlight region will be searched. We evaluate SCOUT with 107 workloads on Apache Hadoop and Spark. The experimental results demonstrate that our approach finds better cloud configurations with a lower search cost than state of the art methods. Based on this work, we conclude that (i) low-level performance information is necessary for finding the right cloud configuration in an effective, efficient and reliable way, and (ii) a search method can be guided by historical data, thereby reducing cost and improving performance.
Article
Full-text available
Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition.
Article
2018 IEEE. Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.
Conference Paper
The microservice architecture has dramatically reduced user effort in adopting and maintaining servers by providing a catalog of functions as services that can be used as building blocks to construct applications. This has enabled datacenter operators to look at managing datacenter hosting microservices quite differently from traditional infrastructures. Such a paradigm shift calls for a need to rethink resource management strategies employed in such execution environments. We observe that the visibility enabled by a microservices execution framework can be exploited to achieve high throughput and resource utilization while still meeting Service Level Agreements, especially in multi-tenant execution scenarios.
Conference Paper
Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.
Article
Managing high-percentile tail latencies is key to designing user-facing cloud services. Rare system hiccups or unusual code paths make some requests take 10  ×100  ×\boldsymbol{10\;\times -100\;\times}10×-100× longer than the average. Prior work seeks to reduce tail latency by trying to address primarily root causes of slow requests. However, often the bulk of requests comprising the tail are not these rare slow-to-execute requests. Rather, due to head-of-line blocking, most of the tail comprises requests enqueued behind slow-to-execute requests. Under high disparity service distributions, queuing effects drastically magnify the impact of rare system hiccups and can result in high tail latencies even under modest load. We demonstrate that improving the queuing behavior of a system often yields greater benefit than mitigating the individual system hiccups that increase service time tails. We suggest two general directions to improve system queuing behavior–server pooling and common-case service acceleration–and discuss circumstances where each is most beneficial.
Conference Paper
Datacenters use accelerators to provide the significant compute throughput required by emerging user-facing services. The diurnal user access pattern of user-facing services provides a strong incentive to co-located applications for better accelerator utilization, and prior work has focused on enabling co-location on multicore processors and traditional non-preemptive accelerators. However, current accelerators are evolving towards spatial multitasking and introduce a new set of challenges to eliminate QoS violation. To address this open problem, we explore the underlying causes of QoS violation on spatial multitasking accelerators. In response to these causes, we propose Laius, a runtime system that carefully allocates the computation resource to co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Our evaluation on a Nvidia RTX 2080Ti GPU shows that Laius improves the utilization of spatial multitasking accelerators by 20.8%, while achieving the 99%-ile latency target for user-facing services.
Article
We introduce Caliper, a technique for accurately estimating performance interference occurring in shared servers. Caliper overcomes the limitations of prior approaches by leveraging a micro-experiment-based technique. In contrast to state-of-the-art approaches that focus on periodically pausing co-running applications to estimate slowdown, Caliper utilizes a strategic phase-triggered technique to capture interference due to co-location. This enables Caliper to orchestrate an accurate and low-overhead interference estimation technique that can be readily deployed in existing production systems. We evaluate Caliper for a broad spectrum of workload scenarios, demonstrating its ability to seamlessly support up to 16 applications running simultaneously and outperform the state-of-the-art approaches.
Conference Paper
Existing techniques for improving datacenter utilization while guaranteeing the QoS are based on the assumption that queries have similar behaviors. However, user queries in emerging compute demanding services demonstrate significantly diverse behavior and require adaptive parallelism. Our study shows that the end-to-end latency of the compute demanding query is determined together by the system-wide load, its workload, its parallelism, contention on shared cache, and memory bandwidth. When hosting such new services, the current cross-query resource allocation results in either severe QoS violation or significant resource under-utilization. To maximize hardware utilization while guaranteeing the QoS, we present Avalon, a runtime system that independently allocates shared resources for each query. Avalon first provides an automatic feature identification tool based on Lasso regression, to identify features that are relevant to a query's performance. Then, it establishes models that can precisely predict a query's duration under various resource configurations. Based on the accurate prediction model, Avalon proactively allocates "just-enough" cores and shared cache spaces to each query, so that the remaining resource can be assigned to execute best-effort applications. During runtime, Avalon monitors the progress of each query and mitigates any possible QoS violation due to memory bandwidth contention, occasional I/O contention, or unpredictable system interference. Our results show that Avalon improves utilization by 28.9% on average compared with state-of-the-art techniques while achieving 99%-ile latency target.
Conference Paper
The variety and complexity of microservices in warehouse-scale data centers has grown precipitously over the last few years to support a growing user base and an evolving product portfolio. Despite accelerating microservice diversity, there is a strong requirement to limit diversity in underlying server hardware to maintain hardware resource fungibility, preserve procurement economies of scale, and curb qualification/test overheads. As such, there is an urgent need for strategies that enable limited server CPU architectures (a.k.a "SKUs") to provide performance and energy efficiency over diverse microservices. To this end, we first undertake a comprehensive characterization of the top seven microservices that run on the compute-optimized data center fleet at Facebook. Our characterization reveals profound diversity in OS and I/O interaction, cache misses, memory bandwidth utilization, instruction mix, and CPU stall behavior. Whereas customizing a CPU SKU for each microservice might be beneficial, it is prohibitive. Instead, we argue for "soft SKUs", wherein we exploit coarse-grain (e.g., boot time) configuration knobs to tune the platform for a particular microservice. We develop a tool, μSKU, that automates search over a soft-SKU design space using A/B testing in production and demonstrate how it can obtain statistically significant gains (up to 7.2% and 4.5% performance improvement over stock and production servers, respectively) with no additional hardware requirements.
Conference Paper
Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices. Detecting QoS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services. We present Seer, an online cloud performance debugging system that leverages deep learning and the massive amount of tracing data cloud systems collect to learn spatial and temporal patterns that translate to QoS violations. Seer combines lightweight distributed RPC-level tracing, with detailed low-level hardware monitoring to signal an upcoming QoS violation, and diagnose the source of unpredictable performance. Once an imminent QoS violation is detected, Seer notifies the cluster manager to take action to avoid performance degradation altogether. We evaluate Seer both in local clusters, and in large-scale deployments of end-to-end applications built with microservices with hundreds of users. We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application-level design bugs, and provide insights on how to better architect microservices to achieve predictable performance.
Conference Paper
Multi-tenancy in modern datacenters is currently limited to a single latency-critical, interactive service, running alongside one or more low-priority, best-effort jobs. This limits the efficiency gains from multi-tenancy, especially as an increasing number of cloud applications are shifting from batch jobs to services with strict latency requirements. We present PARTIES, a QoS-aware resource manager that enables an arbitrary number of interactive, latency-critical services to share a physical node without QoS violations. PARTIES leverages a set of hardware and software resource partitioning mechanisms to adjust allocations dynamically at runtime, in a way that meets the QoS requirements of each co-scheduled workload, and maximizes throughput for the machine. We evaluate PARTIES on state-of-the-art server platforms across a set of diverse interactive services. Our results show that PARTIES improves throughput under QoS by 61% on average, compared to existing resource managers, and that the rate of improvement increases with the number of co-scheduled applications per physical host.
Article
Stochastic gradient descent (SGD) is a fundamental algorithm which has had a profound impact on machine learning. This article surveys some important results on SGD and its variants that arose in machine learning.
Article
In many domains, the previous decade was characterized by increasing data volumes and growing complexity of data analyses, creating new demands for batch processing on distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i.e., methods that can be applied to any workload without modification, since the workload is usually a black box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of performance variation, predicted performance metrics, limitations and challenges, required training data, use cases, and the underlying prediction techniques. We conclude by identifying several open problems and pressing research needs in the field.
Conference Paper
Memory bandwidth is a highly performance-critical shared resource on modern computer systems. To prevent the contention on memory bandwidth among the collocated workloads, prior works have investigated memory bandwidth partitioning techniques. Despite the extensive prior works, it still remains unexplored to characterize the widely-used memory bandwidth partitioning techniques based on various metrics and investigate a hybrid technique that employs multiple memory bandwidth partitioning techniques to improve the overall efficiency. To bridge this gap, we first present the in-depth characterization of the three widely-used memory bandwidth partitioning techniques (i.e., thread packing, clock modulation, and Intel's Memory Bandwidth Allocation (MBA)) in terms of dynamic range, granularity, and efficiency. Guided by the characterization results, we propose HyPart, a hybrid technique for practical memory bandwidth partitioning on commodity servers. HyPart composes the three memory bandwidth partitioning techniques in a constructive manner and dynamically performs optimizations based on the application characteristics without requiring any offline profiling. Our experimental results demonstrate the effectiveness of HyPart in that it provides a wider dynamic range and finer-grain control of memory bandwidth and achieves significantly higher efficiency than the conventional memory bandwidth partitioning techniques.
Conference Paper
Stragglers are exceptionally slow tasks within a job that delay its completion. Stragglers, which are uncommon within a single job, are pervasive in datacenters with many jobs. We present Hound, a statistical machine learning framework that infers the causes of stragglers from traces of datacenter-scale jobs. Hound is designed to achieve several objectives: datacenter-scale diagnosis, unbiased inference, interpretable models, and computational efficiency. We demonstrate Hound's capabilities for a production trace from Google's warehouse-scale datacenters and two Spark traces from Amazon EC2 clusters.
Article
Cloud services have recently undergone a shift from monolithic applications to microservices, with hundreds or thousands of loosely-coupled microservices comprising the end-to-end application. Microservices present both opportunities and challenges when optimizing for quality of service (QoS) and cloud utilization. In this paper we explore the implications cloud microservices have on system bottlenecks, and datacenter server design. We first present and characterize an end-to-end application built using tens of popular open-source microservices that implements a movie renting and streaming service, and is modular and extensible. We then use the end-to-end service to study the scalability and performance bottlenecks of microservices, and highlight implications they have on the design of datacenter hardware. Specifically, we revisit the long-standing debate of brawny versus wimpy cores in the context of microservices, we quantify the I-cache pressure they introduce, and measure the time spent in computation versus communication between microservices over RPCs. As more cloud applications switch to this new programming model, it is increasingly important to revisit the assumptions we have previously used to build and manage cloud systems.
Conference Paper
In the modern multi-tenant cloud, resource sharing increases utilization but causes performance interference between tenants. More generally, performance isolation is also relevant in any multi-workload scenario involving shared resources. Last level cache (LLC) on processors is shared by all CPU cores in x86, thus the cloud tenants inevitably suffer from the cache flush by their noisy neighbors running on the same socket. Intel Cache Allocation Technology (CAT) provides a mechanism to assign cache ways to cores to enable cache isolation, but its static configuration can result in underutilized cache when a workload cannot benefit from its allocated cache capacity, and/or lead to sub-optimal performance for workloads that do not have enough assigned capacity to fit their working set. In this work, we propose a new dynamic cache management technology (dCat) to provide strong cache isolation with better performance. For each workload, we target a consistent, minimum performance bound irrespective of others on the socket and dependent only on its rightful share of the LLC capacity. In addition, when there is spare capacity on the socket, or when some workloads are not obtaining beneficial performance from their cache allocation, dCat dynamically reallocates cache space to cache-intensive workloads. We have implemented dCat in Linux on top of CAT to dynamically adjust cache mappings. dCat requires no modifications to applications so that it can be applied to all cloud workloads. Based on our evaluation, we see an average of 25% improvement over shared cache and 15.7% over static CAT for selected, memory intensive, SPEC CPU2006 workloads. For typical cloud workloads, with Redis we see 57.6% improvement (over shared LLC) and 26.6% improvement (over static partition) and with ElasticSearch we see 11.9% improvement over both.
Conference Paper
In a multicore system, effective management of shared last level cache (LLC), such as hardware/software cache partitioning, has attracted significant research attention. Some eminent progress is that Intel introduced Cache Allocation Technology (CAT) to its commodity processors recently. CAT implements way partitioning and provides software interface to control cache allocation. Unfortunately, CAT can only allocate at way level, which does not scale well for a large thread or program count to serve their various performance goals effectively. This paper proposes Dynamic Cache Allocation with Partial Sharing (DCAPS), a framework that dynamically monitors and predicts a multi-programmed workload's cache demand, and reallocates LLC given a performance target. Further, DCAPS explores partial sharing of a cache partition among programs and thus practically achieves cache allocation at a finer granularity. DCAPS consists of three parts: (1) Online Practical Miss Rate Curve (OPMRC), a low-overhead software technique to predict online miss rate curves (MRCs) of individual programs of a workload; (2) a prediction model that estimates the LLC occupancy of each individual program under any CAT allocation scheme; (3) a simulated annealing algorithm that searches for a near-optimal CAT scheme given a specific performance goal. Our experimental results show that DCAPS is able to optimize for a wide range of performance targets and can scale to a large core count.
Article
Latency-critical applications, common in datacenters, must achieve small and predictable tail (e.g., 95th or 99th percentile) latencies. Their strict performance requirements limit utilization and efficiency in current datacenters. These problems have sparked research in hardware and software techniques that target tail latency. However, research in this area is hampered by the lack of a comprehensive suite of latency-critical benchmarks. We present TailBench, a benchmark suite and evaluation methodology that makes latency-critical workloads as easy to run and characterize as conventional, throughput-oriented ones. TailBench includes eight applications that span a wide range of latency requirements and domains, and a harness that implements a robust and statistically sound load-testing methodology. The modular design of the TailBench harness facilitates multiple load-testing scenarios, ranging from multi-node configurations that capture network overheads, to simplified single-node configurations that allow measuring tail latency in simulation. Validation results show that the simplified configurations are accurate for most applications. This flexibility enables rapid prototyping of hardware and software techniques for latency-critical workloads.
Article
Scheduling diverse applications in large, shared clusters is particularly challenging. Recent research on cluster scheduling focuses either on scheduling speed, using sampling to quickly assign resources to tasks, or on scheduling quality, using centralized algorithms that search for the resources that improve both task performance and cluster utilization. We present Tarcil, a distributed scheduler that targets both scheduling speed and quality. Tarcil uses an analytically derived sampling framework that adjusts the sample size based on load, and provides statistical guarantees on the quality of allocated resources. It also implements admission control when sampling is unlikely to find suitable resources. This makes it appropriate for large, shared clusters hosting short- and long-running jobs. We evaluate Tarcil on clusters with hundreds of servers on EC2. For highly-loaded clusters running short jobs, Tarcil improves task execution time by 41% over a distributed, sampling-based scheduler. For more general scenarios, Tarcil achieves near-optimal performance for 4× and 2× more jobs than sampling-based and centralized schedulers respectively.
Article
Randomized experiments are the gold standard for evaluating the effects of changes to real-world systems, including Internet services. Data in these tests may be difficult to collect and outcomes may have high variance, resulting in potentially large measurement error. Bayesian optimization is a promising technique for optimizing multiple continuous parameters for field experiments, but existing approaches degrade in performance when the noise level is high. We derive an exact expression for expected improvement under greedy batch optimization with noisy observations and noisy constraints, and develop a quasi-Monte Carlo approximation that allows it to be efficiently optimized. Experiments with synthetic functions show that optimization performance on noisy, constrained problems outperforms existing methods. We further demonstrate the effectiveness of the method with two real experiments conducted at Facebook: optimizing a production ranking system, and optimizing web server compiler flags.
Conference Paper
Nowadays, shared-state scheduling architecturehas been paid more attention in the large-scale cluster scheduling area. As a crucial part of shared-state scheduling architecture, optimistic concurrency control (OCC) algorithm has been studied by database community for a long time. However, few studies aim at making it more suitable for shared-state scheduling environment. In this paper, wepropose an extended fine-grained conflict detection method for shared-state scheduling architecture. This method extends the original validation criteria of fine-grained conflict detection method and make it can detect complicated concurrent conflicts under high concurrency of scheduling transactions. We report experiments done on a public simulator by using Google production traces of data-processing workloads. These experiments demonstrate that the extended fine-grained conflict detection method obviously reduces the number of harmful conflicts and it has not led to the efficiency decrease of schedulers.
Conference Paper
Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a colocation scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces datacenter power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.
Article
As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSCs, such as websearch, are user-facing and have quality of service (QoS) requirements. When multiple applications are co-located on a multicore machine, contention for shared memory resources threatens application QoS as severe cross-core performance interference may occur. WSC operators are left with two options: either disregard QoS to maximize WSC utilization, or disallow the co-location of high-priority user-facing applications with other applications, resulting in low machine utilization and millions of dollars wasted. This paper presents ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. ReQoS is composed of a profile guided compilation technique that identifies and inserts markers in contentious code regions in low-priority applications, and a lightweight runtime that monitors the QoS of high-priority applications and reactively reduces the pressure low-priority applications generate to the memory subsystem when cross-core interference is detected. In this work, we show that ReQoS can accurately diagnose contention and significantly reduce performance interference to ensure application QoS. Applying ReQoS to SPEC2006 and SmashBench workloads on real multicore machines, we are able to improve machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency of modern multicore machines by 47% on average over a policy of disallowing co-locations.
Article
Scheduling multiple jobs onto a platform enhances system utilization by sharing resources. The benefits from higher resource utilization include reduced cost to construct, operate, and maintain a system, which often include energy consumption. Maximizing these benefits, while satisfying performance limits, comes at a price -- resource contention among jobs increases job completion time. In this paper, we analyze slow-downs of jobs due to contention for multiple resources in a system; referred to as dilation factor. We observe that multiple-resource contention creates non-linear dilation factors of jobs. From this observation, we establish a general quantitative model for dilation factors of jobs in multi-resource systems. A job is characterized by a vector-valued loading statistics and dilation factors of a job set are given by a quadratic function of their loading vectors. We demonstrate how to systematically characterize a job, maintain the data structure to calculate the dilation factor (loading matrix), and calculate the dilation factor of each job. We validated the accuracy of the model with multiple processes running on a native Linux server, virtualized servers, and with multiple MapReduce workloads co-scheduled in a cluster. Evaluation with measured data shows that the D-factor model has an error margin of less than 16%. We also show that the model can be integrated with an existing on-line scheduler to minimize the makespan of workloads.
Conference Paper
One of the key decisions made by both MapReduce and HPC cluster management frameworks is the placement of jobs within a cluster. To make this decision, they consider factors like resource constraints within a node or the proximity of data to a process. However, they fail to account for the degree of collocation on the cluster's nodes. A tight process placement can create contention for the intra-node shared resources, such as shared caches, memory, disk, or network bandwidth. A loose placement would create less contention, but exacerbate network delays and increase cluster-wide power consumption. Finding the best job placement is challenging, because among many possible placements, we need to find one that gives us an acceptable trade-off between performance and power consumption. We propose to tackle the problem via multi-objective optimization. Our solution is able to balance conflicting objectives specified by the user and efficiently find a suitable job placement.
Article
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
Article
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this variation and exploit the opportunities of mitigating variation directly to simultaneously improve both QoS and utilization. We develop, implement, and evaluate Dirigent, a lightweight performance-management runtime system that accurately controls the QoS of latency-critical applications at fine time scales, leveraging existing architecture mechanisms. We evaluate Dirigent on a real machine and show that it is significantly more effective than configurations representative of prior schemes.
Article
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this variation and exploit the opportunities of mitigating variation directly to simultaneously improve both QoS and utilization. We develop, implement, and evaluate Dirigent, a lightweight performance-management runtime system that accurately controls the QoS of latency-critical applications at fine time scales, leveraging existing architecture mechanisms. We evaluate Dirigent on a real machine and show that it is significantly more effective than configurations representative of prior schemes.
Article
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.
Article
This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the delta-cover sampling. Most Bayesian optimization methods require auxiliary optimization: an additional non-convex global optimization problem, which can be time-consuming and hard to implement in practice. Also, the existing Bayesian optimization method with exponential convergence requires access to the delta-cover sampling, which was considered to be impractical. Our approach eliminates both requirements and achieves an exponential convergence rate.
Conference Paper
Scheduling diverse applications in large, shared clusters is particularly challenging. Recent research on cluster scheduling focuses either on scheduling speed, using sampling to quickly assign resources to tasks, or on scheduling quality, using centralized algorithms that search for the resources that improve both task performance and cluster utilization. We present Tarcil, a distributed scheduler that targets both scheduling speed and quality. Tarcil uses an analytically derived sampling framework that adjusts the sample size based on load, and provides statistical guarantees on the quality of allocated resources. It also implements admission control when sampling is unlikely to find suitable resources. This makes it appropriate for large, shared clusters hosting short- and long-running jobs. We evaluate Tarcil on clusters with hundreds of servers on EC2. For highly-loaded clusters running short jobs, Tarcil improves task execution time by 41% over a distributed, sampling-based scheduler. For more general scenarios, Tarcil achieves near-optimal performance for 4× and 2× more jobs than sampling-based and centralized schedulers respectively.
Conference Paper
Application profiling is an important performance analysis technique, when an application under test is analyzed dynamically to determine its space and time complexities and the usage of its instructions. A big and important challenge is to profile nontrivial web applications with large numbers of combinations of their input parameter values. Identifying and understanding particular subsets of inputs leading to performance bottlenecks is mostly manual, intellectually intensive and laborious procedure. We propose a novel approach for automating performance bottleneck detection using search-based input-sensitive application profiling. Our key idea is to use a genetic algorithm as a search heuristic for obtaining combinations of input parameter values that maximizes a fitness function that represents the elapsed execution time of the application. We implemented our approach, coined as Genetic Algorithm-driven Profiler (GA-Prof) that combines a search-based heuristic with contrast data mining of execution traces to accurately determine performance bottlenecks. We evaluated GA-Prof to determine how effectively and efficiently it can detect injected performance bottlenecks into three popular open source web applications. Our results demonstrate that GA-Prof efficiently explores a large space of input value combinations while automatically and accurately detecting performance bottlenecks, thus suggesting that it is effective for automatic profiling.
Article
Big Data applications are typically associated with systems involving large numbers of users, massive complex software systems, and large-scale heterogeneous computing and storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., recommendation systems, medical analysis tools, real-time game engines, speech recognizers) thus involve many tunable configuration parameters. These parameters are often specified and hard-coded into the software by various developers or teams. If optimized jointly, these parameters can result in significant improvements. Bayesian optimization is a powerful tool for the joint optimization of design choices that is gaining great popularity in recent years. It promises greater automation so as to increase both product quality and human productivity. This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.