Luciano Baresi's Lab
Institution: Politecnico di Milano
Featured research (3)
Nowadays a wide range of applications is constrained by low-latency requirements that cloud infrastructures cannot meet. Multi-access Edge Computing (MEC) has been proposed as the reference architecture for executing applications closer to users and reduce latency, but new challenges arise: edge nodes are resource-constrained, the workload can vary significantly since users are nomadic, and task complexity is increasing (e.g., machine learning inference). To overcome these problems, the paper presents NEPTUNE, a serverless-based framework for managing complex MEC solutions. NEPTUNE i) places functions on edge nodes according to user locations, ii) avoids the saturation of single nodes, iii) exploits GPUs when available, and iv) allocates resources (CPU cores) dynamically to meet foreseen execution times. A prototype, built on top of K3S, was used to evaluate NEPTUNE on a set of experiments that demonstrate a significant reduction in terms of response time, network overhead, and resource consumption compared to three state-of-the-art approaches.
TensorFlow, a popular machine learning (ML) platform, allows users to transparently exploit both GPUs and CPUs to run their applications. Since GPUs are optimized for compute-intensive workloads (e.g., matrix calculus), they help boost executions, but introduce resource heterogeneity. TensorFlow neither provides efficient heterogeneous resource management nor allows for the enforcement of user-defined constraints on the execution time. Most of the works address these issues in the context of creating models on existing data sets (training phase), and only focus on scheduling algorithms. This paper focuses on the inference phase, that is, on the application of created models to predict the outcome on new data interactively, and presents a comprehensive resource management solution called ROMA (Resource Constrained ML Applications). ROMA is an extension of TensorFlow that (a) provides means to easily deploy multiple TensorFlow models in containers using Kubernetes b) allows users to set constraints on response times, (c) schedules the execution of requests on GPUs and CPUs using heuristics, and (d) dynamically refines the CPU core allocation by exploiting control theory. The assessment conducted on four real-world benchmark applications compares ROMA against four different systems and demonstrates a significant reduction ( \(75\%\)) in constraint violations and \(24\%\) saved resources on average.
Cloud applications are increasingly executed onto lightweight containers that can be efficiently managed to cope with highly varying and unpredictable workloads. Kubernetes, the most popular container orchestrator, provides means to automatically scale containerized applications to keep their response time under control. Kubernetes provisions resources using two main components: i) Horizontal Pod Autoscaler (HPA), which controls the amount of containers running for an application, and ii) Vertical Pod Autoscaler (VPA), which oversees the resource allocation of existing containers. These two components have several limitations: they must control different metrics, they use simple threshold-based rules, and the reconfiguration of existing containers requires stopping and restarting them.