ArticlePublisher preview available

Machine learning inference serving models in serverless computing: a survey

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Serverless computing has attracted many researchers with features such as scalability and optimization of operating costs, no need to manage infrastructures, and build programs at a higher speed. Serverless computing can be used for real-time machine learning (ML) prediction using serverless inference functions. Deploying an ML serverless inference function involves building a compute resource, deploying an ML model, network infrastructure, and permissions to call the inference function. However, the subject of machine learning inference (MLI) has challenges such as resource management, delay and response time, large and complex models, and security and privacy, not many studies have been conducted in this field. This comprehensive literature review article examines the recent developments in MLI in serverless computing environments. The mechanisms presented in the taxonomy can be summarized in four categories: service level objective SLO-aware, acceleration-aware, framework-aware, and latency-aware. In each category, different methods and algorithms used to optimize inference in serverless environments have been examined along with their advantages and disadvantages. We show that acceleration-aware methods focus on the optimal use of computing resources, and framework-aware methods play an important role in improving system efficiency and scalability by examining different frameworks for inference in serverless environments. Also, SLO-aware and Latency-aware methods, considering time limits and service level agreement, help provide quality and reliable inference in serverless environments. Finally, this article presents a vision of future challenges and opportunities in this field and provides solutions for future research in the field of MLI in serverless.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Computing (2025) 107:47
https://doi.org/10.1007/s00607-024-01377-9
REGULAR PAPER
Machine learning inference serving models inserverless
computing: asurvey
AkramAslani1· MostafaGhobaei‑Arani1
Received: 12 May 2024 / Accepted: 22 November 2024 / Published online: 7 January 2025
© The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2025
Abstract
Serverless computing has attracted many researchers with features such as scalability
and optimization of operating costs, no need to manage infrastructures, and build
programs at a higher speed. Serverless computing can be used for real-time machine
learning (ML) prediction using serverless inference functions. Deploying an ML
serverless inference function involves building a compute resource, deploying an
ML model, network infrastructure, and permissions to call the inference function.
However, the subject of machine learning inference (MLI) has challenges such as
resource management, delay and response time, large and complex models, and
security and privacy, not many studies have been conducted in this field. This
comprehensive literature review article examines the recent developments in MLI in
serverless computing environments. The mechanisms presented in the taxonomy can
be summarized in four categories: service level objective SLO-aware, acceleration-
aware, framework-aware, and latency-aware. In each category, different methods
and algorithms used to optimize inference in serverless environments have been
examined along with their advantages and disadvantages. We show that acceleration-
aware methods focus on the optimal use of computing resources, and framework-
aware methods play an important role in improving system efficiency and scalability
by examining different frameworks for inference in serverless environments. Also,
SLO-aware and Latency-aware methods, considering time limits and service level
agreement, help provide quality and reliable inference in serverless environments.
Finally, this article presents a vision of future challenges and opportunities in this
field and provides solutions for future research in the field of MLI in serverless.
Keywords Serverless computing· Function-as-a-service· Machine learning
inference· Deep learning· Inference serving models
* Mostafa Ghobaei-Arani
mo.ghobaei@iau.ac.ir
Akram Aslani
ak.aslani@iau.ac.ir
1 Department ofComputer Engineering, Qom Branch, Islamic Azad University, Qom, Iran
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... In the future work, we will study other system resource constraints combined with DAG vertex states to further reduce the latency of distributed stream processing systems; consider performance-aware and other algorithms to further improve the cost-effectiveness of distributed stream processing systems. Aslani et al. [35] mentioned serverless computing. Serverless computing is scalable, optimizes operating costs, eliminates the need to manage infrastructure, and builds programs at higher speeds. ...
Article
Full-text available
With the rapidly growing demand for real-time processing of large amounts of stream data, the global deployment of cloud infrastructure for stream processing has brought tremendous economic benefits to cloud service providers, yet large-scale stream processing continues to increase resource cost consumption. In stream processing, the trade-off between performance and cost for workflows scheduling at lower resource cost remains to be explored. In this paper, a deployment method of streaming applications is proposed to maximize the performance-to-cost ratio in Storm, and a communication traffic optimization method for inter-node communication based on graph partitioning is proposed to reduce resource cost. First, we propose a cost-efficient model based on performance-to-cost ratio and a Storm-based performance-to-cost ratio maximization deployment algorithm PC-Storm. The PC-Storm first detects and collects task node information in the heterogeneous cluster. Then, it calculates the performance-to-cost ratio of the nodes and builds a performance-to-cost ratio table. Finally, the Storm task threads are assigned to nodes with the high performance-to-cost ratio. In addition, a maxSubtopology algorithm is also proposed to optimize the communication traffic between Storm nodes based on graph partitioning. The maxSubtopology algorithm assigns the tasks with heavy communication traffic to the same node under the premise of satisfying the performance constraint. Compared with the existing algorithms, the PC-Storm improves the resource performance-to-cost ratio by 40.92%, and the maxSubtopology algorithm reduces the communication cost by 42.8%. The proposed method can maximize performance-to-cost ratio and reduce the communication cost effectively for deployment of streaming applications.
... The Thoth algorithm is implemented using three approaches, namely Q-Learning, which optimizes computing resources more efficiently than rule-based and Neural Network (NN) algorithms. The authors conducted a systematic and comprehensive literature review to examine recent advancement in several approaches, including machine learning inference (MLI) 48 , clod start latency 49,50 , function placement 51 , proactive content caching 52 , function offloading 53 , computation offloading 54 , and scheduling mechanisms 55 in a serverless computing environment. To handle the increasing number of modules that need to be executed on the Mobile Fog Computing (MFC), a Hidden Markov Model Auto-scaling Offloading (HMAO) method 54 is proposed to find the best place for module execution in order to minimize energy consumption, execution time; optimize the execution time, network resource usage, and delay of the modules on the mobile, Fog, or Cloud. ...
Article
Full-text available
The widespread adoption of cloud services has posed several challenges, primarily revolving around energy and resource efficiency. Integrating cloud and fog resources can help address these challenges by improving fog-cloud computing environments. Nevertheless, the search for optimal task allocation and energy management in such environments continues. Existing studies have introduced notable solutions; however, it is still a challenging issue to efficiently utilize these heterogeneous cloud resources and achieve energy-efficient task scheduling in fog-cloud of things environment. To tackle these challenges, we propose a novel ML-based EcoTaskSched model, which leverages deep learning for energy-efficient task scheduling in fog-cloud networks. The proposed hybrid model integrates Convolutional Neural Networks (CNNs) with Bidirectional Log-Short Term Memory (BiLSTM) to enhance energy-efficient schedulability and reduce energy usage while ensuring QoS provisioning. The CNN model efficiently extracts workload features from tasks and resources, while the BiLSTM captures complex sequential information, predicting optimal task placement sequences. A real fog-cloud environment is implemented using the COSCO framework for the simulation setup together with four physical nodes from the Azure B2s plan to test the proposed model. The DeFog benchmark is used to develop task workloads, and data collection was conducted for both normal and intense workload scenarios. Before preprocessing the data was normalized, treated with feature engineering and augmentation, and then split into training and test sets. To evaluate performance, the proposed EcoTaskSched model demonstrated superiority by significantly reducing energy consumption and improving job completion rates compared to baseline models. Additionally, the EcoTaskSched model maintained a high job completion rate of 85%, outperforming GGCN and BiGGCN. It also achieved a lower average response time, and SLA violation rates, as well as increased throughput, and reduced execution cost compared to other baseline models. In its optimal configuration, the EcoTaskSched model is successfully applied to fog-cloud computing environments, increasing task handling efficiency and reducing energy consumption while maintaining the required QoS parameters. Our future studies will focus on long-term testing of the EcoTaskSched model in real-world IoT environments. We will also assess its applicability by integrating other ML models, which could provide enhanced insights for optimizing scheduling algorithms across diverse fog-cloud settings.
... The Thoth algorithm is implemented using three approaches, namely Q-Learning, which optimizes computing resources more efficiently than rulebased and Neural Network (NN) algorithms. The authors conducted a systematic and comprehensive literature review to examine recent advancement in several approaches, including machine learning inference (MLI) [48], clod start latency [49], [50], function placement [51], proactive content caching [52], function offloading [53], computation offloading [54], and scheduling mechanisms [55] in a serverless computing environment. To handle the increasing number of modules that need to be executed on the Mobile Fog Computing (MFC), a Hidden Markov Model Auto-scaling Offloading (HMAO) method [54] is proposed to find the best place for module execution in order to minimize energy consumption, execution time; optimize the execution time, network resource usage, and delay of the modules on the mobile, Fog, or Cloud. ...
Preprint
Full-text available
Background The widespread adoption of cloud services has posed several challenges, primarily revolving around energy and resource efficiency. Integrating cloud and fog resources can help address these challenges by improving fog-cloud computing environments. Nevertheless, the search for optimal task allocation and energy management in such environments continues. Existing studies have introduced notable solutions; however, it is still a challenging issue to efficiently utilize these heterogeneous cloud resources and achieve energy-efficient task scheduling in fog-cloud of things environment. Objective To tackle these challenges, we propose a novel ML-based EcoTaskSched model, which leverages deep learning for energy-efficient task scheduling in fog-cloud networks. The proposed hybrid model integrates Convolutional Neural Networks (CNNs) with Bidirectional Log-Short Term Memory (BiLSTM) to enhance energy-efficient schedulability and reduce energy usage while ensuring QoS provisioning. The CNN model efficiently extracts workload features from tasks and resources, while the BiLSTM captures complex sequential information, predicting optimal task placement sequences. Methods A real fog-cloud environment is implemented using the COSCO framework for the simulation setup together with four physical nodes from the Azure B2s plan to test the proposed model. The DeFog benchmark is used to develop task workloads, and data collection was conducted for both normal and intense workload scenarios. Before preprocessing the data was normalized, treated with feature engineering and augmentation, and then split into training and test sets. Results To evaluate performance, the proposed EcoTaskSched model demonstrated superiority by significantly reducing energy consumption and improving job completion rates compared to baseline models. Additionally, the EcoTaskSched model maintained a high job completion rate of 85%, outperforming GGCN and BiGGCN. It also achieved a lower average response time, and SLA violation rates, as well as increased throughput, and reduced execution cost compared to other baseline models. Conclusion and Future Work: In its optimal configuration, the EcoTaskSched model is successfully applied to fog-cloud computing environments, increasing task handling efficiency and reducing energy consumption while maintaining the required QoS parameters. Our future studies will focus on long-term testing of the EcoTaskSched model in real-world IoT environments. We will also assess its applicability by integrating other ML models, which could provide enhanced insights for optimizing scheduling algorithms across diverse fog-cloud settings.
... This section reviews the latest research on task scheduling methods 23,24 . We then analyzed the objectives, strengths, weaknesses, and methods of each study through a table. ...
Article
Full-text available
With the explosive growth of terminal devices, scheduling massive parallel task streams has become a core challenge for distributed platforms. For computing resource providers, enhancing reliability, shortening response times, and reducing costs are significant challenges, particularly in achieving energy efficiency through scheduling to realize green computing. This paper investigates the heterogeneous parallel task flow scheduling problem to minimize system energy consumption under response time constraints. First, for a set of independent tasks capable of parallel computation on heterogeneous terminals, the task scheduling is performed according to the computational resource capabilities of each terminal. The problem is modeled as a mixed-integer nonlinear programming problem using a Directed Acyclic Graph as the input model. Then, a dynamic scheduling method based on heuristic and reinforcement learning algorithms is proposed to schedule the task flows. Furthermore, dynamic redundancy is applied to certain tasks based on reliability analysis to enhance system fault tolerance and improve service quality. Experimental results show that our method can achieve significant improvements, reducing energy consumption by 14.3% compared to existing approaches on two practical workflow instances.
Article
Full-text available
Cloud computing enables scalable and flexible resource access with reduced costs and higher efficiency. However, challenges such as resource management, energy optimization, and load balancing require innovative scheduling approaches. This research proposes a multi-objective hybrid algorithm integrating reinforcement learning, quantum mutation (QM), and metaheuristic optimization to enhance task scheduling in cloud environments. The QM operator improves search diversity, while Q-learning dynamically balances exploration and exploitation. The proposed method is validated using statistical tests and benchmark datasets (GoCJ and synthetic data). Results show a makespan reduction of up to 22.04% and energy savings of 90.64%, demonstrating superior efficiency compared to existing methods.
Article
Full-text available
The escalating distributed denial of service (DDoS) attacks severely threatens the security of the industrial internet of things (IIoT). This paper introduces moving target defense (MTD) as an adaptive solution to fortify IIoT security against DDoS attacks. Dynamically reconfiguring network elements and service placements makes it challenging for attackers to target specific vulnerabilities. We propose an MTD traffic manager (MTDTM) architecture to enable early detection and mitigation of DDoS attacks within resource-constrained edge clouds. A traffic classifier is integrated into our model to intelligently filter incoming traffic, ensuring real-time responsiveness to the demands of IIoT applications. Moreover, dynamic admission rules and relocation of service replicas efficiently distribute the traffic, ensuring the availability of services for legitimate users. Unlike traditional static defense methods, our adaptive approach caters to the evolving DDoS threat landscape of IIoT, safeguarding critical industrial processes. Simulation results validate the efficiency of our algorithm while maintaining an acceptable quality of service. Our research demonstrated a 15% to 20% improvement in service response times compared to existing algorithms. Notably, we achieved significant enhancements in average resource availability during DDoS attacks across various parameter variations.
Article
Full-text available
In this paper, the bipartite consensus tracking control problem is investigated for a class of nonlinear fractional-order multi-agent systems (FOMASs) with unknown dynamics, actuator faults, and input nonlinearities. Based on the adaptive backstepping technique, an adaptive bipartite consensus tracking control framework is constructed for FOMASs, where both cooperative and competitive relationships among agents are implemented. Furthermore, a fault compensation mechanism is proposed to relax the restriction on the number of actuators that can fail, and allow the existence of different types of input nonlinearities for each actuator. In addition, an improved adaptive self-triggered control mechanism that can be dynamically adjusted depending on the bipartite consensus error is extended to FOMASs to save network resources. Then, by means of the fractional-order Lyapunov stability criterion, it is theoretically proved that the proposed control scheme ensures that all signals of the closed-loop systems are bounded and drives the bipartite consensus error into a desired neighborhood of the origin. Finally, simulation results are provided to confirm the effectiveness of the proposed control scheme.
Article
Full-text available
In practice, the time-varying communication delays and actuator failures are the main inevitable issues in nonlinear multilateral teleoperation systems, which can reduce the performance and stability of the considered systems. This article proposed a novel failure-distribution-dependent H∞ fuzzy fault-tolerant control scheme to realize position synchronization and force tracking simultaneously for multilateral teleoperation systems. Firstly, the nonlinear multilateral systems were modeled as a kind of T-S fuzzy systems with multiple time-varying delays. Then, based on the distribution characteristic of failures, by introducing a series of tradeoff coefficients, a novel failure-distribution-dependent fault-tolerant control algorithm was provided to ensure force tracking in spite of failures, and the purpose of position synchronization was achieved (not only the master and slave robot position synchronization but also the position synchronization between each slave robot). Finally, a numerical simulation example was given to show the effectiveness of the proposed method.
Article
Unknown failures and time delays of actuators which may degrade system performance seem inevitable in practical systems. However, the available results to compensate for unknown failures of time delay actuators based on adaptive approaches are very limited. In this paper, we address such a problem by considering controlling a class of second-order systems with unknown actuator failures and input delays. Firstly, the controlled system is transformed into a triangular structure model. Meanwhile, the input time delay is transformed into dynamics of the output signal. Then, an adaptive controller is developed using backstepping. Not only can the uncertainties due to actuator failure be compensated, but the unknown input delay has also been effectively restrained under the proposed control scheme. The upper bound condition of strength scalar [Formula: see text] is proposed. It is a sufficient condition for system stability. When the strength scalar satisfies this condition, the design parameters exist and the system is stable under the controlling of the proposed controller. Finally, the effectiveness of the proposed control scheme is verified through simulation results, including numerical simulation and actual system simulation. Through the comparative simulation, it can be seen that the control scheme proposed in this paper has faster convergence speed.
Article
This paper studies an adaptive fuzzy event-triggered fault-tolerant control (FTC) problem for a category of uncertain under-actuated nonlinear systems with unknown actuator failures based on hierarchical sliding mode surface (HSMS). The under-actuated systems under study contain unknown uncertainty functions, actuator faults, and periodically updated control inputs. First, a fuzzy approximator and a fault observer are used to deal with the unknown functions and actuator failures in the system, respectively, and then, a time-varying bounded function is utilized to transform the control input linearly, making it easy to use in the process of controller design. The robustness and responsiveness of the control system are improved by introducing the HSMS technique in the controller. The HSMS-based active FTC strategy developed in this study enhances the safety and reliability of the controlled system compared to the existing control schemes using HSMS. Meanwhile, as opposed to the fixed-threshold event-triggered control (ETC) scheme, the relative-threshold ETC strategy adopted in this study is able to save communication resources while maintaining a balance between communication cost and control performance. In addition to the repetition amplification caused by the superposition of observation errors, fuzzy approximation errors, and measurement errors, the controller singularity problems are the two main problems caused by the control algorithms. Based on Lyapunov’s theory, the proposed scheme ensures that all closed-loop system signals remain bound. Finally, the feasibility and effectiveness of the control strategy are demonstrated through a numerical simulation and a practical example.