NEPTUNE: Network- and GPU-aware Management
of Serverless Functions at the Edge
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, Luca Terracciano
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
Nowadays a wide range of applications is constrained by low-
latency requirements that cloud infrastructures cannot meet. Multi-
access Edge Computing (MEC) has been proposed as the reference
architecture for executing applications closer to users and reduc-
ing latency, but new challenges arise: edge nodes are resource-
constrained, the workload can vary signicantly since users are
nomadic, and task complexity is increasing (e.g., machine learning
inference). To overcome these problems, the paper presents NEP-
TUNE, a serverless-based framework for managing complex MEC
solutions. NEPTUNE i) places functions on edge nodes according to
user locations, ii) avoids the saturation of single nodes, iii) exploits
GPUs when available, and iv) allocates resources (CPU cores) dy-
namically to meet foreseen execution times. A prototype, built on
top of K3S, was used to evaluate NEPTUNE on a set of experiments
that demonstrate a signicant reduction in terms of response time,
network overhead, and resource consumption compared to three
•Theory of computation →
ing methodologies →
Distributed computing methodologies;
puter systems organization →Distributed architectures.
serverless, edge computing, gpu, placement, dynamic resource allo-
cation, control theory
ACM Reference Format:
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, Luca Terrac-
ciano. 2022. NEPTUNE: Network- and GPU-aware Management of Server-
less Functions at the Edge. In 17th International Symposium on Software
Engineering for Adaptive and Self-Managing Systems (SEAMS ’22), May
18–23, 2022, PITTSBURGH, PA, USA. ACM, New York, NY, USA, 13 pages.
Multi-access Edge Computing (MEC) [
] has emerged as a new
distributed architecture for running computations at the edge of the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from firstname.lastname@example.org.
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9305-8/22/05.. .$15.00
network and reduce latency compared to cloud executions. Dier-
ently from cloud computing, which is characterized by a virtually
innite amount of resources placed on large data centers, MEC
infrastructures are based on geo-distributed networks of resource-
constrained nodes (e.g., 5G base stations) that serve requests and
process data close to the users.
The rise of edge computing [
], also fostered by the advent of
5G networks, enables the creation of applications with extremely
low latency requirements like autonomous driving [
], VR/AR [
and mobile gaming [
] systems. According to Li et al. [
average network delay from 260 locations to the nearest Amazon
EC2 availability zone is approximately 74ms. This makes meeting
tight response time requirements in the cloud nearly impossible. In
use-cases like obstacle detection, response times of a few hundreds
of milliseconds are required [
] and thus the network delay must be
lower than the one oered by cloud-based solutions. Many mobile
devices would allow these computations to be executed on the
device itself, but this is not always possible given the inherent
complexity of some tasks (e.g., machine learning-based ones) and
the need for limiting resource consumption (e.g., to avoid battery
An important challenge of edge computing is that clients usu-
ally produce highly dynamic workloads since they move among
dierent areas (e.g., self-driving vehicles) and the amount of trac
in a given region can rapidly escalate (e.g., users moving towards
a stadium for an event). To tackle these cases, solutions that scale
resources (i.e., virtual machines and containers [
according to the workload have been extensively investigated in
the context of cloud computing, ranging from approaches based
on rules [
] and machine learning [
] to those based on
time-series analysis [
]. These solutions assume that (new) re-
sources are always available and that nodes are connected through
a low-latency, high-bandwidth network. At the edge, these assump-
tions are not valid anymore and some ad-hoc solutions have been
presented in the literature. For example, Ascigil et al. [
] and Wang
et al. [
] propose a solution for service placement on resource-
constrained nodes, while Poularakis et al. [
] focus on request
routing and load balancing at the edge.
Approaches that focus on service placement or request routing
for MEC aim to maximize the throughput of edge nodes, but com-
prehensive solutions that address placement, routing, and minimal
delays at the same time are still work in progress. In addition, the
tasks at the edge, like AI-based computations, are becoming heavier
and heavier. The use of GPUs [
] can be fundamental to acceler-
ate these computations, but they are seldom taken into account
explicitly [6, 20, 39].
arXiv:2205.04320v1 [cs.SE] 9 May 2022
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
To tame all these problems, this paper presents NEPTUNE, a com-
prehensive framework for the runtime management of large-scale
edge applications that exploits placement, routing, network delays,
and CPU/GPU interplay in a coordinated way to allow for the con-
current execution of edge applications that meet user-set response
times. NEPTUNE uses the serverless paradigm [
] to let service
providers deploy and execute latency-constrained functions without
managing the underlying infrastructure. NEPTUNE uses Mixed In-
teger Programming (MIP) to allocate functions on nodes, minimize
network delays, and exploit GPUs if available. It uses lightweight
control theoretical planners to allocate CPU cores dynamically to
execute the remaining functions.
NEPTUNE is then holistic and self-adaptive. It addresses edge
nodes, function placement, request routing, available resources, and
hardware accelerators in a single and coherent framework. Users
only provide functions and foreseen response times, and then the
system automatically probes available nodes as well as the locality
and intensity of workloads and reacts autonomously. While other
approaches (see Section 5) only focus on a single or few aspects and
they can only be considered partial solutions, NEPTUNE tackles
all of them and oversees the whole lifecycle from deployment to
We built a prototype that extends K3S
, a popular tool for con-
tainer orchestration at the edge, to evaluate NEPTUNE on a set
of experiments executed on a geo-distributed cluster of virtual
machines provisioned on Amazon Web Services (AWS). To pro-
vide a comprehensive and meaningful set of experiments, we as-
sessed NEPTUNE against three realistic benchmark applications
—including one that can be accelerated with GPUs. The compari-
son revealed 9.4 times fewer response time violations, and 1.6 and
17.8 times improvements as for resource consumption and network
NEPTUNE builds up from PAPS [
] from which inherits part of
the design. Compared to PAPS, this paper provides i) a new place-
ment and routing algorithm that exploits resources more eciently
and minimizes service disruption, ii) the support of GPUs, and iii)
an empirical evaluation of the approach (PAPS was only simulated).
The rest of the paper is organized as follows. Section 2 explains
the problem addressed by NEPTUNE and provides an overview of
the solution. Section 3 presents how NEPTUNE tackles placement,
routing, and GPU/CPU allocation. Section 4 shows the assessment
we carried out to evaluate NEPTUNE. Section 5 presents some re-
lated work, and Section 6 concludes the paper.
The goal of NEPTUNE is to allow for the execution of multiple con-
current applications on a MEC infrastructure with user-set response
times and to optimize the use of available resources. NEPTUNE must
be able to take into consideration the main aspects of edge com-
puting: resource-constrained nodes, limited network infrastructure,
highly uctuating workloads, and strict latency requirements.
A MEC infrastructure is composed of a set
of distributed nodes
that allow clients to execute a set
of applications. Edge clients
(e.g., self-driving cars, mobile phones, or VR/AR devices) access
applications placed on MEC nodes. Each node comes with cores,
Figure 1: MEC Topology.
memory, and maybe GPUs, along with their memory. Because of
user mobility, the workload on each node can vary frequently, and
resource limitations do not always allow each node to serve all the
requests it receives; some requests must be outsourced to nearby
Given a request
for an application
, its response time
is measured as the time required to transmit
from the client
to the closest node
, execute the request, and receive a response.
is dened as
time) represents the time taken for running
(queue time) is the
time spent by
waiting for being managed, and
is the network
delay (or network latency). In particular, as shown in Figure 1,
the sum of the (round trip) time needed by
) and, if
needed, the (round trip) time needed to outsource the computation
on a nearby node
). NEPTUNE handles requests once they
enter the MEC topology and assumes
be optimized by existing
NEPTUNE is then location aware since it considers the geograph-
ical distribution of both nodes and workloads. It stores a represen-
tation of the MEC network by measuring the inter-node delays and
it monitors where and how many requests are generated by users.
Furthermore, NEPTUNE allows users to put a threshold (service
level agreement) on the response time provided by each application
2.1 Solution overview
NEPTUNE requires that an application
be deployed as a set
of functions —as prescribed by the serverless paradigm. Develop-
ers focus on application code without the burden of managing the
underlying infrastructure. Each function covers a single function-
ality and supplies a single or a small set of REST endpoints. The
result is more exible and faster to scale compared to traditional
architectures (e.g., monoliths or microservices).
Besides the function’s code, NEPTUNE requires that each func-
tion be associated with a user-provided required response time
) and the memory
needed to properly
execute it. In case of GPU-accelerated functions, the GPU memory
must also be specied. Then, NEPTUNE manages both its
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
deployment by provisioning one or more function instances, and
its operation with the goal of fullling the set response time.
NEPTUNE manages functions through a three-level control hier-
archy: Topology,Community, and Node.
Since function placement is NP-hard [
], the main goal of the
Topology level is to tackle the complexity for the lower levels by
splitting the topology into communities of closely-located nodes.
Each community is independent of the others and a request can
only be managed within the community of the node that received
it. If this is not possible, then the community is undersized and the
Topology level must recongure the communities.
The Topology level employs a single controller based on the
Speaker-listener Label Propagation Algorithm (SLPA) — proposed by
Xie et al. [
] — to create the communities. SLPA has a complexity
is the number of distributed nodes and
user-dened maximum number of iterations. Since the complexity
scales linearly with the number of nodes, this solution has proven
to be suitable also for large clusters .
Given a maximum community size
and the maximum al-
lowed network delay
, SLPA splits the topology in a set of com-
munities with a number of nodes that is lower than
node delays smaller than
𝛿𝑖, 𝑗 ≤Δ
for all nodes
community). SLPA could potentially assign a node to multiple com-
munities, but to avoid resource contention, NEPTUNE re-allocates
shared nodes —if needed— to create non-overlapping communities.
At Community level, each community is equipped with a MIP-
based controller in charge of managing function instances. The
controller places function instances on the dierent nodes of the
community by rst considering those that could exploit GPUs, if
available. The goal is to minimize network delay by dynamically
deploying function instances close to where the demand is gener-
ated and at the same time to minimize the time spent to forward
them when needed.
The Community level computes routing policies i) to allow each
node to forward part of the workload to other close nodes, and ii) to
prioritize computation-intensive functions by forwarding requests
to GPUs up to their full utilization, and then send the remaining
requests to CPUs. To avoid saturating single nodes, the Community
level can also scale function instances horizontally, that is, it can
replicate them on nearby nodes. For example, if a node
enough resources to execute an instance of
to serve workload
, the Community level creates a new instance (of
) on a node
, and the requests that cannot be served by
While the rst two control levels take care of network latency
), once the requests arrive at the node that processes them, the
Node level ensures that function instances have the needed amount
of cores to meet set response times. Each function instance is man-
aged by a dedicated Proportional Integral (PI) controller that pro-
vides vertical scaling, that is, it adds, or removes, CPU cores to
the function (container). Unlike other approches [
can recongure CPU cores without restarting function instances,
that is, without service disruption. Figure 2 shows that, given a set
point (i.e., the desired response time), the PI controller periodically
monitors the performance of its function instance and dynamically
allocates CPU cores to optimize both execution time
fa 0.7 N1
fa 0.3 N3
Figure 2: NEPTUNE.
Also, the controllers at Node level are independent of each other:
each controller is not aware of how many cores the others would
like to allocate. Therefore, since the sum of the requests might
exceed the capacity of a node, a Contention Manager solves resource
contentions by proportionally scaling down requested allocations.
Every control level is meant to work independently of the others,
and no communication is required between controllers operating
at the same level. This means that NEPTUNE can easily scale.
The three control levels operate at dierent frequencies but yet
cohesively to eliminate potential interference. Thanks to fast PI
controllers and vertical scaling, the Node level operates every few
seconds to handle workload bursts, whereas the Community level
computes functions and routing policies in the order of minutes,
and allows the Node level to fully exploit the underlying resources.
The Topology level runs at longer intervals, but it can react faster
when there are changes in the network, such as when a node is
added or removed.
A real-time monitoring infrastructure is the only communication
across the levels. It allows NEPTUNE to gather performance metrics
(e.g., response times, core allocations per function instance, net-
work delays) needed by the three control levels to properly operate.
Note that the controllers at Community and Node levels measure
device performance in real-time without any a-priori assumption.
This means that NEPTUNE can also manage heterogeneous CPUs
with dierent performance levels (e.g., dierent types of virtual
Figure 2 shows a MEC topology controlled by the three-level
hierarchical control adopted by NEPTUNE. Control components are
depicted in grey. First of all, the Topology level splits the network
into four communities. Each community is handled by its Commu-
nity level controller, responsible for function placement and request
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
routing. The gure reports a detailed view of Community 1. We
can observe that i) a set of functions (
) is placed across
three nodes and, ii) each node is provided with its set of routing
policies. In particular, the routing policies for Node N1 enforce that
70% of the requests for
are served by the node itself, while the
remaining 30% is forwarded to the instance, running on Node N3,
that exploits a GPU. Finally, the gure shows the materialization of
Node level on Node N3. Each function instance (
) is vertically
scaled by a dedicated PI controller, and resource contentions are
solved by the Contention Manager.
3 PLACEMENT, ROUTING, AND
As explain above, the Topology level partitions the network in
a set of communities using algorithm SLPA. Each community is
controlled independently from the others. The main goal of the
Community level controller is to dynamically place function exe-
cutions as close to where the workload is generated as possible. A
trivial solution, which can drastically reduce network delay, would
be to replicate the whole set of functions on each node. Since nodes
have limited resources, this approach is often not feasible.
An eective placement solution should not saturate nodes and
should allow for stable placement, that is, it should avoid the disrup-
tion implied by keeping migrating functions among nodes
. At the
same time, the placement cannot be fully static since users move
and requirements change.
Each placement eort should also consider graceful termination
periods and cold starts. The former is the user-dened amount of
time we must wait whenever a function instance is to be deleted
to let it nish serving pending requests. The latter is a delay that
can range from seconds to minutes and that aects newly created
]. Before a function instance starts serving requests,
a container must be started and the execution environment be
initialized. Diverse approaches, like container pooling [
mitigate cold starts, but their eciency depends on the functions at
hand and they cannot always reduce cold starts in a signicant way.
This is why NEPTUNE does not exploit these solutions natively.
To address these problems, the Community level adopts two
similar instances of a 2-step optimization process —based on Mixed
Integer Programming— to allocate GPUs rst and then CPUs. In
both cases, the rst step aims to nd the best function placement
and routing policy that minimize the overall network delay. Then,
since diverse placements with a network delay close to the optimal
one may require dierent changes in function deployment, the
second step is in charge of choosing a placement that minimizes
deployment actions, that is, disruption.
Table 1 summarizes the inputs NEPTUNE requires to the user,
the characteristics of used nodes, the values gathered by the moni-
toring infrastructure, and the decision variables adopted in the MIP
3.1 Function placement
Each time the Community level is activated, the 2-step optimization
process is executed twice. The rst execution aims to fully utilize
2NEPTUNE does not handle application state migration.
Table 1: Inputs, data, and decision variables.
𝑓Memory required by function 𝑓
𝑓GPU memory required by function 𝑓
𝜙𝑓Maximum allowed network delay for function 𝑓
𝑗Memory available on node 𝑗
𝑗GPU memory available on node 𝑗
𝑗CPU cores on node 𝑗
𝑗GPU cores on node 𝑗
𝛿𝑖, 𝑗 Network delay between nodes 𝑖and 𝑗
𝑂𝑏𝑒𝑠𝑡 Objective function value found after step 1
𝜆𝑓 ,𝑖 Incoming 𝑓requests to node 𝑖
𝑗Average CPU cores used by node 𝑗per single 𝑓request
𝑗Average GPU cores used by node 𝑗per single 𝑓request
𝑓 ,𝑖,𝑗 Fraction of 𝑓requests sent to CPU instances from node 𝑖to 𝑗
𝑓 ,𝑗 1if a CPU instance of 𝑓is deployed on node 𝑗,0otherwise
𝑓 ,𝑖,𝑗 Fraction of 𝑓requests sent to GP U instances from node 𝑖to 𝑗
𝑓 ,𝑗 1if a GPU instance of 𝑓is deployed on node 𝑗,0otherwise
𝑀𝐺 𝑓Number of 𝑓migrations
𝐶𝑅𝑓Number of 𝑓creations
𝐷𝐿𝑓Number of 𝑓deletions
GPU resources, while the second only considers CPUs and the
remaining workload to be handled.
Since the two executions are similar, the formulation presented
herein is generalized. Some of the employed data are resource-
): Table 1 dierentiate them with a
superscript, while in the rest of this section the superscripts
are omitted for simplicity.
Network delay minimization. The rst step aims to place function
instances and to nd routing policies that minimize the overall
network delay 𝐷in a given community 𝐶⊆𝑁.
The formulation employs two decision variables:
The former (
𝑥𝑓 ,𝑖,𝑗 ∈ [
0 : 1
) represents the amount of incoming
) that node
forwards to node
(i.e., routing policies).
The latter (
) is a boolean variable that is
deployed onto node 𝑗(i.e., placement).
The objective function (Formula 1) minimizes the overall net-
work delay of the incoming workload in
. Starting from the in-
requests to each node
and the measured delay between
, it computes the fractions of outsourced requests to
minimize the overall network delay:
𝑥𝑓 ,𝑖,𝑗 ∗𝜆𝑓,𝑖 ∗𝛿𝑖 ,𝑗 (1)
For the sake of brevity, we use
request to mean a request generated for function
and 𝑓instance to refer to an instance of function 𝑓.
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
If we only considered inter-node delays (
), we would minimize
the overall network delay only if incoming requests were distributed
evenly (i.e., each node manages the same amount of requests). Since
workloads can be very dierent, the addition of per-node incoming
) gives a more appropriate formulation of the problem.
Intuitively, the higher the workload in a specic area is, the more
important the minimization of network delay becomes.
In addition to the function to minimize, we must add some con-
straints. First, requests cannot be forwarded too far from where
they are generated. Each function is characterized by parameter
which sets the maximum allowed network delay of each
𝑥𝑓 ,𝑖,𝑗 ∗𝛿𝑖,𝑗 ≤𝑥𝑓 ,𝑖, 𝑗 ∗𝜙𝑓∀𝑖, 𝑗 ∈𝐶, ∀𝑓∈𝐹(2)
Second, the nodes that receive forwarded requests must have a
function instance that can serve them:
𝑐𝑓 ,𝑗 =𝑖 𝑓 (
𝑥𝑓 ,𝑖,𝑗 >0)1𝑒𝑙𝑠𝑒 0∀𝑗∈𝐶, ∀𝑓∈𝐹(3)
Third, the overall memory required by the functions (
on a node must not exceed its capacity 𝑀𝑗:
𝑐𝑓 ,𝑗 ∗𝑚𝑓≤𝑀𝑗∀𝑗∈𝐶(4)
Fourth, to avoid resource contentions, routing policies must
consider the overall amount of GPU or CPU cores available in the
) and the average GPU or CPU cores consumption for each
𝑓request processed on node 𝑗(𝑢𝑓 ,𝑗 ):
𝑥𝑓 ,𝑖,𝑗 ∗𝜆𝑓 ,𝑖 ∗𝑢𝑓 ,𝑗 ≤𝑈𝑗∀𝑗∈𝐶(5)
Fifth, routing policies must be dened for all the nodes in the
community and all the functions of interest:
𝑥𝑓 ,𝑖,𝑗 =1∀𝑖∈𝐶, ∀𝑓∈𝐹(6)
Note that when
gives the fraction of
executed locally, that is, on 𝑖itself.
This optimization problem nds the best placement with the
minimum network delay. However, each iteration (execution of the
optimization problem) may suggest a placement that requires many
disruptive operations (i.e., deletions, creations, and migrations)
with respect to the previous placement (iteration). For this reason,
a second step is used to minimize service disruption and ameliorate
Disruption minimization. The second step searches for a function
placement that minimizes function creation, deletion, and migration
with an overall network delay close to the optimal one found by
the rst step.
This means that the second step keeps the constraints dened
in Formulae 2- 6 and adds:
𝑥𝑓 ,𝑖,𝑗 ∗𝜆𝑓 ,𝑖 ∗𝛿𝑖, 𝑗 ≤𝑂𝑏𝑒𝑠𝑡 ∗ (1+𝜖)(7)
to impose that the nal placement must be in the interval
𝑂𝑏𝑒𝑠𝑡 ∗ (
is the smallest network delay found after
the rst step, and
is an arbitrarily small parameter that quanties
the worsening in terms of network overhead. For example,
means a worsening up to 5%.
We also consider the number of created, deleted, and migrated
instances between two subsequent executions of the 2-step opti-
mization process, that is, between the to-be-computed placement
) and the current one (
the maximum between 0and the removed (added)
tween the two iterations:
𝑓 ,𝑖 −𝑐𝑓 ,𝑖 ,0) ∀𝑓∈𝐹
𝑚𝑎𝑥 (𝑐𝑓 ,𝑖 −𝑐𝑜𝑙𝑑
𝑓 ,𝑖 ,0) ∀𝑓∈𝐹
The number of migrations (in the new placement) that represents
the number of instances that have been moved from one node to
another is computed as the minimum between instance creations
𝐶𝑅𝑓and instance deletions 𝐷𝐿𝑓:
𝑀𝐺𝑓=𝑚𝑖𝑛(𝐶𝑅𝑓, 𝐷𝐿 𝑓) ∀𝑓∈𝐹(9)
The new objective function is then dened as:
The goal of the objective function is to minimize the number
of migrations (
), since deletions and creations are necessary,
to avoid over- and under-provisioning. Factors
which are always lower than 1, allow us to discriminate among so-
lutions with the same amount of migrations, but a dierent number
of creations and deletions.
This formulation ensures close-to-optimal network delays, along
with the minimum number of instances, to serve the current work-
load. The controllers at Node level are then entitled to scale execu-
tors vertically as needed (see next Section).
3.2 CPU allocation
The Node level is in charge of minimizing
, that is, the handling
time, dened as the sum of the execution time
and the queue time
; the network delay
is already minimized by the Community
can vary due to many factors, such as variations in the
workload or changes in the execution environment and we aim
to control it by changing the amount of CPU cores allocated to
function instances. If this is not enough, the problem is lifted up
to the Community level that re-calibrates the number of function
Control theoretical approaches have proven to be an eective
solution for the self-adaptive management of these resources [
The Node level comprises a lightweight Proportional-Integral (PI)
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
controller for each function instance to scale allocated cores dy-
namically. PI controllers support fast control periods, have constant
complexity, and provide formal design-time guarantees.
Each function instance is equipped with an independent PI con-
troller. The control loop monitors the average value of
the allocation, and actuates it. More formally, given a desired set
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
, the controller periodically measures the current
𝑄𝐸 𝑓 ,𝑗
(controlled variable) — the actual value of
— and computes the delta between desired and actual value.
Note that, since the controllers will strive to keep
𝑄𝐸 𝑓 ,𝑗
the set point
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
, this value should be set to a lower value
than the desired 𝑅𝑇𝑅
The controller reacts to the error and recommends the new
amount of cores that the function should use. Algorithm 1describes
the computation: Line 2 computes error
as the dierence be-
Algorithm 1 Node level CPU core allocation.
1: procedure ComputeInstanceCores(𝑓,𝑗)
2: 𝑒𝑟𝑟 :=1
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑 −1
𝑄𝐸 𝑓 ,𝑗 ;
3: 𝑐𝑝𝑢 :=𝑔𝑒𝑡 𝐶𝑃𝑈 𝐴𝑙 𝑙𝑜𝑐𝑎𝑡 𝑖𝑜𝑛 (𝑓 , 𝑗);
4: 𝑖𝑛𝑡𝑜𝑙𝑑 :=𝑐𝑝𝑢 −𝑔𝑖𝑛𝑡 ∗𝑒𝑟𝑟𝑜𝑙𝑑 ;
5: 𝑖𝑛𝑡 :=𝑖𝑛𝑡𝑜𝑙𝑑 +𝑔𝑖𝑛𝑡 ∗𝑒𝑟 𝑟;
6: 𝑒𝑟𝑟𝑜𝑙𝑑 :=𝑒𝑟𝑟 ;
7: 𝑝𝑟𝑜𝑝 :=𝑒𝑟𝑟 ∗𝑔𝑝𝑟𝑜𝑝 ;
8: 𝑐𝑝𝑢 :=𝑖𝑛𝑡 +𝑝𝑟𝑜𝑝;
9: 𝑐𝑝𝑢 :=𝑚𝑎𝑥 (𝑐𝑝𝑢𝑚𝑖𝑛, 𝑚𝑖𝑛(𝑐𝑝𝑢𝑚𝑎𝑥 , 𝑐𝑝𝑢 ));
10: end procedure
tween the inverse of
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
𝑄𝐸 𝑓 ,𝑗
. To compute the Integral
contribution, the current core allocation (
) of the function in-
stance is retrieved at line 3. The previous integral contribution
is computed at line 4 by using the allocation, the integral
(i.e., a tuning parameter), and the prior error
is computed by multiplying the current
times the integral gain
, and by adding
The previous error 𝑒𝑟𝑟𝑜𝑙𝑑 is then updated at line 6.
The proportional contribution is computed by using
the proportional gain
at line 7. Finally, the new allocation
is calculated as the sum of the two contributions (line 8) and then
adjusted according to the maximum and minimum allowed core
allocations 𝑐𝑝𝑢𝑚𝑎𝑥 and 𝑐𝑝𝑢𝑚𝑖𝑛, respectively.
Being independent of the others, these controllers are not aware
of available CPU cores and of the allocations computed by the other
controllers. Therefore, the computed allocations (line 9) are not
immediately applied since they could exceed the allowed capacity.
The allocations of the function instances deployed on a node are
processed by a Contention Manager (one per node), which is in
charge of computing a feasible allocation. If the sum of suggested
allocations ts the allowed capacity, they are applied without any
modication. Otherwise, they are scaled down proportionally. The
Contention Manager can easily be extended and embed other, non-
proportional heuristics to manage resource contention.
Implementation. We implemented a prototype
built on top K3S, a popular distribution of Kubernetes
for edge computing. Each control level is materialized in a dedi-
cated component that exclusively uses native K3s APIs to manage
deployed applications. Conversely to existing approaches (see Sec-
tion 5), the prototype is capable of performing in-place vertical
scaling of containers, that is, it can dynamically update the CPU
cores allocated to the dierent containers without restarting the
The stable version of K3S does not allow one to change allo-
cated resources without restarting function instances, a process
that sometimes can take minutes. This could decrease the capability
of the Node level to handle bursty workloads. For this reason, the
prototype augments K3S with the Kubernetes Enhancement Proposal
1287 that implements In-Place Pod Vertical Scaling
and allows re-
sources to be changed without restarts. This enables faster control
loops and better control quality.
To provide an eective usage of GPUs, the prototype uses nvidia-
, a container runtime that enables the use of GPUs within
containers. However, by default, GPU access can only be reserved to
one function instance at a time. This prevents the full exploitation
of GPUs and limits the possible placements produced at Community
level. To solve this problem, the prototype employs a device plugin
developed by Amazon that enables the fractional allocation of GPUs.
In particular, the plugin makes use of the Nvidia Multi Process
(MPS), a runtime solution designed to transparently allow
GPUs to be shared among multiple processes (e.g., containers).
Research questions. The solution adopted at Topology level has
been largely covered by PAPS [
]. The experiments in the paper
focus on evaluating Community and Node level. The conducted
evaluation addresses the following research questions:
How does NEPTUNE handle workloads generated by mobile
users at the edge?
How does NEPTUNE perform compared to other state-of-
How does NEPTUNE use GPUs to speed up response times?
4.1 Experimental setup
Infrastructure. We conducted the experiments on a simulated
MEC topology with nodes provisioned as a cluster of AWS EC2 geo-
distributed virtual machines distributed across three areas. Each
area corresponds to a dierent AWS region: Area A to eu-west, Area
B to us-east, and Area C to us-west. Since communities are indepen-
dent, our experiments focused on evaluating dierent aspects of
NEPTUNE within a single community that included the three areas.
Figure 3 shows the average network delays between each pair
of areas and nodes computed as the round trip times of an ICMP
] (Internet Control Message Protocol) packet. Note that nodes
4Source code available at https://github.com/deib-polimi/edge- autoscaler.
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
node-A-1 Area B
Figure 3: Network delay between areas.
of the same area were deployed onto dierent AWS availability
zones to obtain signicant network delays. Each area contained
three worker nodes, and one in Area A was GPU-empowered. These
nodes were deployed as c5.xlarge instances (4 vcpus, 8 GB memory);
the one with GPU that used a g4dn.xlarge instance (4 vcpus, 16 GB
memory, 1 GPU). The master node (not depicted in the gure) was
deployed on a c5.2xlarge instance (8 vcpus, 16 GB memory).
NEPTUNE control periods. Node controllers were congured with
a control period of 5seconds. Faster control loops can be used but
they may lead to inconsistent resource allocation updates since K3S
resource states are stored in a remote database. Function placement
and routing policies were recomputed by Community controllers
each 1minute while Topology controller was triggered every 10
Applications. To work on a reasonable set of experiments, we
used the three applications summarized in Table 2: we created
the rst function, and we borrowed the other two from the liter-
]. These applications are written using multiple pro-
gramming languages (e.g., Rust, Java, Go) and have dierent mem-
ory requirements (ranging from 15MB to 500MB) and cold start
times (from a bunch of seconds to minutes). The rst application is
primes, a stateless and CPU-heavy function that counts all the prime
numbers less than a given input number. As exemplar complex ap-
plication we employed sock-shop
that implements an e-commerce
platform. The application uses a microservice architecture; we fur-
ther decomposed it into smaller functions to make it suitable for a
. For example, microservice carts was divided
into three smaller units: carts-post,carts-delete and cart-util. Finally,
to also evaluate GPU-accelerated tasks (e.g machine learning in-
ference), we used Resnet [
], a neural network model for image
classication, implemented using TensorFlow Serving. For each
function Table 2 also reports the memory requirements, the cold
start times and the desired response times (obtained by applying
The source code of the function-based version of sock-shop is available at https:
Table 2: Characteristics of deployed functions.
Name Language Memory 𝑅𝑇 𝑅
Simple stateless function
primes Rust ∼15 MB 200ms <5s
carts-post Java ∼360 MB 300ms ∼100s
carts-delete Java ∼360 MB 200ms ∼100s
carts-util Java ∼360 MB 200ms ∼100s
catalogue Go ∼15 MB 200ms <5s
orders Java ∼400 MB 600ms ∼100s
payment Go ∼15 MB 50ms <5s
shipping Java ∼350 MB 50ms ∼100s
login Go ∼15 MB 100ms <5s
registration Go ∼15 MB 200ms <5s
user Go ∼15 MB 50ms <5s
Machine Learning inference
resnet Python ∼500 MB 550ms ∼100s
the procedure described in Section 4.5). The set points used by PI
controllers were set to half of the value of 𝑅𝑇𝑅
We used Locust
, a distributed scalable performance testing tool,
to feed the system, and mimicked service demand
ferent realistic, dynamic workloads. Each experiment was executed
ve times to have (more) consistent results.
Collected metrics. For each experiment, we collected the average
) and standard deviation (
) of the following metrics: i) response
time (ms) as dened in Section 2, ii) response time violation rate
(% of requests) dened as the percentage of requests that are not
considering the 99th percentile of the measured
response times, iii) network time rate (%) as the percentage of time
spent to forward requests in the network over the total response
), and iv) allocated cores (millicores or thousandths of a
core) to measure the resources consumed by function instances.
Competitors. Our experiments compare NEPTUNE against three
well-known approaches: K3S, Knative
(KN), and OpenFaaS
K3S is one of the most popular solutions for container orchestration
at the edge. It manages the full lifecycle of containerized applica-
tions deployed in a topology and adopts a fair placement policy, that
is, it schedules containers to keep the resource utilization of nodes
equal. K3S exploits the Horizontal Pod Autoscaler
scale applications. KN and OF add serverless functionalities to K3S
and a set of custom components to perform request routing and
To achieve consistent and statistical relevant results, all experi-
ments described in this section were run 5 times.
0 500 1000 1500 2000 2500 3000 3500
(a) Geo-dynamic workload shape (users)
0 500 1000 1500 2000 2500 3000 3500
Node-A-0 Node-A-1 Node-B-0 Node-B-1
(b) Resource allocation (millicores)
0 500 1000 1500 2000 2500 3000 3500
Avg RT RTR
(c) Average response time (ms)
0 500 1000 1500 2000 2500 3000 3500
(d) Networking time rate (%)
Figure 4: Behavior of NEPTUNE with moving workloads.
4.2 RQ1: Moving workload
The rst experiments evaluate the performance of NEPTUNE when
users move between Area A and Area B within the same community.
We used a cluster of four worker nodes: two nodes in Area A (not
equipped with GPU) and two in Area B. Each run lasted 60 minutes
and used application primes with
set to 200ms and the set
point of PI controllers to 100ms. User migration happened twice
per run and consisted in moving 100 users from one area to another
in less than 10 minutes.
Figure 4 shows the behavior of application primes when man-
aged by NEPTUNE. Since the multiple runs executed for this set of
experiments had similar behavior, the gure illustrates how work-
loads, resources, and performance varied over time during one of
these runs. Figure 4a shows how the workload changed in each area.
In particular, the workload was generated by users close to node
Node-A-0 for Area A and Node-B-0 for Area B. Figure 4b presents the
resources allocated to each node over time. Since communities are
independent, at least one instance per function is always allocated
(if possible) to minimize cold starts. Thus, the overall allocation
is always greater than zero. Conversely, if a node
allocated at time
, it means
is not running on
𝑡(e.g., from second 0 to 1250 for Node-B-0).
The chart shows that if one node in an area cannot manage gen-
erated load, the Community level detects this issue and instantiates
a new function instance on another node as close to the work-
load generator as possible. This behavior can be observed close to
second 600 when the workload in Area A reaches the peak and
a new replica is created on Node-A-1. Similarly, at second 1500 a
new replica is deployed on Node-B-1 when the workload in Area
B increases. In contrast, when the workload decreases instances
are deleted as shown close to second 2700 on Node-B-0. Moreover,
the experiment clearly shows how NEPTUNE is able to migrate
function instances when users move to keep the network delay
minimized. For example, close to second 1000, users move from
Area A to Area B, and right after the function is migrated on node
Node-B-0 to handle the workload in the proximity of users.
Thanks to NEPTUNE, function primes never violates the set re-
sponse time: the average response time
in Figure 4c is always
signicantly lower than the threshold (200 ms). The control loops
are able to keep the response time very close to the set point. Con-
trol theoretical controllers behave very well when they operate
with high-frequency control loops, enabled by the in-place vertical
scaling feature [
]. In fact, the response time only deviates from
the set point when the instances are replicated (scaled horizontally),
at seconds 600,1600 and 2600, since the action requires more time
than re-conguring containers. However, note that the response
time always returns close to the set point, and this shows that
NEPTUNE can recover from multiple types of perturbations (e.g,
creation and deletion of replicas, uctuating workloads).
Figure 4d shows that NEPTUNE is able to keep the network
overhead extremely low. The only peaks in the chart (seconds 1100
and 2200) are caused by users who change location and by the fact
that routing policies are not updated immediately.
When users start to migrate to another area, replicas cannot al-
ways be created immediately on nodes with the minimum network
delay, as depicted in the chart close to second 1100: the workload
on Node-B-0 increases and an instance is created on Node-B-1.
This behavior occurs because the two-step optimization process
evaluates the placement on B-0 or B-1 to be extremely comparable
) since they handle a small portion of the trac compared
to the nodes in Area A. However, NEPTUNE migrates the function
instance directly to Node-B-0 node as soon as the workload in Area
B increases (close to second 1200).
4.3 RQ2: Comparison with other approaches
We compared our solution against the three approaches described
in Section 4.1 by means of application sock-shop. Note that some
of the functions of this application must invoke other functions.
For example, function order invokes function user to retrieve user’s
address and payment card information, function catalogue to re-
trieve product details, function payment to ensure the creation of
the invoice and, nally, in case of success, function carts-delete to
empty the cart. We took these dependencies into account by setting
adequate response times as shown in Table 2: from 50ms, for simple
Table 3: Results of the comparison with other approaches.
Function Response time (ms) Response time violation (%) Network time rate (%) Core allocation (millicores)
NEPT K3S KN OF NEPT K3S KN OF NEPT K3S KN OF NEPT K3S KN OF
carts-delete 𝜇66.7 64.6 60.6 100.3 0.1 0.3 0 2 3.5 63.9 92.3 72.9 631.3 1921.1 596 597.5
𝜎3.4 10.9 1.6 27.3 0.1 0.0 0 1.3 2.1 16.5 1.2 18.2 149.9 429.6 2.1 2.6
carts-post 𝜇110.6 175.9 73.8 184.3 0.1 3.5 0.1 3.4 3.7 68 78.3 69 722.8 615.5 597.4 597.3
𝜎7.6 64.2 2.0 73.7 0.1 2.7 0.1 2.6 2.9 22.4 0.9 26.4 178.9 31.3 2.6 2.1
carts-util 𝜇57.4 95.4 54.6 45.6 0 1.7 0 0.1 2.6 78.5 92.8 70.3 516.3 689.3 596.5 4306.1
𝜎3.0 31.0 1.2 1.8 0.1 1.4 0 ~0 1.2 19.6 1.0 2.2 83.2 162.2 2.2 180.3
catalogue 𝜇53.3 54.6 163.1 39.2 0 0.1 17.7 0 1.6 74 41.9 71.2 102.7 197.6 65.2 458
𝜎2.7 5.2 35.4 1.4 0 ~0 2.1 0 0.6 11.9 6.3 1.6 4.1 23.9 13.0 3.1
orders 𝜇211.6 418.9 505.1 485.2 0 16.6 16.5 16 4.1 15.8 44.2 25.2 1114.8 4484.5 1040.7 597.8
𝜎12.1 86.7 165.7 126.4 0 8.2 3.0 9.0 1.7 7.5 25.0 8.2 273.0 407.2 294.5 1.2
payment 𝜇10.4 50.2 27.9 23.6 0 2.8 1.2 0.4 8.2 98.7 98.4 98.9 795 101.8 49.7 443.1
𝜎0.7 9.8 0.4 1.3 0 0.4 0.7 ~0 6.1 9.5 1.3 3.9 438.9 13.2 0.2 4.3
shipping 𝜇15 75 28.6 88 2.6 5.9 1.5 8.5 6.4 96.2 95.6 92.7 416.5 888.5 597.4 596.9
𝜎1.1 23.2 0.6 32.8 1.1 1.6 0.6 3.5 2.5 16.9 1.7 20.3 132.3 202.8 1.0 0.9
login 𝜇30.3 72.5 73.2 46 0 2.8 11.1 0.2 2.6 70.1 77.9 63.2 76.7 94.2 54.1 452.2
𝜎1.5 12.3 12.1 0.7 0 0.7 7.0 ~0 0.9 14.4 1.2 1.1 13.9 15.6 6.1 6.6
registration 𝜇46.4 57.7 65 34.9 0 0.1 2.6 0 1.4 80.9 87.9 81.7 71.6 105.3 53.6 453.4
𝜎2.7 6.3 4.4 1.3 0 0.1 1.4 0 0.4 10.4 1.6 1.8 9.8 12.0 6.1 6.2
user 𝜇21.8 66.4 177 93.4 0.5 7.8 46.7 16.8 7.1 77.9 31.5 76.8 153.2 681.8 355.3 463.2
𝜎0.7 6.4 35.1 20.3 0.5 0.4 24.2 5.3 1.0 10.1 5.9 16.3 23.3 91.0 166.8 1.9
functions with no dependencies, to 600ms, assigned to the more
Each run had a duration of 20 minutes and used a workload
that resembles a steep ramp with an arrival rate
suddenly increase over a short period of time. The workload started
with 10 concurrent users, and we added one additional user every
second up to 100. We considered a network of 6 nodes in Area B
Table 3 reports the statistical results obtained during the ex-
periments with each approach and with function of application
sock-shop. The results show that NEPTUNE provided in most of the
cases the lowest response time compared to the other approaches.
The obtained response times were consistent across multiple runs:
the standard deviation ranged between 3% and 7% of the average.
Other approaches presented higher standard deviation values: in
the worst case, KN obtained a standard deviation equal to 32
of the average, while K3S (36
5%) and OF (40%) were even more
NEPTUNE reported few violations of the required response time.
For most functions the amount of violations was lower or equal to
1%, while it was 2
6% and 0
5% for functions shopping and user,
respectively. Other solutions obtained signicantly higher viola-
tions. In the worst case, K3S failed to meet the foreseen response
6% of the requests, while OF and KN reported violations
9% and 46
8%, respectively. This can be explained because
other approaches, compared to NEPTUNE, do not employ precise
routing policies, do not perform an adequate resource allocation,
and do not solve resource contentions on nodes.
We can also observe how NEPTUNE routing policies helped
meet set response times. The percentage of time spent by routing
requests ranges from 1
4% to 8
2% of the total response time, and,
on average, only 4
1% of the time is spent in the network. On the
other hand, routing policies of other solutions do not consider
node utilization, network delay, and applications performance. K3S
reported a network time rate ranging from 15
8% to 98
7% of the
response time, with an overall average of 72
4%. Similarly, OF and
KN obtained an average network time rate of 72
2% and 74
Finally, as for the resources allocated by each approach for each
function, NEPTUNE allocated on average 4600 millicores, while
K3S and OF used about twice that amount, 9780 and 8960 mil-
licores, respectively. KN uses fewer resources than NEPTUNE on
average (4500 millicores) but it also suers from a high number
of response time violations. This means that KN usually allocates
fewer resources than needed (e.g., for function catalogue).
Dierently from NEPTUNE, other solutions do not adopt any
resource contention mechanism to provide a fair allocation of re-
sources. For example, K3S allocated most of the resources, 4480
millicores, to function orders, while other functions could not get
the resources to work properly. This creates an imbalance among
functions that prevents applications to be properly scaled and leads
to more response time violations.
4.4 RQ3: GPU Management
The third set of experiments was carried out to assess the transpar-
ent GPU management provided by NEPTUNE for computationally
intensive functions. To provide a heterogeneous environment, ex-
periments were conducted using the three nodes in Area A (Node-
A-2 is equipped with a GPU).
We used two functions, called resnet-a and resnet-b, both embed
the ResNet neural network in inference mode. The instances of the
two functions deployed on Node-A-2 were set to share the same
Each run lasted 20 minutes and used the same workload de-
scribed in Section 4.3 with a number of concurrent users starting
from 10 and up to 30 (increased by one every second).
(a) Response time.
Figure 5: Resnet-a: CPU and GPU executions.
Figure 5 illustrates one run of the experiments. Figure 5a shows
the average response time of function resnet-a when executed on
both CPUs and GPUs. Function resnet-b obtained similar results
that are not reported here for lack of space. GPU executions ob-
tained an almost constant response time and never violated the set
At the beginning of the experiment, all the requests were routed
to the GPU and after some 50 seconds the GPU was fully utilized.
To avoid degradation of the response time, the Community level
quickly reacted by updating the routing policies and allowing part
of the workload to be handled by function instances running on
CPUs. The mean response time of CPU instances shows a peak
at the beginning of the experiment (with some brief violations of
the response time) that is caused by the cold start. After that, the
Node level comes into play and dynamically adjusts the CPU cores
allocated to the replicas to keep the response time close to the set
The box plot of Figure 5b shows the distribution of response
times for both functions resnet-a and resnet-b on GPU, on CPU,
and the aggregated result. The interquartile range (IQR) is set to 1.5,
and the rectangle shows the distribution between the 25th and 75th
percentiles. Both GPU instances of resnet-a and resnet-b are able to
keep response times quite far from the set threshold, and thus no
violations. In particular, the mean response times of resnet-a and
resnet-b are 180ms and 183ms, respectively, which is three times
smaller than the threshold.
The distribution of response times on CPUs is wider compared
to GPUs. CPU containers are managed by PI controllers that have a
transient period to adjust the initial core allocation to an adequate
value to reach the desired set point; this does not happen with GPU
Nevertheless, the CPU-only replicas of resnet-a and resnet-b
can serve 98.3% and 100% of requests within the set response time,
respectively. Moreover, GPU instances handle 70% of requests while
the remaining part was routed to CPU instances. As a result, the
total number of violations of both functions is close to 0.
4.5 Threats to validity
We conducted the experiments using twelve functions (three appli-
cations) showing that NEPTUNE is able to minimize the network
delay, to reduce response times, and to eciently allocate resources
compared to other three well-known approaches. However, we
must highlight threats that may constrain the validity of obtained
. The experiments were run with synthetic
workloads that may introduce bias. Workloads have a ramp-shape
to simulate an incremental growth or reduction of connected users.
We used the following procedure to retrieve the maximum con-
current users in each experiment. First, we xed the amount and
types of nodes the topology was composed of. The maximum con-
current users of each experiment was retrieved by observing how
many users were required to generate enough workload to require
consistently at least 70% of the cluster’s resources.
The three applications were not provided with a given required
response time for each function (
was computed using
an iterative process. Starting from 50
and with 50
was set to be able to serve at least 50% of requests in an amount
of time equal to 𝑅𝑇𝑅
. Some of our assumptions may limit the gen-
eralization of the experiments.
Consistently with the serverless paradigm, NEPTUNE assumes
functions to either be stateless (e.g., without session) or depend
on an external database. Currently, interactions with databases are
only partially modeled by NEPTUNE. The time to read from and
write on a database is modeled at the Node level as non-controllable
stationary disturbance of the response time (e.g., a Gaussian noise).
Thus, during our experiments, databases were deployed on dedi-
cated and properly sized machines.
Results show that NEPTUNE is able to eciently control func-
tions that depend on a database (e.g., orders, carts-post) with a
precision similar to the ones without dependencies (e.g., payment,
Construct and Conclusion Threats
. The experiments demon-
strate the validity of our claim, that is, that NEPTUNE is able to
eciently execute multiple functions deployed on a distributed
edge topology. All experiments have been executed ve times and
obtained results are statistically robust and show small variance.
5 RELATED WORK
The management of edge topologies is a hot and widely addressed
topic by both industry and academia [
]. To the best of our
knowledge, NEPTUNE is the rst solution that provides: an easy
to use serverless interface, optimal function placement and rout-
ing policies, in-place vertical scaling of functions, and transparent
management of GPUs and CPUs. The relevant related works we
are aware of only focus on specic aspects of the problem.
Wang et al. [
] propose LaSS, a framework for latency-sensitive
edge computations built on top of Kubernetes and Openwhisk
LaSS models the problem of resource allocation using a M/M/c
FCFS queuing model. They provide a fair-share resource alloca-
tion algorithm, similar to NEPTUNE’s Contention Manager, and
two reclamation policies for freeing allocated resources. LaSS is
the most similar solution to NEPTUNE, but it lacks network over-
head minimization and GPU support. Furthermore, the approach
is not fully compatible with the Kubernetes API. Kubernetes is
only used to deploy OpenWhisk. Functions run natively on top of
the container runtime (e.g., Docker
) and resources are vertically
scaled by bypassing Kubernetes. This approach, also adopted in
cloud computing solutions [
], is known to create state repre-
sentation inconsistencies between the container runtime and the
Ascigir et al. [
] investigate the problem for serverless functions
in hybrid edge-cloud systems and formulate the problem using
Mixed Integer Programming. They propose both fully-centralized
(function orchestration) approaches, where a single controller is
in charge of allocating resources, and fully-decentralized (function
choreography) ones, where controllers are distributed across the
network and decisions are made independently. Compared to NEP-
TUNE, they focus on minimizing the number of unserved requests
and they assume that each request can be served in a xed amount
of time (single time-slot). However, this assumption is not easy to
ensure in edge computing: nodes may be equipped with dierent
types of hardware and produce dierent response times. This is
naturally considered in NEPTUNE with the help of GPUs.
Multiple approaches in the literature focus on placement and
routing at the edge [
]. One of the most used techniques,
also employed by NEPTUNE, is to model the service placement and
workload routing as an Integer or a Mixed Integer Programming
Notably, Tong et al. [
] model a MEC network as a hierarchical
tree of geo-distributed servers and formulate the problem as a two-
steps Mixed Nonlinear Integer Programming (MNIP). In particular,
their approach aims to maximize the amount of served requests by
means of optimal service placement and resource allocation. The
eectiveness of their approach is veried using formal analysis and
large-scale trace-based simulations. They assume that workloads
follow some known stochastic models (Poisson distribution), and
that arrival rates are independent and identically distributed. This
may not be true in the context of edge computing where workloads
are often unpredictable and may signicantly deviate from the
assumed distribution. NEPTUNE does not share these assumptions
and uses fast control-theoretical planners to mitigate volatility and
unpredictability in the short term.
To cope with dynamic workloads, Tan et al. [
] propose an
online algorithm for workload dispatching and scheduling with-
out any assumption about the distribution. However, since their
approach only focuses on routing requests, they cannot always
minimize network delays, especially when edge clients move from
one location to another.
Mobile workloads are addressed, for example, by Leyva-Pupo
et al. [
], who present a solution based on an Integer Linear Pro-
gramming (ILP) problem with two dierent objective functions one
for mobile users and one for static ones. Furthermore, since the
problem is known to be NP-hard, they use heuristic methods to
compute a sub-optimal solution. Sun et al. [
] propose a service
migration solution based on Mixed Integer Programming to keep
the computation as close as possible to the user. In particular, they
consider dierent factors that contribute to migrations costs (e.g.,
required time and resources). However, the two aforementioned
solutions exploit virtual machines and they are known for their
large image sizes and long start-up times, making service migra-
tion a very costly operation. NEPTUNE, as other approaches in the
], uses containers that are lighter and faster to
Only a few solutions have been designed for GPU management
in the context of edge computing. For example, Subedi et al. [
mainly focuses on enabling GPU accelerated edge computation
without considering latency-critical aspects such as placing appli-
cations close to the edge clients.
6 CONCLUSIONS AND FUTURE WORK
The paper proposes NEPTUNE, a serverless-based solution for man-
aging latency-sensitive applications deployed on geo-distributed
large scale edge topologies. It provides smart placement and rout-
ing to minimize network overhead, dynamic resource allocation to
quickly react to workload changes, and transparent management
of CPUs and GPUs. A prototype built on top of K3S, a popular
container orchestrator for the edge, helped us demonstrate the
feasibility of the approach and interesting results with respect to
similar state-of-the-art solutions.
Our future work comprises the improvement of adopted sched-
uling and resource allocation solutions by exploiting function de-
] and workload predictors to anticipate future de-
]. As a further extension, we will consider Bayesian op-
timization approaches [
] to nd optimal response times au-
tomatically. State migration and data consistency approaches can
also be integrated to manage stateful applications.
This work has been partially supported by the SISMA national
research project (MIUR, PRIN 2017, Contract 201752𝐸𝑁 𝑌 𝐵).
Onur Ascigil, Argyrios Tasiopoulos, Truong Khoa Phan, Vasilis Sourlas, Ioan-
nis Psaras, and George Pavlou. 2021. Resource Provisioning and Allocation in
Function-as-a-Service Edge-Clouds (Early Access). IEEE Transactions on Services
Computing (2021), 1–14.
David Balla, Csaba Simon, and Markosz Maliosz. 2020. Adaptive scaling of Kuber-
netes pods. In Proceedings of the IEEE/IFIP Network Operations and Management
Symposium, NOMS 2020. IEEE, 1–5.
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, and Luca Terracciano.
2021. KOSMOS: Vertical and Horizontal Resource Autoscaling for Kubernetes. In
Proceedings of the 19th International Conference on Service-Oriented Computing,
ICSOC 2021 (Lecture Notes in Computer Science, Vol. 13121). Springer, 821–829.
Luciano Baresi, Danilo Filgueira Mendonça, and Giovanni Quattrocchi. 2019.
PAPS: A Framework for Decentralized Self-management at the Edge. In Proceed-
ings of the 17th International Conference on Service-Oriented Computing, ICSOC
2019 (Lecture Notes in Computer Science, Vol. 11895). Springer, 508–522.
Luciano Baresi and Giovanni Quattrocchi. 2018. Towards Vertically Scalable
Spark Applications. In Euro-Par 2018: Parallel Processing Workshops - Euro-Par
2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected
Papers (Lecture Notes in Computer Science, Vol. 11339). Springer, 106–118.
Luciano Baresi and Giovanni Quattrocchi. 2020. COCOS: A Scalable Architecture
for Containerized Heterogeneous Systems. In Proceedings of the IEEE International
Conference on Software Architecture, ICSA 2020. IEEE, 103–113.
Julian Bellendorf and Zoltán Ádám Mann. 2020. Classication of optimization
problems in fog computing. Future Gener. Comput. Syst. 107 (2020), 158–176.
David Bermbach, Jonathan Bader, Jonathan Hasenburg, Tobias Pfandzelter, and
Lauritz Thamsen. 2021. AuctionWhisk: Using an Auction-Inspired Approach for
Function Placement in Serverless Fog Platforms (Early Access). Software: Practice
and Experience (2021), 1–49.
Victor Campmany, Sergio Silva, Antonio Espinosa, Juan Carlos Moure, David
Vázquez, and Antonio M. López. 2016. GPU-based Pedestrian Detection for
Autonomous Driving. In Proceedings of the International Conference on Compu-
tational Science 2016, ICCS 2016 (Procedia Computer Science, Vol. 80). Elsevier,
Junguk Cho, Karthikeyan Sundaresan, Rajesh Mahindra, Jacobus E. van der
Merwe, and Sampath Rangarajan. 2016. ACACIA: Context-aware Edge Comput-
ing for Continuous Interactive Applications over Mobile Networks. In Proceedings
of the 12th International on Conference on emerging Networking EXperiments and
Technologies, CoNEXT 2016. ACM, 375–389.
Thomas Heide Clausen and Philippe Jacquet. 2003. Optimized Link State Routing
Protocol (OLSR). RFC 3626 (2003), 1–75.
Xavier Dutreilh, Nicolas Rivierre, Aurélien Moreau, Jacques Malenfant, and Isis
Truck. 2010. From Data Center Resource Allocation to Control Theory and Back.
In Proceedings of the IEEE International Conference on Cloud Computing, CLOUD
2010. IEEE, 410–417.
Nicolò Felicioni, Andrea Donati, Luca Conterio, Luca Bartoccioni, Davide Yi Xian
Hu, Cesare Bernardis, and Maurizio Ferrari Dacrema. 2020. Multi-Objective
Blended Ensemble For Highly Imbalanced Sequence Aware Tweet Engagement
Prediction. In Proceedings of the Recommender Systems Challenge 2020, RecSys
Challenge 2020. ACM, 29–33.
Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. 2015. An updated
performance comparison of virtual machines and Linux containers. In IEEE
International Symposium on Performance Analysis of Systems and Software, ISPASS
2015. IEEE, 171–172.
Domenico Grimaldi, Valerio Persico, Antonio Pescapè, Alessandro Salvi, and
Stefania Santini. 2015. A Feedback-Control Approach for Resource Management
in Public Clouds. In Proceedings of the IEEE Global Communications Conference
2015, GLOBECOM 2015. IEEE, 1–7.
Songtao Guo, Bin Xiao, Yuanyuan Yang, and Yang Yang. 2016. Energy-ecient
dynamic ooading and resource scheduling in mobile cloud computing. In Pro-
ceedings of the 35th Annual IEEE International Conference on Computer Communi-
cations, INFOCOM 2016. IEEE, 1–9.
Akhil Gupta and Rakesh Kumar Jha. 2015. A Survey of 5G Network: Architecture
and Emerging Technologies. IEEE Access 3 (2015), 1206–1232.
Congfeng Jiang, Xiaolan Cheng, Honghao Gao, Xin Zhou, and Jian Wan. 2019.
Toward Computation Ooading in Edge Computing: A Survey. IEEE Access 7
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-che Tsai, Anurag Khan-
delwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Jayant
Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, and David A. Pat-
terson. 2019. Cloud Programming Simplied: A Berkeley View on Serverless
Computing. CoRR abs/1902.03383 (2019), 1–35.
Patrick Kalmbach, Andreas Blenk, Wolfgang Kellerer, Rastin Pries, Michael
Jarschel, and Marco Homann. 2019. GPU Accelerated Planning and Place-
ment of Edge Clouds. In Proceedings of the International Conference on Networked
Systems 2019, NetSys 2019. IEEE, 1–3.
Jitendra Kumar and Ashutosh Kumar Singh. 2018. Workload prediction in cloud
using articial neural network and adaptive dierential evolution. Future Gener.
Comput. Syst. 81 (2018), 41–52.
Irian Leyva-Pupo, Alejandro Santoyo-González, and Cristina Cervelló-Pastor.
2019. A Framework for the Joint Placement of Edge Service Infrastructure and
User Plane Functions for 5G. Sensors 19, 18 (2019), 3975.
Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. 2010. CloudCmp:
comparing public cloud providers. In Proceedings of the 10th ACM SIGCOMM
Internet Measurement Conference, IMC 2010. 1–14.
Ping-Min Lin and Alex Glikson. 2019. Mitigating Cold Starts in Serverless
Platforms: A Pool-Based Approach. CoRR abs/1903.12221 (2019), 1–5.
Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md. Enamul Haque,
Lingjia Tang, and Jason Mars. 2018. The Architectural Implications of Au-
tonomous Driving: Constraints and Acceleration. In Proceedings of the 23rd In-
ternational Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS 2018. ACM, 751–766.
Wei-Tsung Lin, Chandra Krintz, and Rich Wolski. 2018. Tracing Function Depen-
dencies across Clouds. In Proceedings of the 11th IEEE International Conference on
Cloud Computing, CLOUD 20188. IEEE, 253–260.
Chunhong Liu, Chuanchang Liu, Yanlei Shang, Shiping Chen, Bo Cheng, and
Junliang Chen. 2017. An adaptive prediction approach based on workload pattern
discrimination in the cloud. J. Netw. Comput. Appl. 80 (2017), 35–44.
Shaoshan Liu, Liangkai Liu, Jie Tang, Bo Yu, Yifan Wang, and Weisong Shi.
2019. Edge Computing for Autonomous Driving: Opportunities and Challenges.
Proceedings of the IEEE 107, 8 (2019), 1697–1716.
Omogbai Oleghe. 2021. Container Placement and Migration in Edge Computing:
Concept and Scheduling Models. IEEE Access 9 (2021), 68028–68043.
Quoc-Viet Pham, Fang Fang, Vu Nguyen Ha, Md. Jalil Piran, Mai Le, Long Bao
Le, Won-Joo Hwang, and Zhiguo Ding. 2020. A Survey of Multi-Access Edge
Computing in 5G and Beyond: Fundamentals, Technology Integration, and State-
of-the-Art. IEEE Access 8 (2020), 116974–117017.
 Jon Postel. 1981. Internet Control Message Protocol. RFC 777 (1981), 1–14.
 Konstantinos Poularakis, Jaime Llorca, Antonia Maria Tulino, Ian J. Taylor, and
Leandros Tassiulas. 2019. Joint Service Placement and Request Routing in Multi-
cell Mobile Edge Computing Networks. In Proceedings of the IEEE Conference on
Computer Communications, INFOCOM 2019. IEEE, 10–18.
Peter-Christian Quint and Nane Kratzke. 2018. Towards a Lightweight Multi-
Cloud DSL for Elastic and Transferable Cloud-native Applications. In Proceedings
of the 8th International Conference on Cloud Computing and Services Science,
CLOSER 2018. SciTePress, 400–408.
Gourav Rattihalli, Madhusudhan Govindaraju, Hui Lu, and Devesh Tiwari. 2019.
Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Esti-
mation in Kubernetes. In Proceedings of the 12th IEEE International Conference on
Cloud Computing, CLOUD 2019. IEEE, 33–40.
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemys-
law Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski,
Steven Hand, and John Wilkes. 2020. Autopilot: workload autoscaling at Google.
In Proceedings of the 15th EuroSys Conference 2020, EuroSys 2020, Heraklion, Greece,
April 27-30, 2020. ACM, 16:1–16:16.
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmark-
ing State-of-the-Art Deep Learning Software Tools. In Proceedings of the 7th
International Conference on Cloud Computing and Big Data, CCBD 2016. IEEE,
Paulo Silva, Daniel Fireman, and Thiago Emmanuel Pereira. 2020. Prebaking Func-
tions to Warm the Serverless Cold Start. In Proceedings of the 21st International
Middleware Conference, Middleware 2020. ACM, 1–13.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian
Optimization of Machine Learning Algorithms. In Proceedings of the 26th Annual
Conference on Neural Information Processing Systems. NIPS 2012. 2960–2968.
Piyush Subedi, Jianwei Hao, In Kee Kim, and Lakshmish Ramaswamy. 2021.
AI Multi-Tenancy on Edge: Concurrent Deep Learning Model Executions and
Dynamic Model Placements on Edge Devices. In Proceedings of the 14th IEEE
International Conference on Cloud Computing, CLOUD 2021. IEEE, 31–42.
Xiang Sun and Nirwan Ansari. 2016. PRIMAL: PRofIt Maximization Avatar
pLacement for mobile edge computing. In Proceedings of the IEEE International
Conference on Communications 2016, ICC 2016. IEEE, 1–6.
Haisheng Tan, Zhenhua Han, Xiang-Yang Li, and Francis C. M. Lau. 2017. On-
line job dispatching and scheduling in edge-clouds. In Proceedings of the IEEE
Conference on Computer Communications 2017, INFOCOM 2017. IEEE, 1–9.
Liang Tong, Yong Li, and WeiGao. 2016. Ahierarchical edge cloud architecture for
mobile computing. In Proceedings of the 35th Annual IEEE International Conference
on Computer Communications, INFOCOM 2016. IEEE, 1–9.
Abhishek Verma, Hussam Qassim, and David Feinzimer. 2017. Residual squeeze
CNDS deep learning CNN model for very large scale places image recognition. In
Proceedings of the 8th IEEE Annual Ubiquitous Computing, Electronics and Mobile
Communication Conference, UEMCON 2017, 2017. IEEE, 463–469.
Bin Wang, Ahmed Ali-Eldin, and Prashant J. Shenoy. 2021. LaSS: Running
Latency Sensitive Serverless Computations at the Edge. In Proceedings of the 30th
International Symposium on High-Performance Parallel and Distributed Computing,
HPDC 2021. ACM, 239–251.
Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael M.
Swift. 2018. Peeking Behind the Curtains of Serverless Platforms. In Proceedings of
the USENIX Annual Technical Conference, USENIX ATC 2018. USENIX Association,
Claes Wohlin, Martin Höst, and Kennet Henningsson. 2006. Empirical Research
Methods in Web and Software Engineering. In Web Engineering. Springer, 409–
Jierui Xie, Boleslaw K. Szymanski, and Xiaoming Liu. 2011. SLPA: Uncovering
Overlapping Communities in Social Networks via a Speaker-Listener Interaction
Dynamic Process. In Proceedings of the IEEE 11th International Conference on Data
Mining Workshops, (ICDMW) 2011. IEEE, 344–349.
Lenar Yazdanov and Christof Fetzer. 2014. Lightweight Automatic Resource
Scaling for Multi-tier Web Applications. In Proceedings of the 7th International
Conference on Cloud Computing, CLOUD 2014. IEEE, 466–473.
Xu Zhang, Hao Chen, Yangchao Zhao, Zhan Ma, Yiling Xu, Haojun Huang, Hao
Yin, and Dapeng Oliver Wu. 2019. Improving Cloud Gaming Experience through
Mobile Edge Computing. IEEE Wirel. Commun. 26, 4 (2019), 178–183.
Ao Zhou, Shangguang Wang, Shaohua Wan, and Lianyong Qi. 2020. LMM:
latency-aware micro-service mashup in mobile edge computing environment.
Neural Comput. Appl. 32, 19 (2020), 15411–15425.
Qian Zhu and Gagan Agrawal. 2012. Resource Provisioning with Budget Con-
straints for Adaptive Applications in Cloud Environments. IEEE IEEE Transactions
on Services Computing 5, 4 (2012), 497–511.