PreprintPDF Available

Abstract and Figures

Nowadays a wide range of applications is constrained by low-latency requirements that cloud infrastructures cannot meet. Multi-access Edge Computing (MEC) has been proposed as the reference architecture for executing applications closer to users and reduce latency, but new challenges arise: edge nodes are resource-constrained, the workload can vary significantly since users are nomadic, and task complexity is increasing (e.g., machine learning inference). To overcome these problems, the paper presents NEPTUNE, a serverless-based framework for managing complex MEC solutions. NEPTUNE i) places functions on edge nodes according to user locations, ii) avoids the saturation of single nodes, iii) exploits GPUs when available, and iv) allocates resources (CPU cores) dynamically to meet foreseen execution times. A prototype, built on top of K3S, was used to evaluate NEPTUNE on a set of experiments that demonstrate a significant reduction in terms of response time, network overhead, and resource consumption compared to three state-of-the-art approaches.
Content may be subject to copyright.
NEPTUNE: Network- and GPU-aware Management
of Serverless Functions at the Edge
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, Luca Terracciano
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
Milan, Italy
{name.surname}@polimi.it
ABSTRACT
Nowadays a wide range of applications is constrained by low-
latency requirements that cloud infrastructures cannot meet. Multi-
access Edge Computing (MEC) has been proposed as the reference
architecture for executing applications closer to users and reduc-
ing latency, but new challenges arise: edge nodes are resource-
constrained, the workload can vary signicantly since users are
nomadic, and task complexity is increasing (e.g., machine learning
inference). To overcome these problems, the paper presents NEP-
TUNE, a serverless-based framework for managing complex MEC
solutions. NEPTUNE i) places functions on edge nodes according to
user locations, ii) avoids the saturation of single nodes, iii) exploits
GPUs when available, and iv) allocates resources (CPU cores) dy-
namically to meet foreseen execution times. A prototype, built on
top of K3S, was used to evaluate NEPTUNE on a set of experiments
that demonstrate a signicant reduction in terms of response time,
network overhead, and resource consumption compared to three
well-known approaches.
CCS CONCEPTS
Theory of computation
Scheduling algorithms;
Comput-
ing methodologies
Distributed computing methodologies;
Com-
puter systems organization Distributed architectures.
KEYWORDS
serverless, edge computing, gpu, placement, dynamic resource allo-
cation, control theory
ACM Reference Format:
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, Luca Terrac-
ciano. 2022. NEPTUNE: Network- and GPU-aware Management of Server-
less Functions at the Edge. In 17th International Symposium on Software
Engineering for Adaptive and Self-Managing Systems (SEAMS ’22), May
18–23, 2022, PITTSBURGH, PA, USA. ACM, New York, NY, USA, 13 pages.
https://doi.org/10.1145/3524844.3528051
1 INTRODUCTION
Multi-access Edge Computing (MEC) [
30
] has emerged as a new
distributed architecture for running computations at the edge of the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9305-8/22/05.. .$15.00
https://doi.org/10.1145/3524844.3528051
network and reduce latency compared to cloud executions. Dier-
ently from cloud computing, which is characterized by a virtually
innite amount of resources placed on large data centers, MEC
infrastructures are based on geo-distributed networks of resource-
constrained nodes (e.g., 5G base stations) that serve requests and
process data close to the users.
The rise of edge computing [
17
], also fostered by the advent of
5G networks, enables the creation of applications with extremely
low latency requirements like autonomous driving [
28
], VR/AR [
10
]
and mobile gaming [
49
] systems. According to Li et al. [
23
], the
average network delay from 260 locations to the nearest Amazon
EC2 availability zone is approximately 74ms. This makes meeting
tight response time requirements in the cloud nearly impossible. In
use-cases like obstacle detection, response times of a few hundreds
of milliseconds are required [
25
] and thus the network delay must be
lower than the one oered by cloud-based solutions. Many mobile
devices would allow these computations to be executed on the
device itself, but this is not always possible given the inherent
complexity of some tasks (e.g., machine learning-based ones) and
the need for limiting resource consumption (e.g., to avoid battery
draining).
An important challenge of edge computing is that clients usu-
ally produce highly dynamic workloads since they move among
dierent areas (e.g., self-driving vehicles) and the amount of trac
in a given region can rapidly escalate (e.g., users moving towards
a stadium for an event). To tackle these cases, solutions that scale
resources (i.e., virtual machines and containers [
14
]) automatically
according to the workload have been extensively investigated in
the context of cloud computing, ranging from approaches based
on rules [
12
,
48
] and machine learning [
27
,
51
] to those based on
time-series analysis [
35
]. These solutions assume that (new) re-
sources are always available and that nodes are connected through
a low-latency, high-bandwidth network. At the edge, these assump-
tions are not valid anymore and some ad-hoc solutions have been
presented in the literature. For example, Ascigil et al. [
1
] and Wang
et al. [
44
] propose a solution for service placement on resource-
constrained nodes, while Poularakis et al. [
32
] focus on request
routing and load balancing at the edge.
Approaches that focus on service placement or request routing
for MEC aim to maximize the throughput of edge nodes, but com-
prehensive solutions that address placement, routing, and minimal
delays at the same time are still work in progress. In addition, the
tasks at the edge, like AI-based computations, are becoming heavier
and heavier. The use of GPUs [
9
] can be fundamental to acceler-
ate these computations, but they are seldom taken into account
explicitly [6, 20, 39].
arXiv:2205.04320v1 [cs.SE] 9 May 2022
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
To tame all these problems, this paper presents NEPTUNE, a com-
prehensive framework for the runtime management of large-scale
edge applications that exploits placement, routing, network delays,
and CPU/GPU interplay in a coordinated way to allow for the con-
current execution of edge applications that meet user-set response
times. NEPTUNE uses the serverless paradigm [
19
] to let service
providers deploy and execute latency-constrained functions without
managing the underlying infrastructure. NEPTUNE uses Mixed In-
teger Programming (MIP) to allocate functions on nodes, minimize
network delays, and exploit GPUs if available. It uses lightweight
control theoretical planners to allocate CPU cores dynamically to
execute the remaining functions.
NEPTUNE is then holistic and self-adaptive. It addresses edge
nodes, function placement, request routing, available resources, and
hardware accelerators in a single and coherent framework. Users
only provide functions and foreseen response times, and then the
system automatically probes available nodes as well as the locality
and intensity of workloads and reacts autonomously. While other
approaches (see Section 5) only focus on a single or few aspects and
they can only be considered partial solutions, NEPTUNE tackles
all of them and oversees the whole lifecycle from deployment to
runtime management.
We built a prototype that extends K3S
1
, a popular tool for con-
tainer orchestration at the edge, to evaluate NEPTUNE on a set
of experiments executed on a geo-distributed cluster of virtual
machines provisioned on Amazon Web Services (AWS). To pro-
vide a comprehensive and meaningful set of experiments, we as-
sessed NEPTUNE against three realistic benchmark applications
—including one that can be accelerated with GPUs. The compari-
son revealed 9.4 times fewer response time violations, and 1.6 and
17.8 times improvements as for resource consumption and network
delays, respectively.
NEPTUNE builds up from PAPS [
4
] from which inherits part of
the design. Compared to PAPS, this paper provides i) a new place-
ment and routing algorithm that exploits resources more eciently
and minimizes service disruption, ii) the support of GPUs, and iii)
an empirical evaluation of the approach (PAPS was only simulated).
The rest of the paper is organized as follows. Section 2 explains
the problem addressed by NEPTUNE and provides an overview of
the solution. Section 3 presents how NEPTUNE tackles placement,
routing, and GPU/CPU allocation. Section 4 shows the assessment
we carried out to evaluate NEPTUNE. Section 5 presents some re-
lated work, and Section 6 concludes the paper.
2NEPTUNE
The goal of NEPTUNE is to allow for the execution of multiple con-
current applications on a MEC infrastructure with user-set response
times and to optimize the use of available resources. NEPTUNE must
be able to take into consideration the main aspects of edge com-
puting: resource-constrained nodes, limited network infrastructure,
highly uctuating workloads, and strict latency requirements.
A MEC infrastructure is composed of a set
𝑁
of distributed nodes
that allow clients to execute a set
𝐴
of applications. Edge clients
(e.g., self-driving cars, mobile phones, or VR/AR devices) access
applications placed on MEC nodes. Each node comes with cores,
1https://k3s.io
δ1,2
δ1,3
δ3,4
δ2,4
δ2,3 δ1,3
Mobile
app
Self-driving
VR/AR
y3
y1y2
y4
Edge clients
MEC
Node 1
MEC
Node 2
MEC
Node 3
MEC
Node 4
Figure 1: MEC Topology.
memory, and maybe GPUs, along with their memory. Because of
user mobility, the workload on each node can vary frequently, and
resource limitations do not always allow each node to serve all the
requests it receives; some requests must be outsourced to nearby
nodes.
Given a request
𝑟
for an application
𝑎𝐴
, its response time
𝑅𝑇
is measured as the time required to transmit
𝑟
from the client
to the closest node
𝑖
, execute the request, and receive a response.
More formally,
𝑅𝑇
is dened as
𝑅𝑇 =𝐸+𝑄+𝐷
, where
𝐸
(execution
time) represents the time taken for running
𝑎
,
𝑄
(queue time) is the
time spent by
𝑟
waiting for being managed, and
𝐷
is the network
delay (or network latency). In particular, as shown in Figure 1,
𝐷
is
the sum of the (round trip) time needed by
𝑟
to reach
𝑖
(
𝑦𝑖
) and, if
needed, the (round trip) time needed to outsource the computation
on a nearby node
𝑗
(
𝛿𝑖, 𝑗
). NEPTUNE handles requests once they
enter the MEC topology and assumes
𝑦𝑖
be optimized by existing
protocols [11].
NEPTUNE is then location aware since it considers the geograph-
ical distribution of both nodes and workloads. It stores a represen-
tation of the MEC network by measuring the inter-node delays and
it monitors where and how many requests are generated by users.
Furthermore, NEPTUNE allows users to put a threshold (service
level agreement) on the response time provided by each application
(𝑅𝑇𝑅
𝑎).
2.1 Solution overview
NEPTUNE requires that an application
𝑎
be deployed as a set
𝐹𝑎
of functions —as prescribed by the serverless paradigm. Develop-
ers focus on application code without the burden of managing the
underlying infrastructure. Each function covers a single function-
ality and supplies a single or a small set of REST endpoints. The
result is more exible and faster to scale compared to traditional
architectures (e.g., monoliths or microservices).
Besides the function’s code, NEPTUNE requires that each func-
tion be associated with a user-provided required response time
𝑅𝑇 𝑅
𝑓
(with
𝑅𝑇 𝑅
𝑓𝑅𝑇 𝑅
𝑎
) and the memory
𝑚𝐶𝑃𝑈
𝑓
needed to properly
execute it. In case of GPU-accelerated functions, the GPU memory
𝑚𝐺𝑃𝑈
𝑓
must also be specied. Then, NEPTUNE manages both its
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
deployment by provisioning one or more function instances, and
its operation with the goal of fullling the set response time.
NEPTUNE manages functions through a three-level control hier-
archy: Topology,Community, and Node.
Since function placement is NP-hard [
32
], the main goal of the
Topology level is to tackle the complexity for the lower levels by
splitting the topology into communities of closely-located nodes.
Each community is independent of the others and a request can
only be managed within the community of the node that received
it. If this is not possible, then the community is undersized and the
Topology level must recongure the communities.
The Topology level employs a single controller based on the
Speaker-listener Label Propagation Algorithm (SLPA) — proposed by
Xie et al. [
47
] — to create the communities. SLPA has a complexity
of
𝑂(𝑡𝑁)
where
𝑁
is the number of distributed nodes and
𝑡
is the
user-dened maximum number of iterations. Since the complexity
scales linearly with the number of nodes, this solution has proven
to be suitable also for large clusters [4].
Given a maximum community size
𝑀𝐶𝑆
and the maximum al-
lowed network delay
Δ
, SLPA splits the topology in a set of com-
munities with a number of nodes that is lower than
𝑀𝐶𝑆
and inter-
node delays smaller than
Δ
(i.e.,
𝛿𝑖, 𝑗 Δ
for all nodes
𝑖
and
𝑗
in the
community). SLPA could potentially assign a node to multiple com-
munities, but to avoid resource contention, NEPTUNE re-allocates
shared nodes —if needed— to create non-overlapping communities.
At Community level, each community is equipped with a MIP-
based controller in charge of managing function instances. The
controller places function instances on the dierent nodes of the
community by rst considering those that could exploit GPUs, if
available. The goal is to minimize network delay by dynamically
deploying function instances close to where the demand is gener-
ated and at the same time to minimize the time spent to forward
them when needed.
The Community level computes routing policies i) to allow each
node to forward part of the workload to other close nodes, and ii) to
prioritize computation-intensive functions by forwarding requests
to GPUs up to their full utilization, and then send the remaining
requests to CPUs. To avoid saturating single nodes, the Community
level can also scale function instances horizontally, that is, it can
replicate them on nearby nodes. For example, if a node
𝑖
cannot oer
enough resources to execute an instance of
𝑓
to serve workload
𝜆𝑓 ,𝑖
, the Community level creates a new instance (of
𝑓
) on a node
𝑗
close to
𝑖
, and the requests that cannot be served by
𝑖
are forwarded
to 𝑗.
While the rst two control levels take care of network latency
(
𝐷
), once the requests arrive at the node that processes them, the
Node level ensures that function instances have the needed amount
of cores to meet set response times. Each function instance is man-
aged by a dedicated Proportional Integral (PI) controller that pro-
vides vertical scaling, that is, it adds, or removes, CPU cores to
the function (container). Unlike other approches [
2
,
34
], NEPTUNE
can recongure CPU cores without restarting function instances,
that is, without service disruption. Figure 2 shows that, given a set
point (i.e., the desired response time), the PI controller periodically
monitors the performance of its function instance and dynamically
allocates CPU cores to optimize both execution time
𝐸
and queue
time 𝑄.
MEC Topology
Monitoring Infrastructure
Community 1
Node Level
Community 2
...
fa
Contention
Manager
PI PI
Community 3
...
Community 4
...
Routing
fa 0.7 N1
fa 0.3 N3
...
δ2,3
GPU
Routing
...
Topology
Level
SLPA
Routing
...
δ1,2
δ1,3
fd
Node
N3
Community
Level
MIP
fafc
Node
N1 fbfd
Node
N2
Figure 2: NEPTUNE.
Also, the controllers at Node level are independent of each other:
each controller is not aware of how many cores the others would
like to allocate. Therefore, since the sum of the requests might
exceed the capacity of a node, a Contention Manager solves resource
contentions by proportionally scaling down requested allocations.
Every control level is meant to work independently of the others,
and no communication is required between controllers operating
at the same level. This means that NEPTUNE can easily scale.
The three control levels operate at dierent frequencies but yet
cohesively to eliminate potential interference. Thanks to fast PI
controllers and vertical scaling, the Node level operates every few
seconds to handle workload bursts, whereas the Community level
computes functions and routing policies in the order of minutes,
and allows the Node level to fully exploit the underlying resources.
The Topology level runs at longer intervals, but it can react faster
when there are changes in the network, such as when a node is
added or removed.
A real-time monitoring infrastructure is the only communication
across the levels. It allows NEPTUNE to gather performance metrics
(e.g., response times, core allocations per function instance, net-
work delays) needed by the three control levels to properly operate.
Note that the controllers at Community and Node levels measure
device performance in real-time without any a-priori assumption.
This means that NEPTUNE can also manage heterogeneous CPUs
with dierent performance levels (e.g., dierent types of virtual
machines).
Figure 2 shows a MEC topology controlled by the three-level
hierarchical control adopted by NEPTUNE. Control components are
depicted in grey. First of all, the Topology level splits the network
into four communities. Each community is handled by its Commu-
nity level controller, responsible for function placement and request
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
routing. The gure reports a detailed view of Community 1. We
can observe that i) a set of functions (
𝑓𝑎
,
𝑓𝑏
,
𝑓𝑐
,
𝑓𝑑
) is placed across
three nodes and, ii) each node is provided with its set of routing
policies. In particular, the routing policies for Node N1 enforce that
70% of the requests for
𝑓𝑎
are served by the node itself, while the
remaining 30% is forwarded to the instance, running on Node N3,
that exploits a GPU. Finally, the gure shows the materialization of
Node level on Node N3. Each function instance (
𝑓𝑎
,
𝑓𝑑
) is vertically
scaled by a dedicated PI controller, and resource contentions are
solved by the Contention Manager.
3 PLACEMENT, ROUTING, AND
ALLOCATION
As explain above, the Topology level partitions the network in
a set of communities using algorithm SLPA. Each community is
controlled independently from the others. The main goal of the
Community level controller is to dynamically place function exe-
cutions as close to where the workload is generated as possible. A
trivial solution, which can drastically reduce network delay, would
be to replicate the whole set of functions on each node. Since nodes
have limited resources, this approach is often not feasible.
An eective placement solution should not saturate nodes and
should allow for stable placement, that is, it should avoid the disrup-
tion implied by keeping migrating functions among nodes
2
. At the
same time, the placement cannot be fully static since users move
and requirements change.
Each placement eort should also consider graceful termination
periods and cold starts. The former is the user-dened amount of
time we must wait whenever a function instance is to be deleted
to let it nish serving pending requests. The latter is a delay that
can range from seconds to minutes and that aects newly created
instances [
45
]. Before a function instance starts serving requests,
a container must be started and the execution environment be
initialized. Diverse approaches, like container pooling [
24
,
37
], can
mitigate cold starts, but their eciency depends on the functions at
hand and they cannot always reduce cold starts in a signicant way.
This is why NEPTUNE does not exploit these solutions natively.
To address these problems, the Community level adopts two
similar instances of a 2-step optimization process —based on Mixed
Integer Programming— to allocate GPUs rst and then CPUs. In
both cases, the rst step aims to nd the best function placement
and routing policy that minimize the overall network delay. Then,
since diverse placements with a network delay close to the optimal
one may require dierent changes in function deployment, the
second step is in charge of choosing a placement that minimizes
deployment actions, that is, disruption.
Table 1 summarizes the inputs NEPTUNE requires to the user,
the characteristics of used nodes, the values gathered by the moni-
toring infrastructure, and the decision variables adopted in the MIP
formulation.
3.1 Function placement
Each time the Community level is activated, the 2-step optimization
process is executed twice. The rst execution aims to fully utilize
2NEPTUNE does not handle application state migration.
Table 1: Inputs, data, and decision variables.
Inputs
𝑚𝐶𝑃𝑈
𝑓Memory required by function 𝑓
𝑚𝐺𝑃 𝑈
𝑓GPU memory required by function 𝑓
𝜙𝑓Maximum allowed network delay for function 𝑓
Infrastructure data
𝑀𝐶𝑃𝑈
𝑗Memory available on node 𝑗
𝑀𝐺𝑃 𝑈
𝑗GPU memory available on node 𝑗
𝑈𝐶𝑃𝑈
𝑗CPU cores on node 𝑗
𝑈𝐺𝑃 𝑈
𝑗GPU cores on node 𝑗
Monitored data
𝛿𝑖, 𝑗 Network delay between nodes 𝑖and 𝑗
𝑂𝑏𝑒𝑠𝑡 Objective function value found after step 1
𝜆𝑓 ,𝑖 Incoming 𝑓requests to node 𝑖
𝑢𝐶𝑃𝑈
𝑗Average CPU cores used by node 𝑗per single 𝑓request
𝑢𝐺𝑃 𝑈
𝑗Average GPU cores used by node 𝑗per single 𝑓request
Decision variables
𝑥𝐶𝑃𝑈
𝑓 ,𝑖,𝑗 Fraction of 𝑓requests sent to CPU instances from node 𝑖to 𝑗
𝑐𝐶𝑃𝑈
𝑓 ,𝑗 1if a CPU instance of 𝑓is deployed on node 𝑗,0otherwise
𝑥𝐺𝑃 𝑈
𝑓 ,𝑖,𝑗 Fraction of 𝑓requests sent to GP U instances from node 𝑖to 𝑗
𝑐𝐺𝑃 𝑈
𝑓 ,𝑗 1if a GPU instance of 𝑓is deployed on node 𝑗,0otherwise
𝑀𝐺 𝑓Number of 𝑓migrations
𝐶𝑅𝑓Number of 𝑓creations
𝐷𝐿𝑓Number of 𝑓deletions
GPU resources, while the second only considers CPUs and the
remaining workload to be handled.
Since the two executions are similar, the formulation presented
herein is generalized. Some of the employed data are resource-
specic (e.g.,
𝑥𝑓 ,𝑖,𝑗
,
𝑚𝑓
): Table 1 dierentiate them with a
𝐶𝑃𝑈
or
𝐺𝑃𝑈
superscript, while in the rest of this section the superscripts
are omitted for simplicity.
Network delay minimization. The rst step aims to place function
instances and to nd routing policies that minimize the overall
network delay 𝐷in a given community 𝐶𝑁.
The formulation employs two decision variables:
𝑥𝑓 ,𝑖,𝑗
and
𝑐𝑓 ,𝑗
.
The former (
𝑥𝑓 ,𝑖,𝑗 ∈ [
0 : 1
]
) represents the amount of incoming
𝑓
requests
3
(
𝜆𝑓 ,𝑖
) that node
𝑖
forwards to node
𝑗
(i.e., routing policies).
The latter (
𝑐𝑓 ,𝑗
) is a boolean variable that is
𝑡𝑟𝑢𝑒
if an
𝑓
instance is
deployed onto node 𝑗(i.e., placement).
The objective function (Formula 1) minimizes the overall net-
work delay of the incoming workload in
𝐶
. Starting from the in-
coming
𝑓
requests to each node
𝑖
and the measured delay between
nodes
𝑖
and
𝑗
, it computes the fractions of outsourced requests to
minimize the overall network delay:
𝑚𝑖𝑛
𝐹
𝑓
𝐶
𝑖
𝐶
𝑗
𝑥𝑓 ,𝑖,𝑗 𝜆𝑓,𝑖 𝛿𝑖 ,𝑗 (1)
3
For the sake of brevity, we use
𝑓
request to mean a request generated for function
𝑓
,
and 𝑓instance to refer to an instance of function 𝑓.
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
If we only considered inter-node delays (
𝛿𝑖, 𝑗
), we would minimize
the overall network delay only if incoming requests were distributed
evenly (i.e., each node manages the same amount of requests). Since
workloads can be very dierent, the addition of per-node incoming
requests (
𝜆𝑓 ,𝑖
) gives a more appropriate formulation of the problem.
Intuitively, the higher the workload in a specic area is, the more
important the minimization of network delay becomes.
In addition to the function to minimize, we must add some con-
straints. First, requests cannot be forwarded too far from where
they are generated. Each function is characterized by parameter
𝜙𝑓
,
which sets the maximum allowed network delay of each
𝑓
request:
𝑥𝑓 ,𝑖,𝑗 𝛿𝑖,𝑗 𝑥𝑓 ,𝑖, 𝑗 𝜙𝑓𝑖, 𝑗 𝐶, 𝑓𝐹(2)
Second, the nodes that receive forwarded requests must have a
function instance that can serve them:
𝑐𝑓 ,𝑗 =𝑖 𝑓 (
𝐶
𝑖
𝑥𝑓 ,𝑖,𝑗 >0)1𝑒𝑙𝑠𝑒 0𝑗𝐶, 𝑓𝐹(3)
Third, the overall memory required by the functions (
𝑚𝑓
) placed
on a node must not exceed its capacity 𝑀𝑗:
𝐹
𝑓
𝑐𝑓 ,𝑗 𝑚𝑓𝑀𝑗𝑗𝐶(4)
Fourth, to avoid resource contentions, routing policies must
consider the overall amount of GPU or CPU cores available in the
node (
𝑈𝑗
) and the average GPU or CPU cores consumption for each
𝑓request processed on node 𝑗(𝑢𝑓 ,𝑗 ):
𝐶
𝑖
𝐹
𝑓
𝑥𝑓 ,𝑖,𝑗 𝜆𝑓 ,𝑖 𝑢𝑓 ,𝑗 𝑈𝑗𝑗𝐶(5)
Fifth, routing policies must be dened for all the nodes in the
community and all the functions of interest:
𝐶
𝑗
𝑥𝑓 ,𝑖,𝑗 =1𝑖𝐶, 𝑓𝐹(6)
Note that when
𝑖=𝑗
,
𝑥𝑓 ,𝑖,𝑗
gives the fraction of
𝑓
requests
executed locally, that is, on 𝑖itself.
This optimization problem nds the best placement with the
minimum network delay. However, each iteration (execution of the
optimization problem) may suggest a placement that requires many
disruptive operations (i.e., deletions, creations, and migrations)
with respect to the previous placement (iteration). For this reason,
a second step is used to minimize service disruption and ameliorate
the result.
Disruption minimization. The second step searches for a function
placement that minimizes function creation, deletion, and migration
with an overall network delay close to the optimal one found by
the rst step.
This means that the second step keeps the constraints dened
in Formulae 2- 6 and adds:
𝐹
𝑓
𝐶
𝑖
𝐶
𝑗
𝑥𝑓 ,𝑖,𝑗 𝜆𝑓 ,𝑖 𝛿𝑖, 𝑗 𝑂𝑏𝑒𝑠𝑡 ∗ (1+𝜖)(7)
to impose that the nal placement must be in the interval
[𝑂𝑏𝑒𝑠𝑡 ,
𝑂𝑏𝑒𝑠𝑡 (
1
+𝜖)]
, where
𝑂𝑏𝑒𝑠𝑡
is the smallest network delay found after
the rst step, and
𝜖
is an arbitrarily small parameter that quanties
the worsening in terms of network overhead. For example,
𝜖=
0
.
05
means a worsening up to 5%.
We also consider the number of created, deleted, and migrated
𝑓
instances between two subsequent executions of the 2-step opti-
mization process, that is, between the to-be-computed placement
(
𝑐𝑓 ,𝑖
) and the current one (
𝑐𝑜𝑙𝑑
𝑓 ,𝑖
).
𝐷𝐿𝑓
and
𝐶𝑅𝑓
denote, respectively,
the maximum between 0and the removed (added)
𝑓
instances be-
tween the two iterations:
𝐷𝐿𝑓=
𝐶
𝑖
𝑚𝑎𝑥 (𝑐𝑜𝑙𝑑
𝑓 ,𝑖 𝑐𝑓 ,𝑖 ,0) ∀𝑓𝐹
𝐶𝑅𝑓=
𝐶
𝑖
𝑚𝑎𝑥 (𝑐𝑓 ,𝑖 𝑐𝑜𝑙𝑑
𝑓 ,𝑖 ,0) ∀𝑓𝐹
(8)
The number of migrations (in the new placement) that represents
the number of instances that have been moved from one node to
another is computed as the minimum between instance creations
𝐶𝑅𝑓and instance deletions 𝐷𝐿𝑓:
𝑀𝐺𝑓=𝑚𝑖𝑛(𝐶𝑅𝑓, 𝐷𝐿 𝑓) ∀𝑓𝐹(9)
The new objective function is then dened as:
𝑚𝑖𝑛
𝐹
𝑓
𝑀𝐺𝑓+1
𝐷𝐿𝑓+21
𝐶𝑅𝑓+2(10)
The goal of the objective function is to minimize the number
of migrations (
𝑀𝐺𝑓
), since deletions and creations are necessary,
to avoid over- and under-provisioning. Factors
1
𝐷𝐿𝑓+2
and
1
𝐶𝑅𝑓+2
,
which are always lower than 1, allow us to discriminate among so-
lutions with the same amount of migrations, but a dierent number
of creations and deletions.
This formulation ensures close-to-optimal network delays, along
with the minimum number of instances, to serve the current work-
load. The controllers at Node level are then entitled to scale execu-
tors vertically as needed (see next Section).
3.2 CPU allocation
The Node level is in charge of minimizing
𝑄𝐸
, that is, the handling
time, dened as the sum of the execution time
𝐸
and the queue time
𝑄
; the network delay
𝐷
is already minimized by the Community
level.
𝑄𝐸
can vary due to many factors, such as variations in the
workload or changes in the execution environment and we aim
to control it by changing the amount of CPU cores allocated to
function instances. If this is not enough, the problem is lifted up
to the Community level that re-calibrates the number of function
instances.
Control theoretical approaches have proven to be an eective
solution for the self-adaptive management of these resources [
6
,
15
].
The Node level comprises a lightweight Proportional-Integral (PI)
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
controller for each function instance to scale allocated cores dy-
namically. PI controllers support fast control periods, have constant
complexity, and provide formal design-time guarantees.
Each function instance is equipped with an independent PI con-
troller. The control loop monitors the average value of
𝑄𝐸
, computes
the allocation, and actuates it. More formally, given a desired set
point
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
, the controller periodically measures the current
value of
𝑄𝐸 𝑓 ,𝑗
(controlled variable) — the actual value of
𝑄𝐸 𝑓
on
node
𝑗
— and computes the delta between desired and actual value.
Note that, since the controllers will strive to keep
𝑄𝐸 𝑓 ,𝑗
close to
the set point
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
, this value should be set to a lower value
than the desired 𝑅𝑇𝑅
𝑓.
The controller reacts to the error and recommends the new
amount of cores that the function should use. Algorithm 1describes
the computation: Line 2 computes error
𝑒𝑟𝑟
as the dierence be-
Algorithm 1 Node level CPU core allocation.
1: procedure ComputeInstanceCores(𝑓,𝑗)
2: 𝑒𝑟𝑟 :=1
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑 1
𝑄𝐸 𝑓 ,𝑗 ;
3: 𝑐𝑝𝑢 :=𝑔𝑒𝑡 𝐶𝑃𝑈 𝐴𝑙 𝑙𝑜𝑐𝑎𝑡 𝑖𝑜𝑛 (𝑓 , 𝑗);
4: 𝑖𝑛𝑡𝑜𝑙𝑑 :=𝑐𝑝𝑢 𝑔𝑖𝑛𝑡 𝑒𝑟𝑟𝑜𝑙𝑑 ;
5: 𝑖𝑛𝑡 :=𝑖𝑛𝑡𝑜𝑙𝑑 +𝑔𝑖𝑛𝑡 𝑒𝑟 𝑟;
6: 𝑒𝑟𝑟𝑜𝑙𝑑 :=𝑒𝑟𝑟 ;
7: 𝑝𝑟𝑜𝑝 :=𝑒𝑟𝑟 𝑔𝑝𝑟𝑜𝑝 ;
8: 𝑐𝑝𝑢 :=𝑖𝑛𝑡 +𝑝𝑟𝑜𝑝;
9: 𝑐𝑝𝑢 :=𝑚𝑎𝑥 (𝑐𝑝𝑢𝑚𝑖𝑛, 𝑚𝑖𝑛(𝑐𝑝𝑢𝑚𝑎𝑥 , 𝑐𝑝𝑢 ));
10: end procedure
tween the inverse of
𝑄𝐸 𝑓 ,𝑑𝑒𝑠𝑖𝑟 𝑒𝑑
and
𝑄𝐸 𝑓 ,𝑗
. To compute the Integral
contribution, the current core allocation (
𝑐𝑝𝑢
) of the function in-
stance is retrieved at line 3. The previous integral contribution
𝑖𝑛𝑡𝑜𝑙𝑑
is computed at line 4 by using the allocation, the integral
gain
𝑔𝑖𝑛𝑡
(i.e., a tuning parameter), and the prior error
𝑒𝑟𝑟𝑜𝑙𝑑
. The
integral component
𝑖𝑛𝑡
is computed by multiplying the current
error
𝑒𝑟𝑟
times the integral gain
𝑔𝑖𝑛𝑡
, and by adding
𝑖𝑛𝑡𝑜𝑙𝑑
(line 5).
The previous error 𝑒𝑟𝑟𝑜𝑙𝑑 is then updated at line 6.
The proportional contribution is computed by using
𝑒𝑟𝑟
and
the proportional gain
𝑔𝑝𝑟𝑜 𝑝
at line 7. Finally, the new allocation
is calculated as the sum of the two contributions (line 8) and then
adjusted according to the maximum and minimum allowed core
allocations 𝑐𝑝𝑢𝑚𝑎𝑥 and 𝑐𝑝𝑢𝑚𝑖𝑛, respectively.
Being independent of the others, these controllers are not aware
of available CPU cores and of the allocations computed by the other
controllers. Therefore, the computed allocations (line 9) are not
immediately applied since they could exceed the allowed capacity.
The allocations of the function instances deployed on a node are
processed by a Contention Manager (one per node), which is in
charge of computing a feasible allocation. If the sum of suggested
allocations ts the allowed capacity, they are applied without any
modication. Otherwise, they are scaled down proportionally. The
Contention Manager can easily be extended and embed other, non-
proportional heuristics to manage resource contention.
4 EVALUATION
Implementation. We implemented a prototype
4
of NEPTUNE
built on top K3S, a popular distribution of Kubernetes
5
optimized
for edge computing. Each control level is materialized in a dedi-
cated component that exclusively uses native K3s APIs to manage
deployed applications. Conversely to existing approaches (see Sec-
tion 5), the prototype is capable of performing in-place vertical
scaling of containers, that is, it can dynamically update the CPU
cores allocated to the dierent containers without restarting the
application.
The stable version of K3S does not allow one to change allo-
cated resources without restarting function instances, a process
that sometimes can take minutes. This could decrease the capability
of the Node level to handle bursty workloads. For this reason, the
prototype augments K3S with the Kubernetes Enhancement Proposal
1287 that implements In-Place Pod Vertical Scaling
6
and allows re-
sources to be changed without restarts. This enables faster control
loops and better control quality.
To provide an eective usage of GPUs, the prototype uses nvidia-
docker
7
, a container runtime that enables the use of GPUs within
containers. However, by default, GPU access can only be reserved to
one function instance at a time. This prevents the full exploitation
of GPUs and limits the possible placements produced at Community
level. To solve this problem, the prototype employs a device plugin
8
developed by Amazon that enables the fractional allocation of GPUs.
In particular, the plugin makes use of the Nvidia Multi Process
Service
9
(MPS), a runtime solution designed to transparently allow
GPUs to be shared among multiple processes (e.g., containers).
Research questions. The solution adopted at Topology level has
been largely covered by PAPS [
4
]. The experiments in the paper
focus on evaluating Community and Node level. The conducted
evaluation addresses the following research questions:
RQ1
How does NEPTUNE handle workloads generated by mobile
users at the edge?
RQ2
How does NEPTUNE perform compared to other state-of-
the-art approaches?
RQ3
How does NEPTUNE use GPUs to speed up response times?
4.1 Experimental setup
Infrastructure. We conducted the experiments on a simulated
MEC topology with nodes provisioned as a cluster of AWS EC2 geo-
distributed virtual machines distributed across three areas. Each
area corresponds to a dierent AWS region: Area A to eu-west, Area
B to us-east, and Area C to us-west. Since communities are indepen-
dent, our experiments focused on evaluating dierent aspects of
NEPTUNE within a single community that included the three areas.
Figure 3 shows the average network delays between each pair
of areas and nodes computed as the round trip times of an ICMP
[
31
] (Internet Control Message Protocol) packet. Note that nodes
4Source code available at https://github.com/deib-polimi/edge- autoscaler.
5https://kubernetes.io/.
6
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-
place-update-pod-resources
7https://github.com/NVIDIA/nvidia-docker
8https://github.com/awslabs/aws-virtual-gpu-device-plugin
9https://docs.nvidia.com/deploy/mps/index.html
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
Figure 3: Network delay between areas.
of the same area were deployed onto dierent AWS availability
zones to obtain signicant network delays. Each area contained
three worker nodes, and one in Area A was GPU-empowered. These
nodes were deployed as c5.xlarge instances (4 vcpus, 8 GB memory);
the one with GPU that used a g4dn.xlarge instance (4 vcpus, 16 GB
memory, 1 GPU). The master node (not depicted in the gure) was
deployed on a c5.2xlarge instance (8 vcpus, 16 GB memory).
NEPTUNE control periods. Node controllers were congured with
a control period of 5seconds. Faster control loops can be used but
they may lead to inconsistent resource allocation updates since K3S
resource states are stored in a remote database. Function placement
and routing policies were recomputed by Community controllers
each 1minute while Topology controller was triggered every 10
minutes.
Applications. To work on a reasonable set of experiments, we
used the three applications summarized in Table 2: we created
the rst function, and we borrowed the other two from the liter-
ature [
33
,
36
]. These applications are written using multiple pro-
gramming languages (e.g., Rust, Java, Go) and have dierent mem-
ory requirements (ranging from 15MB to 500MB) and cold start
times (from a bunch of seconds to minutes). The rst application is
primes, a stateless and CPU-heavy function that counts all the prime
numbers less than a given input number. As exemplar complex ap-
plication we employed sock-shop
10
that implements an e-commerce
platform. The application uses a microservice architecture; we fur-
ther decomposed it into smaller functions to make it suitable for a
serverless platform
11
. For example, microservice carts was divided
into three smaller units: carts-post,carts-delete and cart-util. Finally,
to also evaluate GPU-accelerated tasks (e.g machine learning in-
ference), we used Resnet [
43
], a neural network model for image
classication, implemented using TensorFlow Serving. For each
function Table 2 also reports the memory requirements, the cold
start times and the desired response times (obtained by applying
10https://github.com/microservices-demo/microservices-demo
11
The source code of the function-based version of sock-shop is available at https:
//github.com/deib-polimi/serverless- sock-shop
Table 2: Characteristics of deployed functions.
Name Language Memory 𝑅𝑇 𝑅
𝑓Cold start
Simple stateless function
primes Rust 15 MB 200ms <5s
Complex application
carts-post Java 360 MB 300ms 100s
carts-delete Java 360 MB 200ms 100s
carts-util Java 360 MB 200ms 100s
catalogue Go 15 MB 200ms <5s
orders Java 400 MB 600ms 100s
payment Go 15 MB 50ms <5s
shipping Java 350 MB 50ms 100s
login Go 15 MB 100ms <5s
registration Go 15 MB 200ms <5s
user Go 15 MB 50ms <5s
Machine Learning inference
resnet Python 500 MB 550ms 100s
the procedure described in Section 4.5). The set points used by PI
controllers were set to half of the value of 𝑅𝑇𝑅
𝑓.
We used Locust
12
, a distributed scalable performance testing tool,
to feed the system, and mimicked service demand
𝜆𝑓 ,𝑖
through dif-
ferent realistic, dynamic workloads. Each experiment was executed
ve times to have (more) consistent results.
Collected metrics. For each experiment, we collected the average
(
𝜇
) and standard deviation (
𝜎
) of the following metrics: i) response
time (ms) as dened in Section 2, ii) response time violation rate
(% of requests) dened as the percentage of requests that are not
served within
𝑅𝑇 𝑅
𝑓
considering the 99th percentile of the measured
response times, iii) network time rate (%) as the percentage of time
spent to forward requests in the network over the total response
time (
𝐷/𝑅𝑇
), and iv) allocated cores (millicores or thousandths of a
core) to measure the resources consumed by function instances.
Competitors. Our experiments compare NEPTUNE against three
well-known approaches: K3S, Knative
13
(KN), and OpenFaaS
14
(OF).
K3S is one of the most popular solutions for container orchestration
at the edge. It manages the full lifecycle of containerized applica-
tions deployed in a topology and adopts a fair placement policy, that
is, it schedules containers to keep the resource utilization of nodes
equal. K3S exploits the Horizontal Pod Autoscaler
15
to horizontal
scale applications. KN and OF add serverless functionalities to K3S
and a set of custom components to perform request routing and
horizontal scaling.
To achieve consistent and statistical relevant results, all experi-
ments described in this section were run 5 times.
12https://locust.io/
13https://knative.dev/docs/.
14https://www.openfaas.com/.
15
https://rancher.com/docs/rancher/v2.5/en/k8s-in-rancher/horitzontal- pod-
autoscaler/.
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
0 500 1000 1500 2000 2500 3000 3500
Time (s)
0
25
50
75
100
# Users
Node-A-0 Node-B-0
(a) Geo-dynamic workload shape (users)
0 500 1000 1500 2000 2500 3000 3500
Time (s)
0
500
1000
1500
2000
2500
Millicores
Node-A-0 Node-A-1 Node-B-0 Node-B-1
(b) Resource allocation (millicores)
0 500 1000 1500 2000 2500 3000 3500
Time (s)
0
50
100
150
200
RT (ms)
Avg RT RTR
fQEf,desired
(c) Average response time (ms)
0 500 1000 1500 2000 2500 3000 3500
Time (s)
0.0
12.5
25.0
37.5
50.0
NTR %
NTR %
(d) Networking time rate (%)
Figure 4: Behavior of NEPTUNE with moving workloads.
4.2 RQ1: Moving workload
The rst experiments evaluate the performance of NEPTUNE when
users move between Area A and Area B within the same community.
We used a cluster of four worker nodes: two nodes in Area A (not
equipped with GPU) and two in Area B. Each run lasted 60 minutes
and used application primes with
𝑅𝑇 𝑅
𝑝𝑟𝑖𝑚𝑒𝑠
set to 200ms and the set
point of PI controllers to 100ms. User migration happened twice
per run and consisted in moving 100 users from one area to another
in less than 10 minutes.
Figure 4 shows the behavior of application primes when man-
aged by NEPTUNE. Since the multiple runs executed for this set of
experiments had similar behavior, the gure illustrates how work-
loads, resources, and performance varied over time during one of
these runs. Figure 4a shows how the workload changed in each area.
In particular, the workload was generated by users close to node
Node-A-0 for Area A and Node-B-0 for Area B. Figure 4b presents the
resources allocated to each node over time. Since communities are
independent, at least one instance per function is always allocated
(if possible) to minimize cold starts. Thus, the overall allocation
is always greater than zero. Conversely, if a node
𝑖
has 0cores
allocated at time
𝑡
for function
𝑓
, it means
𝑓
is not running on
𝑖
at
𝑡(e.g., from second 0 to 1250 for Node-B-0).
The chart shows that if one node in an area cannot manage gen-
erated load, the Community level detects this issue and instantiates
a new function instance on another node as close to the work-
load generator as possible. This behavior can be observed close to
second 600 when the workload in Area A reaches the peak and
a new replica is created on Node-A-1. Similarly, at second 1500 a
new replica is deployed on Node-B-1 when the workload in Area
B increases. In contrast, when the workload decreases instances
are deleted as shown close to second 2700 on Node-B-0. Moreover,
the experiment clearly shows how NEPTUNE is able to migrate
function instances when users move to keep the network delay
minimized. For example, close to second 1000, users move from
Area A to Area B, and right after the function is migrated on node
Node-B-0 to handle the workload in the proximity of users.
Thanks to NEPTUNE, function primes never violates the set re-
sponse time: the average response time
𝑅𝑇𝑓
in Figure 4c is always
signicantly lower than the threshold (200 ms). The control loops
are able to keep the response time very close to the set point. Con-
trol theoretical controllers behave very well when they operate
with high-frequency control loops, enabled by the in-place vertical
scaling feature [
5
,
6
]. In fact, the response time only deviates from
the set point when the instances are replicated (scaled horizontally),
at seconds 600,1600 and 2600, since the action requires more time
than re-conguring containers. However, note that the response
time always returns close to the set point, and this shows that
NEPTUNE can recover from multiple types of perturbations (e.g,
creation and deletion of replicas, uctuating workloads).
Figure 4d shows that NEPTUNE is able to keep the network
overhead extremely low. The only peaks in the chart (seconds 1100
and 2200) are caused by users who change location and by the fact
that routing policies are not updated immediately.
When users start to migrate to another area, replicas cannot al-
ways be created immediately on nodes with the minimum network
delay, as depicted in the chart close to second 1100: the workload
on Node-B-0 increases and an instance is created on Node-B-1.
This behavior occurs because the two-step optimization process
evaluates the placement on B-0 or B-1 to be extremely comparable
(within
𝜖
) since they handle a small portion of the trac compared
to the nodes in Area A. However, NEPTUNE migrates the function
instance directly to Node-B-0 node as soon as the workload in Area
B increases (close to second 1200).
4.3 RQ2: Comparison with other approaches
We compared our solution against the three approaches described
in Section 4.1 by means of application sock-shop. Note that some
of the functions of this application must invoke other functions.
For example, function order invokes function user to retrieve user’s
address and payment card information, function catalogue to re-
trieve product details, function payment to ensure the creation of
the invoice and, nally, in case of success, function carts-delete to
empty the cart. We took these dependencies into account by setting
adequate response times as shown in Table 2: from 50ms, for simple
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
Table 3: Results of the comparison with other approaches.
Function Response time (ms) Response time violation (%) Network time rate (%) Core allocation (millicores)
NEPT K3S KN OF NEPT K3S KN OF NEPT K3S KN OF NEPT K3S KN OF
carts-delete 𝜇66.7 64.6 60.6 100.3 0.1 0.3 0 2 3.5 63.9 92.3 72.9 631.3 1921.1 596 597.5
𝜎3.4 10.9 1.6 27.3 0.1 0.0 0 1.3 2.1 16.5 1.2 18.2 149.9 429.6 2.1 2.6
carts-post 𝜇110.6 175.9 73.8 184.3 0.1 3.5 0.1 3.4 3.7 68 78.3 69 722.8 615.5 597.4 597.3
𝜎7.6 64.2 2.0 73.7 0.1 2.7 0.1 2.6 2.9 22.4 0.9 26.4 178.9 31.3 2.6 2.1
carts-util 𝜇57.4 95.4 54.6 45.6 0 1.7 0 0.1 2.6 78.5 92.8 70.3 516.3 689.3 596.5 4306.1
𝜎3.0 31.0 1.2 1.8 0.1 1.4 0 ~0 1.2 19.6 1.0 2.2 83.2 162.2 2.2 180.3
catalogue 𝜇53.3 54.6 163.1 39.2 0 0.1 17.7 0 1.6 74 41.9 71.2 102.7 197.6 65.2 458
𝜎2.7 5.2 35.4 1.4 0 ~0 2.1 0 0.6 11.9 6.3 1.6 4.1 23.9 13.0 3.1
orders 𝜇211.6 418.9 505.1 485.2 0 16.6 16.5 16 4.1 15.8 44.2 25.2 1114.8 4484.5 1040.7 597.8
𝜎12.1 86.7 165.7 126.4 0 8.2 3.0 9.0 1.7 7.5 25.0 8.2 273.0 407.2 294.5 1.2
payment 𝜇10.4 50.2 27.9 23.6 0 2.8 1.2 0.4 8.2 98.7 98.4 98.9 795 101.8 49.7 443.1
𝜎0.7 9.8 0.4 1.3 0 0.4 0.7 ~0 6.1 9.5 1.3 3.9 438.9 13.2 0.2 4.3
shipping 𝜇15 75 28.6 88 2.6 5.9 1.5 8.5 6.4 96.2 95.6 92.7 416.5 888.5 597.4 596.9
𝜎1.1 23.2 0.6 32.8 1.1 1.6 0.6 3.5 2.5 16.9 1.7 20.3 132.3 202.8 1.0 0.9
login 𝜇30.3 72.5 73.2 46 0 2.8 11.1 0.2 2.6 70.1 77.9 63.2 76.7 94.2 54.1 452.2
𝜎1.5 12.3 12.1 0.7 0 0.7 7.0 ~0 0.9 14.4 1.2 1.1 13.9 15.6 6.1 6.6
registration 𝜇46.4 57.7 65 34.9 0 0.1 2.6 0 1.4 80.9 87.9 81.7 71.6 105.3 53.6 453.4
𝜎2.7 6.3 4.4 1.3 0 0.1 1.4 0 0.4 10.4 1.6 1.8 9.8 12.0 6.1 6.2
user 𝜇21.8 66.4 177 93.4 0.5 7.8 46.7 16.8 7.1 77.9 31.5 76.8 153.2 681.8 355.3 463.2
𝜎0.7 6.4 35.1 20.3 0.5 0.4 24.2 5.3 1.0 10.1 5.9 16.3 23.3 91.0 166.8 1.9
functions with no dependencies, to 600ms, assigned to the more
complex ones.
Each run had a duration of 20 minutes and used a workload
that resembles a steep ramp with an arrival rate
𝜆𝑓 ,𝑖
designed to
suddenly increase over a short period of time. The workload started
with 10 concurrent users, and we added one additional user every
second up to 100. We considered a network of 6 nodes in Area B
and C.
Table 3 reports the statistical results obtained during the ex-
periments with each approach and with function of application
sock-shop. The results show that NEPTUNE provided in most of the
cases the lowest response time compared to the other approaches.
The obtained response times were consistent across multiple runs:
the standard deviation ranged between 3% and 7% of the average.
Other approaches presented higher standard deviation values: in
the worst case, KN obtained a standard deviation equal to 32
.
8%
of the average, while K3S (36
.
5%) and OF (40%) were even more
inconsistent.
NEPTUNE reported few violations of the required response time.
For most functions the amount of violations was lower or equal to
0
.
1%, while it was 2
.
6% and 0
.
5% for functions shopping and user,
respectively. Other solutions obtained signicantly higher viola-
tions. In the worst case, K3S failed to meet the foreseen response
time 16
.
6% of the requests, while OF and KN reported violations
for 16
.
9% and 46
.
8%, respectively. This can be explained because
other approaches, compared to NEPTUNE, do not employ precise
routing policies, do not perform an adequate resource allocation,
and do not solve resource contentions on nodes.
We can also observe how NEPTUNE routing policies helped
meet set response times. The percentage of time spent by routing
requests ranges from 1
.
4% to 8
.
2% of the total response time, and,
on average, only 4
.
1% of the time is spent in the network. On the
other hand, routing policies of other solutions do not consider
node utilization, network delay, and applications performance. K3S
reported a network time rate ranging from 15
.
8% to 98
.
7% of the
response time, with an overall average of 72
.
4%. Similarly, OF and
KN obtained an average network time rate of 72
.
2% and 74
.
1%,
respectively.
Finally, as for the resources allocated by each approach for each
function, NEPTUNE allocated on average 4600 millicores, while
K3S and OF used about twice that amount, 9780 and 8960 mil-
licores, respectively. KN uses fewer resources than NEPTUNE on
average (4500 millicores) but it also suers from a high number
of response time violations. This means that KN usually allocates
fewer resources than needed (e.g., for function catalogue).
Dierently from NEPTUNE, other solutions do not adopt any
resource contention mechanism to provide a fair allocation of re-
sources. For example, K3S allocated most of the resources, 4480
millicores, to function orders, while other functions could not get
the resources to work properly. This creates an imbalance among
functions that prevents applications to be properly scaled and leads
to more response time violations.
4.4 RQ3: GPU Management
The third set of experiments was carried out to assess the transpar-
ent GPU management provided by NEPTUNE for computationally
intensive functions. To provide a heterogeneous environment, ex-
periments were conducted using the three nodes in Area A (Node-
A-2 is equipped with a GPU).
We used two functions, called resnet-a and resnet-b, both embed
the ResNet neural network in inference mode. The instances of the
two functions deployed on Node-A-2 were set to share the same
GPU.
Each run lasted 20 minutes and used the same workload de-
scribed in Section 4.3 with a number of concurrent users starting
from 10 and up to 30 (increased by one every second).
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
(a) Response time.
resnet-a
GPU
resnet-b
GPU
resnet-a
CPU
resnet-b
CPU
resnet-a
Total
resnet-b
Total
0
200
400
600
Time (ms)
RTR
f
(b) Distributions.
Figure 5: Resnet-a: CPU and GPU executions.
Figure 5 illustrates one run of the experiments. Figure 5a shows
the average response time of function resnet-a when executed on
both CPUs and GPUs. Function resnet-b obtained similar results
that are not reported here for lack of space. GPU executions ob-
tained an almost constant response time and never violated the set
response time.
At the beginning of the experiment, all the requests were routed
to the GPU and after some 50 seconds the GPU was fully utilized.
To avoid degradation of the response time, the Community level
quickly reacted by updating the routing policies and allowing part
of the workload to be handled by function instances running on
CPUs. The mean response time of CPU instances shows a peak
at the beginning of the experiment (with some brief violations of
the response time) that is caused by the cold start. After that, the
Node level comes into play and dynamically adjusts the CPU cores
allocated to the replicas to keep the response time close to the set
point.
The box plot of Figure 5b shows the distribution of response
times for both functions resnet-a and resnet-b on GPU, on CPU,
and the aggregated result. The interquartile range (IQR) is set to 1.5,
and the rectangle shows the distribution between the 25th and 75th
percentiles. Both GPU instances of resnet-a and resnet-b are able to
keep response times quite far from the set threshold, and thus no
violations. In particular, the mean response times of resnet-a and
resnet-b are 180ms and 183ms, respectively, which is three times
smaller than the threshold.
The distribution of response times on CPUs is wider compared
to GPUs. CPU containers are managed by PI controllers that have a
transient period to adjust the initial core allocation to an adequate
value to reach the desired set point; this does not happen with GPU
instances.
Nevertheless, the CPU-only replicas of resnet-a and resnet-b
can serve 98.3% and 100% of requests within the set response time,
respectively. Moreover, GPU instances handle 70% of requests while
the remaining part was routed to CPU instances. As a result, the
total number of violations of both functions is close to 0.
4.5 Threats to validity
We conducted the experiments using twelve functions (three appli-
cations) showing that NEPTUNE is able to minimize the network
delay, to reduce response times, and to eciently allocate resources
compared to other three well-known approaches. However, we
must highlight threats that may constrain the validity of obtained
results [46]:
Internal Threats
. The experiments were run with synthetic
workloads that may introduce bias. Workloads have a ramp-shape
to simulate an incremental growth or reduction of connected users.
We used the following procedure to retrieve the maximum con-
current users in each experiment. First, we xed the amount and
types of nodes the topology was composed of. The maximum con-
current users of each experiment was retrieved by observing how
many users were required to generate enough workload to require
consistently at least 70% of the cluster’s resources.
The three applications were not provided with a given required
response time for each function (
𝑅𝑇 𝑅
𝑓
).
𝑅𝑇 𝑅
𝑓
was computed using
an iterative process. Starting from 50
𝑚𝑠
and with 50
𝑚𝑠
increments,
𝑅𝑇 𝑅
𝑓
was set to be able to serve at least 50% of requests in an amount
of time equal to 𝑅𝑇𝑅
𝑓/2.
External Threats
. Some of our assumptions may limit the gen-
eralization of the experiments.
Consistently with the serverless paradigm, NEPTUNE assumes
functions to either be stateless (e.g., without session) or depend
on an external database. Currently, interactions with databases are
only partially modeled by NEPTUNE. The time to read from and
write on a database is modeled at the Node level as non-controllable
stationary disturbance of the response time (e.g., a Gaussian noise).
Thus, during our experiments, databases were deployed on dedi-
cated and properly sized machines.
Results show that NEPTUNE is able to eciently control func-
tions that depend on a database (e.g., orders, carts-post) with a
precision similar to the ones without dependencies (e.g., payment,
user).
Construct and Conclusion Threats
. The experiments demon-
strate the validity of our claim, that is, that NEPTUNE is able to
eciently execute multiple functions deployed on a distributed
edge topology. All experiments have been executed ve times and
obtained results are statistically robust and show small variance.
5 RELATED WORK
The management of edge topologies is a hot and widely addressed
topic by both industry and academia [
7
,
18
]. To the best of our
knowledge, NEPTUNE is the rst solution that provides: an easy
to use serverless interface, optimal function placement and rout-
ing policies, in-place vertical scaling of functions, and transparent
management of GPUs and CPUs. The relevant related works we
are aware of only focus on specic aspects of the problem.
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
Wang et al. [
44
] propose LaSS, a framework for latency-sensitive
edge computations built on top of Kubernetes and Openwhisk
16
.
LaSS models the problem of resource allocation using a M/M/c
FCFS queuing model. They provide a fair-share resource alloca-
tion algorithm, similar to NEPTUNE’s Contention Manager, and
two reclamation policies for freeing allocated resources. LaSS is
the most similar solution to NEPTUNE, but it lacks network over-
head minimization and GPU support. Furthermore, the approach
is not fully compatible with the Kubernetes API. Kubernetes is
only used to deploy OpenWhisk. Functions run natively on top of
the container runtime (e.g., Docker
17
) and resources are vertically
scaled by bypassing Kubernetes. This approach, also adopted in
cloud computing solutions [
6
,
34
], is known to create state repre-
sentation inconsistencies between the container runtime and the
orchestrator [3].
Ascigir et al. [
1
] investigate the problem for serverless functions
in hybrid edge-cloud systems and formulate the problem using
Mixed Integer Programming. They propose both fully-centralized
(function orchestration) approaches, where a single controller is
in charge of allocating resources, and fully-decentralized (function
choreography) ones, where controllers are distributed across the
network and decisions are made independently. Compared to NEP-
TUNE, they focus on minimizing the number of unserved requests
and they assume that each request can be served in a xed amount
of time (single time-slot). However, this assumption is not easy to
ensure in edge computing: nodes may be equipped with dierent
types of hardware and produce dierent response times. This is
naturally considered in NEPTUNE with the help of GPUs.
Multiple approaches in the literature focus on placement and
routing at the edge [
8
,
16
,
32
]. One of the most used techniques,
also employed by NEPTUNE, is to model the service placement and
workload routing as an Integer or a Mixed Integer Programming
problem.
Notably, Tong et al. [
42
] model a MEC network as a hierarchical
tree of geo-distributed servers and formulate the problem as a two-
steps Mixed Nonlinear Integer Programming (MNIP). In particular,
their approach aims to maximize the amount of served requests by
means of optimal service placement and resource allocation. The
eectiveness of their approach is veried using formal analysis and
large-scale trace-based simulations. They assume that workloads
follow some known stochastic models (Poisson distribution), and
that arrival rates are independent and identically distributed. This
may not be true in the context of edge computing where workloads
are often unpredictable and may signicantly deviate from the
assumed distribution. NEPTUNE does not share these assumptions
and uses fast control-theoretical planners to mitigate volatility and
unpredictability in the short term.
To cope with dynamic workloads, Tan et al. [
41
] propose an
online algorithm for workload dispatching and scheduling with-
out any assumption about the distribution. However, since their
approach only focuses on routing requests, they cannot always
minimize network delays, especially when edge clients move from
one location to another.
16https://openwhisk.apache.org/
17https://www.docker.com
Mobile workloads are addressed, for example, by Leyva-Pupo
et al. [
22
], who present a solution based on an Integer Linear Pro-
gramming (ILP) problem with two dierent objective functions one
for mobile users and one for static ones. Furthermore, since the
problem is known to be NP-hard, they use heuristic methods to
compute a sub-optimal solution. Sun et al. [
40
] propose a service
migration solution based on Mixed Integer Programming to keep
the computation as close as possible to the user. In particular, they
consider dierent factors that contribute to migrations costs (e.g.,
required time and resources). However, the two aforementioned
solutions exploit virtual machines and they are known for their
large image sizes and long start-up times, making service migra-
tion a very costly operation. NEPTUNE, as other approaches in the
literature[
29
,
44
,
50
], uses containers that are lighter and faster to
scale.
Only a few solutions have been designed for GPU management
in the context of edge computing. For example, Subedi et al. [
39
]
mainly focuses on enabling GPU accelerated edge computation
without considering latency-critical aspects such as placing appli-
cations close to the edge clients.
6 CONCLUSIONS AND FUTURE WORK
The paper proposes NEPTUNE, a serverless-based solution for man-
aging latency-sensitive applications deployed on geo-distributed
large scale edge topologies. It provides smart placement and rout-
ing to minimize network overhead, dynamic resource allocation to
quickly react to workload changes, and transparent management
of CPUs and GPUs. A prototype built on top of K3S, a popular
container orchestrator for the edge, helped us demonstrate the
feasibility of the approach and interesting results with respect to
similar state-of-the-art solutions.
Our future work comprises the improvement of adopted sched-
uling and resource allocation solutions by exploiting function de-
pendencies [
26
] and workload predictors to anticipate future de-
mand [
21
]. As a further extension, we will consider Bayesian op-
timization approaches [
13
,
38
] to nd optimal response times au-
tomatically. State migration and data consistency approaches can
also be integrated to manage stateful applications.
7 ACKNOWLEDGEMENTS
This work has been partially supported by the SISMA national
research project (MIUR, PRIN 2017, Contract 201752𝐸𝑁 𝑌 𝐵).
REFERENCES
[1]
Onur Ascigil, Argyrios Tasiopoulos, Truong Khoa Phan, Vasilis Sourlas, Ioan-
nis Psaras, and George Pavlou. 2021. Resource Provisioning and Allocation in
Function-as-a-Service Edge-Clouds (Early Access). IEEE Transactions on Services
Computing (2021), 1–14.
[2]
David Balla, Csaba Simon, and Markosz Maliosz. 2020. Adaptive scaling of Kuber-
netes pods. In Proceedings of the IEEE/IFIP Network Operations and Management
Symposium, NOMS 2020. IEEE, 1–5.
[3]
Luciano Baresi, Davide Yi Xian Hu, Giovanni Quattrocchi, and Luca Terracciano.
2021. KOSMOS: Vertical and Horizontal Resource Autoscaling for Kubernetes. In
Proceedings of the 19th International Conference on Service-Oriented Computing,
ICSOC 2021 (Lecture Notes in Computer Science, Vol. 13121). Springer, 821–829.
[4]
Luciano Baresi, Danilo Filgueira Mendonça, and Giovanni Quattrocchi. 2019.
PAPS: A Framework for Decentralized Self-management at the Edge. In Proceed-
ings of the 17th International Conference on Service-Oriented Computing, ICSOC
2019 (Lecture Notes in Computer Science, Vol. 11895). Springer, 508–522.
SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA Luciano Baresi, Davide Yi Xian Hu, Giovanni arocchi, and Luca Terracciano
[5]
Luciano Baresi and Giovanni Quattrocchi. 2018. Towards Vertically Scalable
Spark Applications. In Euro-Par 2018: Parallel Processing Workshops - Euro-Par
2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected
Papers (Lecture Notes in Computer Science, Vol. 11339). Springer, 106–118.
[6]
Luciano Baresi and Giovanni Quattrocchi. 2020. COCOS: A Scalable Architecture
for Containerized Heterogeneous Systems. In Proceedings of the IEEE International
Conference on Software Architecture, ICSA 2020. IEEE, 103–113.
[7]
Julian Bellendorf and Zoltán Ádám Mann. 2020. Classication of optimization
problems in fog computing. Future Gener. Comput. Syst. 107 (2020), 158–176.
[8]
David Bermbach, Jonathan Bader, Jonathan Hasenburg, Tobias Pfandzelter, and
Lauritz Thamsen. 2021. AuctionWhisk: Using an Auction-Inspired Approach for
Function Placement in Serverless Fog Platforms (Early Access). Software: Practice
and Experience (2021), 1–49.
[9]
Victor Campmany, Sergio Silva, Antonio Espinosa, Juan Carlos Moure, David
Vázquez, and Antonio M. López. 2016. GPU-based Pedestrian Detection for
Autonomous Driving. In Proceedings of the International Conference on Compu-
tational Science 2016, ICCS 2016 (Procedia Computer Science, Vol. 80). Elsevier,
2377–2381.
[10]
Junguk Cho, Karthikeyan Sundaresan, Rajesh Mahindra, Jacobus E. van der
Merwe, and Sampath Rangarajan. 2016. ACACIA: Context-aware Edge Comput-
ing for Continuous Interactive Applications over Mobile Networks. In Proceedings
of the 12th International on Conference on emerging Networking EXperiments and
Technologies, CoNEXT 2016. ACM, 375–389.
[11]
Thomas Heide Clausen and Philippe Jacquet. 2003. Optimized Link State Routing
Protocol (OLSR). RFC 3626 (2003), 1–75.
[12]
Xavier Dutreilh, Nicolas Rivierre, Aurélien Moreau, Jacques Malenfant, and Isis
Truck. 2010. From Data Center Resource Allocation to Control Theory and Back.
In Proceedings of the IEEE International Conference on Cloud Computing, CLOUD
2010. IEEE, 410–417.
[13]
Nicolò Felicioni, Andrea Donati, Luca Conterio, Luca Bartoccioni, Davide Yi Xian
Hu, Cesare Bernardis, and Maurizio Ferrari Dacrema. 2020. Multi-Objective
Blended Ensemble For Highly Imbalanced Sequence Aware Tweet Engagement
Prediction. In Proceedings of the Recommender Systems Challenge 2020, RecSys
Challenge 2020. ACM, 29–33.
[14]
Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. 2015. An updated
performance comparison of virtual machines and Linux containers. In IEEE
International Symposium on Performance Analysis of Systems and Software, ISPASS
2015. IEEE, 171–172.
[15]
Domenico Grimaldi, Valerio Persico, Antonio Pescapè, Alessandro Salvi, and
Stefania Santini. 2015. A Feedback-Control Approach for Resource Management
in Public Clouds. In Proceedings of the IEEE Global Communications Conference
2015, GLOBECOM 2015. IEEE, 1–7.
[16]
Songtao Guo, Bin Xiao, Yuanyuan Yang, and Yang Yang. 2016. Energy-ecient
dynamic ooading and resource scheduling in mobile cloud computing. In Pro-
ceedings of the 35th Annual IEEE International Conference on Computer Communi-
cations, INFOCOM 2016. IEEE, 1–9.
[17]
Akhil Gupta and Rakesh Kumar Jha. 2015. A Survey of 5G Network: Architecture
and Emerging Technologies. IEEE Access 3 (2015), 1206–1232.
[18]
Congfeng Jiang, Xiaolan Cheng, Honghao Gao, Xin Zhou, and Jian Wan. 2019.
Toward Computation Ooading in Edge Computing: A Survey. IEEE Access 7
(2019), 131543–131558.
[19]
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-che Tsai, Anurag Khan-
delwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Jayant
Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, and David A. Pat-
terson. 2019. Cloud Programming Simplied: A Berkeley View on Serverless
Computing. CoRR abs/1902.03383 (2019), 1–35.
[20]
Patrick Kalmbach, Andreas Blenk, Wolfgang Kellerer, Rastin Pries, Michael
Jarschel, and Marco Homann. 2019. GPU Accelerated Planning and Place-
ment of Edge Clouds. In Proceedings of the International Conference on Networked
Systems 2019, NetSys 2019. IEEE, 1–3.
[21]
Jitendra Kumar and Ashutosh Kumar Singh. 2018. Workload prediction in cloud
using articial neural network and adaptive dierential evolution. Future Gener.
Comput. Syst. 81 (2018), 41–52.
[22]
Irian Leyva-Pupo, Alejandro Santoyo-González, and Cristina Cervelló-Pastor.
2019. A Framework for the Joint Placement of Edge Service Infrastructure and
User Plane Functions for 5G. Sensors 19, 18 (2019), 3975.
[23]
Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. 2010. CloudCmp:
comparing public cloud providers. In Proceedings of the 10th ACM SIGCOMM
Internet Measurement Conference, IMC 2010. 1–14.
[24]
Ping-Min Lin and Alex Glikson. 2019. Mitigating Cold Starts in Serverless
Platforms: A Pool-Based Approach. CoRR abs/1903.12221 (2019), 1–5.
[25]
Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md. Enamul Haque,
Lingjia Tang, and Jason Mars. 2018. The Architectural Implications of Au-
tonomous Driving: Constraints and Acceleration. In Proceedings of the 23rd In-
ternational Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS 2018. ACM, 751–766.
[26]
Wei-Tsung Lin, Chandra Krintz, and Rich Wolski. 2018. Tracing Function Depen-
dencies across Clouds. In Proceedings of the 11th IEEE International Conference on
Cloud Computing, CLOUD 20188. IEEE, 253–260.
[27]
Chunhong Liu, Chuanchang Liu, Yanlei Shang, Shiping Chen, Bo Cheng, and
Junliang Chen. 2017. An adaptive prediction approach based on workload pattern
discrimination in the cloud. J. Netw. Comput. Appl. 80 (2017), 35–44.
[28]
Shaoshan Liu, Liangkai Liu, Jie Tang, Bo Yu, Yifan Wang, and Weisong Shi.
2019. Edge Computing for Autonomous Driving: Opportunities and Challenges.
Proceedings of the IEEE 107, 8 (2019), 1697–1716.
[29]
Omogbai Oleghe. 2021. Container Placement and Migration in Edge Computing:
Concept and Scheduling Models. IEEE Access 9 (2021), 68028–68043.
[30]
Quoc-Viet Pham, Fang Fang, Vu Nguyen Ha, Md. Jalil Piran, Mai Le, Long Bao
Le, Won-Joo Hwang, and Zhiguo Ding. 2020. A Survey of Multi-Access Edge
Computing in 5G and Beyond: Fundamentals, Technology Integration, and State-
of-the-Art. IEEE Access 8 (2020), 116974–117017.
[31] Jon Postel. 1981. Internet Control Message Protocol. RFC 777 (1981), 1–14.
[32] Konstantinos Poularakis, Jaime Llorca, Antonia Maria Tulino, Ian J. Taylor, and
Leandros Tassiulas. 2019. Joint Service Placement and Request Routing in Multi-
cell Mobile Edge Computing Networks. In Proceedings of the IEEE Conference on
Computer Communications, INFOCOM 2019. IEEE, 10–18.
[33]
Peter-Christian Quint and Nane Kratzke. 2018. Towards a Lightweight Multi-
Cloud DSL for Elastic and Transferable Cloud-native Applications. In Proceedings
of the 8th International Conference on Cloud Computing and Services Science,
CLOSER 2018. SciTePress, 400–408.
[34]
Gourav Rattihalli, Madhusudhan Govindaraju, Hui Lu, and Devesh Tiwari. 2019.
Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Esti-
mation in Kubernetes. In Proceedings of the 12th IEEE International Conference on
Cloud Computing, CLOUD 2019. IEEE, 33–40.
[35]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemys-
law Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski,
Steven Hand, and John Wilkes. 2020. Autopilot: workload autoscaling at Google.
In Proceedings of the 15th EuroSys Conference 2020, EuroSys 2020, Heraklion, Greece,
April 27-30, 2020. ACM, 16:1–16:16.
[36]
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmark-
ing State-of-the-Art Deep Learning Software Tools. In Proceedings of the 7th
International Conference on Cloud Computing and Big Data, CCBD 2016. IEEE,
99–104.
[37]
Paulo Silva, Daniel Fireman, and Thiago Emmanuel Pereira. 2020. Prebaking Func-
tions to Warm the Serverless Cold Start. In Proceedings of the 21st International
Middleware Conference, Middleware 2020. ACM, 1–13.
[38]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian
Optimization of Machine Learning Algorithms. In Proceedings of the 26th Annual
Conference on Neural Information Processing Systems. NIPS 2012. 2960–2968.
[39]
Piyush Subedi, Jianwei Hao, In Kee Kim, and Lakshmish Ramaswamy. 2021.
AI Multi-Tenancy on Edge: Concurrent Deep Learning Model Executions and
Dynamic Model Placements on Edge Devices. In Proceedings of the 14th IEEE
International Conference on Cloud Computing, CLOUD 2021. IEEE, 31–42.
[40]
Xiang Sun and Nirwan Ansari. 2016. PRIMAL: PRofIt Maximization Avatar
pLacement for mobile edge computing. In Proceedings of the IEEE International
Conference on Communications 2016, ICC 2016. IEEE, 1–6.
[41]
Haisheng Tan, Zhenhua Han, Xiang-Yang Li, and Francis C. M. Lau. 2017. On-
line job dispatching and scheduling in edge-clouds. In Proceedings of the IEEE
Conference on Computer Communications 2017, INFOCOM 2017. IEEE, 1–9.
[42]
Liang Tong, Yong Li, and WeiGao. 2016. Ahierarchical edge cloud architecture for
mobile computing. In Proceedings of the 35th Annual IEEE International Conference
on Computer Communications, INFOCOM 2016. IEEE, 1–9.
[43]
Abhishek Verma, Hussam Qassim, and David Feinzimer. 2017. Residual squeeze
CNDS deep learning CNN model for very large scale places image recognition. In
Proceedings of the 8th IEEE Annual Ubiquitous Computing, Electronics and Mobile
Communication Conference, UEMCON 2017, 2017. IEEE, 463–469.
[44]
Bin Wang, Ahmed Ali-Eldin, and Prashant J. Shenoy. 2021. LaSS: Running
Latency Sensitive Serverless Computations at the Edge. In Proceedings of the 30th
International Symposium on High-Performance Parallel and Distributed Computing,
HPDC 2021. ACM, 239–251.
[45]
Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael M.
Swift. 2018. Peeking Behind the Curtains of Serverless Platforms. In Proceedings of
the USENIX Annual Technical Conference, USENIX ATC 2018. USENIX Association,
133–146.
[46]
Claes Wohlin, Martin Höst, and Kennet Henningsson. 2006. Empirical Research
Methods in Web and Software Engineering. In Web Engineering. Springer, 409–
430.
[47]
Jierui Xie, Boleslaw K. Szymanski, and Xiaoming Liu. 2011. SLPA: Uncovering
Overlapping Communities in Social Networks via a Speaker-Listener Interaction
Dynamic Process. In Proceedings of the IEEE 11th International Conference on Data
Mining Workshops, (ICDMW) 2011. IEEE, 344–349.
[48]
Lenar Yazdanov and Christof Fetzer. 2014. Lightweight Automatic Resource
Scaling for Multi-tier Web Applications. In Proceedings of the 7th International
Conference on Cloud Computing, CLOUD 2014. IEEE, 466–473.
NEPTUNE : Network- and GPU-aware Management of Serverless Functions at the Edge SEAMS ’22, May 18–23, 2022, PITTSBURGH, PA, USA
[49]
Xu Zhang, Hao Chen, Yangchao Zhao, Zhan Ma, Yiling Xu, Haojun Huang, Hao
Yin, and Dapeng Oliver Wu. 2019. Improving Cloud Gaming Experience through
Mobile Edge Computing. IEEE Wirel. Commun. 26, 4 (2019), 178–183.
[50]
Ao Zhou, Shangguang Wang, Shaohua Wan, and Lianyong Qi. 2020. LMM:
latency-aware micro-service mashup in mobile edge computing environment.
Neural Comput. Appl. 32, 19 (2020), 15411–15425.
[51]
Qian Zhu and Gagan Agrawal. 2012. Resource Provisioning with Budget Con-
straints for Adaptive Applications in Cloud Environments. IEEE IEEE Transactions
on Services Computing 5, 4 (2012), 497–511.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Function-as-a-Service (FaaS) paradigm has a lot of potential as a computing model for fog environments comprising both cloud and edge nodes, as compute requests can be scheduled across the entire fog continuum in a fine-grained manner. When the request rate exceeds capacity limits at the resource-constrained edge, some functions need to be offloaded toward the cloud. In this article, we present an auction-inspired approach in which application developers bid on resources while fog nodes decide locally which functions to execute and which to offload in order to maximize revenue. Unlike many current approaches to function placement in the fog, our approach can work in an online and decentralized manner. We also present our proof-of-concept prototype AuctionWhisk that illustrates how such an approach can be implemented in a real FaaS platform. Through a number of simulation runs and system experiments, we show that revenue for overloaded nodes can be maximized without dropping function requests.
Article
Full-text available
Containers are a form of software virtualization, rapidly becoming the de facto way of providing edge computing services. Research on container-based edge computing is plentiful, and this has been buoyed by the increasing demand for single digit, milliseconds latency computations. A container scheduler is part of the architecture that is used to manage and orchestrate multiple container-based applications on heterogenous computing nodes. The scheduler decides how incoming computing requests are allocated to containers, which edge nodes the containers are placed on, and where already deployed containers are migrated to. This paper aims to clarify the concept of container placement and migration in edge servers and the scheduling models that have been developed for this purpose. The study illuminates the frameworks and algorithms upon which the scheduling models are built. To convert the problem to one that can be solved using an algorithm, the container placement problem in mostly abstracted using multi-objective optimization models or graph network models. The scheduling algorithms are predominantly heuristic-based algorithms, which are able to arrive at sub-optimal solutions very quickly. There is paucity of container scheduling models that consider distributed edge computing tasks. Research in decentralized scheduling systems is gaining momentum and the future outlook is in scheduling containers for mobile edge nodes.
Conference Paper
Full-text available
Function-as-service (FaaS) platforms promise a simpler programming model for cloud computing, in which the developers concentrate on writing its applications. In contrast, platform providers take care of resource management and administration. As FaaS users are billed based on the execution of the functions, platform providers have a natural incentive not to keep idle resources running at the platform's expense. However, this strategy may lead to the cold start issue, in which the execution of a function is delayed because there is no ready resource to host the execution. Cold starts can take hundreds of milliseconds to seconds and have been a prohibitive and painful disadvantage for some applications. This work describes and evaluates a technique to start functions , which restores snapshots from previously executed function processes. We developed a prototype of this technique based on the CRIU process checkpoint/restore Linux tool. We evaluate this prototype by running experiments that compare its start-up time against the standard Unix process creation/start-up procedure. We analyze the following three functions: i) a "do-nothing" function, ii) an Image Resizer function, and iii) a function that renders Markdown files. The results attained indicate that the technique can improve the start-up time of function replicas by 40% (in the worst case of a "do-nothing" function) and up to 71% for the Image Resizer one. Further analysis indicates that the runtime initialization is a key factor, and we confirmed it by performing a sensitivity analysis based on synthetically generated functions of different code sizes. These experiments demonstrate that it is critical to decide when to create a snapshot of a function. When one creates the snapshots of warm functions, the speed-up achieved by the prebaking technique is even higher: the speed-up increases from 127.45% to 403.96%, for a small, synthetic function; and for a bigger, synthetic function, this ratio increases from 121.07% to 1932.49%.
Conference Paper
Full-text available
In this paper we provide a description of the methods we used as team BanaNeverAlone for the ACM RecSys Challenge 2020, organized by Twitter. The challenge addresses the problem of user engagement prediction: the goal is to predict the probability of a user engagement (Like, Reply, Retweet or Retweet with comment), based on a series of past interactions on the Twitter platform. Our proposed solution relies on several features that we extracted from the original dataset, as well as on consolidated models, such as gradient boosting for decision trees and neural networks. The ensemble model, built using blending, and a multi-objective optimization allowed our team to rank in position 4.
Article
Full-text available
Driven by the emergence of new compute-intensive applications and the vision of the Internet of Things (IoT), it is foreseen that the emerging 5G network will face an unprecedented increase in traffic volume and computation demands. However, end users mostly have limited storage capacities and finite processing capabilities, thus how to run compute-intensive applications on resource-constrained users has recently become a natural concern. Mobile edge computing (MEC), a key technology in the emerging fifth generation (5G) network, can optimize mobile resources by hosting compute-intensive applications, process large data before sending to the cloud, provide the cloud-computing capabilities within the radio access network (RAN) in close proximity to mobile users, and offer context-aware services with the help of RAN information. Therefore, MEC enables a wide variety of applications, where the real-time response is strictly required, e.g., driverless vehicles, augmented reality, robotics, and immerse media. Indeed, the paradigm shift from 4G to 5G could become a reality with the advent of new technological concepts. The successful realization of MEC in the 5G network is still in its infancy and demands for constant efforts from both academic and industry communities. In this survey, we first provide a holistic overview of MEC technology and its potential use cases and applications. Then, we outline up-to-date researches on the integration of MEC with the new technologies that will be deployed in 5G and beyond. We also summarize testbeds and experimental evaluations, and open source activities, for edge computing. We further summarize lessons learned from state-of-the-art research works as well as discuss challenges and potential future directions for MEC research.
Article
With the development of 4G/5G technology and smart devices, more and more users begin to play games via their mobile devices. As a promising way to enable users to play any games, cloud gaming is proposed to stream game scene rendered remotely in the cloud with the format of video. However, it faces major challenges in terms of long delay and high network bandwidth. To this end, a novel framework named EdgeGame is proposed to improve the cloud gaming experience by leveraging resources in the edge. Compared to existing cloud gaming systems, EdgeGame offloads the computation-intensive rendering to the network edge instead, which can reduce network delay and bandwidth consumption greatly. Moreover, EdgeGame introduces deep reinforcement learning in the edge to adjust the video bitrates adaptively to accommodate the network dynamics. Finally, we implemented a prototype system and compared it with an existing cloud gaming system. The experiments show that EdgeGame can reduce the average network delay by 50 percent and improve user's QoE by 20 percent.