Content uploaded by Jinlai Xu
Author content
All content in this area was uploaded by Jinlai Xu on Dec 09, 2021
Content may be subject to copyright.
Model-based Reinforcement Learning for Elastic
Stream Processing in Edge Computing
Jinlai Xu, and Balaji Palanisamy
School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15213, USA
Email: jinlai.xu@pitt.edu, bpalan@pitt.edu
Abstract—Low-latency data processing is critical for en-
abling next generation Internet-of-Things(IoT) applications. Edge
computing-based stream processing techniques that optimize for
low latency and high throughput provide a promising solution
to ensure a rich user experience by meeting strict application
requirements. However, manual performance tuning of stream
processing applications in heterogeneous and dynamic edge
computing environments is not only time consuming but also not
scalable. Our work presented in this paper achieves elasticity
for stream processing applications deployed at the edge by
automatically tuning the applications to meet the performance
requirements. The proposed approach adopts a learning model
to configure the parallelism of the operators in the stream pro-
cessing application using a reinforcement learning(RL) method.
We model the elastic control problem as a Markov Decision
Process(MDP) and solve it by reducing it to a contextual
Multi-Armed Bandit(MAB) problem. The techniques proposed
in our work uses Upper Confidence Bound(UCB)-based methods
to improve the sample efficiency in comparison to traditional
random exploration methods such as the ϵ-greedy method. It
achieves a significantly improved rate of convergence compared
to other RL methods through its innovative use of MAB methods
to deal with the tradeoff between exploration and exploitation.
In addition, the use of model-based pre-training results in sub-
stantially improved performance by initializing the model with
appropriate and well-tuned parameters. The proposed techniques
are evaluated using realistic and synthetic workloads through
both simulation and real testbed experiments. The experiment
results demonstrate the effectiveness of the proposed approach
compared to standard methods in terms of cumulative reward
and convergence speed.
I. INT ROD UC TI ON
The proliferation of Internet-of-Things(IoT) devices is
rapidly increasing the demands for efficient processing of
low latency stream data generated close to the edge of the
IoT network. IHS Markit forecasts that the number of IoT
devices will increase to more than 125 billion by 2030 [1].
Edge Computing complements traditional cloud computing
solutions by moving operations from remote datacenter to
computing resources at the edge of the network that is close
to the IoT end-devices.
A key objective in enabling low-latency edge computing is
to minimize the volume of data that needs to be transported
and reduce the response time for the requests. Stream data
processing is an integral component of low-latency data an-
alytic systems and several open-source systems (e.g. Apache
Storm [2], and Apache Flink [3]) provide efficient solutions
for processing data streams. These systems optimize the
performance of stream data processing for achieving high
throughput and low (or bounded) response time (latency)
for the stream queries. However, techniques when applied
in an edge computing environment incurs huge operational
cost. In addition, the performance of each application needs
to be manually tuned and further reconfigured for varying
workload conditions and changing execution environments.
Recent studies indicate that the administrative labor occupies
20-50% of the overall operational cost of deploying an IoT
application [4]. Therefore, automated tuning can drastically
decrease the cost of deploying and maintaining stream pro-
cessing applications in edge-based computing systems.
Due to the inherent hardness in predicting the highly
dynamic changes in an edge computing environment, it is
important to adopt a self-adaptive approach using techniques
such as reinforcement learning (RL) methods that are suitable
for adapting to the changes in the environment. Recently,
RL-based methods were developed in the distributed system
domain to enhance heuristic-based system optimization algo-
rithms and provide more effective solutions to problems for
which heuristic-based solutions are less effective. However,
applying RL methods in the systems domain incurs several
challenges. A key performance factor in effective RL-based
methods is the sample efficiency of the method. Currently,
most RL methods are based on deep learning techniques that
use deep neural networks (DNNs) to handle the approximation
of the environment dynamics and the reward distribution. The
use of DNN enhances the models to handle more complex
conditions. However, most of the DNN-based RL algorithms
require a large amount of data to converge a good result
which makes the optimization of the sample efficiency even
harder. Therefore, improving the sample efficiency is a critical
problem when applying RL-based algorithms for optimizing
distributed systems management.
In this paper, we propose an RL-based algorithm to optimize
the performance of distributed stream processing (DSP) appli-
cations to meet various quality of service (QoS) requirements.
For deploying a stream processing application, the system
needs to decide how much resources are allocated to each
operator to meet the QoS requirements set by the users. This
is further challenged by the heterogeneous nature of the edge
computing resources and the dynamically changing working
conditions. Based on the recent developments of the RL
algorithms, we model the DSP scaling problem as a contextual
Multi-Armed Bandit (MAB) problem, which is reduced from
the original Markov Decision Process (MDP). With the above
simplification, the elastic parallelism configuration can be
efficiently solved using the state-of-art algorithms that work
well on MAB problems [5]. It can automatically achieve
tradeoffs between exploring the solution space to find an
optimal solution (exploration) and utilizing the data gathered
from the previous trials (exploitation). We investigate the use
of LinUCB [5] algorithm to dynamically decide the parallelism
configuration during the execution of the DSP application
that aims to improve multiple QoS metrics including end-
to-end latency upper bound, throughput and resource usage.
We further improve the sample efficiency of LinUCB using
a model-based method which is based on a queuing model
simulation to pre-train the RL agent to improve the accuracy
of the initial parameters. The main contributions of this paper
are summarized as follows:
•We model the elastic parallelism configuration problem
for distributed stream processing (DSP) applications in
edge computing environment.
•We integrate the problem model into a Markov Decision
Process (MDP) and reduce the MDP to a contextual MAB
problem where we apply the LinUCB method to find
tradeoffs between exploration and exploitation.
•We propose a model-based learning approach to improve
the sample efficiency of LinUCB by generating more data
based on a queuing model simulation.
•We evaluate our proposed method, MBLinUCB, and
other state-of-art RL methods using realistic workloads
through both simulation and real testbed experiments. The
experiment results demonstrate the effectiveness in terms
of cumulative reward and convergence speed.
II. EL AS TI C STR EA M PROC ES SI NG PRO BL EM
DSP applications are often long-running and can experience
variable workloads. Additionally, the highly dynamic edge
computing environments may change the working conditions
of the applications requiring operators to be migrated between
nodes with different capacities due to failures or mobility
requirements. To bound the performance of the applications
within an acceptable range, it is important to design an elastic
parallelism configuration algorithm to adapt to the changes in
the dynamic edge computing environment. In this section, we
first explain the terminologies used in modeling the elastic
parallelism configuration problem.
Logic Plan: we assume that there is a stream processing ap-
plication submitted to the system. The code of the application
is translated into a Directed Acyclic Graph (DAG) denoted
as Gdag(Vdag , Edag )shown as the logic plan in Figure 1.
Here, the vertices Vdag represent the operators and the edges
Edag represent the streams connecting the operators. We use
i∈Vdag to denote operator iin the application.
Cluster: we assume that the resources of the cluster are also
organized as a graph, Gres(Vres , Eres ), where Vres denotes
nodes in the cluster, and Eres indicates the virtual links
connecting the nodes. As shown in Figure 1, in the edge
computing environment, there are multiple tiers of resources
such as the micro datacenters (MDCs) and the smart gateways
which are deployed near the edge of the network that are
used for processing the data locally to provide low latency
1 2
Logic Plan
03
monitor
MDC 1
Smart
Gateway 1
Smart
Gateway 2
Cluster
monitor
monitor
MDC 1
Smart
Gateway 1
Smart
Gateway 2
Preliminary Operator Placement
12
03
MDC 1
Smart
Gateway 1
Smart
Gateway 2
Dynamic Running Environment
12
03
2
1
03
Controller
Agent
Environment
Interface
Monitor Agent
Monitor
Fig. 1. Elastic Stream Processing Framework
computing to the applications. Thus, it is natural to consider
the edge computing as a heterogeneous environment with
highly dynamic changes in the execution environment. We
simplify the physical resources as virtual nodes in the graph,
Gres. For example, if a node vrepresents a micro data center
(MDC), we group its resource capacity as Cvby considering
all the resources we can use in an MDC as a virtual node.
Preliminary Operator Placement: the operator placement is
a map between the operator, i∈Vdag, and the node, v∈
Vres, in the cluster. We assume that the operator placement for
stream processing application is already provided as shown in
Figure 1. It is denoted as a map X0={xv
i|i∈Vdag, v ∈
Vres}. For each operator i∈Vdag , when it is placed on node
v∈Vres, then xv
i= 1. It is worth noting that, in the current
state-of-art DSP engines (such as Apache Storm, and Flink),
one operator can be replicated to multiple instances and the
instances of the operator can be placed on different nodes. We
assume that each operator is placed on one node as it simplifies
the representation complexity of the model. Additionally, if the
operator placement needs to be reconfigured to fit the working
environment changes, we can treat the reconfiguration as a
new submission as it does not affect the performance of the
parallelism configuration algorithm.
Elastic Parallelism Configuration: the objective of configur-
ing the parallelism is to change the number of instances of the
operator to optimize one or multiple QoS requirements of the
application. We need to decide the parallelism of each operator
i, which is noted as ki∈[1, Kmax], where Kmax is an upper
bound for the parallelism. Therefore, the number of possible
parallelism configurations is Kmax|Vdag |for configuring the
whole application, if all the operators have the max parallelism
as Kmax. The parallelism determines the number of threads
running for the instances of the operator, which is not directly
related to the number of tasks provisioned for an operator.
We discuss this in detail in Section IV. The configuration
can be either static which is fixed when the application is
submitted to the engine or dynamic that can be changed when
the application is running. In this work, we deal with the
dynamic parallelism configuration problem and we present the
detailed solution in Section III.
A. Quality of Service Metrics
The objective of the parallelism configuration can be de-
cided by the user in terms of QoS requirements. As most of
the stream processing applications need to handle the incoming
data under acceptable latency, the goal of the parallelism
configuration can be to minimize the response time while
reducing the resource cost and minimizing the gap between
the throughput and the arrival rate to avoid back-pressure [6].
End-to-end latency upper bound: we assume the end-to-end
latency of the application is primarily composed of compu-
tational or queuing latency. If the network latency or other
latency e.g., the I/O latency caused by memory swapping,
are significant in an application, we rely on other techniques
to optimize the application first before deploying in the edge
computing environment [7]–[9]. In order to make ensure the
end-to-end latency is bounded by a user defined target, we
traverse the path in the application’s DAG to get the estimated
end-to-end latency upper bound. We first define the path as a
sequence of operators, starting at a source and ending at a sink,
as p∈P, where Pdenotes all the paths in the application.
We can estimate the latency upper bound (not tight) of the
application as the longest path in the DAG:
¯
ldag = max
p∈PX
i∈p
¯
li(1)
where ¯
liis the latency upper bound when passing one of the
instances of an operator i.
Throughput: in stream processing applications, the through-
put requirement is typically defined by matching the process-
ing rate of the application to the arrival rate. If the processing
rate is larger than the arrival rate, the application will not incur
a back-pressure [6], which will influence the performance of
the application and may increase the resource usage (e.g., the
memory usage for caching the unprocessed tuples). Therefore,
to evaluate the throughput performance, we use the queue
length, noted as ω, which is widely used in the queuing model
to represent the state of the queue. It also captures the gap
between the throughput and the arrival rate in the long run,
which is easy to monitor in the stream processing engine.
Resource usage: for the resource usage, we can directly use
the parallelism configuration to estimate, which is kifor the
operator i. With an increase in parallelism, there will be more
threads allocated to the operator so that the resource usage will
increase. Thus, parallelism can be used as the representation
of the resource usage.
Reconfiguration cost: as we change the parallelism configu-
ration when the stream processing application is running, it is
important to consider the reconfiguration cost if the parallelism
is changed (e.g., the operator needs to be restarted to apply
the parallelism change). However, most of the previous works
assume a static reconfiguration cost [8], which is a constant
cost related to the downtime. This kind of measurements is
not accurate due to the correlation between the reconfiguration
downtime and other metrics such as latency and throughput.
The downtime caused by the reconfiguration will lead to a
peak latency and throughput after the downtime. Therefore in
this work, we do not include the reconfiguration cost in the
objective. Instead, we include the downtime influence in the
other metrics such as the end-to-end latency and throughput.
1
Operator
1
0
1
0
Upstream Downstream
2
Input
Queue
Output
Queue
Dispatcher
𝜆𝑖𝜓𝑖𝜆𝑖
Input
Queue
Input
Queue 𝜇𝑖
……
Worker
Output
Queue
Output
Queue
Fig. 2. Stream Processing Model
B. Stream Processing Model
With the notion of parallelism configuration and the QoS
metrics defined above, we represent the stream processing
model used to estimate the relationship between the decision
(parallelism configuration) and requirements (QoS metrics)
based on the queuing model of an operator and the message
passing model of the stream processing application. The
discussion will guide the later RL method design in Section III.
As discussed in Section I, the highly dynamic workload
and the heterogeneous resources make it very difficult to
predict the environment dynamics when the stream processing
application is deployed in the edge computing environment.
However, it is important to extract the invariant from the
dynamics for human operators to understand the problem
and the condition of the whole system to debug potential
problems. Based on the intuition above, we adopt the model
from queuing theory and choose the M/M/1 queue (can also
be extended to G/G/1 based on distribution information) to
model the characteristics of the operator. For each operator i,
as shown in Figure 2, we assume that the instances of it do
not share the input and output queues which can be treated
as a M/M/1 queue. An M/M/1 queue can be described as two
variables, λi, µiand one state ωi, where λiis the arrival rate,
µiis the service rate, and ωiis the queue length. Based on
the theory of M/M/1 queue [10], we can get the response
time distribution (which is noted as latency in our work)
and the throughput with closed-form formulations. When the
queue is stable, which means µi> λi, the queue length will
not grow infinitely. Without losing generality, we analyze the
latency upper bound here as an example. The 95th percentile
of latency can be calculated from the cumulative distribution
of an exponential distribution Exp(µi−λi)as follows:
¯
li=ln 20
µi−λi
(2)
Similarly, the other metrics can be also represented as closed-
form formulations. For example, the average latency is 1/(µi−
λi). If the arrival rate and service rate distributions (G/G/1
queue) are given, Equation 2 can be modified correspondingly
to represent the upper bound (95th percentile) of the latency
using the cumulative distribution function. In addition, we
added a variable, ψi, to enhance the queuing model, which
represents the selectivity of the operator i, so that we can get
the output rate as ψiλiif µi> λias shown in Figure 2.
After introducing the queuing model of a single operator,
we now move to the model to estimate the performance
of an application. As described in the beginning of this
section, we assume that the application is organized as a DAG,
Gdag(Vdag , Edag ), where each vertex i∈Vdag represents an
operator and each edge (i, j)∈Edag represents a stream. The
tuples transmitted between two operators will be partitioned
by a default shuffling function, or a user-defined partitioning
function, which calculates the index of the downstream in-
stance that the tuple will go to. In the message passing model,
instead of composing the overall latency from source to sink as
shown in Equation 1, we break down the latency caused on one
operator and based on that, we set the target latency from the
overall latency requirement. With the split objective, for each
operator, the performance can be tuned without taking into
consideration the other operators or the overall application.
Here, we just use a simple heuristic to decide the maximum
latency target of each operator proportional to its contribution
to the overall latency:
¯
lmax
i=¯
li
¯
ldag
¯
lmax (3)
where ¯
lmax is the upper bound latency of the overall ap-
plication set by the user. If the profiling information is not
available or not possible to obtain, we can use other heuristics
such as evenly dividing the latency upper bound into the sub-
objective of each operator with the given number of stages
in the DAG. We leave the dynamic orchestration of the sub-
objectives of the application’s objective by gathering more
information from executing the application as one of our future
works. For the other metrics, such as throughput, we can
monitor the input rate and processing rate for an operator and
the throughput sub-objective can be directly obtained from the
local information (e.g., queue length) of a particular operator
so that we can rely on the local information to optimize the
throughput. Therefore in the RL algorithm, we only need to
focus on tuning the parallelism for one operator with the given
sub-objective. The usage of sub-objective can decrease the
complexity of the parallelism configuration problem, which
we will discuss in details in Section III-B.
In the next section, we present the details of the proposed
model-based RL method based on the above model to automat-
ically decide the parallelism in a dynamic and heterogeneous
edge computing environment.
III. REI NF OR CE ME NT LE AR NING FOR ELASTIC STREAM
PROC ES SI NG
We structure the elastic parallelism configuration as a
Markov Decision Process (MDP) that represents an RL agent’s
decision-making process when performing the parallelism de-
cision. We then reduce the MDP to a contextual MAB problem
and apply LinUCB with a model-based pre-training.
A. A Markov Decision Process Formulation
The problem of continuously configuring the parallelism of
the stream processing applications in an edge computing en-
vironment can be naturally modelled as an MDP. Formally, an
MDP algorithm proceeds in discrete time steps, t= 1,2,3, ...:
(i) The algorithm observes the current DSP application state
st, which is a set of metrics gathered from any monitor threads
running out of the system (e.g., node utilization, network
usage) or the metrics reported by the application itself (e.g.,
latency, throughput, queue length as described in Section II-A).
(ii) Based on observed reward in the previous steps, the
algorithm chooses an action kt∈ A, where Ais the overall
action space, and receives reward rkt, whose expectation
depends on both the state stand the action kt. In the
parallelism configuration process, each action is composed by
the parallelism configuration of all the operators, which can
be noted as kt={kt,i|i∈Vdag }.
(iii) The algorithm improves its state-action-selection strategy
with the new observations, (st,kt, rkt, st+1).
We choose the Finite-horizon undiscounted return [11] as the
objective of the MDP, which can be noted as:
T
X
t=0
rkt(st, st+1)(4)
where, Tis the number of continuous time steps considered in
the objective. It is a cumulative measure of the undiscounted
rewards in a predefined Ttime steps. As shown in the
equation, compared to the infinite discounted reward, the finite
undiscounted reward treats each time step equally. This fits
the objective of the parallelism configuration that aims to
maximize the utility uniformly among time steps. It also fits
well into the contextual MAB problem which we discuss in
the next subsection.
B. Model-based Reinforcement Learning
As discussed in Section I, the traditional RL methods based
on q-value tables or other methods need a large number of data
points to converge. The deep reinforcement learning methods
use DNN to improve the convergence rate but they also need
a lot of efforts either to tune the hyperparameters to tradeoff
between expressivity and the convergence rate or to gather
enough data to feed into the neural networks, which may be
costly or even not possible in some conditions. In addition,
the incomprehensible and nonadjustable deep neural model is
the major barrier for those kinds of models to be practical
in system operations [12]. In our work, we use LinUCB [5]
that assumes a linear relationship between the state and the
reward, and is proved to be effective under the contextual
MAB assumptions even when the process is non-stationary.
The LinUCB method fits into the parallelism configuration
problem well based on our two observations: (i) the parallelism
configuration MDP (defined in Section III-A) can be reduced
to a contextual MAB, and (ii) we can define the reward
function with a linear relationship between the reward and the
parallelism configuration or most of the common objectives
(e.g., latency, throughput) are linear (or can be transformed to
be linear) to the parallelism configuration. Next, we discuss
the above two observations in detail.
The major difference between the MDP and the contextual
MAB is based on whether the agent considers the state
transitions to make the decision. From the theory of M/M/1
queue [10], we can see that for each time step, the state
transition is only dependent on the arrival rate λ, the service
rate µ, and the initial state of the time step, ω, which is the
initial length of the queue. Therefore, if we have the above
variables in a particular state, we can get the state transition
probability for any possible states in the next time steps. If
the distributions of the arrival process and service process are
stationary, the reward (determined by any QoS metrics) can
be determined by the current state and action regardless of
the trajectory of the previous states. It also means that the
decision of the action can be made based on the current state
instead of the trajectory. The above observation is intuitive
when there is only one operator. If there are multiple connected
operators organized as a DAG, the problem is significantly
more complex. However, instead of connecting the queuing
model of each operator to build a queuing network, we can
break the objective (reward) function of the overall application
using a heuristic (as discussed in Section II-B) based on the
message passing model to the individual objective (reward)
for each operator so that for each operator, we can safely use
the LinUCB algorithm to fit the queuing model with the given
objectives and also reduce the possible action space for the
RL method (from exponential to linear).
For the second observation namely, the linear relationship
between the state and the reward, we begin analyzing it using a
single queuing model. To keep it simple, we omit the time step
notation tin the following discussion. If we have an operator
i, it has only one instance. We then have the arrival rate λi,
the service rate µi(for one instance), and the queue length ωi.
We assume the relationship between the parallelism setup and
the speedup of the operator by comparing a single parallelism
condition that obeys Gustafson’s law [13] with a parameter ρi
that defines the portion of the operation that can benefit from
increasing the resource usage. Here µi(ki, ρi)is the estimated
service rate when the parallelism is kiand the parallel portion
is ρi, which can be estimated by:
µi(ki, ρi) = (1 −ρi+ρiki)µi(5)
Without losing generality, we estimate the latency (response
time) distribution in the time step as an example, which can
be an exponential distribution of Exp(µi(ki, ρi)−λi)plus an
estimated upper bound of the processing time of the queuing
tuples ωiExp(µi(ki, ρi)) (not tight). Therefore, the latency
upper bound can be estimated as combining Equation 2:
¯
li(λi, µi, ωi) = ln 20
µi(ki, ρi)−λi
+ωi
ln 20
µi(ki, ρi)(6)
With the given parallel portion ρiand the average processing
rate µi, the overall processing rate of the operator iwith
kiinstances is proportional to the number of instances, ki.
If throughput is part of the reward, it will have a linear
relationship with the parallelism. For latency, in Equation 6,
the operator will start at a state when the queue length is
zero, ωi= 0, and the first part of the equation is inversely
proportional to the processing rate if the arrival rate λiis
fixed. Therefore, through a simple transformation (e.g., set
x1= 1/(µi(ki, ρi)−λi)), we can refer to a linear relationship
between the latency upper bound and the parallelism. For the
other metrics such as throughput, queue length, and resource
utilization, we can also analyze the relationship between the
parallelism and obtain similar results. Through similar simple
transformations, the linear relationship between the metrics
(which represent the states in RL methods) and the parallelism
(number of instances) can be obtained.
Based on the two observations, we apply LinUCB as an RL
agent to decide the parallelism configuration for an operator
and pass the messages between the connected operators in the
DAG. Using the notation of Section III-A, we assume that
the expected reward of an action (parallelism configuration) is
linear in its d-dimensional state st,kiwith some unknown co-
efficient vector θ∗
ki. Therefore, the linear relationship between
the reward and the state can be described as:
E[rt,ki|st,ki] = s⊤
t,kiθ∗
a(7)
As described in LinUCB [5], it uses a ridge regression to fit
the linear model with the training data to get an estimate of
the coefficients ˆ
θkifor each action kiof each time step t. We
omit the detailed steps of the LinUCB algorithm and we refer
the interested readers to the original paper [5]. Here, we only
discuss the action selection policy of LinUCB, which can be
represented as:
kt,i = arg max
ki∈[1,Kmax]s⊤
t,ki
ˆ
θa+αps⊤
t,ki(D⊤
ki
Dki+Id)−1st,ki(8)
where αis a constant, Dkiis a design matrix of dimension
m×dat time step t, whose rows correspond to mtraining
inputs (states), and Idis a d×didentity matrix. From the
above equation, we can see that the action selection of LinUCB
considers both the current knowledge we obtained from the
previous trials in s⊤
t,ki
ˆ
θaand the uncertainty (UCB) of the
action-reward distribution in the second part of Equation 8.
This is the reason why LinUCB has the ability to tradeoff
between the exploration and exploitation.
Algorithm 1 Model-based LinUCB pre-train
1: procedure PRETRAIN(Gdag)→Θ
2: qis initialed as an empty queue
3: Oare sources of Gdag (in-degrees are zero)
4: for o∈Odo
5: q.append((o, λo))
6: while qis not empty do
7: i, λi=q.pop()
8: θi=train(i, λi)
9: add trained parameters θito output Θ
10: for all downstream operators i′of ido
11: λi′=λi′+ψλi
12: remove edge (i, i′)from Gdag
13: if indegree(i′) == 0 then
14: q.append((i′, λ′
i))
15: procedure TRAIN(i, λi)→θi
16: initial the model parameters θi
17: while not terminate and not converge do
18: kt,i =selection(θi, st−1,ki)by Equation 8
19: rst,kt,i , st,ki=simulate(λi, µi, kt,i)
20: θi=updateLinUC B(θi, rst,ki, st,ki, kt,i )
21: reset operator i’s state to initial state
With the above analysis, we can see that if LinUCB is
directly used to set the parallelism for one operator, it can be
efficient as the uncertainty of the operator can be captured by
the linear model (e.g., the parallel portion, the base processing
rate). However, on one hand, it has a cold start phase which
needs multiple rounds to get enough data for each possible
action to reach a reasonable performance level. On the other
hand, in a stream processing DAG, the operators are connected
to each other and the overall performance of the application
may vary due to different bottlenecks. Given the DAG, it is
a challenging problem to determine how to relate the overall
performance of the application to the metrics of every operator
in it. Therefore, instead of directly optimizing the overall
application, we use the objective function split as discussed
in Section II-B to only deal with the optimization for one
operator for each RL agent. In addition, we use the queuing
model-based simulation to validate the configuration to set
the initial parameters for the LinUCB model. The simulation
also gives additional benefits. On one hand, we can assume
different distributions for the arrival rate and service rate
that can support arbitrary G/G/1 queuing models, which is
evaluated in the experiment in Section V. On the other hand,
the simulator can work in different modes to either generate a
lot of synthetic data to directly feed into the model or interact
with the RL agent as a simulation environment, which can
fit into more RL algorithms. In the simulation, instead of
trying all the combinations of the parallelism configuration
of the operators at the same time, we gradually train the
model for each operator by a topological order [14] of the
DAG to ensure that the upstream operators’ configuration
is fixed before the downstream operator’s model is trained.
The simulation process is shown in Algorithm 1. The reward
function for each operator iis defined by using the Simple
Additive Weighting (SAW) technique [15]:
ri(st,ki) = wlatrlat
i+wquerq ue
i+wresrr es
i(9)
where st,kiis the state of time slot twith parallelism as ki, and
rlat
i, rque
i, rres
iare the reward function for latency, throughput,
and resource usage based on the application’s requirements.
To balance the optimization for latency, queue length (gaps
between throughput and input rate) and resource usage, we
add wlat, wque, wres as the weights for each component and
wlat +wque +wres = 1. For different requirements, the
reward function can be set in different forms. For example,
if the application requires deadline-awareness and has a strict
latency bound, we can set rlat
i=−1when ¯
li≥lmax
i,
otherwise, it is zero. If the application’s utility is linear to the
latency, we can set rlat
i=−¯
li
lmax
i
, which decreases when the
latency is increasing. Without losing generality, we define the
reward function by setting rlat
i=−1if ¯
li≥lmax
ielse zero,
rque
i=−1if ωi≥ωmax
ielse zero, and rres
i=−ki
kmax
i
. In
the definition above, both the latency and throughput penalties
have a bounded reward. The reward of resource usage is linear
in terms of the number of instances running. To eliminate the
impact of the state transition that the model has experienced
through a previous bad selected action (e.g., a bad parallelism
configuration may put too many tuples waiting for processing
and hence, the continuous states will be influenced), we reset
the state of the simulation each time when we update the model
with one parallelism configuration in line 21 of Algorithm 1.
Therefore, for each sample of the model, the state of the
operator will start from the same initial state so that each
sample will not be influenced by the previous sample’s state.
In the real environment, we first apply the trained param-
eters to each model as shown in Algorithm 2. Then, for
each operator, the controller decides the parallelism from the
metrics gathered from the system similar to the steps in the
pre-train iteration. The heterogeneity is captured by the linear
model in LinUCB. If the operator is migrated from one node
to another, it only influences the processing rate distribution µi
(when the other latencies are already appropriately optimized).
We evaluate this experimentally in Section V.
In the next section, the implementation is discussed for the
Algorithm 2 MBLinUCB
1: procedure MBLINUCB(Gdag )
2: initial the model parameters for each operator i∈Vdag,θi
3: Θ = pretrain(Gdag )
4: decide initial k0={k0,i|i∈Vdag }from Θ
5: Submit Gdag with k0to stream processing engine
6: while Gdag not terminate do
7: gather metrics in the time slot tas st={st,ki|i∈V dag}
8: for each operator ido
9: θi=updateLinUC B(θi, rst,ki, st,ki, kt−1,i )
10: kt,i =selection(θi, st,ki)by Equation 8
11: if kt,i ̸=kt−1,i then
12: Submit parallelism change ki,t to stream processing engine
0 1 2
3
Topology
Master Node
Coordinator
Slave Node
Slave Node
Slave Node
Slave Node
…
Engine Controller
Slave Node
Worker
Worker
Executor
Task
Task
…
Executor
…
Worker
Receive Q Outbound Q Controller
Monitor
Monitor
Environment
RL
agent
Supervisor
RL
agent
Model-based
Simulation
Pretrain
Loop
Interaction
Loop
Change
Configuration
Monitor
Monitor
Monitor
Monitor
Monitor
Monitor
Fig. 3. System Architecture Overview
above methods in a real-world DSP engine.
IV. IMP LE ME NTATI ON
We implement a prototype of the proposed method using
Apache Storm (v2.0.0) [2]. Though the proposed algorithms
can be implemented on other DSP engines such as Apache
Flink [3], we chose Apache Storm for implementation due
to its widespread use in data science applications [16] and
Storm has the lowest overall latency [17] among the leading
stream processing engines. We use Apache Storm to deploy
the DSP application which runs on the distributed worker
nodes managed by the Storm framework. In Storm, the DSP
application can be represented as a DAG topology that is
used to schedule and optimize the application. However, when
we actually deploy the application, it has an execution plan
which can be seen as an extension of the DAG topology. The
execution plan replaces each operator with its tasks. A task
represents an instance of an operator and is in charge of a
partition of the incoming tuples of the operator. In addition,
one or more tasks are grouped into executors, implemented
as threads as shown in Figure 3. Storm can process a large
amount of tuples in parallel by running multiple executors. The
executors are handled by the worker process in Storm, which
is a Java process acting as a container, which configures a
number of parameters including the maximum heap memory
that can be used. The parallelism is configured by the number
of executors allocated to an operator. When the number of
the executors reaches the number of tasks, the operator gets
its maximum parallelism. The number of executors can be re-
configured without restarting the application (but the executors
need to be restarted to redistribute the tasks) by the re-
balancing tool provided by Storm.
The implementation of our algorithms in Storm is straight-
forward. As illustrated in Figure 3, we implement a centralized
controller of the application in python. The controller is
implemented using the gym environment [18] interface which
can be directly used on most of the RL libraries. The interface
requires the environment to provide several functionalities,
which include step(), reset(), close(), etc. Here,
the most important interface is the step() interface, which
takes in the action for the time step and returns a four-
tuple including the observation (state), the immediate reward,
the end of episode signal, and the auxiliary diagnostic in-
formation. Based on the above interface, we implement the
DSP controller to control the parallelism configuration based
on the action the algorithms calculated in each time step.
Additionally, the controller also takes the responsibility of
monitoring the status of the DSP application by capturing the
metrics from the output of the application, each physical node,
and each instance of the operators.
By wrapping the controller of the application, we extended
the algorithm (LinUCB) in RLlib [19] to implement the pro-
posed algorithms, which can directly use the gym environment
to interact with the DSP application (shown as interaction loop
in Figure 3). Therefore, when the DSP application is submit-
ted, a controller is created and based on the algorithm chosen
for configuring the parallelism, an RL agent (or multiple RL
agents) is created and attached to the controller. Additionally,
the pre-training also implemented the same gym environment
interface, which can directly interact with the RL agent. Using
our MBLinUCB method, we tune the parallelism for each
operator using a specific RL agent and hence, it is possible to
distribute the agent to be attached with the operator to make
the decision. In that way, the agent does not need to be placed
with the controller and can be distributed to anywhere near the
operator to make the decision without significantly degrading
performance. V. EVALUATIO N
We evaluate the proposed techniques compared to several
state-of-art RL methods. We use both simulation and real
testbed experiments to study the behavior of the RL algorithms
when they are used in optimizing the parallelism configuration
of DSP applications.
A. Experimental setup
We describe the experimental setup for the simulation
environment and real test-bed environment separately.
For the simulation environment, we implement a DSP ap-
plication simulator by extending the queue and load balancing
environments provided in Park project [20] and make it com-
patible with the gym environment as discussed in Section IV.
The default setup of the simulation is shown in Table I. In the
TABLE I
DEFAU LT SIMULATION PARAM ETE R SET UP
Kmax 64 wlat, wque , wres 1
3
Average input rate 100 tuples/s ¯
lmax 1000 ms
µi10 tuples/s Time step interval 10 s
simulation experiment, we test three different datasets: (i) a
synthetic Poisson distribution dataset with default arrival rate
λ= 100/s, (ii) a synthetic Pareto distribution dataset with the
shape parameter α= 2.0and the scale parameter xm= 50
(so that the average input rate is also 100 tuples/s in default
Source/
Mapper
Profit
Join
DB
Sink
DB
Table
Ranking
Count
ts
taxi_ride
p_cell
taxi_id
ts
win_id
p_cell
profit
ts
win_id
cell
count
ts
win_id
cell
profitability
ts
cell_1
profit_1
…
p_ts
d_ts
p_cell
d_cell
taxi_id
pay
0
3
2
45 6
1
Taxi
Fig. 4. NY Taxi Profitable Area Application
from the Pareto distribution), and (iii) a trace-driven dataset,
which is made available by Chris Whong [21] that contains
information about the activity of the New York City taxis. Each
data point in the dataset corresponds to a taxi trip including
the timestamp and location for both the pick-up and drop-off
events. As the data is too sparse (around three hundred tuples
per minute) to be used in stream processing experiments, we
speed up the input rate of the dataset by sixty times, which
means that the tuples arriving in one minute in the original
dataset will arrive in one second in the experiment.
In the real testbed experiments, we implement the DSP
application for the 2015 DEBS Grand Challenge (http://www.
debs2015.org/call-grand-challenge.html) to calculate the most
profitable areas for each time window. The dataset used is
the New York City taxis mentioned above and the data rate is
also sped up by sixty times. We deploy a testbed on CloudLab
[22] with nodes organized in three tiers. We use the cluster
with ten xl170 servers in the CloudLab cluster and simulate
the three-tier architecture on an Openstack cluster. The third
tier contains fourteen m1.medium instances (2 vCPUs and 4
GB memory) that act as the smart gateways with relatively
low computing capacity corresponding to the leaf nodes of
the architecture. The second tier has five m1.xlarge instances
(8 vCPUs and 16 GB memory) and each of them functions
as a micro datacenter. The first tier contains one m1.2xlarge
instance (16 vCPUs and 32 GB memory) acting as the
computing resource used in the cloud datacenter. The network
bandwidth, latency and topology are configured by dividing
virtual LANs (local area networks) between the nodes and
adding policies to the ports of each node to enforce using
the Neutron module of OpenStack and the traffic control (tc)
tool in Linux to simulate. TABLE II
DEFAU LT PARAM ETE R SET UP FO R REA L TEST BE D
Kmax 8wlat, wque , wres 1
3
Average input rate 4500 tuples/s ¯
lmax 2000 ms
Time step interval 60 s
We deploy the Storm Nimbus service (acting as the master
node) on a m1.2xlarge instance and one Storm Supervisor
service (acting as the slave node) on each node respectively.
For the checkpoint store, we use a single node Redis service
placed on the master node. The default network is set to be
100 Mb bandwidth in capacity with 20 ms latency between the
gateways and micro datacenters, and the bandwidth capacity
is 100 Mb with 50 ms latency between the cloud datacenter
and micro datacenters. We also place a stream generator on
each smart gateway to emulate the input stream. The input
stream comes to an MQTT (Message Queuing Telemetry
Transport) service deployed on each smart gateway. The
dataset is replicated and replayed on each smart gateway and
the average input rate is around 4500 tuples/s overall. The
(a) Reward (b) 95th percentile Latency (c) Throughput
Fig. 5. Results of simulation with Synthetic Dataset (Poisson distribution)
Fig. 6. Rewards of simulation with
Synthetic Dataset (Pareto distribution
(α= 2.0))
Fig. 7. Rewards of simulation on the
New York taxi trace
default parameters used in the real testbed experiments are
listed in Table II.
B. Application and Operator Placement
To comprehensively evaluate the proposed algorithm, we
choose a smart city application that ranks the profitability of
the areas for taxis in New York city. As shown in Figure 4,
there are seven operators: (i) source and mapper, which con-
sume the input stream from the MQTT service and transform
the raw tuple to the data type that can be understood by the
system, (ii) taxi aggregator, which aggregates the trips by the
taxi IDs in time windows, (iii) taxi counter, which counts the
number of taxis in a particular area in time windows, (iv) profit
aggregator, which aggregates the profits by the pickup area in
time windows, (v) joiner, which joins the profit and number
of taxis to calculate the profitability of a particular area, (vi)
ranking, which sorts the profitability of the area, (vii) sink,
which stores the results of the most profitable areas into a
database for further usage. We have optimized the placement
of the application by placing each operator to one of the three
tiers based on its selectivity. The data source which consumes
the input tuples from the MQTT services are placed at the
same node (one of the gateways) where the MQTT service is
placed. The heavy load aggregators are placed in the micro
data centers. The join, ranking and sink operators are placed
in the mega datacenter. It is worth noting that because of the
windowed aggregators (taxi and profit aggregators) handling
most of the workloads, only those two operators are possible to
be the bottlenecks in the overall stream processing application.
So in the real testbed experiments, we only consider the scale
up of those two operators.
C. Algorithms
In our experiment evaluation, different mechanisms are
measured and compared: (i) PPO, which is a policy gradient
method for RL [23], (ii) A3C, which is the asynchronous
version of the actor-critic methods [24], (iii) DQN, which is a
method based on DNN to learn the policy by Q-learning [25],
(a) Reward (b) Action (parallelism)
Fig. 8. Evaluation of applicability for heterogeneous resources (Poisson
distribution)
(iv) LinUCB, which is a MAB method based on a linear model
to approximate the reward distribution and it uses UCB to
select the action [5], and (v) MBLinUCB, which is the method
proposed in this work. All the methods use the default hyper-
parameters configured in RLlib. For the proposed method, we
generate ten thousand data points to initialize the parameters
in the linear models in the MBLinUCB method as described
in Section III.
D. Simulation Experiment Results
We first evaluate the performance of the algorithms in the
simulation environment. As shown in Figure 5, we compare
the algorithms with a synthetic Poisson distribution workload.
We can see that our method converges faster than the other
methods and it only needs three thousand time-steps to reach
an average reward of -0.3. It also starts from a relatively good
initial position above -0.4 compared to -0.5 in the LinUCB
method. With respect to latency and throughput metrics, our
method and LinUCB perform better than the others. However,
the latency performance of MBLinUCB converges from one
thousand milliseconds, which is much higher than the LinUCB
method. As the MBLinUCB initializes its linear model with
the data generated from the environment model, it starts from
a configuration which just meets the upper bound latency
requirement (1000ms) with the minimum parallelism needed.
As shown in Figure 6, we compare the algorithms with
another synthetic workload from a Pareto distribution (for each
time slot, the input rate is sampled from a Pareto distribution).
The workload is used to test the performance of the algorithms
in conditions when the workload has significant fluctuations
while executing the application. In Figure 6, we can see similar
results as in Figure 5. The proposed method performs better
than the other methods. We note that even LinUCB does not
converge to a good result as MBLinUCB does but with enough
iterations (around thirty thousand timesteps), A3C can get
similar results as MBLinUCB. This is expected as A3C uses
(a) Reward (b) Action (parallelism)
Fig. 9. Evaluation of applicability for heterogeneous operators (Poisson
distribution)
the actor-critic method to improve the sample efficiency so
that it performs better than the other RL methods that also
rely on DNN. In the next set of experiments, we study the
performance of the algorithms using a real trace as shown in
Figure 7. We can see that our method converges to an average
reward greater than -0.3 in less than twenty thousand time
steps. However, LinUCB needs more than sixty thousand time
steps, and the other methods require even more time steps.
In the next two sets of experiments shown in Figure 8
and Figure 9, we evaluate the impact of heterogeneity in
the available resources and operators. As we can see in
Figure 8a, for different operator processing rates which may
get influenced by the characteristics of the operator or the
power of the underline server, MBLinUCB can converge to
a good reward range within a few episodes. We can see that
LinUCB converges to a lower reward when the service rate
is ten compared to when MBLinUCB is used (noted as op-
10 in the figure). The above results can be explained by
comparing the results in Figure 8b. We can see that all the
conditions converge to a small range of actions at the end
of 50 episodes. However, compared to MBLinUCB, LinUCB
converges to a larger number of parallelism so that it has a
higher resource usage penalty so as a lower reward compared
to our method. As shown in Figure 9a, we can see similar
results that within different operator parallelism portions (i.e.,
how many percentiles of the operator’s processing can be
parallelized), our mechanism can converge within a limited
number of episodes. We also note that LinUCB converges to
a lower reward with larger parallelism as shown in Figure 9b.
E. Real Testbed Experiment Results
We evaluate the performance of LinUCB and our method
in the real testbed with a real stream processing application
and a real dataset. As shown in Figure 10a, we can see that
the proposed method converges faster as it has a better initial
configuration. It only takes fifteen time steps (each one minute)
to reach a reward more than -0.3. For the latency analysis,
we illustrate the latency upper bound (95th percentile of the
latency distribution) in Figure 10b. As shown in the results,
we can see that our method starts from a latency which
already meets the requirements (the latency upper bound is
less than two seconds) and is stable during the experiments.
However, the original LinUCB method starts from a very
high latency (more than five seconds and we cut the latency
metric to five seconds if it is larger than that) and then it
(a) Reward (b) 95th percentile latency
Fig. 10. Real Testbed Results
gradually improves the average latency upper bound to three
seconds. Our MBLinUCB already meets the latency bound
requirements at the initial state and tries to improve it by
exploring the possible parallelism configurations in the real
environment.
VI. RE LATE D WOR K
Over the last few years, the developments in the Big
Data ecosystem have raised higher requirements for scal-
able stream processing engines. Several open-source stream
processing frameworks have been developed. Key examples
include Flink [3] and Storm [2]. There have been several
efforts in recent years to optimize stream processing in edge
computing environments. Xu et al. [9] proposed Amnis to
improve the data locality of edge-based stream processing by
considering resource constraints in an offline manner. In [26],
Xu et al. address fault tolerance aspects of edge-based stream
processing using a hybrid backup mechanism. The techniques
proposed in [9], [26] can be used to generate the initial plan
used in our work. To dynamically scale the application, several
different approaches have been developed including techniques
for re-configuring the execution graphs of the application or
adjusting the parallelism by increasing the number of instances
of certain operators. Cardellini et al. [8] proposed an elastic
stream processing framework based on an ILP(Integer Linear
Programming) model to reconfigure the stream processing ap-
plication to make decisions on operator migration and scaling.
However, given the heterogeneous nature of edge computing
systems, these techniques may require substantial manual
effort to tune the parameters to achieve a self-stabilizing status.
There have also been several efforts on developing techniques
to manage stream processing applications using RL methods.
For example, Li et al. [27] proposed a model-free method
to schedule the stream processing application based on an
actor-critical RL method [24]. Ni et al. [28] developed a
resource allocation mechanism based on GCN(graph Convolu-
tion network)-based RL method to group operators to different
nodes. However, the above DNN-based method suffers from
long training periods and low sampling efficiency and they
need a large amount of data to build the model. There have
been several efforts on leveraging the traditional RL method
(such as Q-learning) to deal with the problem. For instance,
Russo et al. [29] used the FA(Function Approximation)-
based TBVI(Trajectory Based Value Iteration) to improve the
sample efficiency of the traditional RL methods (such as Q-
learning) to make operator scaling decisions in heterogeneous
environments. However, to reduce the action space, the above
work defines the action as increasing or decreasing only
one instance for an operator. This increases the convergence
trajectory length and therefore, even when the model is trained
well, it may also take long steps to reach the optimal state,
which also incurs high reconfiguration cost. In contrast to these
existing works, our model-based method automatically finds
tradeoffs between exploration and exploitation by using the
UCB-based RL method. In addition, the proposed approach
further improves the sample efficiency by utilizing a queuing
model-based simulation to generate more data to pre-train the
model. Based on these features, the method proposed in this
paper achieves a significantly higher convergence rate and
cumulative rewards during long runs compared to the existing
methods.
VII. CON CL US IO N
In this paper, we proposed a learning framework achieves
elasticity for stream processing applications deployed at the
edge by automatically tuning the application to meet the
Quality of Service requirements. The method adopts a rein-
forcement learning (RL) method to configure the parallelism of
the operators in the stream processing application. We model
the elastic parallelism configuration for stream processing in
edge computing as a Markov Decision Process (MDP), which
is then reduced to a contextual Multi-Armed Bandit (MAB)
problem. Using the Upper Confidence Bound(UCB)-based
RL method, the proposed approach significantly improves
the sample efficiency and the convergence rate compared to
traditional random exploration methods. In addition, the use of
model-based pre-training in the proposed approach results in
substantially improved performance by initializing the model
with appropriate and well-tuned parameters. The proposed
techniques are evaluated using realistic workloads through
both simulation and real testbed experiments. The experiment
results demonstrate the effectiveness of the proposed approach
in terms of cumulative reward and convergence speed.
VIII. ACK NOWLEDGEMENT
This work is partially supported by an IBM Faculty award.
REF ER EN CE S
[1] “The internet of things: A movement, not a market,”
https://ihsmarkit.com/Info/1017/Internet-of-things.html, accessed
November. 2, 2020.
[2] “Apache storm,” https://storm.apache.org/, accessed July. 16, 2021.
[3] “Apache flink,” https://flink.apache.org/, accessed July. 16, 2021.
[4] C. Jasper, “The hidden costs of delivering iiot services: Industrial
monitoring & heavy equipment,” 2016.
[5] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit
approach to personalized news article recommendation,” in Proceedings
of the 19th international conference on World wide web, 2010, pp. 661–
670.
[6] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal,
J. M. Patel, K. Ramasamy, and S. Taneja, “Twitter heron: Stream
processing at scale,” in Proceedings of the 2015 ACM SIGMOD In-
ternational Conference on Management of Data, 2015, pp. 239–250.
[7] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and
M. Seltzer, “Network-aware operator placement for stream-processing
systems,” in Data Engineering, 2006. ICDE’06. Proceedings of the 22nd
International Conference on. IEEE, 2006, pp. 49–49.
[8] V. Cardellini, F. Lo Presti, M. Nardelli, and G. Russo Russo, “Optimal
operator deployment and replication for elastic distributed data stream
processing,” Concurrency and Computation: Practice and Experience,
vol. 30, no. 9, p. e4334, 2018.
[9] J. Xu, B. Palanisamy, Q. Wang, H. Ludwig, and S. Gopisetty, “Amnis:
Optimized stream processing for edge computing,” to appear in Journal
of Parallel and Distributed Computing, 2021.
[10] W. J. Stewart, Probability, Markov chains, queues, and simulation.
Princeton university press, 2009.
[11] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, “An adaptive sampling
algorithm for solving markov decision processes,” Operations Research,
vol. 53, no. 1, pp. 126–139, 2005.
[12] Z. Meng, M. Wang, J. Bai, M. Xu, H. Mao, and H. Hu, “Interpreting
deep learning-based networking systems,” in Proceedings of the Annual
conference of the ACM Special Interest Group on Data Communica-
tion on the applications, technologies, architectures, and protocols for
computer communication, 2020, pp. 154–171.
[13] J. L. Gustafson, “Reevaluating amdahl’s law,” Communications of the
ACM, vol. 31, no. 5, pp. 532–533, 1988.
[14] A. B. Kahn, “Topological sorting of large networks,” Communications
of the ACM, vol. 5, no. 11, pp. 558–562, 1962.
[15] K. P. Yoon and C.-L. Hwang, Multiple attribute decision making: an
introduction. Sage publications, 1995.
[16] “Ranking popular distributed computing packages for data science,” ac-
cessed November. 2, 2020. [Online]. Available: https://www.kdnuggets.
com/2018/03/top-distributed-computing-packages-data-science.html
[17] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder-
baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng et al., “Benchmarking
streaming computation engines: Storm, flink and spark streaming,” in
2016 IEEE international parallel and distributed processing symposium
workshops (IPDPSW). IEEE, 2016, pp. 1789–1792.
[18] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint
arXiv:1606.01540, 2016.
[19] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg,
J. Gonzalez, M. Jordan, and I. Stoica, “Rllib: Abstractions for distributed
reinforcement learning,” in International Conference on Machine Learn-
ing. PMLR, 2018, pp. 3053–3062.
[20] H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus,
R. Addanki, M. Khani Shirkoohi, S. He et al., “Park: An open plat-
form for learning-augmented computer systems,” Advances in Neural
Information Processing Systems 32 (NIPS 2019), 2019.
[21] C. Whong, “Foiling nyc’s taxi trip data,” FOILing NYCs Taxi Trip Data.
Np, vol. 18, 2014.
[22] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide,
L. Stoller, M. Hibler, D. Johnson, K. Webb, A. Akella, K. Wang,
G. Ricart, L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, and
P. Mishra, “The design and operation of CloudLab,” in Proceedings of
the USENIX Annual Technical Conference (ATC), Jul. 2019, pp. 1–14.
[Online]. Available: https://www.flux.utah.edu/paper/duplyakin-atc19
[23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[24] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
forcement learning,” in International conference on machine learning.
PMLR, 2016, pp. 1928–1937.
[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[26] J. Xu, B. Palanisamy, and Q. Wang, “Resilient stream processing in
edge computing,” in 2021 IEEE/ACM 21st International Symposium on
Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2021, pp.
504–513.
[27] T. Li, Z. Xu, J. Tang, and Y. Wang, “Model-free control for distributed
stream data processing using deep reinforcement learning,” Proceedings
of the VLDB Endowment, vol. 11, no. 6, pp. 705–718, 2018.
[28] X. Ni, J. Li, M. Yu, W. Zhou, and K.-L. Wu, “Generalizable resource
allocation in stream processing via deep reinforcement learning,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
no. 01, 2020, pp. 857–864.
[29] G. R. Russo, V. Cardellini, and F. L. Presti, “Reinforcement learning
based policies for elastic stream processing on heterogeneous resources,”
in Proceedings of the 13th ACM International Conference on Distributed
and Event-based Systems, 2019, pp. 31–42.