Content uploaded by Jinlai Xu

Author content

All content in this area was uploaded by Jinlai Xu on Dec 09, 2021

Content may be subject to copyright.

Model-based Reinforcement Learning for Elastic

Stream Processing in Edge Computing

Jinlai Xu, and Balaji Palanisamy

School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15213, USA

Email: jinlai.xu@pitt.edu, bpalan@pitt.edu

Abstract—Low-latency data processing is critical for en-

abling next generation Internet-of-Things(IoT) applications. Edge

computing-based stream processing techniques that optimize for

low latency and high throughput provide a promising solution

to ensure a rich user experience by meeting strict application

requirements. However, manual performance tuning of stream

processing applications in heterogeneous and dynamic edge

computing environments is not only time consuming but also not

scalable. Our work presented in this paper achieves elasticity

for stream processing applications deployed at the edge by

automatically tuning the applications to meet the performance

requirements. The proposed approach adopts a learning model

to conﬁgure the parallelism of the operators in the stream pro-

cessing application using a reinforcement learning(RL) method.

We model the elastic control problem as a Markov Decision

Process(MDP) and solve it by reducing it to a contextual

Multi-Armed Bandit(MAB) problem. The techniques proposed

in our work uses Upper Conﬁdence Bound(UCB)-based methods

to improve the sample efﬁciency in comparison to traditional

random exploration methods such as the ϵ-greedy method. It

achieves a signiﬁcantly improved rate of convergence compared

to other RL methods through its innovative use of MAB methods

to deal with the tradeoff between exploration and exploitation.

In addition, the use of model-based pre-training results in sub-

stantially improved performance by initializing the model with

appropriate and well-tuned parameters. The proposed techniques

are evaluated using realistic and synthetic workloads through

both simulation and real testbed experiments. The experiment

results demonstrate the effectiveness of the proposed approach

compared to standard methods in terms of cumulative reward

and convergence speed.

I. INT ROD UC TI ON

The proliferation of Internet-of-Things(IoT) devices is

rapidly increasing the demands for efﬁcient processing of

low latency stream data generated close to the edge of the

IoT network. IHS Markit forecasts that the number of IoT

devices will increase to more than 125 billion by 2030 [1].

Edge Computing complements traditional cloud computing

solutions by moving operations from remote datacenter to

computing resources at the edge of the network that is close

to the IoT end-devices.

A key objective in enabling low-latency edge computing is

to minimize the volume of data that needs to be transported

and reduce the response time for the requests. Stream data

processing is an integral component of low-latency data an-

alytic systems and several open-source systems (e.g. Apache

Storm [2], and Apache Flink [3]) provide efﬁcient solutions

for processing data streams. These systems optimize the

performance of stream data processing for achieving high

throughput and low (or bounded) response time (latency)

for the stream queries. However, techniques when applied

in an edge computing environment incurs huge operational

cost. In addition, the performance of each application needs

to be manually tuned and further reconﬁgured for varying

workload conditions and changing execution environments.

Recent studies indicate that the administrative labor occupies

20-50% of the overall operational cost of deploying an IoT

application [4]. Therefore, automated tuning can drastically

decrease the cost of deploying and maintaining stream pro-

cessing applications in edge-based computing systems.

Due to the inherent hardness in predicting the highly

dynamic changes in an edge computing environment, it is

important to adopt a self-adaptive approach using techniques

such as reinforcement learning (RL) methods that are suitable

for adapting to the changes in the environment. Recently,

RL-based methods were developed in the distributed system

domain to enhance heuristic-based system optimization algo-

rithms and provide more effective solutions to problems for

which heuristic-based solutions are less effective. However,

applying RL methods in the systems domain incurs several

challenges. A key performance factor in effective RL-based

methods is the sample efﬁciency of the method. Currently,

most RL methods are based on deep learning techniques that

use deep neural networks (DNNs) to handle the approximation

of the environment dynamics and the reward distribution. The

use of DNN enhances the models to handle more complex

conditions. However, most of the DNN-based RL algorithms

require a large amount of data to converge a good result

which makes the optimization of the sample efﬁciency even

harder. Therefore, improving the sample efﬁciency is a critical

problem when applying RL-based algorithms for optimizing

distributed systems management.

In this paper, we propose an RL-based algorithm to optimize

the performance of distributed stream processing (DSP) appli-

cations to meet various quality of service (QoS) requirements.

For deploying a stream processing application, the system

needs to decide how much resources are allocated to each

operator to meet the QoS requirements set by the users. This

is further challenged by the heterogeneous nature of the edge

computing resources and the dynamically changing working

conditions. Based on the recent developments of the RL

algorithms, we model the DSP scaling problem as a contextual

Multi-Armed Bandit (MAB) problem, which is reduced from

the original Markov Decision Process (MDP). With the above

simpliﬁcation, the elastic parallelism conﬁguration can be

efﬁciently solved using the state-of-art algorithms that work

well on MAB problems [5]. It can automatically achieve

tradeoffs between exploring the solution space to ﬁnd an

optimal solution (exploration) and utilizing the data gathered

from the previous trials (exploitation). We investigate the use

of LinUCB [5] algorithm to dynamically decide the parallelism

conﬁguration during the execution of the DSP application

that aims to improve multiple QoS metrics including end-

to-end latency upper bound, throughput and resource usage.

We further improve the sample efﬁciency of LinUCB using

a model-based method which is based on a queuing model

simulation to pre-train the RL agent to improve the accuracy

of the initial parameters. The main contributions of this paper

are summarized as follows:

•We model the elastic parallelism conﬁguration problem

for distributed stream processing (DSP) applications in

edge computing environment.

•We integrate the problem model into a Markov Decision

Process (MDP) and reduce the MDP to a contextual MAB

problem where we apply the LinUCB method to ﬁnd

tradeoffs between exploration and exploitation.

•We propose a model-based learning approach to improve

the sample efﬁciency of LinUCB by generating more data

based on a queuing model simulation.

•We evaluate our proposed method, MBLinUCB, and

other state-of-art RL methods using realistic workloads

through both simulation and real testbed experiments. The

experiment results demonstrate the effectiveness in terms

of cumulative reward and convergence speed.

II. EL AS TI C STR EA M PROC ES SI NG PRO BL EM

DSP applications are often long-running and can experience

variable workloads. Additionally, the highly dynamic edge

computing environments may change the working conditions

of the applications requiring operators to be migrated between

nodes with different capacities due to failures or mobility

requirements. To bound the performance of the applications

within an acceptable range, it is important to design an elastic

parallelism conﬁguration algorithm to adapt to the changes in

the dynamic edge computing environment. In this section, we

ﬁrst explain the terminologies used in modeling the elastic

parallelism conﬁguration problem.

Logic Plan: we assume that there is a stream processing ap-

plication submitted to the system. The code of the application

is translated into a Directed Acyclic Graph (DAG) denoted

as Gdag(Vdag , Edag )shown as the logic plan in Figure 1.

Here, the vertices Vdag represent the operators and the edges

Edag represent the streams connecting the operators. We use

i∈Vdag to denote operator iin the application.

Cluster: we assume that the resources of the cluster are also

organized as a graph, Gres(Vres , Eres ), where Vres denotes

nodes in the cluster, and Eres indicates the virtual links

connecting the nodes. As shown in Figure 1, in the edge

computing environment, there are multiple tiers of resources

such as the micro datacenters (MDCs) and the smart gateways

which are deployed near the edge of the network that are

used for processing the data locally to provide low latency

1 2

Logic Plan

03

monitor

MDC 1

Smart

Gateway 1

Smart

Gateway 2

Cluster

monitor

monitor

MDC 1

Smart

Gateway 1

Smart

Gateway 2

Preliminary Operator Placement

12

03

MDC 1

Smart

Gateway 1

Smart

Gateway 2

Dynamic Running Environment

12

03

2

1

03

Controller

Agent

Environment

Interface

Monitor Agent

Monitor

Fig. 1. Elastic Stream Processing Framework

computing to the applications. Thus, it is natural to consider

the edge computing as a heterogeneous environment with

highly dynamic changes in the execution environment. We

simplify the physical resources as virtual nodes in the graph,

Gres. For example, if a node vrepresents a micro data center

(MDC), we group its resource capacity as Cvby considering

all the resources we can use in an MDC as a virtual node.

Preliminary Operator Placement: the operator placement is

a map between the operator, i∈Vdag, and the node, v∈

Vres, in the cluster. We assume that the operator placement for

stream processing application is already provided as shown in

Figure 1. It is denoted as a map X0={xv

i|i∈Vdag, v ∈

Vres}. For each operator i∈Vdag , when it is placed on node

v∈Vres, then xv

i= 1. It is worth noting that, in the current

state-of-art DSP engines (such as Apache Storm, and Flink),

one operator can be replicated to multiple instances and the

instances of the operator can be placed on different nodes. We

assume that each operator is placed on one node as it simpliﬁes

the representation complexity of the model. Additionally, if the

operator placement needs to be reconﬁgured to ﬁt the working

environment changes, we can treat the reconﬁguration as a

new submission as it does not affect the performance of the

parallelism conﬁguration algorithm.

Elastic Parallelism Conﬁguration: the objective of conﬁgur-

ing the parallelism is to change the number of instances of the

operator to optimize one or multiple QoS requirements of the

application. We need to decide the parallelism of each operator

i, which is noted as ki∈[1, Kmax], where Kmax is an upper

bound for the parallelism. Therefore, the number of possible

parallelism conﬁgurations is Kmax|Vdag |for conﬁguring the

whole application, if all the operators have the max parallelism

as Kmax. The parallelism determines the number of threads

running for the instances of the operator, which is not directly

related to the number of tasks provisioned for an operator.

We discuss this in detail in Section IV. The conﬁguration

can be either static which is ﬁxed when the application is

submitted to the engine or dynamic that can be changed when

the application is running. In this work, we deal with the

dynamic parallelism conﬁguration problem and we present the

detailed solution in Section III.

A. Quality of Service Metrics

The objective of the parallelism conﬁguration can be de-

cided by the user in terms of QoS requirements. As most of

the stream processing applications need to handle the incoming

data under acceptable latency, the goal of the parallelism

conﬁguration can be to minimize the response time while

reducing the resource cost and minimizing the gap between

the throughput and the arrival rate to avoid back-pressure [6].

End-to-end latency upper bound: we assume the end-to-end

latency of the application is primarily composed of compu-

tational or queuing latency. If the network latency or other

latency e.g., the I/O latency caused by memory swapping,

are signiﬁcant in an application, we rely on other techniques

to optimize the application ﬁrst before deploying in the edge

computing environment [7]–[9]. In order to make ensure the

end-to-end latency is bounded by a user deﬁned target, we

traverse the path in the application’s DAG to get the estimated

end-to-end latency upper bound. We ﬁrst deﬁne the path as a

sequence of operators, starting at a source and ending at a sink,

as p∈P, where Pdenotes all the paths in the application.

We can estimate the latency upper bound (not tight) of the

application as the longest path in the DAG:

¯

ldag = max

p∈PX

i∈p

¯

li(1)

where ¯

liis the latency upper bound when passing one of the

instances of an operator i.

Throughput: in stream processing applications, the through-

put requirement is typically deﬁned by matching the process-

ing rate of the application to the arrival rate. If the processing

rate is larger than the arrival rate, the application will not incur

a back-pressure [6], which will inﬂuence the performance of

the application and may increase the resource usage (e.g., the

memory usage for caching the unprocessed tuples). Therefore,

to evaluate the throughput performance, we use the queue

length, noted as ω, which is widely used in the queuing model

to represent the state of the queue. It also captures the gap

between the throughput and the arrival rate in the long run,

which is easy to monitor in the stream processing engine.

Resource usage: for the resource usage, we can directly use

the parallelism conﬁguration to estimate, which is kifor the

operator i. With an increase in parallelism, there will be more

threads allocated to the operator so that the resource usage will

increase. Thus, parallelism can be used as the representation

of the resource usage.

Reconﬁguration cost: as we change the parallelism conﬁgu-

ration when the stream processing application is running, it is

important to consider the reconﬁguration cost if the parallelism

is changed (e.g., the operator needs to be restarted to apply

the parallelism change). However, most of the previous works

assume a static reconﬁguration cost [8], which is a constant

cost related to the downtime. This kind of measurements is

not accurate due to the correlation between the reconﬁguration

downtime and other metrics such as latency and throughput.

The downtime caused by the reconﬁguration will lead to a

peak latency and throughput after the downtime. Therefore in

this work, we do not include the reconﬁguration cost in the

objective. Instead, we include the downtime inﬂuence in the

other metrics such as the end-to-end latency and throughput.

1

Operator

1

0

1

0

Upstream Downstream

2

Input

Queue

Output

Queue

Dispatcher

𝜆𝑖𝜓𝑖𝜆𝑖

Input

Queue

Input

Queue 𝜇𝑖

……

Worker

Output

Queue

Output

Queue

Fig. 2. Stream Processing Model

B. Stream Processing Model

With the notion of parallelism conﬁguration and the QoS

metrics deﬁned above, we represent the stream processing

model used to estimate the relationship between the decision

(parallelism conﬁguration) and requirements (QoS metrics)

based on the queuing model of an operator and the message

passing model of the stream processing application. The

discussion will guide the later RL method design in Section III.

As discussed in Section I, the highly dynamic workload

and the heterogeneous resources make it very difﬁcult to

predict the environment dynamics when the stream processing

application is deployed in the edge computing environment.

However, it is important to extract the invariant from the

dynamics for human operators to understand the problem

and the condition of the whole system to debug potential

problems. Based on the intuition above, we adopt the model

from queuing theory and choose the M/M/1 queue (can also

be extended to G/G/1 based on distribution information) to

model the characteristics of the operator. For each operator i,

as shown in Figure 2, we assume that the instances of it do

not share the input and output queues which can be treated

as a M/M/1 queue. An M/M/1 queue can be described as two

variables, λi, µiand one state ωi, where λiis the arrival rate,

µiis the service rate, and ωiis the queue length. Based on

the theory of M/M/1 queue [10], we can get the response

time distribution (which is noted as latency in our work)

and the throughput with closed-form formulations. When the

queue is stable, which means µi> λi, the queue length will

not grow inﬁnitely. Without losing generality, we analyze the

latency upper bound here as an example. The 95th percentile

of latency can be calculated from the cumulative distribution

of an exponential distribution Exp(µi−λi)as follows:

¯

li=ln 20

µi−λi

(2)

Similarly, the other metrics can be also represented as closed-

form formulations. For example, the average latency is 1/(µi−

λi). If the arrival rate and service rate distributions (G/G/1

queue) are given, Equation 2 can be modiﬁed correspondingly

to represent the upper bound (95th percentile) of the latency

using the cumulative distribution function. In addition, we

added a variable, ψi, to enhance the queuing model, which

represents the selectivity of the operator i, so that we can get

the output rate as ψiλiif µi> λias shown in Figure 2.

After introducing the queuing model of a single operator,

we now move to the model to estimate the performance

of an application. As described in the beginning of this

section, we assume that the application is organized as a DAG,

Gdag(Vdag , Edag ), where each vertex i∈Vdag represents an

operator and each edge (i, j)∈Edag represents a stream. The

tuples transmitted between two operators will be partitioned

by a default shufﬂing function, or a user-deﬁned partitioning

function, which calculates the index of the downstream in-

stance that the tuple will go to. In the message passing model,

instead of composing the overall latency from source to sink as

shown in Equation 1, we break down the latency caused on one

operator and based on that, we set the target latency from the

overall latency requirement. With the split objective, for each

operator, the performance can be tuned without taking into

consideration the other operators or the overall application.

Here, we just use a simple heuristic to decide the maximum

latency target of each operator proportional to its contribution

to the overall latency:

¯

lmax

i=¯

li

¯

ldag

¯

lmax (3)

where ¯

lmax is the upper bound latency of the overall ap-

plication set by the user. If the proﬁling information is not

available or not possible to obtain, we can use other heuristics

such as evenly dividing the latency upper bound into the sub-

objective of each operator with the given number of stages

in the DAG. We leave the dynamic orchestration of the sub-

objectives of the application’s objective by gathering more

information from executing the application as one of our future

works. For the other metrics, such as throughput, we can

monitor the input rate and processing rate for an operator and

the throughput sub-objective can be directly obtained from the

local information (e.g., queue length) of a particular operator

so that we can rely on the local information to optimize the

throughput. Therefore in the RL algorithm, we only need to

focus on tuning the parallelism for one operator with the given

sub-objective. The usage of sub-objective can decrease the

complexity of the parallelism conﬁguration problem, which

we will discuss in details in Section III-B.

In the next section, we present the details of the proposed

model-based RL method based on the above model to automat-

ically decide the parallelism in a dynamic and heterogeneous

edge computing environment.

III. REI NF OR CE ME NT LE AR NING FOR ELASTIC STREAM

PROC ES SI NG

We structure the elastic parallelism conﬁguration as a

Markov Decision Process (MDP) that represents an RL agent’s

decision-making process when performing the parallelism de-

cision. We then reduce the MDP to a contextual MAB problem

and apply LinUCB with a model-based pre-training.

A. A Markov Decision Process Formulation

The problem of continuously conﬁguring the parallelism of

the stream processing applications in an edge computing en-

vironment can be naturally modelled as an MDP. Formally, an

MDP algorithm proceeds in discrete time steps, t= 1,2,3, ...:

(i) The algorithm observes the current DSP application state

st, which is a set of metrics gathered from any monitor threads

running out of the system (e.g., node utilization, network

usage) or the metrics reported by the application itself (e.g.,

latency, throughput, queue length as described in Section II-A).

(ii) Based on observed reward in the previous steps, the

algorithm chooses an action kt∈ A, where Ais the overall

action space, and receives reward rkt, whose expectation

depends on both the state stand the action kt. In the

parallelism conﬁguration process, each action is composed by

the parallelism conﬁguration of all the operators, which can

be noted as kt={kt,i|i∈Vdag }.

(iii) The algorithm improves its state-action-selection strategy

with the new observations, (st,kt, rkt, st+1).

We choose the Finite-horizon undiscounted return [11] as the

objective of the MDP, which can be noted as:

T

X

t=0

rkt(st, st+1)(4)

where, Tis the number of continuous time steps considered in

the objective. It is a cumulative measure of the undiscounted

rewards in a predeﬁned Ttime steps. As shown in the

equation, compared to the inﬁnite discounted reward, the ﬁnite

undiscounted reward treats each time step equally. This ﬁts

the objective of the parallelism conﬁguration that aims to

maximize the utility uniformly among time steps. It also ﬁts

well into the contextual MAB problem which we discuss in

the next subsection.

B. Model-based Reinforcement Learning

As discussed in Section I, the traditional RL methods based

on q-value tables or other methods need a large number of data

points to converge. The deep reinforcement learning methods

use DNN to improve the convergence rate but they also need

a lot of efforts either to tune the hyperparameters to tradeoff

between expressivity and the convergence rate or to gather

enough data to feed into the neural networks, which may be

costly or even not possible in some conditions. In addition,

the incomprehensible and nonadjustable deep neural model is

the major barrier for those kinds of models to be practical

in system operations [12]. In our work, we use LinUCB [5]

that assumes a linear relationship between the state and the

reward, and is proved to be effective under the contextual

MAB assumptions even when the process is non-stationary.

The LinUCB method ﬁts into the parallelism conﬁguration

problem well based on our two observations: (i) the parallelism

conﬁguration MDP (deﬁned in Section III-A) can be reduced

to a contextual MAB, and (ii) we can deﬁne the reward

function with a linear relationship between the reward and the

parallelism conﬁguration or most of the common objectives

(e.g., latency, throughput) are linear (or can be transformed to

be linear) to the parallelism conﬁguration. Next, we discuss

the above two observations in detail.

The major difference between the MDP and the contextual

MAB is based on whether the agent considers the state

transitions to make the decision. From the theory of M/M/1

queue [10], we can see that for each time step, the state

transition is only dependent on the arrival rate λ, the service

rate µ, and the initial state of the time step, ω, which is the

initial length of the queue. Therefore, if we have the above

variables in a particular state, we can get the state transition

probability for any possible states in the next time steps. If

the distributions of the arrival process and service process are

stationary, the reward (determined by any QoS metrics) can

be determined by the current state and action regardless of

the trajectory of the previous states. It also means that the

decision of the action can be made based on the current state

instead of the trajectory. The above observation is intuitive

when there is only one operator. If there are multiple connected

operators organized as a DAG, the problem is signiﬁcantly

more complex. However, instead of connecting the queuing

model of each operator to build a queuing network, we can

break the objective (reward) function of the overall application

using a heuristic (as discussed in Section II-B) based on the

message passing model to the individual objective (reward)

for each operator so that for each operator, we can safely use

the LinUCB algorithm to ﬁt the queuing model with the given

objectives and also reduce the possible action space for the

RL method (from exponential to linear).

For the second observation namely, the linear relationship

between the state and the reward, we begin analyzing it using a

single queuing model. To keep it simple, we omit the time step

notation tin the following discussion. If we have an operator

i, it has only one instance. We then have the arrival rate λi,

the service rate µi(for one instance), and the queue length ωi.

We assume the relationship between the parallelism setup and

the speedup of the operator by comparing a single parallelism

condition that obeys Gustafson’s law [13] with a parameter ρi

that deﬁnes the portion of the operation that can beneﬁt from

increasing the resource usage. Here µi(ki, ρi)is the estimated

service rate when the parallelism is kiand the parallel portion

is ρi, which can be estimated by:

µi(ki, ρi) = (1 −ρi+ρiki)µi(5)

Without losing generality, we estimate the latency (response

time) distribution in the time step as an example, which can

be an exponential distribution of Exp(µi(ki, ρi)−λi)plus an

estimated upper bound of the processing time of the queuing

tuples ωiExp(µi(ki, ρi)) (not tight). Therefore, the latency

upper bound can be estimated as combining Equation 2:

¯

li(λi, µi, ωi) = ln 20

µi(ki, ρi)−λi

+ωi

ln 20

µi(ki, ρi)(6)

With the given parallel portion ρiand the average processing

rate µi, the overall processing rate of the operator iwith

kiinstances is proportional to the number of instances, ki.

If throughput is part of the reward, it will have a linear

relationship with the parallelism. For latency, in Equation 6,

the operator will start at a state when the queue length is

zero, ωi= 0, and the ﬁrst part of the equation is inversely

proportional to the processing rate if the arrival rate λiis

ﬁxed. Therefore, through a simple transformation (e.g., set

x1= 1/(µi(ki, ρi)−λi)), we can refer to a linear relationship

between the latency upper bound and the parallelism. For the

other metrics such as throughput, queue length, and resource

utilization, we can also analyze the relationship between the

parallelism and obtain similar results. Through similar simple

transformations, the linear relationship between the metrics

(which represent the states in RL methods) and the parallelism

(number of instances) can be obtained.

Based on the two observations, we apply LinUCB as an RL

agent to decide the parallelism conﬁguration for an operator

and pass the messages between the connected operators in the

DAG. Using the notation of Section III-A, we assume that

the expected reward of an action (parallelism conﬁguration) is

linear in its d-dimensional state st,kiwith some unknown co-

efﬁcient vector θ∗

ki. Therefore, the linear relationship between

the reward and the state can be described as:

E[rt,ki|st,ki] = s⊤

t,kiθ∗

a(7)

As described in LinUCB [5], it uses a ridge regression to ﬁt

the linear model with the training data to get an estimate of

the coefﬁcients ˆ

θkifor each action kiof each time step t. We

omit the detailed steps of the LinUCB algorithm and we refer

the interested readers to the original paper [5]. Here, we only

discuss the action selection policy of LinUCB, which can be

represented as:

kt,i = arg max

ki∈[1,Kmax]s⊤

t,ki

ˆ

θa+αps⊤

t,ki(D⊤

ki

Dki+Id)−1st,ki(8)

where αis a constant, Dkiis a design matrix of dimension

m×dat time step t, whose rows correspond to mtraining

inputs (states), and Idis a d×didentity matrix. From the

above equation, we can see that the action selection of LinUCB

considers both the current knowledge we obtained from the

previous trials in s⊤

t,ki

ˆ

θaand the uncertainty (UCB) of the

action-reward distribution in the second part of Equation 8.

This is the reason why LinUCB has the ability to tradeoff

between the exploration and exploitation.

Algorithm 1 Model-based LinUCB pre-train

1: procedure PRETRAIN(Gdag)→Θ

2: qis initialed as an empty queue

3: Oare sources of Gdag (in-degrees are zero)

4: for o∈Odo

5: q.append((o, λo))

6: while qis not empty do

7: i, λi=q.pop()

8: θi=train(i, λi)

9: add trained parameters θito output Θ

10: for all downstream operators i′of ido

11: λi′=λi′+ψλi

12: remove edge (i, i′)from Gdag

13: if indegree(i′) == 0 then

14: q.append((i′, λ′

i))

15: procedure TRAIN(i, λi)→θi

16: initial the model parameters θi

17: while not terminate and not converge do

18: kt,i =selection(θi, st−1,ki)by Equation 8

19: rst,kt,i , st,ki=simulate(λi, µi, kt,i)

20: θi=updateLinUC B(θi, rst,ki, st,ki, kt,i )

21: reset operator i’s state to initial state

With the above analysis, we can see that if LinUCB is

directly used to set the parallelism for one operator, it can be

efﬁcient as the uncertainty of the operator can be captured by

the linear model (e.g., the parallel portion, the base processing

rate). However, on one hand, it has a cold start phase which

needs multiple rounds to get enough data for each possible

action to reach a reasonable performance level. On the other

hand, in a stream processing DAG, the operators are connected

to each other and the overall performance of the application

may vary due to different bottlenecks. Given the DAG, it is

a challenging problem to determine how to relate the overall

performance of the application to the metrics of every operator

in it. Therefore, instead of directly optimizing the overall

application, we use the objective function split as discussed

in Section II-B to only deal with the optimization for one

operator for each RL agent. In addition, we use the queuing

model-based simulation to validate the conﬁguration to set

the initial parameters for the LinUCB model. The simulation

also gives additional beneﬁts. On one hand, we can assume

different distributions for the arrival rate and service rate

that can support arbitrary G/G/1 queuing models, which is

evaluated in the experiment in Section V. On the other hand,

the simulator can work in different modes to either generate a

lot of synthetic data to directly feed into the model or interact

with the RL agent as a simulation environment, which can

ﬁt into more RL algorithms. In the simulation, instead of

trying all the combinations of the parallelism conﬁguration

of the operators at the same time, we gradually train the

model for each operator by a topological order [14] of the

DAG to ensure that the upstream operators’ conﬁguration

is ﬁxed before the downstream operator’s model is trained.

The simulation process is shown in Algorithm 1. The reward

function for each operator iis deﬁned by using the Simple

Additive Weighting (SAW) technique [15]:

ri(st,ki) = wlatrlat

i+wquerq ue

i+wresrr es

i(9)

where st,kiis the state of time slot twith parallelism as ki, and

rlat

i, rque

i, rres

iare the reward function for latency, throughput,

and resource usage based on the application’s requirements.

To balance the optimization for latency, queue length (gaps

between throughput and input rate) and resource usage, we

add wlat, wque, wres as the weights for each component and

wlat +wque +wres = 1. For different requirements, the

reward function can be set in different forms. For example,

if the application requires deadline-awareness and has a strict

latency bound, we can set rlat

i=−1when ¯

li≥lmax

i,

otherwise, it is zero. If the application’s utility is linear to the

latency, we can set rlat

i=−¯

li

lmax

i

, which decreases when the

latency is increasing. Without losing generality, we deﬁne the

reward function by setting rlat

i=−1if ¯

li≥lmax

ielse zero,

rque

i=−1if ωi≥ωmax

ielse zero, and rres

i=−ki

kmax

i

. In

the deﬁnition above, both the latency and throughput penalties

have a bounded reward. The reward of resource usage is linear

in terms of the number of instances running. To eliminate the

impact of the state transition that the model has experienced

through a previous bad selected action (e.g., a bad parallelism

conﬁguration may put too many tuples waiting for processing

and hence, the continuous states will be inﬂuenced), we reset

the state of the simulation each time when we update the model

with one parallelism conﬁguration in line 21 of Algorithm 1.

Therefore, for each sample of the model, the state of the

operator will start from the same initial state so that each

sample will not be inﬂuenced by the previous sample’s state.

In the real environment, we ﬁrst apply the trained param-

eters to each model as shown in Algorithm 2. Then, for

each operator, the controller decides the parallelism from the

metrics gathered from the system similar to the steps in the

pre-train iteration. The heterogeneity is captured by the linear

model in LinUCB. If the operator is migrated from one node

to another, it only inﬂuences the processing rate distribution µi

(when the other latencies are already appropriately optimized).

We evaluate this experimentally in Section V.

In the next section, the implementation is discussed for the

Algorithm 2 MBLinUCB

1: procedure MBLINUCB(Gdag )

2: initial the model parameters for each operator i∈Vdag,θi

3: Θ = pretrain(Gdag )

4: decide initial k0={k0,i|i∈Vdag }from Θ

5: Submit Gdag with k0to stream processing engine

6: while Gdag not terminate do

7: gather metrics in the time slot tas st={st,ki|i∈V dag}

8: for each operator ido

9: θi=updateLinUC B(θi, rst,ki, st,ki, kt−1,i )

10: kt,i =selection(θi, st,ki)by Equation 8

11: if kt,i ̸=kt−1,i then

12: Submit parallelism change ki,t to stream processing engine

0 1 2

3

Topology

Master Node

Coordinator

Slave Node

Slave Node

Slave Node

Slave Node

…

Engine Controller

Slave Node

Worker

Worker

Executor

Task

Task

…

Executor

…

Worker

Receive Q Outbound Q Controller

Monitor

Monitor

Environment

RL

agent

Supervisor

RL

agent

Model-based

Simulation

Pretrain

Loop

Interaction

Loop

Change

Configuration

Monitor

Monitor

Monitor

Monitor

Monitor

Monitor

Fig. 3. System Architecture Overview

above methods in a real-world DSP engine.

IV. IMP LE ME NTATI ON

We implement a prototype of the proposed method using

Apache Storm (v2.0.0) [2]. Though the proposed algorithms

can be implemented on other DSP engines such as Apache

Flink [3], we chose Apache Storm for implementation due

to its widespread use in data science applications [16] and

Storm has the lowest overall latency [17] among the leading

stream processing engines. We use Apache Storm to deploy

the DSP application which runs on the distributed worker

nodes managed by the Storm framework. In Storm, the DSP

application can be represented as a DAG topology that is

used to schedule and optimize the application. However, when

we actually deploy the application, it has an execution plan

which can be seen as an extension of the DAG topology. The

execution plan replaces each operator with its tasks. A task

represents an instance of an operator and is in charge of a

partition of the incoming tuples of the operator. In addition,

one or more tasks are grouped into executors, implemented

as threads as shown in Figure 3. Storm can process a large

amount of tuples in parallel by running multiple executors. The

executors are handled by the worker process in Storm, which

is a Java process acting as a container, which conﬁgures a

number of parameters including the maximum heap memory

that can be used. The parallelism is conﬁgured by the number

of executors allocated to an operator. When the number of

the executors reaches the number of tasks, the operator gets

its maximum parallelism. The number of executors can be re-

conﬁgured without restarting the application (but the executors

need to be restarted to redistribute the tasks) by the re-

balancing tool provided by Storm.

The implementation of our algorithms in Storm is straight-

forward. As illustrated in Figure 3, we implement a centralized

controller of the application in python. The controller is

implemented using the gym environment [18] interface which

can be directly used on most of the RL libraries. The interface

requires the environment to provide several functionalities,

which include step(), reset(), close(), etc. Here,

the most important interface is the step() interface, which

takes in the action for the time step and returns a four-

tuple including the observation (state), the immediate reward,

the end of episode signal, and the auxiliary diagnostic in-

formation. Based on the above interface, we implement the

DSP controller to control the parallelism conﬁguration based

on the action the algorithms calculated in each time step.

Additionally, the controller also takes the responsibility of

monitoring the status of the DSP application by capturing the

metrics from the output of the application, each physical node,

and each instance of the operators.

By wrapping the controller of the application, we extended

the algorithm (LinUCB) in RLlib [19] to implement the pro-

posed algorithms, which can directly use the gym environment

to interact with the DSP application (shown as interaction loop

in Figure 3). Therefore, when the DSP application is submit-

ted, a controller is created and based on the algorithm chosen

for conﬁguring the parallelism, an RL agent (or multiple RL

agents) is created and attached to the controller. Additionally,

the pre-training also implemented the same gym environment

interface, which can directly interact with the RL agent. Using

our MBLinUCB method, we tune the parallelism for each

operator using a speciﬁc RL agent and hence, it is possible to

distribute the agent to be attached with the operator to make

the decision. In that way, the agent does not need to be placed

with the controller and can be distributed to anywhere near the

operator to make the decision without signiﬁcantly degrading

performance. V. EVALUATIO N

We evaluate the proposed techniques compared to several

state-of-art RL methods. We use both simulation and real

testbed experiments to study the behavior of the RL algorithms

when they are used in optimizing the parallelism conﬁguration

of DSP applications.

A. Experimental setup

We describe the experimental setup for the simulation

environment and real test-bed environment separately.

For the simulation environment, we implement a DSP ap-

plication simulator by extending the queue and load balancing

environments provided in Park project [20] and make it com-

patible with the gym environment as discussed in Section IV.

The default setup of the simulation is shown in Table I. In the

TABLE I

DEFAU LT SIMULATION PARAM ETE R SET UP

Kmax 64 wlat, wque , wres 1

3

Average input rate 100 tuples/s ¯

lmax 1000 ms

µi10 tuples/s Time step interval 10 s

simulation experiment, we test three different datasets: (i) a

synthetic Poisson distribution dataset with default arrival rate

λ= 100/s, (ii) a synthetic Pareto distribution dataset with the

shape parameter α= 2.0and the scale parameter xm= 50

(so that the average input rate is also 100 tuples/s in default

Source/

Mapper

Profit

Join

DB

Sink

DB

Table

Ranking

Count

ts

taxi_ride

p_cell

taxi_id

ts

win_id

p_cell

profit

ts

win_id

cell

count

ts

win_id

cell

profitability

ts

cell_1

profit_1

…

p_ts

d_ts

p_cell

d_cell

taxi_id

pay

0

3

2

45 6

1

Taxi

Fig. 4. NY Taxi Proﬁtable Area Application

from the Pareto distribution), and (iii) a trace-driven dataset,

which is made available by Chris Whong [21] that contains

information about the activity of the New York City taxis. Each

data point in the dataset corresponds to a taxi trip including

the timestamp and location for both the pick-up and drop-off

events. As the data is too sparse (around three hundred tuples

per minute) to be used in stream processing experiments, we

speed up the input rate of the dataset by sixty times, which

means that the tuples arriving in one minute in the original

dataset will arrive in one second in the experiment.

In the real testbed experiments, we implement the DSP

application for the 2015 DEBS Grand Challenge (http://www.

debs2015.org/call-grand-challenge.html) to calculate the most

proﬁtable areas for each time window. The dataset used is

the New York City taxis mentioned above and the data rate is

also sped up by sixty times. We deploy a testbed on CloudLab

[22] with nodes organized in three tiers. We use the cluster

with ten xl170 servers in the CloudLab cluster and simulate

the three-tier architecture on an Openstack cluster. The third

tier contains fourteen m1.medium instances (2 vCPUs and 4

GB memory) that act as the smart gateways with relatively

low computing capacity corresponding to the leaf nodes of

the architecture. The second tier has ﬁve m1.xlarge instances

(8 vCPUs and 16 GB memory) and each of them functions

as a micro datacenter. The ﬁrst tier contains one m1.2xlarge

instance (16 vCPUs and 32 GB memory) acting as the

computing resource used in the cloud datacenter. The network

bandwidth, latency and topology are conﬁgured by dividing

virtual LANs (local area networks) between the nodes and

adding policies to the ports of each node to enforce using

the Neutron module of OpenStack and the trafﬁc control (tc)

tool in Linux to simulate. TABLE II

DEFAU LT PARAM ETE R SET UP FO R REA L TEST BE D

Kmax 8wlat, wque , wres 1

3

Average input rate 4500 tuples/s ¯

lmax 2000 ms

Time step interval 60 s

We deploy the Storm Nimbus service (acting as the master

node) on a m1.2xlarge instance and one Storm Supervisor

service (acting as the slave node) on each node respectively.

For the checkpoint store, we use a single node Redis service

placed on the master node. The default network is set to be

100 Mb bandwidth in capacity with 20 ms latency between the

gateways and micro datacenters, and the bandwidth capacity

is 100 Mb with 50 ms latency between the cloud datacenter

and micro datacenters. We also place a stream generator on

each smart gateway to emulate the input stream. The input

stream comes to an MQTT (Message Queuing Telemetry

Transport) service deployed on each smart gateway. The

dataset is replicated and replayed on each smart gateway and

the average input rate is around 4500 tuples/s overall. The

(a) Reward (b) 95th percentile Latency (c) Throughput

Fig. 5. Results of simulation with Synthetic Dataset (Poisson distribution)

Fig. 6. Rewards of simulation with

Synthetic Dataset (Pareto distribution

(α= 2.0))

Fig. 7. Rewards of simulation on the

New York taxi trace

default parameters used in the real testbed experiments are

listed in Table II.

B. Application and Operator Placement

To comprehensively evaluate the proposed algorithm, we

choose a smart city application that ranks the proﬁtability of

the areas for taxis in New York city. As shown in Figure 4,

there are seven operators: (i) source and mapper, which con-

sume the input stream from the MQTT service and transform

the raw tuple to the data type that can be understood by the

system, (ii) taxi aggregator, which aggregates the trips by the

taxi IDs in time windows, (iii) taxi counter, which counts the

number of taxis in a particular area in time windows, (iv) proﬁt

aggregator, which aggregates the proﬁts by the pickup area in

time windows, (v) joiner, which joins the proﬁt and number

of taxis to calculate the proﬁtability of a particular area, (vi)

ranking, which sorts the proﬁtability of the area, (vii) sink,

which stores the results of the most proﬁtable areas into a

database for further usage. We have optimized the placement

of the application by placing each operator to one of the three

tiers based on its selectivity. The data source which consumes

the input tuples from the MQTT services are placed at the

same node (one of the gateways) where the MQTT service is

placed. The heavy load aggregators are placed in the micro

data centers. The join, ranking and sink operators are placed

in the mega datacenter. It is worth noting that because of the

windowed aggregators (taxi and proﬁt aggregators) handling

most of the workloads, only those two operators are possible to

be the bottlenecks in the overall stream processing application.

So in the real testbed experiments, we only consider the scale

up of those two operators.

C. Algorithms

In our experiment evaluation, different mechanisms are

measured and compared: (i) PPO, which is a policy gradient

method for RL [23], (ii) A3C, which is the asynchronous

version of the actor-critic methods [24], (iii) DQN, which is a

method based on DNN to learn the policy by Q-learning [25],

(a) Reward (b) Action (parallelism)

Fig. 8. Evaluation of applicability for heterogeneous resources (Poisson

distribution)

(iv) LinUCB, which is a MAB method based on a linear model

to approximate the reward distribution and it uses UCB to

select the action [5], and (v) MBLinUCB, which is the method

proposed in this work. All the methods use the default hyper-

parameters conﬁgured in RLlib. For the proposed method, we

generate ten thousand data points to initialize the parameters

in the linear models in the MBLinUCB method as described

in Section III.

D. Simulation Experiment Results

We ﬁrst evaluate the performance of the algorithms in the

simulation environment. As shown in Figure 5, we compare

the algorithms with a synthetic Poisson distribution workload.

We can see that our method converges faster than the other

methods and it only needs three thousand time-steps to reach

an average reward of -0.3. It also starts from a relatively good

initial position above -0.4 compared to -0.5 in the LinUCB

method. With respect to latency and throughput metrics, our

method and LinUCB perform better than the others. However,

the latency performance of MBLinUCB converges from one

thousand milliseconds, which is much higher than the LinUCB

method. As the MBLinUCB initializes its linear model with

the data generated from the environment model, it starts from

a conﬁguration which just meets the upper bound latency

requirement (1000ms) with the minimum parallelism needed.

As shown in Figure 6, we compare the algorithms with

another synthetic workload from a Pareto distribution (for each

time slot, the input rate is sampled from a Pareto distribution).

The workload is used to test the performance of the algorithms

in conditions when the workload has signiﬁcant ﬂuctuations

while executing the application. In Figure 6, we can see similar

results as in Figure 5. The proposed method performs better

than the other methods. We note that even LinUCB does not

converge to a good result as MBLinUCB does but with enough

iterations (around thirty thousand timesteps), A3C can get

similar results as MBLinUCB. This is expected as A3C uses

(a) Reward (b) Action (parallelism)

Fig. 9. Evaluation of applicability for heterogeneous operators (Poisson

distribution)

the actor-critic method to improve the sample efﬁciency so

that it performs better than the other RL methods that also

rely on DNN. In the next set of experiments, we study the

performance of the algorithms using a real trace as shown in

Figure 7. We can see that our method converges to an average

reward greater than -0.3 in less than twenty thousand time

steps. However, LinUCB needs more than sixty thousand time

steps, and the other methods require even more time steps.

In the next two sets of experiments shown in Figure 8

and Figure 9, we evaluate the impact of heterogeneity in

the available resources and operators. As we can see in

Figure 8a, for different operator processing rates which may

get inﬂuenced by the characteristics of the operator or the

power of the underline server, MBLinUCB can converge to

a good reward range within a few episodes. We can see that

LinUCB converges to a lower reward when the service rate

is ten compared to when MBLinUCB is used (noted as op-

10 in the ﬁgure). The above results can be explained by

comparing the results in Figure 8b. We can see that all the

conditions converge to a small range of actions at the end

of 50 episodes. However, compared to MBLinUCB, LinUCB

converges to a larger number of parallelism so that it has a

higher resource usage penalty so as a lower reward compared

to our method. As shown in Figure 9a, we can see similar

results that within different operator parallelism portions (i.e.,

how many percentiles of the operator’s processing can be

parallelized), our mechanism can converge within a limited

number of episodes. We also note that LinUCB converges to

a lower reward with larger parallelism as shown in Figure 9b.

E. Real Testbed Experiment Results

We evaluate the performance of LinUCB and our method

in the real testbed with a real stream processing application

and a real dataset. As shown in Figure 10a, we can see that

the proposed method converges faster as it has a better initial

conﬁguration. It only takes ﬁfteen time steps (each one minute)

to reach a reward more than -0.3. For the latency analysis,

we illustrate the latency upper bound (95th percentile of the

latency distribution) in Figure 10b. As shown in the results,

we can see that our method starts from a latency which

already meets the requirements (the latency upper bound is

less than two seconds) and is stable during the experiments.

However, the original LinUCB method starts from a very

high latency (more than ﬁve seconds and we cut the latency

metric to ﬁve seconds if it is larger than that) and then it

(a) Reward (b) 95th percentile latency

Fig. 10. Real Testbed Results

gradually improves the average latency upper bound to three

seconds. Our MBLinUCB already meets the latency bound

requirements at the initial state and tries to improve it by

exploring the possible parallelism conﬁgurations in the real

environment.

VI. RE LATE D WOR K

Over the last few years, the developments in the Big

Data ecosystem have raised higher requirements for scal-

able stream processing engines. Several open-source stream

processing frameworks have been developed. Key examples

include Flink [3] and Storm [2]. There have been several

efforts in recent years to optimize stream processing in edge

computing environments. Xu et al. [9] proposed Amnis to

improve the data locality of edge-based stream processing by

considering resource constraints in an ofﬂine manner. In [26],

Xu et al. address fault tolerance aspects of edge-based stream

processing using a hybrid backup mechanism. The techniques

proposed in [9], [26] can be used to generate the initial plan

used in our work. To dynamically scale the application, several

different approaches have been developed including techniques

for re-conﬁguring the execution graphs of the application or

adjusting the parallelism by increasing the number of instances

of certain operators. Cardellini et al. [8] proposed an elastic

stream processing framework based on an ILP(Integer Linear

Programming) model to reconﬁgure the stream processing ap-

plication to make decisions on operator migration and scaling.

However, given the heterogeneous nature of edge computing

systems, these techniques may require substantial manual

effort to tune the parameters to achieve a self-stabilizing status.

There have also been several efforts on developing techniques

to manage stream processing applications using RL methods.

For example, Li et al. [27] proposed a model-free method

to schedule the stream processing application based on an

actor-critical RL method [24]. Ni et al. [28] developed a

resource allocation mechanism based on GCN(graph Convolu-

tion network)-based RL method to group operators to different

nodes. However, the above DNN-based method suffers from

long training periods and low sampling efﬁciency and they

need a large amount of data to build the model. There have

been several efforts on leveraging the traditional RL method

(such as Q-learning) to deal with the problem. For instance,

Russo et al. [29] used the FA(Function Approximation)-

based TBVI(Trajectory Based Value Iteration) to improve the

sample efﬁciency of the traditional RL methods (such as Q-

learning) to make operator scaling decisions in heterogeneous

environments. However, to reduce the action space, the above

work deﬁnes the action as increasing or decreasing only

one instance for an operator. This increases the convergence

trajectory length and therefore, even when the model is trained

well, it may also take long steps to reach the optimal state,

which also incurs high reconﬁguration cost. In contrast to these

existing works, our model-based method automatically ﬁnds

tradeoffs between exploration and exploitation by using the

UCB-based RL method. In addition, the proposed approach

further improves the sample efﬁciency by utilizing a queuing

model-based simulation to generate more data to pre-train the

model. Based on these features, the method proposed in this

paper achieves a signiﬁcantly higher convergence rate and

cumulative rewards during long runs compared to the existing

methods.

VII. CON CL US IO N

In this paper, we proposed a learning framework achieves

elasticity for stream processing applications deployed at the

edge by automatically tuning the application to meet the

Quality of Service requirements. The method adopts a rein-

forcement learning (RL) method to conﬁgure the parallelism of

the operators in the stream processing application. We model

the elastic parallelism conﬁguration for stream processing in

edge computing as a Markov Decision Process (MDP), which

is then reduced to a contextual Multi-Armed Bandit (MAB)

problem. Using the Upper Conﬁdence Bound(UCB)-based

RL method, the proposed approach signiﬁcantly improves

the sample efﬁciency and the convergence rate compared to

traditional random exploration methods. In addition, the use of

model-based pre-training in the proposed approach results in

substantially improved performance by initializing the model

with appropriate and well-tuned parameters. The proposed

techniques are evaluated using realistic workloads through

both simulation and real testbed experiments. The experiment

results demonstrate the effectiveness of the proposed approach

in terms of cumulative reward and convergence speed.

VIII. ACK NOWLEDGEMENT

This work is partially supported by an IBM Faculty award.

REF ER EN CE S

[1] “The internet of things: A movement, not a market,”

https://ihsmarkit.com/Info/1017/Internet-of-things.html, accessed

November. 2, 2020.

[2] “Apache storm,” https://storm.apache.org/, accessed July. 16, 2021.

[3] “Apache ﬂink,” https://ﬂink.apache.org/, accessed July. 16, 2021.

[4] C. Jasper, “The hidden costs of delivering iiot services: Industrial

monitoring & heavy equipment,” 2016.

[5] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit

approach to personalized news article recommendation,” in Proceedings

of the 19th international conference on World wide web, 2010, pp. 661–

670.

[6] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal,

J. M. Patel, K. Ramasamy, and S. Taneja, “Twitter heron: Stream

processing at scale,” in Proceedings of the 2015 ACM SIGMOD In-

ternational Conference on Management of Data, 2015, pp. 239–250.

[7] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and

M. Seltzer, “Network-aware operator placement for stream-processing

systems,” in Data Engineering, 2006. ICDE’06. Proceedings of the 22nd

International Conference on. IEEE, 2006, pp. 49–49.

[8] V. Cardellini, F. Lo Presti, M. Nardelli, and G. Russo Russo, “Optimal

operator deployment and replication for elastic distributed data stream

processing,” Concurrency and Computation: Practice and Experience,

vol. 30, no. 9, p. e4334, 2018.

[9] J. Xu, B. Palanisamy, Q. Wang, H. Ludwig, and S. Gopisetty, “Amnis:

Optimized stream processing for edge computing,” to appear in Journal

of Parallel and Distributed Computing, 2021.

[10] W. J. Stewart, Probability, Markov chains, queues, and simulation.

Princeton university press, 2009.

[11] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, “An adaptive sampling

algorithm for solving markov decision processes,” Operations Research,

vol. 53, no. 1, pp. 126–139, 2005.

[12] Z. Meng, M. Wang, J. Bai, M. Xu, H. Mao, and H. Hu, “Interpreting

deep learning-based networking systems,” in Proceedings of the Annual

conference of the ACM Special Interest Group on Data Communica-

tion on the applications, technologies, architectures, and protocols for

computer communication, 2020, pp. 154–171.

[13] J. L. Gustafson, “Reevaluating amdahl’s law,” Communications of the

ACM, vol. 31, no. 5, pp. 532–533, 1988.

[14] A. B. Kahn, “Topological sorting of large networks,” Communications

of the ACM, vol. 5, no. 11, pp. 558–562, 1962.

[15] K. P. Yoon and C.-L. Hwang, Multiple attribute decision making: an

introduction. Sage publications, 1995.

[16] “Ranking popular distributed computing packages for data science,” ac-

cessed November. 2, 2020. [Online]. Available: https://www.kdnuggets.

com/2018/03/top-distributed-computing-packages-data-science.html

[17] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder-

baugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng et al., “Benchmarking

streaming computation engines: Storm, ﬂink and spark streaming,” in

2016 IEEE international parallel and distributed processing symposium

workshops (IPDPSW). IEEE, 2016, pp. 1789–1792.

[18] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-

man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint

arXiv:1606.01540, 2016.

[19] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg,

J. Gonzalez, M. Jordan, and I. Stoica, “Rllib: Abstractions for distributed

reinforcement learning,” in International Conference on Machine Learn-

ing. PMLR, 2018, pp. 3053–3062.

[20] H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus,

R. Addanki, M. Khani Shirkoohi, S. He et al., “Park: An open plat-

form for learning-augmented computer systems,” Advances in Neural

Information Processing Systems 32 (NIPS 2019), 2019.

[21] C. Whong, “Foiling nyc’s taxi trip data,” FOILing NYCs Taxi Trip Data.

Np, vol. 18, 2014.

[22] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide,

L. Stoller, M. Hibler, D. Johnson, K. Webb, A. Akella, K. Wang,

G. Ricart, L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, and

P. Mishra, “The design and operation of CloudLab,” in Proceedings of

the USENIX Annual Technical Conference (ATC), Jul. 2019, pp. 1–14.

[Online]. Available: https://www.ﬂux.utah.edu/paper/duplyakin-atc19

[23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-

imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,

2017.

[24] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,

D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-

forcement learning,” in International conference on machine learning.

PMLR, 2016, pp. 1928–1937.

[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-

stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-

ing,” arXiv preprint arXiv:1312.5602, 2013.

[26] J. Xu, B. Palanisamy, and Q. Wang, “Resilient stream processing in

edge computing,” in 2021 IEEE/ACM 21st International Symposium on

Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2021, pp.

504–513.

[27] T. Li, Z. Xu, J. Tang, and Y. Wang, “Model-free control for distributed

stream data processing using deep reinforcement learning,” Proceedings

of the VLDB Endowment, vol. 11, no. 6, pp. 705–718, 2018.

[28] X. Ni, J. Li, M. Yu, W. Zhou, and K.-L. Wu, “Generalizable resource

allocation in stream processing via deep reinforcement learning,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 34,

no. 01, 2020, pp. 857–864.

[29] G. R. Russo, V. Cardellini, and F. L. Presti, “Reinforcement learning

based policies for elastic stream processing on heterogeneous resources,”

in Proceedings of the 13th ACM International Conference on Distributed

and Event-based Systems, 2019, pp. 31–42.