Content uploaded by Homa Maleki

Author content

All content in this area was uploaded by Homa Maleki on Feb 23, 2023

Content may be subject to copyright.

1

Handover-Enabled Dynamic Computation

Ofﬂoading for Vehicular Edge Computing Networks

Homa Maleki, Member, IEEE, Mehmet Bas¸aran, Member, IEEE, and L¨

utﬁye Durak-Ata, Senior Member, IEEE

Abstract—The computation ofﬂoading technique is a promising

solution that empowers computationally limited resource devices

to run delay-constrained applications efﬁciently. Vehicular edge

computing incorporates the processing capabilities into the ve-

hicles, and thus, provides computing services for other vehicles

through computation ofﬂoading. Mobility affects the communica-

tion environment and leads to critical challenges for computation

ofﬂoading. In this paper, we consider an intelligent task ofﬂoading

scenario for vehicular environments including smart vehicles

and roadside units, which can cooperate to perform resource

sharing. Intending to minimize the average ofﬂoading cost which

takes into account energy consumption together with delay

in transmission and processing phases, we formulate the task

ofﬂoading problem as an optimization problem and implement

an algorithm based on deep reinforcement learning with Double

Q-learning which allows user equipments to learn the ofﬂoading

cost performance by observing the environment and make steady

sequences of ofﬂoading decisions despite the uncertainties of the

environment. Besides, concerning the high mobility of the envi-

ronment, we propose a handover-enabled computation ofﬂoading

strategy that leads to a better quality of service and experience for

users in beyond 5G and 6G heterogeneous networks. Simulation

results demonstrate that the proposed scheme achieves low-

cost performance compared to the existing ofﬂoading decision

strategies in the literature.

Index Terms—Computation ofﬂoading, Handover, Intelligent

transportation, Reinforcement learning, Vehicular edge comput-

ing.

I. INTRODUCTION

THE tremendous growth and rapid development of smart

vehicles led to the use of numerous applications in vehic-

ular environments [1]. This technology allows passengers to

travel safely and comfortably with the help of network devices,

cameras, and sensors. Furthermore, it provides the ability to

store and process information, using driving assistance and

autonomous vehicles. For instance, augmented reality (AR)

can provide useful information for better visibility, especially

in unfavorable weather conditions [2]. Most of these ap-

plications need resources to perform massive computations,

both in terms of energy and central processing unit (CPU)

power, which are not available in the vehicles in most cases.

Developing on-board computers in smart vehicles may be a

solution; however, it may not be economical [3]. Therefore,

vehicular edge computing (VEC) could be a suitable solution

H. Maleki and L. Durak-Ata are with the Information and Communications

Research Group (ICRG), Informatics Institute, Istanbul Technical University,

Istanbul, Turkey (e-mail:maleki18@itu.edu.tr; durakata@itu.edu.tr).

M. Bas¸aran is with Information and Communications Research Group, Is-

tanbul Technical University, 34469, and 6GEN Lab., Turkcell, 34854, Istanbul,

Turkey (e-mail:mehmetbasaran@itu.edu.tr, mehmet.basaran@turkcell.com.tr).

for task execution in vehicular environments beyond the 5G

network, which incorporates the processing capabilities into

the vehicles, and thus, provides computing services for other

vehicles as well as pedestrians through computation ofﬂoading

[4]. Ofﬂoading the computation could improve the quality

of service (QoS) and quality of experience (QoE) for users

and applications. Moreover, by ofﬂoading the computations,

the users can increase battery lifetime even if they have the

capability to run the application locally in order to pave the

way for dense 6G applications. Additionally, in some cases,

pedestrians with smart devices can act as edge servers by

providing computation services and sharing their resources

with other users [5].

Vehicular environments are highly dynamic due to the high

mobility, where the topology of the network and wireless

channel states change rapidly over time in beyond 5G and 6G

communications. These circumstances cause vital problems

for making a steady ofﬂoading decision [6]. Reinforcement

Learning (RL) is a powerful solution for making decisions

under the uncertainties of this scenario [7]. The idea behind

RL is to learn by interacting with the environment and it is

generally supposed that the agent has to act regardless of

serious uncertainty about the environment. RL is a kind of

learning technique that understands how to act in a manner that

it maximizes a numerical reward. The learner is not informed

beforehand about which actions that it has to take. In other

words, the learner itself should determine which of the actions

are more beneﬁcial by learning them at that moment [8]. Q-

Learning is one of the most prominent algorithms of RL and

can be described as the quality Qof an action ain a state sun-

der a policy π. During the training process, the agent updates

Q-values for each state-action sequence and saves these Q-

values. Due to overestimation of action values, Q-learning may

have a weak performance in stochastic conditions. Therefore,

double Q-learning (DQL) has been introduced to solve this

problem by using two Qvalues instead of one Qvalue for each

state-action, which prevents overestimation in large number of

iterations through blocking higher action values [9]. Finally,

in order to remove the time dependency and ﬁnd a steady

solution for computation ofﬂoading, DQL can be synchronized

with deep neural network (DNN) [10] called double deep Q-

network (DDQN) [11].

II. RE LATE D WOR K AN D MAIN CONTRIBUTIONS

The latest research in the literature attests to the importance

and superiority of VEC. There are different algorithms and

strategies in the literature proposed to solve the resource allo-

cation problem in vehicular environments including stochastic

Copyright (c) 2023 IEEE. Personal use of this material is permitted. However, permission to use this material for any other

purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ULAKBIM UASL ISTANBUL TEKNIK UNIV. Downloaded on February 23,2023 at 07:08:58 UTC from IEEE Xplore. Restrictions apply.

2

methods, game theory, meta-heuristic algorithms, mathemati-

cal programming, and machine learning [2], [4], [12].

A commonly invoked strategy in this area is game theory

(GT). Typically, GT in computation ofﬂoading [13] is an

analytical approach to investigate the interaction between

cooperating or competing users in respect of shared network

resources to satisfy computation requirements. In [14], a multi-

user non-cooperative computation ofﬂoading game scenario

has been proposed, where each vehicle competes with other

vehicles for the resources of an edge server, and the payoff

function of this game is deliberated by considering the trans-

mission model and the distance between the edge server and

the vehicle.

Meta-heuristic algorithms have also been used in various

research studies in this ﬁeld. The authors of [15] have pre-

sented an event-triggered dynamic computation apportion-

ment to the fog framework using both linear programming-

based optimization and binary particle swarm optimization. To

achieve a signiﬁcant result, they have used a real-world taxi

tracking simulation and have performed two illustrative tasks,

including online video streaming and object identiﬁcation.

In [16], the authors have studied a method for optimizing

execution delay and resource utilization in a multi-user and

multi-server scenario based on the partheno-genetic algorithm

that utilizes heuristic rules. The proposed algorithm decides

where to ofﬂoad the task and also illustrates the computation

sequence in the server. [17] proposes an ant colony-based

scheduling algorithm that manages the idle capacities of smart

vehicles without the help of any other demanding infrastruc-

tures, intending to use them efﬁciently.

Another state-of-the-art method is machine learning, which

generally intends to improve the performance of an algorithm

by practice. To enhance the performance of the next-generation

vehicular networks by reducing the computation delay, the

authors of [18] have suggested the k-nearest neighbor algo-

rithm which decides where to execute the task, either executing

locally or ofﬂoading it to an edge or cloud.

Additionally, RL is commonly utilized to solve the resource

allocation optimization problem in ofﬂoading situations. In

this context, the authors of [6] have introduced an ofﬂoading

method called adaptive learning-based task ofﬂoading (ALTO),

in which they have used multi-armed bandit (MAB) theory

for the purpose of minimizing the vehicle-to-vehicle (V2V)

ofﬂoading delay. In [19], the authors have proposed a semi-

Markov decision process (MDP) model for vehicular cloud

computing (VCC), which considers heterogeneous vehicles

and roadside units (RSUs), and have introduced a technique

for determining the optimal strategy of VCC resource ap-

portionment. The authors of [20] have developed a novel

computation ofﬂoading framework for air-ground integrated

VEC, which is called the learning-based Intent-aware Upper

Conﬁdence Bound (IUCB) algorithm. [21] has proposed a

knowledge-driven computation ofﬂoading decision scenario

that can adapt to different conditions including environment

changes, and then ﬁnd the optimal solution directly from the

environment via deep RL (DRL). A vehicle-assisted ofﬂoading

system concerning the stochastic vehicle transactions, dynamic

execution requests, and uncertainty of communication con-

ditions has been suggested in [22]. Intending to maximize

the permanent utility of the VEC system, the authors have

presented an optimization problem as a semi-Markov process

and introduced two RL approaches including Q-learning and

DRL. [23] has proposed a DRL-based ofﬂoading scheme,

which resembles the ofﬂoading policy with a deep neural

network (DNN) and trains the DNN with the proximal pol-

icy optimization (PPO) algorithm without awareness of the

environment dynamics. In order to save energy and provide

efﬁcient utilization of shared resources between user equip-

ments (UEs), the authors of [24] have generated an equivalent

RL model and proposed a distributed deep learning algorithm

to ﬁnd the near-optimal ofﬂoading decisions in which a set of

DNNs have been used in parallel. In [25], the authors have

proposed an intelligent ofﬂoading system for VEC by using

DRL, where computation scheduling and resource allocation

problems are formulated as a joint optimization problem to

maximize QoE. In addition, a two-sided matching scheme

and a DRL approach are developed to schedule ofﬂoading

applications and designate network resources, respectively.

The authors of [26] have designed an intelligent computation

ofﬂoading system based on deep Q-network (DQN), where

a software-deﬁned network is introduced to achieve infor-

mation collection. [27] investigates a two-layer unmanned

aerial vehicle (UAV) maritime communication network with a

centralized UAV on the top and a cluster of distributed bottom-

UAV, with the purpose of satisfying the latency minimization

problem for both communication and computation of mobile

edge computing (MEC) network utilizing deep Q-network

and deep deterministic policy gradient algorithms. Authors of

[28] present a decentralized computation ofﬂoading solution

based on the Attention-weighted Recurrent Multi-Agent Actor-

Critic algorithm to tackle the challenges of large-scale mixed

cooperative-competitive MEC environments. [29] presents a

broad architecture for fast-adaptive resource allocation in

dynamic vehicular situations. The dynamics of the vehicular

environment are described as a sequence of connected MDPs,

followed by hierarchical RL with meta-learning.

Motivated by these existing research studies explained in

[1]- [29], which mainly ignore the collaboration between

vehicles and RSUs for sharing their computational resources

as edge servers, and also the handover issue due to mobility for

computation ofﬂoading, the main contributions of this article

are summarized as follows:

1) We propose an intelligent task ofﬂoading scenario for

highly dynamic vehicular environments including smart

vehicles and RSUs, which can cooperate to perform

resource sharing. Furthermore, in order to increase the

QoS and QoE, we apply a handover-enabled partial

ofﬂoading strategy, which allows the vehicle to search

for suitable servers at certain distances from its location

and the state of the environment.

2) To make an efﬁcient execution decision, we deﬁne a

cost function that calculates the execution cost of the

target task, considering energy consumption, transmis-

sion delay, and processing delay. Then, we formulate

the task ofﬂoading problem as an optimization problem

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ULAKBIM UASL ISTANBUL TEKNIK UNIV. Downloaded on February 23,2023 at 07:08:58 UTC from IEEE Xplore. Restrictions apply.

3

area 1

area 2

area 3

area 4

area 5

area 6

end point

start point

Target vehicle

V2I offloading

V2V offloading

Local computing

Target task RSU

Target route

Fig. 1. An illustration of a vehicular edge computing scenario in an urban environment, where the target vehicle either ofﬂoads parts of the target task to

different edge servers including neighboring vehicles and RSUs in each area, or executes it locally after making a series of decisions at certain distances

through the proposed handover-enabled intelligent task ofﬂoading method.

and design an algorithm based on DDQN. The proposed

algorithm aims to minimize the average weighted cost

of computation of a target task with respect to the

priorities of UEs (which could be computation delay

or/and energy consumption); therefore, it allows UEs to

learn the ofﬂoading cost performance by observing the

environment and leads them in each step for making the

best decision in the direction of choosing the best edge

server (whether a vehicle or an RSU) for ofﬂoading.

3) The proposed algorithm consists of an evaluation neural

network and a target neural network, where the former

aims to produce the action value for each computation

ofﬂoading step while the latter is used for producing

the target Q-values for training the parameters of the

proposed algorithm. To analyze the effectiveness and

performance of the proposed algorithm, we perform a

series of examinations and comparisons with existing

ofﬂoading methods in the literature.

The remainder of the article is organized as follows: We

introduce the system model and formulate the problem in Sec-

tion III. We propose the solution for computation ofﬂoading

problem in Section IV. In Section V, simulation results are

presented. Finally, the paper is concluded in Section VI.

III. SYS TE M MOD EL

In this section, we propose an intelligent task ofﬂoad-

ing framework for vehicular environments. The VEC as-

sists resource-restricted smart devices to improve their per-

formance through task ofﬂoading, where the user can send

computationally-intensive applications to a remote edge server

to be executed. As shown in Fig. 1, the considered vehicular

environment consists of smart vehicles and ﬁxed RSUs, which

have the capability of sharing their idle resources with the

neighboring UEs. The target vehicle generates a target task at

the start point of its target route, then it tries to execute the task

regarding its situation in the environment by ofﬂoading to an

edge server, which could be a vehicle or an RSU, or executing

locally. Due to its mobility, the target vehicle explores the

environment again after a certain distance (area) and executes

another part of the remaining target task in the new area. This

procedure continues until the whole task has been executed.

Key notations and corresponding deﬁnitions throughout the

paper are listed in Table I. In this section, system architecture,

movement characteristics of vehicles, communication model,

and computation procedure are deﬁned in detail.

A. System Architecture

In this work, we consider a communication environment in

which vehicles and RSUs operate as support infrastructures

for computation. A set of presumptions are made for the

communication environment. We deﬁne a set of vehicles V

as

V={V1, V2, . . . , VI},(1)

This article has been accepted for publication in IEEE Transactions on Vehicular Technology. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ULAKBIM UASL ISTANBUL TEKNIK UNIV. Downloaded on February 23,2023 at 07:08:58 UTC from IEEE Xplore. Restrictions apply.

4

TABLE I

KEY NOTATI ONS AN D DEFI NIT ION S

Notation Description

ViVehicle edge servers

BjRoadside units

VTTarget vehicle

PVT, PVi, PBjTransmission power of target vehicle, edge vehicles, and edge RSUs

RVT,Vi, RVT,BjData transmission rate between VTand edge servers

RBj,VT, RVi,VTData transmission rate between edge servers and VT

Dloc, Eloc Local computing delay and energy consumption

Dedge Delay of processing the task in an edge server

DUL,Vi, DU L,BjUplink transmission delay

DDL,Vi, DDL,BjDownlink transmission delay

Doff , Eof f Edge computing delay and energy consumption

Dtot, Etot Delay and energy of whole task completion

fedge CPU frequency of selected edge server (fVi,fBj)

αUplink transmission overhead coefﬁcient

βJoint Downlink transmission overhead coefﬁcient and output/input data ratio

ρCPU processing overhead

LInput task size

µVi, µBjTask processing size coefﬁcients of each edge server

δD,δEWeights of delay and energy consumption

where Vcontains Ivehicles, aware of their features including

speed, current position, moving direction, CPU frequency,

and availability for sharing resources. These features can be

shared with neighboring UEs, within dedicated short-range

communication (DSRC) protocol, which allows sending pe-

riodic beaconing messages. Note that in this paper, we ignore

data trafﬁc and queuing issues; therefore, we assume that all

the candidate servers are able to accept just one task at the

moment, which we call availability.

Furthermore, we deﬁne a set Bof RSUs, which are static

nodes, as

B={B1, B2, . . . , BJ},(2)

where Jis the number of deployed RSUs.

The vehicular environment that we consider for simulation

is designed similar to urban areas so that the system includes

a varying number of vehicles with various speeds. For making

the study more feasible and realistic, not all vehicles are

assumed to be connected to the network during the entire

process, and each of them enters into the network at a speciﬁc

time and speciﬁc location, and exits from it at a later time.

B. Movement Characteristics of Vehicles

The mobility model exceedingly contributes to the accurate

simulation of VEC networks. Therefore, this paper considers

a modiﬁed movement model which follows the principles of

the Manhattan mobility model [30]. According to this model

and as illustrated in Fig. 1, vehicles move in a vertical or

horizontal direction on an urban map. We use a probabilistic

strategy to choose movement direction at each intersection of

vertical and horizontal roads. In contrast to the Manhattan

model, which takes into account a ﬁxed probability of 0.5

for continuing straight and 0.25 for taking a right or left

side, we assume precedences with higher probabilities in

several points that can demonstrate the more appealing points

of the environment, such as shopping centers or schools.

Furthermore, we designate several intersections along the route

as red trafﬁc lights. When a vehicle arrives at one of these

points, it stays there for a speciﬁed period of time before

continuing on its way. Note that the advantage of assuming

attractive points is that the target vehicle enters areas with

a high density of vehicles, resulting in a large number of

candidates to choose from. Consequently, the algorithm will

learn to make more accurate ofﬂoading decisions.

C. Communication Model

Communication between the vehicles and RSUs takes place

over a wireless channel. Assuming block-fading Rayleigh

channel, the data transmission rate between the target vehicle

VTand the selected edge server Vior Bjis given by

RVT,Vi=RVT,Bj=W0·log2 1 + PVT·d−ξ· |h|2

N0!,(3)

where W0,PVT,d,ξ,h, and N0represent channel bandwidth,

transmit power of the VT, distance of the VTfrom the selected

edge server, path loss exponent, channel fading coefﬁcient, and

additive white Gaussian noise (AWGN) power, respectively

[14]. Furthermore, the transmission rate from RSU and vehicle

edge can be modeled respectively as

RBj,VT=W0·log2 1 + PBj·d−ξ· |h|2

N0!,(4)

RVi,VT=W0·log2 1 + PVi·d−ξ· |h|2

N0!,(5)

where PBjdemonstrates the transmit power of each RSU

and similarly PVishows the transmit power of vehicle edge

servers. Note that the wireless channel state is assumed to be

static throughout the computation ofﬂoading during channel

coherence time. Additionally, we consider and model the

deployed RSUs in the communication area along with the

route as properly placed whose distance is considerably higher

such that their coverage area do not interfere much each other

compared to the channel fading and noise power.

D. Task Generation and Ofﬂoading Strategy

In this paper, we consider an application model in which

each task generated by a vehicle or pedestrian is represented

by two parameters Land ρ, where Land ρdenote the input

data size of the task in bits, and processing overhead in CPU

cycles per bit, respectively. These parameters inﬂuence energy

consumption and execution time, whether in local computing

or computation ofﬂoading. Furthermore, due to the large size

of intelligent applications (e.g. image or video processing

using deep learning methods), we assume the partial ofﬂoading

method, where each generated target task can be processed in

several parts with different sizes. Therefore, the target vehicle

VTwill be able to decide to either ofﬂoad each part to a

suitable edge server or execute it locally according to the state

of the vehicle in the environment. Consequently, the ofﬂoading

strategy for the target task can be characterized as:

1) Initially, a target vehicle of interest VTor smart device of

a pedestrian in a vehicle, generates an intelligent target

task with a large size.

2) VTbegins to investigate the neighboring environment

via sending beaconing messages. Afterward, regarding

the obtained information such as CPU frequency and

availability for accepting a computation, it prepares a list

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

5

of suitable edge servers including the RSUs and smart

vehicles in the communication range. By using this list,

it will be able to make a decision among candidates and

ofﬂoads the task for edge computing, and receives the

result. Note that if there is no suitable candidate in the

area, the target vehicle executes the task locally until the

next server exploration.

3) Due to the high mobility and to prevent data loss,

which may occur by changing the communication range,

the target vehicle explores the environment at regular

distances, called area (Fig. 1), along its target route and

updates the candidate list periodically. After a certain

assumed distance is exceeded, the environment transi-

tions to a new state so that the list of possible candidate

servers changes. In this case, the target vehicle ofﬂoads

the remaining part of the task that is not executed

during the passage of the previous area. In the case of

a red trafﬁc light, the target vehicle remains in that area

temporarily, as if in motion, discovers the environment

for new candidate servers at speciﬁc time intervals and

ofﬂoads the task, then continues on the target route.

4) This process continues until the whole target task has

been executed.

E. Local Computing Model

As explained in the System Architecture subsection, the

system model contains different types of UEs with various

computation capacities. In situations where the target UE is

not able to ﬁnd a suitable edge server, it decides to execute

the task locally until the next server exploration. For each UE,

computation delay and energy consumption can be determined

respectively as

Dloc =ρ·L

floc

,(6)

Eloc =PCP U ·ρ·L, (7)

where floc is the CPU cycle frequency or computation speed

of VTand PCP U indicates the power consumption per CPU

cycle that can be calculated as PCP U =κ·f2

loc, with κ

denoting a power consumption coefﬁcient which relies on the

chip architecture [31].

F. Computation Ofﬂoading Model

In the proposed computation ofﬂoading scenario, ﬁrst, VT

identiﬁes neighboring vehicles and RSUs inside each area,

then, selects one of the candidates and ofﬂoads the task for

execution. Finally, it receives the result. The computation

delay at the edge server consists of three parts: uplink (UL)

transmission delay, execution delay in the edge server Dedge ,

and downlink (DL) transmission delay. Execution delay in the

edge server can be calculated as

Dedge =ρ·L

fedge

,(8)

where fedge is the computation speed at the selected edge

server. Similarly, delay in UL and DL transmissions can be

expressed respectively as

DUL,Vi=α·L

RVT,Vi

, DUL,Bj=α·L

RVT,Bj

(9)

DDL,Vi=β·L

RVi,VT

, DDL,Bj=β·L

RBj,VT

(10)

where αdenotes UL transmission overhead coefﬁcient while β

is a coefﬁcient modeled jointly by DL transmission overhead

and the ratio of output to input data size [32]. Note that

we assign these weights because generally the amount of

processed data is smaller than the raw data. For instance, in

an image processing task, the input data may be a high-quality

image with a large size, but the feedback could be a yes or

no which is very small and even negligible. Consequently, the

total delay of task execution during computation ofﬂoading

can be given by

Doff =DUL +Dedge +DDL.(11)

Energy consumption during computation ofﬂoading to a ve-

hicle edge server and RSU edge server can be expressed

respectively as

Eoff,V =PVT·ρ·L

RVT,Vi

Eoff,B =PVT·ρ·L

RVT,Bj

.(12)

Since the proposed method is based on handover, the delay and

energy consumption of each area are different and depend on

the CPU frequency of the selected edge server and the size of

transmitted data. Therefore, we can calculate the total delay

for computing the whole task with size Las

Dtot =

I

X

i=1

(α·L

RVT,Vi

+ρ·µViL

fVi

+β·µViL

RVi,VT

)

+

J

X

j=1

(α·(1 −µVi)L

RVT,Bj

+ρ·µBjL

fBj

+β·µBjL

RBj,VT

)+n·(ρ·µloc L

floc

),

(13)

where Vifor i∈1,2, ..., I denotes the index of the selected

vehicle, Bjfor j∈1,2, ..., J denotes the index of the selected

RSU and nis the number of areas in which the vehicle

executes the task locally. Moreover, µViLis the part of the

task which is executed in vehicle Vi,µBjLis the part of

the task which is executed in RSU Bjand µlocLis the size

of the task which is computed locally. Therefore we have

0≤µVi, µBj, µloc ≤1and

I

X

i=1

µVi+

J

X

j=1

µBj+n·µloc = 1.(14)

Note that since the target vehicle is not aware of the conditions

in the following steps and the amount of data that will be

processed in the next area, it is not able to divide the task

into small parts before ofﬂoading. So, the trick that we use

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

6

S3

S2

S4

S2.1.2

S2.1

S2.1.4.3

S2.1.3

S2.1.4

S2.1.4.2

area 1 area 4

area 3 area 2

selected edge server

handover path

...

VT

VTVTVT

B1

V5

V1

V2

V4

V3

V6

B2

L*μB1

L*μV1

L*μloc

L*μV6

Example explanation for area1:

- VT generates the task with size L at the start point.

- VT starts discovering the environment. Makes a candidate list of available edge servers according to the information of beaconing messages.

- Assuming, 3 appropriate vehicle candidates in this area and local computing, VT has 4 possible actions and the environment could transition into 4 states.

- VT selects a random action (e.g. V1) and offloads the whole task L to V1.

- V1 executes the L*μV1 portion of the task and sends the feedback to VT after a certain distance (e.g. μV1 = 0.02 and traveled distance= 200 meters with constant speed).

- VT receives a reward or penalty according to the cost (which is related to the data rate and CPU frequency of the selected edge server).

- Environment transitions into state 2 and now the remaining task size is L - L*μV1 = 0.98 * L.

S1S2.1.1 S2.1.4.1

a2

a1

a3

a4

target vehicle

Fig. 2. An illustration of handover-enabled dynamic computation ofﬂoading algorithm.

in this situation is to ofﬂoad only the remaining part of the

target task in each step and decrease the delay and energy

consumption caused by the large size of data. It means that in

each area, after receiving the feedback from the edge server

or local processing unit, the algorithm subtracts the amount

of processed task size from the target task and ofﬂoads the

entire remaining part in the following state. Fig. 2 illustrates

this procedure. Similar to the logic of (13) and assuming (14),

we can deﬁne the total consumed energy for computing a task

as

Etot =

I

X

i=1

(PVT·ρ·L

RVT,Vi

) +

J

X

j=1

(PVT·ρ·(1 −µVi)·L

RVT,Bj

)

+n·(PCP U ·ρ·µloc L),0≤µVi, µBj, µloc ≤1.(15)

G. Problem Formulation

In this article, we propose a method for making an intel-

ligent computation ofﬂoading decision for dynamic vehicular

environments with the aim of minimizing execution delay and

energy consumption. Considering the preceding subsections,

and taking into account Dtot and Etot deﬁned in (13) and (15),

respectively, we can formulate the cost function as a function

of the total delay for processing the whole task and the energy

consumption for ofﬂoading in time period t. Accordingly, it

can be deﬁned as

C(t) = δD·Dtot(t) + δE·Etot (t),0< δD, δE<1,(16)

where δDand δEare the weights of delay (1/Second) and

energy consumption (1/Joule), respectively. These weights

allow the users to make decisions with different priorities.

As an illustration, if the user should process a highly delay-

sensitive task, it may give priority to delay by assigning a

higher value to δDor if the task generator would be a smart

device of a pedestrian in the vehicle, which has a limited

source of energy, it may assign a higher value to δD.

Assuming the total number of Tconsecutive time periods,

the main purpose is to minimize the average cost of computa-

tion ofﬂoading by leading the user for making the best decision

in the direction of choosing a suitable edge server. In this

regard, the computational ofﬂoading problem is formulated as

the following optimization problem:

Cmin(t) = arg min

C 1

T

T

X

t=1

C(t)!.(17)

In order to ﬁnd Cmin(t), we need to minimize Dtot and Etot

at the same time as:

Cmin(t) = arg min

Dtot, Etot 1

T

T

X

t=1

δD·Dtot(t) + δE·Etot (t)!.

(18)

s.t. C1: 0 ≤δD≤1,0≤δE≤1, δE+δD= 1

C2:fBj,min (GHz)≤fBj≤fBj,max (GHz)

C3:fVi,min (GHz)≤fVi≤fVi,max (GHz)

C4:dmin (m)≤d≤dmax (m)

C5:Dtot(t)≤Dloc (t) = Ddeadline

C6: 0 ≤µVi, µBj, µloc ≤1

C7:

I

X

i=1

µVi+

J

X

j=1

µBj+n·µloc = 1

By substituting Dtot and Etot given in (13) and

(15), respectively, in (18), the optimization problem

of computation ofﬂoading with optimization variables

fVi, fBj, RVT,Vi, RVT,Bj, RVi,VT, RBj,VTis deﬁned in (19) .

Therefore, the problem depends on CPU frequencies of edge

servers, data transmission rate, and consequently distances

between selected servers and the target vehicle.

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

7

Agent

Replay

buffer

Reward action

Perform action

Observe State St+1

DNNs

Mini-batch

Environment

Observe State S

Gradient

descent

Update

Parameter

θ

Policy

π*(s,a)

Fig. 3. Double Deep Q-Network application for the proposed urban scenario.

Cmin(t) = arg min

fVi,fBj,RVT,RVi,RBj

1

T

T

X

t=1

δD I

X

i=1 αL

RVT,Vi

+ρµViL

fVi

+βµViL

RVi,VT+

J

X

j=1 α(1 −µVi)L

RVT,Bj

+ρµBjL

fBj

+βµBjL

RBj,VT

+nρµlocL

floc !+δE I

X

i=1 PVTρL

RVT,Vi+

J

X

j=1 PVTρ(1 −µVi)L

RVT,Bj+nPCP U ρµloc L!

.(19)

Note that we study the problem in discrete times; therefore,

the target task executes at distributed time steps t= 1,2, ..., T

and it must fully execute by the end of the time period t=T.

IV. PROBLEM SOLUTION USING DDQN

ALGORITHM

A. Background on Markov Decision Processes

In order to solve the computation ofﬂoading problem il-

lustrated in (19), which needs a steady decision making

solution, ﬁrst, we describe the problem as a discounted Markov

Decision Process (MDP) [33] deﬁned by (S,A,P, P r, R, γ)

where Sis the set of states, Ais the set of actions, Pis the

probability density over states, P r :S × A → P(S)is the

Markov transition probability kernel, R:S × A → P (R)is

the distribution of the reward, and γis the discount factor

(0 < γ ≤1) which determines how much importance is

assigned to immediate and future rewards. In particular, for

taking any action a∈ A at any state s∈ S,P r(· | s, a)

corresponds to probability distribution of the next state and

R(· | s, a)to the distribution of the immediate reward. The

policy π:S → A of the MDP maps each state sto a

probability distribution over the action set A. Furthermore,

the expected action-value function Q:S × A → R can be

deﬁned over expectation operator E[.]as

Q(s, a) = EπRt+1 +γQ(St+1 ,At+1)|St=s,At=a.(20)

For each given action-value function Qand each state s, the

greedy policy π∗which chooses the action with the highest Q

meets

π∗(a|s) = argmax

a∈A

Q(s, a).(21)

In order to ﬁnd the optimal policy that achieves the largest

reward, the optimal action-value function Q∗can be described

as

Q∗(s, a) = max

πQ(s, a).(22)

Therefore, the deterministic optimal policy π∗can be obtained

by maximizing over Q∗(s, a). These optimal value functions

are recursively connected by the Bellman optimality equations

[22], and they need to satisfy the conditions of these equations

which deﬁne the highest reward an agent can earn if it makes

the optimal decision at the current state and all the upcoming

states.

B. Double Deep Q-Learning-based Decision Making

In online interactive-based RL algorithms, estimation of the

action-value function brings uncertainty. The DDQN algorithm

has three main differences from other RL algorithms which

provide a solution for these instabilities. The ﬁrst difference

is that the DDQN algorithm implements the optimal action

estimation function via a DNN instead of using a Q-table.

Secondly, the algorithm uses experience replay [34], [35]

which as shown in Fig. 3, samples the trajectory of MDP

consisting of states St, actions At, rewards Rt, and the fol-

lowing state St+1 from the replay memory Mat each iteration

as observations and then uses them to train the parameters

of DNNs. Experience replay helps to achieve steadiness in

uncertain environments by removing the time dependency of

observations. Finally, the third difference is to use of two

neural networks which could prevent overestimations of action

values by separating the action selection and the strategy

evaluation procedures [36].

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

8

According to the information given above, our proposed

algorithm, which is shown in Algorithm 1 step by step, can be

explained in the following. Computation ofﬂoading decisions

follow a distributed procedure. Therefore, we concentrate on

a single target vehicle of interest which has a target task

with a large size that can break into different parts with

different sizes where each part can be computed locally or on

different servers. Following the generation of a task, the target

vehicle starts to explore the neighboring vehicles and RSUs

within the communication range and provides a candidate list

accordingly. Note that in the proposed computation ofﬂoading

scenario, we may not always have an RSU in each area

because we deploy them randomly. This could enable the

algorithm to decide more wisely in various circumstances.

Therefore, we can deﬁne the states set, S, which includes the

number of available RSUs, the number of available vehicles,

location, and CPU frequencies of each RSU and vehicle, speed

and moving direction of each vehicle, and the remaining task

size for execution. Also, we deﬁne the actions set Aas the

vehicles and RSUs available for ofﬂoading. Since the main

goal of this study is to achieve minimum cost during the

computation of a task that brings us shorter delay and less

energy consumption, we deﬁne the reward function based on

the cost of computing a task in each area as

Rt=1

Ccurrent

min (t),(23)

where Ccurrent

min (t)is the minimum computation cost for the

current state. Note that the aim of the algorithm is to ﬁnd an

optimal policy in order to maximize the cumulative reward

during a long run over a time period t= 0, ..., T . In addition,

the proposed DDQN algorithm uses the ε-greedy policy which

balances the exploration and exploitation in the environment

[37]. When the algorithm chooses exploration, it investigates

the environment and collects information that can lead to

making better ofﬂoading decisions by sending the tasks to

different candidates. On the other hand, exploitation prefers to

exploit the promising information found when the algorithm

performed the exploration.

As mentioned above, the DDQN algorithm prevents select-

ing overestimated values resulting from using the same action

value to evaluate and to make decision [11]. Therefore, the

algorithm consists of two DNNs. One of these is the evaluation

network Qθ, with parameter θ, and the other is the target

network Qθ′with parameter θ′, where θis a parameter based

on Q-learning, and both networks have the same structure.

The objective of parameter θis to select the action with the

maximum action value. On the other hand, the parameter θ′

is used to evaluate the action value of the optimal action. The

target network parameter θ′is updated every Tsteps with

the value of the evaluation network parameter as θ′=θand

remains ﬁxed for the next Tsteps. During each iteration of

the DDQN, transitions of MDP (St,At,Rt,St+1)are stored

in the replay memory M, then a random mini-batch, with size

B, of independent samples from Mis selected for training the

parameters of neural networks.

Furthermore, with independent samples (sτ, aτ, rτ, sτ+1)τ∈[B]

Algorithm 1: DDQN-Based Task Ofﬂoading Algorithm

Input: Initialize evaluation network Qθ, target

network Qθ′, replay memory M, mini-batch B,

exploration probability ϵ, discount rate γ,T

1for learning episode x= 1, ..., X

2for L > 0do

3explore environment for candidates

4if no candidate found in the range

5execute the task locally, L=L−executed task

6else

7for each evaluation step do

8Observe state stand select at∼π(at, st)

9Execute at,L=L−executed task, observe

10 next state st+1 and reward rt

11 Store (st, at, rt, st+1)in experience memory M

12 end for

13 end if

14 for each target step do

15 Sample random mini-batch

Bτ= (sτ, aτ, rτ, sτ+1)∼ M

16 Compute target value for each τ∈[B]

Yτ=rτ+γ·max

a∈A Qθ′(sτ+1, a)

17 Update the evaluation network by performing

18 gradient descent step

[Yτ−Qθ(sτ, aτ)]2

19 Update target network parameters every target

20 step T:θ′←θ

21 end for

22 end for

23 x=x+ 1

24 end for

of M, we can calculate the target Yfrom the target network

as

Yτ≡rτ+1 +γQ sτ+1 ,argmax

a∈A

Q(sτ+1, a;θτ), θ′

τ(24)

and calculate the evaluation network parameter θby perform-

ing a gradient step on the loss function L(θ)deﬁned by

L(θ) = 1

B

B

X

τ=1

[Yτ−Qθ(sτ, aτ)]2(25)

as the mean-squared error between the target value and the

evaluated Q-value. By assuming Rτto be an immediate

reward for taking action Aτand Sτ+1 to be the next state,

the Bellman optimality equation can be presented as

(ϑQθ′) (Sτ,Aτ) = E[Yτ| Sτ,Aτ],(26)

where ϑdenotes the Bellman optimality operator. Then, we

consider θ′=θand calculate the expected value of the loss

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

9

TABLE II

SIM ULATI ON S TATE-SPAC E PARA MET ER S EXA MP LE

vehicle ID entrance location leaving location moving direction CPU frequency availability speed

Target vehicle 0 10000 →2 GHz available 60 km/h

1 220 350 →2.5 GHz available 100 km/h

2 480 3700 →4 GHz available 40 km/h

3 21 1200 ←3.6 GHz not available 85 km/h

4 6700 10000 →4.5 GHz available 65 km/h

5 9200 9500 ←7 GHz available 110 km/h

function in terms of the mean-squared Bellman error and the

variance of Yas

E[L(θ)] = ∥Qθ−ϑQθ′∥2+Eh[Yτ−(ϑQθ′) (Sτ,Aτ)]2i.

(27)

Since the target Ydoes not rely on θ, the minimization of the

loss function can be solved by

min

θ∥Qθ−ϑQθ′∥2,(28)

where DDQN aims to solve this minimization problem in order

to reach the optimal action-value function Q∗that leads to

ﬁnding the optimal policy π∗and consequently making the

best computation ofﬂoading decision A∗.

C. Time Complexity

In this section, we investigate the time complexity of the

proposed DDQN-based algorithm. Generally, the complexity

of RL algorithms relies on the size and properties of the state

and the initial information of the agents [38]. Since we use a

ﬁxed exploration rate in our proposed algorithm, the required

time for convergence with a state size of |S|and exploration

rate ϵis bounded by O|S|log |S|[log(1/ϵ)]/ϵ2 which is

linear in state size [39]. Additionally, the complexity of ﬁnding

the optimal decision is O(Nc)where Nis the number of

actions in each iteration and cdenotes the number of available

candidate edge servers including vehicles and RSUs [40].

V. SIMULATIONS

In this section, we evaluate the performance of the proposed

method for computation ofﬂoading in VEC. The simulations

are performed in the Windows operating system with CPU In-

tel core i7-10750H, 16GB of memory; GPU NVIDIA GeForce

GTX1650, Python 3.7.6. Moreover, we use the TensorFlow

library in order to create the deep learning model of the

proposed algorithm and to speed up numerical computing. We

implement a fully connected backpropagation neural network

for both the evaluation network and the target network. Be-

sides, the rectiﬁed linear units (ReLU) function is used as the

activation function of each layer in the DDQN model to avoid

vanishing gradients [41].

A. Simulation Setup

We consider a computation ofﬂoading framework that con-

sists of 80 vehicles and 30 randomly ﬁxed RSUs located

along a route for 10km. The speed of vehicles is assumed

to be different and distributed uniformly between [30,120]

0 50 100 150 200 250 300

Episode

1000

1500

2000

2500

3000

3500

4000

4500

Average computation cost

proposed method

DQN

ALTO

AdaUCB

UCB

Fig. 4. Comparison of average cost during computation of one whole task.

km/h. The communication range is considered as 200 meters.

To increase the feasibility of the scenario, CPU frequencies

of vehicles and RSUs are randomly distributed in the range

[2,8] and [8,16] GHz, respectively. Note that once we set the

random values for CPU frequencies and distances for each

vehicle and RSU ( at the beginning of running the algorithm),

we keep them constant until the end of the episode. Table

II shows an example of the environment design parameters.

The requested input data size for each task Lis assumed to

be 16 Mbits where the vehicle generates tasks continuously

during the 10km route (to measure the overall size of tasks

that could be processed during the target route), processing

complexity ρis ﬁxed as 1000 cycles/bit, and κ=10−27 Watt

·s3/cycles3. Communication parameters are considered as W

= 10 MHz, PV= 0.1 W, N0= 1,α= 1, and β= 0.01. Initial

parameters of the proposed DDQN algorithm are considered

as M= 1024,B= 32,γ= 0.9,ϵ= 0.9, and learning rate =

0.05.

B. Simulation Results

In this part, we evaluate the performance and reliability

of the proposed DDQN algorithm. To illustrate the perfor-

mance of this algorithm, we adapt the DQN algorithm to

our proposed scenario. Moreover, we compare the results

with the ALTO algorithm introduced in [6] and two other

RL algorithms, namely Upper-Conﬁdence-Bound (UCB) and

Adaptive Upper-Conﬁdence-Bound (AdaUCB) deﬁned in [42],

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

10

0 50 100 150 200 250 300

Episode

50

100

150

200

250

300

350

400

Average delay / task (s)

proposed method

DQN

ALTO

AdaUCB

UCB

Fig. 5. Comparison of average delay for executing a whole target task.

0 50 100 150 200 250 300

Episode

0.6

0.8

1

1.2

1.4

1.6

1.8

Executed task (Mbits)

104

proposed method

DQN

ALTO

adaUBC

UCB

local computing

Fig. 6. Comparison of the amount of executed tasks along 10km road during

300 learning episodes.

[43], respectively. Note that these three algorithms are based

on bandits and unlike the MDPs, actions have no inﬂuence

on subsequent observations. Regarding the learning time to

achieve our performance, we examine the proposed algorithm

with different numbers of episodes. Each episode takes a time

of about 9-11 seconds. As illustrated in Figures 4-8, there are

no signiﬁcant changes in the results after around 300 episodes

and as we mentioned in the Time Complexity subsection, the

required time for convergence is linear in the state size.

Fig. 4 indicates the average computation cost performance

of the proposed DDQN algorithm during 300 learning episodes

over the whole target route. The target task size is large and

due to the high mobility of the vehicular environments, it is

not feasible to consider the task execution without enabling

the handover feature. Therefore, we adjust all the compared

algorithms to resemble our proposed handover-enabled sce-

nario. Following the generation of the task at the start point,

the target vehicle interacts with the environment by ofﬂoading

the task to different servers in different areas separated by a

distance of 200 meters. It calculates the total reward during

the computation of the whole task according to the feedback

of each area. Then, the algorithm tries to ﬁnd an optimal

policy by carrying out this execution process over and over

and achieves the highest cumulative reward. In the beginning,

when the proposed algorithm starts to run, the evaluation

network starts to learn and update the parameters; so, the loss

value starts to increase, and accordingly, the cost increases.

With the growth of learning episodes and due to the effect

of training, the loss value drops down, and cost decreases

steadily and tends to be stable. As presented in the ﬁgure,

the DQN algorithm follows a trend similar to the proposed

method and the results are close to our experiments. Besides,

the other three algorithms, which are based on multi-armed

bandits structure, achieve higher and a nearly steady amount

of cost.

Fig. 5 shows the performance comparison of average delay

per task during the 300 learning episodes. Simulation results

demonstrate that, in terms of computation delay, the proposed

method converges to the optimal faster than existing methods.

In Fig. 6, we compare the amount of executed tasks in

a target vehicle with a ﬁxed speed during the passage of a

road for a distance of 10km. We assume that after ﬁnishing

the execution of one task, immediately another task gener-

ates in the target vehicle. Simulation results show that the

proposed algorithm outperforms the other existing algorithms

by executing larger-sized tasks by making the best ofﬂoading

decisions regarding the computation performance of possible

edge candidates in each area. Furthermore, as compared to the

local computing approach, the proposed algorithm is capable

of handling almost twice tasks (in Mbits).

We examine the amount of energy consumption during

computation ofﬂoading in the ﬁrst 5 areas of the target route

in Fig. 7. The amount of energy consumption is related to the

data transmission rate, which depends on the distance from

the selected edge server, and the data size for ofﬂoading. Our

proposed algorithm takes the location of the edge servers in the

state space of the environment into account; therefore, it can

learn the energy consumption performance by performing each

action considering the distance and make an optimal decision

with lower energy consumption. Additionally, the handover

strategy helps the target vehicle to only ofﬂoad the remaining

part of the task in each step, which could eventually decrease

the energy consumption.

We calculate the cost of computation deﬁned in (14) by

assigning the weights for energy consumption δEand delay

δDto set priorities for each UE. Fig. 8 demonstrates the

effect of preferences on the computation cost of one whole

task. Since the total computation delay includes the UL and

DL transmission delays, giving higher priority to delay by

assuming δDclose to 1 causes a higher cost for UE. Therefore,

delay-sensitive tasks are high-cost. Conversely, giving priority

to energy consumption yields low-cost computation; therefore,

ofﬂoading could be a good solution for battery saving of UEs.

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

11

1 2 3 4 5

Area

0

200

400

600

800

1000

1200

1400

1600

Average energy consumption (J)

proposed method ALTO AdaUCB UCB DQN

Fig. 7. Comparison of average energy consumption during task ofﬂoading in

each area of environment.

0 50 100 150 200 250 300

Episode

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Average computation cost

e d= 0

e d= 0.2

e d= 0.5

e d= 0.8

e d=1

Fig. 8. Comparison of computation cost considering different priorities where

higher δEgives priority to less energy consumption while higher δDfavors

low latency.

VI. CONCLUSION

In this article, we investigate the computation ofﬂoading

problem in beyond 5G and 6G-enabled VEC networks. We

deﬁne the ofﬂoading cost which takes into account energy

consumption, transmission delay, and processing delay during

computation ofﬂoading, as the optimization problem of task

allocation. Accordingly, we design an intelligent algorithm

based on DDQN to minimize the average ofﬂoading cost

considering the high instability of vehicular environments that

require an online solution. The proposed handover-enabled

DDQN algorithm enables the user to learn the ofﬂoading cost

performance by interacting with the environment and trying to

make a steady decision despite the uncertainty of the dynamic

environment. The proposed algorithm has three important

components: The ﬁrst one is using experience memory which

cooperates to obtain steadiness in unpredictable environments

by eliminating the time dependency of observation, the sec-

ond one is using two neural networks to prevent selecting

overestimated values caused by using the same action value

to evaluate and to make the decision, and the third one is

using DNN for optimal action value estimating. Simulation

results demonstrate that the proposed algorithm is able to

achieve better performance in terms of average ofﬂoading cost

under different conditions, which designates the feasibility and

effectiveness of the proposed method. The DDQN algorithm is

able to show the best performance in large-scale environments

with a great number of states; therefore, as a future work,

we plan to use a more realistic urban scenario, consisting of

a high density of vehicles, pedestrians, and infrastructures,

where they collaborate to share resources with a dynamic

queue management system which controls the data trafﬁc.

ACKNOWLEDGMENT

This work has been supported by The Scientiﬁc and Techno-

logical Research Council of Turkey (TUBITAK) under Project

120E307.

REFERENCES

[1] M. Keertikumar, M. Shubham, and R. Banakar, “Evolution of IoT in

smart vehicles: An overview,” in Proc. IEEE Int. Conf. Green Comput.

and Internet of Things (ICGCIoT), 2015, pp. 804–809.

[2] A. B. De Souza et al., “Computation ofﬂoading for vehicular environ-

ments: A survey,” IEEE Access, vol. 8, pp. 198 214–198 243, 2020.

[3] S. Gyawali, S. Xu, Y. Qian, and R. Q. Hu, “Challenges and solutions

for cellular based V2X communications,” IEEE Commun. Surveys Tuts.,

pp. 1–1, 2020.

[4] L. Liu, C. Chen, Q. Pei, S. Maharjan, and Y. Zhang, “Vehicular edge

computing and networking: A survey,” Mobile Netw. and Appl., vol. 26,

no. 3, pp. 1145–1168, 2021.

[5] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision

and challenges,” IEEE Internet of Things J., vol. 3, no. 5, pp. 637–646,

2016.

[6] Y. Sun, X. Guo, J. Song, S. Zhou, Z. Jiang, X. Liu, and Z. Niu, “Adaptive

learning-based task ofﬂoading for vehicular edge computing systems,”

IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3061–3074, 2019.

[7] H. Maleki, M. Basaran, and L. Durak-Ata, “Reinforcement learning-

based decision-making for vehicular edge computing,” in Proc. 29th

IEEE Signal Process. and Commun. Appl. Conf., 2021, pp. 1–4.

[8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction,

2nd ed. MIT press, 2018.

[9] H. Hasselt, “Double q-learning,” Advances in Neural Inform. Process.

Syst., vol. 23, pp. 2613–2621, 2010.

[10] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement

learning for multiagent systems: A review of challenges, solutions, and

applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839, 2020.

[11] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning

with double q-learning,” in Proc. AAAI Conf. Artiﬁcial Intell., vol. 30,

no. 1, 2016.

[12] S. Raza, S. Wang, M. Ahmed, and M. R. Anwar, “A survey on vehicular

edge computing: Architecture, applications, technical issues, and future

directions,” Wireless Commun. and Mobile Comput., vol. 2019, pp. 1–19,

2019.

[13] J. Moura and D. Hutchison, “Game theory for multi-access edge com-

puting: Survey, use cases, and future trends,” IEEE Commun. Surveys

Tuts., vol. 21, no. 1, pp. 260–288, 2019.

[14] Y. Wang et al., “A game-based computation ofﬂoading method in

vehicular multiaccess edge computing networks,” IEEE Internet of

Things J., vol. 7, no. 6, pp. 4987–4996, 2020.

[15] C. Zhu et al., “Folo: Latency and quality optimized task allocation in

vehicular fog computing,” IEEE Internet of Things J., vol. 6, no. 3, pp.

4150–4161, 2019.

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889

12

[16] J. Sun, Q. Gu, T. Zheng, P. Dong, A. Valera, and Y. Qin, “Joint

optimization of computation ofﬂoading and task scheduling in vehicular

edge computing networks,” IEEE Access, vol. 8, pp. 10 466–10 477,

2020.

[17] J. Feng, Z. Liu, C. Wu, and Y. Ji, “Ave: Autonomous vehicular edge

computing framework with aco-based scheduling,” IEEE Trans. Veh.

Technol., vol. 66, no. 12, pp. 10 660–10 675, 2017.

[18] Y. Cui, Y. Liang, and R. Wang, “Resource allocation algorithm with

multi-platform intelligent ofﬂoading in D2D-enabled vehicular net-

works,” IEEE Access, vol. 7, pp. 21 246–21 253, 2019.

[19] C. Lin, D. Deng, and C. Yao, “Resource allocation in vehicular cloud

computing systems with heterogeneous vehicles and roadside units,”

IEEE Internet Things J., vol. 5, no. 5, pp. 3692–3700, 2018.

[20] H. Liao et al., “Learning-based intent-aware task ofﬂoading for air-

ground integrated vehicular edge computing,” IEEE Trans. Intelligent

Transport Syst., vol. 22, no. 8, pp. 5127–5139, 2020.

[21] Q. Qi et al., “Knowledge-driven service ofﬂoading decision for vehicular

edge computing: A deep reinforcement learning approach,” IEEE Trans.

Veh. Technol., vol. 68, no. 5, pp. 4192–4203, 2019.

[22] Y. Liu, H. Yu, S. Xie, and Y. Zhang, “Deep reinforcement learning

for ofﬂoading and resource allocation in vehicle edge computing and

networks,” IEEE Trans. Veh. Technol., vol. 68, no. 11, pp. 11158–11 168,

2019.

[23] W. Zhan, C. Luo, J. Wang, G. Min, and H. Duan, “Deep reinforcement

learning-based computation ofﬂoading in vehicular edge computing,” in

Proc. IEEE Global Commun. Conf. (GLOBECOM), 2019, pp. 1–6.

[24] M. Khayyat, I. A. Elgendy, A. Muthanna, A. S. Alshahrani, S. Alharbi,

and A. Koucheryavy, “Advanced deep learning-based computational

ofﬂoading for multilevel vehicular edge-cloud computing networks,”

IEEE Access, vol. 8, pp. 137 052–137 062, 2020.

[25] Z. Ning, P. Dong, X. Wang, J. J. Rodrigues, and F. Xia, “Deep

reinforcement learning for vehicular edge computing: An intelligent

ofﬂoading system,” ACM Trans. Intelligent Syst. Technol. (TIST), vol. 10,

no. 6, pp. 1–24, 2019.

[26] H. Guo, J. Liu, J. Ren, and Y. Zhang, “Intelligent task ofﬂoading in

vehicular edge computing networks,” IEEE Wireless Commun., vol. 27,

no. 4, pp. 126–132, 2020.

[27] Y. Liu, J. Yan, and X. Zhao, “Deep reinforcement learning based latency

minimization for mobile edge computing with virtualization in maritime

uav communication network,” IEEE Trans. Veh. Technol., vol. 71, no. 4,

pp. 4225–4236, 2022.

[28] Z. Gao, L. Yang, and Y. Dai, “Large-scale computation ofﬂoading using

a multi-agent reinforcement learning in heterogeneous multi-access edge

computing,” IEEE Trans. Mobile Comput., p. early access, 2022.

[29] Y. He, Y. Wang, Q. Lin, and J. Li, “Meta-hierarchical reinforcement

learning (mhrl)-based dynamic resource allocation for dynamic vehicular

networks,” IEEE Trans. Veh. Technol., vol. 71, no. 4, pp. 3495–3506,

2022.

[30] M. Alam, M. Sher, and S. A. Husain, “Integrated mobility model (imm)

for vanets simulation and its impact,” in Proc. Int. Conf. on Emerg.

Technol., 2009, pp. 452–456.

[31] A. P. Miettinen and J. K. Nurminen, “Energy efﬁciency of mobile clients

in cloud computing.” HotCloud, vol. 10, no. 4, p. 19, 2010.

[32] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge com-

puting: Partial computation ofﬂoading using dynamic voltage scaling,”

IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, 2016.

[33] M. L. Puterman, Markov decision processes: discrete stochastic dynamic

programming. John Wiley & Sons, 2014.

[34] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn-

ing, planning and teaching,” Mach. Learn., vol. 8, no. 3-4, pp. 293–321,

1992.

[35] V. Mnih et al., “Human-level control through deep reinforcement learn-

ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[36] L. Xi, L. Yu, Y. Xu, S. Wang, and X. Chen, “A novel multi-agent ddqn-

ad method-based distributed strategy for automatic generation control of

integrated energy systems,” IEEE Trans. Sustain. Energy, vol. 11, no. 4,

pp. 2417–2426, 2019.

[37] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep q-

learning,” in Proc. 2nd Conf. Learn. for Dynamics and Control (PMLR),

2020, pp. 486–489.

[38] S. D. Whitehead, “A complexity analysis of cooperative mechanisms in

reinforcement learning.” in AAAI, 1991, pp. 607–613.

[39] M. Kearns and S. Singh, “Finite-sample convergence rates for q-learning

and indirect algorithms,” Advances in Neural Inform. Process. Syst., pp.

996–1002, 1999.

[40] S. Goudarzi, M. H. Anisi, H. Ahmadi, and L. Musavian, “Dynamic

resource allocation model for distribution operations using SDN,” IEEE

Internet of Things J., vol. 8, no. 2, pp. 976–988, 2020.

[41] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning

for big data,” Information Fusion, vol. 42, pp. 146–157, 2018.

[42] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the

multiarmed bandit problem,” Mach. Learn., vol. 47, no. 2, pp. 235–256,

2002.

[43] H. Wu, X. Guo, and X. Liu, “Adaptive exploration-exploitation tradeoff

for opportunistic bandits,” in Proc. Int. 35th Conf. Mach. Learn. PMLR,

2018, pp. 5306–5314.

Homa Maleki (Student Member, IEEE) received

the B.Sc. degree in Computer Engineering from

Tabriz University, Iran, and M.Sc. degree in In-

formation and Communication Engineering from

Istanbul Technical University, Istanbul, Turkey. She

is currently a Ph.D. candidate in Information and

Communication Engineering, Informatics Institute,

Istanbul Technical University, and a member of

the information and communications research group

(ICRG). She was a Research Assistant at ITU Voda-

fone Future Lab between 2019 and 2022. She is now

a Data Scientist with NTT Data Business Solutions Turkey starting from

June 2022. Her research interests include machine learning in communication,

optimization algorithms, cloud computing, and mobile edge computing.

Mehmet Basaran (Member, IEEE) received the

B.Sc. and M.Sc. degrees in electrical and electron-

ics engineering from Istanbul University, Istanbul,

Turkey, in 2008 and 2011, respectively, and the

Ph.D. degree in electronics and communication en-

gineering from Istanbul Technical University (ITU),

Istanbul, in 2018, where he was a Research As-

sistant. He was a Visiting Researcher with RWTH

Aachen University, Aachen, Germany, in 2017 in the

context of the FET EU Horizon 2020 Project. He

was an R&D Operations Manager at ITU Vodafone

Future Lab between 2018 and 2021. He was a 5G Research Professional

with Siemens Turkey between 2021 and 2023. He is currently a 6G R&D

Principal with Turkcell starting from Feb. 2023. His general research areas

mainly include next-generation communication systems and signal processing

for wireless communications.

Lutﬁye Durak-Ata (Senior Member, IEEE) re-

ceived the B.S., M.S., and Ph.D. degrees in Elec-

trical Engineering from Bilkent University, Ankara,

Turkey, in 1996, 1999, and 2003, respectively. She

was with the Statistical Signal Processing Labo-

ratory, Korean Advanced Institute of Science and

Technology, from 2004 to 2005, and the Electron-

ics and Communications Engineering Department,

Yildiz Technical University, Istanbul, Turkey, from

2005 to 2015. Since 2015, she has been with the

Informatics Institute, Istanbul Technical University,

Istanbul, where she is currently a Full Professor. Her research interests include

wireless communications and networking.

content may change prior to final publication. Citation information: DOI 10.1109/TVT.2023.3247889