ArticlePDF Available

Maneuver Decision of UAV in Short-Range Air Combat Based on Deep Reinforcement Learning

Authors:

Abstract and Figures

With the development of artificial intelligence and integrated sensor technologies, unmanned aerial vehicles (UAVs) are more and more applied in the air combats. A bottleneck that constrains the capability of UAVs against manned vehicles is the autonomous maneuver decision, which is a very challenging problem in the short-range air combat undergoing highly dynamic and uncertain maneuvers of enemies. In this paper, an autonomous maneuver decision model is proposed for the UAV short-range air combat based on reinforcement learning, which mainly includes the aircraft motion model, one-to-one short-range air combat evaluation model and the maneuver decision model based on deep Q network (DQN). However, such model includes a high dimensional state and action space which requires huge computation load for DQN training using traditional methods. Then, a phased training method, called "basic-confrontation", which is based on the idea that human beings gradually learn from simple to complex is proposed to help reduce the training time while getting suboptimal but efficient results. Finally, one-to-one short-range air combats are simulated under different target maneuver policies. Simulation results show that the proposed maneuver decision model and training method can help the UAV achieve autonomous decision in the air combats and obtain an effective decision policy to defeat the opponent.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Maneuver Decision of UAV in
Short-range Air Combat Based on Deep
Reinforcement Learning
QIMING YANG1, JIANDONG ZHANG1, GUOQING SHI1, JINWEN HU2,AND YONG WU 1
1School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129.China (e-mail: jdzhang@nwpu.edu.cn)
2School of Automation, Northwestern Polytechnical University, Xi’an 710129.China(e-mail: hujinwen@nwpu.edu.cn)
Corresponding author: Jiandong Zhang (e-mail: jdzhang@nwpu.edu.cn).
This study was supported in part by the Aeronautical Science Foundation of China under Grant 2017ZC53033, in part by the National
Natural Science Foundation of China under Grant 61603303 and Grant 61803309, in part by Natural Science Foundation of Shaanxi
Province under Grant 2018JQ6070, and in part by the China Postdoctoral Science Foundation under Grant 2018M633574.
ABSTRACT With the development of artificial intelligence and integrated sensor technologies, unmanned
aerial vehicles (UAVs) are more and more applied in the air combats. A bottleneck that constrains the
capability of UAVs against manned vehicles is the autonomous maneuver decision, which is a very
challenging problem in the short-range air combat undergoing highly dynamic and uncertain maneuvers
of enemies. In this paper, an autonomous maneuver decision model is proposed for the UAV short-range
air combat based on reinforcement learning, which mainly includes the aircraft motion model, one-to-
one short-range air combat evaluation model and the maneuver decision model based on deep Q network
(DQN). However, such model includes a high dimensional state and action space which requires huge
computation load for DQN training using traditional methods. Then, a phased training method, called
"basic-confrontation", which is based on the idea that human beings gradually learn from simple to complex
is proposed to help reduce the training time while getting suboptimal but efficient results. Finally, one-to-
one short-range air combats are simulated under different target maneuver policies. Simulation results show
that the proposed maneuver decision model and training method can help the UAV achieve autonomous
decision in the air combats and obtain an effective decision policy to defeat the opponent.
INDEX TERMS Deep reinforcement learning, Maneuver decision, Independent decision, Deep Q network,
Network training
I. INTRODUCTION
COMPARED with manned aircraft, military UAVs have
attracted much attention for their low cost, long flight
time and fearless sacrifice. With the development of sensor
technology, computer technology and communication tech-
nology, the performance of military UAVs is evolved signifi-
cantly, and the range of executable tasks continues to expand
[1]. Although the military UAV can perform reconnaissance
and ground attack missions [2], most of the mission functions
are inseparable from human intervention, and ultimately the
decision is made by the control personnel in ground station.
This ground-based remote operation mode is mainly depen-
dent on the data link which is vulnerable to weather and
electromagnetic interference. So the traditional ground-based
remote operation is difficult to command UAV to conduct air
combat due to the difficulty of adapting to fast and varied
Nomenclature
pUposition vector of UAV, U indicate UAV
˙pTvelocity vector of Target,T indicate Target
αUangle between ˙pUand (pTpU)
αTangle between ˙pTand (pTpU)
nxoverload in the velocity direction
nznormal overload
µroll angle around the velocity vector
ηAsituation advantage
ηRdistance advantage
vspeed
ψTheading angle of target
γUtrack angle of UAV
γreward discount factor
Ddistance
θparameters of Q network
ststate at time t
rtreward at time t
ataction at time t
aicontrol values of ith action of maneuver library
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
air combat scenarios [3].Therefore, to let UAV automatically
make control decisions according to the situation faced, and
to realize UAV independent air combat is a major research
direction of UAV intelligence.
The autonomous maneuver decision in short-range air
combat is the most challenging application direction, because
the two sides in the short-range air combat perform the most
violent maneuvers, making the situation change very quickly.
Autonomous maneuver decision requires automatically gen-
erating flight control commands under various air combat
situations based on technical methods such as mathematical
optimization and artificial intelligence.
The methods of autonomous maneuver decision can be
roughly divided into three categories: game theory [4]- [7],
optimization method [8]- [14] and artificial intelligence [15]-
[20]. Representative methods based on game theory include
differential games [4]- [5] and influence diagrams [7]. The
model established by this kind of method can directly reflect
the various factors of air combat confrontation, but it is
difficult to solve the model with complex decision set due
to the limitation of real-time calculation [7].
The maneuver decision based on optimization theory in-
cludes genetic algorithm [9], Bayesian inference [10], sta-
tistical theory [14], etc. This kind of method is to trans-
form the maneuvering decision problem into the optimization
problem. By solving the optimization model, the maneuver
policy can be obtained. However, many of these optimization
algorithms have poor real-time performance on large-scale
problems and cannot implement online decision making for
air combat. Therefore, they can only be used for offline air
combat tactics optimization research [9].
Artificial intelligence-based methods mainly include ex-
pert system method [16], neural network method [17] and
reinforcement learning method [18]- [21]. The core of expert
system method is to refine the flight experience into a rule
base and generate flight control commands through rules. The
problem with this method is that the rule base is too complex
to build, and the policy in rule base is simple and fixed, so
it is easy to be cracked by the opponent. The core of the
neural network method is to use neural networks to store
maneuver rules. The neural network maneuver decision is
robust and can be continuously learned from a large number
of air combat samples. However, the production of air combat
samples also requires a lot of artificial work to complete,
so this method faces the problem of insufficient learning
samples. Compared with expert system method and neural
network method, reinforcement learning does not require
labeled learning samples. The agent updates the action policy
by interacting with the external environment autonomously
[22]. This method better realizes the combination of online
real-time decision-making and self-learning. It is an effective
method to solve the problem of sequential decision-making
without a priori model.
Many scholars have carried out research on maneuver
decision-making based on the idea of reinforcement learning.
In [3], the approximate dynamic programming method is
used to construct the maneuver decision model, and the
flight experiment is carried out. It is verified that the UAV
autonomous maneuver decision can be realized based on
the idea of reinforcement learning. However, in the paper,
the UAV is assumed to move in a 2D plane, and the actual
situation of the aircraft moving in 3D space in air combat
is not considered. In [18], the fuzzy logic method is used
to divide the state space of the air combat environment, and
the linear approximation method is used to approximate the
Q value. However, with the continuous subdivision of the
state space, the dramatic increase in the number of states
leads to a decline in the efficiency of reinforcement learn-
ing, and learning tends to fall into the dimension explosion
and leads to failure. In [19]- [20], the deep reinforcement
learning algorithm is used to construct the maneuver decision
model of air combat, but the speed is not set as the decision
variable in the model, and the speeds of both sides are set
to constant values, which is inconsistent with the actuality of
air combat. In [21], the DQN algorithm is used to construct
the maneuver decision model of over-the-horizon air combat.
The speed control is considered in the model. However, as
in [3], the model still assumes that both sides move in the
same 2D plane, without considering the influence of altitude
changes on air combat. None of these models have fully and
realistically reflected the air combat model, especially the
characteristics of short-range air combat.
In this paper, based on reinforcement learning, the UAV
short-range air combat autonomous maneuver decision mod-
eling is carried out. Firstly, a second-order aircraft motion
model and one-to-one short-range air combat model in 3D
space are established in succession. And the evaluation
model of air combat advantage is proposed by combining
two factors of situation and distance. The model can quan-
titatively reflect the advantages of the aircraft in any situation
of short-range air combat. Secondly, the maneuver decision
model is constructed under the framework of DQN, and a
new maneuver library is designed. The 7 classic maneu-
ver actions are extended to 15, which improves the action
space of decision. At the same time, based on the advan-
tage evaluation function, the reward function is designed to
comprehensively reflect the changing of the maneuver to the
situation. Finally, for the problem that the maneuver decision
model with high-dimensional state space (13-dimensional)
and action space (15-dimensional) is difficult to be trained
to converge traditionally, a training method called "basic-
confrontation" is proposed, which effectively improves the
training efficiency. Through a large number of machine-
machine confrontation and man-machine confrontation sim-
ulation experiments under the initial states of advantages,
disadvantages and balance, the maneuver decision model
established in this paper is proved to be able to learn the
maneuver policy autonomously and therefore gain advan-
tages in air combat. Compared with the previous researches
of maneuver decision with reinforcement learning in which
the simplifications of the air combat state space and action
space was made due to the difficulty of convergence of high-
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
dimensional space models, the proposed method can make
the model more close to the actual motion state space of air
combat, and can learn effective air combat maneuver decision
policy, and further demonstrate the effectiveness of using
reinforcement learning to solve the air combat maneuver
decision problem.
The following part of the paper is arranged as follows.
In section 2, the short-range air combat maneuver decision
model is established. And the training method of DQN model
is introduced in Section 3. Section 4 introduces the training
and testing of the model through simulation analysis. Finally,
Section 5 concludes the full text.
II. SHORT-RANGE AIR COMBAT MANEUVER DECISION
MODEL
A. AIRCRAFT MOTION MODEL
The aircraft’s motion model is the basis of the air combat
model. The research focus of this paper is maneuvering
decision-making, which mainly considers the positional re-
lationship and velocity vector of the two sides in the three-
dimensional space. Therefore, this paper uses a three-degree-
of-freedom particle model as the motion model of the air-
craft. The angle of attack and the side slip angle are ignored,
assuming that the velocity direction coincides with the body
axis.
In the ground coordinate system, the ox axis takes the east,
the oy axis takes the north, and the oz axis takes the vertical
direction. The motion model of the aircraft in the coordinate
system is as shown in (1).
˙x=vcosγsinψ
˙y=vcosγcosψ
˙z=vsinγ
.(1)
In the same coordinate system, the dynamic model of the
aircraft is shown in (2).
˙v=g(nxsinγ)
˙γ=g
v(nzcosµcosγ)
˙
ψ=gnzsinµ
vcosγ
.(2)
In (1) and (2), x,y, and zrepresent the position coordi-
nates of the aircraft in the coordinate system, vrepresents
the speed, and ˙x,˙y, and ˙zrepresent the values of the velocity
von the three coordinate axes. The track angle γrepresents
the angle between the velocity vector and the horizontal plane
o-x-y. The heading angle ψrepresents the angle between the
projection v0of the velocity vector on the o-x-y plane and the
oy axis. g represents the acceleration of gravity. The position
vector is recorded as p= [x, y, z], and ˙p= [ ˙x, ˙y, ˙z].
[nx, nz, µ]is a set of control variables that control the
maneuvering of the aircraft, set Λas the control space of
the UAV. nxis the overload in the velocity direction, which
represents the thrust and deceleration of the aircraft. nz
represents the overload in the pitch direction, that is, the
normal overload. µis the roll angle around the velocity
vector. The speed of the aircraft is controlled by nx, and the
FIGURE 1. Aircraft three-degree-of-freedom particle model.
direction of the velocity vector is controlled by nzand µ,
thereby controlling the aircraft to perform maneuvers. The
parameters of the aircraft particle model are shown as in
Figure 1.
B. ONE-TO-ONE SHORT-RANGE AIR COMBAT
EVALUATION MODEL
Short-range air combat is also known as dog fighting.The
goal in this air combat is to perform maneuvering to let the
aircraft chase the tail of the target aircraft while avoiding
letting the target enter its own tail.
In the one-to-one air combat situation as shown in Figure
2, from the idea of attack, the UAV should try to fly toward
the target, chasing the target, that is, let the projection of the
velocity vector ˙pUon (pTpU)be the largest, and from
the defense idea, the UAV should avoid the target flying
towards itself, and the projection of the target velocity vector
˙pTon (pUpT)should be minimized. Then the situation
advantage of UAV in air combat can be defined as
ηA=˙pU(pTpU)
kpTpUk˙pT(pUpT)
kpUpTk.(3)
The larger ηAis, the larger the UAV has the advantage of the
situation. On the contrary, when ηAis smaller, the target has a
larger situation advantage, and the UAV is at a disadvantage.
In addition to the situation, distance is also a key factor in
close air combat. The weapon used in close air combat has
a generally limited attack range. It cannot attack the target
beyond the range. Within the range, the closer distance, the
greater the probability of destroying the target. In addition,
in order to avoid collision or accidental injury to the target
debris, the distance between the two sides cannot be too close
and there is a minimum safe distance. Let the maximum
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 2. One-to-one short-range air combat situation.
attack distance of the weapon be Dmax and the minimum
safe distance be Dmin , and then the distance advantage of
the UAV in the air combat can be defined as
ηD=β1
Dmax − kpTpUk
Dmax Dmin 1e(kpTpUk−Dmim)β2,
(4)
where β1and β2are adjustment coefficients. When the
distance between the UAV and the target is less than Dmax,
ηDgradually increases as the distance decreases. When the
distance is less than Dmin or greater than Dmax ,ηDdecreases
as the distance decreases or increases.
Combining situation advantage and distance advantage,
UAV’s advantage evaluation function in air combat can be
defined as
η=ω1ηA+ω2ηD,(5)
where ω1and ω2are weight coefficients. In summary, the
UAV short-range air combat maneuver decision can be re-
garded as an optimization problem that solves the action in
the control space Λto maximize the advantage evaluation
function η.
For the optimization problem maxη(nx, nz, µ),
[nx, nz, µ]Λ, the objective function is a complex high-
order nonlinear function, and it is difficult to obtain the ex-
treme points through the derivative function, so it is difficult
to obtain the analytical optimal solution. However, in the case
where the air combat situation is determined, the advantage
function can be calculated as the evaluation value of the
current action to evaluate the current action. Therefore, we
use the method of reinforcement learning to learn maneuver
policy from the large number of feedback evaluation values.
The advantage function allows the UAV to understand its
pros and cons in the current situation. In addition, it has
to find a set of variables to describe the current air combat
situation to let the UAV understand the current relative sit-
uation. The air combat state at any time can be completely
determined by the information contained in the UAV position
vector pU, the UAV velocity vector ˙pU, the target position
vector pT, and the target velocity vector ˙pTin the same
coordinate system. However, the range of values of the three-
dimensional coordinates is too large, so the coordinate values
are not suitable as state inputs of reinforcement learning
directly. Therefore, the absolute vector information of the
UAV and the target should be transformed into the relative
relationship between the two sides, and the angle should be
used to represent the vector information. This will not only
reduce the dimension of the state space, but also facilitate
the normalization of the value range of the state information,
thereby improving the efficiency of reinforcement learning.
The one-to-one short-range air combat state space can be
composed of the following five aspects:
1. UAV velocity information, including speed vU, track
angle γUand heading angle ψU.
2. Target velocity information, including speed vT, track
angle γTand heading angle ψT.
3. The relative positional relationship between the UAV
and the target is characterized by the distance vector
(pTpU). The coordinate information of the distance vec-
tor is converted into the form of the modulus and angle of the
vector. The modulus of distance vector is D=kpTpUk
.γDrepresents the angle between (pTpU)and the o-x-y
plane, and ψDrepresents the angle between the projection
vector of (pTpU)on the o-x-y plane and the oy axis.
The relative positional relationship between the UAV and the
target is represented by D,γD, and ψD.
4. The relative motion relationship between the UAV and
the target, including the angle αUwhich is between the UAV
velocity vector ˙pUand the distance vector (pTpU)and the
angle αTwhich is between the target velocity vector ˙pTand
the distance vector (pTpU).
5. The height of the UAV zUand the height of the target
zT.
Based on the above variables, the one-to-one air combat
situation at any time can be fully characterized.
C. MANEUVER DECISION MODELING BY DEEP Q
NETWORK
ARCHITECTURE OF MODEL. Reinforcement learning
is a method for Agent to optimize its action policy in an
unknown environment [22]. The Markov Decision Process
(MDP) is usually used as the theoretical framework for re-
inforcement learning. For model-free reinforcement learning
problems, the Markov decision process is described by a 4-
tuple (S, A, R, γ). Where Srepresents the state space, A
represents the action space, Rrepresents the reward function,
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
and γrepresents the discount factor. The interaction process
of reinforcement learning is as follows. At time t, the Agent
applies an action atto the environment. After the action atis
executed, the state is transferred from stto st+1 at the next
moment, and the agent obtains the reward value rt+1 from
the environment.
The state is evaluated by a state value function or a state-
action value function. According to the Bellman formula,
the state-action value function can be shown as Qπ(s, a) =
Eπ[rt+1 +γQ (st+1 , at+1)|st=s, at=a]. It can be seen
from state-action value function that the calculation of the
state value is the accumulation of the immediate reward value
and the subsequent state value, so the evaluation of the state
has taken into account the subsequent state, so the action
selected according to the state value has farsightedness, that
is, the obtained policy is long-term optimal in theory. The
purpose of reinforcement learning is to solve out an optimal
policy π(a|s) = arg max
aA
Q(s, a)in the policy space π,
which can make the long-term cumulative reward largest.
According to the interactive process, the reinforcement
learning model framework of the UAV short-range air combat
maneuver decision is shown in figure 3. The state of the UAV
and the state of the target are integrated and calculated to
form a state description of the air combat environment, which
is output to the agent. The air combat environment model
calculates the UAV’s advantage evaluation value based on
the current situation of both sides, and outputs it as a reward
value to the reinforcement learning agent. In the interaction
process, the agent outputs the action value to the air combat
environment model to change the state of the UAV, and the
target changes the state according to its own maneuver policy.
The reinforcement learning agent continuously interacts with
the air combat environment to obtain air combat states, action
values, and reward values as transitions. Based on the transi-
tions, the maneuver policy of UAV is dynamically updated,
so that the output action value tends to be optimal, thus
realizing the self-learning of the UAV air combat maneuver
policy.
Target state
UAV state
Air
combat
situation Reinforcement
learning agent
Air combat
environment
Reward value
Air combat state
Action
Target action
FIGURE 3. UAV short-range air combat maneuver decision model framework
based on reinforcement learning.
Facing a high-dimensional continuous state space such as
an air combat environment, the DQN algorithm [23] is se-
lected as the algorithm framework of reinforcement learning.
The core of the DQN algorithm is to use the deep neural
network to approximate the value function. At the same time,
based on the Q learning algorithm, the TD error is used to
continuously adjust the parameters θof the neural network,
so that the state value of the network output is constantly
approaching the true value, Q(s, a)Q(s, a|θ). Based
on the short-range air combat environment model in section
2, the maneuver decision model is constructed under DQN
framework.
STATE SPACE. The state space of the maneuver decision
model is used to describe the air combat situation which
is divided into five aspects as section 2.B. Therefore, the
state space consists of the following 13 variables, vU,γU
,ψU,vUvT,γT,ψT,D,γD,ψD,αU,αT,zU,
zUzT. In order to unify the range of each state variable
and improve the efficiency of network learning, each state
variable is normalized into a range as shown in Table 1.
TABLE 1. THE STATE SPACE FOR THE DQN MODEL
State definition State definition
s1vU
vmax ab s82bγD
π
s2γU
2πab s9ψD
2π
s3ψU
2πab s10 αU
2πab
s4vUvT
vmaxvmin ab s11 αT
2πab
s5γT
2πab s12 zU
zmax ab
s6ψT
2πab s13 zUzT
zmaxzmin ab
s7D
Dthreshold ab
vmax and vmin represent the maximum and minimum
speeds of the aircraft motion model, respectively. zmax and
zmin represent the ceiling and minimum safe height of the
aircraft, respectively. Dthreshold is the distance threshold, tak-
ing the starting distance of short-range air combat. aand b
are two positive numbers and satisfy a= 2b.State space is
defined as a vector S= [s1, s2,· · · , s13].
ACTION SPACE. The action space for the DQN model
is the UAV’s maneuver library. The establishment of the
maneuver library can draw on the tactical actions of fighter
pilots during air combat. Pilots can derive many tactical
actions such as barrel rolling, cobra maneuvering, high yo-
yo, and low yo-yo based on various factors such as aircraft
performance, physical endurance and battlefield situation.
However, these complex maneuvers are ultimately derived
from basic maneuvering actions, so as long as the UAV’s
maneuvering library contains these basic maneuvers, it can
meet the requirements of simulation research.
According to the common air combat maneuver, NASA
scholars designed 7 basic maneuvers [24], which are uniform
linear flight, accelerated straight flight, deceleration straight
flight, left turn flight, right turn flight, upward flight, and
downward flight. These maneuvers enable UAV movement
in three-dimensional space, but the increase or decrease of
speed can only be achieved when flying in a straight line,
which makes it difficult to control speed when performing
other maneuvers. Therefore, based on the above basic ma-
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
accelerate
decelerate
maintain
FIGURE 4. UAV maneuver library.
neuver, the maneuver library can be expanded. As shown in
Figure 4, the UAV can be maneuvered in the forward, left,
right, up, and down directions. Maintenance, acceleration,
and deceleration control are provided in each direction. So
the maneuver library can be arbitrarily expanded to perform
the actions required by different handling precisions. Each
action aiin the maneuver library corresponds to a set of
control values [nx, nz, µ]. Thus, action space Aconsists of a
discrete set of action values, which is a subset of the control
space, A= [a1,a2,· · · ,an]Λ.
REWARD FUNCTION. The reward value is an immedi-
ate assessment of the agent’s actions by the environment. The
reward value in this paper is calculated based on the advan-
tage evaluation function of the UAV air combat situation and
is used as an immediate evaluation of the maneuver decision.
At the same time, the reward value should reflect the penalty
for the action beyond the flight range during the simulation.
The limits of the flight range include the limitation of the
flight altitude. Define the penalty function as
ηp=(P, if (zU< zmin)or (zU> zmax) ;
0,otherwise .(6)
Based on the situation assessment function ηand the
penalty function ηP, the reward function of the reinforce-
ment learning algorithm is r=η+ηP.
In summary, the model of UAV short-range air combat
maneuver decision based on DQN algorithm is shown in
Figure 5. The running process of the model is as follows. The
current air combat state is st, the online Q network outputs
the action value atAto the air combat environment
based on the -greedy method. The UAV flies according to
the action value at, and the target flies according to the preset
policy, then the status is updated to st+1 , and the return value
rtis obtained. Store (st, at, rt, st+1)as a transition sample
in experience replay memory to complete an interaction. In
the learning process, the minibatch transitions are extracted
from the experience replay memory based on the prioritized
experience replay policy, and the parameters θof the online
network is updated according to the gradient descent princi-
ple by using the TD error, and the parameters of the online
network θis periodically assigned to the parameter of the
target network θ0. Continue to learn and update until the TD
error approaches 0, and finally come up with a short-range
air combat maneuver policy.
III. TRAINING METHOD OF THE DQN MODEL
Since the state space of the air combat model is very large, if
the unenlightened UAV is directly confronted with the target
which has smart maneuver policy, the result will definitely
be very bad, and a large number of invalid transition samples
will be generated. This will lead to extremely low efficiency
of reinforcement learning, even learning failure due to sparse
sample. In response to this problem, a training method named
basic-confrontation training is designed based on the process
of human learning which is gradually transitioning from
simple cognition to complex knowledge.
Based on this idea, the method divides the training process
of the maneuver decision DQN model into two parts. First,
let the target fly with simple basic action, such as uniform
linear motion and horizontal spiral motion in different initial
states such as advantage, disadvantage, and balance, make
the UAV familiar with the air combat situation and learn
the basic maneuver policy, which is called basic training.
The basic training items are carried out according to the
target’s maneuver strategy and the initial situation of air
combat from simple to complex. The follow-up training is
carried out directly on the previously trained network, thus
achieving a gradual superposition of learning effects. Second,
after the UAV learned the basic flight strategy, carry out the
confrontation training in different initial states in which the
target has smart maneuver policy, let UAV learn the maneuver
policy under the confrontation condition to defeat the target
in the air combat.
TARGET POLICY. When conducting confrontation
training, the target’s maneuver policy adopts a robust maneu-
ver decision algorithm based on statistics principle [14]. This
algorithm has strong robustness, and the simulation proves
that the algorithm is better than the traditional min-max
method [25]. The framework of the algorithm is to test all the
actions in the maneuver library under the current situation,
obtain the value of membership functions after the execution
of each action, and select the action that makes the statistical
information of the membership functions the best as the next
maneuver action. The following is a brief introduction to this
algorithm.
The current air combat situation is characterized by four
parameters: azimuth α, distance D, speed vand altitude h,
and the membership functions of each parameter are defined
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 5. Model of UAV short-range air combat maneuver decision based on DQN algorithm.
separately to enhance the robustness of the situation descrip-
tion. The membership function of the azimuth parameter is
fα=αUαT
π2.(7)
The membership function of the distance parameter is
fD=(1,if D0;
eD2
2σ2,otherwise D > 0,(8)
where σis the standard deviation of the weapon attack
distance, and D=DDmax. The membership function
of the speed parameter is
fv=vU
v
e2|vUv|
v,(9)
where vrepresents the optimal speed of the target attack
UAV, set v=vmax vT, and the value is set as
v=(vT+ ∆v1eD
Dmax ,if D > 0;
vT,otherwise D0.
(10)
The membership function of the height parameter is de-
fined as
fh=
1,if hszhs+σh;
e(∆zhs)2
2σ2
h,if z < hs;
e(zhsσh)2
2σ2
h,otherwise z > hs+σh,
(11)
where hsrepresents the optimal attack height difference of
the target to the UAV and σhis the standard deviation of the
optimal attack height.
When the membership functions of the above four parame-
ters are gradually approaching 1, the target is in an advantage,
and when approaching 0, the target is at a disadvantage. The
steps of the algorithm are as follows:
1. At time t, based on the current state of the target and the
UAV, control commands for all actions in the action library
are sent to the motion model for heuristic maneuvering.
2. Step 1 is performed to obtain all possible positions of
the target at time t+1, and the situation of each position is
solved to obtain a set
Ft+∆t
i=nfi,t+∆t
α(α), f i,t+∆t
D(D)
fi,t+∆t
v(v), f i,t+∆t
h(∆z)o,(12)
where i represents the number of the action in the ma-
neuver library, and the set of membership functions of the
parameters corresponding to all maneuvers is Ft+∆t=
Ft+∆t
1, F t+∆t
2,· · · Ft+∆t
n.
3. Calculate the mean mt+∆t
iand the standard deviation
st+∆t
iof Ft+∆t
i.
mt+∆t
i=EFt+∆t
i.(13)
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
st+∆t
i=
v
u
u
u
u
u
u
u
u
t
fi,t+∆t
α(α)mt+∆t
i2+
fi,t+∆t
D(R)mt+∆t
i2+
fi,t+∆t
h(∆)mt+∆t
i2+
fi,t+∆t
v(v)mt+∆t
i2
.(14)
Get a binary array MSt+∆t
i=mt+∆t
i, st+∆t
i, and build
a set MQt+∆t=MSt+∆t
ifor i=1,2,. . . ,n. Select the
element with the largest mean in MQt+∆t, and use the
corresponding maneuver as the action to be executed for the
target. If the elements with the largest mean are more than 1,
output the maneuver corresponding to the element with the
smallest standard deviation among these elements.
4.Execute the action, update the time, and return to step 1.
TRAINING EVALUATION. In confrontation training,
the target makes maneuver decisions according to the above
algorithm. A training process consists of multiple episodes,
and each episode represents an air combat process consisting
of multiple steps, each step representing a decision cycle. In
order to evaluate the training performance of the maneuver
decision model, three indicators are defined, including the
advantage steps rate, the average advantage reward and the
maximum episode value. When the reward value is not less
than 0.8max (η), the UAV is considered to be in the
advantage state, and the advantage steps rate is the ratio of
the number of steps in the advantage state to the total number
of execution steps in this episode. The average advantage
reward is the average of the reward values of the advantage
state in episode. The maximum episode value is the sum of
all the reward values in this episode. if the UAV flies out of
the height limit described in equation (6), causing the episode
to be interrupted, the maximum episode value is set to 0.
In order to reflect the effect of the agent learning, the
evaluation episode is set to be executed periodically during
a training process. In the evaluation episode, the ε-greedy
algorithm is not executed, and the online Q network directly
outputs the action value with the largest Q value. Advantage
steps rate, the average advantage reward and the maximum
episode value in the episode are recorded to evaluate the
previously learned maneuver policy.
In the next section, we will discuss the training process
in detail through simulation experiments. The DQN model
training process is shown as following algorithm.
IV. SIMULATION AND ANALYSIS
A. PLATFORM SETTING
In this paper, the short-range air combat environment model
is established by using Python language, and the DQN net-
work model is built based on TensorFlow module.
HARDWARE. Based on the UAV autonomous maneuver-
ing decision model, the man-machine air combat confronta-
tion system is developed. As shown in Figure 6, the man-
machine air combat confrontation system consists of three
modules: the UAV self-learning module, the manned aircraft
Algorithm: DQN MODEL TRAINING PROCESS
Initialize online network Qwith random parameters θ
Initialize target network Q0with random parameters θ0θ
Initialize replay buffer R
Set the target maneuver policy (basic/ confrontation)
for episode = 1, M do
Initialize the initial state of air combat
Receive initial observation state s1
If episode %evaluation frequency = 0
Perform evaluation episode
for t = 1, T do
With probability select a random action at
Otherwise select at=maxaQ(st, a;θ)
UAV executes action at, and target executes action according to its
policy
Receive reward rtand observe new state st+1
Store transition (st, at, rt, st+1)in R
Sample a random minibatch of Ntransition (si, ai, ri, si+1)from R
Set yi=ri+γmaxa0Q0(si+1, a0;θ0)
Perform a gradient descent step on (yiQ(si, ai;θ))2with respect to
the network parameters θ
Every C steps reset θ0=θ
end for
end for
operation simulation module and the air combat environ-
ment module. The three modules reside on three computers,
and the computers are connected by Ethernet to exchange
information. Each computer has an Intel(R) Core(TM) i7-
8700k CPU and 16GB RAM. The UAV self-learning module
computer is also equipped with a NVIDIA GeForce GTX
1080 TI graphics card for Tensorflow acceleration.
(a)UAV self-learning
module
(b)manned aircraft operation
simulation module
(c)air combat environment
module
FIGURE 6. The man-machine air combat confrontation system
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2. MANEUVER LIBRARY
No. Maneuver control values
nxnzµ
a1forward maintain 0 1 0
a2forward accelerate 2 1 0
a3forward decelerate -1 0 0
a4left turn maintain 0 8 -acos(1/8)
a5left turn accelerate 2 8 -acos(1/8)
a6left turn decelerate -1 8 -acos(1/8)
a7right turn maintain 0 8 acos(1/8)
a8right turn accelerate 2 8 acos(1/8)
a9right turn decelerate -1 8 acos(1/8)
a10 upward maintain 0 8 0
a11 upward accelerate 2 8 0
a12 upward decelerate -1 8 0
a13 downward maintain 0 8 π
a14 downward accelerate 2 8 π
a15 downward decelerate -1 8 π
PARAMETERS SETTING. The parameters of the short-
range air combat environment model are set as follows. The
farthest attack distance is Dmax=3km, minimum distance
between the two aircrafts Dmin=200m, the maximum return
value is adjusted to 5 , punish value P=-10, and the maximum
speed vmax =400 m/s, minimum speed vmin = 90 m/s, ceiling
height zmax =12000m , minimum height zmin=1000m, dis-
tance threshold DThreshold = 10000 m, a=10, b=5. For control
space, set nx[1,2] , nz[0,8] ,µ[π, π]. Extend 7
basic maneuvers to 15 to achieve direction and speed control.
The action space contains 15 basic maneuver actions. The
control values of the basic actions in the maneuver library
are shown in Table 2.
The parameters in the DQN model are set as follows.
According to the definition of the state space and the def-
inition of the maneuver library, it is clear that the DQN
has 13 input states and 15 output Q values. An online Q
network and a target Q network are constructed using a
fully connected network. Networks have 3 hidden layers with
512,1024 and 512 units respectively. The output layer has
none activation function, and the remaining layers are all tanh
layers. The learning rate is 0.001, and the discount factor
is γ=0.9. The target network is updated every 2,500 steps.
The weights of each layer of the network are initialized by
the variance scaling initializer, and the biases of the fully
connected network are initialized by the zeros initializer. The
size of Minibatch is set to 512, and the size of replay buffer
is set to 105.
In the short-range air combat simulation process, the deci-
sion period T is set to 1s, and an episode contains 30 decision
steps. In an episode simulation, if the UAV flies out of the
boundary shown by equation (6), this episode will end.
Next, the basic training and confrontation training experi-
ments of the maneuver decision model are carried out in turn.
After the UAV learns a certain maneuver policy, the man-
machine confrontation training is implemented.
B. MODEL TRAINING AND TESTING
BASIC TRAINING. In the basic training, the target per-
TABLE 3. BASIC TRAINING ITEMS
Item Target maneuver Initial situation of UAV
1 Uniform linear flight advantage
2 Uniform linear flight balance
3 Uniform linear flight disadvantage
4 horizontal spiral maneuver advantage
5 horizontal spiral maneuver balance
6 horizontal spiral maneuver disadvantage
forms uniform linear motion and horizontal spiral maneuver
respectively. The initial situation of UAV is in advantage,
balance and disadvantage, respectively, making UAV fully
familiar with the situation of air combat. The training items
are shown in Table 3.
The advantage in Table 3 means that the UAV pursues the
target from behind. The balance refers to the UAV and the
target heading toward each other. The disadvantage is that the
target pursues the UAV from behind. Training is carried out
item by item according to the serial number in Table 3. Each
item is trained with 106episodes, and an evaluation episode
is performed every 3000 episodes during training.
In each training process, in order to make the UAV fully
familiar with the air combat environment and improve the
diversity of transitions, the initial state of the UAV and the
target in the training episode are selected randomly within a
wide range, and in order to ensure the uniformity of evalua-
tion, the initial situation is fixed in the evaluation episode. For
example, when performing the first item, the initial situation
of training episode and evaluation episode are shown in Table
4 and Figure 7.
Figure 8 shows the change of the maximum episode value
during the training of item 1. It can be seen that the UAV
updates the maneuver policy through interaction training, and
the maximum episode value continues to increase. Figure 9
shows the maneuvering trajectory of an evaluation episode
once the training of item 1 is completed. It can be seen from
the figure that the UAV starts to chase the target from the
left rear side of the target, continuously adjusts the heading
and speed, and maintains the tail chasing situation, so that the
target is always in the intercepted area of the missile.
Figure 10 shows the change of the maximum episode value
during a training process of item 3. It can be seen from the
figure that the UAV is at a disadvantage in the initial situation,
so the maximum return value is lower at the beginning
of training, but with the development of the training, the
UAV gradually grasps the maneuver policy of getting rid
of the disadvantage and transferring to the advantage, thus
making the maximum episode value rise. Figure 11 shows
the maneuvering trajectory of an evaluation episode once the
training of item 3 is completed. It can be seen that the UAV is
at a disadvantage in front of the target at the initial moment,
then starts to turn right, and faces the target, and finally turns
right into the target tail, and adjusts the speed to catch up with
the target, keeping the interception of the target.
CONFRONTATION TRAINING. After completing all
the learning items in Table 2, confrontation training with
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
z
x
y
-200 200
-300
300
(-3500,2000,2000)
Initial Area of UAV in training episode
U
v
3
3
3
3
Initial Area of Target in training episode
Initial state of UAV in evaluation episode
Initial state of Target in evaluation episode
U
v
T
v
(3500,2000,2000)
(3500,2000,4000)
FIGURE 7. Initial state settings for training item 1.
TABLE 4. INITIAL STATE SETTINGS FOR TRAINING ITEM 1
Initial state x (m) y (m) z (m) v (m/s) γ ψ
Training episode UAV [-200,200] [-300,300] 3000 280 0 π
3,π
3
Target [-3500,3500] [2000,3500] [2000,4000] [100,300] 0 π
3,π
3
Evaluation episode UAV 0 0 3000 180 0 0
Target 3000 3000 3000 180 0 0
smart target is implemented. UAV adopts basic trained DQN
model as its maneuver policy, and the target adopts the statis-
tic principle-based method as its policy, and the performance
of each methods is verified by confrontation simulation.
In confrontation training, the UAV uses the -greedy
method to gradually explore the maneuver policy based on
the results of the basic training. Since the state of confronta-
tion is more complicated, in order to increase the sample
space, the size of the replay buffer is increased from 105to
106.
In order to ensure the diversity of combat state and the
generalization of maneuver policy, the initial state of the UAV
and the target in the training episode are randomly generated
within a certain range, and the confrontation training is
carried out under the condition that the initial situation of the
UAV is balance and disadvantage. Table 5 shows the initial
state of the training in the balance initial situation.
In the training process with balance initial situation, the
changes of the advantage steps rate, the average advantage
reward and the maximum episode value are shown in Figure
12, Figure 13 and Figure 14, respectively. In the three figures,
the blue line shows the change process of the indicator in
confrontation training without basic training, and the yellow
line indicates the confrontation training process under basic
training. Since the UAV has a fundamental flight policy after
the basic training, there will be no low-level errors such as
flying out of the boundary. Therefore, it can be seen from
the figure that the maximum episode value rarely appears 0
value due to the episode interruption in the initial stage of
the confrontation training. The UAV without basic training
has no experience in the air combat environment, and it is
easy to exceed the limitation boundary in the process of
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 5. SETTING OF THE BALANCE INITIAL SITUATION IN CONFRONTATION TRAINING
Initial state x (m) y (m) z (m) v (m/s) γ ψ
Training episode UAV [-200,200] [-300,300] 3000 280 0 π
3,π
3
Target [-3500,3500] [2000,3500] [2000,4000] [100,300] 0 2π
3,4π
3
Evaluation episode UAV 0 0 3000 180 0 π
4
Target 3000 3000 3000 180 0 5π
4
000000000000
 0  N

0
training step(M=1e6)
maximum episode value
FIGURE 8. The maximum episode value during the training of item 1 (Dark
line is the result of smoothing the light line, with a smoothing rate of 0.5).
FIGURE 9. Maneuvering trajectory after training item 1.
maneuver exploration, so that the maximum episode value
has many zero values during the confrontation training. In
addition, since the target aircraft is smarter than the UAV at
the beginning of the training, it has more opportunities to fire
weapon to the UAV, resulting in many negative values for
the maximum episode value. As the training continues, the
basically trained UAV gradually masters the target’s maneu-
ver policy and explores the maneuver policy that can defeat
the target. Therefore, the three indicator values gradually
increase with the development of training. This proves that
UAV’s maneuver policy allows itself to move from the bal-
ance situation to the advantage situation as quickly as possi-
ble and continue to maintain its advantage. In contrast, during
the same confrontation training period, the three indicators
of the UAV without basic training do not rise and converge
g s♀写 0 O O
N
0
training step(M=1e6)
maximum episode value
FIGURE 10. The maximum episode value during the training of item 3 (Dark
line is the result of smoothing the light line, with a smoothing rate of 0.5).
FIGURE 11. Maneuvering trajectory after training item 3.
steadily, and the maximum episode value still shows a large
negative value at the end of the training period, indicating
that the UAV does not get maneuver policy from the training
to get advantage to the target.
Figure 15 shows the maneuvering trajectory in an evalu-
ation episode after the confrontation training with balance
initial situation. The two sides start to fly head-on from the
initial position. After reaching a certain distance, the UAV
flies to the right side of the target. The target turns to the
right to pursue the UAV. Then the UAV reduces the speed
and the turning radius, so that the target rushes to the front of
the UAV, and thus the UAV enters an advantage position.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS







     
under basic training
training step(M=1e6)
advantage steps rate
without basic training
FIGURE 12. The advantage steps rate during the confrontation training with
balance initial situation.
4M 8M 12M 16M 20M 24M
average advantage reward
training step(M=1e6)
under basic training
without basic training
FIGURE 13. The average advantage reward during the confrontation training
with balance initial situation.
training step(M=1e6)
under basic training
maximum episode value
without basic training
FIGURE 14. The maximum episode value during the confrontation training
with balance initial situation.
Figure 16 shows the maneuvering trajectory in an evalu-
ation episode after the confrontation training with disadvan-
tage initial situation. At the initial moment, the target forms a
situation of chasing the UAV from its tail, so it is constantly
maneuvering in the direction of the UAV, intending to reduce
the distance and let the UAV enter the missile interception
FIGURE 15. Confrontation maneuvering trajectory under balance initial
situation.
FIGURE 16. Confrontation maneuvering trajectory under disadvantage initial
situation.
area. UAV turns to the right immediately, intending to get rid
of the unfavorable situation of being chased, and constantly
adjust the speed and heading. After meeting with the target,
the UAV quickly turns right and climbs up. When the target is
doing the barrel rolling maneuver and intending to turn back,
the UAV cuts into the back side of the target, realizes the tail
chase to the target, and obtains the fire chance.
Based on the above-mentioned confrontation training sim-
ulation, the UAV short-range air combat maneuver decision
model based on deep reinforcement learning established in
this paper is proved to be able to obtain the maneuver pol-
icy through autonomous learning and defeat the target with
statistic principle-based maneuver policy.
Decision time performance. Air combat has extremely
strict real-time performance requirement for maneuver de-
12 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
action number
7 8 9 10 11 12 13 14 15
one step decision time (ms)
0
1
2
3
4
5
6
Deep Q Network
Algorithm based on Statistics Principle
FIGURE 17. One step decision time performance.
cision. It is therefore necessary to test the real-time perfor-
mance of the maneuver decision model. According to the size
of the maneuver library, 9 groups of tests were performed.
The number of maneuver actions of each group was increased
from 7 to 15. And the average one-step decision time of DQN
model and algorithm based on statistics principle in each test
were calculated through 1000 decision steps, and the exper-
imental results are shown in Figure 17. It can be seen from
the figure that as the number of maneuver action increases,
the one-step decision time of the DQN model remains at
around 0.6ms, while the decision time of algorithm based on
statistics principle increases from about 2ms to nearly 6ms.
The experimental results show that the real-time perfor-
mance of the DQN decision model is better than that of
the algorithm based on statistics principle. The algorithm
based on statistics principle only performs one traversal
calculation, and the calculation time is relatively short. Other
optimization algorithms such as genetic algorithms require
a large number of loop iteration calculations, and real-time
performance is more difficult to meet the requirements of
online decision-making, so the author of [9] believes that the
purpose of this optimization is not to achieve online control,
but to find some meaningful new maneuvers to carry out
tactical research. So, it can be concluded that the real-time
performance of the model established in this paper is better
than that of iterative optimization algorithms.
MAN-MACHINE AIR COMBAT CONFRONTA-
TION. Although in the confrontation training, UAV can de-
feat the target with certain maneuver policy through learning.
However, this kind of policy of target is relatively fixed, the
randomness is not strong, and it is easy to be mastered and
cracked, so it cannot reflect the complexity of the opponent’s
maneuver policy in real air combat. In order to further verify
the self-learning ability of the reinforcement learning and
the correctness of the acquired maneuver policy, the target
aircraft should be controlled by people, so a man-machine
air combat confrontation system is developed.
The UAV self-learning module is constructed by using the
above-mentioned UAV air combat maneuver decision model.
The main function is to update and improve its maneuver
policy according to the air combat data of man-machine
confrontation. As shown in Figure 18 and Figure 6 (b), the
manned aircraft operation simulation module provides the
operator with simulation pictures of flight attitude and air
combat situation. At the same time, the module provides the
operator with the HOTAS joystick, realizing the real-time
control of the aircraft.
The main function of the air combat environment module
is to receive the flight status information of the UAV and
the manned aircraft, and then display the three-dimensional
situation of the current air combat, and evaluate the current
air combat situation, determine whether one of the two sides
is shot down, and then output the air combat situation infor-
mation and the evaluation value to the both sides.
In the model, the flight performance of the manned aircraft
and the UAV is exactly the same. During the training process,
the UAV firstly conducts online real-time confrontation with
the manned aircraft based on the policy learning from con-
frontation training. After a certain amount of confrontation
training vs manned aircraft, the saved flight path state data
of manned aircraft is randomly intercepted to simulate the
target trajectory for UAV off-line reinforcement learning, and
then the maneuver policy is improved, and then based on this
policy, the manned aircraft is confronted. Continue to iterate
the confrontation-learning process in turn.
Figure 19 is a trajectory diagram of a confrontation before
confrontation policy update is performed. It can be seen that
the manned aircraft defeat the UAV. Figure 20 is a trajectory
diagram of a confrontation after the confrontation policy
update. It can be seen that the UAV defeat the manned aircraft
after off-line training.
Through the simulation experiment of man-machine air
combat confrontation, it is proved that the UAV short-range
air combat maneuver decision model based on deep rein-
forcement learning can self-learn and update the maneuver
policy to gain advantages in air combat confrontation.
DQN vs DQN. Finally, an interesting exploratory experi-
ment is carried out. In this experiment, UAV and target are set
to use the same DQN maneuver decision model to conduct
training simulation. Both sides use the basic training model
parameters, and 1,000,000 episodes training with the balance
initial situation are performed. The difference between the
maximum episode value of UAV and target is added as the
evaluation index. As shown in Figure 21, in the early stage
of training, due to the randomness of model exploration, the
index oscillates back and forth around 0. As the training
progresses, the amplitude of the oscillation gradually slows
down and finally converges to 0, indicating that the policies
of the two sides are becoming consistent. As shown in Figure
22, after the training is completed to form a balancing policy,
the two sides pursue each other in the confrontation, forming
VOLUME 4, 2016 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Enemy plane state information
Longitude:
Latitude:
Height:
Speed:
Track angle:
Heading angle:
118.0006
27.0072
4950
285
6
18
degree
degree
m
m/s
degree
degree
FIGURE 18. Interaction interface of manned aircraft operation simulation module.
FIGURE 19. Maneuver trajectory of a confrontation before the confrontation
policy update.
an equilibrium situation in which the advantage cannot be
obtained.
From the experimental results, it can be seen that if the
target’s maneuver policy is deterministic, the DQN model
can learn to get a maneuver policy to gain the advantage
in air combat, and the simulation gets the unbalanced result,
because the purpose of the reinforcement learning algorithm
is to obtain the maximum return value. If both sides use the
DQN model, it means that both sides could adopt the same
FIGURE 20. Maneuver trajectory of a confrontation after the confrontation
policy update.
policy. In theory, the two sides will form an equilibrium state.
V. CONCLUSIONS
Based on the reinforcement learning theory, a maneuver
decision model for UAV short-range air combat was estab-
lished. Improved state dimension of the air combat maneuver
decision-making environment made the state description of
air combat maneuver decision more realistic, and the action
space was expanded more comprehensive.
Aiming at the problem of low learning efficiency and
14 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS


 �-一 K
 ·
r
1M
2M
3M
4M
5M
6M
7M
8M
9M
training step(M=1e6)
difference of max episode value
FIGURE 21. The difference between the maximum episode value of UAV and
target.
FIGURE 22. Maneuver trajectory of a confrontation after DQN vs DQN
training.
local optimization due to the large state space of air combat,
this paper proposed a model training method based on the
principle of going to confrontation training from basic train-
ing. The simulation results proved that the training method
could effectively improve the efficiency of the UAV learning
confrontation maneuver policy. It also proved that the UAV
short-range air combat maneuver decision model based on
deep reinforcement learning could realize the self-learning
and update policy until the target was defeated.
Due to the limitation of time and equipment resources, this
paper does not conduct more detailed experimental analysis
on some issues, such as the impact of the division of the
action space on the effectiveness of the decision. If the action
space is more detailed, the larger size of the maneuver library
will undoubtedly improve the accuracy of the UAV control,
but too many output units will weaken the recognition ability
of the deep neural network. Therefore, in the case of deter-
mining the network structure, the optimal number of actions
should be found through a large number of experiments,
which can be considered as a problem for future research.
In addition, when designing the basic training program, the
impact of the specific contents of the basic training program
on the learning efficiency of the confrontation training is not
analyzed, and only the fact that the basic training can improve
the learning efficiency of the confrontation training is proved.
In the later stage, the experimental analysis can be continued
on the optimization of the basic training design.
REFERENCES
[1] Skjervold E , Hoelsreter O T, “Autonomous, Cooperative UAV Operations
Using COTS Consumer Drones and Custom Ground Control Station,
in Proc. 2018 IEEE Military Communications Conference (MILCOM),
IEEE, 2018, pp. 486-492.
[2] YANG Qiming, ZHANG Jiandong, SHI Guoqing, “Modeling of UAV path
planning based on IMM under POMDP framework,” Journal of Systems
Engineering and Electronics., vol. 30, no. 3, pp. 545-554, 2019.
[3] Mcgrew J S, How J P, Williams B, et al, “Air-Combat Strategy Using
Approximate Dynamic Programming,” Journal of Guidance Control Dy-
namics., vol. 33, no. 5, pp. 1641-1654, 2010.
[4] Xu G, Wei S, Zhang H, “Application of Situation Function in Air Combat
Differential Games,” in Proc. 36th Chinese Control Conference (CCC),
2017, pp. 5865-5870.
[5] Park H, Lee B Y, Tahk M J, et al, “Differential Game Based Air Combat
Maneuver Generation Using Scoring Function Matrix,” International Jour-
nal of Aeronautical Space Sciences., vol. 17, no. 2, pp. 204-213, 2015.
[6] Xie R Z, Li J Y, Luo D L, “Research on maneuvering decisions for multi-
UAVs Air combat,” in Proc. IEEE International Conference on Control
Automation, 2014, pp. 767-772.
[7] Lin Z, Tong M, Wei Z, et al, “Sequential maneuvering decisions based
on multi-stage influence diagram in air combat,” Journal Of Systems
Engineering And Electronics., vol. 18, no. 3, pp. 551-555, 2007.
[8] Zhang S, Zhou Y, Li Z, et al, “Grey wolf optimizer for unmanned combat
aerial vehicle path planning,” Advances in Engineering Software,., no. 99,
pp. 121-136, 2016.
[9] Smith R E, Dike B A, Mehra R K, et al, “Classifier systems in combat:
Two-sided learning of maneuvers for advanced fighter aircraft,” Computer
Methods in Applied Mechanics Engineering ,., vol. 186, no. 2, pp. 421-
437, 2016.
[10] Changqiang H, Kangsheng D, Hanqiao H, et al, “Autonomous air combat
maneuver decision using Bayesian inference and moving horizon opti-
mization,” Journal of Systems Engineering and Electronics,., vol. 29, no.
1, pp. 86-97, 2018.
[11] Pan Q, Zhou D, Huang J, et al, “Maneuver decision for cooperative
close-range air combat based on state predicted influence diagram,” in
Proc. 2017 IEEE International Conference on Information and Automation
(ICIA), IEEE, 2017, pp.726-731.
[12] Wang D, Zu W, Chang H, et al, “Research on automatic decision making of
UAV based on Plan Goal Graph,” in Proc. IEEE International Conference
on Robotics Biomimetics, IEEE, 2017, pp.1245-1249.
[13] WANG Yuan, HUANG Changqiang, TANG Chuanlin, “Research on un-
manned combat aerial vehicle robust maneuvering decision under incom-
plete target information,” Advances in Mechanical Engineering ,., vol.8,
no.10, pp.1-12, 2016.
[14] Hai-Feng G, Man-Yi H, Qing-Jie Z, et al, “UCAV Robust Maneuver
Decision Based on Statistics Principle,” Acta Armamentarii ,., vol.38, no.1,
pp.160-167, 2017.
[15] Geng W X, Kong F, Ma D Q, “Study on tactical decision of UAV medium-
range air combat,” in Proc. 26th Chinese Control and Decision Conference
(CCDC), 2014, pp.135-139.
[16] FU Li and XIE Huaifu, “An UAV air-combat decision expert system based
on receding horizon control,” Journal of Beijing University of Aeronautics
and Astronautics,.,vol.41, no.11, pp.1994–1999, 2015.
[17] Roger W S, Alan E B, “Neural Network Models of Air Combat Maneu-
vering,” Ph.D. dissertation, New Mexico State University, New Mexico,
USA, 1992.
[18] DING Linjing, YANG Qiming, “Research on Air Combat Maneuver De-
cision of UAVs Based on Reinforcement Learning,” Avionics Technology
,.,vol.49, no.2, pp.29-35, 2018.
[19] Liu P, Ma Y, “A Deep Reinforcement Learning Based Intelligent Decision
Method for UCAV Air Combat,” in Proc. Asian Simulation Conference,
Springer, Singapore, 2017, pp.274-286.
VOLUME 4, 2016 15
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2961426, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[20] ZUO Jialiang, YANG Rennong, ZHANG Ying, et al, “Intelligent decision-
making in air combat maneuvering based on heuristic reinforcement
learning,” Acta Aeronautica et Astronautica Sinica,.,vol.38, no.10, pp.
321168-1-321168-14, 2017.
[21] ZHANG Xianbing, LIU Guoqing, YANG Chaojie and WU Jiang, “Re-
search on Air Confrontation Maneuver Decision-Making Method Based
on Reinforcement Learning,” Electronics,.,vol.7, no.11, pp.279, 2018.
[22] Sutton R S, Barto A G, “Reinforcement Learning: An Introduction,” IEEE
Transactions on Neural Networks,.,vol.9, no.5, pp.1054-1054,1998.
[23] Mnih V, Kavukcuoglu K , Silver D , Rusu AA, Veness J, Bellemare
MG, et al, “Human-level control through deep reinforcement learning,
Nature,.,vol.518, no.7540, pp.529-533, 2015.
[24] Fred A, Giro C, Michael F, et al, “Automated maneuvering decisions
for air-to-air combat,” in Proc. AIAA GuidanceNavigation and Control
Conference, 1987, Monterey, AIAA.
[25] Sun T YTsai S JLee Y Net al, “The study on intelligent advanced fighter
air combat decision support system,” in Proc. 2006 IEEE International
Conference on Information Reuse and Integration, Waikoloa, Hawaii,
IEEE. 2006.
QIMING YANG was born in Xining, Qinghai
province, China in 1988. He received his master
degree in Northwestern Polytechnical University,
Xi’an, China in 2013. Since 2016, he has been
a Ph.D. candidate at the Electronic science and
technology in Northwestern Polytechnical Univer-
sity, His main research interests are artificial intel-
ligence and its application on control and decision
of UAV.
JIANDONG ZHANG was born in Yantai, Shan-
dong province, China in 1974.He received both his
MS and PhD in System Engineering from North-
western Polytechnical University, China. Now he
is an associate professor at the department of Sys-
tem and Control Engineering in this University.
He has published more than 20 refereed journal
and conference papers. His research fields and in-
terests include modeling simulation and effective-
ness evaluation of complex systems, development
and design of integrated avionics system, system measurement and test
technologies.
GUOQING SHI was born in Xi’an, Shaanxi
province, China in 1974. He received both his MS
and PhD in System Engineering from Northwest-
ern Polytechnical University, China. Now he is an
associate professor at the department of System
and Control Engineering in this University.
He has published more than 10 refereed journal
and conference papers. His research fields and
interests include integrated avionics system mea-
surement and test technologies, development and
design of embedded real-time systems, modeling simulation and effective-
ness evaluation of complex systems etc.
JINWEN HU is an associate professor at the
School of Automation, Northwestern Polytechni-
cal University. He obtained bachelor and master
degrees in 2005 and 2008 respectively from North-
western Polytechnical University, and PhD degree
in 2013 from Nanyang Technological University.
He worked as a research scientist at Singapore
Institute of Manufacturing Technology from 2012
to 2015. His research interests include multi-agent
systems, distributed control, unmanned vehicles,
information fusion and process control.
YONG WU was born in Xi’an, Shanxi province,
China in 1964. He is a professor at the department
of System and Control Engineering in Northwest-
ern Polytechnical University, China. He received
his MS in System Engineering from the same
university.
He has published more than 20 refereed journal
and conference papers. His research fields and in-
terests include modeling simulation and effective-
ness evaluation of complex systems, development
and design of integrated avionics system, system measurement and test
technologies. Prof. Wu received four awards of the national defense science
and technology progress in 2004, 2005 and 2011, respectively.
16 VOLUME 4, 2016
... The algorithm has robustness and expansibility, but the waypoint model is used in this paper, which cannot describe the maneuver characteristics of UAV. Other researches on air combat based on reinforcement learning, the intelligent decision making technology for multi-UAV prevention and control proposed in [21,32]. The control method of UAV autonomous avoiding missile threat based on deep reinforcement learning introduced in [33,34]. ...
... These studies focus on the feasibility of reinforcement learning methods in solving some air combat problems, which has little correlation with our autonomous maneuver decision making problem, but provides some ideas for our research. In addition, uniform sampling is used in [21][22][23][24][25][26]29,30,32], which means that the probability of all experiences in the experience pool being extracted and utilized is the same, thus ignoring the different importance of each experience, resulting in long training time and extremely unstable. ...
... If the UAV is in the takeoff stage, the UAV selects the maneuver strategy according to (29). If both UAVs enter the BVR tracking state, the leader first executes the strategy (31), and the number of execution steps is obtained according to (32), while the wingman continues to select a t according to (29). When the leader completes a circle according to (32) and (31), the flight strategies of the leader and the wingman are exchanged. ...
Article
Full-text available
Autonomous maneuver decision making is the core of intelligent warfare, which has become the main research direction to enable unmanned aerial vehicles (UAVs) to independently generate control commands and complete air combat tasks according to environmental situation information. In this paper, an autonomous maneuver decision making method is proposed for air combat by two cooperative UAVs, which is showcased by using the typical olive formation strategy as a practical example. First, a UAV situation assessment model based on the relative situation is proposed, which uses the real-time target and UAV location information to assess the current situation or threat. Second, the continuous air combat state space is discretized into a 13 dimensional space for dimension reduction and quantitative description, and 15 typical action commands instead of a continuous control space are designed to reduce the difficulty of UAV training. Third, a reward function is designed based on the situation assessment which includes the real-time gain due to maneuver and the final combat winning/losing gain. Fourth, an improved training data sampling strategy is proposed, which samples the data in the experience pool based on priority to accelerate the training convergence. Fifth, a hybrid autonomous maneuver decision strategy for dual-UAV olive formation air combat is proposed which realizes the UAV capability of obstacle avoidance, formation and confrontation. Finally, the air combat task of dual-UAV olive formation is simulated and the results show that the proposed method can help the UAVs defeat the enemy effectively and outperforms the deep Q network (DQN) method without priority sampling in terms of the convergence speed.
... In the previous research on air combat maneuver decision based on reinforcement learning, UAV's 1v1 air combat maneuver decision model was established based on deep deterministic policy gradient (DDPG) [16] and deep Q network (DQN) [17] respectively. The modeling framework is shown as Fig. 1 UAV short-range air combat maneuver decision model framework based on reinforcement learning [17] In the process of establishing the air combat maneuver decision model, firstly, a one-to-one air combat environment model was established. ...
... The deterministic off-policy actor-critic algorithm can reduce the variance. The parameterized critic function is used to estimate the state-action value function in (16). And the square sum loss function is used for training the critic. ...
... (19) Finally, calculate the Actor's gradient estimate according to (16), that is, ...
Article
In order to improve the autonomous ability of unmanned aerial vehicles (UAV) to implement air combat mission, many artificial intelligence-based autonomous air combat maneuver decision-making studies have been carried out, but these studies are often aimed at individual decision-making in 1v1 scenarios which rarely happen in actual air combat. Based on the research of the 1v1 autonomous air combat maneuver decision, this paper builds a multi-UAV cooperative air combat maneuver decision model based on multi-agent reinforcement learning. Firstly, a bidirectional recurrent neural network (BRNN) is used to achieve communication between UAV individuals, and the multi-UAV cooperative air combat maneuver decision model under the actor-critic architecture is established. Secondly, through combining with target allocation and air combat situation assessment, the tactical goal of the formation is merged with the reinforcement learning goal of every UAV, and a cooperative tactical maneuver policy is generated. The simulation results prove that the multi-UAV cooperative air combat maneuver decision model established in this paper can obtain the cooperative maneuver policy through reinforcement learning, the cooperative maneuver policy can guide UAVs to obtain the overall situational advantage and defeat the opponents under tactical cooperation.
... Autonomous air combat manoeuvre decision-making [1][2][3] refers to the process of modelling the air combat decisionmaking process of pilots under various situations based on mathematical optimisation and artificial intelligence and uses the model created to automatically control the decision-making process of aircraft [4]. In general, common UAV manoeuvring decision-making models can be divided into traditional methods and intelligent methods. ...
... When investigating the motion, UAV is regarded as a mass point. According to the integral principle, the UAV motion equation with three degrees of freedom was shown in Equations (1) and (2). ...
... In recent years, researchers have been trying to utilize reinforcement learning to tackle the problem of maneuver decision. Yang et al. propose a deep reinforcement learningbased maneuver decision model for UAV in short-range air combat [2]. Their method mainly includes the aircraft motion model, one-to-one short-range air combat evaluation model and the maneuver decision model based on Deep Q Network (DQN). ...
... Similar to previous works [2], [3], we use a simplified aircraft motion model as shown in Fig. 5. We mainly consider the position, orientation and velocity of the two aircrafts. ...
Article
Full-text available
Automatic maneuver decision in close-range air combat depends on the situation awareness of the 3D aerial space. Optimal decision could only be made when the 3D state (e.g. 3D position, orientation and velocity) of the target aircraft is accurately provided. Together with the state of the aircraft in our side, optimal maneuver decision could be made by maximizing the situation advantage or utilizing deep reinforcement learning. On the other hand, vision-based 3D sensing methods are ideal for acquiring the 3D state of the target aircraft in close-range air combat, since radar and other sensors work badly in such short range. In this paper, we propose a novel pipeline for vision-based maneuver decision in close-range air combat. The proposed pipeline contains three main modules: 3D target detection based on Augmented Autoencoder, 3D target tracking based on segmentation and optimization, and maneuver decision based on advantage maximization and Deep Q Networks (DQN). The proposed method effectively handles the difficulties in air combat environment, such as fast movement, occlusion from cloud, etc. Experiments demonstrate that our method could robustly detect and track the target aircraft in complex environment, which provides strong priors for maneuver decision and helps to significantly improve the winning rate of short-range air combat.
... Value-based reinforcement learning methods cannot deal with the problem of continuous action space [12][13][14][15]. Lillicrap combined the deterministic policy gradient algorithm [16] and actor-critic framework, and a deep deterministic policy gradient (DDPG) algorithm is proposed to address continuous state space and continuous action space problems [17]. ...
Article
Full-text available
With the rapid development of unmanned combat aerial vehicle (UCAV)-related technologies, UCAVs are playing an increasingly important role in military operations. It has become an inevitable trend in the development of future air combat battlefields that UCAVs complete air combat tasks independently to acquire air superiority. In this paper, the UCAV maneuver decision problem in continuous action space is studied based on the deep reinforcement learning strategy optimization method. The UCAV platform model of continuous action space was established. Focusing on the problem of insufficient exploration ability of Ornstein–Uhlenbeck (OU) exploration strategy in the deep deterministic policy gradient (DDPG) algorithm, a heuristic DDPG algorithm was proposed by introducing heuristic exploration strategy, and then a UCAV air combat maneuver decision method based on a heuristic DDPG algorithm is proposed. The superior performance of the algorithm is verified by comparison with different algorithms in the test environment, and the effectiveness of the decision method is verified by simulation of air combat tasks with different difficulty and attack modes.
Article
Full-text available
Machine learning (ML) entails artificial procedures that improve robotically through experience and using data. Supervised, unsupervised, semi-supervised, and Reinforcement Learning (RL) are the main types of ML. This study mainly focuses on RL and Deep learning, since necessitates mainly sequential and consecutive decision-making context. This is a comparison to supervised and non-supervised learning due to the interactive nature of the environment. Exploiting a forthcoming accumulative compensation and its stimulus of machines, complex policy decisions. The study further analyses and presents ML perspectives depicting state-of-the-art developments with advancement, relatively depicting the future trend of RL based on its applicability in technology. It's a challenge to an Internet of Things (IoT) and demonstrates what possibly can be adopted as a solution. This study presented a summarized perspective on identified arenas on the analysis of RL. The study scrutinized that a reasonable number of the techniques engrossed in alternating policy values instead of modifying other gears in an exact state of intellectual. The study presented a strong foundation for the current studies to be adopted by the researchers from different research backgrounds to develop models, and architectures that are relevant.
Article
Within visual range air combat involves execution of highly complex and dynamic activities, requiring rapid, sequential decision-making to achieve success. Fighter pilots spend years perfecting tactics and maneuvers for these types of combat engagements, yet the ongoing emergence of unmanned, autonomous vehicle technologies elicits a natural question – can an autonomous unmanned combat aerial vehicle (AUCAV) be imbued with the necessary artificial intelligence to perform challenging air combat maneuvering tasks independently? We formulate and solve the air combat maneuvering problem (ACMP) to examine this important question, developing a Markov decision process (MDP) model to control a defending AUCAV seeking to destroy an attacking adversarial vehicle. The MDP model includes a 5-degree-of-freedom, point-mass aircraft state transition model to accurately represent both kinematics and energy while maneuvering. An approximate dynamic programming (ADP) approach is proposed wherein we develop and test an approximate policy iteration algorithm that implements value function approximation via neural network regression to attain high-quality maneuver policies for the AUCAV. A representative intercept scenario is specified for testing purposes wherein the AUCAV must engage and destroy an adversary aircraft attempting to penetrate the defended airspace. Several designed experiments are conducted to determine how aircraft velocity and adversary maneuvering tactics impact the efficacy of the proposed ADP solution approach and to enable efficient algorithm parameter tuning. ADP-generated policies are compared to two benchmark maneuver policies constructed from two reward shaping functions found in the ACMP literature, attaining improved mean probabilities of kill for 24 of 36 air combat situations considered.
Article
Full-text available
In close-range air combat, highly reliable trajectory prediction results can help pilots to win victory to a great extent. However, traditional trajectory prediction methods can only predict the precise location that the target aircraft may reach, which cannot meet the requirements of high-precision, real-time trajectory prediction for highly maneuvering targets. To this end, this paper proposes an attention-based convolution long sort-term memory (AttConvLSTM) network to calculate the arrival probability of each space in the reachable area of the target aircraft. More specifically, by segmenting the reachable area, the trajectory prediction problem is transformed into a classification problem for solution. Second, the AttConvLSTM network is proposed as an efficient feature extraction method, and combined with the multi-layer perceptron (MLP) to solve this classification problem. Third, a novel loss function is designed to accelerate the convergence of the proposed model. Finally, the flight trajectories generated by experienced pilots are used to evaluate the proposed method. The results indicate that the mean absolute error of the proposed method is no more than 45.73m, which is of higher accuracy compared to other state-of-the-art algorithms.
Article
Full-text available
The guidance problem of a confrontation between an interceptor, a hypersonic vehicle, and an active defender is investigated in this paper. As a hypersonic multiplayer pursuit–evasion game, the optimal guidance scheme for each adversary in the engagement is proposed on the basis of linear-quadratic differential game strategy. In this setting, the angle of attack is designed as the output of guidance laws, in order to match up with the nonlinear dynamics of adversaries. Analytical expressions of the guidance laws are obtained by solving the Riccati differential equation derived by the closed-loop system. Furthermore, the satisfaction of the saddle-point condition of the proposed guidance laws is proven mathematically according to the minimax principle. Finally, nonlinear numerical examples based on 3-DOF dynamics of hypersonic vehicles are presented, to validate the analytical analysis in this study. By comparing different guidance schemes, the effectiveness of the proposed guidance strategies is demonstrated. Players in the engagement could improve their performance in confrontation by employing the proposed optimal guidance approaches with appropriate weight parameters.
Article
Full-text available
With the development of information technology, the degree of intelligence in air combat is increasing, and the demand for automated intelligent decision-making systems is becoming more intense. Based on the characteristics of over-the-horizon air combat, this paper constructs a super-horizon air combat training environment, which includes aircraft model modeling, air combat scene design, enemy aircraft strategy design, and reward and punishment signal design. In order to improve the efficiency of the reinforcement learning algorithm for the exploration of strategy space, this paper proposes a heuristic Q-Network method that integrates expert experience, and uses expert experience as a heuristic signal to guide the search process. At the same time, heuristic exploration and random exploration are combined. Aiming at the over-the-horizon air combat maneuver decision problem, the heuristic Q-Network method is adopted to train the neural network model in the over-the-horizon air combat training environment. Through continuous interaction with the environment, self-learning of the air combat maneuver strategy is realized. The efficiency of the heuristic Q-Network method and effectiveness of the air combat maneuver strategy are verified by simulation experiments.
Article
Full-text available
This article investigates the problem of designing a novel maneuvering decision-making method for the unmanned combat aerial vehicle. The design objective is to promote the real-time ability of decision-making method and solve the problem of uncertainty caused by incomplete target information. On the basis of statistics theory, a robust maneuvering decision method with self-adaptive target intention prediction is proposed. The robustness design is embedded in the membership function of the situation parameters. The reachable set theory and adaptive adjustment mechanism of the target state weight are used in the target intention prediction to promote the real-time ability. Simulations are conducted under the condition that the enemy aircraft perform both non-maneuvering and combat maneuvering. The results verify the good properties of the decision-making method, which can extend the survival time of the unmanned combat aerial vehicle when the enemy aircraft attacks, and short the taking position and attack time of the unmanned combat aerial vehicle when the enemy aircraft evades.
Article
In order to enhance the capability of tracking targets autonomously of unmanned aerial vehicle (UAV), the partially observable Markov decision process (POMDP) model for UAV path planning is established based on the POMDP framework. The elements of the POMDP model are analyzed and described. The state transfer law in the model can be described by the method of interactive multiple model (IMM) due to the diversity of the target motion law, which is used to switch the motion model to accommodate target maneuvers, and hence improving the tracking accuracy. The simulation results show that the model can achieve efficient planning for the UAV route, and effective tracking for the target. Furthermore, the path planned by this model is more reasonable and efficient than that by using the single state transition law.
Article
To reach a higher level of autonomy for unmanned combat aerial vehicle (UCAV) in air combat games, this paper builds an autonomous maneuver decision system. In this system, the air combat game is regarded as a Markov process, so that the air combat situation can be effectively calculated via Bayesian inference theory. According to the situation assessment result, adaptively adjusts the weights of maneuver decision factors, which makes the objective function more reasonable and ensures the superiority situation for UCAV. As the air combat game is characterized by highly dynamic and a significant amount of uncertainty, to enhance the robustness and effectiveness of maneuver decision results, fuzzy logic is used to build the functions of four maneuver decision factors. Accuracy prediction of opponent aircraft is also essential to ensure making a good decision; therefore, a prediction model of opponent aircraft is designed based on the elementary maneuver method. Finally, the moving horizon optimization strategy is used to effectively model the whole air combat maneuver decision process. Various simulations are performed on typical scenario test and close-in dogfight, the results sufficiently demonstrate the superiority of the designed maneuver decision method.
Article
Intelligent decision-making air combat maneuvering has been a research hotspot all the time. Current research on the air combat mainly uses optimization theory and algorithm of traditional artificial intelligence to compute the air combat decision sequence in the relative fixed environment. However, the process of the air combat is dynamic and thus contains many uncertain elements. It is thus difficult to obtain the decision sequence that is tally with the actual conditions of the air combat by using the traditional theoretical methods. A new method for intelligent decision-making in air combat maneuvering based on heuristic reinforcement learning is proposed in this paper. The “trial and error learning” method is adopted to compute the relative better air combat decision sequence in the dynamic air combat, and the neural network is used to learn the process of the reinforcement learning at the same time to accumulate knowledge and inspire the search process of the reinforcement learning. The search efficiency is increased to a great extent, and real-time dynamic computation of the decision sequence during the air combat is realized. Experiment results indicate that the decision sequence conforms to actual conditions. © 2017, Press of Chinese Journal of Aeronautics. All right reserved.
Article
An UCAV (unmanned combat air vehicle) robust maneuver decision method based on statistics principle is proposed for UCAV robust autonomous combat decision. A mathematic model of UCAV is built to reduce the sensibility of the combat maneuver decision. The typical maneuver library is improved, and the robust membership functions of the air combat situation parameters are designed. The statistics method is introduced in the robust maneuver decision, and the simulations are carried out with two typical air combat cases of UCAV confront maneuver and non-maneuver objects. The simulated results indicate that the robust maneuver decision method has the robustness and optimizing capability in guiding UCAV to the advantage situation. © 2017, Editorial Board of Acta Armamentarii. All right reserved.