Page 1
Temporal Difference Learning Waveform
Selection
Bin Wang
Northeastern University, Shenyang, China
wangbin_neu@yahoo.com.cn
Jinkuan Wang, Xin Song and Yinghua Han
Northeastern University at Qinhuangdao, Qinhuangdao, China
sxin78916@mail.neuq.edu.cn
Abstract—How to optimally decide or select the radar
waveform for next transmission based on the observation of
past radar returns is one of the important problems in
cognitive radar. In this paper, stochastic dynamic
programming model is proposed. Then temporal difference
learning method is used to realize the adaptivity of
waveform selection. The simulation results show that the
uncertainty of state estimation using temporal difference
learning is less than that using fixed waveform. Temporal
difference learning method approaches the optimal
waveform selection scheme but has lower computational
cost. Finally, the whole paper is summarized.
Index Terms—waveform selection, stochastic dynamic
programming, temporal difference learning, cognitive radar
I. INTRODUCTION
Radar is the name of an electronic system used for the
detection and location of objects. All early radars use
radio waves, but some modern radars today are based on
optimal waves and the use of lasers. With the
development of modern technology, presentday systems
are very sophisticated and advanced. We should consider
focusing on optimizing the design of the transmitter, not
only the receiver. That means there should be a feedback
loop from the receiver to the transmitter. This is the core
idea of cognitive radar.
Cognitive radar is a new framework of radar system
proposed by Simon Haykin in 2006. It can percept
external environment real time, select optimal waveform
and make transmitted waveform and target environment
and information demand of radar working achieve
optimum matching, and then multiple performance of
searching, tracking, guidance and identification of friend
or foe of multitarget can be realized. It builds on three
basic gradients: Intelligent signal processing; Feedback
from the receiver to the transmitter; Preservation of the
information content of radar returns[1]. Now more and
more people are doing research in this field.
The obvious difference between cognitive radar and
traditional radar is that cognitive radar can select
appropriate waveforms according to different radar
environment. So how to realize the adaptivity of the
transmitter is an important problem in cognitive radar.
The design of adaptive transmitter involves adaptive
model and adaptive algorithm. Goodman have proposed
and simulated a closedloop active sensor by updating the
probabilities on an ensemble of target hypotheses while
adapting customized waveforms in response to prior
measurement and compared the performance of two
different waveform design techniques[2]. In [3], the
author focuses on a cognitive tracking radar, the
implementation of which comprises two distinct
functional blocks, one in the receiver and the other in
transmitter with a feed back link from the receiver to the
transmitter. In [4], Arasaratnam have successfully solved
the best approximation to the Bayesian filter in the sense
of completely preserving secondorder information,
which is called cubature Kalman filters. In [5], Informax
principle aimed at maximizing the mutural information is
used for designing the transimitted signal waveform. In
[6], an extention to the PDA tracking algorithm to include
adaptive waveform selection was developed. In [7], it is
shown that tracking errors are highly dependent on the
waveforms used and in many situations tracking
performance using a good heterogeneous waveform is
improved by an order of magnitude when compared with
a scheme using a homogeneous pulse with the same
energy. The problem of waveform selection can be
thought of as a sensor scheduling problem, as each
possible waveform provides a different means of
measuring the environment, and related works have been
examined in [8], [9]. In [10], an adaptive waveform
selective probabilistic data association algorithm for
tracking a single target in clutter is presented. In [11],
radar waveform selection algorithms for tracking
accelerating targets are considered. In [12], genetic
algorithm is used to perform waveform selection utilizing
the autocorrelation and ambiguity functions in the fitness
evaluation. In [13], Incremental Pruning method is used
to solve the problem of adaptive waveform selection for
target detection. The problem of optimal adaptive
waveform selection for target tracking is also presented in
1394JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
doi:10.4304/jcp.5.9.13941401
Page 2
[14]. In [15], the author uses ADP method to solve the
problem of adaptive waveform selection.
In this paper, under the assumption of rangeDoppler
resolution cell, stochastic dynamic programming model
for adaptive transmitter is proposed. We use temporal
difference learning method to realize the adaptivity of
waveform selection. The simulation results show the
validity of our proposed algorithm.
Ⅱ. RANGEDOPPLER RESOLUTION CELL
The design of adaptive transmitter in cognitive radar
involves adaptive model and adaptive algorithm. We first
consider setting up adaptive model. Generally speaking,
for a target, the most important parameters that a radar
measures are range, Doppler frequency, and two
orthogonal space angles. If we envision a radar resolution
cell that contains a certain fourdimensional hypervolume,
we may assume different targets fall in different
resolution cells. That means if a target measured falls in a
resolution cell, then another target fall in another
resolution cell and does not interfere with measurements
on the first. So as long as each target occupies a
resolution cell and the cells are all disjoint, the radar can
make measurements on each target free of interference
from others. For a single radar pulse, we may give a
general sort of definition by considering the resolution
cell to be bounded in range by the compressed pulse’s
duration, in Doppler by the reciprocal of the transmitted
pulse’s duration, and in the two angles by the antenna
pattern’s two orthogonalplane beamwidths.
For example, if a radar seeks to make measurements
on targets resolved in Doppler frequency at the same time,
it can provide a bank of matched filters operating in
parallel. Each target will excite the filter which is
matched to its Doppler frequency, and its response can be
used for measurements. Targets resolved in the range
coordinate can be seperated with range gates followed by
measurements. Thus a radar can perform simultaneous
measurements on targets unresolved in angle, provided
the targets are resolved in range, or Doppler frequency, or
both. However, it is difficult to simultaneously measure
targets in angle coordinates. Such measurements require
either a bank of main beams or the timesharing of one
main beam among the various targets.
Through the preceding discussion, we can conclude
that angle resolution can be considered independently
from range and Doppler resolution in most circumstances.
When considering this, the resolution properties of the
radar in angle are independent of the resolution properties
in range and Doppler frequency[16].
We define rangeDoppler resolution cell for the
waveform selection model.
R
denotes range resolution which is a radar metric
that describes its ability to detect targets in close
proximity to each other as distinct objects. Radar systems
are normally designed to operate between a minimum
range
min
R
, and maximum range
max
R
. The distance
between
each of width R
min
R
and
max
R
is divided into N range bins,
.
maxmin
RR
N
R
（1）
Targets seperated by at least R
resolved in range.
Radars use Doppler frequency to extract target radial
velocity (range rate), as well as to distinguish moving and
stationary targets or objects such as clutter. The Doppler
phenomenon describes the shift in the center frequency of
an incident waveform.
will be completely
Figure 1. Radar and target.
Figure 1 is a picture of radar and target. In heavy
clutter environments, it is an acute problem that we can
not obtain good Doppler and good range resolution in a
waveform tailoring simultaneously. So we need to
consider the problem of adaptive waveform selection and
make a tradeoff decision between them. The basic
scheme for adaptive waveform selection is to define a
cost function that describes the cost of observing a target
in a particular location for each individual pulse and
select the waveform that optimizes this function on a
pulse by pulse basis.
We make no assumptions about the number of targets
that may be present. We divide the area covered by a
particular radar beam into a grid in rangeDoppler space,
with the cells in range indexed by
in Doppler indexed by
target, 1 target or NM targets. So the number of
possible scenes or hypotheses about the radar scene is
. Let the space of hypotheses be denoted by
2NM
x
1,...,N
. There may be 0
and those
1,...,M
. The
state of our model is
t X
where x
. Let
1
tY be
the measurement variable. Let
that indicates which waveform is chosen at time t to
generate measurement
1
probability of receiving a particular measurement
t Xx
will depend on both the true, underlying scene
and on the choice of waveform used to generate the
measurement.
be the control variable
tu
tY
, where . The
tu U
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 20101395
© 2010 ACADEMY PUBLISHER
Page 3
We assume that the evolution of the state is governed
by a Markov process and define
probability where
(
x x
a P x
()
x x x x
a
A
is the state transition matrix.
b is the measurement probability where
( )(
x xtt
buP Y
( )(( ))
tx xt x x
ubu
B
probability matrix. That means if state of our model is
Xx
x
measurement is
x x
b
x x
a is state transition
1

tt
x x
x
) (2)
,
We define
x x
1
 , )
x u
tt
x X
(3)
,
t
is the measurement
1
tY
and we use the waveform , the probability of
tu
.
( )
u
t
Figure 2. Resolution cell and corresponding parallelogram.
To handle the problem that when a combined
waveform is used, we define as a practical resolution cell
the parallelogram that contains the resolution cell
primitive. Figure 2 is resolution cell and corresponding
parallelogram.
Let us consider calculating these probabilities. The
matched filter is adpoted in the receiver. Assume the
transmitted baseband signal is
baseband signal is
. The matched filter is the one
( )
r t
h t
, and the received
with an impulse response
process of our matched filter is
x t
, so an output
( )
s t
(
t
s
( ))
( )( )( )
r t
() ( )
t r
h t
sd
(4)
In the radar case, the return signal is expected to be
Doppler shifted, then the matched filter to a return signal
with an expected frequency shift
response
*
( )
h ts
The output is given by
( )(
x tst e
where
0
is an expected frequency shift.
The baseband received signal will be modeled as a
return from a Swerling target.
0
has an impulse
0
2
()
j
t e
t
(5)
0
2()
)( )
jt
rd
(6)
At time the magnitude square of the output of a filter
matched to a zero delay and a zero Doppler shift is
Following we consider two situations: there is no
terget and target is present.
When there is no target
The random variable
0
( )
x
zero mean and variance given by
0
( )
E x
where is the energy of the transmitted pulse.
When target is present
This random variable is still zero mean, with variance
given by
1000
( )( )(1
E xx
t
2
2
0
( )
x t
( ) (
)
t
rst d
(7)
0
00
0
( )( ) (
)
xnsd
(8)
is complex Gaussian, with
2
00
( )
2
0
xN
(9)
0
2
00
0
( )() ( )
()
d
j
x Asensd
(10)
2
A
2
22
00
2
0
2
(, ))
A
(11)
where ( , )
A is ambiguity function, given by
1
( , )
( )
sd
Recall that the magnitude square of a complex
Gaussian random variable
exponentially distributed, with density given by
2
2
2
2
( ) (
)
j
Assed
(12)
2
i
~ (0,)
xN
is
2
i
2
2
2
i
1
~2
y
yxe
(13)
In the case when a target is present in cell ( , )
assuming its actual location in the cell has a uniform
distribution
,
2
A
2
2
000
2
0
2
2 (1(, ))
(,)
1
A
aa
D
A
da
A
Ped
a
d
(14)
where A is the resolution cell centred on ( , )
volume A .
with
Ⅲ. TEMPORAL DIFFERENCE LEARNING WAVEFORM
SELECTION
Define
maximum number of dwells that can be used to detect
and confirm targets for a given beam. Then
sequence of waveforms that could be used for that
decision process. We can obtain different according to
different environment in cognitive radar. Let
T
V XE
01
{ , ,...,
u u
}
T
u
where is the
1
T
is a
0
()[(, )]
t
t
tttt
t
R X u
(15)
1396JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
Page 4
where
(, )
ttt
R X u
is the reward earned when the scene
t X is observed using waveform
factor. Then the aim of our problem is to find the
sequence that satisfies
and
tu
is discount
0
() max [
( , )]
t
T
t
tt
t
VXE R X u
t
(16)
However, knowledge of the actual state is not available.
Using the method of [17], we can obtain that the optimal
control policy that is the solution of (16) is also the
solution of
0
0
() max [
( , )]
t
p u
T
t
tt
t
VER
p
(17)
where is the conditional density of the state given the
measurements and the controls and
probability density of the scene.
X . So we need to solve the following
problem
T
t
E
is the a priori
tp
0
p
p is a sufficient statistic
for the true state
t
0
max [ ( , )]
t
p
tt
t
Ru
(18)
The refreshment formula of is given by
tp
1
'
t
t
t
BAp
1 LAp
p
(19)
where
(
x x
b
vector of ones.
We write the expected profits using policy from t
onward
1
( ){(
tttt
tt
where under policy
()
U
p
( )
tt
G p
tp
state in time , and follow policy
t
onward. However, it seems much more natural to
calculate recursively using
( ) ( , )
ttttt
VRu
pp
Now we will establish the relationship between the
original optimization problem and the optimality
equations. Clearly,
()
TT
G
p
t
is the diagonal matrix with the vector
the nonzero elements and 1 is a column
A
B
)()
t
u
is state transition matrix.
,)()}
T
tTTt
GERuC
pppp
t
(20)
tu
.
is the expected total contribution if we are in
from time
t
tV
11
{()}
ttt
E V
pp
(21)
. Next,
assume that it holds for
show that it is true for t . This means that we can write
( )( , ){(
( , )[(
tttt
RuE G
. We want to
()()
T
TTT
V
t
R
1,
T
p
2,...,
p
pp
Because the random variables are discrete and finite,
we can obtain that
)
)
t
t (23)
Hence
( )
V
t
11
11
1
1
1
)
)]
}
( , )
t
p
{ [
E E
(,)()]}
tttttttt
t
T
ttttttTtt
tt
VRu E V
RuRuR
pp
p
p
p
p
pp
(22)
1
1
1
(
11
1
111
1
, ) (
P

(, ) (
P

(,)
)
[]
t
t
t
tt
ttt
g G
ttttt
g G
ttt
g G
tt
g G
E G
tt
g
gP Gg
gP Gg
gP Gg
p
pp
p
pp
pp
ppp
ppppp
pp
p
p
1
{ [
E E G
 ]}
(
ttt
gP G
pp
1
1
1
1
1
1
1
1
( ,
p
) { [
E E
(,)( )]}
( ,
p
)[(,)()]
[( ,
p
)(,)()]
[(,)()]
( )
p
T
tt
tttttTtt
tt
T
tttttttTt
tt
T
tttttttTt
tt
T
ttttTt
tt
tt
RuRuR
RuERuR
E RuRuR
ERuR
G
Using equation (21), we have a backward recursion for
calculating
( )
tt
V
for a given policy at
we can find the expected reward for a given , we
would like tbest o find the
. That is, we wa t to find
*( )max
tt
G
p
If the set
is infinite, we replace
“sup”. We solve this problem by solving the optimality
eqse
uations. The are
( )max(( , )
tttt
u U
p p
p
t
pppp
ppp
ppp
ppp
(24)
p
. Now that
( )
p
tt
G
(25)
n
First, we show by induction that
pp and
t
the rev equ
Wrt ag
()(
VR
pp
TTtT
, we get that
()
tt
V
tt
G
pp
be an arbitrary policy. For t
tells us
p
al
the “max” with
1
( 
p p
, )
u V
( ))
p
tt
VRuP
pp
(26)
*
t
( )
p
( )
p
ttt
VG
for
l
t
erse in
e reso
0,1,...,1
T
. that
gives us the result.
f by induction. Since
p
for all
T
*()
TT
p
G
. Ass
T
1, 2,...,
t
t
, the optimality equatio
)(
(
)
)
G
V
p
p
Then, we show
ality is true, which
ain to our proo
TT
and all
TT
t
ume that
, and let
*()
t
for
( )max(
u U
( , )
t
p
( 
p p
, )
u V
( ))
p
tt
VRu
n
1
ttt
P
p p
(27)
,
*
t
11
( )
p
( )
p
t
GV
With the induction hypothesis
we get
V
, so
*
t
1
, )
u G
( ))
p
ttt
u U
( )
p
max(( , )
p
(
tt
RuP
p P
pp
(28)
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010 1397
© 2010 ACADEMY PUBLISHER
Page 5
We have that
Up )
when in state
p
for an arbitrary
*
t
11
( )
p
( )
p
t
GG
.
Also let be the decision that would be chosen
by policy
(
t
. Then
p p
tp
P
1
1
( )max(
u U
( , )
t
p
( , )
u G
( ))
p
( ,
p
( ))
t
p
( 
p p
, ( ))
t
p
( )
p
( )
p
ttttt
tttt
tt
VRu
RUPUG
G
p P
p P
p
(29)
This means
Next we are going to prove the inequality from the
other side. Specially, we want to show that for any
0
there exists a policy that satisfies
(
Tt
To do this, we start with the definition
( ) max(( , )
tttt
u U
p p
We may let be the decision rule that solves
(31). This rule corresponds to the policy
the set U may be infinite, whereupon we have to replace
the “max” with a “sup” and handle the case where an
optimal decision may not exist. For this case, we know
that we can design a decision rule u
decision u that satisfies
( ) ( , )( 
tttt
VRuP
p P
We can prove (32) by induction. We first note that (30)
is true for t
since
that it is true for t
that
for all
( )
p
( )
tttt
VG
.
)(
t GV
p )
tt
p
( ))
(30)
1
( 
p p
, )
u V
tt
VRuP
pp
(31)
( )
tt
u p
. In general,
that returns a
(
tp )
t
1
, )
u V
( )
p
tt
ppp p
t
p
( )
t
(32)
. Now assume
already know . We
T
( )
p
t
()
TtT
T
T
G
V
p
1,2,...,
t
1
( )
p
( ,
p
( ))
t
p
( 
p p
,( ))
t
p
V
ttttt
GRUPUG
p P
(33)
We can use our induction hypothesis which says
( ) ( ) ((
t
VTt
pp
11
1))
G
to get
1
1
1
( )
p
( ,
p
( ))
t
p
( 
p p
,( ))[
t
p
( ) (
p
(1)) ]
( ,
p
( ))
t
p
( 
p p
,( ))
t
p
( )
p
( 
p p
,( ))[(
t
p
1) ]
{ ( ,
t
R
( ))
t
p
( 
p p
,( ))
t
p
( )
p
} (
)
tttttt
ttttt
ttt
GRUPUVTt
RUPUVPUTt
UPUVTt
p P
p P
p P
p P
p
( )
p
(34)
Now, using equation (32), we replace the term in
brackets with the smaller V
t
t
( )
p
( ) (
t
p
)
ttt
GVTt
(35)
which proves the induction hypothesis. We has shown
that
*
t
*
t
( ) (
t
P
This proves the result. So we can use
our problem[18].
This is stochastic dynamic model of adaptive
waveform selection. Then we can use dynamic
programming algorithms to solve the problem. Backward
dynamic programming algorithm is a basic dynamic
programming method. It can be viewed as an optimal
adaptive algorithm for waveform selection. When state
space and action space are large, it is hard to use this
method. Than means we hardly find optimal solution for
waveform selection. So approximate solutions are
necessary.
Generally speaking, reward function can be different
forms according to different problems. It represents the
value that we stand in certain place and take some certain
action. In the problem of adaptive waveform selection,
two forms of reward function are usually used. They are
linear reward function and entropy reward function.
Linear reward function is usually used in the
circumstance that ( , )
Ru
p
is required to be a piecewise
linear function. The form of this function is simple and
easy to calculate. However, it can not reflect the whole
value sometimes. The form of linear reward function is
1( , )
Ru
p
Entropy reward function is usually used in the
circumstance that
( , )
Ru
p
piecewise linear function. It comes from information
theory. It can reflect the whole value accurately. But it is
more complex than linear reward function. The form of
entropy reward function is
2( , )
x
x
We can choose different form of reward function
according to different problems.
The foundation of approximate dynamic programming
is based on an algorithmic strategy that steps forward
through time. If we wanted to solve this problem using
classical dynamic programming, we could have to find
the value function using
( )
tt
V p
( )max( ( , )
t
u U
Assume is an unbiased sample estimate of the value
of being in state
and the policy is
of v is
(,)(
tttttt
vCuC
pp
Standard stochastic gradient algorithm is used to
estimate the value of being in state
( )( )
tttt
VV
pp
The temporal differences is
(,)
DCuV
p
So
)( ) (
t
P
) ( )
P
V p
( )
P
tttt
GTtGTtVG
(36)
to solve
( )
tt
'1
p p
(37)
is not required to be a
( )log(
p k
( ))
x
Ru p k
p
(38)
11
{()})
ttttttt
VCu E V
pppp
(39)
v
tp
. The definition
111
,) ...
(,)
nnnn
TtTT
uCu
p
(40)
t X
[
n
V
11
( )
p
]
nnn
ttt
nv
(41)
11
11
()()
n
n
V
pp
(42)
1398 JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER