Page 1
Temporal Difference Learning Waveform
Selection
Bin Wang
Northeastern University, Shenyang, China
wangbin_neu@yahoo.com.cn
Jinkuan Wang, Xin Song and Yinghua Han
Northeastern University at Qinhuangdao, Qinhuangdao, China
sxin78916@mail.neuq.edu.cn
Abstract—How to optimally decide or select the radar
waveform for next transmission based on the observation of
past radar returns is one of the important problems in
cognitive radar. In this paper, stochastic dynamic
programming model is proposed. Then temporal difference
learning method is used to realize the adaptivity of
waveform selection. The simulation results show that the
uncertainty of state estimation using temporal difference
learning is less than that using fixed waveform. Temporal
difference learning method approaches the optimal
waveform selection scheme but has lower computational
cost. Finally, the whole paper is summarized.
Index Terms—waveform selection, stochastic dynamic
programming, temporal difference learning, cognitive radar
I. INTRODUCTION
Radar is the name of an electronic system used for the
detection and location of objects. All early radars use
radio waves, but some modern radars today are based on
optimal waves and the use of lasers. With the
development of modern technology, presentday systems
are very sophisticated and advanced. We should consider
focusing on optimizing the design of the transmitter, not
only the receiver. That means there should be a feedback
loop from the receiver to the transmitter. This is the core
idea of cognitive radar.
Cognitive radar is a new framework of radar system
proposed by Simon Haykin in 2006. It can percept
external environment real time, select optimal waveform
and make transmitted waveform and target environment
and information demand of radar working achieve
optimum matching, and then multiple performance of
searching, tracking, guidance and identification of friend
or foe of multitarget can be realized. It builds on three
basic gradients: Intelligent signal processing; Feedback
from the receiver to the transmitter; Preservation of the
information content of radar returns[1]. Now more and
more people are doing research in this field.
The obvious difference between cognitive radar and
traditional radar is that cognitive radar can select
appropriate waveforms according to different radar
environment. So how to realize the adaptivity of the
transmitter is an important problem in cognitive radar.
The design of adaptive transmitter involves adaptive
model and adaptive algorithm. Goodman have proposed
and simulated a closedloop active sensor by updating the
probabilities on an ensemble of target hypotheses while
adapting customized waveforms in response to prior
measurement and compared the performance of two
different waveform design techniques[2]. In [3], the
author focuses on a cognitive tracking radar, the
implementation of which comprises two distinct
functional blocks, one in the receiver and the other in
transmitter with a feed back link from the receiver to the
transmitter. In [4], Arasaratnam have successfully solved
the best approximation to the Bayesian filter in the sense
of completely preserving secondorder information,
which is called cubature Kalman filters. In [5], Informax
principle aimed at maximizing the mutural information is
used for designing the transimitted signal waveform. In
[6], an extention to the PDA tracking algorithm to include
adaptive waveform selection was developed. In [7], it is
shown that tracking errors are highly dependent on the
waveforms used and in many situations tracking
performance using a good heterogeneous waveform is
improved by an order of magnitude when compared with
a scheme using a homogeneous pulse with the same
energy. The problem of waveform selection can be
thought of as a sensor scheduling problem, as each
possible waveform provides a different means of
measuring the environment, and related works have been
examined in [8], [9]. In [10], an adaptive waveform
selective probabilistic data association algorithm for
tracking a single target in clutter is presented. In [11],
radar waveform selection algorithms for tracking
accelerating targets are considered. In [12], genetic
algorithm is used to perform waveform selection utilizing
the autocorrelation and ambiguity functions in the fitness
evaluation. In [13], Incremental Pruning method is used
to solve the problem of adaptive waveform selection for
target detection. The problem of optimal adaptive
waveform selection for target tracking is also presented in
1394JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
doi:10.4304/jcp.5.9.13941401
Page 2
[14]. In [15], the author uses ADP method to solve the
problem of adaptive waveform selection.
In this paper, under the assumption of rangeDoppler
resolution cell, stochastic dynamic programming model
for adaptive transmitter is proposed. We use temporal
difference learning method to realize the adaptivity of
waveform selection. The simulation results show the
validity of our proposed algorithm.
Ⅱ. RANGEDOPPLER RESOLUTION CELL
The design of adaptive transmitter in cognitive radar
involves adaptive model and adaptive algorithm. We first
consider setting up adaptive model. Generally speaking,
for a target, the most important parameters that a radar
measures are range, Doppler frequency, and two
orthogonal space angles. If we envision a radar resolution
cell that contains a certain fourdimensional hypervolume,
we may assume different targets fall in different
resolution cells. That means if a target measured falls in a
resolution cell, then another target fall in another
resolution cell and does not interfere with measurements
on the first. So as long as each target occupies a
resolution cell and the cells are all disjoint, the radar can
make measurements on each target free of interference
from others. For a single radar pulse, we may give a
general sort of definition by considering the resolution
cell to be bounded in range by the compressed pulse’s
duration, in Doppler by the reciprocal of the transmitted
pulse’s duration, and in the two angles by the antenna
pattern’s two orthogonalplane beamwidths.
For example, if a radar seeks to make measurements
on targets resolved in Doppler frequency at the same time,
it can provide a bank of matched filters operating in
parallel. Each target will excite the filter which is
matched to its Doppler frequency, and its response can be
used for measurements. Targets resolved in the range
coordinate can be seperated with range gates followed by
measurements. Thus a radar can perform simultaneous
measurements on targets unresolved in angle, provided
the targets are resolved in range, or Doppler frequency, or
both. However, it is difficult to simultaneously measure
targets in angle coordinates. Such measurements require
either a bank of main beams or the timesharing of one
main beam among the various targets.
Through the preceding discussion, we can conclude
that angle resolution can be considered independently
from range and Doppler resolution in most circumstances.
When considering this, the resolution properties of the
radar in angle are independent of the resolution properties
in range and Doppler frequency[16].
We define rangeDoppler resolution cell for the
waveform selection model.
R
denotes range resolution which is a radar metric
that describes its ability to detect targets in close
proximity to each other as distinct objects. Radar systems
are normally designed to operate between a minimum
range
min
R
, and maximum range
max
R
. The distance
between
each of width R
min
R
and
max
R
is divided into N range bins,
.
maxmin
RR
N
R
（1）
Targets seperated by at least R
resolved in range.
Radars use Doppler frequency to extract target radial
velocity (range rate), as well as to distinguish moving and
stationary targets or objects such as clutter. The Doppler
phenomenon describes the shift in the center frequency of
an incident waveform.
will be completely
Figure 1. Radar and target.
Figure 1 is a picture of radar and target. In heavy
clutter environments, it is an acute problem that we can
not obtain good Doppler and good range resolution in a
waveform tailoring simultaneously. So we need to
consider the problem of adaptive waveform selection and
make a tradeoff decision between them. The basic
scheme for adaptive waveform selection is to define a
cost function that describes the cost of observing a target
in a particular location for each individual pulse and
select the waveform that optimizes this function on a
pulse by pulse basis.
We make no assumptions about the number of targets
that may be present. We divide the area covered by a
particular radar beam into a grid in rangeDoppler space,
with the cells in range indexed by
in Doppler indexed by
target, 1 target or NM targets. So the number of
possible scenes or hypotheses about the radar scene is
. Let the space of hypotheses be denoted by
2NM
x
1,...,N
. There may be 0
and those
1,...,M
. The
state of our model is
t X
where x
. Let
1
tY be
the measurement variable. Let
that indicates which waveform is chosen at time t to
generate measurement
1
probability of receiving a particular measurement
t Xx
will depend on both the true, underlying scene
and on the choice of waveform used to generate the
measurement.
be the control variable
tu
tY
, where . The
tu U
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 20101395
© 2010 ACADEMY PUBLISHER
Page 3
We assume that the evolution of the state is governed
by a Markov process and define
probability where
(
x x
a P x
()
x x x x
a
A
is the state transition matrix.
b is the measurement probability where
( )(
x xtt
buP Y
( )(( ))
tx xt x x
ubu
B
probability matrix. That means if state of our model is
Xx
x
measurement is
x x
b
x x
a is state transition
1

tt
x x
x
) (2)
,
We define
x x
1
 , )
x u
tt
x X
(3)
,
t
is the measurement
1
tY
and we use the waveform , the probability of
tu
.
( )
u
t
Figure 2. Resolution cell and corresponding parallelogram.
To handle the problem that when a combined
waveform is used, we define as a practical resolution cell
the parallelogram that contains the resolution cell
primitive. Figure 2 is resolution cell and corresponding
parallelogram.
Let us consider calculating these probabilities. The
matched filter is adpoted in the receiver. Assume the
transmitted baseband signal is
baseband signal is
. The matched filter is the one
( )
r t
h t
, and the received
with an impulse response
process of our matched filter is
x t
, so an output
( )
s t
(
t
s
( ))
( )( )( )
r t
() ( )
t r
h t
sd
(4)
In the radar case, the return signal is expected to be
Doppler shifted, then the matched filter to a return signal
with an expected frequency shift
response
*
( )
h ts
The output is given by
( )(
x tst e
where
0
is an expected frequency shift.
The baseband received signal will be modeled as a
return from a Swerling target.
0
has an impulse
0
2
()
j
t e
t
(5)
0
2()
)( )
jt
rd
(6)
At time the magnitude square of the output of a filter
matched to a zero delay and a zero Doppler shift is
Following we consider two situations: there is no
terget and target is present.
When there is no target
The random variable
0
( )
x
zero mean and variance given by
0
( )
E x
where is the energy of the transmitted pulse.
When target is present
This random variable is still zero mean, with variance
given by
1000
( )( )(1
E xx
t
2
2
0
( )
x t
( ) (
)
t
rst d
(7)
0
00
0
( )( ) (
)
xnsd
(8)
is complex Gaussian, with
2
00
( )
2
0
xN
(9)
0
2
00
0
( )() ( )
()
d
j
x Asensd
(10)
2
A
2
22
00
2
0
2
(, ))
A
(11)
where ( , )
A is ambiguity function, given by
1
( , )
( )
sd
Recall that the magnitude square of a complex
Gaussian random variable
exponentially distributed, with density given by
2
2
2
2
( ) (
)
j
Assed
(12)
2
i
~ (0,)
xN
is
2
i
2
2
2
i
1
~2
y
yxe
(13)
In the case when a target is present in cell ( , )
assuming its actual location in the cell has a uniform
distribution
,
2
A
2
2
000
2
0
2
2 (1(, ))
(,)
1
A
aa
D
A
da
A
Ped
a
d
(14)
where A is the resolution cell centred on ( , )
volume A .
with
Ⅲ. TEMPORAL DIFFERENCE LEARNING WAVEFORM
SELECTION
Define
maximum number of dwells that can be used to detect
and confirm targets for a given beam. Then
sequence of waveforms that could be used for that
decision process. We can obtain different according to
different environment in cognitive radar. Let
T
V XE
01
{ , ,...,
u u
}
T
u
where is the
1
T
is a
0
()[(, )]
t
t
tttt
t
R X u
(15)
1396JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
Page 4
where
(, )
ttt
R X u
is the reward earned when the scene
t X is observed using waveform
factor. Then the aim of our problem is to find the
sequence that satisfies
and
tu
is discount
0
() max [
( , )]
t
T
t
tt
t
VXE R X u
t
(16)
However, knowledge of the actual state is not available.
Using the method of [17], we can obtain that the optimal
control policy that is the solution of (16) is also the
solution of
0
0
() max [
( , )]
t
p u
T
t
tt
t
VER
p
(17)
where is the conditional density of the state given the
measurements and the controls and
probability density of the scene.
X . So we need to solve the following
problem
T
t
E
is the a priori
tp
0
p
p is a sufficient statistic
for the true state
t
0
max [ ( , )]
t
p
tt
t
Ru
(18)
The refreshment formula of is given by
tp
1
'
t
t
t
BAp
1 LAp
p
(19)
where
(
x x
b
vector of ones.
We write the expected profits using policy from t
onward
1
( ){(
tttt
tt
where under policy
()
U
p
( )
tt
G p
tp
state in time , and follow policy
t
onward. However, it seems much more natural to
calculate recursively using
( ) ( , )
ttttt
VRu
pp
Now we will establish the relationship between the
original optimization problem and the optimality
equations. Clearly,
()
TT
G
p
t
is the diagonal matrix with the vector
the nonzero elements and 1 is a column
A
B
)()
t
u
is state transition matrix.
,)()}
T
tTTt
GERuC
pppp
t
(20)
tu
.
is the expected total contribution if we are in
from time
t
tV
11
{()}
ttt
E V
pp
(21)
. Next,
assume that it holds for
show that it is true for t . This means that we can write
( )( , ){(
( , )[(
tttt
RuE G
. We want to
()()
T
TTT
V
t
R
1,
T
p
2,...,
p
pp
Because the random variables are discrete and finite,
we can obtain that
)
)
t
t (23)
Hence
( )
V
t
11
11
1
1
1
)
)]
}
( , )
t
p
{ [
E E
(,)()]}
tttttttt
t
T
ttttttTtt
tt
VRu E V
RuRuR
pp
p
p
p
p
pp
(22)
1
1
1
(
11
1
111
1
, ) (
P

(, ) (
P

(,)
)
[]
t
t
t
tt
ttt
g G
ttttt
g G
ttt
g G
tt
g G
E G
tt
g
gP Gg
gP Gg
gP Gg
p
pp
p
pp
pp
ppp
ppppp
pp
p
p
1
{ [
E E G
 ]}
(
ttt
gP G
pp
1
1
1
1
1
1
1
1
( ,
p
) { [
E E
(,)( )]}
( ,
p
)[(,)()]
[( ,
p
)(,)()]
[(,)()]
( )
p
T
tt
tttttTtt
tt
T
tttttttTt
tt
T
tttttttTt
tt
T
ttttTt
tt
tt
RuRuR
RuERuR
E RuRuR
ERuR
G
Using equation (21), we have a backward recursion for
calculating
( )
tt
V
for a given policy at
we can find the expected reward for a given , we
would like tbest o find the
. That is, we wa t to find
*( )max
tt
G
p
If the set
is infinite, we replace
“sup”. We solve this problem by solving the optimality
eqse
uations. The are
( )max(( , )
tttt
u U
p p
p
t
pppp
ppp
ppp
ppp
(24)
p
. Now that
( )
p
tt
G
(25)
n
First, we show by induction that
pp and
t
the rev equ
Wrt ag
()(
VR
pp
TTtT
, we get that
()
tt
V
tt
G
pp
be an arbitrary policy. For t
tells us
p
al
the “max” with
1
( 
p p
, )
u V
( ))
p
tt
VRuP
pp
(26)
*
t
( )
p
( )
p
ttt
VG
for
l
t
erse in
e reso
0,1,...,1
T
. that
gives us the result.
f by induction. Since
p
for all
T
*()
TT
p
G
. Ass
T
1, 2,...,
t
t
, the optimality equatio
)(
(
)
)
G
V
p
p
Then, we show
ality is true, which
ain to our proo
TT
and all
TT
t
ume that
, and let
*()
t
for
( )max(
u U
( , )
t
p
( 
p p
, )
u V
( ))
p
tt
VRu
n
1
ttt
P
p p
(27)
,
*
t
11
( )
p
( )
p
t
GV
With the induction hypothesis
we get
V
, so
*
t
1
, )
u G
( ))
p
ttt
u U
( )
p
max(( , )
p
(
tt
RuP
p P
pp
(28)
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010 1397
© 2010 ACADEMY PUBLISHER
Page 5
We have that
Up )
when in state
p
for an arbitrary
*
t
11
( )
p
( )
p
t
GG
.
Also let be the decision that would be chosen
by policy
(
t
. Then
p p
tp
P
1
1
( )max(
u U
( , )
t
p
( , )
u G
( ))
p
( ,
p
( ))
t
p
( 
p p
, ( ))
t
p
( )
p
( )
p
ttttt
tttt
tt
VRu
RUPUG
G
p P
p P
p
(29)
This means
Next we are going to prove the inequality from the
other side. Specially, we want to show that for any
0
there exists a policy that satisfies
(
Tt
To do this, we start with the definition
( ) max(( , )
tttt
u U
p p
We may let be the decision rule that solves
(31). This rule corresponds to the policy
the set U may be infinite, whereupon we have to replace
the “max” with a “sup” and handle the case where an
optimal decision may not exist. For this case, we know
that we can design a decision rule u
decision u that satisfies
( ) ( , )( 
tttt
VRuP
p P
We can prove (32) by induction. We first note that (30)
is true for t
since
that it is true for t
that
for all
( )
p
( )
tttt
VG
.
)(
t GV
p )
tt
p
( ))
(30)
1
( 
p p
, )
u V
tt
VRuP
pp
(31)
( )
tt
u p
. In general,
that returns a
(
tp )
t
1
, )
u V
( )
p
tt
ppp p
t
p
( )
t
(32)
. Now assume
already know . We
T
( )
p
t
()
TtT
T
T
G
V
p
1,2,...,
t
1
( )
p
( ,
p
( ))
t
p
( 
p p
,( ))
t
p
V
ttttt
GRUPUG
p P
(33)
We can use our induction hypothesis which says
( ) ( ) ((
t
VTt
pp
11
1))
G
to get
1
1
1
( )
p
( ,
p
( ))
t
p
( 
p p
,( ))[
t
p
( ) (
p
(1)) ]
( ,
p
( ))
t
p
( 
p p
,( ))
t
p
( )
p
( 
p p
,( ))[(
t
p
1) ]
{ ( ,
t
R
( ))
t
p
( 
p p
,( ))
t
p
( )
p
} (
)
tttttt
ttttt
ttt
GRUPUVTt
RUPUVPUTt
UPUVTt
p P
p P
p P
p P
p
( )
p
(34)
Now, using equation (32), we replace the term in
brackets with the smaller V
t
t
( )
p
( ) (
t
p
)
ttt
GVTt
(35)
which proves the induction hypothesis. We has shown
that
*
t
*
t
( ) (
t
P
This proves the result. So we can use
our problem[18].
This is stochastic dynamic model of adaptive
waveform selection. Then we can use dynamic
programming algorithms to solve the problem. Backward
dynamic programming algorithm is a basic dynamic
programming method. It can be viewed as an optimal
adaptive algorithm for waveform selection. When state
space and action space are large, it is hard to use this
method. Than means we hardly find optimal solution for
waveform selection. So approximate solutions are
necessary.
Generally speaking, reward function can be different
forms according to different problems. It represents the
value that we stand in certain place and take some certain
action. In the problem of adaptive waveform selection,
two forms of reward function are usually used. They are
linear reward function and entropy reward function.
Linear reward function is usually used in the
circumstance that ( , )
Ru
p
is required to be a piecewise
linear function. The form of this function is simple and
easy to calculate. However, it can not reflect the whole
value sometimes. The form of linear reward function is
1( , )
Ru
p
Entropy reward function is usually used in the
circumstance that
( , )
Ru
p
piecewise linear function. It comes from information
theory. It can reflect the whole value accurately. But it is
more complex than linear reward function. The form of
entropy reward function is
2( , )
x
x
We can choose different form of reward function
according to different problems.
The foundation of approximate dynamic programming
is based on an algorithmic strategy that steps forward
through time. If we wanted to solve this problem using
classical dynamic programming, we could have to find
the value function using
( )
tt
V p
( )max( ( , )
t
u U
Assume is an unbiased sample estimate of the value
of being in state
and the policy is
of v is
(,)(
tttttt
vCuC
pp
Standard stochastic gradient algorithm is used to
estimate the value of being in state
( )( )
tttt
VV
pp
The temporal differences is
(,)
DCuV
p
So
)( ) (
t
P
) ( )
P
V p
( )
P
tttt
GTtGTtVG
(36)
to solve
( )
tt
'1
p p
(37)
is not required to be a
( )log(
p k
( ))
x
Ru p k
p
(38)
11
{()})
ttttttt
VCu E V
pppp
(39)
v
tp
. The definition
111
,) ...
(,)
nnnn
TtTT
uCu
p
(40)
t X
[
n
V
11
( )
p
]
nnn
ttt
nv
(41)
11
11
()()
n
n
V
pp
(42)
1398 JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
Page 6
1( )
p
T
n
t
n
tt
t
vVD
(43)
Substituting (43) into (41), we can obtain
1
1
( )
p
( )
p
T
nn
ttttn
t
VVD
(44)
The temporal differences are the errors in our estimates
of the value of being in state
stochastic gradients for the problem of minimizing
estimation error. The discount form is
p . These errors are
1
1
( )
p
( )
p
()
T
nnt
ttttn
t
VV
D
(45)
Through this formular, we can use this method to
update the value of V .
Our algorithm is described as follows:
1 Give an initial state
V
p
2 Choose a sample path.
3 For , do optimization and simulation.
0,1,2,...,
t
T
and value function
t
approximations
for all and , set
1
0
p
p
0( )
ttt
1
n .
Optimization is to compute a decision
and
find the next state using
t
( )
p
n
tt
uU
1
'
t
t
p
Ap
t
BA
1 L
p
.
4 Update the value function approximation to obtain
for all .
( )
tt
V
p
t
5 If we have not met our stopping rule, increase
go to the first step. Else, stop.
n
and
n
Ⅳ. SIMULATION
In this section, we make three experiments. In order to
explain the necessity of waveform selection, we make the
curve of measurement probability versus SNR of three
different waveforms and measurement probability versus
SNR with different targets. Curve of uncertainty of state
estimation demonstrates validity of our proposed
algorithm. We also plot the figure of value space versus
state and waveform.
We adopt linear frequency modulation (LFM) signal in
the transmitter. The formular of LFM is
t
s trectTe
where
signal
2
2 (
)
2
( )( )
c
K
j f tt
(46)
cf is carrier frequency and is rectangular
rect
1
( )
T
0 ,
t
t
rect
T
elsewise
5 43 21012345
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
Time(usec)
Figure 3. Real part of chirp signal.
30 20100102030
5
10
15
20
25
30
35
40
Frequency(MHz)
Figure 4. Magnitude spectrum of chirp signal.
Fiure 3 is real part of chirp signal and figure 4 is
magnitude spectrum of chirp signal.
10010 20304050
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNR(dB)
Measurement Probability
waveform 1
waveform 2
waveform 3
Figure 5. Measure probability versus SNR with three different
waveforms.
1
(47)
Figure 5 is curve of measurement probability versus
SNR with three different waveforms. From this figure we
can see that measurement probability is becoming large
with the increase of SNR. Under the same SNR, using
different waveform corresponds to different measurement
probability. So measurement can be improved through
appropriately scheduling waveform in cognitive radar.
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 20101399
© 2010 ACADEMY PUBLISHER
Page 7
Generally speaking, the wider the pulse duration is, the
larger the measurement probability is. However, wide
pulse duration means large energy of the transmitted
pulse. We should make a balance between pulse duration
and energy of the transmitted pulse and appropriately
schedule waveform in order to obtain large measurement
probability.
2468 101214 16 1820 22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNR(dB)
Meaurement Probability
Swerling V
Swerling I
Figure 6. Measurement probability versus SNR with different targets.
Figure 6 is curve of measurement probability versus
SNR with different targets. From this figure we can see
that under the same SNR, measurement probability is
different to different targets. So according to different
targets we should select different waveforms. In actuality,
path of target is so complex. We should change
waveform according to different environment.
We will use the function
( )'
R
pp p
as the basis for our reward function. The formula
can be considered as the uncertainty in the
state estimation.
We consider a simple scenario. The state space is
. We consider 5 different waveforms where for
each waveform u , and each hypotheses for the target
the distribution of x is given in tableⅠ. The discount
factor
0.9
.
TABLE I.
1
(48)
(1' )
E p p
4 4
x ,
MEASUREMENT PROBABILITIES FOR THE EXAMPLE SCENARIO
x=1
x’=1,2
3,4
0.97,0.01
0.01,0.01
0.96,0.01
0.02,0.01
0.02,0.95
0.02,0.01
0.96,0.01
0.01,0.02
0.01,0.02
0.04,0.03
x=2
x’=1,2
3,4
0.96,0.01
0.01,0.02
0.02,0.95
0.01,0.02
0.02,0.02
0.01,0.95
0.01,0.96
0.02,0.01
0.01,0.97
0.01,0.01
x=3
x’=1,2
3,4
0.01,0.01
0.96,0.02
0.01,0.01
0.01,0.97
0.02,0.96
0.01,0.01
0.97,0.01
0.01,0.01
0.02,0.01
0.96,0.01
x=4
x’=1,2
3,4
0.01,0.95
0.02,0.02
0.02,0.96
0.01,0.01
0.01,0.02
0.02,0.95
0.03,0.95
0.01,0.01
0.04,0.94
0.01,0.01
U=1
U=2
U=3
U=4
U=5
The matrix is given by
A
0.96
0.01 0.93
0.02
0.01 0.02
0.020.01
0.03 0.02
0.95
0.01 0.95
0.01
0.030.02
A
(49)
0 102030 4050
time
60 708090100
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
E(R)
optimal scheme
temporal difference learning
fiwed waveform
Figure 7. Curve of uncertainty of state estimation
Figure 7 is curve of uncertainty of state estimation.
Form this figure, we can see that the uncertainty of state
estimation is becoming lower with the increase of time.
The uncertainty of state estimation using optimal
scheduled waveform is lower than using fixed waveform.
The uncertainty of state estimation using temporal
difference learning algorithm approaches that of using
optimal scheduled waveform. In the simulation, the
optimal scheduled algorithm takes 55 seconds, while
temporal difference learning algorithm takes 35 seconds.
So our algorithm is efficient than the optimal algorithm.
1
2
3
4
1
2
3
4
5
0
0.5
1
1.5
2
2.5
3
3.5
Figure 8. Value space versus state and waveform
Figure 8 is the figure of value space versus state and
waveform. Value of different statewaveform pair can be
obtained in this figure. We can see that the proposed
algorithm has lower computational cost.
Ⅴ. CONCLUSIONS
Adaptive waveform selection is an important problem
in cognitive radar and the problem of adaptive waveform
selection can be viewed as a stochastic dynamic
1400JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010
© 2010 ACADEMY PUBLISHER
Page 8
programming problem. In this paper, under the
assumption of rangeDoppler resolution cell, stochastic
dynamic programming model for adaptive transmitter is
set up. Then temporal difference learning is used to solve
this problem. The simulation results show that the
uncertainty of state estimation using temporal difference
learning is less than that using fixed waveform. Temporal
difference learning method approaches the optimal
waveform selection scheme but has lower computational
cost. Research on algorithms which approach the optimal
waveform selection scheme and has lower computational
cost is an important issue.
ACKNOWLEDGMENT
The authors would like to thank the anonymous
reviewers for their insightful comments that helped
improve the quality of this paper. This work is supported
by The National Natural Science Foundation of China,
under Grant no. 60874108 and The Central University
Fundamental Research Foundation, under Grant. no.
N090604006.
REFERENCES
[1] S. Haykin, “Cognitive radar: a way of the future”, IEEE
Signal Processing Magazine, 2006, 23(1), pp. 3040.
[2] N. A. Goodman, Phaneendra R. Venkata and Mark A.
Neifeld, “Adaptive waveform design and sequential
hypothesis testing for target recognition with active
sensors”, IEEE Journal of Selected Topics in Signal
Processing, 2007, 1(1), pp. 105113.
[3] S. Haykin, “Cognition is the key to the next generation of
radar systems”, Digital Signal Processing Workshop and
5th IEEE Signal Processing Education Workshop, 2009.
DSP/SPE 2009. IEEE 13th, pp. 463467.
[4] I. Arasaratnam and S. Haykin, “Cubature Kalman filters”,
IEEE Tran. Automatic Control, 2009, 54(6), 463467.
[5] S. Haykin, Y. B. Xue and T. Davidson, “Optimal
waveform design for
Conference, 2008.
[6] D. J. Kershaw and R. J. Evans, “Waveform selective
probabilistic data association”, IEEE Transactions on
Aerospace and Electronic Systems, vol. 33, no. 4, pp.
11801188, October 1997.
[7] C. Rago, P. Willett and Y. BarShalom, “Detecting
tracking performance with combined Waveforms”, IEEE
Transactions on Aerospace and Electronic Systems, 1998,
34(2), pp. 612624.
[8] Y. He and E. K. P. Chong, “Sensor scheduling for target
tracking in sensor networks”, 43rd IEEE Conference on
Decision and Control, Paradise, Island, Bahamas, 2004, pp.
743748.
[9] V. Krishnamurthy, “Algorithms for optimal scheduling of
hidden Markov model sensors”, IEEE Trans. on Signal
Processing, 2002, 50(6), pp.13821397.
[10] D. J. Kershawand and R. J. Evans, “Waveform selective
probabilistic data association”, IEEE Transactions on
Aerospace and Electronic Systems, 1997, 33(4), pp. 1180
1188.
cognitive radar”,
Asilomar
[11] C. O. Savage and B. Moran, “Waveform Selection for
Maneuvering Targets Within an IMM Framework”, IEEE
Transactions on Aerospace and Electronic Systems, 2007,
43(3), pp. 12051214.
[12] C. T. Capraro, I. Bradaric, G.T. Capraro and Tsu Kong Lue,
“ Using genetic algorithms for radar selection”, 2008 IEEE
Radar Conference, Inc., Utica, NY, May 2008, pp. 16.
[13] B. F. La Scala, W. Moran and R. J. Evans, “Optimal
adaptive waveform selection for target detection”, The
International Conference on Radar, Adelaide, SA,
Australia, Sept. 2003, pp. 492496.
[14] B. La Scala, M. Rezaeian and B. Moran, “Optimal adaptive
waveform selection for target tracking”, International
Conference on Information Fusion, 2005, pp. 552557.
[15] B. Wang, J. K. Wang and J. Li, “ADPbased optimal
adaptive waveform selection in cognitive radar”,
International Symposium on Intelligent Information
Technology Applications Workshops, Shanghai, China,
Dec. 2008, pp. 788790.
[16] P. Z. Peebles, Radar Principles, John Wiley & Sons, Inc,
1998.
[17] D. Bertsekas, Dynamic Programming and Optimal Control,
volume1, Athena Scientific, 2nd edition, 2001.
[18] Warren B. Powell, Approximate Dynamic Programming:
Solving the Curses of Dimensionality, John Wiley & Sons,
Inc, 2007.
Bin Wang was born in Hebei, China, in 1982. He received
M.S. degree in communication and information system in
Northeastern University in China in 2008. Since March 2008,
he has been working for his PhD degree in Northeastern
University. His research interests are in the area of cognitive
radar and adaptive waveform selection.
Jinkuan Wang received his PhD degree from University of
ElectroCommunications, Japan, in 1993. He is currently a
professor in the School of Information Science and Engineering
in Northeastern University, China, since 1998. His main
interests are in the area of intelligent control and adaptive array.
Xin Song was born in Jilin, China, in 1978. She received her
PhD degree in Communication and Information System in
Northeastern University in China in 2008. She is now a teacher
in Northeastern University at Qinhuangdao, China. Her research
interests are robust adaptive beamforming and wireless
communication.
Yinghua Han was born in Jilin, China, in 1979. She
received the M.S. and PhD degrees in information science and
engineering college from Northeastern University, Shenyang,
China, in 2005 and 2008, respectively. Since 2003, she is with
Engineering Optimization and Smart Antenna Institute. Her
research interests include array signal processing and mobile
wireless communication systems.
JOURNAL OF COMPUTERS, VOL. 5, NO. 9, SEPTEMBER 2010 1401
© 2010 ACADEMY PUBLISHER