Page 1

This article was originally published in a journal published by

Elsevier, and the attached copy is provided by Elsevier for the

author’s benefit and for the benefit of the author’s institution, for

non-commercial research and educational use including without

limitation use in instruction at your institution, sending it to specific

colleagues that you know, and providing a copy to your institution’s

administrator.

All other uses, reproduction and distribution, including without

limitation commercial reprints, selling or licensing copies or access,

or posting on open internet sites, your personal or institution’s

website or repository, are prohibited. For exceptions, permission

may be sought for such use through Elsevier’s permissions site at:

http://www.elsevier.com/locate/permissionusematerial

Page 2

Author's personal copy

and the slope of the sigmoid function. Yam and Chow [30]

Neurocomputing 70 (2006) 525–535

An efficient hidden layer training method for the multilayer perceptron

Changhua Yua,?, Michael T. Manryb, Jiang Lic, Pramod Lakshmi Narasimhab

aFastVDO LLC, Columbia, MD 21046, USA

bDepartment of Electrical Engineering, University of Texas at Arlington, TX 76019

cDepartment of Radiology, Clinical Center National Institutes of Health, Bethesda, MD 20892, USA

Received 17 December 2003; received in revised form 22 November 2005; accepted 23 November 2005

Communicated by S.Y. Lee

Available online 18 May 2006

Abstract

The output-weight-optimization and hidden-weight-optimization (OWO–HWO) training algorithm for the multilayer perceptron

alternately solves linear equations for output weights and reduces a separate hidden layer error function with respect to hidden layer

weights. Here, three major improvements are made to OWO–HWO. First, a desired net function is derived. Second, using the classical

mean square error, a weighted hidden layer error function is derived which de-emphasizes net function errors that correspond to

saturated activation function values. Third, an adaptive learning factor based on the local shape of the error surface is used in hidden

layer training. Faster learning convergence is experimentally verified, using three training data sets.

r 2006 Elsevier B.V. All rights reserved.

Keywords: Hidden weight optimization (HWO); Convergence; Hidden layer error function; Saturation; Adaptive learning factor

1. Introduction

Back propagation (BP) [10,11,26,29] was the first

effective training algorithm for the multilayer perceptron

(MLP). Subsequently, the MLP has been widely applied in

the fields of signal processing [16], remote sensing [17], and

pattern recognition [3]. However, the inherent slow

convergence of BP has delayed the adoption of the MLP

by much of the signal processing community.

During the last two decades, many improvements to BP

have been made. One reason for BP’s slow convergence is the

saturation occurring in some of the nonlinear units. When

hidden units become saturated, the error function reaches a

local minimum early in the training stage. In [14,30], proper

weight initializations are used to avoid premature saturation.

Lee et al. [14] derived the probability of incorrect saturation

for the case of binary training patterns. They concluded that

the saturation probability is a function of the maximum

value of the initial weights, the number of units in each layer,

proposed to initialize the weights to ensure that the outputs

of the hidden units are in the active region of the sigmoid

function. Various techniques, such as an error saturation

prevention function in [13], the cross-entropy error function

[23] and its extension [19], are proposed to prevent the units

from premature saturation.

In [2], Battiti reviewed first and second order algorithms

for learning in neural networks. First order methods are

fast and effective for large-scale problems, while second

order techniques have higher precision. The Levenberg–

Marquardt (LM) [6,9,17] method is a well-known second

order algorithm with better convergence properties [2] than

conventional BP. Unfortunately, it requires O N3

and calculations of order O N3

number of weights in an MLP [18]. Hence the LM method

is impractical for all but small networks.

Many investigators have trained the different MLP

layers separately. In layer-by-layer (LBL) training [20,28],

the output weights are first updated. Then the correspond-

ing desired outputs for the hidden units are found. Finally,

the hidden weights are found by minimizing the difference

between the actual net function and a desired net function.

In [28], for the current training iteration, the optimal

output and hidden weights are found by solving sets of

w

??storage

w

??

where Nwis the total

ARTICLE IN PRESS

www.elsevier.com/locate/neucom

0925-2312/$-see front matter r 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.neucom.2005.11.008

?Corresponding author. Tel.: +18172723469; fax: +18172723483.

E-mail addresses: ychmailuta@yahoo.com (C. Yu), manry@uta.edu

(M.T. Manry).

Page 3

Author's personal copy

a fully connected MLP and then review the OWO–HWO

algorithm.

linear equations using pseudo-inverse matrices. In [20] by

contrast, the output and hidden weights are updated in the

gradient directions with optimal learning factors. This

strategy makes each step in [20] similar to steepest descent.

Some researchers have developed fast training techniques

by solving sets of linear equations [1,3,16,25]. When output

units have linear activation functions, linear equations can

be solved for the output weights. For example, the output

weight optimization (OWO) algorithm [1,17] has been

successfully used to minimize the MSE by solving linear

equations. In [17], the MLP is trained by the so-called

output weight optimization-back propagation (OWO-BP)

algorithm, where the output weights are found by solving

linear equations and the hidden weights are updated by BP.

Scalero and Tepedelenlioglu [27] developed a non-batch

training approach for feed-forward neural networks in which

separate error functions are minimized for each hidden unit.

This idea greatly improved training efficiency. However, they

didn’t use OWO to solve for the output weights. Using ideas

from [27], Chen [3] constructed a batch mode training

algorithm called output weight optimization-hidden weight

optimization (OWO–HWO). In OWO–HWO, output weights

and hidden unit weights are alternately modified to reduce

training error. The algorithm modifies the hidden weights

based on the minimization of the MSE between the desired

and the actual net function, as originally proposed in [27].

Although, OWO–HWO increases training speed, it still has

room for improvement [28] because it uses the delta function

as the desired net function change. This makes HWO similar

to steepest descent, except that net functions are varied instead

of weights. In addition, HWO is equivalent to BP applied to

the hidden weights under certain conditions [3].

In this paper, we develop some improvements to

OWO–HWO. In Section 2, we describe the notation used

in this paper and review the OWO–HWO algorithm. In

Section 3, we introduce a new desired net function change,

which updates the net function in the optimal gradient

direction. We derive a weighted hidden layer error

function, first used in [20], directly from the global MSE.

After that, we construct an adaptation law for the hidden

layer learning factor. The convergence of this method is

shown. In Section 4, by running on several different

training data sets, we compare the new HWO with the

original OWO–HWO, and other training algorithms.

Finally, Section 5 concludes this paper.

2. Review of OWO–HWO

In this section, we describe the structure and notation of

2.1. Structure and notation of a fully connected MLP

Without loss of generality, we restrict ourselves to a

three layer fully connected MLP, which is commonly used.

The structure of the MLP is shown in Fig. 1. Bypass

weights from input layer to output layer are used, but are

not shown in the figure for clarity. These connections allow

the network to easily model the linear component of the

desired mapping, and reduce the number of required

hidden units. The training data set consists of Nvtraining

patterns fðxp;tpÞg, where the pth input vector xpand the pth

desired output vector tp have dimensions N and M,

respectively. Thresholds in the hidden layer are handled

by letting xp, (N+1)¼ 1.

For the kth hidden unit, the net input netpk and the

output activation Opkfor the pth training pattern are

netpk¼

X

?

Nþ1

n¼1

whik;nð Þ ? xpn,

Opk¼ f netpk

where xpn denotes the nth element of xp and whi(k, n)

denotes the weight connecting the nth input unit to the

kth hidden unit. Nhis the number of hidden units. The

activation function f is sigmoidal

?,(2.1)

f netpk

??¼

1

1 þ e?netpk.(2.2)

The ith output ypifor the pth training pattern is

ypi¼

X

Nþ1

n¼1

woii;nð Þ ? xpnþ

X

Nh

k¼1

wohi;kðÞ ? Opk.(2.3)

For convenience, in the OWO procedure we augment the

input vector as

8

>

~ xpn¼

xpn;

1;

n ¼ 1;2;...;N;

n ¼ N þ 1;

n ¼ N þ 2;...;N þ Nhþ 1:

Op;ðn ? N ? 1Þ;

>

:

<

(2.4)

ARTICLE IN PRESS

Oh1

yp1

xp1

xp2

xp3

xp,N+1

OP,Nh

ypM

yp3

yp2

Fig. 1. MLP structure for one hidden layer.

C. Yu et al. / Neurocomputing 70 (2006) 525–535

526

Page 4

Author's personal copy

developed from the ideas in [27]. In HWO, the hidden

weights are updated by minimizing separate error functions

for each hidden unit. The error functions measure the

difference between the desired and the actual net function.

For the kth hidden unit and pth pattern, the desired net

function is constructed as [3]

Using the augmented input vector in (2.4), (2.3) is

rewritten as

ypi¼

where the weights woði;nÞ ¼ woiði;nÞ for 1pnpN þ 1 and

woði;nÞ ¼ wohði;n ? N ? 1Þ

N þ 2pnpN þ Nhþ 1 ¼ L.

For each training pattern, the training error is

X

L

n¼1

woi;nðÞ~ xpn, (2.5)

for

Ep¼

X

M

i¼1

tpi? ypi

hi2, (2.6)

where tpidenotes the ith element of the pth desired output

vector. In batch mode training, the overall performance of

a feed-forward network, measured as mean square error

(MSE) can be written as

E ¼

1

Nv

X

Nv

p¼1

Ep¼

1

Nv

X

Nv

p¼1

X

M

i¼1

tpi? ypi

hi2.(2.7)

2.2. Output weight optimization

As the output units have linear activation functions, the

OWO procedure, for finding weights wo(i,n) in (2.5), can be

realized by solving linear equations [1,3,17], which result

when gradients of E with respect to the output weights are

set to zero. These equations are

X

where C is the cross correlation matrix with elements:

L

n¼1

woi;nðÞR n;mðÞ ¼ C i;mðÞ,(2.8)

C i;mðÞ ¼

1

Nv

X

Nv

p¼1

tpi? ~ xpm,(2.9)

R is the auto correlation matrix defined as:

R n;mðÞ ¼

1

Nv

X

Nv

p¼1

~ xpn~ xpm.(2.10)

Since the equations are often ill-conditioned, meaning that

the determinant of R is close to 0, it is often unsafe to use

Gauss–Jordan elimination. The singular value decomposi-

tion (SVD) [8], LU decomposition (LUD) [24], and

conjugate gradient (CG) [6] approaches are better.

2.3. Hidden weight optimization

Hidden weight optimization is a full batch algorithm

netpkdffi netpkþ Z ? dpk

(2.11)

where netpkdis the desired net function and netpkis the

actual one in (2.1). Z is the learning factor and dopiis the

delta function [3,11,26] of the ith output:

h

The delta function for the kth hidden unit [2,3,7] is

dopi¼ ?qEp

qypi

¼ tpi? ypi

i

.(2.12)

dpk¼ ?

qEp

qnetpk

¼ f0netpk

??X

M

i¼1

wohi;kðÞdopi.(2.13)

The hidden weights are to be updated as

whik;nð

where e(k, n) is the weight change. The weight changes are

derived using

Þ whik;nðÞ þ Z ? e k;nðÞ,(2.14)

netpkþ Z ? dpkffi

X

Nþ1

n¼1

whik;nðÞ þ Z ? e k;nðÞ

½? ? xpn.(2.15)

Therefore,

dpkffi

X

Nþ1

n¼1

e k;nðÞ ? xpn.(2.16)

The error of (2.15) and (2.16) for the kth hidden unit is

measured as

"

Edk ð Þ ¼

1

Nv

X

Nv

p¼1

dpk?

X

Nþ1

n¼1

e k;nð Þ ? xpn

#2

.(2.17)

Equating the gradient of Ed(k) with respect to the hidden

weight change to zero, we have

X

where

Nþ1

n¼1

e k;nðÞR n;mðÞ ¼ Cdk;mðÞ,(2.18)

Cdk;mðÞ ¼

1

Nv

X

Nv

p¼1

dpk? xpm

??¼

?qE

ð

qwhik;mÞ. (2.19)

The hidden weight change e(k, n) can be found from (2.18)

by using the conjugate gradient method. After finding the

learning factor Z, the hidden weights are updated as in

(2.14).

3. Enhancement of OWO-HWO

From (2.11) and (2.13), we can see that the net functions

are updated in the gradient direction. It is well known that

optimizing in the gradient direction, as in steepest descent,

is very slow. In addition, using Ed(k) in (2.17) ignores the

effects of activation function saturation when netpk

large as was pointed out in [15] for the BP algorithm. In

this section, we update the net functions along new

directions that tend to minimize the total training error

and derive the corresponding new hidden layer error

functions, which takes saturation into account.

?? ??is

ARTICLE IN PRESS

C. Yu et al. / Neurocomputing 70 (2006) 525–535

527

Page 5

Author's personal copy

qnetn

pj

i¼1

ð3:7Þ

Setting the right hand side of (3.7) to zero, we can find

Dnetn

PM

pj

ð

3.1. The optimal direction of desired net function change

As mentioned in (2.11), for the original HWO algorithm,

the desired net function change is Z?dpj, where the delta

function is the negative gradient of E with respect to the

current net function netpj. This strategy of updating net

functions using their gradient directions results in slow

training for the hidden units.

In this section, we introduce a new desired net function

netpjd:

netpjd¼ netpjþ Z ? Dnetn

where Dnetn

pj,(3.1)

pj, written as

Dnetn

pj¼ netn

pj? netpj,(3.2)

is the difference between current net function netpjand an

optimal value netn

pj. Now the net function approaches the

optimal value netn

pj, instead of moving in the negative

gradient direction. The current task for constructing the

new algorithm HWO is to find Dnetn

Using Taylor series and Eqs. (2.1), (3.2), the correspond-

ing hidden layer output On

?

where f0

Replacing Opjby its optimal value On

2

pj.

pjcaused by netn

pj? Z ? Dnetn

pjcan be written as

On

pj¼ f netn

pj

?

? Opjþ f0

pj

(3.3)

pjis shorthand for the activation derivative f0(netpj).

pjin (2.3), (2.7) becomes

E ¼

1

Nv

X

Nv

p¼1

X

M

i¼1

tpi?

X

Nþ1

n¼1

woii;nðÞxpn?

X

Nh

k¼1

kaj

wohi;kðÞOpk? wohi;jðÞOn

pj

64

3

75

2

:

(3.4)

Now Dnetn

qE

qnetpj

netpj¼netn

Using (3.1) to (3.5) yields

Using (3.3) and (2.12), we replace

in (3.6) by f0

pjcan be derived based on

????

pj

¼ 0.(3.5)

On

pj? Opj

??

and (tpj-ypi)

pjDnetn

pjand dopi, yielding

2

>

X

qE

qnetn

pj

? ?2

Nv?

qOn

qnetn

pj

pj

X

X

M

i¼1

M

dopi? wohi;jðÞf0

pjDnetn

pj

hi

M

:wohi;jðÞ

(

"

)

¼ ?2

Nv?

qOn

pj

dopiwohi;jðÞ ? f0

pjDnetn

pj

X

i¼1

w2

ohi;jðÞ

#

.

pjas

Dnetn

pj?

i¼1dopiwohi;j

PM

ðÞ

Þ.

f0

i¼1w2

ohi;j

(3.8)

Using (2.12) and (2.13), (3.8) becomes

Dnetn

pj?

dpi

f0

pj

??2PM

i¼1w2

ohi;jðÞ

.(3.9)

3.2. A weighted hidden error function

If we substitute Eq. (2.3) into (2.7), we can rewrite the

total MSE as

"

E ¼

1

Nv

X

p

X

M

i¼1

tpi?

X

Nh

k¼1

wohi;kðÞOpk?

X

Nþ1

n¼1

woii;nðÞxpn

#2

(3.10)

:

During the HWO procedure, if we define netpkdas in (3.1),

the corresponding hidden unit output Opkd could be

approximated by using Taylor series as

?

However, because the hidden weights are updated as in

(2.14), the actual net function we find is

Opkd¼ f netpkd

?? Opkþ Z ? f0

pk? Dnetn

pkd.(3.11)

netpk¼

X

Nþ1

n¼1

whik;nð Þ þ Z ? e k;nðÞ

½?xpn

¼ netpkþ Z ?

X

Nþ1

n¼1

e k;nðÞxpn

ð3:12Þ

The actual kth hidden unit output after HWO is

approximated as

¯Opk¼ f netpk

??? Opkþ Z ? f0

pk?

X

Nþ1

n¼1

e k;nðÞxpn.(3.13)

If we denote the ith output caused by the inputs and Opkdas

Tpi¼

X

Nþ1

n¼1

woii;nðÞ ? xpnþ

X

9

>

Nh

k¼1

wohi;kðÞ ? Opkd

(3.14)

then after HWO, the actual total error can be rewritten as

"

(

E ¼

1

Nv

X

X

p

X

X

M

i¼1

M

tpi? Tpiþ Tpi?

X

Nþ1

n¼1

woii;nðÞxpn?

X

Nh

k¼1

wohi:kðÞOpk

#2

¼

1

Nv

p

i¼1

tpi? Tpi

??þ

X

Nh

k¼1

wohi;kðÞ Opkd? Opk

??

)

ð3:15Þ

If we assume that [tpi-Tpi] is uncorrelated with

X

Dnet?

pk?

n

e k;nðÞxpn

"#

,

ARTICLE IN PRESS

qE

qnetn

pj

¼ ?2

Nv

X

(

M

i¼1

tpi?

X

Nþ1

n¼1

woii;nðÞxpn?

X

Nh

k¼1

kaj

wohi;kðÞOpk? wohi;jðÞOn

pj

64

h

3

75 ? wohi;jðÞ

qOn

qnetn

pj

pj

8

:

>

<

>

;

=

¼ ?2

Nv

M

i¼1

tpi? ypi? wohi;jðÞ On

pj? Opj

??i

:wohi;jðÞ

)

:

qOn

qnetn

pj

pj

.

ð3:6Þ

C. Yu et al. / Neurocomputing 70 (2006) 525–535

528

Page 6

Author's personal copy

error corresponding to saturated activations. When netpkis

in the sigmoid’s linear region, errors between netpkand

netpkdreceive large weight in (3.18).

The hidden weight change e(k, m) in (3.18) can be

found as in Section 2. First, the gradient of the new

Ed(k) with respect to the hidden weight change e(k, m)

is equated to zero, yielding Eq. (2.18). However, the

autocorrelation matrix R and cross correlation matrix Cd

and use Eq. (3.11) and (3.13), Eq. (3.15) becomes

E ?1

Nv

X

p

X

X

M

i¼1

tpi? Tpi

(

?

M

?2

þ1

Nv

p

X

i¼1

X

Nh

k¼1

wohi;kðÞf0

pkZ Dnet?

pk?

X

Nþ1

n¼1

e k;nðÞxpn

"#)2

.

ð3:16Þ

Here Z is the learning factor along the Dnetn

can evaluate (3.16) further because it is reasonable to

assume that the value

"

pjdirection. We

wohi;kðÞf0

pkDnet?

pk?

X

n

e k;nðÞxpn

#()

associated with different hidden units are uncorrelated with

each other. This assumption yields

E ?1

Nv

X

p

X

X

M

i¼1

tpi? Tpi

?

M

?2

þ1

Nv

p

X

i¼1

X

Nh

k¼1

w2

ohi;kðÞZ2f0

pk

??2

Dnet?

pk?

X

Nþ1

n¼1

e k;nðÞxpn

"#2

.

ð3:17Þ

Since Z, [tpi-Tpi] and woh(i,k) are constant during the HWO

procedure, minimizing E is equivalent to minimizing

"

Edk ð Þ ¼

1

Nv

X

p

f0

pk

??2

Dnet?

pk?

X

Nþ1

n¼1

e k;nðÞxpn

#2

.(3.18)

This hidden layer error function, which has been

introduced without derivation in [19], successfully de-

emphasizes error in saturated hidden units using the square

of the derivative of the activation function. Then the

amount of the hidden weight’s update depends upon

whether the hidden net functions are in the linear or

saturation regions of the sigmoid function. If the current

netpk is in a saturation region, the difference between

desired and actual hidden output is small although the

difference between netpkand netpkdis large. In this case,

because there is no need to change the associated weights

according to the large difference between netpkand netpkd,

?

change e(k, n). When netpk is in the linear region, the

hidden weights will be updated according to the difference

between netpkand netpkd.

?

the small value of

f0

pk

?2

term will decrease the weight

That is, the term

f0

pk

?2

de-emphasizes net function

are now found as:

R n;mðÞ ¼

1

Nv

X

Nv

p¼1

xpnxpm

??? f0

pk

??2

(3.19)

and

Cdk;mð Þ ¼

1

Nv

X

Nv

p¼1

Dnet?

pk? xpm

hi

? f0

pk

??2. (3.20)

Again, we have ðN þ 1Þ equations in ðN þ 1Þ unknowns for

the kth hidden unit. After finding the learning factor Z, the

hidden weights are updated as in (2.14).

3.3. Improved bold drive technique

In [12,20], the weights and the desired hidden outputs

were updated using an optimal learning factor. However,

those optimal learning factors are found just based on

current gradients. They do not guarantee an optimal

solution for the whole training procedure, and do not

guarantee faster convergence than heuristic learning factors.

As the error function has broad flat regions adjoining

narrow steep ones [15], the learning factor should be

adapted according to the local shape of the error surface.

That is, when in flat regions, larger Z can be used, while in a

steep region, the learning factor should be kept small.

Combining this idea with the BD technique [3,21],

we developed a new adaptive learning factor for the

hidden layer. If the learning factor Z for hidden

weight updating is small, then the change in the output

layer error function E due to the change in whi(k, n) is

approximately

DE ¼

qE

ð

qwhik;nÞ? Dwhik;nðÞ ¼ Z ?

qE

ð

qwhik;nÞ? e k;nðÞ. (3.21)

The total effect of hidden weight changes on the output

error will be

DE ¼ Z

X

Nh

k¼1

X

Nþ1

n¼1

qE

ð

qwhik;nÞ? e k;nðÞ. (3.22)

Assume E is reduced by a small factor Z0between 0.0 and

0.1,

DE ¼ ?Z0? E.

Combining (3.22) and (3.23), the learning factor Z will be

(3.23)

Z ¼

?Z0E

P

k

P

nqE=qwhik;nðÞ

??? e k;nðÞ.(3.24)

Here, we develop an adaptive law for Z0. When the error

increases at a given step, the learning factor is decreased by

setting

Z0 Z0? 0:5

and the best previous weights are reloaded. When the error

decreases at a given step, we use

(3.25)

Z0 Z0? G,(3.26)

ARTICLE IN PRESS

C. Yu et al. / Neurocomputing 70 (2006) 525–535

529

Page 7

Author's personal copy

from solving (3.18) in the previous HWO iteration. E1and

E2 are the total output MSEs of last two iterations,

respectively. The learning factor Z then is calculated by

using (3.24).

In fact, if we assume the hidden weight changes are

small, R approximates the current gradient magnitude of

the hidden unit error surface. When the gradient magni-

tude is small, the local shape of error function is flat;

where the gain G is

G ¼ 1 þ

a

1 þ eR. (3.27)

The parameter a is used to limit the maximum value of

G. In our simulations, a is set to 0.49. The ratio R is

calculated using the last epoch’s parameters, as

R ¼

DE

DW

jj

kk¼E1? E2

jj

Z e

k k

.(3.28)

The hidden weight change vector e in (3.28) is available

otherwise it is steep [25]. So from (3.26) to (3.28), the

learning factor Z is modulated by an adaptive factor G,

which varies from 1 to ð1 þ aÞ with respect to the local

shape of the error function. In the remainder of this paper,

we refer to the new OWO–HWO algorithm as new HWO,

for brevity. A flow chart for the original OWO–HWO and

new HWO algorithms is shown in Fig. 2.

3.4. Convergence of the modified algorithm

In the following, we show that the hidden weight changes

from HWO make the global error E decrease until a local

minimum or a given stopping criteria is reached.

Denoting W*as the optimal weights, then we have

EðWÞXEðWnÞ. And E is convex in the neighborhood of

W*: N W

ð

E caused by the hidden weight changes can be approxi-

mated as

Þ ¼ W W?Wn

???? ??

oe; e40

??. Then the change in

DE ¼ E W þ DW

ðÞ ? E W

ð Þ ?

qE

qW

??T

DW. (3.29)

In addition, DE is the total effects of the Nhhidden units

DE ¼

X

Nh

k¼1

DE k ð Þ,(3.30)

where DE(k) is the change in E caused by the kth hidden

unit

DE k ð Þ ?

X

n

qE

ð

X

X

qwhik;nÞ? Dwhik;n

X

X

X

ð Þ ¼ Z ?

X

n

qE

ð

qwhik;nÞ? e k;nðÞ

¼ Z ?1

Nv

n

Nv

p¼1

Nv

qEp

ð

qwhik;nÞ? e k;nðÞ

¼ Z ?1

Nv

n

p¼1

Nv

?dpk? xpn? e k;nðÞ

?

dpk

?

¼ ? Z ?1

Nv

p¼1

X

n

xpn? e k;nðÞð3:31Þ

From (3.18), we can say that, in the neighborhood of W*:

X

n

e k;nð Þ ? xpn? Dnet?

pk¼

dpk

M

f0

pk

??P

i¼1

w2

ohi;kðÞ

.(3.32)

The second step in above equation is from (3.9). The total

change of the error function E, due to changes in all hidden

weights, becomes

DE ¼

X

k

DE k ð Þ ? ?Z1

Nv

X

k

X

Nv

p¼1

d2

pk

f0

pk

??2PM

i¼1w2

ohi;kðÞ

(3.33)

.

As every term in the summation is non-negative, we have

DEp0. So the global error E will keep decreasing until

reaching a local minimum.

ARTICLE IN PRESS

Save currentweights,

Eold

HWO Procedure

Calculate Z

Initialize hidden

weights; Set Nit ;

Count = 0;

Calculate E,

Eold = E

Count ++

Count< Nit

OWO Procedure

Calculate E

Yes

Yes

No

Load previous best

weights

Z Z∗ 0.5

Update hidden

weights

Save the final

weights to file

End

No

Begin

E <Eold

E

Fig. 2. OWO–HWO flowchart. (a) training MSE vs. iteration number for

data set: Twod.tra; (b) training MSE v.s. time for data set: Twod.tra.

C. Yu et al. / Neurocomputing 70 (2006) 525–535

530

Page 8

Author's personal copy

OWO+LBL. In Fig. 4, we present the simulation result for

data Twod. From Fig. 4, we can see that the LBL has larger

MSE than the new HWO. If LBL initialized by OWO, the

LBL algorithm won’t decrease the MSE any further. From

the figure we also see that LBL takes more time per

iteration than OWO-HWO. This occurs, in part, because

LBL requires three passes through the data per iteration,

but OWO-HWO requires two passes. LBL also requires

4. Simulation and discussion

The proposed ideas were verified using several training

data sets. The performance of the new HWO was

compared to the original OWO–HWO, the Levenberg–

Marquardt (LM) algorithm, and the new HWO with an

optimal learning factor [12]. The optimal learning factor is

derived in the Appendix. We also compare the new HWO

with the LBL method [20] for one data set. Our simulations

were carried out on a 2.8GHz Pentium IV, Windows NT

workstation using the Visual C++ 6.0 compiler.

4.1. Simulations

Training data set 1: Twod contains simulated data based

on models from backscattering measurements. This train-

ing file is used in the task of inverting the surface scattering

parameters from an inhomogeneous layer above a homo-

geneous half space, where both interfaces are randomly

rough. The parameters to be inverted are the effective

permittivity of the surface, the normalized rms height, the

normalized surface correlation length, the optical depth,

and single scattering albedo of an inhomogeneous irregular

layer above a homogeneous half space from back scattering

measurements.

The training data file contains 1768 patterns. The inputs

consist of eight theoretical values of back scattering

coefficient parameters at V and H polarization and

four incident angles. The outputs were the corresponding

values of permittivity, upper surface height, lower surface

height, normalized upper surface correlation length,

normalized lowersurface correlation length, optical

depth and single scattering albedo which had a joint

uniform pdf [5].

The three-layer MLP has 8 inputs, 10 hidden units and 7

outputs. All the algorithms are trained for 200 iterations.

From the simulation results of Fig. 3, the new HWO has

the fastest convergence speed in terms of both iteration

number and time. The LM algorithm costs much more

time due to its heavy computational load. As in any

location of quadratic error surface, optimal learning rates

will be superior to heuristic ones, we also tried optimal Z in

our experiments. However, as the global error function

actually is not a quadratic function of Z, we found that

using optimal learning factors [12] in every iteration does

not result in better performance. In the following, we

compare the new HWO with the LBL algorithm [20]. In

addition, as the initial MSE in LBL is too large, we use

OWO to initialize the output weights and term it as

additional calculations for the learning factors in each

pattern.

Training data set 2: OH7.TRA. This remote sensing

training data set [22] contains VV and HH polarizations

at L 30, 40 deg, C 10, 30, 40, 50, 60 degree, and X 30,

40, 50 deg along with the corresponding unknowns

Y ¼ s;l;mv

surface correlation length; mv is the volumetric soil

moisture content in g/cm3.

There are 20 inputs, 3 outputs, and 7320 training

patterns. We used a 20-20-3 network and trained

the network for 200 iterations for all the algorithms.

The simulation results of Fig. 5 show that the advantage

of the new algorithm over the other algorithms is

overwhelming.

Training data set 3: F17. This prognostics data set

contains 3335 training patterns for onboard flight load

synthesis (FLS) in helicopters [3]. In FLS, we estimate

fgT, where s is the rms surface height; l is the

ARTICLE IN PRESS

0204060 120140160180200

owohwo

new hwo

LM

new hwo + optimal LF

owohwo

new hwo

LM

new hwo + optimal LF

0.32

0.3

0.28

0.26

0.24

0.22

0.2

0.18

0.16

0.14

0.12

10-2

10-1

100

101

102

103

10080

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

MSE

MSE

Training iterations

Training time: seconds

(a)

(b)

Fig. 3. Simulation results for example 1 Data: Twod.tra, structure: 8-10-7:

(a) training MSE vs. iteration number for data set: Twod.tra; (b) training

MSE vs. time for data set: Twod.tra.

C. Yu et al. / Neurocomputing 70 (2006) 525–535

531

Page 9

Author's personal copy

mechanical loads on critical parts, using measurements

available in the cockpit. The accumulated loads can then be

used to determine component retirement times. There are

17 inputs and 9 outputs for each pattern. In this approach,

signals available on an aircraft, such as airspeed, control

attitudes, accelerations, altitude, and rates of pitch, roll,

and yaw, are processed into desired output loads such as

fore/aft cyclic boost tube oscillatory axial load (OAL),

lateral cyclic boost tube OAL, collective boost tube OAL,

main rotor pitch link OAL, etc. This data was obtained

from the M430 flight load level survey conducted in

Mirabel Canada in early 1995 by Bell Helicopter Textron

of Fort Worth.

We chose the MLP structure 17-10-9 and trained the

network for 200 iterations with all the algorithms. The

simulation results of Fig. 6 show that the new algorithm

outperforms the other algorithms again.

We have shown that the new HWO performs better than

the original OWO–HWO and LM methods. The adaptive

learning factor also gave better results than the optimal

learning factors.

5. Discussion

Evaluation of the performance of an algorithm should

not just be based on the training errors. We also need to

consider the testing error [4,11]. In the following, we

evaluate the training algorithms using testing data sets

disjoint from the training sets. The numbers of testing

patterns are 1000 for twod, 3133 for oh7, and 1410 for f17.

For each data set, we present both the training MSE and

testing MSE for every training algorithm in Table 1. From

the numerical results in Table 1, it is obvious that the new

HWO algorithm has the fastest convergence speed and the

best generalization ability. This good generalization is due

to the facts that (1) the networks are small enough to

prevent memorization, (2) the training and testing sets, in

each experiment, are generated by the same random

ARTICLE IN PRESS

02040 60 80100 120 200

LBL

OWO + LBL

010 20 30

Training time: seconds

40 50607080 90

0.1

0.6

180 160 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.9

0.8

0.7

0.5

0.4

0.3

0.2

new hwo

MSE

MSE

Training iterations

(a)

(b)

LBL

OWO + LBL

new hwo

Fig. 4. Comparison of the algorithms for: Twod.tra, structure: 8-10-7: (a)

training MSE vs. iteration number for data set: oh7.tra; (b) training MSE

v.s. time for data set: oh7.tra.

0204060 80100120160

2

2.6

owohwo

new hwo

LM

new hwo + optimal LF

2

2.8

2.4

2.2

1.8

1.6

1.4

1.2

140180200

2.8

2.6

2.4

2.2

1.8

1.6

1.4

1.2

10-1

100

101

102

103

MSE

MSE

Training iterations

Training time: seconds

owohwo

new hwo

LM

new hwo + optimal LF

(a)

(b)

Fig. 5. Simulation results for example 2 Data: oh7.tra, structure: 20-20-3:

(a) training MSE vs. iteration number for data set: F17.dat; (b) training

MSE vs. time for data set: F17.dat.

C. Yu et al. / Neurocomputing 70 (2006) 525–535

532

Page 10

Author's personal copy

LMOWO–HWO

process, and (3) the new algorithm is very effective at

reducing the training MSE. The increased speed of

convergence for the new algorithm is due to the facts that

(1) linear equations are solved for output weights and

hidden unit weight changes, (2) very few passes through the

data are required per iteration, and (3) the error function

places little emphasis on weights feeding into saturated

hidden units.

6. Conclusions

In order to accelerate convergence of the OWO–HWO

algorithm and to reduce the number of heuristics, this

paper proposes some improvements. The net functions

evolve in a new more optimal direction. The weighted

hidden layer error function was derived directly from the

global MSE. The hidden layer learning factor now adapts

according to the local shape of the error surface. All these

techniques successfully increase the training speed.

In addition, when testing on several data sets, the new

HWO algorithm always outperforms the other algorithms.

However, there are some issues that need further investiga-

tion. It is unclear whether it is necessary to update the net

function in the optimal direction in the first few iterations.

We also need to design an adaptation law for the learning

factor, without the heuristic parameter a. And another

issue we didn’t attack here is the proper choice of initial

hidden weights.

Acknowledgments

This work was supported by the Advanced Technology

Program of the state of Texas, under grant number 003656-

0129-2001.

Appendix

In the following, we present the derivation of the

optimal learning factor [12] for the hidden weight

optimization.

To calculate Z, the error function E is rewritten as

"

!

n¼1

Using W denoting the hidden weights, and e as the hidden

weight changes, in each iteration, the optimal learning rate

Z used in HWO is found by solving

E ¼

1

Nv

X

p

X

M

i¼1

tpi?

X

Nh

k¼1

wohi;kðÞf

X

Nþ1

n¼1

whik;nð

#2

Þð

þ Z ? e k;nð ÞÞ ? xpn

?

X

Nþ1

woii;nðÞxpn

.

ðA:1Þ

dE W þ Z ? e

dZ

Then the hidden weights are updated as W W þ Z ? e.

ðÞ

¼ 0.(A.2)

ARTICLE IN PRESS

0204060 80

Training iterations

120200

0.5

1

2

3

owohwo

new hwo

LM

new hwo + optimal LF

10-1

100

101

102

103

1

2

3

104

0.5

1.5

2.5

1.5

2.5

MSE

MSE

180160140100

Training time: seconds

owohwo

new hwo

LM

new hwo + optimal LF

Fig. 6. Simulation results for example 3 Data: F17.dat, structure: 17-10-9.

Table 1

Training and testing errors for each training algorithm

New HWOOWO+LBLLayer-by-layer Optimal LF

Twod

Training MSE

Testing MSE

Training MSE

Testing MSE

Training MSE

Testing MSE

0.153103

0.171514

1.372671

1.595475

0.749289

0.821818

0.140873

0.149181

1.392534

1.562335

0.892191

0.951841

0.132289

0.147620

1.224596

1.554806

0.716455

0.753218

0.353854

0.391923

2.3586

2.39681

2.88507

2.93159

0.640023

0.702386

3.60798

3.70564

3.15433

3.16477

0.191480

0.210467

1.464110

1.610119

0.892393

0.930940

OH7

F17

C. Yu et al. / Neurocomputing 70 (2006) 525–535

533

Page 11

Author's personal copy

h

?

As it is hard to solve (A.2) directly, we obtain Z using a

Taylor series expansion of EðW þ Z ? eÞ and obtaining the

solution of equation (A.2). A third degree expansion of

EðW þ Z ? eÞ is

E W þ Z ? e

In Eq. (A.3), A0, A1, A2, and A3 are the Taylor series

coefficients. They are obtained as follows:

ð Þ ffi A0þ A1? Z þ A2? Z2þ A3? Z3

(A.3)

A0¼ E W þ Z ? e

1

2!?q2E W þ Z ? e

ðÞjZ¼0;

A1¼qE W þ Z ? e

A3¼1

ðÞ

qZ

3!?q3E W þ Z ? e

jZ¼0,

A2¼

ðÞ

qZ2

jZ¼0;

ðÞ

qZ3

jZ¼0.

ðA:4Þ

Details for obtaining these coefficients are given in the

following. Depending on whether the second degree or

the third degree approximations of the error are used, the

optimal learning rate ZOLcan be obtained from Eq. (A.2)

either as

ZOL¼?A1

2 ? A2,(A.5)

or

ZOL¼

?A2?

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

3 ? A3

A2

2? 3 ? A1? A3

q

.(A.6)

From Eq. (A.3) and (A.6), the second derivative of E(Z) is

expressed as

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

When E(Z) is at the local minimum, q2E Z

greater than zero. So the resulting solution for ZOL in

Eq. (A.6) is

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

3 ? A3

Whenever the argument of the square root in Eq. (A.8) is

negative, we use the solution from Eq. (A.5). Since the

correctly obtained direction vector should point toward the

trough of the error function, the optimal learning rate

should be greater than zero when we obtain the direction

vector correctly. So when the solution from either

Eqs. (A.5) or (A.6) turns out to be negative, we use –g as

a new direction vector and obtain the optimal learning rate

again from either Eqs. (A.5) or (A.8).

In order to find A0, A1, A2, and A3, we have to calculate

the derivatives in (A.4). From the global MSE definition in

(2.7), we have

?

"

q2E

qZ2¼ 2A2þ 6A3? Z ¼ ?2

A2

2? 2 ? A1? A3

q

.(A.7)

ð Þ=qZ2should be

ZOL¼

?A2þA2

2? 3 ? A1? A3

q

.(A.8)

qE

qZ¼

2

Nv

X

Nv

p¼1

X

M

k¼1

tpk? ypk

i

??qypk

qZ

?

,(A.9)

q2E

qZ2¼

2

Nv

X

Nv

p¼1

X

M

k¼1

qypk

qZ

?2

? tpk? ypk

hi

?q2ypk

qZ2

#

,(A.10)

q3E

qZ3¼

2

Nv

X

Nv

p¼1

X

M

k¼1

3 ?

qypk

qZ

??

?

q2ypk

qZ2

!

? tpk? ypk

hi

?q3ypk

qZ3

"#

.

(A.11)

And the output ypkin terms of Z is

ypk¼

X

Nþ1

i¼1

woik;iðÞ ? xpiþ

X

Nh

j¼1

wohk;jð Þ ? f netpjþ Z ? Netpj

??,

(A.12)

where Netpj¼PNþ1

qypk

qZ¼

j¼1

where f ¼ f(netpj+Z?NETpj).

q2ypk

qZ2¼

j¼1

n¼1e j;nð Þ ? xpn.

So

X

Nh

wohk;jð Þ ? f ? 1 ? fðÞ ? Netpj

??,(A.13)

X

Nh

wohk;jðÞ ? Net2

pj? f ? 1 ? fðÞ 1 ? 2fðÞ

no

,(A.14)

q3ypk

qZ3¼

X

Nh

j¼1

wohk;jðÞ ? Net3

pj? f ? 1 ? fðÞ ? 1 ? 6f þ 6f2

??.

(A.15)

Substituting (A.13) (A.14) (A.15) into (A.9) (A.10) (A.11),

A0, A1, A2, and A3can be calculated from (A.4). Then

optimal learning rate can be found from (A.5) or (A.10).

References

[1] S.A. Barton, A matrix method for optimizing a neural network,

Neural Computation 3 (3) (1991) 450–459.

[2] R. Battiti, First- and Second-order methods for learning: between

steepest descent and Newton’s method, Neural computation 4 (2)

(1992) 141–166.

[3] H.-H. Chen, M.T. Manry, Hema Chandrasekaran, A neural network

training algorithm utilizing multiple set of linear equations,

Neurocomputing 25 (1–3) (1999) 55–72.

[4] Y. L.eCun, Generalization and network design strategies, in:

Proceedings of Connectionism in perspective, 1989.

[5] M.S. Dawson, A.K. Fung, M.T. Manry, Surface parameter retrieval

using fast learning neural networks, Remote Sensing Reviews 7 (1)

(1993) 1–18.

[6] R. Fletcher, Practical Methods of Optimization, Wiley, New York,

1987.

[7] M.H. Fun, M.T. Hagan, Levenberg-Marquardt training for modular

networks, The 1996 IEEE International Conference on Neural

Networks, vol. 1, 1996, pp. 468–473.

[8] C.M. Hadzer, R. Hasan, et al., Improved singular value decomposi-

tion by using neural networks, IEEE International Conference on

Neural Networks, vol. 1, 1995 438–442.

[9] M.T. Hagan, M.B. Menhaj, Training feedforward networks with the

Marquardt algorithm, IEEE Transaction on Neural Networks 5 (6)

(1994) 989–993.

[10] M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design,

PWS publishing Company, 1996.

[11] S. Haykin, Neural networks: a comprehensive foundation, Prentice-

Hall, Englewood Cliffs, NJ, 1999.

[12] T.H. Kim, M.T. Manry, F.J. Maldonado, New learning factor and

testingmethodsforconjugategradienttrainingalgorithm,

ARTICLE IN PRESS

C. Yu et al. / Neurocomputing 70 (2006) 525–535

534

Page 12

Author's personal copy

Arlington in 2004. He joined the Neural Net-

works and Image Processing Lab in the EE

department as a research assistant in 2001. His

main research interests include neural networks,

image processing and pattern recognition. Currently, Dr. Yu works at

FastVDO LLC in Columbia, Maryland.

IJCNN’03. International Joint Conference on Neural Networks,

2003, pp. 2011–2016.

[13] H.-M. Lee, C.-M. Chen, T.-C. Huang, Learning efficiency improve-

ment of back-propagation algorithm by error saturation prevention

method, Neurocomputing 41 (2001) 125–143.

[14] Y. Lee, S.-H. Oh, M.W. Kim, An analysis of premature saturation in

back-propagation learning, Neural Networks 6 (1993) 719–728.

[15] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the

convergence of the backpropagation algorithm using learning

adaptation Methods, Neural Computation 11 (1999) 1769–1796.

[16] M.T. Manry, H. Chandrasekaran, C.-H. Hsieh, Signal Processing

Using the Multiplayer Perceptron, CRC Press, Boca Raton, 2001.

[17] M.T. Manry, et al., Fast training of neural networks for remote

sensing, Remote Sensing Reviews 9 (1994) 77–96.

[18] T. Masters, Neural, Novel & Hybrid Algorithms for Time Series

Prediction, Wiley, New York, 1995.

[19] S.-H. Oh, Improving the error back-propagation algorithm with a

modified error function, IEEE Transactions on Neural Networks 8

(3) (1997) 799–803.

[20] S.-H. Oh, S.-Y. Lee, A new error function at hidden layers for fast

training of multilayer perceptrons, IEEE Transactions on Neural

Networks 10 (1999) 960–964.

[21] S.-H. Oh, S.-Y. Lee, Optimal learning rates for each pattern and

neuron in gradient descent training of multilayer perceptrons,

IJCNN’99. International Joint Conference on Neural Networks, vol

3, 1999, pp. 1635–1638.

[22] Y. Oh, K. Sarabandi, F.T. Ulaby, An empirical model and an

inversion technique for radar scattering from bare soil surfaces, IEEE

Transactions on Geoscience and Remote Sensing 30 (2) (1992)

370–381.

[23] V. Ooyen, B. Nienhuis, Improving the convergence of the back-

propagation algorithm, Neural Networks 5 (1992) 465–471.

[24] W.H. Press, et al., Numerical Recipes, Cambridge University Press,

New York, 1986.

[25] K. Rohani, M.S. Chen, M.T. Manry, Neural subset design by direct

polynomial mapping, IEEE Transactions on Neural Networks 3 (6)

(1992) 1024–1026.

[26] D.E. Rumelhart, G.E. Hiton, R.J. Williams, Learning internal

representation by error propagation, Parallel Distributed Processing

1 (1986) 318–362.

[27] R.S. Scalero, N. Tepedelenlioglu, A fast new algorithm for training

feedforward neural networks, IEEE Transactions on signal proces-

sing, vol. 40 (1) 1992 pp. 202–210.

[28] G.-J. Wang, C.-C. Chen, A fast multilayer neural-network training

algorithm based on the layer-by-layer optimizing procedures, IEEE

Transactions on Neural Networks 7 (3) (1996) 768–775.

[29] P.J. Werbos, Beyond regression: new tools for prediction and analysis

in the behavioral science, Ph.D. Thesis, Harvard University, Cam-

bridge, Mass, 1974.

[30] J.Y.F. Yam, T.W.S. Chow, A weight initialization method for

improving training speed in feedforward neural network, Neurocom-

puting 30 (2000) 219–232.

Changhua Yu received his B.S. in 1995 from

Huazhong University of Science and Technology,

Wuhan, China, and M. Sc. in 1998 from

Shanghai Jiaotong University, shanghai, China.

He received his Ph.D. degree in Electrical

Engineering from the University of Texas at

Michael T. Manry was born in Houston, Texas in

1949. He received the B.S., M.S., and Ph.D.in

Electrical Engineering

1976, respectively, from The University of Texas

at Austin. After working there for two years

as an Assistant Professor, he joined Schlumberger

Well Services in Houston where he developed

signal processing algorithms for magnetic reso-

nance well logging and sonic well logging.

He joined the Department of Electrical Engineer-

ing at the University of Texas at Arlington in 1982, and has held

the rank of Professor since 1993. In summer 1989, Dr. Manry developed

neural networks for the Image Processing Laboratory of Texas

Instruments in Dallas. His recent work, sponsored by the Advanced

Technology Program of the state of Texas, E-Systems, Mobil Research,

and NASA has involved the development of techniques for the analysis

and fast design of neural networks for image processing, parameter

estimation, and pattern classification. Dr. Manry has served as a

consultant for the Office of Missile Electronic Warfare at White Sands

Missile Range, MICOM (Missile Command) at Redstone Arsenal, NSF,

Texas Instruments, Geophysics International, Halliburton Logging

Services, Mobil Research and Verity Instruments. He is a Senior Member

of the IEEE.

in1971,1973, and

Jiang Li received his B.S. in Electrical Engineering

from Shanghai Jiaotong University, China in

1992, his M.S. in Automation from Tsinghua

University, China in 2000. He received his Ph.D

degree in Electrical Engineering from the Uni-

versity of Texas at Arlington in 2004. His research

interests focus on neural networks, wireless com-

munication, pattern recognition and signal pro-

cessing. Currently, Dr. Li works at the National

Institute of Health in Bethesda Maryland.

Pramod Lakshmi Narasimha received his B.E. in

Telecommunications Engineering from Bangalore

University, India. He joined the Neural Networks

and Image Processing Lab in the EE department,

UTA as a research assistant in 2002. In 2003, he

received his M.S. degree in Electrical Engineering

at the University of Texas at Arlington. Cur-

rently, he is working on his Ph.D. His research

interests focus on neural networks, pattern

recognition, image and signal processing.

ARTICLE IN PRESS

C. Yu et al. / Neurocomputing 70 (2006) 525–535

535