An efficient hidden layer training method for the multilayer perceptron
ABSTRACT A u t h o r ' s p e r s o n a l c o p y Abstract The outputweightoptimization and hiddenweightoptimization (OWO–HWO) training algorithm for the multilayer perceptron alternately solves linear equations for output weights and reduces a separate hidden layer error function with respect to hidden layer weights. Here, three major improvements are made to OWO–HWO. First, a desired net function is derived. Second, using the classical mean square error, a weighted hidden layer error function is derived which deemphasizes net function errors that correspond to saturated activation function values. Third, an adaptive learning factor based on the local shape of the error surface is used in hidden layer training. Faster learning convergence is experimentally verified, using three training data sets.

Conference Paper: A novel pruning algorithm for selforganizing neural network.
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, a novel pruning algorithm is proposed for selforganizing the feedforward neural network based on the sensitivity analysis, named novel pruning feedforward neural network (NPFNN). In this study, the number of hidden neurons is determined by the output's sensitivity to the hidden nodes. This technique determines the relevance of the hidden nodes by analyzing the Fourier decomposition of the variance. Then each hidden node can obtain a contribution ratio. The connected weights of the hidden nodes with small ratio will be set as zeros. Therefore, the computational cost of the training process will be reduced significantly. It is clearly shown that the novel pruning algorithm minimizes the complexity of the final feedforward neural network. Finally, computer simulation results are carried out to demonstrate the effectiveness of the proposed algorithm.International Joint Conference on Neural Networks, IJCNN 2009, Atlanta, Georgia, USA, 1419 June 2009; 01/2009  SourceAvailable from: Jiang LiHoangAnh T. Nguyen, John Musson, Feng Li, Wei Wang, Guangfan Zhang, Roger Xu, Carl Richey, Tom Schnell, Frederic D. McKenzie, Jiang Li[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we developed a wavelet neural network (WNN) algorithm for electroencephalogram (EEG) artifact. The algorithm combines the universal approximation characteristics of neural networks and the time/frequency property of wavelet transform, where the neural network was trained on a simulated dataset with known ground truths. The contribution of this paper is twofold. First, many EEG artifact removal algorithms, including regression based methods, require reference EOG signals, which are not always available. The WNN algorithm tries to learn the characteristics of EOG from training data and once trained, the algorithm does not need EOG recordings for artifact removal. Second, the proposed method is computationally efficient, making it a reliable real time algorithm. We compared the proposed algorithm to the independent component analysis (ICA) technique and an adaptive wavelet thresholding method on both simulated and real EEG datasets. Experimental results show that the WNN algorithm can remove EEG artifacts effectively without diminishing useful EEG information even for very noisy datasets.Neurocomputing 01/2012; 97(1):374389. · 1.63 Impact Factor  SourceAvailable from: Amarnath Yennu[Show abstract] [Hide abstract]
ABSTRACT: Functional near infrared spectroscopy (fNIRS) was used to explore hemodynamic responses in the human frontal cortex to noxious thermal stimulation over the right temporomandibular joint (TMJ). fNIRS experiments were performed on nine healthy volunteers under both lowpain stimulation (LPS) and highpain stimulation (HPS), using a temperaturecontrolled thermal stimulator. By analyzing the temporal profiles of changes in oxyhemoglobin concentration (HbO) using clusterbased statistical tests, several regions of interest in the prefrontal cortex, such as the dorsolateral prefrontal cortex and the anterior prefrontal cortex, were identified, where significant differences ( p < .05) between HbO responses to LPS and HPS were shown. In order to classify these two levels of pain, a neural networkbased classification algorithm was utilized. With leaveoneout crossvalidation, the two levels of pain were identified with 99% mean accuracy to high pain. Furthermore, the “internal mentation hypothesis” and the defaultmode network were introduced to explain our observations of the contrasting trend, as well as the rise and fall of HbO responses to HPS and LPS.Journal of Applied Biobehavioral Research 09/2013; 18(3):134155.
Page 1
This article was originally published in a journal published by
Elsevier, and the attached copy is provided by Elsevier for the
author’s benefit and for the benefit of the author’s institution, for
noncommercial research and educational use including without
limitation use in instruction at your institution, sending it to specific
colleagues that you know, and providing a copy to your institution’s
administrator.
All other uses, reproduction and distribution, including without
limitation commercial reprints, selling or licensing copies or access,
or posting on open internet sites, your personal or institution’s
website or repository, are prohibited. For exceptions, permission
may be sought for such use through Elsevier’s permissions site at:
http://www.elsevier.com/locate/permissionusematerial
Page 2
Author's personal copy
and the slope of the sigmoid function. Yam and Chow [30]
Neurocomputing 70 (2006) 525–535
An efficient hidden layer training method for the multilayer perceptron
Changhua Yua,?, Michael T. Manryb, Jiang Lic, Pramod Lakshmi Narasimhab
aFastVDO LLC, Columbia, MD 21046, USA
bDepartment of Electrical Engineering, University of Texas at Arlington, TX 76019
cDepartment of Radiology, Clinical Center National Institutes of Health, Bethesda, MD 20892, USA
Received 17 December 2003; received in revised form 22 November 2005; accepted 23 November 2005
Communicated by S.Y. Lee
Available online 18 May 2006
Abstract
The outputweightoptimization and hiddenweightoptimization (OWO–HWO) training algorithm for the multilayer perceptron
alternately solves linear equations for output weights and reduces a separate hidden layer error function with respect to hidden layer
weights. Here, three major improvements are made to OWO–HWO. First, a desired net function is derived. Second, using the classical
mean square error, a weighted hidden layer error function is derived which deemphasizes net function errors that correspond to
saturated activation function values. Third, an adaptive learning factor based on the local shape of the error surface is used in hidden
layer training. Faster learning convergence is experimentally verified, using three training data sets.
r 2006 Elsevier B.V. All rights reserved.
Keywords: Hidden weight optimization (HWO); Convergence; Hidden layer error function; Saturation; Adaptive learning factor
1. Introduction
Back propagation (BP) [10,11,26,29] was the first
effective training algorithm for the multilayer perceptron
(MLP). Subsequently, the MLP has been widely applied in
the fields of signal processing [16], remote sensing [17], and
pattern recognition [3]. However, the inherent slow
convergence of BP has delayed the adoption of the MLP
by much of the signal processing community.
During the last two decades, many improvements to BP
have been made. One reason for BP’s slow convergence is the
saturation occurring in some of the nonlinear units. When
hidden units become saturated, the error function reaches a
local minimum early in the training stage. In [14,30], proper
weight initializations are used to avoid premature saturation.
Lee et al. [14] derived the probability of incorrect saturation
for the case of binary training patterns. They concluded that
the saturation probability is a function of the maximum
value of the initial weights, the number of units in each layer,
proposed to initialize the weights to ensure that the outputs
of the hidden units are in the active region of the sigmoid
function. Various techniques, such as an error saturation
prevention function in [13], the crossentropy error function
[23] and its extension [19], are proposed to prevent the units
from premature saturation.
In [2], Battiti reviewed first and second order algorithms
for learning in neural networks. First order methods are
fast and effective for largescale problems, while second
order techniques have higher precision. The Levenberg–
Marquardt (LM) [6,9,17] method is a wellknown second
order algorithm with better convergence properties [2] than
conventional BP. Unfortunately, it requires O N3
and calculations of order O N3
number of weights in an MLP [18]. Hence the LM method
is impractical for all but small networks.
Many investigators have trained the different MLP
layers separately. In layerbylayer (LBL) training [20,28],
the output weights are first updated. Then the correspond
ing desired outputs for the hidden units are found. Finally,
the hidden weights are found by minimizing the difference
between the actual net function and a desired net function.
In [28], for the current training iteration, the optimal
output and hidden weights are found by solving sets of
w
??storage
w
??
where Nwis the total
ARTICLE IN PRESS
www.elsevier.com/locate/neucom
09252312/$see front matter r 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2005.11.008
?Corresponding author. Tel.: +18172723469; fax: +18172723483.
Email addresses: ychmailuta@yahoo.com (C. Yu), manry@uta.edu
(M.T. Manry).
Page 3
Author's personal copy
a fully connected MLP and then review the OWO–HWO
algorithm.
linear equations using pseudoinverse matrices. In [20] by
contrast, the output and hidden weights are updated in the
gradient directions with optimal learning factors. This
strategy makes each step in [20] similar to steepest descent.
Some researchers have developed fast training techniques
by solving sets of linear equations [1,3,16,25]. When output
units have linear activation functions, linear equations can
be solved for the output weights. For example, the output
weight optimization (OWO) algorithm [1,17] has been
successfully used to minimize the MSE by solving linear
equations. In [17], the MLP is trained by the socalled
output weight optimizationback propagation (OWOBP)
algorithm, where the output weights are found by solving
linear equations and the hidden weights are updated by BP.
Scalero and Tepedelenlioglu [27] developed a nonbatch
training approach for feedforward neural networks in which
separate error functions are minimized for each hidden unit.
This idea greatly improved training efficiency. However, they
didn’t use OWO to solve for the output weights. Using ideas
from [27], Chen [3] constructed a batch mode training
algorithm called output weight optimizationhidden weight
optimization (OWO–HWO). In OWO–HWO, output weights
and hidden unit weights are alternately modified to reduce
training error. The algorithm modifies the hidden weights
based on the minimization of the MSE between the desired
and the actual net function, as originally proposed in [27].
Although, OWO–HWO increases training speed, it still has
room for improvement [28] because it uses the delta function
as the desired net function change. This makes HWO similar
to steepest descent, except that net functions are varied instead
of weights. In addition, HWO is equivalent to BP applied to
the hidden weights under certain conditions [3].
In this paper, we develop some improvements to
OWO–HWO. In Section 2, we describe the notation used
in this paper and review the OWO–HWO algorithm. In
Section 3, we introduce a new desired net function change,
which updates the net function in the optimal gradient
direction. We derive a weighted hidden layer error
function, first used in [20], directly from the global MSE.
After that, we construct an adaptation law for the hidden
layer learning factor. The convergence of this method is
shown. In Section 4, by running on several different
training data sets, we compare the new HWO with the
original OWO–HWO, and other training algorithms.
Finally, Section 5 concludes this paper.
2. Review of OWO–HWO
In this section, we describe the structure and notation of
2.1. Structure and notation of a fully connected MLP
Without loss of generality, we restrict ourselves to a
three layer fully connected MLP, which is commonly used.
The structure of the MLP is shown in Fig. 1. Bypass
weights from input layer to output layer are used, but are
not shown in the figure for clarity. These connections allow
the network to easily model the linear component of the
desired mapping, and reduce the number of required
hidden units. The training data set consists of Nvtraining
patterns fðxp;tpÞg, where the pth input vector xpand the pth
desired output vector tp have dimensions N and M,
respectively. Thresholds in the hidden layer are handled
by letting xp, (N+1)¼ 1.
For the kth hidden unit, the net input netpk and the
output activation Opkfor the pth training pattern are
netpk¼
X
?
Nþ1
n¼1
whik;nð Þ ? xpn,
Opk¼ f netpk
where xpn denotes the nth element of xp and whi(k, n)
denotes the weight connecting the nth input unit to the
kth hidden unit. Nhis the number of hidden units. The
activation function f is sigmoidal
?,(2.1)
f netpk
??¼
1
1 þ e?netpk.(2.2)
The ith output ypifor the pth training pattern is
ypi¼
X
Nþ1
n¼1
woii;nð Þ ? xpnþ
X
Nh
k¼1
wohi;kðÞ ? Opk.(2.3)
For convenience, in the OWO procedure we augment the
input vector as
8
>
~ xpn¼
xpn;
1;
n ¼ 1;2;...;N;
n ¼ N þ 1;
n ¼ N þ 2;...;N þ Nhþ 1:
Op;ðn ? N ? 1Þ;
>
:
<
(2.4)
ARTICLE IN PRESS
Oh1
yp1
xp1
xp2
xp3
xp,N+1
OP,Nh
ypM
yp3
yp2
Fig. 1. MLP structure for one hidden layer.
C. Yu et al. / Neurocomputing 70 (2006) 525–535
526
Page 4
Author's personal copy
developed from the ideas in [27]. In HWO, the hidden
weights are updated by minimizing separate error functions
for each hidden unit. The error functions measure the
difference between the desired and the actual net function.
For the kth hidden unit and pth pattern, the desired net
function is constructed as [3]
Using the augmented input vector in (2.4), (2.3) is
rewritten as
ypi¼
where the weights woði;nÞ ¼ woiði;nÞ for 1pnpN þ 1 and
woði;nÞ ¼ wohði;n ? N ? 1Þ
N þ 2pnpN þ Nhþ 1 ¼ L.
For each training pattern, the training error is
X
L
n¼1
woi;nðÞ~ xpn, (2.5)
for
Ep¼
X
M
i¼1
tpi? ypi
hi2, (2.6)
where tpidenotes the ith element of the pth desired output
vector. In batch mode training, the overall performance of
a feedforward network, measured as mean square error
(MSE) can be written as
E ¼
1
Nv
X
Nv
p¼1
Ep¼
1
Nv
X
Nv
p¼1
X
M
i¼1
tpi? ypi
hi2.(2.7)
2.2. Output weight optimization
As the output units have linear activation functions, the
OWO procedure, for finding weights wo(i,n) in (2.5), can be
realized by solving linear equations [1,3,17], which result
when gradients of E with respect to the output weights are
set to zero. These equations are
X
where C is the cross correlation matrix with elements:
L
n¼1
woi;nðÞR n;mðÞ ¼ C i;mðÞ,(2.8)
C i;mðÞ ¼
1
Nv
X
Nv
p¼1
tpi? ~ xpm,(2.9)
R is the auto correlation matrix defined as:
R n;mðÞ ¼
1
Nv
X
Nv
p¼1
~ xpn~ xpm.(2.10)
Since the equations are often illconditioned, meaning that
the determinant of R is close to 0, it is often unsafe to use
Gauss–Jordan elimination. The singular value decomposi
tion (SVD) [8], LU decomposition (LUD) [24], and
conjugate gradient (CG) [6] approaches are better.
2.3. Hidden weight optimization
Hidden weight optimization is a full batch algorithm
netpkdffi netpkþ Z ? dpk
(2.11)
where netpkdis the desired net function and netpkis the
actual one in (2.1). Z is the learning factor and dopiis the
delta function [3,11,26] of the ith output:
h
The delta function for the kth hidden unit [2,3,7] is
dopi¼ ?qEp
qypi
¼ tpi? ypi
i
.(2.12)
dpk¼ ?
qEp
qnetpk
¼ f0netpk
??X
M
i¼1
wohi;kðÞdopi.(2.13)
The hidden weights are to be updated as
whik;nð
where e(k, n) is the weight change. The weight changes are
derived using
Þ whik;nðÞ þ Z ? e k;nðÞ,(2.14)
netpkþ Z ? dpkffi
X
Nþ1
n¼1
whik;nðÞ þ Z ? e k;nðÞ
½? ? xpn.(2.15)
Therefore,
dpkffi
X
Nþ1
n¼1
e k;nðÞ ? xpn.(2.16)
The error of (2.15) and (2.16) for the kth hidden unit is
measured as
"
Edk ð Þ ¼
1
Nv
X
Nv
p¼1
dpk?
X
Nþ1
n¼1
e k;nð Þ ? xpn
#2
.(2.17)
Equating the gradient of Ed(k) with respect to the hidden
weight change to zero, we have
X
where
Nþ1
n¼1
e k;nðÞR n;mðÞ ¼ Cdk;mðÞ,(2.18)
Cdk;mðÞ ¼
1
Nv
X
Nv
p¼1
dpk? xpm
??¼
?qE
ð
qwhik;mÞ. (2.19)
The hidden weight change e(k, n) can be found from (2.18)
by using the conjugate gradient method. After finding the
learning factor Z, the hidden weights are updated as in
(2.14).
3. Enhancement of OWOHWO
From (2.11) and (2.13), we can see that the net functions
are updated in the gradient direction. It is well known that
optimizing in the gradient direction, as in steepest descent,
is very slow. In addition, using Ed(k) in (2.17) ignores the
effects of activation function saturation when netpk
large as was pointed out in [15] for the BP algorithm. In
this section, we update the net functions along new
directions that tend to minimize the total training error
and derive the corresponding new hidden layer error
functions, which takes saturation into account.
?? ??is
ARTICLE IN PRESS
C. Yu et al. / Neurocomputing 70 (2006) 525–535
527
Page 5
Author's personal copy
qnetn
pj
i¼1
ð3:7Þ
Setting the right hand side of (3.7) to zero, we can find
Dnetn
PM
pj
ð
3.1. The optimal direction of desired net function change
As mentioned in (2.11), for the original HWO algorithm,
the desired net function change is Z?dpj, where the delta
function is the negative gradient of E with respect to the
current net function netpj. This strategy of updating net
functions using their gradient directions results in slow
training for the hidden units.
In this section, we introduce a new desired net function
netpjd:
netpjd¼ netpjþ Z ? Dnetn
where Dnetn
pj,(3.1)
pj, written as
Dnetn
pj¼ netn
pj? netpj,(3.2)
is the difference between current net function netpjand an
optimal value netn
pj. Now the net function approaches the
optimal value netn
pj, instead of moving in the negative
gradient direction. The current task for constructing the
new algorithm HWO is to find Dnetn
Using Taylor series and Eqs. (2.1), (3.2), the correspond
ing hidden layer output On
?
where f0
Replacing Opjby its optimal value On
2
pj.
pjcaused by netn
pj? Z ? Dnetn
pjcan be written as
On
pj¼ f netn
pj
?
? Opjþ f0
pj
(3.3)
pjis shorthand for the activation derivative f0(netpj).
pjin (2.3), (2.7) becomes
E ¼
1
Nv
X
Nv
p¼1
X
M
i¼1
tpi?
X
Nþ1
n¼1
woii;nðÞxpn?
X
Nh
k¼1
kaj
wohi;kðÞOpk? wohi;jðÞOn
pj
64
3
75
2
:
(3.4)
Now Dnetn
qE
qnetpj
netpj¼netn
Using (3.1) to (3.5) yields
Using (3.3) and (2.12), we replace
in (3.6) by f0
pjcan be derived based on
????
pj
¼ 0.(3.5)
On
pj? Opj
??
and (tpjypi)
pjDnetn
pjand dopi, yielding
2
>
X
qE
qnetn
pj
? ?2
Nv?
qOn
qnetn
pj
pj
X
X
M
i¼1
M
dopi? wohi;jðÞf0
pjDnetn
pj
hi
M
:wohi;jðÞ
(
"
)
¼ ?2
Nv?
qOn
pj
dopiwohi;jðÞ ? f0
pjDnetn
pj
X
i¼1
w2
ohi;jðÞ
#
.
pjas
Dnetn
pj?
i¼1dopiwohi;j
PM
ðÞ
Þ.
f0
i¼1w2
ohi;j
(3.8)
Using (2.12) and (2.13), (3.8) becomes
Dnetn
pj?
dpi
f0
pj
??2PM
i¼1w2
ohi;jðÞ
.(3.9)
3.2. A weighted hidden error function
If we substitute Eq. (2.3) into (2.7), we can rewrite the
total MSE as
"
E ¼
1
Nv
X
p
X
M
i¼1
tpi?
X
Nh
k¼1
wohi;kðÞOpk?
X
Nþ1
n¼1
woii;nðÞxpn
#2
(3.10)
:
During the HWO procedure, if we define netpkdas in (3.1),
the corresponding hidden unit output Opkd could be
approximated by using Taylor series as
?
However, because the hidden weights are updated as in
(2.14), the actual net function we find is
Opkd¼ f netpkd
?? Opkþ Z ? f0
pk? Dnetn
pkd.(3.11)
netpk¼
X
Nþ1
n¼1
whik;nð Þ þ Z ? e k;nðÞ
½?xpn
¼ netpkþ Z ?
X
Nþ1
n¼1
e k;nðÞxpn
ð3:12Þ
The actual kth hidden unit output after HWO is
approximated as
¯Opk¼ f netpk
??? Opkþ Z ? f0
pk?
X
Nþ1
n¼1
e k;nðÞxpn.(3.13)
If we denote the ith output caused by the inputs and Opkdas
Tpi¼
X
Nþ1
n¼1
woii;nðÞ ? xpnþ
X
9
>
Nh
k¼1
wohi;kðÞ ? Opkd
(3.14)
then after HWO, the actual total error can be rewritten as
"
(
E ¼
1
Nv
X
X
p
X
X
M
i¼1
M
tpi? Tpiþ Tpi?
X
Nþ1
n¼1
woii;nðÞxpn?
X
Nh
k¼1
wohi:kðÞOpk
#2
¼
1
Nv
p
i¼1
tpi? Tpi
??þ
X
Nh
k¼1
wohi;kðÞ Opkd? Opk
??
)
ð3:15Þ
If we assume that [tpiTpi] is uncorrelated with
X
Dnet?
pk?
n
e k;nðÞxpn
"#
,
ARTICLE IN PRESS
qE
qnetn
pj
¼ ?2
Nv
X
(
M
i¼1
tpi?
X
Nþ1
n¼1
woii;nðÞxpn?
X
Nh
k¼1
kaj
wohi;kðÞOpk? wohi;jðÞOn
pj
64
h
3
75 ? wohi;jðÞ
qOn
qnetn
pj
pj
8
:
>
<
>
;
=
¼ ?2
Nv
M
i¼1
tpi? ypi? wohi;jðÞ On
pj? Opj
??i
:wohi;jðÞ
)
:
qOn
qnetn
pj
pj
.
ð3:6Þ
C. Yu et al. / Neurocomputing 70 (2006) 525–535
528
Page 6
Author's personal copy
error corresponding to saturated activations. When netpkis
in the sigmoid’s linear region, errors between netpkand
netpkdreceive large weight in (3.18).
The hidden weight change e(k, m) in (3.18) can be
found as in Section 2. First, the gradient of the new
Ed(k) with respect to the hidden weight change e(k, m)
is equated to zero, yielding Eq. (2.18). However, the
autocorrelation matrix R and cross correlation matrix Cd
and use Eq. (3.11) and (3.13), Eq. (3.15) becomes
E ?1
Nv
X
p
X
X
M
i¼1
tpi? Tpi
(
?
M
?2
þ1
Nv
p
X
i¼1
X
Nh
k¼1
wohi;kðÞf0
pkZ Dnet?
pk?
X
Nþ1
n¼1
e k;nðÞxpn
"#)2
.
ð3:16Þ
Here Z is the learning factor along the Dnetn
can evaluate (3.16) further because it is reasonable to
assume that the value
"
pjdirection. We
wohi;kðÞf0
pkDnet?
pk?
X
n
e k;nðÞxpn
#()
associated with different hidden units are uncorrelated with
each other. This assumption yields
E ?1
Nv
X
p
X
X
M
i¼1
tpi? Tpi
?
M
?2
þ1
Nv
p
X
i¼1
X
Nh
k¼1
w2
ohi;kðÞZ2f0
pk
??2
Dnet?
pk?
X
Nþ1
n¼1
e k;nðÞxpn
"#2
.
ð3:17Þ
Since Z, [tpiTpi] and woh(i,k) are constant during the HWO
procedure, minimizing E is equivalent to minimizing
"
Edk ð Þ ¼
1
Nv
X
p
f0
pk
??2
Dnet?
pk?
X
Nþ1
n¼1
e k;nðÞxpn
#2
.(3.18)
This hidden layer error function, which has been
introduced without derivation in [19], successfully de
emphasizes error in saturated hidden units using the square
of the derivative of the activation function. Then the
amount of the hidden weight’s update depends upon
whether the hidden net functions are in the linear or
saturation regions of the sigmoid function. If the current
netpk is in a saturation region, the difference between
desired and actual hidden output is small although the
difference between netpkand netpkdis large. In this case,
because there is no need to change the associated weights
according to the large difference between netpkand netpkd,
?
change e(k, n). When netpk is in the linear region, the
hidden weights will be updated according to the difference
between netpkand netpkd.
?
the small value of
f0
pk
?2
term will decrease the weight
That is, the term
f0
pk
?2
deemphasizes net function
are now found as:
R n;mðÞ ¼
1
Nv
X
Nv
p¼1
xpnxpm
??? f0
pk
??2
(3.19)
and
Cdk;mð Þ ¼
1
Nv
X
Nv
p¼1
Dnet?
pk? xpm
hi
? f0
pk
??2. (3.20)
Again, we have ðN þ 1Þ equations in ðN þ 1Þ unknowns for
the kth hidden unit. After finding the learning factor Z, the
hidden weights are updated as in (2.14).
3.3. Improved bold drive technique
In [12,20], the weights and the desired hidden outputs
were updated using an optimal learning factor. However,
those optimal learning factors are found just based on
current gradients. They do not guarantee an optimal
solution for the whole training procedure, and do not
guarantee faster convergence than heuristic learning factors.
As the error function has broad flat regions adjoining
narrow steep ones [15], the learning factor should be
adapted according to the local shape of the error surface.
That is, when in flat regions, larger Z can be used, while in a
steep region, the learning factor should be kept small.
Combining this idea with the BD technique [3,21],
we developed a new adaptive learning factor for the
hidden layer. If the learning factor Z for hidden
weight updating is small, then the change in the output
layer error function E due to the change in whi(k, n) is
approximately
DE ¼
qE
ð
qwhik;nÞ? Dwhik;nðÞ ¼ Z ?
qE
ð
qwhik;nÞ? e k;nðÞ. (3.21)
The total effect of hidden weight changes on the output
error will be
DE ¼ Z
X
Nh
k¼1
X
Nþ1
n¼1
qE
ð
qwhik;nÞ? e k;nðÞ. (3.22)
Assume E is reduced by a small factor Z0between 0.0 and
0.1,
DE ¼ ?Z0? E.
Combining (3.22) and (3.23), the learning factor Z will be
(3.23)
Z ¼
?Z0E
P
k
P
nqE=qwhik;nðÞ
??? e k;nðÞ.(3.24)
Here, we develop an adaptive law for Z0. When the error
increases at a given step, the learning factor is decreased by
setting
Z0 Z0? 0:5
and the best previous weights are reloaded. When the error
decreases at a given step, we use
(3.25)
Z0 Z0? G,(3.26)
ARTICLE IN PRESS
C. Yu et al. / Neurocomputing 70 (2006) 525–535
529
Page 7
Author's personal copy
from solving (3.18) in the previous HWO iteration. E1and
E2 are the total output MSEs of last two iterations,
respectively. The learning factor Z then is calculated by
using (3.24).
In fact, if we assume the hidden weight changes are
small, R approximates the current gradient magnitude of
the hidden unit error surface. When the gradient magni
tude is small, the local shape of error function is flat;
where the gain G is
G ¼ 1 þ
a
1 þ eR. (3.27)
The parameter a is used to limit the maximum value of
G. In our simulations, a is set to 0.49. The ratio R is
calculated using the last epoch’s parameters, as
R ¼
DE
DW
jj
kk¼E1? E2
jj
Z e
k k
.(3.28)
The hidden weight change vector e in (3.28) is available
otherwise it is steep [25]. So from (3.26) to (3.28), the
learning factor Z is modulated by an adaptive factor G,
which varies from 1 to ð1 þ aÞ with respect to the local
shape of the error function. In the remainder of this paper,
we refer to the new OWO–HWO algorithm as new HWO,
for brevity. A flow chart for the original OWO–HWO and
new HWO algorithms is shown in Fig. 2.
3.4. Convergence of the modified algorithm
In the following, we show that the hidden weight changes
from HWO make the global error E decrease until a local
minimum or a given stopping criteria is reached.
Denoting W*as the optimal weights, then we have
EðWÞXEðWnÞ. And E is convex in the neighborhood of
W*: N W
ð
E caused by the hidden weight changes can be approxi
mated as
Þ ¼ W W?Wn
???? ??
oe; e40
??. Then the change in
DE ¼ E W þ DW
ðÞ ? E W
ð Þ ?
qE
qW
??T
DW. (3.29)
In addition, DE is the total effects of the Nhhidden units
DE ¼
X
Nh
k¼1
DE k ð Þ,(3.30)
where DE(k) is the change in E caused by the kth hidden
unit
DE k ð Þ ?
X
n
qE
ð
X
X
qwhik;nÞ? Dwhik;n
X
X
X
ð Þ ¼ Z ?
X
n
qE
ð
qwhik;nÞ? e k;nðÞ
¼ Z ?1
Nv
n
Nv
p¼1
Nv
qEp
ð
qwhik;nÞ? e k;nðÞ
¼ Z ?1
Nv
n
p¼1
Nv
?dpk? xpn? e k;nðÞ
?
dpk
?
¼ ? Z ?1
Nv
p¼1
X
n
xpn? e k;nðÞð3:31Þ
From (3.18), we can say that, in the neighborhood of W*:
X
n
e k;nð Þ ? xpn? Dnet?
pk¼
dpk
M
f0
pk
??P
i¼1
w2
ohi;kðÞ
.(3.32)
The second step in above equation is from (3.9). The total
change of the error function E, due to changes in all hidden
weights, becomes
DE ¼
X
k
DE k ð Þ ? ?Z1
Nv
X
k
X
Nv
p¼1
d2
pk
f0
pk
??2PM
i¼1w2
ohi;kðÞ
(3.33)
.
As every term in the summation is nonnegative, we have
DEp0. So the global error E will keep decreasing until
reaching a local minimum.
ARTICLE IN PRESS
Save currentweights,
Eold
HWO Procedure
Calculate Z
Initialize hidden
weights; Set Nit ;
Count = 0;
Calculate E,
Eold = E
Count ++
Count< Nit
OWO Procedure
Calculate E
Yes
Yes
No
Load previous best
weights
Z Z∗ 0.5
Update hidden
weights
Save the final
weights to file
End
No
Begin
E <Eold
E
Fig. 2. OWO–HWO flowchart. (a) training MSE vs. iteration number for
data set: Twod.tra; (b) training MSE v.s. time for data set: Twod.tra.
C. Yu et al. / Neurocomputing 70 (2006) 525–535
530
Page 8
Author's personal copy
OWO+LBL. In Fig. 4, we present the simulation result for
data Twod. From Fig. 4, we can see that the LBL has larger
MSE than the new HWO. If LBL initialized by OWO, the
LBL algorithm won’t decrease the MSE any further. From
the figure we also see that LBL takes more time per
iteration than OWOHWO. This occurs, in part, because
LBL requires three passes through the data per iteration,
but OWOHWO requires two passes. LBL also requires
4. Simulation and discussion
The proposed ideas were verified using several training
data sets. The performance of the new HWO was
compared to the original OWO–HWO, the Levenberg–
Marquardt (LM) algorithm, and the new HWO with an
optimal learning factor [12]. The optimal learning factor is
derived in the Appendix. We also compare the new HWO
with the LBL method [20] for one data set. Our simulations
were carried out on a 2.8GHz Pentium IV, Windows NT
workstation using the Visual C++ 6.0 compiler.
4.1. Simulations
Training data set 1: Twod contains simulated data based
on models from backscattering measurements. This train
ing file is used in the task of inverting the surface scattering
parameters from an inhomogeneous layer above a homo
geneous half space, where both interfaces are randomly
rough. The parameters to be inverted are the effective
permittivity of the surface, the normalized rms height, the
normalized surface correlation length, the optical depth,
and single scattering albedo of an inhomogeneous irregular
layer above a homogeneous half space from back scattering
measurements.
The training data file contains 1768 patterns. The inputs
consist of eight theoretical values of back scattering
coefficient parameters at V and H polarization and
four incident angles. The outputs were the corresponding
values of permittivity, upper surface height, lower surface
height, normalized upper surface correlation length,
normalized lowersurface correlation length, optical
depth and single scattering albedo which had a joint
uniform pdf [5].
The threelayer MLP has 8 inputs, 10 hidden units and 7
outputs. All the algorithms are trained for 200 iterations.
From the simulation results of Fig. 3, the new HWO has
the fastest convergence speed in terms of both iteration
number and time. The LM algorithm costs much more
time due to its heavy computational load. As in any
location of quadratic error surface, optimal learning rates
will be superior to heuristic ones, we also tried optimal Z in
our experiments. However, as the global error function
actually is not a quadratic function of Z, we found that
using optimal learning factors [12] in every iteration does
not result in better performance. In the following, we
compare the new HWO with the LBL algorithm [20]. In
addition, as the initial MSE in LBL is too large, we use
OWO to initialize the output weights and term it as
additional calculations for the learning factors in each
pattern.
Training data set 2: OH7.TRA. This remote sensing
training data set [22] contains VV and HH polarizations
at L 30, 40 deg, C 10, 30, 40, 50, 60 degree, and X 30,
40, 50 deg along with the corresponding unknowns
Y ¼ s;l;mv
surface correlation length; mv is the volumetric soil
moisture content in g/cm3.
There are 20 inputs, 3 outputs, and 7320 training
patterns. We used a 20203 network and trained
the network for 200 iterations for all the algorithms.
The simulation results of Fig. 5 show that the advantage
of the new algorithm over the other algorithms is
overwhelming.
Training data set 3: F17. This prognostics data set
contains 3335 training patterns for onboard flight load
synthesis (FLS) in helicopters [3]. In FLS, we estimate
fgT, where s is the rms surface height; l is the
ARTICLE IN PRESS
0204060 120140160180200
owohwo
new hwo
LM
new hwo + optimal LF
owohwo
new hwo
LM
new hwo + optimal LF
0.32
0.3
0.28
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
102
101
100
101
102
103
10080
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
MSE
MSE
Training iterations
Training time: seconds
(a)
(b)
Fig. 3. Simulation results for example 1 Data: Twod.tra, structure: 8107:
(a) training MSE vs. iteration number for data set: Twod.tra; (b) training
MSE vs. time for data set: Twod.tra.
C. Yu et al. / Neurocomputing 70 (2006) 525–535
531
Page 9
Author's personal copy
mechanical loads on critical parts, using measurements
available in the cockpit. The accumulated loads can then be
used to determine component retirement times. There are
17 inputs and 9 outputs for each pattern. In this approach,
signals available on an aircraft, such as airspeed, control
attitudes, accelerations, altitude, and rates of pitch, roll,
and yaw, are processed into desired output loads such as
fore/aft cyclic boost tube oscillatory axial load (OAL),
lateral cyclic boost tube OAL, collective boost tube OAL,
main rotor pitch link OAL, etc. This data was obtained
from the M430 flight load level survey conducted in
Mirabel Canada in early 1995 by Bell Helicopter Textron
of Fort Worth.
We chose the MLP structure 17109 and trained the
network for 200 iterations with all the algorithms. The
simulation results of Fig. 6 show that the new algorithm
outperforms the other algorithms again.
We have shown that the new HWO performs better than
the original OWO–HWO and LM methods. The adaptive
learning factor also gave better results than the optimal
learning factors.
5. Discussion
Evaluation of the performance of an algorithm should
not just be based on the training errors. We also need to
consider the testing error [4,11]. In the following, we
evaluate the training algorithms using testing data sets
disjoint from the training sets. The numbers of testing
patterns are 1000 for twod, 3133 for oh7, and 1410 for f17.
For each data set, we present both the training MSE and
testing MSE for every training algorithm in Table 1. From
the numerical results in Table 1, it is obvious that the new
HWO algorithm has the fastest convergence speed and the
best generalization ability. This good generalization is due
to the facts that (1) the networks are small enough to
prevent memorization, (2) the training and testing sets, in
each experiment, are generated by the same random
ARTICLE IN PRESS
02040 60 80100 120 200
LBL
OWO + LBL
010 20 30
Training time: seconds
40 50607080 90
0.1
0.6
180 160 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.5
0.4
0.3
0.2
new hwo
MSE
MSE
Training iterations
(a)
(b)
LBL
OWO + LBL
new hwo
Fig. 4. Comparison of the algorithms for: Twod.tra, structure: 8107: (a)
training MSE vs. iteration number for data set: oh7.tra; (b) training MSE
v.s. time for data set: oh7.tra.
0204060 80100120160
2
2.6
owohwo
new hwo
LM
new hwo + optimal LF
2
2.8
2.4
2.2
1.8
1.6
1.4
1.2
140180200
2.8
2.6
2.4
2.2
1.8
1.6
1.4
1.2
101
100
101
102
103
MSE
MSE
Training iterations
Training time: seconds
owohwo
new hwo
LM
new hwo + optimal LF
(a)
(b)
Fig. 5. Simulation results for example 2 Data: oh7.tra, structure: 20203:
(a) training MSE vs. iteration number for data set: F17.dat; (b) training
MSE vs. time for data set: F17.dat.
C. Yu et al. / Neurocomputing 70 (2006) 525–535
532
Page 10
Author's personal copy
LMOWO–HWO
process, and (3) the new algorithm is very effective at
reducing the training MSE. The increased speed of
convergence for the new algorithm is due to the facts that
(1) linear equations are solved for output weights and
hidden unit weight changes, (2) very few passes through the
data are required per iteration, and (3) the error function
places little emphasis on weights feeding into saturated
hidden units.
6. Conclusions
In order to accelerate convergence of the OWO–HWO
algorithm and to reduce the number of heuristics, this
paper proposes some improvements. The net functions
evolve in a new more optimal direction. The weighted
hidden layer error function was derived directly from the
global MSE. The hidden layer learning factor now adapts
according to the local shape of the error surface. All these
techniques successfully increase the training speed.
In addition, when testing on several data sets, the new
HWO algorithm always outperforms the other algorithms.
However, there are some issues that need further investiga
tion. It is unclear whether it is necessary to update the net
function in the optimal direction in the first few iterations.
We also need to design an adaptation law for the learning
factor, without the heuristic parameter a. And another
issue we didn’t attack here is the proper choice of initial
hidden weights.
Acknowledgments
This work was supported by the Advanced Technology
Program of the state of Texas, under grant number 003656
01292001.
Appendix
In the following, we present the derivation of the
optimal learning factor [12] for the hidden weight
optimization.
To calculate Z, the error function E is rewritten as
"
!
n¼1
Using W denoting the hidden weights, and e as the hidden
weight changes, in each iteration, the optimal learning rate
Z used in HWO is found by solving
E ¼
1
Nv
X
p
X
M
i¼1
tpi?
X
Nh
k¼1
wohi;kðÞf
X
Nþ1
n¼1
whik;nð
#2
Þð
þ Z ? e k;nð ÞÞ ? xpn
?
X
Nþ1
woii;nðÞxpn
.
ðA:1Þ
dE W þ Z ? e
dZ
Then the hidden weights are updated as W W þ Z ? e.
ðÞ
¼ 0.(A.2)
ARTICLE IN PRESS
0204060 80
Training iterations
120200
0.5
1
2
3
owohwo
new hwo
LM
new hwo + optimal LF
101
100
101
102
103
1
2
3
104
0.5
1.5
2.5
1.5
2.5
MSE
MSE
180160140100
Training time: seconds
owohwo
new hwo
LM
new hwo + optimal LF
Fig. 6. Simulation results for example 3 Data: F17.dat, structure: 17109.
Table 1
Training and testing errors for each training algorithm
New HWOOWO+LBLLayerbylayer Optimal LF
Twod
Training MSE
Testing MSE
Training MSE
Testing MSE
Training MSE
Testing MSE
0.153103
0.171514
1.372671
1.595475
0.749289
0.821818
0.140873
0.149181
1.392534
1.562335
0.892191
0.951841
0.132289
0.147620
1.224596
1.554806
0.716455
0.753218
0.353854
0.391923
2.3586
2.39681
2.88507
2.93159
0.640023
0.702386
3.60798
3.70564
3.15433
3.16477
0.191480
0.210467
1.464110
1.610119
0.892393
0.930940
OH7
F17
C. Yu et al. / Neurocomputing 70 (2006) 525–535
533
Page 11
Author's personal copy
h
?
As it is hard to solve (A.2) directly, we obtain Z using a
Taylor series expansion of EðW þ Z ? eÞ and obtaining the
solution of equation (A.2). A third degree expansion of
EðW þ Z ? eÞ is
E W þ Z ? e
In Eq. (A.3), A0, A1, A2, and A3 are the Taylor series
coefficients. They are obtained as follows:
ð Þ ffi A0þ A1? Z þ A2? Z2þ A3? Z3
(A.3)
A0¼ E W þ Z ? e
1
2!?q2E W þ Z ? e
ðÞjZ¼0;
A1¼qE W þ Z ? e
A3¼1
ðÞ
qZ
3!?q3E W þ Z ? e
jZ¼0,
A2¼
ðÞ
qZ2
jZ¼0;
ðÞ
qZ3
jZ¼0.
ðA:4Þ
Details for obtaining these coefficients are given in the
following. Depending on whether the second degree or
the third degree approximations of the error are used, the
optimal learning rate ZOLcan be obtained from Eq. (A.2)
either as
ZOL¼?A1
2 ? A2,(A.5)
or
ZOL¼
?A2?
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3 ? A3
A2
2? 3 ? A1? A3
q
.(A.6)
From Eq. (A.3) and (A.6), the second derivative of E(Z) is
expressed as
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
When E(Z) is at the local minimum, q2E Z
greater than zero. So the resulting solution for ZOL in
Eq. (A.6) is
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3 ? A3
Whenever the argument of the square root in Eq. (A.8) is
negative, we use the solution from Eq. (A.5). Since the
correctly obtained direction vector should point toward the
trough of the error function, the optimal learning rate
should be greater than zero when we obtain the direction
vector correctly. So when the solution from either
Eqs. (A.5) or (A.6) turns out to be negative, we use –g as
a new direction vector and obtain the optimal learning rate
again from either Eqs. (A.5) or (A.8).
In order to find A0, A1, A2, and A3, we have to calculate
the derivatives in (A.4). From the global MSE definition in
(2.7), we have
?
"
q2E
qZ2¼ 2A2þ 6A3? Z ¼ ?2
A2
2? 2 ? A1? A3
q
.(A.7)
ð Þ=qZ2should be
ZOL¼
?A2þA2
2? 3 ? A1? A3
q
.(A.8)
qE
qZ¼
2
Nv
X
Nv
p¼1
X
M
k¼1
tpk? ypk
i
??qypk
qZ
?
,(A.9)
q2E
qZ2¼
2
Nv
X
Nv
p¼1
X
M
k¼1
qypk
qZ
?2
? tpk? ypk
hi
?q2ypk
qZ2
#
,(A.10)
q3E
qZ3¼
2
Nv
X
Nv
p¼1
X
M
k¼1
3 ?
qypk
qZ
??
?
q2ypk
qZ2
!
? tpk? ypk
hi
?q3ypk
qZ3
"#
.
(A.11)
And the output ypkin terms of Z is
ypk¼
X
Nþ1
i¼1
woik;iðÞ ? xpiþ
X
Nh
j¼1
wohk;jð Þ ? f netpjþ Z ? Netpj
??,
(A.12)
where Netpj¼PNþ1
qypk
qZ¼
j¼1
where f ¼ f(netpj+Z?NETpj).
q2ypk
qZ2¼
j¼1
n¼1e j;nð Þ ? xpn.
So
X
Nh
wohk;jð Þ ? f ? 1 ? fðÞ ? Netpj
??,(A.13)
X
Nh
wohk;jðÞ ? Net2
pj? f ? 1 ? fðÞ 1 ? 2fðÞ
no
,(A.14)
q3ypk
qZ3¼
X
Nh
j¼1
wohk;jðÞ ? Net3
pj? f ? 1 ? fðÞ ? 1 ? 6f þ 6f2
??.
(A.15)
Substituting (A.13) (A.14) (A.15) into (A.9) (A.10) (A.11),
A0, A1, A2, and A3can be calculated from (A.4). Then
optimal learning rate can be found from (A.5) or (A.10).
References
[1] S.A. Barton, A matrix method for optimizing a neural network,
Neural Computation 3 (3) (1991) 450–459.
[2] R. Battiti, First and Secondorder methods for learning: between
steepest descent and Newton’s method, Neural computation 4 (2)
(1992) 141–166.
[3] H.H. Chen, M.T. Manry, Hema Chandrasekaran, A neural network
training algorithm utilizing multiple set of linear equations,
Neurocomputing 25 (1–3) (1999) 55–72.
[4] Y. L.eCun, Generalization and network design strategies, in:
Proceedings of Connectionism in perspective, 1989.
[5] M.S. Dawson, A.K. Fung, M.T. Manry, Surface parameter retrieval
using fast learning neural networks, Remote Sensing Reviews 7 (1)
(1993) 1–18.
[6] R. Fletcher, Practical Methods of Optimization, Wiley, New York,
1987.
[7] M.H. Fun, M.T. Hagan, LevenbergMarquardt training for modular
networks, The 1996 IEEE International Conference on Neural
Networks, vol. 1, 1996, pp. 468–473.
[8] C.M. Hadzer, R. Hasan, et al., Improved singular value decomposi
tion by using neural networks, IEEE International Conference on
Neural Networks, vol. 1, 1995 438–442.
[9] M.T. Hagan, M.B. Menhaj, Training feedforward networks with the
Marquardt algorithm, IEEE Transaction on Neural Networks 5 (6)
(1994) 989–993.
[10] M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design,
PWS publishing Company, 1996.
[11] S. Haykin, Neural networks: a comprehensive foundation, Prentice
Hall, Englewood Cliffs, NJ, 1999.
[12] T.H. Kim, M.T. Manry, F.J. Maldonado, New learning factor and
testingmethodsforconjugategradienttrainingalgorithm,
ARTICLE IN PRESS
C. Yu et al. / Neurocomputing 70 (2006) 525–535
534
Page 12
Author's personal copy
Arlington in 2004. He joined the Neural Net
works and Image Processing Lab in the EE
department as a research assistant in 2001. His
main research interests include neural networks,
image processing and pattern recognition. Currently, Dr. Yu works at
FastVDO LLC in Columbia, Maryland.
IJCNN’03. International Joint Conference on Neural Networks,
2003, pp. 2011–2016.
[13] H.M. Lee, C.M. Chen, T.C. Huang, Learning efficiency improve
ment of backpropagation algorithm by error saturation prevention
method, Neurocomputing 41 (2001) 125–143.
[14] Y. Lee, S.H. Oh, M.W. Kim, An analysis of premature saturation in
backpropagation learning, Neural Networks 6 (1993) 719–728.
[15] G.D. Magoulas, M.N. Vrahatis, G.S. Androulakis, Improving the
convergence of the backpropagation algorithm using learning
adaptation Methods, Neural Computation 11 (1999) 1769–1796.
[16] M.T. Manry, H. Chandrasekaran, C.H. Hsieh, Signal Processing
Using the Multiplayer Perceptron, CRC Press, Boca Raton, 2001.
[17] M.T. Manry, et al., Fast training of neural networks for remote
sensing, Remote Sensing Reviews 9 (1994) 77–96.
[18] T. Masters, Neural, Novel & Hybrid Algorithms for Time Series
Prediction, Wiley, New York, 1995.
[19] S.H. Oh, Improving the error backpropagation algorithm with a
modified error function, IEEE Transactions on Neural Networks 8
(3) (1997) 799–803.
[20] S.H. Oh, S.Y. Lee, A new error function at hidden layers for fast
training of multilayer perceptrons, IEEE Transactions on Neural
Networks 10 (1999) 960–964.
[21] S.H. Oh, S.Y. Lee, Optimal learning rates for each pattern and
neuron in gradient descent training of multilayer perceptrons,
IJCNN’99. International Joint Conference on Neural Networks, vol
3, 1999, pp. 1635–1638.
[22] Y. Oh, K. Sarabandi, F.T. Ulaby, An empirical model and an
inversion technique for radar scattering from bare soil surfaces, IEEE
Transactions on Geoscience and Remote Sensing 30 (2) (1992)
370–381.
[23] V. Ooyen, B. Nienhuis, Improving the convergence of the back
propagation algorithm, Neural Networks 5 (1992) 465–471.
[24] W.H. Press, et al., Numerical Recipes, Cambridge University Press,
New York, 1986.
[25] K. Rohani, M.S. Chen, M.T. Manry, Neural subset design by direct
polynomial mapping, IEEE Transactions on Neural Networks 3 (6)
(1992) 1024–1026.
[26] D.E. Rumelhart, G.E. Hiton, R.J. Williams, Learning internal
representation by error propagation, Parallel Distributed Processing
1 (1986) 318–362.
[27] R.S. Scalero, N. Tepedelenlioglu, A fast new algorithm for training
feedforward neural networks, IEEE Transactions on signal proces
sing, vol. 40 (1) 1992 pp. 202–210.
[28] G.J. Wang, C.C. Chen, A fast multilayer neuralnetwork training
algorithm based on the layerbylayer optimizing procedures, IEEE
Transactions on Neural Networks 7 (3) (1996) 768–775.
[29] P.J. Werbos, Beyond regression: new tools for prediction and analysis
in the behavioral science, Ph.D. Thesis, Harvard University, Cam
bridge, Mass, 1974.
[30] J.Y.F. Yam, T.W.S. Chow, A weight initialization method for
improving training speed in feedforward neural network, Neurocom
puting 30 (2000) 219–232.
Changhua Yu received his B.S. in 1995 from
Huazhong University of Science and Technology,
Wuhan, China, and M. Sc. in 1998 from
Shanghai Jiaotong University, shanghai, China.
He received his Ph.D. degree in Electrical
Engineering from the University of Texas at
Michael T. Manry was born in Houston, Texas in
1949. He received the B.S., M.S., and Ph.D.in
Electrical Engineering
1976, respectively, from The University of Texas
at Austin. After working there for two years
as an Assistant Professor, he joined Schlumberger
Well Services in Houston where he developed
signal processing algorithms for magnetic reso
nance well logging and sonic well logging.
He joined the Department of Electrical Engineer
ing at the University of Texas at Arlington in 1982, and has held
the rank of Professor since 1993. In summer 1989, Dr. Manry developed
neural networks for the Image Processing Laboratory of Texas
Instruments in Dallas. His recent work, sponsored by the Advanced
Technology Program of the state of Texas, ESystems, Mobil Research,
and NASA has involved the development of techniques for the analysis
and fast design of neural networks for image processing, parameter
estimation, and pattern classification. Dr. Manry has served as a
consultant for the Office of Missile Electronic Warfare at White Sands
Missile Range, MICOM (Missile Command) at Redstone Arsenal, NSF,
Texas Instruments, Geophysics International, Halliburton Logging
Services, Mobil Research and Verity Instruments. He is a Senior Member
of the IEEE.
in1971,1973, and
Jiang Li received his B.S. in Electrical Engineering
from Shanghai Jiaotong University, China in
1992, his M.S. in Automation from Tsinghua
University, China in 2000. He received his Ph.D
degree in Electrical Engineering from the Uni
versity of Texas at Arlington in 2004. His research
interests focus on neural networks, wireless com
munication, pattern recognition and signal pro
cessing. Currently, Dr. Li works at the National
Institute of Health in Bethesda Maryland.
Pramod Lakshmi Narasimha received his B.E. in
Telecommunications Engineering from Bangalore
University, India. He joined the Neural Networks
and Image Processing Lab in the EE department,
UTA as a research assistant in 2002. In 2003, he
received his M.S. degree in Electrical Engineering
at the University of Texas at Arlington. Cur
rently, he is working on his Ph.D. His research
interests focus on neural networks, pattern
recognition, image and signal processing.
ARTICLE IN PRESS
C. Yu et al. / Neurocomputing 70 (2006) 525–535
535