48 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999
Dynamic Tunneling Technique for Efficient Training
of Multilayer Perceptrons
Pinaki RoyChowdhury, Y. P. Singh, and R. A. Chansarkar
Abstract—A new efficient computational technique for training
of multilayer feedforward neural networks is proposed. The
proposed algorithm consists two learning phases. The first phase
is a local search which implements gradient descent, and the
second phase is a direct search scheme which implements dy-
namic tunneling in weight space avoiding the local trap thereby
generates the point of next descent. The repeated application
of these two phases alternately forms a new training procedure
which results into a global minimum point from any arbitrary
initial choice in the weight space. The simulation results are
provided for five test examples to demonstrate the efficiency of the
proposed method which overcomes the problem of initialization
and local minimum point in multilayer perceptrons.
Index Terms—DTT, Lipschitz condition, MLP.
found a wide variety of applications in diverse areas viz.
pattern recognition, classification, function approximation,
prediction of currency exchange rates, maximizing the yields
of chemical processes, identification of precancerous cells
. The most commonly used training method in MLP is
error backpropagation (EBP) algorithm . EBP has been
tested successfully for different kinds of tasks. However
it lacks in many aspects like, slow convergency, getting
trapped in local minimum point . A lot of progress
has taken place in EBP learning theory –. Efforts
have also been made to modify EBP training method using
variable step-size , layer-by-layer optimization ,
and using linear block optimization by using least square
techniques . Second-order methods have also been studied
for MLP training and they include Newton method ,
the Broyden–Fletcher–Goldfarb–Shanno method , the
Levenberg–Marquardt modification , conjugate-gradient
, and scaled conjugate-gradient .
In this paper a new method is developed for efficient training
of MLP by combining EBP and dynamic tunneling technique
. The EBP is used here to find a local minimum point
and the dynamic tunneling technique (DTT) is employed
to detrap the local minimum. Thus the application of DTT
ULTILAYER perceptrons (MLP’s) form a class of
feedforward neural networks (FFNN’s). They have
Manuscript received January 16, 1998; revised August 20, 1998.
P. RoyChowdhury is with the Defence Terrain Research Laboratory, Delhi,
Y. P. Singh is with the Department of Computer Engineering, IT-BHU,
R. A. Chansarkar was with the Terrain Research, Defence Research And
Development Organization, Delhi, India.
Publisher Item Identifier S 1045-9227(99)01226-6.
indicate connection weights. Vertical lines on the nodes represent bias weights.
Architecture of a typical MLP. Circles represent nodes, and lines
whereas BLK2 is the thresholding unit using sigmoid nonlinearity (snl).
Model of a typical neuron in MLP. BLK1 is the summation unit,
results in the point of next descent. This technique along
with EBP applied alternately in the weight space results in
the point of global minimum, as demonstrated using various
case studies. In all these experiments various initial conditions
in the weight space were selected to demonstrate its efficacy
to get global minimum point overcoming the problem of
initialization in MLP.
In this work Section II will discuss the architecture of
MLP. Proposed learning technique is discussed in Section III.
Section IV gives the computational steps along with the com-
plexity of the algorithm, whereas Section V deals with the sim-
ulation results and discussions, confirming good performance
of the proposed learning methodology. To end, Section VI
gives the conclusions.
II. MLP ARCHITECTURE
The general architecture of MLP is shown in Fig. 1. It has
an input layer, arbitrary number of hidden layers, and an output
layer. Input is fed to each of the input layer (layer 1) neurons,
the outputs of input layer feed into each of the layer 2 neurons,
and so on as shown in Fig. 1. In this work the layered structure
having 1 input, 1 output, and
layer network. The model of a typical neuron in MLP
shown in Fig. 2. The output from a neuron can be expressed
-2 hidden layers will be termed
1045–9227/99$10.00 1999 IEEE
ROYCHOWDHURY et al.: DYNAMIC TUNNELING TECHNIQUE49
of the weighted response from the neurons in the preceding
layer. In Fig. 2,
indicates sigmoid nonlinearity.
indicates the gain of the sigmoid andis the sum
III. TRAINING METHOD
This section describes the proposed learning method as
weight updation scheme. The computational scheme of the
proposed technique starts as follows.
An initial point is chosen in the weight space at random and
then it is slightly perturbed. The new point is tested for either
gradient descent or tunneling phase in the following manner.
Let the random point chosen on the weight space be denoted
search space. The new point is represented by
has the same dimension as, and each component of
small quantity (
1). Depending on the relative value of mean
squared error (mse), which is the mean of the squared error
given by (2) below, at
then learning in gradient descent is initiated, else tunneling
takes place in the weight space. If a gradient descent phase
is initiated then it will converge to local minimum point. If
tunneling is initiated then it finds a point for the gradient
descent. After this, the algorithm automatically enters either of
the two phases alternately and weights are updated according
to the modification rule of the respective phases explained
later. The learning procedure continues till the global optimal
weights are obtained with minimum mse.
denotes the dimension of the
; i.e., if msemse
A. Phase 1—EBP
The most popular learning algorithm in MLP is the EBP
and is described in brief with the following notations.
Output of the th node in layer .
Weight connecting th node in layer to th node
th training sample.
Desired response of the th output node for the
th training sample.
Number of nodes in layer .
Number of layers.
Number of training patterns.
In the above notations
EBP implements a gradient search technique to find the
network weights, that minimizes the squared error function
and represents the bias
The weights of the network are updated iteratively according to
represents the index of iteration. Thus, application of gradient
descent in weight space results in
is a positive constant, called the learning rate, and
The dynamics of the error function is then given by
Thus for a nonnegative
nonincreasing function of time. This proves the fact that there
will be a global decrease in the value of
the total errorwill be a
B. Phase II—The Dynamic Tunneling Technique
The dynamic tunneling technique is an implementation of
direct search method and it can be treated as a modified
Hooke–Jeeves pattern search method , which is discussed
below. An equilibrium point
of the dynamical system
is termed an attractor (repeller) if no (at least one) eigenvalue
of the matrix
has a positive real part . Typically, dynamical system such
as (6) obey Lipschitz condition
which guarantees the existence of a unique solution for each
. Usually such systems has infinite relax-
ation time to an attractor and escape time from a repeller.
Based on the violation of Lipschitz condition at equilibrium
points, which induces singular solutions such that each solution
approaches an attractor, or escapes from a repeller in finite
time. To exemplify the above statement, consider a system
The system represented by (9) has an equilibrium point at
, which violates the Lipschitz condition at, since
The equilibrium point of the above mentioned system is termed
as attracting equilibrium point, since from any initial condition
, the dynamical system in (9) reaches the equilibrium
in a finite timegiven by
Similarly, the dynamical system
has a repelling unstable equilibrium point at
violates the Lipschitz condition. Any initial condition which
50IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999
is infinitesimally close to the repelling point
the repeller, to reach point
in a finite time given by
The concept of dynamic tunneling algorithm is based on the
violation of Lipschitz condition at equilibrium point, which
is governed by the fact that any particle placed at small
perturbation from the point of equilibrium will move away
from the current point to another within a finite amount of
time as discussed above. The tunneling is implemented by
solving the differential equation given below
the last local minimum for
the local minimum point
of the tunneling system. The value of
and (14) is integrated for a fixed amount
, with a small time-step
is computed with the new value of
remaining components of
to a halt when mse
minimum (the condition for descent), and initiates the next
gradient descent. If this condition of descent is not satisfied,
then this process is repeated with all the components of
until the above condition of descent is reached. If for no value
the above condition is satisfied, then the last local
minimum is the global minimum point. In this way by repeated
application of gradient descent and tunneling in weight space,
may lead to the point of global minimum. In this work, to
avoid the condition of saturation of neurons, a corrective term
 is introduced as
represents the strength of learning,represents
. It is obvious from (6) that
is also the point of equilibrium
. After every time-
. Tunneling comes
indicates last localwhere
Finally, a general expression for the dynamical system dis-
cussed separately in two phases can be written as
is a heaviside step function defined as
and diff is defined as mse
issues of (16) is as discussed earlier.
mse. The implementation
IV. COMPUTATIONAL SCHEME
In this section the proposed learning algorithm is expressed
as computational steps as given below.
any small random weights.
small value1 [this means,
and each component of
then EBP; else Dynamic Tunneling
6.1) Choose the tolerance limit (tol) at which gradient
descent will come to a halt.
6.2.1) Feed Forward
6.2.2) Compute Gradient
6.2.3) Update Weight
6.3) Until termination condition is satisfied.
Now Steps 6.2.1–6.2.3 will be explained as com-
putational models  in brief.
is of same
then , else
od2 od1, here represents the
layer, starting from layer
(as obtained from EBP) layer by
7.4) Compute diff using
th time step
7.5) Repeat Tunnel using
0 in the entire domain, then Tunnel using
solution of 7.3 at
until diff 0.
9) If diff
algorithm; else EBP using
0;, thenand Terminate
as obtained from
B. Complexity of the Algorithm
In this section the worst case complexity of the proposed
method will be analyzed in terms of the number of iterations
performed both in EBP and DTT. To proceed with the analysis,
ROYCHOWDHURY et al.: DYNAMIC TUNNELING TECHNIQUE51
TABLE SHOWING THE NUMBER OF EPOCHS (NOE) TAKEN TO CONVERGE TO GLOBAL MINIMUM IN ERROR SURFACE FOR THE PARITY, 4-2-4 ENCODER, 2-1-4-4 ENCODER,
AND FULL ADDER PROBLEMS ALONG WITH mse AND VARIOUS CONTROL PARAMETERS LIKE STEPS (SMALL PERTURBATION), ??, ?, tol (TERMINATION CRITERIA)
EBP is considered first. Symbols used carry the interpretation
as stated earlier
Total number of iterations (TNOI) for one pattern
Similar results can be computed for 6.2.2) and 6.2.3). For
training patterns the total number of iterations in one cycle of
EBP is given as below
Assuming number of cycles required using all the training
patterns for EBP to converge to a local minimum as
then total number of iterations is given by TEBP and can
be expressed as
The number of times tunneling will occur,
for a particular variable
The number of variables
if each time tunneling occurs, the condition of descent is
satisfied in the last phase of tunneling, and in the last variable,
then the total number of iterations required during tunneling
is given by
During the process of integration by RK45 method, the number
of function evaluations can be given approximately by
RK45 can be easily reduced to
52 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999
Fig. 3.Convergence curve for 4-2-4 encoder problem.
Total number of iterations during the tunneling phase
So the total number of function evaluations during one phase
of EBP and tunneling phase is given by
If number of such cycles is
of our proposed method can be represented by
, then the worst case analysis
which can be expressed as
So, the expected computational effort in our proposed method
is cubic in the problem variables , squared in problem variable
V. SIMULATION RESULTS AND DISCUSSIONS
To assess the performance of the proposed training scheme,
experiments were conducted on standard problems of 1) par-
ity; 2) encoder 3) adders 4) demultiplexer; and 5) character
recognition. The setup in terms of number of layers and nodes
for the above mentioned problems are discussed below.
The Parity problem is a standard problem where the output
of the network is required to be “1” if the input pattern
contains an odd number of “1’s” and “0” otherwise. In this
problem the most similar patterns which differ by a single bit
require different answer. The XOR problem  is a parity
problem with input pattern of size 2. For this problem a three-
layer network of size 4-4-1 is considered. Five different initial
conditions are considered in the weight space for training the
network and the result is provided in Table I. The results
confirm the robustness of the algorithm in overcoming the
problem of initialization.
In this example two types of encoders are considered. In the
first case a set of orthogonal input patterns are mapped to a
set of orthogonal output patterns . For this work a three-
layer network of size 4-2-4 was employed. The next problem
taken up in this section was that of a four-layer network of
dimension 2-1-4-4. For the first case the network was trained
starting from three different initial conditions, and whereas
in the second case it was trained using two different initial
conditions. The results of these are presented in Table I. From
Fig. 3 it can be concluded that there is a continuous decrease
of error for each epoch. From the performance of the network
it can be concluded that the training process is not dependent
on the initial choice of the weight. The number of epochs taken
to arrive at the global minimum are comparatively less than
reported in . The training pattern for the first case is
Whereas the coding scheme for the second case is
In this example both full-adder and half-adder is investi-
gated. A full-adder has three inputs and two outputs. For this
case study a 3 layer network of size 3-3-2 was used. The next
ROYCHOWDHURY et al.: DYNAMIC TUNNELING TECHNIQUE53
TABLE SHOWING THE NUMBER OF EPOCHS (NOE) TAKEN TO CONVERGE TO GLOBAL MINIMUM IN ERROR SURFACE
FOR THE HALF ADDER, DEMULTIPLEXER AND CHARACTER RECOGNITION PROBLEMS ALONG WITH mse AND
VARIOUS CONTROL PARAMETERS LIKE STEPS (SMALL PERTURBATION), ??, ?, tol (TERMINATION CRITERIA)
problem in the adders section is half-adder, which has two
input bits and two output bits. A three-layer network of size
2-2-2 was used for this problem. For full-adder the network
was trained starting from three different initial conditions
and whereas in half-adder two different initial conditions
were taken. The result is again presented in Tables I and
II, respectively, indicating the robustness of the algorithm
toward initial choice of weight. The training patterns used for
The training patterns for half-adder are
A Demultiplexer is a circuit that receives information on a
single line and transmits this information on one of
output lines. The selection of a specific output line is controlled
by the bit value of
study one input, two selection, and four output lines were
considered. For training purpose a three-layer network of size
3-3-4 was used. Again five different initial conditions were
taken for training the network. The result as presented in
Table II confirms the efficacy of the algorithm to overcome the
judicious selection of initial weights. Also from Fig. 4 it can be
observed that the error decreases continuously with number of
epochs, thereby indicating the trend of global decrease. The
in Table I should be read as:
selection lines. If output pattern is represented
then, the selection lines
will be same as the data input , while other outputs
are maintained at one.
selection lines. For this particular case
, indicates that the
E. Character Recognition
In this problem the training set is constituted by the ten
digits and 26 letters (first 13 lower and upper case) of the
English alphabet. Each of them is represented by a matrix
54 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999
Fig. 4. Convergence curve for demultiplexer problem.
Fig. 5.Convergence curve for character recognition problem.
of size 7
specified target given by a seven bit ASCII code. Character I
has an ASCII code of 1 0 0 1 0 0 1 and is represented as
5 black and white pixels. Each character has aThe network used for training has 35 input nodes, ten hidden
nodes, and seven nodes in the output layer. The results
presented in Table II along with Fig. 5 confirms the accuracy
and robustness of the algorithm.
Finally the comparison of the proposed method along with
EBP presented in Table III underlines the fact that the pro-
posed method can attain a lower value of MSE in less number
of epochs. Table III along with Figs. 3–5 establishes the fact
that dynamic tunneling ensures the traversal of only those state
spaces which should guarantee a continuous decrease in the
value of objective function thereby eliminating the problem of
initialization in MLP.
ROYCHOWDHURY et al.: DYNAMIC TUNNELING TECHNIQUE55 Download full-text
TABLE SHOWING THE COMPARISON IN TERMS OF (NOE) AND mse BETWEEN THE PROPOSED METHOD
(EBPDT) AND THE CONVENTIONAL EBP ALGORITHM FOR THE PROBLEMS DISCUSSED IN THE TEXT
In this paper a new algorithm based on combining the
gradient descent method and dynamic tunneling technique for
training the MLP is proposed. The new algorithm is tested on
many test cases and the results obtained confirms the robust
learning using the proposed algorithm and overcomes the
problem of initialization and local minimum point in learning
using EBP algorithm. This algorithm usually leads to the point
of global minimum, keeping in view the termination condition
of the gradient descent phase. The algorithm reaches the global
minimum in polynomial time. This algorithm is sensitive to the
initial choice of weight only from the perspective of number
of epochs. A good initial point will converge to the global
minimum very rapidly whereas a poor choice of the starting
condition takes more time. This algorithm is computationally
efficient as it requires evaluation of the first-order derivative
only. However, second-order methods can also be investigated
along with dynamic tunneling technique.
 G. E. Hinton, “How neural networks learn from experience,” Sci. Amer.,
pp. 145–151, Sept. 1992.
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
sentations by backpropagating errors,” Nature, vol. 323, pp. 533–536,
 R. Beale and T. Jackson, Neural Computing: An Introduction.
U.K.: Inst. Phys., 1992.
 X.-H. Yu and G.-A. Chen, “Efficient backpropagation learning using
optimal learning rate and momentum,” Neural Networks, vol. 10, no. 3,
pp. 517–527, 1997.
 R. Battiti, “First- and second-order methods for learning: Between
steepest descent and Newton methods,” Neural Comput., vol. 4, pp.
 S. Osowski, P. Bojarczak, and M. Stodolski, “Fast second-order learning
algorithm for feedforward multilayer neural networks and its applica-
tions,” Neural Networks, vol. 9, no. 9, pp. 1583–1596, 1996.
 T. H. Martin and B. M. Mohammad, “Training feedforward network
with Marquardt algorithm,” IEEE Trans. Neural Networks, vol. 5, pp.
959–963, Nov. 1996.
 E. M. Johansson, F. U. Dowla, and D. M. Goodman, “Backpropagation
learning for multilayer feedforward neural networks using the conjugate
gradient method,” Int. J. Neural Systems, vol. 2, no. 4, pp. 291–302,
 M. S. Moller, “A scaled conjugate gradient algorithm for fast supervised
learning,” Neural Networks, vol. 6, no. 4, pp. 525–534, 1993.
 G. D. Magoulas, M. N. Vrahatis, and G. S. Androulakis, “Effective
backpropagation training with variable stepsize,” Neural Networks, vol.
10, no. 1, pp. 69–82, 1997.
 G.-J. Wang and C.-C. Chen, “A fast multilayer neural-network training
algorithm based on the layer-by-layer optimizing procedures,” IEEE
Trans. Neural Networks, vol. 7, pp. 768–775, May 1996.
 R. Parisi, E. D. Di Claudio, G. Orlandi, and B. D. Rao, “A general-
ized learning paradigm exploiting the structure of feedforward neural
networks,” IEEE Trans. Neural Networks, vol. 7, pp. 1450–1460, Nov.
 J. Barhen, V. Protopopescu, and D. Reister, “TRUST: A deterministic
algorithm for global optimization,” Science, vol. 276, pp. 1094–1097,
 B. C. Cetin, J. Barhen, and J. W. Burdick, “Terminal repeller uncon-
strained subenergy tunneling (TRUST) for fast global optimization,” J.
Optimization Theory Applicat., vol. 77, no. 1, pp. 97–126, Apr. 1993.
 K. Deb, Optimization For Engineering Design, Algorithms and Exam-
New Delhi, India: Prentice-Hall of India, 1995.
 C. G. Looney, “Advances in feedforward neural networks: Demystifying
knowledge acquiring black boxes,” IEEE Trans. Knowledge Data Eng.,
vol. 8, pp. 211–226, Apr. 1996.
 D. Gries, Science of Programming.
 P. RoyChowdhury, Y. P. Singh, and R. A. Chansarkar, “A new global
minimization method for training feedforward neural networks,” Neural
Networks, to appear.
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” in Parallel Distributed Process-
ing: Explorations in the Microstructures of Cognition, D. E. Rumelhart
and J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, pp.
 M. M. Mano, Digital Design, 2nd ed.
of India, 1995.
New Delhi, India: Narosa.
New Delhi, India: Prentice-Hall
Pinaki RoyChowdhury, photograph and biography not available at the time
Y. P. Singh, photograph and biography not available at the time of publication.
R. A. Chansarkar, photograph and biography not available at the time of