Content uploaded by Nabil Benoudjit
Author content
All content in this area was uploaded by Nabil Benoudjit on Jan 01, 2014
Content may be subject to copyright.
On the Kernel Widths in Radial-Basis Function
Networks
NABIL BENOUDJIT and MICHEL VERLEYSEN
Universite
´Catholique de Louvain, Microelectronics Laboratoy, Place du Levant 3, B-1348
Louvain-la-Neuve, Belgium. e-mail: {benoudjit, verleysen}@dice.ucl.ac.be
Abstract. RBFN (Radial-Basis Function Networks) represent an attractive alternative to
other neural network models. Their learning is usually split into an unsupervised part, where
center and widths of the basis functions are set, and a linear supervised part for weight com-
putation. Although available literature on RBFN learning widely covers how basis function
centers and weights must be set, little effort has been devoted to the learning of basis func-
tion widths. This paper addresses this topic: it shows the importance of a proper choice of
basis function widths, and how inadequate values can dramatically influence the approxima-
tion performances of the RBFN. It also suggests a one-dimensional searching procedure as a
compromise between an exhaustive search on all basis function widths, and a non-optimal a
priori choice.
Key words. clusters, Gaussian kernels, radial basis function networks, width scaling factor
1. Introduction
Artificial Neural Networks (ANN) are largely used in applications involving classi-
fication or function approximation. It has been proved that several classes of ANN
such as Multilayer Perceptron (MLP) and Radial-Basis Function Networks (RBFN)
are universal function approximators [1–3]. Therefore, they are widely used for func-
tion approximation [3].
Radial-Basis Function Networks and Multilayer Perceptrons can be used for a
wide range of applications primarily because they can approximate any function
under mild conditions; however, the training of RBFN reveals faster than the train-
ing of multilayer perceptrons [4]. This fast learning speed comes from the fact that
RBFN has just two layers (Figure 1) of parameters (centersþwidths and weights)
and each layer can be determined sequentially. This paper deals with the training
of RBF networks.
MLP are trained by supervised techniques: the weights are computed by minimi-
zing a non-linear cost function. On the contrary the training of RBF networks can be
split into an unsupervised part and a linear supervised part. Unsupervised updating
techniques are straightforward and relatively fast. Moreover, the supervised part of
the learning consists in solving a linear problem, which is therefore also fast, with the
additional benefit of avoiding the problem of local minima usually encountered
Neural Processing Letters 18: 139–154, 2003. 139
#2003 Kluwer Academic Publishers. Printed in the Netherlands.
when using multilayer perceptrons [5]. The training procedure for RBFN can be
decomposed naturally into three distinct stages: (i) RBF centers are determined by
some unsupervised/clustering techniques, (ii) widths of the Gaussian kernels that
are the subject of this paper are optimized, (iii) the network weights between the
radial functions layer and the output layer are calculated.
Several algorithms and heuristics are available in the literature regarding the com-
putation of the centers of the radial functions [6–8] and the weights [2, 9]. However,
very few papers only are dedicated to the optimization of the widths of the Gaussian
kernels. In this paper we show first that the problem of fixing the width is not evident
(largely data-dependent) and secondly that it certainly depends on the dimension of
the input space. When we are in the presence of a small number of samples (which is
always the case in large-dimensional spaces), there is no other choice than an exhaus-
tive search by computation in order to optimize the widths of the Gaussian kernels.
In this paper we suggest a one-dimensional searching procedure as a compromise
between an exhaustive search on all basis function widths, and a non-optimal a
priori choice. Note that gradient descent on all RBFN parameters is still possible,
but in this case the speed advantage of RBFN learning is at least partially lost.
The paper is organized as follows. Section 2 reviews the basic principles of a
RBFN and presents the width optimization procedure. Section 3 presents simula-
tions performed on simple examples. The simulation results obtained for different
dimensions of the input space and a comparison of our approach to three commonly
accepted rules are given in Section 4. Section 5 concludes the paper.
2. Radial Basis Function Network
A RBF network is a two-layers ANN. Consider an unknown function fðxÞ:<d!<.
In a regression context, RBF networks approximate f(x) by a weighted sum of
d-dimensional radial activation functions (plus linear and independent terms, see
Figure 1. Architecture of a radial basis function network with scalar output.
140 NABIL BENOUDJIT AND MICHEL VERLEYSEN
below). The radial basis functions are centred on well-positioned data points, called
centroids; the centroids can be regarded as the nodes of the hidden layer. The posi-
tions of the centroids and the widths of the radial basis functions are obtained by an
unsupervised learning rule. The weights of the output layer are calculated by a super-
vised process using pseudo-inverse matrices or singular value decomposition (SVD)
[2]. However, it should be noted that other authors use the gradient descent algo-
rithm to optimize the parameters of RBF network [10]. The training strategies of
‘spherical’ RBF networks will be detailed in Subsection 2.1.
Suppose we want to approximate a function f(x) with a set of Mradial basis func-
tions f
j
(x), centred on the centroids c
j
and defined by:
fj:<d!<:fjðxÞ¼fjðkxcjkÞ;ð1Þ
where k:kdenotes the Euclidean distance, cj2<
dand 1 4j4M:
The approximation of the function f(x) may be expressed as a linear combination
of the radial basis functions [11]:
^
fðxÞ¼X
M
j¼1
ljfjðkxcjkÞ þ X
d
i¼1
aixiþb;ð2Þ
where l
j
are weight factors, and a
i
,bare the weights for the linear and independent
terms respectively.
A typical choice for the radial basis functions is a set of multi-dimensional Gaus-
sian kernels:
fjðkxcjkÞ ¼ exp 1
2ðxcjÞTX1
jðxcjÞ
;ð3Þ
where Pjis a covariance matrix.
Three cases can be considered:
1. Full covariance matrices Pjare replaced by identical scalar widths sj¼sfor all
Gaussian kernels. In the literature several authors used this option [3, 12–15].
2. Full covariance matrices Pjare replaced by different scalars widths sjfor
each Gaussian kernel j. For example, in [16–18], each scalar width sjis estimated
independently.
3. Full covariance matrices Pjrepresent the general case. Musavi et al. [19] used the
covariance matrix to estimate the width of each Gaussian kernel.
In this paper we limit our discussion to the two first cases (1) and (2). One argu-
ment to avoid case (3) is that the number of parameters then grows drastically in
Equation 2; applications dealing with a small number of learning data are thus dif-
ficult to handle by this method. Note that also procedure (3) is very sensitive to out-
liers [20]. Of course in other situations the use of full covariance matrices may be
RADIAL-BASIS FUNCTION NETWORKS 141
interesting [2, 19]. Statistical algorithms used for the estimation of parameters in
mixture modelling (as the EM algorithm) could also be used (see for example [2]
for the estimation of centers and widths, and [21]). Nevertheless, the use of EM algo-
rithm is based on the maximization of likelihood, making this algorithm more often
used for classification and probability estimation problems. The EM algorithm is
also sensitive to outliers [22]. In fact, many authors (see for example [4]) suggest
to use solution (2) and the learning strategies described below for simplicity reasons.
In the following, we will thus concentrate on the study of so-called ‘spherical’ RBFN
covering cases (1) and (2).
2.1. SPHERICAL RBFN LEARNING STRATEGIES
Once the number and the general shape of the radial basis functions f
j
(x) are chosen,
the RBF network has to be trained properly. Given a training data set Tof size N
T
,
T¼fðxp;ypÞ2<
d<;14p4NT:yp¼fðxpÞg;ð4Þ
the training algorithm consists of finding the parameters c
j
,s
j
and l
j
, such that ^
fðxÞ
fits the unknown function f(x) as close as possible. This is realised by minimising a
cost function (usually the mean square error between ^
fðxÞand f(x) on the learning
points). Often, the training algorithm is decoupled into a three-stage procedure:
.determining the centers c
j
of the Gaussian kernels,
.computing the widths s
j
of the Gaussian kernels,
.computing the weights l
j
and independent terms a
i
and b.
Moody and Darken [16] proposed to use the k-means clustering algorithm to find
the location of the centroids c
j
. Other authors use a stochastic online process (Com-
petitive learning) method [7, 17], which leads to similar results, with the advantage of
being adaptive (continuous learning, even with evolving input data).
The computation of the Gaussian function widths sjis the subject of this paper; it
will be detailed at the end of this section.
Once the basis function parameters are determined, the transformation between
the input data and the corresponding outputs of the hidden units is fixed. The net-
work can thus be viewed as an equivalent single-layer network with linear output
units. Minimisation of the average mean square error yields the well-known least-
square solution for the weights.
l¼fþy¼ðfTfÞ1fTy;ð5Þ
where l;yare the row vectors of ljweight factors and yptraining data outputs (of
sizes Mand N
T
respectively), fis the NTMmatrix of fij ¼exp ð xicj
22s2
jÞ
values and fþ¼ðfTfÞ1fTdenotes the pseudo-inverse of f. In practice, to avoid
possible numerical difficulties due to an ill-conditioned matrix fTf, singular value
decomposition (SVD) is usually used to find the weights [2].
142 NABIL BENOUDJIT AND MICHEL VERLEYSEN
The second stage of the training process involves the computation of the Gaussian
function widths, while fixing the degree of overlapping between the Gaussian
kernels. This allows finding a compromise between locality and smoothness of the
function ^
fðxÞ. We consider here both cases (1) and (2) quoted in Section 2. Case
(1) consists in taking identical widths sj¼sfor all Gaussian kernels [3, 12–15]. In
[15] for example, the widths are fixed as follows:
s¼dmax
ffiffiffiffiffiffiffiffi
2M
p;ð6Þ
where Mis the number of centroids and dmax is the maximum distance between any
pair of them. This choice would be close to the optimal solution if the data were uni-
formly distributed in the input space, leading to a uniform distribution of centroids.
Unfortunately most real-life problems show non-uniform data distributions. The
method is thus inadequate in practice and an identical width for all Gaussian kernels
should be avoided.
If the distances between the centroids are not equal, it is better to assign a specific
width to each Gaussian kernel. For example, it would be reasonable to assign a lar-
ger width to centroids are widely separated from each other and a smaller width to
closer ones [4]. Case (2) therefore consists of estimating the width of each Gaussian
kernel independently. This can be done, for example, by splitting the learning points
x
p
into clusters according to the Voronoi region associated to each centroid
1
, and
then computing the standard deviation sc
jof the distance between the learning points
in a cluster and the corresponding centroid; in reference [17] for example, it is sug-
gested to use an iterative procedure to estimate this standard deviation. Moody
and Darken [16], on the other hand, proposed to compute the width factors sj
(the radius of kernel j) by the p-nearest neighbours heuristic:
sj¼1
pX
p
i¼1
cjci
2
!
1
2
ð7Þ
where the c
j
are the p-nearest neighbours to centroid c
i
. A suggested value for pis
2 [16]. Saha and Keeler [18] proposed to compute the width factors sjby nearest
neighbour heuristic where sj(the radius of kernel j) is set to the Euclidean distance
between c
j
(the vector determining the centre for the jth RBF) and its nearest neigh-
bour c
i
, multiplied by an overlap constant r:
sj¼r:min
cjci
:ð8Þ
This second class of methods offers the advantage of taking the distribution vari-
ations of the data into account. In practice, they are able to perform much better
1
A Voronoi region is the part of the space nearest to a specific centroid than to any other one.
RADIAL-BASIS FUNCTION NETWORKS 143
than fixed-width methods, as they offer a greater adaptability to the data. Even
though, as we will show in Section 4, the width values given by the above rules
remain sub-optimal.
2.2. WIDTH SCALING FACTOR OPTIMIZATION
We suggest in this subsection a procedure for the computation of the Gaussian func-
tion widths based on an exhaustive search belonging the second class of algorithms
quoted in subsection 2.1; the purpose is to show the importance of the optimization
of Gaussian widths. Therefore we select the widths in such a way to guarantee a nat-
ural overlap between Gaussian kernels, preserving the local properties of RBFN,
and at the same time to maximize the generalization ability of the network.
First we compute the standard deviations sc
jof the learning data in each cluster in
a classical way.
DEFINITION. Sigma_cluster is the empirical standard deviation of the learning
data contained in a cluster or Voronoi region associated to a centroid.
Subsequently, we determine a width scaling factor WSF, common to all Gaussian
kernels. The widths of the kernels are then defined as:
8j;sj¼WSFsc
j:ð9Þ
Although the EM algorithm (for example [22, 23]) could be used to optimize all sj
simultaneously, it appears that in practical situations it is sometimes difficult to
escape from local minima, leading to non-optimal solution. Equation (9) then offers
a compromise between the usual methods without optimization of sjand a
M-dimensional optimization of all sjtogether.
By inserting the width factor WSF, the approximation function ^
fðxÞis smoothed
such that the generalization process is possibly improved, and an optimal overlap-
ping of the Gaussian kernels is allowed. Unfortunately, the optimal width factor
WSF depends on the function to approximate, the dimension of the input set, as well
as on the data distribution. The choice of the optimal WSF value is thus obtained by
extensive simulations (cross-validation): the optimal WSF
opt
value will be chosen as
the one minimizing the error criterion (mean square error on a validation set),
among a set Qof possible WSF values.
When several minima appear, it is recommended to choose the one corresponding
to the smallest width scaling factor. Indeed, large WSF have to be avoided for com-
plexity, reproducibility and/or numerical instability.
3. Simulations
We consider a simple example, i.e. we try to approximate the unity identity function
(yp¼1) on a d-dimensional hypercube domain [0,10]
d
.
144 NABIL BENOUDJIT AND MICHEL VERLEYSEN
It must be mentioned here that this problem is purely theoretical: there is no inter-
est to approximate such a linear (and even constant!) function by a RBFN. If the
RBFN model in Equation 2 is used to approximate this function, all weights ljmul-
tiplying the Gaussian kernels should be equal to zero. In order to reach the goal of
this paper, i.e. to have insights about the optimal values of the kernel widths, the lin-
ear and constant terms were removed from Equation 2 in the simulations. Neverthe-
less, the objective of this paper is to evaluate the variances of the Gaussian kernels
with respect to the dimension of the input space. In order to avoid the consequences
of the other parts of the RBFN training algorithm, we chose to work with a constant
target function, in order to remove the influence of its variations from our conclu-
sions. Simulations were performed for different dimensions d, in order to see the
influence of the dimension on the results.
For all simulations presented in this paper, the density of learning points is uni-
form in the d-dimensional input space. For this reason, the traditional vector quan-
tization (VQ) step in the RBFN learning process is skipped; the centroids are
attached to the nodes of a square grid in dimension d. The goal of this setting is
to eliminate the influence of the VQ results on the simulations. It is well known that
the placement of centroids on the nodes of a square grid is not the ideal result of a
vectorial quantization, when d52. For example, it has been demonstrated that in
dimension d¼2, an ideal vector quantization on a uniform density gives a result
in a hive of bee, and not a result in a square grid, as shown it Figure 2 [7]. Never-
theless, it can be shown through a simple calculation that the quantization error
obtained with the square grid (Figure 2a) is only about 4% higher than the one
obtained with the ideal results (Figure 2b).
As this ideal result is not known in dimensions greater than 2 the assumption is
made that the results obtained by placing the centroids on a square grid is a good
approximation of those that would be obtained with a true vector quantization.
Once the centroids are placed on a regular grid, the next subsection shows a theo-
retical way to calculate the optimal width by considering that all the weights are
identical in Equation 2. Next, in Subsection 3.2, the optimal width will be estimated
by setting all weights ljfree and calculated according to Equation 5.
Figure 2. a- scalar quantization (square grid), b- vector quantization (hive of bee).
RADIAL-BASIS FUNCTION NETWORKS 145
3.1. THEORETICAL VALUE OF THE OPTIMAL WIDTH OF THE GAUSSIAN KERNELS
As mentioned above, the centroids are placed on a regular grid, and the function to
be approximated is constant ( y¼constant); therefore it is expected that the weights
ljin Equation 2 will be identical for all centroids. For a theoretical calculation of the
optimal WSF coefficient, we will make this assumption and further suppose that
their values will be equal to 1. Then, we calculate by Equation 2 (without linear
and constant terms) the theoretical output function of the network, and this for vari-
ous values of sj; again, as the centroids are placed on a regular grid, we will suppose
that all sjvalues will be identical. The goal is to find the value of sjgiving the ‘flat-
test’ possible output function ^
fðxÞ. This one will not be around 1 (there is no reason,
since we chose the ljequal to 1), but well around another average value m. Taking
lj¼1 does not change anything to the problem: if the ljwere set to lj¼1=m,we
would have found an output function with an average value of 1, which was the
initial problem. Nevertheless, the two problems bring obviously the same conclu-
sions regarding the widths sj.
Note that, sc¼sc
j(8j2[1,...,M]) being constant over all clusters, it is equivalent
by Equation 9 to find a optimal value of sor a optimal value of WSF. In the follow-
ing section, we will estimate optimal values of s, in order to make possible the
comparison with other methods from the literature (Section 4).
For each value of s, to find the mean value mwe take simply the mean of the out-
put function ^
fðxÞ. To quantify the ‘flatness’ of the output function, we calculate its
standard deviation std
y
around the mean value m. It should be mentioned here that,
in order to avoid as much as possible the border effects encountered when using
RBFN, the mean and the standard deviations of ^
fðxÞare taken only in the middle
of the distribution, i.e. in the 3:85;6:05½
dcompact. For each dimension, the sgiving
the flatter function ^
fðxÞis called sigma_theo.
DEFINITION. Identical Gaussian kernels with unit weights (lj¼1) are summed
for various svalues
^
fxðÞ¼
X
M
j¼1
exp
xcj
2
2s2
!
;
where Mis the number of centroids. Sigma_theo is the svalue corresponding to the
smallest standard deviation of ^
fðxÞ.
As an example, Figure 3 gives the standard deviation std
y
of ^
fðxÞaccording to sin
dimension 1 and Figure 4 gives the same result in dimension 2.
3.2. EXPERIMENTAL VALUE OF THE OPTIMAL WIDTH OF THE GAUSSIAN KERNELS
In this second set of simulations, we still consider centroids placed on a regular grid,
but without the assumption that all ljweights will be identical. On the contrary, all
weights are set free and we calculate them according to Equation 5 (using Singular
146 NABIL BENOUDJIT AND MICHEL VERLEYSEN
Value Decomposition). As in the previous section, we repeat the experiment for a
large set of possible values s(identical for all Gaussian kernels), and for several
dimensions dof the input space. If the principle of ‘locality’ of the Gaussian Kernels
is respected and the border effects are neglected, we should expect identical lj.In
practice, this is not the case, mainly because of the border effects, as shown it
Figure 5a. As in Subsection 3.1, we only used the Gaussian kernels in the centre of
the distribution (see Figure 5b) in order to decrease the influence of the border effects.
After the best-fit function is calculated, the performance of the RBF network is
estimated by computing an error criterion. Consider a validation data set V, contain-
ing N
V
data points:
V¼ðxq;yqÞ2<
d<;14q4NV:yq¼fðxqÞ
:ð10Þ
The error criterion can be chosen as the mean square error:
MSEV¼1
NVX
NV
q¼1ðyq^
fðxqÞÞ2;ð11Þ
Figure 4. std
y
according to sin dimension 2.
Figure 3. std
y
according to sin dimension 1.
RADIAL-BASIS FUNCTION NETWORKS 147
where y
q
are the desired outputs. The minimum of the mean square error (MSE
v
)
gives now another value of sigma, called sigma_exp.
DEFINITION. Identical Gaussian kernels are summed for various svalues
^
fxðÞ¼
X
M
j¼1
ljexp xcj
2
2s2
!
;
where Mis the number of centroids. Sigma_exp is the svalue corresponding to the
smallest MSE
v
.
Figure 6(a) gives MSE
v
according to sin dimension 1 and Figure 6(b) gives the
same result in dimension 3. Figure 6 shows the presence of two minimums. The first
one corresponds to a local decomposition of the function in a sum of Gaussian ker-
nels; this interpretation is consistent with the classical RBF approach. However, the
second one corresponds to a non-local decomposition of the function. As a conse-
quence, the weights ljturn out to be enormous in absolute value (positive or
Figure 6. (a) MSE
v
versus sin Dim 1. (b) MSE
v
according versus sin Dim 3.
Figure 5. a- with border effects, b- without border effects.
148 NABIL BENOUDJIT AND MICHEL VERLEYSEN
negative) in order to compensate for the non-flat slopes. This leads to a greater
complexity of the RBFN. In addition, large ljdramatically increases numerical
instability. The optimal value chosen for sis therefore the one related to the smal-
lest minimum.
4. Results
The simulations were made on databases of points distributed uniformly in a hyper-
cube of edge lengths equal to 10, in various dimensions d. The number of centroids is
chosen equal to 9d. Other simulations made with a different number of centroids
gave similar results. The number of training and test points is chosen sufficiently
large to avoid (as much as possible) falling into the difficulties due to the empty space
phenomenon (between 1000 and 50000 training points according to the dimension of
the input space). These restrictions have limited the simulations to a dimension of
input space equal to 5.
Table 1 gives the results. In order to obtain values independent from the number
of centroids, the ‘Width Scaling Factors’ WSF_theo and WSF_exp are defined, as
being the ratio between sigma_theo and sigma_cluster on one hand, and sigma_exp
and sigma_cluster on the other hand respectively. Indeed it is more appropriate to
compare results on the scale-independent WSF coefficient instead of the values of
sfor two reasons:
.most results in the literature are based on the sigma_cluster value, making the
use of WSF easier for comparisons;
.the WSF values are independent from the number of centroids, while those of s
are not.
Several comments result from Table 1:
.sigma_cluster is proportional to the square root of dimension, as shown by a
simple analytical calculation:
scluster ¼ffiffiffi
d
pa
2ffiffiffi
3
p;ð12Þ
Table 1. WSF_theo and WSF_exp according to the dimension of the input space.
Dim Sigma_cluster Sigma_theo Sigma_exp WSF_theo WSF_exp
1 0.3175 0.92 0.5715 2.897 1.8
2 0.4490 0.92 0.5837 2.048 1.30
3 0.5499 0.92 0.5506 1.673 1.01
4 0.6350 0.92 0.5676 1.448 0.89
5 0.7099 0.92 0.5471 1.2959 0.77
RADIAL-BASIS FUNCTION NETWORKS 149
Figure 7. MSE
v
according to sigma in Dim 1 with the various calculation methods of s.
Figure 8. MSE
v
according to sigma in Dim 2 with the various calculation methods of s.
Figure 9. MSE
v
according to sigma in Dim 3 with the various calculation methods of s.
150 NABIL BENOUDJIT AND MICHEL VERLEYSEN
where ais the length of an edge of the hypercube corresponding to the Vor-
onoi zone of a centroid. In the simulations, the 9dcentroids are placed a priori
at the positions 0:55 þk1:1 (with 1 4k49) measured on each axis of input
space; ais thus equal to 1.1.
.We notice that sigma_theo does not depend on the dimension of the input
space. Therefore, WSF_theo is inversely proportional to the square root of
the dimension of the input space.
.We also notice that the sigma_exp values are systematically lower, about 30 to
35%, than the sigma_theo values. This is due to the increased freedom given to
the network coefficients by allowing weight variations rather than fixing them.
We also compared our method (Calculation of sigma_exp) to the three approaches
of Moody & Darken [16], S. Haykin [15] and A. Saha & J. D. Keeler [18], quoted in
Section 2. Figures 7, 8 and 9 illustrate the mean square error obtained according to
sigma for different dimension dð14d43Þwith the various calculation methods of
Figure 10. Local decomposition of y¼1with sigma_exp ¼0.5715.
Figure 11. Local decomposition of y¼1with sigma_Moody ¼0.7778.
RADIAL-BASIS FUNCTION NETWORKS 151
s. We notice here that whatever is the dimension of the input space, we always find
two minima. Figures 10, 11 and 12 clearly show that the choice of the sigma value
has a great influence on the local character of the decomposition in a sum of Gaus-
sian kernels of the function (in dimension 1) to approximate.
5. Conclusion
This paper gives some insights about the choice of Gaussian kernel widths, to use
during the training of spherical RBF networks. Indeed, a major part of the literature
in the field of RBFN covers the optimization of the positions of Gaussian kernels
and the multiplicative weights. On the other hand, the choice of their widths is often
based on heuristics, without real theoretical justification.
In this paper, we first show the importance of the choice of the kernel widths. In
many situations, a bad choice can lead to an approximation error definitely higher
than the optimum, sometimes by several orders of magnitude. Then, we show by
two types of simulations, that a classic choice (taking the width of Gaussian kernels
equal to the standard deviation of the points in a cluster) is certainly not optimal.
For example, in simulations in dimension 1, it appears that the width should be
twice this value. It is then suggested to optimize the widths; an example of one-
dimensional optimization procedure is presented in this paper, through the use of
a multiplying factor to the widths. Finally, we show that the dimension of the data
space has an important influence on the choice of s. In particular, that the multipli-
cative correction that must be applied to the standard deviation of points in a
cluster is shown to be inversely proportional to the square root of the dimension
of the input space.
Simulations on real databases (see for example [24]) show, similarly to the curves
illustrated in this paper, a strong dependency of the approximation error with respect
to the width (and Width Scaling Factor) of the Gaussian kernels. A similar methodo-
logy can thus be applied to choose optimum widths, despite the fact that their numer-
ical values depend on the function to approximate.
Figure 12. Non-local decomposition of y¼1with sigma_Haykin ¼2.0742.
152 NABIL BENOUDJIT AND MICHEL VERLEYSEN
The results show the need for a greater attention to be given to the optimization of
the widths of the Gaussian kernels in RBF networks, and to the development of
methods allowing to fix these widths according to the problem without using exhaus-
tive search.
Acknowledgements
Michel Verleysen is Senior Research Associate of the Belgian F.N.R.S. (National
Fund For Scientific Research).
References
1. Park, J. and Sandberg, I.: Approximation and radial basis function networks, Neural
Comput. 5(1993), 305–316.
2. Bishop, C. M.: Neural Networks for Pattern Recognition, Oxford university press, 1995.
3. Park, J. and Sandberg, I. W.: Universal approximation using radial-basis-function
networks, Neural Comput. 3(1991), 246–257.
4. Young-Sup Hwang and Sung-Yang Bang.: An efficient method to construct a radial basis
function neural network classifier, Neural Networks,10(8) (1997), 1495–1503.
5. Robert J. Howlett and Lakhmi C. Jain, Radial Basis Function Networks 2: New Advan-
ces in Design, Physica-Verlag: Heidelberg, 2001.
6. Ahalt, S. C. and Fowler, J. E.: Vector quantization using artificial neural networks mod-
els. In Proceedings of the International Workshop on Adaptive Methods and Emergent Tech-
niques for Signal Processing and Communications, June 1993, pp. 42–61.
7. Gresho, A. and Gray, R. M.: Vector Quanitzation and Signal Compression, Kluwer
International series in engineering and computer science, Norwell, Kluwer Academic
Publishers, 1992.
8. David Sanchez, V. A.: Second derivative dependent placement of RBF centers, Neurocom-
puting 7(3) (1995), 311–317.
9. Omohundro, S. M.: Efficient algorithms with neural networks behavior, Complex Systems
1(1987), 273–347.
10. Verleysen, M. and Hlava
´
ckova
´, K.: An optimized RBF network for approximation of
functions, In: European Symposium on Artificial Neural Networks (ESANN 94), pp. 175–
180, Brussels, April 20-21-22, 1994.
11. Tomaso Poggio and Federico Girosi: Networks for approximation and learning, Proceed-
ings of the IEEE,78(9) (1990), 1481–1497.
12. Orr, M. J.: Introduction to Radial Basis Functions Networks, Technical reports, April
1996, www.anc.ed.ac.uk/mjo/papers/intro.ps.
13. David Sanchez, V. A.: On the number and the distribution of RBF centers, Neurocomput-
ing 7(2) (1995), 197–202.
14. Chen, S. and Billings, S. A.: Neural networks for nonlinear dynamic system modelling and
identification, Int. J. Control,56(2) (1992), 319–346.
15. Haykin, S.: Neural Networks a Comprehensive Foundation, Prentice-Hall Inc, second
edition, 1999.
16. Moody, J. and Darken, C. J.: Fast learning in networks of locally-tuned processing units,
Neural Comput. 1(1989), 281–294.
17. Verleysen, M. and Hlava
´
ckova
´, K.: Learning in RBF Networks, International Conference
on Neural Networks (ICNN), Washington, DC, June 3–9 (1996), pp. 199–204.
RADIAL-BASIS FUNCTION NETWORKS 153
18. Saha, A. and Keeler, J. D.: Algorithms for Better Representation and Faster Learning in
Radial Basis Function Networks, Advances in Neural Information Processing Systems 2,
Edited by David S. Touretzky, pp. 482–489, 1989.
19. Musavi, M. T., Ahmed, W., Chan, K. H., Faris, K. B. and Hummels, D. M.: On the
Training of radial basis function classifiers, Neural Networks,5(1992), 595–603.
20. Ripley, B. D.: Pattern Recognition and Neural Network, Cambridge University Press, first
edition, 1996.
21. La
´zaro, M., Santamarı
´a, I. and Pantaleo
´n, C.: A new EM-based training algorithm for
RBF networks, Neural Networks,16 (2003), 69–77.
22. Archambeau, C., Lee, J. and Verleysen, M.: On convergence problems of the EM algo-
rithm for finite Gaussian mixtures, In: European Symposium on Artificial Neural Networks
(ESANN’2003), pp. 99–104, Bruges, April 23-24-25, 2003.
23. Xu, L. and Jordan, M. I.: On convergence properties of the EM algorithm for Gaussian
mixtures, Neural Computation,8(1) (1996), 129–151.
24. Benoudjit, N., Archambeau, C., Lendasse, A., Lee, J. and Verleysen, M.: Width optimi-
zation of the Gaussian kernels in radial basis function networks, In: European Symposium
on Artificial Neural Networks (ESANN’2002), pp. 425–432, Bruges, April 24-25-26, 2002.
154 NABIL BENOUDJIT AND MICHEL VERLEYSEN