ArticlePDF Available

Abstract and Figures

RBFN (Radial-Basis Function Networks) represent an attractive alternative to other neural network models. Their learning is usually split into an unsupervised part, where center and widths of the basis functions are set, and a linear supervised part for weight com- putation. Although available literature on RBFN learning widely covers how basis function centers and weights must be set, little effort has been devoted to the learning of basis func- tion widths. This paper addresses this topic: it shows the importance of a proper choice of basis function widths, and how inadequate values can dramatically influence the approxima- tion performances of the RBFN. It also suggests a one-dimensional searching procedure as a compromise between an exhaustive search on all basis function widths, and a non-optimal a priori choice.
Content may be subject to copyright.
On the Kernel Widths in Radial-Basis Function
Networks
NABIL BENOUDJIT and MICHEL VERLEYSEN
Universite
´Catholique de Louvain, Microelectronics Laboratoy, Place du Levant 3, B-1348
Louvain-la-Neuve, Belgium. e-mail: {benoudjit, verleysen}@dice.ucl.ac.be
Abstract. RBFN (Radial-Basis Function Networks) represent an attractive alternative to
other neural network models. Their learning is usually split into an unsupervised part, where
center and widths of the basis functions are set, and a linear supervised part for weight com-
putation. Although available literature on RBFN learning widely covers how basis function
centers and weights must be set, little effort has been devoted to the learning of basis func-
tion widths. This paper addresses this topic: it shows the importance of a proper choice of
basis function widths, and how inadequate values can dramatically influence the approxima-
tion performances of the RBFN. It also suggests a one-dimensional searching procedure as a
compromise between an exhaustive search on all basis function widths, and a non-optimal a
priori choice.
Key words. clusters, Gaussian kernels, radial basis function networks, width scaling factor
1. Introduction
Artificial Neural Networks (ANN) are largely used in applications involving classi-
fication or function approximation. It has been proved that several classes of ANN
such as Multilayer Perceptron (MLP) and Radial-Basis Function Networks (RBFN)
are universal function approximators [1–3]. Therefore, they are widely used for func-
tion approximation [3].
Radial-Basis Function Networks and Multilayer Perceptrons can be used for a
wide range of applications primarily because they can approximate any function
under mild conditions; however, the training of RBFN reveals faster than the train-
ing of multilayer perceptrons [4]. This fast learning speed comes from the fact that
RBFN has just two layers (Figure 1) of parameters (centersþwidths and weights)
and each layer can be determined sequentially. This paper deals with the training
of RBF networks.
MLP are trained by supervised techniques: the weights are computed by minimi-
zing a non-linear cost function. On the contrary the training of RBF networks can be
split into an unsupervised part and a linear supervised part. Unsupervised updating
techniques are straightforward and relatively fast. Moreover, the supervised part of
the learning consists in solving a linear problem, which is therefore also fast, with the
additional benefit of avoiding the problem of local minima usually encountered
Neural Processing Letters 18: 139–154, 2003. 139
#2003 Kluwer Academic Publishers. Printed in the Netherlands.
when using multilayer perceptrons [5]. The training procedure for RBFN can be
decomposed naturally into three distinct stages: (i) RBF centers are determined by
some unsupervised/clustering techniques, (ii) widths of the Gaussian kernels that
are the subject of this paper are optimized, (iii) the network weights between the
radial functions layer and the output layer are calculated.
Several algorithms and heuristics are available in the literature regarding the com-
putation of the centers of the radial functions [6–8] and the weights [2, 9]. However,
very few papers only are dedicated to the optimization of the widths of the Gaussian
kernels. In this paper we show first that the problem of fixing the width is not evident
(largely data-dependent) and secondly that it certainly depends on the dimension of
the input space. When we are in the presence of a small number of samples (which is
always the case in large-dimensional spaces), there is no other choice than an exhaus-
tive search by computation in order to optimize the widths of the Gaussian kernels.
In this paper we suggest a one-dimensional searching procedure as a compromise
between an exhaustive search on all basis function widths, and a non-optimal a
priori choice. Note that gradient descent on all RBFN parameters is still possible,
but in this case the speed advantage of RBFN learning is at least partially lost.
The paper is organized as follows. Section 2 reviews the basic principles of a
RBFN and presents the width optimization procedure. Section 3 presents simula-
tions performed on simple examples. The simulation results obtained for different
dimensions of the input space and a comparison of our approach to three commonly
accepted rules are given in Section 4. Section 5 concludes the paper.
2. Radial Basis Function Network
A RBF network is a two-layers ANN. Consider an unknown function fðxÞ:<d!<.
In a regression context, RBF networks approximate f(x) by a weighted sum of
d-dimensional radial activation functions (plus linear and independent terms, see
Figure 1. Architecture of a radial basis function network with scalar output.
140 NABIL BENOUDJIT AND MICHEL VERLEYSEN
below). The radial basis functions are centred on well-positioned data points, called
centroids; the centroids can be regarded as the nodes of the hidden layer. The posi-
tions of the centroids and the widths of the radial basis functions are obtained by an
unsupervised learning rule. The weights of the output layer are calculated by a super-
vised process using pseudo-inverse matrices or singular value decomposition (SVD)
[2]. However, it should be noted that other authors use the gradient descent algo-
rithm to optimize the parameters of RBF network [10]. The training strategies of
‘spherical’ RBF networks will be detailed in Subsection 2.1.
Suppose we want to approximate a function f(x) with a set of Mradial basis func-
tions f
j
(x), centred on the centroids c
j
and defined by:
fj:<d!<:fjðxÞ¼fjðkxcj;ð1Þ
where k:kdenotes the Euclidean distance, cj2<
dand 1 4j4M:
The approximation of the function f(x) may be expressed as a linear combination
of the radial basis functions [11]:
^
fðxÞ¼X
M
j¼1
ljfjðkxcjkÞ þ X
d
i¼1
aixiþb;ð2Þ
where l
j
are weight factors, and a
i
,bare the weights for the linear and independent
terms respectively.
A typical choice for the radial basis functions is a set of multi-dimensional Gaus-
sian kernels:
fjðkxcjkÞ ¼ exp 1
2ðxcjÞTX1
jðxcjÞ

;ð3Þ
where Pjis a covariance matrix.
Three cases can be considered:
1. Full covariance matrices Pjare replaced by identical scalar widths sj¼sfor all
Gaussian kernels. In the literature several authors used this option [3, 12–15].
2. Full covariance matrices Pjare replaced by different scalars widths sjfor
each Gaussian kernel j. For example, in [16–18], each scalar width sjis estimated
independently.
3. Full covariance matrices Pjrepresent the general case. Musavi et al. [19] used the
covariance matrix to estimate the width of each Gaussian kernel.
In this paper we limit our discussion to the two first cases (1) and (2). One argu-
ment to avoid case (3) is that the number of parameters then grows drastically in
Equation 2; applications dealing with a small number of learning data are thus dif-
ficult to handle by this method. Note that also procedure (3) is very sensitive to out-
liers [20]. Of course in other situations the use of full covariance matrices may be
RADIAL-BASIS FUNCTION NETWORKS 141
interesting [2, 19]. Statistical algorithms used for the estimation of parameters in
mixture modelling (as the EM algorithm) could also be used (see for example [2]
for the estimation of centers and widths, and [21]). Nevertheless, the use of EM algo-
rithm is based on the maximization of likelihood, making this algorithm more often
used for classification and probability estimation problems. The EM algorithm is
also sensitive to outliers [22]. In fact, many authors (see for example [4]) suggest
to use solution (2) and the learning strategies described below for simplicity reasons.
In the following, we will thus concentrate on the study of so-called ‘spherical’ RBFN
covering cases (1) and (2).
2.1. SPHERICAL RBFN LEARNING STRATEGIES
Once the number and the general shape of the radial basis functions f
j
(x) are chosen,
the RBF network has to be trained properly. Given a training data set Tof size N
T
,
T¼fðxp;ypÞ2<
d<;14p4NT:yp¼fðxpÞg;ð4Þ
the training algorithm consists of finding the parameters c
j
,s
j
and l
j
, such that ^
fðxÞ
fits the unknown function f(x) as close as possible. This is realised by minimising a
cost function (usually the mean square error between ^
fðxÞand f(x) on the learning
points). Often, the training algorithm is decoupled into a three-stage procedure:
.determining the centers c
j
of the Gaussian kernels,
.computing the widths s
j
of the Gaussian kernels,
.computing the weights l
j
and independent terms a
i
and b.
Moody and Darken [16] proposed to use the k-means clustering algorithm to find
the location of the centroids c
j
. Other authors use a stochastic online process (Com-
petitive learning) method [7, 17], which leads to similar results, with the advantage of
being adaptive (continuous learning, even with evolving input data).
The computation of the Gaussian function widths sjis the subject of this paper; it
will be detailed at the end of this section.
Once the basis function parameters are determined, the transformation between
the input data and the corresponding outputs of the hidden units is fixed. The net-
work can thus be viewed as an equivalent single-layer network with linear output
units. Minimisation of the average mean square error yields the well-known least-
square solution for the weights.
l¼fþy¼ðfTfÞ1fTy;ð5Þ
where l;yare the row vectors of ljweight factors and yptraining data outputs (of
sizes Mand N
T
respectively), fis the NTMmatrix of fij ¼exp ð xicj
22s2
jÞ
values and fþ¼ðfTfÞ1fTdenotes the pseudo-inverse of f. In practice, to avoid
possible numerical difficulties due to an ill-conditioned matrix fTf, singular value
decomposition (SVD) is usually used to find the weights [2].
142 NABIL BENOUDJIT AND MICHEL VERLEYSEN
The second stage of the training process involves the computation of the Gaussian
function widths, while fixing the degree of overlapping between the Gaussian
kernels. This allows finding a compromise between locality and smoothness of the
function ^
fðxÞ. We consider here both cases (1) and (2) quoted in Section 2. Case
(1) consists in taking identical widths sj¼sfor all Gaussian kernels [3, 12–15]. In
[15] for example, the widths are fixed as follows:
s¼dmax
ffiffiffiffiffiffiffi
2M
p;ð6Þ
where Mis the number of centroids and dmax is the maximum distance between any
pair of them. This choice would be close to the optimal solution if the data were uni-
formly distributed in the input space, leading to a uniform distribution of centroids.
Unfortunately most real-life problems show non-uniform data distributions. The
method is thus inadequate in practice and an identical width for all Gaussian kernels
should be avoided.
If the distances between the centroids are not equal, it is better to assign a specific
width to each Gaussian kernel. For example, it would be reasonable to assign a lar-
ger width to centroids are widely separated from each other and a smaller width to
closer ones [4]. Case (2) therefore consists of estimating the width of each Gaussian
kernel independently. This can be done, for example, by splitting the learning points
x
p
into clusters according to the Voronoi region associated to each centroid
1
, and
then computing the standard deviation sc
jof the distance between the learning points
in a cluster and the corresponding centroid; in reference [17] for example, it is sug-
gested to use an iterative procedure to estimate this standard deviation. Moody
and Darken [16], on the other hand, proposed to compute the width factors sj
(the radius of kernel j) by the p-nearest neighbours heuristic:
sj¼1
pX
p
i¼1
cjci
2
!
1
2
ð7Þ
where the c
j
are the p-nearest neighbours to centroid c
i
. A suggested value for pis
2 [16]. Saha and Keeler [18] proposed to compute the width factors sjby nearest
neighbour heuristic where sj(the radius of kernel j) is set to the Euclidean distance
between c
j
(the vector determining the centre for the jth RBF) and its nearest neigh-
bour c
i
, multiplied by an overlap constant r:
sj¼r:min
cjci

:ð8Þ
This second class of methods offers the advantage of taking the distribution vari-
ations of the data into account. In practice, they are able to perform much better
1
A Voronoi region is the part of the space nearest to a specific centroid than to any other one.
RADIAL-BASIS FUNCTION NETWORKS 143
than fixed-width methods, as they offer a greater adaptability to the data. Even
though, as we will show in Section 4, the width values given by the above rules
remain sub-optimal.
2.2. WIDTH SCALING FACTOR OPTIMIZATION
We suggest in this subsection a procedure for the computation of the Gaussian func-
tion widths based on an exhaustive search belonging the second class of algorithms
quoted in subsection 2.1; the purpose is to show the importance of the optimization
of Gaussian widths. Therefore we select the widths in such a way to guarantee a nat-
ural overlap between Gaussian kernels, preserving the local properties of RBFN,
and at the same time to maximize the generalization ability of the network.
First we compute the standard deviations sc
jof the learning data in each cluster in
a classical way.
DEFINITION. Sigma_cluster is the empirical standard deviation of the learning
data contained in a cluster or Voronoi region associated to a centroid.
Subsequently, we determine a width scaling factor WSF, common to all Gaussian
kernels. The widths of the kernels are then defined as:
8j;sj¼WSFsc
j:ð9Þ
Although the EM algorithm (for example [22, 23]) could be used to optimize all sj
simultaneously, it appears that in practical situations it is sometimes difficult to
escape from local minima, leading to non-optimal solution. Equation (9) then offers
a compromise between the usual methods without optimization of sjand a
M-dimensional optimization of all sjtogether.
By inserting the width factor WSF, the approximation function ^
fðxÞis smoothed
such that the generalization process is possibly improved, and an optimal overlap-
ping of the Gaussian kernels is allowed. Unfortunately, the optimal width factor
WSF depends on the function to approximate, the dimension of the input set, as well
as on the data distribution. The choice of the optimal WSF value is thus obtained by
extensive simulations (cross-validation): the optimal WSF
opt
value will be chosen as
the one minimizing the error criterion (mean square error on a validation set),
among a set Qof possible WSF values.
When several minima appear, it is recommended to choose the one corresponding
to the smallest width scaling factor. Indeed, large WSF have to be avoided for com-
plexity, reproducibility and/or numerical instability.
3. Simulations
We consider a simple example, i.e. we try to approximate the unity identity function
(yp¼1) on a d-dimensional hypercube domain [0,10]
d
.
144 NABIL BENOUDJIT AND MICHEL VERLEYSEN
It must be mentioned here that this problem is purely theoretical: there is no inter-
est to approximate such a linear (and even constant!) function by a RBFN. If the
RBFN model in Equation 2 is used to approximate this function, all weights ljmul-
tiplying the Gaussian kernels should be equal to zero. In order to reach the goal of
this paper, i.e. to have insights about the optimal values of the kernel widths, the lin-
ear and constant terms were removed from Equation 2 in the simulations. Neverthe-
less, the objective of this paper is to evaluate the variances of the Gaussian kernels
with respect to the dimension of the input space. In order to avoid the consequences
of the other parts of the RBFN training algorithm, we chose to work with a constant
target function, in order to remove the influence of its variations from our conclu-
sions. Simulations were performed for different dimensions d, in order to see the
influence of the dimension on the results.
For all simulations presented in this paper, the density of learning points is uni-
form in the d-dimensional input space. For this reason, the traditional vector quan-
tization (VQ) step in the RBFN learning process is skipped; the centroids are
attached to the nodes of a square grid in dimension d. The goal of this setting is
to eliminate the influence of the VQ results on the simulations. It is well known that
the placement of centroids on the nodes of a square grid is not the ideal result of a
vectorial quantization, when d52. For example, it has been demonstrated that in
dimension d¼2, an ideal vector quantization on a uniform density gives a result
in a hive of bee, and not a result in a square grid, as shown it Figure 2 [7]. Never-
theless, it can be shown through a simple calculation that the quantization error
obtained with the square grid (Figure 2a) is only about 4% higher than the one
obtained with the ideal results (Figure 2b).
As this ideal result is not known in dimensions greater than 2 the assumption is
made that the results obtained by placing the centroids on a square grid is a good
approximation of those that would be obtained with a true vector quantization.
Once the centroids are placed on a regular grid, the next subsection shows a theo-
retical way to calculate the optimal width by considering that all the weights are
identical in Equation 2. Next, in Subsection 3.2, the optimal width will be estimated
by setting all weights ljfree and calculated according to Equation 5.
Figure 2. a- scalar quantization (square grid), b- vector quantization (hive of bee).
RADIAL-BASIS FUNCTION NETWORKS 145
3.1. THEORETICAL VALUE OF THE OPTIMAL WIDTH OF THE GAUSSIAN KERNELS
As mentioned above, the centroids are placed on a regular grid, and the function to
be approximated is constant ( y¼constant); therefore it is expected that the weights
ljin Equation 2 will be identical for all centroids. For a theoretical calculation of the
optimal WSF coefficient, we will make this assumption and further suppose that
their values will be equal to 1. Then, we calculate by Equation 2 (without linear
and constant terms) the theoretical output function of the network, and this for vari-
ous values of sj; again, as the centroids are placed on a regular grid, we will suppose
that all sjvalues will be identical. The goal is to find the value of sjgiving the ‘flat-
test’ possible output function ^
fðxÞ. This one will not be around 1 (there is no reason,
since we chose the ljequal to 1), but well around another average value m. Taking
lj¼1 does not change anything to the problem: if the ljwere set to lj¼1=m,we
would have found an output function with an average value of 1, which was the
initial problem. Nevertheless, the two problems bring obviously the same conclu-
sions regarding the widths sj.
Note that, sc¼sc
j(8j2[1,...,M]) being constant over all clusters, it is equivalent
by Equation 9 to find a optimal value of sor a optimal value of WSF. In the follow-
ing section, we will estimate optimal values of s, in order to make possible the
comparison with other methods from the literature (Section 4).
For each value of s, to find the mean value mwe take simply the mean of the out-
put function ^
fðxÞ. To quantify the ‘flatness’ of the output function, we calculate its
standard deviation std
y
around the mean value m. It should be mentioned here that,
in order to avoid as much as possible the border effects encountered when using
RBFN, the mean and the standard deviations of ^
fðxÞare taken only in the middle
of the distribution, i.e. in the 3:85;6:05½
dcompact. For each dimension, the sgiving
the flatter function ^
fðxÞis called sigma_theo.
DEFINITION. Identical Gaussian kernels with unit weights (lj¼1) are summed
for various svalues
^
fxðÞ¼
X
M
j¼1
exp
xcj
2
2s2
!
;
where Mis the number of centroids. Sigma_theo is the svalue corresponding to the
smallest standard deviation of ^
fðxÞ.
As an example, Figure 3 gives the standard deviation std
y
of ^
fðxÞaccording to sin
dimension 1 and Figure 4 gives the same result in dimension 2.
3.2. EXPERIMENTAL VALUE OF THE OPTIMAL WIDTH OF THE GAUSSIAN KERNELS
In this second set of simulations, we still consider centroids placed on a regular grid,
but without the assumption that all ljweights will be identical. On the contrary, all
weights are set free and we calculate them according to Equation 5 (using Singular
146 NABIL BENOUDJIT AND MICHEL VERLEYSEN
Value Decomposition). As in the previous section, we repeat the experiment for a
large set of possible values s(identical for all Gaussian kernels), and for several
dimensions dof the input space. If the principle of ‘locality’ of the Gaussian Kernels
is respected and the border effects are neglected, we should expect identical lj.In
practice, this is not the case, mainly because of the border effects, as shown it
Figure 5a. As in Subsection 3.1, we only used the Gaussian kernels in the centre of
the distribution (see Figure 5b) in order to decrease the influence of the border effects.
After the best-fit function is calculated, the performance of the RBF network is
estimated by computing an error criterion. Consider a validation data set V, contain-
ing N
V
data points:
V¼ðxq;yqÞ2<
d<;14q4NV:yq¼fðxqÞ

:ð10Þ
The error criterion can be chosen as the mean square error:
MSEV¼1
NVX
NV
q¼1ðyq^
fðxqÞÞ2;ð11Þ
Figure 4. std
y
according to sin dimension 2.
Figure 3. std
y
according to sin dimension 1.
RADIAL-BASIS FUNCTION NETWORKS 147
where y
q
are the desired outputs. The minimum of the mean square error (MSE
v
)
gives now another value of sigma, called sigma_exp.
DEFINITION. Identical Gaussian kernels are summed for various svalues
^
fxðÞ¼
X
M
j¼1
ljexp xcj
2
2s2
!
;
where Mis the number of centroids. Sigma_exp is the svalue corresponding to the
smallest MSE
v
.
Figure 6(a) gives MSE
v
according to sin dimension 1 and Figure 6(b) gives the
same result in dimension 3. Figure 6 shows the presence of two minimums. The first
one corresponds to a local decomposition of the function in a sum of Gaussian ker-
nels; this interpretation is consistent with the classical RBF approach. However, the
second one corresponds to a non-local decomposition of the function. As a conse-
quence, the weights ljturn out to be enormous in absolute value (positive or
Figure 6. (a) MSE
v
versus sin Dim 1. (b) MSE
v
according versus sin Dim 3.
Figure 5. a- with border effects, b- without border effects.
148 NABIL BENOUDJIT AND MICHEL VERLEYSEN
negative) in order to compensate for the non-flat slopes. This leads to a greater
complexity of the RBFN. In addition, large ljdramatically increases numerical
instability. The optimal value chosen for sis therefore the one related to the smal-
lest minimum.
4. Results
The simulations were made on databases of points distributed uniformly in a hyper-
cube of edge lengths equal to 10, in various dimensions d. The number of centroids is
chosen equal to 9d. Other simulations made with a different number of centroids
gave similar results. The number of training and test points is chosen sufficiently
large to avoid (as much as possible) falling into the difficulties due to the empty space
phenomenon (between 1000 and 50000 training points according to the dimension of
the input space). These restrictions have limited the simulations to a dimension of
input space equal to 5.
Table 1 gives the results. In order to obtain values independent from the number
of centroids, the ‘Width Scaling Factors’ WSF_theo and WSF_exp are defined, as
being the ratio between sigma_theo and sigma_cluster on one hand, and sigma_exp
and sigma_cluster on the other hand respectively. Indeed it is more appropriate to
compare results on the scale-independent WSF coefficient instead of the values of
sfor two reasons:
.most results in the literature are based on the sigma_cluster value, making the
use of WSF easier for comparisons;
.the WSF values are independent from the number of centroids, while those of s
are not.
Several comments result from Table 1:
.sigma_cluster is proportional to the square root of dimension, as shown by a
simple analytical calculation:
scluster ¼ffiffi
d
pa
2ffiffi
3
p;ð12Þ
Table 1. WSF_theo and WSF_exp according to the dimension of the input space.
Dim Sigma_cluster Sigma_theo Sigma_exp WSF_theo WSF_exp
1 0.3175 0.92 0.5715 2.897 1.8
2 0.4490 0.92 0.5837 2.048 1.30
3 0.5499 0.92 0.5506 1.673 1.01
4 0.6350 0.92 0.5676 1.448 0.89
5 0.7099 0.92 0.5471 1.2959 0.77
RADIAL-BASIS FUNCTION NETWORKS 149
Figure 7. MSE
v
according to sigma in Dim 1 with the various calculation methods of s.
Figure 8. MSE
v
according to sigma in Dim 2 with the various calculation methods of s.
Figure 9. MSE
v
according to sigma in Dim 3 with the various calculation methods of s.
150 NABIL BENOUDJIT AND MICHEL VERLEYSEN
where ais the length of an edge of the hypercube corresponding to the Vor-
onoi zone of a centroid. In the simulations, the 9dcentroids are placed a priori
at the positions 0:55 þk1:1 (with 1 4k49) measured on each axis of input
space; ais thus equal to 1.1.
.We notice that sigma_theo does not depend on the dimension of the input
space. Therefore, WSF_theo is inversely proportional to the square root of
the dimension of the input space.
.We also notice that the sigma_exp values are systematically lower, about 30 to
35%, than the sigma_theo values. This is due to the increased freedom given to
the network coefficients by allowing weight variations rather than fixing them.
We also compared our method (Calculation of sigma_exp) to the three approaches
of Moody & Darken [16], S. Haykin [15] and A. Saha & J. D. Keeler [18], quoted in
Section 2. Figures 7, 8 and 9 illustrate the mean square error obtained according to
sigma for different dimension dð14d43Þwith the various calculation methods of
Figure 10. Local decomposition of y¼1with sigma_exp ¼0.5715.
Figure 11. Local decomposition of y¼1with sigma_Moody ¼0.7778.
RADIAL-BASIS FUNCTION NETWORKS 151
s. We notice here that whatever is the dimension of the input space, we always find
two minima. Figures 10, 11 and 12 clearly show that the choice of the sigma value
has a great influence on the local character of the decomposition in a sum of Gaus-
sian kernels of the function (in dimension 1) to approximate.
5. Conclusion
This paper gives some insights about the choice of Gaussian kernel widths, to use
during the training of spherical RBF networks. Indeed, a major part of the literature
in the field of RBFN covers the optimization of the positions of Gaussian kernels
and the multiplicative weights. On the other hand, the choice of their widths is often
based on heuristics, without real theoretical justification.
In this paper, we first show the importance of the choice of the kernel widths. In
many situations, a bad choice can lead to an approximation error definitely higher
than the optimum, sometimes by several orders of magnitude. Then, we show by
two types of simulations, that a classic choice (taking the width of Gaussian kernels
equal to the standard deviation of the points in a cluster) is certainly not optimal.
For example, in simulations in dimension 1, it appears that the width should be
twice this value. It is then suggested to optimize the widths; an example of one-
dimensional optimization procedure is presented in this paper, through the use of
a multiplying factor to the widths. Finally, we show that the dimension of the data
space has an important influence on the choice of s. In particular, that the multipli-
cative correction that must be applied to the standard deviation of points in a
cluster is shown to be inversely proportional to the square root of the dimension
of the input space.
Simulations on real databases (see for example [24]) show, similarly to the curves
illustrated in this paper, a strong dependency of the approximation error with respect
to the width (and Width Scaling Factor) of the Gaussian kernels. A similar methodo-
logy can thus be applied to choose optimum widths, despite the fact that their numer-
ical values depend on the function to approximate.
Figure 12. Non-local decomposition of y¼1with sigma_Haykin ¼2.0742.
152 NABIL BENOUDJIT AND MICHEL VERLEYSEN
The results show the need for a greater attention to be given to the optimization of
the widths of the Gaussian kernels in RBF networks, and to the development of
methods allowing to fix these widths according to the problem without using exhaus-
tive search.
Acknowledgements
Michel Verleysen is Senior Research Associate of the Belgian F.N.R.S. (National
Fund For Scientific Research).
References
1. Park, J. and Sandberg, I.: Approximation and radial basis function networks, Neural
Comput. 5(1993), 305–316.
2. Bishop, C. M.: Neural Networks for Pattern Recognition, Oxford university press, 1995.
3. Park, J. and Sandberg, I. W.: Universal approximation using radial-basis-function
networks, Neural Comput. 3(1991), 246–257.
4. Young-Sup Hwang and Sung-Yang Bang.: An efficient method to construct a radial basis
function neural network classifier, Neural Networks,10(8) (1997), 1495–1503.
5. Robert J. Howlett and Lakhmi C. Jain, Radial Basis Function Networks 2: New Advan-
ces in Design, Physica-Verlag: Heidelberg, 2001.
6. Ahalt, S. C. and Fowler, J. E.: Vector quantization using artificial neural networks mod-
els. In Proceedings of the International Workshop on Adaptive Methods and Emergent Tech-
niques for Signal Processing and Communications, June 1993, pp. 42–61.
7. Gresho, A. and Gray, R. M.: Vector Quanitzation and Signal Compression, Kluwer
International series in engineering and computer science, Norwell, Kluwer Academic
Publishers, 1992.
8. David Sanchez, V. A.: Second derivative dependent placement of RBF centers, Neurocom-
puting 7(3) (1995), 311–317.
9. Omohundro, S. M.: Efficient algorithms with neural networks behavior, Complex Systems
1(1987), 273–347.
10. Verleysen, M. and Hlava
´
ckova
´, K.: An optimized RBF network for approximation of
functions, In: European Symposium on Artificial Neural Networks (ESANN 94), pp. 175–
180, Brussels, April 20-21-22, 1994.
11. Tomaso Poggio and Federico Girosi: Networks for approximation and learning, Proceed-
ings of the IEEE,78(9) (1990), 1481–1497.
12. Orr, M. J.: Introduction to Radial Basis Functions Networks, Technical reports, April
1996, www.anc.ed.ac.uk/mjo/papers/intro.ps.
13. David Sanchez, V. A.: On the number and the distribution of RBF centers, Neurocomput-
ing 7(2) (1995), 197–202.
14. Chen, S. and Billings, S. A.: Neural networks for nonlinear dynamic system modelling and
identification, Int. J. Control,56(2) (1992), 319–346.
15. Haykin, S.: Neural Networks a Comprehensive Foundation, Prentice-Hall Inc, second
edition, 1999.
16. Moody, J. and Darken, C. J.: Fast learning in networks of locally-tuned processing units,
Neural Comput. 1(1989), 281–294.
17. Verleysen, M. and Hlava
´
ckova
´, K.: Learning in RBF Networks, International Conference
on Neural Networks (ICNN), Washington, DC, June 3–9 (1996), pp. 199–204.
RADIAL-BASIS FUNCTION NETWORKS 153
18. Saha, A. and Keeler, J. D.: Algorithms for Better Representation and Faster Learning in
Radial Basis Function Networks, Advances in Neural Information Processing Systems 2,
Edited by David S. Touretzky, pp. 482–489, 1989.
19. Musavi, M. T., Ahmed, W., Chan, K. H., Faris, K. B. and Hummels, D. M.: On the
Training of radial basis function classifiers, Neural Networks,5(1992), 595–603.
20. Ripley, B. D.: Pattern Recognition and Neural Network, Cambridge University Press, first
edition, 1996.
21. La
´zaro, M., Santamarı
´a, I. and Pantaleo
´n, C.: A new EM-based training algorithm for
RBF networks, Neural Networks,16 (2003), 69–77.
22. Archambeau, C., Lee, J. and Verleysen, M.: On convergence problems of the EM algo-
rithm for finite Gaussian mixtures, In: European Symposium on Artificial Neural Networks
(ESANN’2003), pp. 99–104, Bruges, April 23-24-25, 2003.
23. Xu, L. and Jordan, M. I.: On convergence properties of the EM algorithm for Gaussian
mixtures, Neural Computation,8(1) (1996), 129–151.
24. Benoudjit, N., Archambeau, C., Lendasse, A., Lee, J. and Verleysen, M.: Width optimi-
zation of the Gaussian kernels in radial basis function networks, In: European Symposium
on Artificial Neural Networks (ESANN’2002), pp. 425–432, Bruges, April 24-25-26, 2002.
154 NABIL BENOUDJIT AND MICHEL VERLEYSEN
... As in the case of Kriging, the selection of the shape parameter of the RBF model is an often discussed and researched topic in literature. Some papers propose some heuristic to calculate the single scalar value , typically based on the dimensionality of the problem and the distance between the sampled points [30,31]. Others implement some cross-validation schemes such as Kfold cross-validation or leave-out-one cross-validation (LOOCV) [4,21]. ...
... Initially, it seems reasonable to take some global curvature measure as, after all, the surrogate is fit on the entire domain. The issue with this assumption is demonstrated on the problem in Equation (31) in the scaled and rotated domain. Figure 12 shows the function overlaid with a quadratic fit of the function, while Figure 13 shows the contour plots of the quadratic fit and the function. ...
... The domain is rotated using the the eigenvectors as columns in an orthogonal matrix, and each direction is scaled with the square root of the eigenvalues. To demonstrate the proposed method Figure 15 shows contour plots of Equation (31) in the scaled and rotated domain, a transformed domain computed from 5 random samples, and a transformed domain computed from 9 random samples. Although in this example the local Hessians are known analytically, in general, the local Hessians must be estimated from data. ...
Preprint
Recent developments in surrogate construction predominantly focused on two strategies to improve surrogate accuracy. Firstly, component-wise domain scaling informed by cross-validation. Secondly, regression to construct response surfaces using additional information in the form of additional function-values sampled from multi-fidelity models and gradients. Component-wise domain scaling reliably improves the surrogate quality at low dimensions but has been shown to suffer from high computational costs for higher dimensional problems. The second strategy, adding gradients to train surrogates, typically results in regression surrogates. Counter-intuitively, these gradient-enhanced regression-based surrogates do not exhibit improved accuracy compared to surrogates only interpolating function values. This study empirically establishes three main findings. Firstly, constructing the surrogate in poorly scaled domains is the predominant cause of deteriorating response surfaces when regressing with additional gradient information. Secondly, surrogate accuracy improves if the surrogates are constructed in a fully transformed domain, by scaling and rotating the original domain, not just simply scaling the domain. The domain transformation scheme should be based on the local curvature of the approximation surface and not its global curvature. Thirdly, the main benefit of gradient information is to efficiently determine the (near) optimal domain in which to construct the surrogate. This study proposes a foundational transformation algorithm that performs near-optimal transformations for lower dimensional problems. The algorithm consistently outperforms cross-validation-based component-wise domain scaling for higher dimensional problems. A carefully selected test problem set that varies between 2 and 16-dimensional problems is used to clearly demonstrate the three main findings of this study.
... The way to select the most suitable RBF kernel function and its bandwidth for SVM and MSVM classification is a research problem in Machine learning. Many methods were developed to estimate this parameter, see (Haykin, 1994), (Nakayama et al., 2002) and (Benoudjit and Verleysen, 2003). Through the years, some methods were extensively used starting by a cross-validation procedure (Muller et al., 2001). ...
Thesis
While it is impractical to study the population in many domains and applications, sampling is a necessary method allows to infer information. This thesis is dedicated to develop probability sampling algorithms to infer the whole population when it is too large or impossible to be obtained. Markov chain Monte Carlo (MCMC) techniques are one of the most important tools for sampling from probability distributions especially when these distributions have intractable normalization constants. The work of this thesis is mainly interested in graph sampling techniques. Two methods in chapter 2 are presented to sample uniform subtrees from graphs using Metropolis-Hastings algorithms. The proposed methods aim to sample trees according to a distribution from a graph where the vertices are labelled. The efficiency of these methods is proved mathematically. Additionally, simulation studies were conducted and confirmed the theoretical convergence results to the equilibrium distribution. Continuing to the work on graph sampling, a method is presented in chapter 3 to sample sets of similar vertices in an arbitrary undirected graph using the properties of the Permanental Point processes PPP. Our algorithm to sample sets of k vertices is designed to overcome the problem of computational complexity when computing the permanent by sampling a joint distribution whose marginal distribution is a kPPP. Finally in chapter 4, we use the definitions of the MCMC methods and convergence speed to estimate the kernel bandwidth used for classification in supervised Machine learning. A simple and fast method called KBER is presented to estimate the bandwidth of the Radial basis function RBF kernel using the average Ricci curvature of �−graphs.
... Furthermore, a discussion on the kernel widths of RBF networks is provided in the work of Benoudjit et al. [28]. Paetz proposed a method [29] that can reduce the number of neurons in RBF networks with dynamic decay adjustment in order to increase the generalization abilities of these networks. ...
Article
Full-text available
Radial basis function networks are considered a machine learning tool that can be applied on a wide series of classification and regression problems proposed in various research topics of the modern world. However, in many cases, the initial training method used to fit the parameters of these models can produce poor results either due to unstable numerical operations or its inability to effectively locate the lowest value of the error function. The current work proposed a novel method that constructs the architecture of this model and estimates the values for each parameter of the model with the incorporation of Grammatical Evolution. The proposed method was coded in ANSI C++, and the produced software was tested for its effectiveness on a wide series of datasets. The experimental results certified the adequacy of the new method to solve difficult problems, and in the vast majority of cases, the error in the classification or approximation of functions was significantly lower than the case where the original training method was applied.
... Gradient descent algorithm trains MLP model. In RBF, transformation to higher dimension is performed using K means clustering algorithm [8], thereafter learning of separation boundary is done using gradient descent algorithm. Both the ANN classifiers are able to perform classification task fairly accurately. ...
Preprint
Quality assurance in production line demands reliable weld joints. Human made errors is a major cause of faulty production. Promptly Identifying errors in the weld while welding is in progress will decrease the post inspection cost spent on the welding process. Electrical parameters generated during welding, could able to characterize the process efficiently. Parameter values are collected using high speed data acquisition system. Time series analysis tasks such as filtering, pattern recognition etc. are performed over the collected data. Filtering removes the unwanted noisy signal components and pattern recognition task segregate error patterns in the time series based upon similarity, which is performed by Self Organized mapping clustering algorithm. Welder quality is thus compared by detecting and counting number of error patterns appeared in his parametric time series. Moreover, Self Organized mapping algorithm provides the database in which patterns are segregated into two classes either desirable or undesirable. Database thus generated is used to train the classification algorithms, and thereby automating the real time error detection task. Multi Layer Perceptron and Radial basis function are the two classification algorithms used, and their performance has been compared based on metrics such as specificity, sensitivity, accuracy and time required in training.
... Recently, a variety of papers have appeared proposing novel initialization techniques for these networks' parameters [32][33][34]. Benoudjit et al. [35] discussed the effect of kernel widths on RBF networks. Moreover, Neruda et al. [36] presented a comparison of some learning methods for RBF networks. ...
Article
Full-text available
Radial basis function networks are widely used in a multitude of applications in various scientific areas in both classification and data fitting problems. These networks deal with the above problems by adjusting their parameters through various optimization techniques. However, an important issue to address is the need to locate a satisfactory interval for the parameters of a network before adjusting these parameters. This paper proposes a two-stage method. In the first stage, via the incorporation of grammatical evolution, rules are generated to create the optimal value interval of the network parameters. During the second stage of the technique, the mentioned parameters are fine-tuned with a genetic algorithm. The current work was tested on a number of datasets from the recent literature and found to reduce the classification or data fitting error by over 40% on most datasets. In addition, the proposed method appears in the experiments to be robust, as the fluctuation of the number of network parameters does not significantly affect its performance.
... The learning of RBF networks can be implemented as [20,21]: ...
Article
Full-text available
Artificial neural networks can solve various tasks in computer vision, such as image classification, object detection, and general recognition. Our comparative study deals with four types of artificial neural networks—multilayer perceptrons, probabilistic neural networks, radial basis function neural networks, and convolutional neural networks—and investigates their ability to classify 2D matrix codes (Data Matrix codes, QR codes, and Aztec codes) as well as their rotation. The paper presents the basic building blocks of these artificial neural networks and their architecture and compares the classification accuracy of 2D matrix codes under different configurations of these neural networks. A dataset of 3000 synthetic code samples was used to train and test the neural networks. When the neural networks were trained on the full dataset, the convolutional neural network showed its superiority, followed by the RBF neural network and the multilayer perceptron.
Article
This paper proposes a novel damage index for railway bridges based on synchronous strain and displacement data collected at the passage of trains. The approach identifies a transformation operator that converts strains into displacements in a data‐driven fashion without prior structural knowledge and with no parameter selection. The displacement prediction error is proposed as a robust damage index, insensitive to the vehicle loads and temperature effects. Numerical simulations and data generated through the calibrated finite element model of a steel truss bridge showcase the effectiveness of the proposed approach.
Chapter
Pixel value prediction plays an important role in areas such as image compression and reversible data hiding. Many traditional pixel value predictors are available in literature. These predictors predict image pixel value using neighboring pixel values. These predictors establish a mathematical mapping from the neighboring pixel values to the predicted pixel value. Such heuristics may not handle the varied neighborhood that can occur in an image. Finding such kind of mapping is not a trivial task. On the contrary, a neural network can be trained for pixel value prediction. In this approach, a neural network takes neighborhood pixel values to produce the predicted value of the pixel as output. In this paper, the performances of two shallow neural networks - multi-layer perceptron (MLP) and radial basis function (RBF) neural network - are investigated for pixel value prediction. Mean squared error (MSE) measure between original and predicted pixel values is used to compare the performances of pixel value predictors. Experiments reveal that the pixel predictors using these two shallow neural networks perform better than comparing state-of-the-art predictors. Moreover, MLP-based pixel predictor performs better than the RBF neural network based pixel predictor.Keywordspixel value predictionmulti-layer perceptronradial basis function neural networkreversible data hiding
Article
Full-text available
We build up the mathematical connection between the “Expectation-Maximization” (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix P, and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of P and provide new results analyzing the effect that P has on the likelihood surface. Based on these mathematical results, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of gaussian mixture models.
Book
The Radial Basis Function (RBF) neural network has gained in popularity over recent years because of its rapid training and its desirable properties in classification and functional approximation applications. RBF network research has focused on enhanced training algorithms and variations on the basic architecture to improve the performance of the network. In addition, the RBF network is proving to be a valuable tool in a diverse range of application areas, for example, robotics, biomedical engineering, and the financial sector. The two volumes provide a comprehensive survey of the latest developments in this area. Volume 2 contains a wide range of applications in the laboratory and case studies describing current industrial use. Both volumes will prove extremely useful to practitioners in the field, engineers, reserachers, students and technically accomplished managers.
Chapter
A recurrent theme in this thesis is that many manufacturing tasks can be broken into fine motion and gross motion subtasks, and therefore it is appropriate to develop a manipulation system consisting of separate “modules.” In the case of assembly, the operation can be decomposed into gross motions that move the peg in the direction of the hole, and fine accommodations about the central axis of the hole. If a robot is to assemble parts, it suffices to have an arm with the required reach and speed to pick the parts up, move them roughly into position, and push them toward each other, while a compliant RCC wrist takes care of the fine accommodations and keeps the parts from jamming. In other tasks, such as grinding, the fine and gross motion elements are not entirely independent; a change in the velocity along the surface changes the amount of material removed and alters the apparent “hardness” of the surface. Still, it is possible to use a separate arm, wrist and hand provided that they communicate with each other. If the tasks displayed a large amount of interdependence between fine and gross motions, it would become more convenient to treat the wrist and hand as the last few joints at the end of an arm. In the present case, it suffices to work in cartesian coordinates, transmitting deflections or forces sensed at the wrist and hand, and locations reported by the robot. It is not necessary to directly couple, say, the third wrist sensor with the fourth arm actuator.
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.