Page 1

Neural Networks 23 (2010) 89–107

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

Clustering: A neural network approach$

K.-L. Du∗

Department of Electrical and Computer Engineering, Concordia University, 1455 de Maisonneuve West, Montreal, Canada, H3G 1M8

a r t i c l ei n f o

Article history:

Received 10 September 2007

Accepted 13 August 2009

Keywords:

Clustering

Neural network

Competitive learning

Competitive learning network

Vector quantization

a b s t r a c t

Clustering is a fundamental data analysis method. It is widely used for pattern recognition, feature

extraction, vector quantization (VQ), image segmentation, function approximation, and data mining.

As an unsupervised classification technique, clustering identifies some inherent structures present in

a set of objects based on a similarity measure. Clustering methods can be based on statistical model

identification(McLachlan&Basford,1988)orcompetitivelearning.Inthispaper,wegiveacomprehensive

overview of competitive learning based clustering methods. Importance is attached to a number of

competitive learning based clustering neural networks such as the self-organizing map (SOM), the

learning vector quantization (LVQ), the neural gas, and the ART model, and clustering algorithms such as

the C-means, mountain/subtractive clustering, and fuzzy C-means (FCM) algorithms. Associated topics

such as the under-utilization problem, fuzzy clustering, robust clustering, clustering based on non-

Euclidean distance measures, supervised clustering, hierarchical clustering as well as cluster validity are

also described. Two examples are given to demonstrate the use of the clustering methods.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Vector quantization (VQ) is a classical method for approximat-

ingacontinuousprobabilitydensityfunction(PDF)p(x)ofthevec-

tor variable x ∈ Rnby using a finite number of prototypes. A set

of feature vectors x is represented by a finite set of prototypes

{c1,...,cK} ⊂ Rn, referred to as the codebook. Codebook design

can be performed by using clustering. Once the codebook is spec-

ified, approximation of x is to find the reference vector c from the

codebook that is closest to x (Kohonen, 1989, 1997). This is the

nearest-neighbor paradigm, and the procedure is actually simple

competitive learning (SCL).

The codebook can be designed by minimizing the expected

squared quantization error

?

where c is a function of x and ci. Given the sample xt, an iterative

approximation scheme for finding the codebook is derived by Ko-

honen (1997)

E =?x − c?2p(x)dx

(1)

ci(t + 1) = ci(t) + η(t)δwi[xt− ci(t)]

where the subscript w corresponds to the winning prototype,

whichistheprototypeclosesttoxt,δwiistheKroneckerdelta,with

(2)

$This work was supported by the NSERC of Canada.

∗Tel.: +1 514 8482424x7015.

E-mail addresses: kldu@ieee.org, kldu@ece.concordia.ca.

δwitaking 1 for w = i and 0 otherwise, and η > 0 is a small learn-

ing rate that satisfies the classical Robbins–Monro conditions, that

is,?η(t) = ∞ and?η2(t) < ∞. Typically, η is selected to

η(t) = η0

This is the SCL based VQ.

Voronoi tessellation, also called Voronoi diagram, is useful for

demonstrating VQ results. The space is partitioned into a finite

number of regions bordered by hyperplanes. Each region is rep-

resented by a codebook vector, which is the nearest neighbor to

any point within the region. All vectors in each region constitute a

Voronoi set. For a smooth underlying probability density p(x) and

a large K, all regions in an optimal Voronoi partition have the same

within-region variance σk(Gersho, 1979).

Given a competitive learning based clustering method, learn-

ing is first conducted to adjust the algorithmic parameters; after

the learning phase is completed, the network is ready for gener-

alization. When a new input pattern x is presented to the map,

the map gives the corresponding output c based on the nearest-

neighborhood rule. Clustering is a fundamental data analysis

method, and is widely used for pattern recognition, feature extrac-

tion, VQ, image segmentation, and data mining. In this paper, we

provide a comprehensive introduction to clustering. Various clus-

teringtechniquesbasedoncompetitivelearningaredescribed.The

paper is organized as follows. In Section 2, we give an introduction

to competitive learning. In Section 3, the Kohonen network and

the self-organizing map (SOM) are treated. Section 4 is dedicated

to learning vector quantization (LVQ). Sections 5–7 deal with the

C-means, mountain/subtractive, and neural gas clustering meth-

ods,respectively.ARTandARTMAPmodelsaretreatedinSection8.

be decreasing monotonically in time. For example, one can select

?

1 −

t

T

?

, where η0∈ (0,1] and T is the iteration bound.

0893-6080/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.

doi:10.1016/j.neunet.2009.08.007

Page 2

90

K.-L. Du / Neural Networks 23 (2010) 89–107

Fig. 1. Architecture of the competitive learning network. The output selects one of

the prototypes ciby setting yi= 1 and all yj= 0, j ?= i.

Fuzzy clustering is described in Section 9, and supervised cluster-

ing are described in Section 10. In Section 11, the under-utilization

problem as well as strategies for avoiding this problem is narrated.

RobustclusteringistreatedinSection12,andclusteringusingnon-

Euclidean distance measures is coped with in Section 13. Hierar-

chical clustering and its hybridization with partitional clustering

are described in Section 14. Constructive clustering methods and

other clustering methods are introduced in Sections 15 and 16, re-

spectively. Some cluster validity criteria are given in Section 17.

Two examples are given in Section 18 to demonstrate the use of

the clustering methods. We wind up by a summary in Section 19.

2. Competitive learning

Competitive learning can be implemented using a two-layer

(J–K) neural network, as shown in Fig. 1. The input and output lay-

ers are fully connected. The output layer is called the competition

layer, wherein lateral connections are used to perform lateral inhi-

bition.

Based on the mathematical statistics problem called cluster

analysis, competitive learning is usually derived by minimizing the

mean squared error (MSE) functional (Tsypkin, 1973)

E =

1

N

N

?

p=1

Ep

(3)

Ep=

K ?

k=1

µkp

??xp− ck

??2

(4)

where N is the size of the pattern set, and µkpis the connection

weight assigned to prototype ckwith respect to xp, denoting the

membership of pattern p into cluster k. When ckis the closest

(winning) prototype to xp in the Euclidean metric, µkp =

otherwiseµkp= 0. The SCL is derived by minimizing (3) under the

assumptionthattheweightsareobtainedbythenearestprototype

condition. Thus

??xp− ck

its closest prototype ck.

Based on the criterion (4) and the gradient-descent method,

assuming cw = cw(t) to be the winning prototype of x = xt, we

get the SCL as

cw(t + 1) = cw(t) + η(t)[xt− cw(t)]

ci(t + 1) = ci(t),

where η(t) can be selected according to the Robbins–Monro

conditions. The process is known as winner-take-all (WTA). The

WTA mechanism plays an important role in most unsupervised

1;

Ep= min

which is the squared Euclidean distance between the input xpand

1≤k≤K

??2

(5)

(6)

(7)

i ?= w

learning networks. If each cluster has its own learning rate as ηi=

1

Ni, Nibeing the number of samples assigned to the ith cluster, the

algorithm achieves the minimum output variance (Yair, Zeger, &

Gersho, 1992).

ManyWTAmodelswereimplementedbasedonthecontinuous-

timeHopfieldnetworktopology(Dempsey&McVey,1993;Majani,

Erlanson, & Abu-Mostafa, 1989; Sum et al., 1999; Tam, Sum, Leung,

&Chan,1996),orbasedonthecellularneuralnetwork(CNN)(Chua

&Yang,1988)modelwithlinearcircuitcomplexity(Andrew,1996;

Seiler & Nossek, 1993). There are also some circuits for realizing

the WTA function (Lazzaro, Lyckebusch, Mahowald, & Mead, 1989;

Tam et al., 1996). k-winners-take-all (k-WTA) is a process of se-

lecting the k largest components from an N-dimensional vector. It

is a key task in decision making, pattern recognition, associative

memories, or competitive learning networks. k-WTA networks are

usually based on the continuous-time Hopfield network (Calvert &

Marinov, 2000; Majani et al., 1989; Yen, Guo, & Chen, 1998), and

k-WTA circuits (Lazzaro et al., 1989; Urahama & Nagao, 1995) can

be implemented using the Hopfield network based on the penalty

method and have infinite resolution.

3. The Kohonen network

Von der Malsburg’s model (von der Malsburg, 1973) and Koho-

nen’s self-organization map (SOM) (Kohonen, 1982, 1989) are two

topology-preservingcompetitivelearningmodelsthatareinspired

by the cortex of mammals. The SOM is popular for VQ, clustering

analysis, feature extraction, and data visualization.

The Kohonen network has the same structure as the competi-

tivelearningnetwork.TheoutputlayeriscalledtheKohonenlayer.

Lateral connections are used as a form of feedback whose magni-

tude is dependent on the lateral distance from a specific neuron,

whichischaracterizedbyaneighborhoodparameter.TheKohonen

network defined on Rnis a one-, two-, or higher-dimensional grid

A of neurons characterized by prototypes ck∈ Rn(Kohonen, 1989,

1990). Input patterns are presented sequentially through the input

layer,withoutspecifyingthedesiredoutput.TheKohonennetwork

is called the SOM when the lateral feedback is more sophisticated

than the WTA rule. For example, the lateral feedback used in the

SOM can be selected as the Mexican hat function, which is found

in the visual cortex. The SOM is more successful in classification

and pattern recognition.

3.1. The self-organizing map

The SOM computes the Euclidean distance of the input pattern

x to each neuron k, and find the winning neuron, denoted neuron

w withprototypecw,usingthenearest-neighborrule.Thewinning

node is called the excitation center.

For all the input vectors that are closest to cw, update all the

prototype vectors by the Kohonen learning rule (Kohonen, 1990)

ck(t + 1) = ck(t) + η(t)hkw(t)?

the excitation response or neighbor function, which defines the

response of neuron k when cwis the excitation center. If hkw(t)

takes δkw, (8) reduces to the SCL. hkw(t) can be selected as a

function that decreases with the increasing distance between ck

and cw, and typically as the Gaussian function

xt− ck(t)?,

k = 1,...,K

(8)

where η(t) satisfies the Robbins–Monro conditions, and hkw(t) is

hwk(t) = h0e

wheretheconstanth0> 0,σ(t)isadecreasingfunctionoft witha

popular choice,σ(t) = σ0e−t

time constant (Obermayer, Ritter, & Schulten, 1991). The Gaussian

−?ck−cw?2

σ2(t)

(9)

τ,σ0being a positive constant andτ a

Page 3

K.-L. Du / Neural Networks 23 (2010) 89–107

91

function is biologically more reasonable than a rectangular one.

TheSOMusingtheGaussianneighborhoodconvergesmorequickly

than that using a rectangular one (Lo & Bavarian, 1991).

ck(0) can be selected as random values, or from available sam-

ples, or any ordered initial state. The algorithm terminated when

the map achieves an equilibrium with a given accuracy or when

a specified number of iterations is reached. In the convergence

phase, hwkcan be selected as time-invariant, and each prototype

can be updated by using an individual learning rate ηkKohonen

(1997)

ηk(t + 1) =

Normalization of x is suggested since the resulting reference vec-

tors tend to have the same dynamic range. This may improve the

numerical accuracy (Kohonen, 1990).

The SOM (Kohonen, 1989) is a clustering network with a set

of heuristic procedures: it is not based on the minimization of

any known objective function. It suffers from several major prob-

lems, such as forced termination, unguaranteed convergence, non-

optimizedprocedure,andtheoutputbeingoftendependentonthe

sequence of data. The Kohonen network is closely related to the C-

means clustering (Lippman, 1987). There are some proofs for the

convergence of the one-dimensional SOM based on the Markov

chain analysis (Flanagan, 1996), but no general proof of conver-

gence for multi-dimensional SOM is available (Flanagan, 1996; Ko-

honen, 1997).

The SOM performs clustering while preserving topology. It is

useful for VQ, clustering, feature extraction, and data visualization.

The Kohonen learning rule is a major development of competitive

learning. The SOM is related to adaptive C-means, but performs a

topological feature map which is more complex than just cluster

analysis. After training, the input vectors are spatially ordered

in the array. The Kohonen learning rule provides a codebook

in which the distortion effects are automatically taken into

account. The SOM is especially powerful for the visualization of

high-dimensional data. It converts complex, nonlinear statistical

relations between high-dimensional data into simple geometric

relations at a low-dimensional display. The SOM can be used to

decompose complex information processing systems into a set of

simple subsystems (Gao, Ahmad, & Swamy, 1991). A fully analog

integrated circuit of the SOM has been designed in Mann and

Gilbert (1989). A comprehensive survey of SOM applications is

given in Kohonen (1996).

However, the SOM is not a good choice in terms of clustering

performance compared to other popular clustering algorithms

suchastheC-means,theneuralgas,andtheART2A(He,Tan,&Tan,

2004; Martinetz, Berkovich, & Schulten, 1993). For large output

dimensions, the number of nodes in the adaptive grid increases

exponentially with the number of function parameters. The

prespecified standard grid topology may not be able to match the

structure of the distribution, leading to poor topological mappings.

ηk(t)

1 + hwkηk(t).

(10)

3.2. Extensions of the self-organizing Map

Adaptive subspace SOM (ASSOM) (Kohonen, 1996, 1997; Koho-

nen, Oja, Simula, Visa, & Kangas, 1996) is a modular neural net-

work model comprising an array of topologically ordered SOM

submodels. ASSOM creates a set of local subspace representations

by competitive selection and cooperative learning. Each submodel

is responsible for describing a specific region of the input space

by its local principal subspace, and represents a manifold such as

a linear subspace with a small dimensionality, whose basis vec-

tors are determined adaptively. ASSOM not only inherits the topo-

logical representation property of the SOM, but provides learning

results which reasonably describe the kernels of various transfor-

mation groups like the PCA. The hyperbolic SOM (HSOM) (Ritter,

1999) implements its lattice by a regular triangulation of the hy-

perbolic plane. The hyperbolic lattice provides more freedom to

map a complex information space such as language into spatial

relations.

Extraction of knowledge from databases is an essential task of

dataanalysisanddatamining.Themulti-dimensionaldatamayin-

volvequantitativeandqualitative(nominal,ordinal)variablessuch

as categorical data, which is the case in survey data. The SOM can

be viewed as an extension of principal component analysis (PCA)

due to its topology-preserving property. For qualitative variables,

the SOM has been generalized for multiple correspondence analy-

sis (Cottrell, Ibbou, & Letremy, 2004).

The SOM is designed for real-valued vectorial data analysis,

and it is not suitable for non-vectorial data analysis such as the

structured data analysis. Examples of structured data are tempo-

ral sequences such as time series, language, and words, spatial

sequences like the DNA chains, and tree or graph structured

data arising from natural language parsing and from chemistry.

Prominentunsupervisedself-organizingmethodsfornon-vectorial

data are the temporal Kohonen map (TKM), the recurrent SOM

(RSOM), the recursive SOM (RecSOM), the SOM for structured data

(SOMSD), and the merge SOM (MSOM). All these models introduce

recurrence into the SOM, and have been reviewed and compared

in Hammer, Micheli, Sperduti, and Strickert (2004) and Strickert

and Hammer (2005).

4. Learning vector quantization

The k-nearest-neighbor (k-NN) algorithm (Duda & Hart, 1973)

is a conventional classification technique. It is also used for outlier

detection.Itgeneralizeswellforlargetrainingsets,andthetraining

set can be extended at any time. The theoretical asymptotic classi-

ficationerrorisupper-boundedbytwicetheBayeserror.However,

it uses a large storage space, and has a computational complexity

of O?

as the competitive learning network. The unsupervised LVQ is

essentially the SCL based VQ. There are two families of the LVQ-

stylemodels,supervisedmodelssuchastheLVQ1,thLVQ2,andthe

LVQ3 (Kohonen, 1989) as well as unsupervised models such as the

LVQ (Kohonen, 1989) and the incremental C-means (MacQueen,

1967). The supervised LVQ is based on the known classification

of feature vectors, and can be treated as a supervised version of

the SOM. The LVQ is used for VQ and classification, as well as for

fine tuning the SOM (Kohonen, 1989, 1990). LVQ algorithms define

near-optimal decision borders between classes, even in the sense

of classical Bayesian decision theory.

The supervised LVQ minimizes the functional (3), where µkp=

1 if neuron k is the winner and zero otherwise, when pattern pair

p is presented. It works on a set of N pattern pairs?

coding the class membership, that is, only one entry of yptakes the

value unity while all its other entries are zero. Assuming that the

pth pattern is presented at time t, the LVQ1 is given as (Kohonen,

1990)

cw(t + 1) = cw(t) + η(k)[xt− cw(t)],

cw(t + 1) = cw(t) − η(t)[xt− cw(k)],

ci(t + 1) = ci(t),

where w is the index of the winning neuron, xt = xpand η(t)

is defined as in earlier formulations. When it is used to fine-tune

the SOM, one should start with a small η(0), usually less than 0.1.

This algorithm tends to reduce the point density of ciaround the

Bayesian decision surfaces. The OLVQ1 is an optimized version of

N2?

. It also takes a long time for recall.

LVQ (Kohonen, 1990) employs the same network architecture

xp,yp

?

, where

xp∈ RJis the input vector and yp∈ RKis the binary target vector

yp,w= 1

yp,w= 0

i ?= w

(11)

Page 4

92

K.-L. Du / Neural Networks 23 (2010) 89–107

the LVQ1 (Kohonen, Kangas, Laaksonen, & Torkkola, 1992). In the

OLVQ1, each codebook vector ciis assigned an individual adaptive

learning rate ηi. The OLVQ1 converges at a rate up to one order of

magnitude faster than the LVQ1.

LVQ2 and LVQ3 comply better with the Bayesian decision sur-

face. In LVQ1, only one codebook vector ciis updated at each step,

while LVQ2 and LVQ3 change two codebook vectors simultane-

ously. Different LVQ algorithms can be combined in the clustering

process. However, both LVQ2 and LVQ3 have the problem of refer-

ence vector divergence (Sato & Yamada, 1995). In a generalization

of the LVQ2 (Sato & Yamada, 1995), this problem is eliminated by

applying gradient descent on a nonlinear cost function. Some ap-

plications of the LVQ were reviewed in Kohonen et al. (1996).

Addition of training counters to individual neurons can effec-

tively record the training statistics of the LVQ (Odorico, 1997). This

allows for dynamic self-allocation of the neurons to classes during

the course of training. At the generalization stage, these counters

provideanestimateofthereliabilityofclassificationoftheindivid-

ualneurons.Themethodisespeciallyvaluableinhandlingstrongly

overlapping class distributions in the pattern space.

5. C-means clustering

Themostwell-knowndataclusteringtechniqueisthestatistical

C-means, also known as the k-means (MacQueen, 1967; Moody &

Darken, 1989; Tou & Gonzalez, 1976). The C-means algorithm ap-

proximatesthemaximumlikelihood(ML)solutionfordetermining

the location of the means of a mixture density of component den-

sities. The C-means clustering is closely related to the SCL, and is

a special case of the SOM. The algorithm partitions the set of N in-

put patterns into K separate subsets Ck, each containing Nkinput

patterns by minimizing the MSE

E (c1,...,cK) =

1

N

K ?

k=1

?

xn∈Ck

?xn− ck?2

(12)

whereckistheprototypeorcenteroftheclusterCk.Byminimizing

E with respect to ck, the optimal location of ckis obtained as the

mean of the samples in the cluster, ck=

The C-means can be implemented in either the batch mode

(Linde, Buzo, & Gray, 1980; Moody & Darken, 1989) or the incre-

mental mode (MacQueen, 1967). The batch C-means (Linde et al.,

1980), also called the Linde–Buzo–Gray, LBG or generalized Lloyd

algorithm, is applied when the whole training set is available.

The incremental C-means is suitable for a training set that is

obtained on-line. In the batch C-means, the initial partition is

arbitrarily defined by placing each input pattern into a randomly

selected cluster, and the prototypes are defined to be the average

of the patterns in the individual clusters. When the C-means is

performed, at each step the patterns keep changing from one

cluster to the closest cluster ckaccording to the nearest-neighbor

rule and the prototypes are then recalculated as the mean of the

samples in the clusters. In the incremental C-means, each cluster

is initialized with a random pattern as its prototype; the C-means

updatestheprototypesuponthepresentationofeachnewpattern.

The incremental C-means gives the new prototype as

?

where w is the index of the winning neuron, η(t) is defined as

in earlier formulations. The general procedure for the C-means

clustering is to repeat the redistribution of patterns among the

clusters using criterion (12) until there is no further change in the

prototypes of the clusters. After the algorithm converges, one can

calculate the variance vector ? σkfor each cluster.

As a gradient-descent technique, the C-means achieves a lo-

cal optimum solution that depends on the initial selection of the

cluster prototypes. The number of clusters must also be prespec-

ified. Numerous improvements on the C-means have been made.

1

Nk

?

xi∈Ckxi.

ck(t + 1) =

ck(t) + η(t)(xt− ck(t)),

ck(t),

k = w

k ?= w

(13)

The local minimum problem can be eliminated by using global op-

timization methods such as the genetic algorithm (GA) (Bandy-

opadhyay & Maulik, 2002; Krishna & Murty, 1999), the simulated

annealing (SA) (Bandyopadhyay, Maulik, & Pakhira, 2001), and a

hybrid SA and evolutionary algorithm (EA) system (Delport, 1996).

In Chinrunrueng and Sequin (1995), the incremental C-means is

improved by biasing the clustering towards an optimal Voronoi

partition(Gersho,1979)viaaclustervariance-weightedMSEasthe

objective function, and by adjusting the learning rate dynamically

according to the current variances in all partitions. The method al-

ways converges to an optimal or near-optimum configuration. The

enhanced LBG (Patane & Russo, 2001) avoids bad local minima by

incorporationoftheconceptofutilityofacodeword.Theenhanced

LBGoutperformstheLBGwithutility(LBG-U)(Fritzke,1997b)both

in terms of accuracy and the number of required iterations. The

LBG-U is also based on the LBG and the concept of utility.

When an initial prototype is in a region with few training

patterns, this results in a large cluster. This disadvantage can be

remedied by a modified C-means (Wilpon & Rabiner, 1985). The

clustering starts from one cluster. It splits the cluster with the

largest intracluster distance into two. After each splitting, the C-

means is applied until the existing clusters are convergent. This

procedure is continued until K clusters are obtained.

The relation between the PCA and the C-means has been es-

tablished in Ding and He (2004). Principal components have been

proved to be the continuous solutions to the discrete cluster

membership indicators for the C-means clustering, with a clear

simplex cluster structure (Ding & He, 2004). PCA based dimen-

sionalityreductionsareparticularlyeffectivefortheC-meansclus-

tering. Lower bounds for the C-means objective function (12) are

derived as the total variance minus the eigenvalues of the data co-

variance matrix (Ding & He, 2004).

In the two-stage clustering procedure (Vesanto & Alhoniemi,

2000), the SOM is first used to cluster the data set, and the proto-

types produced are further clustered using an agglomerative clus-

tering algorithm or the C-means. The clustering results using the

SOM as an intermediate step are comparable to that of direct clus-

tering of the data, but with a significantly reduced computation

time.

6. Mountain and subtractive clusterings

The mountain clustering (Yager & Filev, 1994a, 1994b) is a

simple and effective method for estimating the number of clusters

and the initial locations of the cluster centers. The method grids

the data space and computes a potential value for each grid point

based on its distance to the actual data points. Each grid point is

a potential cluster center. The potential for each grid is calculated

based on the density of the surrounding data points. The grid with

the highest potential is selected as the first cluster center and then

the potential values of all the other grids are reduced according to

their distances to the first cluster center. The next cluster center

is located at the grid point with the highest remaining potential.

This process is repeated until the remaining potential values of all

the grids fall below a threshold. However, the grid structure causes

the complexity to grow exponentially with the dimension of the

problem.

The subtractive clustering (Chiu, 1994a), as a modified moun-

tain clustering, uses all the data points to replace all the grid points

as potential cluster centers. This effectively reduces the number of

grid points to N (Chiu, 1994a). The potential measure for each data

point xiis defined as a function of the Euclidean distances to all the

other input data points

P(i) =

N

?

j=1

e−α?xi−xj?2,

i = 1,...,N

(14)

where the constant α =

the neighborhood. A data point surrounded by many neighboring

4

r2

a, rabeing a normalized radius defining

Page 5

K.-L. Du / Neural Networks 23 (2010) 89–107

93

data points has a high potential value. Thus, the mountain and

subtractive clustering techniques are less sensitive to noise than

other clustering algorithms, such as the C-means and the fuzzy C-

means (FCM) (Bezdek, 1981).

After the data point with the highest potential, xu, is selected

as the kth cluster center, that is, ck= xuwith P(k) = P(u) as its

potential value, the potential of each data point xiis modified by

subtracting a term associated with ck

P(i) = P(i) − P(k)e−β?xi−ck?2

where the constant β =

the neighborhood. In order to avoid closely located cluster centers,

rbis set greater than ra, typically rb = 1.25ra. The algorithm

continues until the remaining potentials of all the data points are

below some fraction of the potential of the first cluster center

(15)

4

r2

b

, rbbeing a normalized radius defining

P(k) = max

where ε is selected within (0, 1). A small ε leads to a large

number of hidden nodes, while a largeε generates a small network

structure. Typically, ε is selected as 0.15.

The training data xiis recommended to be scaled before ap-

plying the method for easy selection of α and β. Since it is difficult

to select a suitable ε for all data patterns, additional criteria for

accepting/rejecting cluster centers can be used. One method is to

select two thresholds (Chiu, 1994a, 1994b), namely,ε andε. Above

ε, ckis definitely accepted as a cluster center, while below ε it

is definitely rejected. If P(k) falls between the two thresholds, a

trade-off between a reasonable potential and its distance to the

existing cluster centers must been examined.

Unlike the C-means and the FCM, which require iterations of

many epochs, the subtractive clustering requires only one pass of

the training data. Besides, the number of clusters does not need

to be prespecified. The subtractive clustering is a deterministic

method: For the same neural network structure, the same network

parameters are always obtained. Both the C-means and the FCM

require O(KNT) computations, where T is the total number of

epochs and each computation requires the calculation of the

distance and the memberships. The computational load for the

subtractive clustering is O?

medium-size training sets, the subtractive clustering is relatively

fast, but it requires more training time when N ? KT (Dave &

Krishnapuram, 1997).

The subtractive clustering provides only rough estimates of the

cluster centers, since the cluster centers obtained are situated at

some data points. Moreover, since α and β are not determined

from the data set and no cluster validity is used, the clusters

produced may not appropriately represent the clusters. The result

by the subtractive clustering can be used for initializing iterative

optimization based clustering algorithms such as the C-means and

the FCM.

The subtractive clustering can be improved by performing a

search over α and β, which makes it essentially equivalent to

the least-biased fuzzy clustering algorithm (Beni & Liu, 1994). The

least-biased fuzzy clustering, based on the deterministic anneal-

ing approach (Rose, 1998; Rose, Gurewitz, & Fox, 1990), tries to

minimize the clustering entropy of each cluster under the assump-

tion of unbiased centroids. In Angelov and Filev (2004), an on-line

clustering method has been implemented based on a first-order

Cauchy type potential function. In Pal and Chakraborty (2000),

the mountain and subtractive clustering methods are improved by

tuningtheprototypesobtainedusingthegradient-descentmethod

to maximize the potential function. By modifying the potential

function, the mountain method can also be used to detect other

types of clusters like circular shells (Pal & Chakraborty, 2000).

i

P(i) < εP(1)

(16)

N2+ KN?

, each computation involving

the calculation of the exponential function. Thus, for small- or

In Kim, Lee, Lee, and Lee (2005), a kernel-induced distance is used

to replace the Euclidean distance in the potential function. This

enables to cluster the data that is linearly inseparable in the orig-

inal space into homogeneous groups in the transformed high-

dimensional space, where the data separability is increased.

7. Neural gas

Theneuralgas(NG)(Martinetzetal.,1993)isaVQmodelwhich

minimizes a known cost function and converges to the C-means

quantization error via a soft-to-hard competitive model transition.

The soft-to-hard annealing process helps the algorithm escape

from local minima. The NG is a topology-preserving network, and

can be treated as an extension to the C-means. It has a fixed

number of processing units, K, with no lateral connection.

A data optimal topological ordering is achieved by using nei-

ghborhood ranking within the input space at each training step. To

find its neighborhood rank, each neuron compares its distance to

the input vector with those of all the other neurons to the input

vector. Neighborhood ranking provides the training strategy with

mechanisms related to robust statistics, and the NG does not suffer

fromtheprototypeunder-utilizationproblem(Rumelhart&Zipser,

1985). At step t, the Euclidean distances between an input vector

xtand all the prototype vectors ck(t) are calculated by dk(xt) =

?xt− ck(t)?, k = 1,...,K, and d(t) = (d1(xt),...,dK(xt))T.

Eachprototypeck(t)isassignedarankrk(t),whichtakesaninteger

value from 0,...,K − 1, with 0 for the smallest and K − 1 for the

largest dk(xt).

The prototypes are updated by

ck(t + 1) = ck(t) + ηh(rk(t))(xt− ck(t))

where h(r) = e−

neighborhood width. When ρ(t) → 0, (17) reduces to the C-

means update rule (13). During the iteration, both ρ(t) and η(t)

(17)

r

ρ(t)realizes a soft competition, ρ(t) being the

decrease exponentially, η(t) = η0

where η0and ρ0are the initial decay parameters, ηfand ρfare

the final decay parameters, and Tfis the maximum number of

iterations. The prototypes ckare initialized by randomly assigning

vectors from the training set.

Unlike the SOM, which uses predefined static neighborhood

relations, the NG determines a dynamical neighborhood relation

as learning proceeds. The NG is an efficient and reliable cluster-

ing algorithm, which is not sensitive to the neuron initialization.

The NG converges faster to a smaller MSE E than the C-means,

the maximum-entropy clustering (Rose et al., 1990), and the SOM.

This advantage comes at the price of a higher computational effort.

In serial implementation, the complexity for the NG is O(K logK)

while the other three methods all have a complexity of O(K).

Nevertheless, in parallel implementation all the four algorithms

have a complexity of O(logK) (Martinetz et al., 1993). The NG can

be derived from a gradient-descent procedure on a potential func-

tion associated with the framework of fuzzy clustering (Bezdek,

1981).

To accelerate the sequential NG, a truncated exponential func-

tion is used as the neighborhood function and the neighborhood

ranking is implemented without evaluating and sorting all the dis-

tances (Choy & Siu, 1998b). In Rovetta and Zunino (1999), an im-

proved NG and its analog VLSI subcircuitry have been developed

basedonpartialsorting.Theapproachreducesthetrainingtimeby

up to two orders of magnitude, without reducing the performance.

In the Voronoi tessellation, when the prototype of each Voronoi

region is connected to all the prototypes of its bordering Voronoi

regions,aDelaunaytriangulationisobtained.CompetitiveHebbian

learning(Martinetz,1993;Martinetz&Schulten,1994)isamethod

that generates a subgraph of the Delaunay triangulation, called

?

ηf

η0

? t

Tfand ρ(t) = ρ0

?

ρf

ρ0

? t

Tf,

Page 6

94

K.-L. Du / Neural Networks 23 (2010) 89–107

Fig. 2.

triangulation. The Delaunay triangulation is represented by a mix of thick and thick

dashedlines,theinducedDelaunaytriangulationbythicklines,Voronoitessellation

by thin lines, prototypes by circles, and a data distribution P(x) by shaded regions.

TogeneratetheinducedDelaunaytriangulation,twoprototypesareconnectedonly

if at least a part of the common border of their Voronoi polygons lies in a region

where P(x) > 0.

An illustration of the Delaunay triangulation and the induced Delaunay

the induced Delaunay triangulation by masking the Delaunay

triangulation with a data distribution P(x). This is shown in

Fig. 2. The induced Delaunay triangulation is optimally topology-

preserving in a general sense (Martinetz, 1993). Given a number

of prototypes in RJ, competitive Hebbian learning successively

adds connections among them by evaluating input data drawn

from P(x). The method does not change the prototypes, but only

generates topology according to these prototypes. For each input

x, its two closest prototypes are connected by an edge. This leads

to the induced Delaunay triangulation, which is limited to those

regions of the input space RJ, where P(x) > 0. The topology-

representing network (Martinetz & Schulten, 1994) is obtained

by alternating the learning steps of the NG and the competitive

Hebbian learning, where the NG is used to distribute a certain

number of prototypes and the competitive Hebbian learning is

then used to generate the topology. An edge aging scheme is used

to remove obsolete edges. Competitive Hebbian learning avoids

the topological defects observed for the SOM.

8. ART networks

Adaptive resonance theory (ART) (Grossberg, 1976) is biologi-

cally motivated and is a major advance in the competitive learning

paradigm. The theory leads to a series of real-time unsupervised

network models for clustering, pattern recognition, and associa-

tive memory (Carpenter & Grossberg, 1987a, 1987b, 1988, 1990;

Carpenter, Grossberg, & Rosen, 1991a, 1991b; Carpenter, Gross-

berg, Markuzon, Reynolds, & Rosen, 1992). These models are ca-

pable of stable category recognition in response to arbitrary input

sequences with either fast or slow learning. ART models are char-

acterizedbysystemsofdifferentialequationsthatformulatestable

self-organizing learning methods. Instar and outstar learning rules

are the two learning rules used. The ART has the ability to adapt,

yet not forget the past training, and it overcomes the so-called

stability–plasticity dilemma (Carpenter & Grossberg, 1987a;

Grossberg, 1976). At the training stage, the stored prototype of a

category is adapted when an input pattern is sufficiently similar to

the prototype. When novelty is detected, the ART adaptively and

autonomouslycreatesanewcategorywiththeinputpatternasthe

prototype. The similarity is characterized by a vigilance parame-

ter ρ ∈ (0,1]. A large ρ leads to many finely divided categories,

while a smaller ρ gives fewer categories. The stability and plastic-

ity properties as well as the ability to efficiently process dynamic

data make the ART attractive for clustering large, rapidly cha-

nging sequences of input patterns, such as in the case of data

mining (Massey, 2003). However, the ART approach does not cor-

respond to the C-means algorithm for cluster analysis and VQ in

the global optimization sense (Lippman, 1987).

8.1. ART models

ART model family includes a series of unsupervised learning

models. ART networks employ a J–K recurrent architecture, which

is a different form of Fig. 1. The input layer F1, called the comparing

layer,hasJ neuronswhiletheoutputlayerF2,calledtherecognizing

layer, has K neurons. F1 and F2 are fully interconnected in both

directions. F2 acts as a WTA network. The feedforward weights

connecting to the F2 neuron j are represented by the vector wj,

while the feedback weights from the same neuron are represented

by the vector cjthat stores the prototype of cluster j. The number

of clusters K varies with the size of the problem.

The ART models are characterized by a set of short-term mem-

ory (STM) and long-term memory (LTM) time-domain nonlinear

differential equations. The STM equations describe the evolution

of the neurons and their interactions, while the LTM equations de-

scribe the change of the interconnection weights with time as a

functionofthesystemstate.F1storestheSTMforthecurrentinput

pattern,whileF2storestheprototypesofclustersastheLTM.There

are three types of ART implementations: full mode, STM steady-

statemode,andfastlearningmode(Carpenter&Grossberg,1987b;

Serrano-Gotarredona & Linares-Barranco, 1996). In the full mode,

both the STM and LTM differential equations are realized. The STM

steady-state mode only implements the LTM differential equa-

tions, while the STM behavior is governed by nonlinear algebraic

equations. In the fast learning mode, both the STM and the LTM are

implemented by their steady-state nonlinear algebraic equations,

andthuspropersequencingofSTMandLTMeventsisrequired.The

fast learning mode is inexpensive and is most popular.

LiketheincrementalC-means,theARTmodelfamilyissensitive

to the order of presentation of the input patterns. ART models tend

to build clusters of the same size, independently of the distribution

of the data.

8.1.1. ART 1

ThesimplestandmostpopularARTmodelistheART1(Carpen-

ter & Grossberg, 1987a) for learning to categorize arbitrarily many,

complex binary input patterns presented in an arbitrary order. A

popular fast learning implementation is given by Du and Swamy

(2006), Moore (1988), Massey (2003) and Serrano-Gotarredona

and Linares-Barranco (1996). The ART 1 is stable for a finite train-

ing set. However, the order of the training patterns may influ-

ence the final prototypes and clusters. Unlike the SOM (Kohonen,

1982), the Hopfield network (Hopfield, 1982), and the neocogni-

tron (Fukushima, 1980), the ART 1 can deal with arbitrary com-

binations of binary input patterns. In addition, the ART 1 has no

restriction on memory capacity since its memory matrices are not

square.

Other popular ART 1-based clustering algorithms are the im-

proved ART 1 (IART 1) (Shih, Moh, & Chang, 1992), the adaptive

Hamming net (AHN) (Hung & Lin, 1995), the fuzzy ART (Carpenter

et al., 1991a, 1992; Carpenter & Ross, 1995), the fuzzy AHN (Hung

& Lin, 1995), and the projective ART (PART) (Cao & Wu, 2002). The

fuzzy ART (Carpenter et al., 1991a) simply extends the logical AND

in the ART 1 to the fuzzy AND. Both the fuzzy ART and the fuzzy

AHN have an analog architecture, and function like the ART 1 but

for analog input patterns.

The ART models, typically governed by differential equations,

have a high computational complexity for numerical implementa-

tions. Implementations using analog or optical hardware are more

Page 7

K.-L. Du / Neural Networks 23 (2010) 89–107

95

desirable. A modified ART 1 in the fast learning mode has been de-

rived for easy hardware implementation in Serrano-Gotarredona

and Linares-Barranco (1996), and the method has also been ex-

tended for the full mode and the STM steady-state mode. A num-

ber of hardware implementations of the ART 1 in different modes

are also surveyed in Serrano-Gotarredona and Linares-Barranco

(1996).

8.1.2. ART 2

The ART 2 (Carpenter & Grossberg, 1987b) is designed to

categorize analog or binary random input sequences. It is similar

to the ART 1, but has a more complex F1 field so as to allow

the ART 2 to stably categorize sequences of analog inputs that

can be arbitrarily close to one another. The F1 field includes a

combinationofnormalizationandnoisesuppression,aswellasthe

comparison of the bottom-up and top-down signals needed for the

reset mechanism. The clustering behavior of the ART 2 was found

to be similar to that of the C-means clustering (Burke, 1991).

The ART 2 is computationally expensive and has difficulties in

parameter selection. The ART 2A (Carpenter et al., 1991b) employs

the same architecture as the ART 2, and can accurately reproduce

the behavior of the ART 2 in the fast learning limit. The ART 2A

is two to three orders of magnitude faster than the ART 2, and

also suggests efficient parallel implementations. The ART 2A is also

fast at intermediate learning rates, which captures many desirable

properties of slow learning of the ART 2 such as noise tolerance.

In Carpenter and Grossberg (1987b), F2 initially contains a number

of uncommitted nodes, which get committed one by one upon

the input presentation. An implementation of the ART 2A, with F2

being initialized as the null set and dynamically growing during

learning, is given in Du and Swamy (2006); He et al. (2004). The

ART 2A with an intermediate learning rate η copes better with

noisyinputsthanitdoeswithafastlearningrate,andtheemergent

category structure is less dependent on the input presentation

order (Carpenter et al., 1991b). The ART-C 2A (He et al., 2004)

applies a constraint reset mechanism on the ART 2A to allow a

direct control on the number of output clusters generated, by

adaptively adjusting the value of ρ. The ART 2A and the ART-C 2A

have clustering quality comparable to that of the C-means and the

SOM, but with less computational time He et al. (2004).

8.1.3. Other ART models

The ART 3 (Carpenter & Grossberg, 1990) carries out parallel

searches by testing hypotheses about distributed recognition

codes in a multilevel network hierarchy. The ART 3 introduces a

search process for ART architectures that can robustly cope with

sequences of asynchronous analog input patterns in real time. The

distributed ART (dART) (Carpenter, 1997) combines the stable fast

learning capability of ART systems with the noise tolerance and

code compression capabilities of the multilayer perceptron (MLP).

With a WTA code, the unsupervised dART model reduces to the

fuzzy ART (Carpenter et al., 1991a). Other ART-based algorithms

include the efficient ART (EART) family (Baraldi & Alpaydin, 2002),

the simplified ART (SART) family (Baraldi & Alpaydin, 2002), the

symmetric fuzzy ART (S-Fuzzy ART) (Baraldi & Alpaydin, 2002),

the Gaussian ART (Williamson, 1996) as an instance of SART

family, and the fully self-organizing SART (FOSART) Baraldi and

Parmiggiani (1997).

8.2. ARTMAP models

ARTMAP models (Carpenter, Grossberg, & Reynolds, 1991; Car-

penter et al., 1992; Carpenter & Ross, 1995), which are self-

organizing and goal-oriented, are a class of supervised learning

methods. The ARTMAP, also called predictive ART, autonomously

learns to classify arbitrarily many, arbitrarily ordered vectors

into recognition categories based on predictive success (Carpen-

ter et al., 1991). Compared to the backpropagation (BP) learn-

ing (Rumelhart, Hinton, & Williams, 1986), the ARTMAP has a

number of advantages such as being self-organizing, self-stabili-

zing, match learning, and real time. The ARTMAP learns orders of

magnitude faster and is also more accurate than the BP. These are

achieved by using an internal controller that jointly maximizes

predictive generalization and minimizes predictive error by link-

ing predictive success to category size on a trial-by-trial basis, us-

ing only local operations. However, the ARTMAP is very sensitive

to the order of the training patterns compared to learning by the

radial basis function network (RBFN) (Broomhead & Lowe, 1988).

The ARTMAP learns predetermined categories of binary input

patterns in a supervised manner. It is based on a pair of ART mod-

ules, namely, ARTaand ARTb. ARTaand ARTbcan be fast learning

ART 1 modules coding binary input vectors. These modules are

connectedbyaninter-ARTmodulethatresemblesART1.Theinter-

ART module includes a map field that controls the learning of an

associative map from ARTarecognition categories to ARTbrecogni-

tion categories. The map field also controls match tracking of the

ARTavigilance parameter. The inter-ART vigilance resetting sig-

nal is a form of backpropagation of information. Given a stream

of input–output pairs??

tion, when a pattern x is presented to ARTa, its prediction is pro-

duced at ARTb.

The fuzzy ARTMAP (Carpenter et al., 1992; Carpenter & Ross,

1995) can be taught to supervisedly learn predetermined cate-

gories of binary or analog input patterns. The fuzzy ARTMAP in-

corporates two fuzzy ART modules. The fuzzy ARTMAP is capable

of fast, but stable, on-line recognition learning, hypothesis testing,

and adaptive naming in response to an arbitrary stream of ana-

log or binary input patterns. The fuzzy ARTMAP is also shown to

be a universal approximator (Verzi, Heileman, Georgiopoulos, &

Anagnostopoulos,2003).OthermembersoftheARTMAPfamilyare

the ART-EMAP (Carpenter & Ross, 1995), the ARTMAP-IC (Carpen-

ter & Markuzon, 1998), the Gaussian ARTMAP (Williamson, 1996),

the distributed ARTMAP (dARTMAP) (Carpenter, 1997), the default

ARTMAP (Carpenter, 2003), and the simplified fuzzy ARTMAP (Ka-

suba, 1993; Vakil-Baghmisheh & Pavesic, 2003). The distributed

vs. the WTA-coding representation is a primary factor differenti-

ating the various ARTMAP networks. The relations of some of the

ARTMAP variants are given by Carpenter (2003): fuzzy ARTMAP ⊂

default ARTMAP ⊂ ARTMAP-IC ⊂ dARTMAP.

9. Fuzzy clustering

xp,yp

??

. During training, ARTareceives a

yp

stream?

xp

?

and ARTbreceives a stream?

?

. During generaliza-

Fuzzy clustering is an important class of clustering algorithms.

Fuzzy clustering helps to find natural vague boundaries in data.

Preliminaries of fuzzy sets and logic are given in Buckley and

Eslami (2002) and Du and Swamy (2006).

9.1. Fuzzy C-means clustering

ThediscretenessofeachclustermakestheC-meansanalytically

and algorithmically intractable. Partitioning the dataset in a fuzzy

manner avoids this problem. The FCM clustering (Bezdek, 1974,

1981), also known as the fuzzy ISODATA (Dunn, 1974), treats

each cluster as a fuzzy set, and each feature vector is assigned to

multiple clusters with some degree of certainty measured by the

membership function. The FCM optimizes the following objective

function (Bezdek, 1974, 1981)

E =

K ?

j=1

N

?

i=1

µm

ji

??xi− cj

??2

(18)

Page 8

96

K.-L. Du / Neural Networks 23 (2010) 89–107

where the membership matrix U =

the membership of xiinto cluster j. The condition must be valid

?µji

?

, µji ∈ [0,1] denoting

K ?

The weighting parameter m ∈ (1,∞) is called the fuzzifier. m

determines the fuzziness of the partition produced, and reduces

the influence of small membership values. When m →

the resulting partition asymptotically approaches a hard or crisp

partition. On the other hand, the partition becomes a maximally

fuzzy partition if m → ∞.

By minimizing (18) subject to (19), the optimal solution is

derived as

?

K ?

N ?

?µji

for i = 1,...,N, j = 1,...,K. Eq. (20) corresponds to a soft-

max rule and (21) is similar to the mean of the data points in a

cluster. Both equations are dependent on each other. The iterative

alternating optimization procedure terminates when the change

in the prototypes is sufficiently small (Bezdek, 1981; Karayiannis

& Mi, 1997). The FCM clustering with a high degree of fuzziness

diminishestheprobabilityofgettingstuckatlocalminima(Bezdek,

1981). A typical value for m is 1.5 or 2.0.

The FCM needs to store U and all ci’s, and the alternating

estimationofUandci’scausesacomputationalandstorageburden

for large-scale data sets. The computation can be accelerated

by combining their updates (Kolen & Hutcheson, 2002), and

consequently the storage of U is avoided. The single iteration

time of the accelerated method is O(K), while that of the FCM is

O(K2) (Kolen & Hutcheson, 2002). The C-means is a special case of

the FCM, when µjiis unity for only one class and zero for all the

other classes. Like the C-means, the FCM may find a local optimum

solution, and the result is dependent on the initialization of U or

cj(0).

There are many variants of the FCM. The penalized FCM (Yang,

1993) is a convergent generalized FCM obtained by adding a

penaltytermassociatedwithµji.ThecompensatedFCM(Lin,1999)

speeds up the convergence of the penalized FCM by modifying

the penalty. A weighted FCM (Tsekouras, Sarimveis, Kavakli, &

Bafas, 2004) is used for fuzzy modeling towards developing a

Takagi–Sugeno–Kang (TSK) fuzzy model of optimal structure. All

these and many other existing generalizations of the FCM can

be analyzed in a unified framework called the generalized FCM

(GFCM) (Yu & Yang, 2005), by using the Lagrange multiplier

method from an objective function comprising a generalization

of the FCM criterion and a regularization term. The multistage

random sampling FCM (Cheng, Goldgof, & Hall, 1998) reduces the

clustering time normally by a factor of 2 to 3, with a quality of the

final partitions equivalent to that created by the FCM. The FCM

has been generalized by introducing the generalized Boltzmann

distributiontoescapelocalminima(Richardt,Karl,&Muller,1998).

Existing global optimization techniques can be incorporated into

the FCM to provide globally optimum solutions. The ε-insensitive

FCM (εFCM) is an extension to the FCM by introducing the robust

statistics using Vapnik’s ε-insensitive estimator to reduce the

effect of outliers (Leski, 2003a). The εFCM is based on L1-norm

clustering (Kersten, 1999). Other robust extensions to the FCM

j=1

µji= 1,

i = 1,...,N.

(19)

1+,

µji=

1

?xi−cj?2

?

?µji

N ?

?

1

m−1

l=1

1

?xi−cl?2

?mxi

?m

?

1

m−1

,

(20)

cj=

i=1

i=1

(21)

includes the Lp-norm clustering (0 < p < 1) (Hathaway & Bezdek,

2000)andtheL1-normclustering(Kersten,1999).TheFCMhasalso

been extended for clustering other data types, such as symbolic

data (El-Sonbaty & Ismail, 1998).

For a blend of unlabeled and labeled patterns, the FCM with

partial supervision (Pedrycz & Waletzky, 1997) can be applied

and the method is derived following the same procedure as

that of the FCM. The classification information is added to the

objective function, and a weighting factor balances the supervised

and unsupervised terms within the objective function (Pedrycz &

Waletzky, 1997). The conditional FCM (Pedrycz, 1998) develops

clusters preserving homogeneity of the clustered patterns with

regard to their similarity in the input space, as well as their

respective values assumed in the output space. It is a supervised

clustering. The conditional FCM is based on the FCM, but requires

the output variable of a cluster to satisfy a particular condition,

which can be treated as a fuzzy set, defined via the corresponding

membership. This results in a reduced computational complexity

for classification problems by splitting the problem into a series

of condition-driven clustering problems. A family of generalized

weighted conditional FCM algorithms are derived in Leski (2003b).

9.2. Other fuzzy clustering algorithms

Many other clustering algorithms are based on the concept of

fuzzy membership. The Gustafson–Kessel algorithm (Gustafson &

Kessel, 1979) extends the FCM by using the Mahalanobis distance,

and is suited for hyperellipsoidal clusters of equal volume. The

algorithm takes typically five times as long as the FCM to com-

plete cluster formation (Karayiannis & Randolph-Gips, 2003).

The adaptive fuzzy clustering (AFC) (Anderson, Bezdek, & Dave,

1982) also employs the Mahalanobis distance, and is suitable for

ellipsoidal or linear clusters. The Gath–Geva algorithm (Gath &

Geva,1989)isderivedfromacombinationoftheFCMandfuzzyML

estimation.Themethodincorporatesthehypervolumeanddensity

criteriaasclustervaliditymeasuresandperformswellinsituations

of large variability of cluster shapes, densities, and number of data

points in each cluster.

The C-means and the FCM are based on the minimization

of the trace of the (fuzzy) within-cluster scatter matrix. The

minimum scatter volume (MSV) and minimum cluster volume

(MCV) algorithms are two iterative clustering algorithms based

on determinant (volume) criteria (Krishnapuram & Kim, 2000).

The MSV algorithm minimizes the determinant of the sum of

the scatter matrices of the clusters, while the MCV minimizes

the sum of the volumes of the individual clusters. The behavior

of the MSV is similar to that of the C-means, whereas the MCV

is more versatile. The MCV in general gives better results than

the C-means, MSV, and Gustafson–Kessel algorithms, and is less

sensitivetoinitializationthantheexpectation–maximization(EM)

algorithm (Dempster, Laird, & Rubin, 1977). Volume prototypes

extend the cluster prototypes from points to regions in the

clusteringspace(Kaymak&Setnes,2002).Aclusterrepresentedby

a volume prototype implies that all data points close to a cluster

center belong fully to that cluster. In Kaymak and Setnes (2002),

the Gustafson–Kessel algorithm and the FCM have been extended

by using the volume prototypes and similarity-driven merging of

clusters.

There are various fuzzy clustering methods that are based on

the Kohonen network, the LVQ, the ART models, and the Hopfield

network.

9.2.1. Kohonen network and learning vector quantization based fuzzy

clustering

The fuzzy SOM (Huntsberger & Ajjimarangsee, 1990) modifies

the SOM by replacing the learning rate with fuzzy membership

Page 9

K.-L. Du / Neural Networks 23 (2010) 89–107

97

of the nodes in each class. The fuzzy LVQ (FLVQ) (Bezdek & Pal,

1995), originally named the fuzzy Kohonen clustering network

(FKCN) (Bezdek, Tsao, & Pal, 1992), is a batch algorithm that

combines the ideas of fuzzy membership values for learning rates,

the parallelism of the FCM, and the structure and self-organizing

update rules of the Kohonen network. Soft competitive learning

in clustering has the same function as fuzzy clustering (Baraldi

& Blonda, 1999). The soft competition scheme (SCS) (Yair et al.,

1992) is a sequential, deterministic version of LVQ, obtained by

modifying the neighborhood mechanism of the Kohonen learning

ruleandincorporatingthestochasticrelaxationtechnique.TheSCS

consistently provides better codebooks than the incremental C-

means (Linde et al., 1980), even for the same computation time,

and is relatively insensitive to the choice of the initial codebook.

The learning rates of the FLVQ and SCS algorithms have opposite

tendencies (Bezdek & Pal, 1995). The SCS has difficulty in selecting

good parameters (Bezdek & Pal, 1995). Other extensions to the

FLVQ, LVQ, and FCM algorithms are the extended FLVQ family

learning schemes (Karayiannis & Bezdek, 1997), the non-Euclidean

FLVQ (NEFLVQ) and the non-Euclidean FCM (NEFCM) (Karayiannis

& Randolph-Gips, 2003), the generalized LVQ (GLVQ) (Pal, Bezdek,

& Tsao, 1993), the generalized LVQ family (GLVQ-F) (Karayiannis,

Bezdek, Pal, Hathaway, & Pai, 1996), the family of fuzzy algorithms

for LVQ (FALVQ) (Karayiannis, 1997; Karayiannis & Pai, 1996),

entropy-constrained fuzzy clustering (ECFC) algorithms, and

entropy-constrained LVQ (ECLVQ) algorithms (Karayiannis, 1999).

9.2.2. ART networks based fuzzy clustering

In Section 8.1, we have mentioned some fuzzy ART models such

as the fuzzy ART, the S-fuzzy ART, and the fuzzy AHN, as well

as some fuzzy ARTMAP models such as the fuzzy ARTMAP, the

ART-EMAP, default ARTMAP, the ARTMAP-IC, and the dARTMAP.

The supervised fuzzy min–max classification network (Simpson,

1992) as well as the unsupervised fuzzy min–max clustering net-

work (Simpson, 1993) is a kind of combination of fuzzy logic and

the ART 1 (Carpenter & Grossberg, 1987a). The operations in these

modelsrequireonlycomplements,additionsandcomparisonsthat

are most suitable for parallel hardware execution. Some cluster-

ing and fuzzy clustering algorithms including the SOM (Kohonen,

1989), the FLVQ (Bezdek & Pal, 1995), the fuzzy ART (Carpenter

et al., 1991a), the growing neural gas (GNG) (Fritzke, 1995a), and

the FOSART (Baraldi & Parmiggiani, 1997) are surveyed and com-

pared in Baraldi and Blonda (1999).

9.2.3. Hopfield network based fuzzy clustering

The clustering problem can be cast as a problem of minimiza-

tion of the MSE between the training patterns and the cluster cen-

ters. This optimization problem can be solved using the Hopfield

network (Lin, 1999; Lin, Cheng, & Mao, 1996). In the fuzzy Hopfield

network (FHN) (Lin et al., 1996) and the compensated fuzzy Hop-

field network (CFHN) (Lin, 1999), the training patterns are mapped

to a Hopfield network of a two-dimensional neuron array, where

each column represents a cluster and each row a training pattern.

The state of each neuron corresponds to a fuzzy membership func-

tion. A fuzzy clustering strategy is included in the Hopfield net-

work to eliminate the need for finding the weighting factors in the

energy function. This energy function is called the scatter energy

function, and is formulated based on the within-class scatter ma-

trix.Thesemodelshaveinherentparallelstructures.IntheFHN(Lin

et al., 1996), an FCM strategy is imposed for updating the neuron

states. The CFHN (Lin, 1999) integrates the compensated FCM into

the learning scheme and updating strategies of the Hopfield net-

worktoavoidtheNP-hardproblem(Swamy&Thulasiraman,1981)

andtoacceleratetheconvergencefortheclusteringprocedure.The

CFHN learns more rapidly and more effectively than clustering us-

ing the Hopfield network, the FCM, and the penalized FCM (Yang,

1993).TheCFHNhasbeenusedforVQinimagecompression(Liu&

Lin,2000),sothattheparallelimplementationforcodebookdesign

is feasible.

10. Supervised clustering

When output patterns are used in clustering, this leads to

supervised clustering. The locations of the cluster centers are

determined by both the input pattern spread and the output

pattern deviations. For classification problems, the class mem-

bership of each training pattern is available and can be used

for clustering, thus significantly improving the decision accuracy.

Examples of supervised clustering include the LVQ family (Koho-

nen, 1990), the ARTMAP family (Carpenter et al., 1991), the con-

ditional FCM (Pedrycz, 1998), the supervised C-means (Al-Harbi &

Rayward-Smith, 2006), and the C-means plus k-NN based cluster-

ing Bruzzone and Prieto (1998).

Supervised clustering can be implemented by augmenting the

input pattern with its output pattern,? xi=?

Bezdek, 1999; Uykan, Guzelis, Celebi, & Koivo, 2000). A scaling

factor β is introduced to balance between the similarities in the

input and output spaces ? x =

obtained.Theresultingclustercodebookvectorsareprojectedonto

the input space to obtain the centers.

Based on the enhanced LBG (Patane & Russo, 2001), the clus-

teringforfunctionapproximation(CFA)(Gonzalez,Rojas,Pomares,

Ortega,&Prieto,2002)algorithmisasupervisedclusteringmethod

designed for function approximation. The CFA increases the den-

sity of the prototypes in the input areas where the target function

presents a more variable response, rather than just in the zones

with more input examples (Gonzalez et al., 2002). The CFA mini-

mizes the variance of the output response of the training examples

belonging to the same cluster. In Staiano, Tagliaferri, and Pedrycz

(2006), a prototype regression function is built as a linear combi-

nation of local linear regression models, one for each cluster, and is

then inserted into the FCM. Thus, the prototypes are adjusted ac-

cording to both the input distribution and the regression function

in the output space.

xT

i,yT

i

?T, so as to obtain

an improved distribution of the cluster centers by an unsupervised

clustering (Chen, Chen, & Chang, 1993; Pedrycz, 1998; Runkler &

?

xT

i,βyT

i

?T(Pedrycz, 1998). By

applying the FCM, the new cluster centers cj =

?

cT

x,j,cT

y,j

?Tare

11. The under-utilization problem

Conventional competitive learning based clustering like the C-

means or the LVQ suffers from a severe initialization problem

called prototype under-utilization or dead-unit problem, since

some prototypes, called dead units (Grossberg, 1987; Rumelhart

& Zipser, 1985), may never win the competition. This problem is

caused by the fact that only the winning prototype is updated for

every input. Initializing the prototypes with random input vectors

can reduce the probability of the under-utilization problem, but

does not eliminate it. Many efforts have been made to solve the

under-utilization problem.

11.1. Competitive learning with conscience

In the leaky learning strategy (Grossberg, 1987; Rumelhart &

Zipser, 1985), all the prototypes are updated. The winning proto-

type is updated by employing a fast learning rate, while all the los-

ing prototypes move towards the input vector with a much slower

learningrate.Eachprocessingunitisassignedwithathreshold,and

then increase the threshold if the unit wins, or decrease it other-

wise (Rumelhart & Zipser, 1985).

The conscience strategy realizes a similar idea by reducing

the winning rate of the frequent winners (Desieno, 1988). The

frequent winner receives a bad conscience by adding a penalty

Page 10

98

K.-L. Du / Neural Networks 23 (2010) 89–107

term to its distance from the input signal. This leads to an entropy

maximization, that is, each unit wins at an approximately equal

probability. Thus, the probability of under-utilized neurons being

selected as winners is increased.

The popular frequency sensitive competitive learning (FSCL)

(Ahalt, Krishnamurty, Chen, & Melton, 1990) reduces the under-

utilization problem by introducing a distortion measure that

ensuresallcodewordsinthecodebooktobeupdatedwithasimilar

probability. The codebooks obtained by the FSCL algorithm have

sufficient entropy so that Huffman coding of the VQ indices would

not provide significant additional compression. In the FSCL, each

prototype incorporates a count of the number of times it has been

the winner, uj, j = 1,...,K. The distance measure is modified

to give prototypes with a lower count value a chance to win the

competition. The only difference with the VQ algorithm is that the

winning neuron is found by Ahalt et al. (1990)

?

uw(t) = uw(t − 1) + 1, ui(t) = ui(t − 1) for i ?= w, where w

is the index of the winning neuron and ui(0) = 0, i = 1,...,K.

In (22), uj

selecting the fairness function as F?

codewords initially and gradually turns into competitive learning

as training proceeds to minimize the MSE function.

In the multiplicatively biased competitive learning (MBCL)

model (Choy & Siu, 1998a), the competition among the neurons is

biased by a multiplicative term. The MBCL avoids neuron under-

utilization with probability one, as time goes to infinity. The

FSCL (Ahalt et al., 1990; Krishnamurthy, Ahalt, Melton, & Chen,

1990) is a member of the MBCL family. In the MBCL, only one

weight vector is updated per step. The fuzzy FSCL (FFSCL) (Chung

& Lee, 1994) combines the frequency sensitivity with fuzzy

competitive learning. Since both the FSCL and the FFSCL use a non-

Euclideandistancetodeterminethewinner,theproblemofshared

clusters may occur: a number of prototypes move into the same

cluster as learning proceeds.

cw(t) = argcj

min

j=1,...,K

uj(t − 1)??xt− cj(t − 1)???

??can be generalized as F(uj)??xt− cj

being constants, the FSCL emphasizes the winning uniformity of

(22)

??xt− cj

??. When

uj

?

= uβ0e−t/T0

j

, β0and T0

11.2. Rival-penalized competitive learning

The problem of shared clusters is considered in the rival-

penalized competitive learning (RPCL) algorithm (Xu, Krzyzak, &

Oja,1993).TheRPCLaddsanewmechanismtotheFSCLbycreating

arivalpenalizingforce.Foreachinput,thewinningunitismodified

to adapt to the input, the second-place winner called the rival

is also updated by a smaller learning rate along the opposite

direction, and all the other prototypes remain unchanged

?ci(t) + ηw(xt− ci(t)),

ci(t),

ci(t + 1) =

i = w

i = r

otherwise

ci(t) − ηr(xt− ci(t)),

(23)

where w and r are the indices of winning and rival prototypes,

which are decided by (22), and ηwand ηrare their respective

learning rates, ηw(t) ? ηr.

This actually pushes the rival away from the sample pattern

so as to prevent it from interfering the competition. The RPCL

automaticallyallocatesanappropriatenumberofprototypesforan

input data set, and all the extra candidate prototypes will finally

be pushed to infinity. It provides a better performance than the

FSCL. The RPCL can be regarded as an unsupervised extension of

the supervised LVQ2 (Kohonen, 1990). It simultaneously modifies

the weight vectors of both the winner and its rival, when the

winner is in a wrong class but the rival is in a correct class for an

input vector (Xu et al., 1993). The lotto type competitive learning

(LTCL) (Luk & Lien, 1998) can be treated as a generalization of

the RPCL, where instead of just penalizing the nearest rival, all

the losers are penalized equally. The generalized LTCL (Luk & Lien,

1999) modifies the LTCL by allowing more than one winner, which

are divided into tiers, with each tier being rewarded differently.

The RPCL may, however, encounter the over-penalization or

under-penalization problem (Zhang & Liu, 2002). The STepwise

Automatic Rival-penalized (STAR) C-means (Cheung, 2003) is a

generalizationoftheC-meansbasedontheFSCL(Ahaltetal.,1990)

and a Kullback–Leibler divergence based criterion. The STAR C-

means has a mechanism similar to the RPCL, but penalizes the

rivals in an implicit way, whereby avoiding the problem of the

RPCL.

11.3. Soft competitive learning

The winner-take-most rule relaxes the WTA rule by allowing

more than one neuron as winners to a certain degree. The is the

soft competitive learning. Examples are the SCS (Yair et al., 1992),

the SOM (Kohonen, 1989), the NG (Martinetz et al., 1993), the

GNG (Fritzke, 1995a), maximum-entropy clustering (Rose et al.,

1990), the GLVQ (Pal et al., 1993), the FCM (Bezdek, 1981), the

fuzzy competitive learning (FCL) (Chung & Lee, 1994), and fuzzy

clustering algorithms. The FCL algorithms (Chung & Lee, 1994) are

a class of sequential algorithms obtained by fuzzifying competitive

learning algorithms, such as the SCL and the FSCL. The enhanced

sequential fuzzy clustering (ESFC) (Zheng & Billings, 1999) is a

modification to the FCL to better overcome the under-utilization

problem. The SOM (Kohonen, 1989) employs the winner-take-

most strategy at the early stages and approaches a WTA method as

timegoeson.Duetothesoftcompetitivestrategy,thesealgorithms

are less likely to be trapped at local minima and to generate dead

units than hard competitive alternatives (Baraldi & Blonda, 1999).

The maximum-entropy clustering (Rose et al., 1990) circum-

vents the under-utilization problem and local minima in the error

function by using soft competitive learning and deterministic an-

nealing. The prototypes are updated by

j=1

whereη isthelearningrate,1

and the term within the bracket turns out to be the Boltzmann

distribution. The SCS (Yair et al., 1992) employs a similar soft

competitive strategy, but β is fixed as unity.

The winner-take-most criterion, however, detracts some pro-

totypes from their corresponding clusters, and consequently be-

comes biased toward the global mean of the clusters, since all the

prototypes are attracted to each input pattern (Liu, Glickman, &

Zhang, 2000).

ci(t + 1) = ci(t) + η(t)

e−β?xt−ci(t)?2

K ?

βannealsfromalargenumbertozero,

e−β?xt−cj(t)?2

(xt− ci(t))

(24)

12. Robust clustering

Outliers in a data set affects the result of clustering. The in-

fluence of outliers can be eliminated by using the robust statistics

approach (Huber, 1981). This idea has also been incorporated into

many robust clustering methods (Bradley, Mangasarian, & Steet,

1996; Dave & Krishnapuram, 1997; Frigui & Krishnapuram, 1999;

Hathaway & Bezdek, 2000; Kersten, 1999; Leski, 2003a). The C-

median clustering (Bradley et al., 1996) is derived by solving a

bilinear programming problem that utilizes the L1-norm distance.

The fuzzy C-median (Kersten, 1999) is a robust FCM method that

uses the L1-norm with the exemplar estimation based on the fuzzy

median.Robustclusteringalgorithmscanbederivedbyoptimizing

an objective function ET, which comprises of the cost E for the

conventionalalgorithmsandaconstrainttermEcfordescribingthe

noise.

Page 11

K.-L. Du / Neural Networks 23 (2010) 89–107

99

12.1. Noise clustering

In the noise clustering approach (Dave, 1991), all outliers

are collected into a separate, amorphous noise cluster, whose

prototype has the same distance δ from all the data points, while

all the other points are collected into K clusters. The threshold δ

is relatively large compared to the distances of the good points to

their respective cluster prototypes. If a noisy point is far away from

all the K clusters, it is attracted to the noise cluster. In the noise

clustering approach (Dave, 1991), the constraint term is given by

?

j=1

Optimizing on E yields

?

K ?

The second term in the denominator, due to outliers, lowers µji.

TheformulafortheprototypesisthesameasthatintheFCM.Thus,

the noise clustering can be treated as a robustified FCM. When

all the K clusters have a similar size, the noise clustering is very

effective.However,asinglethresholdistoorestrictiveifthecluster

size varies widely in the data set.

Ec=

N

?

i=1

δ2

1 −

K ?

µji

?m

.

(25)

µji=

1

?xi−cj?2

?

?

1

m−1

k=1

?

1

?xi−ck?2

1

m−1+

?

1

δ2

?

1

m−1

.

(26)

12.2. Possibilistic C-means

Unlike fuzzy clustering, the possibilistic C-means (PCM) (Kr-

ishnapuram & Keller, 1993) does not require the sum of the

memberships of a data point across the clusters to be unity. The

membership functions represent a possibility of belonging rather

than a relative degree of membership between clusters. Thus, the

derived degree of membership does not decrease as the number of

clusters increases. Without this constraint, the modified objective

function is decomposed into many individual objective functions,

one for each cluster, which can be optimized separately.

The constraint term for the PCM is given by a sum associated

with the fuzzy complements of all the K clusters

Ec=

K ?

j=1

βj

N

?

i=1

?

1 − µji

?m

(27)

where βjare suitable positive numbers. The individual objective

functions are given as

Ej

T=

N

?

i=1

µm

ji

??xi− cj

1

??2+ βj

N

?

i=1

?

1 − µji

?m,

j = 1,...,K. (28)

Optimizing (28) with respect to µjiyields the solution

µji=

1 +

?

?xi−cj?2

βj

?

1

m−1

.

(29)

For outliers, µjiis small. Some heuristics for selecting βjare given

in Krishnapuram and Keller (1993).

Given a number of clusters K, the FCM will arbitrarily split or

merge real clusters in the data set to produce exactly the specified

number of clusters, while the PCM can find those natural clusters

in the data set. When K is smaller than the number of actual

clusters, only K good clusters are found, and the other data points

are treated as outliers. When K is larger than the number of actual

clusters, all the actual clusters can be found and some clusters will

coincide. In the noise clustering, there is only one noise cluster,

while in the PCM there are K noise clusters. The PCM behaves as

a collection of K independent noise clustering algorithms, each

searching a single cluster. The performance of the PCM, however,

relies heavily on initialization of cluster prototypes and estimation

ofβj, and the PCM tends to converge to coincidental clusters (Dave

& Krishnapuram, 1997).

12.3. Other robust clustering problems

A family of robust clustering algorithms have been obtained by

treating outliers as the fuzzy complement (Yang & Wang, 2004).

Assuming that a noise cluster exists outside each data cluster,

the fuzzy complement of µjican be viewed as the membership

of xiin the noise cluster with a distance βj. Based on this idea,

many different implementations of the probabilistic approach can

be proposed (Dave & Krishnapuram, 1997; Yang & Wang, 2004),

and a general form of Ecis obtained as a generalization of that for

the PCM (Yang & Wang, 2004). The alternating cluster estimation

method (Runkler & Bezdek, 1999) is a simple extension of the

generalmethod(Dave&Krishnapuram,1997;Yang&Wang,2004).

The fuzzy robust C-spherical shells algorithm (Yang & Wang,

2004) searches the clusters that belongs to the spherical shells by

combining the concept of the fuzzy complement and the fuzzy

C-spherical shells algorithm (Krishnapuram, Nasraoui, & Frigui,

1992).Thehardrobustclusteringalgorithm(Yang&Wang,2004)is

an extension of the GLVQ-F algorithm (Karayiannis et al., 1996). All

these robust algorithms are highly dependent on the initial values

and adjustment of βj.

The robust competitive agglomeration (RCA) algorithm (Frigui

& Krishnapuram, 1999) combines the advantages of both the

hierarchical and partitional clustering techniques. The objective

function also contains a constraint term. An optimum number of

clusters is determined via a process of competitive agglomeration,

while the knowledge of the global shape of the clusters is

incorporated via the use of prototypes. Robust statistics like the

M-estimator (Huber, 1981) is incorporated to combat the outliers.

Overlapping clusters are handled by using fuzzy memberships.

Clusteringofavectorialdatasetwithmissingentriesbelongsto

robust clustering. In Hathaway and Bezdek (2001), four strategies,

namely the whole data, partial distance, optimal completion and

nearest prototype strategies, are discussed for implementing the

FCM for incomplete data. The introduction of the concept of

noise clustering into relational clustering techniques leads to their

robust versions (Dave & Sen, 2002). A review of robust clustering

methods is given in Dave and Krishnapuram (1997).

13. Clustering using non-Euclidean distance measures

Due to the Euclidean distance measure, conventional clustering

methods favor hyperspherically shaped clusters of equal size,

but have the undesirable property of splitting big and elongated

clusters (Duda & Hart, 1973). The Mahalanobis distance can be

used to look for hyperellipsoid shaped clusters. However, the

C-means algorithm using the Mahalanobis distance tends to

produce unusually large or unusually small clusters (Mao & Jain,

1996). The hyperellipsoidal clustering (HEC) network (Mao & Jain,

1996) integrates PCA and clustering into one network, and can

adaptively estimate the hyperellipsoidal shape of each cluster.

The HEC implements clustering using a regularized Mahalanobis

distance that is a linear combination of the Mahalanobis and

Euclidean distances. The regularized distance achieves a trade-off

between the hyperspherical and hyperellipsoidal cluster shapes

to prevent the HEC network from producing unusually large or

unusually small clusters. The Mahalanobis distance is used in the

Gustafson–Kessel algorithm (Gustafson & Kessel, 1979) and the

AFC (Anderson et al., 1982). The symmetry based C-means (Su

& Chou, 2001) employs the C-means as a coarse search for the K

clustercentroidandanensuingfine-tuningprocedurebasedonthe

Page 12

100

K.-L. Du / Neural Networks 23 (2010) 89–107

point-symmetrydistanceasthedissimilaritymeasure.Themethod

can effectively find clusters with symmetric shapes, such as the

human face.

Anumberofalgorithmsfordetectingcirclesandhyperspherical

shells have been proposed as extensions of the C-means and FCM

algorithms. These include the fuzzy C-shells (Dave, 1990), fuzzy C-

ring (Man & Gath, 1994), hard C-spherical shells (Krishnapuram

et al., 1992), unsupervised C-spherical shells (Krishnapuram

et al., 1992), fuzzy C-spherical shells (Krishnapuram et al.,

1992), and possibilistic C-spherical shells (Krishnapuram & Keller,

1993) algorithms. All these algorithms are based on iterative

optimization of objective functions similar to that for the FCM, but

defines the distance from a prototype?λi= (ci,ri) to the point xjas

d2

j,i= d2?

where ci and ri are the center and radius of the hypersphere,

respectively. The optimal number of substructures in the data

set can be effectively estimated by using some validity criteria

such as spherical shell thickness (Krishnapuram et al., 1992), fuzzy

hypervolume and fuzzy density (Gath & Geva, 1989; Man & Gath,

1994).

By using different distance measures, many clustering algo-

rithms can be derived for detecting clusters of various shapes such

as lines and planes (Bezdek, 1981; Dave & Krishnapuram, 1997;

Frigui & Krishnapuram, 1999; Kaymak & Setnes, 2002; Zhang &

Liu, 2002), circles and spherical shells (Krishnapuram et al., 1992;

Pal & Chakraborty, 2000; Zhang & Liu, 2002), ellipses (Frigui & Kr-

ishnapuram, 1999; Gath & Hoory, 1995), curves, curved surfaces,

ellipsoids (Bezdek, 1981; Frigui & Krishnapuram, 1999; Gath &

Geva, 1989; Kaymak & Setnes, 2002; Mao & Jain, 1996), rectangles,

rectangular shells and polygons (Hoeppner, 1997). Relational data

can be clustered by using the non-Euclidean relational FCM (NER-

FCM) (Hathaway & Bezdek, 1994, 2000). Fuzzy clustering for rela-

tional data is reviewed in Dave and Sen (2002).

xj,?λi

?

=???xj− ci

??− ri

?2

(30)

14. Hierarchical clustering

Existing clustering algorithms are broadly classified into

partitional, hierarchical, and density based clustering. Clustering

methods discussed thus far belong to partitional clustering.

14.1. Partitional, hierarchical, and density based clustering

Partitional clustering can be either hard or fuzzy one. Fuzzy

clusteringcandealwithoverlappingclusterboundaries.Partitional

clustering is dynamic, where points can move from one cluster

to another. Knowledge of the shape or size of the clusters can

be incorporated by using appropriate prototypes and distance

measures. Partitional clustering is susceptible to local minima of

its objective function, and the number of clusters K is usually

requiredtobeprespecified.Also,itissensitivetonoiseandoutliers.

Partitional clustering has a typical complexity of O(N).

Hierarchical clustering consists of a sequence of partitions in

a hierarchical structure, which can be represented as a clustering

treecalleddendrogram.Hierarchicalclusteringtakestheformofei-

ther agglomerative or divisive technique. New clusters are formed

by reallocating the membership degree of one point at a time,

basedonacertainmeasureofsimilarityordistance.Agglomerative

clustering is suitable for data with dendritic substructure. Outliers

can be easily identified in hierarchical clustering, since they merge

with other points less often due to their larger distances from the

other points and the number of outliers is typically much less than

that in a cluster. The number of clusters K need not be specified,

and the local minimum problem arising from initialization does

not occur. However, prior knowledge of the shape or size of the

clusters cannot be incorporated, and overlapping clusters cannot

be separated. Moreover, hierarchical clustering is static, and points

committed to a given cluster cannot move to a different cluster.

Hierarchical clustering has a typical complexity of O?

procedure, but is computationally more expensive (Xu & Wunsch

II, 2005).

Density based clustering groups objects of a data set into clus-

ters based on density conditions. Clusters are dense regions of

objects in the data space and are separated by regions of low den-

sity. The method is robust against outliers since an outlier affects

clustering only in the neighborhood of this data point. It can han-

dle outliers and discover clusters of arbitrary shape. Density based

clustering has a complexity of the same order as hierarchical clus-

tering. The DBSCAN (Ester, Kriegel, Sander, & Xu, 1996) is a widely

known density based clustering algorithm. In the DBSCAN, a re-

gion is defined as the set of points that lie in the?-neighborhood of

some point p. Cluster label propagation from p to the other points

in a region R happens if |R|, the cardinality of R, exceeds a given

threshold for the minimal number of points.

N2?

, making

it impractical for larger data set. Divisive clustering reverses the

14.2. Distance measures and cluster representations

Theinter-clusterdistanceisusuallycharacterizedbythesingle-

linkage or the complete-linkage technique. The single-linkage

technique calculates the inter-cluster distance using the two clos-

estdatapointsindifferentclusters.Themethodismoresuitablefor

finding well-separated stringy clusters. In contrast, the complete-

linkage technique defines the inter-cluster distance as the farthest

distance between any two data points in different clusters. Other

more complicated methods are group-average-linkage, median-

linkage, and centroid-linkage techniques.

A cluster is conventionally represented by its centroid or pro-

totype. This is desirable only for spherically shaped clusters, but

causes cluster splitting for a large or arbitrarily shaped cluster,

since the centroids of its subclusters can be far apart. At the other

extreme, if all data points in a cluster are used as its representa-

tives, the clustering algorithm is extremely sensitive to noise and

outliers.Thisall-pointsrepresentationcanclusterarbitraryshapes.

The scatter-points representation (Guha, Rastogi, & Shim, 2001),

as a trade-off between the two extremes, represents each cluster

by a fixed number of points that are generated by selecting well-

scattered points from the cluster and then shrinking them toward

thecenteroftheclusterbyaspecifiedfraction.Thisreducesthead-

verse effects of the outliers since the outliers are typically farther

awayfromthemeanandarethusshiftedbyalargerdistancedueto

shrinking. The scatter-points representation achieves robustness

to outliers, and identifies clusters that have non-spherical shape

and wide variations in size.

14.3. Agglomerative clustering

Agglomerative clustering starts from N clusters, each contain-

ing one data point. A series of nested merging is performed un-

til all the data points are grouped into one cluster. The algorithm

processes a set of N2numerical relationships between the N data

points, and agglomerates according to their similarity or distance.

Agglomerative clustering is based on a local connectivity criterion.

TheruntimeisO?

canbebasedonthecentroid(Zhang,Ramakrishnan,&Livny,1996),

all-points (Zahn, 1971), or scatter-points (Guha et al., 2001) rep-

resentation. For large data sets, storage or multiple input/output

scans of the data points is a bottleneck for the existing clustering

algorithms. Some strategies can be applied to combat this prob-

lem (Guha et al., 2001; Vesanto & Alhoniemi, 2000; Wang & Rau,

2001; Zhang et al., 1996).

N2?

.Dendrogramisusedtoillustratetheclusters

produced by agglomerative clustering. Agglomerative clustering

Page 13

K.-L. Du / Neural Networks 23 (2010) 89–107

101

The conventional minimum spanning tree (MST) algorithm

(Zahn, 1971) is a graph-theoretical technique (Swamy & Thulasir-

aman, 1981; Thulasiraman & Swamy, 1992). It uses the all-points

representation. The method first finds an MST for the input data.

Then, by removing the longest K − 1 edges, K clusters are ob-

tained. The MST algorithm is good at clustering arbitrary shapes.

The method, however, is very sensitive to the outliers, and it may

merge two clusters due to a chain of outliers between them. The

BIRCH method (Zhang et al., 1996) first performs an incremen-

tal and approximate preclustering phase in which dense regions

of points are represented by compact summaries, and a centroid

based hierarchical algorithm is then used to cluster the set of sum-

maries. The outliers are eliminated from the summaries via the

identification of the sparsely distributed data points in the feature

space.TheBIRCHneedsonlyalittlemorethanonescanofthedata.

However, the method fails to identify clusters with non-spherical

shapes or a wide variation in size by splitting larger clusters and

merging smaller clusters. The CURE method (Guha et al., 2001) is a

robust clustering algorithm based on the scatter-points represen-

tation.Tohandlelargedatabases,theCUREemploysacombination

of random sampling and partitioning. The complexity of the CURE

is not worse than that of centroid based hierarchical algorithms.

The CURE provides a better performance with less execution time

compared to the BIRCH (Guha et al., 2001). It can discover clusters

withinterestingshapesandislesssensitivetotheoutliersthanthe

MST. The CHAMELEON (Karypis, Han, & Kumar, 1999) first creates

a graph, where each node represents a pattern and all the nodes

are connected according to the k-NN paradigm. The graph is recur-

sively partitioned into many small unconnected subgraphs, each

partitioningyieldingtwosubgraphsofroughlyequalsize.Agglom-

erative clustering is applied to the subclusters. Two subclusters are

merged only when the interconnectivity as well as the closeness of

the individual clusters is very similar. The CHAMELEON automati-

cally adapts to the characteristics of the clusters being merged. The

method is more effective than the CURE in discovering clusters of

arbitrary shapes and varying densities (Karypis et al., 1999).

14.4. Hybridization of hierarchical and partitional clusterings

The advantages of both the hierarchical and the partitional

clustering have been incorporated into many methods (Frigui

& Krishnapuram, 1999; Geva, 1999; Su & Liu, 2005; Vesanto

& Alhoniemi, 2000; Wang & Rau, 2001). The VQ-clustering and

VQ-agglomeration methods (Wang & Rau, 2001) involve a VQ

process followed, respectively, by clustering and agglomerative

clustering that treat the codewords as initial prototypes. Each

codeword is associated with a gravisphere that has a well defined

attraction radius. The agglomeration algorithm requires that each

codeword be moved directly to the centroid of its neighboring

codewords. A similar two-stage clustering procedure that uses the

SOM for VQ and an agglomerative clustering or the C-means for

further clustering is given in Vesanto and Alhoniemi (2000). The

performance results of these two-stage methods are comparable

to those of direct methods, with a significantly reduced execution

time (Vesanto & Alhoniemi, 2000; Wang & Rau, 2001). A two-stage

procedure given in Su and Liu (2005) can cluster data with arbi-

trary shapes, where an ART-like algorithm partitions data into a

set of small multi-dimensional hyperellipsoids and an agglomera-

tive algorithm sequentially merges those hyperellipsoids. Dendro-

gramsandtheso-calledtablesofrelativefrequencycountsarethen

usedtopicksometrustableclusteringresultsfromalotofdifferent

clusteringresults.Inthehierarchicalunsupervisedfuzzyclustering

(HUFC) (Geva, 1999), PCA is applied to each cluster for optimal fea-

ture extraction. This method is effective for data sets with a wide

dynamic variation in both the covariance matrix and the number

of members in each class. The robust competitive agglomeration

(RCA) (Frigui & Krishnapuram, 1999) finds the optimum number

of clusters by competitive agglomeration, and achieves noise im-

munity by integrating robust statistics.

15. Constructive clustering techniques

Conventional partitional clustering algorithms assume a net-

work with a fixed number of clusters (nodes) K. However, select-

ing the appropriate value of K is a difficult task without a prior

knowledge of the input data. Constructive clustering can solve this

difficulty.

A simple strategy for determining K is to perform clustering for

a range of K, and select the value of K that minimizes a cluster

validity measure. This procedure is computationally intensive

when the actual number of clusters is large. Examples of such

strategy are the scatter based FSCL clustering (Sohn & Ansari,

1998) and a method using the distortion errors plus a codebook

complexity term as the cost function (Buhmann & Kuhnel, 1993).

The ISODATA (Ball & Hall, 1967) can be treated as a variant of the

incremental C-means (MacQueen, 1967) by incorporating some

heuristics for merging and splitting clusters, and for handling

outliers; thus, it realizes a variable number of clusters K.

Self-creating mechanism in the competitive learning process

can adaptively determine the natural number of clusters. The self-

creating and organizing neural network (SCONN) (Choi & Park,

1994) employs adaptively modified node thresholds to control

its self-growth. For a new input, the winning node is updated if

it is active; otherwise a new node is created from the winning

node. Activation levels of all the nodes decrease with time, so that

the weight vectors are distributed at the final stage according to

the input distribution. Nonuniform VQ is realized by decreasing

the activation levels of the active nodes and increasing those

of the other nodes to estimate the asymptotic point density

automatically. The SCONN avoids the under-utilization problem,

and has VQ accuracy and speed advantage over the SOM and the

batch C-means (Linde et al., 1980).

The growing cell structures (GCS) network (Fritzke, 1994a)

can be viewed as a modification of the SOM by integrating node

recruiting/pruning functions. It assigns each nodes with a local

accumulated statistical variable called signal counter ui. For each

new pattern, only the winning node increases its signal counter uw

by 1, and then all the signal counters uidecay with a forgetting

factor. After a fixed number of iterations, a new node is inserted

between the node with the largest signal counter and its farthest

neighbor. The algorithm occasionally prunes a node with its signal

counter below a specified threshold during a complete epoch.

The growing grid network (Fritzke, 1995b) is strongly related to

the GCS. As opposed to the GCS, the growing grid has a strictly

rectangular topology. By inserting complete rows or columns of

units, the grid may adapt its height/width ratio to the given

pattern distribution. The branching competitive learning (BCL)

network (Xiong, Swamy, Ahmad, & King, 2004) adopts the same

technique for recruiting and pruning nodes as the GCS except that

a new geometrical criterion is applied to the winning node before

updating its signal counter uw.

The GNG model (Fritzke, 1995a, 1997a) is based on the GCS

(Fritzke, 1994a) and the NG (Martinetz et al., 1993). The GNG is ca-

pable of generating and removing neurons and lateral connections

dynamically. Lateral connections are generated by the competitive

Hebbian learning rule. The GNG achieves robustness against noise

and performs perfect topology-preserving mapping. The GNG with

utility criterion (GNG-U) (Fritzke, 1997a) integrates an on-line cri-

terion to identify and delete useless neurons, and can thus track

nonstationary data input. A similar on-line clustering method is

given in Furao and Hasegawa (2005). The dynamic cell structures

(DCS) model (Bruske & Sommer, 1995) uses a modified Kohonen

learning rule to adjust the prototypes and the competitive Heb-

bian rule so as to establish a dynamic lateral connection structure.

Applying the DCS to the GCS yields the DCS-GCS algorithm, which

hasabehaviorsimilartothatoftheGNG.Thelife-longlearningcell

Page 14

102

K.-L. Du / Neural Networks 23 (2010) 89–107

structures(LLCS)algorithm(Hamker,2001)isanon-lineclustering

and topology representation method. It employs a strategy similar

to that of the ART, and incorporates similarity based unit pruning

and aging based edge pruning procedures.

The self-splitting competitive learning (SSCL) (Zhang & Liu,

2002) can find the natural number of clusters based on the

one-prototype-take-one-cluster (OPTOC) paradigm and a validity

measure for self-splitting. The OPTOC enables each prototype to

situate at the centroid of one natural cluster when the number of

clusters is greater than that of the prototypes. The SSCL starts with

a single prototype and splits adaptively until all the clusters are

found.Duringthelearningprocess,oneprototypeischosentosplit

into two prototypes according to the validity measure, until the

SSCL achieves an appropriate number of clusters.

16. Miscellaneous clustering methods

There are also numerous density based and graph theory

based clustering algorithms. Here, we mention some algorithms

associated with competitive learning and neural networks. The

LBG has been implemented by storing the data points via a k-

d tree, achieving typically an order of magnitude faster than the

LBG (Kanungo et al., 2002). The expectation–maximization (EM)

clustering (Bradley, Fayyad, & Reina, 1998) represents each cluster

using a probability distribution, typically a Gaussian distribution.

Each cluster is represented by a mean and a J1× J1covariance

matrix, where J1 is the dimension of an input vector. Each

pattern belongs to all the clusters with the probabilities of

membership determined by the distributions of the corresponding

clusters. Thus, the EM clustering can be treated as a fuzzy

clustering technique. The EM technique is derived by maximizing

the log likelihood of the probability density function of the

mixture model. The C-means is equivalent to the classification

EM (CEM) algorithm corresponding to the uniform spherical

Gaussian model (Celeux & Govaert, 1992; Xu & Wunsch II, 2005).

Kernel based clustering first nonlinearly maps the patterns into an

arbitrarily high-dimensional feature space, and clustering is then

performed in the feature space. Some examples are the kernel

C-means (Scholkopf, Smola, & Muller, 1998), kernel subtractive

clustering (Kim et al., 2005), variants of kernel C-means based

on the SOM and the ART (Corchado & Fyfe, 2000), a kernel based

algorithm that minimizes the trace of the within-class scatter

matrix (Girolami, 2002), and support vector clustering (SVC) (Ben-

Hur, Horn, Siegelmann, & Vapnik, 2001; Camastra & Verri, 2005;

Chiang & Hao, 2003). The SVC can effectively deal with the outliers.

17. Cluster validity

An optimal number of clusters or a good clustering algorithm is

onlyinthesenseofacertainclustervaliditycriterion.Manycluster

validity measures are defined for this purpose.

17.1. Measures based on maximal compactness and maximal separa-

tion of clusters

Agoodclusteringalgorithmshouldgenerateclusterswithsmall

intracluster deviations and large inter-cluster separations. Cluster

compactness and cluster separation are two measures for the

performance of clustering. A popular cluster validity measure is

defined as (Davies & Bouldin, 1979; Du & Swamy, 2006)

?dWCS(ck) + dWCS(cl)

where the within-cluster scatter for cluster k, denoted dWCS(ck),

and the between-cluster separation for clusters k and l, denoted

dBCS(ck,cl), are calculated by

EWBR=

1

K

K ?

k=1

max

l?=k

dBCS(ck,cl)

?

(31)

dWCS(ck) =

Nkbeing the number of data points in cluster k. The best clustering

minimizes EWBR. This index indicates good clustering results for

sphericalclusters(Vesanto&Alhoniemi,2000).Alternativecriteria

for the cluster compactness, cluster separation, and overall cluster

quality measures are given in He et al. (2004). In Xie and Beni

(1991), the ratio of compactness and separation is used as a cluster

validity criterion for fuzzy clustering. Entropy cluster validity

measures based on class conformity are given in Boley (1998); He

et al. (2004). Some cluster validity measures are described and

compared in Bezdek and Pal (1998).

?

i

?xi− ck?

Nk

,

dBCS(ck,cl) = ?ck− cl?

(32)

17.2. Measures based on minimal hypervolume and maximal density

of clusters

A good partitioning of the data usually leads to a small

total hypervolume and a large average density of the clusters.

Cluster validity measures can be thus selected as the hypervolume

and average density of the clusters. The fuzzy hypervolume

criterion (Gath & Geva, 1989; Krishnapuram et al., 1992) is defined

as the sum of the volumes of all the clusters, Vi, and Vi

[det(Fi)]

is defined by Gustafson and Kessel (1979)

=

1

2, where Fi, the fuzzy covariance matrix of the ith cluster,

Fi=

1

N ?

j=1

µm

ij

N

?

j=1

µm

ij

?

xj− ci

??

xj− ci

?T.

(33)

The average fuzzy density criterion (Gath & Geva, 1989) is defined

as the average of the fuzzy density in each cluster,

Sisums the membership degrees of only those members within

a hyperellipsoid defined by Fi. The fuzzy hypervolume criterion

typically has a clear extremum; the average fuzzy density criterion

is not desirable when there is a substantial cluster overlapping and

a large variation in the compactness of the clusters (Gath & Geva,

1989). A partitioning that results in both dense and loose clusters

may lead to a large average fuzzy density.

For shell clustering, the hypervolume and average density mea-

sures are still applicable. However, the distance vector between a

pattern and a prototype needs to be redefined. In the case of spher-

ical shell clustering, the displacement or distance vector between

a pattern xjand a prototype?λi= (ci,ri) is defined by

dji=?

The fuzzy hypervolume and average fuzzy density measures for

spherical shell clustering are obtained by replacing the distance

vector?

In the case of fuzzy spherical shell clustering, the fuzzy shell thick-

ness of a cluster is defined in Krishnapuram et al. (1992). The av-

erage shell thickness of all clusters can be used as a cluster validity

measure for shell clustering.

Si

Vi, where

xj− ci

?− ri

xj− ci

??xj− ci

in (33) by dji. For shell clustering, the shell thick-

ness measure can be used to describe the compactness of a shell.

??.

(34)

xj− ci

?

18. Computer simulations

In this section, we give two examples to illustrate the applica-

tion of clustering algorithms.

18.1. An artificial example

Given a data set of 1000 random data points in the two-

dimensional space: In each of the two half rings there are 500

uniformly random data points. We use the SOM to realize VQ

Page 15

K.-L. Du / Neural Networks 23 (2010) 89–107

103

ci,1

1000 epochs

ci,2

a

1000 epochs

–3

–2

–1

0

1

2

ci,2

b

–3

–2

–1

0

1

2

–3–2–10123

ci,1

–3–2–10123

Fig. 3. Random data points in the two-dimensional space. In each of the two quarters, there are 1000 uniformly random points. (a) The out cells are arranged in a 10 × 10

grid. (b) The output cells are arranged in a one-dimensional grid of 100 cells.

4

6

8

2

4

6

0

2

4

6

8

x1

x2

x3

4

6

8

2

4

6

0

1

2

3

x1

x2

x4

2

3

4

5

0

5

10

0

1

2

3

x2

x3

x4

4

6

8

2

4

6

0

2

4

6

8

x1

x2

x3

4

6

8

2

4

6

0

1

2

3

x1

x2

x4

2

3

4

5

0

5

10

0

1

2

3

x2

x3

x4

4

6

8

2

4

6

0

2

4

6

8

x1

x2

x3

4

6

8

2

4

6

0

1

2

3

x1

x2

x4

2

3

4

5

0

5

10

0

1

2

3

x2

x3

x4

a

b

c

Fig. 4. The iris classification: (a) The iris data set and the class information. (b) The clustering result by the FCM. (c) The clustering result by the subtractive clustering.

and topology-preserving by producing a grid of cells. Simulation

is based on the Matlab Neural Network Toolbox. In the first group

of simulations, the output cells are arranged in a 10× 10 grid, and

the hexagonal neighborhood topology is used. The training result

for 1000 epochs is shown in Fig. 3a. When the 100 output cells are

arranged in one dimension, the training result for 1000 epochs is

showninFig.3b.Givenatestpoint,thetrainednetworkcanalways

find the prototype based on the nearest-neighbor paradigm.