Content uploaded by Eric Granger

Author content

All content in this area was uploaded by Eric Granger on Nov 18, 2014

Content may be subject to copyright.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

International Journal of Computational Intelligence and Applications

c

World Scientiﬁc Publishing Company

MATCH TRACKING STRATEGIES FOR FUZZY ARTMAP

NEURAL NETWORKS

PHILIPPE HENNIGES, ERIC GRANGER∗AND ROBERT SABOURIN

Laboratoire d’imagerie, de vision et d’intelligence artiﬁcielle

D´

ept. de g´

enie de la production automatis´

ee

´

Ecole de technologie sup´

erieure, Montreal, Canada

LUIZ S. OLIVEIRA

Dept. de Inform´

atica Aplicada

Pontif´

ıcia Universidade Cat´

olica do Paran´

a (PUCPR), Curitiba, Brazil

Received (received date)

Revised (revised date)

Training fuzzy ARTMAP neural networks for classiﬁcation using data from a complex real-world

environments may lead to category proliferation and yield poor performance. This problem is known

to occur whenever the training set contains noisy or overlapping data. Moreover, when the training set

contains inconsistent cases (i.e., identical input patterns that belong to different recognition classes),

fuzzy ARTMAP will fail to converge. To circumvent these problems, some alternatives to the network’s

original match tracking (MT) process have been proposed in literature, such as using negative MT, and

removing MT altogether. In this paper, the impact on fuzzy ARTMAP performance of training with

different MT strategies is assessed empirically, using different synthetic data sets, and the NIST SD19

data set (a handwritten numerical character recognition problem). During computer simulations, fuzzy

ARTMAP is trained with the original (positive) match tracking (MT+), with negative match tracking

(MT-), and without MT algorithm (WMT). Their performance is compared to that of fuzzy ARTMAP

where the MT parameter is optimized during training using a Particle Swarm Optimisation (PSO)-

based strategy, denoted PSO(MT). Through a comprehensive set of simulations, it has been observed

that by training with MT-, fuzzy ARTMAP expends fewer resources than with other MT strategies,

but can achieve a signiﬁcantly higher generalization error, especially for data with overlapping class

distributions. In particular, degradation of error in fuzzy ARTMAP performance due to overtraining

is more pronounced for MT- than for MT+. Generalization error achieved using WMT is signiﬁcantly

higher than other strategies on data with complex non-linear decision bounds. Furthermore, the number

of internal categories required to represent decision boundaries increases signiﬁcantly. Optimizing

the value of the match tracking parameter using PSO(MT) yields the lowest overall generalization

error, and requires fewer internal categories than WMT, but generally more categories than MT+ and

MT-. However, this strategy requires a large number of training epochs to convergence. Based on this

empirical results with PSO(MT), the MT process as such can provide a signiﬁcant increase to fuzzy

ARTMAP performance, assuming that the MT parameter is tuned for the speciﬁc application in mind.

Keywords: Pattern Recognition, Classiﬁcation, Supervised Learning, Neural Networks, Adaptive Res-

onance Theory (ART), Fuzzy ARTMAP, Match Tracking, Particle Swarm Optimisation, Character

Recognition, NIST SD19.

∗Corresponding author: ´

Ecole de technologie sup´

erieure, 1100 Notre-Dame Ouest, Montreal, Quebec, H3C

1K3, Canada, email: eric.granger@etsmtl.ca, phone: 1-514-396-8650, fax: 1-514-396-8595.

1

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

2P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

1. Introduction

The fuzzy ARTMAP neural network architecture is capable of self-organizing stable recog-

nition categories in response to arbitrary sequences of analog or binary input patterns. It can

perform fast, stable, on-line, unsupervised or supervised, incremental learning, classiﬁca-

tion, and prediction 6,7. As such, it has been successfully applied in complex real-world

pattern recognition tasks such as the recognition of radar signals 15,31, multi-sensor im-

age fusion, remote sensing and data mining 9,30,34,38, recognition of handwritten charac-

ters 3,13,22, and signature veriﬁcation 28.

A drawback of fuzzy ARTMAP is its ability to learn decision boundaries between class

distributions that consistently yield low generalization error for a wide variety of pattern

recognition problems. For instance, when trained for automatic classiﬁcation of handwrit-

ten characters, fuzzy ARTMAP cannot achieve a level of performance that is competitive

with some other commonly-used models 16 a. In the context of batch supervised learning

of a ﬁnite training set, the main factors affecting fuzzy ARTMAP’s capacity to generalize

are:

(1) internal dynamics of network: prototype choice and class prediction functions, learning

rule, match tracking process, hyper-parameter values, and representation of categories

with hyper-rectangles.

(2) learning process: supervised learning strategy (and thus, the number of training

epochs), proportion of patterns in the training subset to those in validation and test

subsets, user-deﬁned hyper-parameter values, data normalisation technique, sequential

gradient-based learning, and data presentation order.

(3) data set structure: overlap and dispersion of patterns, etc., and therefore of the geometry

of decision boundaries among patterns belonging to different recognition classes.

Several ARTMAP networks have been proposed to reﬁne the decision boundaries cre-

ated by fuzzy ARTMAP. For instance, many variants attempt to improve the accuracy

of fuzzy ARTMAP predictions by providing for probabilistic (density based) predic-

tions 10,14,24,35,37,39.

When learning data from complex real-world environments, fuzzy ARTMAP is known

to suffer from overtraining, often referred to in literature as the category proliferation prob-

lem. It occurs when the training data set contains overlapping class distributions 18,21,23. In-

creasing the amount of training data requires signiﬁcantly more internal category neurons,

and therefore computational complexity, while yielding a higher generalisation error. The

category proliferation problem is directly connected to the match tracking (MT) process

of fuzzy ARTMAP. During fuzzy ARTMAP training, when a mismatch occurs between

predicted and desired output responses, MT allows selecting alternate category neurones.

aIn handwritten character recognition, statistical classiﬁers (e.g., linear and quadratic discriminant function, Gaus-

sian mixture classiﬁer, and k-Nearest-Neighbor (kNN)), neural networks (e.g., Multi-Layer Perceptron (MLP), the

Radial Basis Function (RBF) network), and Support Vector Machines (SVM) are commonly used for classiﬁca-

tion due to their learning ﬂexibility and inexpensive computation 25. Such recognition problems typically exhibit

complex decision boundaries, with moderate overlap between character classes.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 3

The match tracking process is parameterized by hyper-parameter ε, and was originally

introduced as a small positive value 10. In fuzzy ARTMAP literature, this parameter is com-

monly set to a value (ε=0+) that allows to minimize network resources. Such a choice may

however contribute to overtraining, and signiﬁcantly degrade the capacity to generalize. As

a result, some authors have studied the impact on performance of removing the MT al-

together, and conclude that the usefulness of MT is questionable 2,26. However, training

without MT may lead to a network with a greater number of internal categories, and possi-

bly a higher generalization error.

In an extreme case, a well known convergence problem occurs when learning inconsis-

tent cases – identical training subset patterns that belong to different classes 10. The con-

sequence is a failure to converge, as identical prototypes linked to these inconsistent cases

proliferate. This anomalous situation is a result of the original match tracking process.

This convergence problem may be circumvented by using the feature of ARTMAP-IC 10

called negative match tracking (i.e., setting ε=0−after mismatch reset). This allows fuzzy

ARTMAP training to converge and ﬁnd solutions with fewer internal categories, but may

however lead to a higher generalization error.

In this paper, the impact on fuzzy ARTMAP performance of training with different MT

strategies – the original positive MT (MT+), negative MT (MT-) and without MT (WMT)

- is assessed empirically. As an alternative, a Particle Swarm Optimization (PSO)-based

approach called PSO(MT) is used to optimize the value of MT hyper-parameter εduring

fuzzy ARTMAP training, such that the generalization error is minimized. The architec-

ture, weights, and MT parameter are in effect selected to minimize generalisation error

by virtue of ARTMAP training, which allows to grow the network architecture (i.e., the

number of category neurons) with the problem’s complexity. An experimental protocol

has been deﬁned such that the generalization error and resource requirements of fuzzy

ARTMAP trained with different MT strategies may be compared using different types of

pattern recognition problems. The ﬁrst two types consist of synthetic data with overlapping

class distributions, and with complex decision boundaries but no overlap, respectively. The

third type consists of real-world data - handwritten numerical characters extracted from the

NIST SD19.

In the next section, the MT strategies for fuzzy ARTMAP training are brieﬂy reviewed.

Section III presents the experimental methodology, e.g., protocol, data sets and perfor-

mance measures employed for proof of concept computer simulations. Section IV presents

and discuss experimental results obtained with synthetic and NIST SD19 data.

2. Fuzzy ARTMAP Match Tracking

2.1. The fuzzy ARTMAP neural network:

ARTMAP refers to a family of neural network architectures based on Adaptive Resonance

Theory (ART) 4that is capable of fast, stable, on-line, unsupervised or supervised, incre-

mental learning, classiﬁcation, and prediction 6 7. ARTMAP is often applied using the

simpliﬁed version shown in Figure 1. It is obtained by combining an ART unsupervised

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

4P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

1

F1

W

+

_

match

F2

2M

...

12N

...

ART network

...

A1A2AM

12L

...

x

y

reset

tracking

Fab

.

..

|x|

|A|

yab

t1t2tL

Wab

Fig. 1. An ARTMAP neural network architecture specialized for pattern classiﬁcation.

neural network 4with a map ﬁeld. The ARTMAP architecture called fuzzy ARTMAP 7

can process both analog and binary-valued input patterns by employing fuzzy ART 5as the

ART network.

The fuzzy ART neural network consists of two fully connected layers of nodes: an M

node input layer, F1, and an Nnode competitive layer, F2. A set of real-valued weights

W={wi j ∈[0,1]:i=1,2,..., M;j=1,2, ...,N}is associated with the F1-to-F2layer con-

nections. Each F2node jrepresents a recognition category that learns a prototype vector

wj= (w1j,w2j,..., wM j ). The F2layer of fuzzy ART is connected, through learned associa-

tive links, to an Lnode map ﬁeld Fab, where Lis the number of classes in the output space.

A set of binary weights Wab ={wab

jk ∈ {0,1}:j=1,2, ..., N;k=1,2, ..., L}is associated

with the F2-to-Fab connections. The vector wab

j= (wab

j1,wab

j2,..., wab

jL)links F2node jto one

of the Loutput classes.

2.2. Algorithm for supervised learning of fuzzy ARTMAP:

In batch supervised training mode, ARTMAP classiﬁers learn an arbitrary mapping be-

tween training set patterns a=(a1,a2,..., am)and their corresponding binary supervision

patterns t=(t1,t2,...,tL). These patterns are coded to have unit value tK=1 if Kis the target

class label for a, and zero elsewhere. The following algorithm describes fuzzy ARTMAP

learning:

(1) Initialisation: Initially, all the F2nodes are uncommitted, all weight values wi j are

initialized to 1, and all weight values wab

jk are set to 0. An F2node becomes committed

when it is selected to code an input vector a, and is then linked to an Fab node. Values

of the learning rate β∈[0,1], the choice α>0, the match tracking 0 <ε1, and the

baseline vigilance ¯

ρ∈[0,1]hyper-parameters are set.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 5

(2) Input pattern coding: When a training pair (a,t)is presented to the network, aunder-

goes a transformation called complement coding, which doubles its number of com-

ponents. The complement-coded input pattern has M=2mdimensions and is deﬁned

by A=(a,ac)=(a1,a2,..., am;ac

1,ac

2,..., ac

m), where ac

i= (1−ai), and ai∈[0,1]. The

vigilance parameter ρis reset to its baseline value ¯

ρ.

(3) Prototype selection: Pattern Aactivates layer F1and is propagated through weighted

connections Wto layer F2. Activation of each node jin the F2layer is determined by

the Weber law choice function:

Tj(A) = |A∧wj|

α+|wj|,(1)

where |·| is the L1norm operator deﬁned by |wj| ≡ ∑M

i=1|wi j|,∧is the fuzzy AND

operator, (A∧wj)i≡min(Ai,wij ), and αis the user-deﬁned choice parameter. The F2

layer produces a binary, winner-take-all pattern of activity y= (y1,y2,..., yN)such that

only the node j=Jwith the greatest activation value J=argmax{Tj:j=1,2, ..., N}

remains active; thus yJ=1 and yj=0,j6=J. If more than one Tjis maximal, the

node jwith the smallest index is chosen. Node Jpropagates its top-down expectation,

or prototype vector wJ, back onto F1and the vigilance test is performed. This test

compares the degree of match between wJand Aagainst the dimensionless vigilance

parameter ρ∈[0,1]:

|A∧wJ|

|A|=|A∧wJ|

M≥ρ.(2)

If the test is passed, then node Jremains active and resonance is said to occur. Oth-

erwise, the network inhibits the active F2node (i.e., TJis set to 0 until the network

is presented with the next training pair (a,t)) and searches for another node Jthat

passes the vigilance test. If such a node does not exist, an uncommitted F2node be-

comes active and undergoes learning (Step 5). The depth of search attained before an

uncommitted node is selected is determined by the choice parameter α.

(4) Class prediction: Pattern tis fed directly to the map ﬁeld Fab, while the F2category y

learns to activate the map ﬁeld via associative weights Wab. The Fab layer produces a

binary pattern of activity yab = (yab

1,yab

2,..., yab

L) = t∧wab

Jin which the most active Fab

node K=argmax{yab

k:k=1,2,..., L}yields the class prediction (K=k(J)). If node K

constitutes an incorrect class prediction, then a match tracking (MT) signal raises the

vigilance parameter ρsuch that:

ρ=|A∧wJ|

M+ε,(3)

where ε=0+, to induce another search among F2nodes in Step 3. This search contin-

ues until either an uncommitted F2node becomes active (and learning directly ensues in

Step 5), or a node Jthat has previously learned the correct class prediction Kbecomes

active.

(5) Learning: Learning input ainvolves updating prototype vector wJ, and, if Jcorre-

sponds to a newly-committed node, creating an associative link to Fab. The prototype

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

6P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

vector of F2node Jis updated according to:

w0

J=β(A∧wJ)+(1−β)wJ,(4)

where βis a ﬁxed learning rate parameter. The algorithm can be set to slow learn-

ing with 0 <β<1, or to fast learning with β=1. With complement coding and fast

learning, fuzzy ARTMAP represents category jas an m-dimensional hyperrectangle

Rjthat is just large enough to enclose the cluster of training set patterns ato which it

has been assigned. That is, an M-dimensional prototype vector wjrecords the largest

and smallest component values of training subset patterns aassigned to category j.

The vigilance test limits the growth of hyperrectangles – a ρclose to 1 yields small

hyperrectangles, while a ρclose to 0 allows large hyperrectangles. A new association

between F2node Jand Fab node K(k(J) = K) is learned by setting wab

Jk =1 for k=K,

where Kis the target class label for a, and 0 otherwise. The next training subset pair

(a,t)is presented to the network in Step 2.

Network training proceeds from one epoch to the next, and is halted for validation after

each epoch b. Given a ﬁnite training data set, batch supervised learning ends after the epoch

for which the generalisation error is minimized on an independent validation data set. With

the large data sets considered in this paper, learning through this hold-out validation (HV)

is an appropriate validation strategy. If data were limited, k-fold cross-validation would be

a more suitable strategy, at the expense of some estimation bias due to crossing 18,33.

Once the weights Wand Wab have been found through this process, ARTMAP can pre-

dict a class label for an input pattern by performing Steps 2, 3 and 4 without any vigilance

or match tests. During testing, a pattern athat activates node Jis predicted to belong to

class K=k(J). The time complexity required to process one input pattern, during either a

training or testing phase, is O(MN).

2.3. Match tracking strategies:

During training, when a mismatch occurs between a predicted response yab and a desired

response tfor an input pattern a, the original positive MT process (MT+) of fuzzy ARTMAP

raises the internal vigilance parameter to ρ= (|A∧wJ|)(M)−1+εin order to induce another

search among F2category nodes. MT+ is parameterized by the MT hyper-parameter ε,

which was introduced as a small positive value, 0 <ε17.

It is well documented that training fuzzy ARTMAP with data from overlapping class

distributions may lead to category proliferation, and that this problem is connected to the

MT process. In this case, increasing the amount of training data requires signiﬁcantly more

resources (i.e., the number of internal category neurons, thus memory space and computa-

tional complexity), yet provides a higher generalisation error 18,21,23. In addition, the MT

parameter is commonly set to the value ε= +0.001 in fuzzy ARTMAP literature to mini-

mize network resources 10. Such a choice may however play a signiﬁcant role in category

proliferation, and considerably degrade the capacity to generalize.

bAn epoch is deﬁned as one complete presentation of all the patterns of the training set.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 7

Consequently, some authors have challenged the need for a MT process 1,26. Training

without MT (WMT) implies creating a new category each time that a predictive response yab

does not match a desired response t. Note that training fuzzy ARTMAP WMT is equivalent

to performing MT but setting ε=1. Training WMT may however lead to a network with a

greater number of internal categories, and possibly a higher generalization error.

In an extreme case, a convergence problem occurs whenever the training set contains

identical patterns that belong recognition classes 10. The effect is a proliferation of identical

prototypes associated with the inconsistent cases, and a failure to converge. Consider for

example that on the ﬁrst training epoch, fuzzy ARTMAP learns two completely overlap-

ping, minimum-sized prototypes, wA.1(linked to class A) and wB.1(linked to class B), for

two identical pulse patterns, a1and a2. In a subsequent epoch, wA.1is initially selected to

learn a2, since TA.1=TB.1'1, and wA.1was created prior to wB.1(index A.1 is smaller than

B.1). Since wA.1is not linked to class B, mismatch reset raises the vigilance parameter ρto

(|A2∧wA.1|/M) + ε, where |A2∧wA.1|=|A2∧wB.1|. As a result, wB.1can no longer pass

the vigilance test required to become selected for a2, and fuzzy ARTMAP must create an-

other minimum-sized prototype wB.2=wB.1. From epoch to epoch, the same phenomenon

repeats itself, yielding ever more prototypes wB.n=wB.1for n =3,4,..., ∞.

ARTMAP-IC 10 is an extension of fuzzy ARTMAP that produce a binary winner-take-

all pattern ywhen training, but use distributed activation of coded F2 nodes when testing.

ARTMAP-IC is further extended in two ways. First, it biases distributed test set predictions

according to the number of times F2 nodes are assigned to training set patterns. Second, it

uses a negative MT process (MT-) to address the problem of inconsistent cases, whereby

identical training set patterns correspond to different classes labels.

With negative MT (MT-), ρis also initially raised after mismatch reset, but is allowed to

decay slightly before a different node Jis selected. Then, the MT parameter is set to a small

negative value, ε≤0 (typically a value of ε=−0.001), which allows for identical inputs

that predict different classes to establish distinct recognition categories. In the example

above, mismatch reset raises ρbut wB.1would still pass the vigilance test. This allows to

learn fully overlapping prototypes for training set patterns that belong to different classes.

In some applications, incorporation into fuzzy ARTMAP of the MT- feature of

ARTMAP-IC may be essential to avoid the convergence problem observed with original

MT+. Training fuzzy ARTMAP with MT- would thereby ﬁnd solutions with fewer internal

categories, but may nonetheless lead to a higher generalization error.

An alternate approach consists in optimizing the MT hyper-parameter during batch su-

pervised learning of a fuzzy ARTMAP neural network. In effect, both network (weights and

architecture) and εvalues are co-optimized for a given problem, using the same cost func-

tion. The next Sub-section presents a Particle Swarm Optimization (PSO)-based approach

called PSO(MT) that automatically selects a value (magnitude and polarity) of εduring

fuzzy ARTMAP training such that the generalization error is minimized. This approach is

based on the PSO training strategy proposed in 16, but focused only on a one-dimensional

optimization space of ε∈[−1,1].

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

8P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

x

y

s

q

s

q+1

p

q

p

Global

Best

(gBest)

v

q

v

q+1

q

g

Particle's

Best

(pBest)

i

Fig. 2. PSO update of a particle’s position sqto sq+1in a 2-dimensional space during iteration q+1.

2.4. Particle Swarm Optimisation (PSO) of the match tracking parameter

PSO is a population-based stochastic optimization technique that was inspired by social

behavior of bird ﬂocking or ﬁsh schooling 19. It shares many similarities with evolutionary

computation techniques such as genetic algorithms (GAs), yet has no evolution operators

such as crossover and mutation. PSO belongs to the class of evolutionary algorithm tech-

niques that does not utilize the “survival of the ﬁttest” concept, nor a direct selection func-

tion. A solution with lower ﬁtness values can therefore survive during the optimization and

potentially visit any point of the search space 12. Finally, while GAs were conceived to deal

with binary coding, PSO was designed, and proved very effective, in solving real valued

global optimization problems, which makes it suitable for this study.

With PSO, each particle corresponds to a single solution in the search space, and the

population of particles is called a swarm. All particles are assigned position values which

are evaluated according to the ﬁtness function being optimized, and velocities values which

direct their movement. Particles move through the search space by following the particles

with the best ﬁtness. Assuming a d-dimensional search space, the position of particle i

in an P-particle swarm is represented by a d-dimensional vector si= (si1,si2,...,sid), for

i=1,2,...,P. The velocity of this particle is denoted by vector vi= (vi1,vi2,...,vid), while

the best previously-visited position of this particle is denoted as pi= (pi1,pi2,..., pid ). For

each new iteration q+1, the velocity and position of particle iare updated according to:

vq+1

i=wqvq

i+c1r1(pq

i−sq

i) + c2r2(pq

g−sq

i)(5)

sq+1

i=sq

i+vq+1

i(6)

where pgrepresents the global best particle position in the swarm, wqis the particle in-

ertia weight, c1and c2are two positive constants called cognitive and social parameters,

respectively, and r1and r2are random numbers uniformly distributed in the range [0,1].

The role of wqin Equation 5 is to regulate the trade-off between exploration and ex-

ploitation. A large inertia weight facilitates global search (exploration), while a small one

tends to facilitate ﬁne-tuning the current search area (exploitation). This is why inertia

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 9

Algorithm 1: PSO learning strategy for fuzzy ARTMAP.

A. Initialization:

set the maximum number of iterations qmax and/or ﬁtness objective E∗

set PSO parameters P,vmax,w0,c1,c2,r1and r1

initialize particle positions at random such that p0

g,s0

iand p0

i∈[−1,1]d, for i=1,2,...,P

initialize particle velocities at random such that 0≤v0

i≤vmax, for i=1,2,...,P

B. Iterative process:

set iteration counter q=0

while q≤qmax or E(pq

g)≥E∗do

for i=1,2,...,Pdo

train fuzzy ARTMAP using hold-out validation and sq

i

compute ﬁtness value E(sq

i)of resulting network

if E(sq

i)<E(pq

i)then

update particle’s best personal position: pq

i=sq

i

end

end

select the particle with best global ﬁtness: g=argmin{E(sq

i):i=1,2,...,P}

for i=1,2,...,Pdo

update velocity: vq+1

i=wqvq

i+c1r1(pq

i−sq

i) + c2r2(pq

g−sq

i)

update position: sq+1

i=sq

i+vq+1

i

end

q=q+1

update particle inertia wq

end

weight values are deﬁned by some monotonically decreasing function of q. Proper ﬁne-

tuning of c1and c2may result in faster convergence of the algorithm and alleviation of

the local minima. Kennedy and Eberhart propose that the cognitive and social scaling pa-

rameters be selected such that c1=c2=220. Finally, the parameters r1and r2are used to

maintain the diversity of the population. Figure 2 depicts the update by PSO of a particle’s

position from sq

ito sq+1

i.

Algorithm 1 shows the pseudo-code of a PSO learning strategy specialized for super-

vised training of fuzzy ARTMAP neural networks. It essentially seeks to minimize fuzzy

ARTMAP generalisation error E(sq

i)in the d-dimensional space of hyper-parameter val-

ues. For enhanced computational throughput and global search capabilities, Algorithm 1 is

inspired by the synchronous parallel version of PSO 32. It utilizes a basic type of neigh-

borhood called global best or gbest, which is based on a sociometric principle that concep-

tually connects all the members of the swarm to one another. Accordingly, each particle

is inﬂuenced by the very best performance of any member of the entire swarm. Exchange

of information only takes place among the particle’s own experience (the location of its

personal best pq

i, lbest), and the experience of the best particle in the swarm (the location

of the global best pq

g, gbest).

The PSO(MT) approach is obtained by setting d=1, and the particle positions to MT

parameter values, sq

i=εq

i. Measurement of any ﬁtness values E(sq

i)in this algorithm in-

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

10 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

volves computing the generalisation error on a validation subset for the fuzzy ARTMAP

network which has been trained using the MT hyper-parameter value at particle position εq

i.

When selecting pq

ior pq

g, if the two ﬁtness values being compared are equal, then the parti-

cle/network requiring fewer number of F2 category nodes is chosen. The same training and

validation sets are used throughout this process. Following the last iteration of Algorithm 1,

the overall generalisation error is computed on a test set for the network corresponding to

particle position pq

g.

3. Experimental Methodology

In order to observe the effects of MT strategies from a perspective of different data struc-

ture, several data sets were selected for computer simulations. Four synthetic data sets

are representative of pattern recognition problems that involve either (1) simple decision

boundaries with overlapping class distributions, or (2) complex decision boundaries, were

class distributions do not overlap on decision boundaries. A set of handwritten numerical

characters from the NIST SD19 database is representative of complex real-world pattern

recognition problems. Prior to a simulation trial, these data sets were normalized according

to the min-max technique, and partitioned into three parts – training, validation, and test

subset.

During each simulation trial, the performance of fuzzy ARTMAP is compared from a

perspective of different training subset size, and match tracking strategies. In order to assess

the effect on performance of training subset size, the number of training subset patterns

used for supervised learning was progressively increased, while corresponding validation

and test subsets were held ﬁxed. The performance is compared for fuzzy ARTMAP neural

networks trained according to four different MT strategies: MT+ (ε=0.001), MT- (ε=

−0.001), WMT (equivalent to setting ε=1) and PSO(MT). Training is performed by setting

the other three hyper-parameters such that the resources (number of categories, training

epochs, etc.) are minimized: α=0.001, β=1 and ρ=0. In all cases, training is performed

using the HV strategy 33 described in Subsection 2.2.

The PSO(MT) strategy also uses the hold-out validation technique on fuzzy ARTMAP

network to calculate the ﬁtness of each particle, and therefore ﬁnd the network and εvalue

that minimize generalization error. Other fuzzy ARTMAP hyper-parameters are left un-

changed. In all simulations involving PSO, the search space of the MT parameter was set

to the following range of ε∈[−1,1]. Each simulation trial was performed with P=15

particles, and ended after a maximum of qmax =100 iterations (although none of our sim-

ulations have ever attained that limit). A ﬁtness objective E∗was not considered to end

training, but a trial was ended if the global best ﬁtness E(pq

g)is constant for 10 consecutive

iterations. The initial position s0

1of one particle was set according to MT- (ε=−0.001).

All the remaining particle vectors were initialized randomly, according to a uniform distri-

bution in the search space. The PSO parameters were set as follows: c1=c2=2; r1and

r2were random numbers uniformly distributed in [0,1]; wqwas decreased linearly from

0.9 to 0.4 over the qmax iterations; the maximum velocity vmax was set to 0.2. At the end

of a trial, the fuzzy ARTMAP network with the best global ﬁtness value pq

gwas retained.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 11

Independently trials were repeated 4 timescwith different initializations of particle vectors,

and the network with greatest pq

gof the four was retained.

Since fuzzy ARTMAP performance is sensitive to the presentation order of the train-

ing data, each simulation trial was repeated 10 times with either 10 different randomly

generated data sets (synthetic data), or 10 different randomly selected data presentation

orders (NIST SD19 data). The average performance of fuzzy ARTMAP was assessed in

terms of resources required during training, and its generalisation error on the test sets. The

amount of resources required during training is measured by compression and convergence

time. Compression refers to the average number of training patterns per category prototype

created in the F2 layer. Convergence time is the number of epochs required to complete

learning for a learning strategy. It does not include presentations of the validation subset

used to perform hold-out validation. Generalisation error is estimated as the ratio of in-

correctly classiﬁed test subset patterns over all test set patterns. Given that compression

indicates the number of F2 nodes, the combination of compression and convergence time

provides useful insight into the amount of processing required by fuzzy ARTMAP during

training to produce its best asymptotic generalisation error. Average results, with corre-

sponding standard error, are always obtained as a result of the 10 independent simulation

trials.

The Quadratic Bayes classiﬁer (CQB) and k-Nearest-Neighbour with Euclidean dis-

tance (kNN) classiﬁer were included for reference with generalisation error results. These

are classic parametric and non-parametric classiﬁcation techniques from statistical pattern

recognition, which are immune to the effects of overtraining. For each computer simula-

tion, the value of kemployed with kNN was selected among k= 1, 3, 5, 7, and 9, using

hold-out validation. The rest of this section gives some additional details on the synthetic

and real data sets employed during computer simulations.

3.1. Synthetic data sets:

All four synthetic data sets described below are composed of a total of 30,000 randomly-

generated patterns, with 10,000 patterns for the training, validation, and test subsets. They

correspond to 2 class problems, with a 2 dimensional input feature space. Each data subset

is composed of an equal number of 5,000 patterns per class. In addition, the area occupied

by each class is equal. During simulation trials, the number of training subset patterns used

for supervised learning was progressively increased from 10 to 10,000 patterns according

to a logarithmic rule: 5, 6, 8, 10, 12, 16, 20, 26, 33, 42, 54, 68, 87, 110, 140, 178, 226, 286,

363, 461, 586, 743, 943, 1197, 1519, 1928, 2446, 3105, 3940, 5000 patterns per class. This

corresponds to 30 different simulation trials over the entire 10,000 pattern training subset.

These data sets have been selected to facilitate the observation of fuzzy ARTMAP be-

havior on different tractable problems. Of the four sets, two have simple linear decision

boundaries with overlapping class distributions, Dµ(ξtot)and Dσ(ξt ot ), and two have com-

plex non-linear decision boundaries without overlap, DCIS and DP2. The total theoretical

cFrom previous study with our data sets, it was determined that performing 4 independent trials of the PSO

learning strategy with only 15 particles leads to better optimization results than performing 1 trial with 60 particles.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

12 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

(a) Dµ(ξtot )(b) Dσ(ξtot )

(c) DCIS (d) DP2

Fig. 3. Representation of the synthetic data sets used for computer simulations.

probability of error associated with Dµand Dσis denoted by ξtot . Note that with DCIS

and DP2, the length of decision boundaries between class distributions is longer, and fewer

training patterns are available in the neighborhood of these boundaries than with Dµ(ξtot)

and Dσ(ξtot ). In addition, note that the total theoretical probability of error with DCIS and

DP2 is 0, since class distributions do not overlap on decision boundaries. The four synthetic

data sets are now described:

Dµ(ξtot ): As represented in Figure 3(a), this data consists of two classes, each one deﬁned by a

multivariate normal distribution in a two dimensional input feature space. It is assumed

that data is randomly generated by sources with the same Gaussian noise. Both sources

are described by variables that are independent and have equal variance σ2, therefore

distributions are hyperspherical. In fact, Dµ(ξtot )refers to 13 data sets, where the degree

of overlap, and thus the total probability of error between classes differs for each set.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 13

The degree of overlap is varied from a total probability of error, ξt ot = 1% to ξtot =

25%, with 2% increments, by adjusting the mean vector µ2of class 2.

Dσ(ξtot ): As represented in Figure 3(b), this data is identical to Dµ(ξtot ), except that the degree

of overlap between classes is varied by adjusting the variance σ2

2of both classes. Note

that for a same degree of overlap, Dσ(ξtot)data sets have a larger overlap boundary

than Dµ(ξtot )yet they are not as dense.

DCIS: As represented in Figure 3(c), the Circle-in-Square problem 6requires a classiﬁer to

identify the points of a square that lie inside a circle, and those that lie outside a cir-

cle. The circle’s area equals half of the square. It consists of one non-linear decision

boundary where classes do not overlap.

DP2: As represented in Figure 3(d), each decision region of the DP2 problem is delimited by

one or more of the four following polynomial and trigonometric functions:

f1(x) = 2sin(x) + 5 (7)

f2(x)=(x−2)2+1 (8)

f3(x) = −0.1x2+0.6sin(4x) + 8 (9)

f4(x) = (x−10)2

2+7.902 (10)

and belongs to one of the two classes, indicated by the Roman numbers I and II 36.

It consists of four non-linear boundaries, and class deﬁnitions do not overlap. Note

that equation f4(x)was slightly modiﬁed from the original equation such that the area

occupied by each class is approximately equal.

3.2. NIST Special Database 19 (SD19):

Automatic reading of numerical ﬁelds has been attempted in several domains of application

such as bank cheque processing, postal code recognition, and form processing. Such appli-

cations have been very popular in handwriting recognition research, due to the availability

of relatively inexpensive CPU power, and to the possibility of considerably reducing the

manual effort involved in these tasks 29.

The NIST SD19 17 data set has been selected due to the great variability and difﬁculty of

such handwriting recognition problems (see Figure 4). It consists of images of handwritten

sample forms (hsf) organized into eight series, hsf-{0,1,2,3,4,6,7,8}. SD19 is divided in 3

sections which contains samples representing isolated handwritten digits (’0’, ’1’, ..., ’9’)

extracted from hsf-{0123}, hsf-7 and hsf-4.

For our simulations, the data in hsf-{0123}has been further divided into training subset

(150,000 samples), validation subset 1 (15,000 samples), validation subset 2 (15,000 sam-

ples) and validation subset 3 (15,000 samples). The training and validation subsets contain

an equal number of samples per class. All 60,089 samples in hsf-7 has been used as a stan-

dard test subset. The distribution of samples per class in test sets is approximately equal.

The set features extracted for samples is a mixture of concavity, contour, and surface

characteristics 29. Accordingly, 78 features are used to describe concavity, 48 features are

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

14 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

Figure 3 – Handwriting sample from for NIST SD19

(a) Handwritten sample form. (b) Images of extracted digits.

Fig. 4. Examples in the NIST SD19 data of: (a) a handwriting sample form, and (b) some images of handwritten

digits extracted from the forms.

used to describe contour, and 6 features are used to describe surface. Each sample is there-

fore composed of 132 features that are normalized between 0 and 1 by summing up their

respective feature values, and then dividing each one by its summation. With this feature

set, the NIST SD19 data base exhibits complex decision boundaries, with moderate over-

lap between digit classes. Some experimental results obtained with Multi-Layer Perceptron

(MLP), Support Vector Machine (SVM), and k-NN classiﬁers are reported in 16 .

During simulations, the number of training subset patterns used for supervised learning

was progressively increased as from 100 to 150,000 patterns, according to a logarithmic

rule. The 16 different training subset consist of the ﬁrst 10, 16, 28, 47, 80, 136, 229, 387,

652, 1100, 1856, 3129, 5276, 8896, and all 15000 patterns per class.

4. Simulation Results

4.1. Synthetic data with overlapping class distributions:

Figure 5 presents the average performance obtained when fuzzy ARTMAP is trained with

the four MT strategies – MT-,MT+,WMT and PSO(MT) – on Dµ(13%). The generalisation

errors for the Quadratic Bayes classiﬁer (CQB), as well as the theoretical probability of

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 15

101102103

14

16

18

20

22

24

26

28

Generalization error (%)

Training set size (patterns per class)

etot

FAM MT−

FAM MT+

FAM WMT

FAM PSO(MT)

CQB

(a) Generalisation error.

101102103

0

5

10

15

20

25

30

Compression (training patterns per F2 node)

Training set size (patterns per class)

(b) Compression.

101102103

100

101

102

103

Compression (training patterns per F2 node)

Training set size (patterns per class)

(c) Convergence time.

101102103

−0.2

0

0.2

0.4

0.6

0.8

1

Match Tracking (ε)

Training set size (patterns per class)

(d) MT parameter for PSO(MT).

Fig. 5. Average performance of fuzzy ARTMAP (with MT+,MT-,WMT and PSO(MT)) versus training subset

size for Dµ(ξtot =13%). Error bars are standard error of the sample mean.

error (ξtot ), are also shown for reference.

As shown in Figure 5(a), PSO(MT) generally yields the lowest generalisation error

over training set sizes, followed by WMT,MT+, and then MT-. With more than 20 training

patterns per class, the error of both MT- and MT+ algorithms tends to increase in a manner

that is indicative of fuzzy ARTMAP overtraining 18 . However, with more than about 500

training patterns per class, the generalization error for MT- grows more rapidly with the

training set size than for MT+,WMT and PSO(MT). With a training set of 5000 patterns

per class, a generalization error of about 21.22% is obtained with MT+, 26.17% with MT-,

16.22% with WMT, and 15.26% with PSO(MT). The degradation in performance of MT- is

accompanied by a notably higher compression and a lower convergence time than other MT

strategies. MT- produces networks with fewer but larger categories than other MT strategies

because of the MT polarity. Those large categories contribute to a lower resolution of the

decision boundary, and thus a greater generalization error.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

16 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

(a) Net generalisation error.

(b) Compression. (c) Convergence time.

Fig. 6. Average performance of fuzzy ARTMAP (with MT+,MT-,WMT and PSO(MT)) as a function of ξtot for

all Dµ(ξtot )data sets. Error bars are standard error of the sample mean.

By training with WMT, the generalization error is signiﬁcantly lower than both MT- and

MT+ especially with a large amount of training patterns, but the compression is the lowest

of all training strategies. Based on the error alone, the effectiveness of the MT algorithm

is debateable with overlapping data when compared with MT- and MT+, especially for

application in which resource requirements are not an issue.

By training with PSO(MT), fuzzy ARTMAP yields a signiﬁcantly lower generalization

error than all other strategies, and a compression that falls between that of WMT and MT- or

MT+. With a training set of 5000 patterns per class, a compression of about 8.0 is obtained

with MT+, 26.4 with MT-, 4.8 with WMT, and 5.3 with PSO(MT). The convergence time is

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 17

generally comparable with WMT,MT- and MT+. However, PSO(MT) requires a consider-

able number of training epochs to complete the optimization process. With a training set of

5000 patterns per class, a convergence time of about 8.2 epochs is obtained with MT+, 3.6

with MT-, 12.3 with WMT, and 2534 with PSO(MT).

Empirical results indicate that the MT process of fuzzy ARTMAP has a considerable

impact on performance obtained with overlapping data, especially when εis optimized. As

shown in Figure 5(d), when α=0.001, β=1 and ρ=0, and class distributions overlap, the

values of εthat minimize error tends from about 0 towards 0.8 as the training set size grows.

Higher εsettings tend to create a growing number of category hyperrectangles close to the

bourdary between classes. Generalisation error of PSO(MT) tends toward that of to WMT

on this data set. Furthermore, PSO(MT) and WMT do not show the performance degradation

due to overtraining as with MT+ and MT-.

Very similar tendencies are found in simulation results where fuzzy ARTMAP is trained

using the other Dµ(ξtot )and Dσ(ξtot )data sets. However, as ξtot increases, the performance

degradation due to training subset size tends to become more pronounced, and occurs for

fewer training set patterns. Let us deﬁne the net error as the difference between the gen-

eralization error obtained by using all the training data (5,000 patterns per class) and the

theoretical probability of error ξtot of the database. Figure 6 shows the performance of fuzzy

ARTMAP as a function of ξtot for all Dµ(ξtot )data sets. As shown, using PSO(HV) always

provides the lowest net error over ξtot values for overlapping data, followed by WMT,MT+

and MT-. Again, MT- obtains the highest compression, whereas PSO(MT) obtain a com-

pression between WMT and MT+. The convergence time of PSO(HV) is orders of magnitude

longer that the other strategies.

Figure 7 presents an example of decision boundaries obtained for Dµ(ξtot =13%)when

fuzzy ARTMAP is trained with 5,000 patterns per class and different MT strategies. For

overlapping class distribution, MT- tends to create much fewer F2 nodes (908 categories

with 5000 patterns per class) than the other MT strategies because of the polarity of ε.

Although it leads to a higher compression, and can resolve inconsistent cases, the larger

categories produce coarse granulation of the decision boundary, and thus a higher gen-

eralization error. With PSO(MT) and WMT, the lower error is a consequence of the ﬁner

resolution on overlap regions of the decision boundary between classes.

4.2. Synthetic data with complex decision boundaries:

Figure 8 presents the average performance obtained when fuzzy ARTMAP is trained on

DCIS using the four MT strategies – MT-,MT+,WMT and PSO(MT). The generalisation

error for the k-NN classiﬁer, as well as the theoretical probability of error, ξtot , are also

shown for reference.

In this case, MT+,MT- and PSO(MT) obtain a similar generalization error across train-

ing set sizes, while WMT yields an error that is signiﬁcantly higher than the others strategies

for larger training set sizes. For example, with a training set of 5000 patterns per class, a

generalization error of about 1.51% is obtain with MT+, 1.64% with MT-, 4.36% with WMT,

and 1.47% with PSO(MT). Compression of fuzzy ARTMAP as a functions of training set

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

18 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

(a) MT+. (b) MT-.

(c) WMT. (d) PSO(MT).

Fig. 7. An Example of decision boundaries formed by fuzzy ARTMAP in the input space for Dµ(ξtot =13%).

Training is performed (a) with MT+, (b) with MT-, (c) WMT, and (d) PSO(MT) on 5,000 training patterns per class.

The optimal decision boundary for Dµ(ξtot =13%)is also shown for reference. Note that virtually no training,

validation or test subset patterns are located in the upper-left and lower-right corners of these ﬁgures.

size in a grows in a similar way for MT-,MT+ and PSO(MT). With a training set of 5000

patterns per class, a compression of 107 is obtained with MT+, 108 with MT-, 14 with WMT,

and 109 with PSO(MT).WMT does not allow to create a network with higher compres-

sion because data structure leads to the creation many small categories that overlap on the

decision boundary between classes. However, WMT requires the fewest number of training

epochs to converge, while PSO(MT) requires a considerable number of epochs. With a

training set of 5000 patterns per class, a convergence time of about 18.4 epochs is required

with MT+, 14.4 with MT-, 6.6 with WMT, and 4186 with PSO(MT).

Empirical results indicate that the MT process of fuzzy ARTMAP also has a consider-

able impact on performance obtained on data with complex decision boundaries, especially

when εis optimized. As shown in Figure 8(d), when α=0.001, β=1 and ρ=0, and de-

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 19

101102103

0

5

10

15

20

25

30

35

40

Generalization error (%)

Training set size (patterns per class)

etot

FAM MT−

FAM MT+

FAM WMT

FAM PSO(MT)

KNN

(a) Generalisation error.

101102103

101

102

Compression (training patterns per F2 node)

Training set size (patterns per class)

(b) Compression.

101102103

100

101

102

103

Compression (training patterns per F2 node)

Training set size (patterns per class)

(c) Convergence time.

101102103

0

0.2

0.4

0.6

0.8

1

Match Tracking (ε)

Training set size (patterns per class)

(d) MT parameter for PSO(MT).

Fig. 8. Average performance of fuzzy ARTMAP (with MT+,MT-,WMT and PSO(MT)) versus training subset

size for DCIS. Error bars are standard error of the sample mean.

cision boundaries are complex, the values of εthat minimize error tends from about 0.4

towards 0 as the training set size grows. Lower εsettings tend to create fewer category

hyperrectangles close to the bourdary between classes. Generalisation error of PSO(MT)

tends toward that of to MT+ and MT- on this data set.

Similar tendencies are found in simulation results where fuzzy ARTMAP is trained

using the DP2 data set. However, since the decision boundaries are more complex with DP2,

a greater number of training patterns are required for fuzzy ARTMAP to asymptotically

start reaching its minimum generalisation error. Moreover, all MT strategies tested on data

with non linear decision boundaries generate no overtraining 18.

Figure 9 presents an example of decision boundaries obtained for DCIS when fuzzy

ARTMAP is trained with 5,000 patterns per class and different MT strategies. For data

with complex decision boundaries, training fuzzy ARTMAP WMT yields higher generaliza-

tion error since it initially tends to create some large categories, and then compensates by

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

20 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

(a) MT+. (b) MT-.

(c) WMT. (d) PSO(MT).

Fig. 9. An Example of decision boundaries formed by fuzzy ARTMAP in the input space for DCIS. Training

is performed (a) with MT+, (b) with MT-, (c) WMT, and (d) PSO(MT) on 5,000 training patterns per class. The

optimal decision boundary for DCIS is also shown for reference.

creating many small categories. This leads to coarse granulation of the decision boundary,

and thus a higher generalization error.

Table 1 shows the average generalisation error obtained with the reference classiﬁers

and the fuzzy ARTMAP neural network using different MT strategies on Dµ(ξtot ),DCIS

and DP2. Training was performed on 5,000 patterns per class. When using PSO(MT),

the generalisation error of fuzzy ARTMAP is always lower than when using MT+,MT-

and WMT, but is always signiﬁcantly higher than that of the Quadratic Bayes and k-NN

classiﬁers. When data contains overlapping class distributions, the values of εthat minimize

error tends towards +1. In contrast, when decision boundaries are complex, these εvalues

tend towards 0.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 21

Table 1. Average generalisation error of reference and fuzzy ARTMAP classiﬁers using different MT strategies on syn-

thetic data sets. Values in parenthesis are standard error of the sample mean.

Data set Average generalisation error (%)

CQB k-NN FAM w/ MT+ FAM w/ MT- FAM w/ WMT FAM w/ PSO(MT) →ε

Dµ(1%)1,00(0,04) 1,08(0,03) 1,87(0,04) 2,31(0,19) 1,30(0,03) 1,24(0,04) →0,61(0,06)

Dµ(3%)3,08(0,05) 3,31(0,06) 5,44(0,09) 7,52(0,16) 3,84(0,09) 3,66(0,06) →0,75(0,05)

Dµ(5%)4,87(0,07) 5,26(0,08) 8,48(0,13) 11,15(0,36) 6,01(0,07) 5,75(0,08) →0,79(0,04)

Dµ(7%)7,00(0,10) 7,48(0,11) 11,85(0,15) 16,05(0,47) 8,63(0,20) 8,07(0,08) →0,73(0,04)

Dµ(9%)9,12(0,08) 9,88(0,08) 15,01(0,14) 19,88(0,74) 11,30(0,21) 10,62(0,11) →0,72(0,02)

Dµ(11%)11,00(0,08) 11,81(0,13) 18,06(0,18) 23,85(0,37) 13,29(0,15) 12,72(0,12) →0,77(0,05)

Dµ(13%)13,16(0,15) 14,27(0,18) 21,22(0,17) 26,17(0,41) 16,22(0,19) 15,26(0,16) →0,74(0,05)

Dµ(15%)15,11(0,15) 16,13(0,13) 23,69(0,16) 29,05(0,48) 18,40(0,32) 17,42(0,15) →0,74(0,04)

Dµ(17%)16,96(0,10) 18,39(0,09) 26,25(0,16) 31,87(0,25) 20,49(0,13) 19,79(0,33) →0,71(0,08)

Dµ(19%)19,25(0,16) 20,71(0,16) 29,13(0,09) 34,19(0,44) 23,30(0,26) 22,24(0,11) →0,79(0,05)

Dµ(21%)20,97(0,13) 22,70(0,16) 31,63(0,14) 36,28(0,34) 25,86(0,54) 24,35(0,12) →0,79(0,05)

Dµ(23%)22,99(0,12) 25,04(0,13) 33,77(0,21) 38,15(0,28) 28,40(0,41) 26,72(0,19) →0,71(0,03)

Dµ(25%)25,11(0,10) 27,23(0,12) 36,08(0,14) 39,52(0,18) 31,05(0,40) 29,05(0,14) →0,72(0,04)

DCIS N/A 0,86(0,03) 1,51(0,04) 1,64(0,04) 4,36(0,43) 1,47(0,04) →0,01(0,00)

DP2 N/A 1,65(0,04) 3,45(0,19) 4,33(0,22) 7,13 0,48) 3,44(0,06) →0,01(0,00)

4.3. NIST SD19 data:

Figure 10 presents the average performance obtained when fuzzy ARTMAP is trained on

the NIST SD19 data using the four MT strategies – MT-,MT+,WMT and PSO(MT). The

generalisation error for the k-NN classiﬁer are also shown for reference.

As shown in this ﬁgure, MT- and MT+ obtain similar average generalization error

across training set sizes. Using a training set of 52760 patterns, a generalization error of

about 5.81% is obtained with MT+, 6.02% with MT-, 32.84% with WMT, and 5.57% with

PSO(MT). When optimizing the MT parameter with PSO(MT), generalization error is

lower then other MT strategies with a small number of training pattern, and similar to MT-

and MT+ with greater number of training pattern. WMT is unable to create fuzzy ARTMAP

network with low generalization error on NIST SD19. Since NIST database possesses com-

plex decision boundaries with a small degree of overlap, WMT cannot generate a good repre-

sentation of the decision boundaries because it generates too many categories that overlap

between classes.

Using all the training data, MT- acheives the highest compression, followed by MT+,

PSO(MT) and WMT. However, with small amount of training patterns, PSO(MT) generates

the highest compression. For example, with a training set of 52760 patterns, a compression

rate of about 237.4 is obtained with MT+, 281.9 with MT-, 2.7 with WMT, and 141.6 with

PSO(MT).WMT obtains the lowest compression rate because if creates many very small

categories to deﬁne the decision boundaries. With a training set of 52760 patterns, a con-

vergence time of about 15.7 epochs is obtain with MT+, 6.8 with MT-, 1 with WMT, and 381

with PSO(MT).WMT still possesses the fastest convergence time. The low generalization

error of PSO(MT) requires a high convergence time (about 24.3 time higher than MT+ with

all training pattern).

As shown in Figure 10(d), when α=0.001, β=1 and ρ=0, and decision boundaries

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

22 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

101102103

5

10

15

20

25

30

35

40

Generalization error (%)

Training set size (patterns per class)

FAM MT−

FAM MT+

FAM WMT

FAM PSO(MT)

kNN

(a) Generalisation error.

101102103

101

102

Compression (training patterns per F2 node)

Training set size (patterns per class)

(b) Compression.

101102103

100

101

102

Compression (training patterns per F2 node)

Training set size (patterns per class)

(c) Convergence time.

101102103

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Match Tracking (ε)

Training set size (patterns per class)

(d) MT parameter for PSO(MT).

Fig. 10. Average performance of fuzzy ARTMAP (with MT+,MT-,WMT and PSO(MT)) versus training subset

size for NIST SD19 data set. Error bars are standard error of the sample mean.

are complex, the values of εthat minimize error tends from about -0.2 towards 0 as the

training set size grows. As with DCIS and DP2, Generalisation error of PSO(MT) tends

toward that of to MT+ and MT- on this data set. Despite promising results training fuzzy

ARTMAP with PSO(MT), other pattern classiﬁers (such as SVM) have achieved signiﬁ-

cantly lower generalization error 27,29.

5. Conclusions

A fuzzy ARTMAP neural network applied to complex real-world problems such as hand-

written character recognition may achieve poor performance and encounter a convergence

problem whenever the training set contains very similar or identical patterns that belong to

different classes. In this paper, the impact on fuzzy ARTMAP performance of adopting dif-

ferent MT strategies – the original positive MT (MT+), negative MT (MT-) and without MT

(WMT) – is assessed. As an alternative, the value of the MT parameter is optimized along

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 23

with network weight using a Particle Swarm Optimization (PSO) based strategy called

PSO(MT). An experimental protocol has been deﬁned such that the generalization error

and resource requirements of fuzzy ARTMAP trained with different MT strategies may be

assessed on different types of synthetic and a real-world handwritten numerical character

pattern recognition problem.

Overall, empirical results indicate that using the MT process for batch supervised learn-

ing has a signiﬁcant impact on fuzzy ARTMAP performance. When data is deﬁned by over-

lapping class distributions, training with MT- tends to produce fewer categories than the

other MT strategies, although this advantage coincides with a higher generalization error.

The need for MT+ or MT- is debateable as WMT yields a signiﬁcantly lower generalization

error. However, PSO(MT) has been shown to create fuzzy ARTMAP networks with a ﬁner

resolution on decision bounds, and an even lower error than WMT. In addition, it has been

shown to eliminate the degradation of error due to overtraining. To represent overlapping

class distributions with PSO(MT), the lowest errors are obtained for MT parameter val-

ues that tend toward the maximum value (ε=1) as the training set size grows. PSO(MT)

thereby favors the creation of new internal categories to deﬁne decision boundaries.

When data is deﬁned by complex decision boundaries, training with PSO(MT) creates

the decision boundaries that yield the lowest generalization error, followed most closely by

MT- and then MT+. Training with WMT yields a considerably higher generalization error

and lower compression than the other MT strategies, specially when for larger training

set sizes. To represent complex decision boundaries with PSO(MT), the lowest errors are

obtained for MT parameter values that tend toward 0 as the training set size grows.

Finally, with the NIST SD19 data set, when using all training pattern the generalization

error obtain with PSO(MT) is about 0.84% lower than MT-, but comes at the expense

of lower compression and a convergence time that can be two order of magnitude greater

than other strategies. Training with a Multi-Objective PSO (MOPSO) based strategy, where

the cost function accounts for both generalization error and compression would provide

solutions that require fewer internal categories. In addition light weight versions of PSO

may reduce the convergence time.

In this paper, training fuzzy ARTMAP with PSO(MT) has been shown to produce a sig-

niﬁcantly lower generalization error than with other MT strategies. These results are always

produced at the expense of a signiﬁcantly higher number of training epochs. Nonetheless,

results obtained with PSO(MT) underline the importance of optimizing the MT param-

eter during training, for different problems. The MT parameter values found using this

strategy vary signiﬁcantly according to training set size and data set structure, and differ

considerably from the popular choice (ε=0+), specially when data has overlapping class

distributions.

Acknowledgements

This research was supported in part by the Natural Sciences and Engineering Research

Council of Canada, and le Fonds qu´

eb´

ecois de la recherche sur la nature et les technologies.

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

24 P. Henniges, E. Granger, R. Sabourin and L. S. Oliveira

References

1. Anagnostopoulos, G. C., Georgiopoulos, M., Verzi, S. J., and Heileman, G. L., ”Boosted Ellipsoid

ARTMAP,” Proc. SPIE – Applications and Science of Computational Intelligence V,4739, 74-85,

2002.

2. Anagnostopoulos, G. C., and Georgiopoulos, ”Putting the Utility of Match Tracking in Fuzzy

ARTMAP Training to the Test,” Lecture Notes in Computer Science,2774, 1-6, 2003.

3. Bote-Lorenzo, M. L., Dimitriadis, Y., G ´

omez-S´

anchez, E., “Automatic extraction of human-

recognizable shape and execution prototypes of handwritten characters,” Pattern Recognition,

36:7, 1605-1617, 2003.

4. Carpenter, G. A., and Grossberg, S., “A Massively Parallel Architecture for a Self-Organizing

Neural Pattern Recognition Machine,” Computer, Vision, Graphics and Image Processing,37,

54-115, 1987.

5. Carpenter, G. A., Grossberg, S., and Rosen, D. B., “Fuzzy ART: Fast Stable Learning and Cate-

gorisation of Analog Patterns by an Adaptive Resonance System,” Neural Networks,4:6, 759-771,

1991.

6. Carpenter, G. A., Grossberg, S., and Reynolds, J. H., “ARTMAP: Supervised Real-Time Learn-

ing and Classiﬁcation of Nonstationary Data by a Self-Organizing Neural Network,” Neural Net-

works,4, 565-588, 1991.

7. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., and Rosen, D. B., “Fuzzy

ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Mul-

tidimensional Maps,” IEEE Trans. on Neural Networks,3:5, 698-713, 1992.

8. Carpenter, G. A., and Ross, W. D, “ART-EMAP: A Neural Network Architecture for Object

Recognition by Evidence Accumulation,” IEEE Trans. on Neural Networks, 6:4, 805-818, 1995

9. Carpenter, G.A., Gjaja, M.N., Gopal, S., and Woodcock, C.E., “ART Neural Networks for Re-

mote Sesing: Vegetation Classiﬁcation from Landsat TM and Terrain Data,” IEEE Trans. on Geo-

sciences and Remote Sensing,35:2, 1997.

10. Carpenter, G. A., and Markuzon, N., “ARTMAP-IC and Medical Diagnosis: Instance Counting

and Inconsistent Cases,” Neural Networks,11:2, 323-336, 1998.

11. Carpenter, G. A., Milenova, B. L., and Noeskeand, B. W., “Distributed ARTMAP: a neural

network for fast distributed supervised learning,” Neural Networks,11, 793813, 1998.

12. Eberhart, R. C. and Shi, Y., “Comparison Between Genetic Algorihms and Particle Swarm In-

telligence,” in Evolutionary Programming VII, V. W. Porto et al, eds., Springer, 611-616, 1998

13. G´

omez-S´

anchez, E., Gago-Gonzalez, J. A., Dimitriadis, Y. A., Cano-Izquierdo, J. M., Lopez

Coronado, J., “Experimental study of a novel neuro-fuzzy system for on-line handwritten

UNIPEN digit recognition,” Pattern Recognition Letters,19, 357-364, 1998.

14. G´

omez-S´

anchez, E., Dimitriadis, Y. A., Cano-Izquierdo, J. M., Lopez-Coronado, J.,

“µARTMAP: Use of Mutual Information for Category Reduction in Fuzzy ARTMAP,” IEEE

Trans. on Neural Networks,13:1, 58-69, 2002.

15. Granger, E., Rubin, M., Grossberg, S., and Lavoie, P., “A What-and-Where Fusion Neural Net-

work for Recognition and Tracking of Multiple Radar Emitters,” Neural Networks,14, 325-344,

2001.

16. Granger, E., Henniges, P., Sabourin, R., and Oliveira, L. S., “Supervised Learning of Fuzzy

ARTMAP Neural Networks Through Particle Swarm Optimization,” Journal of Pattern Recogni-

tion Research,2:1, 27-60, 2007.

17. Grother, P. J., “NIST Special Database 19 - Handprinted forms and characters database,” Na-

tional Institute of Standards and Technology (NIST), 1995.

18. Henniges, P., Granger, E., and Sabourin, R., “Factors of Overtraining with Fuzzy ARTMAP Neu-

ral Networks,” International Joint Conference on Neural Networks 2005, 1075-1080, Montreal,

Canada, August 1-4, 2005.

19. Kennedy, J., and Eberhart, R. C., “Particle Swarm Intelligence,” Proc. Int’l Conference on Neural

November 14, 2007 23:32 WSPC IJCIA07˙PSOMT

Match Tracking Strategies for Fuzzy ARTMAP 25

Network, 1942-1948, 1995.

20. Kennedy, J., and Eberhart, R. C., Swarm Intelligence, Morgan Kaufmann, 2001.

21. Koufakou, A., Georgiopoulos, M., Anagnostopoulos, G., and Kasparis, T., ”Cross-Validation in

Fuzzy ARTMAP for Large Databases,” Neural Networks,14, 1279-1291, 2001.

22. Lee, S.-J., and Tsai, H.-L., “Pattern Fusion in Feature Recognition Neural Networks for Hand-

written Character Recognition”, IEEE Transactions on Systems, Man, and Cybernetics Part B:

Cybernetics,28:4, 612-617, 1998.

23. Lerner B., and Vigdor B., ”An Empirical Study of Fuzzy ARTMAP Applied to Cytogenetics,”

IEEE Convention of Electrical and Electronics Engineers in Israel, 301-304, 2004.

24. Lim C. P., and Harrison, R. F., ”Modiﬁed Fuzzy ARTMAP Approaches for Bayes Optimal Clas-

siﬁcation Rates: An Empirical Demonstration,” Neural Network,10:4, 755-774, 1997.

25. Liu, C.-L., Sako, H., and Fujisawa, H., ”Performance Evaluation of Pattern Classiﬁers for Hand-

written Character Recognition,” Int’l J. on Document Analysis and Recognition,4, 191-204, 2002.

26. Marriott, S., and Harrison, R. F., ”A modiﬁed fuzzy ARTMAP architecture for the approximation

of noisy mappings”, Neural Networks,8:4, 619-41, 1995.

27. Milgram, J., Chriet, M. and Sabourin, R., “Estimating Accurate Multi-class Probabilities with

Support Vector Machines,” International Joint Conference on Neural Networks 2005, 1906-1911,

Montral, Canada, August 1-4, 2005.

28. Murshed, N. A., Bortolozzi, F., and Sabourin, R., ”A Cognitive Approach to Signature Veriﬁ-

cation,” International Journal of Pattern Recognition and Artiﬁcial Intelligence (Special issue on

Bank Cheques Processing,11:7, 801-825, 1997.

29. Oliveira, L. S., Sabourin, R., Bortolozzi, F., and Suen, C. Y., ”Automatic Recognition of Hand-

written Numerical Strings: A Recognition and Veriﬁcation Strategy,” IEEE Transactions on Pat-

tern Analysis and Machine Intelligence,24:11, 1438-1454, 2002.

30. Parsons, O., and Carpenter, G. A., “ARTMAP neural network for information fusion and data

mining: map production and target recognition methodologies,” Neural Networks,16, 10751089,

2003.

31. Rubin, M.A., ”Application of Fuzzy ARTMAP and ART-EMAP to Automatic Target Recogni-

tion Using Radar Range Proﬁles,” Neural Networks,8:7, 1109-1116, 1995.

32. Schutte, J. F., Reinbolt, J. A., Fregly, B. J., Haftka, R. T., and George, A. D., ”Parallel Global

Optimization with Particle Swarm Algorithm,” International J. of Numerical Methods in Engi-

neering,61, 2296-2315, 2004.

33. Stone, M., ”Cross-Validatory Choice and Assessment of Statistical Predictions,” Journal of the

Royal Statistical Society, 111-147, 1974.

34. Sumathi, S., Sivanandam, S. N., and Jagadeeswari, R., “Design of Soft Computing Models for

Data Mining Applications,” Indian J. of Engineering and Materials Sciences,7:3, 107-21, 2000.

35. Srinivasa, N., “Learning and Generalization of Noisy Mappings Using a Modiﬁed PROBART

Neural Network,” IEEE Trans. on Signal Processing,45:10, 2533-2550, 1997.

36. Valentini, G., ”An Experimental Bias-Variance Analysis of SVM Ensembles Based on Resam-

pling Techniques,” IEEE Trans. Systems, Man, and Cybernetics - Part B: Cybernetics,35:6, 1252-

1271, 2005.

37. Verzi, S. J., Heileman, G. L., Georgiopoulos, M., and Healy, M. J., ”Boosting the Performance

of ARTMAP”, IEEE International Joint Conference on Neural Networks Proceedings 1998, An-

chorage, USA, 396-401, 1998.

38. Waxman, A. M., Verly, J. G., Fay, D. A., Liu, F., Braun, M. I., Pugliese, B., Ross, W., Streilein,

W., “A Prototype System for 3D Color Fusion and Mining of Multisensor/Spectral Imagery,” Proc.

of the 4th International Conference on Information Fusion, Vol. 1, pp. WeC1-(3-10), Montreal,

Canada, August 7-10, 2001.

39. Williamson, J. R., ”A Constructive, Incremental-Learning Neural Network for Mixture Model-

ing and Classiﬁcation,” Neural Computation,9:7, 1517-1543, 1997.