IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001 497
Learn++: An Incremental Learning Algorithm for
Supervised Neural Networks
Robi Polikar, Member, IEEE, Lalita Udpa, Senior Member, IEEE, Satish S. Udpa, Senior Member, IEEE, and
Abstract—We introduce Learn++, an algorithm for incre-
mental training of neural network (NN) pattern classifiers. The
proposed algorithm enables supervised NN paradigms, such as
the multilayer perceptron (MLP), to accommodate new data,
including examples that correspond to previously unseen classes.
Furthermore, the algorithm does not require access to previously
used data during subsequent incremental learning sessions, yet at
the same time, it does not forget previously acquired knowledge.
Learn++ utilizes ensemble of classifiers by generating multiple
hypotheses using training data sampled according to carefully
tailored distributions. The outputs of the resulting classifiers
are combined using a weighted majority voting procedure. We
present simulation results on several benchmark datasets as well
as a real-world classification task. Initial results indicate that the
proposed algorithm works rather well in practice. A theoretical
upper bound on theerror oftheclassifiersconstructed by Learn++
is also provided.
Index Terms—Catastrophic forgetting, classification algo-
rithms, ensemble of classifiers, incremental learning, knowledge
acquisition and retention, pattern recognition, supervised neural
ACHINE LEARNING offers one of the most cost ef-
fective and practical approaches to the design of pattern
classifiers for a broad range of pattern recognition applications.
The performance of the resulting classifier relies heavily on the
availability of a representative set of training examples.In many
practical applications, acquisition of a representative training
data is expensive and time consuming. Consequently, it is not
uncommon for such data to become available in small batches
over a period of time. In such settings, it is necessary to update
an existing classifier in an incremental fashion to accommo-
date newdata withoutcompromising classificationperformance
on old data. Learning new information without forgetting pre-
viously acquired knowledge, however, raises the so-called sta-
bility–plasticity dilemma , one of the fundamental problems
in knowledge management (KM): Some information may have
Manuscript received June 1, 2001; revised October 1, 2001.
R. Polikar is with the Department of Electrical and Computer Engineering,
Rowan University, Glassboro, NJ 08028 USA (e-mail: email@example.com).
L.Udpaiswith theDepartment ofElectrical and ComputerEngineering, Iowa
State University, Ames, IA 50011 USA (e-mail: firstname.lastname@example.org).
S. S. Udpa is with the Department of Electrical and Computer Engi-
neering, Michigan State University, East Lansing, MI 48824 USA (e-mail:
V. Honavar is with the Artificial Intelligence Research Laboratory, De-
partment of Computer Science, Iowa State University, Ames, IA 50011 USA
Publisher Item Identifier S 1094-6977(01)11261-7.
to be lost to learn new information, as learning new patterns will
tend to overwrite formerly acquired knowledge. The dilemma
points out the fact that a completely stable classifier will pre-
serve existing knowledge, but will not accommodate any new
information, whereas a completely plastic classifier will learn
new information but will not conserve prior knowledge.
A typical approach for learning new information involves
discarding the existing classifier, and retraining the classifier
using all of the data that have been accumulated thus far. Ex-
amples of this approach include common neural network (NN)
paradigms, such as multilayer perceptron (MLP), radial basis
function (RBF) networks, wavelet networks, and Kohonen net-
works. This approach, lying on the “stability” end of the spec-
trum,however,results inloss of all previouslyacquiredinforma-
tion, which is known as catastrophic forgetting. Furthermore,
this approach may not even be feasible in many applications,
particularly if the original data is no longer available. An alter-
native approach, lying toward the “plasticity” end of the spec-
trum, involves the use of online training algorithms. However,
many existing online algorithms assume rather restricted form
of classifiers, such as classifiers that compute conjunctions of
Boolean features. Consequently, such algorithms have limited
applicability in real-world applications. A third approach to in-
cremental learning is the use of instance-based learners such
as nearest neighbor classifiers. However, this approach entails
storing all of the data.
Various algorithms suggested in the literature for incre-
mental learning typically use one or a combination of the
above-mentioned approaches, and fall somewhere in between
the stability–plasticity spectrum. Some of the more recent and
prominent of such algorithms are discussed in the next section.
We should also note that the term “incremental learning” has
been used rather loosely in the literature, where the term re-
ferred to as diverse concepts as incremental network growing
andpruning, on-line learning, or relearning offormerly misclas-
sified instances. Furthermore, various other terms, such as con-
structive learning, lifelong learning, and evolutionary learning
have also been used to imply learning new information.
Against this background, precise formulations of the in-
cremental learning problem, characterizations of information
requirements of incremental learning, and the establishment
of necessary and sufficient conditions for incremental learning
need to be established. In this paper, we therefore define an
incremental learning algorithm as one that meets the following
1) It should be able to learn additional information from new
1094–6977/01$10.00 © 2001 IEEE
498 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
2) It should not require access to the original data, used to
train the existing classifier.
3) It should preserve previouslyacquired knowledge(that is,
it should not suffer from catastrophic forgetting).
4) It should be able to accommodate new classes that may
be introduced with new data.
An algorithm that possesses these properties would be an
indispensable tool for pattern recognition and machine learning
researchers, since virtually unlimited number of applica-
tions can benefit from such a versatile incremental learning
algorithm. The problem addressed in this paper is therefore de-
signing a supervised incremental learning algorithm satisfying
all of the above-mentioned criteria.
The rest of this paper is organized as follows. In Section II,
we provide an overview of various approaches suggested for in-
cremental learning algorithms, as well as an overview of en-
semble-based learning algorithms, which were originally pro-
posed for improving generalization performance of classifiers.
In Section III, we show how ensemble-based approaches can be
used in an incremental learning setting, andpresent the Learn++
algorithm in detail. In Section IV, we explain the benchmark and
real-world databases used to evaluate the algorithm, along with
simulation results obtained on these databases. We also com-
pare the Learn++ performance to that of fuzzy ARTMAP on the
real-world database. Finally, in Section V, we summarize our
conclusions and point at future research directions.
A. Incremental Learning
As mentioned earlier, various algorithms have been sug-
gested for incremental learning, where incremental learning
implied different problems. For example, in some cases, the
phrase “incremental learning” has been used to refer to growing
or pruning of classifier architectures – or to selection
of most informative training samples . In other cases,
some form of controlled modification of classifier weights
has been suggested, typically by retraining with misclassified
signals –. These algorithms are capable of learning new
information; however, they do not simultaneously satisfy all
of the above-mentioned criteria for incremental learning: they
either require access to old data, forget prior knowledge along
the way, or unable to accommodate new classes. One notable
exception is the (fuzzy) ARTMAP algorithm , , which
is based on generating new decision clusters in response to
new patterns that are sufficiently different from previously seen
instances. This sufficiency is controlled by a user-defined vigi-
lance parameter. Each cluster learns a different hyper-rectangle
shaped portion of the feature space in an unsupervised mode,
which are then mapped to target classes. Since previously
generated clusters are always retained, ARTMAP does not
suffer from catastrophic forgetting. Furthermore, ARTMAP
does not require access to previously seen data, and it can
accommodate new classes. Therefore, ARTMAP fits perfectly
into our description of incremental learning.
ARTMAP is a very powerful and versatile algorithm; how-
ever,it has its owndrawbacks. In manyapplications, researchers
have noticed that ARTMAP is very sensitive to selection of the
vigilanceparameter,tothe noise levelsinthe trainingdata and to
theorder in which the training data is presented to the algorithm.
Furthermore, the algorithm generates a large number of clus-
ters causing overfitting, resulting in poor generalization perfor-
mance, if the vigilanceparameter is not chosen correctly. There-
fore, this parameter is typically chosen in an ad hoc manner by
trial and error. Various algorithms have been suggested to over-
come such difficulties –.
Other incremental learning algorithms, such as incremental
construction of support vectormachine classifiers with provable
performance guarantees , incremental learning based on re-
producible kernel Hilbert spaces , or incrementally adding
new IF–THEN rules to an existing fuzzy inference system 
have also been suggested. These algorithms also fit to the in-
cremental learning setting described above, however, they re-
quire either precise a priori knowledge of data distributions, or
an ad-hoc selection of a large number of parameters.
B. Ensemble of Classifiers
In this paper, we follow a different approach to the incre-
mental learning problem, and present an algorithm that not only
satisfies all criteria mentioned above, but also overcomes the
difficulties that are associated with ARTMAP and ARTMAP
based classifiers. In essence, instead of generating new cluster
nodes for each previously unseen (or sufficiently different) in-
stance, we generate multiple new “weak classifiers” for previ-
ously unseen portions of the feature space. This conceptually
subtle difference,allowsus to develop a fundamentally different
incremental learning algorithm that is insensitive to the order of
presentation of the training data, or even to the minor adjust-
ments of the algorithm parameters.
Learn++, the proposed incremental learning algorithm
described in the next section, was inspired by the AdaBoost
(adaptive boosting) algorithm, originally developed to improve
the classification performance of weak classifiers. Schapire
showed that for a two class problem, a weak learner that can
barely do little better than random guessing can be transformed
into a strong learner that almost always achieves arbitrarily
low error rate using a procedure called boosting . Freund et
al. later developed AdaBoost, extending boosting to multiclass
and regression problems , . In essence, both Learn++
and AdaBoost generate an ensemble of weak classifiers, each
trained using a different distribution of training samples.
The outputs of these classifiers are then combined using
Littlestone’s majority-voting scheme  to obtain the final
classification rule. Combining weak classifiers take advantage
of the so-called instability of the weak classifier. This instability
causes the classifiers to construct sufficiently different decision
surfaces for minor modifications in their training datasets.
The idea of generating an ensemble of classifiers for
improving classification accuracy was formerly introduced
by many other researchers. For example, Wolpert suggested
combining hierarchical levels of classifiers, using a procedure
called stacked generalization . Jordan and Jacobs intro-
duced hierarchical mixture of experts (HME), where multiple
classifiers were highly trained (hence experts) in different re-
gions of the feature space, and their outputs were then weighted
using a gating network , . Kitler et al. analyzed error
POLIKAR et al.: LEARN++: AN INCREMENTAL LEARNING ALGORITHM 499
sensitivities of various voting and combination schemes ,
whereas Rangarajan et al. investigated the capacity of voting
systems . Ji and Ma proposed an alternative approach
to AdaBoost that generates simple perceptrons of random
parameters and then combines the perceptron outputs using
majority voting , similar to generating an ensemble of clas-
sifiers through randomizing the internal parameters of a base
classifier, previously introduced by Ali and Pazzani . Ji and
Ma give an excellent review of various methods for combining
classifiers in , whereas Dietterich compares ensemble of
classifiers to other types of learners, such as reinforcement and
stochastic learners in .
There have also been some attempts for using HMEs in an
online setting to incrementally learn from incoming data ,
, however such attempts have not addressed all of the above
mentioned issues of incremental learning, in particular,learning
new classes. Consequently, research on combining classifiers
have been mostly limited to improving performance of classi-
fiers, rather than incremental learning. This leads us to consider
adaptations of ensemble-based methods such as AdaBoost or
HME to achieve incremental learning.
NSEMBLE OF CLASSIFIERS FOR INCREMENTAL LEARNING:
Combining ensemble of classifiers in Learn++ is specifically
geared toward achieving incremental learning, as described by
the criteria mentioned earlier. However, due to their similari-
ties, Learn++ alsoinherits performanceimprovement properties
of AdaBoost, as shown in simulation results. Learn++ is based
on the following intuition: Each new classifier added to the en-
semble is trained using a set of examples drawn according to a
distribution, which ensures that examples that are misclassified
by the current ensemble have a high probability of being sam-
pled. In an incremental learning setting, the examples that have
a high probability of error are precisely those that are unknown
or that have not yet been used to train the classifier.
As mentioned earlier, both AdaBoost and Learn++ generate
weak hypotheses and combine them through weighted majority
votingof the classes predictedby the individualhypotheses. The
hypotheses are obtained by retraining a base classifier (weak
learner) using strategically updated distributions of the training
database. AdaBoost’s distribution update rule is optimized for
improving classifier accuracy, whereas Learn++ distribution
update rule is optimized for incremental learning of new data, in
The Learn++ algorithm is given in Fig. 1. Inputs to Learn++
1) training data
are training instances and are the corre-
sponding correct labels for
randomly selected from the
2) a weak learning algorithm WeakLearn, to be used as the
3) an integer
, specifying the number of classifiers to be
Recall that a weak learning algorithm is used as a base clas-
sifier, to allow sufficiently different decision boundaries to be
generated by slightly modified training datasets. Also note that
most strong classifiers spend a majority of their training time in
fine-tuningthe decisionboundary. As described below, Learn++
requires each weak learner to generate only a rough estimate of
the actual decision boundary, effectively eliminating the costly
fine-tuning step, allowing faster training and less over fitting.
Each classifier can be thought of as a hypothesis
to the output space . Learn++ asks WeakLearn
to generate multiple hypotheses using different subsets of the
,and each hypothesis learns only aportionof the
, from which training subsets are chosen.
of the classifiers on that instance (Step 1). In general, instances
for the first iteration are initialized to , giving equal
likelihood to each instance to be selected into the first training
At each iteration
, Learn++ first di-
into a training subset and a test subset
according to the current distribution (Step 2), and
calls WeakLearn to generate the hypothesis
(Step 3) using the training subset . The error of on
is defined as (Step 4)
which is simply the sum of distribution weights of misclassified
, is discarded and new and
are selected. That is, the weak hypothesis is only expected to
achieve a 50% (or better) empirical classification performance
. For a binary class problem, this is the least restrictive
requirement one could have, since an error of one-half for a bi-
naryclass problemmeans random guessing.However,obtaining
a maximum error of one-half becomes increasingly difficult as
thenumber of classes increase, since for an
error generated by random guessing is
the choice of a weak learning algorithm with a minimum clas-
sification performance of 50% may not be trivial. However, NN
algorithms can easily be configured to simulate weak learners,
by modifying their size anderror goal parameters. Useof strong
learners, on the other hand, are not recommended in algorithms
using the ensemble of classifiersapproach, since there is little to
be gained from their combination, and/or they may lead to over
fitting of the data , .
is satisfied, then the normalized error
is computed as
All hypotheses generated in the previous
iterations are then
combined using weighted majority voting (Step 5). The voting
weights are computed as the logarithms of the reciprocals of
. Therefore, those hypotheses that perform
500 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
Fig. 1. Algorithm Learn++.
well on their own training and test data are given larger voting
powers. A classification decision is then made based on the
combined outputs of individual hypotheses, which constitutes
the composite hypothesis
decides on the class that receives the highest
total vote from all
hypotheses. The composite error made by
is then computed as
on misclassified instances, where
is 1 if the predicate is
true, and 0 otherwise. If
, current is discarded, a
new training subset is selected and a new
is generated. We
canonly exceedthisthresholdduring the immediate
iteration after a new database
is introduced. At all other
will be satisfied, since all hypotheses that
make up the composite hypothesis have already been verified
in Step 4 to achieve a minimum of 50% performance on
, composite normalized error is computed as
are then updated, for computing the next
, which in turn is used in selecting the next
training and testing subsets,
and , respectively.
POLIKAR et al.: LEARN++: AN INCREMENTAL LEARNING ALGORITHM 501
The distribution update rule constitutes the heart of the algo-
rithm, as it allows Learn++ to learn incrementally
According to this rule, if instance
is correctly classified
by the composite hypothesis
, its weight is multiplied by a
, which, by its definition, is less than 1. If is mis-
classified, its distribution weight is kept unchanged. This rule
reduces the probability of correctly classified instances being
, while increasing the probability of misclas-
sified instances to be selected into
. If we interpret in-
stances that are repeatedly misclassified as hard instances, and
those that are correctly classified as simple instances, the al-
gorithm focuses more and more on hard instances, and forces
additional classifiers to be trained with them. Instances coming
from previously unseen parts of the feature space, such as those
fromnewclasses, can be interpretedas hard instances atthe time
they are introduced to the algorithm. Note that using the com-
posite hypothesis in (6) makes incremental learning possible
particularly when instances from new classes are introduced,
since these instances will be misclassified by the composite hy-
pothesis and forced into the next training dataset. The proce-
dure would not work nearly as efficiently, if the weight update
rule were based on the performance of the previous
AdaBoost does) instead of the composite hypothesis
from the distribution update rule, Learn++ also differs from Ad-
aBoost in definition of training error and the evaluation of in-
dividual hypotheses. During each iteration, Learn++ generates
an additional test subset
on which the training error and
hypothesis evaluation are based, whereas AdaBoost computes
the individual hypothesis errors on their own training data
only. Finally, since AdaBoost does not compute a composite hy-
pothesis, composite error is also not applicable in AdaBoost.
hypotheses are generated for each database , the
final hypothesis is obtained by the weighted majority voting of
all composite hypotheses
Note that while incremental learning is achieved through
generating additional classifiers, former knowledge is not lost,
since all classifiers are retained. Another important property
of Learn++ is its independence of the base classifier used
as a weak learner. In particular, it can be used to convert
any supervised classifier, originally incapable of incremental
learning, to one that can learn from new data.
Fig. 2 conceptually illustrates Learn++ architecture on
an example. The dark curve is the decision boundary to be
learned and the two sides of the dashed line represent the
feature space for two training databases
and , which need
not be mutually exclusive. Weak hypotheses are illustrated
with simple geometric figures, generated by weak learners
, where through are generated
due to training with different subsets of
, and through
are generated due to training with different subsets of .
Hypotheses decide whether a data point is within the decision
boundary. They are hierarchically combined to form composite
, which are then combined to form
the final hypothesis
Learn++ guarantees convergence on any given training
dataset, by reducing the classification error with each added
hypothesis. We state the theorem that relates the overall upper
error bound of Learn++ to individual errors of each hypothesis,
the proof of which is given in the Appendix.
Theorem: The training error of the Learn++ algorithm given
in Fig. 1 is bounded above by
is the error of the th composite hypoth-
. Furthermore, is itself bounded above by
, where is the error of
the individual hypothesis
XPERIMENTS WITH LEARN++
The algorithm was tested on various benchmark and
real-world databases. Due to space limitations, results on four
databases are presented here, with additional results available
on the web . Detailed descriptions of each database along
with the performance of the algorithm on these databases are
explained in the following sections. In all experiments, previ-
ously seen data were not used in subsequent stages of learning,
and in each case the algorithm was tested on an independent
validation dataset that was not used during training. In all cases,
we have used a relatively small MLP trained with a large error
goal as the base classifier to simulate a weak learner. We note
that Learn++ itself is independent of the classifier used. MLP
was used since it is the most commonly employed classification
algorithm that is not capable of incremental learning without
Different architectures and error goals were tried to test the
algorithm’s sensitivity and invariance to minor modifications
to parameter selections, including the MLP architecture, mean
square error (MSE) goal, and the number of hypotheses gen-
erated. The parameters given below are typical representatives
of those that have been tried. Furthermore, in order to test the
sensitivity of Learn++ to the order of presentation of the data,
multiple experiments were performed for all databases, where
the order of the datasets introduced to the algorithm at different
times were varied. The results for all cases were virtually the
same. Average representative performance results are presented
A. Optical Digits Database
This benchmark database, obtained from the UCI machine
learning repository , consisted of 5620 instances of dig-
itized handwritten characters; 1200 instances were used for
training and all remaining instances were used for validation.
The characters were numerals 0–9, and they were digitized on
grid, creating 64 attributes. This dataset was used to
evaluate Learn++ on incremental learning without introducing
new classes. Fig. 3 shows sample images of this database. The
training dataset of 1200 instances were divided into six subsets,
, each with 200 instances containing all ten classes
502 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
Fig. 2. Combining classifiers for incremental learning.
to be used in six training sessions. In each training session,
only one of these datasets was used. For each training session
, 30 weak hypotheses were generated by
Learn++. Each hypothesis
of the th
training session was generated using a training subset
a testing subset
(used to compute hypothesis error), each
with 100 instances drawn from
. The base classifier was a
single hidden layer MLP with 30 hidden layer and ten output
nodes with a MSE goal of 0.1. An additional validation set,
TEST, of 4420 instances was used for validation purposes.
Note that NNs can simulate a weak learner, when their ar-
chitecture is kept small and their error goal is kept high with
respect to the complexity of the particular problem. The rela-
tively high error goal of 0.1 allowed the MLP to serve as a weak
learner in this case, as shown in the Average/learner column of
Table I, which indicates the average performance of individual
hypotheses on each database
. On average, weak learners
performed little over 50%, which improved to over 90% when
thehypotheses werecombined.This improvementdemonstrates
the performance improvement property of Learn++ (asinherited
from AdaBoost) on a given single database. Each column there-
afterindicates Learn++ performanceon the currentand previous
training datasets as additional data were introduced. Previous
datasets were not used for training in subsequent training ses-
sions, but they were only used to evaluate the algorithm per-
formance on previously seen instances to make sure that previ-
ously acquired knowledge was not lost. The last row of Table I
shows the classification performance on the validation dataset,
which gradually and consistently improvedfrom 82% to 93% as
newdatabases became available,demonstrating the incremental
learning capability of the proposed algorithm.
In order to compare the performance of Learn++ to that of
a strong learner trained with the entire training data of 1200
instances, various architecture–error goal combinations were
tried. An MLP with 50 hidden layer nodes and a 100 times
smaller error goal of 0.001 was able to match (and slightly ex-
ceed) Learn++ performance, by classifying 95% of the TEST
B. Vehicle Silhouette Database
Also obtained from the UCI depository, the vehicle silhou-
ette database consisted of 18 features from which the type of a
vehicle is determined. The database consisted of 846 instances,
which was divided into three training datasets
instances each, and a validation dataset, TEST, of 216 instances
in four classes. For each training session
Fig. 3. Sample images from the optical digits database.
potheses were generated using a 30-node single hidden layer
MLP with an error goal of 0.1. This particular benchmark data-
base is considered as one of the more difficult databases in the
repository, since generalization performances using various al-
gorithms (strong learners) have been in the 65%–80% range
, . The results are presented in Table II, where the Av-
erage/learner column indicates the average performance of a
weak hypothesis (a singleMLP). We note from the average62%
performance that the chosen MLP architecture and error goal
was able to simulate a weak learner. The other columns indi-
cate the Learn++ performance on individual training datasets
and on the validation dataset after each of the three training ses-
sions. As seen in Table II, there is a minor and gradual loss of
information on the previous training datasets as new datasets are
introduced, however, the generalization performance on the val-
idation dataset improved from 78% to 83%. This performance
was comparable, or better, than the performance of most algo-
rithms that were trained using the entire data .
C. Concentric Circles Database
This rather simple synthetic database of concentric rings with
twoattributesandfiveclasses wasgenerated for testing Learn++
performance on incremental learning when new classes are in-
troduced. Fig. 4 illustrates this database. The database was di-
vided into six training datasets,
through , and a valida-
tion dataset TEST.
and had 50 instances from each of
the classes 1, 2, and 3; datasets
and had 50 instances
from each of the classes 1, 2, 3, and 4; and datasets
50 instances from each of the classes 1–5. The validation set
TEST had 500 instances from all five classes. Table III presents
the classification performance results. The validation on TEST
dataset shows steadily increasing generalization performance,
indicating the algorithm was able to learn the new informa-
tion,and the newclasses, successfully. Note that largerimprove-
ments in the performance are obtained after the third and fifth
training sessions, since these training sessions introduced new
POLIKAR et al.: LEARN++: AN INCREMENTAL LEARNING ALGORITHM 503
RAINING AND GENERALIZATION PERFORMANCE OF LEARN++ ON OPTICAL DIGITS
RAINING AND GENERALIZATION PERFORMANCE OF LEARN++
ON VEHICLE DATABASE
Fig. 4. Circular regions database.
classes that were not available earlier. Similarly, the improve-
ments in the performance after the fourth and sixth training ses-
sions are minor compared to the previous sessions, since these
sessions did not introduce new classes. This is also reflected in
the number of hypotheses generated during each training ses-
sion, which are givenin parentheses onthe firstrow of the table.
Note that when new classes are introduced, the number of hy-
potheses generated in each session is not the same. The number
of hypotheses generated was determined simply by monitoring
the classification performance, where each training session was
terminated when the performance no longer improved.
The last column titled “Last 7” indicates the Learn++ perfor-
mance on the last seven hypotheses. Although these hypotheses
were trained with a dataset that included all classes, they were
not adequate to give satisfactory performance, demonstrating
that all hypotheses are required for the final classification.
An alternative set of six datasets was also generated from this
database, by changing the order of classes introduced incremen-
tally, to test the algorithm’s sensitivity to the order of presenta-
tion of the data. The results,which are providedon theweb ,
were virtually the same.
Finally, in order to compare the incremental learning perfor-
mance of Learn++ to that of a strong learner trained on the en-
tire training data, a larger MLP with 50 hidden nodes and an
error goal of 0.005 was trained. The performance of this strong
learner was 95%, only slightly better than that of Learn++.
D. Gas Sensing Dataset
Learn++ was then implemented on real-world data obtained
from a set of six polymer-coated quartz crystal microbalances
(QCMs) used to detect volatile organic compounds (VOCs).
Detection and identification of VOCs are of crucial importance
for environmental monitoring and in gas sensing. Piezoelectric
acoustic wave sensors, which comprise a versatile class of
chemical sensors, are used for the detection of VOCs. For
sensing applications, a sensitive polymer film is cast on the
surface of the QCM. This layer can bind a VOC of interest,
altering the resonant frequency of the device, in proportion
to the added mass. Addition or subtraction of gas molecules
from the surface or bulk of an acoustic wave sensor results
in a change in its resonant frequency. The frequency change
, caused by a deposited mass can be described by
where is the fundamental
resonant frequency of the bare crystal, and
is the active
surface area . The sensor typically consists of an array of
several crystals, each coated with a different polymer. This
design is aimed at improving identification, hampered by the
limited selectivity of individual films. Employing more than
one crystal, and coating each with a different partially selective
polymer, different responses can be obtained for different
gases. The combined response of these crystals can then be
used as a signature pattern of the VOC detected.
The gas sensing dataset used in this study consisted of re-
sponses of six QCMs to five VOCs, including ethanol (ET),
xylene (XL), octane (OC), toluene (TL), and trichloroethelene
(TCE). Fig. 5 illustrates sample patterns for each VOC from
504 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
RAINING AND GENERALIZATION PERFORMANCE OF LEARN++ ON CONCENTRIC CIRCLES
Fig. 5. Sample responses of the six-QCM sensor array to VOCs of interest.
ATA -CLASS DISTRIBUTION FOR THE GAS SENSING DATABASE
six QCMs coated with different polymers, where the vertical
axis represents normalized frequency change. Note that the pat-
terns from toluene, xylene, and TCE look considerably similar;
hence, they are difficult to distinguish from each other. Further
details on VOC recognition using QCMs can be found in ,
whereas more information on this dataset, experimental setup
for generating the gas sensing signals, and sample patterns are
provided on the web .
The dataset consisted of 384 six-dimensional patterns, half
of which were used for training. Table IV presents the distri-
bution of the datasets, where subsequent datasets are strongly
biased toward the new class. Such a distribution results in an
even tougher challenge; since the algorithm will no longer have
the opportunity to see adequate number of instances from pre-
viously introduced classes in the subsequent training sessions.
The performance of Learn++ on this dataset is shownin Table V.
The generalization performance of Learn++ on the validation
dataset, gradually improving from 61% to 88% as new data was
introduced, demonstrates its incremental learning capability
even wheninstancesof newclassesare introduced insubsequent
training sessions. Learn++ performance on this dataset was
comparable to that of a strong learner, a two hidden layer MLP
of error goal 0.001, trained with the entire training data, which
had a classification performance of 90%. Learn++ was able to
perform as well as the strong learner, by seeing only a portion of
the dataset at a time. This dataset was also presented to the algo-
rithm in a different order, and the resulting performances were
virtually the same, implying that the algorithm is not sensitive
to the order of presentation of the training data. Furthermore,
various minor modifications to the base classifier architecture
nodes) and error goal ( ) also resulted in
similar performances, indicating that the algorithm is not very
sensitive to minor changes in its parameters. A formal analysis
on how much such parameters can be changedwithout affecting
the performance is currently underway.
Finally, Learn++ was also compared to fuzzy ARTMAP on
ARTMAP onthe identical dataset described in Table IV for var-
according to the ARTMAP learning algorithm, convergence is
to it, and future training does not alter this clustering. Therefore,
ARTMAP never forgets what it has seen as a training data in-
of incremental learning. However, fuzzy ARTMAP was indeed
sensitivetoslight changesin itsvigilance parameter, andevenits
that than of Learn++.
UMMARY AND DISCUSSION
This paper introduced Learn++, a versatile incremental
learning algorithm based on synergistic performance of an
ensemble of weak classifiers/learners. Learn++ can learn
from new data even when the data introduces new classes.
POLIKAR et al.: LEARN++: AN INCREMENTAL LEARNING ALGORITHM 505
RAINING AND GENERALIZATION PERFORMANCE OF LEARN++ ON
UZZY ARTMAP PERFORMANCE ON THE GAS SENSING DATABASE
Learn++ does not require access to previously used data during
subsequent training sessions, and it is able to retain previously
acquired knowledge. Learn++ makes no assumptions as to
what kind of weak learning algorithm is to be used. Any
weak learning algorithm can serve as the base classifier of
Learn++, though the algorithm is optimized for supervised
NN-type classifiers, whose weakness can be easily controlled
via network size and error goal.
Learn++ is also intuitively simple, easy to implement, and
converges much faster than strong learning algorithms. This is
because using weak learners eliminates the problem of fine-
tuning and over fitting, since each learner only roughly approx-
imates the decision boundary.
Initial results using this algorithm look promising, but there is
significant room for improvement and many questions to be an-
swered. The algorithm has two key components, both of which
can be improved. The first one is the selection of the subsequent
training dataset, which depends on the distribution update rule.
AdaBoost depends solely on the performance of individual
whereas Learn++ uses the performance of overall
bution update. The former guarantees robustness and prevents
performancedeterioration,whereasthe latter allowsefficientin-
cremental learning capability when new classes are introduced.
An appropriate combination ofthe two updating schemes might
provide optimum performance levels. Initialization of the dis-
tribution when a new database is introduced can also be opti-
mized by an initial classification evaluation of the composite
hypotheses on the new database.
The second key factor in Learn++ is the hypothesis combina-
tionrule. Currently,votingweights are determinedbasedon per-
formances of the hypotheses on their own training data subset.
This is suboptimal, since the performance of a hypothesis on a
specific subset of the input space does not guarantee the perfor-
mance of that hypothesis on an unknown instance, which may
come from a different subset of the space. This static combina-
tion rule can be replaced by a dynamic rule that estimates which
hypotheses are likelyto correctly classify a given (unknown)in-
stance, based on statistical distance metrics or a posteriori prob-
abilities, and determine voting weights accordingly for each in-
Other issues include selection of algorithm parameters, and
using other classifiers as weak learners. The algorithm param-
eters, such as base classifier architecture, error goal, number of
hypotheses to be generated, are currently chosen in a rather ad
hoc manner. Although the algorithm appears to be insensitive
to minor changes in these parameters, a formal method for se-
lecting them would be beneficial. Future work will also include
evaluating Learn++ with other classifiers used as weak learners,
such as RBF NNs and non-NN-based classification/clustering
Finally, the weighted majority voting for combining the hy-
potheses hints at a simple way of estimating the reliability of the
final decision and confidence limits of the performance figures.
In particular, if a vast (marginal) majority of
agree on the
class of a particular instance, then this can be interpreted as the
algorithm having high (low) confidence in the final decision. A
formal analysis of classifier reliability and confidence intervals
of the classifier outputs can be done by computing a posteriori
probabilities of classifier outputs, which can then be compared
to those obtained by using vote count mechanism.
ERROR BOUND ANALYSIS FOR LEARN++
Theorem: The training error of the Learn++ algorithm given
in Fig. 1 is bounded above by
is also bounded above by the AdaBoost.M1 error
Proof: Followingasimilar approach givenin , we first
show that the above error bound holds for a two-class problem,
and then show that a multiclass problem can be reduced to a
binary class problem, allowing the same error bound to hold for
the multiclass case as well. Let us call the algorithm working
on binary problems Learn+ (as opposed to Learn++, which is
reserved for the multiclass problem).
In a binary class setting where the two possible values for
are 0 and 1, the equations for error termsand distributionupdate
rules given in Fig. 1 can be simplified as follows. The combined
hypothesis is obtained by
whereas the error for
the distribution update rule is given by
506 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
and the final classification rule for each dataset is
We define the error of the final hypothesis as sum of the initial
weights of the misclassified instances, that is
To find and upper bound for
, we analyze the final weights
of the instances after
iterations, and associate these weights
with the errors committed by combined hypotheses
rounds, the final weight for any instance is
The summation over all instances gives
Comparing the sum of weights of all instances to the sum of the
weights that are misclassified
We now note that the final hypothesis
will make a mis-
take on instance
if and only if
or alternatively, if and only if
Incorporating (17) into (15) for misclassified instances, we ob-
giving us an upper bound for the error of the final hypothesis.
However, this upper bound based on the weights of individual
instances is of little use, since it is difficult to keep track of the
weights of every instance used for each hypothesis. These sum
of weights can also be limited by an upper bound, based on the
errors of each
. Recognizing that for
, and starting with the sum of the weights of all
We now define the intermediate variable
as the loss of the th hypothesis on instance , then the total error
th combined hypothesis is
Furthermore, recall from Step 1 of Learn++ algorithm that
Substituting (21) and (22) into (20), we obtain
from (22), we obtain
POLIKAR et al.: LEARN++: AN INCREMENTAL LEARNING ALGORITHM 507
and from (21)
iterations, we obtain
Substituting (24) into (19), we obtain
which gives us an upper bound on the training error in terms
of the normalized error and the actual error of the combined
. Note that no relationship has been assumed be-
and in this derivation We now find the optimum
, from (25). Since all terms in (25) are positive, we
can take the derivatives individually for each
Finally, substituting (26) into (25), we obtain
which is identical in form to that of AdaBoost, except that the
errors of individual hypotheses
are replaced by the errors of
. Furthermore, since each combined
hypothesisis obtainedfrom individual hypotheses much like the
final hypothesis is obtained from the combined hypotheses, an
identical error analysis can be carried out for each
obviously will be
, the error of
the AdaBoost algorithm.
So far, we have only shown the error bound for the binary
classification problem; however, it is easy to show that the
same analysis holds for multiclass problems, by establishing a
one-to-one mapping between the binary class and multiclass
problems. Again, following a similar approach to that in ,
for each instance in the Learn++ training set
, we define
a Learn+ instance
with some random number,
. We also define the initial distribution for Learn+
instances to be the same as Learn++ instances.
For each iteration
we pass the hypothesis
as if WeakLearn returns it to Learn+. Note
that according to this formulation, if Learn++ misclassifies
then it will return 1 to
. Since the correct class of the
is 0, (all instances for Learn+ are of class 0 by
our previous definition), then
misclassifies this instance
as well. On the other hand, if Learn++ correctly classifies
, it will return 0 to , and since this is also the
correct class for all Learn+ instances,
also classifies the
correctly. In other words, when the
multiclass algorithm makes an error, the binary class algorithm
makes an error, and when the multiclass algorithm correctly
classifies an instance, so does the binary class algorithm. Since
initial distributions for both algorithms were defined to be
identical, errors computed by both algorithms will also be
, , and . Therefore,
the error of the final hypothesis
will also be identical to that
given in (27).
 S. Grossberg, “Nonlinear neural networks: principles, mechanisms and
architectures,” Neural Netw., vol. 1, no. 1, pp. 17–61, 1988.
 E. H. Wang and A. Kuh, “A smart algorithm for incremental learning,”
in Proc. Int. Joint Conf. Neural Netw., vol. 3, 1992, pp. 121–126.
 B. Zhang, “An incremental learning algorithm that optimizes network
size and sample size in one trial,” in Proc.IEEE Int. Conf. Neural Netw.,
1994, pp. 215–220.
 F. S. Osorio and B. Amy, “INSS: A hybrid system for constructive ma-
chine learning,” Neurocomput., vol. 28, pp. 191–205, 1999.
 A. P. Engelbrecht and R. Brits, “A clustering approach to incremental
learning for feedforward neural networks,” in Proc. Int. Joint Conf.
Neural Netw., vol. 3, 2001, pp. 2019–2024.
 C. H. Higgins and R. M. Goodman, “Incremental learning with rule-
based neural networks,” in Proc. Int. Joint Conf. Neural Netw., vol. 1,
1991, pp. 875–880.
 M. T. Vo, “Incremental learning using the time delay neural network,”
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Proces., vol. 2, 1994,
 T. Hoya and A. G. Constantidines, “A heuristic pattern correction
scheme for GRNNs and its application to speech recognition,” in Proc.
IEEE Signal Proces. Soc. Workshop, 1998, pp. 351–359.
 K. Yamauchi, N. Yamaguchi, and N. Ishii, “An incremental learning
method with retrieving of interfered patterns,” IEEE Trans. Neural
Netw., vol. 10, pp. 1351–1365, Nov. 1999.
 L. Fu, “Incremental knowledge acquisition in supervised learning net-
works,” IEEE Trans. Syst., Man, Cybern. A, vol. 26, pp. 801–809, Nov.
 L. Fu, H. H. Hsu, and J. C. Principe, “Incremental backpropagation
learning networks,”IEEE Trans.Neural Netw., vol.7, pp. 757–762, May
 L. Grippo, “Convergent on-line algorithms for supervised learning in
neural networks,” IEEE Trans. Neural Netw., vol. 11, pp. 1284–1299,
 G. A. Carpenter, S. Grossberg, and J. H. Reynolds, “ARTMAP: Super-
vised real-time learning and classification of nonstationary data by a self
organizing neural network,” Neural Netw., vol. 4, no. 5, pp. 565–588,
 G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D.
B. Rosen, “Fuzzy ARTMAP: A neural network architecture for incre-
mental supervised learning of analog multidimensional maps,” IEEE
Trans. Neural Netw., vol. 3, pp. 698–713, Sept. 1992.
 J. R. Williamson,, “Gaussian ARTMAP: A neural network for fast in-
cremental learning of noisy multidimensional maps,” Neural Netw., vol.
9, no. 5, pp. 881–897, 1996.
 C. P. Lim and R. F. Harrison, “An incremental adaptive network for
on-line supervised learning and probability estimation,” Neural Netw.,
vol. 10, no. 5, pp. 925–939, 1997.
508 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
 G. Tontini, “Robust learning and identification of patterns in statistical
process control charts using a hybrid RBF fuzzy ARTMAP neural
network,” in Proc. Int. Joint Conf. Neural Netw., vol. 3, 1998, pp.
 F. H. Hamker, “Life-long learning cell structures—Continuously
learning without catastrophic interference,” Neural Netw., vol. 14, no.
4, pp. 551–573, 2000.
 G. C. Anagnostopoulos and M. Georgiopoulos, “Ellipsoid ART and
ARTMAP for incremental clustering and classification,” in Proc. Int.
Joint Conf. Neural Netw., vol. 2, 2001, pp. 1221–1226.
 S. J. Verzi, G. L. Heileman, M. Georgiopoulos, and M. J. Healy,
“Rademacher penalization applied to fuzzy ARTMAP and boosted
ARTMAP,” in Proc. Int. Joint Conf. Neural Netw., vol. 2, 2001, pp.
 D. Caragea, A.Silvescu, and V. Honavar et al., “Learning in open-ended
environments: Distributed learning and incremental learning,” in Archi-
tectures for Intelligence, Wermter et al., Eds. New York: Springer-
 S. Vijayakumar and H. Ogawa, “RKHS-based functional analysis for
exact incremental learning,” Neurocomput., vol. 29, pp. 85–113, 1999.
 G. G. Yen and P. Meesad, “An effective neurofuzzy paradigm for ma-
chine condition health monitoring,” IEEE Trans. Syst., Man, Cybern. B,
vol. 31, pp. 523–536, Aug. 2001.
 R. Schapire, “Strength of weak learning,” Machine Learn., vol. 5, pp.
 Y. Freund and R. Schapire, “A decision theoretic generalization of
on-line learning and an application to boosting,” Comput. Syst. Sci.,
vol. 57, no. 1, pp. 119–139, 1997.
 R. Schapire,Y.Freund, P.Bartlett,and W.S. Lee, “Boostingthe margins:
A new explanation for the effectiveness of voting methods,” Ann. Stat.,
vol. 26, no. 5, pp. 1651–1686, 1998.
 N. Littlestone and M. Warmuth, “Weighted majority algorithm,” Inform.
Comput., vol. 108, pp. 212–261, 1994.
 D. H. Wolpert, “Stacked generalization,” Neural Netw., vol. 5, no. 2, pp.
 R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
mixtures of local experts,” Neural Comput., vol. 3, pp. 79–87, 1991.
 M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the
EM algorithm,” Neural Comput., vol. 6, no. 2, pp. 181–214, 1994.
 J. Kittler,M. Hatef, R. P.Duin, and J.Matas,“On combining classifiers,”
IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 226–239, Mar.
 S. Rangarajan, P. Jalote, and S. Tripathi, “Capacity of voting systems,”
IEEE Trans. Software Eng., vol. 19, pp. 698–706, July 1993.
 C. Ji and S. Ma, “Combination of weak classifiers,” IEEE Trans. Neural
Netw., vol. 8, pp. 32–42, Jan. 1997.
 K. M. Ali and M. J. Pazzani, “Error reduction through learning multiple
descriptions,” Machine Learn., vol. 24, no. 3, pp. 173–202, 1996.
 C. Ji and S. Ma, “Performance and efficiency: Recent advances in su-
pervised learning,” Proc. IEEE, vol. 87, pp. 1519–1535, Sept. 1999.
 T. G. Dietterich, “Machine learning research,” AI Mag., vol. 18, no. 4,
pp. 97–136, 1997.
 C. K. Tham, “On-line learning using hierarchical mixtures of experts,”
in IEE Conf. Artificial Neural Netw., 1995, pp. 347–351.
 R. Polikar, “Algorithms for enhancing pattern separability, feature
selection and incremental learning with applications to gas sensing
electronic nose systems,” Ph.D. dissertation, Iowa State Univ., Ames,
Aug. 2000. [Online]. Available: http://engineering.rowan.edu/~po-
 C. L. Blake and C. J. Merz. (1998) UCI repository of machine learning
databases. Dept. Inform. and Comput. Sci., Univ. of California,
Irvine. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepos-
 Y. Freund and R. Schapire, “Experiments with a new boosting algo-
rithm,” in Proc. 13th Int. Conf. Machine Learning, 1996, pp. 148–156.
 R. Parekh, J. Yang, and V. Honavar, “Constructive neural network algo-
rithms for pattern classification,” IEEE Trans. Neural Netw., vol.11, pp.
436–451, Mar. 2000.
 A. D’Amico, C. Di Natale, and E. Verona, “Acoustic devices,” in Hand-
book of Biosensors and Electronic Nose. Medicine, Food and the Envi-
ronment, E. Kress-Rogers, Ed. Boca Raton, FL: CRC, 1997, ch. 9, pp.
 R. Polikar.. [Online]. Available: http://engineering.rowan.edu/~po-
Robi Polikar (S’92–M’01) received the B.S. degree
in electronics and communications engineering from
Istanbul Technical University, Istanbul, Turkey,
in 1993, and the M.S. and Ph.D. degrees, both
co-majors in biomedical engineering and electrical
engineering, from Iowa State University, Ames, in
1995 and in 2000, respectively.
He is currently an Assistant Professor with the
Department of Electrical and Computer Engineering
at Rowan University, Glassboro, NJ. His current
research interests include signal processing, pattern
recognition, neural systems, machine learning, and computational models of
learning, with applications to biomedical engineering and imaging, chemical
sensing, nondestructive evaluation, and testing. He also teaches upper under-
graduate and graduate level courses in wavelet theory, pattern recognition,
neural networks, and biomedical systems and devices at Rowan University.
Dr. Polikar is a Member of ASEE, Tau Beta Pi, and Eta Kappa Nu.
Lalita Udpa (S’84–M’86–SM’91) received theM.S.
and Ph.D. degrees in electrical engineering from
Colorado State University, Fort Collins, in 1981 and
She is currently a Professor with the Electrical
and Computer Engineering Department at Iowa
State University, Ames. She works primarily in
the broad areas of computational modeling, signal
processing, and pattern recognition with applications
to nondestructive evaluation (NDE). Her research
interests include development of finite element
models for electromagnetic NDE phenomena, applications of neural networks
and signal processing algorithms for the analysis of NDE measurements,
and development of image processing techniques for flaw detection in noisy,
low-contrast X-ray images.
Dr. Udpa is a senior member of Sigma Xi and Eta Kappa Nu.
Satish S. Udpa (S’82–M’83–SM’91) received the
B.Tech. degree in electrical engineering in 1975
and a postgraduate diploma in 1977, both from
J.N.T. University, Hyderabad, India, and the M.S.
and Ph.D. degrees in electrical engineering from
Colorado State University, Fort Collins, in 1980 and
He began serving as the Chairperson of the
Department of Electrical and Computer Engineering
at Michigan State University, East Lansing, in
August 2001. Prior to joining Michigan State, he
was the Whitney Professor of Electrical and Computer Engineering at Iowa
State University, Ames, and Associate Chairperson for Research and Graduate
Studies. He holds three patents and has published more than 180 journal
articles, book chapters, and research reports. His research interests include
nondestructive evaluation, biomedical signal processing, electromagnetics,
signal and image processing, and pattern recognition.
Dr. Udpa is a Fellow of the American Society for Nondestructive Testing and
the Indian Society for Nondestructive Testing.
Vasant Honavar received the B.E. degree in electronics engineering from Ban-
galore University, Bangalore, India, the M.S. degree in electrical and computer
engineeringfromDrexelUniversity, Philadelphia, PA, in 1984,andthe M.S. and
Ph.D. degrees in computer science from the University of Wisconsin, Madison,
in 1989 and 1990, respectively.
He founded and directs the Artificial Intelligence Research Laboratory
(www.cs.iastate.edu/~honavar/aigroup.html) at Iowa State University (ISU),
Ames, where he is currently a Professor of computer science. He is also a
Member of the Laurence H. Baker Center for Bioinformatics and Biological
Statistics, the Virtual Reality Application Center, Information Assurance
Center, the faculty of Bioinformatics and Computational Biology, the
faculty of Neuroscience, and the faculty of Information Assurance at ISU.
His research and teaching interests include artificial intelligence, machine
learning, bioinformatics and computational biology, grammatical inference,
intelligent agents and multiagent systems, distributed intelligent information
networks, intrusion detection, neural and evolutionary computing, data mining,
knowledge discovery and visualization, knowledge-based systems, and applied
artificial intelligence. He has published over 100 research articles in refereed
journals, conferences and books, and has co-edited four books. He is a
Co-editor-in-Chief of the Journal of Cognitive Systems Research.