ArticlePDF Available

Unveiling the Tapestry: The Interplay of Generalization and Forgetting in Continual Learning

Authors:

Abstract and Figures

In artificial intelligence (AI), generalization refers to a model’s ability to perform well on out-of-distribution data related to the given task, beyond the data it was trained on. For an AI agent to excel, it must also possess the continual learning capability, whereby an agent incrementally learns to perform a sequence of tasks without forgetting the previously acquired knowledge to solve the old tasks. Intuitively, generalization within a task allows the model to learn underlying features that can readily be applied to novel tasks, facilitating quicker learning and enhanced performance in subsequent tasks within a continual learning framework. Conversely, continual learning methods often include mechanisms to mitigate catastrophic forgetting, ensuring that knowledge from earlier tasks is retained. This preservation of knowledge over tasks plays a role in enhancing generalization for the ongoing task at hand. Despite the intuitive appeal of the interplay of both abilities, existing literature on continual learning and generalization has proceeded separately. In the preliminary effort to promote studies that bridge both fields, we first present empirical evidence showing that each of these fields has a mutually positive effect on the other. Next, building upon this finding, we introduce a simple and effective technique known as shape-texture consistency regularization (STCR), which caters to continual learning. STCR learns both shape and texture representations for each task, consequently enhancing generalization and thereby mitigating forgetting. Remarkably, extensive experiments validate that our STCR, can be seamlessly integrated with existing continual learning methods, including replay-free approaches. Its performance surpasses these continual learning methods in isolation or when combined with established generalization techniques by a large margin. Our data and source code are available at https://github.com/ZhangLab-DeepNeuroCogLab/distillation-style-cnn .
Content may be subject to copyright.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Unveiling the Tapestry: The Interplay of
Generalization and Forgetting in Continual Learning
Zenglin Shi , Jie Jing , Ying Sun , Joo-Hwee Lim , Senior Member, IEEE, and Mengmi Zhang
Abstract—In artificial intelligence (AI), generalization refers
to a model’s ability to perform well on out-of-distribution data
related to the given task, beyond the data it was trained on. For
an AI agent to excel, it must also possess the continual learning
capability, whereby an agent incrementally learns to perform
a sequence of tasks without forgetting the previously acquired
knowledge to solve the old tasks. Intuitively, generalization within
a task allows the model to learn underlying features that can
readily be applied to novel tasks, facilitating quicker learning and
enhanced performance in subsequent tasks within a continual
learning framework. Conversely, continual learning methods
often include mechanisms to mitigate catastrophic forgetting,
ensuring that knowledge from earlier tasks is retained. This
preservation of knowledge over tasks plays a role in enhancing
generalization for the ongoing task at hand. Despite the intuitive
appeal of the interplay of both abilities, existing literature on
continual learning and generalization has proceeded separately.
In the preliminary eort to promote studies that bridge both
fields, we first present empirical evidence showing that each
of these fields has a mutually positive eect on the other.
Next, building upon this finding, we introduce a simple and
eective technique known as shape-texture consistency regu-
larization (STCR), which caters to continual learning. STCR
learns both shape and texture representations for each task,
consequently enhancing generalization and thereby mitigating
forgetting. Remarkably, extensive experiments validate that our
STCR, can be seamlessly integrated with existing continual learn-
ing methods, including replay-free approaches. Its performance
surpasses these continual learning methods in isolation or when
Received 17 January 2024; revised 15 July 2024, 13 October 2024,
and 9 January 2025; accepted 18 February 2025. This work was supported in
part by the National Research Foundation, Singapore, under its AI Singapore
Program (AISG), under Award AISG2-RP-2021-025; and in part by the
4National Research Foundation, Singapore, under its National Research
Foundation Fellowship (NRFF), under Award NRF-NRFF15-2023-0001.
The work of Zenglin Shi was supported by the National Natural Sci-
ence Foundation of China under Grant 62472138. The work of Mengmi
Zhang was supported in part by the Start-Up Grant from the Agency
for Science, Technology, and Research (A*STAR); and in part by the
Start-Up Grant from Nanyang Technological University. (Zenglin Shi
and Jie Jing contributed equally to this work.) (Corresponding author:
Mengmi Zhang.)
Zenglin Shi is with the College of Computing and Data Science, Nanyang
Technological University (NTU), Singapore 639798, also with the Agency for
Science, Technology and Research (A*STAR), Singapore 138632, and also
with the School of Computer Science and Information Engineering, Hefei
University of Technology, Hefei 230002, China.
Jie Jing is with NTU, Singapore 639798, also with A*STAR, Singapore
138632, and also with the Department of Computer Science, Sichuan Univer-
sity, Chengdu 610017, China.
Ying Sun is with the Institute for Infocomm Research, Agency for Science,
Technology, and Research (A*STAR), Singapore 138632.
Joo Hwee Lim is with the Institute for Infocomm Research, Agency for
Science, Technology, and Research (A*STAR), Singapore 138632.
Mengmi Zhang is with NTU, Singapore 639798, and also with A*STAR,
Singapore 138632 (e-mail: mengmi.zhang@ntu.edu.sg).
Digital Object Identifier 10.1109/TNNLS.2025.3546269
combined with established generalization techniques by a large
margin. Our data and source code are available at https://
github.com/ZhangLab-DeepNeuroCogLab/distillation-style-cnn.
Index Terms—Continual learning, generalization, robustness,
shape-texture bias.
I. INTRODUCTION
ARTIFICIAL intelligent agents must possess the capability
not only to learn and recognize out-of-distribution data
related to a given task but also to excel at continual learning,
acquiring knowledge over a sequence of tasks without for-
getting information from earlier tasks. Both these capabilities
are crucial for the deployment of artificial intelligence (AI) in
dynamically changing and complex environments. Consider
an autonomous vehicle journeying from California to Boston
in the U.S., where it encounters varying weather conditions,
transitioning from sunny to snowy. In addition, the vehicle
may come across novel objects along the route, requiring
continuous learning to detect these new objects without for-
getting the previously learned ones. While both generalization
and continual learning are essential for an AI agent, existing
literature has rarely explored the intersection of these fields in
a systematic, comprehensive, and quantitative manner.
In AI, generalization refers to the model’s robustness to
out-of-distribution data related to the given task, beyond the
data it was trained on. Existing solutions for out-of-domain
generalization often rely on increasing data diversity [1],
[2],[3],[4], self-supervised learning [5], and meta-learning
[6],[7].
Concurrently, in the continual learning setting, agents are
tasked with incrementally learning a sequence of tasks without
experiencing catastrophic forgetting of the in-distribution data
of earlier tasks. This challenge of catastrophic forgetting has
led to the development of regularization-based approaches,
e.g., [8],[9],[10],[11],[12],[13],[14], and [15]. These
methods aim to preserve essential parameters for old tasks
through knowledge distillation or heuristics. Replay-based
approaches, e.g., [12],[13],[16],[17],[18], and [19], involve
storing or synthesizing exemplars from old tasks to train
alongside new ones. More recently, architecture expansion
methods, e.g., [20], have emerged, altering the model structure
or tuning the learnable prompts to accommodate new tasks.
While several pioneering works apply generalization
approaches for continual learning problems [6],[21],[22],
[23],[24], none of these studies systematically explored the
reciprocal eect between out-of-distribution generalization and
forgetting within continual learning. In our initial eorts to
©2025 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 1. Illustration of the interplay between out-of-distribution generalization
and continual learning. Over successive tasks, the model incrementally learns
to recognize new object classes (black arrows), such as strawberries in Task 1
followed by birds in Task 2, and so on. During this continual learning process,
the model exhibits a progressive loss of previously acquired knowledge known
as catastrophic forgetting (orange arrows). Generalization refers to the model
capability for performing well on the out-of-distribution data related to the
given task, beyond the data it has seen during training (blue arrows). The
sketch of the real-world strawberry in Task 1 refers to one out-of-distribution
test sample, which was unseen by the model during training. Other examples
of out-of-distribution data also include birds corrupted by noise or style-
transferred cats. Here, we are to investigate the interplay of generalization
within a given task and the forgetting about the previous tasks.
close this research gap, we conduct a comprehensive study
involving the direct integration of existing generalization
methods with continual learning baselines, followed by an
evaluation of these integrated models. We assess their perfor-
mance in terms of forgetting about earlier tasks and handling
out-of-distribution data from all the tasks trained so far (see
Fig. 1). Remarkably, empirical evidence from our study indi-
cates a mutually beneficial relationship between generalization
and continual learning capabilities. Specifically, continual
learning methods exhibiting reduced forgetting showcase
enhanced generalization for all the previous tasks and the
current task. Conversely, a superior generalization method
contributes to a reduction in forgetting when it comes to earlier
tasks. These findings underscore the intricate and reciprocal
relationship between the two fundamental capabilities in AI.
Building upon this finding, we further enhance the existing
generalization and continual learning baselines by introducing
a simple yet eective shape texture consistency regulariza-
tion method, dubbed shape-texture consistency regularization
(STCR). In every current task, STCR distills both shape
and texture representation of the training images by reg-
ularizing the logits of their style transferred versions and
themselves. Extensive experiments validate that our approach
can be seamlessly integrated with existing continual learning
methods, including replay-free approaches. Its performance
significantly surpasses any single continual learning baselines
or their combined versions with any existing generalization
methods. This highlights the eectiveness of our STCR in
simultaneously improving both generalization and continual
learning capabilities.
We summarize our key contributions as follows.
1) Building upon pioneering works applying generaliza-
tion approaches in continual learning, our work fosters
connections between both fields by exploring their
reciprocal eects over a sequence of tasks. Through
a series of systematic and comprehensive experiments
involving ten established generalization and contin-
ual learning baselines and eight evaluation metrics,
our empirical evidence reveals interesting insights that
generalization and continual learning capabilities are
mutually beneficial.
2) To enhance the generalization capabilities of existing
continual learning baselines, we have developed a sim-
ple and eective STCR method. Our method distills
both shape and texture representations from the train-
ing images of the current task, subsequently enhancing
generalization and thereby mitigating forgetting.
3) Our STCR can be seamlessly integrated with both
existing replay and replay-free continual learning meth-
ods. The experimental results demonstrate significant
improvements in both generalization and continual
learning performances, surpassing these existing contin-
ual learning methods in isolation or when combined with
established generalization techniques.
The structure of the subsequent sections is as follows.
In Section II, we survey related works and highlight the
dierences between our work and the existing literature. In
Section III, we introduced empirical evidence on the interplay
of generalization and forgetting based on existing generaliza-
tion and continual learning methods. Next, building upon this
evidence, we introduce our proposed STCR method in Sec-
tion IV. Subsequently, we elaborate the experimental setups
(see Section V) and conduct extensive analysis on STCR (see
Section VI). Finally, we present our conclusions and discuss
future work in Section VII.
II. RE LATE D WORK
A. Continual Learning
Continuous learning involves acquiring knowledge in a
sequential manner by training a single neural network on a
series of tasks. During this process, only data from the current
task is utilized for training. One grand challenge in continual
learning is catastrophic forgetting [25],[26]. To tackle this
challenge, the following major types of approaches have been
proposed.
Regularization-based approaches [8],[9],[10] impose strin-
gent constraints on model parameters. This is achieved by
penalizing changes in parameters that are crucial for retaining
knowledge related to older tasks. Despite demonstrating some
success in alleviating forgetting, these methods fall short of
delivering satisfactory performance in demanding continual
learning scenarios [13],[27]. Recent approaches have incorpo-
rated knowledge distillation [28] as a regularization technique
to minimize the changes to the decision boundaries of old tasks
while learning new tasks. Methods based on knowledge distil-
lation [11],[12],[13],[14],[15], typically involve maintaining
a snapshot of the model trained on earlier tasks in a memory
buer. Subsequently, the knowledge encapsulated in the stored
old model is distilled and transferred to the current model as
part of the learning process. Recent approaches have adopted
similar ideas of consistency regularization [24] on prototypes
for class-incremental learning.
Exemplar replay methods [12],[13],[16],[17],[18],[19]
preserve a subset of representative examples from prior tasks
in a memory buer. Many of these approaches [12],[13],[17],
[18] leverage herding heuristics [29] for selecting exemplars.
The stored exemplars, in conjunction with new data, are then
utilized to optimize the network parameters during the learning
process of a new task. While replay strategies can be notably
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 3
eective, they come with certain drawbacks. The replay of a
restricted set of stored examples may result in overfitting. In
addition, storing a substantial number of images for replay
purposes can be memory-intensive.
To address memory constraints, [30],[31] proposed a
scoring method to select what examples to store, and [31]
introduced the memory update method to decide what exam-
ples in the memory buer to replace. To avoid the problem
of overfitting due to a limited number of replay samples,
generative replay methods integrate data from new tasks with
synthetic data generated by generative models, aiming to
replicate stimuli encountered in previous instances [16],[19],
[32],[33],[34],[35],[36]. Nevertheless, the generative models
required for generating suitable synthetic data still tend to be
sizable, memory-intensive, and challenging to train.
To eliminate these issues, replay-free approaches have been
proposed [37],[38],[39]. These approaches often rely on
augmenting prototypes [39], generating pseudo features with
prototypes as inputs [37], and computing feature covariance
relations for classification based on prototypes [38].
Dynamic architecture expansions [40],[41],[42],[43],[44]
address the challenge of catastrophic forgetting by expanding
the network. In this strategy, a new network is trained for
each task, while the preceding networks are held constant.
This ensures that the initially generated features for earlier
task classes remain preserved. Despite showcasing enhanced
performance compared to single-model approaches, these
methods result in models that quickly escalate in size as the
number of tasks increases. As a result, the scalability issue
renders them impractical for many real-world applications.
Despite many of these advancements in continual learning,
a predominant focus remains on testing models with in-
distribution data. The generalization ability of these methods
on out-of-distribution data was yet comprehensively evaluated.
In a pioneering eort, we provide empirical evidence sug-
gesting that continual learning methods demonstrating reduced
forgetting of old tasks also exhibit improved generalization for
all the trained tasks.
B. Generalization
Generalization methods initially focused on increasing data
diversity with augmentation techniques [2],[45],[46],[47],
[48],[49]. Subsequently, researchers delved into adversarial
training as a means to enhance generalization [50],[51],[52].
However, such adversarial training approaches often result
in compromised performance within the training distribution
itself [53].
An alternative avenue of research aimed at improving out-
of-domain generalization involves exploring shape and texture
representation learning. Geirhos et al. [48] uncovered that con-
volutional networks trained on natural images tend to acquire
texture representations that generalize well to in-distribution
data but exhibit suboptimal performance on out-of-distribution
data. In response to this, researchers have introduced methods
focused on shape representation learning [48],[49],[54],
often at the expense of in-distribution performance. Li et al.
[1] proposed shape-texture debiased training using a mixup
loss to prevent bias toward either shape or texture, striking
a performance balance between in-distribution and out-of-
distribution data.
In contrast to existing works [6],[21],[22],[23],[24],
our research delves into the impact of generalization methods
on forgetting over a sequence of tasks within the continual
learning framework. While several pioneering works apply
generalization approaches, such as data augmentation [6]
with text label using bidirectional encoder representations
from transformers (BERT) [55], transfer learning [21],[23],
and meta-learning [22] for continual learning problems, none
of these studies systematically explored the eect of out-
of-distribution generalization in every task on catastrophic
forgetting of earlier tasks. Wang et al. [56] provided a com-
prehensive survey of continual learning theories, methods,
and applications, highlighting the stability-plasticity tradeo
and the need for strong intratask and intertask generalization.
Consistent with this framework, our empirical results empha-
size the importance of enhancing intratask generalization in
continual learning. Building upon this evidence, we introduce
a straightforward yet eective STCR approach tailored for
continual learning. The experimental results demonstrate that
our model, serving as a plug-and-play module seamlessly inte-
grated with both replay and replay-free methods, outperforms
various combinations of existing generalization and continual
learning baselines by a substantial margin. We further val-
idated the eectiveness of STCR through a loss landscape
analysis similar to [56].
III. INT ER PL AY OF GENERALIZATION AND FORGETTING
We started this section by formulating the problem setups
(see Section III-A) followed by introducing competitive
baselines, datasets, and protocols in continual learning (see
Section III-B). We experimented on continual learning base-
lines and produced empirical evidence on the interplay of
generalization and less forgetting in Section III-C. Note that
we use the common evaluation metrics, such as Forgetting F
and corruption error R-C, in [57]. To quantify the bidirec-
tional eects between generalization and forgetting, we also
introduced R-C and F. See the detailed introduction to
these metrics in Section V-C. Moreover, to provide insights
into the reasons why generalization eliminates forgetting, we
presented loss landscape analysis in Section III-D.
A. Problem Setting
We focus on the class-incremental scenario in continual
learning, where the objective is to learn a unified classifier
over incrementally occurring sets of classes [17]. Formally,
let T={T1,...,TT}be a sequence of classification tasks,
Dt={(xi,t,yi,t)}nt
i=1be the labeled training set of the task Tt
with ntsamples from mtdistinct classes, and (xi,t,yi,t) be the
ith (image, label) pair in the training set of the task Tt. A
single model Θis trained to solve the tasks Tincrementally.
The model Θconsists of a feature extractor and a unified
classifier. For the initial task T1, the model Θ1learns a standard
classifier for the first m1classes. In the incremental step t,
only the classifier is extended by adding mtnew output nodes
to learn mtnew classes, leading to a new model Θt. During
the training of Θt, only the training set Dtfor the task Tt
is available. In the testing phase, Θtis required to classify
all classes seen so far, i.e., {m1,...,mt}. Unlike the existing
class-incremental works, Θtshould perform well on both the
in-distribution and out-of-distribution test data. For example,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
if Θtis trained to recognize real-world dog images, it should
be capable of recognizing both real-world dog images and
cartoon dog images during testing.
For replay-based continual learning methods, a memory
buer is often employed to store the original training images
from the earlier tasks. We denote this memory buer as the
exemplar set P. For fair comparisons, in all the experiments,
we allow Pto store 20 exemplars per object class by default,
unless otherwise specified. The terms “exemplar” and “original
training image” in the memory buer are used interchangeably
throughout the text.
B. Baselines and Protocols
We introduce a list of competitive continual learning base-
lines used in this experiment.
Naive trains the model on the sequence of tasks without any
measures to prevent forgetting.
Replay: Replay-based methods are the most eective group
of continual learning methods. We implement the naive replay
method based on [17], which stores the images from the previ-
ous tasks in a memory buer and replays these images together
with the training images in the current task. Consistent with
[17], we used a memory buer size of 20 images per class.
Gdumb [58] greedily stores upcoming samples while bal-
ancing the class distribution of the memory buer. During
testing, the model is trained from scratch only using the
samples in the memory buer.
Lwf [11] is a knowledge distillation approach that regular-
izes the training process by aligning the predictions of the new
model with those of the previous model on the new images at
the current task.
iCARL [17] combines Replay and a knowledge distillation
technique introduced above in Lwf [11]. During training, an
exemplar set of images will be dynamically selected based on
feature similarity with the prototypes. Image augmentation is
also applied during training and replays.
Cumulative is an upper bound. It involves training the model
on all the aggregated data from D1at T1to Dtat the current
task Tt, ensuring that the model never forgets about the old
knowledge while learning the current task.
We benchmarked all these continual learning baselines on
ImageNet-100 [59]. We used ImageNet-100-C [57] as its
out-of-distribution generalization test set. All the models are
trained on the first task T1containing 50 classes, followed by
ten classes in every subsequent task. There are a total of six
tasks. See Sections V-A and V-B for details on datasets and
protocols.
C. Empirical Evidence on the Interplay of Generalization
and Less Forgetting Reveal Their Mutual Benefits
Here, we first examine how continual learning techniques,
aimed at mitigating catastrophic forgetting, influence the
overall generalization capabilities of models. Conversely, we
delve into the impact of generalization methods on reducing
forgetting across a sequence of tasks. We report both results
in Fig. 2.
From Fig. 2(a), we note a strong positive linear correlation
between Fand R-C (r=0.93 with a p-value of 0.01)
across diverse continual learning methods. See Section V-C
Fig. 2. Empirical evidence about the interplay of generalization and forgetting
in continual learning on ImageNet-100 when T=6. (a) Forgetting (F)
and generalization capability (R-C) across a range of continual learning
methods. For each continual learning algorithm, we report its R-C and F. See
Section V-C for the definitions of R-C and F. We also performed a linear
fitting (dashed line) between R-C and Fbased on all the sample points
on the subplot using RANSACRegressor [60]. (b) Impact of generalization
on reduced forgetting. We computed the performance dierence in R-C
and Fbetween existing continual learning baselines with and without the
B-Aug generalization baseline, and denote these dierences as Fand
R-C. Positive Fand R-C imply that the continual learning model has
improved generalization capability and reduced forgetting after integrating
the B-Aug generalization algorithm, compared to its counterpart without
B-Aug. The dashed line indicates the result of linear fitting between Fand
R-C using RANSACRegressor [60]. See Section III-B for the introduction
of the continual learning baselines. See Section V-F for the introduction of the
generalization baselines. See Section V-C for the introduction of the evaluation
metrics.
for the definitions of Fand R-C. This suggests that methods
exhibiting reduced forgetting also exhibit enhanced general-
ization capability. Essentially, a continual learning approach
with diminished forgetting retains more historical knowledge,
consequently augmenting the diversity of acquired features.
This retained knowledge, in turn, contributes to the model’s
improved ability to generalize when confronted with out-of-
distribution data from all previously encountered tasks.
From Fig. 2(b), a strong positive linear correlation between
Fand RC(r=0.80 with a p-value of 0.02) was observed
across various continual learning methods with and without
the basic augmentation (B-Aug) generalization method intro-
duced in Section V-F. Briefly, B-Aug is a data augmentation
technique involving color jittering, color drops, Gaussian
noise, and Gaussian blur [61].
The strong positive linear correlation suggests that better
generalization capability contributes to diminishing forgetting
for earlier tasks. For instance, the Replay method (dark blue),
incorporating the B-Aug, achieves a reduction of 12.90 in
Fand 19.46 in R-C compared to its isolated counterpart.
Similarly, the cumulative baseline (cyan) also experiences a
substantial decrease in Fand R-C when integrated with the
B-Aug generalization method. One plausible explanation is
that the generalization method empowers the model to capture
more invariant and generic features in the current task, proving
beneficial for both knowledge transfer to novel tasks and
preserving learned knowledge from older tasks due to flatter
loss minima [62]. In Section VI, we provide a detailed analysis
of these empirical results.
However, we observe two exceptions in Fig. 2(b). Specif-
ically, the integration of the B-Aug generalization method
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 5
Fig. 3. Loss landscape analysis on existing continual learning methods with and without existing data augmentation techniques. The loss landscape is produced
by evaluating the final models at task T6with perturbed weights in random directions on the test sets of tasks T1(darker line) and T6(lighter line) of the
ImageNet-100 dataset. The x-axis indicates the magnitudes of weight perturbations. The results of the existing continual learning baselines in colors with
the existing B-Aug generalization method (solid line) and without B-Aug (dotted line) are presented. See Section III-B for the introduction of the continual
learning baselines. See Section V-F for the introduction of the generalization baselines.
has adversely aected the continual learning and general-
ization performances of Naive and iCaRL. Naive serves as
a lower bound (LB) without any mechanisms to mitigate
catastrophic forgetting. Consequently, the inclusion of the
B-Aug generalization method exacerbates overfitting, resulting
in a performance decline. In the case of iCaRL [17], which
has its own built-in data augmentation method during training,
the impact of the B-Aug generalization method is mitigated.
These built-in data augmentations in iCaRL include random
cropping, horizontal flipping, and rotation.
D. Loss Landscape Analysis Provides Insights Into Spurious
Feature Learning Across Tasks
In machine learning literature [63],[64],[65],[66], a
spurious feature is an irrelevant pattern that a model can learn
to capture. These features are correlated with the correct class
labels in the training data but are not inherently relevant to the
learning problem. A model with better generalization ability
is less likely to capture spurious features. This encourages the
model to make class predictions based on generic features,
ensuring it is more robust to input perturbations.
To quantify the extent to which models capture generic
features for the current task, we computed the loss landscape
[67],[68] on the test set of T6using the final models Θ6
trained after the last task T6. Specifically, to compute the loss
landscape, we perturbed the weights of the models in random
directions and tested the models on the current tasks. We
reported their losses on T6as a function of weight perturbation
magnitudes in Fig. 3. From the results, we observed that the
models with the B-Aug generalization method have wider loss
landscapes compared to their counterparts in the same task.
This implies that the generalization methods help the models
learn to capture more generic features for object recognition
in the current task.
In continual learning, spurious features are specific to indi-
vidual tasks and do not generalize well. In other words, if a
model captures spurious features for a particular task, these
features are less eective for the other tasks. As a result, the
models prone to capturing spurious features may struggle with
earlier tasks, leading to increased forgetting. To validate this
point, we performed another loss landscape analysis on the
test set of the first task T1using the final model Θ6at T6.
As shown in Fig. 3, final models trained with B-Aug in T6
consistently have wider loss landscapes even on the test sets
of T1, indicating better generalization. This suggests that these
models capture more generic features shared across tasks,
thereby reducing catastrophic forgetting in earlier tasks.
IV. OUR PROPOSED METHOD—STCR
We present empirical evidence in Section III-C high-
lighting how superior generalization performance in all the
tasks trained so far contributes to reduced forgetting during
continual learning. To enhance the generalization capability
of established continual learning methods, we introduce a
straightforward and eective regularization approach, dubbed
STCR. See Fig. 4for the method schematic. Our STCR
can seamlessly integrate with any existing continual learning
method. For instance, it can leverage exemplars in replay-
based continual learning methods to enhance generalization
ability. We introduce our STCR in this section.
A. Shape-Texture Consistency Regularization
Existing works in object recognition have suggested that
models trained with original naturalistic images are often
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Fig. 4. Schematic of our STCR method. Given an image from the training set,
we create its shape-texture conflict counterpart by performing style transfers
on the training images of the current task. The style images come from the
examples stored in the memory buer. The original image and shape-texture
conflict image are combined in the mini-batch for training. The feature encoder
extracts both texture-biased and shape-biased representations from these mini-
batches. Their output logits are then normalized for classification via cross-
entropy losses. Moreover, these logits are also regularized to make consistent
class distributions. During replays, the model also rehearses the old images
using cross-entropy losses. By using the proposed STCR, the resulting model
can eectively prevent forgetting, and meanwhile, generalize to corrupted data,
and domain-shifted data during inference. See Section IV for the detailed
design of our STCR.
biased toward textures [48]. Augmenting the training data
with style-transferred images such as shape-texture conflict
images is an eective way to encourage models to learn shape-
biased features, which are more generic for object recognition
[48]. Inspired by these works, our STCR is designed to learn
shape and texture representations for each classification task,
enabling generalization to both in-distribution and out-of-
distribution data. While standard training on natural images
tends to learn texture representations for in-distribution gener-
alization, shape representations are more robust to distribution
shifts. To encourage the model to learn shape representations,
we adopt the approach of Geirhos et al. [48] by using shape-
texture conflict images for training. However, rather than use
external artistic paintings in [48] for style transfers, we use
dierent images within the current training set Dtas style
images. This is because storing paintings in memory is not
feasible for memory-constrained continual learning.
Specifically, for any generic continual learning methods
with or without replay buers, during optimization at the
incremental step t, we randomly sample two mini-batches of
images with a size of kper batch from the current training
set D(t). Next, we use one mini-batch of images as style
templates and the other as content images. In this case, both
mini-batches come from the current training set and they
are randomly paired for style transfers. Every pair of style
and content images could be from the same or dierent
object classes. Mathematically, we denote the first mini-batch
X={x1,t,...,xk,t}as content images and the second mini-
batch ˆ
X={ˆx1,t,..., ˆxk,t}as style images. Together with these
paired images, we can then generate a mini-batch of shape-
texture conflict images ˜
X={˜x1,t,..., ˜x1,t}based on Xand ˆ
X,
where
˜xi,t=fxi,t,ˆxi,t.(1)
Here, f(·) is the style transfer operation. Specifically, we use
the real-time style transfer approach, AdaIN [69], pretrained
using microsoft common objects in context (MS-COCO) [70]
and WikiArt [71]. Note that none of these datasets overlap
with the datasets we use for continual learning (see Section
V-A for the details about our datasets). To examine the quality
of style transferred images by AdaIN which AdaIN has never
been trained on, we provide additional visualization results
in Appendix B (see Supplementary Material) and verify that
the quality of those generated images is reasonable. Moreover,
we emphasize that AdaIN [69] is used as a proof-of-concept
for style transfers. There are many other style transfer meth-
ods available, such as [72],[73], and [74], which are more
compute-ecient for real-world applications.
When training solely on the original natural images, the
model tends to acquire texture representations. Conversely,
training exclusively on shape-texture conflict images leads to
the learning of shape representations. To concurrently develop
both shape and texture representations, we utilize a consistency
regularization approach that involves both the original natural
images and the shape-texture conflict images. Specifically, let
X={x1,t,...,xk,t}be a mini-batch of kimages randomly
sampled from the current training set Dtin the incremental
step t, and ˜
X={˜x1,t,..., ˜xk,t}be the corresponding mini-batch
of shape-texture conflict images generated following (1). Next,
we obtain a new mini-batch {X,˜
X}by combining Xand ˜
Xfor
model training. Let Z={z1,t,...,zk,t}and ˜
Z={˜z1,t,...,˜zk,t}
be the corresponding logits of Xand ˜
X, produced by the
model Θt, respectively. Zis generated using original natural
images as inputs and encodes texture information. Conversely,
˜
Zis generated using shape-texture conflict images as input
and encodes shape information. To ensure that the model
captures both texture and shape information, we encourage Z
and ˜
Zto predict similar class distributions with a consistency
regularization loss
LSTCR =DKL (pkq)+DKL (qkp)(2)
where p=σ(Z), q=σ(˜
Z), τis a temperature
hyper-parameter controlling the smoothness of the probability
distributions. We empirically set it to 2.DKL is the Kullback-
Leibler divergence loss comparing the distance between two
distributions. σ(·) is softmax function. Our consistency regu-
larization loss only requires a mini-batch of training images
from the current task and it can be computed in an online
manner.
B. Refinement of STCR for Replay-Based Continual Learning
Just like many existing generalization methods, LSTCR can
be integrated with any established continual learning baselines,
such as elastic weight consolidation (EWC) [8], exemplar
replay [17], or knowledge distillation [11]. Section IV-A
describes the generic STCR, where both style and content
images come from the current task’s training set Dt. However,
replay-based continual learning methods use memory buers
Pto store exemplars. We can leverage these exemplars to
refine our STCR for replay-based methods, as detailed in this
section.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 7
In replay-based continual learning methods, training images
from earlier tasks are often stored in the memory buer Pfor
replays in the current task. These stored images often retain
additional information from earlier tasks. We capitalized on
these exemplars Pto enhance our STCR. Specifically, instead
of using training images as style templates from the current
task, we employed images from the memory buer Pas style
templates and kept the training images from the current task
as content to generate shape-texture conflict images. These
exemplar sets Pin the memory buer encompass a broader
range of classes from previous tasks, enhancing the diversity
of the shape-texture conflict images. Simultaneously, these
conflict images carry styles from earlier tasks, serving as
a form of feature rehearsal to mitigate forgetting. In our
experiments, we also observed that applying STCR during
replays introduces noise from shape-texture conflict images,
which can detrimentally impact replay performances (refer to
Section VI-E).
The schematic of our STCR is illustrated in Fig. 4, and its
implementation is outlined in Algorithm S1 (see Supplemen-
tary Material). At every incremental step t, the overall loss
function of STCR to train a neural network is defined as
Lt=LCE
Dt,˜
Dt
+βLCE
P+γLSTCR
Dt,˜
Dt(3)
where cross-entropy loss LCE
Dt,˜
Dtis computed on the combi-
nation of shape-texture conflict images ˜
Dtand the original
natural images Dtfrom the current task t. For the replay-based
continual learning methods, LCE
Pis computed only on the orig-
inal natural images from exemplar sets P. The hyperparameter
β=1 and γ=0.01 regulate the loss for exemplar replays and
the STCR loss, respectively. We analyzed the eect of βand
γin Section VI-E.
C. Implementation Details
The experiments are conducted based on the publicly
available ocial code provided by Mittal et al. [15]. For
ImageNet-100 and ImageNet-1000, we use an 18-layer ResNet
with randomly initialized weights. The network is trained for
70 epochs in the initial base task with a base learning rate
of 1e1, and is trained for 40 epochs in the incremental task
with a base learning rate of 1e2. The base learning rate is
divided by 10 at epochs {30,60}in the initial base task, and
is divided by 10 at epochs {25,35}in the incremental steps.
For CIFAR-100, we use a 32-layer ResNet with randomly
initialized weights. The network is trained for 120 epochs in
the initial base task with a base learning rate of 1e1, and
is trained for 60 epochs in the subsequent incremental tasks
with a base learning rate of 1e2. We adopt a cosine learning
rate schedule, in which the learning rate decays until 1e4. For
all networks, the last layer is cosine normalized, as suggested
in [18]. All networks are optimized using the SGD optimizer
with a mini-batch of 128, a momentum of 0.9, and a weight
decay of 1e4. Following [15] and [18], we use an adaptive
weighting function for knowledge distillation loss. At each
incremental step t,λt=λbase Pt
i=1mi/mt2/3where Pt
i=1mi
denotes the number of all seen classes from step 1 to step t,
and mtdenotes the number of classes in step t.λbase is set to 20
for CIFAR-100, 100 for ImageNet-100, and 600 for ImageNet-
1000, as suggested by [15]. To ensure statistical significance,
we conducted three runs for all the experiments and reported
their means and standard deviations. All codes will be made
publicly available upon publication.
V. EX PE RI ME NTAL SETUP
A. Datasets
Same as previous continual learning works [18],[19],
[75],[76], we perform class-incremental experiments on three
standard image datasets CIFAR-100 [77], ImageNet-100, and
ImageNet-1000 [59] and follow the same training and test
data splits. CIFAR-100 contains 60000 color images with the
image size of 32×32 from 100 classes, in which 50 000 images
are for training and the remaining 10 000 are for testing.
ImageNet-1000 contains around 1.3 million color images of
size 224 ×224 from 1000 classes. For each class, there
are 1300 images for training and 50 images for validation.
ImageNet-100 is a subset of Imagenet-1000 with 100 classes
that are randomly sampled using the NumPy seed of 1993.
Since there is a lack of out-of-distribution test sets in
the existing continual learning literature, we use the datasets
ImageNet-1000-C [57], CIFAR-100-C [57], and ImageNet-
1000-R [78] for evaluating out-of-distribution generalization.
ImageNet-1000-C and CIFAR-100-C consist of 15 diverse
corruption types applied to validation images of ImageNet-
1000 and CIFAR-100, respectively, with five dierent severity
levels for each corruption type. ImageNet-1000-R is a real-
world distribution shift dataset that includes changes in image
styles and geographic locations. We curated ImageNet-100-C
and ImageNet-100-R from ImageNet-1000-C and ImageNet-
1000-R also by random sampling with seed 1993.
B. Protocols
We follow the standard protocols used in [14],[15],[18],
[19],[75], and [76] for benchmarking class-incremental learn-
ing methods in continual learning. The protocol consists of an
initial base task T1followed by Tincremental tasks, where T
is set to 6 or 11. For all datasets, we assign half of the classes
to the initial base task and distribute the remaining classes
equally over the incremental steps.
C. Metrics
We use the following metrics to measure the continual
learning and generalization performance. Let Aj,idenote the
accuracy on task iafter training the continual learning model
φjon task j.
Average incremental accuracy (Acc.) [17] is computed
by averaging the accuracies of all the models {Θ1,...,ΘT}
obtained in all the incremental steps. Each model’s accuracy
is evaluated on all classes seen thus far. The mathematical
formulation is listed below
Acc. =1
T
T
X
t=1 1
t
t
X
i=1
At,i!.(4)
Acc. is a common evaluation metric in the continual learn-
ing literature [14],[15],[17],[79], measuring the overall
model performance on the in-distribution data over all the
trained tasks. The higher the Acc., the better.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Average accuracy ATcomputes the mean accuracy over all
the test sets of all tasks for the final model φTtrained after
the last task Tin a continual learning protocol. We denote the
average accuracy as AT
AT=1
T
T
X
t=1
AT,t.(5)
Forgetting (F)[19] measures the performance drop of ΘTon
the initial task T1. It is the gap between the accuracy of Θ1
and the accuracy of ΘTon the same in-distribution data of T1.
Formally, Fis defined as
F=AT,1A1,1.(6)
In the continual learning literature, Freflects how much the
model has forgotten about T1. The smaller F, the better.
1) Generalization (R-C and R-R): Here, we introduce two
evaluation metrics to assess the out-of-distribution general-
ization capabilities of a model. This includes the robustness
against data corruption and the domain shifts. Consistent
with [57], we assess the robustness against data corruptions
of the ultimate model ΘTat the last task T(R-C) by
calculating its mean corruption error across all the classes
of ImageNet-100-C, ImageNet-1000-C, and CIFAR-100-C,
respectively. The R-C of ΘTcan then be defined as its own
mean corruption error normalized by the mean corruption error
of AlexNet [80] (mCEAlexNet)
R-C =mCEmodel
mCEAlexNet
.(7)
Note that AlexNet is trained on the entire dataset in an
oine setting without class incremental learning. B-Aug in
Section V-F is applied during the training of AlexNet.
Following the method described in [78], we also evaluate
the generalization of the final model ΘTagainst domain shifts
(R-R) with accuracy measurements on all the classes of
ImageNet-100-R and ImageNet-1000-R.
2) Change in Forgetting and Robustness (Fand R-C):
To quantitatively assess the impact of generalization methods
on reducing forgetting across a sequence of tasks, we introduce
two evaluation metrics Fand R-C. Let Fdenote the
forgetting score for the final model ΘTtrained using a contin-
ual learning method, with and without B-Aug as specified in
Section V-F. These forgetting scores are represented as FwB,ΘT
and FwoB,ΘT, respectively. The dierence in Fcan be defined
as
F=FwB,ΘTFwoB,ΘT.(8)
Similarly, we can compute the dierence in R-C as R-C
between the continual learning baseline with and without B-
Aug generalization algorithm
R-C =R-CwB,ΘTR-CwoB,ΘT.(9)
3) Backward Transfer: Backward transfer (BWT) [81],
[82] measures the influence of learning a new task on the
performance of previously learned tasks. A positive BWT
indicates that learning a new task improves performance on
earlier tasks, while a negative BWT suggests a performance
decline due to catastrophic forgetting.
To compute BWT, we evaluate the model’s accuracy on all
previously learned tasks before and after training on a new
task. Formally, BWT is defined as
BWT =PT
i=2Pi1
j=1Ai,jAj,j
T(T1)
2
(10)
where Aj,idenotes the accuracy on task iafter training on the
jth task.
Without loss of generality, we reported the results of average
accuracy ATand BWT in Appendix A (see Supplementary
Material). The conclusions are consistent with those obtained
from other evaluation metrics.
D. Generic Continual Learning Frameworks
First, to demonstrate that our STCR can be integrated with
any generic continual learning frameworks, we provide an
overview of major frameworks used in continual learning (see
Section VI-A) and combine our STCR with these frameworks.
In Section III-B, we introduced Naive and Replay. We list the
other major frameworks below.
1) Weight Regularization-Based (WReg): WReg-based
methods are one category of continual learning methods. EWC
[8] is one of these methods. We implement EWC, which
leverages the Fisher Information matrix to identify important
parameters and penalizes further changes of these parameters
in the later tasks.
2) Knowledge Distillation-Based (KD): KD methods are
special cases of weight-regularization methods. They turn out
to be very eective, especially when they are combined with a
replay-based method. We implement the knowledge distillation
loss from [11], which saves a snapshot of the model in the
previous tasks and distills knowledge from the old model to
the new model based on the training images at the current
task.
E. Existing Continual Learning Baselines
Next, to demonstrate that our STCR can further enhance
the state-of-the-art (SOTA) continual learning algorithms, we
integrate our STCR with three SOTA continual learning
methods, each of which could incorporate a combination of
multiple continual learning frameworks introduced above. We
implement these approaches using their publicly available
code.
Continual class-incremental learning (CCIL) [15] employs
a hybrid approach combining replay and regularization strate-
gies.
Podnet [14] utilizes an ecient distillation loss across mul-
tiple spatial dimensions of feature maps after various pooling
operations.
Adaptive feature consolidation (AFC) [79] estimates the
relationship between the representation changes and the result-
ing loss increases incurred by model updates. The greater the
loss increase on the old representations, the more important
the weights are to the old task. Based on the importance of
the weights, the model restricts updates of important features
while allowing changes in less critical features, thereby pre-
venting forgetting.
F. Generalization Baselines
We compare our STCR against the following generalization
approaches summarized below.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 9
LB does not use any techniques to encourage generalization.
B-Aug follows [49] and applies simple and naturalistic
augmentations, including color distortion, noise, and blur.
Specifically, we employ color jitter with an 80% probability,
color drops with a 20% probability, Gaussian noise (mean =
0 and std =0.025) with a 50% probability, and Gaussian blur
(kernel size is 10% of the image width/height) with a 50%
probability.
Auto-augmentation (A-Aug) [47] represents an automated
technique for identifying data augmentation strategies from
the dataset, frequently resulting in improved generalization
compared to B-Aug techniques.
Style augmentation (S-Aug) [83] adopts [48] to create
shape-texture conflict images using artistic painting styles and
trains models on both shape-texture conflict images and orig-
inal natural images. In contrast, our STCR method eliminates
the need for external memory to store artistic-style images,
and we introduce a regularization loss between original and
shape-texture conflict images.
Mixup loss (mixup) initially generates shape-texture conflict
images by following [45] and then applies a mixup loss
to these conflict images. The hyperparameter , determining
the relative importance of shape and texture, is set to 0.5,
consistent with [1].
It is important to note that our experimentation involves
four generic continual learning frameworks, seven continual
learning baselines, and five generalization methods. Ideally,
applying each generalization method to every continual learn-
ing baseline or framework would result in a total of (4+7)×5=
55 combinations. However, due to limitations in comput-
ing resources, we pragmatically select a subset of continual
learning methods and generalization methods for controlled
experiments. In these experiments, we either vary the continual
learning methods or vary the generalization methods, but not
both simultaneously. Refer to Section VI for specifications of
each experiment.
VI. EX PE RI ME NT S AN D RES ULTS
A. STCR Can Be Seamlessly Integrated With Any Generic
Continual Learning Frameworks
Given the mutually beneficial relationship between gener-
alization and reduced forgetting discussed in Section III-C,
we introduced STCR as a generalization method specifically
designed for continual learning (see Section IV). Here, we
demonstrate that our STCR can be seamlessly integrated with
any generic continual learning framework. Specifically, we
selected the three most commonly used continual learning
frameworks as well as one naive framework (see Section III-B)
and then applied STCR to them. These integrated frameworks
are labeled as “the name of the framework +STCR.” We
evaluated “framework +STCR” on ImageNet-100 with two
task sequence lengths of T=6 and T=11 (see Section III).
For comparisons, we also evaluate these continual learning
frameworks without our STCR in the same experiment set-
tings. We report their generalization and continual learning
performances in Table I.
From Table I, across both task sequence lengths T=6 and
T=11, we observed that all the continual learning frameworks
integrated with our STCR outperform their counterparts with-
out STCR by a large margin, in terms of both generalization
TABLE I
GENERALIZATION AND CONTINUAL LEARNING PERFORMANCE OF THE
GEN ERI C CONTINUAL LEARNING FRAMEWORKS WITH A ND WITHOUT
OUR STCR. WERE PORT ED T HE PER FOR MA NCE O F FOU R GENE RI C
CONTINUAL LEARNING FRAM EWO RK S WITH ( HIGHLIGHTED IN GR AY
BACK GRO UND )AND WITHOUT OUR STCR O N IMAG ENET-100 WITH
T=6(FIRS T COLUMN)AN D T=11 (SECOND COLUMN)I N TERM S OF
THE IR GENERALIZATION ABILITY AGAINST DATA CORRUPTION (R-C),
AN D CONTINUAL LEARNING ABILITY (ACC ., AN D F). SEE SECTIONS
III-B AN D V-D FOR T HE INTRODUCTION OF THE SE
GEN ERI C CONTINUAL LEARNING FRAMEWORKS. SEE
SEC TIO N V-C FO R THE IN TRO DUC TI ON OF T HE
EVALUATI ON MET RI CS. T HE BE ST RES ULTS AR E IN
BOL D. TH E NUMBERS IN BRAC KET S IND ICATE T HE
STANDARD DEV IATI ON AFT ER THR EE RUNS
(R-C) and continual learning performances (Acc. and F).
For instance, when T=6, WReg +STCR beats WReg
alone by 6.28% in R-C, 3.49% in Acc. and 11.91% in F.
This implies that our STCR, when integrated with continual
learning frameworks, is eective in enhancing generalization
and reducing forgetting.
Notably, among the four continual learning frameworks,
Naive, WReg, and KD are replay-free frameworks and they
do not require exemplar sets P. Incorporating the generic
STCR introduced in Section IV-A into these frameworks also
significantly reduces forgetting and enhances generalization.
In particular, the naive continual learning framework, which
lacks measures to prevent forgetting, often has chance-level
continual learning performance. With STCR, we observed
a dramatic boost in continual learning and generalization
performance across all evaluation metrics (compare Naive +
STCR versus STCR).
Moreover, we also found that the positive eect of STCR
becomes more prominent when it is integrated with those
frameworks that are better for continual learners. For example,
the existing works have shown that KD and Replay are the
most eective strategies in reducing catastrophic forgetting,
which most of the SOTA continual learning methods adopt.
Indeed, compared with Naive and WReg, this is also demon-
strated by higher Acc. and lower Fof KD and Replay
as shown in Table I. With our STCR, the generalization
performance of Replay +STCR and KD +STCR gets boosted
by 11.73% and 10.77% compared to their counterparts, respec-
tively. This is in high contrast with a relative improvement of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TABLE II
GENERALIZATION AND CONTINUAL LEARNING PERFORMANCE OF THE SOTA CONTINUAL LEARNING MET HOD S WIT H AND WITHOUT OU R STCR.
WEREP ORTE D TH E PERFORMANCE OF THRE E SOTA CONTINUAL LEARNING MET HO DS WIT H (HIGHLIGHTED IN GRAY BACK GROU ND )AND WITHOUT
OUR STCR O N TWO LEARNING PARADIGMS (T=6,THE FI RS T SIX ROW S)AN D (T=11,TH E LAST SI X ROWS ) ACROSS ALL TH REE DATAS ETS
(THRE E COLUMNS)IN TE RMS O F THEI R GENERALIZATION ABILITY (R-C AND RR)A ND CONTINUAL LEARNING ABILITY (AC C., AND F). SE E
SEC TIO N V-E FO R THE INTRODUCTION OF THE SE SOTA MET HOD S. SEE SEC TIO N V-C FO R THE IN TRO DU CTI ON O F THE EVAL UATIO N METR IC S. TH E
BES T RESU LTS ARE I N BOLD . TH E NUMBERS IN BRAC KET S IND ICATE T HE STANDARD DEVI ATION AF TE R THRE E RUNS
7.54% and 6.28% in R-C for Naive +STCR and WReg +
STCR.
B. SOTA Continual Learning Methods With STCR Beat
Their Standalone Configurations
In Section VI-A, we demonstrated the feasibility of integrat-
ing our STCR with any generic continual learning framework.
Here, we specifically assess the impact of STCR on SOTA’s
continual learning methods. While these SOTA methods may
align with one or more of the continual learning frameworks
discussed earlier, they also have distinct continual learning
strategies. We highlighted these dierences from the generic
continual learning frameworks in Section V-E.
In this experiment, we incorporate STCR into three SOTA
continual learning methods (see Section V-E) and assess their
generalization and continual learning performances across
three image datasets, given two task sequence lengths of T=6
and T=11. The versions of the SOTA methods integrated
with our STCR are labeled as “SOTA +STCR.” To facilitate
comparisons, we also include the outcomes of their standalone
versions without STCR. We present our results in Table II.
SOTA +STCR consistently outperforms SOTA themselves
in terms of generalization and reduced forgetting. For instance,
when STCR is applied to AFC on ImageNet-100 with T=6,
AFC +STCR increases Acc. from 76.46% to 79.24% and
decreases Ffrom 9.87% to 7.43%, dramatically improving
the continual learning performance. Meanwhile, AFC +STCR
reduces R-C by 15.67% and improves R-R by 8.83%, demon-
strating its eectiveness in enhancing the generalization ability
of these SOTA continual learning methods.
Fig. 5. Loss landscape analysis on existing continual learning methods with
and without our STCR in the ImageNet-100 dataset. We perform the loss
landscape analysis using the final model θ6at task T6with perturbed weights
in random directions on the test sets of tasks T1(left) and T6(right) of
the ImageNet-100 dataset. These final models are trained with competitive
continual learning baselines with and without our STCR. See Section V-E for
the introduction to continual learning baselines.
C. Continual Learning Methods With STCR Have Wider
Loss Landscapes Across Tasks After Weight Perturbations
In Section VI-B, SOTA continual learning methods with
STCR demonstrated remarkable generalization and continual
learning abilities across all four evaluation metrics (see Section
V-C). Here, we investigate why STCR, as a generalization
method, leads to less forgetting for SOTA continual learning
methods. We applied the same experiment protocol and con-
ducted loss landscape analysis, introduced in Section III-D.
In Fig. 5, we present the loss landscapes of the models
trained using SOTA continual learning methods with and
without STCR. Consistent with observations in Section III-D,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 11
Fig. 6. Continual learning baseline with our STCR outperforms its counterpart with existing generalization methods. We report the generalization (R-C) and
continual learning performances (Acc., and F) of the continual learning method CCIL [15] integrated with our STCR as well as its counterpart integrated
with established generalization baselines on ImageNet-100 with T=6. See Section V-F for the introduction of each generalization baseline. See Section V-C
for the introduction of the evaluation metrics. The best performance is our STCR in red.
we found that SOTA methods with STCR have wider loss
landscapes compared to their counterparts (dotted versus solid
lines in the same color in both test sets of T1and T6).
This suggests that models with STCR are more robust to
weight perturbations, avoiding spurious features and enhancing
the learning of generic features shared across tasks, thereby
reducing catastrophic forgetting.
D. Continual Learning Methods With STCR Outperform
Their Counterparts Combined With Existing Generalization
Methods
We compare our STCR and the existing generalization
baselines (see Section V-F) in a continual learning setting on
ImageNet-100 with T=6. To isolate the eect of continual
learning baselines, we use CCIL ([15, Sec. V-E]) as the con-
tinual learning backbone, apply all the generalization baselines
to CCIL in every task, and evaluate their performances in
R-C, Acc., and F(see Section V-C). As controls, we also
introduce the lower bound, which is CCIL alone without any
generalization methods.
We report the results in Fig. 6. As expected, the LB without
any generalization methods performs the worst in terms of all
the evaluation metrics. All the generalization baselines perform
slightly better than the LB; however, their performance is still
inferior to our STCR. This implies that our STCR is more
eective in boosting the generalization ability within a task,
and thereby enhancing the continual learning performance
with reduced forgetting.
It is also worth noting that S-Aug and mixup share simi-
larities with our STCR as both methods involve shape-texture
conflict images. Even with the extra storage of artistic style
templates, S-Aug still underperforms our STCR. This indicates
that the data augmentation techniques alone are not enough.
Additional regularization losses are necessary for the continual
learning setting.
Moreover, in contrast to mixup loss which balances the
learned shape and texture biases, our STCR still achieves
much better performance. This suggests that our proposed
consistency regularization loss is more eective than mixup
for generalization in the continual learning setting.
E. Network Analysis Reveals Our Key Design Decisions
To evaluate the contribution of each component in our STCR
method, we repeat the experiments on ImageNet-100 using the
ablated versions of our method. As control experiments, we
fix the continual learning baseline CCIL ([15], Section V-E)
and use it to integrate with our ablated versions. In these
experiments, the number of replay exemplars in CCIL is set to
2000 by default, unless otherwise specified. We also introduce
additional application scenarios and demonstrate the benefits
that our STCR can bring in these scenarios.
1) Eect of Coecient γ:We analyze the eect of γ, which
controls the strength of LSTCR in (3), by varying its values
γ {0,0.001,0.01,0.1}. Note that LSTCR is not in eect when
γis set to 0. The results (COEFFICIENT γ) in Table III,
demonstrate that incorporating STCR with small γleads to
improved performance compared to the case when γ=0, in
all evaluation metrics. This suggests that the STCR is essential
for enhancing the continual learning performance. We also
observed that higher values of γlead to less forgetting but
may deteriorate generalization ability.
It is important to note that the impact of γremains con-
sistent across dierent learning protocols and datasets. For
all protocols and datasets, we have fixed the default value of
γ=0.01.
2) Eect of Coecient β:We analyze the eect of the
coecient β, which controls the strength of βLCE
Pin (3),
by varying its values β {0.1,1,10}. We reported their
results in Table III (COEFFICIENT β). From the results, we
observed βof 1 yields the best performance. Either a small
βof 0.1 or a large βof 10 leads to decreased performance.
Intuitively, a higher beta emphasizes the eect of naive replays
on original exemplars in the memory buer, leading to weaker
representation alignment on shape-texture conflict images, and
hence, poorer generalization performances. Lower beta impairs
the eect of exemplar rehearsals during replays leading to
more catastrophic forgetting.
3) Hyper-Parameter Temperature τ:The hyper-parameter
temperature in (2) controls the smoothness of a probabil-
ity distribution. Typically, the higher the temperature, the
smoother the distributions, leading to a lower KL divergence
and a softer alignment between the distributions. To study
the eect of temperature, we conducted an extra network
analysis by varying the temperatures from 1, 2, to 3. We
reported the results of these models with dierent temperatures
in Table III (TEMPERATURE). A temperature of 2 leads
to optimal performance, while the performance decreases
with temperatures of 1 and 3. These results suggest that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TABLE III
GENERALIZATION AND CONTINUAL LEARNING PERFORMANCE OF OUR
STCR VARIATIONS IN DIVER SE APP LI CATIO N SCE NAR IOS . WE
REP ORTE D TH E GENERALIZATION ABILITY (R-C A ND RR)AND
CONTINUAL LEARNING ABILITY (ACC ., A ND F)OF OU R STCR
VARIATIONS IN MULTIP LE SC ENA RIO S ON IM AGENE T-100. SEE SEC TI ON
VI-E FO R THE INTRODUCTION OF EAC H APPL IC ATION SCENARIO. SEE
SEC TIO N V-C FO R THE IN TRO DUC TI ON OF T HE EVALU ATION ME TR ICS .
THE ROW S HIGHLIGHTED IN GRAY IND ICATE OU R DEFAU LT METH OD
DESIGN. TH E BEST RE SU LTS ARE IN BO LD . THE NUMBERS IN
BRACKETS INDI CATE T HE STAN DARD DE VI ATIO N
AFT ER THR EE RU NS
the selection of temperature is essential for learning generic
features of shape-texture conflict images. Empirically, we set
the temperature to be 2, consistent with [28].
4) Exemplar Set Size: Following [15] and [17], we fixed
the memory buer sizes to be 20 exemplars per object class in
all the previous experiments. Here, we added an extra ablation
study where we varied the memory buer size from 10, 20,
to 30 on CCIL +STCR, and studied their impacts. From the
results in Table III (EXAMPLAR SIZE), we observed that the
generalization and continual learning performances increased
from 10 to 20. This implies that larger buer sizes increase
the diversity of exemplars during replays; hence, enhancing
the performance of CCIL +STCR. We also noticed that the
performance saturates from 20 to 30. This implies that too
large memory buer sizes do not necessarily lead to optimal
performance.
5) STCR in Replay: We investigate the impact of STCR in
replays and present our findings in (REPLAY) in Table III.
As emphasized in Section IV, performing style transfers on
the replay images and training on such synthesized shape-
texture conflict images during replays hinder the performance
of continual learning. The synthesized images may frequently
exhibit low quality, potentially due to the restricted size of
exemplars in the replay buers. Rehearsing with these images
could disrupt the retention of knowledge from earlier tasks.
6) Styles From Exemplar Sets Versus Current Training Set:
For replay-based continual learning methods, we emphasized
the importance of using style templates from exemplar sets in
the memory buer to generate shape-texture conflict images.
Here, we report the performance when the style templates
come from the current training set. The results in (STYLES
FROM) in Table III show that using styles from exemplar sets
outperforms the one from the current training set by around
3% over the four evaluation metrics. Two possible reasons are:
first, exemplar sets encompass a broader range of classes from
previous tasks, enhancing the diversity of the shape-texture
conflict images. Second, these conflict images carry styles
from earlier tasks, serving as a form of feature rehearsal to
mitigate forgetting.
7) Color Transfer: In our work, we used the style transfer
method AdaIN [69] as a proof-of-concept for generating
shape-style conflict images. However, running AdaIN can be
slow and compute-intensive for real-world practice. To verify
that our STCR can be adapted to other types of style or color
transfer methods, we designed a model variation by replacing
AdaIn with the color transfer method [85]. Color transfer is a
process in digital image manipulation where the color palette
of one image is applied to another, aiming to preserve the
original content while changing its esthetic and mood. From
Table III (COLOR TRANSFER), we observed that the results
of color transfers perform slightly worse than the ones with
style transfers (AdaIN). This implies that color transfer, as
a data augmentation method, is not as eective as S-Aug at
capturing generic image representations. Interestingly, we also
noted that the results of color transfer are better than the ones
of CCIL alone across all four evaluation metrics. This suggests
that a generic data augmentation technique could improve the
model generalization ability; hence, leading to less forgetting
in continual learning.
8) Longer Task Sequences: We conducted additional exper-
iments on ImageNet-100 to evaluate the eectiveness of our
STCR with longer task sequences T=26 and T=51. We
evaluated CCIL ([15, Sec. V-E]) with and without our STCR
in these two experiments. We presented the results in (TASK
LENGTH) of Table III. The results demonstrate that STCR
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 13
can consistently improve both the generalization and continual
learning performance of CCIL even over a much longer task
sequence. For example, CCIL with STCR beats CCIL alone
by 15.7%, 6.88% in R-C and R-R, and 3.12% in Acc., and
Fover T=51.
9) Connecting to Vision Transformer: We assessed the
eectiveness of our STCR on continual learning methods
involving vision transformer architectures [86]. Specifically,
we applied our STCR on Yu et al. [84] (ViTIL) for training
vision transformers in continual learning settings. We followed
their implementation details and reported the results of their
method with and without our STCR in (MODEL) in Table III.
Consistent with the results obtained from the convolutional
neural network (CNN)-based CL approaches such as CCIL
[15] (see (MODEL) in Table III), we observed improved
performances of ViTIL in both generalization and continual
learning evaluation metrics. Interestingly, we also noted that
the performance boost in vision transformers is relatively
lower than CNNs. This may be attributed to the fact that vision
transformers have already learned better shape and texture
representations than CNNs [87].
10) Combining STCR With Mixup: We combined STCR
with mixup using the losses proposed in (3). To achieve this,
we followed the standard practice of mixup operation [45]
by using =0.5 to blend a pair of the original training
images in the current task. We then applied style transfers
to these mixed-up images and used the proposed consistency
regularization loss in (2) on their corresponding logits. Note
that the logits of the mixed-up images are computed by taking
the weighted sum of their logits given their class labels. We
reported the result of this ablation study (MIXUP) in Table III.
MixUp +STCR significantly underperforms STCR alone by
35.1%, 19.67%, 21.81%, and 4.15% in R-C, R-R, Acc, and
F, respectively. This suggests that the mixed images may
interfere with the model’s ability to maintain consistent shape
and texture representations, which are critical for mitigating
forgetting and enhancing generalization.
11) Evaluation in a Protocol of Ten Tasks With Every Ten
Classes Per Task: We also tested our model’s performance
using a dierent continual learning protocol, which prescribes
a uniform number of classes per task. Specifically, we evalu-
ated the competitive continual learning baselines CCIL and
AFC, as well as their counterparts with our STCR (CCIL
+STCR and AFC +STCR) on the ImageNet-100 dataset,
where each task has 10 classes and there are a total of ten
tasks. The results can be seen in Table III (TEN TASKS, TEN
CLASSES). We observed that all models performed worse
than those in the previous protocol, where the initial task
contained 50 classes and each subsequent task had ten classes.
This could be attributed to the increasing diculty of the
protocol. Moreover, we found that CCIL +STCR outperforms
CCIL alone in this new protocol, suggesting that our STCR
consistently enhances the performance of existing continual
learning methods by improving generalization and reducing
forgetting. Interestingly, we also compared the relative perfor-
mance dierences between AFC +STCR and CCIL +STCR
in the current protocol with those in the previous protocol
and found that the relative performance dierences remain
consistent. This indicates that our conclusions are independent
of the protocol variation, as our STCR consistently benefits
continual learning baselines across all evaluation metrics.
VII. CONCLUSION AND FUTURE WOR KS
In AI, the existing literature on generalization and continual
learning has evolved independently. In our eort to bridge
these fields, we presented empirical evidence showcasing
their mutually beneficial relationship: eective generalization
within a task facilitates quicker learning and improved per-
formance in subsequent tasks within the continual learning
framework. On the flip side, continual learning methods
are designed to combat catastrophic forgetting, ensuring the
preservation of knowledge from earlier tasks, and ultimately
contributing to enhanced generalization for ongoing tasks.
Building upon this insight, we introduced STCR, a simple
yet eective regularization technique that learns both shape
and texture representations for each task in continual learning.
This approach integrated with any continual learning methods
not only enhances generalization but also mitigates forgetting.
Our extensive experiments highlight that the existing continual
learning methods, seamlessly integrated with STCR, not only
surpass their own performance but also outperform their
counterparts integrated with any other existing generalization
methods.
Despite the promising performance of our method, we have
identified several limitations that need to be addressed in future
work. First, as a proof-of-concept, our STCR uses a compute-
heavy style transfer technique. For real-world applications,
alternative style transfer methods that can perform few-shot
fine-tuning and real-time inference should be explored.
Second, we have only examined the interplay of generaliza-
tion and continual learning in fully supervised settings. The
dynamics of this interplay in semi-supervised and unsuper-
vised settings remain to be explored.
REFERENCES
[1] Y. Li et al., “Shape-texture debiased neural network training,” in Proc.
ICLR, 2021.
[2] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Laksh-
minarayanan, “AugMix: A simple data processing method to improve
robustness and uncertainty,” 2019, arXiv:1912.02781.
[3] M. A. Islam et al., “Shape or texture: Understanding discriminative
features in CNNs,” 2021, arXiv:2101.11604.
[4] A. Doerig et al., “The neuroconnectionist research programme,” Nature
Rev. Neurosci., vol. 24, no. 7, pp. 431–450, Jul. 2023.
[5] K. Thakral, S. Mittal, U. Uppal, B. Giddwani, M. Vatsa, and R. Singh.
(2023). Self-supervised Continual Learning. [Online]. Available: https://
openreview.net/forum?id=udl9OobOxZu
[6] S. Ho, M. Liu, L. Du, Y. Li, L. Gao, and S. Gao, “Semi-supervised
continual learning with meta self-training,” in Proc. 31st ACM Int. Conf.
Inf. Knowl. Manage., Oct. 2022, pp. 4024–4028.
[7] K. Javed and M. White, “Meta-learning representations for
continual learning,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alch´
e-
Buc, E. Fox, and R. Garnett, Eds., Jan. 2019, pp. 1818–1828. [Online].
Available: https://proceedings.neurips.cc/paper files/paper/2019/file/
f4dd765c12f2ef67f98f3558c282a9cd-Paper.pdf
[8] K. James et al., “Overcoming catastrophic forgetting in neural
networks,” Proc. Nat. Acad. Sci. USA, vol. 114, no. 13, pp. 3521–3526,
Mar. 2017.
[9] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic
intelligence,” in Proc. ICML, Jan. 2017, pp. 3987–3995.
[10] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars,
“Memory aware synapses: Learning what (not) to forget,” in Proc.
ECCV, Nov. 2018, pp. 139–154.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
[11] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, Dec. 2017.
[12] F. M. Castro, M. J. Mar´
ın-Jim´
enez, N. Guil, C. Schmid, and K. Ala-
hari, “End-to-end incremental learning,” in Proc. ECCV, Jan. 2018,
pp. 241–257.
[13] Y. Wu et al., “Large scale incremental learning,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 374–382.
[14] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “PODNet:
Pooled outputs distillation for small-tasks incremental learning,” in Proc.
ECCV, Jan. 2020, pp. 86–102.
[15] S. Mittal, S. Galesso, and T. Brox, “Essentials for class incremental
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Workshops (CVPRW), Jun. 2021, pp. 3508–3517.
[16] A. Robins, “Catastrophic forgetting, rehearsal and pseudorehearsal,”
Connect. Sci., vol. 7, no. 2, pp. 123–146, 1995.
[17] S.-A. Rebu, A. Kolesnikov, G. Sperl, and C. H. Lampert, “ICaRL:
Incremental classifier and representation learning,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5533–5542.
[18] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unified
classifier incrementally via rebalancing,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 831–839.
[19] Y. Liu, Y. Su, A.-A. Liu, B. Schiele, and Q. Sun, “Mnemonics
training: Multi-class incremental learning without forgetting,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 12242–12251.
[20] B. Han, F. Zhao, Y. Zeng, W. Pan, and G. Shen, “Enhancing ecient
continual learning with dynamic structure development of spiking neural
networks,” 2023, arXiv:2308.04749.
[21] M. Boschini et al., “Transfer without forgetting,” in Proc. Eur. Conf.
Comput. Vis., 2022, pp. 692–709.
[22] M. Riemer et al., “Learning to learn without forgetting by maximizing
transfer and minimizing interference,” 2018, arXiv:1810.11910.
[23] J. Peng, D. Ye, B. Tang, Y. Lei, Y. Liu, and H. Li, “Lifelong learning
with cycle memory networks,” IEEE Trans. Neural Netw. Learn. Syst.,
vol. 35, no. 11, pp. 16439–16452, Nov. 2024.
[24] Y. Shi et al., “Multi-granularity knowledge distillation and prototype
consistency regularization for class-incremental learning,” Neural Netw.,
vol. 164, pp. 617–630, Jul. 2023.
[25] M. McCloskey and N. J. Cohen, “Catastrophic interference in con-
nectionist networks: The sequential learning problem,” in Psychology
of Learning and Motivation, vol. 24. Amsterdam, The Netherlands:
Elsevier, 1989, pp. 109–165.
[26] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio,
“An empirical investigation of catastrophic forgetting in gradient-based
neural networks,” 2013, arXiv:1312.6211.
[27] G. M. van de Ven and A. S. Tolias, “Three scenarios for continual
learning,” 2019, arXiv:1904.07734.
[28] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” 2015, arXiv:1503.02531.
[29] M. Welling, “Herding dynamical weights to learn,” in Proc. 26th Annu.
Int. Conf. Mach. Learn., Jun. 2009, pp. 1121–1128.
[30] D. Shim, Z. Mai, J. Jeong, S. Sanner, H. Kim, and J. Jang,
“Online class-incremental continual learning with adversarial Shap-
ley value,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 11,
pp. 9630–9638.
[31] G. Sun, B. Ji, L. Liang, and M. Chen, “CeCR: Cross-entropy contrastive
replay for online class-incremental continual learning,” Neural Netw.,
vol. 173, May 2024, Art. no. 106163.
[32] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep
generative replay,” in Proc. Adv. Neural Inf. Process. Syst., 2017.
[33] C. Atkinson, B. McCane, L. Szymanski, and A. Robins, “Pseudo-
Recursal: Solving the catastrophic forgetting problem in deep neural
networks,” 2018, arXiv:1802.03875.
[34] X. Liu et al., “Generative feature replay for class-incremental learning,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops,
Jun. 2020, pp. 226–227.
[35] G. Shen, S. Zhang, X. Chen, and Z.-H. Deng, “Generative feature replay
with orthogonal weight modification for continual learning,” in Proc. Int.
Joint Conf. Neural Netw. (IJCNN), Jul. 2021, pp. 1–8.
[36] G. M. van de Ven, Z. Li, and A. S. Tolias, “Class-incremental learning
with generative classifiers,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2021, pp. 3611–3620.
[37] D. Goswami, Y. Liu, B. Twardowski, and J. van de Weijer, “FeCAM:
Exploiting the heterogeneity of class distributions in exemplar-free
continual learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 36,
2023, pp. 6582–6595.
[38] G. Petit, A. Popescu, H. Schindler, D. Picard, and B. Delezoide, “FeTrIL:
Feature translation for exemplar-free class-incremental learning,” in
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2023,
pp. 3911–3920.
[39] F. Zhu, X.-Y. Zhang, C. Wang, F. Yin, and C.-L. Liu, “Prototype
augmentation and self-supervision for incremental learning,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
pp. 5871–5880.
[40] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong
learning with a network of experts,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 7120–7129.
[41] J. Rajasegaran, M. Hayat, S. Khan, F. S. Khan, and L. Shao, “Random
path selection for continual learning,” in Proc. NeurIPS, Jan. 2019,
pp. 12648–12658.
[42] S. C. Y. Hung, C.-H. Tu, C. Wu, C. Chen, Y.-M. Chan, and C.-S. Chen,
“Compacting, picking and growing for unforgetting continual learning,”
in Proc. NeurIPS, vol. 32, 2019.
[43] S. Yan, J. Xie, and X. He, “DER: Dynamically expandable representation
for class incremental learning,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2021, pp. 3013–3022.
[44] Z. Hu, Y. Li, J. Lyu, D. Gao, and N. Vasconcelos, “Dense network
expansion for class incremental learning,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 11858–11867.
[45] H. Zhang, M. Ciss´
e, Y. Dauphin, and D. L´
opez-Paz, “mixup: Beyond
empirical risk minimization,” in Proc. ICLR, 2017.
[46] T. DeVries and G. W. Taylor, “Improved regularization of convolutional
neural networks with cutout,” 2017, arXiv:1708.04552.
[47] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le,
“AutoAugment: Learning augmentation strategies from data,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 113–123.
[48] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann,
and W. Brendel, “ImageNet-trained CNNs are biased towards tex-
ture; increasing shape bias improves accuracy and robustness,” 2018,
arXiv:1811.12231.
[49] K. L. Hermann, T. Chen, and S. Kornblith, “The origins and prevalence
of texture bias in convolutional neural networks,” in Proc. NeurIPS,
vol. 33, 2020, pp. 19000–19015.
[50] R. Volpi, H. Namkoong, O. S¸ ener, J. C. Duchi, V. Murino, and
S. Savarese, “Generalizing to unseen domains via adversarial data
augmentation,” in Proc. NeurIPS, vol. 31, 2018.
[51] M. Yi et al., “Improved OOD generalization via adversarial training and
pretraing,” in Proc. ICML, Jul. 2021, pp. 11987–11997.
[52] R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk,
“Improving robustness without sacrificing accuracy with patch Gaussian
augmentation,” 2019, arXiv:1906.02611.
[53] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry,
“Robustness may be at odds with accuracy,” in Proc. ICLR,
2019.
[54] B. Shi, D. Zhang, Q. Dai, J. Wang, Z. Zhu, and Y. Mu, “Informative
dropout for robust representation learning: A shape-bias perspective,” in
Proc. ICML, 2020, pp. 8828–8839.
[55] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of deep bidirectional transformers for language understanding,” 2018,
arXiv:1810.04805.
[56] L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive sur-
vey of continual learning: Theory, method and application,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5362–5383,
Aug. 2024.
[57] D. Hendrycks and T. Dietterich, “Benchmarking neural network
robustness to common corruptions and perturbations,” 2019,
arXiv:1903.12261.
[58] A. Prabhu, P. H. Torr, and P. K. Dokania, “GDumb: A simple approach
that questions our progress in continual learning,” in Proc. Eur. Conf.
Comput. Vis. (ECCV), Glasgow, U.K. Cham, Switzerland: Springer,
2020, pp. 524–540.
[59] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[60] S. T. Teoh, M. Kitamura, Y. Nakayama, S. Putri, Y. Mukai, and
E. Fukusaki, “Random sample consensus combined with partial least
squares regression (RANSAC-PLS) for microbial metabolomics data
mining and phenotype improvement,” J. Biosci. Bioeng., vol. 122, no. 2,
pp. 168–175, 2015.
[61] D. Marr and E. Hildreth, “Theory of edge detection,” Proc. Roy. Soc.
London B, Biol. Sci., vol. 207, no. 1167, pp. 187–217, Feb. 1980.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SHI et al.: UNVEILING THE TAPESTRY: THE INTERPLAY OF GENERALIZATION AND FORGETTING 15
[62] J. Geiping, M. Goldblum, G. Somepalli, R. Shwartz-Ziv, T. Goldstein,
and A. G. Wilson, “How much data are augmentations worth? An
investigation into scaling laws, invariance, and implicit regularization,”
in Proc. 11th Int. Conf. Learn. Represent., 2023, pp. 187–217. [Online].
Available: https://openreview.net/forum?id=3aQs3MCSexD
[63] S. Singla and S. Feizi, “Salient ImageNet: How to discover spurious
features in deep learning?,” 2021, arXiv:2110.04301.
[64] P. Izmailov, P. Kirichenko, N. Gruver, and A. G. Wilson, “On feature
learning in the presence of spurious correlations,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 35, Jan. 2022, pp. 38516–38532.
[65] C. Zhou, X. Ma, P. Michel, and G. Neubig, “Examining and combating
spurious features under distribution shift,” in Proc. 38th Int. Conf.
Mach. Learn., vol. 139, M. Meila and T. Zhang, Eds., Jul. 2021,
pp. 12857–12867. [Online]. Available: https://proceedings.mlr.press/
v139/zhou21g.html
[66] F. Khani and P. Liang, “Removing spurious features can hurt accuracy
and aect groups disproportionately,” in Proc. ACM Conf. Fair-
ness, Accountability, Transparency, New York, NY, USA, Mar. 2021,
pp. 196–205, doi: 10.1145/3442188.3445883.
[67] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the
loss landscape of neural nets,” in Proc. NeurIPS, vol. 31, 2018.
[68] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: A loss
landscape perspective,” 2019, arXiv:1912.02757.
[69] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Oct. 2017, pp. 1510–1519.
[70] T.-Y. Lin, et al..,, “Microsoft COCO: Common objects in context,” in
Proc. ECCV, Jan. 2014, pp. 740–755.
[71] N. Nichol. (2016). Painter by Numbers. WikiArt. [Online]. Available:
https://www.kaggle.com/c/painter-by-numbers
[72] H. Zhang and K. Dana, “Multi-style generative network for real-time
transfer,” in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, Jan.
2017, pp. 1–16.
[73] L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-Net: Multi-scale zero-
shot style transfer by feature decoration,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8242–8250.
[74] C. Reich, T. Prangemeier, C. Wildner, and H. Koeppl, “Multi-StyleGAN:
Towards image-based simulation of time-lapse live-cell microscopy,” in
Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham,
Switzerland: Springer, Jan. 2021, pp. 476–486.
[75] X. Tao, X. Chang, X. Hong, X. Wei, and Y. Gong, “Topology-preserving
class-incremental learning,” in Proc. ECCV, Jan. 2020, pp. 254–270.
[76] A. Douillard, A. Ram´
e, G. Couairon, and M. Cord, “DyTox: Trans-
formers for continual learning with dynamic token expansion,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
pp. 9275–9285.
[77] A. Krizhevsky et al., “Learning multiple layers of features from tiny
images,” Master’s thesis, Univ. Toronto, Toronto, ON, Canada, 2009.
[78] D. Hendrycks, et al..,, “The many faces of robustness: A critical analysis
of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), Oct. 2021, pp. 8320–8329.
[79] M. Kang, J. Park, and B. Han, “Class-incremental learning by
knowledge distillation with adaptive feature consolidation,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
pp. 16050–16059.
[80] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst., vol. 25, May 2012, pp. 84–90.
[81] S. Lin, L. Yang, D. Fan, and J. Zhang, “Beyond not-forgetting: Continual
learning with backward knowledge transfer,” in Proc. NeurIPS, vol. 35,
2022, pp. 16165–16177.
[82] N. D´
ıaz-Rodr´
ıguez, V. Lomonaco, D. Filliat, and D. Maltoni, “Don’t for-
get, there is more than forgetting: New metrics for continual learning,”
2018, arXiv:1810.13166.
[83] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware
minimization for eciently improving generalization,” in Proc. ICLR,
2021.
[84] P. Yu, Y. Chen, Y. Jin, and Z. Liu, “Improving vision transformers for
incremental learning,” 2021, arXiv:2112.06103.
[85] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color trans-
fer between images,” IEEE Comput. Graph. Appl., vol. 21, no. 4,
pp. 34–41, Jan. 2001.
[86] A. Dosovitskiy, et al..,, “An image is worth 16×16 words: Transformers
for image recognition at scale,” in Proc. ICLR, 2021.
[87] S. Tuli, I. Dasgupta, E. Grant, and T. L. Griths, “Are convolutional
neural networks or transformers more like human vision?,” 2021,
arXiv:2105.07197.
Zenglin Shi received the Ph.D. degree from the Uni-
versity of Amsterdam, Amsterdam, The Netherlands,
in 2022.
He was with the University of Amsterdam, where
he worked closely with Prof. Cees Snoek and Dr.
Pascal Mattes. After the Ph.D. degree, he moved
to Singapore to work as a Research Scientist with
A*STAR, Singapore. He is currently a Full Pro-
fessor with Hefei University of Technology, Hefei,
China. He has published dozens of research papers
in top-tier journals and conferences, such as IEEE
TRANSACTIONS ON PATT ERN AN ALYSI S AN D MAC HI NE INTELLIGENCE,
IJCV, IEEE TRANSACTIONS ON IM AGE PRO CES SI NG, CVPR, and ICCV.
His research interests include computer vision and machine learning.
Jie Jing received the B.S. degree in software engi-
neering from Sichuan University, Chengdu, China,
in 2020, where he is currently pursuing the Ph.D.
degree with the Department of Computer Science.
He is a Visiting Ph.D. Student with the Deep
NeuroCognition Laboratory, Agency for Science,
Technology and Research (A*STAR), Singapore.
His research interests include computer vision, con-
tinual learning, and medical imaging.
Ying Sun received the B.Eng. degree from Tsinghua
University, Beijing, China, in 1998, the M.Phil.
degree from The Hong Kong University of Science
and Technology, Hong Kong, in 2000, and the Ph.D.
degree in electrical and computer engineering from
Carnegie Mellon University, Pittsburgh, PA, USA, in
2004.
She is currently a Principal Scientist with the
Institute for Infocomm Research, Agency for Sci-
ence, Technology and Research (A*STAR), Singa-
pore. Her research interests include computer vision
and machine learning, especially video understanding and generation, visual
representation learning, and visual reasoning.
Joo-Hwee Lim (Senior Member, IEEE) received the
B.Sc. (Hons.) and M.Sc. (by research) degrees in
computer science from the National University of
Singapore, Singapore, in 1989 and 1991, respec-
tively, and the Ph.D. degree in computer science
and engineering from the University of New South
Wales, Sydney, NSW, Australia, in 2004.
He is currently a Senior Principal Scientist III
and the Head of the Visual Intelligence Department,
Institute for Infocomm Research (I2R), A*STAR,
Singapore, and an Adjunct Professor with the Col-
lege of Computing and Data Science (CCDS), Nanyang Technological
University (NTU), Singapore. He pioneered a novel framework (i.e., the
popular bag-of-visual-words approach) for indexing images based on spatial
aggregation of visual keywords, modeled from statistical learning, and a visual
query language based on visual patterns, spatial quantifiers, and Boolean
operators. His research experience includes connectionist expert systems,
neural-fuzzy systems, handwritten recognition, multi-agent systems, content-
based image retrieval, scene/object recognition, medical image analysis, with
more than 320 international refereed journal articles and conference papers,
and holds 30 patents (awarded and pending).
Mengmi Zhang is currently an Assistant Profes-
sor and a Principal Investigator leading the Deep
NeuroCognition Laboratory, Nanyang Technologi-
cal University (NTU), Singapore. She also holds
a joint appointment as a Principal Scientist with
the Agency for Science, Technology and Research
(A*STAR), Singapore. Her research interests include
the intersection of artificial intelligence and compu-
tational neuroscience, with significant contributions
to visual attention, contextual reasoning, semantic
and episodic memory, and continual learning.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ResearchGate has not been able to resolve any citations for this publication.
Article
To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance drop of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of promising directions, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.
Conference Paper
Children possess the ability to learn multiple cognitive tasks sequentially, which is a major challenge toward the long-term goal of artificial general intelligence. Existing continual learning frameworks are usually applicable to Deep Neural Networks (DNNs) and lack the exploration on more brain-inspired, energy-efficient Spiking Neural Networks (SNNs). Drawing on continual learning mechanisms during child growth and development, we propose Dynamic Structure Development of Spiking Neural Networks (DSD-SNN) for efficient and adaptive continual learning. When learning a sequence of tasks, the DSD-SNN dynamically assigns and grows new neurons to new tasks and prunes redundant neurons, thereby increasing memory capacity and reducing computational overhead. In addition, the overlapping shared structure helps to quickly leverage all acquired knowledge to new tasks, empowering a single network capable of supporting multiple incremental tasks (without the separate sub-network mask for each task). We validate the effectiveness of the proposed model on multiple class incremental learning and task incremental learning benchmarks. Extensive experiments demonstrated that our model could significantly improve performance, learning speed and memory capacity, and reduce computational overhead. Besides, our DSD-SNN model achieves comparable performance with the DNNs-based methods, and significantly outperforms the state-of-the-art (SOTA) performance for existing SNNs-based continual learning methods.
Article
Learning from a sequence of tasks for a lifetime is essential for an agent toward artificial general intelligence. Despite the explosion of this research field in recent years, most work focuses on the well-known catastrophic forgetting issue. In contrast, this work aims to explore knowledge-transferable lifelong learning without storing historical data and significant additional computational overhead. We demonstrate that existing data-free frameworks, including regularization-based single-network and structure-based multinetwork frameworks, face a fundamental issue of lifelong learning, named anterograde forgetting, i.e., preserving and transferring memory may inhibit the learning of new knowledge. We attribute it to the fact that the learning network capacity decreases while memorizing historical knowledge and conceptual confusion between the irrelevant old knowledge and the current task. Inspired by the complementary learning theory in neuroscience, we endow artificial neural networks with the ability to continuously learn without forgetting while recalling historical knowledge to facilitate learning new knowledge. Specifically, this work proposes a general framework named cycle memory networks (CMNs). The CMN consists of two individual memory networks to store short-and long-term memories separately to avoid capacity shrinkage and a transfer cell between them. It enables knowledge transfer from the long-term to the short-term memory network to mitigate conceptual confusion. In addition, the memory consolidation mechanism integrates short-term knowledge into the long-term memory network for knowledge accumulation. We demonstrate that the CMN can effectively address the anterograde forgetting on several task-related, task-conflict, class-incremental, and cross-domain benchmarks. Furthermore, we provide extensive ablation studies to verify each framework component. The source codes are available at: https://github.com/GeoX-Lab/CMN.
Article
Artificial neural networks (ANNs) inspired by biology are beginning to be widely used to model behavioural and neural data, an approach we call 'neuroconnectionism'. ANNs have been not only lauded as the current best models of information processing in the brain but also criticized for failing to account for basic cognitive functions. In this Perspective article, we propose that arguing about the successes and failures of a restricted set of current ANNs is the wrong approach to assess the promise of neuroconnectionism for brain science. Instead, we take inspiration from the philosophy of science, and in particular from Lakatos, who showed that the core of a scientific research programme is often not directly falsifiable but should be assessed by its capacity to generate novel insights. Following this view, we present neuroconnectionism as a general research programme centred around ANNs as a computational language for expressing falsifiable theories about brain computation. We describe the core of the programme, the underlying computational framework and its tools for testing specific neuroscientific hypotheses and deriving novel understanding. Taking a longitudinal view, we review past and present neuroconnectionist projects and their responses to challenges and argue that the research programme is highly progressive, generating new and otherwise unreachable insights into the workings of the brain.
Article
Deep neural networks (DNNs) are prone to the notorious catastrophic forgetting problem when learning new tasks incrementally. Class-incremental learning (CIL) is a promising solution to tackle the challenge and learn new classes while not forgetting old ones. Existing CIL approaches adopted stored representative exemplars or complex generative models to achieve good performance. However, storing data from previous tasks causes memory or privacy issues, and the training of generative models is unstable and inefficient. This paper proposes a method based on multi-granularity knowledge distillation and prototype consistency regularization (MDPCR) that performs well even when the previous training data is unavailable. First, we propose to design knowledge distillation losses in the deep feature space to constrain the incremental model trained on the new data. Thereby, multi-granularity is captured from three aspects: by distilling multi-scale self-attentive features, the feature similarity probability, and global features to maximize the retention of previous knowledge, effectively alleviating catastrophic forgetting. Conversely, we preserve the prototype of each old class and employ prototype consistency regularization (PCR) to ensure that the old prototypes and semantically enhanced prototypes produce consistent prediction, which excels in enhancing the robustness of old prototypes and reduces the classification bias. Extensive experiments on three CIL benchmark datasets confirm that MDPCR performs significantly better over exemplar-free methods and outperforms typical exemplar-based approaches.
Chapter
This work investigates the entanglement between Continual Learning (CL) and Transfer Learning (TL). In particular, we shed light on the widespread application of network pretraining, highlighting that it is itself subject to catastrophic forgetting. Unfortunately, this issue leads to the under-exploitation of knowledge transfer during later tasks. On this ground, we propose Transfer without Forgetting (TwF), a hybrid approach building upon a fixed pretrained sibling network, which continuously propagates the knowledge inherent in the source domain through a layer-wise loss term. Our experiments indicate that TwF steadily outperforms other CL methods across a variety of settings, averaging a 4.81% gain in Class-Incremental accuracy over a variety of datasets and different buffer sizes. Our code is available at https://github.com/mbosc/twf.KeywordsContinual learningLifelong learningExperience replayTransfer learningPretrainingAttention