Conference PaperPDF Available

RECALL: Replay-based Continual Learning in Semantic Segmentation

RECALL: Replay-based Continual Learning in Semantic Segmentation
Andrea Maracani
, Umberto Michieli
, Marco Toldo,
, Pietro Zanuttigh
Department of Information Engineering, University of Padova, {umberto.michieli,toldomarco,zanuttigh}
Deep networks allow to obtain outstanding results in se-
mantic segmentation, however they need to be trained in a
single shot with a large amount of data. Continual learn-
ing settings where new classes are learned in incremental
steps and previous training data is no longer available are
challenging due to the catastrophic forgetting phenomenon.
Existing approaches typically fail when several incremental
steps are performed or in presence of a distribution shift of
the background class. We tackle these issues by recreating no
longer available data for the old classes and outlining a con-
tent inpainting scheme on the background class. We propose
two sources for replay data. The first resorts to a generative
adversarial network to sample from the class space of past
learning steps. The second relies on web-crawled data to re-
trieve images containing examples of old classes from online
databases. In both scenarios no samples of past steps are
stored, thus avoiding privacy concerns. Replay data are then
blended with new samples during the incremental steps. Our
approach, RECALL, outperforms state-of-the-art methods.
1. Introduction
A common requirement for many machine learning ap-
plications is the ability to learn a sequence of tasks in multi-
ple incremental steps, e.g., progressively introducing novel
classes to be recognized, instead of using a single-shot train-
ing procedure on a large dataset [
]. This problem has been
widely studied in image classification and many methods
propose to alleviate the forgetting of previous tasks and in-
transigence of learning new ones [
]. When the
model is exposed to samples of novel classes and is trained
on them without additional provisions, the optimization leads
to the so-called catastrophic forgetting phenomenon [
i.e., knowledge about previously seen classes tends to be lost.
Incremental learning on dense tasks (e.g., semantic seg-
mentation), where pixel-wise predictions are performed, has
These authors share the first authorship.
Our work was in part supported by the Italian Minister for Education
(MIUR) under the “Departments of Excellence” initiative (Law 232/2016).
Step k-1
Step 0
Step k-1
Step 0
Step k
Steps 0 … k-1 (data unavailable)
Figure 1: Replay images of previously seen classes are re-
trieved by a web crawler or a generative network and further
labeled. Then, the network is incrementally trained with a
mixture of new and replay data.
only recently been explored and the first experimental stud-
ies show that catastrophic forgetting is even more severe than
on the classification task [
]. Current approaches for
class-incremental semantic segmentation re-frame knowl-
edge distillation strategies inspired by previous works on
image classification [
]. Although they partially
alleviate forgetting, they often fail when multiple incremen-
tal steps are performed or when background shift [
] (i.e.,
change of statistics of the background across learning steps,
as it incorporates old or future classes) occurs.
In this paper, we follow a completely different strategy
and, instead of distilling knowledge from a teacher model
(i.e., the old one) to avoid forgetting, we propose to generate
samples of old classes by using replay strategies. We propose
RECALL (REplay in ContinuAL Learning), a method that
re-creates representations of old classes and mixes them with
the available training data, i.e., containing novel classes be-
ing learned (see Fig. 1). To reduce background shift we intro-
duce a self-inpainting strategy that re-assigns the background
region according to predictions of the previous model.
To generate representations of past classes we pursue two
possible directions. The first is based on a pre-trained gener-
ative model, i.e., a Generative Adversarial Network (GAN)
] conditioned to produce samples of an input class. The
GAN has been trained beforehand on a dataset different than
the target one (we chose ImageNet as it comprehends a wide
variety of classes and domains), thus requiring a Class Map-
ping Module to perform the translation between the two label
spaces. The second strategy, instead, is based on crawling
images from the web, querying the class names to drive the
search. Both approaches allow to retrieve a large amount of
weakly labeled data. Finally, we generate pseudo-labels for
semantic segmentation using a side labeling module, which
requires only minimal extra storage.
Our main contributions are: 1) we propose RECALL,
which is the first approach to use replay data in continual
semantic segmentation; 2) to the best of our knowledge,
we are the first to introduce the webly-supervised paradigm
in continual learning, showing how we can extract useful
clues from extremely weakly supervised and noisy samples;
3) we devise a background inpainting strategy to generate
pseudo-labels and overcome the background shift; 4) we
achieve state-of-the-art results on a wide range of scenarios,
especially when performing multiple incremental steps.
2. Related Works
Continual Learning (CL).
Deep neural networks wit-
nessed remarkable improvements in many fields; however,
such models are prone to catastrophic forgetting when they
are trained to continuously improve the learned knowledge
(e.g., new categories) from progressively provided data [
Catastrophic forgetting is a long-standing problem [
which has been recently tackled in a variety of visual tasks
such as image classification [
], object de-
tection [40,25] and semantic segmentation [29,31,5,23].
Current techniques can be grouped into four main (non
mutually exclusive) categories [
]: namely, dynamic ar-
chitectures, regularization-based, rehearsal and generative
replays. Dynamic architectures can be explicit [
], if
new network branches are grown, or implicit [
], if
some network weights are available for certain tasks only.
Regularization-based approaches mainly propose to compute
some penalty terms to regularize training (e.g., based on the
importance of the weights for a specific task) [
] or to
distill knowledge from the old model [
]. Rehearsal
approaches store a set of raw samples of past tasks into mem-
ory, which are then used while training for the new task
]. Finally, generative replays approaches [
rely on generative models typically trained on the same data
distribution, which are later used to generate artificial sam-
ples to preserve previous knowledge. Generative models are
usually GANs [
] or auto-encoders [
]. In this
work, we employ two kinds of generative replays: either
resorting to a standard pre-trained GAN or to web-crawled
images to avoid forgetting, without storing any of the sam-
ples related to previous tasks. When using the generative
model, differently from previous works on continual image
classification, we do not select real exemplars as anchor to
support the learned distribution [
] nor we train or fine-
tune the GAN architecture on the current data distribution
[39,17,43], thus reducing memory and computation time.
CL in Semantic Segmentation.
Semantic segmentation
has experienced a wide research interest in the past few
years and deep networks have achieved noticeable results on
this task. Current techniques are based on the auto-encoder
structure firstly employed by FCN [
] and subsequently im-
proved by many approaches [
]. Recently, increas-
ing attention has been devoted to class-incremental semantic
segmentation [
] to learn new categories from
new data. In [
] the problem is first introduced and
tackled with regularization approaches such as parameters
freezing (e.g., fixing the encoder after the initial training
stage) and knowledge distillation. In [
] knowledge distil-
lation is coupled with a class importance weighting scheme
to emphasize gradients on difficult classes. Cermelli et al.
] study the distribution shift of the background class. In
] long- and short-range spatial relationships at feature
level are preserved. In [
] the latent space is regularized to
improve class-conditional features separation.
Webly-Supervised Learning
is an emerging paradigm
in which large amounts of web data is exploited for learn-
ing CNNs [
]. Recently, it was employed also in
semantic segmentation to provide a plentiful source of both
images [
] and videos [
] with weak image-level class
labels during training. The most active research directions
are devoted toward understanding how to query images, how
to filter and exploit them (e.g., assigning pseudo-labels). To
our knowledge, however, webly-supervised learning has not
yet been explored in continual learning as a replay strategy.
3. Problem Formulation
The semantic segmentation task consists in labeling each
pixel in an image by assigning it to a class from a collec-
tion of possible semantic classes
, which typically also
comprises a special background category that we denote as
. More formally, given an image
X∈ X RH×W×3
we aim at producing a map
Y∈ Y ⊂ CH×W
that is a
prediction of the ground truth map
. This is nowadays
usually achieved by using a suitable deep learning model
M:X 7RH×W×|C|
, commonly made by a feature extrac-
tor Efollowed by a decoding module D,i.e., M=DE.
In standard supervised learning, the model is learned in
a single shot over a training set
T X × Y
, available
in its complete form to the training algorithm. In class-
incremental learning, instead, we assume that the training
is performed in multiple steps and only a subset of train-
ing data is made available to the algorithm at each step
k= 0, ..., K
. More in detail, we start from an initial step
k= 0
where only training data concerning a subset of all
the classes
C0⊂ C
is available (we assume that
b∈ C0
We denote with
M0:X 7RH×W×|C0|
model trained after this initial step. Moving to a generic step
, a new set of classes
is added to the class collection
learned up to that point, resulting in an expanded
set of learnable classes
C0k=C0(k1) ∪ Ck
(we assume
C0(k1) ∩ Ck=
). The model after the
-th step of train-
ing is
Mk:X 7RH×W×|C0k|
, where
since in our approach the encoder
is not trained during
the incremental steps and only the decoder is updated [29].
Two main continual scenarios have been proposed (see
] for a more detailed description) and we tackle
both in a unified framework.
Disjoint setup
: in the initial step all the images in the train-
ing set with at least one pixel belonging to a class of
(except for
) are assumed to be available. We denote with
YC0∪{b}⊂ CH×W
the corresponding output space where la-
bels can only belong to
, while all the pixels not pertaining
to these classes are assigned to
. The incremental partitions
are built as disjoint subsets of the whole training set. The
training data associated to
-th step,
Tk⊂ X × YCk∪{b}
, con-
tains only images corresponding to classes in
with just
classes of step
annotated (possible old classes are labeled
as b), and is disjoint w.r.t. previous and past partitions.
Overlapped setup
: in the first phase we select the subset
of training images having only
-labeled pixels. Then,
the training set at each incremental step contains all the
images with labeled pixels from
Tk⊂ X × YCk∪{b}
Similarly to the initial step, labels are limited to semantic
classes in Ck, while remaining pixels are assigned to b.
In both setups,
undergoes a semantic shift at each step,
as pixels of ever changing class sets are assigned to it.
4. General Architecture
In the standard setup, the segmentation model
trained with annotated samples from a training set
. Data
should be representative of the task we would like to solve,
meaning that multiple instances of all the considered se-
mantic classes
should be available in the provided dataset
for the segmentation network to properly learn them. Once
has been assembled, the cross-entropy objective is com-
monly employed to optimize the weights of M:
Lce(M;C,T) = 1
|T | X
Y[c]·log M(X)[c](1)
In the incremental learning setting, when performing an in-
cremental training step
only samples related to new classes
are assumed to be at our disposal. Following the simplest
approach, we could initialize our model’s weights from the
previous step (
) and learn the segmentation
task over classes
by optimizing the standard objective
with data from the current training parti-
. However, simple fine-tuning leads to catastrophic
forgetting, being unable to preserve previous knowledge.
Architecture of Replay Block.
To cope with this issue,
we opt for a replay strategy. Our goal is to retrieve task-
related knowledge of past classes to be blended into the
ongoing incremental step, all without accessing training data
of previous iterations. To this end, we introduce a Replay
Block, whose target is twofold. First, it has to provide images
resembling instances of classes from previous steps, whether
generating them from scratch or retrieving them from an
available alternative source (e.g., a web database). Second,
it has to obtain reliable semantic labels of those images, by
resorting to learned knowledge from past steps.
The Replay Block’s image retrieval task is executed by
what we call Source Block:
This module takes in input a set of classes
excluded) and provides images whose semantic content can
be ascribed to those categories (e.g.,
Xrp X rp
). We adopt
two different solutions for the Source Block, namely GAN
and web-based techniques, both detailed in Sec. 5.
The Source Block provides unlabeled image data (if we
exclude the weak image-level classification labels), and for
this reason we introduce an additional Label Evaluation
, which aims at annotating examples pro-
vided by the replay module. This block is made of sepa-
rate instances
, each denoting a segmenta-
tion model to classify a specific set of semantic categories
Ck∪ {b}(i.e., the classes in Ckplus the background) :
modules share the encoder section
from the
initial training step, so that only a minimal portion of the seg-
mentation network (i.e.,
, which accounts for only few
parameters, see Sec. 7.1) is stored for each block’s instance.
Notice that a single instance recognizing all classes could be
used, leading to an even more compact representation, but it
experimentally led to worse performance.
Provided that
are available, replay training
data can be collected for classes in
. A query to
puts a generic image example
, which is then
associated to its prediction
Ck= arg max
By retrieving multiple replay examples, we build a replay
, where
is a fixed
hyperparameter empirically set (see section 6).
Background Self-Inpainting.
To deal with the background
shift phenomenon, we propose a simple yet effective in-
painting mechanism to transfer knowledge from the previous
model into the current one. While the replay block re-creates
samples of previously seen classes, background inpainting
acts on background regions of current samples reducing
the background shift and at the same time bringing a reg-
ularization effect similar to knowledge distillation [
although its implementation is quite different. At every step
with training set
, we take the background region of
each ground truth map and we label it with the associated
prediction from the previous model
(see Fig. 3). We
call it background inpainting since the background regions in
label maps are changed according to a self-teaching scheme
Label Evaluation
Source Block Replay data at step k
Incremental Training
Replay Block Update
Inpainted data at step k
unavailable at step k
Class Model
Figure 2: Overview of the proposed RECALL: class labels from past incremental steps are provided to a Source Block, either
a web crawler or a pre-trained conditional GAN, which retrieves a set of unlabeled replay images for the past semantic classes.
Then, a Label Evaluation Block produces the missing annotations. Finally, the segmentation network is incrementally trained
with a replay-augmented dataset, composed of both new classes data and replay data.
Self-Inpainted label
Ground truth at step k
Prediction from step k-1
Image at step k
Figure 3: Background self-inpainting process.
based on the prediction of the old model. More formally, we
replace each original label map
available at step
k > 0
with its inpainted version Ybi:
Y[h,w]if Y[h,w] Ck
arg max
(X,Y)∈ Tk
, while
[h, w]
denotes the pixel coordi-
nates. Labels at step
k= 0
are not inpainted, as at that stage
we lack any prior knowledge of past classes. When back-
ground inpainting is performed, each set
k⊂ X × YC0k
(k > 0) contains all samples of Tkafter being inpainted.
Incremental Training with Replay Block.
The training
procedure of RECALL is detailed and summarized in Al-
gorithm 1and the process is described in Fig. 2. Suppose
we are at the incremental step
, with only training data of
classes in
from partition
available. In a first stage, the
Replay Block is fixed and it is used to retrieve annotated data
for steps from
uniformly distributed among all
the past classes. Following the described pipeline, the gen-
erative and labeling models are applied independently over
each incremental class set
Ci, i = 0, ..., k 1
. The replay
Algorithm 1 RECALL: incremental training procedure.
Input: {Tk}K
k=0 and {Ck}K
Output: MK
train M0=E0D0with Lce(M0;C0,T0)
train Son (C0,T0)
train DH
C0with Lce(LC0;C0,T0)
for k1to Kdo
background inpainting on Tkto obtain Tbi
train Son (Ck,Tk)
train DH
Ckwith Lce(LCk;Ck∪ {b},Tk)
generate Trp
k∪ RC0(k1)
train Dkwith Lce(Mk;C0k,Trp
end for
training dataset for step
is the union of the single replay
sets for each previous step:
RC0(k1) =
. Once we
have assembled
, by merging it with
we get an
augmented step-
training partition
k∪ RC0(k1)
This new set, in principle, is complete of annotated samples
containing both old and new classes, thanks to replay data.
Therefore, we effectively learn the segmentation model
through the cross-entropy objective
replay-augmented training data. This mitigates the bias to-
ward new classes, thus preventing forgetting.
In a second stage, we exploit
to train the Class
Mapping Module if needed (see Sec. 5). In particular,
we teach the Source Block
to produce samples of
and we optimize the decoder
to correctly segment,
in conjunction with
, images from
by minimizing
Lce(LCk;Ck∪ {b},Tk)
. This stage is not exploited in the
current step, but will be necessary in future ones.
During a standard incremental training stage, we follow
a mini-batch gradient descent scheme, where batches of
annotated training data are sampled from
. However,
to guarantee a proper stream of information, we opt for an
interleaving sampling policy, rather than a random one. In
particular, at a generic iteration of training, a batch of data
supplied to the network is made of
samples from
the current training partition
replay samples
. The ratio between
the proportion of replay and new data (see also Sec. 7.1).
We need, in fact, to carefully balance how new data is dosed
with respect to replay one, so that enough information about
new classes is provided within the learning process, while
concurrently we assist the network in recalling knowledge
acquired in past steps to prevent catastrophic forgetting.
5. Replay Strategies
In this section we describe more in detail the replay strate-
gies employed for the image generation task of the Source
. As mentioned previously, we opt for a generative
approach based on a GAN framework and for an online re-
trieval solution, where images are collected by a web crawler.
Replay by GAN.
The GAN-based strategy exploits a deep
generative adversarial framework to re-create the no longer
available samples for previously seen classes. We use a con-
ditional GAN,
, pre-trained on a generic large-scale visual
dataset with data from a wide set of semantic classes
and different domains. For the experiments, we choose an
ImageNet [
] based pre-training. On this regard, we remark
that classes and domains are not required to be completely
coherent: for instance person does not exist in ImageNet, but
related classes (e.g., hat) still allow to preserve its knowledge
(further considerations on this are reported in Suppl. Mat.).
When performing the k-th incremental step, we retrieve im-
ages containing previously seen classes by sampling the
GAN’s generator output, i.e.,
Xrp =G(n, cG)
on GAN’s classes
cG∈ CG
corresponding to the target ones
from the original training data (nis a generic noise input).
Since the GAN is pre-trained on a separate dataset, typ-
ically it inherits a different label set. For this reason, the
Source Block with GAN is composed of two main modules,
namely the actual GAN for image generation and a Class
Mapping Module to translate each class of the semantic seg-
mentation incremental dataset to the most similar class of
the GAN’s training dataset. Provided that we have trained
both the GAN and class mapping modules, first we use the
latter to translate the class set
to the matching set
Then, a set of queries to the conditioned GAN’s generator:
Ck=G(n, cG), cG∈ CG
provides samples resembling the ones in
, as long as the
mapping is able to properly associate each original class to a
statistically similar counterpart in the GAN’s label space.
At each incremental step
, the Source Block with GAN
goes through two separate training and inference stages. In
a first training phase, samples from
are fed to an Image
, which is pre-trained to solve an image classi-
fication task on the GAN’s dataset. In particular, for each
c∈ Ck
we select the corresponding training subset
k⊂ Tk
,i.e., all the samples of set
associated to class
, and we sum the resulting class probability vectors from
the classification output. Then, the GAN’s class
with the
highest probability score is identified by:
cG= arg max
X←T c
I(X) [j](6)
is extracted from
(labels are not used) and
denotes the vector output of the last softmax layer of
, whose
-th entry corresponds to the
-th GAN’s class. By
repeating this procedure for every class in
, we build the
mapped set
. Class correspondence is stored, so that at
each step we have access to class mappings of past iterations.
In a second evaluation phase, classes in
given as input to the Source Block. Thanks to the class corre-
spondences saved in previous steps,
are mapped to
. Next, image generation conditioned on each class
is performed, and the resulting replay images
are fed to the Label Evaluation Block to be associated to their
corresponding semantic labels. By following this procedure,
we end up with self-annotated data of past classes suitable
to support the supervised training at the current step, which
otherwise would be limited to new classes.
Replay by Web Crawler.
As an alternative we propose to
retrieve training examples from an online source. For the
evaluation, we searched images from the
F lickr
but any other online database or search engine can be used.
Assuming we are at the incremental step
and we have
access to the names of every class in the past iterations
c∈ C0(k1)
), we download images whose tag and
description happen to both contain the class name through
F lickr
’s web crawler. Then, the web-crawled images
are fed to the Label Evaluation Block for their annotation.
Compared to the GAN-based approach, the online re-
trieval solution is simpler as no learnable modules are in-
troduced. In addition, we completely avoid to assume that
a larger dataset is available, whose class range should be
sufficiently ample and diverse to cope with the continuous
stream of novel classes incrementally introduced. On the
other side, this approach requires the availability of an inter-
net connection and in some way exploits additional training
data even if almost unsupervised. Plus, we lack control over
the weak labeling performed by the web source.
6. Implementation Details
We use the DeepLab-V2 [
] as segmentation architec-
ture with ResNet-101 [
] as backbone. Nonetheless, RE-
CALL is independent of the specific network architecture.
Encoder’s weights are pre-trained on ImageNet [
] and all
network’s weights are trained in the initial step
. In the fol-
lowing steps, only the main decoder is trained, together with
the additional
helper decoders, which are needed
to annotate replay samples (as discussed in Sec. 4). For fair
comparison, all competing approaches are trained with the
same backbone. SGD with momentum is used for weights
optimization, with initial learning rate set to
decreased to
according to polynomial decay of
. Following previous works [
], we train the
model for
|Ck|× 1000
learning steps in the disjoint setup and
|Ck| × 1500
steps in the overlapped setup. Each helper
is trained with a polynomially decaying learn-
ing rate starting from
and ending at
|Ck| × 1000 steps. As Source Block, we use BigGAN-deep
] pre-trained [
] on ImageNet. At each incremental step
we generate 500 replay samples per old class, i.e.
Nr= 500
To map classes from the segmentation dataset to the GAN’s
one, we use the EfficientNet-B2 [
] classifier implemented
at [
] and pre-trained on ImageNet. The interleaving ratio
rold/rnew is set to 1.
As input pre-processing, random scaling and mirroring
are followed by random padding and cropping to
321 ×321
px. The entire framework is developed in TensorFlow [
and trained on a single NVIDIA RTX 2070 Super. Training
time varies depending on the setup, with the longest run
taking about
hours. Code and replay data are available at
7. Experimental Results
In this section we present the experimental evaluation
on the Pascal VOC 2012 dataset [
]. Following previous
works on this topic [
], we start by analyzing
the performance on three widely used incremental scenar-
ios: i.e., addition of the last class (19-1), addition of the last
5 classes at once (15-5) and addition of the last 5 classes
sequentially (15-1). Moreover, we report the performance
on three more challenging scenarios in which 10 classes
are added sequentially one by one (10-1), in 2 batches of 5
elements each (10-5) and all at once (10-10). Classes for
the incremental steps are selected according to the alphabeti-
cal order. We compare with the na
ıve fine-tuning approach
(FT), which defines the lower limit to the accuracy of an
incremental model, and with the joint training on the com-
plete dataset in one step, which serves as upper bound. We
also report the results of a simple Store and Replay (S&R)
method, where at each incremental step we store a certain
number of true samples for newly added classes, such that
the respective size in average matches the size of the helper
decoders needed by RECALL (see Fig. 6). As comparison,
we include
methods extended from classification (i.e., LwF
] and its single-headed version LwF-MC [
]) and the
most relevant methods designed for continual segmentation
(i.e., ILT [
], CIL [
], MiB [
] and SDR [
]). Exhaustive
quantitative results in terms of mIoU are shown in Table 1.
For each setup we report the mean accuracy for the initial
set of classes, for the classes in the incremental steps and for
all classes, computed after the overall training.
Addition of the last class.
First, we train over the first 19
classes during step 0. Then, we perform a single incremen-
tal step to learn tv/monitor. Looking at Table 1(upper-left
section), we notice that FT results in a drastic performance
degradation w.r.t. joint training, due to catastrophic forget-
ting. RECALL, instead, shows higher overall mIoU than
competitors and it is especially effective on the last class,
whilst still retaining high accuracy on the past ones thanks
to the regularization brought in by background inpainting
and replay strategies. S&R, instead, heavily forgets previous
classes, thus confirming the usefulness of replay data.
Addition of last 5 classes.
In this setup, 15 classes are
learned in the initial step, while the remaining 5 are added
in one shot (15-5) or sequentially one at a time (15-1). Com-
pared to the 19-1 setup, the addition of multiple classes in
the incremental iterations makes catastrophic forgetting even
more severe. The accuracy gap between FT and joint train-
ing, in fact, raises from about
of the 19-1 case to more
of mIoU in the 15-1 scenario. Taking a closer
look at the results in Table 1(upper mid and right sections),
our replay approaches strongly limit the degradation caused
by catastrophic forgetting. This trend can be observed in
the 15-5 setup and more evidently in the 15-1 one, both in
the disjoint and overlapped settings: exploiting generated
or web-derived replay samples proves to effectively restore
knowledge of past classes, leading to a final mIoU approach-
ing that of the joint training. Storing and replaying original
samples, instead, improves the performance w.r.t. FT, but
ultimately leads to a mIoU lower of more than
if com-
pared to our approaches. This is due to the limited number
of samples to be stored in order to match the helper decoder
size: their sole addition is, in fact, insufficient to adequately
preserve learned knowledge. Finally, we observe that RE-
CALL can scale much better than competitors when multiple
incremental steps are performed (scenario 15-1), as typically
encountered in real-world applications.
Addition of last 10 classes.
To analyze the previous claim,
we introduce some new challenging experiments, not eval-
uated in previous works. In these tests only 10 classes are
observed in the initial step, while the remaining ones are
added in a single batch (10-10), in 2 steps of 5 classes
each (10-5), or individually (10-1). Again, FT is heavily
affected by the information loss that occurs when perform-
ing incremental training without regularization, leading to
performance drops up to about
of mIoU w.r.t. the joint
training in the most challenging 10-1 setting. Thanks to
the introduction of replay data, RECALL brings a remark-
able performance boost to the segmentation accuracy and
becomes more and more valuable as the difficulty of the
Table 1: mIoU on Pascal VOC2012 for different incremental setups. Results of competitors in the upper part come from
[30,5], while we run their implementations for the new scenarios in the bottom part.
19-1 15-5 15-1
Disjoint Overlapped Disjoint Overlapped Disjoint Overlapped
Method 1-19 20 all 1-19 20 all 1-15 16-20 all 1-15 16-20 all 1-15 16-20 all 1-15 16-20 all
FT 35.2 13.2 34.2 34.7 14.9 33.8 8.4 33.5 14.4 12.5 36.9 18.3 5.8 4.9 5.6 4.9 3.2 4.5
S&R 55.3 43.2 56.2 54.0 48.0 55.1 38.5 43.1 41.6 36.3 44.2 40.3 41.0 31.8 40.7 38.6 31.2 38.9
LwF [26] 65.8 28.3 64.0 62.6 23.4 60.8 39.7 33.3 38.2 67.0 41.8 61.0 26.2 15.1 23.6 24.0 15.0 21.9
LwF-MC [35] 38.5 1.0 36.7 37.1 2.3 35.4 41.5 25.4 37.6 59.8 22.6 51.0 6.9 2.1 5.7 6.9 2.3 5.8
ILT [29] 66.9 23.4 64.8 50.2 29.2 49.2 31.5 25.1 30.0 69.0 46.4 63.6 6.7 1.2 5.4 5.7 1.0 4.6
CIL [23] 62.6 18.1 60.5 35.1 13.8 34.0 42.6 35.0 40.8 14.9 37.3 20.2 33,3 15.9 29.1 6.3 4.5 5.9
MiB [5] 69.6 25.6 67.4 70.2 22.1 67.8 71.8 43.3 64.7 75.5 49.4 69.0 46.2 12.9 37.9 35.1 13.5 29.7
SDR [30]69.9 37.3 68.4 69.1 32.6 67.4 73.5 47.3 67.2 75.4 52.6 69.9 59.2 12.9 48.1 44.7 21.8 39.2
RECALL (GAN) 65.2 50.1 65.8 67.9 53.5 68.4 66.3 49.8 63.5 66.6 50.9 64.0 66.0 44.9 62.1 65.7 47.8 62.7
RECALL (Web) 65.0 47.1 65.4 68.1 55.3 68.6 69.2 52.9 66.3 67.7 54.3 65.6 67.6 49.2 64.3 67.8 50.9 64.8
Joint 75.5 73.5 75.4 75.5 73.5 75.4 77.5 68.5 75.4 77.5 68.5 75.4 77.5 68.5 75.4 77.5 68.5 75.4
10-10 10-5 10-1
Disjoint Overlapped Disjoint Overlapped Disjoint Overlapped
Method 1-10 11-20 all 1-10 11-20 all 1-10 11-20 all 1-10 11-20 all 1-10 11-20 all 1-10 11-20 all
FT 7.7 60.8 33.0 7.8 58.9 32.1 7.2 41.9 23.7 7.4 37.5 21.7 6.3 2.0 4.3 6.3 2.8 4.7
S&R 25.1 53.9 41.7 18.4 53.3 38.2 26.0 28.5 29.7 22.2 28.5 27.9 30.2 19.3 27.3 28.3 20.8 27.1
LwF [26] 63.1 61.1 62.2 70.7 63.4 67.2 52.7 47.9 50.4 55.5 47.6 51.7 6.7 6.5 6.6 16.6 14.9 15.8
LwF-MC [35] 52.4 42.5 47.7 53.9 43.0 48.7 44.6 43.0 43.8 44.3 42.0 43.2 6.9 1.7 4.4 11.2 2.5 7.1
ILT [29]67.7 61.3 64.7 70.3 61.9 66.3 53.4 48.1 50.9 55.0 44.8 51.7 14.1 0.6 7.5 16.5 1.0 9.1
CIL [23] 37.4 60.6 48.4 38.4 60.0 48.7 27.5 41.4 34.1 28.8 41.7 34.9 7.1 2.4 4.9 6.3 0.8 3.6
MiB [5] 66.9 57.5 62.4 70.4 63.7 67.2 54.3 47.6 51.1 55.2 49.9 52.7 14.9 9.5 12.3 15.1 14.8 15.0
SDR [30] 67.5 57.9 62.9 70.5 63.9 67.4 55.5 48.2 52.0 56.9 51.3 54.2 25.5 15.7 20.8 26.3 19.7 23.2
RECALL (GAN) 62.6 56.1 60.8 65.0 58.4 63.1 60.0 52.5 57.8 60.8 52.9 58.4 58.3 46.0 53.9 59.5 46.7 54.8
RECALL (Web) 64.1 56.9 61.9 66.0 58.8 63.7 63.2 55.1 60.6 64.8 57.0 62.3 62.3 50.0 57.8 65.0 53.7 60.7
Joint 76.6 74.0 75.4 76.6 74.0 75.4 76.6 74.0 75.4 76.6 74.0 75.4 76.6 74.0 75.4 76.6 74.0 75.4
settings increases. In the 10-10 case, our method achieves
slightly lower mIoU results than competitors (although com-
parable). As we increase complexity, our approach is able
to outperform competitors by about
of mIoU in 10-5
and by
of mIoU in 10-1. We remark how RECALL
shows a convincing capability of providing a rather steady
accuracy in different setups, regardless of the number of in-
cremental steps used to introduce new classes. For example,
in the disjoint scenario, when moving from simpler to more
challenging setups (i.e., from 10-10 to 10-1, passing through
10-5), the mIoU of FT drops as
33.0% 23.7% 4.3%
and the one of SDR (i.e., the best compared approach) as
62.9% 52.0% 20.8%
, while our approach maintains a
stable mIoU trend of
61.9%60.6% 57.8%
. Finally, we
report the mIoU after each incremental step on the 10-1 dis-
joint scenario in Fig. 4, where our approaches show much
higher mIoU at every learning step, indicating improved re-
silience to forgetting and background shift, than competitors.
In the qualitative results in Fig. 5we observe that RE-
CALL effectively alleviates forgetting and reduces the bias
towards novel classes. In the first row, the bus is correctly
preserved while FT, S&R and inpainting wrongly classify it
as train (i.e., one of the novel classes); in the second row, FT
places sheep and tv (newly added classes) in place of cow;
in the third row, some horse’s features are either mixed with
those of person and cat or completely destroyed, while they
are preserved by our methods (the web scheme shows higher
accuracy than the GAN here). Additional results, a few tests
Figure 4: Evolution of mIoU on the 10 tasks of 10-1 disjoint.
combining RECALL with competitors and a preliminary
evaluation on ADE20K [47] are presented in Suppl. Mat.
7.1. Ablation Study
To further validate the robustness of our approach, we
perform some ablation studies. First of all, we analyze the
memory requirements. The plot in Fig. 6shows in semi-log
scale the memory occupation (expressed in MB) of the data
to be stored at the end of each incremental step, as a function
of the number of classes learned up to that point. We denote
with standard an incremental approach which do not store
any sample (e.g., FT, LwF, ILT, MiB, SDR). The saved model
generally corresponds to a fixed size encoder and a decoder,
whose dimension slightly increases at each step to account
for additional output channels for the new learnable classes.
Saving images, instead, refers to the extreme scenario where
training images of past steps are stored, thus being available
RGB GT FT S&R Inpainting RECALL (GAN) RECALL (Web) Joint
Figure 5: Qualitative results on disjoint incremental setups: from top to bottom 15-1, 15-5 and 10-1. (best viewed in colors).
Figure 6: Memory occupation in the disjoint scenario.
Figure 7: Distinct interleaving policies in 15-1 disjoint.
throughout the entire incremental process. As concerns our
approach, to annotate originally weakly-labeled replay im-
ages, we devise a specific module (Sec. 4), which requires
to save a set of helper decoders
, one for each past
step. Finally, for the GAN-based approach we add the stor-
age required for the generative model. Fig. 6shows that
our web-based solution is very close to the standard ones in
terms of memory occupation. The space required to store
the GAN is comparable to that needed to save images in the
very initial steps, but then remains constant while the space
for saving all training data quickly grows.
We further analyze the contribution of the background in-
painting and replay techniques in Table 2. While inpainting
alone provides a solid contribution in terms of knowledge
preservation acting similarly to knowledge distillation, we
Table 2: mIoU results showing the contribution of each
module, D: Disjoint, O: Overlapped.
19-1 15-5 15-1 10-10 10-5 10-1
Method D O D O D O D O D O D O
Bgr inp. 65.6 66.7 52.2 52.5 49.7 49.9 58.8 60.7 47.5 47.1 34.0 39.0
GAN 54.5 56.2 49.8 49.1 47.9 48.2 45.8 48.8 38.1 43.7 36.6 40.8
Web 57.3 57.4 55.2 54.7 55.0 53.7 55.2 58.2 47.9 52.1 45.4 50.1
GAN+inp. 65.8 68.4 63.5 64.0 62.1 62.7 60.8 63.1 57.8 58.4 53.9 54.8
Web+inp. 65.4 68.6 66.3 65.6 64.3 64.8 61.9 63.7 60.6 62.3 57.8 60.7
observe that its effect tends to attenuate with multiple in-
cremental steps. For example, moving from 10-10 to 10-1
overlapped setups, the mIoU drops more than 20%. On the
other hand, the proposed replay techniques prove to be ben-
eficial when multiple training stages are involved. On the
same setting, replay techniques alone limit the degradation
to only 8%. Yet, jointly employing replay and inpainting fur-
ther boosts the final results in all setups (up to
), proving
that they can be effectively combined.
Finally, we analyze how results vary w.r.t. the proportion
of new (
) and replay (
) samples seen during training
(Fig. 7): the mIoU is quite stable w.r.t. this ratio, however
the maximum value is reached when the same number of old
and replay samples is used, i.e., rnew/rold = 1.
8. Conclusions
In this paper we introduced RECALL, which targets con-
tinual semantic segmentation by means of replay strategies to
alleviate catastrophic forgetting and background inpainting
to mitigate background shift. Two replay schemes are pro-
posed to retrieve data related to former training stages, either
reproducing it via a conditional GAN or crawling it from the
web. The experimental analyses proved the efficacy of our
framework in improving accuracy and robustness to multiple
incremental steps compared to competitors. New research
will improve the generative model, coupling it more strictly
with the incremental setup, and explore how to control and
refine weak supervision during web-crawling. Evaluation on
different datasets, such as ADE20K, will also be performed.
Tensorflow module of BigGAN-deep 512,
https:// 512/1
. Ac-
cessed on 18/03/2020.
TensorFlow module of EfficientNet-b2,
classification/1. Accessed on 18/03/2020.
ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A
system for large-scale machine learning. In 12th
Symposium on Operating Systems Design and Implementation
({OSDI}16), pages 265–283, 2016.
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
scale GAN training for high fidelity natural image synthesis.
In Proceedings of the International Conference on Learning
Representations, 2019.
Fabio Cermelli, Massimiliano Mancini, Samuel Rota Bul
Elisa Ricci, and Barbara Caputo. Modeling the background
for incremental learning in semantic segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2020.
Liang-Chieh Chen, George Papandreou, Florian Schroff, and
Hartwig Adam. Rethinking atrous convolution for semantic
image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 40:834–848, 2018.
Xinlei Chen and Abhinav Gupta. Webly supervised learning
of convolutional networks. In Proceedings of the Interna-
tional Conference on Computer Vision, pages 1431–1439,
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Fei-Fei Li. Imagenet: A large-scale hierarchical im-
age database. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 248–255,
Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learn-
ing everything about anything: Webly-supervised visual con-
cept learning. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 3270–3277,
Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu
Cord. Plop: Learning without forgetting for continual seman-
tic segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2021.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman. The PASCAL Visual Object Classes
Challenge 2012 (VOC2012) Results. http://www.pascal-
Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori
Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and
Daan Wierstra. Pathnet: Evolution channels gradient descent
in super neural networks. arXiv preprint arXiv:1701.08734,
Robert M French. Catastrophic forgetting in connectionist
networks. Trends in cognitive sciences, 3(4):128–135, 1999.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680,
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville,
and Yoshua Bengio. An empirical investigation of catas-
trophic forgetting in gradient-based neural networks. arXiv
preprint arXiv:1312.6211, 2013.
Chen He, Ruiping Wang, Shiguang Shan, and Xilin Chen.
Exemplar-supported generative reproduction for class incre-
mental learning. In Proceedings of the British Machine Vision
Conference, page 98, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016.
Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee,
and Bohyung Han. Weakly supervised semantic segmentation
using web-crawled videos. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
7322–7330, 2017.
Bin Jin, Maria V Ortiz Segovia, and Sabine Susstrunk. Webly
supervised semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3626–3635, 2017.
Nitin Kamra, Umang Gupta, and Yan Liu. Deep generative
dual memory network for continual learning. arXiv preprint
arXiv:1710.10368, 2017.
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel
Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan,
John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,
et al. Overcoming catastrophic forgetting in neural net-
works. Proceedings of the national academy of sciences,
114(13):3521–3526, 2017.
Marvin Klingner, Andreas B
ar, Philipp Donn, and Tim Fin-
gscheidt. Class-incremental learning for semantic segmen-
tation re-using neither old data nor old labels. International
Conference on Intelligent Transportation Systems, 2020.
ee Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide
Maltoni, David Filliat, and Natalia D
ıguez. Contin-
ual learning for robotics: Definition, framework, learning
strategies, opportunities and challenges. Information Fusion,
58:52–68, 2020.
Dawei Li, Serafettin Tasci, Shalini Ghosh, Jingwen Zhu, Junt-
ing Zhang, and Larry Heck. Efficient incremental learning for
mobile object detection. arXiv preprint arXiv:1904.00781,
Zhizhong Li and Derek Hoiem. Learning without forget-
ting. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 40(12):2935–2947, 2018.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 3431–3440, 2015.
David Lopez-Paz and MarcAurelio Ranzato. Gradient
episodic memory for continual learning. In Advances in
Neural Information Processing Systems, 2017.
Umberto Michieli and Pietro Zanuttigh. Incremental Learning
Techniques for Semantic Segmentation. In Proceedings of
the International Conference on Computer Vision Workshops,
Umberto Michieli and Pietro Zanuttigh. Continual semantic
segmentation via repulsion-attraction of sparse and disen-
tangled latent representations. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
Umberto Michieli and Pietro Zanuttigh. Knowledge distilla-
tion for incremental learning in semantic segmentation. Com-
puter Vision and Image Understanding, 2021.
Li Niu, Ashok Veeraraghavan, and Ashutosh Sabharwal. We-
bly supervised learning meets zero-shot learning: A hybrid
approach for fine-grained classification. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 7171–7180, 2018.
Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jah-
nichen, and Moin Nabi. Learning to remember: A synaptic
plasticity driven framework for continual learning. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 11321–11329, 2019.
German I Parisi, Ronald Kemker, Jose L Part, Christopher
Kanan, and Stefan Wermter. Continual lifelong learning with
neural networks: A review. Neural Networks, 2019.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl,
and Christoph H Lampert. icarl: Incremental classifier and
representation learning. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
2001–2010, 2017.
Anthony Robins. Catastrophic forgetting, rehearsal and pseu-
dorehearsal. Connection Science, 7(2):123–146, 1995.
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat-
zoglou. Overcoming catastrophic forgetting with hard atten-
tion to the task. In Proceedings of the International Confer-
ence on Machine Learning, 2018.
Tong Shen, Guosheng Lin, Chunhua Shen, and Ian Reid.
Bootstrapping the performance of webly supervised semantic
segmentation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1363–1371,
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim.
Continual learning with deep generative replay. In Advances
in Neural Information Processing Systems, pages 2990–2999,
Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari.
Incremental learning of object detectors without catastrophic
forgetting. In Proceedings of the International Conference on
Computer Vision, pages 3400–3409, 2017.
Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking
model scaling for convolutional neural networks. In Proceed-
ings of the International Conference on Machine Learning,
pages 6105–6114, 2019.
Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Grow-
ing a brain: Fine-tuning by increasing model capacity. In
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2471–2480, 2017.
Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye,
Zicheng Liu, Yandong Guo, Zhengyou Zhang, and Yun Fu.
Incremental classifier learning with generative adversarial
networks. arXiv preprint arXiv:1802.00853, 2018.
Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
tion by dilated convolutions. In Proceedings of the Interna-
tional Conference on Learning Representations, 2016.
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual
learning through synaptic intelligence. In Proceedings of
the International Conference on Machine Learning, pages
3987–3995, 2017.
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2881–2890, 2017.
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar-
riuso, and Antonio Torralba. Scene parsing through ade20k
dataset. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017.
... Regularization-based methods [9,18,39] focus on distilling knowledge, e.g., output probability, intermedia features, from pre-trained model of previous step. Replay-based methods [36] propose to store the information of previous old classes or web-crawled images and replay for new training steps. However, a key barrier to further develop these methods is the requirement for pixel-level annotations for new classes. ...
... Incremental Learning for Semantic Segmentation. In addition to an exhaustive exploration of incremental learning for image classification [7,10,17,25,28,33,41,44,46,49,50], a relatively few methods [9,18,29,36,38,39] have been proposed to tackle the incremental learning for semantic segmentation task, which can be classified into regularization-based and replay-based methods. ...
... SDR [39] proposes to optimize the classconditional features by minimizing feature discrepancy of the same class. In addition, as a replay-based method, RE-CALL [36] uses web-crawled images with pseudo labels to remedy the forgetting problem. Pixel-by-pixel labeling for semantic segmentation is time-consuming and laborintensive. ...
Modern incremental learning for semantic segmentation methods usually learn new categories based on dense annotations. Although achieve promising results, pixel-by-pixel labeling is costly and time-consuming. Weakly incremental learning for semantic segmentation (WILSS) is a novel and attractive task, which aims at learning to segment new classes from cheap and widely available image-level labels. Despite the comparable results, the image-level labels can not provide details to locate each segment, which limits the performance of WILSS. This inspires us to think how to improve and effectively utilize the supervision of new classes given image-level labels while avoiding forgetting old ones. In this work, we propose a novel and data-efficient framework for WILSS, named FMWISS. Specifically, we propose pre-training based co-segmentation to distill the knowledge of complementary foundation models for generating dense pseudo labels. We further optimize the noisy pseudo masks with a teacher-student architecture, where a plug-in teacher is optimized with a proposed dense contrastive loss. Moreover, we introduce memory-based copy-paste augmentation to improve the catastrophic forgetting problem of old classes. Extensive experiments on Pascal VOC and COCO datasets demonstrate the superior performance of our framework, e.g., FMWISS achieves 70.7% and 73.3% in the 15-5 VOC setting, outperforming the state-of-the-art method by 3.4% and 6.1%, respectively.
... In general, the assumption is to have a model trained on an initial class-set which at different stages is fine-tuned on different new data, where images contain annotation for one or more new classes (see Figure 1.12). The class-incremental learning problem has a long history in image clas-sification; yet, this problem has been only recently addressed in the SIS context (Cermelli et al., 2020;Michieli and Zanuttigh, 2021;Douillard et al., 2021;Maracani et al., 2021;Cha et al., 2021). Most works focus on the scenario where the original dataset on which the model was trained is not available, and propose to fine-tune the model on samples available for the current new class (Cermelli et al., 2020;Michieli and Zanuttigh, 2021;Douillard et al., 2021;Maracani et al., 2021) -without storing samples from the new classes over time. ...
... The class-incremental learning problem has a long history in image clas-sification; yet, this problem has been only recently addressed in the SIS context (Cermelli et al., 2020;Michieli and Zanuttigh, 2021;Douillard et al., 2021;Maracani et al., 2021;Cha et al., 2021). Most works focus on the scenario where the original dataset on which the model was trained is not available, and propose to fine-tune the model on samples available for the current new class (Cermelli et al., 2020;Michieli and Zanuttigh, 2021;Douillard et al., 2021;Maracani et al., 2021) -without storing samples from the new classes over time. This is a realistic scenario under the assumption that the model has been trained by a third-party and therefore one does not have access to the original training set or the access is prohibited by the rights to use the original training samples for copyright issues. ...
... Instead, Maracani et al. (2021) propose two strategies to recreate data that comprise the old classes; the first one relies on GANs and the second exploits images retrieved from the web. Cermelli et al. (2022) propose a classincremental learning pipeline for SiS where new classes are learned by relying on global image-level labels instead of pixel-level annotations, hence related to weekly supervised SiS (see Section 1.3.2). ...
Full-text available
Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer vision applications, providing key information for the global understanding of an image. This survey is an effort to summarize two decades of research in the field of SiS, where we propose a literature review of solutions starting from early historical methods followed by an overview of more recent deep learning methods including the latest trend of using transformers. We complement the review by discussing particular cases of the weak supervision and side machine learning techniques that can be used to improve the semantic segmentation such as curriculum, incremental or self-supervised learning. State-of-the-art SiS models rely on a large amount of annotated samples, which are more expensive to obtain than labels for tasks such as image classification. Since unlabeled data is instead significantly cheaper to obtain, it is not surprising that Unsupervised Domain Adaptation (UDA) reached a broad success within the semantic segmentation community. Therefore, a second core contribution of this book is to summarize five years of a rapidly growing field, Domain Adaptation for Semantic Image Segmentation (DASiS) which embraces the importance of semantic segmentation itself and a critical need of adapting segmentation models to new environments. In addition to providing a comprehensive survey on DASiS techniques, we unveil also newer trends such as multi-domain learning, domain generalization, domain incremental learning, test-time adaptation and source-free domain adaptation. Finally, we conclude this survey by describing datasets and benchmarks most widely used in SiS and DASiS and briefly discuss related tasks such as instance and panoptic image segmentation, as well as applications such as medical image segmentation.
... CL studies solutions to enable the incrementation of models with novel classes without losing previously acquired knowledge. The main families of CL strategies are generally categorized into 1) replay-based [31,26,33,26,36], 2) regularized-based [4,3,25,21] and 3) dynamic architecture-based methods [32,1,24,40,9]. In the following sections, we focus on regularization-based and dynamic architecture-based methods since we propose a hybrid strategy between these two approaches to build our Dynamic Y-KD network. ...
... CL studies solutions to enable the incrementation of models with novel classes without losing previously acquired knowledge. The main families of CL strategies are generally categorized into 1) replay-based [31,26,33,26,36], 2) regularized-based [4,3,25,21] and 3) dynamic architecture-based methods [32,1,24,40,9]. In the following sections, we focus on regularization-based and dynamic architecture-based methods since we propose a hybrid strategy between these two approaches to build our Dynamic Y-KD network. ...
Despite the success of deep learning methods on instance segmentation, these models still suffer from catastrophic forgetting in continual learning scenarios. In this paper, our contributions for continual instance segmentation are threefold. First, we propose the Y-knowledge distillation (Y-KD), a knowledge distillation strategy that shares a common feature extractor between the teacher and student networks. As the teacher is also updated with new data in Y-KD, the increased plasticity results in new modules that are specialized on new classes. Second, our Y-KD approach is supported by a dynamic architecture method that grows new modules for each task and uses all of them for inference with a unique instance segmentation head, which significantly reduces forgetting. Third, we complete our approach by leveraging checkpoint averaging as a simple method to manually balance the trade-off between the performance on the various sets of classes, thus increasing the control over the model's behavior without any additional cost. These contributions are united in our model that we name the Dynamic Y-KD network. We perform extensive experiments on several single-step and multi-steps scenarios on Pascal-VOC, and we show that our approach outperforms previous methods both on past and new classes. For instance, compared to recent work, our method obtains +2.1% mAP on old classes in 15-1, +7.6% mAP on new classes in 19-1 and reaches 91.5% of the mAP obtained by joint-training on all classes in 15-5.
... In general, these methods update an entire classifier continuously during training on a stream of tasks rather than freezing a feature extractor. The methods in the second group mostly freeze an ImageNet-pretrained feature extractor after fine-tuning in a base training session Maracani et al., 2021;Ganea et al., 2021;Cheraghian et al., 2021;Wu et al., 2022). They focused on finding a novel CL method that can efficiently train a classification layer. ...
... Deep SLDA ) trained a classification layer using linear discriminant analysis, and CEC trained a classification layer to transfer context information for adaptation with a graph model. Lately, a continual semantic segmentation method, called RECALL (Maracani et al., 2021), which uses a pre-trained model as an encoder was proposed. However, MuFAN is the first online CL method that utilizes a richer multi-scale feature map from different layers of the pre-trained encoder to obtain a strong training signal from a single data point. ...
The aim of continual learning is to learn new tasks continuously (i.e., plasticity) without forgetting previously learned knowledge from old tasks (i.e., stability). In the scenario of online continual learning, wherein data comes strictly in a streaming manner, the plasticity of online continual learning is more vulnerable than offline continual learning because the training signal that can be obtained from a single data point is limited. To overcome the stability-plasticity dilemma in online continual learning, we propose an online continual learning framework named multi-scale feature adaptation network (MuFAN) that utilizes a richer context encoding extracted from different levels of a pre-trained network. Additionally, we introduce a novel structure-wise distillation loss and replace the commonly used batch normalization layer with a newly proposed stability-plasticity normalization module to train MuFAN that simultaneously maintains high plasticity and stability. MuFAN outperforms other state-of-the-art continual learning methods on the SVHN, CIFAR100, miniImageNet, and CORe50 datasets. Extensive experiments and ablation studies validate the significance and scalability of each proposed component: 1) multi-scale feature maps from a pre-trained encoder, 2) the structure-wise distillation loss, and 3) the stability-plasticity normalization module in MuFAN. Code is publicly available at
... Class-Incremental Learning (CIL) aims to continuously learn new classes from incrementally obtained training data and has drawn remarkable research interest in recent years [1], [2], [3], [4], [5], [6], [7]. When applying to a classification task, most existing CIL methods [8], [1], [9], [10], [11], [3], [2], [12], [13], [14] generally first assume that each image only contains a single object, and then develop anti-forgetting mechanisms such as knowledge distillation [15] to learn new classes without forgetting the old ones (the single-label CIL problem). ...
Current class-incremental learning research mainly focuses on single-label classification tasks while multi-label class-incremental learning (MLCIL) with more practical application scenarios is rarely studied. Although there have been many anti-forgetting methods to solve the problem of catastrophic forgetting in class-incremental learning, these methods have difficulty in solving the MLCIL problem due to label absence and information dilution. In this paper, we propose a knowledge restore and transfer (KRT) framework for MLCIL, which includes a dynamic pseudo-label (DPL) module to restore the old class knowledge and an incremental cross-attention(ICA) module to save session-specific knowledge and transfer old class knowledge to the new model sufficiently. Besides, we propose a token loss to jointly optimize the incremental cross-attention module. Experimental results on MS-COCO and PASCAL VOC datasets demonstrate the effectiveness of our method for improving recognition performance and mitigating forgetting on multi-label class-incremental learning tasks.
... The majority of research in continual semantic segmentation can be broadly divided into class-incremental semantic segmentation (CiSS) and continual unsupervised domain adaptation (CUDA). The challenges of CiSS are mostly addressed by using a knowledge distillation-based loss with hard or soft pseudo labels [4], [5], [6], [7], addressing the background shift [8], disentangling the representation of individual classes [9] or combining knowledge-distillation with replay [10], [11], [12], [13]. The goal of CUDA is to adapt a model trained on a supervised source dataset to a sequence of unlabeled domains. ...
Research in the field of Continual Semantic Segmentation is mainly investigating novel learning algorithms to overcome catastrophic forgetting of neural networks. Most recent publications have focused on improving learning algorithms without distinguishing effects caused by the choice of neural architecture.Therefore, we study how the choice of neural network architecture affects catastrophic forgetting in class- and domain-incremental semantic segmentation. Specifically, we compare the well-researched CNNs to recently proposed Transformers and Hybrid architectures, as well as the impact of the choice of novel normalization layers and different decoder heads. We find that traditional CNNs like ResNet have high plasticity but low stability, while transformer architectures are much more stable. When the inductive biases of CNN architectures are combined with transformers in hybrid architectures, it leads to higher plasticity and stability. The stability of these models can be explained by their ability to learn general features that are robust against distribution shifts. Experiments with different normalization layers show that Continual Normalization achieves the best trade-off in terms of adaptability and stability of the model. In the class-incremental setting, the choice of the normalization layer has much less impact. Our experiments suggest that the right choice of architecture can significantly reduce forgetting even with naive fine-tuning and confirm that for real-world applications, the architecture is an important factor in designing a continual learning model.
... Data-centric CIL algorithms mainly concentrate on resisting forgetting with the help of former data. Replay is a simple yet effective strategy, which has been widely applied to image-based camera localization [129], semantic segmentation [130], [131], [132], video classification [133], and action recognition [95]. However, since the exemplar set only saves a tiny portion of the training set, data replay may suffer the overfitting problem [59], [134]. ...
Full-text available
Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. For example, a robot needs to understand new instructions, and an opinion monitoring system should analyze emerging topics every day. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs -- the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in deep class-incremental learning and summarize these methods from three aspects, i.e., data-centric, model-centric, and algorithm-centric. We also provide a rigorous and unified evaluation of 16 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code to reproduce these evaluations is available at
Class-incremental learning for semantic segmentation (CiSS) is presently a highly researched field which aims at updating a semantic segmentation model by sequentially learning new semantic classes. A major challenge in CiSS is overcoming the effects of catastrophic forgetting, which describes the sudden drop of accuracy on previously learned classes after the model is trained on a new set of classes. Despite latest advances in mitigating catastrophic forgetting, the underlying causes of forgetting specifically in CiSS are not well understood. Therefore, in a set of experiments and representational analyses, we demonstrate that the semantic shift of the background class and a bias towards new classes are the major causes of forgetting in CiSS. Furthermore, we show that both causes mostly manifest themselves in deeper classification layers of the network, while the early layers of the model are not affected. Finally, we demonstrate how both causes are effectively mitigated utilizing the information contained in the background, with the help of knowledge distillation and an unbiased cross-entropy loss.
Full-text available
Deep learning architectures have shown remarkable results in scene understanding problems, however they exhibit a critical drop of performances when they are required to learn incrementally new tasks without forgetting old ones. This catastrophic forgetting phenomenon impacts on the deployment of artificial intelligence in real world scenarios where systems need to learn new and different representations over time. Current approaches for incremental learning deal only with image classification and object detection tasks, while in this work we formally introduce incremental learning for semantic segmentation. We tackle the problem applying various knowledge distillation techniques on the previous model. In this way, we retain the information about learned classes, whilst updating the current model to learn the new ones. We developed four main methodologies of knowledge distillation working on both output layers and internal feature representations. We do not store any image belonging to previous training stages and only the last model is used to preserve high accuracy on previously learned classes. Extensive experimental results on the Pascal VOC2012 and MSRC-v2 datasets show the effectiveness of the proposed approaches in several incremental learning scenarios.
Full-text available
Continual learning (CL) is a particular machine learning paradigm where the data distribution and learning objective change through time, or where all the training data and objective criteria are never available at once. The evolution of the learning process is modeled by a sequence of learning experiences where the goal is to be able to learn new skills all along the sequence without forgetting what has been previously learned. CL can be seen as an online learning where knowledge fusion needs to take place in order to learn from streams of data presented sequentially in time. Continual learning also aims at the same time at optimizing the memory, the computation power and the speed during the learning process. An important challenge for machine learning is not necessarily finding solutions that work in the real world but rather finding stable algorithms that can learn in real world. Hence, the ideal approach would be tackling the real world in a embodied platform: an autonomous agent. Continual learning would then be effective in an autonomous agent or robot, which would learn autonomously through time about the external world, and incrementally develop a set of complex skills and knowledge.Robotic agents have to learn to adapt and interact with their environment using a continuous stream of observations. Some recent approaches aim at tackling continual learning for robotics, but most recent papers on continual learning only experiment approaches in simulation or with static datasets. Unfortunately, the evaluation of those algorithms does not provide insights on whether their solutions may help continual learning in the context of robotics. This paper aims at reviewing the existing state of the art of continual learning, summarizing existing benchmarks and metrics, and proposing a framework for presenting and evaluating both robotics and non robotics approaches in a way that makes transfer between both fields easier. We put light on continual learning in the context of robotics to create connections between fields and normalize approaches.
While deep learning has led to remarkable advances across diverse applications, it struggles in domains where the data distribution changes over the course of learning. In stark contrast, biological neural networks continually adapt to changing domains, possibly by leveraging complex molecular machinery to solve many tasks simultaneously. In this study, we introduce intelligent synapses that bring some of this biological complexity into artificial neural networks. Each synapse accumulates task relevant information over time, and exploits this information to rapidly store new memories without forgetting old ones. We evaluate our approach on continual learning of classification tasks, and show that it dramatically reduces forgetting while maintaining computational efficiency.