PreprintPDF Available

Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The pursuit of long-term autonomy mandates that robotic agents must continuously adapt to their changing environments and learn to solve new tasks. Continual learning seeks to overcome the challenge of catastrophic forgetting, where learning to solve new tasks causes a model to forget previously learnt information. Prior-based continual learning methods are appealing for robotic applications as they are space efficient and typically do not increase in computational complexity as the number of tasks grows. Despite these desirable properties, prior-based approaches typically fail on important benchmarks and consequently are limited in their potential applications compared to their memory-based counterparts. We introduce Bayesian adaptive moment regularization (BAdam), a novel prior-based method that better constrains parameter growth, leading to lower catastrophic forgetting. Our method boasts a range of desirable properties for robotic applications such as being lightweight and task label-free, converging quickly, and offering calibrated uncertainty that is important for safe real-world deployment. Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments such as Split MNIST and Split FashionMNIST, and does so without relying on task labels or discrete task boundaries.
Towards Robust Continual Learning with Bayesian Adaptive Moment
Regularization
Jack Foster1and Alexandra Brintrup2
Abstract The pursuit of long-term autonomy mandates that
robotic agents must continuously adapt to their changing envi-
ronments and learn to solve new tasks. Continual learning seeks
to overcome the challenge of catastrophic forgetting, where
learning to solve new tasks causes a model to forget previously
learnt information. Prior-based continual learning methods are
appealing for robotic applications as they are space efficient
and typically do not increase in computational complexity as
the number of tasks grows. Despite these desirable properties,
prior-based approaches typically fail on important benchmarks
and consequently are limited in their potential applications
compared to their memory-based counterparts. We introduce
Bayesian adaptive moment regularization (BAdam), a novel
prior-based method that better constrains parameter growth,
leading to lower catastrophic forgetting. Our method boasts a
range of desirable properties for robotic applications such as
being lightweight and task label-free, converging quickly, and
offering calibrated uncertainty that is important for safe real-
world deployment. Results show that BAdam achieves state-
of-the-art performance for prior-based methods on challeng-
ing single-headed class-incremental experiments such as Split
MNIST and Split FashionMNIST, and does so without relying
on task labels or discrete task boundaries.
I. INTRODUCTION
A common assumption in machine learning (ML) is
that training data is independent and identically distributed
(i.i.d). However, in Continual Learning (CL) an ML model
is presented with a non-i.i.d stream of sequential tasks.
Under such circumstances traditional learning methods
suffer from a phenomenon known as catastrophic forgetting,
where previously learnt knowledge is lost as a result of
model parameters biasing towards recently observed data
[1], [2], [3]. Continual learning is a major, long-standing
challenge within machine learning [4], and it is naturally
entwined with robotics [5]. A canonical example of
continual learning is that of a Mars rover needing to adapt
to ever changing terrain as it traverses the surface [6].
An equally valid example would be a warehouse packing
robot here on Earth, needing to continuously recognise new
tools or new products in order to appropriately package them.
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may
no longer be accessible
* Funding from the UK Engineering & Physical Sciences Research
Council (EPSRC) for the AgriForwards Centre for Doctoral Training hosted
at University of Cambridge with UK university partners is gratefully
acknowledged (Grant Reference: EP/S023917/1). Work is also supported
by the James Dyson Foundation
1Jack Foster is with the Department of Engineering, University of
Cambridge, CB3 0FS, UK jwf40@cam.ac.uk
2Alexandra Brintrup is is with the Department of Engineering, University
of Cambridge, CB3 0FS, UK ab702@cam.ac.uk
There are several approaches to continual learning, such
as memory-based, where past experiences are stored in
a replay-buffer, and (prior-based) regularization methods,
which seek to constrain weight updates based on their
importance to previously learned tasks. Regularization
methods are attractive as they are biologically accurate [4],
do not require additional memory storage, and typically
do not grow in computational complexity as the number
of tasks increases. These properties make regularization
approaches particularly appealing for robotic applications,
due to the need to operate in real-time, in the real-world,
and often on edge or lightweight computing hardware.
In [6], five desiderata for CL evaluations are introduced
which ensure that evaluations are robust and representative
of real-world CL challenges. They are as follows: input data
from later tasks must bear some resemblance to previous
tasks, task labels should not be available at test-time (since
if they were then separate models could be trained for
each task), all tasks should share an output head (multi-
head architectures simplify the problem by making the
assumption that the task label is known at test-time), there
cannot be unconstrained retraining on old tasks (motivations
of CL preclude this), and finally CL evaluations must
use more than two tasks. Two experiments that satisfy
these requirements are the single-headed Split MNIST [7],
and Split FashionMNIST problems. Prior-based methods
typically fail to improve upon na¨
ıve baseline performance
in these evaluations, experiencing the same catastrophic
forgetting as ordinary supervised learning. In order to excel,
these methods require the experimental setup to violate
the desiderata (e.g. using multiple heads and test-time task
labels). This is a significant barrier to the application of
prior-based methods to robotics and other domains.
A further weakness of regularization approaches is they
often make strong assumptions about the structure of the
data stream. Specifically, it is typically assumed that tasks
are discrete (i.e. no overlaps), and that the boundaries
between tasks are known. In [8] a different paradigm is
noted, where task boundaries are unknown, and where tasks
may overlap. This is a much harder problem as task labels
cannot be used, and also because taking multiple passes
over each task’s data is no longer possible. While such
experimental conditions are very challenging, they may be
more representative of real-world online continual learning
challenges, where there is no known structure to the data
stream. Bayesian Gradient Descent (BGD) was proposed as
arXiv:2309.08546v1 [cs.LG] 15 Sep 2023
a solution to continual learning with unknown, graduated
task boundaries by solving the problem with a closed-form
update rule that does not rely on task labels [9]. BGD
is insightful and possesses several appealing properties,
particularly for robotics. BGD requires no auxiliary memory
storage, it does not require task labels, it does not require
tasks to be separated discretely, and finally it has calibrated
uncertainties which is valuable in safety-critical robotic
applications [10]. However, BGD has two key failings:
first it is slow to converge and second, BGD fails to solve
single-headed split MNIST and split FashionMNIST.
We present Bayesian Adaptive Moment Regularization
(BAdam), a novel continual learning method that unifies the
closed-form update rule of BGD with properties found in the
Adam optimizer [11], maintaining adaptive per-parameter
learning rates based on the first and second order moment
of the gradient. The consequence of this is improved
convergence rates, more effective constraining of parameter
updates and therefore, we hypothesize, the learning of
parameters that better generalize across multiple tasks.
We show empirically that our method makes substantial
improvements over previous prior-based approaches on
single-headed class-incremental problems.
Finally, there has been no evaluation of methods in an
environment that enforces the desiderata from [6], while
also featuring graduated task boundaries (where tasks have
some overlap as one finishes and another begins), no train-
time task labels, and that permits only a single-epoch of
training per task. This formulation is more reflective of the
scenarios encountered by a robot in the real-world, where
the environment often changes gradually, and retraining
over multiple epochs is hard due to compute restrictions.
Therefore, in this work we introduce this formulation of
a continual learning challenge and evaluate regularization
methods in such an environment.
II. REL AT ED WORK
As discussed in [8], there are three primary continual
learning scenarios; task-, domain-, and class-incremental
learning, differentiated by task label availability (Table I).
Task-IL problems are the easiest, as the label is given and
thus task-specific components, such as separate heads or sub-
networks, may be utilized. Such scenarios are less likely to
be present in real-world tasks, due to the uncertainty sur-
rounding the observed datastream. Class-IL scenarios are the
most challenging, and class-incremental split MNIST is the
simplest benchmark that accurately evaluates the efficacy of
CL methods [6]. There exist many approaches to solving CL
problems, which can be loosely classified into the following
four paradigms: memory-based [12], [13], [14], [15], [16],
[17], architectural [18], [19], [20], data-augmentation [21],
[22], [23], [24], and regularization. Here, we restrict our
review to regularization methods.
TABLE I
OVERVIEW OF THE THREE CONTINUAL LEARNING SCENARIOS FROM [8]
Scenario Task-ID
Task-IL Task-ID provided
Domain-IL Task-ID not provided
Class-IL Must infer task-ID
Regularization techniques are an abstraction of biological
neuro-plasticity [25], [26], reducing parameter update
rates proportional to their importance to previous tasks,
thereby protecting past knowledge. This is effective as
many parameter sets will yield equivalent performance for a
given neural network [27], [28], thus constraining parameter
updates does not preclude strong solutions being found.
Elastic weight consolidation (EWC) protects knowledge
of previous tasks with a quadratic penalty on the difference
between the previous task’s parameters, and that of
the current task [29]. The Fisher information matrix of
previous tasks is used to estimate parameter importance.
Memory aware synapses (MAS) quantify parameter
importance through measuring the sensitivity of the
output to perturbations of the weight [30], while synaptic
intelligence (SI) calculates importance as a function of a
parameter’s contribution to the path integral of the gradient
vector field along the parameter trajectory [7]. Variational
continual learning (VCL) takes a Bayesian approach
[31], where online variational inference is leveraged to
mitigate catastrophic forgetting. A limitation of all of these
methods is that they all assume task boundaries are discrete,
and that the current task is known. Task-free continual
learning (TFCL) removes the reliance on task labels [32].
Utilising MAS as the underlying CL method, TFCL detects
plateaus in the loss surface (indicating learnt tasks), taking
subsequent peaks to indicate the introduction of new tasks.
This provides an online, label-free means of determining
when to update importance weights, however it is somewhat
fallible to graduated task boundaries, as the intermingling
of tasks may cause sporadic changes in the loss surface.
Bayesian Gradient Descent addresses both graduated
task boundaries and no task labels through a closed form
update rule derived from online variational Bayes [9].
BGD’s theoretical derivation assumes that data arrives in
a sequential, online manner, offering a promising route
to truly online continual learning. In practice, however,
convergence is slow, and many epochs are required to reach
optimal performance, violating the conditions described in
[8]. Furthermore, like other prior-based methods, BGD fails
to solve single-headed class-incremental problems.
Other noteworthy papers include [33], which is not a
prior-based method but is a form of hard-regularization,
constraining the parameter space to be within a specific
hyper-rectangle. [34] Introduces a variational form of the
Adam optimizer [11], which bares some resemblance to our
proposed method, however it is not a continual learning
algorithm, nor does it prevent catastrophic forgetting. Finally,
we note there have been several works highlighting the need
for continual learning in robotics [5], [35], [36].
III. MET HOD S
A. Preliminaries
BGD attempts to learn the online posterior distribution of
model parameters given by:
p(θ|Dn) = p(Dn|θ)p(θ|Dn1)
p(Dn)(1)
Where θare the model parameters, and Dnis the nth
dataset (task) in the sequence. As calculating the exact
posterior is intractable, online variational Bayes is used to
approximate it. From [37], a parametric distribution q(θ|ϕ)
is used to approximate a posterior distribution by minimising
the Kullback-Leibler divergence:
KL(q(θ|ϕ)||p(θ|D)) = Eθq(θ|ϕ)log p(θ|D)
q(θ|ϕ)(2)
In online variational Bayes [38], the optimal variational
parameters are calculated by:
ϕ=arg min
ϕZqn(θ|ϕ)log qn(θ|ϕ)
p(θ|Dn)
ϕ=arg min
ϕ
Eθqn(θ|ϕ)[log(qn(θ|ϕ))
log(qn1(θ)) log(p(Dn|θ)] (3)
The following transformation is defined to arrive at our
Bayesian neural network:
θi=µi+ϵiσi
ϵi N(0,1)
ϕ= (µ, σ)(4)
Assuming the parametric distribution is Gaussian and that
components are independent (mean-field approximation), the
optimization problem in 3 can be solved using unbiased
Monte Carlo gradients as in [39]. This results in the fol-
lowing closed form updates for the variational parameters
(µ, σ):
µt=µt1ησ2
t1Eϵ(log(p(Dn|θ)))
∂θt(5)
σt=σt1s1 + 1
2σt1Eϵ(log(p(Dn|θ)))
∂θt
ϵt2
1
2σ2
t1Eϵ(log(p(Dn|θ)))
∂θt
ϵt(6)
Where ηis a learning rate. Since (log(p(Dn|θ)))
∂θidepends
on µiand σi, the solution is predicted by evaluating
this derivative using the prior parameters. Finally, the
expectations are estimated using the Monte Carlo method.
For a more complete derivation of the method, we direct
the reader to the original paper [9].
B. The BAdam Optimizer
The update rule for µin BGD closely resembles the
update rule in stochastic gradient descent, with the key
difference being the variance weighting, which leads to
smaller updates for highly certain parameters. This is
the mechanism through which BGD lowers plasticity and
prevents forgetting. To alleviate the catastrophic forgetting
experienced by BGD, we propose a new update rule for µ
which leads to significantly less plasticity. This update rule
is given in equation 7.
mt=
β1·mt1+ (1 β1)·Eϵh∂Ln(θ)
∂θii
1βt
1
vt=
β2·vt1+ (1 β2)·Eϵh∂Ln(θ)
∂θii2
1βt
2
µt=µt1ησ2·mt
vt+γ(7)
where Ln(θ)=(log(p(Dn|θ))) is the log likelihood
loss, and γis a small value, typically 1e8. This is
derived from the update rule in Adam [11], with mt
being the bias-corrected mean of the loss and vtbeing
the bias-corrected variance. The approach introduces a
per-parameter learning rate and a momentum characteristic,
which facilitates a greater robustness to saddle points and
areas of small curvature in the loss surface, as well as faster
convergence [11].
The benefits of BAdam’s µupdate rule, in a continual
learning context, are two-fold. Firstly, recent work has shown
that Adam leads to considerably less plasticity than SGD
[40]. While this can be a negative as it can limit learning
in later tasks, we propose that this reduced plasticity could
play a role in reducing catastrophic forgetting. Second, the
variance of each parameter (σ) is minimised when µis at
an optimal value, and since the plasticity of a parameter is
controlled by this σvalue, better and faster optimization of
µleads to lower update rates for those parameters, which
will reduce catastrophic forgetting. To better understand this,
we analyse discussion from [9]. Using a Taylor expansion
we can see that Eϵh∂Ln(θ)
∂θiϵiiclosely approximates the
curvature of the loss, for small values of σ:
Eϵ∂Ln(θ)
∂θi
ϵi
=Eϵ
∂Ln(µ)
∂θi
+X
j
2Ln(µ)
∂θiθj
ϵjσj +O(||σ2||)
ϵi
Fig. 1. Change in average µmagnitude during training. Catastrophic
forgetting can be observed in BGD’s parameters, while this is less present
for BAdam.
=2Ln(µ)
2θi
σi+O(||σ||2)(8)
where for ϵiand ϵj the following holds: Eϵ[ϵi]=0and
Eϵ[ϵi, ϵj] = δij . The consequence of this is that in areas
with positive curvature, such as near local and global
minima, the uncertainty will decrease. This is because the
right hand term to be subtracted in equation 6 will be
positive, causing a reduction in σ. Similarly, uncertainty
will increase at points of negative curvature, such as near
maxima or saddle points. Due to these characteristics of σ,
our method introduces key advantages over vanilla BGD.
Since the curvature of the loss strongly depends on the
value of µ, faster convergence of µwill lead to the faster
shrinking of σ, which is desirable as it becomes more likely
a task will be protected upon the introduction of data from
new tasks. Finally, since saddle points are associated with
higher uncertainty, and BAdam has the ability to better
overcome saddle points, our method can avoid getting stuck
in low-certainty states.
Figure 1 further highlights the benefits of BAdam’s µ
update rule. The figure shows the average magnitude for
all µparameters in a simple feedforward neural network
trained on the SplitMNIST task. Two key observations
are immediately obvious: first, BAdam yields far smaller
parameter values, which is indicative of lower plasticity
since parameters can’t grow as quickly. Second, every time
a new task is introduced parameters optimized by BGD
experience a sudden drop in magnitude, essentially ’soft
resetting’ the parameters, which is indicative of catastrophic
forgetting. In contrast, parameters optimized by BAdam
suffer a proportionally much smaller drop in magnitude,
indicating that these parameters are either better protected,
or that they occupy a position in parameter space that
better generalizes to subsequent tasks. This is the key
benefit of BAdam compared to other CL methods, BAdam
better impedes the plasticity of model parameters to protect
existing knowledge. As will be discussed in section V,
BAdam achieves this without sacrificing model performance.
IV. EXP ERI MEN TS
All experiments are conducted on an Nvidia RTX
4090 and results for continual learning experiments
are averaged over 15 seeds [1,15] (25 seeds, for the
graduated experiments). In the first experiment, we compare
the convergence properties of BAdam to that of BGD
and Adam by training a small neural network with 2
convolutional and 3 fully connected layers for 10 epochs on
the CIFAR10 dataset.
Next, BAdam’s performance is evaluated against other
prior-based methods on the three standard benchmark
datasets: SplitMNIST, Split FashionMNIST, and PMNIST,
as in [6]. Split MNIST and split FashionMNIST are
framed as single-headed class-incremental problems. In
split MNIST, the model must successfully classify each
digit as in standard MNIST training, however the digits are
presented in pairs, sequentially: [0,1],[2,3],[4,5], and so
forth. Split FashionMNIST shares the same formulation,
but with clothing items. For completeness, we also evaluate
on the domain-incremental PMNIST dataset [29], where a
model must classify a sequential set of permuted MNIST
digits. We evaluate BAdam against BGD, MAS, EWC, SI,
and VCL (without an auxiliary coreset). We adopt the same
architectures, hyper-parameters, and experimental setup
found in [6], with the exception that we do not allow VCL
to train for additional epochs on account of being Bayesian,
since BGD and BAdam also use Bayesian neural networks
but are not afforded additional training time. MAS, BAdam
and BGD hyper-parameters were found via a gridsearch
(typically for BAdam we found ηvalues in the range
[0.01,0.1] and initial values for σin the range [0.005,0.02]
to be most effective). The implementation utilizes PyTorch
[41], and the Avalanche continual learning library [42].
Since there exists no standard robotic continual learning
benchmark that is class-incremental and satisfies the
desiderata of [6], we introduce a novel formulation
of the standard SplitMNIST benchmarks that features
specific conditions reflective of learning in the real world.
Referring to these experiments henceforth as the graduated
Fig. 2. Probability of a sample being taken from each task every batch for
graduated SplitMNIST
Fig. 3. Optimizer convergence rate comparison on single-task CIFAR10
experiments, we take the original SplitMNIST and Split
FashionMNIST evaluations and impose the following
constraints: first, each data sample is seen only once,
since storing and extensively training on data samples is
challenging for robotics and edge devices. Second, task
labels are unavailable to the methods, which is reflective
of the real-world where an autonomous agent does not
necessarily have access to this meta-information. Finally,
tasks may have some small overlap, which is representative
of how real-world tasks and environments change gradually
over time. The graduated boundaries are implemented by
assigning draw probabilities to each task based on the index
of the current sample. The probability is proportional to
the squared distance of the current sample index to the
index of the centre point of each task. This creates peaks
where a task is near-guaranteed to be selected, which slowly
decays into a graduated boundary with the next task. This
can be seen in figure 2. Since no task labels are available,
regularization is recalculated after every batch. While these
experiments bear a strong resemblance to experiments in
[9], they are different in that only a single task is permitted
and also that we restrict the experiments to single-headed
class-incremental problems, which was not enforced in
[9]. For these tasks, we train a fully-connected neural
network with 2 hidden layers of 200 units each, and use
the ReLU activation function. We conduct a grid-search
for each method’s hyper-parameters, optimising for final
model performance. In addition to the methods evaluated
in the labelled experiments, we also evaluate the Task-Free
Continual Learning approach (TFCL), which allows MAS
to not require task-labels.
V. R ESU LTS
A. Cifar10 Convergence Analysis
Figure 3 shows that BAdam converges much faster than
BGD, demonstrating the efficacy of the amended update rule
for µ. Furthermore, the convergence rates of Badam are com-
parable to Adam, which is a strong property since BAdam
utilizes a Bayesian neural network. The ability to converge
quickly is of significant importance for all optimization tasks,
Fig. 4. CL method performance on class-incremental split MNIST
but this is especially true for those likely to be undertaken
on robotic hardware where compute power may be limited.
B. Standard Benchmark Experiments
For split MNIST (figure 4), no previous prior-based
method outperforms the na¨
ıve baseline, all achieving
20% accuracy. In contrast BAdam makes substantial
improvements, reaching over 40% accuracy, doubling the
efficacy of other methods. On the more challenging Split
FashionMNIST benchmark, table II shows that existing
methods once again perform similarly to the na¨
ıve baseline,
achieving 21% accuracy. BAdam improves upon this
significantly, reaching 31% accuracy, being the only method
to successfully retain some previous knowledge. These
improvements are statistically significant for p= 0.05
using a standard T-test. While there remains clear room for
improvement, BAdam is the first prior based method that
makes substantial steps towards solving these challenging
single-headed class-incremental tasks. On the domain-
incremental PMNIST task, BAdam is competitive with all
other methods, whereas VCL’s poor performance may be
attributed to underfitting.
Fig. 5. CL method performance on class-incremental split MNIST with
graduated boundaries, single epoch training, no task labels
TABLE II
MET HOD RE SULTS FO R ALL EX PER IMEN TS
Method SplitMNIST Split F-MNIST PMNIST Graduated SplitMNIST Graduated Split F-MNIST
Na¨
ıve 0.20 ±0.02 0.21 ±0.03 0.65 ±0.02 0.20 ±0.01 0.20 ±0.02
BAdam 0.41 ±0.07 0.31 ±0.07 0.81 ±0.01 0.41 ±0.07 0.25 ±0.07
BGD 0.20 ±0.02 0.21 ±0.03 0.76 ±0.02 0.20 ±0.05 0.14 ±0.05
EWC 0.20 ±0.01 0.21 ±0.03 0.76 ±0.01 0.20 ±0.02 0.21 ±0.02
MAS 0.21 ±0.03 0.23 ±0.04 0.85 ±0.01 0.13 ±0.03 0.13 ±0.04
SI 0.20 ±0.02 0.21 ±0.03 0.75 ±0.02 0.20 ±0.01 0.20 ±0.02
VCL 0.20 ±0.01 0.20 ±0.02 0.51 ±0.02 0.24 ±0.07 0.23 ±0.04
TFCL - - - 0.27 ±0.07 0.22 ±0.07
C. Graduated Experiments
As seen in figure 5 and table II, BAdam outperforms all
other methods on both tasks, and is considerably better than
BGD. BAdam not only reaches superior performance as the
number of tasks grows, but it is also the only method with
stable and consistent improvement. Other methods achieve
stronger intermediate performance, but they fail to maintain
that level due to catastrophic forgetting. BAdam is seemingly
slower to reach good performance compared to other meth-
ods, despite its strong convergence properties, however this is
explained by method hyper-parameters being optimized for
final performance on all tasks, not intermediate performance.
BAdam’s performance on SplitMNIST is perfectly preserved
between the original and graduated experiments, while Split
FashionMNIST is slightly worse, which is likey due to
underfitting on the more challenging task. These findings
indicate that the method is robust to the additional challenges
posed by this experiment. Interestingly, VCL improves when
applied to the graduated experiments.
VI. DISCUSSION
BAdam exhibits state-of-the-art performance on single-
headed, class-incremental tasks for prior-based methods.
BAdam is the strongest method evaluated, both in the
discrete, labelled domain, and the graduated, label-free,
single epoch domain, and is also the first prior-based method
to make convincing improvements over the baseline in these
experiments. The ability to retain knowledge on these
challenging benchmarks is an important step for prior-based
methods, although there is still clear room for improvement.
Many previous contributions to regularization CL have
been new parameter importance estimation methods,
however existing prior-based approaches do not seem
to have inaccurate importance estimation, and thus
further contributions in this direction have failed to
improve performance on class-incremental scenarios. The
contributions in this work are instead focused around
improving the convergence properties of BGD, this seems
to be a promising avenue forward.
BAdam is promising, however it has two key limitations.
Firstly, like BGD, BAdam has two hyper-parameters to
optimize, and we found that both of these methods are fairly
sensitive to changes in the initialisation value of the standard
deviation. Further work on identifying good initialisation
values a priori is worthwhile, to help ensure the optimal
performance of BAdam may be realised in real-world
challenges. Secondly, the primary barrier for progress is the
rate at which neural networks can learn more challenging
problems. If a model cannot learn a single task in a single
epoch (such as complex graph, vision, or reinforcement
learning problems), then solving continual learning tasks
under such conditions is limited, since the upper bound
on performance is unsatisfactory. Future research into fast
convergence and concepts such as few-shot learning is
important to the progress of online continual learning.
Furthermore, since BAdam is a closed-form update rule, if
improved learning rates and convergence is reached through
advances in stochastic optimization, BAdam will likely need
to be further improved with these characteristics.
VII. CONCLUSION
In this work we present BAdam, which unifies desirable
properties of Adam and BGD, yielding a fast-converging
continual learning method with no reliance on task labels.
The extensions to BGD offered by BAdam are easy to
implement, computationally efficient, and built upon familiar
concepts from stochastic optimization literature.
We evaluated BAdam alongside a range of regularization
CL methods for single-head class-incremental problems
in both traditional continual learning conditions, and in a
novel experimental setup focused on continual learning in
a single epoch, without requiring task labels or a strongly
structured data-stream. This is often more reflective of
the real-world and is considerably more challenging than
other CL environments [8]. We found BAdam to be the
most efficacious approach in both setups, being the only
prior-based continual learning method to improve upon the
naive baseline and more than doubling the performance
of other methods in the SplitMNIST task. While further
work is required to fully solve class-incremental problems,
BAdam takes the first steps towards retaining previous task
knowledge in these challenging domains using prior-based
methods, laying the groundwork for important future
work in this direction. Further work could explore several
avenues, a simple direction would be to explore alternative
ways to improve convergence by leveraging other concepts
from stochastic optimization. A more interesting direction
is to investigate other limitations of prior-based approaches
that exist beyond the importance estimation methods.
REFERENCES
[1] M. McCloskey and N. J. Cohen, “Catastrophic interference in connec-
tionist networks: The sequential learning problem,” in Psychology of
learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
[2] S. Thrun and T. M. Mitchell, “Lifelong robot learning, Robotics and
Autonomous Systems, vol. 15, no. 1, pp. 25–46, 1995, the Biology and
Technology of Intelligent Autonomous Agents. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/092188909500004Y
[3] R. M. French, “Catastrophic forgetting in connectionist networks,
Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
[4] D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick,
“Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, no. 2,
pp. 245–258, 2017.
[5] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. D´
ıaz-
Rodr´
ıguez, “Continual learning for robotics: Definition, framework,
learning strategies, opportunities and challenges,” Information fusion,
vol. 58, pp. 52–68, 2020.
[6] S. Farquhar and Y. Gal, “Towards robust evaluations of continual
learning,” arXiv preprint arXiv:1805.09733, 2018.
[7] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through
synaptic intelligence,” in Proceedings of the 34th International
Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.
PMLR, 06–11 Aug 2017, pp. 3987–3995. [Online]. Available:
https://proceedings.mlr.press/v70/zenke17a.html
[8] G. M. Van de Ven and A. S. Tolias, “Three scenarios for continual
learning,” arXiv preprint arXiv:1904.07734, 2019.
[9] C. Zeno, I. Golan, E. Hoffer, and D. Soudry, “Task agnostic
continual learning using online variational bayes, arXiv preprint
arXiv:1803.10123, 2018.
[10] L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun,
“Hands-on bayesian neural networks—a tutorial for deep learning
users,” IEEE Computational Intelligence Magazine, vol. 17, no. 2,
pp. 29–48, 2022.
[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[12] D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Expe-
rience replay for continual learning,” Advances in Neural Information
Processing Systems, vol. 32, 2019.
[13] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for con-
tinual learning,” Advances in neural information processing systems,
vol. 30, 2017.
[14] F. M. Castro, M. J. Mar´
ın-Jim´
enez, N. Guil, C. Schmid, and K. Ala-
hari, “End-to-end incremental learning,” in Proceedings of the Euro-
pean conference on computer vision (ECCV), 2018, pp. 233–248.
[15] M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W.
Teh, “Functional regularisation for continual learning using gaussian
processes,” International Conference on Learning Representations,
2020.
[16] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio, “Gradient based
sample selection for online continual learning,” Advances in neural
information processing systems, vol. 32, 2019.
[17] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl:
Incremental classifier and representation learning,” in Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition,
2017, pp. 2001–2010.
[18] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural
networks,” arXiv preprint arXiv:1606.04671, 2016.
[19] A. V. Terekhov, G. Montone, and J. K. O’Regan, “Knowledge transfer
in deep block-modular neural networks,” in Conference on Biomimetic
and Biohybrid Systems. Springer, 2015, pp. 268–279.
[20] T. J. Draelos, N. E. Miner, C. C. Lamb, J. A. Cox, C. M. Vineyard,
K. D. Carlson, W. M. Severa, C. D. James, and J. B. Aimone, “Neuro-
genesis deep learning: Extending deep networks to accommodate new
classes,” in 2017 International Joint Conference on Neural Networks
(IJCNN). IEEE, 2017, pp. 526–533.
[21] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE transactions
on pattern analysis and machine intelligence, vol. 40, no. 12, pp.
2935–2947, 2017.
[22] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deep
generative replay,” Advances in neural information processing systems,
vol. 30, 2017.
[23] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars, “Encoder
based lifelong learning,” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, pp. 1320–1328.
[24] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter,
“Continual lifelong learning with neural networks: A review,” Neural
networks, vol. 113, pp. 54–71, 2019.
[25] L. G. Cohen, P. Celnik, A. Pascual-Leone, B. Corwell, L. Faiz,
J. Dambrosia, M. Honda, N. Sadato, C. Gerloff, M. Hallett, et al.,
“Functional relevance of cross-modal plasticity in blind humans,
Nature, vol. 389, no. 6647, pp. 180–183, 1997.
[26] J. Cichon and W.-B. Gan, “Branch-specific dendritic ca2+ spikes cause
persistent synaptic plasticity, Nature, vol. 520, no. 7546, pp. 180–185,
2015.
[27] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in
Neural networks for perception. Elsevier, 1992, pp. 65–93.
[28] H. J. Sussmann, “Uniqueness of the weights for minimal feedforward
nets with a given input-output map, Neural networks, vol. 5, no. 4,
pp. 589–593, 1992.
[29] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska,
et al., “Overcoming catastrophic forgetting in neural networks, Pro-
ceedings of the national academy of sciences, vol. 114, no. 13, pp.
3521–3526, 2017.
[30] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuyte-
laars, “Memory aware synapses: Learning what (not) to forget, in
Proceedings of the European conference on computer vision (ECCV),
2018, pp. 139–154.
[31] C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, “Variational continual
learning,” arXiv preprint arXiv:1710.10628, 2017.
[32] R. Aljundi, K. Kelchtermans, and T. Tuytelaars, “Task-free continual
learning,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.
[33] M. Wołczyk, K. Piczak, B. W´
ojcik, L. Pustelnik, P. Morawiecki,
J. Tabor, T. Trzcinski, and P. Spurek, “Continual learning with guar-
antees via weight interval constraints, in International Conference on
Machine Learning. PMLR, 2022, pp. 23 897–23 911.
[34] M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava,
“Fast and scalable bayesian deep learning by weight-perturbation in
adam,” in International conference on machine learning. PMLR,
2018, pp. 2611–2620.
[35] K. Shaheen, M. A. Hanif, O. Hasan, and M. Shafique, “Continual
learning for real-world autonomous systems: Algorithms, challenges
and frameworks, Journal of Intelligent & Robotic Systems, vol. 105,
no. 1, p. 9, 2022.
[36] N. Churamani, S. Kalkan, and H. Gunes, “Continual learning for affec-
tive robotics: Why, what and how?” in 2020 29th IEEE International
Conference on Robot and Human Interactive Communication (RO-
MAN). IEEE, 2020, pp. 425–431.
[37] A. Graves, “Practical variational inference for neural networks, in
Advances in Neural Information Processing Systems, J. Shawe-Taylor,
R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24.
Curran Associates, Inc., 2011.
[38] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I.
Jordan, “Streaming variational bayes, Advances in neural information
processing systems, vol. 26, 2013.
[39] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight
uncertainty in neural network,” in International conference on machine
learning. PMLR, 2015, pp. 1613–1622.
[40] S. Dohare, J. F. Hernandez-Garcia, P. Rahman, R. S. Sutton, and A. R.
Mahmood, “Maintaining plasticity in deep continual learning,” arXiv
preprint arXiv:2306.13812, 2023.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
imperative style, high-performance deep learning library,” Advances
in neural information processing systems, vol. 32, 2019.
[42] V. Lomonaco, L. Pellegrini, A. Cossu, A. Carta, G. Graffieti, T. L.
Hayes, M. D. Lange, M. Masana, J. Pomponi, G. van de Ven,
M. Mundt, Q. She, K. Cooper, J. Forest, E. Belouadah, S. Calder-
ara, G. I. Parisi, F. Cuzzolin, A. Tolias, S. Scardapane, L. Antiga,
S. Amhad, A. Popescu, C. Kanan, J. van de Weijer, T. Tuytelaars,
D. Bacciu, and D. Maltoni, “Avalanche: an end-to-end library for
continual learning,” in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, ser. 2nd Continual Learning in Com-
puter Vision Workshop, 2021.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Artificial neural networks, deep-learning methods and the backpropagation algorithm¹ form the foundation of modern machine learning and artificial intelligence. These methods are almost always used in two phases, one in which the weights of the network are updated and one in which the weights are held constant while the network is used or evaluated. This contrasts with natural learning and many applications, which require continual learning. It has been unclear whether or not deep learning methods work in continual learning settings. Here we show that they do not—that standard deep-learning methods gradually lose plasticity in continual-learning settings until they learn no better than a shallow network. We show such loss of plasticity using the classic ImageNet dataset and reinforcement-learning problems across a wide range of variations in the network and the learning algorithm. Plasticity is maintained indefinitely only by algorithms that continually inject diversity into the network, such as our continual backpropagation algorithm, a variation of backpropagation in which a small fraction of less-used units are continually and randomly reinitialized. Our results indicate that methods based on gradient descent are not enough—that sustained deep learning requires a random, non-gradient component to maintain variability and plasticity.
Article
Full-text available
Continual learning is essential for all real-world applications, as frozen pre-trained models cannot effectively deal with non-stationary data distributions. The purpose of this study is to review the state-of-the-art methods that allow continuous learning of computational models over time. We primarily focus on the learning algorithms that perform continuous learning in an online fashion from considerably large (or infinite) sequential data and require substantially low computational and memory resources. We critically analyze the key challenges associated with continual learning for autonomous real-world systems and compare current methods in terms of computations, memory, and network/model complexity. We also briefly describe the implementations of continuous learning algorithms under three main autonomous systems, i.e., self-driving vehicles, unmanned aerial vehicles, and urban robots. The learning methods of these autonomous systems and their strengths and limitations are extensively explored in this article.
Conference Paper
Full-text available
Creating and sustaining closed-loop dynamic and social interactions with humans require robots to continually adapt towards their users’ behaviours, their affective states and moods while keeping them engaged in the task they are performing. Analysing, understanding and appropriately responding to human nonverbal behaviour and affective states are the central objectives of affective robotics research. Conventional machine learning approaches do not scale well to the dynamic nature of such real-world interactions as they require samples from stationary data distributions. The real-world is not stationary, it changes continuously. In such contexts, the training data and learning objectives may also change rapidly. Continual Learning (CL), by design, is able to address this very problem by learning incrementally. In this paper, we argue that CL is an essential paradigm for creating fully adaptive affective robots (why). To support this argument, we first provide an introduction to CL approaches and what they can offer for various dynamic (interactive) situations (what). We then formulate guidelines for the affective robotics community on how to utilise CL for perception and behaviour learning with adaptation (how). For each case, we reformulate the problem as a CL problem and outline a corresponding CL-based solution. We conclude the paper by highlighting the potential challenges to be faced and by providing specific recommendations on how to utilise CL for affective robotics.
Article
Full-text available
Continual learning (CL) is a particular machine learning paradigm where the data distribution and learning objective change through time, or where all the training data and objective criteria are never available at once. The evolution of the learning process is modeled by a sequence of learning experiences where the goal is to be able to learn new skills all along the sequence without forgetting what has been previously learned. CL can be seen as an online learning where knowledge fusion needs to take place in order to learn from streams of data presented sequentially in time. Continual learning also aims at the same time at optimizing the memory, the computation power and the speed during the learning process. An important challenge for machine learning is not necessarily finding solutions that work in the real world but rather finding stable algorithms that can learn in real world. Hence, the ideal approach would be tackling the real world in a embodied platform: an autonomous agent. Continual learning would then be effective in an autonomous agent or robot, which would learn autonomously through time about the external world, and incrementally develop a set of complex skills and knowledge.Robotic agents have to learn to adapt and interact with their environment using a continuous stream of observations. Some recent approaches aim at tackling continual learning for robotics, but most recent papers on continual learning only experiment approaches in simulation or with static datasets. Unfortunately, the evaluation of those algorithms does not provide insights on whether their solutions may help continual learning in the context of robotics. This paper aims at reviewing the existing state of the art of continual learning, summarizing existing benchmarks and metrics, and proposing a framework for presenting and evaluating both robotics and non robotics approaches in a way that makes transfer between both fields easier. We put light on continual learning in the context of robotics to create connections between fields and normalize approaches.
Article
Full-text available
Humans and animals have the ability to continually acquire and fine-tune knowledge throughout their lifespan. This ability, referred to as lifelong learning, is mediated by a rich set of neurocognitive mechanisms that together contribute to the development and specialization of our sensorimotor skills as well as to long-term memory consolidation and retrieval. Consequently, lifelong learning capabilities are crucial for computational learning systems and autonomous agents interacting in the real world and processing continuous streams of information. However, lifelong learning remains a long-standing challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting or interference. This limitation represents a major drawback also for state-of-the-art deep and shallow neural network models that typically learn representations from stationary batches of training data, thus without accounting for situations in which the number of tasks is not known a priori and the information becomes incrementally available over time. In this review, we critically summarize the main challenges linked to lifelong learning for artificial learning systems and compare existing neural network approaches that alleviate, to different extents, catastrophic forgetting. Although significant advances have been made in domain-specific learning with neural networks, extensive research efforts are required for the development of robust lifelong learning on autonomous agents and robots. We discuss well-established and emerging research motivated by lifelong learning factors in biological systems such as neurosynaptic plasticity, critical developmental stages, multi-task transfer learning, intrinsically motivated exploration, and crossmodal learning.
Article
Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides deep learning practitioners with an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian neural networks, i . e ., stochastic artificial neural networks trained using Bayesian methods.