ChapterPDF Available

# Hebbian Learning Meets Deep Convolutional Neural Networks

Authors:

## Abstract and Figures

Neural networks are said to be biologically inspired since they mimic the behavior of real neurons. However, several processes in state-of-the-art neural networks, including Deep Convolutional Neural Networks (DCNN), are far from the ones found in animal brains. One relevant difference is the training process. In state-of-the-art artificial neural networks, the training process is based on backpropagation and Stochastic Gradient Descent (SGD) optimization. However, studies in neuroscience strongly suggest that this kind of processes does not occur in the biological brain. Rather, learning methods based on Spike-Timing-Dependent Plasticity (STDP) or the Hebbian learning rule seem to be more plausible, according to neuroscientists. In this paper, we investigate the use of the Hebbian learning rule when training Deep Neural Networks for image classification by proposing a novel weight update rule for shared kernels in DCNNs. We perform experiments using the CIFAR-10 dataset in which we employ Hebbian learning, along with SGD, to train parts of the model or whole networks for the task of image classification, and we discuss their performance thoroughly considering both effectiveness and efficiency aspects.
Content may be subject to copyright.
Hebbian Learning Meets Deep Convolutional
Neural Networks?
Giuseppe Amato1[0000000301714315], Fabio Carrara1[0000000150145089],
Fabrizio Falchi1[0000000162585313], Claudio Gennaro1[0000000209675050] ,
and Gabriele Lagani2
1CNR Pisa, Italy
{giuseppe.amato, fabio.carrara, fabrizio.falchi, claudio.gennaro}
@isti.cnr.it
2University of Pisa, Italy
gabriele.lagani@gmail.com
Abstract. Neural networks are said to be biologically inspired since
they mimic the behavior of real neurons. However, several processes in
state-of-the-art neural networks, including Deep Convolutional Neural
Networks (DCNN), are far from the ones found in animal brains. One
relevant diﬀerence is the training process. In state-of-the-art artiﬁcial
neural networks, the training process is based on backpropagation and
Stochastic Gradient Descent (SGD) optimization. However, studies in
neuroscience strongly suggest that this kind of processes does not occur
in the biological brain. Rather, learning methods based on Spike-Timing-
Dependent Plasticity (STDP) or the Hebbian learning rule seem to be
more plausible, according to neuroscientists. In this paper, we investigate
the use of the Hebbian learning rule when training Deep Neural Networks
for image classiﬁcation by proposing a novel weight update rule for shared
kernels in DCNNs. We perform experiments using the CIFAR-10 dataset
in which we employ Hebbian learning, along with SGD, to train parts of
the model or whole networks for the task of image classiﬁcation, and we
discuss their performance thoroughly considering both eﬀectiveness and
eﬃciency aspects.
Keywords: Hebbian learning ·deep learning ·computer vision ·convo-
lutional neural networks
c
Springer Nature Switzerland AG 2019
E. Ricci et al. (Eds.): ICIAP 2019, LNCS 11751, pp. 324-334, 2019.
Final authenticated publication: https://doi.org/10.1007/978-3-030-30642-7 29
?This work was partially supported by “Automatic Data and documents Analysis
to enhance human-based processes” (ADA), CUP CIPE D55F17000290009, and by
the AI4EU project, funded by the EC (H2020 - Contract n. 825619). We gratefully
acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40
GPU used for this research.
2 G. Amato et al.
1 Introduction
Backpropagation is the most common learning rule for artiﬁcial neural networks.
Despite being initially developed for biologically inspired artiﬁcial networks, it is
commonly known by neuroscience that this process is unlikely to be implemented
by nature. Seeking for more plausible models that mimic biological brains, re-
searchers introduced several alternative learning rules for artiﬁcial networks.
In this work, we explore one of these alternative learning rules — Hebbian
learning — in the context of modern deep neural networks for image classiﬁ-
cation. Speciﬁcally, the concept of Hebbian learning refers to a family of learn-
ing rules, inspired by biology, according to which the weight associated with
a synapse increases proportionally to the values of the pre-synaptic and post-
synaptic stimuli at a given instant of time [2,4].
Diﬀerent variants of Hebbian rules can be found in the literature. In this
work, we investigate two main Hebbian learning approaches, that is the Winner-
Takes-All competition [3,12] and the supervised Hebbian learning solution. We
apply these rule to train image classiﬁers that we extensively evaluate and com-
pare with respect to standard models trained with Stochastic Gradient Descent
(SGD). Moreover, we experiment with hybrid models in which we apply Hebbian
and SGD updates to diﬀerent parts of the network. Experiments on the CIFAR-
10 dataset suggest that the Hebbian approach is adequate for training the lower
and the higher layers of deep convolutional neural networks, while current re-
sults suggest that it has some limitations when used in the intermediate layers1.
On the other hand, the Hebbian approach is much faster than Gradient Descent
in terms of numbers of epochs required for training. Moreover, Hebbian update
rules are inherently local and thus fully parallelizable in the backward/update
phase, and we think that strategies to enhance the scalability of current models
can beneﬁt from this property. The main contributions of this work are:
the use of Hebbian learning in DCNN, with a novel proposal for weight
updates in shared kernels (Section 3.4);
the deﬁnition of various hybrid deep neural networks, obtained combining
SGD and Hebbian learning in the various network layers (Section 4);
extensive experimentation and analysis of the results (Section 5).
The paper is organized as follows. Section 2 gives a brief overview of other works
in this context. Section 3 introduces the Hebbian learning model. Section 4
discusses the deep network architecture that we deﬁned and how we set the
experiments to assess the performance of the approach. Section 5 discusses the
experiments and the results. Section 6 concludes.
2 Related works
Recently, several works investigated the Hebbian rule for training neural net-
works for image classiﬁcation. In [15], the authors propose a deep Convolutional
1For the implementation details about the experiments described in this document
and the related source code, the reader is referred to [8,9].
Hebbian Learning Meets Deep Convolutional Neural Networks 3
Neural Network (CNN) architecture consisting of three convolutional layers, fol-
lowed by an SVM classiﬁer. The convolutional layers are trained, without super-
vision, to extract relevant features from the inputs. This technique was applied
on diﬀerent image datasets, among which CIFAR-10 [6], on which the algorithm
achieved above 75% accuracy with a three-layer network.
In [10], the authors obtain the Hebbian weight update rules by minimizing
an appropriate loss function, deﬁned as the strain loss. Intuitively, they aim at
minimizing how the diﬀerences, between the similarity among input vectors and
output vectors, get distorted when moving from the input space and the output
space. Also in this case, the authors use CIFAR-10 to perform the experiments,
achieving accuracy up to 80% with a single layer followed by an SVM classiﬁer [1].
In the above approaches, the Hebbian rule application remains limited to
relatively shallow networks. On the other hand, in our work, we explore the
possibility of applying Hebbian learning rules to deeper network architectures
and discuss the opportunities and limitations arisen in this context.
3 Hebbian learning model
The Hebbian plasticity rule can be expressed as
∆w =η y(x, w)x , (1)
where xis the vector of input signals on the neuron synapses, wis the weight
vector associated with the neuron, ηis the learning rate coeﬃcient, ∆w is the
weight update vector, and y(x, w) is the post-synaptic activation of the neuron
— a function of the input and the weights that is assumed to be non-negative
(e.g. a dot product followed by a ReLU or sigmoid activation).
3.1 Weight decay
Rule 1 only allows weights to grow, not to decrease. In order to prevent the
weight vector from growing unbounded, Rule 1 is extended by introducing a
weight decay (forgetting) term [2] γ(x, w)
∆w =η y(x, w)xγ(x, w).(2)
When the weight decay term is γ(x, w) = η y(x, w)w[4], we obtain
∆w =η y(x, w) (xw).(3)
If we assume that η y(x, w) is smaller than 1, the latter equation obtains the
following physical interpretation: at each iteration, the weight vector is modiﬁed
by taking a step towards the input, the size of the step being proportional to the
similarity between the input and the weight vector, so that if a similar input is
presented again in the future, the neuron will be more likely to produce a stronger
response. If an input (or a cluster of similar inputs) is presented repeatedly to
4 G. Amato et al.
the neuron, the weight vector tends to converge towards it, eventually acting as a
matching ﬁlter. In other words, the input is memorized in the synaptic weights.
In this perspective, the neuron can be seen as an entity that, when stimulated
with a frequent pattern, learns to recognize it.
3.2 Competitive Hebbian Learning: Winner Takes All
Equations 3 can be also used in the context of competitive learning [14,3,5,12].
When more than one neuron is involved in the learning process, it is possible to
introduce some forms of lateral interaction in order to force diﬀerent neurons to
learn diﬀerent patterns. A possible scheme of interaction is Winner Takes All
(WTA) competition [3,12] which works as follows:
1. when input is presented to the network, the neurons start a competition;
2. the winner of the competition is the neuron whose weight vector is the clos-
est to the input vector (according to some distance metric, e.g. angular dis-
tance [3] or euclidean distance [4]), while all the other neurons get inhibited;
3. the neurons update their weights according to Eq. 3, where yis set to 0 for
the inhibited neurons and to 1 for the winner neuron.
3.3 Supervised Hebbian Learning
Hebbian learning is inherently an unsupervised approach to neural network train-
ing because each neuron updates its weight without relying on labels provided
with the data. However, it is possible to use a simple trick to apply Hebbian
rules in a supervised fashion: the teacher neuron technique [13,11] involves im-
posing a teacher signal on the output of the neurons that we want to train, thus
replacing the output that they would naturally produce. By doing so and by ap-
plying a Hebbian learning rule, neurons adapt their weights in order to actually
reproduce the desired output when the same input is provided.
Applying this technique to the output layer is straightforward, as the teacher
signal coincides with the output target, but the choice of the teacher signal for
supervised training of internal neurons is not trivial. Similarly to [15], we use the
following technique to guide the neurons to develop a certain class-speciﬁcity in
intermediate layers: we divide the kernels of a layer in as many groups as the
number of classes and associate each group with a unique class; in addition, we
also devote a set of kernels to be in common to all the classes; when an input of
a given class is presented to the network, a high teacher signal y= 1 is provided
to all the neurons sharing kernels that belong to the group corresponding to the
given class, while the others receive a low teacher signal y= 0 (neurons sharing
kernels associated with the set common to all the classes always receive a high
teacher signal).
3.4 Hebbian rule with shared kernels in DCNN
Equation 3 allows computing ∆w for each neuron in a given layer. Due to weight
sharing in convolutional neural networks, diﬀerent neurons in the same convo-
lutional layer that shares the same kernel might be associated with diﬀerent
Hebbian Learning Meets Deep Convolutional Neural Networks 5
Fig. 1: Architecture of the Deep Convolutional Neural Network used in our ex-
periments.
∆w’s. In order to allow weight sharing in kernels of deep convolutional layers,
we propose to perform an aggregation step in which the diﬀerent ∆w’s, obtained
at diﬀerent spatial locations, are used to produce a global ∆wagg used for the
update of the kernel. ∆wagg is computed as a weighted average of the ∆w, where
the weights are proportional to the coeﬃcient that determines the step size (y
when the basic Hebbian rule is used, 1 or 0 for winners and losers, respectively,
when the WTA rule is used, and the teacher signal when supervised Hebbian
learning is used).
4 Neural network architecture and experiment settings
To evaluate the Hebbian rule in deep learning, we designed a reference deep net-
work architecture inspired to AlexNet [7]. The deep network structure, shown
in Figure 1, is composed of four convolutional layers followed by a fully con-
nected layer (layer 5). Layer 6 is a linear classiﬁer with one output per class.
We performed several experiments in which we combined Hebbian learning with
SGD learning in various ways, and we measured the classiﬁcation performance
obtained on the CIFAR-10 dataset [6]. In the ﬁrst and second experiment, dis-
cussed in Sections 5.1 and 5.2, we modiﬁed the architecture in Figure 1 as shown
in Figure 2a. Speciﬁcally, we placed in turn Hebbian classiﬁers and SGD clas-
siﬁers on top of feature extracted from various layers of, respectively, an SGD-
and an Hebbian-trained network. In other words, the entire network was trained
using a single approach (either Hebbian or SGD) and just the top layer (the clas-
siﬁer) was changed. In the third and fourth experiment, discussed in Sections
5.3 and 5.4, we placed SGD-trained layers on top of Hebbian-trained layers, and
vice-versa, at various level of the network, as shown in Figure 2b. In the ﬁfth
experiment, discussed in Section 5.5, we put various Hebbian-trained layers in
between SGD trained layers, as shown in Figure 2c. In all the experiments, when
lower layers were trained with the Hebbian rule, we also pre-processed images
with ZCA-whitening [6], which provided us with better performance.
6 G. Amato et al.
Input
5x5 Conv, 96
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 128
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
(a) A classiﬁer on top on
the i-th layer of the net-
work
Input
5x5 Conv, 96
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 128
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
3x3 Conv, 128
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
SGD (Hebb.)
Hebb. (SGD)
(b) SGD layers on top
of Hebbian-trained layers
(and vice-versa)
Input
5x5 Conv, 96
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 128
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
3x3 Conv, 128
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
FC, 300
ReLU
Batch Norm
Dropout
Classiﬁer
Flat
(FC, 10 outputs, 1 per class)
Hebb.
SGD
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
3x3 Conv, 192
ReLU
2x2 Max Pool
Batch Norm
3x3 Conv, 256
ReLU
Batch Norm
(c) Hebbian trained layers
in between SGD layers
Fig. 2: Architecture modiﬁcations applied in our experiments.
5 Experiments
As a baseline for the various experiments, we trained the deﬁned network by
applying the Stochastic Gradient Descent (SGD) algorithm to minimize the
Cross-Entropy loss. Training was executed using the following conﬁguration:
SGD with Nesterov correction and momentum of 0.9, L2 penalty of 0.06, and
a batch size of 64. The network was trained for 20 epochs on the ﬁrst four of
the ﬁve training batches of the CIFAR-10 dataset [6], corresponding to the ﬁrst
40,000 samples, while the ﬁfth batch, corresponding to the last 10,000 samples,
was used for validation. Testing was performed on the CIFAR-10 test batch pro-
vided speciﬁcally for this purpose. Images were normalized to have zero mean
and unit standard deviation. Early stopping was used so that, at the end of
the 20 epochs, the network parameter conﬁguration that we kept was the one
achieving the highest accuracy. The learning rate was set to 103for the ﬁrst ten
epochs, then halved every epoch for the next ten epochs. This baseline was used
both for comparing with the performance obtained with Hebbian learning and
to produce pre-trained SGD layers to be combined with Hebbian-trained layers.
In the next sections, we discuss the results obtained by the various combinations
of Hebbian-trained and SGD-trained layers, introduced in Section 4.
5.1 Hebbian vs SGD classiﬁers on SGD-trained layers
In this ﬁrst experiment, we used the baseline trained network, discussed in Sec-
tion 5, as a pre-processing module to extract features from an image, which
Hebbian Learning Meets Deep Convolutional Neural Networks 7
Table 1: Accuracy (%) of SGD- and Hebbian-trained classiﬁers built on top of
various internal layers of an SGD- or Hebbian-trained network.
Classiﬁer Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
SGD SGD SGD SGD SGD
SGD 60.71 66.30 72.39 82.69 84.95
Hebbian 46.58 56.59 67.79 82.18 84.88
Hebbian Hebbian Hebbian Hebbian Hebbian
SGD 63.92 63.81 58.28 52.99 41.78
were then fed to a classiﬁer as shown in Figure 2a. We compared both a classi-
ﬁer trained using SGD and the Hebbian rule.
To perform an exhaustive test, we measured the performance obtained by
placing (and training) a classiﬁer after every layer of the network. The SGD
classiﬁers were trained with the same parameters used for the baseline, except
that the L2 penalty is reduced to 5 ·104. The Hebbian classiﬁers were trained
with learning rate 0.1, the similarity between neuron inputs and weight vectors
was measured in terms of angular distance (lower angular distance means higher
similarity and vice-versa), and the activation function adopted was simply the
scalar product between the input and the normalized weight vectors.
Table 1 (top two rows) reports the results of tested conﬁgurations. We can see
that classiﬁers placed on top of higher layers and trained with the Hebbian rule
achieve accuracy values practically overlapped to those of an SGD classiﬁer. On
the other hand, Hebbian classiﬁers placed on top of lower layers obtain a lower
accuracy. However, it is worth mentioning that Hebbian classiﬁers can be trained
in just a few epochs (usually one or two in our experiments), while classiﬁers
trained with Gradient Descent need from ﬁve to ten epochs to converge.
5.2 SGD classiﬁers on Hebbian-trained layers
Experiments discussed in this section are complementary to those discussed
above. The entire deep network is trained with Hebbian approach, and the fea-
tures extracted from the various layers are fed to an SGD classiﬁer.
The goal is to evaluate the performance of the Hebbian approach for training
the feature extraction layers of the network. To train the network, we set the
learning rate of the Hebbian weight update rule to 0.1. The similarity between
neuron inputs and weight vectors was measured in terms of angular distance
(lower angular distance means higher similarity and vice-versa), and the activa-
tion function used for neurons of Hebbian hidden layers was the cosine similarity
between input and weight vector, followed by the ReLU non-linearity. We used
the WTA approach (Section 3.2) for updating the weight of the internal layers,
we applied ZCA-whitening (see Section 4) to the input images. We imposed a
teacher signal on the layers of the network trained with Hebbian approach, even
if they are not classiﬁcation layers, according to the logic discussed Section 3.3.
8 G. Amato et al.
In our experiments, we used 96 common kernels at layer 1, 8 kernels per class
plus 16 common kernels at layer 2, 16 kernels per class plus 32 common kernels
at layer 3, 24 kernels per class plus 16 common kernels at layer 4, and 28 kernels
per class plus 20 common kernels at layer 5.
Table 1 (bottom row) reports the achieved resultsIt can be observed that the
accuracy slowly degrades with the number of layers. We conclude that training
with the basic Hebbian rule has some disadvantages when the depth of the
network increases, while it is still competitive when the layers are less than 4.
5.3 Hybrid: SGD layers on top of Hebbian layers
In the previous experiments, the entire network was trained using a single ap-
proach and we changed just the classiﬁer. Here, we created hybrid networks
where the bottom layers were trained with the Hebbian rule and the top layers
were trained with SGD. The goal of the following set of experiments is to assess
the limits within which layers trained with the Hebbian algorithm can replace
layers trained with SGD.
The architectures of these hybrid networks are as in Figure 2b, where a
layer was chosen as the splitting point between Hebbian trained layers and SGD
trained ones. All the layers from the ﬁrst to the ﬁfth were used in diﬀerent
experiments as splitting points. The features extracted from the Hebbian-trained
layers up to the splitting point were fed to the remaining network which was
re-trained from scratch with SGD on the Hebbian feature maps provided as
input. During this re-training process, the Hebbian-trained network was kept in
evaluation mode and its parameters were left untouched. As before, also in this
case, when training the Hebbian layers, we used the WTA approach, and the
images were processed with ZCA-whitening.
Table 2 (second group) shows the accuracy on CIFAR-10 of a network com-
posed of bottom layers trained with the Hebbian algorithm and top layers trained
with Gradient Descent. In addition, the accuracy of the baseline fully trained
with Gradient Descent and that of the same network fully trained with the Heb-
bian algorithm are also shown for comparisons (Table 2, ﬁrst group). We also
report the results of a network where the bottom layers are left completely un-
trained (randomly initialized), so that it is possible to assess whether Hebbian
training gives a positive contribution w.r.t. pure randomness or it is completely
destructive (Table 2, third group).
It can be observed that the ﬁrst layer trained with the Hebbian algorithm can
perfectly replace the corresponding Gradient Descent layer. There is a certain
accuracy loss when also the second layer is switched to Hebbian learning, however
it is still competitive. Accuracy heavily degrades when further layers are set to
Hebbian learning. As expected, Hebbian training is better than untrained layers.
5.4 Hybrid: Hebbian layers on top of SGD layers
The experiments presented in this section complement those discussed in Sec-
tion 5.3: bottom layers are SGD-trained, while the top layers are Hebbian-
Hebbian Learning Meets Deep Convolutional Neural Networks 9
trained. Table 2 (fourth group) shows the accuracy on CIFAR-10 of a network
composed of bottom layers trained with Gradient Descent and top layers trained
with the Hebbian algorithm. The table compares the accuracy of hybrid networks
obtained by choosing layers 1, 2, ..., 5 to be the splitting point, i.e. all the layers
on top of the ﬁrst, second, ..., ﬁfth (respectively) were trained with the Hebbian
algorithm, while the rest of the network was kept in evaluation mode and its
parameters were left untouched.
In this case, we can observe that the last layer trained with the Hebbian
algorithm (i.e. the supervised Hebbian classiﬁer) can perfectly replace the cor-
responding Gradient Descent layer. The ﬁfth layer can also be replaced with
a Hebbian layer with only a minimal accuracy decrease. We observe a slightly
higher accuracy loss when also the fourth layer is replaced. Accuracy heavily
degrades when further layers are replaced.
5.5 Hybrid: Hebbian layers between SGD layers
Finally, more complex conﬁgurations were also considered in which network lay-
ers were divided into three groups: bottom layers, middle layers, and top layers,
as shown in Figure 2c. The bottom layers were trained with SGD, the middle
layers with the Hebbian algorithm and the top layers again with SGD.
Table 2 (ﬁfth to eighth group) shows the accuracy on CIFAR-10 of these
conﬁgurations. The table compares the accuracy of hybrid networks obtained by
choosing various combinations of inner layers to be converted to Hebbian train-
ing. It can be observed that a minor accuracy loss occurs when a single inner
layer is switched to a Hebbian equivalent. A slightly larger accuracy loss occurs
when two layers are replaced. Speciﬁcally, lower layers are more susceptible than
higher layers. Accuracy degrades more when further layers are replaced. How-
ever, the replacement of inner layers has more inﬂuence on the resulting accuracy
than the replacement of outer layers.
6 Conclusion
We explored the use of the Hebbian rules for training a deep convolutional neural
network for image classiﬁcation. We extended the Hebbian weight update rule to
convolutional layers, and we tested various combinations of Hebbian and SGD
Experiments on CIFAR-10 showed that the Hebbian algorithm can be ef-
fectively used to train a few layers of a neural network, but the performance
decreases when more layers are involved. In particular, the Hebbian algorithm
is adequate for training the lower and the higher layers of a neural network, but
not for the intermediate layers, which lead to the main performance drops when
switched from Gradient Descent to its Hebbian equivalent.
On the other hand, the algorithm is advantageous with respect to Gradient
Descent in terms of the number of epochs needed for training. In fact, a stack
of Hebbian layers (for instance the top portion of a hybridly-trained network, or
10 G. Amato et al.
Table 2: Accuracy (%) on the CIFAR-10 test set of various conﬁgurations of
learning rules. Columns ‘L1-L5’ and ‘Classif’ report the learning rule (G gradient
descent, H Hebbian rule, R random init.) used to train respectively layers 1 to
5 and the ﬁnal classiﬁer.
Group Description L1 L2 L3 L4 L5 Classif Accuracy
1
Full SGD G G G G G G 84.95
Full Hebbian H H H H H H 28.59
2
1-bottom Hebbian H G G G G G 84.93
2-bottom Hebbian H H G G G G 78.61
3-bottom Hebbian H H H G G G 67.87
4-bottom Hebbian H H H H G G 57.56
5-bottom Hebbian H H H H H G 41.78
3
1-bottom Random R G G G G G 80.19
2-bottom Random R R G G G G 71.87
3-bottom Random R R R G G G 54.96
4-bottom Random R R R R G G 45.56
5-bottom Random R R R R R G 9.52
4
1-top Hebbian G G G G G H 84.88
2-top Hebbian G G G G H H 83.16
3-top Hebbian G G G H H H 71.18
4-top Hebbian G G H H H H 50.43
5-top Hebbian G H H H H H 32.95
5
Layer 1 Hebbian G H G G G G 80.36
Layer 2 Hebbian G G H G G G 80.68
Layer 3 Hebbian G G G H G G 80.92
Layer 4 Hebbian G G G G H G 83.75
6
Layer 2-3 Hebbian G H H G G G 72.12
Layer 3-4 Hebbian G G H H G G 74.98
Layer 4-5 Hebbian G G G H H G 76.86
7Layer 2-3-4 Hebbian G H H H G G 63.68
Layer 3-4-5 Hebbian G G H H H G 62.43
8 Layer 2-3-4-5 Hebbian G H H H H G 47.24
Hebbian Learning Meets Deep Convolutional Neural Networks 11
even a full Hebbian network), can be trained in fewer epochs (e.g. one or two
on the architecture we used) than a network trained with SGD, which needs
twenty epochs. Although the performance of deep full Hebbian networks is not
yet comparable to the one of gradient-based models, according to our results,
current Hebbian learning approaches could be eﬃciently and eﬀectively adopted
in scenarios like ﬁne-tuning and transfer learning, where Hebbian layers on top
of pre-trained SGD layers can be re-trained fast and eﬀectively.
Moreover, the local nature of the Hebbian rule potentially provides huge
speed-ups for large models with respect to backpropagation, thus encouraging
further research to improve current approaches.
References
1. Yanis Bahroun and Andrea Soltoggio. Online representation learning with
multi-layer hebbian networks for image classiﬁcation tasks. arXiv preprint
arXiv:1702.06456, 2017.
2. Wulfram Gerstner and Werner M Kistler. Spiking neuron models: Single neurons,
populations, plasticity. Cambridge university press, 2002.
3. Stephen Grossberg. Adaptive pattern classiﬁcation and universal recoding: I. par-
allel development and coding of neural feature detectors. Biological cybernetics,
23(3):121–134, 1976.
4. Simon Haykin. Neural networks and learning machines. Pearson, 3 edition, 2009.
5. Teuvo Kohonen. Self-organized formation of topologically correct feature maps.
Biological cybernetics, 43(1):59–69, 1982.
6. Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from
tiny images. 2009.
7. Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation
with deep convolutional neural networks. Advances in neural information process-
ing systems, 2012.
8. Gabriele Lagani. Hebbian learning algorithms for training convolutional neural
networks. Master’s thesis, School of Engineering, University of Pisa, Italy, 2019.
9. Gabriele Lagani. Hebbian learning algorithms for training convolutional neural net-
works - project code. https://github.com/GabrieleLagani/HebbianLearningThesis,
2019.
10. Cengiz Pehlevan, Tao Hu, and Dmitri B Chklovskii. A hebbian/anti-hebbian neural
network for linear subspace learning: A derivation from multidimensional scaling
of streaming data. Neural computation, 27(7):1461–1495, 2015.
11. Filip Ponulak. Resume-new supervised learning method for spiking neural net-
works. technical report. Institute of Control and Information Engineering, Poznan
University of Technology, 2005.
12. David E Rumelhart and David Zipser. Feature discovery by competitive learning.
Cognitive science, 9(1):75–112, 1985.
13. Amar Shrestha, Khadeer Ahmed, Yanzhi Wang, and Qinru Qiu. Stable spike-
timing dependent plasticity rule for multilayer unsupervised and supervised learn-
ing. pages 1999–2006, 2017.
14. Chr Von der Malsburg. Self-organization of orientation sensitive cells in the striate
cortex. Kybernetik, 14(2):85–100, 1973.
tations using the hebbian principle. arXiv preprint arXiv:1611.04228, 2016.
... Nonetheless, only relatively shallow networks were considered. Deeper network architectures were also considered in [1], but still, a thorough investigation of the various competitive learning strategies is missing. Our experiments on different object recognition datasets show that the Hebbian approaches are effective to train early feature extraction layers, or to re-train higher layers of a pre-trained network, when compared to supervised backprop. ...
... Still, the experiments are limited to relatively shallow networks. Deeper networks trained by Hebbian WTA were considered in [1], were it was confirmed that the WTA approach was effective for training early feature extraction layers, thus being suitable for relatively shallow networks, but also to retrain higher layers of a pre-trained network (including the final classifier, by a supervised Hebbian learning variant [15]), while requiring fewer training epochs than backprop, thus suggesting potential applications in the context of transfer learning [26]. Nonetheless, the results of this latter work were preliminary, and involved a single approach (WTA) and a single dataset for testing (CIFAR10). ...
... Notice that this does not raise biological plausibility issues, because backpropagation is not required when SGD is used to train a single layer. Even if the Hebbian approach is unsupervised, it is also possible to apply a supervised variant [1,15] for training the linear classifier, although, at this stage, we preferred to use SGD in all cases, in order to make comparisons on equal footings. Indeed, the SGD weight update can be considered as a form of supervised Hebbian update, modulated by a teacher signal. ...
Chapter
We explore competitive Hebbian learning strategies to train feature detectors in Convolutional Neural Networks (CNNs), without supervision. We consider variants of the Winner-Takes-All (WTA) strategy explored in previous works, i.e. k-WTA, e-soft-WTA and p-soft-WTA, performing experiments on different object recognition datasets. Results suggest that the Hebbian approaches are effective to train early feature extraction layers, or to re-train higher layers of a pre-trained network, with soft competition generally performing better than other Hebbian approaches explored in this work. Our findings encourage a path of cooperation between neuroscience and computer science towards a deeper investigation of biologically inspired learning principles.
... Previous works [2,36,37] already showed that Hebbian learning variants are suitable for training relatively shallow networks (with two or three layers), which are appealing for applications on constrained devices. For instance, in [1], preliminary results showed that HWTA competition was effective to re-train higher layers of a pre-trained network, achieving results comparable with backprop, but requiring fewer training epochs, thus suggesting potential applications in the context of transfer learning. ...
... Furthermore, in order to assess the impact of switching from backprop to Hebbian training layer by layer, we also considered hybrid models in which some network layers are trained with backprop and others with Hebbian learning. Such hybrid models were also studied in [1], but only preliminary results where presented involving just the HWTA learning rule and just one dataset. In this work, we provide a more comprehensive evaluation of the HWTA rule, as well as the HPCA rule, using more datasets in our experiments. ...
... In [1,19], the authors provided preliminary experiments on a single dataset (CIFAR10), by applying Hebbian-WTA learning to CNNs with up to six layers, comparing the results with those obtained by training the same network with backprop. The WTA approach, as it is, is unsupervised, but a supervised Hebbian learning variant was also proposed in order to train the final classification layer. ...
Article
Full-text available
In this paper, we investigate Hebbian learning strategies applied to Convolutional Neural Network (CNN) training. We consider two unsupervised learning approaches, Hebbian Winner-Takes-All (HWTA), and Hebbian Principal Component Analysis (HPCA). The Hebbian learning rules are used to train the layers of a CNN in order to extract features that are then used for classification, without requiring backpropagation (backprop). Experimental comparisons are made with state-of-the-art unsupervised (but backprop-based) Variational Auto-Encoder (VAE) training. For completeness,we consider two supervised Hebbian learning variants (Supervised Hebbian Classifiers—SHC, and Contrastive Hebbian Learning—CHL), for training the final classification layer, which are compared to Stochastic Gradient Descent training. We also investigate hybrid learning methodologies, where some network layers are trained following the Hebbian approach, and others are trained by backprop. We tested our approaches on MNIST, CIFAR10, and CIFAR100 datasets. Our results suggest that Hebbian learning is generally suitable for training early feature extraction layers, or to retrain higher network layers in fewer training epochs than backprop. Moreover, our experiments show that Hebbian learning outperforms VAE training, with HPCA performing generally better than HWTA.
... However, pure Hebbian multi-layer networks suffer from poor performance. Notably, a consistent result is that decoding performance decreases over successive layers, suggesting loss of information [Amato et al., 2019. This is a particularly disconcerting finding, since the purpose of hierarchical networks is precisely to aggregate information and produce more exploitable representations over successive layers DiCarlo and Cox [2007], . ...
... Recently several authors have proposed using modern deep learning framework for Hebbian learning in multi-layer convolutional networks [Amato et al., 2019, Talloen et al., 2021. These projects rely on intricate hand-crafted machinery to implement Hebbian learning. ...
... Rather, they simply learn large, simple features (Gabor-like oriented edge detectors and blobs), by combining lower-level features appropriately. This explains the reduction in decoding performance over successive layers, despite a higher number of filters, as was observed in previous work [Amato et al., 2019. To combat this tendency, we introduce several interventions ("triangle" method for computing activations [Coates et al., 2011] and massive pruning of connections between layers) which both prevent the formation of high-level Gabors and massively increase higher-level performance, allowing higher layers to produce more informative representations than the first layer. ...
Preprint
Full-text available
Deep learning networks generally use non-biological learning methods. By contrast, networks based on more biologically plausible learning, such as Hebbian learning, show comparatively poor performance and difficulties of implementation. Here we show that hierarchical, convolutional Hebbian learning can be implemented almost trivially with modern deep learning frameworks, by using specific losses whose gradients produce exactly the desired Hebbian updates. We provide expressions whose gradients exactly implement a plain Hebbian rule (dw ~= xy), Grossberg's instar rule (dw ~= y(x-w)), and Oja's rule (dw ~= y(x-yw)). As an application, we build Hebbian convolutional multi-layer networks for object recognition. We observe that higher layers of such networks tend to learn large, simple features (Gabor-like filters and blobs), explaining the previously reported decrease in decoding performance over successive layers. To combat this tendency, we introduce interventions (denser activations with sparse plasticity, pruning of connections between layers) which result in sparser learned features, massively increase performance, and allow information to increase over successive layers. We hypothesize that more advanced techniques (dynamic stimuli, trace learning, feedback connections, etc.), together with the massive computational boost offered by modern deep learning frameworks, could greatly improve the performance and biological relevance of multi-layer Hebbian networks.
... 3. However, it was only recently that Hebbian learning started gaining attention in the context of DNN training [2,3,25,26,42,43]. ...
... However, the previous approaches were based on relatively shallow network architectures (2-3 layers). A further step was taken in [2,26], where a Hebbian WTA learning rule was considered. The learning rule was applied for training a 6-layer Convolutional Neural Network (CNN). ...
Chapter
We propose a semi-supervised learning strategy for deep Convolutional Neural Networks (CNNs) in which an unsupervised pre-training stage, performed using biologically inspired Hebbian learning algorithms, is followed by supervised end-to-end backprop fine-tuning. We explored two Hebbian learning rules for the unsupervised pre-training stage: soft-Winner-Takes-All (soft-WTA) and nonlinear Hebbian Principal Component Analysis (HPCA). Our approach was applied in sample efficiency scenarios, where the amount of available labeled training samples is very limited, and unsupervised pre-training is therefore beneficial. We performed experiments on CIFAR10, CIFAR100, and Tiny ImageNet datasets. Our results show that Hebbian outperforms Variational Auto-Encoder (VAE) pre-training in almost all the cases, with HPCA generally performing better than soft-WTA.
... A brief overview is given in Section 5. However, it was only recently that Hebbian learning started gaining attention in the context of DNN training (Amato, Carrara, Falchi, Gennaro, & Lagani, 2019;Bahroun & Soltoggio, 2017;Krotov & Hopfield, 2019;Lagani, 2019;Wadhwa & Madhow, 2016a, 2016b. In Krotov and Hopfield (2019), a Hebbian learning rule based on inhibitory competition was used to train a neural network composed of fully connected layers on object recognition tasks. ...
... However, the previous approaches were based on relatively shallow network architectures (2-3 Layers). A further step was taken in Amato et al. (2019) and Lagani (2019), where a Hebbian WTA learning rule was investigated for training a 6-layer Convolutional Neural Network (CNN). Also, a supervised variant of Hebbian learning was proposed to train the final classification layer. ...
Article
We propose to address the issue of sample efficiency, in Deep Convolutional Neural Networks (DCNN), with a semi-supervised training strategy that combines Hebbian learning with gradient descent: all internal layers (both convolutional and fully connected) are pre-trained using an unsupervised approach based on Hebbian learning, and the last fully connected layer (the classification layer) is trained using Stochastic Gradient Descent (SGD). In fact, as Hebbian learning is an unsupervised learning method, its potential lies in the possibility of training the internal layers of a DCNN without labels. Only the final fully connected layer has to be trained with labeled examples. We performed experiments on various object recognition datasets, in different regimes of sample efficiency, comparing our semi-supervised (Hebbian for internal layers + SGD for the final fully connected layer) approach with end-to-end supervised backprop training, and with semi-supervised learning based on Variational Auto-Encoder (VAE). The results show that, in regimes where the number of available labeled samples is low, our semi-supervised approach outperforms the other approaches in almost all the cases.
... Hebbian learning has a rich history in artificial neural networks, dating back to the neocognitron [7], and including recent attempts at introducing it into deep architectures [8]. However, to the best of our knowledge, ours is the first paper to clearly demonstrate gains in robustness from its incorporation in DNNs. ...
Preprint
While end-to-end training of Deep Neural Networks (DNNs) yields state of the art performance in an increasing array of applications, it does not provide insight into, or control over, the features being extracted. We report here on a promising neuro-inspired approach to DNNs with sparser and stronger activations. We use standard stochastic gradient training, supplementing the end-to-end discriminative cost function with layer-wise costs promoting Hebbian ("fire together," "wire together") updates for highly active neurons, and anti-Hebbian updates for the remaining neurons. Instead of batch norm, we use divisive normalization of activations (suppressing weak outputs using strong outputs), along with implicit $\ell_2$ normalization of neuronal weights. Experiments with standard image classification tasks on CIFAR-10 demonstrate that, relative to baseline end-to-end trained architectures, our proposed architecture (a) leads to sparser activations (with only a slight compromise on accuracy), (b) exhibits more robustness to noise (without being trained on noisy data), (c) exhibits more robustness to adversarial perturbations (without adversarial training).
... In [24], BCM theory, Competitive Hebbian Learning, and Stochastic Gradient Descent are considered to derive a new learning rule. The integration of Hebbian-based learning with ConvNets has also been proposed [25][26][27][28], but BCM learning rules have been barely considered [29]. In addition, some of the previous works focused on improving the TDL algorithm, taking into account the results of [1], which includes the articles by [30][31][32][33]. ...
Article
Full-text available
This research integrates key concepts of Computational Neuroscience, including the Bienestock-CooperMunro (BCM) rule, Spike Timing-Dependent Plasticity Rules (STDP), and the Temporal Difference Learning algorithm, with an important structure of Deep Learning (Convolutional Networks) to create an architecture with the potential of replicating observations of some cognitive experiments (particularly, those that provided some basis for sequential reasoning) while sharing the advantages already achieved by the previous proposals. In particular, we present Ring Model B, which is capable of associating visual with auditory stimulus, performing sequential predictions, and predicting reward from experience. Despite its simplicity, we considered such abilities to be a first step towards the formulation of more general models of prelinguistic reasoning.
Article
Lately, cross-modal retrieval has attained plenty of attention due to enormous multi-modal data generation every day in the form of audio, video, image, and text. One vital requirement of cross-modal retrieval is to reduce the heterogeneity gap among miscellaneous modalities so that one modality’s results can be retrieved from the other in an efficient way. So, a novel unsupervised cross-modal retrieval framework based on associative learning is proposed in this paper where two traditional SOMs are trained separately for images and collateral text and then they are associated together using the Hebbian learning network to facilitate the cross-modal retrieval process. Experimental outcomes on a popular Wikipedia dataset demonstrate that the presented technique outshines various existing state-of-the-art techniques.
Article
In this paper, we compare the efficiency of three different techniques used to predict the daily power consumption for a local industrial region (the studied case). At first, a variant of the Multiple Model Particle Filter is suggested as a probabilistic approach. Then, two different ANNs with one and two hidden layers respectively are designed and tested. Finally, we demonstrate a developed ANN-based design that has the ability to adapt its own structure according to the historical fluctuations provided by a given dataset that contains the consumed power for the same regarded region between 2011 and 2015; 1825 days. The potential of AI-based techniques will be emphasized by summarizing a complement heuristic study that employs the genetic algorithm to suggest an optimal outage schedule for the generators supplying the upper-mentioned region to accomplish maintenance activities that could be needed from time to time or to rest some of the units if the predicted consumption for a given period doesn't require the total produced power.
Conference Paper
Full-text available
Unsupervised learning permits the development of algorithms that are able to adapt to a variety of different data sets using the same underlying rules thanks to the autonomous discovery of discriminating features during training. Recently, a new class of Hebbian-like and local unsupervised learning rules for neural networks have been developed that minimise a similarity matching cost-function. These have been shown to perform sparse representation learning. This study tests the effectiveness of one such learning rule for learning features from images. The rule implemented is derived from a nonnegative classical multidimensional scaling cost-function, and is applied to both single and multi-layer architectures. The features learned by the algorithm are then used as input to an SVM to test their effectiveness in classification on the established CIFAR-10 image dataset. The algorithm performs well in comparison to other unsupervised learning algorithms and multi-layer networks, thus suggesting its validity in the design of a new class of compact, online learning networks.
Article
Full-text available
Neural network models of early sensory processing typically reduce the dimension-ality of streaming input data. Such networks learn the principal subspace, in the sense of principal component analysis (PCA), by adjusting synaptic weights according to activity-dependent learning rules. When derived from a principled cost function these rules are nonlocal and hence biologically implausible. At the same time, biologically plausible local rules have been postulated rather than derived from a principled cost function. Here, to bridge this gap, we derive a biologically plausible network for sub-space learning on streaming data by minimizing a principled cost function. In a departure from previous work, where cost was quantified by the representation, or reconstruction , error, we adopt a multidimensional scaling (MDS) cost function for streaming data. The resulting algorithm relies only on biologically plausible Hebbian and anti-Hebbian local learning rules. In a stochastic setting, synaptic weights converge to a stationary state which projects the input data onto the principal subspace. If the data are generated by a nonstationary distribution, the network can track the principal subspace. Thus, our result makes a step towards an algorithmic theory of neural computation.
Article
Full-text available
In this report I introduce ReSuMe -a new supervised learning method for Spiking Neural Networks. The research on ReSuMe has been pri-marily motivated by the need of inventing an efficient learning method for control of movement for the physically disabled. However, thorough analysis of the ReSuMe method reveals its suitability not only to the task of movement control, but also to other real-life applications including modeling, identification and control of diverse non-stationary, nonlinear objects. ReSuMe integrates the idea of learning windows, known from the spike-based Hebbian rules, with a novel concept of remote supervision. Gen-eral overview of the method, the basic definitions, the network architec-ture and the details of the learning algorithm are presented. The proper-ties of ReSuMe such as locality, computational simplicity and the online processing suitability are discussed. ReSuMe learning abilities are illus-trated in a verification experiment.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
The "fire together, wire together" Hebbian model is a central principle for learning in neuroscience, but surprisingly, it has found limited applicability in modern machine learning. In this paper, we take a first step towards bridging this gap, by developing flavors of competitive Hebbian learning which produce sparse, distributed neural codes using online adaptation with minimal tuning. We propose an unsupervised algorithm, termed Adaptive Hebbian Learning (AHL). We illustrate the distributed nature of the learned representations via output entropy computations for synthetic data, and demonstrate superior performance, compared to standard alternatives such as autoencoders, in training a deep convolutional net on standard image datasets.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
This work contains a theoretical study and computer simulations of a new self-organizing process. The principal discovery is that in a simple network of adaptive physical elements which receives signals from a primary event space, the signal representations are automatically mapped onto a set of output responses in such a way that the responses acquire the same topological order as that of the primary events. In other words, a principle has been discovered which facilitates the automatic formation of topologically correct maps of features of observable events. The basic self-organizing system is a one- or two-dimensional array of processing units resembling a network of threshold-logic units, and characterized by short-range lateral feedback between neighbouring units. Several types of computer simulations are used to demonstrate the ordering process as well as the conditions under which it fails.