PreprintPDF Available

Noise-Resilient Designs for Optical Neural Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

All analog signal processing is fundamentally subject to noise, and this is also the case in modern implementations of Optical Neural Networks (ONNs). Therefore, to mitigate noise in ONNs, we propose two designs that are constructed from a given, possibly trained, Neural Network (NN) that one wishes to implement. Both designs have the capability that the resulting ONNs gives outputs close to the desired NN. To establish the latter, we analyze the designs mathematically. Specifically, we investigate a probabilistic framework for the first design that establishes that the design is correct, i.e., for any feed-forward NN with Lipschitz continuous activation functions, an ONN can be constructed that produces output arbitrarily close to the original. ONNs constructed with the first design thus also inherit the universal approximation property of NNs. For the second design, we restrict the analysis to NNs with linear activation functions and characterize the ONNs' output distribution using exact formulas. Finally, we report on numerical experiments with LeNet ONNs that give insight into the number of components required in these designs for certain accuracy gains. We specifically study the effect of noise as a function of the depth of an ONN. The results indicate that in practice, adding just a few components in the manner of the first or the second design can already be expected to increase the accuracy of ONNs considerably.
Content may be subject to copyright.
Noise-Resilient Designs for Optical Neural Networks
Gianluca Kosmellaa,b,, Ripalta Stabilea, Jaron Sandersb
aDepartment of Electrical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Netherlands
bDepartment of Mathematics and Computer Science, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Netherlands
Abstract
All analog signal processing is fundamentally subject to noise, and this is also the case in modern implementations of Optical
Neural Networks (ONNs). Therefore, to mitigate noise in ONNs, we propose two designs that are constructed from a given,
possibly trained, Neural Network (NN) that one wishes to implement. Both designs have the capability that the resulting ONNs
gives outputs close to the desired NN.
To establish the latter, we analyze the designs mathematically. Specifically, we investigate a probabilistic framework for
the first design that establishes that the design is correct, i.e., for any feed-forward NN with Lipschitz continuous activation
functions, an ONN can be constructed that produces output arbitrarily close to the original. ONNs constructed with the first
design thus also inherit the universal approximation property of NNs. For the second design, we restrict the analysis to NNs
with linear activation functions and characterize the ONNs output distribution using exact formulas.
Finally, we report on numerical experiments with LeNet ONNs that give insight into the number of components required in
these designs for certain accuracy gains. We specifically study the effect of noise as a function of the depth of an ONN. The
results indicate that in practice, adding just a few components in the manner of the first or the second design can already be
expected to increase the accuracy of ONNs considerably.
Keywords: Optical Neural Networks, Law of Large Numbers, Universal Approximation
1. Introduction
Machine Learning (ML) is a computing paradigm in which
problems that are traditionally challenging for programmers
to explicitly write algorithms for, are solved by learning algo-
rithms that improve automatically through experience. That
is, they “learn” structure in data. Prominent examples include
image recognition [
1
], semantic segmentation [
2
], human-level
control in video games [
3
], visual tracking [
4
], and language
translation [5].
Classical computers are designed and best suited for se-
rialized operations (they have a central processing unit and
separated memory), while the data-driven ML approach re-
quires decentralized and parallel calculations at high band-
width as well as continuous processing of parallel data. To
illustrate how ML can benefit from a different architecture,
we can consider performance relative to the number of ex-
ecuted operations, also indicated as Multiply-–Accumulate
Operation (MAC) rates, and the energy efficiency, i.e., the
amount of energy spent to execute one single operation. Com-
putational efficiency in classical computers levels off below 10
GMAC/s/W [6].
An alternative computing architecture with a more dis-
tributed interconnectivity and memory would allow for greater
energy efficiency and computational speed. An inspiring ex-
ample would be an architecture such as the brain. The brain
Corresponding author
Email address: g.k.kosmella@tue.nl (Gianluca Kosmella)
is able to perform about 10
18
MAC/s using only 20 W of
power [
6
], and operates approximately 10
11
neurons with an
average number of inputs for each of about 10
4
synapses.
This leads to an estimated total of 10
15
synaptic connections,
all conveying signals up to 1
kHz
bandwidth. The brain’s
computational efficiency (being less than 1
aJ
per MAC) is
then about 8orders of magnitude higher than the one of
current supercomputers, which operate instead at 100
pJ
per
MAC [6].
Connecting software to hardware through computing ar-
chitecture tailored to ML tasks is the endeavor of research
within the field of neuromorphic computing. The electronics
community is now busy developing non-von Neumann com-
puting architectures to enable information processing with
an energy efficiency down to a few pJ per operation. Aim-
ing to replicate fundamentals of biological neural circuits in
dedicated hardware, important advances have been made in
neuromorphic accelerators [
7
]. These advances are based on
the spiking architectural models, which are still not fully un-
derstood. Deep Learning (DL)-focused approaches, on the
other hand, aim to construct hardware that efficiently realizes
DL architectures, while eliminating as much of the complexity
of biological neural networks as possible. Among the most
powerful DL hardware we can name the GPU-based DL ac-
celerators hardware [
8
,
9
,
10
,
11
,
12
], as well as emerging
analogue electronic Artificial Intelligence chipsets that tend
to collocate processing and memory to minimize the mem-
ory–processor communication energy costs (e.g. the analogue
arXiv:2308.06182v1 [cs.NE] 11 Aug 2023
crossbar approaches [
13
]). The Mythic’s architecture, for ex-
ample, can yield high accuracy in inference applications within
a remarkable energy efficiency of just half a pJ per MAC. Even
if the implementation of neuromorphic approaches is visibly
bringing outstanding record energy efficiencies and computa-
tion speeds, neuromorphic electronics is already struggling to
offer the desired data throughput at the neuron level. Neuro-
morphic processing for high-bandwidth applications requires
GHz operation per neuron, which calls for a fundamentally
different technology approach.
1.1. Optical Neural Networks
A major concern with neuromorphic electronics is that
the distributed hardware needed for parallel interconnections
is impractical to realize with classical metal wiring: a trade-
off applies between interconnectivity and bandwidth, limiting
these engine’s utilization to applications in the kHz and sub-
GHz regime. When sending information not through electrical
signals but via optical signals, the optical interconnections do
not undergo interference and the optical bandwidth is virtually
unlimited. This can for example be achieved when exploiting
the color and/or the space and/or the polarization and/or the
time domain, thus allowing for applications in the GHz regime.
It has been theorized that photonic neuromorphic processors
could operate ten thousand times faster while using less energy
per computation [
14
,
15
,
16
,
17
]. Photonics therefore seems
to be a promising platform for advances in neuromorphic
computing.
Implementations of weighted addition for Optical Neu-
ral Networks (ONNs) include Mach–Zehnder Interferometer-
based Optical Interference Units [
18
], time-multiplexed and,
coherent detection [
19
], free space systems using spatial light
modulators [
20
] and Micro–Ring–Resonator-based weight-
ing bank on silicone [
21
]. Furthermore, Indium phosphide-
integrated optical cross-connect using Semiconductor Opti-
cal Amplifiers as single stage weight elements, as well as
Semiconductor Optical Amplifier-based wavelength converters
[
22
,
23
,
24
] have been demonstrated for allowing All-Optical
(AO) Neural Networks (NNs). A comprehensive review of all
the approaches used in integrated photonics can be found in
[25].
Next to these promises, aspects like implementation of
nonlinearities, access and storage of weights in on-chip mem-
ory, and noise sources in analog photonic implementations, all
pose challenges in devising scalable photonic neuromorphic
processors and accelerators. These challenges also occur when
they are embedded within end-to-end systems. Fortunately,
arbitrary scalability of these networks has been demonstrated,
with a certain noise and accuracy. However, it would be useful
to envision new architectures to reduce noise even more.
1.2. Noise in ONNs
The types of noise in ONNs include thermal crosstalk [
26
],
cumulative noise in optical communication links [
27
,
28
] and
noise deriving from applying an activation function [29].
In all these studies, the noise is considered to be approxi-
mated well by Additive White Gaussian Noise (AWGN).
For example, taking the studies [
26
,
28
,
27
,
29
,
30
] as
starting point, the authors of [
31
] model an ONN as a com-
munication channel with AWGN. We follow this assumption
and will model an ONN as having been built up from inter-
connected nodes with noise in between them. This generic
approach does not restrict us to any specific device that may
be used in practice.
The model also applies to the two alternative designs
of an AO implementation of a NN (see for example [
32
])
and the case of an optical/electrical/optical (O/E/O) NN
[
22
]. In an AO NN, the activation function is applied by
manipulating an incoming electromagnetic wave. Modulation
(and the AWGN it causes) only occurs prior to entering an
AO NN (or equivalently, in the first layer). For the remainder
of the network the signal remains in the optical domain. Here,
when applying the optical activation function a new source
of noise is introduced as AWGN at the end of each layer.
Using the O/E/O network architecture, the weighted addition
is performed in the optical realm, but the light is captured
soon after each layer, where it is converted into an electrical
and digital signal and the activation function is applied via
software on a computer. The operation on the computer
can be assumed to be noiseless. However, since the result
again needs to be modulated (to be able to act as input to
the next layer), modulation noise is added. We can further
abstract from the specifics of the AO and O/E/O design and
see that in either implementation noise occurs at the same
locations within the mathematical modeling, namely AWGN
for weighted addition and afterwards AWGN from an optical
activation function or from modulation, respectively. This
means that we do not need to distinguish between the two
design choices in our modeling; we only need to choose the
corresponding AWGN term after activation.
The operation of a layer of a feed-forward NN can be
modeled by multiplying a matrix
W
with an input vector
x
(a bias term
b
can be absorbed into the matrix–vector
product and will therefore suppressed in notation here) and
then applying an activation function
f
:
RR
element-wise
to the result. Symbolically,
x7→ f(W x).
Now, concretely, the noise model that we study is described
by
x7→ f(W x + Normal(0,Σw)) + Normal(0,Σa),(1)
for each hidden layer of the ONN. Here
Normal
(0
,
Σ) denotes
the multivariate normal distribution with mean vector 0and
covariance matrix Σ. More specifically, Σ
w
,Σ
a
and Σ
m
are
the covariance matrices associated with weighted addition,
application of the activation function, and modulation, respec-
tively. Figure 1gives a schematic representation of the noise
model under study. As we have seen above, in the O/E/O
case we have Σ
a
= Σ
m
, otherwise Σ
a
is due to the specific
structure of the photonic activation function. The first layer,
regardless of an AO or O/E/O network, sees a modulated
input
x
, i.e.,
x
+
Normal
(0
,
Σ
m
), and afterwards the same
2
data xx
=x+Nm
Modulator
y
=W x+Nw
Weighted addition
y
=σ(y) + Na
Activation
yout
Photonic layer More layers
Figure 1: Schematic depiction of the noise model of ONNs that we study. First, data
x
is modulated onto light. This step adds an
AWGN term
Nm
. This light enters the Photonic Layer, in which a weighted addition takes place, adding AWGN
Nw
. The activation
function is then applied, adding AWGN
Na
. The activation function may be applied by photo-detecting the signal of the weighted
addition, turning it to a digital signal and applying the activation function on a computer. The result of that action would then be
modulated again, to produce the optical output of the photonic neuron. The modulator is thus only required in the first layer, as each
photonic neuron takes in light and outputs light.
steps of weighing and applying an activation function, that
is
(1)
. Arguably the hidden layers and their noise structure
are the most important parts, especially in deep NNs. There-
fore, the main equation governing the behavior of the noise
propagation in an ONN will remain (1).
1.3. Noise-resistant designs for ONNs
The main contribution of this paper lies in analyzing two
noise reduction mechanisms for feed-forward ONNs. The
mechanisms are derived from the insight that noise can be
mitigated through averaging because of the law of large num-
bers, and they are aimed at using the enormous bandwidth
that photonics offer. The first design (Design A) and its
analysis are inspired by recent advancements for NNs with
random edges in [
33
]; the second design (Design B) is new
and simpler to implement, but comes without a theoretical
guarantee of correctness for nonlinear ONNs, specifically.
Both designs—illustrated in Figure 2—are built from a
given NN for which an optical implementation is desired.
Each design proposes a larger ONN by taking parts of the
original NN, and duplicating and arranging them in a certain
way. If noise is absent, then this larger ONN produces the
same output as the original NN; and, if noise is present, then
this ONN produces an output closer to the desired NN than
the direct implementation of the NN as an ONN without
modifications would give.
The first mechanism to construct a larger ONN suppress-
ing inherent noise of analog systems starts with a certain
number of copies
N
of the input data. The copies are all
processed independently by (in parallel arranged copies of)
the layers. Each copy of a layer takes in multiple input copies
to produce the result of weighted addition, to which the acti-
vation mechanism is applied. The copies that are transmitted
to each layer (or set of parallel arrayed layers) are independent
of each other. The independent outputs function as inputs to
the upcoming (copies of the) layers, and so on and so forth.
The idea of the second design is to use multiple copies
of the input, on which the weighted addition is performed.
The noisy products of the weighted addition are averaged to
a single number/light beam. This average is then copied and
the multiple copies are fed through the activation function,
creating multiple noisy activations to be used as the next
layer’s input, and so on.
1.4. Summary of results
Using Design A, we are able to establish that ONNs posses
the same theoretical properties as NNs. Specifically, we can
prove that any NN can be approximated arbitrarily well by an
ONN built using Design A (Theorem 1). Similar considerations
for NNs with random edges can be found in [
33
], but the
noise model and proof method are different. Here, we first
bound the deviation of an ONN and a noiseless NN. To this
bound Hoeffding’s inequality is then applied.
Establishing this theoretical guarantee, however, is done
by increasing the number of components exponentially as the
depth of the network increases. The current proof shows that
for an ONN with Design A meant to approximate a NN with
L
layers arbitrarily well (and thus reduce the noise to negligible
levels), a sufficient number of components is
ω
(
KL(L+1)LL
)
for some constant
K >
0. This is however not to say that
such a large number is necessary: it is merely sufficient.
From a practical viewpoint, however, having to use as
few components as possible would be more attractive. We
therefore also investigate Design B, in which the number
of components increases only linearly with the depth of the
network. Because Design A already allows us to establish
the approximation property of ONNs, we limit our analysis of
Design B to linear NNs for simplicity. We specifically establish
in Theorem 2for any linear NN the exact output distribution
of an ONN built using Design B. Similar to the guarantee for
Design A in Theorem 1, but more restrictively, this implies
that any linear NN can be approximated arbitrarily well by
some ONN built using Design B. Strictly speaking, Design
B now has no guarantee of correctness for nonlinear NNs,
but this should practically not withhold us (especially when
activations, for instance, are close to linear).
We conduct numerical experiments with Designs A and
B by constructing to LeNet ONNs. The numerical results
indicate that in practice, adding some components for noise
negation is already sufficient to increase the accuracy of an
3
(a) The original NN.
(b) Design A.
(c) Design B.
Figure 2: (a) Base 4
3
2network, light circles indicate activa-
tions, boxes indicate post-activations. (b) Example for Design A
with 2layers as input copies to each subsequent layer. The light
circles indicate the linear operations/matrix-vector products. The
results of the linear operation is averaged (single solid-blue circle)
and fed through the activation function, producing the multiple
version of the layers output (boxes). (c) Example of Design B.
ONN; an exponential number does not appear not to be
necessary (see Figures 3to 4).
Finally, we want to remark that the high bandwidth of
photonic circuits can be exploited to implement the designs
as efficiently as possible.
1.5. Outline of the paper
We introduce the AWGN model formally in Section 2.
This model is the basis for the analysis of the proposed noise
reduction schemes that are next discussed in Sections 3and 4.
There, we specifically define Designs A and B, and each design
is followed by a mathematical analysis. The main results are
Theorems 1and 2. Section 5contain numerical simulations on
LeNet ONNs to which we apply Designs A and B. Section 6
concludes; technical details are deferred to the Appendix.
2. Model
We consider general feed-forward NNs implemented on
analog optical devices. Noise occurs due to various reasons
in those optical devices. Reasons include quantum noise in
modulation, chip imperfections, and crosstalk [
26
,
28
,
27
,
29
,
30].
The noise profiles and levels of different devices differ,
but we can, to good approximation, expect AWGN to occur
at three separate instances [
31
]: when modulating, when
weighting, and when applying an activation function. The
thus proposed AWGN model is formalized next in Section 2.1.
2.1. Feed-forward nonlinear ONNs
We assume that our aim is to implement a feed-forward
nonlinear NN with domain
Rd0
and range
RdL
, that can be
represented by a parameterized function Ψ
NN
:
Rd0×Rn
RdL
as follows. For
= 1
, . . . , L N+
,Ψ
NN
must be the
composition of the functions
ΨNN
:Rd1Rd, x 7→ σ()W()x+b().
Here,
W()Rd×d1
denotes the weight matrix in the
-
th layer,
b()Rd×1
the bias vector in the
-th layer, and
σ()
:
Rd×1Rd×1
the activation function in the
-th layer.
Specifically, the NN satisfies
ΨNN(·, w)=ΨNN
L(·, w(L)) · ·· ΨNN
1(·, w(1)),(2)
where
w()
= (
W(), b()
)represents the parameters in the
-th layer. Note that we do not necessarily assume that the
activation function is applied component-wise (it could be any
high-dimensional function). Such cases are simply contained
within the model.
Suppose now that the NN in
(2)
is implemented as an
ONN, but without amending its design. AWGN will then
disrupt the output of each layer. Specifically, for depths
LN+
, the ONN will be representable by a function Ψ
ONN
that is the composition of the noisy functions
ΨONN
:Rd1Rd, x 7→ σ()W()x+b()+N()
w+N()
a
(3)
4
for = 1, . . . , L N+. Here,
N()
w
(d)
= Normal(0,Σ()
w)and N()
a
(d)
= Normal(0,Σ()
act)
denote multivariate normal distributions that describe the
AWGN within the ONN. In other words, the ONN will satisfy
ΨONN(·, w)=ΨONN
L(·, w(L)) · ·· ΨONN
1(·, w(1))(4)
instead of
(2)
. Observe that
(4)
is a random NN; its outcome
is uncertain, but hopefully close to that of (2).
2.2. Feed-forward linear ONNs
Let us briefly examine the special case of a feed-forward
linear ONN in more detail. That is, we now assume addition-
ally that for
= 1
, . . . , L
, there exist
e()Rd
such that
σ()
(
y
) =
D()y
where
D()
=
diag
(
e()
)
.
In other words,
each activation function
σ()
does element-wise multiplica-
tions by constants.
If each activation function is linear, then the output distri-
bution of each layer will remain multivariate normal distributed
due to the so-called linear transformation theorem [
34
, Theo-
rem 1.2.6]. The mean and covariance matrix of the underlying
multivariate normal distribution will however be transformed
in each layer.
Let us illustrate how the covariance matrix transforms by
discussing the first layer in detail. Each layer in
(3)
can be
interpreted as a random function that takes the noisy vector
A
(1)
= (A
(1)
1,...,
A
(1)
d1
)say as input, and produces
the even noisier vector A
()
= (A
()
1,...,
A
()
d
)say as output.
Specifically, the noisy input to the first layer is modeled by
A(0) |x:(d)
=x+N(0,Σm)(5)
because of the modulation error within the first layer. Here
· | ·
indicates a conditional random variable. This input next
experiences weighted addition and more noise is introduced:
the noisy preactivation of the first layer satisfies
U(1) |A(0) :(d)
=W(1)A(0) +b(1) +N(0,Σ(1)
w).(6)
Combining
(5)
and
(6)
with the linear transformation theorem
for the multivariate normal distribution as well as the fact that
sums of independent multivariate normal random variables are
again multivariate normally distributed [
34
, Theorem 1.2.14],
we find that
U(1) |x(d)
=W(1)x+b(1) +W(1) N0,Σm+N0,Σ(1)
w
(d)
=W(1)x+b(1) +N0, W (1)Σm(W(1) )+ Σ(1)
w.
After applying the linear activation function, we obtain
A(1) |x(d)
=σ1(U(1)) + N(0,Σ(1)
a)|x
(d)
=D(1)(W(1) x+b(1))
+N0,Σ(1)
a+D(1)W(1) Σm(W(1))+ Σ(1)
w(D(1))
=NNN
1(x, w),Σ(1)
ONN)
say. Observe that the unperturbed network’s output remains
intact, and is accompanied by a centered normal distribution
with an increasingly involved covariance matrix:
Σ(1)
ONN
=D(1)W(1) Σm(W(1))+ Σ(1)
w(D(1))+ Σ(1)
a
=D(1)W(1) Σm(W(1))(D(1) )+D(1)Σ(1)
w(D(1))
+ Σ(1)
a.(7)
Observe furthermore that the covariance matrix in
(7)
is
independent of the bias b(1).
The calculations in eqs. (5) to (7) can readily be extended
into a recursive proof that establishes the covariance matrix
of the entire linear ONN. Specifically, for
= 1
, . . . , L
, define
the maps
T()(Σ) := D()W()Σ(W())(D())
+D()Σ()
w(D())+ Σ()
a.(8)
We then have the following:
Proposition 1 (Distribution of linear ONNs)
Assume that
there exist vectors
e()Rd
such that
σ()
(
y
) =
diag
(
e()
)
y
.
The feed-forward linear ONN in (4)then satisfies
ΨONN(·, w)(d)
=NΨNN(·, w),Σ(L)
ONN,
where for =L, L 1,...,1,
Σ()
ONN =T()(1)
ONN ); and Σ(0)
ONN = Σm.
In linear ONNs with symmetric noise (that is, the AWGN
of each layer’s noise sources has the same covariance ma-
trix), Proposition 1’s recursion simplifies. Introduce
P()
:=
QL
i=+1 D(i)W(i)
for notational convenience. The following
is proved in Appendix A.1.1:
Corollary 1 (Symmetric noise case)
Within the setting of
Proposition 1, assume additionally that for all
N+
,Σ
()
a
=
Σaand Σ()
w= Σw. Then,
Σ(L)
ONN =P(0)Σm(P(0) )+
L
X
=1
P()Σa(P())
+
L
X
=1
P()D()Σw(D())P().
If moreover for all
N+
,
W()
=
W
,
D()
=
D
, and
DFWF<1, then
lim
L→∞ Σ(L)
ONN =
X
n=0
(DW )n(DΣwD+ Σa) ((DW )n).
Proposition 1and Corollary 1describe the output distri-
bution of linear ONNs completely.
5
2.3. Discussion
One way to think of the AWGN model in Section 2.1 is to
take a step back from the microscopic analysis of individual
devices, and consider an ONN as a series of black box devices
(recall also Figure 1). Each black box device performs their
designated task and acts as communication channels with
AWGN. This way of modeling in order to analyze the impact
of noise can also be seen in [
31
]; and other papers modeling
optical channels include [
28
,
27
]. Further papers considering
noise in optical systems with similar noise assumptions are [
35
,
36
], where furthermore multiplicative noise is considered when
an amplifier is present within the circuit [
35
]. Qualitatively
the results for Design A also apply for multiplicative noise,
the scaling however may differ.
Limitations of the model. We note firstly that modeling the
noise in ONNs as AWGN is warranted only in an operating
regime with many photons, and is thus unlikely to be a good
model for ONNs that operate in a regime with just a few
photons.
Secondly, due to physical device features and operation
conditions, weights, activations, and outputs can only be
realized in ONNs if their values lie in certain ranges. Such
constraints are no part of the model in Section 2. Fortu-
nately, however, the implied range restrictions are usually not
a problem in practice. For example, activation functions like
sigmoid and
tanh
map into [0
,
1] and [
1
,
1], respectively.
Additional regularization rules like weight decay also move
the entries of weight matrices in NNs towards smaller values.
In case physical constraints were met one can increase the
weight decay parameter to further penalize large weights dur-
ing training, leading to smaller weights so that the ONN is
again applicable.
3. Results—Design A
3.1. Reducing the noise in feed-forward ONNs (Design A)
Recall that an example of Design A is presented in Fig-
ure 2(b). Algorithm 1constructs this tree-like network, given
the desired number of copies n0, . . . , nLper layer.
Observe that in Design A, the number of copies utilized
in each layer, the
n
, are fixed. There is however only a
single copy in the last layer. Its output is the unique output
of the ONN. Each other layer receives multiple independent
inputs. With each of the independent copies weighted addition
is performed, and the results are averaged to produce the
layer’s single output. Having independent incoming copies is
achieved by having multiple independent branches of the prior
partial networks incoming into a given layer. This means that
the single layer Lreceives nL1independent inputs of nL1
independent layers
L
1. Each of the
nL1
copies of layer
L
1receives
nL2
inputs from independent copies of layer
L
2. Generally, let
n1
be the number of copies of layer
1that act as inputs to layer .
Observe that all copies are created upfront. That means
there are
QL1
=0 n
copies of the data. By Algorithm 1,
Algorithm 1 Algorithm to construct a noise reducing network
Require: Input n= (n)=0,...,L
Require: QL
=0 ni
copies of input
x(0)
, named
1x(0),...,QL
=0 nix(0)
for = 0, . . . , L 1do
for α= 1,...,QL1
i=nido
αξ()W(+1) αx()+b(+1) + Normal(0,Σw)
end for
for α= 0,...,QL1
i=ni1do
αy()averaging
n1
α·n+1ξ()+· · · +α·n+nξ()
αx(+1) σ()(αy()) + Normal(0,Σa)
end for
end for
return 1x(L)
QL1
=1 n
copies of the first layer are arrayed in parallel to
each other, and each of them processes
n0
copies of the data.
The outputs of the
QL1
=1 n
arrayed copies of the first layer
are the input to the
QL1
=2 n
arrayed copies of the second
layer, and so on.
Notice that noise stemming from applying the activation
function is subject to a linear transformation in the next layer.
The activation function noise can therefore be considered as
weight-noise by inserting an identity layer with
σ
=
id
,
W
=
I
and b= 0.
We want to verify that a Design A ONN yields outputs
that are with high probability close to the original noiseless
NN. Let ˜
ΨONN(x, w)the Design A ONN and then let
P
hsup
xRd
ΨNN(x, w)˜
ΨONN(x, w)
2< DLi>1CL,
(9)
be the desired property. The main result of this section is the
following:
Theorem 1
For any
CL
(0
,
1), any
DL
(0
,
), and
any nonlinear NN Ψ
NN
, with Lipschitz-continuous activations
functions with Lipschitz-constants
a(i)
and weight-matrices
W(i)
, Algorithm 1is able to construct an ONN
˜
ΨONN
that
satisfies (9).
Let the covariance matrices of the occurring AWGN be
diagonal matrices and let each of the values of the covari-
ance matrices be upper bounded by
σ2
0. For any set of
(
κi
)
i=1,...,L
,(
δi
)
i=1,...,L
such that
Q
(1
κ
)
>
1
CL
and
PδDL
, a sufficient number of copies to construct an
ONN ˜
ΨONN that satisfies (9)is given by
nL= 1
n
σ2QL
i=+1 a(i)QL
i=+2 W(i)op2
δ2
+1
× 2Γ((d+1 + 1)/2)
Γ(d+1/2)
6
+v
u
u
t
C2
c
4d+1
4
2d+1
42(lnκ+1/2)1
QL
i=+1 ni!2
,
=L1,...,0.
Here Γis the gamma function and
C, c >
0are absolute
constants.
This result is proven in Section 3.3. A consideration on the
asymptotic total amount of copies in deep ONNs is relegated
to Appendix A.2.1.
3.2. Idea behind Design A
Having the law of large numbers in mind it seems reason-
able that the average of multiple experiments would help in
achieving a more precise output in the presence of noise. How-
ever, it would typically not be correct to just input
n
identical,
deterministic copies of
x
into
n
independent ONNs—thus pro-
ducing
n
noisy realizations Ψ
ONN,1
(
x, w
)
,...,
Ψ
ONN,n
(
x, w
)
say—and then calculate their average in the hope to recover
Ψ
NN
(
x, w
). This is because while by the law of large numbers
it is true that
lim
n→∞
1
n
n
X
i=1
ΨONN,i(x, w) =
E
ΨONN(x, w),
it is not necessarily true that the expectation
E
ΨONN(x, w)
equals Ψ
NN
(
x, w
). The reason is that activation functions in
NNs are typically nonlinear.
We can circumvent the issue by modifying the approach
and instead exploit the law of large numbers layer-wise. Recall
that in the noiseless NN, layer maps a fixed input
x7→ σ()(W()x+b()),(10)
and that the same layer in the ONN maps the same fixed
input
x7→ σ()(W()x+b()+N()
w)
instead. If we let (
N(i)
)
i∈{1,...,n}
be independent realizations
of the distribution of
N()
w
(which has mean zero), we can
expect by the law of large numbers that for sufficiently large
n, the realized quantities
1
n
n
X
i=1
W()x+b()+N(i)and W()x+b()
are close to each other. If
σ()
is moreover sufficiently regular,
then we may expect that the realized quantity
σ()1
nn
X
i=1
W()x+b()+N(i) (11)
is close to
(10)
for sufficiently large
n
, i.e., close to the
unperturbed output of the original layer.
The implementation in
(11)
can be realized by using
n
times as many nodes in the hidden layer; thus to essentially
create
n
copies of the original hidden layer. These indepen-
dent copies are then averaged. Furthermore, one can allow
for different inputs (
xi
)
i∈{1,...,n}
, assuming some statistical
properties of their distribution. This will be formalized next
in the proof of Theorem 1in Section 3.3.
3.3. Proof of Theorem 1
For the proof we will first upper bound the deviation
between an ONN constructed with Design A and the noiseless
NN (Section 3.3.1) and then we find a probabilistic bound on
the deviations bound (Section 3.3.2).
3.3.1. An upper-bound for the ONN NN deviation
The output of the Design A network is
˜x=σ(L)1
nL1
nL1
X
i=1 W(L)˜xi+b(L)+N(i),(12)
where each ˜xiis recursively calculated as
˜xi=σ(L1)1
nL2
nL2
X
ji=1 W(L1) ˜xji+b(L1) +N(ji),
the ˜xjiare calculated as
˜xji=σ(L2)1
nL3
nL3
X
kji=1W(L2) ˜xkji+b(L2) +N(kji),
and so on and so forth. The difference in
L2
-norm of
(12)
and the noiseless NN
σ(L)W(L)σ(L1)W(L1) . . . +b(L1)+b(L)
can iteratively be bounded by using the Lipschitz property
of the activation functions, triangle inequality, and submulti-
plicativity of the norms.
We start the iteration by bounding
σ(L)1
nL1
nL1
X
i=1 W(L)˜xi+b(L)+N(i)
σ(L)W(L)σ(L1)W(L1) . . . +b(L1)+b(L)
2
a(L)
1
nL1
nL1
X
i=1
W(L)˜xiσ(L1)W(L1) . . . +b(L1)+N(i)
2
a(L)W(L)op
nL1
×
nL1
X
i=1 ˜xiσ(L1)W(L1) . . . +b(L1)
2
+a(L)
1
nL1
nL1
X
i=1
N(i)
2.
7
In the next iteration step the term
nL1
X
i=1 ˜xiσ(L1)W(L1) . . . +b(L1)
2
is further bounded by first using the triangle inequality and
thereafter bounding in the same way as we did in the first
layer:
nL1
X
i=1 σ(L1)1
nL2
nL2
X
ji=1 W(L1) ˜xji+b(L1) +N(ji)
σ(L1)W(L1) σ(L2)W(L2) . . . +b(L2)
+b(L1)
2
a(L1)W(L1) op
nL2
×
nL1
X
i=1
nL2
X
ji=1 ˜xjiσ(L2)W(L2) . . . +b(L2)
2
+a(L1)
nL1
X
i=1
1
nL2
nL2
X
ji=1
N(ji)
2.
Here,
nL1
X
i=1
nL2
X
ji=1 ˜xjiσ(L2)W(L2) . . . +b(L2)
2
may again be bounded in the same fashion. This leads to the
following recursive argument.
Let F()be the sum of the differences between—loosely
speaking—the ends of the remaining Design A “subtrees” and
noiseless NNs “subtrees” at layer . More specifically, let
F(L):=
˜xσ(L)W(L). . . +b(L)
;
F():=
nL1
X
iL=1
nL2
X
iL1L=1 ···
n+1
X
i+2...L1L
=1
n
X
i+1+2...L1L
=1
˜xi+1+2...L1Lσ()W(). . . +b()
2,
= 1, . . . , L 1,
the special case of
F(0)
will be considered in detail later. For
simplicity, we join the sums outside the norm into one. Notice
that because
nL
= 1, we have
QL1
k=+1 nk
=
QL
k=+1 nk
, and
we can write
F()=
QL
k=+1 nk
X
i=1
n
X
ji=1˜xjiσ()W(). . . +b()
2,
where specifically
˜xji=σ()1
n1
n1
X
kji=1W()˜xkji+b()+N(kji),
and the jiand kjiare nothing more than relabelings.
Bounding
F()
using the triangle inequality, Lipschitz-
property, and submultiplicativity yields
F()a()
QL
k=+1 nk
X
i=1
n
X
ji=1
1
n1
n1
X
kji=1N(kji)
+W()˜xkjiσ(1)W(1) . . . +b(1)
2
a()W()op
n1
F(1)
+a()
QL
k=+1 nk
X
i=1
n
X
ji=1
1
n1
n1
X
kji=1
N(kji)
2.(13)
We thus found a recursive formula for the bound.
The recursion ends at
F(0)
. The noiseless NN receives
x
as input, while the ONN receives modulated input
x
+
N(ji)
,
where N(ji)is the modulation noise, i.e., AWGN. Therefore,
F(0) =
QL
k=1 nk
X
i=1
n0
X
ji=1
((x+N(ji))x)
2
=
QL
k=1 nk
X
i=1
n0
X
ji=1
N(ji)
2.(14)
Observe that the x-dependence disappeared.
Readily iterating (13) leads to the bound
F(L)
X
=L,L1,...,1
L
Y
i=
a(i)
L
Y
i=+1 W(i)op
×1
QL
k=nk
1
n1
QL
k=nk
X
i=1
n1
X
ji=1
N(ji)
2.
Therefore, if all the
L2
-norms of the sums of the Gaussians are
small at the same time, the network is close to the noiseless
NN. Let
S:=
L
Y
i=
a(i)
L
Y
i=+1 W(i)op
×1
QL
k=nk
1
n1
QL
k=nk
X
i=1
n1
X
ji=1
N(ji)
2.
If for all
PSδ>1κ,(15)
and moreover
PδDL
as well as
Q
(1
κ
)
>
1
CL
,
then (9) holds. This can be seen by bounding
P
hsup
xRd
ΨNN(x, w)˜
ΨONN(x, w)
2< DLi
8
P
hX
S< DLi
P
h\
S< δi
=Y
P
hS< δi>Y
(1 κ)>1CL.
Here, in the first inequality the dependence on
x
disappears
due to (14).
3.3.2. Bound for deviations
We next consider the
S
for which we want to guarantee
that
PS< δ>1κ.
Let
m
=
QL
k=nk
. By assumption the
N(ji)
k
are independent
and identically
Normal
(0
, σ2
k
)distributed, where
σ2
kσ2
, for
some common
σ2
. We are lower bounding the number of
copies required, therefore using AWGN with higher variance
only increases the lower bound, as the calculations below
show. We calculate the bound exemplary for
N(ji)
distributed
according to
Normal
(0
, σ2
), re-substituting
σ2
k
below in
(18)
(which is the bound given in Theorem 1) thus covers the case
of N(ji)
k
(d)
= Normal(0, σ2
k).
Each component of the vector
n1
X
ji=1
N(ji)=
n1
X
ji=1
N(ji)
1,...,
n1
X
ji=1
N(ji)
d
is assumed to be
Normal
(0
, n1σ2
) =
n1σNormal
(0
,
1)
distributed. It then holds that
m
X
i=1
n1
X
ji=1
N(ji)
2
(d)
=
m
X
i=1
n1σNormal(0, Id)2.
This is a sum of independent chi-distributed random vari-
ables, which means they are sub-gaussian (see below that we
can calculate the sub-gaussian norm and it is indeed finite).
Thus Hoeffding’s inequality applies, according to which, for
X1, . . . , Xn
independent, mean zero, sub-gaussian random
variables, for every t0
Ph
N
X
i=1
Xi< ti>12 expct2
PN
i=1 Xi2
ψ2(16)
holds; see e.g. [
37
, Theorem 2.6.2]. Here
c >
0is an absolute
constant (see [37, Theorem 2.6.2]) and
Xψ2:= inft > 0 : E[exp(X2/t2)] 2.
To apply Hoeffding’s inequality in our setting, we need to cen-
ter the occurring random variables. For
N(i)Normal
(0
, Id
),
the term N(i)2is chi distributed with mean
µd=2Γ((d+ 1)/2)
Γ(d/2) ,(17)
where Γis the gamma function, see e.g. [38, p.238].
Consider
P"QL
i=a(i)QL
i=+1 W(i)op
mn1
σ
m
X
i=1 N(i)2< δ#
=P"QL
i=a(i)QL
i=+1 W(i)op
mn1
σ
m
X
i=1N(i)2µd
< δQL
i=a(i)QL
i=+1 W(i)op
mn1
σmµd#
which equals
P"m
X
i=1N(i)2µ
<mn1δ
σQL
i=a(i)QL
i=+1 W(i)op mµ#
and is lower bounded (compare to (16)) by
12 exp
cmn1δ
σQL
i=a(i)QL
i=+1 W(i)op mµd2
Pm
i=1
N(i)2µ
2
ψ2
,
which in turn is lower bounded by
12 exp
cmn1δ
σQL
i=a(i)QL
i=+1 W(i)op mµd2
Pm
i=1 C2
N(i)2
2
ψ2
,
where
C >
0is an absolute constant (see [
37
, Lemma 2.6.8]).
For a chi distributed random variable Xit holds that
E[exp(X2/t2)] = MX2(1/t2)
where
MX2
(
s
)is the moment generating function of X
2
—a
chi-squared distributed random variable. It is known (see e.g.
[39, Appendix 13]) that
MX2(s) = (1 2s)d/2
for
s < 1
2
. Accordingly for 2
< t2
, the property in the
definition of the sub-gaussian norm
E[exp(X2/t2)] = 121
t2d/22
is satisfied for all tfor which
tmax
s4d
4
2d
42,2
=s4d
4
2d
42
holds. The square of the sub-gaussian norm of the chi dis-
tributed random variables is thus
N(i)2
2
ψ2
=4d
4
2d
42.
9
Substituting the norm into the lower bound yields
12 exp
cmn1δ
σQL
i=a(i)QL
i=+1 W(i)op mµd2
C2m4d
4
2d
42
.
In order to achieve (15), a sufficient criterion is
κ
2exp
cmn1δ
σQL
i=a(i)QL
i=+1 W(i)op µd2
C24d
4
2d
42
.
Solving for n1leads to
n1
σ2QL
i=a(i)QL
i=+1 W(i)op2
δ2
× sC24d
4
2d
42(lnκ/2)1
cm
+µd!2
.
(18)
If we substitute the expression in
(17)
for
µd
,
(18)
becomes
the bound as seen in Theorem 1.
3.4. Conclusion
Within the context of the model described in Section 2,
we have established that any feed-forward NN can be approxi-
mated arbitrarily well by ONNs constructed using Design A.
This is Theorem 1in essence.
This result has two consequences when it comes to the
physical implementation of ONNs. On the one hand, it is
guaranteed that the theoretical expressiveness of NNs can
be retained in practice. On the other hand, Design A allows
one to improve the accuracy of a noisy ONN to a desired
level, and in fact bring the accuracy arbitrarily close to that of
any state-of-the-art feed-forward noiseless NNs. Let us finally
remark that the high bandwidth of photonic circuits may be
of use when implementing Design A.
4. Results—Design B
4.1. Reducing noise in feed-forward linear ONNs (Design B)
Recall that an example of Design B is presented in Fig-
ure 2(c). Algorithm 2constructs this network, given a desired
number of copies min each layer.
Calculating the output of a NN by using Design B first
requires to fix a number
m
. The input data
x(0)
is then
modulated
m
times, creating
m
noisy realizations of the
input (
αx(0)
)
α=1,...,m
. The weighted addition step and the
activation function of each layer are singled out and copied
m
times. Both the copies of the weighted addition step and
of the activation function of each layer are arrayed parallel
to each other and performed on the
m
inputs, resulting in
m
outputs. The
m
parallel outputs of the weighted addition are
Algorithm 2 Algorithm to construct a noise reducing network
Require: Fix number mN
Require: mcopies of input 1x(0),...,mx(0)
for = 1, . . . , L do
for α= 1, . . . , m do
αξ()W()αx(1) +b()+ Normal(0,Σw)
end for
y()combining
1ξ()+· ·· +mξ()
1y(). . . ,my()splitting
m1y()
for α= 1, . . . , m do
αx()σ()(αy()) + Normal(0,Σact)
end for
end for
return m1Pm
α=1 αx(L)
merged to a single output, and afterwards split into
m
pieces.
The
m
pieces are each send to one of the
m
activation function
mechanisms for processing. The resulting
m
activation values
are the output of the layer. If it is the last layer, the
m
activation values are merged to produce the final output.
These steps are formally described in Algorithm 2. A schematic
representation of Design B can be seen in Figure 2(c).
4.2. Analysis of Design B
We now consider the physical and mathematical conse-
quences of Design B.
Observe that in Design B, the
m
weighted additions of the
-th layer’s input
x(l1)
result in realizations (
αξ()
)
α=1,...,m
of
W()x(1)
+
b()
+
Normal
(0
,
Σ
w
). These realizations are
then combined resulting in
mW ()x(1) +mb()+Normal(0, mΣw)+Normal(0,Σsum).
Splitting the signal again into
m
parts, each signal carries
information following the distribution
W()x(1) +b()+ Normal(0, m1Σw+m2Σsum)
+ Normal(0,Σspl).
The mean of the normal distribution therefore is the origi-
nal networks pre-activation obtained from this input (that is
without perturbations). The covariance matrix of the normal
distribution is
m1
Σ
w
+
m2
Σ
sum
+ Σ
spl
. Each of those
signals is fed through the mechanism applying the activation
function, yielding
m
noisy versions of the output, distributed
according to
x()|x(1)
(d)
=σ()W()x(1) +b()
+ Normal(0, m1Σw+m2Σsum + Σspl)
+ Normal(0,Σact).
The effect of Design B is thus that
T()
(Σ) in
(8)
is
replaced by
T()
m(Σ) := 1
mD()W()ΣD()W()+1
mD()ΣwD()
10
+1
m2D()ΣsumD()+D()Σspl D()+ Σa;
see Appendix A.2.2. Observe also that Σ
()
a
can be written as
(1
/m
)
m
Σ
()
a
. Therefore, if we substitute the matrix
¯
Σ()
a
=
mΣ()
afor Σ()
ain T()(Σ), we can write
T()
m(Σ) = m1T()(Σ) + m2D()ΣsumD()
+D()ΣsplD().
We have the following analogs to Proposition 1and Corol-
lary 1:
Theorem 2 (Distribution of Design B)
Assume that there
exist vectors
a()Rd
such that
σ()
(
y
) =
diag
(
a()
)
y
. The
feed-forward linear ONN constructed using Design B with
m
copies then satisfies
ΨONN
m(·, w)(d)
= NormalΨNN(·, w),Σ(L)
ONN,m,
where for =L, L 1,...,1,
Σ()
ONN,m =T()
m(1)
ONN,m); and Σ(0)
ONN,m = Σm.
Under the assumption of symmetric noise, a similar simpli-
fication of the recursion in Theorem 2, similar to that in Propo-
sition 1, is possible. Assume Σ
sum
= Σ
spl
= 0. Introduce
again
P()
:=
QL
i=+1 D(i)W(i)
for notational convenience.
The following is proved in Appendix A.2:
Corollary 2 (Symmetric noise case)
Assume that for all
N+,Σ()
a= Σaand Σ()
w= Σw. Then,
Σ(L)
ONN,m =
L
X
=1
(m1)LP()mΣa(P())
+
L
X
=1
(m1)LP()D()Σw(D())P()
+ (mL)P(0)Σm(P(0) ).
We will next consider the limit of the covariance matrix
in a large, symmetric linear ONN with Design B, that we can
grow infinitely deep. Algorithm 2is namely able to guarantee
boundedness of the covariance matrix in such deep ONN if
the parameter mis chosen appropriately:
Corollary 3
Consider a linear ONN with Design B and param-
eter
m
, that has
L
layers, and that satisfies the following sym-
metry properties: for all
{
1
, . . . , L}
,
W()
=
W
,
D()
=
D
,Σ
()
a
= Σ
a
and Σ
()
w
= Σ
w
. Then, if
DFWF<m
,
the limit limL→∞ Σ(L)
ONN exists.
Moreover,
lim
L→∞ Σ(L)
ONN,m
=
X
n=0
m(n+1)(D W )n(DΣwD+mΣa) ((DW )n).
Notice that the bound on the number of copies needed
for the covariance matrix of an ONN to converge to a limit
is independent of e.g. the Frobenius norms of the covariance
matrices that describe the noise distributions. This is because,
here, we are not interested in bounding the covariance matrix
to a specific level; instead, we are merely interested in the
existence of a limit.
4.3. Discussion & Conclusion
Compared to Theorem 2’s recursive description of the
covariance matrix in any linear ONN with Design B, Corol-
lary 2provides a series that describes the covariance matrix in
any linear, symmetric ONN with Design B. While the result
holds more restrictively, it is more insightful. For example,
it allows us to consider the limit of the covariance matrix
in an extremely deep ONNs (see Corollary 3). Corollary 3
suggests that in deep ONNs with Design B, one should choose
m
(
DFWF
)
2
in order to control the noise and not
be too inefficient with the number of copies.
These results essentially mean that in a physical implemen-
tation of an increasingly deep and linear ONN, the covariance
matrix can be reduced (and thus remain bounded) by applying
Design B with multiple copies. The quality of the ONN’s
output increases as the number of copies in Design B (or
Design A for that matter) is increased. Finally, it is worth
mentioning that Design B could potentially be implemented
such that it leverages the enormous bandwidth of optics.
5. Simulations
We investigate the improvements of output quality achieved
by Designs A and B on a benchmark example: the convolu-
tional neural network LeNet [
40
]. As measure for quality we
consider the Mean Squared Error (MSE)
MSE = 1
n
n
X
i=1 ΨONN(x(i), w)ΨNN(x(i), w)2
and the prediction accuracy
#{correctly classified images}
#{images}.
5.1. Empirical variance
We extracted plausible values for Σ
w
and Σ
a
from the
ONN implementation [
41
] of a 2-layer NN for classification of
the Modified National Institute of Standards and Technology
(MNIST) database [
42
]. In [
41
], the authors trained the
NN on a classical computer and implemented the trained
weights afterwards in an ONN. They then tuned noise (with
the same noise model as in Section 2of this paper) into the
noiseless computer model, assuming that Σ
w
=
diag
(
σ2
w
)
and Σ
a
=
diag
(
σ2
a
). They found
σw
[0
.
08
,
0
.
1]
·d
and
σa
[0
.
1
,
0
.
15]
·d
to reach the same accuracy levels as the
ONN, where ddenotes the diameter of the range.
11
5.2.
LeNet ONN: Performance when implemented via Design
A and B
Convolutional NNs can be regarded as feedforward NNs
by stacking the (2D or 3D) images into column vectors and
arranging the filters to a weight matrix. Thus Design A and B
are well-defined for Convolutional NNs. We apply the designs
to LeNet5 [
40
], which is trained for classifying the handwritten
digits in the MNIST dataset [42]. The layers are:
1.
2D convolutional layer with kernel size 5, stride 1and
2-padding. Output has 6channels of 28x28 pixel repre-
sentations, with the activation function being tanh;
2.
average pooling layer, pooling 2x2 block, the output
therefore is 14x14;
3.
2D convolutional layer with kernel size 5, stride 1and
no padding. The output has 16 channels of 10x10 pixel
representations and the activation function is tanh;
4.
average pooling layer, pooling 2x2 block, the output
therefore is 5x5;
5.
2D convolutional layer with kernel size 5, stride 1and
no padding. The output has 120 channels of 1pixel
representations and the activation function used is
tanh
;
(5.)
flattening layer, which turns the 120 one-dimensional
channels into one 120-dimensional vector;
6.
dense layer with 84 neurons and
tanh
activation func-
tion;
7.
dense layer with 10 neurons and
softmax
activation
function.
Figures 3and 4show the MSE and the prediction accu-
racy of Design A and B for an increasing number of copies,
respectively.
For simplicity we set all individual copies
ni
per layer
i
in Design A to equal
m
, that is
ni
=
m
for all
i
. The total
number of copies that Design A starts with then is
mL
. Here
L
is equal to 7. In Design B the number of copies is
m
per
layer and the total number of copies is
mL
. In the case of
one copy the designs A and B are identical to the original
network, while we focus on the effect once the designs deviate
from the original network (m2).
The axis in Figures 3and 4denote the number of copies
per layer. Here, we scale the copies per layer for Design A
linearly, because the total amount of copies for Design A grows
exponentially and we scale the copies per layer for Design B
exponentially, because the total number of copies for Design
B grows linearly. This way the comparison is on equal terms.
Figure 3displays the MSE seen for LeNet, depending
on the amount of copies for each design. In the trade-off
between additional resources needed for the additional copies
against the diminishing benefits of adding further copies, we
see that, for both measures MSE (Figure 3) and relative
accuracy (Figure 4), already 2 to 5 copies per layer yield
good results. The relative accuracy in Figure 4is scaled such
Figure 3: MSE(
·
10
2
) for Design A (top) and Design B (bottom)
as function of copies on LeNet5 trained for MNIST classification.
The pale area contains the 95%-confidence intervals.
that 0corresponds to the accuracy of the original NN with
noise profile (i.e., the ONN without modifications, we call
this the original ONN) and 1to the accuracy of the original
NN without noise. The designs do not alter the fundamental
operation of the original NN, therefore there should be no
performance gain and the original NN’s accuracy should be
considered the highest achievable, thus constituting the upper
bound in relative accuracy of 1. Likewise the lowest accuracy
should be given by the original ONN, as there is no noise
reduction involved.
Figure 4: Relative accuracy for Design A (top) and Design B
(bottom) as function of copies on LeNet5 trained for MNIST clas-
sification. The pale area contains the 56
.
5%-confidence intervals.
5.3. Effect of additional layers in LeNet
In order to investigate how the depth affects the noise
at the output, while keeping the operation of the network
the same to ensure the results are commensurable, we insert
additional layers with identity matrix and identity activation
function (we will call them identity layers) into a network.
12
Figure 5: Accuracy of LeNet ONNs, depending on the amount of inserted identity layers and the variance level of the ONN, for (a) a
network with
tanh
activation function and one copy, (b) a network with
ReLU
activation function and one copy, (c) a network with
linear
activation function and one copy, (d) a network with
tanh
activation function and two copies, (e) a network with
ReLU
activation
function and two copies, (f) a network with linear activation function and two copies.
Specifically, we take networks with the LeNet architecture
as in Section 5.2, using different activation functions, while
fixing the output layer to be
softmax
. We then insert identity
layers between layers 1 and 2, 3 and 4, 5 and 6, as well as
between layers 6 and 7. For a fixed total of additional layers,
the layers are inserted in the four spots between layers
1&2
,
3&4,5&6, and 7&8 according to the tuple
n7→ jn+ 3
4k,jn+ 2
4k,jn+ 1
4k,jn
4k.
The insertion pattern is illustrated in Table 1:
# of additional layers 1&2 3&4 5&6 7&8
1 1 0 0 0
2 1 1 0 0
3 1 1 1 0
4 1 1 1 1
5 2 1 1 1
6 2 2 1 1
... ... ... ... ...
Table 1: Insertion pattern.
Finally, we tune the variance terms of the covariance matrix
in our noise model. The results are displayed in Figure 5.
In Figure 5, we observe that the
tanh
and the
ReLU
networks perform as expected. Additional noisy layers decrease
the accuracy and thus the same level of performance can only
be achieved if the variance is lower. This trend can also be
seen in the linear network, but to a lesser extend.
5.4. Simulations on effective values for Design B
According to Corollary 3, the covariance matrix of a
linear ONN constructed by Design B is bounded if
m >
(DFWF)2
, and therefore
m
=
(DFWF)2
, is suf-
ficient to ensure that the covariance matrix of the output
distribution Ψ
ONN
m
(
·, w
)
(d)
= Normal
NN
(
·, w
)
,
Σ
(L)
ONN,m
)in
Theorem 2is bounded in linear NNs. This is derived by using
submultiplicativity of the norm (see
(A.8)
) and is therefore
possibly a lose bound. We use the exact relation given by
Corollary 2for the covariance matrix in Theorem 2to investi-
gate the lowest values for
m
for which the covariance matrix
starts being bounded. In Figure 6we depict a linear NN with
constant width 4. We vary the values for
DF
and
WF
.
Upon close inspection we see that the lowest value for
m
seems
to be
g
(
x, y
)
(
xy
)
2/Id4
F,
where
Id
is the identity ma-
trix of dimension
d
, see Figure 6. Because
IdF
=
d
, the
value for mfound numerically is m (DFWF/d)2.
6. Discussion & Conclusion
Design A, introduced in