Available via license: CC BY 4.0
Content may be subject to copyright.
Noise-Resilient Designs for Optical Neural Networks
Gianluca Kosmellaa,b,∗, Ripalta Stabilea, Jaron Sandersb
aDepartment of Electrical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Netherlands
bDepartment of Mathematics and Computer Science, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Netherlands
Abstract
All analog signal processing is fundamentally subject to noise, and this is also the case in modern implementations of Optical
Neural Networks (ONNs). Therefore, to mitigate noise in ONNs, we propose two designs that are constructed from a given,
possibly trained, Neural Network (NN) that one wishes to implement. Both designs have the capability that the resulting ONNs
gives outputs close to the desired NN.
To establish the latter, we analyze the designs mathematically. Specifically, we investigate a probabilistic framework for
the first design that establishes that the design is correct, i.e., for any feed-forward NN with Lipschitz continuous activation
functions, an ONN can be constructed that produces output arbitrarily close to the original. ONNs constructed with the first
design thus also inherit the universal approximation property of NNs. For the second design, we restrict the analysis to NNs
with linear activation functions and characterize the ONNs’ output distribution using exact formulas.
Finally, we report on numerical experiments with LeNet ONNs that give insight into the number of components required in
these designs for certain accuracy gains. We specifically study the effect of noise as a function of the depth of an ONN. The
results indicate that in practice, adding just a few components in the manner of the first or the second design can already be
expected to increase the accuracy of ONNs considerably.
Keywords: Optical Neural Networks, Law of Large Numbers, Universal Approximation
1. Introduction
Machine Learning (ML) is a computing paradigm in which
problems that are traditionally challenging for programmers
to explicitly write algorithms for, are solved by learning algo-
rithms that improve automatically through experience. That
is, they “learn” structure in data. Prominent examples include
image recognition [
1
], semantic segmentation [
2
], human-level
control in video games [
3
], visual tracking [
4
], and language
translation [5].
Classical computers are designed and best suited for se-
rialized operations (they have a central processing unit and
separated memory), while the data-driven ML approach re-
quires decentralized and parallel calculations at high band-
width as well as continuous processing of parallel data. To
illustrate how ML can benefit from a different architecture,
we can consider performance relative to the number of ex-
ecuted operations, also indicated as Multiply-–Accumulate
Operation (MAC) rates, and the energy efficiency, i.e., the
amount of energy spent to execute one single operation. Com-
putational efficiency in classical computers levels off below 10
GMAC/s/W [6].
An alternative computing architecture with a more dis-
tributed interconnectivity and memory would allow for greater
energy efficiency and computational speed. An inspiring ex-
ample would be an architecture such as the brain. The brain
∗Corresponding author
Email address: g.k.kosmella@tue.nl (Gianluca Kosmella)
is able to perform about 10
18
MAC/s using only 20 W of
power [
6
], and operates approximately 10
11
neurons with an
average number of inputs for each of about 10
4
synapses.
This leads to an estimated total of 10
15
synaptic connections,
all conveying signals up to 1
kHz
bandwidth. The brain’s
computational efficiency (being less than 1
aJ
per MAC) is
then about 8orders of magnitude higher than the one of
current supercomputers, which operate instead at 100
pJ
per
MAC [6].
Connecting software to hardware through computing ar-
chitecture tailored to ML tasks is the endeavor of research
within the field of neuromorphic computing. The electronics
community is now busy developing non-von Neumann com-
puting architectures to enable information processing with
an energy efficiency down to a few pJ per operation. Aim-
ing to replicate fundamentals of biological neural circuits in
dedicated hardware, important advances have been made in
neuromorphic accelerators [
7
]. These advances are based on
the spiking architectural models, which are still not fully un-
derstood. Deep Learning (DL)-focused approaches, on the
other hand, aim to construct hardware that efficiently realizes
DL architectures, while eliminating as much of the complexity
of biological neural networks as possible. Among the most
powerful DL hardware we can name the GPU-based DL ac-
celerators hardware [
8
,
9
,
10
,
11
,
12
], as well as emerging
analogue electronic Artificial Intelligence chipsets that tend
to collocate processing and memory to minimize the mem-
ory–processor communication energy costs (e.g. the analogue
arXiv:2308.06182v1 [cs.NE] 11 Aug 2023
crossbar approaches [
13
]). The Mythic’s architecture, for ex-
ample, can yield high accuracy in inference applications within
a remarkable energy efficiency of just half a pJ per MAC. Even
if the implementation of neuromorphic approaches is visibly
bringing outstanding record energy efficiencies and computa-
tion speeds, neuromorphic electronics is already struggling to
offer the desired data throughput at the neuron level. Neuro-
morphic processing for high-bandwidth applications requires
GHz operation per neuron, which calls for a fundamentally
different technology approach.
1.1. Optical Neural Networks
A major concern with neuromorphic electronics is that
the distributed hardware needed for parallel interconnections
is impractical to realize with classical metal wiring: a trade-
off applies between interconnectivity and bandwidth, limiting
these engine’s utilization to applications in the kHz and sub-
GHz regime. When sending information not through electrical
signals but via optical signals, the optical interconnections do
not undergo interference and the optical bandwidth is virtually
unlimited. This can for example be achieved when exploiting
the color and/or the space and/or the polarization and/or the
time domain, thus allowing for applications in the GHz regime.
It has been theorized that photonic neuromorphic processors
could operate ten thousand times faster while using less energy
per computation [
14
,
15
,
16
,
17
]. Photonics therefore seems
to be a promising platform for advances in neuromorphic
computing.
Implementations of weighted addition for Optical Neu-
ral Networks (ONNs) include Mach–Zehnder Interferometer-
based Optical Interference Units [
18
], time-multiplexed and,
coherent detection [
19
], free space systems using spatial light
modulators [
20
] and Micro–Ring–Resonator-based weight-
ing bank on silicone [
21
]. Furthermore, Indium phosphide-
integrated optical cross-connect using Semiconductor Opti-
cal Amplifiers as single stage weight elements, as well as
Semiconductor Optical Amplifier-based wavelength converters
[
22
,
23
,
24
] have been demonstrated for allowing All-Optical
(AO) Neural Networks (NNs). A comprehensive review of all
the approaches used in integrated photonics can be found in
[25].
Next to these promises, aspects like implementation of
nonlinearities, access and storage of weights in on-chip mem-
ory, and noise sources in analog photonic implementations, all
pose challenges in devising scalable photonic neuromorphic
processors and accelerators. These challenges also occur when
they are embedded within end-to-end systems. Fortunately,
arbitrary scalability of these networks has been demonstrated,
with a certain noise and accuracy. However, it would be useful
to envision new architectures to reduce noise even more.
1.2. Noise in ONNs
The types of noise in ONNs include thermal crosstalk [
26
],
cumulative noise in optical communication links [
27
,
28
] and
noise deriving from applying an activation function [29].
In all these studies, the noise is considered to be approxi-
mated well by Additive White Gaussian Noise (AWGN).
For example, taking the studies [
26
,
28
,
27
,
29
,
30
] as
starting point, the authors of [
31
] model an ONN as a com-
munication channel with AWGN. We follow this assumption
and will model an ONN as having been built up from inter-
connected nodes with noise in between them. This generic
approach does not restrict us to any specific device that may
be used in practice.
The model also applies to the two alternative designs
of an AO implementation of a NN (see for example [
32
])
and the case of an optical/electrical/optical (O/E/O) NN
[
22
]. In an AO NN, the activation function is applied by
manipulating an incoming electromagnetic wave. Modulation
(and the AWGN it causes) only occurs prior to entering an
AO NN (or equivalently, in the first layer). For the remainder
of the network the signal remains in the optical domain. Here,
when applying the optical activation function a new source
of noise is introduced as AWGN at the end of each layer.
Using the O/E/O network architecture, the weighted addition
is performed in the optical realm, but the light is captured
soon after each layer, where it is converted into an electrical
and digital signal and the activation function is applied via
software on a computer. The operation on the computer
can be assumed to be noiseless. However, since the result
again needs to be modulated (to be able to act as input to
the next layer), modulation noise is added. We can further
abstract from the specifics of the AO and O/E/O design and
see that in either implementation noise occurs at the same
locations within the mathematical modeling, namely AWGN
for weighted addition and afterwards AWGN from an optical
activation function or from modulation, respectively. This
means that we do not need to distinguish between the two
design choices in our modeling; we only need to choose the
corresponding AWGN term after activation.
The operation of a layer of a feed-forward NN can be
modeled by multiplying a matrix
W
with an input vector
x
(a bias term
b
can be absorbed into the matrix–vector
product and will therefore suppressed in notation here) and
then applying an activation function
f
:
R→R
element-wise
to the result. Symbolically,
x7→ f(W x).
Now, concretely, the noise model that we study is described
by
x7→ f(W x + Normal(0,Σw)) + Normal(0,Σa),(1)
for each hidden layer of the ONN. Here
Normal
(0
,
Σ) denotes
the multivariate normal distribution with mean vector 0and
covariance matrix Σ. More specifically, Σ
w
,Σ
a
and Σ
m
are
the covariance matrices associated with weighted addition,
application of the activation function, and modulation, respec-
tively. Figure 1gives a schematic representation of the noise
model under study. As we have seen above, in the O/E/O
case we have Σ
a
= Σ
m
, otherwise Σ
a
is due to the specific
structure of the photonic activation function. The first layer,
regardless of an AO or O/E/O network, sees a modulated
input
x
, i.e.,
x
+
Normal
(0
,
Σ
m
), and afterwards the same
2
data xx′
=x+Nm
Modulator
y′
=W x′+Nw
Weighted addition
y
=σ(y′) + Na
Activation
yout
Photonic layer More layers
Figure 1: Schematic depiction of the noise model of ONNs that we study. First, data
x
is modulated onto light. This step adds an
AWGN term
Nm
. This light enters the Photonic Layer, in which a weighted addition takes place, adding AWGN
Nw
. The activation
function is then applied, adding AWGN
Na
. The activation function may be applied by photo-detecting the signal of the weighted
addition, turning it to a digital signal and applying the activation function on a computer. The result of that action would then be
modulated again, to produce the optical output of the photonic neuron. The modulator is thus only required in the first layer, as each
photonic neuron takes in light and outputs light.
steps of weighing and applying an activation function, that
is
(1)
. Arguably the hidden layers and their noise structure
are the most important parts, especially in deep NNs. There-
fore, the main equation governing the behavior of the noise
propagation in an ONN will remain (1).
1.3. Noise-resistant designs for ONNs
The main contribution of this paper lies in analyzing two
noise reduction mechanisms for feed-forward ONNs. The
mechanisms are derived from the insight that noise can be
mitigated through averaging because of the law of large num-
bers, and they are aimed at using the enormous bandwidth
that photonics offer. The first design (Design A) and its
analysis are inspired by recent advancements for NNs with
random edges in [
33
]; the second design (Design B) is new
and simpler to implement, but comes without a theoretical
guarantee of correctness for nonlinear ONNs, specifically.
Both designs—illustrated in Figure 2—are built from a
given NN for which an optical implementation is desired.
Each design proposes a larger ONN by taking parts of the
original NN, and duplicating and arranging them in a certain
way. If noise is absent, then this larger ONN produces the
same output as the original NN; and, if noise is present, then
this ONN produces an output closer to the desired NN than
the direct implementation of the NN as an ONN without
modifications would give.
The first mechanism to construct a larger ONN suppress-
ing inherent noise of analog systems starts with a certain
number of copies
N
of the input data. The copies are all
processed independently by (in parallel arranged copies of)
the layers. Each copy of a layer takes in multiple input copies
to produce the result of weighted addition, to which the acti-
vation mechanism is applied. The copies that are transmitted
to each layer (or set of parallel arrayed layers) are independent
of each other. The independent outputs function as inputs to
the upcoming (copies of the) layers, and so on and so forth.
The idea of the second design is to use multiple copies
of the input, on which the weighted addition is performed.
The noisy products of the weighted addition are averaged to
a single number/light beam. This average is then copied and
the multiple copies are fed through the activation function,
creating multiple noisy activations to be used as the next
layer’s input, and so on.
1.4. Summary of results
Using Design A, we are able to establish that ONNs posses
the same theoretical properties as NNs. Specifically, we can
prove that any NN can be approximated arbitrarily well by an
ONN built using Design A (Theorem 1). Similar considerations
for NNs with random edges can be found in [
33
], but the
noise model and proof method are different. Here, we first
bound the deviation of an ONN and a noiseless NN. To this
bound Hoeffding’s inequality is then applied.
Establishing this theoretical guarantee, however, is done
by increasing the number of components exponentially as the
depth of the network increases. The current proof shows that
for an ONN with Design A meant to approximate a NN with
L
layers arbitrarily well (and thus reduce the noise to negligible
levels), a sufficient number of components is
ω
(
KL(L+1)LL
)
for some constant
K >
0. This is however not to say that
such a large number is necessary: it is merely sufficient.
From a practical viewpoint, however, having to use as
few components as possible would be more attractive. We
therefore also investigate Design B, in which the number
of components increases only linearly with the depth of the
network. Because Design A already allows us to establish
the approximation property of ONNs, we limit our analysis of
Design B to linear NNs for simplicity. We specifically establish
in Theorem 2for any linear NN the exact output distribution
of an ONN built using Design B. Similar to the guarantee for
Design A in Theorem 1, but more restrictively, this implies
that any linear NN can be approximated arbitrarily well by
some ONN built using Design B. Strictly speaking, Design
B now has no guarantee of correctness for nonlinear NNs,
but this should practically not withhold us (especially when
activations, for instance, are close to linear).
We conduct numerical experiments with Designs A and
B by constructing to LeNet ONNs. The numerical results
indicate that in practice, adding some components for noise
negation is already sufficient to increase the accuracy of an
3
(a) The original NN.
(b) Design A.
(c) Design B.
Figure 2: (a) Base 4
−
3
−
2network, light circles indicate activa-
tions, boxes indicate post-activations. (b) Example for Design A
with 2layers as input copies to each subsequent layer. The light
circles indicate the linear operations/matrix-vector products. The
results of the linear operation is averaged (single solid-blue circle)
and fed through the activation function, producing the multiple
version of the layers output (boxes). (c) Example of Design B.
ONN; an exponential number does not appear not to be
necessary (see Figures 3to 4).
Finally, we want to remark that the high bandwidth of
photonic circuits can be exploited to implement the designs
as efficiently as possible.
1.5. Outline of the paper
We introduce the AWGN model formally in Section 2.
This model is the basis for the analysis of the proposed noise
reduction schemes that are next discussed in Sections 3and 4.
There, we specifically define Designs A and B, and each design
is followed by a mathematical analysis. The main results are
Theorems 1and 2. Section 5contain numerical simulations on
LeNet ONNs to which we apply Designs A and B. Section 6
concludes; technical details are deferred to the Appendix.
2. Model
We consider general feed-forward NNs implemented on
analog optical devices. Noise occurs due to various reasons
in those optical devices. Reasons include quantum noise in
modulation, chip imperfections, and crosstalk [
26
,
28
,
27
,
29
,
30].
The noise profiles and levels of different devices differ,
but we can, to good approximation, expect AWGN to occur
at three separate instances [
31
]: when modulating, when
weighting, and when applying an activation function. The
thus proposed AWGN model is formalized next in Section 2.1.
2.1. Feed-forward nonlinear ONNs
We assume that our aim is to implement a feed-forward
nonlinear NN with domain
Rd0
and range
RdL
, that can be
represented by a parameterized function Ψ
NN
:
Rd0×Rn→
RdL
as follows. For
ℓ
= 1
, . . . , L ∈N+
,Ψ
NN
must be the
composition of the functions
ΨNN
ℓ:Rdℓ−1→Rdℓ, x 7→ σ(ℓ)W(ℓ)x+b(ℓ).
Here,
W(ℓ)∈Rdℓ×dℓ−1
denotes the weight matrix in the
ℓ
-
th layer,
b(ℓ)∈Rdℓ×1
the bias vector in the
ℓ
-th layer, and
σ(ℓ)
:
Rdℓ×1→Rdℓ×1
the activation function in the
ℓ
-th layer.
Specifically, the NN satisfies
ΨNN(·, w)=ΨNN
L(·, w(L))◦ · ·· ◦ ΨNN
1(·, w(1)),(2)
where
w(ℓ)
= (
W(ℓ), b(ℓ)
)represents the parameters in the
ℓ
-th layer. Note that we do not necessarily assume that the
activation function is applied component-wise (it could be any
high-dimensional function). Such cases are simply contained
within the model.
Suppose now that the NN in
(2)
is implemented as an
ONN, but without amending its design. AWGN will then
disrupt the output of each layer. Specifically, for depths
L∈N+
, the ONN will be representable by a function Ψ
ONN
that is the composition of the noisy functions
ΨONN
ℓ:Rdℓ−1→Rdℓ, x 7→ σ(ℓ)W(ℓ)x+b(ℓ)+N(ℓ)
w+N(ℓ)
a
(3)
4
for ℓ= 1, . . . , L ∈N+. Here,
N(ℓ)
w
(d)
= Normal(0,Σ(ℓ)
w)and N(ℓ)
a
(d)
= Normal(0,Σ(ℓ)
act)
denote multivariate normal distributions that describe the
AWGN within the ONN. In other words, the ONN will satisfy
ΨONN(·, w)=ΨONN
L(·, w(L))◦ · ·· ◦ ΨONN
1(·, w(1))(4)
instead of
(2)
. Observe that
(4)
is a random NN; its outcome
is uncertain, but hopefully close to that of (2).
2.2. Feed-forward linear ONNs
Let us briefly examine the special case of a feed-forward
linear ONN in more detail. That is, we now assume addition-
ally that for
ℓ
= 1
, . . . , L
, there exist
e(ℓ)∈Rdℓ
such that
σ(ℓ)
(
y
) =
D(ℓ)y
where
D(ℓ)
=
diag
(
e(ℓ)
)
.
In other words,
each activation function
σ(ℓ)
does element-wise multiplica-
tions by constants.
If each activation function is linear, then the output distri-
bution of each layer will remain multivariate normal distributed
due to the so-called linear transformation theorem [
34
, Theo-
rem 1.2.6]. The mean and covariance matrix of the underlying
multivariate normal distribution will however be transformed
in each layer.
Let us illustrate how the covariance matrix transforms by
discussing the first layer in detail. Each layer in
(3)
can be
interpreted as a random function that takes the noisy vector
A
(ℓ−1)
= (A
(ℓ−1)
1,...,
A
(ℓ−1)
dℓ−1
)say as input, and produces
the even noisier vector A
(ℓ)
= (A
(ℓ)
1,...,
A
(ℓ)
dℓ
)say as output.
Specifically, the noisy input to the first layer is modeled by
A(0) |x:(d)
=x+N(0,Σm)(5)
because of the modulation error within the first layer. Here
· | ·
indicates a conditional random variable. This input next
experiences weighted addition and more noise is introduced:
the noisy preactivation of the first layer satisfies
U(1) |A(0) :(d)
=W(1)A(0) +b(1) +N(0,Σ(1)
w).(6)
Combining
(5)
and
(6)
with the linear transformation theorem
for the multivariate normal distribution as well as the fact that
sums of independent multivariate normal random variables are
again multivariate normally distributed [
34
, Theorem 1.2.14],
we find that
U(1) |x(d)
=W(1)x+b(1) +W(1) N0,Σm+N0,Σ(1)
w
(d)
=W(1)x+b(1) +N0, W (1)Σm(W(1) )⊺+ Σ(1)
w.
After applying the linear activation function, we obtain
A(1) |x(d)
=σ1(U(1)) + N(0,Σ(1)
a)|x
(d)
=D(1)(W(1) x+b(1))
+N0,Σ(1)
a+D(1)W(1) Σm(W(1))⊺+ Σ(1)
w(D(1))⊺
=N(ΨNN
1(x, w),Σ(1)
ONN)
say. Observe that the unperturbed network’s output remains
intact, and is accompanied by a centered normal distribution
with an increasingly involved covariance matrix:
Σ(1)
ONN
=D(1)W(1) Σm(W(1))⊺+ Σ(1)
w(D(1))⊺+ Σ(1)
a
=D(1)W(1) Σm(W(1))⊺(D(1) )⊺+D(1)Σ(1)
w(D(1))⊺
+ Σ(1)
a.(7)
Observe furthermore that the covariance matrix in
(7)
is
independent of the bias b(1).
The calculations in eqs. (5) to (7) can readily be extended
into a recursive proof that establishes the covariance matrix
of the entire linear ONN. Specifically, for
ℓ
= 1
, . . . , L
, define
the maps
T(ℓ)(Σ) := D(ℓ)W(ℓ)Σ(W(ℓ))⊺(D(ℓ))⊺
+D(ℓ)Σ(ℓ)
w(D(ℓ))⊺+ Σ(ℓ)
a.(8)
We then have the following:
Proposition 1 (Distribution of linear ONNs)
Assume that
there exist vectors
e(ℓ)∈Rdℓ
such that
σ(ℓ)
(
y
) =
diag
(
e(ℓ)
)
y
.
The feed-forward linear ONN in (4)then satisfies
ΨONN(·, w)(d)
=NΨNN(·, w),Σ(L)
ONN,
where for ℓ=L, L −1,...,1,
Σ(ℓ)
ONN =T(ℓ)(Σ(ℓ−1)
ONN ); and Σ(0)
ONN = Σm.
In linear ONNs with symmetric noise (that is, the AWGN
of each layer’s noise sources has the same covariance ma-
trix), Proposition 1’s recursion simplifies. Introduce
P(ℓ)
:=
QL
i=ℓ+1 D(i)W(i)
for notational convenience. The following
is proved in Appendix A.1.1:
Corollary 1 (Symmetric noise case)
Within the setting of
Proposition 1, assume additionally that for all
ℓ∈N+
,Σ
(ℓ)
a
=
Σaand Σ(ℓ)
w= Σw. Then,
Σ(L)
ONN =P(0)Σm(P(0) )⊺+
L
X
ℓ=1
P(ℓ)Σa(P(ℓ))⊺
+
L
X
ℓ=1
P(ℓ)D(ℓ)Σw(D(ℓ))⊺P(ℓ)⊺.
If moreover for all
ℓ∈N+
,
W(ℓ)
=
W
,
D(ℓ)
=
D
, and
∥D∥F∥W∥F<1, then
lim
L→∞ Σ(L)
ONN =∞
X
n=0
(DW )n(DΣwD⊺+ Σa) ((DW )n)⊺.
Proposition 1and Corollary 1describe the output distri-
bution of linear ONNs completely.
5
2.3. Discussion
One way to think of the AWGN model in Section 2.1 is to
take a step back from the microscopic analysis of individual
devices, and consider an ONN as a series of black box devices
(recall also Figure 1). Each black box device performs their
designated task and acts as communication channels with
AWGN. This way of modeling in order to analyze the impact
of noise can also be seen in [
31
]; and other papers modeling
optical channels include [
28
,
27
]. Further papers considering
noise in optical systems with similar noise assumptions are [
35
,
36
], where furthermore multiplicative noise is considered when
an amplifier is present within the circuit [
35
]. Qualitatively
the results for Design A also apply for multiplicative noise,
the scaling however may differ.
Limitations of the model. We note firstly that modeling the
noise in ONNs as AWGN is warranted only in an operating
regime with many photons, and is thus unlikely to be a good
model for ONNs that operate in a regime with just a few
photons.
Secondly, due to physical device features and operation
conditions, weights, activations, and outputs can only be
realized in ONNs if their values lie in certain ranges. Such
constraints are no part of the model in Section 2. Fortu-
nately, however, the implied range restrictions are usually not
a problem in practice. For example, activation functions like
sigmoid and
tanh
map into [0
,
1] and [
−
1
,
1], respectively.
Additional regularization rules like weight decay also move
the entries of weight matrices in NNs towards smaller values.
In case physical constraints were met one can increase the
weight decay parameter to further penalize large weights dur-
ing training, leading to smaller weights so that the ONN is
again applicable.
3. Results—Design A
3.1. Reducing the noise in feed-forward ONNs (Design A)
Recall that an example of Design A is presented in Fig-
ure 2(b). Algorithm 1constructs this tree-like network, given
the desired number of copies n0, . . . , nLper layer.
Observe that in Design A, the number of copies utilized
in each layer, the
nℓ
, are fixed. There is however only a
single copy in the last layer. Its output is the unique output
of the ONN. Each other layer receives multiple independent
inputs. With each of the independent copies weighted addition
is performed, and the results are averaged to produce the
layer’s single output. Having independent incoming copies is
achieved by having multiple independent branches of the prior
partial networks incoming into a given layer. This means that
the single layer Lreceives nL−1independent inputs of nL−1
independent layers
L−
1. Each of the
nL−1
copies of layer
L−
1receives
nL−2
inputs from independent copies of layer
L−
2. Generally, let
nℓ−1
be the number of copies of layer
ℓ−1that act as inputs to layer ℓ.
Observe that all copies are created upfront. That means
there are
QL−1
ℓ=0 nℓ
copies of the data. By Algorithm 1,
Algorithm 1 Algorithm to construct a noise reducing network
Require: Input n= (nℓ)ℓ=0,...,L
Require: QL
ℓ=0 ni
copies of input
x(0)
, named
1x(0),...,QL
ℓ=0 nix(0)
for ℓ= 0, . . . , L −1do
for α= 1,...,QL−1
i=ℓnido
αξ(ℓ)←W(ℓ+1) αx(ℓ)+b(ℓ+1) + Normal(0,Σw)
end for
for α= 0,...,QL−1
i=ℓni−1do
αy(ℓ)averaging
←−−−−−− n−1
ℓα·nℓ+1ξ(ℓ)+· · · +α·nℓ+nℓξ(ℓ)
αx(ℓ+1) ←σ(ℓ)(αy(ℓ)) + Normal(0,Σa)
end for
end for
return 1x(L)
QL−1
ℓ=1 nℓ
copies of the first layer are arrayed in parallel to
each other, and each of them processes
n0
copies of the data.
The outputs of the
QL−1
ℓ=1 nℓ
arrayed copies of the first layer
are the input to the
QL−1
ℓ=2 nℓ
arrayed copies of the second
layer, and so on.
Notice that noise stemming from applying the activation
function is subject to a linear transformation in the next layer.
The activation function noise can therefore be considered as
weight-noise by inserting an identity layer with
σ
=
id
,
W
=
I
and b= 0.
We want to verify that a Design A ONN yields outputs
that are with high probability close to the original noiseless
NN. Let ˜
ΨONN(x, w)the Design A ONN and then let
P
hsup
x∈Rd
ΨNN(x, w)−˜
ΨONN(x, w)
2< DLi>1−CL,
(9)
be the desired property. The main result of this section is the
following:
Theorem 1
For any
CL∈
(0
,
1), any
DL∈
(0
,∞
), and
any nonlinear NN Ψ
NN
, with Lipschitz-continuous activations
functions with Lipschitz-constants
a(i)
and weight-matrices
W(i)
, Algorithm 1is able to construct an ONN
˜
ΨONN
that
satisfies (9).
Let the covariance matrices of the occurring AWGN be
diagonal matrices and let each of the values of the covari-
ance matrices be upper bounded by
σ2≥
0. For any set of
(
κi
)
i=1,...,L
,(
δi
)
i=1,...,L
such that
Q
(1
−κℓ
)
>
1
−CL
and
Pδℓ≤DL
, a sufficient number of copies to construct an
ONN ˜
ΨONN that satisfies (9)is given by
nL= 1
nℓ≥
σ2QL
i=ℓ+1 a(i)QL
i=ℓ+2 ∥W(i)∥op2
δ2
ℓ+1
× √2Γ((dℓ+1 + 1)/2)
Γ(dℓ+1/2)
6
+v
u
u
t
C2
c
4dℓ+1
√4
2dℓ+1
√4−2(−lnκℓ+1/2)1
QL
i=ℓ+1 ni!2
,
ℓ=L−1,...,0.
Here Γis the gamma function and
C, c >
0are absolute
constants.
This result is proven in Section 3.3. A consideration on the
asymptotic total amount of copies in deep ONNs is relegated
to Appendix A.2.1.
3.2. Idea behind Design A
Having the law of large numbers in mind it seems reason-
able that the average of multiple experiments would help in
achieving a more precise output in the presence of noise. How-
ever, it would typically not be correct to just input
n
identical,
deterministic copies of
x
into
n
independent ONNs—thus pro-
ducing
n
noisy realizations Ψ
ONN,1
(
x, w
)
,...,
Ψ
ONN,n
(
x, w
)
say—and then calculate their average in the hope to recover
Ψ
NN
(
x, w
). This is because while by the law of large numbers
it is true that
lim
n→∞
1
n
n
X
i=1
ΨONN,i(x, w) =
E
ΨONN(x, w),
it is not necessarily true that the expectation
E
ΨONN(x, w)
equals Ψ
NN
(
x, w
). The reason is that activation functions in
NNs are typically nonlinear.
We can circumvent the issue by modifying the approach
and instead exploit the law of large numbers layer-wise. Recall
that in the noiseless NN, layer ℓmaps a fixed input
x7→ σ(ℓ)(W(ℓ)x+b(ℓ)),(10)
and that the same layer in the ONN maps the same fixed
input
x7→ σ(ℓ)(W(ℓ)x+b(ℓ)+N(ℓ)
w)
instead. If we let (
N(i)
)
i∈{1,...,n}
be independent realizations
of the distribution of
N(ℓ)
w
(which has mean zero), we can
expect by the law of large numbers that for sufficiently large
n, the realized quantities
1
n
n
X
i=1
W(ℓ)x+b(ℓ)+N(i)and W(ℓ)x+b(ℓ)
are close to each other. If
σ(ℓ)
is moreover sufficiently regular,
then we may expect that the realized quantity
σ(ℓ)1
nn
X
i=1
W(ℓ)x+b(ℓ)+N(i) (11)
is close to
(10)
for sufficiently large
n
, i.e., close to the
unperturbed output of the original layer.
The implementation in
(11)
can be realized by using
n
times as many nodes in the hidden layer; thus to essentially
create
n
copies of the original hidden layer. These indepen-
dent copies are then averaged. Furthermore, one can allow
for different inputs (
xi
)
i∈{1,...,n}
, assuming some statistical
properties of their distribution. This will be formalized next
in the proof of Theorem 1in Section 3.3.
3.3. Proof of Theorem 1
For the proof we will first upper bound the deviation
between an ONN constructed with Design A and the noiseless
NN (Section 3.3.1) and then we find a probabilistic bound on
the deviations bound (Section 3.3.2).
3.3.1. An upper-bound for the ONN —NN deviation
The output of the Design A network is
˜x=σ(L)1
nL−1
nL−1
X
i=1 W(L)˜xi+b(L)+N(i),(12)
where each ˜xiis recursively calculated as
˜xi=σ(L−1)1
nL−2
nL−2
X
ji=1 W(L−1) ˜xji+b(L−1) +N(ji),
the ˜xjiare calculated as
˜xji=σ(L−2)1
nL−3
nL−3
X
kji=1W(L−2) ˜xkji+b(L−2) +N(kji),
and so on and so forth. The difference in
L2
-norm of
(12)
and the noiseless NN
σ(L)W(L)σ(L−1)W(L−1) . . . +b(L−1)+b(L)
can iteratively be bounded by using the Lipschitz property
of the activation functions, triangle inequality, and submulti-
plicativity of the norms.
We start the iteration by bounding
σ(L)1
nL−1
nL−1
X
i=1 W(L)˜xi+b(L)+N(i)
−σ(L)W(L)σ(L−1)W(L−1) . . . +b(L−1)+b(L)
2
≤a(L)
1
nL−1
nL−1
X
i=1
W(L)˜xi−σ(L−1)W(L−1) . . . +b(L−1)+N(i)
2
≤a(L)∥W(L)∥op
nL−1
×
nL−1
X
i=1 ˜xi−σ(L−1)W(L−1) . . . +b(L−1)
2
+a(L)
1
nL−1
nL−1
X
i=1
N(i)
2.
7
In the next iteration step the term
nL−1
X
i=1 ˜xi−σ(L−1)W(L−1) . . . +b(L−1)
2
is further bounded by first using the triangle inequality and
thereafter bounding in the same way as we did in the first
layer:
nL−1
X
i=1 σ(L−1)1
nL−2
nL−2
X
ji=1 W(L−1) ˜xji+b(L−1) +N(ji)
−σ(L−1)W(L−1) σ(L−2)W(L−2) . . . +b(L−2)
+b(L−1)
2
≤a(L−1)∥W(L−1) ∥op
nL−2
×
nL−1
X
i=1
nL−2
X
ji=1 ˜xji−σ(L−2)W(L−2) . . . +b(L−2)
2
+a(L−1)
nL−1
X
i=1
1
nL−2
nL−2
X
ji=1
N(ji)
2.
Here,
nL−1
X
i=1
nL−2
X
ji=1 ˜xji−σ(L−2)W(L−2) . . . +b(L−2)
2
may again be bounded in the same fashion. This leads to the
following recursive argument.
Let F(ℓ)be the sum of the differences between—loosely
speaking—the ends of the remaining Design A “subtrees” and
noiseless NNs “subtrees” at layer ℓ. More specifically, let
F(L):=
˜x−σ(L)W(L). . . +b(L)
;
F(ℓ):=
nL−1
X
iL=1
nL−2
X
iL−1L=1 ···
nℓ+1
X
iℓ+2...L−1L
=1
nℓ
X
iℓ+1ℓ+2...L−1L
=1
˜xiℓ+1ℓ+2...L−1L−σ(ℓ)W(ℓ). . . +b(ℓ)
2,
∀ℓ= 1, . . . , L −1,
the special case of
F(0)
will be considered in detail later. For
simplicity, we join the sums outside the norm into one. Notice
that because
nL
= 1, we have
QL−1
k=ℓ+1 nk
=
QL
k=ℓ+1 nk
, and
we can write
F(ℓ)=
QL
k=ℓ+1 nk
X
i=1
nℓ
X
ji=1˜xji−σ(ℓ)W(ℓ). . . +b(ℓ)
2,
where specifically
˜xji=σ(ℓ)1
nℓ−1
nℓ−1
X
kji=1W(ℓ)˜xkji+b(ℓ)+N(kji),
and the jiand kjiare nothing more than relabelings.
Bounding
F(ℓ)
using the triangle inequality, Lipschitz-
property, and submultiplicativity yields
F(ℓ)≤a(ℓ)
QL
k=ℓ+1 nk
X
i=1
nℓ
X
ji=1
1
nℓ−1
nℓ−1
X
kji=1N(kji)
+W(ℓ)˜xkji−σ(ℓ−1)W(ℓ−1) . . . +b(ℓ−1)
2
≤a(ℓ)∥W(ℓ)∥op
nℓ−1
F(ℓ−1)
+a(ℓ)
QL
k=ℓ+1 nk
X
i=1
nℓ
X
ji=1
1
nℓ−1
nℓ−1
X
kji=1
N(kji)
2.(13)
We thus found a recursive formula for the bound.
The recursion ends at
F(0)
. The noiseless NN receives
x
as input, while the ONN receives modulated input
x
+
N(ji)
,
where N(ji)is the modulation noise, i.e., AWGN. Therefore,
F(0) =
QL
k=1 nk
X
i=1
n0
X
ji=1
((x+N(ji))−x)
2
=
QL
k=1 nk
X
i=1
n0
X
ji=1
N(ji)
2.(14)
Observe that the x-dependence disappeared.
Readily iterating (13) leads to the bound
F(L)
≤X
ℓ=L,L−1,...,1
L
Y
i=ℓ
a(i)
L
Y
i=ℓ+1 ∥W(i)∥op
×1
QL
k=ℓnk
1
nℓ−1
QL
k=ℓnk
X
i=1
nℓ−1
X
ji=1
N(ji)
2.
Therefore, if all the
L2
-norms of the sums of the Gaussians are
small at the same time, the network is close to the noiseless
NN. Let
Sℓ:=
L
Y
i=ℓ
a(i)
L
Y
i=ℓ+1 ∥W(i)∥op
×1
QL
k=ℓnk
1
nℓ−1
QL
k=ℓnk
X
i=1
nℓ−1
X
ji=1
N(ji)
2.
If for all ℓ
PSℓ≤δℓ>1−κℓ,(15)
and moreover
Pδℓ≤DL
as well as
Q
(1
−κℓ
)
>
1
−CL
,
then (9) holds. This can be seen by bounding
P
hsup
x∈Rd
ΨNN(x, w)−˜
ΨONN(x, w)
2< DLi
8
≥
P
hX
ℓ
Sℓ< DLi≥
P
h\
ℓSℓ< δℓi
=Y
ℓ
P
hSℓ< δℓi>Y
ℓ
(1 −κℓ)>1−CL.
Here, in the first inequality the dependence on
x
disappears
due to (14).
3.3.2. Bound for deviations
We next consider the
Sℓ
for which we want to guarantee
that
PSℓ< δℓ>1−κℓ.
Let
mℓ
=
QL
k=ℓnk
. By assumption the
N(ji)
k
are independent
and identically
Normal
(0
, σ2
k
)distributed, where
σ2
k≤σ2
, for
some common
σ2
. We are lower bounding the number of
copies required, therefore using AWGN with higher variance
only increases the lower bound, as the calculations below
show. We calculate the bound exemplary for
N(ji)
distributed
according to
Normal
(0
, σ2
), re-substituting
σ2
k
below in
(18)
(which is the bound given in Theorem 1) thus covers the case
of N(ji)
k
(d)
= Normal(0, σ2
k).
Each component of the vector
nℓ−1
X
ji=1
N(ji)=
nℓ−1
X
ji=1
N(ji)
1,...,
nℓ−1
X
ji=1
N(ji)
dℓ⊺
is assumed to be
Normal
(0
, nℓ−1σ2
) =
√nℓ−1σNormal
(0
,
1)
distributed. It then holds that
mℓ
X
i=1
nℓ−1
X
ji=1
N(ji)
2
(d)
=
mℓ
X
i=1
√nℓ−1σ∥Normal(0, Id)∥2.
This is a sum of independent chi-distributed random vari-
ables, which means they are sub-gaussian (see below that we
can calculate the sub-gaussian norm and it is indeed finite).
Thus Hoeffding’s inequality applies, according to which, for
X1, . . . , Xn
independent, mean zero, sub-gaussian random
variables, for every t≥0
Ph
N
X
i=1
Xi< ti>1−2 exp−ct2
PN
i=1 ∥Xi∥2
ψ2(16)
holds; see e.g. [
37
, Theorem 2.6.2]. Here
c >
0is an absolute
constant (see [37, Theorem 2.6.2]) and
∥X∥ψ2:= inft > 0 : E[exp(X2/t2)] ≤2.
To apply Hoeffding’s inequality in our setting, we need to cen-
ter the occurring random variables. For
N(i)∼Normal
(0
, Id
),
the term ∥N(i)∥2is chi distributed with mean
µd=√2Γ((d+ 1)/2)
Γ(d/2) ,(17)
where Γis the gamma function, see e.g. [38, p.238].
Consider
P"QL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op
mℓ√nℓ−1
σ
mℓ
X
i=1 ∥N(i)∥2< δℓ#
=P"QL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op
mℓ√nℓ−1
σ
mℓ
X
i=1∥N(i)∥2−µdℓ
< δℓ−QL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op
mℓ√nℓ−1
σmℓµdℓ#
which equals
P"mℓ
X
i=1∥N(i)∥2−µ
<mℓ√nℓ−1δℓ
σQL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op −mℓµ#
and is lower bounded (compare to (16)) by
1−2 exp
−cmℓ√nℓ−1δℓ
σQL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op −mℓµdℓ2
Pmℓ
i=1
∥N(i)∥2−µ
2
ψ2
,
which in turn is lower bounded by
1−2 exp
−cmℓ√nℓ−1δℓ
σQL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op −mℓµdℓ2
Pmℓ
i=1 C2
∥N(i)∥2
2
ψ2
,
where
C >
0is an absolute constant (see [
37
, Lemma 2.6.8]).
For a chi distributed random variable Xit holds that
E[exp(X2/t2)] = MX2(1/t2)
where
MX2
(
s
)is the moment generating function of X
2
—a
chi-squared distributed random variable. It is known (see e.g.
[39, Appendix 13]) that
MX2(s) = (1 −2s)−dℓ/2
for
s < 1
2
. Accordingly for 2
< t2
, the property in the
definition of the sub-gaussian norm
E[exp(X2/t2)] = 1−21
t2−dℓ/2≤2
is satisfied for all tfor which
t≥max
s4dℓ
√4
2dℓ
√4−2,√2
=s4dℓ
√4
2dℓ
√4−2
holds. The square of the sub-gaussian norm of the chi dis-
tributed random variables is thus
∥N(i)∥2
2
ψ2
=4dℓ
√4
2dℓ
√4−2.
9
Substituting the norm into the lower bound yields
1−2 exp
−cmℓ√nℓ−1δℓ
σQL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op −mℓµdℓ2
C2mℓ4dℓ
√4
2dℓ
√4−2
.
In order to achieve (15), a sufficient criterion is
κℓ
2≥exp
−cmℓ√nℓ−1δℓ
σQL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op −µdℓ2
C24dℓ
√4
2dℓ
√4−2
.
Solving for nℓ−1leads to
nℓ−1≥
σ2QL
i=ℓa(i)QL
i=ℓ+1 ∥W(i)∥op2
δ2
ℓ
× sC24dℓ
√4
2dℓ
√4−2(−lnκℓ/2)1
cmℓ
+µdℓ!2
.
(18)
If we substitute the expression in
(17)
for
µdℓ
,
(18)
becomes
the bound as seen in Theorem 1.
□
3.4. Conclusion
Within the context of the model described in Section 2,
we have established that any feed-forward NN can be approxi-
mated arbitrarily well by ONNs constructed using Design A.
This is Theorem 1in essence.
This result has two consequences when it comes to the
physical implementation of ONNs. On the one hand, it is
guaranteed that the theoretical expressiveness of NNs can
be retained in practice. On the other hand, Design A allows
one to improve the accuracy of a noisy ONN to a desired
level, and in fact bring the accuracy arbitrarily close to that of
any state-of-the-art feed-forward noiseless NNs. Let us finally
remark that the high bandwidth of photonic circuits may be
of use when implementing Design A.
4. Results—Design B
4.1. Reducing noise in feed-forward linear ONNs (Design B)
Recall that an example of Design B is presented in Fig-
ure 2(c). Algorithm 2constructs this network, given a desired
number of copies min each layer.
Calculating the output of a NN by using Design B first
requires to fix a number
m
. The input data
x(0)
is then
modulated
m
times, creating
m
noisy realizations of the
input (
αx(0)
)
α=1,...,m
. The weighted addition step and the
activation function of each layer are singled out and copied
m
times. Both the copies of the weighted addition step and
of the activation function of each layer are arrayed parallel
to each other and performed on the
m
inputs, resulting in
m
outputs. The
m
parallel outputs of the weighted addition are
Algorithm 2 Algorithm to construct a noise reducing network
Require: Fix number m∈N
Require: mcopies of input 1x(0),...,mx(0)
for ℓ= 1, . . . , L do
for α= 1, . . . , m do
αξ(ℓ)←W(ℓ)αx(ℓ−1) +b(ℓ)+ Normal(0,Σw)
end for
y(ℓ)combining
←−−−−−− 1ξ(ℓ)+· ·· +mξ(ℓ)
1y(ℓ). . . ,my(ℓ)splitting
←−−−−− m−1y(ℓ)
for α= 1, . . . , m do
αx(ℓ)←σ(ℓ)(αy(ℓ)) + Normal(0,Σact)
end for
end for
return m−1Pm
α=1 αx(L)
merged to a single output, and afterwards split into
m
pieces.
The
m
pieces are each send to one of the
m
activation function
mechanisms for processing. The resulting
m
activation values
are the output of the layer. If it is the last layer, the
m
activation values are merged to produce the final output.
These steps are formally described in Algorithm 2. A schematic
representation of Design B can be seen in Figure 2(c).
4.2. Analysis of Design B
We now consider the physical and mathematical conse-
quences of Design B.
Observe that in Design B, the
m
weighted additions of the
ℓ
-th layer’s input
x(l−1)
result in realizations (
αξ(ℓ)
)
α=1,...,m
of
W(ℓ)x(ℓ−1)
+
b(ℓ)
+
Normal
(0
,
Σ
w
). These realizations are
then combined resulting in
mW (ℓ)x(ℓ−1) +mb(ℓ)+Normal(0, mΣw)+Normal(0,Σsum).
Splitting the signal again into
m
parts, each signal carries
information following the distribution
W(ℓ)x(ℓ−1) +b(ℓ)+ Normal(0, m−1Σw+m−2Σsum)
+ Normal(0,Σspl).
The mean of the normal distribution therefore is the origi-
nal networks pre-activation obtained from this input (that is
without perturbations). The covariance matrix of the normal
distribution is
m−1
Σ
w
+
m−2
Σ
sum
+ Σ
spl
. Each of those
signals is fed through the mechanism applying the activation
function, yielding
m
noisy versions of the output, distributed
according to
x(ℓ)|x(ℓ−1)
(d)
=σ(ℓ)W(ℓ)x(ℓ−1) +b(ℓ)
+ Normal(0, m−1Σw+m−2Σsum + Σspl)
+ Normal(0,Σact).
The effect of Design B is thus that
T(ℓ)
(Σ) in
(8)
is
replaced by
T(ℓ)
m(Σ) := 1
mD(ℓ)W(ℓ)ΣD(ℓ)W(ℓ)⊺+1
mD(ℓ)ΣwD(ℓ)⊺
10
+1
m2D(ℓ)ΣsumD(ℓ)⊺+D(ℓ)Σspl D(ℓ)⊺+ Σa;
see Appendix A.2.2. Observe also that Σ
(ℓ)
a
can be written as
(1
/m
)
m
Σ
(ℓ)
a
. Therefore, if we substitute the matrix
¯
Σ(ℓ)
a
=
mΣ(ℓ)
afor Σ(ℓ)
ain T(ℓ)(Σ), we can write
T(ℓ)
m(Σ) = m−1T(ℓ)(Σ) + m−2D(ℓ)ΣsumD(ℓ)⊺
+D(ℓ)ΣsplD(ℓ)⊺.
We have the following analogs to Proposition 1and Corol-
lary 1:
Theorem 2 (Distribution of Design B)
Assume that there
exist vectors
a(ℓ)∈Rdℓ
such that
σ(ℓ)
(
y
) =
diag
(
a(ℓ)
)
y
. The
feed-forward linear ONN constructed using Design B with
m
copies then satisfies
ΨONN
m(·, w)(d)
= NormalΨNN(·, w),Σ(L)
ONN,m,
where for ℓ=L, L −1,...,1,
Σ(ℓ)
ONN,m =T(ℓ)
m(Σ(ℓ−1)
ONN,m); and Σ(0)
ONN,m = Σm.
Under the assumption of symmetric noise, a similar simpli-
fication of the recursion in Theorem 2, similar to that in Propo-
sition 1, is possible. Assume Σ
sum
= Σ
spl
= 0. Introduce
again
P(ℓ)
:=
QL
i=ℓ+1 D(i)W(i)
for notational convenience.
The following is proved in Appendix A.2:
Corollary 2 (Symmetric noise case)
Assume that for all
ℓ∈N+,Σ(ℓ)
a= Σaand Σ(ℓ)
w= Σw. Then,
Σ(L)
ONN,m =
L
X
ℓ=1
(m−1)L−ℓP(ℓ)mΣa(P(ℓ))⊺
+
L
X
ℓ=1
(m−1)L−ℓP(ℓ)D(ℓ)Σw(D(ℓ))⊺P(ℓ)⊺
+ (m−L)P(0)Σm(P(0) )⊺.
We will next consider the limit of the covariance matrix
in a large, symmetric linear ONN with Design B, that we can
grow infinitely deep. Algorithm 2is namely able to guarantee
boundedness of the covariance matrix in such deep ONN if
the parameter mis chosen appropriately:
Corollary 3
Consider a linear ONN with Design B and param-
eter
m
, that has
L
layers, and that satisfies the following sym-
metry properties: for all
ℓ∈ {
1
, . . . , L}
,
W(ℓ)
=
W
,
D(ℓ)
=
D
,Σ
(ℓ)
a
= Σ
a
and Σ
(ℓ)
w
= Σ
w
. Then, if
∥D∥F∥W∥F<√m
,
the limit limL→∞ Σ(L)
ONN exists.
Moreover,
lim
L→∞ Σ(L)
ONN,m
=∞
X
n=0
m−(n+1)(D W )n(DΣwD⊺+mΣa) ((DW )n)⊺.
Notice that the bound on the number of copies needed
for the covariance matrix of an ONN to converge to a limit
is independent of e.g. the Frobenius norms of the covariance
matrices that describe the noise distributions. This is because,
here, we are not interested in bounding the covariance matrix
to a specific level; instead, we are merely interested in the
existence of a limit.
4.3. Discussion & Conclusion
Compared to Theorem 2’s recursive description of the
covariance matrix in any linear ONN with Design B, Corol-
lary 2provides a series that describes the covariance matrix in
any linear, symmetric ONN with Design B. While the result
holds more restrictively, it is more insightful. For example,
it allows us to consider the limit of the covariance matrix
in an extremely deep ONNs (see Corollary 3). Corollary 3
suggests that in deep ONNs with Design B, one should choose
m≈ ⌈
(
∥D∥F∥W∥F
)
2⌉
in order to control the noise and not
be too inefficient with the number of copies.
These results essentially mean that in a physical implemen-
tation of an increasingly deep and linear ONN, the covariance
matrix can be reduced (and thus remain bounded) by applying
Design B with multiple copies. The quality of the ONN’s
output increases as the number of copies in Design B (or
Design A for that matter) is increased. Finally, it is worth
mentioning that Design B could potentially be implemented
such that it leverages the enormous bandwidth of optics.
5. Simulations
We investigate the improvements of output quality achieved
by Designs A and B on a benchmark example: the convolu-
tional neural network LeNet [
40
]. As measure for quality we
consider the Mean Squared Error (MSE)
MSE = 1
n
n
X
i=1 ΨONN(x(i), w)−ΨNN(x(i), w)2
and the prediction accuracy
#{correctly classified images}
#{images}.
5.1. Empirical variance
We extracted plausible values for Σ
w
and Σ
a
from the
ONN implementation [
41
] of a 2-layer NN for classification of
the Modified National Institute of Standards and Technology
(MNIST) database [
42
]. In [
41
], the authors trained the
NN on a classical computer and implemented the trained
weights afterwards in an ONN. They then tuned noise (with
the same noise model as in Section 2of this paper) into the
noiseless computer model, assuming that Σ
w
=
diag
(
σ2
w
)
and Σ
a
=
diag
(
σ2
a
). They found
σw∈
[0
.
08
,
0
.
1]
·d
and
σa∈
[0
.
1
,
0
.
15]
·d
to reach the same accuracy levels as the
ONN, where ddenotes the diameter of the range.
11
5.2.
LeNet ONN: Performance when implemented via Design
A and B
Convolutional NNs can be regarded as feedforward NNs
by stacking the (2D or 3D) images into column vectors and
arranging the filters to a weight matrix. Thus Design A and B
are well-defined for Convolutional NNs. We apply the designs
to LeNet5 [
40
], which is trained for classifying the handwritten
digits in the MNIST dataset [42]. The layers are:
1.
2D convolutional layer with kernel size 5, stride 1and
2-padding. Output has 6channels of 28x28 pixel repre-
sentations, with the activation function being tanh;
2.
average pooling layer, pooling 2x2 block, the output
therefore is 14x14;
3.
2D convolutional layer with kernel size 5, stride 1and
no padding. The output has 16 channels of 10x10 pixel
representations and the activation function is tanh;
4.
average pooling layer, pooling 2x2 block, the output
therefore is 5x5;
5.
2D convolutional layer with kernel size 5, stride 1and
no padding. The output has 120 channels of 1pixel
representations and the activation function used is
tanh
;
(5.)
flattening layer, which turns the 120 one-dimensional
channels into one 120-dimensional vector;
6.
dense layer with 84 neurons and
tanh
activation func-
tion;
7.
dense layer with 10 neurons and
softmax
activation
function.
Figures 3and 4show the MSE and the prediction accu-
racy of Design A and B for an increasing number of copies,
respectively.
For simplicity we set all individual copies
ni
per layer
i
in Design A to equal
m
, that is
ni
=
m
for all
i
. The total
number of copies that Design A starts with then is
mL
. Here
L
is equal to 7. In Design B the number of copies is
m
per
layer and the total number of copies is
mL
. In the case of
one copy the designs A and B are identical to the original
network, while we focus on the effect once the designs deviate
from the original network (m≥2).
The axis in Figures 3and 4denote the number of copies
per layer. Here, we scale the copies per layer for Design A
linearly, because the total amount of copies for Design A grows
exponentially and we scale the copies per layer for Design B
exponentially, because the total number of copies for Design
B grows linearly. This way the comparison is on equal terms.
Figure 3displays the MSE seen for LeNet, depending
on the amount of copies for each design. In the trade-off
between additional resources needed for the additional copies
against the diminishing benefits of adding further copies, we
see that, for both measures MSE (Figure 3) and relative
accuracy (Figure 4), already 2 to 5 copies per layer yield
good results. The relative accuracy in Figure 4is scaled such
Figure 3: MSE(
·
10
2
) for Design A (top) and Design B (bottom)
as function of copies on LeNet5 trained for MNIST classification.
The pale area contains the 95%-confidence intervals.
that 0corresponds to the accuracy of the original NN with
noise profile (i.e., the ONN without modifications, we call
this the original ONN) and 1to the accuracy of the original
NN without noise. The designs do not alter the fundamental
operation of the original NN, therefore there should be no
performance gain and the original NN’s accuracy should be
considered the highest achievable, thus constituting the upper
bound in relative accuracy of 1. Likewise the lowest accuracy
should be given by the original ONN, as there is no noise
reduction involved.
Figure 4: Relative accuracy for Design A (top) and Design B
(bottom) as function of copies on LeNet5 trained for MNIST clas-
sification. The pale area contains the 56
.
5%-confidence intervals.
5.3. Effect of additional layers in LeNet
In order to investigate how the depth affects the noise
at the output, while keeping the operation of the network
the same to ensure the results are commensurable, we insert
additional layers with identity matrix and identity activation
function (we will call them identity layers) into a network.
12
Figure 5: Accuracy of LeNet ONNs, depending on the amount of inserted identity layers and the variance level of the ONN, for (a) a
network with
tanh
activation function and one copy, (b) a network with
ReLU
activation function and one copy, (c) a network with
linear
activation function and one copy, (d) a network with
tanh
activation function and two copies, (e) a network with
ReLU
activation
function and two copies, (f) a network with linear activation function and two copies.
Specifically, we take networks with the LeNet architecture
as in Section 5.2, using different activation functions, while
fixing the output layer to be
softmax
. We then insert identity
layers between layers 1 and 2, 3 and 4, 5 and 6, as well as
between layers 6 and 7. For a fixed total of additional layers,
the layers are inserted in the four spots between layers
1&2
,
3&4,5&6, and 7&8 according to the tuple
n7→ jn+ 3
4k,jn+ 2
4k,jn+ 1
4k,jn
4k.
The insertion pattern is illustrated in Table 1:
# of additional layers 1&2 3&4 5&6 7&8
1 1 0 0 0
2 1 1 0 0
3 1 1 1 0
4 1 1 1 1
5 2 1 1 1
6 2 2 1 1
... ... ... ... ...
Table 1: Insertion pattern.
Finally, we tune the variance terms of the covariance matrix
in our noise model. The results are displayed in Figure 5.
In Figure 5, we observe that the
tanh
and the
ReLU
networks perform as expected. Additional noisy layers decrease
the accuracy and thus the same level of performance can only
be achieved if the variance is lower. This trend can also be
seen in the linear network, but to a lesser extend.
5.4. Simulations on effective values for Design B
According to Corollary 3, the covariance matrix of a
linear ONN constructed by Design B is bounded if
m >
(∥D∥F∥W∥F)2
, and therefore
m
=
⌈(∥D∥F∥W∥F)2⌉
, is suf-
ficient to ensure that the covariance matrix of the output
distribution Ψ
ONN
m
(
·, w
)
(d)
= Normal
(Ψ
NN
(
·, w
)
,
Σ
(L)
ONN,m
)in
Theorem 2is bounded in linear NNs. This is derived by using
submultiplicativity of the norm (see
(A.8)
) and is therefore
possibly a lose bound. We use the exact relation given by
Corollary 2for the covariance matrix in Theorem 2to investi-
gate the lowest values for
m
for which the covariance matrix
starts being bounded. In Figure 6we depict a linear NN with
constant width 4. We vary the values for
∥D∥F
and
∥W∥F
.
Upon close inspection we see that the lowest value for
m
seems
to be
g
(
x, y
)
≃ ⌈
(
xy
)
2/∥Id∥4
F⌉,
where
Id
is the identity ma-
trix of dimension
d
, see Figure 6. Because
∥Id∥F
=
√d
, the
value for mfound numerically is m≃ ⌈(∥D∥F∥W∥F/d)2⌉.
6. Discussion & Conclusion
Design A, introduced in Section 3, guarantees an approx-
imation property (Theorem 1). This is achieved through
technical machinery to control the noise, even though there
are nonlinear activation functions involved. This method is
powerful enough to yield the universal approximation property,
as NNs can be approximated arbitrarily well with ONNs that
13
Figure 6: The contour lines denote the lowest
m
for which
∥
Σ
(L)
ONN,m∥F
stops growing exponentially, as a function of
∥W∥F
and ∥D∥F.
are constructed through the first design, and NNs themselves
can approximate any continuous function arbitrarily well [43,
Theorem 1]. Our mathematical guarantee however, only states
a sufficient number of copies required, and this number grows
exponentially as the number of layers increases.
We then introduced Design B in Section 4, in which the
growth of number of copies is much more benign. However,
the analysis of Design B was restricted to linear NNs, and
Design B might therefore not be expressive enough to have
the universal approximation property. Linear NNs, or NNs
with algebraic polynomials as activation functions for that
matter, namely do not posses the universal approximation
property. The assumption of linear activation functions did
allow us to characterize the distribution of the output exactly
on the flipside (Theorem 2).
In short, in this paper, we have discussed the noise present
in ONNs and described a mathematical model for the noise.
We also investigate the numerical implications of the the
mathematical model, with a specific focus on the effects of
depth (Figure 5). The proposed noise reduction schemes yield
greater accuracy and the theoretical results (Theorem 1and
Corollary 3) guarantee that ONNs work just as noiseless NNs
in the many copies limit. With the designs and findings of
Sections 3to 4we have a framework to exploit known NN
wisdom, as no new training is required. Further research should
address optimization algorithms that take the noise of ONNs
into account to investigate the regularization, generalization
and minimization properties of trained ONNs.
Acknowledgments
This research was supported by the European Union’s
Horizon 2020 research and innovation programme under the
Marie Sk lodowska-Curie grant agreement no. 945045, and by
the NWO Gravitation project NETWORKS under grant no.
024.002.003.
We would finally like to thank Bin Shi for advise on the
noise level parameters of ONNs for our simulations. Fur-
thermore we want to thank Albert Senen–Cerda, Sanne van
Kempen, and Alexander Van Werde for feedback to a draft
version of this document.
References
[1]
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image
Recognition, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016.
[2]
J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for
Semantic Segmentation, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015.
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., Human-level control through deep reinforcement learning,
Nature (2015).
[4]
H. Nam, B. Han, Learning Multi-Domain Convolutional Neural Net-
works for Visual Tracking, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016.
[5]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, et al., Google’s
Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation, arXiv preprint arXiv:1609.08144
(2016).
[6]
J. Hasler, H. Marr, Finding a roadmap to achieve large neuromorphic
hardware systems, Frontiers in Neuroscience (2013).
[7]
Z. Du, D. Rubin, Y. Chen, L. Hel, T. Chen, L. Zhang, C. Wu,
O. Temam, Neuromorphic Accelerators: A Comparison Between
Neuroscience and Machine-Learning Approaches, in: MICRO-48:
Proceedings of the 48th International Symposium on Microarchi-
tecture, 2015.
[8]
F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur,
P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, B. Taba,
M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk,
B. Jackson, D. S. Modha, TrueNorth: Design and Tool Flow of a
65 mW 1 Million Neuron Programmable Neurosynaptic Chip, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2015).
[9]
B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chan-
drasekaran, J.-M. Bussat, R. Alvarez Icaza, J. V. Arthur, P. A.
Merolla, K. Boahen, Neurogrid: A Mixed-Analog-Digital Multichip
System for Large-Scale Neural Simulations, Proceedings of the
IEEE (2014).
[10]
S. Furber, F. Galluppi, S. Temple, L. Plana, The SpiNNaker Project,
Proceedings of the IEEE (2014).
[11]
P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy,
J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Naka-
mura, et al., A million spiking-neuron integrated circuit with a
scalable communication network and interface, Science (2014).
[12]
J. Schemmel, D. Br¨uderle, A. Gr¨ubl, M. Hock, K. Meier, S. Millner,
A wafer-scale neuromorphic hardware system for large-scale neural
modeling, in: 2010 IEEE International Symposium on Circuits and
Systems (ISCAS), 2010.
[13]
S. A. Siddiqui, S. Dutta, A. Tang, L. Liu, C. A. Ross, M. A. Baldo,
Magnetic Domain Wall Based Synaptic and Activation Function
Generator for Neuromorphic Accelerators, Nano Letters (2019).
[14]
T. De Lima, B. Shastri, A. Tait, M. Nahmias, P. Prucnal, Progress
in neuromorphic photonics, Nanophotonics (2017).
[15]
K. Kitayama, M. Notomi, M. Naruse, K. Inoue, S. Kawakami,
A. Uchida, Novel frontier of photonics for data processing—Photonic
accelerator, APL Photonics (2019).
[16]
M. Miscuglio, J. Meng, O. Yesiliurt, Y. Zhang, L. J. Prokopeva,
A. Mehrabian, J. Hu, A. V. Kildishev, V. J. Sorger, Artificial
Synapse with Mnemonic Functionality using GSST-based Photonic
Integrated Memory, in: 2020 International Applied Computational
Electromagnetics Society Symposium (ACES), 2020.
[17]
B. J. Shastri, A. N. Tait, T. F. de Lima, M. A. Nahmias, H.-T.
Peng, P. R. Prucnal, Principles of Neuromorphic Photonics, arXiv
preprint arXiv:1801.00016 (2017).
14
[18]
Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,
M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund,
M. Soljaˇci´c, Deep Learning with Coherent Nanophotonic Circuits,
Nature Photonics (2017).
[19]
R. Hamerly, L. Bernstein, A. Sludds, M. Soljaˇci´c, D. Englund,
Large-Scale Optical Neural Networks Based on Photoelectric Multi-
plication, Physical Review X (2019).
[20]
L. Bernstein, A. Sludds, R. Hamerly, V. Sze, J. Emer, D. Englund,
Freely scalable and reconfigurable optical hardware for deep learning,
scientific reports (2021).
[21]
C. Huang, S. Fujisawa, T. F. De Lima, A. N. Tait, E. Blow, Y. Tian,
S. Bilodeau, A. Jha, F. Yaman, H. G. Batshon, et al., Demonstra-
tion of photonic neural network for fiber nonlinearity compensation
in long-haul transmission systems, in: 2020 Optical Fiber Commu-
nications (OFC) Conference and Exhibition, 2020.
[22]
B. Shi, N. Calabretta, R. Stabile, Deep Neural Network through an
InP SOA-Based Photonic Integrated Cross-Connect, IEEE Journal
of Selected Topics in Quantum Electronics (2019).
[23]
B. Shi, K. Prifti, E. Magalh˜aes, N. Calabretta, R. Stabile, Loss-
less Monolithically Integrated Photonic InP Neuron for All-Optical
Computation, in: Optical Fiber Communication Conference, 2020.
[24]
B. Shi, N. Calabretta, R. Stabile, First Demonstration of a Two-
Layer All-Optical Neural Network by Using Photonic Integrated
Chips and SOAs, in: 45th European Conference on Optical Com-
munication (ECOC 2019), 2019.
[25]
B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice,
H. Bhaskaran, C. Wright, P. R. Prucnal, Photonics for artificial
intelligence and neuromorphic computing, Nature Photonics (2021).
[26]
A. N. Tait, T. F. De Lima, E. Zhou, A. X. Wu, M. A. Nahmias,
B. J. Shastri, P. R. Prucnal, Neuromorphic photonic networks using
silicon photonic weight banks, scientific reports (2017).
[27]
R.-J. Essiambre, G. Kramer, P. Winzer, G. Foschini, B. Goebel,
Capacity Limits of Optical Fiber Networks, Journal of Lightwave
Technology (2010).
[28]
X. Li, R. Mardling, J. Armstrong, Channel Capacity of IM/DD
Optical Communication Systems and of ACO-OFDM, in: 2007
IEEE International Conference on Communications, 2007.
[29]
T. de Lima, A. Tait, H. Saeidi, M. Nahmias, H. Peng, S. Abbaslou,
B. Shastri, P. Prucnal, Noise Analysis of Photonic Modulator Neu-
rons, IEEE Journal of Selected Topics in Quantum Electronics
(2019).
[30]
I. Chakraborty, G. Saha, A. Sengupta, K. Roy, Toward Fast Neural
Computing using All-Photonic Phase Change Spiking Neurons,
Scientific Reports (2018).
[31]
N. Passalis, M. Kirtas, G. Mourgias-Alexandris, G. Dabos, N. Pleros,
A. Tefas, Training Noise-Resilient Recurrent Photonic Networks
for Financial Time Series Analysis, in: 2020 28th European Signal
Processing Conference (EUSIPCO), 2021.
[32]
G. Mourgias-Alexandris, A. Tsakyridis, N. Passalis, A. Tefas, K. Vyr-
sokinos, N. Pleros, An all-optical neuron with sigmoid activation
function, Optics Express (2019).
[33]
O. A. Manita, M. A. Peletier, J. W. Portegies, J. Sanders, A. Senen-
Cerda, Universal approximation in dropout neural networks, Journal
of Machine Learning Research (2022).
[34] R. J. Muirhead, Aspects of Multivariate Statistical Theory, 2009.
[35]
N. Semenova, L. Larger, D. Brunner, Understanding and mitigating
noise in trained deep neural networks, Neural Networks (2022).
[36]
N. Semenova, X. Porte, L. Andreoli, M. Jacquot, L. Larger, D. Brun-
ner, Fundamental aspects of noise in analog-hardware neural net-
works, Chaos: An Interdisciplinary Journal of Nonlinear Science
(2019).
[37]
R. Vershynin, High-dimensional probability: An introduction with
applications in data science, Cambridge University Press, 2018.
[38]
M. L. Abell, J. P. Braselton, J. A. Rafter, Statistics with mathe-
matica, Academic Press, 1999.
[39]
C. Clapham, J. Nicholson, J. R. Nicholson, The concise Oxford
dictionary of mathematics, Oxford University Press, 2014.
[40]
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-Based Learn-
ing Applied to Document Recognition, Proceedings of the IEEE
(1998).
[41]
B. Shi, N. Calabretta, R. Stabile, InP photonic integrated multi-
layer neural networks: Architecture and performance analysis, APL
Photonics (2022).
[42]
L. Deng, The mnist database of handwritten digit images for ma-
chine learning research, IEEE Signal Processing Magazine (2012).
[43]
M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer Feed-
forward Networks with Non-Polynomial Activation Function Can
Approximate Any Function, Neural Networks (1993).
[44]
S. Banach, Sur les op´erations dans les ensembles abstraits et leur
application aux ´equations int´egrales, Fundamenta mathematicae
(1922).
15
A.
A.1. Proofs of Section 2
A.1.1. Proof of Corollary 1
The first conclusion of Corollary 1follows immediately
from expanding the recursion.
Assume that for all
ℓ∈Ndiag
(
a(ℓ)
) =
a
and
W(ℓ)
=
W
as well as Σ
(ℓ)
a
= Σ
a
and Σ
(ℓ)
w
= Σ
w
. Let
T
:=
T(ℓ)
be the
common map under those conditions.
To prove the second conclusion of Corollary 1, observe
that for any two matrices
X
and
Y
of the same dimension as
aand W,
∥T(X)−T(Y)∥F=∥aW (X−Y)(aW )⊺∥F
≤ ∥a∥2
F∥W∥2
F∥X−Y∥F(A.1)
by submultiplicativity of the Frobenius norm. Let us now
consider the setting of Proposition 1for a moment, that is, we
initialize
X1
= Σ
m
and calculate
Xℓ+1
=
T
(
Xℓ
)recursively.
The sequence
X1, X2, . . .
converges if
∥a∥F∥W∥F<
1, as a
consequence of the Banach fixed point theorem [
44
] combined
with (A.1).
We may therefore consider the unique fixed point
X⋆
:=
limℓ→∞ Xℓ.
It must satisfy the fixed point equation
T
(
X⋆
) =
X⋆
, which reads
X⋆−
(
aW
)
X⋆
(
aW
)
⊺
=
a
Σ
wa⊺
+ Σ
a.
Equiv-
alently,
vec(X⋆)−((aW )⊗(aW ))vec(X⋆)
= vec(aΣwa⊺) + vec(Σa).
Here,
⊗
denotes the Kronecker product and
vec
the vector-
ization of a matrix (effectively, we stack the columns of the
matrix
A
on top of one another). This vectorization trick
allows us to write the solution to the fixed point equation as
vec(X⋆) = (I−((aW )⊗2))−1(vec(aΣwa⊺+ Σa)).(A.2)
Here, (aW )⊗2denotes (aW )⊗(aW ).
Formally we rewrite the inverse in
(A.2)
in terms of a von
Neumann series
(I−((aW )⊗2))−1=∞
X
n=0
((aW )⊗2)n.(A.3)
This is however justified only if
∥((aW )⊗2)n∥F→0,(A.4)
which we verify next.
For the Kronecker product it holds that
tr
(
A⊗B
) =
tr
(
A
)
tr
(
B
). Therefore,
∥
(
aW
)
⊗2∥F
=
tr
((
aW
)
⊺
(
aW
)) =
∥aW ∥2
F
by definition of the Frobenius norm. Furthermore,
by submultiplicativity,
∥
((
aW
)
⊗2
)
n∥F≤ ∥
(
aW
)
⊗2∥n
F.
Thus,
by the assumption that
∥a∥F∥W∥F<
1, condition
(A.4)
holds and
(A.3)
’s expression is proper. This leads to the
representation of X∗as
vec(X⋆) = ∞
X
n=0
((aW )⊗2)nvec(aΣwa⊺+ Σa)
= vec(aΣwa⊺+ Σa) + ∞
X
n=1
((aW )⊗2)nvec(aΣwa⊺+ Σa).
Returning to matrix notation we have
X⋆=aΣwa⊺+ Σa+∞
X
n=1
(aW )n(aΣwa⊺+ Σa)((aW )n)⊺.
This proves the second conclusion of Corollary 1.
A.2. Additional material Section 4
A.2.1. Additional considerations on Design A
To consider the total number of copies in Design A to
guarantee
(9)
, we need to multiply all the
nℓ
in Theorem 1.
To simplify the terms we upper bound
√2Γ((dℓ+1 + 1)/2)
Γ(dℓ+1/2) ℓ∈ {0, . . . , L −1}
by a constant
D
(assuming the sequence of
dℓ
is bounded).
We also replace
v
u
u
t
C2
c
4dℓ+1
√4
2dℓ+1
√4−2(−lnκℓ+1/2)1
QL
i=ℓ+1 ni
by a constant E. If the total number of copies satisfies
L
Y
ℓ=0
nℓ≥QL
ℓ=1 a(ℓ)2ℓQL
ℓ=1 ∥W(ℓ)∥op2(ℓ−1)
QL
ℓ=1 δ2
ℓ
×σ2LD+E2L,
then we are able to construct an ONN
˜
ΨONN
that satis-
fies
(9)
. The product
QL
ℓ=1 δ2
ℓ
is maximized if all
δℓ
=
DL/L
. We furthermore upper-bound
QL
ℓ=1 a(ℓ)2ℓ
by
a2L2
and QL
ℓ=1 ∥W(ℓ)∥op2(ℓ−1) by W2L2. We then have
N=
L
Y
ℓ=0
nℓ=ω(K2L+2L2LL) = ω(K2L(L+1)LL).
A.2.2. Deriving the covariance matrix for Design B
We now derive
T(ℓ)
m
(Σ)—the transformation of the co-
variance matrix which an input undergoes as it becomes the
output of layer
ℓ
. Recall that this input is distributed as
Normal
(
x(ℓ−1),
Σ). Denote the random variable for the pre-
activation (from which the realizations are drawn) after joining
and splitting beams by P(ℓ). Then
P(ℓ)|x(ℓ−1)
(d)
=m−1h
m
X
i=1
W(ℓ)Normal(x(ℓ−1),Σ) + Normal(0,Σw)
+ Normal(0,Σsum)i+ Normal(0,Σspl )
(d)
=m−1Normal(mW (ℓ)x(ℓ−1), mW (ℓ)Σ(W(ℓ))⊺)
16
+ Normal(0, mΣw) + Normal(0,Σsum)+ Normal(0,Σspl )
(d)
= NormalW(ℓ)x(ℓ−1),
m−1(W(ℓ)ΣW(ℓ)⊺+ Σw+m−1Σsum)+Σspl.
The random variable P
(ℓ)
is then channeled through the
activation function, which subsequently adds another noise
term. The resulting activation is the random variable A(ℓ)
A(ℓ)|x(ℓ−1)
(d)
=σ(ℓ)(P(ℓ)) + Normal(0,Σa)
(d)
=D(ℓ)W(ℓ)x(ℓ−1)
+ Normal0,D(ℓ)W(ℓ)ΣD(ℓ)W(ℓ)⊺
m+D(ℓ)ΣwD(ℓ)⊺
m
+D(ℓ)ΣsumD(ℓ)⊺
m2+D(ℓ)ΣsplD(ℓ)⊺+ Σa
(d)
= Normal(ΦNN
ℓ(x(ℓ−1)), T (ℓ)
m(Σ))
As we can see, instead of
T(ℓ)(Σ) := D(ℓ)W(ℓ)Σ(D(ℓ)W(ℓ))⊺+D(ℓ)Σ(ℓ)
w(D(ℓ))⊺+Σ(ℓ)
a
we have
T(ℓ)
m(Σ) := m−1D(ℓ)W(ℓ)ΣD(ℓ)W(ℓ)⊺
+m−1D(ℓ)ΣwD(ℓ)⊺+m−2D(ℓ)ΣsumD(ℓ)⊺
+D(ℓ)ΣsplD(ℓ)⊺+ Σa.
A.2.3. Proof of Corollary 2
As Corollary 2is the Design B analog of Corollary 1, the
proofs are similar. The first expression in Corollary 2is again
immediate from expansion. For the limit we use the same
Banach fixpoint argument, where only the variables have to
be exchanged. The following executes these steps.
Assume again that for all
ℓ∈ND(ℓ)
=
D
and
W(ℓ)
=
W
as well as Σ
(ℓ)
a
= Σ
a
and Σ
(ℓ)
w
= Σ
w
. Let
Tm
:=
T(ℓ)
m
be the
common map under those conditions.
Recall
(A.1)
. In the setting of Theorem 2, that is
X1
= Σ
m
and
Xℓ+1
=
Tm
(
Xℓ
), the so-defined sequence converges if
∥D∥F∥W∥F<√m
(see also below
(A.8)
) due to
(A.1)
and
the Banach fixed point theorem [
44
]. We therefore let the
unique fixed point be
lim
ℓ→∞ Xℓ=X⋆.
We can write the fixed point equation Tm(X⋆) = X⋆as
X⋆−m−1(DW )X⋆(DW )⊺
=m−1DΣwD⊺+m−1Σa+m−2DΣsumD⊺+DΣspl D⊺,
and further write it as
vec(X⋆)−m−1((DW )⊗(DW ))vec(X⋆)
= vecm−1DΣwD⊺+m−1Σa+m−2DΣsumD⊺
+DΣsplD⊺.
Here,
⊗
denotes again the Kronecker product and
vec
the
vectorization of a matrix. Applying the vectorization trick as
in the proof of Corollary 1allows us to write the solution to
the fixed point equation as
vec(X⋆)(A.5)
=I−(m−1(DW )⊗2)−1
vecm−1DΣwD⊺+m−1Σa+m−2DΣsumD⊺
+DΣsplD⊺.
Again, (DW )⊗2denotes (DW )⊗(D W ).
Formally we rewrite the inverse in
(A.5)
in terms of a von
Neumann series
(I−(m−1(DW )⊗2))−1=∞
X
n=0
(m−1(DW )⊗2)n.(A.6)
This is again only justified if
∥(m−1(DW )⊗2)n∥F→0.(A.7)
By submultiplicativity it holds that
∥(m−1(DW )⊗2)n∥F≤ ∥m−1(DW )⊗2∥n
F.(A.8)
For the Kronecker product it holds that
tr
(
A⊗B
) =
tr
(
A
)
tr
(
B
)and thus
∥
(
DW
)
⊗2∥F
=
tr
((
DW
)
∗
(
DW
)) =
∥DW ∥2
F
by definition of the Frobenius norm. Therefore, by our assump-
tion of
∥D∥F∥W∥F<√m
, condition
(A.7)
holds and
(A.6)
is valid. To simplify the notation we let Σ
sum
= Σ
spl
= 0,
leading to the representation of X∗as
vec(X⋆) = ∞
X
n=0
m−n((DW )⊗2)nvec(m−1DΣwD⊺+ Σa).
Returning to the matrix notation we have
X⋆=∞
X
n=0
m−n(DW )n(m−1DΣwD⊺+ Σa)((DW )n)⊺.
That is it. 2
17