ArticlePDF Available

Freely scalable and reconfigurable optical hardware for deep learning

Springer Nature
Scientific Reports
Authors:

Abstract and Figures

As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The path-length-independence of optical energy consumption enables information locality between a transmitter and a large number of arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that digital optical data transfer is beneficial over electronics when the spacing of computational units is on the order of >10\,\upmu > 10 μ m.
Digital fully-connected neural network (FC-NN) and hardware implementations. (a) FC-NN with input activations (red, vector length K) connected to output activations (vector length N) via weighted paths, i.e., weights (blue, matrix size K×N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K\times N$$\end{document}). (b) Matrix representation of one layer of an FC-NN with B-sized batching. (c) Example bit-serial multiplier array, with output-stationary accumulation across k. Fan-out of X across n∈1…N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \in \left\{ 1 \ldots N\right\} $$\end{document}; fan-out of W across b∈1…B\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b \in \left\{ 1 \ldots B\right\} $$\end{document}. Bottom panel: all-electronic version with fan-out by copper wire (for clarity, fan-out of W not illustrated). Top panel: digital optical neural network version, where X and W are fanned out passively using optics, and transmitted to an array of photodetectors. Each pixel contains two photodetectors, where the activations and weights can be separated by, e.g., polarization or wavelength filters. Each photodetector pair is directly connected to a multiplier in close proximity.
… 
This content is subject to copyright. Terms and conditions apply.

Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports
Freely scalable and recongurable
optical hardware for deep learning
Liane Bernstein1,5*, Alexander Sludds1,5*, Ryan Hamerly1,2, Vivienne Sze1, Joel Emer3,4 &
Dirk Englund1*
As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and
solve more complex problems. This trend has been enabled by an increase in available compute
power; however, eorts to continue to scale electronic processors are impeded by the costs of
communication, thermal management, power delivery and clocking. To improve scalability,
we propose a digital optical neural network (DONN) with intralayer optical interconnects and
recongurable input values. The path-length-independence of optical energy consumption enables
information locality between a transmitter and a large number of arbitrarily arranged receivers, which
allows greater exibility in architecture design to circumvent scaling limitations. In a proof-of-concept
experiment, we demonstrate optical multicast in the classication of 500 MNIST images with a
3-layer, fully-connected network. We also analyze the energy consumption of the DONN and nd that
digital optical data transfer is benecial over electronics when the spacing of computational units is on
the order of
>
10
µ
m.
Machine learning has become ubiquitous in modern data analysis, decision-making, and optimization. A promi-
nent subset of machine learning is the articial deep neural network (DNN), which has revolutionized many
elds, including classication1, translation2 and prediction3,4. An important step toward unlocking the full poten-
tial of DNNs is improving the energy consumption and speed of DNN tasks. To this end, emerging DNN-specic
hardware58 optimizes data access, reuse and communication for mathematical operations: most importantly,
general matrix–matrix multiplication (GEMM) and convolution9. However, despite these advances, a central
challenge in the eld is scaling hardware to keep up with exponentially-growing DNN models10 (see Fig.1) due
to electronic communication11, clocking12, thermal management13 and power delivery14.
To overcome these electronic limitations, optical systems have previously been proposed to perform linear
algebra and data transmission. Analog weighting of optical inputs can be implemented with masks, holography
or optical interference using acousto-optic modulation1518, spatial light modulation19, electro-optic or thermo-
optic modulation2023, phase-change materials24 or printed diractive elements25. Due to their analog nature,
system errors can decrease the accuracy of large DNN models processed on this hardware. Prior works in digi-
tal optical interconnects have focused on integrated point-to-point connections26,27, free-space point-to-point
transmission28,29, and small-scale free-space multicast30. ese ideas would be dicult to scale since they incur
signicant overhead in number of components and introduce compounded component losses.
In this Article, we introduce a novel optical DNN accelerator that encodes inputs and weights into recon-
gurable on-o optical pulses. Free-space optical elements passively transmit and copy data from memory to
large-scale electronic multiplier arrays (fan-out). e length-independence of this optical data routing enables
freely scalable systems, where single transmitters are fanned out to many arbitrarily arranged receivers with fast
and energy-ecient links. is system architecture is similar to our previous coherent optical neural network23,
but in contrast to this work and the other analog schemes described above, we propose an entirely digital system.
Incoherent optical paths for data transmission (not computation) replace electrical on-chip interconnects, and
can thus preserve accuracy. Unlike prior digital optical interconnect systems, our ‘digital optical neural network’
(DONN) uses free-space fan-out for data distribution to a large number of receivers for the specic application
of matrix multiplication of the type found in modern DNNs.
We rst illustrate the DONN architecture and discuss possible implementations. en, in a proof-of-concept
experiment, we demonstrate that digital optical transmission and fan-out with cylindrical lenses has little eect
on the classication accuracy of the MNIST handwritten digit dataset (< 0.6%). Crosstalk is the primary cause of
OPEN
         NTT
        
          NVIDIA,
These authors contributed equally: Liane Bernstein and
 *email: lbern@mit.edu; asludds@mit.edu; englund@mit.edu
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
this drop in accuracy, and because it is deterministic, it can be compensated: with a simple crosstalk correction
scheme, we reduce our bit error rates by two orders of magnitude. Alternatively, crosstalk can be greatly reduced
through optimized optical design. Since shot and thermal noise are negligible (see “Discussion”), the accuracy
of the DONN can therefore be equivalent to an all-electronic DNN accelerator.
We also compare the energy consumption of optical interconnects (including light source energy) against
that of electronic interconnects over distances representative of logic, multi-chiplet interconnects and multi-chip
interconnects in a 7nm CMOS node. Multiple chips44 or partitioned chips45,46 are regularly employed to process
large networks since they can ease electronic constraints and improve performance over a monolithic equivalent
through greater mapping exibility47, at the cost of increased communication energy. Our calculations show
an advantage in data transmission costs for distances ≥ 5
µ
m (roughly the size of the basic computation unit:
an 8-bit multiply-and-accumulate (MAC), with length 5–8
µ
m). e DONN thus scales favorably with respect
to very large DNN accelerators: the DONN’s optical communication cost for an 8-bit MAC, i.e., the energy to
transmit two 8-bit values, remains constant at
3
fJ/MAC, whereas multi-chiplet systems have much higher
electrical interconnect costs (
1000
fJ/MAC), and multi-chip systems have a higher energy consumption still
(
30, 000
fJ/MAC). us, the ecient optical data distribution provided by the DONN architecture will become
critical for continued growth of DNN performance through increased model sizes and greater connectivity.
Results
Problem statement. A DNN consists of a sequence of layers, in which input activations from one layer are
connected to the next layer via weighted paths (weights), as shown in Fig.2a. We focus on inference tasks in this
paper (where weights are known from prior training), which, in addition to the energy consumption problem,
place stringent requirements on latency and throughput. Modern inference accelerators expend the majority of
energy (> 90%) on memory access, data movement, and computation in fully-connected (FC) and convolutional
(CONV) layers5.
Parallelized vector operations, such as matrix–matrix multiplication or successive vector–vector inner prod-
ucts, are the largest energy consumers in CONV and FC layers. In an FC layer, a vector
x
of input values (‘input
activations’, of length K) is multiplied by a matrix W
K×N
of weights (Fig.2b). is matrix–vector product yields
a vector of output activations (
, of length N). Most DNN accelerators process vectors in B-sized batches, where
the inputs are represented by a matrix X
B×K
. e FC layer then becomes a matrix–matrix multiplication (X
B×K·
W
K×N
). CONV layers can also be processed as matrix multiplications, e.g., with a Toeplitz matrix9.
In matrix multiplication, fan-out, where data is read once from main memory (DRAM) and used multiple
times, can greatly reduce data movement and memory access. is amortization of read cost across numerous
operations is critical for overall eciency, since retrieving a single matrix element from DRAM requires two to
three orders of magnitude more energy than the MAC11. A simple input-weight product illustrates the benet of
fan-out, since activation and weight elements appear repeatedly, as highlighted by the repetition of
X11
and
W11
:
(1)
2012 2013 2014 2015 201620172018201
92
020
Year
10
5
10
6
10
7
10
8
10
9
10
10
10
11
Number of Model Parameters
AlexNet
VGG16
GoogLeNet
ResNet-50
DQ
N
Inception V3 Xception
Transformer
(Base)
Transformer (Big)
NASNet
SENet
BERT
GPT-2
ALBERT
Tr
ansformer-XL
GPT-3
Figure1. Number of parameters, i.e., weights, in recent landmark neural networks1,2,3143 (references dated by
rst release, e.g., on arXiv). e number of multiplications (not always reported) is not equivalent to the number
of parameters, but larger models tend to require more compute power, notably in fully-connected layers.
e two outlying nodes (pink) are AlexNet and VGG16, now considered over-parameterized. Subsequently,
eorts have been made to reduce DNN sizes, but there remains an exponential growth in model sizes to solve
increasingly complex problems with higher accuracy.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
Consequently, DNN hardware design focuses on optimizing data transfer and input and weight matrix ele-
ment reuse. Accelerators based on conventional electronics use ecient memory hierarchies, a large array of
tightly packed processing elements (PEs, i.e., multipliers with or without local storage), or some combination
of the these approaches. Memory hierarchies optimize temporal data reuse in memory blocks near the PEs to
boost performance under the constraint of chip area9. is strategy can enable high throughput in CONV layers5.
With fewer intermediate memory levels, a larger array of PEs (e.g., TPU v18) can further increase throughput
and lower energy consumption on workloads with a high-utilization mapping due to potentially reduced over-
all memory accesses and a greater number of parallel multipliers (spatial reuse). erefore, for workloads with
large-scale matrix multiplication such as those mentioned in the Introduction, if we maximize the number of
available PEs, we can improve eciency.
Digital optical neural network architecture. Our DONN architecture replaces electrical interconnects
with optical links to relax the design constraints of reducing inter-multiplier spacing or colocating multipliers
with memory. Specically, optical elements transfer and fan out activation and weight bits to electronic multi-
pliers to reduce communication costs in matrix multiplication, where each element
Xbk
is fanned out N times,
and
Wkn
is fanned out B times. e DONN scheme shown in Fig.2c spatially encodes the rst column of X
B×K
activations into a column of on-o optical pulses. At the rst time step, the activation matrix transmitters fan
out the rst bit of each of the matrix elements
Xb1,b{1...B}
to the PEs (here,
k=1
). Simultaneously, a
row of weight matrix light sources transmits the corresponding weight bits
W1n
to each PE. e photons from
these activation and weight bits generate photoelectrons in the detectors, producing the voltages required at the
inputs of electronic multipliers (either 0V for a ‘0’ or 0.8V for a ‘1’). Aer 8 time steps, a multiplier has received
2×8
bits (8 bits for the activation value and 8 bits for the weight value), and the electronic multiplication occurs
as it would in an all-electronic system. e activation-weight product is completed, and is added to the locally
stored partial sum. e entire matrix–matrix product is therefore computed in
8×K
time steps; this dataow
is commonly called ‘output stationary’. Instead of this bit-serial implementation, bits can be encoded spatially,
using a bus of parallel transmitters and receivers. e trade-o between added energy and latency in bit-serial
multiplication versus increased area from photodetectors for a parallel multiplier can be analyzed for specic
applications and CMOS nodes.
We illustrate an exemplary experimental DONN implementation in Fig.3. Each source in a linear array of
vertical cavity surface emitting lasers (VCSELs) or
µ
LEDs emits a cone of light into free space, which is col-
limated by a spherical lens. A diractive optical element (DOE) focuses the light to a 1D spot array on a 2D
receiver, where the activations and weights are brought into close proximity using a beamsplitter. ‘Receiverless’
photodetectors48 convert the optical signals to the electrical domain. An electronic multiplier then multiplies
the values. e output is either saved to memory, or routed directly to another DONN that implements the next
layer of computation. Note that the data distribution pattern is not conned to regular rows and columns. A
spatial light modulator (SLM), an array of micromirrors, scattering waveguides or a DOE can route and fan out
bits to arbitrary locations. Since free-space propagation is lossless and mirrors, SLMs and diractive elements
W
x
y
W
(b)(a)
Input activations
Output classification
x
W
=
y
.
Single
classification
X
Batch classification
(B objects to process)
1
x
x
x
K
K
x
K
K
N
1
y
y
x
N
W
=
Y
.
B
x
K
K
W
x
K
K
N
B
Y
Y
x
N
=
=
.
.
k = 1
8 bits
n = 1
n = N
k = 2
N
B
b = 1
...
b = B
(c)
X
Optical Electronic
Figure2. Digital fully-connected neural network (FC-NN) and hardware implementations. (a) FC-NN with
input activations (red, vector length K) connected to output activations (vector length N) via weighted paths, i.e.,
weights (blue, matrix size
K×N
). (b) Matrix representation of one layer of an FC-NN with B-sized batching.
(c) Example bit-serial multiplier array, with output-stationary accumulation across k. Fan-out of X across
n{1
...
N}
; fan-out of W across
b{1
...
B}
. Bottom panel: all-electronic version with fan-out by copper
wire (for clarity, fan-out of W not illustrated). Top panel: digital optical neural network version, where X and
W are fanned out passively using optics, and transmitted to an array of photodetectors. Each pixel contains two
photodetectors, where the activations and weights can be separated by, e.g., polarization or wavelength lters.
Each photodetector pair is directly connected to a multiplier in close proximity.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
are highly ecient (> 95%), most length- or receiver-number-dependent losses can be attributed to imperfect
focusing, e.g., from optical aberrations far from the optical axis. ese eects can be mitigated through judicious
optical design. We assume for the remainder of our analysis that energy is length-independent.
Bit error rate and inference experiments. We used a DONN implementation similar to Fig.3a to test
optical digital data transmission and fan-out for DNNs, as described in “Methods. In our rst experiment, we
determined the bit error rate of our system. Figure4a shows an example of a background-subtracted and nor-
malized image, captured on the camera when the digital micromirror devices (DMDs) displayed random vectors
of ‘1’s and ‘0’s. e cameras de-Bayering algorithm (described in “Methods”), as well as optical aberrations and
misalignment, caused some crosstalk between pixels (see Fig.4b). Using a region of
357 ×477
superpixels on the
camera, we calculated bit error rates (in a single shot) of
1.2 ×102
and
2.6 ×104
for the blue and red channels,
respectively. When we conned the region of interest to
151 ×191
superpixels, the bit error rate (averaged over
100 dierent trials, i.e., 100 pairs of input vectors) was
4.4 ×103
and
4.6 ×105
for the blue and red arms. See
W
n = 1
n = N
...
(a)
+
+
Multiplier
+
+
To memory or next layer
X
b = 1
...
b
= B
DOE
sneLSBEOD
Lens
(b)
Vbias Vbias
Vout
Vout
VDD
VDD
Figure3. Possible implementation of digital optical neural network. (a)Digital inputs and weights are
transmitted electronically to an array of light sources (red and blue, respectively, illustrating dierent paths).
Single-mode light from a source is collimated by a spherical lens (Lens), then focused to a 1D spot array by
a diractive optical element (DOE). A 50:50 beamsplitter brings light from the inputs and weights into close
proximity on a custom CMOS receiver. (b)Example circuit with 2 photodetectors (biased by voltage
Vbias
) per
PE: 1 for activations; 1 for weights. Received bits (
Vout
) proceed to multiplier, then memory or next layer.
0
1
200 300 400
Correctly received as 1
Correctly received as 0
Threshold
Multiplier (y)
Intensity
100
200
300
400
Multiplier (x)
200
0
(a) (b)
300
A
ctivations
Weights
Optical fan-out
100
100
Multiplier (y)
0
Figure4. Background-subtracted and normalized receiver output from free-space digital optical neural
network experiment with random vectors of ‘1’s and ‘0’s displayed on DMDs. (a) Full 2D image. (b) One
column: pixels received as ‘1’ in red and ‘0’ in black.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
Supplementary Note1 for more details on bit error rate and error maps. Because crosstalk is deterministic, and
not a source of random noise, we can compensate for it. We applied a simple crosstalk correction scheme that
assumes uniform crosstalk on the detector and subtracts a xed fraction of an element’s nearest neighbors from
the element itself (see Supplementary Note2). e bit error rates for the blue and red channels then respectively
dropped to
2.9 ×103
and 0 for the
357 ×477
-pixel, single shot image and
2.6 ×105
and 0 for the
151 ×191
-pixel, 100-image average. In other words, aer crosstalk correction, there were no errors in the red channel, and
the errors in the blue channel dropped signicantly.
Next, we experimentally tested the DONN’s eect on the classication accuracy of 500 MNIST images using
a three-layer (i.e., two-hidden-layer), fully-connected neural network (FC-NN), with the dataset and training
steps described in Supplementary Note3. We compared our uncorrected experimental classication results with
inference performed entirely on CPU (ground truth) in two ways. e simplest analysis, reported in Table1,
Figure5. Experimentally measured 3-layer FC-NN output scores, otherwise known as confusion matrix,
for 500 MNIST images from test dataset. e values along the diagonal represent correct classication by the
model. Each column is an average of
50
vectors. (a) DONN output scores (no crosstalk correction applied).
(b) Ground-truth (all-electronic) output scores. (c, d) Box plot of the diagonals of subgures (a) and (b)
respectively. (e) Dierence in diagonals of DONN output scores versus ground-truth output scores. Box plots
represent the median (orange), interquartile range (IQR, box) and ‘whiskers’ extending 1.5 IQRs beyond the rst
and third quartile; outliers are displayed as yellow circles.
Table 1. MNIST classication accuracy of DONN (no crosstalk correction applied) versus all-electronic
hardware with custom fully-connected neural network models.
2 layers (%) 3 layers (%)
Electronic (ground truth) 95.8 96.4
DONN 95.4 95.8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
shows a 0.6% drop in classication accuracy for the DONN versus the ground truth values (or 3 additional
incorrectly classied images). Figure5 illustrates more detailed results, where we analyzed the network output
scores. An output score is roughly equivalent to the assigned likelihood that an input image belongs to a given
class, and is dened as the normalized (via the somax function) output vector of a DNN. We found that, along
the matrix diagonal, the rst and third quartiles in the dierence in output scores between the DONN and the
ground truth have a magnitude < 3%. e absolute dierence in average output scores is also < 3%. We also
performed this experiment with a single hidden layer (‘2-layer’ case), and achieved similar results (a 0.4% drop
in accuracy, or 2 misclassied images). No crosstalk error correction was applied to these results to illustrate the
worst-case impact on accuracy.
Energy analysis: DONN compared with all-electronic hardware. In this section, we compare the
theoretical interconnect energy consumption of the DONN with its all-electronic equivalent, where intercon-
nects are illustrated in green in Fig.6. We assume an implementation in a 7nm CMOS process for both cases.
e interconnect energy, which must include any source ineciencies, is the energy required to charge the
parasitic wire, detector, and inverter capacitances, where a CMOS inverter is representative of the input to a
multiplier. See “Methods” for full energy calculations. In the electronic case, a long wire transports data to a
row of multipliers using low-cost (0.06fJ/bit) repeaters (see Supplementary Note6). e wire has a large para-
sitic capacitance, but also produces an eective electrical fan-out. In the DONN, the energetic requirements
of the detectors contrast with those of conventional optical receivers, which aim to maximize sensitivity to the
optical input eld, rather than minimize the energetic cost of the system as a whole. e parameters used for
electronic and optical components are summarized in Table2, where
hν/e
must be greater than or equal to the
bandgap
Eg
of the detector material (here, we have chosen silicon as an example, and set
hν/e=Eg
).
Cwire/µm
is the wire capacitance per micrometer,
VDD
is the supply voltage and
Cdet
is a theoretical approximation of the
(b)
(a)
(c)
Mem
PE
PE
PE
PE
PE
PE
PE
PE
(d)
PE
PE
PE
PE
PE
PE
PE
PE
Mem Mem Mem
Figure6. Fan-out of one bit from memory (Mem) to multiple processing elements (PEs). (a) Fan-out by
electrical wire to a row of PEs in a monolithic chip. (b) DONN equivalent of monolithic chip, where green wire
is replaced by optical paths. (c) Fan-out by electrical wire to blocks of PEs divided into chiplets, or separated by
memory and logic. (d) DONN equivalent of fan-out to PEs in multiple blocks [energetically equivalent to (b)].
Table 2. Parameters.
We assume a square multiplier and scale reported 8-bit multiplier areas in a 45nm
node5961 to a 7nm node (the current state of the art) with the scaling factors from literature58. A MAC unit
comprises both an 8-bit multiplier and a 32-bit adder, so we are placing a lower bound on the minimum length
of
Lwire
. Recent work62 optimizes MAC units for DNNs, and reports a
337 µ
m
2
area in a 28nm node, where the
MAC unit comprises an 8-bit multiplier and a 32-bit adder. Extrapolated to a 7nm node with a fourth-order
polynomial t of the scaling factors from literature58, the MAC unit is of size (
7µ
m)
2
, which falls within the
5-8
µ
m range. *
EMAC
, the energy required for one multiply-and-accumulate, shown for reference.
Cwire/µm
0.2fF
/µ
m48,55,56
CT
0.1fF48,53
Cdet
0.1fF48
hν/e
1.12eV
WPE
0.551,52
Adet
1µm
×
1µm
48
Lwire_intra-chiplet
5-8
µ
m
Lwire_inter-chiplet
2.5mm45
Lwire_inter-chip
5cm57
VDD
0.80V58
EMAC
* 25fJ/MAC11,58
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
capacitance of a receiverless cubic photodetector48 with surface area
Adet =(1×1)µm2
. Several past examples
of small CMOS integrated detectors in older CMOS nodes49,50 showcase the feasibility of receiverless detectors
in advanced nodes. e optical source power conversion eciency (wall-plug eciency, i.e., WPE) is a measured
value for VCSELs51,52.
CT
is an approximation for the capacitance of an inverter48,53.
Lwire
is the distance between
MAC units in various scenarios: with abutted MAC units (intra-chiplet), between chiplets (inter-chiplet) and
between chips (inter-chip).
As shown in Fig.7, we nd that the optical communication energy is
Ecomm 3
fJ/MAC, independent
of length, when we use receiverless detectors in a modern CMOS process (limited by the photodetector and
inverter capacitances). On the other hand, the electrical interconnect energy scales from
Ecomm =3
–4fJ/MAC
for inter-multiplier communication for abutted MAC units, to
1000fJ/MAC for inter-chiplet interconnects, to
30,000fJ/MAC for inter-chip interconnects. e crossover point where the optical interconnect energy drops
below the electrical energy occurs when
Lwire 5µm
. e DONN therefore provides an improvement in the
interconnect energy for data transmission and can scale to greatly decrease the energy consumption of data dis-
tribution with regular distribution patterns. In Fig.7, we have also included the optical communication energy
per MAC with a large, commercial photodiode, which illustrates the need for receiverless photodetectors in a
7nm CMOS process. In the future, plasmonic photodetectors may lower the capacitance further than 0.1fF54.
Discussion
With minimal impact on accuracy, the DONN yields an energy advantage over all-electronic accelerators with
long wire lengths for digital data transfer. In our proof-of-concept experiment, we performed inference on 500
MNIST images with 2- and 3-layer FC-NNs and found a < 0.6% drop in accuracy and a < 3% absolute dierence
in average output scores with respect to the ground truth implementation on CPU. We attributed these errors to
crosstalk due to imperfect alignment and blurring from the camera’s Bayer lter. In fact, a simple crosstalk cor-
rection scheme lowered measured bit error rates by two orders of magnitude. We could thus transmit bits with
100% measured delity in the activation arm (better aligned than the weight arm), which illustrates that crosstalk
can be mitigated and possibly eliminated through post-processing, charge sharing at the detectors, greater spac-
ing of receivers, or optimized design of optical elements and receiver pixels. In the hypothetical regime where
error due to crosstalk is negligible, the remaining noise sources are shot and thermal noise. Intuitively, shot and
thermal noise are also present in an all-electronic system, and the number of photoelectrons at the input to an
inverter in the DONN is equal to the number of electrons at the input to an inverter in electronics. erefore,
if these noise sources do not limit accuracy in the all-electronic case, the same can be said for the DONN48. For
mathematical validation that shot and thermal noise have a trivial impact on bit error rate in the DONN, see
Supplementary Note7. ese analyses demonstrate that the fundamental limit to the accuracy of the DONN is
no dierent than the accuracy of electronics, and thus, we do not expect accuracy to hinder DONN scaling in
an optimized system.
10-4 10-2 100102
Length (mm)
100
102
104
10
6
E
comm
(fJ/MAC)
EMAC
D
O
N
N
|
C
d
e
t
=
.
1
f
F
E
D
O
N
N
|
C
d
e
t
=
1
p
F
Intra-chiplet wire
Inter-chiplet wire
Inter-chip wire
Eelec
Figure7. Energy required to transmit 16 bits (communication energy per 8-bit MAC, i.e.,
Ecomm
). Electronic
data transfer energy (
Eelec
) increases with wire length, whereas optical data transfer energy (
EDONN
) remains
constant. Optical data transfer evaluated for two detector capacitances:
Cdet =1
pF for large, commercially-
available photodiodes63; and
Cdet =0.1
fF for emerging receiverless, (1
µ
m)
3
-sized cubic detectors in modern
CMOS processes48. Below
Cdet =0.1
fF, the capacitance of the overall receiver becomes limited by the
capacitance of the CMOS inverter. Otherwise, the capacitance of the photodetector is energy-limiting. Energy of
one 8-bit multiply-and-accumulate operation (
EMAC =25
fJ/MAC) also shown for reference.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
In our theoretical energy calculations, we compared the length-independent data delivery costs of the DONN
with those of an all-electronic system. We found that in the worst case, when multipliers are abutted in a mul-
tiplier array, optical transmitters have a similar interconnect energy cost compared to copper wires in a 7nm
node. e regime where the DONN shows important gains over copper interconnects is in architectures with
increased spacing between computation units. As problems scale beyond the capabilities of existing single elec-
tronic chips, multiple chiplets or chips perform DNN tasks in concert. In the multi-chiplet and multi-chip
cases, the costs to transmit two 8-bit values in electronics (
1000fJ/MAC and
30,000fJ/MAC, respectively)
are therefore signicantly larger than that of an 8-bit MAC (25fJ/MAC)11,58. On the other hand, in optics, the
interconnect cost (
3fJ/MAC, including source energy) remains an order of magnitude smaller than the MAC
cost. Since multi-chiplet and multi-chip systems oer a promising approach to increasing throughput on large
DNN models, optical connectivity can further these scaling eorts by reducing inter-chiplet and inter-chip com-
munication energy by orders of magnitude. We further discuss the scalability of the DONN in Supplementary
Note8. In terms of the DONN’s area, we assume the added chip area at the receiver is negligible, since the area
of a photodetector
Adet =1µm2
is
50
×
smaller than a MAC unit of size
(
L
wire_intra-chiplet
)
2
. Furthermore, for
many practical applications (e.g., workstations, servers, data centers), chip area, which sets fabrication cost, and
energy eciency are much more important than overall packaged volume. In data centers today, space is required
between chips for heat sinks and airow, and the addition of lenses need not increase this volume signicantly.
Finally, as discussed in Supplementary Note9, optical devices do not restrict the clock speed of the system since
their bandwidths are
>10
GHz. In fact, the clock speed of a digital electronic system is generally limited to
1
GHz due to thermal dissipation requirements; it could be improved in the DONN, since greater component
spacing for thermal management would not increase energy consumption.
Because length-independent data distribution is a tool currently unavailable to digital system designers,
relaxing electronic constraints on locality can open new avenues for DNN accelerator architectures. For example,
memory can be devised such that numerous small pieces of memory are located far away from the point of com-
putation and reused many times spatially, with a small xed cost for doing so. Designers can then lay out smaller
memory blocks with higher bandwidth, lower energy consumption, and higher yield. If memory and computa-
tion are spatially distinct, we have the added benet of allowing for more compact memories that consume less
energy and area, e.g., DRAM, which is fabricated with a dierent process than typical CMOS to achieve higher
density than on-chip memories. Furthermore, due to its massive fan-out potential, the DONN can, rstly, reduce
overhead by minimizing a systems reliance on a memory hierarchy and, secondly, amortize the cost of weight
delivery to multiple clients running the same neural network inference on dierent inputs. Additionally, some
newer neural network models require irregular connectivity (e.g., graph neural networks, which show state-of-
the-art performance on recommender systems, but are restricted in size due to insucient compute power64,65).
ese systems have arbitrary connections with potentially long wire lengths between MAC units, representing
dierent edges in the graph. e DONN can implement these links without incurring additional costs in energy
from a complex network-on-chip in electronics. Yet another instance of greater distance between multipliers is
in higher-bit-precision applications, as in training, which require larger MAC units.
In future work, we plan to assess the performance of the DONN on state-of-the-art DNN workloads, such as
the models described in MLPerf66. Firstly, we will benchmark the DONN against all-electronic state-of-the-art
accelerators by using Timeloop67. rough a search for optimal mappings (ways to organize data and computa-
tion), this soware can simulate the total energy consumption and latency of running various workloads on
a given hardware architecture, including computation and memory access. Timeloop therefore enables us to
perform an in-depth comparison of all-electronic accelerators against the proposed instances of the DONN,
including variable data transmission costs for dierent electronic wire lengths. Second, we will design an optical
setup and receiver to reduce experimental crosstalk, power consumption and latency. We can then test larger
workloads on this optimized hardware. Finally, beyond neural networks, there are many examples of matrix mul-
tiplication which a DONN-style architecture can accelerate, such as optimization, Ising machines and statistical
analysis, and we plan to investigate these applications as well.
In summary, the DONN implements arbitrary transmission and fan-out of data with an energy cost per
MAC that is independent of data transmission length and number of receivers. is property is key to scaling
deep neural network accelerators, where increasing the number of processing elements for greater throughput in
all-electronic hardware typically implies higher data communication costs due to longer electronic path length.
Contrary to other proposed optical neural networks2125, the DONN does not require digital-to-analog conver-
sion and is therefore less prone to error propagation. e DONN is also recongurable, in that the weights and
activations can be easily updated. Our work indicates that the length-independent communication enabled by
optics is useful for digital neural network system design, for example to simplify memory access to weight data.
We nd that optical data transfer begins to save energy when the spacing of MAC computational units is on
the order of > 10 μm. More broadly, further gains can be expected through the relaxation of electronic system
architecture constraints.
Methods
Digital optical neural network implementation for bit error rate and inference experi-
ments. We performed bit error rate and inference experiments with optical data transfer and fan-out of
point sources using cylindrical lenses. Two digital micromirror devices (DMDs, Texas Instruments DLP3000,
DLP4500) illuminated by spatially-ltered and collimated LEDs (orlabs M625L3, M455L3) acted as stand-
ins for the two linear source arrays. For the input activations/weights, each 10.8
µ
m-long mirror in one DMD
column/row either reected the red/blue light toward the detector (‘1’) or a beam dump (‘0’). en, for each
of the DMDs, an
f=100 mm
spherical lens followed by an
f=100 mm
cylindrical achromatic lens imaged
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
one DMD pixel to an entire row/column of superpixels of a color camera (orlabs DCC3240C). Each camera
superpixel is made up of four pixels of size (5.3
µ
m)2: two green, one red and one blue. e camera acquisi-
tion program applies a ‘de-Bayering’ interpolation to automatically extract color information for each sub-pixel;
this interpolation causes blurring, and therefore it increases crosstalk in our system. In a future version of the
DONN, a specialized receiver will reduce this crosstalk and also operate at a higher speed.
To process the image received on the camera, we subtracted the background, normalized, then thresholded
by a xed value for each channel. (We acquired normalization and background curves with all DMD pixels in the
‘on’ and ‘o ’ states, respectively. is background subtraction and normalization could be implemented on-chip
by precharacterizing the system, and biasing each receiver pixel by some xed voltage.) If the detected intensity
was above the threshold value, it was labeled a ‘1’; below threshold, a ‘0’. For the bit error rate experiments, we
compared the parsed values from the camera with the known values transmitted by the DMDs, and dened
the bit error rate as the number of incorrectly received bits divided by the total number of bits. In the inference
experiments, the DMDs displayed the activations and pre-trained weights, which propagated through the opti-
cal system to the camera. Aer background subtraction and normalization, the CPU multiplied each activation
with each weight, and applied the nonlinear function (ReLU aer the hidden layers and somax at the output).
We did not correct for crosstalk here, to illustrate the worst-case scenario of impact on accuracy. e CPU then
fed the outputs back to the input activation DMD for the next layer of computation. We used a DNN model
with two hidden layers with 100 activations each and a 10-activation output layer. We also tested a model with
a single hidden layer with 100 activations.
MNIST preprocessing. For the inputs to the network, a bilinear interpolation algorithm transformed the
28 ×28
-pixel images into
7×7
-pixel images, which were then attened into a 1D 49-element vector. e follow-
ing standard mapping quantized both input and weight matrices into 8-bit integer representations:
where Quantized is the returned value, QuantizedMin is the minimum value expressible in the quantized data-
type (here, always 0), Input is the input data to be quantized, FloatingMin is the minimum value in Input, and
Scale is the scaling factor to map between the two datatype ranges
FloatingMaxFloatingMin
QuantizedMax
QuantizedMin
. See gemmlowp
documentation68 for more information on implementations of this quantization. In practice, 8-bit representations
are widely used in DNNs, since 8-bit MACs are generally sucient to maintain accuracy in inference8,69,70.
Electronic and optical interconnect energy calculations. When an electronic wire transports data
over a distance
Lwire
to the gate of a CMOS inverter (representative of a full-adder’s input, the basic building
block of multipliers), the energy consumption per bit is:
where
VDD
is the supply voltage,
Cwire/µm
is the wire capacitance per micrometer,
Lwire
is the wire length between
two multipliers and
CT
is the inverter capacitance. Interconnects consume energy predominantly when a load
capacitance, such as a wire, is charged from a low (0V) to a high (
1V) voltage, i.e., in a
01
transition. If we
assume a low leakage current, maintaining a value of ‘1’ (i.e.,
11
) consumes little additional energy. To switch
a wire from a ‘1’ to a ‘0’, the wire is discharged to the ground for free (Supplementary Note4). Lastly, maintaining
a value of ‘0’ simply keeps the voltage at 0V, at no cost. Assuming a random distribution of ‘0’ and ‘1’ bits, we
therefore include a factor of 1/4 in Eq. (3) to account for this dependence on switching activity.
In the DONN, a light source replaces the wire for fan-out. e low capacitances of the receiverless detectors
in the DONN allow for the removal of receiving ampliers48. us, the DONN’s minimum energy consumption
corresponds to the optical energy required to generate a voltage swing of 0.8V on the load capacitance (i.e., the
photodetector (
Cdet
) and an inverter (
CT
)), all divided by the source’s power conversion eciency (wall-plug
eciency, WPE). Subsequent transistors in the multiplier are powered by the o-chip voltage supply, as in the
all-electronic architecture. Assuming a detector responsivity of
171, the DONN interconnect energy cost is:
where
hν
is the photon energy and the number of photons per bit,
np
, is determined by:
As in the all-electronic case, we assume low leakage on the receiverless photodetector. Photons are received
for every ‘1’ and therefore, to avoid charge buildup, charge on the output capacitor must be reset aer every
clock cycle. In Supplementary Note5, we propose a CMOS discharge circuit that actively resets the receiver.
(Another possible method is a dual-rail encoding scheme48.) us, the switching activity factor is 1/2 instead
of 1/4: as for the all-electronic case, we assume a random distribution of bits, but here, both
11
and
01
have a nonzero cost.
e energy consumption per 8-bit multiply-and-accumulate (
Ecomm
in fJ/MAC) is simply the energy per bit
multiplied by 16, representative of transmitting two 8-bit values.
(2)
Quantized
=QuantizedMin +(
Input FloatingMin)
Scale
(3)
E
elec/bit =1
4
Cwire
µm·Lwire +CT
·V2
DD
(4)
E
DONN/bit =
1
2·WPE
·hν·n
p
(5)
n
p=
(C
det
+C
T
)·VDD
e
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
Data availability
e data generated and analyzed in this study are available from the corresponding authors upon reasonable
request.
Code availability
Code used for acquiring and processing the MNIST dataset can be found at https ://githu b.com/alexs ludds /Digit
al-Optic al-Neura l-Netwo rk-Code. Code used for image processing, hardware control, and calculations for energy,
crosstalk and bit error rate is available from the corresponding authors upon reasonable request.
Received: 24 October 2020; Accepted: 12 January 2021
References
1. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classication with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 25, 1097–1105 (2012).
2. Dai, Z. etal. Transformer-XL: attentive language models beyond a xed-length context. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguisti cs, 2978–2988, ht t ps ://doi.org/10.18653 /v1/P19-1285 (Association for Computational
Linguistics, Florence, Italy, 2019).
3. Esteva, A. et al. Dermatologist-level classication of skin cancer with deep neural networks. Nature 542, 115–118. https ://doi.
org/10.1038/natur e2105 6 (2017).
4. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444. https ://doi.org/10.1038/natur e1453 9 (2015).
5. Chen, Y., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: An energy-ecient recongurable accelerator for deep convolutional neural
networks. IEEE J. Solid-State Circuits 52, 127–138. https ://doi.org/10.1109/JSSC.2016.26163 57 (2017).
6. Chen, Y.-H., Yang, T.-J., Emer, J. & Sze, V. Eyeriss v2: A exible accelerator for emerging deep neural networks on mobile devices.
IEEE J. Emerg. Sel. Top. Circuits Syst. 9, 292–308. https ://doi.org/10.1109/JETCA S.2019.29102 32 (2019).
7. Yin, S. et al. A high energy ecient recongurable hybrid neural network processor for deep learning applications. IEEE J. Solid-
State Circuits 53, 968–982. https ://doi.org/10.1109/JSSC.2017.27782 81 (2018).
8. Jouppi, N.P. etal. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), 1–12, https ://doi.org/10.1145/30798 56.30802 46 (2017).
9. Sze, V., Chen, Y., Yang, T. & Emer, J. S. Ecient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105,
2295–2329. https ://doi.org/10.1109/JPROC .2017.27617 40 (2017).
10. Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222. https ://doi.org/10.1038/s4192 8-018-
0059-3 (2018).
11. Horowitz, M. Computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Confer-
ence Digest of Technical Papers (ISSCC), 10–14, https ://doi.org/10.1109/ISSCC .2014.67573 23 (2014).
12. Poulton, J. W. et al. A 1.17-pj/b, 25-gb/s/pin ground-referenced single-ended serial link for o- and on-package communica-
tion using a process- and temperature-adaptive voltage regulator. IEEE J. Solid-State Circuits 54, 43–54. https ://doi.org/10.1109/
JSSC.2018.28750 92 (2019).
13. Shrivastava, M. et al. Physical insight toward heat transport and an improved electrothermal modeling framework for FinFET
architectures. IEEE Trans. Electron. Devices 59, 1353–1363. https ://doi.org/10.1109/TED.2012.21882 96 (2012).
14. Gupta, M.S., Oatley, J.L., Joseph, R., Wei, G. & Brooks, D.M. Understanding voltage variations in chip multiprocessors using a
distributed power-delivery network. In 2007 Design, Automation and Test in Europe Conference and Exhibition, 1–6, https ://doi.
org/10.1109/DATE.2007.36466 3 (2007).
15. Casasent, D., Jackson, J. & Neuman, C. Frequency-multiplexed and pipelined iterative optical systolic array processors. Appl. Opt.
22, 115–124. https ://doi.org/10.1364/AO.22.00011 5 (1983).
16. Rhodes, W. & Guilfoyle, P. Acoustooptic algebraic processing architectures. Proc. IEEE 72, 820–830. https ://doi.org/10.1109/
PROC.1984.12941 (1984).
17. Cauleld, H., Rhodes, W., Foster, M. & Horvitz, S. Optical implementation of systolic array processing. Opt. Commun. 40, 86–90.
https ://doi.org/10.1016/0030-4018(81)90333 -3 (1981).
18. Xu, S., Wang, J., Wang, R., Chen, J. & Zou, W. High-accuracy optical convolution unit architecture for convolutional neural net-
works by cascaded acousto-optical modulator arrays. Opt. Express 27, 19778–19787. https ://doi.org/10.1364/OE.27.01977 8 (2019).
19. Liang, Y.-Z. & Liu, H.-K. Optical matrix–matrix multiplication method demonstrated by the use of a multifocus hololens. Opt.
Lett. 9, 322–324. https ://doi.org/10.1364/ol.9.00032 2 (1984).
20. Athale, R. A. & Collins, W. C. Optical matrix–matrix multiplier based on outer product decomposition. Appl. Opt. 21, 2089–2090.
https ://doi.org/10.1364/AO.21.00208 9 (1982).
21. Shen, Y. et al. Deep learning with coherent nanophotonic circuits. Nat. Photon. 11, 441–446. https ://doi.org/10.1038/nphot
on.2017.93 (2017).
22. Tait, A. N. et al. Neuromorphic photonic networks using silicon photonic weight banks. Sci. Rep. 7, 1–10. https ://doi.org/10.1038/
s4159 8-017-07754 -z (2017).
23. Hamerly, R., Bernstein, L., Sludds, A., Soljacic, M. & Englund, D. Large-scale optical neural networks based on photoelectric
multiplication. Phys. Rev. X 9, 021032. https ://doi.org/10.1103/PhysR evX.9.02103 2 (2019).
24. Feldmann, J. etal. Parallel convolution processing using an integrated photonic tensor core (2020). arXiv :2002.00281 .
25. Lin, X. et al. All-optical machine learning using diractive deep neural networks. Science 361, 1004–1008. https ://do i .org/10.1126/
scien ce.aat80 84 (2018).
26. Krishnamoorthy, A. V. et al. Computer systems based on silicon photonic interconnects. Proc. IEEE 97, 1337–1361. https ://doi.
org/10.1109/JPROC .2009.20207 12 (2009).
27 . Mehta, N., Lin, S., Yin, B., Moazeni, S. & Stojanović, V. A laser-forwarded coherent transceiver in 45-nm soi cmos using monolithic
microring resonators. IEEE J. Solid-State Circuits 55, 1096–1107. https ://doi.org/10.1109/JSSC.2020.29687 64 (2020).
28. Xue, J. et al. An intra-chip free-space optical interconnect. ACM SIGARCH Comput. Archit. News 38, 94–105. https ://doi.
org/10.1145/18160 38.18159 75 (2010).
29. Hamedazimi, N. etal. Firey: A recongurable wireless data center fabric using free-space optics. In Proceedings of the 2014 ACM
conference on SIGCOMM, 319–330, https ://doi.org/10.1145/26192 39.26263 28 (2014).
30. Bao, J. et al. Flycast: Free-space optics accelerating multicast communications in physical layer. ACM SIGCOMM Comput. Com-
mun. Rev. 45, 97–98. https ://doi.org/10.1145/28299 88.27900 02 (2015).
31. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2014). arXiv :1409.1556.
32. Szegedy, C. etal. Going deeper with convolutions (2014). arXiv :1409.4842.
33. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533. https ://doi.org/10.1038/natur
e1423 6 (2015).
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
34. Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826, https ://doi.org/10.1109/CVPR.2016.308 (2016).
35. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 770–778, https ://doi.org/10.1109/CVPR.2016.90 (2016).
36. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 1800–1807, https ://doi.org/10.1109/CVPR.2017.195 (2017).
37. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
38. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q.V. Learning transferable architectures for scalable image recognition. In 2018 IEEE/
CVF Conference on Computer Vision and Pattern Recognition, 8697–8710, https ://doi.org/10.1109/CVPR.2018.00907 (2018).
39. Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Rec-
ognition, 7132–7141, https ://doi.org/10.1109/CVPR.2018.00745 (2018).
40. De vlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understand-
ing (2018). arXiv :1810.04805 .
41. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 1 (2019).
42. Lan, Z. etal. ALBERT: A lite BERT for self-supervised learning of language representations (2019). arXiv :1909.11942 .
43. Brown, T.B. etal. Language models are few-shot learners (2020). arXiv :2005.14165 .
44. Fowers, J. etal. A congurable cloud-scale dnn processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium
on Computer Architecture (ISCA), 1–14, https ://doi.o rg/10.1109/ISCA.2018.00012 (2018).
45. Shao, Y.S. etal. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd
Annual IEEE/ACM International Symposium on Microarchitecture - MICRO ’52, 14–27, https ://doi.org/10.1145/33524 60.33583 02
(2019).
46. Yin, J. etal. Modular routing design for chiplet-based systems. In 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA), 726–738, https ://doi.org/10.1109/ISCA.2018.00066 (2018).
47. Samajdar, A. etal. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In 2019 IEEE
International Symposium on Performance Analysis of Systems and Soware, 304–315 (IEEE, 2020).
48. Miller, D. A. B. Attojoule optoelectronics for low-energy information processing and communications. J. Light. Technol. 35,
346–396. https ://doi.org/10.1109/JLT.2017.26477 79 (2017).
49. Keeler, G. A. et al. Optical pump-probe measurements of the latency of silicon CMOS optical interconnects. IEEE Photon. Technol.
Lett. 14, 1214–1216. https ://doi.org/10.1109/LPT.2002.10220 22 (2002).
50. Latif, S., Kocabas, S., Tang, L., Debaes, C. & Miller, D. Low capacitance CMOS silicon photodetectors for optical clock injection.
Appl. Phys. A 95, 1129–1135. https ://doi.org/10.1007/s0033 9-009-5122-5 (2009).
51. Iga, K. Vertical-cavity surface-emitting laser : Its conception and evolution. Jpn. J. Appl. Phys. 47, 1. https ://doi.org/10.1143/JJAP.47.1
(2008).
52. Jäger, R . et al. 57% wallplug eciency oxide-conned 850 nm wavelength GaAs VCSELs. Electron. Lett. 33, 330–331. https ://doi.
org/10.1049/el:19970 193 (1997).
53. Zheng, P., Connelly, D., Ding, F. & Liu, T.-J.K. FinFET evolution toward stacked-nanowire FET for CMOS technology scaling.
IEEE Trans. Electron Dev. 62, 3945–3950. https ://doi.org/10.1109/TED.2015.24873 67 (2015).
54. Tang, L. et al. Nanometre-scale germanium photodetector enhanced by a near-infrared dipole antenna. Nat. Photon. 2, 226–229.
https ://doi.org/10.1038/nphot on.2008.30 (2008).
55. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M. & Glasco, D. GPUs and the future of parallel computing. IEEE Micro 31, 7–17.
https ://doi.org/10.1109/MM.2011.89 (2011).
56. Dally, W.J. etal. Hardware-enabled articial intelligence. In 2018 IEEE Symposium on VLSI Circuits, 3–6, https ://doi.org/10.1109/
VLSIC .2018.85023 68 (2018).
57. Chao, C. & Saeta, B. Cloud TPU: Codesigning architecture and infrastructure. Hot Chips 31, 1 (2019).
58 . Stillmaker, A. & Baas, B. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integra-
tion 58, 74–81. https ://doi.org/10.1016/j.vlsi.2017.02.002 (2017).
59. Saadat, H., Bokhari, H. & Parameswaran, S. Minimally biased multipliers for approximate integer and oating-point multiplica-
tion. IEEE Trans. Comput. Des. Integr. Circuits Syst. 37, 2623–2635. https ://doi.org/10.1109/TCAD.2018.28572 62 (2018).
60. Shoba, M. & Nakkeeran, R. Energy and area ecient hierarchy multiplier architecture based on Vedic mathematics and GDI logic.
Eng. Sci. Technol. Int. J. 20, 321–331. https ://doi.org/10.1016/j.jestc h.2016.06.007 (2017).
61. Ravi, S., Patel, A., Shabaz, M., Chaniyara, P. M. & Kittur, H. M. Design of low-power multiplier using UCSLA technique. In Articial
Intelligence and Evolutionary Algorithms in Engineering Systems 119–126, https ://doi.org/10.1007/978-81-322-2135-7_14 (2015).
62. Johnson, J. Rethinking oating point for deep learning (2018). arXiv :1811.01721 .
63. orlabs. High-speed ber-coupled detectors https ://www.thorl abs.com/newgr ouppa ge9.cfm?objec tgrou p_id=1297&pn=DET02
AFC. (2020) .
64. Wu, Z. etal. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks Learn. Syst. 1–21, https ://
doi.org/10.1109/TNNLS .2020.29783 86 (2020).
65. Zhang, Z., Cui, P. & Zhu, W. Deep learning on graphs: A survey. IEEE Transactions on Knowl. Data Eng. 1–1, https ://doi.
org/10.1109/TKDE.2020.29813 33 (2020).
66. Mattson, P. et al. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 8–16. https
://doi.org/10.1109/MM.2020.29748 43 (2020).
67. Parashar, A. etal. Timeloop: A systematic approach to DNN accelerator evaluation. In 2019 IEEE International Symposium on
Performance Analysis of Systems and Soware, 304–315, https ://doi.org/10.1109/ISPAS S.2019.00042 (IEEE, 2019).
68. Jacob, B. & Warden, P. et al. gemmlowp: A small self-contained low-precision GEMM library https ://githu b.com/googl e/gemml
owp. (2015, accessed 2020) .
69. Judd, P., Albericio, J., Hetherington, T., Aamodt, T.M. & Moshovos, A. Stripes: Bit-serial deep neural network computing. In 2016
49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12, https ://doi.org/10.1109/MICRO .2016.77837
22 (2016).
70. Albericio, J. etal. Bit-pragmatic deep neural network computing. In 2017 50th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 382–394, https ://doi.org/10.1145/31239 39.31239 82 (2017).
71. Coimbatore Balram, K., Audet, R. & Miller, D. Nanoscale resonant-cavity-enhanced germanium photodetectors with lithographi-
cally dened spectral response for improved performance at telecommunications wavelengths. Opt. Express 21, 10228–33. https
://doi.org/10.1364/OE.21.01022 8 (2013).
Acknowledgements
anks to Christopher Panuski for helpful discussions about
µ
LEDs and Angshuman Parashar and Yannan
(Nellie) Wu for insights into all-electronic DNN accelerators. We would also like to thank Mohamed Ibrahim
for useful discussions on receiver discharging circuits. Anthony Pennes helped with several machining tasks.
anks to Ronald Davis III and Zhen Guo for manuscript revisions. We also thank the NVIDIA Corporation for
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 | 
www.nature.com/scientificreports/
the donation of the Tesla K40 GPU used for training the fully-connected networks. Equipment was purchased
thanks to the U.S. Army Research Oce through the Institute for Soldier Nanotechnologies (ISN) at MIT under
grant no. W911NF-18-2-0048. L.B. is supported by a Postgraduate Scholarship from the Natural Sciences and
Engineering Research Council of Canada, National Science Foundation (NSF) E2CDA Grant No. 1640012 and
the afore-mentioned ISN Grant. A.S. is supported by an NSF Graduate Research Fellowship Program under Grant
No. 1122374, NTT Research Inc., NSF EAGER program Grant No. 1946967, and the NSF/SRC E2CDA and ISN
grants mentioned above. R.H. was supported by an Intelligence Community Postdoctoral Research Fellowship
at MIT, administered by ORISE through the U.S. DoE/ODNI.
Author contributions
D.E. and R.H. developed the original concept. L.B. designed and performed the hardware experiments with the
support of A.S. and D.E. A.S. developed the data acquisition, training, and confusion matrix analysis soware.
L.B. developed the output image processing soware and performed the bit error rate calculations. L.B. and A.S.
performed the energy calculations, with critical insights from R.H. J.E. and V.S. provided critical insights into
all-electronic hardware comparisons. L.B. and A.S. wrote the manuscript with input from all authors. R.H., J.E.,
V.S. and D.E. supervised the project.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https ://doi.
org/10.1038/s4159 8-021-82543 -3.
Correspondence and requests for materials should be addressed to L.B., A.S.orD.E.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... E.g. AlexNet and Resnet-50 both contain more than one million weights, biases and activations [102]. That is, without considering the memory required for the model structure, input data, and application. ...
Article
Full-text available
Downtime caused by failing equipment can be extremely costly for organizations. Predictive Maintenance (PdM), which uses data to predict when maintenance should be conducted, is an essential tool for increasing safety, maximizing uptime and minimizing costs. Contempoary PdM systems primarily have sensors collect information about the equipment under observation. This information is afterwards transmitted off the device for processing at a high-performance computer system. While this can allow highquality predictions, it also imposes barriers that keep some organisations from adopting PdM. For example, some applications prevent data transmission off sensor devices due to regulatory or infrastructure limitations. Being able to process the collected information right at the sensor device is, therefore, desirable in many sectors - something that recent progress in the field of TinyML promises to deliver. This paper investigates the intersection between PdM and TinyML and explores how TinyML can enable many new PdM applications. We consider a holistic view of TinyML-based PdM, focusing on the full stack of Machine Learning (ML) models, hardware, toolchains, data and PdM applications. Our main findings are that each part of the TinyML stack has received varying degrees of attention. In particular, ML models and their optimisations have seen a lot of attention, while data optimisations and TinyML datasets lack contributions. Furthermore, most TinyML research focuses on image and audio classification, with little attention paid to other application areas such as PdM. Based on our observations, we suggest promising avenues of future research to scale and improve the application of TinyML to PdM.
... Advances in machine learning, particularly deep learning, have enabled applications in various real-world scenarios [1]. Accompanying improvements in performance, the increasing model complexity and the exploding number of parameters [2] prohibit human users from comprehending the decisions made by these data-driven models, as the decision rules are implicitly learned from the data presented. The absence of reasoning for model decisions keeps raising concerns about the transparency of AI-driven systems [3]. ...
Preprint
Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB's fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.
... -Performance Bottleneck Identification: Analyze architectural designs to find and eliminate performance bottlenecks [288,289]. -Scalability Optimization: Ensure that the chip architecture scales well with increasing system complexity (e.g., more cores or memory) [290,291]. ...
Preprint
Full-text available
Large Language Models (LLMs) are emerging as promising tools in hardware design and verification, with recent advancements suggesting they could fundamentally reshape conventional practices. In this survey, we analyze over 54 research papers to assess the current role of LLMs in enhancing automation, optimization, and innovation within hardware design and verification workflows. Our review highlights LLM applications across synthesis, simulation, and formal verification, emphasizing their potential to streamline development processes while upholding high standards of accuracy and performance. We identify critical challenges, such as scalability, model interpretability, and the alignment of LLMs with domain-specific languages and methodologies. Furthermore, we discuss open issues, including the necessity for tailored model fine-tuning, integration with existing Electronic Design Automation (EDA) tools, and effective handling of complex data structures typical of hardware projects. This survey not only consolidates existing knowledge but also outlines prospective research directions, underscoring the transformative role LLMs could play in the future of hardware design and verification.
... The computational complexity of processing raw videos increases further with the adaptation of larger models-a trend that continues to expand over the years (Tan & Le, 2019;Bernstein et al., 2021). When the training data is limited, using larger input dimensionality and larger models also increases the chance of overfitting (Defernez & Kemsley, 1999). ...
Preprint
Full-text available
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.
... The computational complexity of processing raw videos increases further with the adaptation of larger models-a trend that continues to expand over the years (Tan & Le, 2019;Bernstein et al., 2021). When the training data is limited, using larger input dimensionality and larger models also increases the chance of overfitting (Defernez & Kemsley, 1999). ...
Article
Full-text available
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP). To foster future research, we make the pretrained features for the entire MovieNet dataset, along with our genre classification code and the trained models, publicly available.
Conference Paper
The hardware limitations of conventional electronics in deep neural network (DNN) applications have spurred explorations into alternative architectures, including architectures using optical- and/or quantum- domain signal processing signal processing subroutines. This work investigates the scalability and performance metrics—such as throughput, energy consumption, and latency—of various such architectures, with a focus on recently developed hardware error correction techniques, in-situ training methods, initial field trials, as well as extensions into DNN-based inference on quantum signals with reversible, quantum-coherent resources.
Article
Full-text available
Physical reservoirs are a promising approach for realizing high‐performance artificial intelligence devices utilizing physical devices. Although nonlinear interfered spin‐wave multi‐detection exhibits high nonlinearity and the ability to map in high dimensional feature space, it does not have sufficient performance to process time‐series data precisely. Herein, development of an iono–magnonic reservoir by combining such interfered spin wave multi‐detection and ion‐gating involving protonation‐induced redox reaction triggered by the application of voltage is reported. This study is the first to report the manipulation of the propagating spin wave property by ion‐gating and the application of the same to physical reservoir computing. The subject iono–magnonic reservoir can generate various reservoir states in a single homogenous medium by utilizing a spin wave property modulated by ion‐gating. Utilizing the strong nonlinearity resulting from chaos, the reservoir shows good computational performance in completing the Mackey–Glass chaotic time‐series prediction task, and the performance is comparable to that exhibited by simulated neural networks.
Article
Full-text available
Photonic Stochastic Emergent Learning (PSEL) represents an innovative paradigm rooted in mathematical brain modelling and emergent memories. In this study, we explore the intersection of these concepts to address memory storage and classification tasks. Leveraging optical computing principles and random projections, PSEL constructs memory representations from the inherent randomness in nature. Specifically, we select a set of highly similar random states generated by coherent light scattered from a diffusive medium. Classification is performed by organizing the memories spatially into different classes and comparing inputs to those stored memories. The results demonstrate the efficacy of PSEL in memory construction and parallel classification, emphasizing its potential applications in high-performance computing and artificial intelligence systems.
Article
Full-text available
All analog signal processing is fundamentally subject to noise, and this is also the case in next generation implementations of optical neural networks (ONNs). Therefore, we propose the first hardware-based approach to mitigate noise in ONNs. A tree-like and an accordion-like design are constructed from a given NN that one wishes to implement. Both designs have the capability that the resulting ONNs gives outputs close to the desired solution. To establish the latter, we analyze the designs mathematically. Specifically, we investigate a probabilistic framework for the tree-like design that establishes the correctness of the design, i.e. for any feed-forward NN with Lipschitz continuous activation functions, an ONN can be constructed that produces output arbitrarily close to the original. ONNs constructed with the tree-like design thus also inherit the universal approximation property of NNs. For the accordion-like design, we restrict the analysis to NNs with linear activation functions and characterize the ONNs’ output distribution using exact formulas. Finally, we report on numerical experiments with LeNet ONNs that give insight into the number of components required in these designs for certain accuracy gains. The results indicate that adding just a few components and/or adding them only in the first (few) layers in the manner of either design can already be expected to increase the accuracy of ONNs considerably. To illustrate the effect we point to a specific simulation of a LeNet implementation, in which adding one copy of the layers components in each layer reduces the mean-squared error (MSE) by 59.1% for the tree-like design and by 51.5% for the accordion-like design. In this scenario, the gap in accuracy of prediction between the noiseless NN and the ONNs reduces even more: 93.3% for the tree-like design and 80% for the accordion-like design.
Article
Full-text available
With the proliferation of ultrahigh-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence (AI)¹, the world is generating exponentially increasing amounts of data that need to be processed in a fast and efficient way. Highly parallelized, fast and scalable hardware is therefore becoming progressively more important². Here we demonstrate a computationally specific integrated photonic hardware accelerator (tensor core) that is capable of operating at speeds of trillions of multiply-accumulate operations per second (10¹² MAC operations per second or tera-MACs per second). The tensor core can be considered as the optical analogue of an application-specific integrated circuit (ASIC). It achieves parallelized photonic in-memory computing using phase-change-material memory arrays and photonic chip-based optical frequency combs (soliton microcombs³). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant passive components and can operate at a bandwidth exceeding 14 gigahertz, limited only by the speed of the modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates3,4,5, ultralow-loss silicon nitride waveguides6,7, and high-speed on-chip detectors and modulators, our approach provides a path towards full complementary metal–oxide–semiconductor (CMOS) wafer-scale integration of the photonic tensor core. Although we focus on convolutional processing, more generally our results indicate the potential of integrated photonics for parallel, fast, and efficient computational hardware in data-heavy AI applications such as autonomous driving, live video processing, and next-generation cloud computing services.
Article
Full-text available
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
Article
Full-text available
Optical neural networks (ONNs) have become competitive candidates for the next generation of high-performance neural network accelerators because of their low power consumption and high-speed nature. Beyond fully-connected neural networks demonstrated in pioneer works, optical computing hardwares can also conduct convolutional neural networks (CNNs) by hardware reusing. Following this concept, we propose an optical convolution unit (OCU) architecture. By reusing the OCU architecture with different inputs and weights, convolutions with arbitrary input sizes can be done. A proof-of-concept experiment is carried out by cascaded acousto-optical modulator arrays. When the neural network parameters are ex-situ trained, the OCU conducts convolutions with SDR up to 28.22 dBc and performs well on inferences of typical CNN tasks. Furthermore, we conduct in situ training and get higher SDR at 36.27 dBc, verifying the OCU could be further refined by in situ training. Besides the effectiveness and high accuracy, the simplified OCU architecture served as a building block could be easily duplicated and integrated to future chip-scale optical CNNs.
Article
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial-temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.
Article
Deep learning has been shown to be successful in a number of domains, ranging from acoustics, images, to natural language processing. However, applying deep learning to the ubiquitous graph data is non-trivial because of the unique characteristics of graphs. Recently, substantial research efforts have been devoted to applying deep learning methods to graphs, resulting in beneficial advances in graph analysis techniques. In this survey, we comprehensively review the different types of deep learning methods on graphs. We divide the existing methods into five categories based on their model architectures and training strategies: graph recurrent neural networks, graph convolutional networks, graph autoencoders, graph reinforcement learning, and graph adversarial methods. We then provide a comprehensive overview of these methods in a systematic manner mainly by following their development history. We also analyze the differences and compositions of different methods. Finally, we briefly outline the applications in which they have been used and discuss potential future research directions.
Article
We describe the design choices behind MLPerf, a machine learning performance benchmark that has become an industry standard. The first two rounds of the MLPerf Training benchmark helped drive improvements to software-stack performance and scalability, showing a 1.3x speedup in the top 16-chip results despite higher quality targets and a 5.5x increase in system scale. The first round of MLPerf Inference received over 500 benchmark results from 14 different organizations, showing growing adoption.
Article
For the silicon photonic links to meet the target bit error rate (BER), the laser source must output enough optical power to overcome the optical channel loss and limited receiver sensitivity. Combined with poor wall-plug efficiency, this optical power requirement makes the electrical power consumed by the laser source a significant portion of the link energy cost. This article proposes a laser-forwarded coherent link that greatly reduces the required laser optical power by averaging optical path loss between the main and forwarded paths and by improving the receiver sensitivity using a homodyne coherent detector. The link performance analysis reveals that the laser forwarded link improves the laser power budget by ≈6–8 dB compared with a conventional intensity-modulation/direct detection link using typical photonic link components. To demonstrate the laser forwarded link, a microring resonator phase modulator, a balanced detector, and a 3-dB coupler are integrated with CMOS circuits in a GFUS 45-nm SOI process. The transmit driver and the receiver consume 40 and 450 fJ/bit, respectively. Aided by ≈8-dB boost from laser forwarding and the coherent detection gain, the receiver achieves −15.6-dBm optical modulation amplitude sensitivity. The link operates at 10 Gb/s with BER < 10 −9 and electrical energy efficiency of 2.3 pJ/bit.
Conference Paper
Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.