Access to this full-text is provided by Springer Nature.
Content available from Scientific Reports
This content is subject to copyright. Terms and conditions apply.
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports
Freely scalable and recongurable
optical hardware for deep learning
Liane Bernstein1,5*, Alexander Sludds1,5*, Ryan Hamerly1,2, Vivienne Sze1, Joel Emer3,4 &
Dirk Englund1*
As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and
solve more complex problems. This trend has been enabled by an increase in available compute
power; however, eorts to continue to scale electronic processors are impeded by the costs of
communication, thermal management, power delivery and clocking. To improve scalability,
we propose a digital optical neural network (DONN) with intralayer optical interconnects and
recongurable input values. The path-length-independence of optical energy consumption enables
information locality between a transmitter and a large number of arbitrarily arranged receivers, which
allows greater exibility in architecture design to circumvent scaling limitations. In a proof-of-concept
experiment, we demonstrate optical multicast in the classication of 500 MNIST images with a
3-layer, fully-connected network. We also analyze the energy consumption of the DONN and nd that
digital optical data transfer is benecial over electronics when the spacing of computational units is on
the order of
>
10
µ
m.
Machine learning has become ubiquitous in modern data analysis, decision-making, and optimization. A promi-
nent subset of machine learning is the articial deep neural network (DNN), which has revolutionized many
elds, including classication1, translation2 and prediction3,4. An important step toward unlocking the full poten-
tial of DNNs is improving the energy consumption and speed of DNN tasks. To this end, emerging DNN-specic
hardware5–8 optimizes data access, reuse and communication for mathematical operations: most importantly,
general matrix–matrix multiplication (GEMM) and convolution9. However, despite these advances, a central
challenge in the eld is scaling hardware to keep up with exponentially-growing DNN models10 (see Fig.1) due
to electronic communication11, clocking12, thermal management13 and power delivery14.
To overcome these electronic limitations, optical systems have previously been proposed to perform linear
algebra and data transmission. Analog weighting of optical inputs can be implemented with masks, holography
or optical interference using acousto-optic modulation15–18, spatial light modulation19, electro-optic or thermo-
optic modulation20–23, phase-change materials24 or printed diractive elements25. Due to their analog nature,
system errors can decrease the accuracy of large DNN models processed on this hardware. Prior works in digi-
tal optical interconnects have focused on integrated point-to-point connections26,27, free-space point-to-point
transmission28,29, and small-scale free-space multicast30. ese ideas would be dicult to scale since they incur
signicant overhead in number of components and introduce compounded component losses.
In this Article, we introduce a novel optical DNN accelerator that encodes inputs and weights into recon-
gurable on-o optical pulses. Free-space optical elements passively transmit and copy data from memory to
large-scale electronic multiplier arrays (fan-out). e length-independence of this optical data routing enables
freely scalable systems, where single transmitters are fanned out to many arbitrarily arranged receivers with fast
and energy-ecient links. is system architecture is similar to our previous coherent optical neural network23,
but in contrast to this work and the other analog schemes described above, we propose an entirely digital system.
Incoherent optical paths for data transmission (not computation) replace electrical on-chip interconnects, and
can thus preserve accuracy. Unlike prior digital optical interconnect systems, our ‘digital optical neural network’
(DONN) uses free-space fan-out for data distribution to a large number of receivers for the specic application
of matrix multiplication of the type found in modern DNNs.
We rst illustrate the DONN architecture and discuss possible implementations. en, in a proof-of-concept
experiment, we demonstrate that digital optical transmission and fan-out with cylindrical lenses has little eect
on the classication accuracy of the MNIST handwritten digit dataset (< 0.6%). Crosstalk is the primary cause of
OPEN
NTT
NVIDIA,
These authors contributed equally: Liane Bernstein and
*email: lbern@mit.edu; asludds@mit.edu; englund@mit.edu
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
this drop in accuracy, and because it is deterministic, it can be compensated: with a simple crosstalk correction
scheme, we reduce our bit error rates by two orders of magnitude. Alternatively, crosstalk can be greatly reduced
through optimized optical design. Since shot and thermal noise are negligible (see “Discussion”), the accuracy
of the DONN can therefore be equivalent to an all-electronic DNN accelerator.
We also compare the energy consumption of optical interconnects (including light source energy) against
that of electronic interconnects over distances representative of logic, multi-chiplet interconnects and multi-chip
interconnects in a 7nm CMOS node. Multiple chips44 or partitioned chips45,46 are regularly employed to process
large networks since they can ease electronic constraints and improve performance over a monolithic equivalent
through greater mapping exibility47, at the cost of increased communication energy. Our calculations show
an advantage in data transmission costs for distances ≥ 5
µ
m (roughly the size of the basic computation unit:
an 8-bit multiply-and-accumulate (MAC), with length 5–8
µ
m). e DONN thus scales favorably with respect
to very large DNN accelerators: the DONN’s optical communication cost for an 8-bit MAC, i.e., the energy to
transmit two 8-bit values, remains constant at
∼3
fJ/MAC, whereas multi-chiplet systems have much higher
electrical interconnect costs (
∼1000
fJ/MAC), and multi-chip systems have a higher energy consumption still
(
∼30, 000
fJ/MAC). us, the ecient optical data distribution provided by the DONN architecture will become
critical for continued growth of DNN performance through increased model sizes and greater connectivity.
Results
Problem statement. A DNN consists of a sequence of layers, in which input activations from one layer are
connected to the next layer via weighted paths (weights), as shown in Fig.2a. We focus on inference tasks in this
paper (where weights are known from prior training), which, in addition to the energy consumption problem,
place stringent requirements on latency and throughput. Modern inference accelerators expend the majority of
energy (> 90%) on memory access, data movement, and computation in fully-connected (FC) and convolutional
(CONV) layers5.
Parallelized vector operations, such as matrix–matrix multiplication or successive vector–vector inner prod-
ucts, are the largest energy consumers in CONV and FC layers. In an FC layer, a vector
x
of input values (‘input
activations’, of length K) is multiplied by a matrix W
K×N
of weights (Fig.2b). is matrix–vector product yields
a vector of output activations (
y
, of length N). Most DNN accelerators process vectors in B-sized batches, where
the inputs are represented by a matrix X
B×K
. e FC layer then becomes a matrix–matrix multiplication (X
B×K·
W
K×N
). CONV layers can also be processed as matrix multiplications, e.g., with a Toeplitz matrix9.
In matrix multiplication, fan-out, where data is read once from main memory (DRAM) and used multiple
times, can greatly reduce data movement and memory access. is amortization of read cost across numerous
operations is critical for overall eciency, since retrieving a single matrix element from DRAM requires two to
three orders of magnitude more energy than the MAC11. A simple input-weight product illustrates the benet of
fan-out, since activation and weight elements appear repeatedly, as highlighted by the repetition of
X11
and
W11
:
(1)
2012 2013 2014 2015 201620172018201
92
020
Year
10
5
10
6
10
7
10
8
10
9
10
10
10
11
Number of Model Parameters
AlexNet
VGG16
GoogLeNet
ResNet-50
DQ
N
Inception V3 Xception
Transformer
(Base)
Transformer (Big)
NASNet
SENet
BERT
GPT-2
ALBERT
Tr
ansformer-XL
GPT-3
Figure1. Number of parameters, i.e., weights, in recent landmark neural networks1,2,31–43 (references dated by
rst release, e.g., on arXiv). e number of multiplications (not always reported) is not equivalent to the number
of parameters, but larger models tend to require more compute power, notably in fully-connected layers.
e two outlying nodes (pink) are AlexNet and VGG16, now considered over-parameterized. Subsequently,
eorts have been made to reduce DNN sizes, but there remains an exponential growth in model sizes to solve
increasingly complex problems with higher accuracy.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
Consequently, DNN hardware design focuses on optimizing data transfer and input and weight matrix ele-
ment reuse. Accelerators based on conventional electronics use ecient memory hierarchies, a large array of
tightly packed processing elements (PEs, i.e., multipliers with or without local storage), or some combination
of the these approaches. Memory hierarchies optimize temporal data reuse in memory blocks near the PEs to
boost performance under the constraint of chip area9. is strategy can enable high throughput in CONV layers5.
With fewer intermediate memory levels, a larger array of PEs (e.g., TPU v18) can further increase throughput
and lower energy consumption on workloads with a high-utilization mapping due to potentially reduced over-
all memory accesses and a greater number of parallel multipliers (spatial reuse). erefore, for workloads with
large-scale matrix multiplication such as those mentioned in the Introduction, if we maximize the number of
available PEs, we can improve eciency.
Digital optical neural network architecture. Our DONN architecture replaces electrical interconnects
with optical links to relax the design constraints of reducing inter-multiplier spacing or colocating multipliers
with memory. Specically, optical elements transfer and fan out activation and weight bits to electronic multi-
pliers to reduce communication costs in matrix multiplication, where each element
Xbk
is fanned out N times,
and
Wkn
is fanned out B times. e DONN scheme shown in Fig.2c spatially encodes the rst column of X
B×K
activations into a column of on-o optical pulses. At the rst time step, the activation matrix transmitters fan
out the rst bit of each of the matrix elements
Xb1,∀b∈{1...B}
to the PEs (here,
k=1
). Simultaneously, a
row of weight matrix light sources transmits the corresponding weight bits
W1n
to each PE. e photons from
these activation and weight bits generate photoelectrons in the detectors, producing the voltages required at the
inputs of electronic multipliers (either 0V for a ‘0’ or 0.8V for a ‘1’). Aer 8 time steps, a multiplier has received
2×8
bits (8 bits for the activation value and 8 bits for the weight value), and the electronic multiplication occurs
as it would in an all-electronic system. e activation-weight product is completed, and is added to the locally
stored partial sum. e entire matrix–matrix product is therefore computed in
8×K
time steps; this dataow
is commonly called ‘output stationary’. Instead of this bit-serial implementation, bits can be encoded spatially,
using a bus of parallel transmitters and receivers. e trade-o between added energy and latency in bit-serial
multiplication versus increased area from photodetectors for a parallel multiplier can be analyzed for specic
applications and CMOS nodes.
We illustrate an exemplary experimental DONN implementation in Fig.3. Each source in a linear array of
vertical cavity surface emitting lasers (VCSELs) or
µ
LEDs emits a cone of light into free space, which is col-
limated by a spherical lens. A diractive optical element (DOE) focuses the light to a 1D spot array on a 2D
receiver, where the activations and weights are brought into close proximity using a beamsplitter. ‘Receiverless’
photodetectors48 convert the optical signals to the electrical domain. An electronic multiplier then multiplies
the values. e output is either saved to memory, or routed directly to another DONN that implements the next
layer of computation. Note that the data distribution pattern is not conned to regular rows and columns. A
spatial light modulator (SLM), an array of micromirrors, scattering waveguides or a DOE can route and fan out
bits to arbitrary locations. Since free-space propagation is lossless and mirrors, SLMs and diractive elements
W
x
y
W
(b)(a)
Input activations
Output classification
x
W
=
y
.
Single
classification
X
Batch classification
(B objects to process)
1
x
x
x
K
K
x
K
K
N
1
y
y
x
N
W
=
Y
.
B
x
K
K
W
x
K
K
N
B
Y
Y
x
N
=
=
.
.
k = 1
8 bits
n = 1
n = N
k = 2
N
B
b = 1
...
b = B
(c)
X
Optical Electronic
Figure2. Digital fully-connected neural network (FC-NN) and hardware implementations. (a) FC-NN with
input activations (red, vector length K) connected to output activations (vector length N) via weighted paths, i.e.,
weights (blue, matrix size
K×N
). (b) Matrix representation of one layer of an FC-NN with B-sized batching.
(c) Example bit-serial multiplier array, with output-stationary accumulation across k. Fan-out of X across
n∈{1
...
N}
; fan-out of W across
b∈{1
...
B}
. Bottom panel: all-electronic version with fan-out by copper
wire (for clarity, fan-out of W not illustrated). Top panel: digital optical neural network version, where X and
W are fanned out passively using optics, and transmitted to an array of photodetectors. Each pixel contains two
photodetectors, where the activations and weights can be separated by, e.g., polarization or wavelength lters.
Each photodetector pair is directly connected to a multiplier in close proximity.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
are highly ecient (> 95%), most length- or receiver-number-dependent losses can be attributed to imperfect
focusing, e.g., from optical aberrations far from the optical axis. ese eects can be mitigated through judicious
optical design. We assume for the remainder of our analysis that energy is length-independent.
Bit error rate and inference experiments. We used a DONN implementation similar to Fig.3a to test
optical digital data transmission and fan-out for DNNs, as described in “Methods”. In our rst experiment, we
determined the bit error rate of our system. Figure4a shows an example of a background-subtracted and nor-
malized image, captured on the camera when the digital micromirror devices (DMDs) displayed random vectors
of ‘1’s and ‘0’s. e camera’s de-Bayering algorithm (described in “Methods”), as well as optical aberrations and
misalignment, caused some crosstalk between pixels (see Fig.4b). Using a region of
357 ×477
superpixels on the
camera, we calculated bit error rates (in a single shot) of
1.2 ×10−2
and
2.6 ×10−4
for the blue and red channels,
respectively. When we conned the region of interest to
151 ×191
superpixels, the bit error rate (averaged over
100 dierent trials, i.e., 100 pairs of input vectors) was
4.4 ×10−3
and
4.6 ×10−5
for the blue and red arms. See
W
n = 1
n = N
...
(a)
+
+
Multiplier
+
+
To memory or next layer
X
b = 1
...
b
= B
DOE
sneLSBEOD
Lens
(b)
Vbias Vbias
Vout
Vout
VDD
VDD
Figure3. Possible implementation of digital optical neural network. (a)Digital inputs and weights are
transmitted electronically to an array of light sources (red and blue, respectively, illustrating dierent paths).
Single-mode light from a source is collimated by a spherical lens (Lens), then focused to a 1D spot array by
a diractive optical element (DOE). A 50:50 beamsplitter brings light from the inputs and weights into close
proximity on a custom CMOS receiver. (b)Example circuit with 2 photodetectors (biased by voltage
Vbias
) per
PE: 1 for activations; 1 for weights. Received bits (
Vout
) proceed to multiplier, then memory or next layer.
0
1
200 300 400
Correctly received as 1
Correctly received as 0
Threshold
Multiplier (y)
Intensity
100
200
300
400
Multiplier (x)
200
0
(a) (b)
300
A
ctivations
Weights
Optical fan-out
100
100
Multiplier (y)
0
Figure4. Background-subtracted and normalized receiver output from free-space digital optical neural
network experiment with random vectors of ‘1’s and ‘0’s displayed on DMDs. (a) Full 2D image. (b) One
column: pixels received as ‘1’ in red and ‘0’ in black.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
Supplementary Note1 for more details on bit error rate and error maps. Because crosstalk is deterministic, and
not a source of random noise, we can compensate for it. We applied a simple crosstalk correction scheme that
assumes uniform crosstalk on the detector and subtracts a xed fraction of an element’s nearest neighbors from
the element itself (see Supplementary Note2). e bit error rates for the blue and red channels then respectively
dropped to
2.9 ×10−3
and 0 for the
357 ×477
-pixel, single shot image and
2.6 ×10−5
and 0 for the
151 ×191
-pixel, 100-image average. In other words, aer crosstalk correction, there were no errors in the red channel, and
the errors in the blue channel dropped signicantly.
Next, we experimentally tested the DONN’s eect on the classication accuracy of 500 MNIST images using
a three-layer (i.e., two-hidden-layer), fully-connected neural network (FC-NN), with the dataset and training
steps described in Supplementary Note3. We compared our uncorrected experimental classication results with
inference performed entirely on CPU (ground truth) in two ways. e simplest analysis, reported in Table1,
Figure5. Experimentally measured 3-layer FC-NN output scores, otherwise known as confusion matrix,
for 500 MNIST images from test dataset. e values along the diagonal represent correct classication by the
model. Each column is an average of
∼50
vectors. (a) DONN output scores (no crosstalk correction applied).
(b) Ground-truth (all-electronic) output scores. (c, d) Box plot of the diagonals of subgures (a) and (b)
respectively. (e) Dierence in diagonals of DONN output scores versus ground-truth output scores. Box plots
represent the median (orange), interquartile range (IQR, box) and ‘whiskers’ extending 1.5 IQRs beyond the rst
and third quartile; outliers are displayed as yellow circles.
Table 1. MNIST classication accuracy of DONN (no crosstalk correction applied) versus all-electronic
hardware with custom fully-connected neural network models.
2 layers (%) 3 layers (%)
Electronic (ground truth) 95.8 96.4
DONN 95.4 95.8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
shows a 0.6% drop in classication accuracy for the DONN versus the ground truth values (or 3 additional
incorrectly classied images). Figure5 illustrates more detailed results, where we analyzed the network output
scores. An output score is roughly equivalent to the assigned likelihood that an input image belongs to a given
class, and is dened as the normalized (via the somax function) output vector of a DNN. We found that, along
the matrix diagonal, the rst and third quartiles in the dierence in output scores between the DONN and the
ground truth have a magnitude < 3%. e absolute dierence in average output scores is also < 3%. We also
performed this experiment with a single hidden layer (‘2-layer’ case), and achieved similar results (a 0.4% drop
in accuracy, or 2 misclassied images). No crosstalk error correction was applied to these results to illustrate the
worst-case impact on accuracy.
Energy analysis: DONN compared with all-electronic hardware. In this section, we compare the
theoretical interconnect energy consumption of the DONN with its all-electronic equivalent, where intercon-
nects are illustrated in green in Fig.6. We assume an implementation in a 7nm CMOS process for both cases.
e interconnect energy, which must include any source ineciencies, is the energy required to charge the
parasitic wire, detector, and inverter capacitances, where a CMOS inverter is representative of the input to a
multiplier. See “Methods” for full energy calculations. In the electronic case, a long wire transports data to a
row of multipliers using low-cost (0.06fJ/bit) repeaters (see Supplementary Note6). e wire has a large para-
sitic capacitance, but also produces an eective electrical fan-out. In the DONN, the energetic requirements
of the detectors contrast with those of conventional optical receivers, which aim to maximize sensitivity to the
optical input eld, rather than minimize the energetic cost of the system as a whole. e parameters used for
electronic and optical components are summarized in Table2, where
hν/e
must be greater than or equal to the
bandgap
Eg
of the detector material (here, we have chosen silicon as an example, and set
hν/e=Eg
).
Cwire/µm
is the wire capacitance per micrometer,
VDD
is the supply voltage and
Cdet
is a theoretical approximation of the
(b)
(a)
(c)
Mem
PE
PE
PE
PE
PE
PE
PE
PE
(d)
PE
PE
PE
PE
PE
PE
PE
PE
Mem Mem Mem
Figure6. Fan-out of one bit from memory (Mem) to multiple processing elements (PEs). (a) Fan-out by
electrical wire to a row of PEs in a monolithic chip. (b) DONN equivalent of monolithic chip, where green wire
is replaced by optical paths. (c) Fan-out by electrical wire to blocks of PEs divided into chiplets, or separated by
memory and logic. (d) DONN equivalent of fan-out to PEs in multiple blocks [energetically equivalent to (b)].
Table 2. Parameters.
†
We assume a square multiplier and scale reported 8-bit multiplier areas in a 45nm
node59–61 to a 7nm node (the current state of the art) with the scaling factors from literature58. A MAC unit
comprises both an 8-bit multiplier and a 32-bit adder, so we are placing a lower bound on the minimum length
of
Lwire
. Recent work62 optimizes MAC units for DNNs, and reports a
337 µ
m
2
area in a 28nm node, where the
MAC unit comprises an 8-bit multiplier and a 32-bit adder. Extrapolated to a 7nm node with a fourth-order
polynomial t of the scaling factors from literature58, the MAC unit is of size (
7µ
m)
2
, which falls within the
5-8
µ
m range. *
EMAC
, the energy required for one multiply-and-accumulate, shown for reference.
Cwire/µm
∼
0.2fF
/µ
m48,55,56
CT
∼
0.1fF48,53
Cdet
0.1fF48
hν/e
1.12eV
WPE
∼
0.551,52
Adet
1µm
×
1µm
48
Lwire_intra-chiplet
5-8
µ
m†
Lwire_inter-chiplet
2.5mm45
Lwire_inter-chip
∼
5cm57
VDD
0.80V58
EMAC
* 25fJ/MAC11,58
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
capacitance of a receiverless cubic photodetector48 with surface area
Adet =(1×1)µm2
. Several past examples
of small CMOS integrated detectors in older CMOS nodes49,50 showcase the feasibility of receiverless detectors
in advanced nodes. e optical source power conversion eciency (wall-plug eciency, i.e., WPE) is a measured
value for VCSELs51,52.
CT
is an approximation for the capacitance of an inverter48,53.
Lwire
is the distance between
MAC units in various scenarios: with abutted MAC units (intra-chiplet), between chiplets (inter-chiplet) and
between chips (inter-chip).
As shown in Fig.7, we nd that the optical communication energy is
Ecomm ≈3
fJ/MAC, independent
of length, when we use receiverless detectors in a modern CMOS process (limited by the photodetector and
inverter capacitances). On the other hand, the electrical interconnect energy scales from
Ecomm =3
–4fJ/MAC
for inter-multiplier communication for abutted MAC units, to
∼
1000fJ/MAC for inter-chiplet interconnects, to
∼
30,000fJ/MAC for inter-chip interconnects. e crossover point where the optical interconnect energy drops
below the electrical energy occurs when
Lwire ≥5µm
. e DONN therefore provides an improvement in the
interconnect energy for data transmission and can scale to greatly decrease the energy consumption of data dis-
tribution with regular distribution patterns. In Fig.7, we have also included the optical communication energy
per MAC with a large, commercial photodiode, which illustrates the need for receiverless photodetectors in a
7nm CMOS process. In the future, plasmonic photodetectors may lower the capacitance further than 0.1fF54.
Discussion
With minimal impact on accuracy, the DONN yields an energy advantage over all-electronic accelerators with
long wire lengths for digital data transfer. In our proof-of-concept experiment, we performed inference on 500
MNIST images with 2- and 3-layer FC-NNs and found a < 0.6% drop in accuracy and a < 3% absolute dierence
in average output scores with respect to the ground truth implementation on CPU. We attributed these errors to
crosstalk due to imperfect alignment and blurring from the camera’s Bayer lter. In fact, a simple crosstalk cor-
rection scheme lowered measured bit error rates by two orders of magnitude. We could thus transmit bits with
100% measured delity in the activation arm (better aligned than the weight arm), which illustrates that crosstalk
can be mitigated and possibly eliminated through post-processing, charge sharing at the detectors, greater spac-
ing of receivers, or optimized design of optical elements and receiver pixels. In the hypothetical regime where
error due to crosstalk is negligible, the remaining noise sources are shot and thermal noise. Intuitively, shot and
thermal noise are also present in an all-electronic system, and the number of photoelectrons at the input to an
inverter in the DONN is equal to the number of electrons at the input to an inverter in electronics. erefore,
if these noise sources do not limit accuracy in the all-electronic case, the same can be said for the DONN48. For
mathematical validation that shot and thermal noise have a trivial impact on bit error rate in the DONN, see
Supplementary Note7. ese analyses demonstrate that the fundamental limit to the accuracy of the DONN is
no dierent than the accuracy of electronics, and thus, we do not expect accuracy to hinder DONN scaling in
an optimized system.
10-4 10-2 100102
Length (mm)
100
102
104
10
6
E
comm
(fJ/MAC)
EMAC
D
O
N
N
|
C
d
e
t
=
.
1
f
F
E
D
O
N
N
|
C
d
e
t
=
1
p
F
Intra-chiplet wire
Inter-chiplet wire
Inter-chip wire
Eelec
Figure7. Energy required to transmit 16 bits (communication energy per 8-bit MAC, i.e.,
Ecomm
). Electronic
data transfer energy (
Eelec
) increases with wire length, whereas optical data transfer energy (
EDONN
) remains
constant. Optical data transfer evaluated for two detector capacitances:
Cdet =1
pF for large, commercially-
available photodiodes63; and
Cdet =0.1
fF for emerging receiverless, (1
µ
m)
3
-sized cubic detectors in modern
CMOS processes48. Below
Cdet =0.1
fF, the capacitance of the overall receiver becomes limited by the
capacitance of the CMOS inverter. Otherwise, the capacitance of the photodetector is energy-limiting. Energy of
one 8-bit multiply-and-accumulate operation (
EMAC =25
fJ/MAC) also shown for reference.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
In our theoretical energy calculations, we compared the length-independent data delivery costs of the DONN
with those of an all-electronic system. We found that in the worst case, when multipliers are abutted in a mul-
tiplier array, optical transmitters have a similar interconnect energy cost compared to copper wires in a 7nm
node. e regime where the DONN shows important gains over copper interconnects is in architectures with
increased spacing between computation units. As problems scale beyond the capabilities of existing single elec-
tronic chips, multiple chiplets or chips perform DNN tasks in concert. In the multi-chiplet and multi-chip
cases, the costs to transmit two 8-bit values in electronics (
∼
1000fJ/MAC and
∼
30,000fJ/MAC, respectively)
are therefore signicantly larger than that of an 8-bit MAC (25fJ/MAC)11,58. On the other hand, in optics, the
interconnect cost (
∼
3fJ/MAC, including source energy) remains an order of magnitude smaller than the MAC
cost. Since multi-chiplet and multi-chip systems oer a promising approach to increasing throughput on large
DNN models, optical connectivity can further these scaling eorts by reducing inter-chiplet and inter-chip com-
munication energy by orders of magnitude. We further discuss the scalability of the DONN in Supplementary
Note8. In terms of the DONN’s area, we assume the added chip area at the receiver is negligible, since the area
of a photodetector
Adet =1µm2
is
∼
50
×
smaller than a MAC unit of size
(
L
wire_intra-chiplet
)
2
. Furthermore, for
many practical applications (e.g., workstations, servers, data centers), chip area, which sets fabrication cost, and
energy eciency are much more important than overall packaged volume. In data centers today, space is required
between chips for heat sinks and airow, and the addition of lenses need not increase this volume signicantly.
Finally, as discussed in Supplementary Note9, optical devices do not restrict the clock speed of the system since
their bandwidths are
>10
GHz. In fact, the clock speed of a digital electronic system is generally limited to
∼1
GHz due to thermal dissipation requirements; it could be improved in the DONN, since greater component
spacing for thermal management would not increase energy consumption.
Because length-independent data distribution is a tool currently unavailable to digital system designers,
relaxing electronic constraints on locality can open new avenues for DNN accelerator architectures. For example,
memory can be devised such that numerous small pieces of memory are located far away from the point of com-
putation and reused many times spatially, with a small xed cost for doing so. Designers can then lay out smaller
memory blocks with higher bandwidth, lower energy consumption, and higher yield. If memory and computa-
tion are spatially distinct, we have the added benet of allowing for more compact memories that consume less
energy and area, e.g., DRAM, which is fabricated with a dierent process than typical CMOS to achieve higher
density than on-chip memories. Furthermore, due to its massive fan-out potential, the DONN can, rstly, reduce
overhead by minimizing a system’s reliance on a memory hierarchy and, secondly, amortize the cost of weight
delivery to multiple clients running the same neural network inference on dierent inputs. Additionally, some
newer neural network models require irregular connectivity (e.g., graph neural networks, which show state-of-
the-art performance on recommender systems, but are restricted in size due to insucient compute power64,65).
ese systems have arbitrary connections with potentially long wire lengths between MAC units, representing
dierent edges in the graph. e DONN can implement these links without incurring additional costs in energy
from a complex network-on-chip in electronics. Yet another instance of greater distance between multipliers is
in higher-bit-precision applications, as in training, which require larger MAC units.
In future work, we plan to assess the performance of the DONN on state-of-the-art DNN workloads, such as
the models described in MLPerf66. Firstly, we will benchmark the DONN against all-electronic state-of-the-art
accelerators by using Timeloop67. rough a search for optimal mappings (ways to organize data and computa-
tion), this soware can simulate the total energy consumption and latency of running various workloads on
a given hardware architecture, including computation and memory access. Timeloop therefore enables us to
perform an in-depth comparison of all-electronic accelerators against the proposed instances of the DONN,
including variable data transmission costs for dierent electronic wire lengths. Second, we will design an optical
setup and receiver to reduce experimental crosstalk, power consumption and latency. We can then test larger
workloads on this optimized hardware. Finally, beyond neural networks, there are many examples of matrix mul-
tiplication which a DONN-style architecture can accelerate, such as optimization, Ising machines and statistical
analysis, and we plan to investigate these applications as well.
In summary, the DONN implements arbitrary transmission and fan-out of data with an energy cost per
MAC that is independent of data transmission length and number of receivers. is property is key to scaling
deep neural network accelerators, where increasing the number of processing elements for greater throughput in
all-electronic hardware typically implies higher data communication costs due to longer electronic path length.
Contrary to other proposed optical neural networks21–25, the DONN does not require digital-to-analog conver-
sion and is therefore less prone to error propagation. e DONN is also recongurable, in that the weights and
activations can be easily updated. Our work indicates that the length-independent communication enabled by
optics is useful for digital neural network system design, for example to simplify memory access to weight data.
We nd that optical data transfer begins to save energy when the spacing of MAC computational units is on
the order of > 10 μm. More broadly, further gains can be expected through the relaxation of electronic system
architecture constraints.
Methods
Digital optical neural network implementation for bit error rate and inference experi-
ments. We performed bit error rate and inference experiments with optical data transfer and fan-out of
point sources using cylindrical lenses. Two digital micromirror devices (DMDs, Texas Instruments DLP3000,
DLP4500) illuminated by spatially-ltered and collimated LEDs (orlabs M625L3, M455L3) acted as stand-
ins for the two linear source arrays. For the input activations/weights, each 10.8
µ
m-long mirror in one DMD
column/row either reected the red/blue light toward the detector (‘1’) or a beam dump (‘0’). en, for each
of the DMDs, an
f=100 mm
spherical lens followed by an
f=100 mm
cylindrical achromatic lens imaged
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
one DMD pixel to an entire row/column of superpixels of a color camera (orlabs DCC3240C). Each camera
superpixel is made up of four pixels of size (5.3
µ
m)2: two green, one red and one blue. e camera acquisi-
tion program applies a ‘de-Bayering’ interpolation to automatically extract color information for each sub-pixel;
this interpolation causes blurring, and therefore it increases crosstalk in our system. In a future version of the
DONN, a specialized receiver will reduce this crosstalk and also operate at a higher speed.
To process the image received on the camera, we subtracted the background, normalized, then thresholded
by a xed value for each channel. (We acquired normalization and background curves with all DMD pixels in the
‘on’ and ‘o ’ states, respectively. is background subtraction and normalization could be implemented on-chip
by precharacterizing the system, and biasing each receiver pixel by some xed voltage.) If the detected intensity
was above the threshold value, it was labeled a ‘1’; below threshold, a ‘0’. For the bit error rate experiments, we
compared the parsed values from the camera with the known values transmitted by the DMDs, and dened
the bit error rate as the number of incorrectly received bits divided by the total number of bits. In the inference
experiments, the DMDs displayed the activations and pre-trained weights, which propagated through the opti-
cal system to the camera. Aer background subtraction and normalization, the CPU multiplied each activation
with each weight, and applied the nonlinear function (ReLU aer the hidden layers and somax at the output).
We did not correct for crosstalk here, to illustrate the worst-case scenario of impact on accuracy. e CPU then
fed the outputs back to the input activation DMD for the next layer of computation. We used a DNN model
with two hidden layers with 100 activations each and a 10-activation output layer. We also tested a model with
a single hidden layer with 100 activations.
MNIST preprocessing. For the inputs to the network, a bilinear interpolation algorithm transformed the
28 ×28
-pixel images into
7×7
-pixel images, which were then attened into a 1D 49-element vector. e follow-
ing standard mapping quantized both input and weight matrices into 8-bit integer representations:
where Quantized is the returned value, QuantizedMin is the minimum value expressible in the quantized data-
type (here, always 0), Input is the input data to be quantized, FloatingMin is the minimum value in Input, and
Scale is the scaling factor to map between the two datatype ranges
FloatingMax−FloatingMin
QuantizedMax
−
QuantizedMin
. See gemmlowp
documentation68 for more information on implementations of this quantization. In practice, 8-bit representations
are widely used in DNNs, since 8-bit MACs are generally sucient to maintain accuracy in inference8,69,70.
Electronic and optical interconnect energy calculations. When an electronic wire transports data
over a distance
Lwire
to the gate of a CMOS inverter (representative of a full-adder’s input, the basic building
block of multipliers), the energy consumption per bit is:
where
VDD
is the supply voltage,
Cwire/µm
is the wire capacitance per micrometer,
Lwire
is the wire length between
two multipliers and
CT
is the inverter capacitance. Interconnects consume energy predominantly when a load
capacitance, such as a wire, is charged from a low (0V) to a high (
∼
1V) voltage, i.e., in a
0→1
transition. If we
assume a low leakage current, maintaining a value of ‘1’ (i.e.,
1→1
) consumes little additional energy. To switch
a wire from a ‘1’ to a ‘0’, the wire is discharged to the ground for free (Supplementary Note4). Lastly, maintaining
a value of ‘0’ simply keeps the voltage at 0V, at no cost. Assuming a random distribution of ‘0’ and ‘1’ bits, we
therefore include a factor of 1/4 in Eq. (3) to account for this dependence on switching activity.
In the DONN, a light source replaces the wire for fan-out. e low capacitances of the receiverless detectors
in the DONN allow for the removal of receiving ampliers48. us, the DONN’s minimum energy consumption
corresponds to the optical energy required to generate a voltage swing of 0.8V on the load capacitance (i.e., the
photodetector (
Cdet
) and an inverter (
CT
)), all divided by the source’s power conversion eciency (wall-plug
eciency, WPE). Subsequent transistors in the multiplier are powered by the o-chip voltage supply, as in the
all-electronic architecture. Assuming a detector responsivity of
∼
171, the DONN interconnect energy cost is:
where
hν
is the photon energy and the number of photons per bit,
np
, is determined by:
As in the all-electronic case, we assume low leakage on the receiverless photodetector. Photons are received
for every ‘1’ and therefore, to avoid charge buildup, charge on the output capacitor must be reset aer every
clock cycle. In Supplementary Note5, we propose a CMOS discharge circuit that actively resets the receiver.
(Another possible method is a dual-rail encoding scheme48.) us, the switching activity factor is 1/2 instead
of 1/4: as for the all-electronic case, we assume a random distribution of bits, but here, both
1→1
and
0→1
have a nonzero cost.
e energy consumption per 8-bit multiply-and-accumulate (
Ecomm
in fJ/MAC) is simply the energy per bit
multiplied by 16, representative of transmitting two 8-bit values.
(2)
Quantized
=QuantizedMin +(
Input −FloatingMin)
Scale
(3)
E
elec/bit =1
4
Cwire
µm·Lwire +CT
·V2
DD
(4)
E
DONN/bit =
1
2·WPE
·hν·n
p
(5)
n
p=
(C
det
+C
T
)·VDD
e
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
Data availability
e data generated and analyzed in this study are available from the corresponding authors upon reasonable
request.
Code availability
Code used for acquiring and processing the MNIST dataset can be found at https ://githu b.com/alexs ludds /Digit
al-Optic al-Neura l-Netwo rk-Code. Code used for image processing, hardware control, and calculations for energy,
crosstalk and bit error rate is available from the corresponding authors upon reasonable request.
Received: 24 October 2020; Accepted: 12 January 2021
References
1. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classication with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 25, 1097–1105 (2012).
2. Dai, Z. etal. Transformer-XL: attentive language models beyond a xed-length context. In Proceedings of the 57th Annual Meeting
of the Association for Computational Linguisti cs, 2978–2988, ht t ps ://doi.org/10.18653 /v1/P19-1285 (Association for Computational
Linguistics, Florence, Italy, 2019).
3. Esteva, A. et al. Dermatologist-level classication of skin cancer with deep neural networks. Nature 542, 115–118. https ://doi.
org/10.1038/natur e2105 6 (2017).
4. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444. https ://doi.org/10.1038/natur e1453 9 (2015).
5. Chen, Y., Krishna, T., Emer, J. S. & Sze, V. Eyeriss: An energy-ecient recongurable accelerator for deep convolutional neural
networks. IEEE J. Solid-State Circuits 52, 127–138. https ://doi.org/10.1109/JSSC.2016.26163 57 (2017).
6. Chen, Y.-H., Yang, T.-J., Emer, J. & Sze, V. Eyeriss v2: A exible accelerator for emerging deep neural networks on mobile devices.
IEEE J. Emerg. Sel. Top. Circuits Syst. 9, 292–308. https ://doi.org/10.1109/JETCA S.2019.29102 32 (2019).
7. Yin, S. et al. A high energy ecient recongurable hybrid neural network processor for deep learning applications. IEEE J. Solid-
State Circuits 53, 968–982. https ://doi.org/10.1109/JSSC.2017.27782 81 (2018).
8. Jouppi, N.P. etal. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), 1–12, https ://doi.org/10.1145/30798 56.30802 46 (2017).
9. Sze, V., Chen, Y., Yang, T. & Emer, J. S. Ecient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105,
2295–2329. https ://doi.org/10.1109/JPROC .2017.27617 40 (2017).
10. Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222. https ://doi.org/10.1038/s4192 8-018-
0059-3 (2018).
11. Horowitz, M. Computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Confer-
ence Digest of Technical Papers (ISSCC), 10–14, https ://doi.org/10.1109/ISSCC .2014.67573 23 (2014).
12. Poulton, J. W. et al. A 1.17-pj/b, 25-gb/s/pin ground-referenced single-ended serial link for o- and on-package communica-
tion using a process- and temperature-adaptive voltage regulator. IEEE J. Solid-State Circuits 54, 43–54. https ://doi.org/10.1109/
JSSC.2018.28750 92 (2019).
13. Shrivastava, M. et al. Physical insight toward heat transport and an improved electrothermal modeling framework for FinFET
architectures. IEEE Trans. Electron. Devices 59, 1353–1363. https ://doi.org/10.1109/TED.2012.21882 96 (2012).
14. Gupta, M.S., Oatley, J.L., Joseph, R., Wei, G. & Brooks, D.M. Understanding voltage variations in chip multiprocessors using a
distributed power-delivery network. In 2007 Design, Automation and Test in Europe Conference and Exhibition, 1–6, https ://doi.
org/10.1109/DATE.2007.36466 3 (2007).
15. Casasent, D., Jackson, J. & Neuman, C. Frequency-multiplexed and pipelined iterative optical systolic array processors. Appl. Opt.
22, 115–124. https ://doi.org/10.1364/AO.22.00011 5 (1983).
16. Rhodes, W. & Guilfoyle, P. Acoustooptic algebraic processing architectures. Proc. IEEE 72, 820–830. https ://doi.org/10.1109/
PROC.1984.12941 (1984).
17. Cauleld, H., Rhodes, W., Foster, M. & Horvitz, S. Optical implementation of systolic array processing. Opt. Commun. 40, 86–90.
https ://doi.org/10.1016/0030-4018(81)90333 -3 (1981).
18. Xu, S., Wang, J., Wang, R., Chen, J. & Zou, W. High-accuracy optical convolution unit architecture for convolutional neural net-
works by cascaded acousto-optical modulator arrays. Opt. Express 27, 19778–19787. https ://doi.org/10.1364/OE.27.01977 8 (2019).
19. Liang, Y.-Z. & Liu, H.-K. Optical matrix–matrix multiplication method demonstrated by the use of a multifocus hololens. Opt.
Lett. 9, 322–324. https ://doi.org/10.1364/ol.9.00032 2 (1984).
20. Athale, R. A. & Collins, W. C. Optical matrix–matrix multiplier based on outer product decomposition. Appl. Opt. 21, 2089–2090.
https ://doi.org/10.1364/AO.21.00208 9 (1982).
21. Shen, Y. et al. Deep learning with coherent nanophotonic circuits. Nat. Photon. 11, 441–446. https ://doi.org/10.1038/nphot
on.2017.93 (2017).
22. Tait, A. N. et al. Neuromorphic photonic networks using silicon photonic weight banks. Sci. Rep. 7, 1–10. https ://doi.org/10.1038/
s4159 8-017-07754 -z (2017).
23. Hamerly, R., Bernstein, L., Sludds, A., Soljacic, M. & Englund, D. Large-scale optical neural networks based on photoelectric
multiplication. Phys. Rev. X 9, 021032. https ://doi.org/10.1103/PhysR evX.9.02103 2 (2019).
24. Feldmann, J. etal. Parallel convolution processing using an integrated photonic tensor core (2020). arXiv :2002.00281 .
25. Lin, X. et al. All-optical machine learning using diractive deep neural networks. Science 361, 1004–1008. https ://do i .org/10.1126/
scien ce.aat80 84 (2018).
26. Krishnamoorthy, A. V. et al. Computer systems based on silicon photonic interconnects. Proc. IEEE 97, 1337–1361. https ://doi.
org/10.1109/JPROC .2009.20207 12 (2009).
27 . Mehta, N., Lin, S., Yin, B., Moazeni, S. & Stojanović, V. A laser-forwarded coherent transceiver in 45-nm soi cmos using monolithic
microring resonators. IEEE J. Solid-State Circuits 55, 1096–1107. https ://doi.org/10.1109/JSSC.2020.29687 64 (2020).
28. Xue, J. et al. An intra-chip free-space optical interconnect. ACM SIGARCH Comput. Archit. News 38, 94–105. https ://doi.
org/10.1145/18160 38.18159 75 (2010).
29. Hamedazimi, N. etal. Firey: A recongurable wireless data center fabric using free-space optics. In Proceedings of the 2014 ACM
conference on SIGCOMM, 319–330, https ://doi.org/10.1145/26192 39.26263 28 (2014).
30. Bao, J. et al. Flycast: Free-space optics accelerating multicast communications in physical layer. ACM SIGCOMM Comput. Com-
mun. Rev. 45, 97–98. https ://doi.org/10.1145/28299 88.27900 02 (2015).
31. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2014). arXiv :1409.1556.
32. Szegedy, C. etal. Going deeper with convolutions (2014). arXiv :1409.4842.
33. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533. https ://doi.org/10.1038/natur
e1423 6 (2015).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
34. Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826, https ://doi.org/10.1109/CVPR.2016.308 (2016).
35. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 770–778, https ://doi.org/10.1109/CVPR.2016.90 (2016).
36. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 1800–1807, https ://doi.org/10.1109/CVPR.2017.195 (2017).
37. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
38. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q.V. Learning transferable architectures for scalable image recognition. In 2018 IEEE/
CVF Conference on Computer Vision and Pattern Recognition, 8697–8710, https ://doi.org/10.1109/CVPR.2018.00907 (2018).
39. Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Rec-
ognition, 7132–7141, https ://doi.org/10.1109/CVPR.2018.00745 (2018).
40. De vlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understand-
ing (2018). arXiv :1810.04805 .
41. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 1 (2019).
42. Lan, Z. etal. ALBERT: A lite BERT for self-supervised learning of language representations (2019). arXiv :1909.11942 .
43. Brown, T.B. etal. Language models are few-shot learners (2020). arXiv :2005.14165 .
44. Fowers, J. etal. A congurable cloud-scale dnn processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium
on Computer Architecture (ISCA), 1–14, https ://doi.o rg/10.1109/ISCA.2018.00012 (2018).
45. Shao, Y.S. etal. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd
Annual IEEE/ACM International Symposium on Microarchitecture - MICRO ’52, 14–27, https ://doi.org/10.1145/33524 60.33583 02
(2019).
46. Yin, J. etal. Modular routing design for chiplet-based systems. In 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA), 726–738, https ://doi.org/10.1109/ISCA.2018.00066 (2018).
47. Samajdar, A. etal. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In 2019 IEEE
International Symposium on Performance Analysis of Systems and Soware, 304–315 (IEEE, 2020).
48. Miller, D. A. B. Attojoule optoelectronics for low-energy information processing and communications. J. Light. Technol. 35,
346–396. https ://doi.org/10.1109/JLT.2017.26477 79 (2017).
49. Keeler, G. A. et al. Optical pump-probe measurements of the latency of silicon CMOS optical interconnects. IEEE Photon. Technol.
Lett. 14, 1214–1216. https ://doi.org/10.1109/LPT.2002.10220 22 (2002).
50. Latif, S., Kocabas, S., Tang, L., Debaes, C. & Miller, D. Low capacitance CMOS silicon photodetectors for optical clock injection.
Appl. Phys. A 95, 1129–1135. https ://doi.org/10.1007/s0033 9-009-5122-5 (2009).
51. Iga, K. Vertical-cavity surface-emitting laser : Its conception and evolution. Jpn. J. Appl. Phys. 47, 1. https ://doi.org/10.1143/JJAP.47.1
(2008).
52. Jäger, R . et al. 57% wallplug eciency oxide-conned 850 nm wavelength GaAs VCSELs. Electron. Lett. 33, 330–331. https ://doi.
org/10.1049/el:19970 193 (1997).
53. Zheng, P., Connelly, D., Ding, F. & Liu, T.-J.K. FinFET evolution toward stacked-nanowire FET for CMOS technology scaling.
IEEE Trans. Electron Dev. 62, 3945–3950. https ://doi.org/10.1109/TED.2015.24873 67 (2015).
54. Tang, L. et al. Nanometre-scale germanium photodetector enhanced by a near-infrared dipole antenna. Nat. Photon. 2, 226–229.
https ://doi.org/10.1038/nphot on.2008.30 (2008).
55. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M. & Glasco, D. GPUs and the future of parallel computing. IEEE Micro 31, 7–17.
https ://doi.org/10.1109/MM.2011.89 (2011).
56. Dally, W.J. etal. Hardware-enabled articial intelligence. In 2018 IEEE Symposium on VLSI Circuits, 3–6, https ://doi.org/10.1109/
VLSIC .2018.85023 68 (2018).
57. Chao, C. & Saeta, B. Cloud TPU: Codesigning architecture and infrastructure. Hot Chips 31, 1 (2019).
58 . Stillmaker, A. & Baas, B. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integra-
tion 58, 74–81. https ://doi.org/10.1016/j.vlsi.2017.02.002 (2017).
59. Saadat, H., Bokhari, H. & Parameswaran, S. Minimally biased multipliers for approximate integer and oating-point multiplica-
tion. IEEE Trans. Comput. Des. Integr. Circuits Syst. 37, 2623–2635. https ://doi.org/10.1109/TCAD.2018.28572 62 (2018).
60. Shoba, M. & Nakkeeran, R. Energy and area ecient hierarchy multiplier architecture based on Vedic mathematics and GDI logic.
Eng. Sci. Technol. Int. J. 20, 321–331. https ://doi.org/10.1016/j.jestc h.2016.06.007 (2017).
61. Ravi, S., Patel, A., Shabaz, M., Chaniyara, P. M. & Kittur, H. M. Design of low-power multiplier using UCSLA technique. In Articial
Intelligence and Evolutionary Algorithms in Engineering Systems 119–126, https ://doi.org/10.1007/978-81-322-2135-7_14 (2015).
62. Johnson, J. Rethinking oating point for deep learning (2018). arXiv :1811.01721 .
63. orlabs. High-speed ber-coupled detectors https ://www.thorl abs.com/newgr ouppa ge9.cfm?objec tgrou p_id=1297&pn=DET02
AFC. (2020) .
64. Wu, Z. etal. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks Learn. Syst. 1–21, https ://
doi.org/10.1109/TNNLS .2020.29783 86 (2020).
65. Zhang, Z., Cui, P. & Zhu, W. Deep learning on graphs: A survey. IEEE Transactions on Knowl. Data Eng. 1–1, https ://doi.
org/10.1109/TKDE.2020.29813 33 (2020).
66. Mattson, P. et al. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 8–16. https
://doi.org/10.1109/MM.2020.29748 43 (2020).
67. Parashar, A. etal. Timeloop: A systematic approach to DNN accelerator evaluation. In 2019 IEEE International Symposium on
Performance Analysis of Systems and Soware, 304–315, https ://doi.org/10.1109/ISPAS S.2019.00042 (IEEE, 2019).
68. Jacob, B. & Warden, P. et al. gemmlowp: A small self-contained low-precision GEMM library https ://githu b.com/googl e/gemml
owp. (2015, accessed 2020) .
69. Judd, P., Albericio, J., Hetherington, T., Aamodt, T.M. & Moshovos, A. Stripes: Bit-serial deep neural network computing. In 2016
49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12, https ://doi.org/10.1109/MICRO .2016.77837
22 (2016).
70. Albericio, J. etal. Bit-pragmatic deep neural network computing. In 2017 50th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 382–394, https ://doi.org/10.1145/31239 39.31239 82 (2017).
71. Coimbatore Balram, K., Audet, R. & Miller, D. Nanoscale resonant-cavity-enhanced germanium photodetectors with lithographi-
cally dened spectral response for improved performance at telecommunications wavelengths. Opt. Express 21, 10228–33. https
://doi.org/10.1364/OE.21.01022 8 (2013).
Acknowledgements
anks to Christopher Panuski for helpful discussions about
µ
LEDs and Angshuman Parashar and Yannan
(Nellie) Wu for insights into all-electronic DNN accelerators. We would also like to thank Mohamed Ibrahim
for useful discussions on receiver discharging circuits. Anthony Pennes helped with several machining tasks.
anks to Ronald Davis III and Zhen Guo for manuscript revisions. We also thank the NVIDIA Corporation for
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2021) 11:3144 |
www.nature.com/scientificreports/
the donation of the Tesla K40 GPU used for training the fully-connected networks. Equipment was purchased
thanks to the U.S. Army Research Oce through the Institute for Soldier Nanotechnologies (ISN) at MIT under
grant no. W911NF-18-2-0048. L.B. is supported by a Postgraduate Scholarship from the Natural Sciences and
Engineering Research Council of Canada, National Science Foundation (NSF) E2CDA Grant No. 1640012 and
the afore-mentioned ISN Grant. A.S. is supported by an NSF Graduate Research Fellowship Program under Grant
No. 1122374, NTT Research Inc., NSF EAGER program Grant No. 1946967, and the NSF/SRC E2CDA and ISN
grants mentioned above. R.H. was supported by an Intelligence Community Postdoctoral Research Fellowship
at MIT, administered by ORISE through the U.S. DoE/ODNI.
Author contributions
D.E. and R.H. developed the original concept. L.B. designed and performed the hardware experiments with the
support of A.S. and D.E. A.S. developed the data acquisition, training, and confusion matrix analysis soware.
L.B. developed the output image processing soware and performed the bit error rate calculations. L.B. and A.S.
performed the energy calculations, with critical insights from R.H. J.E. and V.S. provided critical insights into
all-electronic hardware comparisons. L.B. and A.S. wrote the manuscript with input from all authors. R.H., J.E.,
V.S. and D.E. supervised the project.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https ://doi.
org/10.1038/s4159 8-021-82543 -3.
Correspondence and requests for materials should be addressed to L.B., A.S.orD.E.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com