ArticlePDF Available

Abstract

This paper presents the design of an ultra-low energy neural network that uses time-mode signal processing). Handwritten digit classification using a single-layer artificial neural network (ANN) with a Softmin-based activation function is described as an implementation example. To realize time-mode operation, the presented design makes use of monostable multivibrator-based multiplying analogue-to-time converters, fixed-width pulse generators and basic digital gates. The time-mode digit classification ANN was designed in a standard CMOS 0.18 μm IC process and operates from a supply voltage of 0.6 V. The system operates on the MNIST database of handwritten digits with quantized neuron weights and has a classification accuracy of 88%, which is typical for single-layer ANNs, while dissipating 65.74 pJ per classification with a speed of 2.37 k classifications per second. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.
royalsocietypublishing.org/journal/rsta
Research
Cite this article: Akgun OC, Mei J. 2019 An
energy ecient time-mode digit classication
neural network implementation. Phil.Trans.R.
Soc. A 378: 20190163.
http://dx.doi.org/10.1098/rsta.2019.0163
Accepted: 16 September 2019
One contribution of 13 to a theme issue
‘Harmonizing energy-autonomous computing
and intelligence’.
Subject Areas:
electrical engineering, microsystems,
articial intelligence
Keywords:
neural network, handwritten digit,
classication, time-mode, energy eciency,
ultra-low energy
Author for correspondence:
O. C. Akgun
e-mail: o.c.akgun@tudelft.nl
Present address: Department of Anatomy,
Université du Québec à Trois-Rivières (UQTR),
Trois-Rivières, Canada.
An energy ecient time-mode
digit classication neural
network implementation
O. C. Akgun1and J. Mei2,
1Section Bioelectronics, Department of Microelectronics,
Delft University of Technology, The Netherlands
2Department of Neurology and Department of Experimental
Neurology, Charité - Universitätsmedizin, Berlin, Germany
OCA, 0000-0003-1572-5891
This paper presents the design of an ultra-low
energy neural network that uses time-mode signal
processing). Handwritten digit classification using
a single-layer artificial neural network (ANN) with
a Softmin-based activation function is described
as an implementation example. To realize time-
mode operation, the presented design makes use
of monostable multivibrator-based multiplying
analogue-to-time converters, fixed-width pulse
generators and basic digital gates. The time-mode
digit classification ANN was designed in a standard
CMOS 0.18 µm IC process and operates from a
supply voltage of 0.6 V. The system operates on the
MNIST database of handwritten digits with quantized
neuron weights and has a classification accuracy of
88%, which is typical for single-layer ANNs, while
dissipating 65.74 pJ per classification with a speed of
2.37 k classifications per second.
This article is part of the theme issue ‘Harmonizing
energy-autonomous computing and intelligence’.
1. Introduction
Machine learning is the study of models and algorithms
that give rise to generalizable understanding of data
and task completion without explicitly programmed
instructions. As one of the many approaches in machine
learning, artificial neural networks (ANNs) are partly
inspired by connectivity and property of biological
neurons and have proven to achieve considerable
performance in a number of application areas.
These areas include machine translation [1,2], computer
vision [3,4], pattern recognition [57], game-playing [8,9]
2019 The Author(s) Published by the Royal Society. Allrights reser ved.
2
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
classified
digits
time-mode
digit classification
neural network
image
data
n.n=N
analogue
Figure 1. Proposed TMSP ANN high-level block diagram. (Online version in colour.)
and medical diagnosis [10,11]. In the applications that require real-time operation, e.g. speech
[5] and human action [7] recognition, and physical activity and patient monitoring [12], there is
a need for always-on sensing. However, one of the challenges of the modern machine learning
algorithms is their energy dissipation [13]. Most of the machine learning hardware development
is done using either standard cell digital design methods [14,15] or mixed-signal methods [16]
employing analogue processing techniques in CMOS technologies.
The advancement and scaling of CMOS technologies have always been based on improving
the performance of digital systems. With each new technology node, the threshold voltages of the
available MOS transistors and the supply voltage of the process node is scaled as well. Scaling
of the supply voltage reduces the headroom that is available to the transistors for operating in
the saturation region. Without transistors operating in the saturation region, it is very hard to
realize signal processing and amplification functions in the analogue domain. One solution to
this problem is using time-mode signal processing (TMSP) techniques [1719]. Time-mode (TM)
circuits represent an analogue signal by the time difference between two binary switching events.
Furthermore, when compared to standard digital design practices, TM operation is inherently
of lower power. For example, when compared to standard CMOS digital circuit operation, to
transfer N bits of data, the number of switchings required may change from 0 to N on the data line
if the data are transmitted in parallel, whereas in a TM circuit transfer of the data always takes two
switchings if the rising and falling edges of a pulse is used for information transmission. There are
other advantages of TM operation, especially for machine learning hardware implementations:
(i) TM operation allows the designer to reduce the supply voltage and still realize analogue-like
functions, as will be shown in this paper, and (ii) using single wires for data transmission instead
of using data buses will allow a hardware designer to realize densely connected ANNs on chip
more easily. Based on these observations, it is arguable that more low-power signal processing
and machine learning systems will be implemented using TMSP techniques in the future.
The research work presented here focuses on developing a TM digit-classification single-layer
neural network for ultra-low-energy operation. The proposed system is shown in figure 1. A digit
classification ANN was chosen for its simplicity, and well-studied and understood behaviour.
During the design and training of the ANN, image data from a widely available dataset, MNIST
[20], was used. nby nimage data were converted into analogue values and applied to the TM
ANN. The applied image data are processed by the TM ANN by accumulating weighted delay
values and a classification signal for the input image is generated. As it will be presented in the
paper, TMSP allows such an ANN to work with extremely low-energy dissipation values and
with classification accuracy that is typical for single-layer ANNs.
Contributions of this paper are as follows. A TM implementation of a handwritten digit
classification ANN is presented. Optimization steps for both system level and hardware level
design are given, followed by the details of sub-block designs. The designed ANN is verified by
both system-level mathematical simulations as well as with transistor-level SPICE simulations.
The design is characterized for classification accuracy, energy dissipation and classification
speed. The organization of the paper is as follows. Section 2 presents the high-level details
and implementation steps of the ANN in software. Section 3 describes the TMSP ANN
3
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
implementation with sub-block design and performance improvement steps. Transistor-level
simulation results are presented in §4 together with performance metrics, and, finally, the
conclusion is drawn in §5.
2. Articial neural network
In this study, we implemented a hardware version of a TM, fully connected, single-layer neural
network to recognize handwritten digits (figure 2), using the MNIST database of handwritten
digits [20]. The MNIST database contains a training set of 60 000 images and a test set of 10 000
images, with all images of size 28 ×28 pixels.
As presented in figure 2, a single layer of neurons applying a linear transformation to the
input data (i.e. MNIST handwritten digits) was constructed. The size of input sample was set to
784 (MNIST handwritten digits of 28 ×28 pixels) with an input range [0, 1] and the size of output
sample was set to 10 (10 possible output digits ranging from 0 to 9).
An artificial neuron in the implemented ANN receives pixel data from input units (figure 3).
Each input x1...xnis multiplied by its respective weight w1...wn, and the artificial neuron
receives and sums all weighted inputs according to
f(X)=w1x1+w2x2+···+wixi+···+wnxn
=
n
i
wixi. (2.1)
Afterwards, an activation function is used to process the weighted sum. In the presented
implementation, Softmin activation function, which takes all the weighted sums as input, is used
(figure 2). By processing the weighted sums, Softmin function rescales and assigns probabilities
to the classified digit outputs. As a result of the Softmin function, each output is squashed to
a value in the range (0, 1) and the sum of all outputs add up to 1. The Softmin function is
defined as
Softmin(xi)=exp(xi)
jexp(xj). (2.2)
After the weighted sums of the inputs are transformed by the activation function, final
classification results are supplied by the ANN, i.e. if the nth value is the highest in the output
vector of the Softmin function, it means that weighted sum output of the nth neuron is the
smallest and nth neuron has the highest probability to successfully classify the input to be digit n.
Unlike most of the implementations that used Softmax as the activation function, in the present
study Softmin was used. This is based on the assumption that with Softmax, an artificial neuron
with a greater weighted sum (i.e. greater accumulated delay in the hardware implementation)
wins. However, this does not correspond to the targeted hardware implementation in which the
fastest neuron is favoured and, therefore, using a Softmax activation function would decrease the
operating speed of the ANN. Hence, we have chosen the Softmin activation function with which
the fastest neuron wins and classification speed of the ANN is higher.
The described high-level ANN was implemented, trained and tested on the PyTorch
framework [21]. The training of the ANN was done using batches of 100 images per batch and
for 10 epochs. During the off-line training of the ANN, floating point values were used; however,
during hardware implementation and high-level verification simulations, the weight values were
scaled to a range and quantized.
The adaptive moment estimation method (Adam) [22] was used as the optimization method
during the training of the ANN. Adam optimization combines the advantages of both Adaptive
Gradient Algorithm (AdaGrad [23]), which works well with sparse gradients, and Root Mean
Square Propagation (RMSProp [24]), which works well in online and non-stationary settings.
4
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
0
1
2
3
4
5
6
7
8
9
pixel 1
pixel 2
pixel 3
pixel 4
pixel 5
pixel 6
pixel 7
pixel 8
pixel 9
pixel 10
pixel 11
pixel 12
pixel 13
pixel 14
pixel n × n
softmin classification
results
n pixels
n pixels
Figure 2. A fully connected, single-layer neural network receiving inputs from an image of handwritten digit 3.
x1
x2
x3
x4
x5
xn×n
w1
w2
wn×n
w5
w4
w3activation
function outpu
t
inputs weights sum
weighted
sum
S
Figure 3. An articial neuron with n×ninputs. Each input is associated with a weight. The weighted sum of all inputs is
transformed by the activation function to produce the output.
Kingma&Ba[22] suggested that instead of generating its parameter updates using a momentum
(like RMSProp with momentum does), updates of Adam may be directly estimated using an
average of first and second moment of the gradient. As a result, Adam performed equal or
better than RMSProp regardless of hyperparameter setting. In L2-regularized multi-class logistic
regression, Adam converged faster than AdaGrad. In a dataset with sparse features, Adam
converged as fast as AdaGrad while dealing with space features efficiently. In an experiment
with convolutional neural networks, Adam converged considerably faster than AdaGrad. For a
more detailed discussion, see [22,25]. During training, the learning rate (α) was set to 0.01 while
all other parameters were configured based on the default settings recommended in [22](β1=0.9,
5
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
1 285 10 15 20 25
image length/width (pixels)
no. quantization bits
0
10
20
30
40
50
60
70
80
90
100
accurac
y
(%)
90
85
80
70
60
50
maximum accuracy
chosen
implementation
8
7
6
5
4
3
2
1
Figure 4. Classication accuracy of the designed ANN for varying image length/width sizes and weight quantization bits.
Accuracy contour lines are also drawn for easier reference. Maximum accuracy and chosen implementations are marked.
β2=0.999, =108) and we aimed to minimize the cross-entropy loss, which is given by
loss(x,class)=−log exp(x[class])
jexp(x[j])
=−x[class] +log
j
exp(x[j])
, (2.3)
where class is the digit to be classified.
To be able to implement the ANN hardware in an efficient manner, we first investigated the
effects of image size reduction (downsampling) and quantization of neuron weights. Multiple
ANNs with varying image sizes (from 1 pixel/side to 28 pixels/side) and varying weight
quantization bits (1–8 bits) were created, trained and tested. To ease the hardware implementation
and consecutive implementation steps that will be explained in the next sections, in addition
to standard training settings, we constrained the minimum value of neuron weights to be 0.
Therefore, all the weights obtained from the training were either 0 or positive numbers. The
results of our simulations are presented in figure 4. During our tests, maximum accuracy
achievable from the presented single-layer shallow ANN was 92.95% (for an image size of 26 ×26
and 8-bits quantization), which is in line with the classification accuracy of single-layer ANNs in
the literature [26].
After the successful software implementation, training and testing of the ANN, multiple steps
were taken to prepare the design for TM hardware implementation. First, image size and number
of quantization bits were chosen for the required accuracy. In this implementation, we opted for
a 9 pixels/side input image (81 input pixels) in order to (i) reduce the energy dissipation without
significant loss of accuracy, (ii) to have an ANN implementation that is directly comparable to an
implementation in the literature [13], and (iii) to reduce the transistor-level transient simulation
time significantly. With input image scaling, the maximum digit classification accuracy reduced
from 92.95 to 89.65% (for 8-bit quantization).
6
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
Figure 5. Weights of a trained neuron, visual representation and values before and after quantization.
Table 1. Number of non-zero weights after quantization for all the neurons in the designed ANN.
neuron no. non-zero weights
063
..........................................................................................................................................................................................................
151
..........................................................................................................................................................................................................
247
..........................................................................................................................................................................................................
339
..........................................................................................................................................................................................................
461
..........................................................................................................................................................................................................
550
..........................................................................................................................................................................................................
654
..........................................................................................................................................................................................................
745
..........................................................................................................................................................................................................
864
..........................................................................................................................................................................................................
953
..........................................................................................................................................................................................................
After the input image size was chosen, the weights were scaled to the range [0, 1] and were later
quantized. From our simulations (figure 4), 4-bit quantized weights were a good compromise
between the expected energy dissipation and accuracy. In the implemented system, which will
be explained in the next section, most of the energy is dissipated in the switched capacitances.
For a fixed total switched capacitance, the number of quantization bits have negligible effect on
the total energy dissipation, and employing a higher number of quantization bits only result
in added implementation complexity. However, if smaller capacitance values can be tolerated,
i.e. less stringent noise and mismatch considerations in the system, then the smallest number
of quantization bits for a given image size, hence the smallest total capacitance values in the
implementation, should be chosen as the energy dissipation scales linearly with the number
of quantization bits. Furthermore, energy dissipation increases quadratically with the square
of image size/side, making the image size a more important parameter for energy reduction.
Therefore, the smallest possible image size that satisfies the accuracy requirements should be
chosen to minimize the energy dissipation. For the presented ANN implementation, we assumed
9×9 input images, and chose 4-bit quantization weights to represent an average case for the
number of quantization bits. The classification accuracy loss due to quantization was minimal,
i.e. from 89.65 to 89.35%.
Results of quantization on the weights of the neuron (used to classify handwritten digit 9)
are shown in figure 5. The leftmost figure shows the intensity of the weights, darker pixels
representing smaller values. The middle figure shows the floating point scaled weights and the
rightmost figure shows the weights after quantization. When these two figures are compared,
7
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
x1x2
neuron j
wNj
w2j
xN
44
w1j
classification
signal
edge
detector
4
mATC mATC mATC
. . .
4
D
VDD
Q
begin
classification
Figure 6. AclassierneuronusingTMSP.(Onlineversionincolour.)
it is observed that due to quantization, many weights were reduced to zero.Thesezero weights
have no effect on the weighted sum given in (2.1), therefore can be removed to both simplify the
hardware implementation and reduce energy dissipation. Non-zero weights for all the neurons
are given in table 1.
Owing to the nature of the designed TMSP circuits which will be explained in the next section,
each circuit that realizes the multiply accumulate (MAC) operation has an inherent non-zero fixed
delay. Therefore, not to penalize the neurons which have more non-zero weights and for the
correct operation of the designed system, we designed the circuit implementation of the ANN
such that each neuron has an equal number of MAC elements, which is equal to the maximum
number of non-zero weights given in table 1, i.e. 64 for neuron 8. Such an implementation allowed
us to reduce the number of MAC units for the 9 ×9 pixel design from 810 to 640, effectively
reducing the expected average energy dissipation by 21% by high-level design choices.
3. A time-mode MNIST digit classier ANN implementation
Following the mathematical modelling, training, verification and quantization of the ANN, we
applied TM operation and TMSP methods to the design of a digit classification ANN in a standard
0.18 µm IC process. Each neuron defined by (2.1) is mapped to a TMSP implementation, as shown
in figure 6. As in (2.1), a chain of multiplying analogue-to-time converters (mATC) converts a
voltage input value into a pulse whose width is proportional to both the input signal value and
the assigned weight. The signal propagates through the chain of mATCs and fixed-width pulse
generators (FWPGs). FWPGs, represented by the pulse blocks in figure 6, are required to be able
to trigger the next mATC in the chain with the falling edge of the previous mATC pulse. The
structures and operation principles of both the mATC and the negative-edge triggered fixed-
width pulse generator are explained in the following paragraphs. In this specific implementation,
we created a chain of 64 mATCs for each neuron. Owing to the resulting zero weights after
quantization, not all the pixels are connected to each neuron, further simplifying the hardware
implementation and future on-chip routing.
The operation of the TM ANN neuron is as follows: once the neuron has been triggered with
the Begin Classification signal, the chain of mATCs and FWPGs operate sequentially to accumulate
the delay information from each mATC, each of which represents the weighted input pixel
data. As explained in the previous section, the ANN has been trained with a Softmin activation
function, meaning that the neuron with the smallest weighted sum output value, i.e. in TMSP
terms, the fastest response (earliest falling edge at the output of the last (Nth) mATC), will get to
classify the input image first. Therefore, we placed negative-edge triggered flip-flops at the output
of the neurons to capture the final falling edge of the signal generated by the chain of mATCs. This
‘faster response wins’ approach directly mimics the Softmin function explained in the previous
section and is also similar to how some biological neural networks which are trained repeatedly
behave.
During the design of the TM ANN, we employed a modified version of the basic monostable
multivibrator (MSMV) [27] to work as an mATC in the system, as shown in figure 7. In this
implementation, a pMOS transistor (M1) acts as a variable resistor whose resistance is modulated
8
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
Vin
M1
1µ/1µ
Vout
trigger
C3
C1
C2
C0
b0
b1
b2
b3
Cx
n1
VDD
n2
Figure 7. Monostable multivibrator based multiplying ATC with time linearizing capacitor Cx. (Online version in colour.)
by the current input voltage signal. When the MSMV is triggered by an input pulse, nodes n1 and
n2 are pulled to logic-low and M1 starts charging node n2. The gate of M1 is driven by the input
signal that is to be converted into time, and sampling is realized by modulating the instantaneous
resistance of M1. Thus, the RC time constant of the multivibrator is modulated as well, resulting
in a pulse whose width is proportional to the amplitude of the input signal. The pulse width
generated by the ATC is given in [28]by
T=C(R+Ron)lnR
R+Ron
VDD
VDD Vth , (3.1)
where Ris the average resistance of the pMOS transistor during pulse generation, Ron the
resistance of the NOR gate, and Vth the switching threshold of the inverter. Assuming Ron R
and Vth =VDD/2, (3.1) is simplified to T=0.69RC. Furthermore, this mATC implementation has
an inherent timeout feature and will always generate a pulse event at node n1 regardless of the
input signal value at Vin, avoiding stalling of the chain. Transistor M1 was made bigger than the
minimum values required for correct operation to mitigate process variation effects. The ATC
given in [28] was modified with the inclusion of extra switchable capacitors C0-C3to allow the
ATC to realize time-multiplication operation. The capacitors C0-C3are increasing in a binary
weighted fashion, C0being the unit least-significant bit (LSB) capacitance, and C3=8·C0being
the most-significant bit (MSB) capacitance. The unit capacitor was sized such that, for the smallest
multiplication coefficient, i.e. 0001, the mATC still generates a pulse response that is proportional
to the input signal value. The switches were implemented as transmission gates using minimum
size (0.22 µm/0.18 µm) MOS transistors.
In the first iteration of the design, the minimum unit capacitor that satisfies this requirement
was found to be 20 fF. In this iteration, we used only switchable capacitors as the charged capacitor
to reduce the total switched capacitance, hence the total energy dissipation. However, during our
transistor-level simulations, we saw that due to the parasitic capacitances at node n2 and the
non-idealities of the switches, the pulse-width ratio between the successive weights degraded,
especially for the smaller values, i.e. for 0001 and 0010. Therefore, we placed a fixed time
linearizing 10 fF capacitor Cxin parallel to the switched capacitors. Addition of Cxalso allowed us
to reduce the value of the unit switched capacitor from 20 fF to 10 fF, as for the smallest weight
setting, the charged capacitance at n2 is still 20 fF.
Transistor-level simulations using the HSPICE simulator were run to characterize the mATC.
Simulations were run for a supply voltage of 0.6 V VDD, while sweeping the input signal voltage
from 300 to 400 mV, to represent the expected input signal values from an imager. In all the
transistor-level simulations, the black and white pixels are represented by 300mV and 400 mV
input voltage values to the mATCs, respectively.
9
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
pulse-width (µs)
mATC confi
g
uration wei
g
ht
mATC pulse range - without Cx
mATC pulse range - with Cx
0
100
200
300
400
500
Figure 8. Pulse-width ranges generated by the mATC for dierent conguration weights.
The advantages of the placement of Cxin the second iteration of the mATC design are shown
in figure 8. The range of pulses generated by both versions of the mATC for different weights as
well as their mean are presented in the figure. There are multiple points that should be noted from
transistor-level simulation results: (i) slope of the time response of the mATC has been reduced,
effectively making the mean pulse-width values more fitting to a binary progression, hence the
name time-linearizing (TL), (ii) due to the better fitting of the mean to binary progression, the
error between the multiplication steps has been reduced (the root mean squared error (RMSE)
is reduced from 6.61 to 1.59%, see table 2 for more details), and (iii) due to the reduced total
capacitance, the system response is faster (average pulse width is reduced from 81.16 to 43.72µs)
and the average energy dissipation is reduced (from 254 to 157fJ).
A negative-edge triggered FWPG, shown in figure 9, is used between the mATC blocks as
we require the triggering of the next mATC in the chain to occur during the falling edge of the
pulse generated by the previous mATC. By triggering the next mATC with the falling edge of
the previous mATC output, time addition operation is realized. In this implementation, we used
an FWPG which generates pulses with a pulse-width of 50 ns. This minimum value of the pulse-
width can be chosen to be any value that satisfies the following requirements during triggering:
(i) both nodes n1 and n2 are completely driven to ground during the pulse, and (ii) other input of
the NOR gate is completely driven to VDD with sufficient timing margin to account for process
mismatch before the output of the FWPG goes low. The maximum value of the pulse-width of
the FWPG is limited by the minimum pulse value that is generated by the mATC, i.e. 1.94µsfor
a0001 input. During our simulations, same pulse-width was also used for the Begin Classification
signal.
4. Simulation results
After characterization and verification of the sub-blocks of the hardware ANN, extensive
transistor-level SPICE simulations using HSPICE simulator were run to verify the correct time-
mode operation of the designed system. As in the characterization of the sub-blocks, a supply
of 0.6 V is used. Separate testbenches were programmatically created to simulate 100 samples
from the test dataset and transient simulations were run. Results of one such simulation run for
10
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
Figure 9. Negative-edge triggered xed-width pulse generator. (Online version in colour.)
Table 2. mATC expected multiplication weight ratios and % errors for dierent designs.
mATC multiplication weight ratio expected ratio % error - base mATC % error - TL mATC
w2/w12.00 91.09 20.91
..........................................................................................................................................................................................................
w3/w21.50 15.03 7.17
..........................................................................................................................................................................................................
w4/w31.33 4.27 0.22
..........................................................................................................................................................................................................
w5/w41.25 2.87 1.62
..........................................................................................................................................................................................................
w6/w51.20 1.96 0.41
..........................................................................................................................................................................................................
w7/w61.17 1.50 0.41
..........................................................................................................................................................................................................
w8/w71.14 0.24 0.87
..........................................................................................................................................................................................................
w9/w81.12 1.11 0.26
..........................................................................................................................................................................................................
w10/w91.11 1.36 0.45
..........................................................................................................................................................................................................
w11/w10 1.10 1.08 0.27
..........................................................................................................................................................................................................
w12/w11 1.09 0.99 0.45
..........................................................................................................................................................................................................
w13/w12 1.08 1.07 0.37
..........................................................................................................................................................................................................
w14/w13 1.08 1.34 0.74
..........................................................................................................................................................................................................
w15/w14 1.07 1.08 0.48
..........................................................................................................................................................................................................
a classification of digit 2 is shown in figure 10. As it can be seen, the correct classifier neuron
generates a faster output response than the other neurons, successfully classifying the input digit.
For this specific case, the fastest neuron, i.e. neuron 2, responded 62.1 µs faster than the next fastest
neuron.
The average energy dissipation per classification while working at 0.6 V VDD is 65.74 pJ.
The average classification response time for the test dataset is 421.8 µs, resulting in 2.37 k
classifications per second for a classification accuracy of 88%. As the focus of the present study
is on an energy efficient design with low dissipation rather than state-of-the-art classification
accuracy, the accuracy of 88%, which is typical in 1-layer neural networks [26], is acceptable at
the current stage. In the meantime, the classification accuracy is still significantly higher than the
random guess for MNIST dataset.
We also investigated the effects of process mismatch on the performance of, first, a chain of
mATCs, and, later, on the ANN. We first simulated a chain of mATCs with varying number of
elements for the effects of local mismatch. For these simulations, to represent an average case
of operation, all the analogue input voltages to the mATCs and the multiplication coefficients
were set to 350 mV and 1000, respectively. One hundred-point Monte Carlo simulations were
run, and the results are presented in figure 11. The figure shows the curve fits of the normalized
probability density functions over a varying number of elements in an mATC chain and the
decrease in the coefficient of variation (CV =σ/μ) with increasing number of elements in the
chain. The improvement in CV is 2 for every doubling of the number of mATC elements in
the chain.
As it is apparent from the mATC chain mismatch simulations, increasing the number
of elements in the chain reduces the relative variability of the ANN. Even though as the
11
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
neuron 0
0
Tresponse = 506.4 µs
Tresponse = 368.1 µs
neuron 1
0
0.6
0
0.6
0.6
Tresponse = 306.0 µs
neuron 2
0
0.6
0 100 200 300 400 500 600 700
time (
µ
s)
neurons 3–9
Figure 10. Transistor-level transient simulation of an example correct classication of a handwritten digit 2 by neuron 2. The
gure shows the output signals of the neurons with timing information. (Online version in colour.)
probability density function
response time normalized to a sin
g
le mATC (
µ
s)
chain length
1
2
4
8
16
32
64
–3
–2
–1
0
1 2 4 8 16 32 64
log2 (normalized CV)
chain length
18 20 22 24 26 28 30 32 34 36
Figure 11. One hundred-point Monte Carlo simulations showing the improvement in the coecient of variationwith increasing
number of mATCs inside the neuron. (Online version in colour.)
implemented ANN is trained off-line and there is no provision and straight-forward way to
address variability during training, reliability issues due to process mismatch may be addressed
in two ways: (i) algorithmically testing each neuron for the variability of the elements by applying
12
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
250
neuron 9
neuron 8
neuron 7
neuron 6
neuron 5
neuron 4
neuron 3
neuron 2
neuron 1
neuron 0
300 350
response time (µs)
400 450 500
Figure 12. One hundred-point Monte Carlo simulation (classication of a handwritten digit 3) response time variation of the
neurons of the designed ANN due to process mismatch. (Online version in colour.)
multiple analogue input and digital control combinations and extracting the linear transfer curve,
and (ii) by increasing the number of mATCs in the chain to average out and reduce the effects of
variation, as shown in figure 11.
To test the performance of the ANN for process mismatch, for each of the 100 image samples,
we used to simulate and characterize the system, we ran 100 point Monte Carlo mismatch
simulations (100 ×100 transient simulations in total) and the average standard deviation in the
neuron response due to process mismatch time was 9.2µs. Simulation results of such a simulation
run for the classification of handwritten digit of 3 is presented in figure 12 as an example. The
figure shows the response time variation distribution of each neuron in the designed ANN due
to process mismatch.
For a misclassification to occur due to mismatch, the second fastest neuron should respond
faster than the fastest neuron of the nominal conditions. For the case shown in figure 12, this is
possible between neurons 3 and 7. This probability can be modelled as the half of the area of
overlap of two Gaussian distributions with the same standard deviation and differing means
(figure 13), and the intersection point of the distributions depends on the distance between
the means. For mean differences less than 1.2σ, we saw that the errors due to the training
(88% accuracy) occurred. For mean differences greater than 1.2σ, we calculated the added
misclassification probability for each neuron and found the total added possibility of error due to
process mismatch to be 1.17%, reducing the expected minimum accuracy to 86.63%.
When compared to the state-of-the-art hardware ANN implementations, the design presented
in this work compares favourably in terms of reduced energy dissipation, which is the main aim
of this design exercise. A comparison of results with a recent and directly comparable hardware
9×9 pixel MNIST classification ANN in [13] is given in table 3. When both implementations are
compared, even though the presented ANN is designed using an older technology, i.e. 0.18 µm
process, compares favourably in terms of energy dissipation. One metric where the design in
[13] is performing better than the presented implementation is the classification speed. However,
due to the design constraints in [13], operating voltage cannot be lowered further; hence, energy
dissipation, which is proportional to the square of supply voltage in digital circuits, cannot be
further reduced. Furthermore, it is expected that our implementation will achieve much better
average operating speed and energy dissipation numbers when this design is migrated to more
advanced technologies. The energy dissipation per classification is reduced by a factor of 9.58x,
13
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
0
0.01
0.02
0.03
0.04
0.05
360 380 400 420 440 460 480 500
PDF
neuron response time (
µ
s)
fastest neuron
2nd fastest neuron
Figure 13. Response variation of two competing neurons due to process mismatch. Half of the value of the green shaded area
represents the probability of misclassication.
Table 3. Comparison of the implemented TM ANN with an implementation in the literature.
SRAM classier [13] this work
technology (nm) 130 180
..........................................................................................................................................................................................................
supply voltage (V) 1.2 0.6
..........................................................................................................................................................................................................
classication accuracy (%) 90 88
..........................................................................................................................................................................................................
classication speed (Hz) 50 ×1062370
..........................................................................................................................................................................................................
analogue-to-digital conversion energy included no yes
..........................................................................................................................................................................................................
energy dissipation (pJ) 630 66
..........................................................................................................................................................................................................
from 630 pJ down to 65.74 pJ when compared to [13]. It should also be noted that the ANN
implementation presented in this paper works with analogue signal inputs, without requiring
the input data to be converted to digital for further processing. If analogue-to-digital conversion
energy cost per image is added to the classification energy numbers reported in [13]intable 3,
presented ANN implementation is even more energy efficient.
Extending the single-layer ANN presented in this study to a multi-layer version is an on-going
work. However, from our preliminary results, it has been observed that, once a value/variable is
converted to a TM signal, in order to operate in the most energy efficient way, processing should
continue in TM without conversion between the TM and analogue/digital domains. For example,
an asynchronous time-to-digital converter (TDC) in 0.18 µm process dissipates 1.48 pJ [19], and a
similar TDC in a 65 nm process dissipates 0.97 pJ [29] per conversion. When compared to the
average energy dissipation of each neuron (6.6 pJ), it can be observed that conversion between
different operating domains incurs energy dissipation overhead values which are comparable to
the energy dissipation of the data processing circuitry.
5. Conclusion
This paper presents the hardware design and the simulation results of a TM, single-layer
ANN with Softmin activation function for handwritten digit classification. TMSP techniques
have been applied for accumulating weighted image signal values using energy-efficient time-
mode circuitry. Optimization steps for both system level and hardware level design are given.
The system was designed and simulated in a standard 0.18 µmprocessandoperatesfroma
14
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
supply voltage of 0.6 V. By applying the presented design guidelines, an energy-optimal 9 ×9
handwritten digit image classification ANN with 4-bit quantized weights was designed. The
energy dissipation of the design for each classification is 65.74 pJ while operating at a speed of
2.37 k classifications per second, with a classification accuracy of 88%.
Data accessibility. This article has no additional data.
Authors’ contributions. O.C.A. conceived, designed, simulated and verified the transistor-level, time-mode ANN
implementation. O.C.A. and J.M. created and optimized Python level ANN for time-mode implementation.
Both authors drafted, read and approved the manuscript.
Competing interests. We declare we have no competing interests.
Funding. This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under the Marie Sklodowska-Curie grant agreement no. 752819 for the MSCA IF Project ATiNaRI.
Acknowledgements. The authors thank the anonymous reviewers for their constructive comments and helpful
suggestions.
References
1. Sutskever I, Vinyals O, Le QV. 2014 Sequence to sequence learning with neural networks. In
Advances in neural information processing systems, pp. 3104–3112. San Diego, CA: NIPS.
2. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y.
2014 Learning phrase representations using RNN encoder-decoder for statistical machine
translation. (http://arxiv.org/1406.1078).
3. Cire¸san D, Meier U, Schmidhuber J. 2012 Multi-column deep neural networks for image
classification. (http://arxiv.org/1202.2745).
4. Ba J, Mnih V, Kavukcuoglu K. 2014 Multiple object recognition with visual attention. (http://
arxiv.org/1412.7755).
5. Deng L, Hinton G, Kingsbury B. 2013 New types of deep neural network learning for speech
recognition and related applications: an overview. In 2013 IEEE Int. Conf. on Acoustics, Speech
and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May, pp. 8599–8603. Piscataway,
NJ: IEEE.
6. Ji S, Xu W, Yang M, Yu K. 2013 3d convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231. (doi:10.1109/TPAMI.
2012.59)
7. Liang M, Hu X. 2015 Recurrent convolutional neural network for object recognition. In Proc. of
the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, 7–12 June, pp. 3367–3375.
8. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. 2013
Playing Atari with deep reinforcement learning. (http://arxiv.org/1312.5602).
9. Silver D et al. 2016 Mastering the game of go with deep neural networks and tree search.
Nature 529, 484–489. (doi:10.1038/nature16961)
10. Khan J et al. 2001 Classification and diagnostic prediction of cancers using gene expression
profiling and artificial neural networks. Nat. Med. 7, 673–679. (doi:10.1038/89044)
11. Al-Shayea QK. 2011 Artificial neural networks in medical diagnosis. Int. J. Comput. Sci. Issues
8, 150–154.
12. Kodali S, Hansen P, Mulholland N, Whatmough P, Brooks D, Wei G-Y. 2017 Applications
of deep neural networks for ultra low power IoT. In 2017 IEEE Int. Conf. on Computer Design
(ICCD) , pp. 589–592. Piscataway, NJ: IEEE.
13. Zhang J, Wang Z, Verma N. 2017 In-memory computation of a machine-learning classifier
in a standard 6T SRAM array. IEEE J. Solid-State Circuits 52, 915–924. (doi:10.1109/
JSSC.2016.2642198)
14. Chen Y-H, Krishna T, Emer JS, Sze V. 2017 Eyeriss: an energy-efficient reconfigurable
accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52, 127–138.
(doi:10.1109/JSSC.2016.2616357)
15. Lee J, Kim C, Kang S, Shin D, Kim S, Yoo H-J. 2018 Unpu: an energy-efficient deep neural
network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 54,
173–185.
16. Bankman D, Yang L, Moons B, Verhelst M, Murmann B. 2018 An always-on 3.8 µj/86% CIFAR-
10 mixed-signal binary CNN processor with all memory on chip in 28 nm CMOS. In 2018 IEEE
15
royalsocietypublishing.org/journal/rsta Phil.Trans.R.Soc.A378: 20190163
................................................................
Int. Solid-State Circuits Conf.-(ISSCC), Boston, MA, 11–15 February, pp. 222–224. Piscataway, NJ:
IEEE.
17. Yuan F. 2014 CMOS time-to-digital converters for mixed-mode signal processing. J. Eng. 2014,
140–154. (doi:10.1049/joe.2014.0044)
18. Chen Z, Gu J. 2016 Analysis and design of energy efficient time domain signal processing. In
Proc. of the 2016 Int. Symp. on Low Power Electronics and Design, San Francisco, CA, 8–10 August,
pp. 100–105. NewYork, NY: ACM.
19. Akgun OC, Mangia M, Pareschi F, Rovatti R, Setti G, Serdijn WA. 2019 An energy-efficient
multi-sensor compressed sensing system employing time-mode signal processing techniques.
In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS),Sapporo, Japan, 26–29 May, pp. 1–5.
Piscataway, NJ: IEEE.
20. LeCun Y. 1998 The MNIST database of handwritten digits. See http://yann.lecun.com/exdb/
mnist/.
21. Paszke A et al. 2017 Automatic differentiation in PyTorch. In NIPS-W. San Diego, CA: NIPS.
22. Kingma DP, Ba J. 2014 Adam: a method for stochastic optimization. (http://arxiv.org/1412.
6980).
23. Duchi J, Hazan E, Singer Y. 2011 Adaptive subgradient methods for online learning and
stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159.
24. Tieleman T, Hinton G. 2012 Lecture 6.5-rmsprop: divide the gradient by a running average of
its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31.
25. Ruder S. 2016 An overview of gradient descent optimization algorithms.
26. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998 Gradient-based learning applied to document
recognition. Proc. IEEE 86, 2278–2324. (doi:10.1109/5.726791)
27. Akgun OC, Gurkaynak FK, Leblebici Y. 2009 A current sensing completion detection method
for asynchronous pipelines operating in the sub-threshold regime. Int. J. Circuit Theory Appl.
37, 203–220. (doi:10.1002/cta.540)
28. Sedra AS, Smith KC. 1998 Microelectronic circuits. 4th edn. Oxford, UK: Oxford University
Press.
29. Akgun OC. 2018 An asynchronous pipelined time-to-digital converter using time-domain
subtraction. In Proc. of IEEE Int. Symp. on Circuits and Systems (ISCAS), Florence, Italy, 27–30
May, pp. 1–5. Piscataway, NJ: IEEE.
... The practical application of handwritten digits recognition has been performed by numerous researches to overcome challenges such as reducing the computational complexity [16][17][18] and increasing the calcification correctness [19][20][21]. However, this paper focuses on finding the minimum latency corresponding to different quality bounds of classifying images. ...
Article
Full-text available
This paper proposes field-programmable gate array (FPGA) acceleration on a scalable multi-layer perceptron (MLP) neural network for classifying handwritten digits. First, an investigation to the network architectures is conducted to find the optimal FPGA design corresponding to different classification rates. As a case study, then a specific single-hidden-layer MLP network is implemented with an eight-stage pipelined structure on Xilinx Ultrascale FPGA. It mainly contains a timing controller designed by Verilog Hardware Description Language (HDL) and sigmoid neurons integrated by Xilinx IPs. Finally, experimental results show a greater than ×10 speedup compared with prior implementations. The proposed FPGA architecture is expandable to other specifications on different accuracy (up to 95.82%) and hardware cost.
Article
An energy-efficient deep neural network (DNN) accelerator, unified neural processing unit (UNPU), is proposed for mobile deep learning applications. The UNPU can support both convolutional layers (CLs) and recurrent or fully connected layers (FCLs) to support versatile workload combinations to accelerate various mobile deep learning applications. In addition, the UNPU is the first DNN accelerator ASIC that can support fully variable weight bit precision from 1 to 16 bit. It enables the UNPU to operate on the accuracy-energy optimal point. Moreover, the lookup table (LUT)-based bit-serial processing element (LBPE) in the UNPU achieves the energy consumption reduction compared to the conventional fixed-point multiply-and-accumulate (MAC) array by 23.1%, 27.2%, 41%, and 53.6% for the 16-, 8-, 4-, and 1-bit weight precision, respectively. Besides the energy efficiency improvement, the unified DNN core architecture of the UNPU improves the peak performance for CL by 1.15$\times$ compared to the previous work. It makes the UNPU operate on the lower voltage and frequency for the given DNN to increase energy efficiency. The UNPU is implemented in 65-nm CMOS technology and occupies the $4 \times 4$ mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> die area. The UNPU can operates from 0.63- to 1.1-V supply voltage with maximum frequency of 200 MHz. The UNPU has peak performance of 345.6 GOPS for 16-bit weight precision and 7372 GOPS for 1-bit weight precision. The wide operating range of UNPU makes the UNPU achieve the power efficiency of 3.08 TOPS/W for 16-bit weight precision and 50.6 TOPS/W for 1-bit weight precision. The functionality of the UNPU is successfully demonstrated on the verification system using ImageNet deep CNN (VGG-16).
Article
The trend of pushing inference from cloud to edge due to concerns of latency, bandwidth, and privacy has created demand for energy-efficient neural network hardware. This paper presents a mixed-signal binary convolutional neural network (CNN) processor for always-on inference applications that achieves 3.8 μJ/classification at 86% accuracy on the CIFAR-10 image classification data set. The goal of this paper is to establish the minimum-energy point for the representative CIFAR-10 inference task, using the available design tradeoffs. The BinaryNet algorithm for training neural networks with weights and activations constrained to +1 and -1 drastically simplifies multiplications to XNOR and allows integrating all memory on-chip. A weight-stationary, data-parallel architecture with input reuse amortizes memory access across many computations, leaving wide vector summation as the remaining energy bottleneck. This design features an energy-efficient switched-capacitor (SC) neuron that addresses this challenge, employing a 1024-bit thermometer-coded capacitive digital-to-analog converter (CDAC) section for summing pointwise products of CNN filter weights and activations and a 9-bit binary-weighted section for adding the filter bias. The design occupies 6 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> in 28-nm CMOS, contains 328 kB of on-chip SRAM, operates at 237 frames/s (FPS), and consumes 0.9 mW from 0.6 V/0.8 V supplies. The corresponding energy per classification (3.8 μJ) amounts to a 40× improvement over the previous low-energy benchmark on CIFAR-10, achieved in part by sacrificing some programmability. The SC neuron array is 12.9× more energy efficient than a synthesized digital implementation, which amounts to a 4× advantage in system-level energy per classification.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Technical Report
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Article
This paper presents a machine-learning classifier where computations are performed in a standard 6T SRAM array, which stores the machine-learning model. Peripheral circuits implement mixed-signal weak classifiers via columns of the SRAM, and a training algorithm enables a strong classifier through boosting and also overcomes circuit nonidealities, by combining multiple columns. A prototype 128 x 128 SRAM array, implemented in a 130-nm CMOS process, demonstrates ten-way classification of MNIST images (using image-pixel features downsampled from 28 x 28 = 784 to 9 x 9 = 81, which yields a baseline accuracy of 90%). In SRAM mode (bit-cell read/write), the prototype operates up to 300 MHz, and in classify mode, it operates at 50 MHz, generating a classification every cycle. With accuracy equivalent to a discrete SRAM/digital-MAC system, the system achieves ten-way classification at an energy of 630 pJ per decision, 113x lower than a discrete system with standard training algorithm and 13x lower than a discrete system with the proposed training algorithm.