Conference PaperPDF Available

# Novel Arithmetics to Accelerate Machine Learning Classifiers in Autonomous Driving Applications

Authors:

## Figures

Content may be subject to copyright.
IEEE ICECS 2019 SS Advances in Circuits and Systems for HPC Accelerators and Processors
1
Abstract. Autonomous driving techniques frequently need the
clustering and the classification of data coming from several
need to be implemented in real-time in embedded on-board
computing units. The trend for data classification and clustering
in the signal processing community is moving towards machine
learning (ML) algorithms. One of them, which plays a central
role, is the k-nearest neighbors (k-NN) algorithm. To meet
stringent requirements in terms of real-time computing
capability and circuit/memory complexity, ML accelerators are
needed. Innovation is required in terms of computing arithmetic
since classic integer numbers lead to low classification
accuracy with respect to the needs of safety critical applications
like autonomous driving. Instead, floating numbers require too
much circuit and memory. To overcome these issues the paper
shows that the use of a new format, called Posit, implemented
in a new cppPosit software library, can lead to a k-NN
implementation having the same accuracy of floats, but with
halved bit-size. This means that a Posit Processing Unit (PPU)
reduces by a factor higher than 2 the data transfer and storage
complexity of ML accelerators. We also prove that a LUT-
based complete tabulated implementation of a PPU for a 8-bit
requires just 64 kB storage size, compliant with memory-
constrained devices.
Index Terms - k-Nearest Neighbors (k-NN), Alternative Real
Representation, Posits, Machine Learning (ML) Accelerator
I. INTRODUCTION
Autonomous driving is a safety critical application, as specified
also in functional safety standards like ISO26262, with strict
requirements in terms of real-time (both throughput and latency)
[1, 2]. In Levels 1 and 2 of the SAE autonomous driving scale [1]
just an assistance to human driver is needed. Hence, signal
processing based on deterministic algorithms is still enough, e.g.
FFT-based processing of Frequency Modulated Continuous
autonomous driving levels, from L3 to L5, the complexity of the
scenario and the needs of signal processing are very high, not
only for sensing, but also for localization, navigation, decision
and actuation. As consequence, in recent state-of-art Machine
Learning (ML) signal processing is proposed to be used on-board
of vehicles [1-4]. ML approaches have reached the state-of-art in
several signal processing domains [4-7] like image processing,
such as scene understanding (image segmentation, region-of-
interest extraction, sub-scene classification, etc.) must be done
on-board the vehicle, since cloud-based computing scenarios
(where the processing is done on remote cloud server and on-
board there is only a client generating requests to the server)
suffers of several issues: privacy, authentication, integrity and
connection latency and contention or even communication
unavailability in uncovered areas (highway tunnels, etc). On
board ML computing can be done only if the computational
algorithm complexity is not too high, and a performing HW is
adopted. Hence, on-board computing units for ML should be
optimized in terms of the ratio between processing throughput
performance and resources (memory, bandwidth, power
consumption, ...) [7-9]. This is the trend that also big industrial
players are following like Google, Nvidia or Intel, that are trying
to enter in the autonomous driving market, or the recently
announced Full Self Driving (FSD) chip from Tesla. This topic
is also the core of the automotive stream in the H2020 European
Processor Initiative (embedded HPC for autonomous driving
with BMW as main technology end-user [9, 10]) funding this
work.
To address the above issues new computing arithmetic styles are
appearing in research [11-20] overcoming the classic fixed-point
(INT) vs. IEEE-754 floating-point duality in case of embedded
DNN signal processing. Just as an example, Intel is proposing
BFLOAT16 (Brain Floating Point), that has same number of
exponent bits of the single-precision floating point allowing in
this way to replace binary32 in practical uses although with less
precision. BFLOAT16 are supported in Google TensorFlow
software, in Google Tensor Processing Units (TPU) and Intel AI
processors. Intel is also prosing flexpoint [11, 12] in which
exponent information is shared among a group of numbers.
NVIDIA for its latest Turing architecture is supporting the
concurrent execution of floating point and integer instructions in
the Turing SM such as Float32/Float16 and INT32/8/4 precision
modes for inferencing workloads that can tolerate quantization
[13]. The Tesla FSD chip exploits a neural processing units using
8-bit by 8-bit integer multiply and a 32-bit integer addition.
Transprecision computing for DNN is also proposed in state of
art by academia [14] and industry, e.g. IBM and Greenwaves in
[15]. Signal processing sparsity has been exploited recently [16,
17] to achieve a compression of ML complexity to reach real-
time computing on edge devices. However, the deep compression
in [16] is paid in terms of accuracy reduction, e.g. 76.6% Top-1
(far from the requirements, typically above 95%, of functional
safety applications) on the Imagenet object classification
challenge. Quantized neural networks are proposed in [18],
where using data sets such as MNIST, CIFAR-10, and ImageNet,
weights and activations are reduced to 1-bit or 2-bit but the top-1
accuracy is limited to 51%. Recently, a novel way to represent
real numbers, called Posit, has been proposed [19, 20]. Basically,
the Posit format can be thought as a compressed floating-point
representation, where more mantissa bits are used for small
numbers, and less mantissa bits for large numbers, within a fixed-
length format (the exponent bits adapt accordingly, to maintain
the format fixed in length). In this work we present the results
Novel Arithmetics to Accelerate Machine Learning
Classifiers in Autonomous Driving Applications
Marco Cococcioni*, IEEE SM, Federico Rossi*, Emanuele Ruffaldi#, Sergio Saponara*, IEEE SM
*Dept. of Information Engineering, University of Pisa, Italy - #MMI spa, Calci, Pisa, Italy
IEEE ICECS 2019 SS Advances in Circuits and Systems for HPC Accelerators and Processors
2
obtained when exploiting the Posit format, with a new proposed
cppPosit library, and exploiting also tabulated look-up table
(LUT) based HW calculation, on a widely used ML algorithm,
like the k-NN classifier [4].
II. THE POSIT FORMAT
The Posit representation [19, 20] is depicted in Fig. 1. A Posit
contains a maximum of four fields: the sign bit, the regime field,
the exponent field and the fraction fields. The fields are variable
length with priority given to the encoding of regime, then
exponent and finally fraction. The maximum length of the
exponent is decided a-priori, together with the total length in
bits. These two lengths characterize different types of Posit
representations. The length of the regime field is determined
using a run-length method: the number of consecutive 0 after
the sign bit and before the first 1 bit is the regime length (a
regime field can be also made by a sequence of 1, until the first
0 is encountered: in that case the number of consecutive 1 is the
regime length, but this time its value is negative). Once the
length of the regime is known, the length of the mantissa can be
determined, as the number of remaining bits (after skipping the
exponent bits). The formula that allows retrieving the real
number is in [20] and two examples of its application are shown
in Fig. 2. Please observe how the two Posits representations
shown in the figure have a different number of bits reserved for
the fraction field (8 and 9, respectively), having different
lengths for the regime fields (4 vs 3).
Posits can be conveniently put on a circle sharing the concept
of projection of reals over a circle, but different design
decisions allow to implement Posit operation without imposing
the use of a LUT. In Fig. 3 the circle for a 4-bit Posit is
presented, in the case of 1 bit for the exponent.
Posits enjoy many really interesting properties, such as:
Unique representation for zero
No representations wasted for Not-A-Number (NaN).
When using Posits, an exception is raised instead of
reserving representations for NaNs. The IEEE 754 standard
wastes a lot of representations for NaNs, which makes also
the HW for comparing floats complex.
No need to support unnormalized numbers (which, instead,
are generally introduced in floats to augment the accuracy
around zero, but makes the FPU implementation
significantly more complex).
Even more interestingly, Posits are sorted like signed integers,
when the latter are represented using the two's complement.
Thus, comparing two Posits can be done in ALU by type re-
casting to signed integers: negative Posits are expressed using
complement two as integers, and the other three fields allow
direct ordering. We think the brightest idea of the Posit
representation is to reserve more bits to the mantissa for small
real numbers (close to zero) and less for large real numbers,
within a fixed length format (the total length is fixed, although
the length of the regime and that of the mantissa vary). Posit
can also be viewed as a (lossy) compressed version of a float.
Fig. 3. Posit circle when the total number of bits is 4 and the number
of exponent bits is just 1. Observe how the mantissa is almost 1 bit in
this case (the last blue bit, when present).
III. THE CPPPOSIT LIBRARY
In this work we present the implementation of a new C++11
Open Source library, called cppPosit available on github. It is
released generic programming approach and traits to achieve
compactness of representation and speed. The library supports
Posits having total number of bits ranging from 4 to 64, and
supports many variants of the Posits as controlled by the
template parameters:
template <class T, int totalbits, int
esbits, class FT, PositSpec>
where class T is the storage type for the Posit itself (a signed
integer data type, such as int8); totalbits is the number of bits
of the posit, that can be less than of the storage type for
experimenting with different memory layouts; espbits is the
maximum number of bits of the exponent; class FT is the
storage type for the fractional part of the mantissa during
manipulation (an unsigned data type, such as uint16). The FT
data type is useful when performing the four elementary
operations on unpacked Posits see below ). The presence of
NaN is optional.
IEEE ICECS 2019 SS Advances in Circuits and Systems for HPC Accelerators and Processors
3
A. Different operations performed at different levels
The cppPosit library uses different Posits represented at
different decoding levels.
We defined 4 level of operations for working with Posits:
1. At level 1, operations are performed manipulating directly
the bits of the encoding. The cost is the one of integer ALU.
Examples of operations that work at level 1 are:
o 1/x (inversion)
o x (unary minus) and |x| (absolute value)
o x < y, x<=y, x > y, x >= y (comparisons)
o 1-x (for espbits=0 and x in [0,1])
o Pseudosigmoid (for esbits=0)
2. At level 2, Posit is unpacked on its underlying fields (sign,
regime, exponent, and fraction), without building the
complete exponent. The operations are done on such fields
and the cost comprises the encoding and decoding.
Examples of operations are:
o x/2 (computing the half of x)
o 2*x (doubling x)
3. At level 3 we have a fully unpacked version (sign,
exponent, fraction). In addition to level 2 operations it is
necessary to compute the full exponent. Examples are:
o convert to/from float, posit or fixed point
4. At level 4 the unpacked version is used to perform the
operations in two possible ways:
o Software floating point
o Hardware floating point
In the cppPosit library everything is template based, this is one
of the most important advantages of this library. cppPosit
defines three key types: Posit (level 1), UnpackedLow (level 2),
Unpacked (level 3). Unpacked is parametrized to the type of the
fraction (mantissa) and it can handle the conversion to/from any
type of IEEE floating point number (expressed via trait) and
Posit. A specialized class PostF provides all operations as Level
4 over single or double. In the next section we present an
implementation of the k-NN which can work with different data
types, in particular with Posits and float.
B. Tabulated Posits
When the total number of bits for the considered Posit is lower
than or equal to 14, Posits can be tabulated. This allows to
speed-up the computation in architecture that still do not have a
HW PPU (Posit Processing Unit). To save memory, cppPosit
uses some tricks, such as using a single LUT to store both the
result of the sum and the difference between two Posits. This is
possible due to the fact that the sum is symmetric (so the values
are store on the main diagonal and above it), while the
difference is antisymmetric (thus we can store its values below
the diagonal). Finally, to avoid the need of the tables for
multiplication and division we have tabulated both the natural
logarithm and exponential function. When we have to compute
the product between Posits, we first compute the logarithms of
both, then we sum these values, finally we compute the
exponential function of their sum.
A similar trick is used for the division (this time the difference
between the two logarithms is computed). Following this
strategy, we are able to store everything is needed for the four
elementary operations into a single square LUT of size X*X,
plus two vectors of length X, see Table 1, where X is the total
number of bits of the Posit. For Posit8 or 10 the whole
computation if tabulated can be saved in cache memory or on-
chip SRAM buffer, thus allowing for high-speed computation.
For Posit8 the storage size is compliant with memory-
constrained units.
Tab. 1 Memory required to store the single LUT as a function of X
(total number of bits of the Posit).
IV. IMPLEMENTING THE k-NN WITH POSITS SUPPORT
The k-Nearest-Neighbors (k-NN) is a simple learning algorithm
used for classification and regression problems. Regarding
classification, a new object is classified by a majority vote of its
neighbors, in particular the object is assigned to the class most
common among its k nearest neighbors. The position of each
object in n-dimension space is determined by its n-features. The
algorithm can be summarized as:
A positive integer k is specified, along with a new object
It selects the k entries in the dataset which are closest to the
new object (e.g. using the euclidean distance)
It finds the most voted class of these entries. That label will
be how the object is classified
The best choice of k depends upon the data. Larger values of k
reduce effect of the noise on the classification, but make
boundaries between classes less distinct. A good k can be
selected by various heuristic techniques.
The k-NN can also be used for regression, in this case the output
is the property value for the object. This value is the average of
the values of its k nearest neighbors. We have implemented both
a C++ library for Posit support (cppPosit) and a generic k-NN
algorithm, able to work both with floats and Posits, see Fig. 4.
In particular, we extended the nanoflann library
(https://github.com/jlblancoc/nanoflann) to operate with Posits.
For small-size Posit (up to 12 bits) we have also considered the
tabulated version.
Fig. 4. SW architecture of the implemented k-NN library able to run
with different data types.
IEEE ICECS 2019 SS Advances in Circuits and Systems for HPC Accelerators and Processors
4
V. EXPERIMENTAL RESULTS
We have tested our implementation on three well-known
datasets [4]: sift-128-euclidean; mnist-784-euclidean; fashion-
mnist-784-euclidean. Table 2 and Fig. 5 show the performance
of different data types, for data set usually used as classification
benchmarks in literature (e.g. fashion, mnist and sift), when
changing the scaling factor from 0 to 1. The scaling factor
parameter rescales the whole dataset. Scaling = 1 means no
scaling (thus the original dataset is used in that case). On the
other cases, a scaled version of the original dataset is used (thus
the variability on the datasets is reduced: this allows data types
with lower dynamic to accommodate the observations without
truncation). Table 2 shows that Posit16 with 3 bits of exponent
works very well. From Fig. 5, the k-NN with a 16-bit Posit with
three bits of exponent attains performance close to a float32 and
an 8-bit Posit outperforms float16. The achieved results show
that Posit16-3 can ensure the same precision (in pattern
recognition precision is the fraction of relevant instances among
the retrieved instances) than a Float32 on the original data set
(Scaling=1). Moreover, Posit8-1 outperforms Float16 since it
achieves higher precision for the same scaling factor. Applying
an appropriate scaling factor to the dataset values we can adapt
the test to types with smaller range. For the k-NN
implementations the algorithm chosen for the test is nanoflann,
that is a pure-template version of the more widespread FLANN
[4]. Although nanoflann is parametrized over generic types it
required some patching for supporting non-floating point
values, with specific care in the accumulation of vector norms.
For testing purposes with ann-benchmarks we created a library
that contains the nanoflann templates instantiated with each of
the relevant types.
Fig. 5. Precision as a function of the scaling factor, Mnist dataset
(similar results for the other benchmark data set)
Tab. 2. Precision obtained on the three datasets, when using float32,
Posit16 and Posit32. Scaling factor=1.
VI. CONCLUSIONS
This work compared the performances of a k-NN classifier,
when using Posit and float, showing the benefits introduced by
the former. For example, using a PPU instead of floats for k-
NN classification (on known data set benchmarks), a Posit16
achieved the same accuracy of Floats32 while a Posit8
outperformed Floats16 (that in literature has been proposed as
alternative to Floats32 for artificial intelligence applications).
This is a remarkable result, not only for saving storage space,
but also to better exploit CPU vectorization, all levels of cache,
and to increase the bandwidth of data transfer between CPU and
RAM to contrast the memory wall phenomenon. As showed in
Table 1, a LUT-based tabulated implementation of a PPU for
Posit8 requires a 64 kB storage size, compliant with memory-
constrained embedded devices. The impact of this work is thus
high, since beside autonomous driving there are many safety-
critical applications where the accuracy of ML-based decisions is
an issue but low-complexity/real-time is needed (robotics,
industry4.0, avionics). As future work, we are considering the
implementation of fused dot product, the high-level synthesis in
HDL starting from the cppPosit SW library for HW design of a
PPU, and the use of Posits in DNNs.
ACKNOWLEDGMENTS
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement No 826647
REFERENCES
[1] S. Saponara, et al., “Radar-on-chip/in-package in autonomous driving
vehicles and intelligent transport systems: opportunities and challenges”,
IEEE Signal Processing Magazine, 36 (5), 2019
[2] L. Lo Bello, et al., Recent advances and trends in on-board embedded
and networked automotive systems”, IEEE Tran. Ind. Inf., 15 (2), 2019
[3] From Signal Processing to Machine Learning, in Digital Signal
Processing with Kernel Methods, by J. Royo-Alvarez et al, Wiley, 2018
[4] M. Muja et al.,Scalable nearest neighbor algorithms for high dimensional
data”, IEEE Trans. Pattern Analysis and Mach. Int., 36 (11), 2014
[5] T. Bubolz et al., “Quality and Energy-Aware HEVC Transrating Based
on Machine Learning”, IEEE Tran. Circ. and Syst. I, 66 (6), 2019
[6] Li Du et al., “A Reconfigurable 64-Dimension K-Means Clustering
Accelerator with Adaptive Overflow Control”, IEEE Trans. Circ. and Sys.
II , 2019, doi 10.1109/TCSII.2019.2922657
[7] P. Nousi, et al., “Convolutional Neural Networks for visual information
analysis with limited computing resources” IEEE ICIP2018, pp.321-325
[8] Yu Cheng et al., “Model Compression and Acceleration for Deep Neural
Networks”, IEEE Signal Proc. Mag., pp. 126-136, 35 (1), 2018
[9] D. Reinhardt et al., “High performance processor architecture for
automotive large scaled integrated systems within the European Processor
Initiative research project”, SAE Tech. Paper 2019-01-0118
[10] https://www.european-processor-initiative.eu/
[11] U. Köster et al. Flexpoint: An Adaptive Numerical Format for Efficient
Training of Deep Neural Networks”, NIPS 2017, pp. 1740-1750
[12] V. Popescu et al., “Flexpoint: predictive numerics for deep learning”,
IEEE Symposium on Computer Arithmetics, 2018
[13] NVIDIA TURING GPU Architecture, graphics reinvented, White
paper n. WP-09183-001_v01, pp. 1-80, 2018
[14] G. Tagliavini et al., “FlexFloat: A Software Library for Transprecision
Computing”, IEEE Trans. on CAD of Int. Cir. and Syst. 2019
[15] A. Malossi et al., The transprecision computing paradigm: concept,
design, and applications”, IEEE DATE 2018, pp. 1105-1110
[16] G. Venkatesh et al., “Accelerating Deep Convolutional Networks Using
Low-Precision and Sparsity”, IEEE ICASSP 2017
[17] G. Srivastava et al., Joint optimization of quantization and
structured sparsity for compressed deep neural networks”, ICASSP 2019
[18] I. Hubara, et al., Quantized neural networks: training neural networks
with low precision weights and activations”, J. ML Research, 18 (1), 2017
[19] M. Cococcioni, et al., Exploiting posit arithmetic for Deep Neural
Networks in autonomous driving applications,” IEEE Automotive 2018
[20] J. L. Gustafson et al., “Beating floating point at its own game: Posit
arithmetic,” Supercomp. Frontiers and Innov., 4 (2), 2017
... Other formats also come from the concept of transprecision computing [6,7] (NVIDIA Turing architectures allow computation with 4-, 8-, and 32-bit integers and with 16-and 32-bit floats). The up-and-coming Posit format has been theoretically [8][9][10] and practically [11] proven to be a perfect replacement for IEEE float numbers when applied to DNNs in terms of efficiency and accuracy. ...
... The Posit format has been introduced by John L. Gustafson in Reference [8] and was further investigated in Reference [9,10,12]. The format is a fixed-length one with up to 4 fields as also reported in Figure 1: ...
... For this paper, we employ our software implementation of Posit numbers developed at the University of Pisa, called cppPosit. As already described in References [9,12], the library classifies Posit operations into four different classes (from L1 to L4), with increasing computational complexity. ...
Article
Full-text available
With increasing real-time constraints being put on the use of Deep Neural Networks (DNNs) by real-time scenarios, there is the need to review information representation. A very challenging path is to employ an encoding that allows a fast processing and hardware-friendly representation of information. Among the proposed alternatives to the IEEE 754 standard regarding floating point representation of real numbers, the recently introduced Posit format has been theoretically proven to be really promising in satisfying the mentioned requirements. However, with the absence of proper hardware support for this novel type, this evaluation can be conducted only through a software emulation. While waiting for the widespread availability of the Posit Processing Units (the equivalent of the Floating Point Unit (FPU)), we can already exploit the Posit representation and the currently available Arithmetic-Logic Unit (ALU) to speed up DNNs by manipulating the low-level bit string representations of Posits. As a first step, in this paper, we present new arithmetic properties of the Posit number system with a focus on the configuration with 0 exponent bits. In particular, we propose a new class of Posit operators called L1 operators, which consists of fast and approximated versions of existing arithmetic operations or functions (e.g., hyperbolic tangent (TANH) and extended linear unit (ELU)) only using integer arithmetic. These operators introduce very interesting properties and results: (i) faster evaluation than the exact counterpart with a negligible accuracy degradation; (ii) an efficient ALU emulation of a number of Posits operations; and (iii) the possibility to vectorize operations in Posits, using existing ALU vectorized operations (such as the scalable vector extension of ARM CPUs or advanced vector extensions on Intel CPUs). As a second step, we test the proposed activation function on Posit-based DNNs, showing how 16-bit down to 10-bit Posits represent an exact replacement for 32-bit floats while 8-bit Posits could be an interesting alternative to 32-bit floats since their performances are a bit lower but their high speed and low storage properties are very appealing (leading to a lower bandwidth demand and more cache-friendly code). Finally, we point out how small Posits (i.e., up to 14 bits long) are very interesting while PPUs become widespread, since Posit operations can be tabulated in a very efficient way (see details in the text).
... Lately, this novel representation has been proven to equalize FP32 accuracy performance using only 16 bits. [8][9][10][11][12]. Furthermore, we also recently saw the rise of hardware implementations for posit numbers. ...
... The posit format [7][8][9]19] is a fixed length format that can be configured in the number of overall bits (nbits) and the maximum number of exponent bits (esbits). ...
Conference Paper
Full-text available
Real-time processing of images and videos is becoming considerably crucial in modern applications of machine learning (ML) and deep neural networks. Having a faster and compressed floating point arithmetic can significantly increase the performance of such applications optimizing memory occupation and transfer of information. In this field, the novel posit number system is very promising. In this paper we exploit posit numbers to evaluate the performance of several machine learning algorithms in real-time image and video processing applications. Future steps will involve further hardware accelerations for native posit operations.
... Another very promising alternative to IEEE 32-bit Floatingpoint standard is the posit ™ number system, proposed by Gustafson [9]. This format has been proven to match single precision accuracy performance with only 16 bits used for the representation [10][11][12][13][14]. Furthermore, the first hardware implementations of this novel type are very promising in terms of energy consumption and area occupation [15][16][17]. ...
... The posit format [9][10][11]19] is a configurable fixed length format for real number representation; the format configuration involves the number of overall bits (nbits) and the maximum number of exponent bits (esbits). ...
Article
Full-text available
With the arrival of the open-source RISC-V processor architecture, there is the chance to rethink Deep Neural Networks (DNNs) and information representation and processing. In this work we will exploit the following ideas: i) reduce the number of bits needed to represent the weights of the DNNs using our recent findings and implementation of the posit number system, ii) exploit RISC-V vectorization as much as possible to speed up the format encoding/decoding, the evaluation of activations functions (using only arithmetic and logic operations, exploiting approximated formulas) and the computation of core DNNs matrix-vector operations. The comparison with the well-established architecture ARM Scalable Vector Extension (SVE) is natural and challenging due to its closedness and mature nature. The results show how it is possible to vectorize posit operations on RISC-V, gaining a substantial speed-up on all the operations involved. Furthermore, the experimental outcomes highlight how the new architecture can catch up, in terms of performance, with the more mature ARM architecture. Towards this end, the present study is important because it anticipates the results that we expect to achieve when we will have an open RISC-V hardware co-processor capable to operate natively with posits.
... Another promising representation that diverges from the floating-point standard is the posit number system [5][6][7]. This type has been proven to be a perfect drop-in replacement of 32-bit IEEE 754 floats in machine learning, using just 16 bits [8][9][10][11][12][13]. Moreover, it has been productively exploited in low-precision inference down to 8-bit posit representation, with very little degradation of network inference accuracy. ...
... As widely shown in [7,8,10,17,18], the posit format is a fixed-length alternative representation to float numbers. A posit can be configured in the total number of bits (nbits) and the number of exponent bits (es). ...
Article
Full-text available
With the advent of image processing and computer vision for automotive under real-time constraints, the need for fast and architecture-optimized arithmetic operations is crucial. Alternative and efficient representations for real numbers are starting to be explored, and among them, the recently introduced posit$$^{\mathrm{TM}}$$ number system is highly promising. Furthermore, with the implementation of the architecture-specific mathematical library thoroughly targeting single-instruction multiple-data (SIMD) engines, the acceleration provided to deep neural networks framework is increasing. In this paper, we present the implementation of some core image processing operations exploiting the posit arithmetic and the ARM scalable vector extension SIMD engine. Moreover, we present applications of real-time image processing to the autonomous driving scenario, presenting benchmarks on the tinyDNN deep neural network (DNN) framework.
... The posit TM format ( [8][9][10]) is one of the most promising representations that deviates from the IEEE 754 standard. In machine learning, this kind has been shown to be a great drop-in replacement for 32-bit IEEE 754 floats, using only 16 bits [11][12][13][14][15][16]. Furthermore, it has been successfully used in low-precision inference down to 8-bit posit representation with minimal network inference accuracy degradation. ...
Chapter
Full-text available
With the pervasiveness of deep neural networks in scenarios that bring real-time requirements, there is the increasing need for optimized arithmetic on high performance architectures. In this paper we adopt two key visions: i) extensive use of vectorization to accelerate computation of deep neural network kernels; ii) adoption of the posit compressed arithmetic in order to reduce the memory transfers between the vector registers and the rest of the memory architecture. Finally, we present our first results on a real hardware implementation of the ARM Scalable Vector Extension.
... These results have been obtained on a single dataset, but scaling it multiple times in order to reduce the dynamic range of the input data (thus allowing low-precision data types to be competitive with Float32). More details can be found in [63]. The obtained results confirm that Posits are powerful in a number of machine learning application and thus this means that implementing Posit-based HW accelerators will be beneficial for a number of different applications. ...
Article
Full-text available
This paper focuses on trends, opportunities and challenges of novel arithmetics for DNN signal processing, with particular reference to assisted and autonomous drivingapplications. Due to strict constrains in terms of latency, dependability and security of autonomous driving, machine perception (i.e. detection or decisions tasks) based on DNN cannot be implemented relying on a remote cloud access. These tasks must be performed in real-time on embedded systems on-board the vehicle, particularly for the inference phase (considering the use of DNNs pre-trained during an off-line step). When developing a DNN computing platform, the choice of the computing arithmetics matters. Moreover, functional safe applications like autonomous driving pose severe constraints on the effect that signal processing accuracy has on final rate of wrong detection/decisions. Hence, after reviewing the different choices and trade-off concerning arithmetics, both in academia and industry, we highlight the issues in implementing DNN accelerators to achieve accurate and low-complex processingof automotive sensor signals (the latter coming from diversesources like cameras, radars, lidars, ultrasonics). The focus ison both on general-purpose operations massively used in DNN like multiply, accumulation, compare, or on specific functionslike for example sigmoid or hyperbolic tangent, used for neuron activation.
... The Posit format as proposed in [7][8][9] is a fixed-length representation composed by at most 4 fields as shown in Fig 1.: 1-bit sign field, variable-length regime field, variable-length (up to es-bits) exponent field and a variable-length fraction field. The overall length and the maximum exponent lengths are decided a-priori. ...
Conference Paper
Full-text available
Deep Neural Networks (DNNs) are being used in more and more fields. Among the others, automotive is a field where deep neural networks are being exploited the most. An important aspect to be considered is the real-time constraint that this kind of applications put on neural network architectures. This poses the need for fast and hardware-friendly information representation. The recently proposed Posit format has been proved to be extremely efficient as a low-bit replacement of traditional floats. Its format has already allowed to construct a fast approximation of the sigmoid function, an activation function frequently used in DNNs. In this paper we present a fast approximation of another activation function widely used in DNNs: the hyperbolic tangent. In the experiment, we show how the approximated hyperbolic function outperforms the approximated sigmoid counterpart. The implication is clear: the posit format shows itself to be again DNN friendly, with important outcomes.
Article
Growing constraints on memory utilization, power consumption, and I/O throughput have increasingly become limiting factors to the advancement of high performance computing (HPC) and edge computing applications. IEEE-754 floating-point types have been the de facto standard for floating-point number systems for decades, but the drawbacks of this numerical representation leave much to be desired. Alternative representations are gaining traction, both in HPC and machine learning environments. Posits have recently been proposed as a drop-in replacement for the IEEE-754 floating-point representation. We survey the state-of-the-art and state-of-the-practice in the development and use of posits in edge computing and HPC. The current literature supports posits as a promising alternative to traditional floating-point systems, both as a stand-alone replacement and in a mixed-precision environment. Development and standardization of the posit type is ongoing, and much research remains to explore the application of posits in different domains, how to best implement them in hardware, and where they fit with other numerical representations.
Conference Paper
Full-text available
Nowadays, real-time applications are exploiting DNNs more and more for computer vision and image recognition tasks. Such kind of applications are posing strict constraints in terms of both fast and efficient information representation and processing. New formats for representing real numbers have been proposed and among them the Posit format appears to be very promising, providing means to implement fast approximated version of widely used activation functions in DNNs. Moreover, information processing performance are continuously improved thanks to advanced vectorized SIMD (single-instruction multiple-data) processor architectures and instructions like ARM SVE (Scalable Vector Extension). This paper explores both approaches (Posit-based implementation of activation functions and vectorized SIMD processor architectures) to obtain faster DNNs. The two proposed techniques are able to speed up both DNN training and inference steps.
Conference Paper
Full-text available
Autonomous driving systems and connected mobility are the next big developments for the car manufacturers and their suppliers during the next decade. To achieve the high computing power needs and fulfill new upcoming requirements due to functional safety and security, heterogeneous processor architectures with a mixture of different core architectures and hardware accelerators are necessary. To tackle this new type of hardware complexity and nevertheless stay within monetary constraints, high performance computers, inspired by state of the art data center hardware, could be adapted in order to fulfill automotive quality requirements. The European Processor Initiative (EPI) research project tries to come along with that chal- lenge for next generation semiconductors. To be as close as possible to series development needs for the next upcoming car generations, we present a hybrid semiconductor sys- tem-on-chip architecture for automotive. This microprocessor is inspired and derived from HPC architecture of the European Proces- sor Initiative research project. Furthermore we suggest a possible future architecture for high per- formance automotive microprocessors integrated on an automotive computing platform. We describe our architectural hardware ap- proach for a generic high performance Central Processing Unit (CPU) for deep embedded operation up to hosting POSIX based automotive systems. It implements different kinds of non-functional requirements for functional safety like fail-operational and for securi- ty crypto-accelerators within a single package.
Article
Full-text available
Modern cars consist of a number of complex embedded and networked systems with steadily increasing requirements in terms of processing and communication resources. Novel automotive applications, such as, automated driving, rise new needs and novel design challenges that cover a broad range of hardware/software engineering aspects. In this context, this paper provides an overview of the current technological challenges in on-board and networked automotive systems. The paper encompasses both the state-of-the-art design strategies and the upcoming hardware/software solutions for the next generation of automotive systems, with a special focus on embedded and networked technologies. In particular, the work surveys current solutions and future trends on models and languages for automotive software development, on-board computational platforms, in-car network architectures and communication protocols, and novel design strategies for cybersecurity and functional safety.
Conference Paper
Full-text available
This paper discusses the introduction of an integrated Posit Processing Unit (PPU) as an alternative to Floating-point Processing Unit (FPU) for Deep Neural Networks (DNNs) in automotive applications. Autonomous Driving tasks are increasingly depending on DNNs. For example, the detection of obstacles by means of object classification needs to be performed in real-time without involving remote computing. To speed up the inference phase of DNNs the CPUs on-board the vehicle should be equipped with co-processors, such as GPUs, which embed specific optimization for DNN tasks. In this work, we review an alternative arithmetic that could be used within the co-processor. We argue that a new representation for floating point numbers called Posit is particularly advantageous, allowing for a better trade-off between computation accuracy and implementation complexity. We conclude that implementing a PPU within the co-processor is a promising way to speed up the DNN inference phase.
Article
Video transrating has become an essential task to allow the transmission of different versions of the same video in streaming services and live applications. However, as the transrating operation comprises a decoding and an encoding step in sequence, it demands high processing time and energy consumption, which is prohibitive in large-scale systems. This work proposes a scalable quality and time/energy-aware HEVC transrating system based on decision trees. The scalable scheme operates under three different modes that employ the decision tree outcomes in different ways according to the desired tradeoff between image quality and time/energy savings. Experimental results presented a transrating time reduction of up to 57.5%, with a minimum energy consumption reduction of 49.5% and an average memory bandwidth reduction of 24% in comparison to the original transcoder. These results were achieved at the cost of a BD-rate increase of only 0.664% in the most conservative transrating mode, which allows a transrating time reduction of 48.5%. The proposed decision trees were implemented as an IP core and synthesized targeting 45nm ASIC technology, achieving the capability of processing 7680×4320 videos at 240 frames per second with a negligible power consumption of 0.849 mW.
Article
This paper presents a novel reconfigurable K-Means clustering accelerator that is suitable for integration in both IoT and data center system. The high vector dimension reconfigurability and design cost reduction is achieved through vector-streaming and adaptive overflow control to adapt distance computation using as-needed precision (dynamic 16-bit fixed-point data format). A two-stage shift-bit counted comparator is proposed. It can determine most results through only turning on the shift-bit comparator (3-bit), reducing the power consumption by 7x compared to the direct full dynamic range comparison. Four vectors with two cluster centroids are processed simultaneously. Up to 8-dimension cluster vectors are stored in local buffer to reduce data exchange between the main memory and the processing engine. A prototype accelerator was implemented in TSMC 65nm. The accelerator occupied 0.26mm2 and can support up to 64-dimensional vector clustering. It achieved 31.2M query vectors/sec with 41mW power consumption at 250MHz clock (Cluster Number:2, Vector Dimension: 64) and an energy efficiency of 0.41TOPS/W at 30MHz clock.
Article
In recent years approximate computing has been extensively explored as a paradigm to design hardware and software solutions that save energy by trading off on the quality of the computed results. In applications that involve numerical computations with wide dynamic range, precision tuning of floating-point (FP) variables is a key knob to leverage the energy/quality trade-off of program results. This aspect assumes maximum relevance in the transprecision computing scenario, where accuracy of data is tuned at fine grain in application code. Performing precision tuning at fine grain requires a software development flow that streamlines the assessment of which variables have “precision slack” within an application. In this paper we introduce FlexFloat, an open-source software library that has been expressly designed to aid the development of transprecision applications. FlexFloat provides a C/C++ interface for supporting multiple FP formats. Unlike alternative libraries, FlexFloat enables to control the bit-width of mantissa and exponent fields and provides advanced features for the collection of runtime statistics, reducing the FP emulation time compared to the state-of-the-art solutions. Its design allows to emulate the behavior of standard IEEE FP types and custom extensions for reduced-precision computation. This makes the library suitable for adoption in multiple contexts, from manual exploration to integration into automatic tools. Experimental findings demonstrate that our approach can be used to perform a complete precision analysis from which deriving multiple program versions depending on the energy/quality trade-off. Furthermore, we show that the adoption of our methodology can lead to a significant reduction of energy consumption even on current commercial hardware (an embedded GPGPU).