Content uploaded by Marco Cococcioni
Author content
All content in this area was uploaded by Marco Cococcioni on Jan 29, 2020
Content may be subject to copyright.
A Fast Approximation of the Hyperbolic Tangent
when Using Posit Numbers and its Application
to Deep Neural Networks
Marco Cococcioni1, Federico Rossi1, Emanuele Ruffaldi2, Sergio Saponara1
1Dept of Information Engineering, University of Pisa, 56122 – Italy
2MMI spa, Calci, Pisa, 56011 – Italy
Abstract. Deep Neural Networks (DNNs) are being used in more and more
fields. Among the others, automotive is a field where deep neural networks are
being exploited the most. An important aspect to be considered is the real-time
constraint that this kind of applications put on neural network architectures.
This poses the need for fast and hardware-friendly information representation.
The recently proposed Posit format has been proved to be extremely efficient as
a low-bit replacement of traditional floats. Its format has already allowed to
construct a fast approximation of the sigmoid function, an activation function
frequently used in DNNs. In this paper we present a fast approximation of
another activation function widely used in DNNs: the hyperbolic tangent. In the
experiment, we show how the approximated hyperbolic function outperforms
the approximated sigmoid counterpart. The implication is clear: the posit format
shows itself to be again DNN friendly, with important outcomes.
Keywords. Deep Neural Networks (DNNs), Posit, Activation functions
1 Introduction
The use of deep neural networks (DNN) as a general tool for signal and data
processing is increasing both in industry and academia. One of the key challenge is
the cost-effective computation of DNNs in order to ensure that these techniques can
be implemented at low-cost, low-power and in real-time for embedded applications in
IoT devices, robots, autonomous cars and so on. To this aim, an open research field is
devoted to the cost-effective implementation of the main operators used in DNN,
among them the activation function. The basic node of a DNN implements the sum of
products of inputs (X) and their corresponding Weights (W) and then applies an
activation function
()f
to it to get the output of that layer and feed it as an input to
the next layer. If we do not apply an activation function then the output signal would
simply be a simple linear function, which has a low complexity but is not power
enough to learn complex mappings (typically non-linear) from data. This is why the
most used activation functions like Sigmoid, Tanh (Hyperbolic tangent) and ReLu
(Rectified linear units) introduce non-linear properties to DNN [1,2]. Choosing the
activation function for a DNN model must take into account various aspects of both
the considered data distribution and the underlying information representation.
Moreover, for decision critical applications like machine perception for robotic and
autonomous cars, also the implementation accuracy is important.
Indeed, one of the main trend in industry to keep low the complexity of DNN
computation is avoiding complex arithmetic like double-precision floating point (64-
bit), but relying on much more compact formats like BFLOAT or Flexpoint [3, 4] (i.e.
a revised version of the 16-bit IEEE-754 floating point format adopted by Google
Tensor Processing Units and Intel AI processors) or transprecision computing [5, 6]
(e.g. the last Turing GPU from NVIDIA sustains INT32, INT8, INT4 and fp32 and
fp16 computation [5]). To this aim, this paper presents a fast approximation of the
hyperbolic tangent activation function combined with a new hardware-friendly
information representation based on Posit numerical format.
Hereafter, Section 2 introduces the Posit format and the CppPosit library implemented
at University of Pisa for the computation of the new numerical format. Section 3
introduces the hyperbolic tangent and its approximation. Implementation results when
the proposed technique is applied to DNN with known benchmark dataset are
reported in Section 4, where also a comparison with other known activation functions,
like sigmoid, is discussed. Conclusions are drawn in Section 5.
2 Posit Arithmetic and the CppPosit Library
The Posit format as proposed in [7-9] is a fixed-length representation composed by at
most 4 fields as shown in Fig 1.: 1-bit sign field, variable-length regime field,
variable-length (up to es-bits) exponent field and a variable-length fraction field. The
overall length and the maximum exponent lengths are decided a-priori. Regime
length and bit-content is determined as by the number of consecutive zeroes or ones
terminated, respectively, by a single one (negative regime) or zero (positive regime)
(see Fig. 2).
In this work we are going to use the cppPosit library, a modern C++14
implementation of the original Posit number system. The library identifies four
different operational levels (L1-L4):
- L1 operations are the ones involving bit-manipulation of the posit, without decoding
it, considering it as an integer. L1 operations are thus performed on ALU and are fast.
- L2 operations involve unpacking the Posit into its four different fields, with no
exponent computation.
- L3 operations instead involve full exponent unpacking, but without the need to
perform arithmetic operations on the unpacked fields (examples are converting
to/from float, posit or fixed point).
- L4 operations require the unpacked version to perform software/hardware floating
point computation using unpacked fields.
L1 operations are the most interesting, since they are the most efficient ones. L1
operations include inversion, negation, comparisons and absolute value. Moreover,
when esbits=0, L1 operations also include doubling/halving, 1’s complement when
the specific Posit representation falls within the range [0,1] and an approximation of
the sigmoid function, called here fast Sigmoid, and described in [9]. Table 1 reports
some implemented L1 operations stating whether the formula is exact or an
approximation and the operation requirements in terms of Posit configuration and
value. It is important to underline that every effort put in finding an L1 expression for
some functions or operations has two advantages: a faster execution when using a
software emulated PPU (Posit Processing Units), and a lower area required (i.e. less
transistors) when the PPU is implemented in hardware.
Table 1. L1 operations summary
Operation
Approximation
Requirements
2*x
no
esbits=0
x/2
no
esbits=0
1/x
no
none
1-x
no
esbits=0, x in [-1,1]
FastSigmoid [9]
yes
esbits=0
FastTanh (see below)
yes
esbits=0
3 The Hyperbolic Tangent and its Approximation FastTanh
The hyperbolic tangent is a non-linear activation function typically adopted as a
replacement to the sigmoid activation function. The advantage of the hyperbolic
tangent over the sigmoid is the higher enhancement given to the negative values. In
fact, the output of the hyperbolic tangent spans in [-1, 1] while the sigmoid outputs
are only half of the previous, lying in [0, 1]. Furthermore, this difference in output
range heavily impacts performances when using small-sized number representation,
such as Posits with 10 or 8 bits. If we consider the sigmoid function applied to a Posit
with x bits, we are actually using, as output, a Posit with x-1 bits, since we are
discarding the range [-1,0], which is significantly dense when using the Posit format
(see Fig. 3).
Fig. 3 The posit circle when the total number of bits is 5. The hyperbolic tangent uses all the
numbers in [-1, 1], while the sigmoid function only the ones in [0, 1].
However, as already mentioned before, the sigmoid function
( )
sigmoid( ) 1 1
x
xe=−
has a fast and efficient L1 approximation when using Posits with 0 exponent bits [9]
(FastSigmoid). In order to exploit a similar trick for the hyperbolic tangent, we first
introduced the scaled sigmoid function:
sSigmoid ( ) sigmoid( ) 2
kx k k x k= −
(1)
Particularly interesting is the case k=2, when the scaled sigmoid coincides with the
hyperbolic tangent:
( ) ( )
22
2
sSigmoid ( ) 1 1 tanh( )
xx
x e e x
= − + =
(2)
Now that we can express the hyperbolic tangent as a linear function of the sigmoid
one, we must rework the expression in order to provide a fast and efficient
approximation to be used with Posits.
We know that Posit properties guarantee that, when using 0 exponent bits format,
doubling the Posit value and computing its sigmoid approximation is just a matter of
bit manipulations, so they can be efficiently obtained. The subtraction in Equation (1)
does not come with an efficient bit manipulation implementation as-is. In order to
transform it into an L1 operation we have to rewrite it as:
FastTanh(x) = 2·sigmoid(2·x)-1
(3)
Then let us focus on negative values for x only. For these values, the expression
2 sigmoid(2 )x
is inside the unitary region [0, 1]. Therefore, the L1 1’s complement
can be applied. Finally, the negation is always an L1 operation, thus for all negative
values of x the hyperbolic tangent approximation can be computed as an L1 operation.
Moreover, thanks to the anti-symmetry of the hyperbolic tangent, this approach can
values used by Tanh
values used by the Sigmoid
also be extended to positive values. The following is a possible pseudo-code
implementation:
FastTanh(x) → y
x_n = x > 0 ? -x:x
s = x > 0
y_n = neg(compl1(twice(FastSigmoid(twice(x_n)))))
y = s > 0 ? -y_n:y_n
where twice is an L1 operation which computes
2x
and compl1 is the L1
function that computes the 1 complement, again as an L1 operation.
Since we are also interested in training neural networks, we also need an efficient
implementation of the hyperbolic tangent derivative:
d(tanh(x))/d(x) = 1-tanh(x)²
Let y=tanh(x)², we know that 1-y is always a L1 operation when esbits = 0, since
tanh(x)² is always in [0,1]. In order to provide an efficient way to compute the
hyperbolic tangent square, we can tabulate the square operator for all Posit values.
This approach does not come with a great cost, since it is a unary operator and it can
also be applied even to Posit with 16 bits.
4 Experimental Results
We compared the approximated hyperbolic tangent to the original version in terms of
execution time and precision. Figure 4 shows the precision comparison, reporting also
for Posit8 and Posit16 the mean squared error between the approximated and the
original form (for both types, we used 0 bits of exponent). Figure 5 shows execution
time comparison for several repetitions. Each repetition consists in computing about
60,000 hyperbolic tangents with the approximated formula and the exact one. As
reported, the precision degradation is in the order of 10-3 while the gain in speed is
around a factor 6. In Figs. 4 and 5 fast appr tanh is the Posit-based implementation,
using L1 operations, of the Tanh function, by using the FastTanh formula in Eq. 3.
This corresponds to the column labeled has FastTanh in Table 2.
Then we tested the approximated hyperbolic tangent as activation function for the
LeNet-5 convolutional neural network, replacing the exact hyperbolic tangent used in
the original implementation proposed in [10,11] and comparing results against the
original activation. The network model has been trained on MNIST [11] and Fashion-
MNIST datasets [12].
Table 2 shows performance comparison between the two activation functions
(FastTanh and Tanh) on the two datasets. Moreover, also the results obtained with
Sigmoid and ReLu are reported, since they are widely adopted in literature as
activation functions for DNN. The results in Table 2 in terms of accuracy show that
the FastTanh outperforms both the ReLu and the FastSigmoid (a well-known
approximation of the sigmoid function) which are widely used in state-of-art to
implement activation functions in DNN.
Fig. 4. Comparison between exact hyperbolic tangent (True tanh, in blue) and FastTanh (fast
appr. tanh, in black), for Posit<8,0> (top) and Posit<16,0> (bottom). For Posit<8,0> the mean
squared error is 2.816·10-3, while for Posit<16,0> it is 2.947·10-3.
Fig. 5. Comparison of execution time of multiple consecutive executions between exact
hyperbolic tangent (True tanh, in blue) and FastTanh (fast appr tanh, in black)
Table 2. Accuracy (%) and inference time (ms) comparison between different activation
functions and different Posit configurations (MNIST and Fashion-MNIST data set)
MNIST
Activation
FastTanh
(this paper)
True Tanh
FastSigmoid
[9]
ReLu
%
ms
%
ms
%
ms
%
ms
Posit16,0
98.5
3.2
98.8
5.28
97.1
3.31
89
2
Posit14,0
98.5
2.9
98.8
4.64
97.1
3.09
89
1.9
Posit12,0
98.5
2.9
98.8
4.66
97.1
3.04
89
1.9
Posit10,0
98.6
2.9
98.7
4.62
96.9
3.08
89
1.9
Posit8,0
98.6
3.01
98.4
4.84
94.2
3.01
88
1.9
FASHION-MNIST
Activation
FastTanh
(this paper)
True Tanh
FastSigmoid
[9]
ReLu
%
ms
%
ms
%
ms
%
ms
Posit16,0
89.6
3.4
90.0
5.5
85.2
3.4
85
2.1
Posit14,0
89.6
2.9
90.0
5.0
85.2
3.2
85
1.9
Posit12,0
89.7
2.9
90.0
5.1
85.2
3.1
85
1.9
Posit10,0
89.7
2.9
89.7
5.1
85.1
3.2
85
1.9
Posit8,0
89.6
3.1
89.3
5.2
84.3
3.0
84
1.9
5 Conclusions
In this work we have introduced FastTanh, a fast approximation of the hyperbolic
tangent for numbers represented in Posit format which uses only L1 operations. We
have used this approximation to speed up the training phase of deep neural networks.
The proposed approximation has been tested on common deep neural network
benchmarks. The use of this approximation resulted in a slightly less accurate neural
network, with respect to the use of the slower true hyperbolic tangent, but with better
performance in terms of inference time of the network. In our experiment, the
FastTanh also outperforms both the ReLu and the FastSigmoid, which is a well-
known approximation of the sigmoid function, a de facto standard activation function
in neural networks. We are now working on deriving fast L1 approximations for other
activating functions, such as the softplus and others.
Acknowledgements
Work partially supported by H2020 European Project EPI (European Processor
Initiative) and by the Italian Ministry of Education and Research (MIUR) in the
framework of the CrossLab project (Departments of Excellence program), granted to
the Department of Information Engineering of the University of Pisa.
References
1. D. Pedamonti, “Comparison of non-linear activation functions for deep neural networks on
MNIST classification task”, arXiv:1804.02763, 2018
2. V. Nair, G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines”, 27th
Int. Conf. on International Conference on Machine Learning (ICML) 2010, pp. 807-814
3. U. Köster et al. “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep
Neural Networks”, NIPS 2017, pp. 1740-1750
4. V. Popescu et al., “Flexpoint: predictive numerics for deep learning”, IEEE Symposium on
Computer Arithmetics, 2018
5. “NVIDIA TURING GPU Architecture, graphics reinvented”, White paper n. WP-09183-
001_v01, pp. 1-80, 2018
6. A. Malossi et al., “The transprecision computing paradigm: concept, design, and
applications”, IEEE DATE 2018, pp. 1105-1110
7. M. Cococcioni, F. Rossi, E. Ruffaldi, S. Saponara, “Novel Arithmetics to Accelerate
Machine Learning Classifiers in Autonomous Driving Applications”, submitted to IEEE
ICECS 2019
8. M. Cococcioni, E. Ruffaldi, S. Saponara, “Exploiting Posit arithmetic for Deep Neural
Networks in Autonomous Driving Applications”, IEEE Automotive 2018
9. J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its own game: Posit
arithmetic,” Supercomputing Frontiers and Innovations, vol. 4, no. 2, pp. 71–86, 2017
10. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,”Proceedingsof the IEEE, 1998.
11. Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U.
Muller, E. Sackinger, P. Simard,and V. Vapnik, “Comparison of learning algorithms for
handwritten digit recognition,” in International Conference on Artificial Neural Networks,
Paris, F. Fogelman and P. Gallinari, Eds. EC2 and Cie, 1995, pp. 53–60.
12. H. Xiao, K. Rasul, R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms”, arXiv:1708.07747, 2017