ArticlePDF Available
A New Range-Reduction Algorithm
Nicolas Brisebarre, David Defour, Peter Kornerup, Member,IEEE,
Jean-Michel Muller, Senior Member,IEEE, and Nathalie Revol
Abstract—Range-reduction is a key point for getting accurate elementary function routines. We introduce a new algorithm that is fast
for input arguments belonging to the most common domains, yet accurate over the full double-precision range.
Index Terms—Range-reduction, elementary function evaluation, floating-point arithmetic.
ALGORITHMS for the evaluation of elementary functions
give correct results only if the argument is within a
given small interval, usually centered at zero. To evaluate
an elementary function fðxÞfor any x, it is necessary to find
some “transformation” that makes it possible to deduce
fðxÞfrom some value gðxÞ, where
.x, called the reduced argument, is deduced from x;
.xbelongs to the convergence domain of the
algorithm implemented for the evaluation of g.
In practice, range-reduction needs care for the trigonometric
functions. With these functions, xis equal to xkC, where
kis an integer and Can integer multiple of =4. Also of
potential interest is the case C¼lnð2Þfor the implementa-
tion of the exponential function.
A poor range-reduction method may lead to catastrophic
accuracy problems when the input argument is large or
close to an integer multiple of C. It is easy to understand
why a poor range-reduction algorithm gives inaccurate
results. The naive method consists of performing the
using machine precision. When kC is close to x, almost all the
accuracy, if not all, is lost when performing the subtraction
xkC. For instance, if C¼=2and x¼8248:251512, the
correct value of xis 2:14758367 1012,andthe
corresponding value of kis 5;251. Directly computing
xk=2on a calculator with 10-digit decimal arithmetic
(assuming rounding to the nearest and replacing =2by the
nearest exactly representable number), then one gets
1:0106. Hence, such a poor range-reduction would
lead to a computed value of cosðxÞequal to 1:0106,
whereas the correct value is 2:14758367  1012 .
A first solution to overcome the problem consists of
using arbitrary-precision arithmetic, but this may make the
computation much slower. Moreover, it is not that easy to
predict on the fly the precision with which the computation
should be performed.
Most common input arguments to the trigonometric
functions are small (say, less than 8) or sometimes medium
(say, between 8and approximately 260). They are rarely
huge (say, greater than 260). We want to design methods
that are fast for the frequent cases, and accurate for all cases.
A rough estimate, based on SUN fdlibm library, is that the
cost of trigonometric range-reduction—when reduction is
necessary—is approximately one third of the total function
evaluation cost.
First, we describe Payne and Hanek’s method [11], which
provides an accurate range-reduction, but has the drawback
of being fairly expensive in term of operations; this method
is very commonly implemented; it is used in the SUN
fdlibm library in particular.
To know with which precision the intermediate calcula-
tions must be carried on to get an accurate result, one must
know the worst cases, that is, the input arguments that are
hardest to reduce. Also, to estimate the average perfor-
mance of the algorithms (and to tune them so that these
performances are good), one must have at least a rough
estimate of the statistical distribution of the reduced
arguments. These two problems are dealt with at the end
of this section.
In the second section, we present our algorithm
dedicated to the reduction of small and medium size
arguments. In the third section, we compare our method
with some other available methods, which justifies the use
of our algorithm for small and medium size arguments.
1.1 The Payne and Hanek Reduction Algorithm
We assume in this section that we want to perform range-
reduction for the trigonometric functions, with C¼=4,
and that the convergence domain of the algorithm used for
.N. Brisebarre, J.-M. Muller, and N. Revol are with Laboratoire LIP
EEcole Normale Supe
´rieure de Lyon, 46
´e d’Italie, 69364 Lyon Cedex 07, France. N. Brisebarre is also with
LArAI, Saint-Etienne.
E-mail: {Nicolas.Brisebarre, Jean-Michel.Muller, Nathalie.Revol}@
.D. Defour is with Laboratoire LP2A, Universite
´de Perpignan, 52 Avenue
Paul Alduy, 66860 Perpignan, France.
.P. Kornerup is with the Department of Mathematics and Computer
Science, Southern Danish University, Odense Campusvej 55 DK-5230
Odense, Denmark. E-mail:
Manuscript received 28 Nov. 2003; revised 15 June 2004; accepted 16 Sept.
2004; published online 18 Jan. 2005.
For information on obtaining reprints of this article, please send e-mail to:, and reference IEEECS Log Number TCSI-0233-1103.
0018-9340/05/$20.00 ß2005 IEEE Published by the IEEE Computer Society
evaluating the functions contains
I¼½0;=4. An adapta-
tion to other cases is straightforward.
From an input argument x, we want to find the reduced
argument xand an integer kthat satisfy:
 x¼
Once xis known, it suffices to know kmod 8 to calculate
sinðxÞor cosðxÞfrom x.Ifxis large or if xis very close to a
multiple of =4, the direct use of (1) to determine xmay
require the knowledge of 4= with very large precision and
a cost-expensive multiple-precision computation if we wish
the range-reduction to be accurate.
Now, let us present Payne and Hanek’s reduction
method [11], [12]. Assume an n-bit mantissa, radix 2
floating-point format (the number of bits nincludes the
possible hidden bit; for instance, with an IEEE double-
precision format, n¼53). Let xbe the positive floating-
point argument to be reduced and let ebe its unbiased
exponent, so
where Xis an n-bit integer satisfying 2n1X<2n.
We can assume e1(since, if e<1, no reduction is
necessary). Let
be the infinite binary expansion of ¼4= and define an
integer parameter pused to specify the required accuracy of
the range-reduction. Then, rewrite ¼4= as
Leftðe; pÞ2neþ2
þMiddleðe; pÞþRightðe; pÞ
Leftðe; pÞ¼0ife<nþ2
Middleðe; pÞ¼ neþ1nene1p;
Rightðe; pÞ¼0:ne2pne3p:
Fig. 1 shows the splitting of the binary expansion of .
The basic idea of the Payne-Hanek reduction method is
to notice that, if pis large enough, Middleðe; pÞcontains the
only bits of ¼4= that matter for the range-reduction.
x¼Leftðe; pÞX8
þMiddleðe; pÞX22np
þRightðe; pÞX22np;
the number Leftðe; pÞX8is a multiple of 8so that, once
multiplied by =4(see (1)), it will have no influence on the
trigonometric functions. Rightðe; pÞX22npis less
than 2np; therefore, it can be made as small as desired
by adequately choosing p.
How pis chosen will be explained in Section 2.3.
1.2 Worst Cases
Assume we want the reduced argument to belong to
½C=2;C=2Þ.DefinexmodCas the number y2
½C=2;C=2Þsuch that y¼xkC, where kis an integer.
There are two important points that must be considered
when trying to design accurate yet fast range-reduction
.First, what is the “worst case”? That is, what will be
the smallest possible absolute value of the reduced
argument for all possible inputs in a given format.
That value will allow us to immediately deduce the
precision with which the reduction must be carried
on to make sure that, even for the most difficult
cases, the returned result will be accurate enough.
.What is the statistical distribution of the smallest
absolute values of the reduced arguments? That is,
given a small value , what is the probability that the
reduced argument will have an absolute value less
than ? This point is important if we want to design
algorithms that are fast for the most frequent cases
and remain accurate on all cases.
Computing the worst case is rather easy, using an
algorithm due to Kahan [4] (a C program that implements
the method can be found at
~wkahan/. A Maple program is given in [9]). The algorithm
uses the continued-fraction theory. For instance, a few
minutes of calculation suffice to find the double-precision
number between 8and 263 1that is closest to a multiple of
=4. This number is:
=4¼6411027962775774 248
The distance between =4and the closest multiple of =4is
=43:094903 1019 0:71 261:
So, if we apply a range-reduction from a double-precision
argument in ½8;263 1to ½=4;=4Þand if we wish to get
a reduced argument with relative accuracy better than 2,
we must perform the range reduction with absolute error
better than 261.
Also, the double-precision number greater than 8and
less than 710 which is closest to a multiple of lnð2Þis:
lnð2Þ¼7804143460206699 249
The distance between lnð2Þand the closest multiple of lnð2Þis
lnð2Þ1:972015 1017 >256:
In that case, we considered only numbers less than 710 since
exponentials of numbers larger than that are mere over-
flows in double-precision arithmetic.
1. In practice, we can reduce to an interval of size slightly larger than C
to facilitate the reduction.
Fig. 1. The splitting of digits of 4= in Payne and Hanek’s reduction
1.3 Statistical Distribution of the Reduced
Now, let us turn to the statistical distribution of reduced
We assume that Cis a positive fractional multiple of or
lnð2Þ. Let emin and emax be two rational integers such that
2emin C=2<2eminþ1and emin emax.
Let p2IN such that 2pþ1C, our aim is to estimate the
number of floating-point numbers xwith n-bit mantissas
and exponents between emin and emax such that
where xmodCis defined as the unique number y2
½C=2;þC=2Þsuch that y¼xkC, where kis an integer.
Let Ebe a rational integer such that emin Eemax.As
2pþ1C, we have 2p<2eminþ12Eþ1. Therefore, 2p2E,
i.e., pþE0.
We start by estimating the number of floating-point
numbers xwith n-bit mantissas and exponent Ethat satisfy
(2). Hence, we search for the j2IN ,2n1j2n1, such
that the inequality
kC j
has solutions in k2ZZ. Such knecessarily satisfy
<k< 1
We note that, as pþE0and j2n1, the left-hand side
of (4) is positive. Hence,
max 1;1
since 2n1j2n1and these inequalities are sharp
since the upper bound in (4) is irrational and the lower
bound is either zero or an irrational number. The number of
possible kis exactly
Inequality (3) is equivalent to
Hence, for every ksatisfying (5), there are exactly
min 2n1;bkC2n1Eþ2n1pEc
max 2n1;dkC2n1E2n1pEe
integers jsolutions since the numbers kC2n1E2n1pE
and kC2n1Eþ2n1pEare irrational (we saw before that
As 2pþ1C,ifkmEþ1, we have
and, if kME1, we have
Now, to analyze (8), we have to distinguish two cases.
First case: 2n1pE1=2, i.e., nEp.
This case is the easy one and (7) yields the conclusion. For
every k,mEþ1kME1,thereareexactly2npE
integer solutions jsince the numbers kC2n1E2n1pE
and kC2n1Eþ2n1pEare irrational. When k2fmE;M
we can only say that there are at least 1and at most 2npE
integer solutions j. Notice that these solutions can easily be
enumerated by a program. Therefore, the number of
floating-point numbers xwith n-bit mantissas and expo-
nent Ethat satisfy (2) is upper bounded by NE2npE, and
lower bounded by ðNE2Þ2npEþ2.
Second case: 2n1pE<1=2, i.e., nE<p.
We need results about uniform distribution of sequences
[8] that we briefly recall now.
For a real number x,fxgdenotes the fractional part of x,
i.e., fxxbxcand jjxjj denotes the distance from xto
the nearest integer, namely,
jjxjj ¼ min
n2ZZ jxnminðfxg;1fx:
Let us recall the following definitions from [8].
Definition 1. Let ðxnÞn1be a given sequence of real numbers.
Let Nbe a positive integer.
For a subset Eof ½0;1Þ, the counting function AðE;N;ðxnÞÞ
is the number of terms xn,1nN, for which fxng2E.
Let y1;...;y
Nbe a finite sequence of real numbers. The
DNððynÞÞ ¼ sup
Að½a; bÞ;N;ðynÞÞ
is called the discrepancy of the sequence y1;...;y
N. For an
infinite sequence ðxnÞof real numbers (or for a finite sequence
containing at least Nterms), DNððxnÞÞ is meant to be the
discrepancy of the initial segment formed by the first Nterms
of ðxnÞ.
Thus, in particular, the number of values xnwith 1
nNsatisfying fxng2½a; bÞ, for any 0a<b1,is
bounded from above by NðbaÞþDNððxnÞÞ. Hence, the
number of values kC2n1E, with mEkME, that satisfy
(7), i.e., that satisfy 0fkC2n1Eg<2n1pEor 1
2n1pE<fkC2n1Eg<1is bounded from above by
Definition 2. Let be a positive real number or infinity. The
irrational number is said to be of type if is the supremum
of all for which lim inf q!1;
q2IN qjjqjj ¼ 0.
Theorem 3.2 from [8, chapter 2] states the following
Theorem 1. Let be of finite type . Then, for every ">0, the
discrepancy DNðuÞof u¼ðnÞsatisfies
Let us apply this theorem to values of interest for this
paper, namely, C¼qlnð2Þand C¼q with q2QQ.
.If Cis a nonzero fractional multiple of lnð2Þ.
We know from [2] that any nonzero fractional
multiple of lnð2Þhas a type 2:9. Thus, the number
of floating-point numbers xwith n-bit mantissas and
exponent Ethat satisfy (2) is upper bounded by
2npEðNEþOðNEð19=29Þþ"ÞÞ for every ">0.
.If Cis a nonzero fractional multiple of .
We know from [3] that any nonzero fractional
multiple of has a type 7:02. Hence, the number
of floating-point numbers xwith n-bit mantissas and
exponent Ethat satisfy (2) is upper bounded by
2npEðNEþOðNEð301=351Þþ"ÞÞ for every ">0.
From this theorem, we can deduce the following result.
Proposition 1. Let Cbe a positive fractional multiple of or
lnð2Þ. Let emin and emax be two rational integers such that
2emin C=2<2eminþ1and emin emax. Let p2IN such that
2pC=2. The number Eof floating-point numbers xwith
n-bit mantissas and exponent Ebetween emin and emax such
nEp. In that case, Eis easily computable by a
.E¼2npEðNEþOðNEþ"ÞÞ if nE<p,for
every ">0, with 19=29 for Cnonzero fractional
multiple of lnð2Þand 301=351 for Cnonzero
fractional multiple of ;
From this proposition, numerous experiments, and a
well-known result by Khintchine [5], [6] that states that
almost all real numbers are of type 1, we can assume that,
for any E, we have
We have checked this result by computing all reduced
arguments for some values of n,emin, and emax such that
this exhaustive computation remains possible in a reason-
able delay. Some obtained results are given in Figs. 2, 3,
and 4. These results show that the estimate provided by
(10) is a good one. These estimates will be used at the
end of Section 2.3.
Fig. 2. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼lnð2Þ,n¼14,emin ¼2, and emax ¼6. Notice that the estimation
obtained from (10) is adequate.
Fig. 3. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼=4,n¼18, with emin ¼emax ¼5. The estimation given by (10) is
Fig. 4. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼=4,n¼18, with emin ¼emax ¼7. Again, the estimation given by
(10) is adequate.
In this section, we assume that we perform range-reduction
for the trigonometric functions, with C¼=2. Extension to
other values of C(such as a fractional multiple of —still for
the trigonometric functions—or a fractional multiple of
lnð2Þ—for the exponential function) is straightforward.
As stated before, our general philosophy is that we must
give results that are:
1. always correct, even for rare cases;
2. computed as quickly as possible for frequent cases.
A way to deal with these requirements is to build a fast
algorithm for input arguments with a small exponent and to
use a slower yet still accurate algorithm for input argument
with a large exponent.
2.1 Medium-Size Arguments (in ½8;263 1)
To do so, in the following, we focus on input arguments
with a “reasonably small” exponent. More precisely, we
assume that the double-precision input argument xhas
absolute value less than 263 1. For larger arguments, we
assume that Payne and Hanek’s method will be used or that
xmodCwill be computed using multiple-precision arith-
metic. For straightforward symmetry reasons, we can
assume that xis positive. We also assume that xis greater
than or equal to 8. We then proceed as follows:
1. We define IðxÞas xrounded to the nearest integer.
Then, xis split into its residual part ðxÞ¼xIðxÞ
and IðxÞ, which is split into eight 7-bit parts IiðxÞfor
0i7as follows:
I6ðxÞ¼I248 x256I7ðxÞðÞðÞ
I5ðxÞ¼I240 x256I7ðxÞþ248 I6ðxÞ
I4ðxÞ¼I232 xP7
I3ðxÞ¼I224 xP7
I2ðxÞ¼I216 xP7
so that
x¼256I7ðxÞþ248 I6ðxÞþ...þ28I1ðxÞþI0ðxÞþðxÞ:
Note that ðxÞis exactly representable in double-
precision and that, for x252, we have ðxÞ¼0and
IðxÞ¼x. Also, since x8, the last mantissa bit of
ðxÞhas a weight greater than or equal to 249 .
Important remark. One could get a very similar
algorithm, certainly easier to understand, by repla-
cing the values IkðxÞby the values JkðxÞdefined as
J0ðxÞcontains bits 0 to 7 of IðxÞ;
J1ðxÞcontains bits 8 to 15 of IðxÞ;
J2ðxÞcontains bits 16 to 23 of IðxÞ;
J3ðxÞcontains bits 24 to 31 of IðxÞ;
J4ðxÞcontains bits 32 to 39 of IðxÞ;
J5ðxÞcontains bits 40 to 47 of IðxÞ;
J6ðxÞcontains bits 48 to 55 of IðxÞ;
J7ðxÞcontains bits 56 to 63 of IðxÞ;
but that would lead to tables twice as large as the
ones required by our algorithm. Indeed, the values
I0up to I7are stored on 8 bits each, but the sign bit
will not be used and, thus, only 7 bits are necessary
to index the tables.
The general idea behind our algorithm is to
compute first
ðI0ðxÞÞ mod=2þð28I1ðxÞÞ mod=2
þð216 I2ðxÞÞ mod=2
þð256 I7ðxÞÞ mod=2
It holds that xSðxÞis a multiple of =2and SðxÞ
will be smaller than x, but, in general, SðxÞwill not
be the desired reduced argument: A second, simpler
reduction step will be necessary. In practice, the
various possible values of 28iIiðxÞÞj mod=2are
stored in tables as a sum of two or three floating-
point numbers.
As mentioned above, our goal is to always
provide correct results even for the worst case for
which we lose 61 bits of accuracy. Then, we need to
store ðIiðxÞmod=2Þwith at least
61 ðleading zerosÞ
þ53 ðnonzero significant bitsÞ
þgðextra guard bitsÞ
¼114 þgbits:
To reach that precision (with a value of gequal to 39,
which will be deduced in the following), all the
numbers ðj28iIiðxÞj mod=2Þ, which belong to
½1;1, are stored in tables as the sum of three
double-precision numbers:
hiði; wÞis the multiple of 249 that is
closest to ðð28iwÞmod=2Þ
Tmedði; wÞis the multiple of 299 that is
closest to ðð28iwÞmod=2Þ
Thiði; wÞ
Tloði; wÞis the double-precision number
that is closest to
ðð28iwÞmod=2ÞThiði; wÞ
Tmedði; wÞ;
where wis a 7-bit nonnegative integer.
Note that Thiði; wÞ¼Tmed ði; wÞ¼Tloði; wÞ¼0for
w¼0. The three tables Thi,Tmed , and Tlo need 10
address bits. The total amount of memory required by
these tables is 3210 8¼24 Kbytes. From the defini-
tions, one can easily deduce jTmedði; wÞj  250 and
jTloði; wÞj  2100 .ThesumThiði; wÞþTmedði; wÞþ
Tloði; wÞapproximates ð28iwÞmod=2with 153 bits
of precision, which corresponds to g¼39. Comput-
ing Thi,Tmed , and Tlo for the 1,024 different possible
values of ði; wÞallows us to get slightly sharper
bounds, given in Table 1.
2. Define
ShiðxÞ¼ X
signðIiðxÞÞThiði; jIiðxÞjÞ
Its absolute value is bounded by 2þ1
2, which is less
than 8. Since ShiðxÞis a multiple of 249 and has
absolute value less than 8, it is exactly representable
in double-precision floating-point arithmetic (it is
even representable with 52 bits only). Therefore,
with a correctly rounded arithmetic (such as the one
provided on any system that complies with the
IEEE-754 standard for floating-point arithmetic), it
will be exactly computed, without any rounding error.
Also, consider
i¼0signðIiðxÞÞTmedði; jIiðxÞjÞ;
i¼0signðIiðxÞÞTloði; jIiðxÞjÞ:
The number SmedðxÞis a multiple of 299 and its
absolute value is less than 247. Hence, it is exactly
representable, and exactly computed, in double-
precision floating-point arithmetic. jSlojis less than
297 and, if Slo is computed with round-to-nearest
arithmetic as a balanced binary tree of additions:
then the rounding error is less than 32151. For
each of the values Tloði; IiðxÞÞ, the fact that is it
rounded to the nearest yields an accumulated error
(for these eight values) less than 82154. Thus, the
absolute error on SloðxÞis less than or equal to
82154 þ32151 ¼2149.
Since ShiðxÞþSmed ðxÞis exactly computed, the
number SðxÞ¼ShiðxÞþSmed ðxÞþSloðxÞis equal to
xminus an integer multiple of =2plus an error
bounded by 2149.
And yet, SðxÞmay not be the final reduced argument
since its absolute value may be significantly larger than =4.
We therefore may have to add or subtract a multiple of =2
from SðxÞto get the final result and straightforward
calculations show that this multiple can only be k=2with
2.2 Small Arguments (Smaller than 8)
Define ChiðkÞ, for k¼1;2;3;4, as the multiple of 249 that is
closest to k=2.ChiðkÞis exactly representable as a double-
precision number. Define CmedðkÞas the multiple of 299
that is closest to k=2ChiðkÞand Clo ðkÞas the double-
precision number that is closest to k=2ChiðkÞCmed ðkÞ.
We now proceed as follows:
.If jShiðxÞj  =4, then we define
RhiðxÞ¼Shi ðxÞ;
RmedðxÞ¼Smed ðxÞ;
.Else, let kxbe such that ChiðkxÞis closest to jShi ðxÞj.
We successively compute:
-If ShiðxÞ>0,
RhiðxÞ¼Shi ðxÞChiðkxÞ;
RmedðxÞ¼Smed ðxÞCmedðkxÞ;
RhiðxÞ¼Shi ðxÞþChiðkxÞ;
RmedðxÞ¼Smed ðxÞþCmedðkxÞ;
Again, RhiðxÞand Rmed ðxÞare exactly representable
precision arithmetic:
-RhiðxÞhas an absolute value less than =4and is
a multiple of 249;
-RmedðxÞhas an absolute value less than 247 þ
250 and is a multiple of 299.
jRloðxÞj is less than 297 þ2100 and it is
computed with error less than or equal to
2149 þ2150 þ2154 ¼49 2154:
149 is the error bound on Slo;
154 bounds the error due to the floating-point
representation of CloðkxÞ;
150 bounds the rounding error that occurs
when computing SloðxÞClo ðkxÞin round-to-
nearest mode.
Therefore, the number RðxÞ¼RhiðxÞþRmed ðxÞþRloðxÞ
is equal to xminus an integer multiple of =2plus an error
bounded by 49 2154 <2148.
This step is also used (alone, without the previous steps)
to reduce small input arguments, less than 8. This allows
Maximum Values of Thi,Tmed , and Tlo
our algorithm to perform range-reduction for both kind of
arguments, small and medium size. The reduced argument
is now stored as the sum of three double-precision
numbers, RhiðxÞ,Rmed ðxÞ, and RloðxÞ. We want to return
the reduced argument as the sum of two double-precision
numbers (one double-precision number may not suffice if
we wish to compute trigonometric functions with very good
accuracy). To do that, we will use the Fast2sum algorithm
presented hereafter.
2.3 Final Step
We will get the final result of the range-reduction as
follows: Let pbe an integer parameter, 1p44, used to
specify the required accuracy. This choice comes from the
fact that we work in double precision arithmetic and that, in
the most frequent cases, the final relative error will be
bounded by 2100þp: To allow an accurate double precision
function result even in the very worst case, we must have a
relative error significantly less than 253. The problem here
is only to propagate the possible carry when summing the
three components RhiðxÞ,Rmed ðxÞ,andRloðxÞ.Thisis
performed using floating-point addition and the following
Theorem 2 (Fast2sum algorithm) [7, p. 221, Theorem C]. Let
aand bbe floating-point numbers, with jajjbj. Assume the
used floating-point arithmetic provides correctly rounded
results with rounding to the nearest. The following algorithm:
computes two floating-point numbers sand rthat satisfy:
.sis the floating-point number which is closest to aþb.
We now consider the different possible cases:
.If jRhiðxÞj >1=2p, then, since jRmed ðxÞj <247 þ250,
the reduced argument will be close to Rhi ðxÞ. In that
case, we first compute
tmedðxÞ¼Rmed ðxÞþRloðxÞ:
The error on tmedðxÞis bounded by the former error
on RloðxÞplus the rounding error due to the addition.
Assuming rounding to nearest, this last error is less
than or equal to 2100. Hence, the error on tmed ðxÞis
less than or equal to 2100 þ2148. Then, we perform
(without rounding error)
loÞ¼fast2sumðRhi ðxÞ;t
After that, the two floating-point numbers ðyhi;y
represent the reduced argument with an absolute
error bounded by 2100 þ2148 2100. Hence, the
relative error on the reduced argument will be
bounded by a value very close to 2100þp.
.If RhiðxÞ¼0, then we perform
loÞ¼fast2sumðRmed ðxÞ;R
After that, since the absolute value of the reduced
argument is always larger than 0:71 261, the two
floating-point numbers ðyhi;y
loÞrepresent the re-
duced argument with a relative error smaller than
49 2154
0:71 261 <286:
.If 0<jRhiðxÞj  2p, then, since the absolute value of
the reduced argument is always larger than 0:71
261 and since jRloðxÞj <297 þ2100 , most of the
information on the reduced argument is in RhiðxÞ
and RmedðxÞ. We first perform
medÞ¼fast2sumðRhi ðxÞ;R
Let kbe the integer satisfying
We easily find
jtmedj2k53 :
After that, we compute
ylo ¼tmed þRloðxÞ:
The rounding error due to this addition is bounded
by 2k107. Hence, the two floating-point numbers
loÞrepresent the reduced argument with an
absolute error smaller than
49 2154 þmaxf2k107;2150 g:
Therefore, ðyhi;y
loÞrepresent the reduced argument
with a relative error better than
49 2154þkþmaxf2107;2150þkg;
which is less than 287 since the absolute value of
the reduced argument is less than 0:71 261, which
implies 2k261.
A first solution is to try to make the various error bounds
equal. This is done by choosing p¼14. By doing that, in the
worst case, the bound on the relative error will be 286,
which is quite good. We should notice that, in this case,
assuming (10) with C¼=2, the probability that jRhiðxÞj
will be less than 2pis around 7:8105.
A possibly better solution is to make the most frequent case
(i.e., jRhiðxÞj >2p) more accurate and to assume that a more
accurate yet slower algorithm is used in the other cases (an
easy solution is to split the variables into four floating-point
values instead of three as we did here). This is done by using a
somewhat smaller value of p. For instance, with p¼10 and
C¼=2, still assuming (10), the probability that jRhiðxÞj <
2pis around 1:25 103. In the most frequent case
(jRhiðxÞj  2p), the error bound on the computed reduced
argument will be 290. Due to its low probability, the other
case can be processed with an algorithm a hundred times
slower without significantly changing the average time of
computation, cf. Amdahl’s law.
2.4 The Algorithm
We can now sketch the complete algorithm:
Algorithm Range-Reduction:
Input: A double-precision floating-point number x>0and
an integer p>0specifying the required precision in bits.
Output: The reduced argument ygiven as the sum of two
double-precision floating-point numbers yhi and ylo, such
=4y<=4and y¼xk
2within an error given
in the analysis of Section 2.3, for some integer k.
if x263 1then
{Apply the method of Payne and Hanek.}
else if x8then
Shi x;Smed 0;Slo 0;
I roundðxÞ; xI;
Shi ;Smed 0;Slo 0;
i 7;
j 56;
while i0do
w roundðI>>jÞ;
Shi Shi þsignðwÞThiði; jw;
Smed Smed þsignðwÞTmedði; jw;
I Iðw<<jÞ;i i1;j j8
Slo P7
i¼0signðwÞTloði; jw(cf. 11);
if jShij=4then
k ReduceðjShi
Shi Shi þsignðShiÞChi ðkÞ;
Smed Smed þsignðShiÞCmed ðkÞ;
Slo Slo þsignðShiÞClo ðkÞ;
if jShij>2pthen
temp Smed þSlo;
loÞ fast2sumðShi ; tempÞ;
else if Shi ¼0then
loÞ fast2sumðSmed ;S
ðyhi; tempÞ fast2sumðShi ;S
ylo temp þSlo.
Where: The function ReduceðjShichooses the appropriate
multiple kof =2, represented as the triple
In this section, we compare our method to other algorithms
on the same input range ½8;263 1: Payne and Hanek’s
methods (see Section 1.1) and the Modular range-reduction
method described in [1]. Concerning Payne and Hanek’s
method, we used the version of the algorithm used by Sun
Microsystems [10]. We chose as criteria for the evaluation of
the algorithms the table size, the number of table accesses,
and the number of floating-point multiplications, divisions,
and additions.
Table 2 shows the potential advantages of our algorithm
for small and medium-sized input argument. Payne and
Hanek’s method over that range doesn’t need much
memory, but roughly requires three times as many
operations. The Modular range-reduction has the same
characteristics as Payne and Hanek’s method concerning
the table size needed and the number of elementary
operations involved, but requires more table accesses. Our
algorithm is then a good compromise between table size
and number of operations for range-reduction of medium-
sized argument.
To get more accurate figures than by just counting the
operations, we have implemented this algorithm in ANSI-C.
The program can be downloaded from http://gala.univ- This implementation
shows that our algorithm is 4 to 5 times faster, depending
on the required final precision, than the Sun implementa-
tion of Payne and Hanek’s algorithm, provided that the
tables are in main memory (which will be true when the
trigonometric functions are frequently called in a numerical
program; and, when they are not frequently called, the
speed of range-reduction is no longer an issue). Our
algorithm is then a good compromise between table size
and delay for range-reduction of small and medium-sized
A variant of our algorithm would consist of first
computing Shi,Smed and Rhi ,Rmed only. Then, during the
fourth step of the algorithm, if the accuracy does not suffice,
compute Tlo and Rlo. This slight modification can reduce the
number of elementary operations in the (most frequent)
cases where no extra accuracy is needed. We can also
reduce the table size by 4Kbytes by storing the Tlo values in
single-precision only, instead of using double-precision.
Another variant (that can be useful depending on the
processor and compiler) would be to replace the loop
“while i0” with “while I<>0and i0.” In that case
(for a medium-sized argument x), the number Nof double-
precision floating-point operations becomes at most
N¼17 þ2dlog256 xe, i.e., 19 N33. Also, the number of
table accesses becomes at most 11 þ2dlog256 xe.
We have presented an algorithm for accurate range-
reduction of input arguments with absolute value less than
263 1. This table-based algorithm gives accurate results for
the most frequent cases. In order to cover the whole double-
precision domain for input arguments, we suggest using
Payne and Hanek’s algorithm for huge arguments. A major
Comparison of Our Algorithm with Payne and Hanek’s Algorithm
and the Modular Range-Reduction Algorithm
2. In fact, the absolute value of the reduced argument is less than =4
plus the largest possible value of jSmed þSloj,hence,lessthan
=4þ247 þ297. In practice, this has no influence on the elementary
function algorithms.
drawback of our method lies in the table size needed, thus a
future effort will be to reduce the table size, while keeping a
good trade off between speed and accuracy.
[1] M. Daumas, C. Mazenc, X. Merrheim, and J.M. Muller, “Modular
Range-Reduction: A New Algorithm for Fast and Accurate
Computation of the Elementary Functions,” J. Universal Computer
Science, vol. 1, no. 3, pp. 162-175, Mar. 1995.
[2] M. Hata, “Legendre Type Polynomials and Rationality Measures,”
J. reine angew. Math., vol. 407, pp. 99-125, 1990.
[3] M. Hata, “Rational Approximations to and Some Other
Numbers,” Acta Arithmetica, vol. 63, no. 4, pp. 335-349, 1993.
[4] W. Kahan, Minimizing q*m-n,
wkahan/ at the beginning of the file “nearpi.c,” 1983.
[5] A. Ya Khintchine, “Einige Sa
¨tze u
¨ber Kettenbru
¨che, mit Anwen-
dungen auf die Theorie der diophantischen Approximationen,”
Math. Ann., vol. 92, pp. 115-125, 1924.
[6] A. Ya Khintchine, Continued Fractions. Chicago, London: Univ. of
Chicago Press, 1964.
[7] D. Knuth, The Art of Computer Programming, vol. 2. Reading, Mass.:
Addison Wesley, 1973.
[8] L. Kuipers and H. Niederreiter, Uniform Distribution of Sequences.
Pure and Applied Mathematics. New York-London-Sydney:
Wiley-Interscience (John Wiley & Sons), 1974.
[9] J.-M. Muller, Elementary Functions, Algorithms and Implementation.
Boston: Birkha
¨user, 1997.
[10] K.C. Ng, “Argument Reduction for Huge Arguments: Good to the
Last Bit,” technical report, SunPro, 1992. http://www.validlab.
com/arg. pdf.
[11] M. Payne and R. Hanek, “Radian Reduction for Trigonometric
Functions,” SIGNUM Newsletter, vol. 18, pp. 19-24, 1983.
[12] R.A. Smith, “A Continued-Fraction Analysis of Trigonometric
Argument Reduction,” IEEE Trans. Computers, vol. 44, no. 11,
pp. 1348-1351, Nov. 1995.
Nicolas Brisebarre received the PhD degree in
pure mathematics from the Universite
´Bordeaux I,
France, in 1998. He has been Maı
tre de
´rences (associate professor) in pure
mathematics at LArAl, Universite
´de Saint-
EEtienne, France, since 1999. He is currently on
sabbatical leave at INRIA, France, within the
Arenaire Project, LIP,
EEcole Normale Supe
de Lyon. His research interests are in computer
arithmetic and number theory.
David Defour received the PhD degree in
computer science from the
EEcole Normale
´rieure de Lyon, Lyon, France, in 2003.
He has been an assistant professor of compu-
ter science at the University of Perpignan,
France, since September 2004. His research
interests are in computer arithmetic and com-
puter architecture.
Peter Kornerup received the mag.scient. de-
gree in mathematics from Aarhus University,
Denmark, in 1967. After a period with the
University Computing Center, starting in 1969
he was involved in establishing the computer
science curriculum at Aarhus University, where
he helped found the Computer Science Depart-
ment in 1971. Through most of the 1970s and
1980s he served as chairman of that department.
Since 1988, he has been a professor of
computer science at Odense University, now the University of Southern
Denmark, where he also served a period as the chairman of the
department. He spent a leave during 1975-1976 with the University of
Southwestern Louisiana, Lafayette, four months in 1979 and shorter
stays in many years with Southern Methodist University, Dallas, Texas,
one month with the Universite
´de Provence in Marseille in 1996 and two
months with the
EEcole Normale Supe
´rieure de Lyon in 2001. His
interests include compiler construction, microprogramming, computer
networks, and computer architecture, but, in particular, his research has
been in computer arithmetic and number representations, with applica-
tions in cryptology and digital signal processing. He has served on the
program committees for numerous IEEE, ACM, and other meetings, in
particular, on the program committees for the fourth through the 16th
IEEE Symposia on Computer Arithmetic and served as program cochair
for these symposia in 1983, 1991, and 1999. He has been a guest editor
for a number of journal special issues and served as an associate editor
of the IEEE Transactions on Computers from 1991-1995. He is a
member of the IEEE.
Jean-Michel Muller received the PhD degree in
1985 from the Institut National Polytechnique de
Grenoble. He is Directeur de Recherches
(senior researcher) at CNRS, France, and he
is head of the LIP laboratory (LIP is a joint
laboratory of CNRS,
EEcole Normale Supe
de Lyon, INRIA, and the Universite
Bernard Lyon 1). His research interests are in
computer arithmetic. He was program cochair of
the 13th IEEE Symposium on Computer Arith-
metic (Asilomar, USA, June 1997), general chair of the 14th IEEE
Symposium on Computer Arithmetic (Adelaide, Australia, April 1999).
He served as an associate editor of the IEEE Transactions on
Computers from 1996 to 2000. He is a senior member of the IEEE
and the IEEE Computer Society.
Nathalie Revol received the PhD degree in
applied mathematics from the Institut National
Polytechnique de Grenoble, France, in 1994.
She was an associate professor in applied
mathematics in the laboratory ANO of the
University of Lille, France, from 1996 to 2002.
She is currently a research scientist at INRIA,
France, within the Arenaire Project, LIP,
Normale Supe
´rieure de Lyon. Her research
interests focus on computer arithmetic and,
especially, arbitrary precision interval arithmetic: library and algorithms.
.For more information on this or any other computing topic,
please visit our Digital Library at
... Précisons que x * est un nombre généralement irrationnel quand x est, lui, exact. La précision de x * conditionnant la précision du résultat final, il est alors indispensable de réaliser cette étape en utilisant un algorithme adapté [18,28,29,137]. L'argument réduit est alors éventuellement représenté par plusieurs nombres flottants dont la somme fournit une bonne approximation de x * . On trouve ici une première source d'erreur du processus d'évaluation. ...
... then it is sufficient to evaluate it on [−C/2, C/2] after performing an additive reduction by subtracting k · C from x. This reduction step must be carefully done especially if C cannot be stored exactly in memory [18,28,29,137]. ...
... Satisfactory solutions already exist to address the first range reduction [18,38,137], the generation and evaluation of accurate and efficient polynomial evaluation schemes [16,50,120], and the reconstruction step. The interested reader can find more details in [121]. ...
Elementary mathematical functions are pervasive in many high performance computing programs. However, although the mathematical libraries (libms), on which these programs rely, generally provide several flavors of the same function, these are fixed at implementation time. Hence this monolithic characteristic of libms is an obstacle for the performance of programs relying on them, because they are designed to be versatile at the expense of specific optimizations. Moreover, the duplication of shared patterns in the source code makes maintaining such code bases more error prone and difficult. A current challenge is to propose "meta-tools" targeting automated high performance code generation for the evaluation of elementary functions. These tools must allow reuse of generic and efficient algorithms for different flavours of functions or hardware architectures. Then, it becomes possible to generate optimized tailored libms with factorized generative code, which eases its maintenance. First, we propose an novel algorithm that allows to generate lookup tables that remove rounding errors for trigonometric and hyperbolic functions. The, we study the performance of vectorized polynomial evaluation schemes, a first step towards the generation of efficient vectorized elementary functions. Finally, we develop a meta-implementation of a vectorized logarithm, which factors code generation for different formats and architectures. Our contributions are shown competitive compared to free or commercial solutions, which is a strong incentive to push for developing this new paradigm.
... Satisfactory solutions already exist to address the first range reduction [11], [12], [13], the generation and evaluation of accurate and efficient polynomial evaluation schemes [14], [15], [16], and the reconstruction step. The interested reader can find more details in [8]. ...
... Hence the first step of PPT generation using the Barning-Hall tree is the following: multiplying the matrices in Equation (4) by the root (3, 4, 5) taken as a column vector, one gets the three new PPTs (5,12,13), (15,8,17), and (21,20,29), and their symmetric counterparts (12,5,13), (8,15,17), and (20,21,29). In the following, note that we always consider the "degenerated" PPT (0, 1, 1), because it gives us an exact corrective term for the first table entry, without making the LCM k grow. ...
... Hence the first step of PPT generation using the Barning-Hall tree is the following: multiplying the matrices in Equation (4) by the root (3, 4, 5) taken as a column vector, one gets the three new PPTs (5,12,13), (15,8,17), and (21,20,29), and their symmetric counterparts (12,5,13), (8,15,17), and (20,21,29). In the following, note that we always consider the "degenerated" PPT (0, 1, 1), because it gives us an exact corrective term for the first table entry, without making the LCM k grow. ...
Elementary mathematical functions are pervasively used in many applications such as electronic calculators, computer simulations, or critical embedded systems. Their evaluation is always an approximation, which usually makes use of mathematical properties, precomputed tabulated values, and polynomial approximations. Each step generally combines error of approximation and error of evaluation on finite-precision arithmetic. When they are used, tabulated values generally embed rounding error inherent to the transcendence of elementary functions. In this article, we propose a general method to use error-free values that is worthy when two or more terms have to be tabulated in each table row. For the trigonometric and hyperbolic functions, we show that Pythagorean triples can lead to such tables in little time and memory usage. When targeting correct rounding in double precision for the same functions, we also show that this method saves memory and floating-point operations by up to 29% and 42%, respectively.
... Le travail présenté ici donne ainsi des ar- guments supplémentaires aux partisans de la standardisation des fonctions élémentaires, parmi lesquels figurent les projets ARÉNAIRE et SPACES de l'INRIA. Parmi les tra- vaux de ces deux équipes qui améliorent la compréhension et l'implantation des fonctions élémentaires, on peut citer [36,37,51,52,103,109,147,177]. ...
... La première étape a été étudiée par exemple dans [36,145], et des réponses très précises sont apportées pour le choix du polynôme de la troisième étape dans [37]. La méthode des tables de Gal concerne à la deuxième étape. ...
Les réseaux euclidiens sont un outil très puissant dans plusieurs domaines de l'algorithmique, en cryptographie et en théorie algorithmique des nombres par exemple. L'objet du présent mémoire est dual : nous améliorons les algorithmes de réduction des réseaux, et nous développons une nouvelle application dans le domaine de l'arithmétique des ordinateurs. En ce qui concerne l'aspect algorithmique, nous étudions le cas des petites dimensions et décrivons une nouvelle variante de l'algorithme LLL. Du point de vue de l'application nous utilisons la méthode de Coppersmith permettant de trouver les petites racines de polynômes modulaires, pour calculer les pires cas pour l'arrondi des fonctions mathématiques, quand la fonction et la précision sont donnés. Nous adaptons notre technique aux mauvais cas simultanés pour deux fonctions. Ces deux méthodes sont des pré-calculs coûteux, qui une fois effectués permettent d'accélérer les implantations des fonctions élémentaires en précision fixe.
... Also, SSE is a two-operand instruction set (the output shares a register with an input), whereas AVX is a three-operand instruction set (neither input is overwritten by the output). For those reasons, the AVX encoding scheme includes the re-encoding with three operands of all SSE instructions (later shortened to AVX-128), both scalar and 128-bits-wide vector instructions, creating another three ways (9)(10)(11). Finally, in the middle (4)(5)(6)(7)(8) there is the non-FMA version of AVX (from the Sandy Bridge and Ivy Bridge families), which can be compiled and run on Haswell without the benefit of FMAs. ...
... So evaluating sin(x) for an arbitrary x can be done by evaluating sin(x − kπ/4) or cos(x − kπ) for a properly selected integer k and taking care about the symmetry and periodicity of the function. However, doing so while retaining numerical accuracy is itself a complex topic [9,14]. In practice, this means that for many such functions the amount of computations needed is going to be variable. ...
Full-text available
Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak, but also the relevant instruction sets, encoding and the frequency scaling behaviour of modern hardware.
... In our implementation, we set the matrix's dimension to 48×95, the resolutions of θ to 5 • , and the resolution of ρ to 8 pixels. In addition, we adopt the CORDIC algorithm to calculate cos θ and sin θ and adopt the Payne-Hanek range reduction to reduce the range of the function [6]. ...
... An on-the-fly range reduction algorithm, with the same accuracy as MRR, for the case when the input argument is given bit-serially has also been proposed [83]. To use different algorithms depending on the size of the input argument, and by that obtain high-speed for arguments in the most common range, was suggested in [5]. ...
Full-text available
This paper shows how the method of Cody and Waite for range-reduction of arguments to transcendental functions may be extended to much larger values of arguments, when a fused multiply-add instruction is available. It is first shown how the multiple N can be determined, such that the reduced argument x − N C can be calculated exactly, satisfying |x − N C| < 1, given a finite precision constant C. It is then shown how C must be chosen such that x − N C approximates x − N K, with an absolute error less than one ulp of the arithmetic, where K is a given transcendental constant. The method is applicable in a standard precision of the CPU, as well as when a double precision fused multiply-add instruction is emulated by the single precision instructions, thus further extending the range of applicability.
This text is meant to be an introduction to a recent strategy introduced by Bourgain and Gamburd (following a work of Helfgott) for proving graph-expansion. The strategy is designed for graphs H that are defined using some underlying group G. The strategy ...
Range reduction is widely used in the element function library and high performance computing applications. A new rang reduction algorithm and an improved one based on the Payne and Hanek Algorithm are proposed. Using the 64-bit integer arithmetic unit, all the intermediate computation is optimized for its Parallelism, table size and throughput, and the improved one base on it debases the complexity, and doesn't increase the table size using a pre-computed technology. The performance of the new algorithms gets above 50% improvement on the loongson2f platform through the experiment.
An accurate reduction poses little difficulty for arguments of a few radians. However for, say, a CRAY1, H format on the VAX, or double extended in the proposed IEEE standard, the maximum argument which might be presented for reduction is of the order of 2^16000 radians. Accurate reduction of such an argument would require storage of π (or its reciprocal) to over 16,000 bits. Direct reduction by division (or multiplication) then requires generation of a somewhat larger number of bits in the result in order to guarantee the accuracy of the reduction. Of these bits only the low few bits of the integer part of the quotient (product) and enough bits to correctly round the remainder are relevant; the rest will be discarded.