ArticlePDF Available

## Abstract and Figures

Range-reduction is a key point for getting accurate elementary function routines. We introduce a new algorithm that is fast for input arguments belonging to the most common domains, yet accurate over the full double-precision range.
Content may be subject to copyright.
A New Range-Reduction Algorithm
Nicolas Brisebarre, David Defour, Peter Kornerup, Member,IEEE,
Jean-Michel Muller, Senior Member,IEEE, and Nathalie Revol
Abstract—Range-reduction is a key point for getting accurate elementary function routines. We introduce a new algorithm that is fast
for input arguments belonging to the most common domains, yet accurate over the full double-precision range.
Index Terms—Range-reduction, elementary function evaluation, floating-point arithmetic.
æ
1INTRODUCTION
ALGORITHMS for the evaluation of elementary functions
give correct results only if the argument is within a
given small interval, usually centered at zero. To evaluate
an elementary function fðxÞfor any x, it is necessary to find
some “transformation” that makes it possible to deduce
fðxÞfrom some value gðxÞ, where
.x, called the reduced argument, is deduced from x;
.xbelongs to the convergence domain of the
algorithm implemented for the evaluation of g.
In practice, range-reduction needs care for the trigonometric
functions. With these functions, xis equal to xkC, where
kis an integer and Can integer multiple of =4. Also of
potential interest is the case C¼lnð2Þfor the implementa-
tion of the exponential function.
A poor range-reduction method may lead to catastrophic
accuracy problems when the input argument is large or
close to an integer multiple of C. It is easy to understand
why a poor range-reduction algorithm gives inaccurate
results. The naive method consists of performing the
computations
k¼x
C
jk
x¼xkC;
using machine precision. When kC is close to x, almost all the
accuracy, if not all, is lost when performing the subtraction
xkC. For instance, if C¼=2and x¼8248:251512, the
correct value of xis 2:14758367 1012,andthe
corresponding value of kis 5;251. Directly computing
xk=2on a calculator with 10-digit decimal arithmetic
(assuming rounding to the nearest and replacing =2by the
nearest exactly representable number), then one gets
1:0106. Hence, such a poor range-reduction would
lead to a computed value of cosðxÞequal to 1:0106,
whereas the correct value is 2:14758367  1012 .
A first solution to overcome the problem consists of
using arbitrary-precision arithmetic, but this may make the
computation much slower. Moreover, it is not that easy to
predict on the fly the precision with which the computation
should be performed.
Most common input arguments to the trigonometric
functions are small (say, less than 8) or sometimes medium
(say, between 8and approximately 260). They are rarely
huge (say, greater than 260). We want to design methods
that are fast for the frequent cases, and accurate for all cases.
A rough estimate, based on SUN fdlibm library, is that the
cost of trigonometric range-reduction—when reduction is
necessary—is approximately one third of the total function
evaluation cost.
First, we describe Payne and Hanek’s method , which
provides an accurate range-reduction, but has the drawback
of being fairly expensive in term of operations; this method
is very commonly implemented; it is used in the SUN
fdlibm library in particular.
To know with which precision the intermediate calcula-
tions must be carried on to get an accurate result, one must
know the worst cases, that is, the input arguments that are
hardest to reduce. Also, to estimate the average perfor-
mance of the algorithms (and to tune them so that these
performances are good), one must have at least a rough
estimate of the statistical distribution of the reduced
arguments. These two problems are dealt with at the end
of this section.
In the second section, we present our algorithm
dedicated to the reduction of small and medium size
arguments. In the third section, we compare our method
with some other available methods, which justifies the use
of our algorithm for small and medium size arguments.
1.1 The Payne and Hanek Reduction Algorithm
We assume in this section that we want to perform range-
reduction for the trigonometric functions, with C¼=4,
and that the convergence domain of the algorithm used for
IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005 331
.N. Brisebarre, J.-M. Muller, and N. Revol are with Laboratoire LIP
(CNRS, ENSL, INRIA, UCBL),
EEcole Normale Supe
´rieure de Lyon, 46
alle
´e d’Italie, 69364 Lyon Cedex 07, France. N. Brisebarre is also with
LArAI, Saint-Etienne.
E-mail: {Nicolas.Brisebarre, Jean-Michel.Muller, Nathalie.Revol}@
ens.lyon.fr.
.D. Defour is with Laboratoire LP2A, Universite
´de Perpignan, 52 Avenue
Paul Alduy, 66860 Perpignan, France.
E-mail: david.defour@univ-perp.fr.
.P. Kornerup is with the Department of Mathematics and Computer
Science, Southern Danish University, Odense Campusvej 55 DK-5230
Manuscript received 28 Nov. 2003; revised 15 June 2004; accepted 16 Sept.
2004; published online 18 Jan. 2005.
tc@computer.org, and reference IEEECS Log Number TCSI-0233-1103.
evaluating the functions contains
1
tion to other cases is straightforward.
From an input argument x, we want to find the reduced
argument xand an integer kthat satisfy:
k¼4
x
 x¼
4
4
xk

:ð1Þ
Once xis known, it suffices to know kmod 8 to calculate
sinðxÞor cosðxÞfrom x.Ifxis large or if xis very close to a
multiple of =4, the direct use of (1) to determine xmay
require the knowledge of 4= with very large precision and
a cost-expensive multiple-precision computation if we wish
the range-reduction to be accurate.
Now, let us present Payne and Hanek’s reduction
method , . Assume an n-bit mantissa, radix 2
floating-point format (the number of bits nincludes the
possible hidden bit; for instance, with an IEEE double-
precision format, n¼53). Let xbe the positive floating-
point argument to be reduced and let ebe its unbiased
exponent, so
x¼X2enþ1;
where Xis an n-bit integer satisfying 2n1X<2n.
We can assume e1(since, if e<1, no reduction is
necessary). Let
0:12345...
be the infinite binary expansion of ¼4= and define an
integer parameter pused to specify the required accuracy of
the range-reduction. Then, rewrite ¼4= as
Leftðe; pÞ2neþ2
þMiddleðe; pÞþRightðe; pÞ
ðÞ
2ne1p;
where
Leftðe; pÞ¼0ife<nþ2
01neþ2otherwise;
Middleðe; pÞ¼ neþ1nene1p;
Rightðe; pÞ¼0:ne2pne3p:
8
>
>
<
>
>
:
Fig. 1 shows the splitting of the binary expansion of .
The basic idea of the Payne-Hanek reduction method is
to notice that, if pis large enough, Middleðe; pÞcontains the
only bits of ¼4= that matter for the range-reduction.
Since
4
x¼Leftðe; pÞX8
þMiddleðe; pÞX22np
þRightðe; pÞX22np;
the number Leftðe; pÞX8is a multiple of 8so that, once
multiplied by =4(see (1)), it will have no influence on the
trigonometric functions. Rightðe; pÞX22npis less
than 2np; therefore, it can be made as small as desired
How pis chosen will be explained in Section 2.3.
1.2 Worst Cases
Assume we want the reduced argument to belong to
½C=2;C=2Þ.DefinexmodCas the number y2
½C=2;C=2Þsuch that y¼xkC, where kis an integer.
There are two important points that must be considered
when trying to design accurate yet fast range-reduction
algorithms.
.First, what is the “worst case”? That is, what will be
the smallest possible absolute value of the reduced
argument for all possible inputs in a given format.
That value will allow us to immediately deduce the
precision with which the reduction must be carried
on to make sure that, even for the most difficult
cases, the returned result will be accurate enough.
.What is the statistical distribution of the smallest
absolute values of the reduced arguments? That is,
given a small value , what is the probability that the
reduced argument will have an absolute value less
than ? This point is important if we want to design
algorithms that are fast for the most frequent cases
and remain accurate on all cases.
Computing the worst case is rather easy, using an
algorithm due to Kahan  (a C program that implements
the method can be found at http://http.cs.berkeley.edu/
~wkahan/. A Maple program is given in ). The algorithm
uses the continued-fraction theory. For instance, a few
minutes of calculation suffice to find the double-precision
number between 8and 263 1that is closest to a multiple of
=4. This number is:
=4¼6411027962775774 248
22:776546738526000979:
The distance between =4and the closest multiple of =4is
=43:094903 1019 0:71 261:
So, if we apply a range-reduction from a double-precision
argument in ½8;263 1to ½=4;=4Þand if we wish to get
a reduced argument with relative accuracy better than 2,
we must perform the range reduction with absolute error
better than 261.
Also, the double-precision number greater than 8and
less than 710 which is closest to a multiple of lnð2Þis:
lnð2Þ¼7804143460206699 249
13:8629436111989061:
The distance between lnð2Þand the closest multiple of lnð2Þis
lnð2Þ1:972015 1017 >256:
In that case, we considered only numbers less than 710 since
exponentials of numbers larger than that are mere over-
flows in double-precision arithmetic.
332 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005
1. In practice, we can reduce to an interval of size slightly larger than C
to facilitate the reduction.
Fig. 1. The splitting of digits of 4= in Payne and Hanek’s reduction
method.
1.3 Statistical Distribution of the Reduced
Arguments
Now, let us turn to the statistical distribution of reduced
arguments.
We assume that Cis a positive fractional multiple of or
lnð2Þ. Let emin and emax be two rational integers such that
2emin C=2<2eminþ1and emin emax.
Let p2IN such that 2pþ1C, our aim is to estimate the
number of floating-point numbers xwith n-bit mantissas
and exponents between emin and emax such that
jxmodCj<2p;ð2Þ
where xmodCis defined as the unique number y2
½C=2;þC=2Þsuch that y¼xkC, where kis an integer.
Let Ebe a rational integer such that emin Eemax.As
2pþ1C, we have 2p<2eminþ12Eþ1. Therefore, 2p2E,
i.e., pþE0.
We start by estimating the number of floating-point
numbers xwith n-bit mantissas and exponent Ethat satisfy
(2). Hence, we search for the j2IN ,2n1j2n1, such
that the inequality
kC j
2n12E
<2pð3Þ
has solutions in k2ZZ. Such knecessarily satisfy
1
C1
2pþj
2n12E

<k< 1
C
1
2pþj
2n12E

:ð4Þ
We note that, as pþE0and j2n1, the left-hand side
of (4) is positive. Hence,
max 1;1
C1
2pþ2E

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}
mE
k1
C
1
2pþ2Eþ12E
2n1

|ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}
ME
ð5Þ
since 2n1j2n1and these inequalities are sharp
since the upper bound in (4) is irrational and the lower
bound is either zero or an irrational number. The number of
possible kis exactly
NE¼MEmEþ1:ð6Þ
Inequality (3) is equivalent to
kC2n1Ej
<2n1pE:ð7Þ
Hence, for every ksatisfying (5), there are exactly
min 2n1;bkC2n1Eþ2n1pEc

max 2n1;dkC2n1E2n1pEe

þ1ð8Þ
integers jsolutions since the numbers kC2n1E2n1pE
and kC2n1Eþ2n1pEare irrational (we saw before that
k0).
As 2pþ1C,ifkmEþ1, we have
2n1dkC2n1E2n1pEe
and, if kME1, we have
2n1bkC2n1Eþ2n1pEc:
Now, to analyze (8), we have to distinguish two cases.
First case: 2n1pE1=2, i.e., nEp.
This case is the easy one and (7) yields the conclusion. For
every k,mEþ1kME1,thereareexactly2npE
integer solutions jsince the numbers kC2n1E2n1pE
and kC2n1Eþ2n1pEare irrational. When k2fmE;M
Eg,
we can only say that there are at least 1and at most 2npE
integer solutions j. Notice that these solutions can easily be
enumerated by a program. Therefore, the number of
floating-point numbers xwith n-bit mantissas and expo-
nent Ethat satisfy (2) is upper bounded by NE2npE, and
lower bounded by ðNE2Þ2npEþ2.
Second case: 2n1pE<1=2, i.e., nE<p.
We need results about uniform distribution of sequences
 that we briefly recall now.
For a real number x,fxgdenotes the fractional part of x,
i.e., fxxbxcand jjxjj denotes the distance from xto
the nearest integer, namely,
jjxjj ¼ min
n2ZZ jxnminðfxg;1fx:
Let us recall the following definitions from .
Definition 1. Let ðxnÞn1be a given sequence of real numbers.
Let Nbe a positive integer.
For a subset Eof ½0;1Þ, the counting function AðE;N;ðxnÞÞ
is the number of terms xn,1nN, for which fxng2E.
Let y1;...;y
Nbe a finite sequence of real numbers. The
number
DNððynÞÞ ¼ sup
0a<b1
Að½a; bÞ;N;ðynÞÞ
NðbaÞ
is called the discrepancy of the sequence y1;...;y
N. For an
infinite sequence ðxnÞof real numbers (or for a finite sequence
containing at least Nterms), DNððxnÞÞ is meant to be the
discrepancy of the initial segment formed by the first Nterms
of ðxnÞ.
Thus, in particular, the number of values xnwith 1
nNsatisfying fxng2½a; bÞ, for any 0a<b1,is
bounded from above by NðbaÞþDNððxnÞÞ. Hence, the
number of values kC2n1E, with mEkME, that satisfy
(7), i.e., that satisfy 0fkC2n1Eg<2n1pEor 1
2n1pE<fkC2n1Eg<1is bounded from above by
NE2npEþ2DNEððkC2n1EÞÞÞ.
Definition 2. Let be a positive real number or infinity. The
irrational number is said to be of type if is the supremum
of all for which lim inf q!1;
q2IN qjjqjj ¼ 0.
Theorem 3.2 from [8, chapter 2] states the following
result:
Theorem 1. Let be of finite type . Then, for every ">0, the
discrepancy DNðuÞof u¼ðnÞsatisfies
DNðuÞ¼OðNð1=Þþ"Þ:
BRISEBARRE ET AL.: A NEW RANGE-REDUCTION ALGORITHM 333
Let us apply this theorem to values of interest for this
paper, namely, C¼qlnð2Þand C¼q with q2QQ.
.If Cis a nonzero fractional multiple of lnð2Þ.
We know from  that any nonzero fractional
multiple of lnð2Þhas a type 2:9. Thus, the number
of floating-point numbers xwith n-bit mantissas and
exponent Ethat satisfy (2) is upper bounded by
2npEðNEþOðNEð19=29Þþ"ÞÞ for every ">0.
.If Cis a nonzero fractional multiple of .
We know from  that any nonzero fractional
multiple of has a type 7:02. Hence, the number
of floating-point numbers xwith n-bit mantissas and
exponent Ethat satisfy (2) is upper bounded by
2npEðNEþOðNEð301=351Þþ"ÞÞ for every ">0.
From this theorem, we can deduce the following result.
Proposition 1. Let Cbe a positive fractional multiple of or
lnð2Þ. Let emin and emax be two rational integers such that
2emin C=2<2eminþ1and emin emax. Let p2IN such that
2pC=2. The number Eof floating-point numbers xwith
n-bit mantissas and exponent Ebetween emin and emax such
that
jxmodCj<2pð9Þ
satisfies
.2npEðNE2Þþ2E2npENEif
nEp. In that case, Eis easily computable by a
program;
.E¼2npEðNEþOðNEþ"ÞÞ if nE<p,for
every ">0, with 19=29 for Cnonzero fractional
multiple of lnð2Þand 301=351 for Cnonzero
fractional multiple of ;
where
NE¼1
C
1
2pþ2Eþ12E
2n1

1
C1
2pþ2E

þ1:
From this proposition, numerous experiments, and a
well-known result by Khintchine ,  that states that
almost all real numbers are of type 1, we can assume that,
for any E, we have
E2npENE

:ð10Þ
We have checked this result by computing all reduced
arguments for some values of n,emin, and emax such that
this exhaustive computation remains possible in a reason-
able delay. Some obtained results are given in Figs. 2, 3,
and 4. These results show that the estimate provided by
(10) is a good one. These estimates will be used at the
end of Section 2.3.
334 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005
Fig. 2. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼lnð2Þ,n¼14,emin ¼2, and emax ¼6. Notice that the estimation
Fig. 3. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼=4,n¼18, with emin ¼emax ¼5. The estimation given by (10) is
Fig. 4. Actual number of reduced arguments of absolute value less than
and expected number using (10), for various values of , in the case
C¼=4,n¼18, with emin ¼emax ¼7. Again, the estimation given by
In this section, we assume that we perform range-reduction
for the trigonometric functions, with C¼=2. Extension to
other values of C(such as a fractional multiple of —still for
the trigonometric functions—or a fractional multiple of
lnð2Þ—for the exponential function) is straightforward.
As stated before, our general philosophy is that we must
give results that are:
1. always correct, even for rare cases;
2. computed as quickly as possible for frequent cases.
A way to deal with these requirements is to build a fast
algorithm for input arguments with a small exponent and to
use a slower yet still accurate algorithm for input argument
with a large exponent.
2.1 Medium-Size Arguments (in ½8;263 1)
To do so, in the following, we focus on input arguments
with a “reasonably small” exponent. More precisely, we
assume that the double-precision input argument xhas
absolute value less than 263 1. For larger arguments, we
assume that Payne and Hanek’s method will be used or that
xmodCwill be computed using multiple-precision arith-
metic. For straightforward symmetry reasons, we can
assume that xis positive. We also assume that xis greater
than or equal to 8. We then proceed as follows:
1. We define IðxÞas xrounded to the nearest integer.
Then, xis split into its residual part ðxÞ¼xIðxÞ
and IðxÞ, which is split into eight 7-bit parts IiðxÞfor
0i7as follows:
I7ðxÞ¼Ið256xÞ;
I6ðxÞ¼I248 x256I7ðxÞðÞðÞ

;
I5ðxÞ¼I240 x256I7ðxÞþ248 I6ðxÞ

;
I4ðxÞ¼I232 xP7
i¼528iIiðxÞ

;
I3ðxÞ¼I224 xP7
i¼428iIiðxÞ

;
I2ðxÞ¼I216 xP7
i¼328iIiðxÞ

;
I1ðxÞ¼I28xP7
i¼228iIiðxÞ

;
I0ðxÞ¼IxP7
i¼128iIiðxÞ

;
ðxÞ¼xP7
i¼028iIiðxÞ;
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
so that
x¼256I7ðxÞþ248 I6ðxÞþ...þ28I1ðxÞþI0ðxÞþðxÞ:
Note that ðxÞis exactly representable in double-
precision and that, for x252, we have ðxÞ¼0and
IðxÞ¼x. Also, since x8, the last mantissa bit of
ðxÞhas a weight greater than or equal to 249 .
Important remark. One could get a very similar
algorithm, certainly easier to understand, by repla-
cing the values IkðxÞby the values JkðxÞdefined as
J0ðxÞcontains bits 0 to 7 of IðxÞ;
J1ðxÞcontains bits 8 to 15 of IðxÞ;
J2ðxÞcontains bits 16 to 23 of IðxÞ;
J3ðxÞcontains bits 24 to 31 of IðxÞ;
J4ðxÞcontains bits 32 to 39 of IðxÞ;
J5ðxÞcontains bits 40 to 47 of IðxÞ;
J6ðxÞcontains bits 48 to 55 of IðxÞ;
J7ðxÞcontains bits 56 to 63 of IðxÞ;
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
but that would lead to tables twice as large as the
ones required by our algorithm. Indeed, the values
I0up to I7are stored on 8 bits each, but the sign bit
will not be used and, thus, only 7 bits are necessary
to index the tables.
The general idea behind our algorithm is to
compute first
SðxÞ¼
ðI0ðxÞÞ mod=2þð28I1ðxÞÞ mod=2
þð216 I2ðxÞÞ mod=2
.
.
.
þð256 I7ðxÞÞ mod=2
þðxÞ:
It holds that xSðxÞis a multiple of =2and SðxÞ
will be smaller than x, but, in general, SðxÞwill not
be the desired reduced argument: A second, simpler
reduction step will be necessary. In practice, the
various possible values of 28iIiðxÞÞj mod=2are
stored in tables as a sum of two or three floating-
point numbers.
As mentioned above, our goal is to always
provide correct results even for the worst case for
which we lose 61 bits of accuracy. Then, we need to
store ðIiðxÞmod=2Þwith at least
þ53 ðnonzero significant bitsÞ
þgðextra guard bitsÞ
¼114 þgbits:
To reach that precision (with a value of gequal to 39,
which will be deduced in the following), all the
numbers ðj28iIiðxÞj mod=2Þ, which belong to
½1;1, are stored in tables as the sum of three
double-precision numbers:
PT
hiði; wÞis the multiple of 249 that is
closest to ðð28iwÞmod=2Þ
Tmedði; wÞis the multiple of 299 that is
closest to ðð28iwÞmod=2Þ
Thiði; wÞ
Tloði; wÞis the double-precision number
that is closest to
ðð28iwÞmod=2ÞThiði; wÞ
Tmedði; wÞ;
8
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
:
where wis a 7-bit nonnegative integer.
Note that Thiði; wÞ¼Tmed ði; wÞ¼Tloði; wÞ¼0for
w¼0. The three tables Thi,Tmed , and Tlo need 10
address bits. The total amount of memory required by
BRISEBARRE ET AL.: A NEW RANGE-REDUCTION ALGORITHM 335
these tables is 3210 8¼24 Kbytes. From the defini-
tions, one can easily deduce jTmedði; wÞj  250 and
jTloði; wÞj  2100 .ThesumThiði; wÞþTmedði; wÞþ
Tloði; wÞapproximates ð28iwÞmod=2with 153 bits
of precision, which corresponds to g¼39. Comput-
ing Thi,Tmed , and Tlo for the 1,024 different possible
values of ði; wÞallows us to get slightly sharper
bounds, given in Table 1.
2. Define
ShiðxÞ¼ X
7
i¼0
signðIiðxÞÞThiði; jIiðxÞjÞ
!
þðxÞ:
Its absolute value is bounded by 2þ1
2, which is less
than 8. Since ShiðxÞis a multiple of 249 and has
absolute value less than 8, it is exactly representable
in double-precision floating-point arithmetic (it is
even representable with 52 bits only). Therefore,
with a correctly rounded arithmetic (such as the one
provided on any system that complies with the
IEEE-754 standard for floating-point arithmetic), it
will be exactly computed, without any rounding error.
Also, consider
SmedðxÞ¼P7
i¼0signðIiðxÞÞTmedði; jIiðxÞjÞ;
SloðxÞ¼
P7
i¼0signðIiðxÞÞTloði; jIiðxÞjÞ:
The number SmedðxÞis a multiple of 299 and its
absolute value is less than 247. Hence, it is exactly
representable, and exactly computed, in double-
precision floating-point arithmetic. jSlojis less than
297 and, if Slo is computed with round-to-nearest
arithmetic as a balanced binary tree of additions:
signðI0ðxÞÞTloð0;jI0ðxÞjÞ
þsignðI1ðxÞÞTloð1;jI1ðxÞjÞ
þsignðI2ðxÞÞTloð2;jI2ðxÞjÞ
þsignðI3ðxÞÞTloð3;jI3ðxÞjÞ
þsignðI4ðxÞÞTloð4;jI4ðxÞjÞ
þsignðI5ðxÞÞTloð5;jI5ðxÞjÞ
þsignðI6ðxÞÞTloð6;jI6ðxÞjÞ
þsignðI7ðxÞÞTloð7;jI7ðxÞjÞ;
ð11Þ
then the rounding error is less than 32151. For
each of the values Tloði; IiðxÞÞ, the fact that is it
rounded to the nearest yields an accumulated error
(for these eight values) less than 82154. Thus, the
absolute error on SloðxÞis less than or equal to
82154 þ32151 ¼2149.
Since ShiðxÞþSmed ðxÞis exactly computed, the
number SðxÞ¼ShiðxÞþSmed ðxÞþSloðxÞis equal to
xminus an integer multiple of =2plus an error
bounded by 2149.
And yet, SðxÞmay not be the final reduced argument
since its absolute value may be significantly larger than =4.
We therefore may have to add or subtract a multiple of =2
from SðxÞto get the final result and straightforward
calculations show that this multiple can only be k=2with
k¼1,2,3,or4.
2.2 Small Arguments (Smaller than 8)
Define ChiðkÞ, for k¼1;2;3;4, as the multiple of 249 that is
closest to k=2.ChiðkÞis exactly representable as a double-
precision number. Define CmedðkÞas the multiple of 299
that is closest to k=2ChiðkÞand Clo ðkÞas the double-
precision number that is closest to k=2ChiðkÞCmed ðkÞ.
We now proceed as follows:
.If jShiðxÞj  =4, then we define
RhiðxÞ¼Shi ðxÞ;
RmedðxÞ¼Smed ðxÞ;
RloðxÞ¼SloðxÞ:
.Else, let kxbe such that ChiðkxÞis closest to jShi ðxÞj.
We successively compute:
-If ShiðxÞ>0,
RhiðxÞ¼Shi ðxÞChiðkxÞ;
RmedðxÞ¼Smed ðxÞCmedðkxÞ;
RloðxÞ¼SloðxÞCloðkxÞ:
-Else,
RhiðxÞ¼Shi ðxÞþChiðkxÞ;
RmedðxÞ¼Smed ðxÞþCmedðkxÞ;
RloðxÞ¼SloðxÞþCloðkxÞ:
Again, RhiðxÞand Rmed ðxÞare exactly representable
(hence,theyareexactlycomputed)indouble-
precision arithmetic:
-RhiðxÞhas an absolute value less than =4and is
a multiple of 249;
-RmedðxÞhas an absolute value less than 247 þ
250 and is a multiple of 299.
jRloðxÞj is less than 297 þ2100 and it is
computed with error less than or equal to
2149 þ2150 þ2154 ¼49 2154:
-2
149 is the error bound on Slo;
-2
154 bounds the error due to the floating-point
representation of CloðkxÞ;
-2
150 bounds the rounding error that occurs
when computing SloðxÞClo ðkxÞin round-to-
nearest mode.
Therefore, the number RðxÞ¼RhiðxÞþRmed ðxÞþRloðxÞ
is equal to xminus an integer multiple of =2plus an error
bounded by 49 2154 <2148.
This step is also used (alone, without the previous steps)
to reduce small input arguments, less than 8. This allows
336 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005
TABLE 1
Maximum Values of Thi,Tmed , and Tlo
our algorithm to perform range-reduction for both kind of
arguments, small and medium size. The reduced argument
is now stored as the sum of three double-precision
numbers, RhiðxÞ,Rmed ðxÞ, and RloðxÞ. We want to return
the reduced argument as the sum of two double-precision
numbers (one double-precision number may not suffice if
we wish to compute trigonometric functions with very good
accuracy). To do that, we will use the Fast2sum algorithm
presented hereafter.
2.3 Final Step
We will get the final result of the range-reduction as
follows: Let pbe an integer parameter, 1p44, used to
specify the required accuracy. This choice comes from the
fact that we work in double precision arithmetic and that, in
the most frequent cases, the final relative error will be
bounded by 2100þp: To allow an accurate double precision
function result even in the very worst case, we must have a
relative error significantly less than 253. The problem here
is only to propagate the possible carry when summing the
three components RhiðxÞ,Rmed ðxÞ,andRloðxÞ.Thisis
performed using floating-point addition and the following
result.
Theorem 2 (Fast2sum algorithm) [7, p. 221, Theorem C]. Let
aand bbe floating-point numbers, with jajjbj. Assume the
used floating-point arithmetic provides correctly rounded
results with rounding to the nearest. The following algorithm:
fast2sum(a,b):
s:=a+b
z:=s-a
r:=b-z
computes two floating-point numbers sand rthat satisfy:
.rþs¼aþbexactly;
.sis the floating-point number which is closest to aþb.
We now consider the different possible cases:
.If jRhiðxÞj >1=2p, then, since jRmed ðxÞj <247 þ250,
the reduced argument will be close to Rhi ðxÞ. In that
case, we first compute
tmedðxÞ¼Rmed ðxÞþRloðxÞ:
The error on tmedðxÞis bounded by the former error
on RloðxÞplus the rounding error due to the addition.
Assuming rounding to nearest, this last error is less
than or equal to 2100. Hence, the error on tmed ðxÞis
less than or equal to 2100 þ2148. Then, we perform
(without rounding error)
ðyhi;y
loÞ¼fast2sumðRhi ðxÞ;t
medðxÞÞ:
After that, the two floating-point numbers ðyhi;y
loÞ
represent the reduced argument with an absolute
error bounded by 2100 þ2148 2100. Hence, the
relative error on the reduced argument will be
bounded by a value very close to 2100þp.
.If RhiðxÞ¼0, then we perform
ðyhi;y
loÞ¼fast2sumðRmed ðxÞ;R
loðxÞÞ:
After that, since the absolute value of the reduced
argument is always larger than 0:71 261, the two
floating-point numbers ðyhi;y
loÞrepresent the re-
duced argument with a relative error smaller than
49 2154
0:71 261 <286:
.If 0<jRhiðxÞj  2p, then, since the absolute value of
the reduced argument is always larger than 0:71
261 and since jRloðxÞj <297 þ2100 , most of the
information on the reduced argument is in RhiðxÞ
and RmedðxÞ. We first perform
ðyhi;t
medÞ¼fast2sumðRhi ðxÞ;R
medðxÞÞ:
Let kbe the integer satisfying
2kjyhij<2kþ1:
We easily find
jtmedj2k53 :
After that, we compute
ylo ¼tmed þRloðxÞ:
The rounding error due to this addition is bounded
by 2k107. Hence, the two floating-point numbers
ðyhi;y
loÞrepresent the reduced argument with an
absolute error smaller than
49 2154 þmaxf2k107;2150 g:
Therefore, ðyhi;y
loÞrepresent the reduced argument
with a relative error better than
49 2154þkþmaxf2107;2150þkg;
which is less than 287 since the absolute value of
the reduced argument is less than 0:71 261, which
implies 2k261.
A first solution is to try to make the various error bounds
equal. This is done by choosing p¼14. By doing that, in the
worst case, the bound on the relative error will be 286,
which is quite good. We should notice that, in this case,
assuming (10) with C¼=2, the probability that jRhiðxÞj
will be less than 2pis around 7:8105.
A possibly better solution is to make the most frequent case
(i.e., jRhiðxÞj >2p) more accurate and to assume that a more
accurate yet slower algorithm is used in the other cases (an
easy solution is to split the variables into four floating-point
values instead of three as we did here). This is done by using a
somewhat smaller value of p. For instance, with p¼10 and
C¼=2, still assuming (10), the probability that jRhiðxÞj <
2pis around 1:25 103. In the most frequent case
(jRhiðxÞj  2p), the error bound on the computed reduced
argument will be 290. Due to its low probability, the other
case can be processed with an algorithm a hundred times
slower without significantly changing the average time of
computation, cf. Amdahl’s law.
BRISEBARRE ET AL.: A NEW RANGE-REDUCTION ALGORITHM 337
2.4 The Algorithm
We can now sketch the complete algorithm:
Algorithm Range-Reduction:
Input: A double-precision floating-point number x>0and
an integer p>0specifying the required precision in bits.
Output: The reduced argument ygiven as the sum of two
double-precision floating-point numbers yhi and ylo, such
that
2
=4y<=4and y¼xk
2within an error given
in the analysis of Section 2.3, for some integer k.
Method:
if x263 1then
{Apply the method of Payne and Hanek.}
else if x8then
Shi x;Smed 0;Slo 0;
else
I roundðxÞ; xI;
Shi ;Smed 0;Slo 0;
i 7;
j 56;
while i0do
w roundðI>>jÞ;
Shi Shi þsignðwÞThiði; jw;
Smed Smed þsignðwÞTmedði; jw;
I Iðw<<jÞ;i i1;j j8
Slo P7
i¼0signðwÞTloði; jw(cf. 11);
if jShij=4then
k ReduceðjShi
Shi Shi þsignðShiÞChi ðkÞ;
Smed Smed þsignðShiÞCmed ðkÞ;
Slo Slo þsignðShiÞClo ðkÞ;
if jShij>2pthen
temp Smed þSlo;
ðyhi;y
loÞ fast2sumðShi ; tempÞ;
else if Shi ¼0then
ðyhi;y
loÞ fast2sumðSmed ;S
loÞ;
else
ðyhi; tempÞ fast2sumðShi ;S
medÞ;
ylo temp þSlo.
Where: The function ReduceðjShichooses the appropriate
multiple kof =2, represented as the triple
ðChiðkÞ;C
medðkÞ;C
loðkÞÞ.
3COST OF THE ALGORITHM
In this section, we compare our method to other algorithms
on the same input range ½8;263 1: Payne and Hanek’s
methods (see Section 1.1) and the Modular range-reduction
method described in . Concerning Payne and Hanek’s
method, we used the version of the algorithm used by Sun
Microsystems . We chose as criteria for the evaluation of
the algorithms the table size, the number of table accesses,
and the number of floating-point multiplications, divisions,
Table 2 shows the potential advantages of our algorithm
for small and medium-sized input argument. Payne and
Hanek’s method over that range doesn’t need much
memory, but roughly requires three times as many
operations. The Modular range-reduction has the same
characteristics as Payne and Hanek’s method concerning
the table size needed and the number of elementary
operations involved, but requires more table accesses. Our
algorithm is then a good compromise between table size
and number of operations for range-reduction of medium-
sized argument.
To get more accurate figures than by just counting the
operations, we have implemented this algorithm in ANSI-C.
shows that our algorithm is 4 to 5 times faster, depending
on the required final precision, than the Sun implementa-
tion of Payne and Hanek’s algorithm, provided that the
tables are in main memory (which will be true when the
trigonometric functions are frequently called in a numerical
program; and, when they are not frequently called, the
speed of range-reduction is no longer an issue). Our
algorithm is then a good compromise between table size
and delay for range-reduction of small and medium-sized
arguments.
A variant of our algorithm would consist of first
computing Shi,Smed and Rhi ,Rmed only. Then, during the
fourth step of the algorithm, if the accuracy does not suffice,
compute Tlo and Rlo. This slight modification can reduce the
number of elementary operations in the (most frequent)
cases where no extra accuracy is needed. We can also
reduce the table size by 4Kbytes by storing the Tlo values in
single-precision only, instead of using double-precision.
Another variant (that can be useful depending on the
processor and compiler) would be to replace the loop
“while i0” with “while I<>0and i0.” In that case
(for a medium-sized argument x), the number Nof double-
precision floating-point operations becomes at most
N¼17 þ2dlog256 xe, i.e., 19 N33. Also, the number of
table accesses becomes at most 11 þ2dlog256 xe.
4CONCLUSIONS
We have presented an algorithm for accurate range-
reduction of input arguments with absolute value less than
263 1. This table-based algorithm gives accurate results for
the most frequent cases. In order to cover the whole double-
precision domain for input arguments, we suggest using
Payne and Hanek’s algorithm for huge arguments. A major
338 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 3, MARCH 2005
TABLE 2
Comparison of Our Algorithm with Payne and Hanek’s Algorithm
and the Modular Range-Reduction Algorithm
2. In fact, the absolute value of the reduced argument is less than =4
plus the largest possible value of jSmed þSloj,hence,lessthan
=4þ247 þ297. In practice, this has no influence on the elementary
function algorithms.
drawback of our method lies in the table size needed, thus a
future effort will be to reduce the table size, while keeping a
good trade off between speed and accuracy.
REFERENCES
 M. Daumas, C. Mazenc, X. Merrheim, and J.M. Muller, “Modular
Range-Reduction: A New Algorithm for Fast and Accurate
Computation of the Elementary Functions,” J. Universal Computer
Science, vol. 1, no. 3, pp. 162-175, Mar. 1995.
 M. Hata, “Legendre Type Polynomials and Rationality Measures,”
J. reine angew. Math., vol. 407, pp. 99-125, 1990.
 M. Hata, “Rational Approximations to and Some Other
Numbers,” Acta Arithmetica, vol. 63, no. 4, pp. 335-349, 1993.
 W. Kahan, Minimizing q*m-n, http://http.cs.berkeley.edu/
wkahan/ at the beginning of the file “nearpi.c,” 1983.
 A. Ya Khintchine, “Einige Sa
¨tze u
¨ber Kettenbru
¨che, mit Anwen-
dungen auf die Theorie der diophantischen Approximationen,”
Math. Ann., vol. 92, pp. 115-125, 1924.
 A. Ya Khintchine, Continued Fractions. Chicago, London: Univ. of
Chicago Press, 1964.
 D. Knuth, The Art of Computer Programming, vol. 2. Reading, Mass.:
 L. Kuipers and H. Niederreiter, Uniform Distribution of Sequences.
Pure and Applied Mathematics. New York-London-Sydney:
Wiley-Interscience (John Wiley & Sons), 1974.
 J.-M. Muller, Elementary Functions, Algorithms and Implementation.
Boston: Birkha
¨user, 1997.
 K.C. Ng, “Argument Reduction for Huge Arguments: Good to the
Last Bit,” technical report, SunPro, 1992. http://www.validlab.
com/arg. pdf.
 M. Payne and R. Hanek, “Radian Reduction for Trigonometric
Functions,” SIGNUM Newsletter, vol. 18, pp. 19-24, 1983.
 R.A. Smith, “A Continued-Fraction Analysis of Trigonometric
Argument Reduction,” IEEE Trans. Computers, vol. 44, no. 11,
pp. 1348-1351, Nov. 1995.
Nicolas Brisebarre received the PhD degree in
pure mathematics from the Universite
´Bordeaux I,
France, in 1998. He has been Maı
ˆ
tre de
Confe
´rences (associate professor) in pure
mathematics at LArAl, Universite
´de Saint-
EEtienne, France, since 1999. He is currently on
sabbatical leave at INRIA, France, within the
Arenaire Project, LIP,
EEcole Normale Supe
´rieure
de Lyon. His research interests are in computer
arithmetic and number theory.
David Defour received the PhD degree in
computer science from the
EEcole Normale
Supe
´rieure de Lyon, Lyon, France, in 2003.
He has been an assistant professor of compu-
ter science at the University of Perpignan,
France, since September 2004. His research
interests are in computer arithmetic and com-
puter architecture.
Peter Kornerup received the mag.scient. de-
gree in mathematics from Aarhus University,
Denmark, in 1967. After a period with the
University Computing Center, starting in 1969
he was involved in establishing the computer
science curriculum at Aarhus University, where
he helped found the Computer Science Depart-
ment in 1971. Through most of the 1970s and
1980s he served as chairman of that department.
Since 1988, he has been a professor of
computer science at Odense University, now the University of Southern
Denmark, where he also served a period as the chairman of the
department. He spent a leave during 1975-1976 with the University of
Southwestern Louisiana, Lafayette, four months in 1979 and shorter
stays in many years with Southern Methodist University, Dallas, Texas,
one month with the Universite
´de Provence in Marseille in 1996 and two
months with the
EEcole Normale Supe
´rieure de Lyon in 2001. His
interests include compiler construction, microprogramming, computer
networks, and computer architecture, but, in particular, his research has
been in computer arithmetic and number representations, with applica-
tions in cryptology and digital signal processing. He has served on the
program committees for numerous IEEE, ACM, and other meetings, in
particular, on the program committees for the fourth through the 16th
IEEE Symposia on Computer Arithmetic and served as program cochair
for these symposia in 1983, 1991, and 1999. He has been a guest editor
for a number of journal special issues and served as an associate editor
of the IEEE Transactions on Computers from 1991-1995. He is a
member of the IEEE.
Jean-Michel Muller received the PhD degree in
1985 from the Institut National Polytechnique de
Grenoble. He is Directeur de Recherches
(senior researcher) at CNRS, France, and he
is head of the LIP laboratory (LIP is a joint
laboratory of CNRS,
EEcole Normale Supe
´rieure
de Lyon, INRIA, and the Universite
´Claude
Bernard Lyon 1). His research interests are in
computer arithmetic. He was program cochair of
the 13th IEEE Symposium on Computer Arith-
metic (Asilomar, USA, June 1997), general chair of the 14th IEEE
Symposium on Computer Arithmetic (Adelaide, Australia, April 1999).
He served as an associate editor of the IEEE Transactions on
Computers from 1996 to 2000. He is a senior member of the IEEE
and the IEEE Computer Society.
Nathalie Revol received the PhD degree in
applied mathematics from the Institut National
Polytechnique de Grenoble, France, in 1994.
She was an associate professor in applied
mathematics in the laboratory ANO of the
University of Lille, France, from 1996 to 2002.
She is currently a research scientist at INRIA,
France, within the Arenaire Project, LIP,
EEcole
Normale Supe
´rieure de Lyon. Her research
interests focus on computer arithmetic and,
especially, arbitrary precision interval arithmetic: library and algorithms.