Content uploaded by Jesús Peinado
Author content
All content in this area was uploaded by Jesús Peinado on Jan 07, 2019
Content may be subject to copyright.
Fast Taylor polynomial evaluation for the computation
of the matrix cosine
Jorge Sastre1
Instituto de Telecomunicaciones y Aplicaciones Multimedia
Javier Ib´a˜nez1
Instituto de Instrumentaci´on para Imagen Molecular
Pedro Alonso-Jord´a1,∗
Department of Information Systems and Computation
Jes´us Peinado1
Department of Information Systems and Computation
Emilio Defez1
Instituto de Matem´atica Multidisciplinar
Abstract
In this work we introduce a new method to compute the matrix cosine. It
is based on recent new matrix polynomial evaluation methods for the Taylor
approximation and a mixed forward and backward error analysis. The matrix
polynomial evaluation methods allow to evaluate the Taylor polynomial ap-
proximation of the matrix cosine function more efficiently than using Paterson-
Stockmeyer method. A sequential Matlab implementation of the new algo-
rithm is provided, giving better efficiency and accuracy than state-of-the-art
algorithms. Moreover, we provide an implementation in Matlab that can use
NVIDIA GPUs easily and efficiently.
Keywords: matrix, matrix trigonometric functions, matrix cosine, Taylor, fast
matrix polynomial evaluation, GPU computing
∗Corresponding author
Email address: palonso@upv.es (Pedro Alonso-Jord´a)
1All authors belong to Universitat Polit`ecnica de Val`encia.
Preprint submitted to Journal of Computational and Applied MathematicsSeptember 4, 2018
1. Introduction
The exact solution of many engineering processes that are described by sec-
ond order differential equations is given in terms of the trigonometric matrix
functions sine and/or cosine. This is the case, for instance, of the wave prob-
lem. The most popular state-of-the-art algorithms used to calculate these ma-
trix functions are based on polynomial and rational approximations with scaling
and recovering techniques [1, 2, 3, 4, 5]. Paterson-Stockmeyer’s method [6] is
the most used to compute the matrix polynomials that appear in these ap-
proximations in order to reduce computational costs. Recently, a new family of
methods for the evaluation of general matrix polynomials has been proposed [7],
allowing to reduce the number of matrix products needed to evaluate a poly-
nomial with respect to the Paterson–Stockmeyer’s method. In this work, we
present competitive algorithms for the computation of the matrix cosine based
on the evaluation of Taylor approximations using those methods. Sequential
and NVIDIA GPU based Matlab implementations of the new algorithms are
given. The basic computational kernel of algorithms based on Taylor approxi-
mations is matrix multiplication. This kernel can be executed very rapidly on
accelerator devices like GPUs (Graphic Processing Units). In this paper we have
exploited this fact, together with our previous experience on this subject [8], to
build a Matlab script plus a mex file capable of executing the new algorithm
very efficiently.
The next section presents a scaling and squaring Taylor algorithm for com-
puting the matrix cosine based on the methods described in [7]. Section 3
describes a forward and backward error analysis for computing the Taylor ap-
proximation using our algorithm. Finally, in Section 4, we show some numerical
results of the Matlab sequential and GPU implementations from both the per-
formance and accuracy points of view. Finally, Section 5 gives some conclusions.
2. Taylor algorithm for computing the matrix cosine
The matrix cosine can be defined for all A∈Cn×nby the series
cos(A) =
∞
X
i=0
(−1)iA2i
(2i)! .(1)
Let
T2m(A) =
m
X
i=0
piA2i=
m
X
i=0
piBi≡Pm(B),(2)
be the Taylor approximation of order 2mof cos(A), where pi=(−1)i
(2i)! and
B=A2. Algorithm 1 shows a general algorithm for computing the matrix
cosine based on Taylor series. Since (2) is accurate near the origin, for computing
cos(A) from the Taylor approximation it is often necessary to scale matrix B
and recover cos(A) from the Taylor approximation of the cosine of the scaled
matrix. Scaling matrix Bby an integer s > 0 consists of computing B:= 4−sB
2
Algorithm 1 Given a matrix A∈Cn×nand a maximum order mM∈N,
this algorithm computes C= cos(A) by a Taylor approximation of order 2mk≤
2mM, where mkare optimal degrees of the Taylor polynomial, i.e. the maximum
degrees of the Taylor polynomial which can be evaluated for a certain number
of matrix products.
1: B=A2
2: Scaling phase: Choose mk6mMand an integer scaling parameter sfor
the Taylor approximation with scaling.
3: B=B/4s
4: Compute C=Pmk(B)
5: for i=1:sdo
6: C= 2C2−I
7: end for
(Step 3). Once the cosine of the scaled matrix has been computed, cos(A)
can be computed (Steps 5–7) by using repeatedly the double angle formula
cos(2X) = 2 cos2(X)−I[9, p. 288].
Matrix Acan be preprocessed to reduce its norm as described in [10, Alg. 1.2]
and this procedure will not be discussed in this paper. Step 4 was traditionally
performed by using the Paterson-Stockmeyer’s method [6], however, in this pa-
per we use a more efficient method for evaluating C=Pmk(B) based on [7].
This method depends on the value of mkselected in Step 2 of Algorithm 1 (from
now on we will use minstead of mkfor simplicity). Below we analyse each case.
For m= 1, 2 and 4, similarly to (10) from [11], the Taylor polynomials
Pm(B) can be computed by using the following expressions:
P1(B) = −B/2 + I , (3)
P2(B)=(B2/12 −B)/2 + I,
P4(B) = ((((B2/56 −B)/30 + I)B2/12 −B)/2 + I.
Following [7, Ex. 3.1], P8(B) can be evaluated by using the following formulae:
y02(B) = B2(c1B2+c2B),(4)
P8(B)=(y02 (B) + c3B2+c4B)(y02(B) + c5B2)
+c6y02(B) + B2/24 −B/2 + I,
with a cost of 3 matrix products. With that cost the maximum approximation
order available with Paterson–Stockmeyer is m= 6. The coefficients cifor IEEE
double precision arithmetic are given in Table 1, see [7, Table 4].
In the following we show that the most efficient methods proposed in [7]
to evaluate the Taylor polynomial for m > 8 are not accurate enough for the
matrix cosine approximation. Therefore, other possibilities are proposed for
increasing accuracy, in exchange for a higher cost. Despite this higher cost,
the proposed matrix polynomial evaluation methods are more efficient than
Paterson-Stockmeyer method.
3
Table 1: Coefficients for computing the matrix exponential Taylor approximation.
m= 8 m= 12 m= 15
c12.186201576339059 ×10−71.269542268337734 ×10−12 6.140022498994532 ×10−17
c2−2.623441891606870 ×10−5−3.503936660612145 ×10−10 −2.670909787062621 ×10−14
c36.257028774393310 ×10−31.135275478038335 ×10−71.438284920333222 ×10−11
c4−4.923675742167775 ×10−1−2.027712316612395 ×10−5−1.050202496489896 ×10−8
c51.441694411274536 ×10−41.647243380001247 ×10−34.215975785860907 ×10−6
c65.023570505224926 ×101−6.469859264308602 ×10−1−1.238347173261210 ×10−3
c7−4.008589447357360 ×10−5−3.234597615453410 ×10−9
c89.187724869020796 ×10−39.292820886910254 ×10−7
c9−1.432942184841715 ×10−22.466381973203188 ×10−1
c10 4.555439797286385 ×10−3−9.369018510939971 ×10−10
Following [12, Sec. 3.2], and similarly to [7, Ex. 5.1], with a cost of 4 matrix
products it is possible to obtain a Taylor based approximation P16 (B) of the
matrix cosine of order m= 15, with several real solutions for the coefficients
involved. However, for all the real solutions rounded in IEEE double precision
arithmetic, the stability check proposed in [7, Ex. 3.1] gives errors of order
10−14 > u or greater, where uis the unit roundoff in IEEE double precision
arithmetic, i.e. u= 2−53 ≈1.11−16. We have checked that these evaluation
formulae provided a reduced accuracy results in numerical tests.
Then, for a cost of 4 matrix products we will use (34) and (35) from [7] to
evaluate P12(B) by using the following formulae:
y02(B) = B3(c1B3+c2B2+c3B),(5)
P12(B)=(y02(B) + c4B3+c5B2+c6B)(y02 (B) + c7B3+c8B2)
c9y02 +c10B3+B2/24 −B/2 + I,
where the coefficients ciare given in Table 1. From the different real solutions for
the coefficients, we selected those ones giving the lowest maximum error in the
stability test, similarly to [7, Ex 3.1], giving errors lower than 1.31 ×10−16 =
O(u). Equation (5) provides a lower order than 15, but behaves in a stable
manner being, in turn, more efficient than Paterson–Stockmeyer method, since
with a cost of 4 matrix products the maximum approximation order available
with Paterson–Stockmeyer is m= 9.
The highest order mused in [5] for Pm(B) is m= 16, available with 6 matrix
products using Paterson–Stockmeyer method. Using (34) and (35) from [7] it is
possible to evaluate P16(B) with 5 matrix products and several possibilities of
real coefficients. The stability check proposed in [7, Ex. 3.1] gives a maximum
error of 1.03 ×10−15 > u, and we checked that numerical results were not
accurate enough. The stability can be improved using expression (52) from [7],
with s= 3 and p= 3, giving the following formulae for m= 15:
y02(B) = B3(c1B3+c2B2+c3B),(6)
P15(B) = −((y02 (B) + c4B3+c5B2+c6B)(y02(B) + c7B3+c8B2) + c9y02
+c10B3+B2/368800 −B/40320 + I/720)B3+B2/24 −B/2 + I .
4
The stability check for the coefficients cigiven in Table 1, selected among all the
possible real solutions of the coefficients, gives a maximum error of order u, in
an exchange for a lower order m= 15 <16. The minus sign at the beginning of
expression P15(B) allows to obtain real solutions for all the coefficients involved,
as suggested in [7, p. 237]. With a cost of 5 matrix products the maximum
approximation order available with Paterson–Stockmeyer is m= 12.
All coefficients cithat appear in expressions (4)–(6) were computed with the
MATLAB R2018a Symbolic Math Toolbox, using 200 decimal digit arithmetic.
Table 1 shows these values in IEEE double precision arithmetic.
3. Error analysis
In [2] an absolute forward error analysis of the Taylor approximation for the
matrix cosine was developed. in [5] a combination of a relative forward and
backward error analysis was developed for the same function. In this section
we present a unified study of the error analysis for the computation of that
matrix function, selecting the analysis among the three types of analysis giving
the most efficient option for each degree mof the cosine Taylor approximation.
The following theorem is used in this study:
Theorem 1 ([2]). Let hl(x) = P
i≥l
pixibe a power series with radius of conver-
gence w,˜
hl(x) = P
i≥l
|pi|xi,B∈Cn×nwith ρ(B)< w,l∈Nand t∈Nwith
16t6l. If t0is the multiple of tsuch that l6t06l+t−1and
βt= max{d1/j
j:j=t, l, l + 1, . . . , t0−1, t0+ 1, t0+ 2, . . . , l +t−1},
where djis an upper bound for ||Bj||,dj>||Bj||, then
||hl(B)|| 6˜
hl(βt).
If we apply Theorem 1 for t=l, then ||hl(B)|| 6˜
hl(βl), where
βl= max{d1/j
j:j=l, l + 1, . . . , 2l−1}.(7)
In [13, Sec. 4.1] the authors approximated βmin = min{β(l)
t,1≤t≤m+ 1}by
βmin ≈max{d1/(l+1)
l+1 , d1/(l+2)
l+2 },(8)
corresponding to the two first terms of (7).
3.1. Absolute and relative forward errors
Let A∈Cn×nand B=A2. Using (1) the absolute forward error in exact
arithmetic for the Taylor approximation (2) of cos(A), denoted by E(1)
f, is,
E(1)
f=kcos(A)−Pm(B)k=
X
i≥m+1
f(1)
m,iBi
,(9)
5
where f(1)
m,i = (−1)i/(2i)!. This error analysis is used in [2, Sec. 4].
If kBk=kA2k<acosh2(2) ≈1.7343, then cos−1(A) exists [5, Proposi-
tion 1] and it follows that the relative forward error to compute cos(A) in exact
arithmetic, denoted by E(2)
f, is [5, Sec. 2.1]
E(2)
f=
cos−1(A) (cos(A)−Pm(B))
=
I−Pm(B) cos−1(A)
=
X
i≥m+1
f(2)
m,iBi
.(10)
If we define g(k)
m+1(x) = P
i>m+1
f(k)
m,ixiand ˜g(k)
m+1(x) = P
i≥m+1
f(k)
m,i
xi,k=
1,2, and we apply Theorem 1, then
E(k)
f=
g(k)
m+1(B)
≤˜g(k)
m+1(βt),(11)
for every t, 1 ≤t≤m+ 1.
3.2. Relative backward error
The backward error ∆Aof approximating cos(A) by Taylor approximation
T2m(A) verifies
T2m(A) = cos(A+ ∆A).
From [5, Sec. 2.2] the backward error ∆Acan be expressed by
∆A=X
i≥m
bm,iA2i+1 ,
where coefficients bm,i can be computed symbolically, see (8)-(11) of [5, Sec. 2.2]
for details. Then, the relative backward error (Eb) in exact arithmetic of ap-
proximating cos(A) by T2m(A) can be computed as
Eb=k∆Ak
kAk=
P
i≥m
bm,iA2i+1
kAk≤
X
i≥m
bm,iA2i
=
X
i≥m
bm,iBi
.
If we define hm(x) = P
i>m
bm,ixiand ˜
hm(x) = P
i≥m
|bm,i|xi, and we apply
Theorem 1, then
Eb=khm(B)k ≤ ˜
hm(βt),(12)
for every t, 1 ≤t≤m. In [5] it was used an error analysis that is a combination
of the relative forward and backward error analysis in exact arithmetic.
6
Table 2: Values of Θf1(m), Θf2(m), Θb(m), and Θ(m) for m= 8,12,15.
m= 8 m= 12 m= 15
Θf10.96 6.59 16.45
Θf20.91 − −
Θb0.94 6.75 9.91
Θ 0.9625107544271462 6.752349007371135 16.45123831556254
3.3. Computation of Taylor order mand the scaling parameter s
Let Θfk(m), k= 1,2, be
Θfk(m) = max
θ≥0 : ˜g(k)
m+1(θ) = X
i≥m+1
f(k)
m,i
θi≤u
,(13)
and let Θb(m) be
Θb(m) = max
θ≥0 : ˜
hm(θ) = X
i≥m
b(k)
m,i
θi≤u
,(14)
where u= 2−53 is the unit roundoff in double precision floating-point arithmetic.
We have used MATLAB Symbolic Math Toolbox to compute Θfk(m), k= 1,2,
and Θb(m) for m= 8,12,15 in 250-decimal digit arithmetic, considering enough
terms to obtain all the Θ values for each mwith enough significant digits, and
obtaining the coefficients symbolically. Note that Θb(15) needs more than 1500
terms to obtain three significant digits, similarly to what happens with Θb(16)
and Θb(20) in [5, Sec. 2.2]. Then, a numerical zero-finder is invoked to determine
the highest values Θfk(m), k= 1,2, and Θb(m), such that
˜g(k)
m+1(Θfk(m)) = X
i≥m+1
f(k)
m,i
Θi
fk(m)≤u,
and ˜
hm(Θb(m)) = X
i≥m
e(k)
m,i
Θi
b(m)≤u,
holds. The values of Θfk(m), k= 1,2, and Θb(m) for m= 8,12,15 are de-
picted in Table 2. For m={1,2,4}[5, Sec. 2.2] shows that Θf2(m)>Θb(m),
and comparing tables [2, Table 2] and [5, Table 1] one gets that Θf1(m)'
Θf2(m). This is a normal behavior since, for those values of m, it follows
that cos(pΘf1(m)≈1 and, in that case, the forward absolute error bound (9)
and the forward relative error bound (10) are approximately equal. Analo-
gously to [5, Sec. 2], and to minimize the computational cost, we are select-
ing Θ(m) = max {Θf1(m),Θf2(m),Θb(m)}for each mk={1,2,4,8,12,15},
k= 1,2,...,6, i.e. Θf1(m) for m={1,2,4,8,15}and Θb(m) for m= 12.
Then, considering (11) and taking into account the values of Θmfrom Table 2,
7
it follows that, for m={1,2,4,8,15}, if βt≤Θf1(m) the absolute forward error
is lower than or equal to the unit roundoff:
E(1)
f≤˜g(1)
m(βt)≤˜g(1)
m(Θf1(m)) ≤u,
and using (8) one gets
β(m)
m+1 ≈max{b1/(m+1)
m+1 , b1/(m+2)
m+2 },(15)
where the superscript (m) stands for the Taylor approximation order used (see
(15) from [5]). For m= 12 and considering (12), if β(m)
min ≤Θb(m), then relative
backward error is lower or equal than the unit roundoff:
Eb≤˜
hm(βt)≤˜
hm(Θb(m)) ≤u,
and using (8) one gets
β(m)
min ≈max{b1/(m)
m, b1/(m+1)
m+1 },(16)
where the superscript (m) also stands for the Taylor approximation order used
(see (16) from [5]).
We provide here a scaling algorithm without norm estimations of matrix
powers and another one with norm estimations of matrix powers similar to
those algorithms developed in [5, Sec. 2.4]: If there exists a value m≤15 such
that β(mk)
min ≤Θ(m), then one of the above conditions is verified and, in this
case, we choose the lower order mkverifying it with a scaling parameter s= 0.
Otherwise, we choose the Taylor approximation of order 12 or 15 providing the
lower cost, with
s= max 0,1
2log βmk
min
Θ(mk), mk= 12 or 15.
Note that Θ8<Θ12/4 and Θ12 >Θ15 /4, and then only mk= 12 and 15 are
efficient orders for scaling.
The algorithm without estimation of norms of matrix powers uses bounds
of matrix powers based on products of matrix powers previously computed and
is analogous to Algorithm 2 from [5]: For mk≤8, only Band B2are available
and then, using Theorem 1 and (15), we take
β(mk)
min =β2= max kB2k
mk
2kBk1
mk+1 ,kB2k
mk+2
21
mk+2
=kB2k
mk
2kBk1
mk+1 .
For mk= 12 and 15, B3is also available and then, using Theorem 1, we take
β(mk)
min = min{β2, β3}. Functions ms selectNoNormEst and beta NoNormEst
from http://personales.upv.es/jorsasma/software/cosmpol.m are MAT-
LAB implementations of the scaling algorithm with no estimation of norms of
matrix powers, and for the computation of β2and β3, respectively.
8
The algorithm with estimation of norms of matrix powers uses the estima-
tion of two matrix powers taking into account the simplifications (15) and (16),
where values dkare computed approximately by the block 1–norm estimation
algorithm of [14], and is also analogous to that on [5]: It reduces the num-
ber of estimations combining estimations of the values dkbased on products
of norms of matrix powers previously computed or estimated, see function
ms selectNormEst from cosmpol.m.
4. Numerical experiments
In this section, we compare the new MATLAB function developed in this
paper, cosmpol, with another two functions:
•cosm: Code based on the Pad´e rational approximation for the matrix
cosine [3]. The MATLAB function cosm has an argument which allows us
to compute cos(A) by means of just Pad´e approximants, or also using the
real Schur decomposition and the complex Schur decomposition. In these
tests we did not use the Schur decomposition since, in the tests carried
out in [5] it was shown that, using the Schur decomposition in Taylor
algorithms from [5] provides higher efficiency than Pad´e method from [3]
with the Schur decomposition with similar accuracy. The MATLAB code
can be found in: http://github.com/sdrelton/cosm_sinm.
•cosmtay: Code based on the Taylor series for the matrix cosine [5]. This
code allows us to use or not norm estimation. In this paper, we did not
use norm estimation. The MATLAB code of this algorithm can be found
in: http://personales.upv.es/jorsasma/software/cosmtay.m.
•cosmpol: This is the new code presented in this paper for computing the
matrix cosine. This code is also able to use or not norm estimation but,
as with cosmtay, we have not used norm estimation here.
4.1. Numerical tests
To test and compare the accuracy of the three algorithms we define the
following tests:
•Test 1: One hundred of diagonalizable 128 ×128 real matrices with a
1-norm varying from 2.32 to 220.04. These matrices have the form A=
VTDV , where Dis diagonal with real and complex eigenvalues, and V
is an orthogonal matrix obtained as V=H/16, being Ha Hadamard
matrix, i.e. a square matrix whose entries are either +1 or -1 and whose
rows are mutually orthogonal, being H−1=HT, where HTis the matrix
Htransposed.
•Test 2: One hundred of non diagonalizable 128 ×128 real matrices whose
1-norms vary from 6.5 to 249.5. These matrices have the form A=VTJ V ,
9
where Jis a Jordan matrix with real and complex eigenvalues. The alge-
braic multiplicity of the eigenvalues vary between 1 and 3. Matrix Vis an
orthogonal matrix obtained as V=H/16, being Ha Hadamard matrix.
•Test 3: Fifteen matrices with dimensions lower than or equal to 128 from
the Eigtool MATLAB package [15] and forty four 128 ×128 real matrices
from the function matrix of the Matrix Computation Toolbox [16]. Those
matrices whose condition number cannot be calculated have dropped from
the test. In addition, we have scaled some matrices of this test so that
their 1-norm is lower or equal to 1024 and their matrix cosine can be
calculated with the compared functions.
The “exact” matrix cosine is computed as cos(A) = VTcos(D)V, for ma-
trices of Test 1, and cos(A) = VTcos(J)V, for matrices of Test 2 (see [9, pp.
10]), by using the MATLAB’s Symbolic Math Toolbox with 256 decimal digit
arithmetic in all the computations. Following [4, Sec. 4.1], for the other matrices
we used MATLAB symbolic versions of a scaled Pad´e rational approximation
from [3] and a scaled Taylor Paterson-Stockmeyer approximation [5, pp. 67],
both with 4096 decimal digit arithmetic and several orders mand/or scaling
parameters shigher than the ones used by cosm and cosmtay, respectively,
checking that their relative difference was small enough. The algorithm accu-
racy was tested by computing the relative error
E = kcos(A)−˜
Yk1
kcos(A)k1
,
where ˜
Yis the computed solution and cos(A) is the exact solution.
To compute the condition number of a matrix cosine function we have used
the MATLAB function funm condest1, which estimates the condition number
for the matrix 1-norm.
Table 3 shows the computational costs. In this table, the computational
cost of each algorithm has been calculated by counting the number of matrix
products (M) of each code, since the cost of the rest of operations is negligible
compared to matrix products for big enough matrices. The cost of the resolu-
tion of linear systems that appears in the code based on Pad´e approximations
has been calculated as 4/3 products because, from a computational point of
view, the cost of that operation compared to the cost of a matrix product is
approximately equal to 4/3 (see Table C.1 from [9, pp. 336]). According to the
figures in this table, cosmpol is clearly faster than the other two routines.
To compare the relative errors we can see Table 4. This table shows the
percentage of cases in which the relative errors of cosmpol are lower than the
relative errors of MATLAB codes cosm and cosmtay.
We have plotted in Figures 1, 2, and 3 the Normwise relative errors (a), the
Performance Profiles (b), and the ratio of relative errors (c) to show if these
ratios are significant:
E(cosmpol)/E(cosm), E(cosmtay)/E(cosm),
10
Table 3: Matrix products (M) for the three tests using MATLAB functions cosmpol,cosmtay,
and cosm. The values shown in columns cosmtay and cosm are the percentage of extra products
carried out by these routines w.r.t. cosmpol.
M(cosmpol)M(cosmtay)M(cosm)
Test 1 854 11.00% 32.20%
Test 2 871 10.67% 31.57%
Test 3 511 9.20% 31.70%
Table 4: Relative error comparison.
Test 1 Test 2 Test 3
E(cosmpol)< E(cosm) 97% 97% 71.27%
E(cosmpol)< E(cosmtay) 60% 55% 50.85%
and the ratio of the matrix products (d):
M(cosmpol)/M(cosm), M (cosmtay)/M (cosm),
for the three tests, respectively. In the performance profile, the αcoordinate
varies between 1 and 5 in steps equal to 0.1, and the pcoordinate is the prob-
ability that the considered algorithm has a relative error lower than or equal
to α-times the smallest error over all methods. The ratios of relative errors are
presented in decreasing order with respect to E(cosmpol)/E(cosm). The solid
lines in figures 1a, 2a and 3a represent function kcosu, where kcos is the condition
number of the matrix cosine function [9, Chap. 3]. Value u= 2−53 represents
the unit roundoff in double precision floating-point arithmetic.
At the light of the results shown by the tables and figures we can make the
following analysis:
•Regarding numerical stability we present figures showing the normwise
relative error: Figure 1a shows that all the functions behave in a numerical
stable way in Test 1. Figure 2a shows that in Test 2 Taylor functions are
more stable numerically than the Pad´e function cosm. Figure 3a shows
that all the three functions have a similar numerical stability in Test 3.
Only in one matrix of this test all the three functions present certain
numerical instability, with a relative error more than 108higher than the
solid line (see Subfigure 3a).
•The functions based on polynomial approximations are more accurate than
the one based on Pad´e approximants, being the new function cosmpol
slightly more accurate than our former cosmtay function. The perfor-
mance profile (Figures 1b, 2b, and 3b) show that the graph of cosmpol is
above the graphs of the other two functions demonstrating that, in gen-
eral, it is the most accurate. This is also shown by Table 4 where we see
that function cosmpol has a lower relative error, between 71.27%−97% of
11
0 20 40 60 80 100
Matrix
10-15
10-14
10-13
Er
cond*u
cosmpol
cosmtay
cosm
(a) Normwise relative errors.
1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
cosmpol
cosmtay
cosm
(b) Performance profile.
0 20 40 60 80 100
Matrix
100
Er
E(cosmpol)/E(cosm)
E(cosmtay)/E(cosm)
(c) Ratio of relative errors.
0 20 40 60 80 100
Matrix
0.65
0.7
0.75
0.8
0.85
0.9
Er
M(cosmpol)/M(cosm)
M(cosmtay)/M(cosm)
(d) Ratio of matrix products.
Figure 1: Experimental results for Test 1.
the matrices than cosm, and between 50.85%−60% of the matrices, when
it is compared with cosmtay.
•Regarding the computational costs, Table 3 shows that function cosmpol
has a lower computational cost than the other two functions. This is
also confirmed by Subfigures 1d, 2d, and 3d, which show that the ra-
tios of matrix products computed for functions cosmpol and cosm, i.e.
M(cosmpol)/M(cosm) are lower than 1 for all the test matrices, and lower
in almost all cases than the ratio of cosmtay and cosm, i.e. M(cosmtay)/M(cosm).
4.2. The accelerated algorithm
We have implemented an “accelerated” version of Algorithm 1 that can use
one NVIDIA GPU. The accelerated version of the algorithm has been developed
with the aim of being efficient and easy to use, for which we implemented a
MATLAB mex file.
We use languages CUDA and C++ to implement the mex file. This code
is to accelerate those parts of the original Matlab function that have a high
12
0 20 40 60 80 100
Matrix
10-16
10-15
10-14
10-13
Er
cond*u
cosmpol
cosmtay
cosm
(a) Normwise relative errors.
1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
cosmpol
cosmtay
cosm
(b) Performance Profile.
0 20 40 60 80 100
Matrix
100
Er
E(cosmpol)/E(cosm)
E(cosmtay)/E(cosm)
(c) Ratio of relative errors.
0 20 40 60 80 100
Matrix
0.65
0.7
0.75
0.8
0.85
0.9
Er
M(cosmpol)/M(cosm)
M(cosmtay)/M(cosm)
(d) Ratio of matrix products.
Figure 2: Experimental results for Test 2.
computational cost, i.e. matrix multiplications. In this work we have taken
the mex file developed in [8] to modify and adapt it to cosmpol using the new
method for selecting the degree mand scaling parameter sfrom Section 3,
corresponding to Step 2 of Algorithm 1, and the new methods for evaluating
the Taylor matrix polynomial approximations for the matrix cosine from Section
2, corresponding to Step 4 of Algorithm 1. The mex function is unique but
allows to perform the different operations required by the algorithm. This way
data (matrices) are kept in the device (GPU) memory between consecutive
calls to the mex function. The GPU is mainly in charge of executing matrix
multiplications but also performs low cost operations, e.g. calculation of the
1-norm of a matrix, to avoid transmitting data between CPU and GPU only
to perform these operations. Other low cost operations are carried out into
the host CPU. The Matlab mex function, called call_gpu, executes different
operations (init,power,scale, . . . ) depending on the arguments with which
it is called [8]. The only operation that has been change in this paper is eval,
which evaluates a matrix polynomial. Now, this operation implements Eqs. 3-6
using coefficients of Table 1.
With this implementation of Algorithm 1, the Matlab script can be executed
13
0 10 20 30 40 50
Matrix
10-15
10-10
10-5
Er
cond*u
cosmpol
cosmtay
cosm
(a) Normwise relative error.
1 1.5 2 2.5 3 3.5 4 4.5 5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
cosmpol
cosmtay
cosm
(b) Performance Profile.
0 10 20 30 40 50 60
Matrix
10-2
10-1
100
101
102
Er
E(cosmpol)/E(cosm)
E(cosmtay)/E(cosm)
(c) Ratio of relative errors.
0 10 20 30 40 50 60
Matrix
0.4
0.5
0.6
0.7
0.8
0.9
Er
M(cosmpol)/M(cosm)
M(cosmtay)/M(cosm)
(d) Ratio of matrix products.
Figure 3: Experimental results for Test 3.
in both CPU or GPU. Table 5 shows the execution time (in sec.) obtained
in both devices for randomly generated matrices. To obtain the CPU time
we used two processors with 12 cores each (model Intel Xeon CPU E5-2697
v2 @2.70GHz). Thus, the matrix multiplication used by Matlab exploits the
24 cores available in our host. The GPU time was obtained on an NVIDIA
Tesla K20Xm, a high performance device that features 2688 CUDA cores. We
observe that our new algorithm cosmpol is faster than cosmtay in both devices,
a reduction in time due to the saving in matrix products. The reduction in time
from the CPU to the GPU is not so high in cosmpol than in cosmtay but is
still important since the algorithm is also supported on matrix multiplication,
a very optimized operation included into the library CUBLAS [17] for NVIDIA
GPUs.
5. Conclusions
In this paper we have introduced a new method to compute the matrix
cosine function. This method is based on the Taylor approximation of the cosine
function using matrix polynomial evaluation methods from [7] and an improved
14
Table 5: Execution time (sec.) of the algorithm in CPU and the accelerated version in GPU.
cosmtay cosmpol
nCPU GPU CPU GPU
1000 0.21 0.19 0.17 0.16
1500 0.54 0.37 0.52 0.31
2000 1.05 0.56 0.77 0.48
2500 1.98 0.93 1.83 0.81
3000 3.36 1.40 3.26 1.23
3500 5.12 1.97 4.71 1.78
4000 7.19 2.69 5.83 2.46
4500 8.30 3.61 8.07 3.32
5000 10.86 4.77 10.13 4.36
5500 15.55 6.13 15.02 5.62
6000 26.01 7.91 21.83 7.33
version of the scaling algorithm from [5]. From the different real solutions of the
coefficients of the formulae from [7], the coefficients selected in this paper give
the lowest maximum error in the stability check from [7], providing a maximum
order of approximation mM= 15, i.e. a maximum order of approximation of
the cosine Taylor series equal to 30, and excellent accuracy results in numerical
tests.
A MATLAB implementation, (cosmpol) based on that method has been
developed and compared with other state-of-the-art algorithms based on Taylor
(cosmtay) approximations using Paterson–Stockmeyer method for evaluating
the Taylor matrix polynomial approximations, and Pad´e (cosm) approximants.
Numerical experiments show that, in general, cosmpol has a computational cost
in terms of matrix products lower than the other functions cosmtay and cosm;
also, cosmpol has a higher accuracy in the majority of tests than the other
codes with a similar numerical stability.
Finally, we note that all the above discussion for the fast computation of
the matrix cosine is applicable to the computation of the matrix sine since
sin(A) = cos(A−π/2I).
Acknowledgements
This work has been partially supported by Spanish Ministerio de Econom´ıa
y Competitividad and European Regional Development Fund (ERDF) grants
TIN2014-59294-P, and TIN2017-89314-P.
References
[1] E. Defez, J. Sastre, J. J. Ib´a˜nez, P. A. Ruiz, Computing matrix functions
arising in engineering models with orthogonal matrix polynomials, Math.
Comput. Model. 57 (7-8) (2013) 1738–1743.
15
[2] J. Sastre, J. Ib´a˜nez, P. Ruiz, E. Defez, Efficient computation of the matrix
cosine, Appl. Math. Comput. 219 (2013) 7575–7585.
[3] A. H. Al-Mohy, N. J. Higham, S. D. Relton, New algorithms for comput-
ing the matrix sine and cosine separately or simultaneously, SIAM J. Sci.
Comput. 37 (1) (2015) A456–A487.
[4] P. Alonso, J. Ib´a˜nez, J. Sastre, J. Peinado, E. Defez, Efficient and accurate
algorithms for computing matrix trigonometric functions, J. Comput. Appl.
Math. 309 (2017) 325–332.
[5] J. Sastre, J. Ib´a˜nez, P. Alonso, J. Peinado, E. Defez, Two algorithms for
computing the matrix cosine functions, Appl. Math. Comput. 312 (2017)
66–77.
[6] M. S. Paterson, L. J. Stockmeyer, On the number of nonscalar multipli-
cations necessary to evaluate polynomials, SIAM J. Comput. 2 (1) (1973)
60–66.
[7] J. Sastre, Efficient evaluation of matrix polynomials, Linear Alg. Appl. 539
(2018) 229–250.
[8] P. Alonso, J. Peinado, J. Ib´a˜nez, J. Sastre, E. Defez, Computing matrix
trigonometric functions with GPUs through Matlab, The Journal of Super-
computing (2018) Online.
[9] N. J. Higham, Functions of Matrices: Theory and Computation, SIAM,
Philadelphia, PA, USA, 2008.
[10] G. I. Hargreaves, N. J. Higham, Efficient algorithms for the matrix cosine
and sine, Numer. Algorithms 40 (2005) 383–400.
[11] J. Sastre, J. J. Ib´a˜nez, E. Defez, P. A. Ruiz, Accurate matrix exponen-
tial computation to solve coupled differential models in engineering, Math.
Comput. Model. 54 (2011) 1835–1840.
[12] J. Sastre, J. J. Ib´a˜nez, E. Defez, Boosting the computation of the matrix
exponential, Appl. Math. Comput., in press.
[13] P. Ruiz, J. Sastre, J. Ib´a˜nez, E. Defez, High perfomance computing of the
matrix exponential, J. Comput. Appl. Math. 291 (2016) 370–379.
[14] N. J. Higham, Fortran codes for estimating the one-norm of a real or com-
plex matrix, with applications to condition estimation, ACM Trans. Math.
Softw. 14 (4) (1988) 381–396.
[15] T. G. Wright, Eigtool, version 2.1 (2009).
URL web.comlab.ox.ac.uk/pseudospectra/eigtool.
[16] N. J. Higham, The Test Matrix Toolbox for MATLAB, Numerical Analysis
Report No. 237, Manchester, England (Dec. 1993).
16
[17] NVIDIA, CUDA Toolkit. cuBLAS library, docs.nvidia.com/cuda/cublas,
v9.2.148 Edition, Last accessed July, 2018 (July 2018).
17