Content uploaded by Jesús Peinado

Author content

All content in this area was uploaded by Jesús Peinado on Jan 07, 2019

Content may be subject to copyright.

Fast Taylor polynomial evaluation for the computation

of the matrix cosine

Jorge Sastre1

Instituto de Telecomunicaciones y Aplicaciones Multimedia

Javier Ib´a˜nez1

Instituto de Instrumentaci´on para Imagen Molecular

Pedro Alonso-Jord´a1,∗

Department of Information Systems and Computation

Jes´us Peinado1

Department of Information Systems and Computation

Emilio Defez1

Instituto de Matem´atica Multidisciplinar

Abstract

In this work we introduce a new method to compute the matrix cosine. It

is based on recent new matrix polynomial evaluation methods for the Taylor

approximation and a mixed forward and backward error analysis. The matrix

polynomial evaluation methods allow to evaluate the Taylor polynomial ap-

proximation of the matrix cosine function more eﬃciently than using Paterson-

Stockmeyer method. A sequential Matlab implementation of the new algo-

rithm is provided, giving better eﬃciency and accuracy than state-of-the-art

algorithms. Moreover, we provide an implementation in Matlab that can use

NVIDIA GPUs easily and eﬃciently.

Keywords: matrix, matrix trigonometric functions, matrix cosine, Taylor, fast

matrix polynomial evaluation, GPU computing

∗Corresponding author

Email address: palonso@upv.es (Pedro Alonso-Jord´a)

1All authors belong to Universitat Polit`ecnica de Val`encia.

Preprint submitted to Journal of Computational and Applied MathematicsSeptember 4, 2018

1. Introduction

The exact solution of many engineering processes that are described by sec-

ond order diﬀerential equations is given in terms of the trigonometric matrix

functions sine and/or cosine. This is the case, for instance, of the wave prob-

lem. The most popular state-of-the-art algorithms used to calculate these ma-

trix functions are based on polynomial and rational approximations with scaling

and recovering techniques [1, 2, 3, 4, 5]. Paterson-Stockmeyer’s method [6] is

the most used to compute the matrix polynomials that appear in these ap-

proximations in order to reduce computational costs. Recently, a new family of

methods for the evaluation of general matrix polynomials has been proposed [7],

allowing to reduce the number of matrix products needed to evaluate a poly-

nomial with respect to the Paterson–Stockmeyer’s method. In this work, we

present competitive algorithms for the computation of the matrix cosine based

on the evaluation of Taylor approximations using those methods. Sequential

and NVIDIA GPU based Matlab implementations of the new algorithms are

given. The basic computational kernel of algorithms based on Taylor approxi-

mations is matrix multiplication. This kernel can be executed very rapidly on

accelerator devices like GPUs (Graphic Processing Units). In this paper we have

exploited this fact, together with our previous experience on this subject [8], to

build a Matlab script plus a mex ﬁle capable of executing the new algorithm

very eﬃciently.

The next section presents a scaling and squaring Taylor algorithm for com-

puting the matrix cosine based on the methods described in [7]. Section 3

describes a forward and backward error analysis for computing the Taylor ap-

proximation using our algorithm. Finally, in Section 4, we show some numerical

results of the Matlab sequential and GPU implementations from both the per-

formance and accuracy points of view. Finally, Section 5 gives some conclusions.

2. Taylor algorithm for computing the matrix cosine

The matrix cosine can be deﬁned for all A∈Cn×nby the series

cos(A) =

∞

X

i=0

(−1)iA2i

(2i)! .(1)

Let

T2m(A) =

m

X

i=0

piA2i=

m

X

i=0

piBi≡Pm(B),(2)

be the Taylor approximation of order 2mof cos(A), where pi=(−1)i

(2i)! and

B=A2. Algorithm 1 shows a general algorithm for computing the matrix

cosine based on Taylor series. Since (2) is accurate near the origin, for computing

cos(A) from the Taylor approximation it is often necessary to scale matrix B

and recover cos(A) from the Taylor approximation of the cosine of the scaled

matrix. Scaling matrix Bby an integer s > 0 consists of computing B:= 4−sB

2

Algorithm 1 Given a matrix A∈Cn×nand a maximum order mM∈N,

this algorithm computes C= cos(A) by a Taylor approximation of order 2mk≤

2mM, where mkare optimal degrees of the Taylor polynomial, i.e. the maximum

degrees of the Taylor polynomial which can be evaluated for a certain number

of matrix products.

1: B=A2

2: Scaling phase: Choose mk6mMand an integer scaling parameter sfor

the Taylor approximation with scaling.

3: B=B/4s

4: Compute C=Pmk(B)

5: for i=1:sdo

6: C= 2C2−I

7: end for

(Step 3). Once the cosine of the scaled matrix has been computed, cos(A)

can be computed (Steps 5–7) by using repeatedly the double angle formula

cos(2X) = 2 cos2(X)−I[9, p. 288].

Matrix Acan be preprocessed to reduce its norm as described in [10, Alg. 1.2]

and this procedure will not be discussed in this paper. Step 4 was traditionally

performed by using the Paterson-Stockmeyer’s method [6], however, in this pa-

per we use a more eﬃcient method for evaluating C=Pmk(B) based on [7].

This method depends on the value of mkselected in Step 2 of Algorithm 1 (from

now on we will use minstead of mkfor simplicity). Below we analyse each case.

For m= 1, 2 and 4, similarly to (10) from [11], the Taylor polynomials

Pm(B) can be computed by using the following expressions:

P1(B) = −B/2 + I , (3)

P2(B)=(B2/12 −B)/2 + I,

P4(B) = ((((B2/56 −B)/30 + I)B2/12 −B)/2 + I.

Following [7, Ex. 3.1], P8(B) can be evaluated by using the following formulae:

y02(B) = B2(c1B2+c2B),(4)

P8(B)=(y02 (B) + c3B2+c4B)(y02(B) + c5B2)

+c6y02(B) + B2/24 −B/2 + I,

with a cost of 3 matrix products. With that cost the maximum approximation

order available with Paterson–Stockmeyer is m= 6. The coeﬃcients cifor IEEE

double precision arithmetic are given in Table 1, see [7, Table 4].

In the following we show that the most eﬃcient methods proposed in [7]

to evaluate the Taylor polynomial for m > 8 are not accurate enough for the

matrix cosine approximation. Therefore, other possibilities are proposed for

increasing accuracy, in exchange for a higher cost. Despite this higher cost,

the proposed matrix polynomial evaluation methods are more eﬃcient than

Paterson-Stockmeyer method.

3

Table 1: Coeﬃcients for computing the matrix exponential Taylor approximation.

m= 8 m= 12 m= 15

c12.186201576339059 ×10−71.269542268337734 ×10−12 6.140022498994532 ×10−17

c2−2.623441891606870 ×10−5−3.503936660612145 ×10−10 −2.670909787062621 ×10−14

c36.257028774393310 ×10−31.135275478038335 ×10−71.438284920333222 ×10−11

c4−4.923675742167775 ×10−1−2.027712316612395 ×10−5−1.050202496489896 ×10−8

c51.441694411274536 ×10−41.647243380001247 ×10−34.215975785860907 ×10−6

c65.023570505224926 ×101−6.469859264308602 ×10−1−1.238347173261210 ×10−3

c7−4.008589447357360 ×10−5−3.234597615453410 ×10−9

c89.187724869020796 ×10−39.292820886910254 ×10−7

c9−1.432942184841715 ×10−22.466381973203188 ×10−1

c10 4.555439797286385 ×10−3−9.369018510939971 ×10−10

Following [12, Sec. 3.2], and similarly to [7, Ex. 5.1], with a cost of 4 matrix

products it is possible to obtain a Taylor based approximation P16 (B) of the

matrix cosine of order m= 15, with several real solutions for the coeﬃcients

involved. However, for all the real solutions rounded in IEEE double precision

arithmetic, the stability check proposed in [7, Ex. 3.1] gives errors of order

10−14 > u or greater, where uis the unit roundoﬀ in IEEE double precision

arithmetic, i.e. u= 2−53 ≈1.11−16. We have checked that these evaluation

formulae provided a reduced accuracy results in numerical tests.

Then, for a cost of 4 matrix products we will use (34) and (35) from [7] to

evaluate P12(B) by using the following formulae:

y02(B) = B3(c1B3+c2B2+c3B),(5)

P12(B)=(y02(B) + c4B3+c5B2+c6B)(y02 (B) + c7B3+c8B2)

c9y02 +c10B3+B2/24 −B/2 + I,

where the coeﬃcients ciare given in Table 1. From the diﬀerent real solutions for

the coeﬃcients, we selected those ones giving the lowest maximum error in the

stability test, similarly to [7, Ex 3.1], giving errors lower than 1.31 ×10−16 =

O(u). Equation (5) provides a lower order than 15, but behaves in a stable

manner being, in turn, more eﬃcient than Paterson–Stockmeyer method, since

with a cost of 4 matrix products the maximum approximation order available

with Paterson–Stockmeyer is m= 9.

The highest order mused in [5] for Pm(B) is m= 16, available with 6 matrix

products using Paterson–Stockmeyer method. Using (34) and (35) from [7] it is

possible to evaluate P16(B) with 5 matrix products and several possibilities of

real coeﬃcients. The stability check proposed in [7, Ex. 3.1] gives a maximum

error of 1.03 ×10−15 > u, and we checked that numerical results were not

accurate enough. The stability can be improved using expression (52) from [7],

with s= 3 and p= 3, giving the following formulae for m= 15:

y02(B) = B3(c1B3+c2B2+c3B),(6)

P15(B) = −((y02 (B) + c4B3+c5B2+c6B)(y02(B) + c7B3+c8B2) + c9y02

+c10B3+B2/368800 −B/40320 + I/720)B3+B2/24 −B/2 + I .

4

The stability check for the coeﬃcients cigiven in Table 1, selected among all the

possible real solutions of the coeﬃcients, gives a maximum error of order u, in

an exchange for a lower order m= 15 <16. The minus sign at the beginning of

expression P15(B) allows to obtain real solutions for all the coeﬃcients involved,

as suggested in [7, p. 237]. With a cost of 5 matrix products the maximum

approximation order available with Paterson–Stockmeyer is m= 12.

All coeﬃcients cithat appear in expressions (4)–(6) were computed with the

MATLAB R2018a Symbolic Math Toolbox, using 200 decimal digit arithmetic.

Table 1 shows these values in IEEE double precision arithmetic.

3. Error analysis

In [2] an absolute forward error analysis of the Taylor approximation for the

matrix cosine was developed. in [5] a combination of a relative forward and

backward error analysis was developed for the same function. In this section

we present a uniﬁed study of the error analysis for the computation of that

matrix function, selecting the analysis among the three types of analysis giving

the most eﬃcient option for each degree mof the cosine Taylor approximation.

The following theorem is used in this study:

Theorem 1 ([2]). Let hl(x) = P

i≥l

pixibe a power series with radius of conver-

gence w,˜

hl(x) = P

i≥l

|pi|xi,B∈Cn×nwith ρ(B)< w,l∈Nand t∈Nwith

16t6l. If t0is the multiple of tsuch that l6t06l+t−1and

βt= max{d1/j

j:j=t, l, l + 1, . . . , t0−1, t0+ 1, t0+ 2, . . . , l +t−1},

where djis an upper bound for ||Bj||,dj>||Bj||, then

||hl(B)|| 6˜

hl(βt).

If we apply Theorem 1 for t=l, then ||hl(B)|| 6˜

hl(βl), where

βl= max{d1/j

j:j=l, l + 1, . . . , 2l−1}.(7)

In [13, Sec. 4.1] the authors approximated βmin = min{β(l)

t,1≤t≤m+ 1}by

βmin ≈max{d1/(l+1)

l+1 , d1/(l+2)

l+2 },(8)

corresponding to the two ﬁrst terms of (7).

3.1. Absolute and relative forward errors

Let A∈Cn×nand B=A2. Using (1) the absolute forward error in exact

arithmetic for the Taylor approximation (2) of cos(A), denoted by E(1)

f, is,

E(1)

f=kcos(A)−Pm(B)k=

X

i≥m+1

f(1)

m,iBi

,(9)

5

where f(1)

m,i = (−1)i/(2i)!. This error analysis is used in [2, Sec. 4].

If kBk=kA2k<acosh2(2) ≈1.7343, then cos−1(A) exists [5, Proposi-

tion 1] and it follows that the relative forward error to compute cos(A) in exact

arithmetic, denoted by E(2)

f, is [5, Sec. 2.1]

E(2)

f=

cos−1(A) (cos(A)−Pm(B))

=

I−Pm(B) cos−1(A)

=

X

i≥m+1

f(2)

m,iBi

.(10)

If we deﬁne g(k)

m+1(x) = P

i>m+1

f(k)

m,ixiand ˜g(k)

m+1(x) = P

i≥m+1

f(k)

m,i

xi,k=

1,2, and we apply Theorem 1, then

E(k)

f=

g(k)

m+1(B)

≤˜g(k)

m+1(βt),(11)

for every t, 1 ≤t≤m+ 1.

3.2. Relative backward error

The backward error ∆Aof approximating cos(A) by Taylor approximation

T2m(A) veriﬁes

T2m(A) = cos(A+ ∆A).

From [5, Sec. 2.2] the backward error ∆Acan be expressed by

∆A=X

i≥m

bm,iA2i+1 ,

where coeﬃcients bm,i can be computed symbolically, see (8)-(11) of [5, Sec. 2.2]

for details. Then, the relative backward error (Eb) in exact arithmetic of ap-

proximating cos(A) by T2m(A) can be computed as

Eb=k∆Ak

kAk=

P

i≥m

bm,iA2i+1

kAk≤

X

i≥m

bm,iA2i

=

X

i≥m

bm,iBi

.

If we deﬁne hm(x) = P

i>m

bm,ixiand ˜

hm(x) = P

i≥m

|bm,i|xi, and we apply

Theorem 1, then

Eb=khm(B)k ≤ ˜

hm(βt),(12)

for every t, 1 ≤t≤m. In [5] it was used an error analysis that is a combination

of the relative forward and backward error analysis in exact arithmetic.

6

Table 2: Values of Θf1(m), Θf2(m), Θb(m), and Θ(m) for m= 8,12,15.

m= 8 m= 12 m= 15

Θf10.96 6.59 16.45

Θf20.91 − −

Θb0.94 6.75 9.91

Θ 0.9625107544271462 6.752349007371135 16.45123831556254

3.3. Computation of Taylor order mand the scaling parameter s

Let Θfk(m), k= 1,2, be

Θfk(m) = max

θ≥0 : ˜g(k)

m+1(θ) = X

i≥m+1

f(k)

m,i

θi≤u

,(13)

and let Θb(m) be

Θb(m) = max

θ≥0 : ˜

hm(θ) = X

i≥m

b(k)

m,i

θi≤u

,(14)

where u= 2−53 is the unit roundoﬀ in double precision ﬂoating-point arithmetic.

We have used MATLAB Symbolic Math Toolbox to compute Θfk(m), k= 1,2,

and Θb(m) for m= 8,12,15 in 250-decimal digit arithmetic, considering enough

terms to obtain all the Θ values for each mwith enough signiﬁcant digits, and

obtaining the coeﬃcients symbolically. Note that Θb(15) needs more than 1500

terms to obtain three signiﬁcant digits, similarly to what happens with Θb(16)

and Θb(20) in [5, Sec. 2.2]. Then, a numerical zero-ﬁnder is invoked to determine

the highest values Θfk(m), k= 1,2, and Θb(m), such that

˜g(k)

m+1(Θfk(m)) = X

i≥m+1

f(k)

m,i

Θi

fk(m)≤u,

and ˜

hm(Θb(m)) = X

i≥m

e(k)

m,i

Θi

b(m)≤u,

holds. The values of Θfk(m), k= 1,2, and Θb(m) for m= 8,12,15 are de-

picted in Table 2. For m={1,2,4}[5, Sec. 2.2] shows that Θf2(m)>Θb(m),

and comparing tables [2, Table 2] and [5, Table 1] one gets that Θf1(m)'

Θf2(m). This is a normal behavior since, for those values of m, it follows

that cos(pΘf1(m)≈1 and, in that case, the forward absolute error bound (9)

and the forward relative error bound (10) are approximately equal. Analo-

gously to [5, Sec. 2], and to minimize the computational cost, we are select-

ing Θ(m) = max {Θf1(m),Θf2(m),Θb(m)}for each mk={1,2,4,8,12,15},

k= 1,2,...,6, i.e. Θf1(m) for m={1,2,4,8,15}and Θb(m) for m= 12.

Then, considering (11) and taking into account the values of Θmfrom Table 2,

7

it follows that, for m={1,2,4,8,15}, if βt≤Θf1(m) the absolute forward error

is lower than or equal to the unit roundoﬀ:

E(1)

f≤˜g(1)

m(βt)≤˜g(1)

m(Θf1(m)) ≤u,

and using (8) one gets

β(m)

m+1 ≈max{b1/(m+1)

m+1 , b1/(m+2)

m+2 },(15)

where the superscript (m) stands for the Taylor approximation order used (see

(15) from [5]). For m= 12 and considering (12), if β(m)

min ≤Θb(m), then relative

backward error is lower or equal than the unit roundoﬀ:

Eb≤˜

hm(βt)≤˜

hm(Θb(m)) ≤u,

and using (8) one gets

β(m)

min ≈max{b1/(m)

m, b1/(m+1)

m+1 },(16)

where the superscript (m) also stands for the Taylor approximation order used

(see (16) from [5]).

We provide here a scaling algorithm without norm estimations of matrix

powers and another one with norm estimations of matrix powers similar to

those algorithms developed in [5, Sec. 2.4]: If there exists a value m≤15 such

that β(mk)

min ≤Θ(m), then one of the above conditions is veriﬁed and, in this

case, we choose the lower order mkverifying it with a scaling parameter s= 0.

Otherwise, we choose the Taylor approximation of order 12 or 15 providing the

lower cost, with

s= max 0,1

2log βmk

min

Θ(mk), mk= 12 or 15.

Note that Θ8<Θ12/4 and Θ12 >Θ15 /4, and then only mk= 12 and 15 are

eﬃcient orders for scaling.

The algorithm without estimation of norms of matrix powers uses bounds

of matrix powers based on products of matrix powers previously computed and

is analogous to Algorithm 2 from [5]: For mk≤8, only Band B2are available

and then, using Theorem 1 and (15), we take

β(mk)

min =β2= max kB2k

mk

2kBk1

mk+1 ,kB2k

mk+2

21

mk+2

=kB2k

mk

2kBk1

mk+1 .

For mk= 12 and 15, B3is also available and then, using Theorem 1, we take

β(mk)

min = min{β2, β3}. Functions ms selectNoNormEst and beta NoNormEst

from http://personales.upv.es/jorsasma/software/cosmpol.m are MAT-

LAB implementations of the scaling algorithm with no estimation of norms of

matrix powers, and for the computation of β2and β3, respectively.

8

The algorithm with estimation of norms of matrix powers uses the estima-

tion of two matrix powers taking into account the simpliﬁcations (15) and (16),

where values dkare computed approximately by the block 1–norm estimation

algorithm of [14], and is also analogous to that on [5]: It reduces the num-

ber of estimations combining estimations of the values dkbased on products

of norms of matrix powers previously computed or estimated, see function

ms selectNormEst from cosmpol.m.

4. Numerical experiments

In this section, we compare the new MATLAB function developed in this

paper, cosmpol, with another two functions:

•cosm: Code based on the Pad´e rational approximation for the matrix

cosine [3]. The MATLAB function cosm has an argument which allows us

to compute cos(A) by means of just Pad´e approximants, or also using the

real Schur decomposition and the complex Schur decomposition. In these

tests we did not use the Schur decomposition since, in the tests carried

out in [5] it was shown that, using the Schur decomposition in Taylor

algorithms from [5] provides higher eﬃciency than Pad´e method from [3]

with the Schur decomposition with similar accuracy. The MATLAB code

can be found in: http://github.com/sdrelton/cosm_sinm.

•cosmtay: Code based on the Taylor series for the matrix cosine [5]. This

code allows us to use or not norm estimation. In this paper, we did not

use norm estimation. The MATLAB code of this algorithm can be found

in: http://personales.upv.es/jorsasma/software/cosmtay.m.

•cosmpol: This is the new code presented in this paper for computing the

matrix cosine. This code is also able to use or not norm estimation but,

as with cosmtay, we have not used norm estimation here.

4.1. Numerical tests

To test and compare the accuracy of the three algorithms we deﬁne the

following tests:

•Test 1: One hundred of diagonalizable 128 ×128 real matrices with a

1-norm varying from 2.32 to 220.04. These matrices have the form A=

VTDV , where Dis diagonal with real and complex eigenvalues, and V

is an orthogonal matrix obtained as V=H/16, being Ha Hadamard

matrix, i.e. a square matrix whose entries are either +1 or -1 and whose

rows are mutually orthogonal, being H−1=HT, where HTis the matrix

Htransposed.

•Test 2: One hundred of non diagonalizable 128 ×128 real matrices whose

1-norms vary from 6.5 to 249.5. These matrices have the form A=VTJ V ,

9

where Jis a Jordan matrix with real and complex eigenvalues. The alge-

braic multiplicity of the eigenvalues vary between 1 and 3. Matrix Vis an

orthogonal matrix obtained as V=H/16, being Ha Hadamard matrix.

•Test 3: Fifteen matrices with dimensions lower than or equal to 128 from

the Eigtool MATLAB package [15] and forty four 128 ×128 real matrices

from the function matrix of the Matrix Computation Toolbox [16]. Those

matrices whose condition number cannot be calculated have dropped from

the test. In addition, we have scaled some matrices of this test so that

their 1-norm is lower or equal to 1024 and their matrix cosine can be

calculated with the compared functions.

The “exact” matrix cosine is computed as cos(A) = VTcos(D)V, for ma-

trices of Test 1, and cos(A) = VTcos(J)V, for matrices of Test 2 (see [9, pp.

10]), by using the MATLAB’s Symbolic Math Toolbox with 256 decimal digit

arithmetic in all the computations. Following [4, Sec. 4.1], for the other matrices

we used MATLAB symbolic versions of a scaled Pad´e rational approximation

from [3] and a scaled Taylor Paterson-Stockmeyer approximation [5, pp. 67],

both with 4096 decimal digit arithmetic and several orders mand/or scaling

parameters shigher than the ones used by cosm and cosmtay, respectively,

checking that their relative diﬀerence was small enough. The algorithm accu-

racy was tested by computing the relative error

E = kcos(A)−˜

Yk1

kcos(A)k1

,

where ˜

Yis the computed solution and cos(A) is the exact solution.

To compute the condition number of a matrix cosine function we have used

the MATLAB function funm condest1, which estimates the condition number

for the matrix 1-norm.

Table 3 shows the computational costs. In this table, the computational

cost of each algorithm has been calculated by counting the number of matrix

products (M) of each code, since the cost of the rest of operations is negligible

compared to matrix products for big enough matrices. The cost of the resolu-

tion of linear systems that appears in the code based on Pad´e approximations

has been calculated as 4/3 products because, from a computational point of

view, the cost of that operation compared to the cost of a matrix product is

approximately equal to 4/3 (see Table C.1 from [9, pp. 336]). According to the

ﬁgures in this table, cosmpol is clearly faster than the other two routines.

To compare the relative errors we can see Table 4. This table shows the

percentage of cases in which the relative errors of cosmpol are lower than the

relative errors of MATLAB codes cosm and cosmtay.

We have plotted in Figures 1, 2, and 3 the Normwise relative errors (a), the

Performance Proﬁles (b), and the ratio of relative errors (c) to show if these

ratios are signiﬁcant:

E(cosmpol)/E(cosm), E(cosmtay)/E(cosm),

10

Table 3: Matrix products (M) for the three tests using MATLAB functions cosmpol,cosmtay,

and cosm. The values shown in columns cosmtay and cosm are the percentage of extra products

carried out by these routines w.r.t. cosmpol.

M(cosmpol)M(cosmtay)M(cosm)

Test 1 854 11.00% 32.20%

Test 2 871 10.67% 31.57%

Test 3 511 9.20% 31.70%

Table 4: Relative error comparison.

Test 1 Test 2 Test 3

E(cosmpol)< E(cosm) 97% 97% 71.27%

E(cosmpol)< E(cosmtay) 60% 55% 50.85%

and the ratio of the matrix products (d):

M(cosmpol)/M(cosm), M (cosmtay)/M (cosm),

for the three tests, respectively. In the performance proﬁle, the αcoordinate

varies between 1 and 5 in steps equal to 0.1, and the pcoordinate is the prob-

ability that the considered algorithm has a relative error lower than or equal

to α-times the smallest error over all methods. The ratios of relative errors are

presented in decreasing order with respect to E(cosmpol)/E(cosm). The solid

lines in ﬁgures 1a, 2a and 3a represent function kcosu, where kcos is the condition

number of the matrix cosine function [9, Chap. 3]. Value u= 2−53 represents

the unit roundoﬀ in double precision ﬂoating-point arithmetic.

At the light of the results shown by the tables and ﬁgures we can make the

following analysis:

•Regarding numerical stability we present ﬁgures showing the normwise

relative error: Figure 1a shows that all the functions behave in a numerical

stable way in Test 1. Figure 2a shows that in Test 2 Taylor functions are

more stable numerically than the Pad´e function cosm. Figure 3a shows

that all the three functions have a similar numerical stability in Test 3.

Only in one matrix of this test all the three functions present certain

numerical instability, with a relative error more than 108higher than the

solid line (see Subﬁgure 3a).

•The functions based on polynomial approximations are more accurate than

the one based on Pad´e approximants, being the new function cosmpol

slightly more accurate than our former cosmtay function. The perfor-

mance proﬁle (Figures 1b, 2b, and 3b) show that the graph of cosmpol is

above the graphs of the other two functions demonstrating that, in gen-

eral, it is the most accurate. This is also shown by Table 4 where we see

that function cosmpol has a lower relative error, between 71.27%−97% of

11

0 20 40 60 80 100

Matrix

10-15

10-14

10-13

Er

cond*u

cosmpol

cosmtay

cosm

(a) Normwise relative errors.

1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

cosmpol

cosmtay

cosm

(b) Performance proﬁle.

0 20 40 60 80 100

Matrix

100

Er

E(cosmpol)/E(cosm)

E(cosmtay)/E(cosm)

(c) Ratio of relative errors.

0 20 40 60 80 100

Matrix

0.65

0.7

0.75

0.8

0.85

0.9

Er

M(cosmpol)/M(cosm)

M(cosmtay)/M(cosm)

(d) Ratio of matrix products.

Figure 1: Experimental results for Test 1.

the matrices than cosm, and between 50.85%−60% of the matrices, when

it is compared with cosmtay.

•Regarding the computational costs, Table 3 shows that function cosmpol

has a lower computational cost than the other two functions. This is

also conﬁrmed by Subﬁgures 1d, 2d, and 3d, which show that the ra-

tios of matrix products computed for functions cosmpol and cosm, i.e.

M(cosmpol)/M(cosm) are lower than 1 for all the test matrices, and lower

in almost all cases than the ratio of cosmtay and cosm, i.e. M(cosmtay)/M(cosm).

4.2. The accelerated algorithm

We have implemented an “accelerated” version of Algorithm 1 that can use

one NVIDIA GPU. The accelerated version of the algorithm has been developed

with the aim of being eﬃcient and easy to use, for which we implemented a

MATLAB mex ﬁle.

We use languages CUDA and C++ to implement the mex ﬁle. This code

is to accelerate those parts of the original Matlab function that have a high

12

0 20 40 60 80 100

Matrix

10-16

10-15

10-14

10-13

Er

cond*u

cosmpol

cosmtay

cosm

(a) Normwise relative errors.

1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

cosmpol

cosmtay

cosm

(b) Performance Proﬁle.

0 20 40 60 80 100

Matrix

100

Er

E(cosmpol)/E(cosm)

E(cosmtay)/E(cosm)

(c) Ratio of relative errors.

0 20 40 60 80 100

Matrix

0.65

0.7

0.75

0.8

0.85

0.9

Er

M(cosmpol)/M(cosm)

M(cosmtay)/M(cosm)

(d) Ratio of matrix products.

Figure 2: Experimental results for Test 2.

computational cost, i.e. matrix multiplications. In this work we have taken

the mex ﬁle developed in [8] to modify and adapt it to cosmpol using the new

method for selecting the degree mand scaling parameter sfrom Section 3,

corresponding to Step 2 of Algorithm 1, and the new methods for evaluating

the Taylor matrix polynomial approximations for the matrix cosine from Section

2, corresponding to Step 4 of Algorithm 1. The mex function is unique but

allows to perform the diﬀerent operations required by the algorithm. This way

data (matrices) are kept in the device (GPU) memory between consecutive

calls to the mex function. The GPU is mainly in charge of executing matrix

multiplications but also performs low cost operations, e.g. calculation of the

1-norm of a matrix, to avoid transmitting data between CPU and GPU only

to perform these operations. Other low cost operations are carried out into

the host CPU. The Matlab mex function, called call_gpu, executes diﬀerent

operations (init,power,scale, . . . ) depending on the arguments with which

it is called [8]. The only operation that has been change in this paper is eval,

which evaluates a matrix polynomial. Now, this operation implements Eqs. 3-6

using coeﬃcients of Table 1.

With this implementation of Algorithm 1, the Matlab script can be executed

13

0 10 20 30 40 50

Matrix

10-15

10-10

10-5

Er

cond*u

cosmpol

cosmtay

cosm

(a) Normwise relative error.

1 1.5 2 2.5 3 3.5 4 4.5 5

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

cosmpol

cosmtay

cosm

(b) Performance Proﬁle.

0 10 20 30 40 50 60

Matrix

10-2

10-1

100

101

102

Er

E(cosmpol)/E(cosm)

E(cosmtay)/E(cosm)

(c) Ratio of relative errors.

0 10 20 30 40 50 60

Matrix

0.4

0.5

0.6

0.7

0.8

0.9

Er

M(cosmpol)/M(cosm)

M(cosmtay)/M(cosm)

(d) Ratio of matrix products.

Figure 3: Experimental results for Test 3.

in both CPU or GPU. Table 5 shows the execution time (in sec.) obtained

in both devices for randomly generated matrices. To obtain the CPU time

we used two processors with 12 cores each (model Intel Xeon CPU E5-2697

v2 @2.70GHz). Thus, the matrix multiplication used by Matlab exploits the

24 cores available in our host. The GPU time was obtained on an NVIDIA

Tesla K20Xm, a high performance device that features 2688 CUDA cores. We

observe that our new algorithm cosmpol is faster than cosmtay in both devices,

a reduction in time due to the saving in matrix products. The reduction in time

from the CPU to the GPU is not so high in cosmpol than in cosmtay but is

still important since the algorithm is also supported on matrix multiplication,

a very optimized operation included into the library CUBLAS [17] for NVIDIA

GPUs.

5. Conclusions

In this paper we have introduced a new method to compute the matrix

cosine function. This method is based on the Taylor approximation of the cosine

function using matrix polynomial evaluation methods from [7] and an improved

14

Table 5: Execution time (sec.) of the algorithm in CPU and the accelerated version in GPU.

cosmtay cosmpol

nCPU GPU CPU GPU

1000 0.21 0.19 0.17 0.16

1500 0.54 0.37 0.52 0.31

2000 1.05 0.56 0.77 0.48

2500 1.98 0.93 1.83 0.81

3000 3.36 1.40 3.26 1.23

3500 5.12 1.97 4.71 1.78

4000 7.19 2.69 5.83 2.46

4500 8.30 3.61 8.07 3.32

5000 10.86 4.77 10.13 4.36

5500 15.55 6.13 15.02 5.62

6000 26.01 7.91 21.83 7.33

version of the scaling algorithm from [5]. From the diﬀerent real solutions of the

coeﬃcients of the formulae from [7], the coeﬃcients selected in this paper give

the lowest maximum error in the stability check from [7], providing a maximum

order of approximation mM= 15, i.e. a maximum order of approximation of

the cosine Taylor series equal to 30, and excellent accuracy results in numerical

tests.

A MATLAB implementation, (cosmpol) based on that method has been

developed and compared with other state-of-the-art algorithms based on Taylor

(cosmtay) approximations using Paterson–Stockmeyer method for evaluating

the Taylor matrix polynomial approximations, and Pad´e (cosm) approximants.

Numerical experiments show that, in general, cosmpol has a computational cost

in terms of matrix products lower than the other functions cosmtay and cosm;

also, cosmpol has a higher accuracy in the majority of tests than the other

codes with a similar numerical stability.

Finally, we note that all the above discussion for the fast computation of

the matrix cosine is applicable to the computation of the matrix sine since

sin(A) = cos(A−π/2I).

Acknowledgements

This work has been partially supported by Spanish Ministerio de Econom´ıa

y Competitividad and European Regional Development Fund (ERDF) grants

TIN2014-59294-P, and TIN2017-89314-P.

References

[1] E. Defez, J. Sastre, J. J. Ib´a˜nez, P. A. Ruiz, Computing matrix functions

arising in engineering models with orthogonal matrix polynomials, Math.

Comput. Model. 57 (7-8) (2013) 1738–1743.

15

[2] J. Sastre, J. Ib´a˜nez, P. Ruiz, E. Defez, Eﬃcient computation of the matrix

cosine, Appl. Math. Comput. 219 (2013) 7575–7585.

[3] A. H. Al-Mohy, N. J. Higham, S. D. Relton, New algorithms for comput-

ing the matrix sine and cosine separately or simultaneously, SIAM J. Sci.

Comput. 37 (1) (2015) A456–A487.

[4] P. Alonso, J. Ib´a˜nez, J. Sastre, J. Peinado, E. Defez, Eﬃcient and accurate

algorithms for computing matrix trigonometric functions, J. Comput. Appl.

Math. 309 (2017) 325–332.

[5] J. Sastre, J. Ib´a˜nez, P. Alonso, J. Peinado, E. Defez, Two algorithms for

computing the matrix cosine functions, Appl. Math. Comput. 312 (2017)

66–77.

[6] M. S. Paterson, L. J. Stockmeyer, On the number of nonscalar multipli-

cations necessary to evaluate polynomials, SIAM J. Comput. 2 (1) (1973)

60–66.

[7] J. Sastre, Eﬃcient evaluation of matrix polynomials, Linear Alg. Appl. 539

(2018) 229–250.

[8] P. Alonso, J. Peinado, J. Ib´a˜nez, J. Sastre, E. Defez, Computing matrix

trigonometric functions with GPUs through Matlab, The Journal of Super-

computing (2018) Online.

[9] N. J. Higham, Functions of Matrices: Theory and Computation, SIAM,

Philadelphia, PA, USA, 2008.

[10] G. I. Hargreaves, N. J. Higham, Eﬃcient algorithms for the matrix cosine

and sine, Numer. Algorithms 40 (2005) 383–400.

[11] J. Sastre, J. J. Ib´a˜nez, E. Defez, P. A. Ruiz, Accurate matrix exponen-

tial computation to solve coupled diﬀerential models in engineering, Math.

Comput. Model. 54 (2011) 1835–1840.

[12] J. Sastre, J. J. Ib´a˜nez, E. Defez, Boosting the computation of the matrix

exponential, Appl. Math. Comput., in press.

[13] P. Ruiz, J. Sastre, J. Ib´a˜nez, E. Defez, High perfomance computing of the

matrix exponential, J. Comput. Appl. Math. 291 (2016) 370–379.

[14] N. J. Higham, Fortran codes for estimating the one-norm of a real or com-

plex matrix, with applications to condition estimation, ACM Trans. Math.

Softw. 14 (4) (1988) 381–396.

[15] T. G. Wright, Eigtool, version 2.1 (2009).

URL web.comlab.ox.ac.uk/pseudospectra/eigtool.

[16] N. J. Higham, The Test Matrix Toolbox for MATLAB, Numerical Analysis

Report No. 237, Manchester, England (Dec. 1993).

16

[17] NVIDIA, CUDA Toolkit. cuBLAS library, docs.nvidia.com/cuda/cublas,

v9.2.148 Edition, Last accessed July, 2018 (July 2018).

17