Automatic Blocking Of QR and LU Factorizations for Locality

07/2004; DOI: 10.1145/1065895.1065898
Source: CiteSeer

ABSTRACT ojps5#?tbh'rs^547*/21#;268eF5#/2681I?0u^`14; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 Y868?HW5#;x54Y8<4Wf0;=5v71I3>d0.0/=5/26 ; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 5#dHd0Y Q/268zH7 28890-32250 d0.0/=5/26 ; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 3>1t!W*;2?7143>dH.!/2W*;2]H/2[0W^547}/214;268eF5#/26814?54Y8<414;26Z/2[H3> ?HWWtb/21>fW f0Y81Q7}gWtCyv[0W?C1IdW};=5#/268?H<1I?Y 4;268eF5#/26814?54Y8<414;26Z/2[H3> ?HWWtb/21>fW d0Y81I6Z/:/2[HWt!WWd7F547=[HW[068W*;=5#;27=[QXd!;2W454Y8W?Q/V68?/21t05FXL V7143>dH.0/2W}; 3>W*3>14;_XX!/2W3>TmW7F5#.H\\_Wwf1#/2[ojpfH54\\_WFt|1I?bu1I.0\\_W[H14Yt!W*; /_;=5#?H^`14;235#/26814?H=|5#?th'r^5#7*/214;268eF5/26 1I?bu1I.0\\_W[H14Yt!W*; 28930-270 7*1I3>dHY8W*cY81Q14d/_;2.H7*/2.!;2W]'^W*y7143>dH68Y8W*;2w7F54?^.0Y H14Yt!W*; 28930 - /2[0Wf0Y81Q7}g68?H<14^x/2[HW*\\_Wn54Y8<414;26Z/2[H3>{[H1I.0<I[Y868?HWF5;q54Y8<IW*f0;=5|Y86Z f!;=5#;268W \\_.H7=[C54 hi-j9L- kmld0;216t!Wv354?.5#Y8Y8Xf0Y ?HWF5;q54Y8<IW*f0;= 3>W*?/=5/2681I?HV1#^/2[HW\\_W54Y8<I1#;268/2[03>]QfQX54.0/21435#/2687F5#Y 4Y8<IW*f0;=5|Y86Z 289 f0Y81Q7}gWtGW*;2\\_6814?Hq14^/2[0Wn7143>dH.0/=5/2681I?H]3>14;2WnfW*?HW*z0/w7F54?fW <I5468?HWt\\_.H7=[5454.!/21I35/26 143>dH.0/=5/2681I?H]3>14;2WnfW*?HW*z0/w7F54?fW 2 /_;=5/2W<I68W{[H68>d5#dW*;t!W3>1I?0/_;=5#/2W [01Fy/215#dHdHYZX5#?5#<4 <#;2W\\_\\_68WY81Q1Id/_;=5#?H^1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 28980-186 68?0<0]Q/21wd0;21t!.H7W W*U>7*6 ^1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 dH5#;_/26 ryv68/2[ 28800-16570 1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 28980-1 /2[H5#/7F5#?fW<4W?HW};=5#/2WFtfX14.0;>14d0/2683>68eW*;5#?t7*1I3>d5;2W|/2[0W dW};_^14;235#?H7W14^5#.0/21#Nf0Y81Q7}...


Available from: Haihang You, Dec 05, 2013
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
    ACM Transactions on Mathematical Software 04/2013; 39(3). DOI:10.1145/2450153.2450154 · 3.29 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 05/2014; 26(7). DOI:10.1002/cpe.3110 · 0.78 Impact Factor