Automatic Blocking Of QR and LU Factorizations for Locality

Proceedings of the ACM SIGPLAN Workshop on Memory System Performance, MSP 2004 07/2004; DOI: 10.1145/1065895.1065898
Source: CiteSeer


ojps5#?tbh'rs^547*/21#;268eF5#/2681I?0u^`14; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 Y868?HW5#;x54Y8<4Wf0;=5v71I3>d0.0/=5/26 ; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 5#dHd0Y Q/268zH7 28890-32250 d0.0/=5/26 ; t0W*?H\\_W35/_;26 0S:c!dW*;2683>W?Q/=5/26 3>1t!W*;2?7143>dH.!/2W*;2]H/2[0W^547}/214;268eF5#/26814?54Y8<414;26Z/2[H3> ?HWWtb/21>fW f0Y81Q7}gWtCyv[0W?C1IdW};=5#/268?H<1I?Y 4;268eF5#/26814?54Y8<414;26Z/2[H3> ?HWWtb/21>fW d0Y81I6Z/:/2[HWt!WWd7F547=[HW[068W*;=5#;27=[QXd!;2W454Y8W?Q/V68?/21t05FXL V7143>dH.0/2W}; 3>W*3>14;_XX!/2W3>TmW7F5#.H\\_Wwf1#/2[ojpfH54\\_WFt|1I?bu1I.0\\_W[H14Yt!W*; /_;=5#?H^`14;235#/26814?H=|5#?th'r^5#7*/214;268eF5/26 1I?bu1I.0\\_W[H14Yt!W*; 28930-270 7*1I3>dHY8W*cY81Q14d/_;2.H7*/2.!;2W]'^W*y7143>dH68Y8W*;2w7F54?^.0Y H14Yt!W*; 28930 - /2[0Wf0Y81Q7}g68?H<14^x/2[HW*\\_Wn54Y8<414;26Z/2[H3>{[H1I.0<I[Y868?HWF5;q54Y8<IW*f0;=5|Y86Z f!;=5#;268W \\_.H7=[C54 hi-j9L- kmld0;216t!Wv354?.5#Y8Y8Xf0Y ?HWF5;q54Y8<IW*f0;= 3>W*?/=5/2681I?HV1#^/2[HW\\_W54Y8<I1#;268/2[03>]QfQX54.0/21435#/2687F5#Y 4Y8<IW*f0;=5|Y86Z 289 f0Y81Q7}gWtGW*;2\\_6814?Hq14^/2[0Wn7143>dH.0/=5/2681I?H]3>14;2WnfW*?HW*z0/w7F54?fW <I5468?HWt\\_.H7=[5454.!/21I35/26 143>dH.0/=5/2681I?H]3>14;2WnfW*?HW*z0/w7F54?fW 2 /_;=5/2W<I68W{[H68>d5#dW*;t!W3>1I?0/_;=5#/2W [01Fy/215#dHdHYZX5#?5#<4 <#;2W\\_\\_68WY81Q1Id/_;=5#?H^1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 28980-186 68?0<0]Q/21wd0;21t!.H7W W*U>7*6 ^1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 dH5#;_/26 ryv68/2[ 28800-16570 1#;235#/2681I?/2W7}[0?H6 1Fy/215#dHdHYZX5#?5#<4 28980-1 /2[H5#/7F5#?fW<4W?HW};=5#/2WFtfX14.0;>14d0/2683>68eW*;5#?t7*1I3>d5;2W|/2[0W dW};_^14;235#?H7W14^5#.0/21#Nf0Y81Q7}...

Download full-text


Available from: Haihang You, Dec 05, 2013
  • Source
    • "Meanwhile, compiler optimization techniques have been developed to transform programs written in high-level languages to run efficiently on modern architectures[2] [3]. These program transformations include loop blocking[4] [5], loop unrolling[2], loop permutation, fusion and distribution[6] [7]. To select optimal parameters such as block size, unrolling * This work supported in part by the NSF under CNS-0325873. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Empirical software optimization and tuning is an ac- tive research topic in the high performance com- puting research community. It is an adaptive sys- tem to generate optimized software using empirically searched parameters. Due to the large parameter search space, an appropriate search heuristic is an es- sential part of the system. This paper describes an effective search method that can be generally applied to empirical optimization. We apply this method to ATLAS (Automatically Tuned Linear Algebra Soft- ware), which is a system for empirically optimizing dense linear algebra kernels. Our experiments on four different platforms show that the new search scheme can produce parameters that can lead ATLAS to gen- erate a library with better performance.
  • Source
    • "One thing that a DAG of tasks does not convey is which variant of a given algorithm it represents: left-looking or right-looking [37]. The DAG is the same for either variant and it is the order of visiting the DAG's nodes during execution that determines which variant is used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a two stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a left-looking bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the state-of-the art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.
    25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May, 2011 - Conference Proceedings; 05/2011
  • Source
    • "Meanwhile, the compiler community has developed optimization techniques to transform programs written in high-level languages to run efficiently on these modern architectures [1], [2]. Some of these program transformations include loop blocking[3], [4], loop unrolling[1], loop permutation, fusion and distribution[5], [6]. To select parameters for transformations such as blocking and unrolling, most compilers use analytical models. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the application of various search techniques to the problem of automatic empirical code optimization. The search process is a critical aspect of auto-tuning systems because the large size of the search space and the cost of evaluating the candidate implementations makes it infeasible to find the true optimum point by brute force. We evaluate the effectiveness of Nelder-Mead Simplex, Genetic Algorithms, Simulated Annealing, Particle Swarm Optimization, Orthogonal search, and Random search in terms of the performance of the best candidate found under varying time limits.
    Cluster Computing, 2008 IEEE International Conference on; 11/2008
Show more