Article

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Using this approach, the same source code can be run on different processors, such as CPUs and GPUs, as long as they are supported by the compiler, achieving an easy and good level of code portability. OpenACC is becoming increasingly popular in several computational communities and it is increasingly used in many stencil-based applications, mostly running on GPU, including Lattice Boltzmann Methods [11][12][13], and more recently also Lattice QCD [14,15]. ...
... In this paper we analyze the computing and energy performance of the Open-ACC Staggered Parallel LatticeQCD Everywhere (OpenStaPLE) application that we have recently developed [15,16], compiling with PGI 18.1 and running on the Development for an Added Value Infrastructure Designed in Europe [17] (DA-VIDE) HPC cluster. ...
... OpenStaPLE derives from an earlier version coded with CUDA [19], and ported to OpenACC to allow for code portability on multiple architectures [20]. Its first implementation, able to run only on single-accelerator systems, GPUs (NVIDIA and AMD) and CPUs, has been described in [15]. In this work we adopt the latest version of OpenStaPLE, described in [16], in which the physical data domain can be sliced along one dimension and parallelized on multiple computing nodes, and multiple processors (e.g. ...
Chapter
In this contribution we measure the computing and energy performance of the recently developed DAVIDE HPC-cluster, a massively parallel machine based on IBM POWER CPUs and NVIDIA Pascal GPUs. We use as an application benchmark the OpenStaPLE Lattice QCD code, written using the OpenACC programming framework. Our code exploits the computing performance of GPUs through the use of OpenACC directives, and uses OpenMPI to manage the parallelism among several GPUs. We analyze the speed-up and the aggregate performance of the code, and try to identify possible bottlenecks that harm performances. Using the power monitor tools available on DAVIDE we also discuss some energy aspects pointing out the best trade-offs between time-to-solution and energy-to-solution.
... Using this approach, the same source code may run on all processors, GPUs and also CPUs, supported by the compiler, achieving an easy and good level of code portability. OpenACC is becoming increasingly popular among several scientific communities for coding many lattice-based applications to run mainly on GPU accelerators, including Lattice Boltzmann Methods 16,17,18 , and more recently also Lattice QCD 19,20 . ...
... In our previous work 20 we have described an OpenACC implementation of a state-of-the-art Monte Carlo Lattice QCD application, derived from an earlier version coded using CUDA 21 , and able to run only on single-accelerator systems, including GPUs (NVIDIA and AMD), and also CPUs. In this paper we extend our OpenACC code to run also on accelerators-based parallel computing machines, discussing in detail how we have structured the code, the strategies that have guided our design choices, and presenting several performance results on different computing architectures. ...
... We store the pseudofermions fields is the vec3 soa structure (see listing 1 and Figure 1). It consist of 3 arrays of double (or float) C99 complex numbers, arranged in a Structure of Arrays (SoA) layout 44,20 . Each array has LNH SIZEH = n 0,loc n 1,loc n 2,loc (n 2,loc + 2h)/2 elements, where n i,loc are the sizes of the local lattice and h is the "largest" needed halo size, which is 1 for the Dirac operator, and 1 or 2 for the gauge part depending on the choice of S g . ...
Article
Full-text available
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
... Indeed, our implementation extends a former multi-node and multi-GPU distributed code (publicly available in [83]) where the RHMC algorithm applies to a single lattice, partitioned in one of the space-time directions (usually the longest one) into multiple MPI ranks associated with different devices, as described in refs. [84,85]. In our parallel implementation of the parallel tempering algorithm, each copy of the lattice runs the same RHMC operations of the code without replicas in a single instruction multiple data (SIMD) fashion, with the only difference lying in the boundary conditions at the defect, which vary depending on the replica label. ...
... In our parallel implementation of the parallel tempering algorithm, each copy of the lattice runs the same RHMC operations of the code without replicas in a single instruction multiple data (SIMD) fashion, with the only difference lying in the boundary conditions at the defect, which vary depending on the replica label. This parallelization becomes straightforward thanks to the use of independent MPI communicators and groups for each lattice copy, therefore guaranteeing communications only between devices associated with sub-lattices that share borders during the RHMC steps, mimicking the former implementation without replicas which was already introduced in [84,85]. However, since shared memory is not available when dealing with larger lattices and numbers of replicas, instead of swapping whole configurations between different devices as discussed above, our practical choice is to swap only the boundary conditions at the defect, which involves only a small amount of data. ...
Article
Full-text available
A bstract We simulate N f = 2 + 1 QCD at the physical point combining open and periodic boundary conditions in a parallel tempering framework, following the original proposal by M. Hasenbusch for 2 d CP N −1 models, which has been recently implemented and widely employed in 4 d SU( N ) pure Yang-Mills theories too. We show that using this algorithm it is possible to achieve a sizable reduction of the auto-correlation time of the topological charge in dynamical fermions simulations both at zero and finite temperature, allowing to avoid topology freezing down to lattice spacings as fine as a ∼ 0 . 02 fm. Therefore, this implementation of the Parallel Tempering on Boundary Conditions algorithm has the potential to substantially push forward the investigation of the QCD vacuum properties by means of lattice simulations.
... Numerical simulations have been performed on the COKA cluster, using 5 computing nodes, each with 8 NVIDIA K80 dual-GPU boards and two 56 Gb=s FDR InfiniBand network interfaces. Our parallel code (OpenStaPLE 2 ) is a single [78] and multi [79] GPU implementation of a standard RHMC algorithm. It is an evolution of a previous CUDA code [80], developed using the OpenACC and OpenMPI frameworks to manage respectively parallelism on the GPUs and among the nodes. ...
... Apart from first order and 3D-Ising exponents, we also report tricritical indices: they are expected to describe the critical behavior exactly at the separation point between the first order and the second 1 Notice that, as usual in these cases (compare, e.g., with what is done in the numerical study of the Ising model), the absolute value of ImðLÞ is used in place of the order parameter itself, since otherwise residual tunnelings taking place on a finite volume would ruin the finite size scaling analysis on the side of the broken phase. 2 The name of the code was not explicitly mentioned in Refs. [78,79] where the code was presented. In fact, the name has been decided afterwards and will be used for a public release that will appear soon. ...
Article
Full-text available
We investigate the fate of the Roberge-Weiss endpoint transition and its connection with the restoration of chiral symmetry as the chiral limit of Nf=2+1 QCD is approached. We adopt a stout staggered discretization on lattices with Nt=4 sites in the temporal direction; the chiral limit is approached maintaining a constant physical value of the strange-to-light mass ratio and exploring three different light quark masses, corresponding to pseudo-Goldstone pion masses mπ≃100, 70 and 50 MeV around the transition. A finite size scaling analysis provides evidence that the transition remains second order, in the 3D Ising universality class, in all the explored mass range. The residual chiral symmetry of the staggered action also allows us to investigate the relation between the Roberge-Weiss endpoint transition and the chiral restoration transition as the chiral limit is approached: our results, including the critical scaling of the chiral condensate, are consistent with a coincidence of the two transitions in the chiral limit; however we are not able to discern the symmetry controlling the critical behavior, because the critical indices relevant to the scaling of the chiral condensate are very close to each other for the two possible universality classes [3D Ising or O(2)].
... Larger spatial sizes, up to ∼4 fm, have been explored in a few cases, in order to check the impact of finite size effects, or to perform a finite size scaling analysis around the transition: in those cases, b z has been increased accordingly in order to keep eB fixed, see Eq. (6). Additional simulations, needed for zero temperature subtractions or normalization, have been performed for eB ¼ 0, 4, and 9 GeV 2 on lattices with a temporal extension of around 5.5 fm, which is large enough to be considered as a good approximation for T ≃ 0. 3 Monte Carlo sampling of gauge configurations has been performed based on a rational hybrid Monte Carlo (RHMC) algorithm running on GPUs [35,36]. For each simulation we performed Oð10 3 Þ RHMC trajectories of unit length, taking measures every five trajectories. ...
Article
Full-text available
We provide numerical evidence that the thermal QCD crossover turns into a first order transition in the presence of large enough magnetic background fields. The critical end point is found to be located between eB=4 GeV2 [where the pseudocritical temperature is Tc=(98±3) MeV] and eB=9 GeV2 [where the critical temperature is Tc=(63±5) MeV]. Results are based on the analysis of quark condensates and number susceptibilities, determined by lattice simulations of Nf=2+1 QCD at the physical point, discretized via rooted stout staggered fermions and a Symanzik tree level improved pure gauge action, adopting two different lattice spacings, a=0.114 and 0.086 fm, for eB=9 GeV2 and three, a=0.114, 0.086, and 0.057 fm, for eB=4 GeV2. We also present preliminary results regarding the confining properties of the thermal theory, suggesting that they could change drastically going across the phase transition.
... A summary of all simulation points is reported in Table II. Monte Carlo sampling of gauge configurations has been performed based on a rational hybrid Monte Carlo (RHMC) algorithm running on Graphics Processing Unit (GPU) [113,114]. For each simulation we performed Oð10 3 Þ RHMC steps, taking measures every 10 unit trajectories. ...
Article
Full-text available
We investigate, by numerical lattice simulations, the static quark-antiquark potential, the flux tube properties and the chiral condensate for Nf=2+1 QCD with physical quark masses in the presence of strong magnetic fields, going up to eB=9 GeV2, with continuum extrapolated results. The string tension for quark-antiquark separations longitudinal to the magnetic field is suppressed by 1 order of magnitude at the largest explored magnetic field with respect to its value at zero magnetic background, but is still nonvanishing; in the transverse direction, instead, the string tension is enhanced but seems to reach a saturation at around 50% of its value at B=0. The flux tube shows a consistent suppression/enhancement of the overall amplitude, with mild modifications of its profile. Finally, we observe magnetic catalysis in the whole range of explored fields with a behavior compatible with a lowest Landau level approximation, in particular with a linear dependence of the chiral condensate on B which is in agreement, within errors, with that already observed for eB∼1 GeV2.
... For this reason in this work we study the energy-efficiency of the Intel Knights Landing (KNL) architecture, using as a benchmarking code a real HPC application, that has been heavily optimized for several architectures and routinely used for production runs of fluid-dynamics simulations based on the Lattice Boltzmann method [3]. This application is a good representative of a wider class of lattice based stencil-codes, including also HPC Grand Challenge applications such as Lattice Quantum Chromodynamics (LQCD) [4][5][6][7][8]. ...
Preprint
Full-text available
Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason it is important to understand the energy performance of these processors and to study strategies allowing to use them in the most efficient way. In this work we focus on computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications. We consider the 64-core Xeon Phi 7230, and profile its performance and energy efficiency using both its on-chip MCDRAM and the off-chip DDR4 memory as the main storage for application data. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture, and implemented using several different arrangements of the application data in memory (data-layouts, in short). We also assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM), and number of threads per core. We finally consider possible trade-offs between computing performance and energy efficiency, tuning the clock frequency of the processor using the Dynamic Voltage and Frequency Scaling (DVFS) technique.
... LQCD codes come in a very large number of different versions, each developed to study specific areas of the underlying theory [38]. In the following, we consider the staggered formulation, commonly adopted for investigating QCD thermodynamics, for which we recently developed the OpenACC Staggered Parallel LatticeQCD Everywhere (OpenStaPLE) code, based on the OpenACC framework [39,40]. For benchmark purposes, we further restricted our analysis to the Dirac operator, a linear-algebra routine specifically optimized for this application [12,41] that typically accounts for ≈ 80% of the computational load of a production run. ...
Article
Full-text available
In the last years, the energy efficiency of HPC systems is increasingly becoming of paramount importance for environmental, technical, and economical reasons. Several projects have investigated the use of different processors and accelerators in the quest of building systems able to achieve high energy efficiency levels for data centers and HPC installations. In this context, Arm CPU architecture has received a lot of attention given its wide use in low-power and energy-limited applications, but server grade processors have appeared on the market just recently. In this study, we targeted the Marvell ThunderX2, one of the latest Arm-based processors developed to fit the requirements of high performance computing applications. Our interest is mainly focused on the assessment in the context of large HPC installations, and thus we evaluated both computing performance and energy efficiency, using the ERT benchmark and two HPC production ready applications. We finally compared the results with other processors commonly used in large parallel systems and highlight the characteristics of applications which could benefit from the ThunderX2 architecture, in terms of both computing performance and energy efficiency. Pursuing this aim, we also describe how ERT has been modified and optimized for ThunderX2, and how to monitor power drain while running applications on this processor.
... The scale determination is affected by an overall systematic error of the order of 2%-3% [34,35]; however, this is not relevant to our final results, which are based on the determination of dimensionless ratios of quantities measured at the critical temperature. We have adopted the standard rational hybrid Monte Carlo algorithm [39][40][41] implemented in two different codes, one running on standard clusters (NISSA), the other on GPUs (OpenStaPLE [42,43]) and developed in OpenACC starting from previous GPU implementations [44]. ...
Article
Full-text available
We determine the curvature of the pseudocritical line of Nf=2+1 QCD with physical quark masses via Taylor expansion in the quark chemical potentials. We adopt a discretization based on stout improved staggered fermions and the tree level Symanzik gauge action; the location of the pseudocritical temperature is based on chiral symmetry restoration. Simulations are performed on lattices with different temporal extent (Nt=6, 8, 10), leading to a continuum extrapolated curvature κ=0.0145(25), which is in very good agreement with the continuum extrapolation obtained via analytic continuation and the same discretization, κ=0.0135(20). This result eliminates the possible tension emerging when comparing analytic continuation with earlier results obtained via Taylor expansion.
... The partition function is periodic in b with period N x N y . Numerical simulations have been performed using the rational hybrid Monte Carlo algorithm (RHMC) [126] implemented in the NISSA code [127] and in the OPENSTAPLE code for GPUs [128,129]. We have performed around 100 runs with different combinations of T and B for each value of the pion mass, with average statistics of approximately 3000 RHMC trajectories for each run. ...
Article
Full-text available
We investigate the behavior of the pseudocritical temperature of Nf=2+1 QCD as a function of a static magnetic background field for different values of the pion mass, going up to mπ≃660 MeV. The study is performed by lattice QCD simulations, adopting a stout staggered discretization of the theory on lattices with Nt=6 slices in the Euclidean temporal direction; for each value of the pion mass the temperature is changed moving along a line of constant physics. We find that the decrease of Tc as a function of B, which is observed for physical quark masses, persists in the whole explored mass range, even if the relative variation of Tc appears to be a decreasing function of mπ, approaching zero in the quenched limit. The location of Tc is based on the renormalized quark condensate and its susceptibility; determinations based on the Polyakov loop lead to compatible results. On the contrary, inverse magnetic catalysis, i.e., the decrease of the quark condensate as a function of B in some temperature range around Tc, is not observed when the pion mass is high enough. That supports the idea that inverse magnetic catalysis might be a secondary phenomenon, while the modifications induced by the magnetic background on the gauge field distribution and on the confining properties of the medium could play a primary role in the whole range of pion masses.
... This approach is appropriate for several computational Grand Challenge applications, such as Lattice QCD (LQCD) and LBM, for which a large effort has gone in the past in porting and optimizing codes and libraries for both custom and commodity HPC computing systems (Bernard at al., 2002;Bilardi, Pietracaprina, Pucci, Schifano & Tripiccione, 2005). More recently many efforts have been focused on GPUs (Bernaschi, Fatica, Melchionna, Succi, & Kaxiras, 2010;Bonati et al., 2017;Bonati, Cossu, D'Elia, & Incardona, 2012;Pedro Valero-Lara, 2014;Pedro Valero-Lara et al., 2015;Januszewski & Kostur, 2014;Tölke, 2008). These efforts have allowed to obtain significant performance levels, at least on one or just a small number of GPUs (Bailey, Myre, Walsh, Lilja, & Saar, 2009;Biferale et al., 2012;Biferale et al., 2010;Rinaldi, Dari, Vénere, & Clausse, 2012), (Jonas Tölke, 2008;Xian & Takayuki, 2011). ...
Chapter
GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, and are quickly becoming very popular processors for HPC applications. Still, writing efficient and scalable programs for GPUs is not an easy task as codes must adapt to increasingly parallel architecture features. In this chapter, the authors describe in full detail design and implementation strategies for lattice Boltzmann (LB) codes able to meet these goals. Most of the discussion uses a state-of-the art thermal lattice Boltzmann method in 2D, but all lessons learned in this particular case can be immediately extended to most LB and other scientific applications. The authors describe the structure of the code, discussing in detail several key design choices that were guided by theoretical models of performance and experimental benchmarks, having in mind both single-GPU codes and massively parallel implementations on commodity clusters of GPUs. The authors then present and analyze performances on several recent GPU architectures, including data on energy optimization.
... For this reason, in this work, we study the energy efficiency of the Intel Knights Landing (KNL) architecture, using as a benchmarking code a real HPC application that has been heavily optimized for several architectures and routinely used for production runs of fluid-dynamics simulations based on the lattice Boltzmann method [3]. This application is a good representative of a wider class of lattice-based stencil codes, including also HPC Grand Challenge applications such as Lattice Quantum Chromodynamics (LQCD) [4][5][6][7][8]. ...
Article
Full-text available
Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason, it is important to understand the energy performance of these processors and to study strategies allowing their use in the most efficient way. In this work, we focus on the computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications. We consider the 64-core Xeon Phi 7230 and profile its performance and energy efficiency using both its on-chip MCDRAM and the off-chip DDR4 memory as the main storage for application data. As a benchmark application, we use a lattice Boltzmann code heavily optimized for this architecture and implemented using several different arrangements of the application data in memory (data-layouts, in short). We also assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM) and the number of threads per core. We finally consider possible trade-offs between computing performance and energy efficiency, tuning the clock frequency of the processor using the Dynamic Voltage and Frequency Scaling (DVFS) technique.
... For this reason in this work we study the energy-efficiency of the Intel Knights Landing (KNL) architecture, using as a benchmarking code a real HPC application, heavily optimized for several architectures and used for production runs of fluid-dynamics simulations based on Lattice Boltzmann methods. This application is a good representative of a wider class of lattice based stencil-codes, including also HPC Grand Challenge applications such as Lattice Quantum Chromodynamics (LQCD) [3,4,5,6]. ...
Article
Full-text available
Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze its energy performance using both the on-chip MCDRAM and the regular DDR4 system memory as main storage for the application data-domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture and implemented using different memory data layouts to store its lattice. We assessthen the energy consumption using different memory data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core.
... In this article we concentrate mainly on multi-GPUs and we illustrate some (mainly MPI) constructs below using snippets of our code in FORTRAN. For code examples of a single node implementation see [29](FORTRAN) and [30](C). ...
Article
We present the results of an effort to accelerate a Rational Hybrid Monte Carlo (RHMC) program for lattice quantum chromodynamics (QCD) simulation for 2 flavours of staggered fermions on multiple Kepler K20X GPUs distributed on different nodes of a Cray XC30. We do not use CUDA but adopt a higher level directive based programming approach using the OpenACC platform. The lattice QCD algorithm is known to be bandwidth bound; our timing results illustrate this clearly, and we discuss how this limits the parallelization gains. We achieve more than a factor three speed-up compared to the CPU only MPI program.
Chapter
This paper presents an early performance assessment of the ThunderX2, the most recent Arm-based multi-core processor designed for HPC applications. We use as benchmarks well known stencil-based LBM and LQCD algorithms, widely used to study respectively fluid flows, and interaction properties of elementary particles. We run benchmark kernels derived from OpenMP production codes, we measure performance as a function of the number of threads, and evaluate the impact of different choices for data layout. We then analyze our results in the framework of the roofline model, and compare with the performances measured on mainstream Intel Skylake processors. We find that these Arm based processors reach levels of performance competitive with those of other state-of-the-art options.
Article
Full-text available
This report is devoted to Lattice QCD simulations carried out at the “Govorun” supercomputer. The basics of Lattice QCD methodology, the main Lattice QCD algorithms and the most numerically demanding routines are reviewed. We then present details of our multiGPU code implementation and the specifics of its application on the “Govorun” architecture. We show that implementation of very efficient and scalable code is possible and present scalability tests. We finally relate our program with the goals of the NICA project and review the main physical scenarios that are to be studied numerically.
Article
Full-text available
We study the dependence of the static quark free energy on the baryon chemical potential for Nf=2+1 QCD with physical quark masses, in a range of temperature spanning from 120 MeV up to 1 GeV and adopting a stout staggered discretization with two different values of the Euclidean temporal extension, Nt=6 and Nt=8. In order to deal with the sign problem, we exploit both Taylor expansion and analytic continuation, obtaining consistent results. We show that the dependence of the free energy on μB is sensitive to the location of the chiral crossover, in particular the μB susceptibility, i.e., the linear term in μB2 in the Taylor expansion of the free energy, has a peak around 150 MeV. We also discuss the behavior expected in the high temperature regime based on perturbation theory, and obtain a good quantitative agreement with numerical results.
Conference Paper
Highly-optimized parallel molecular dynamics programs have allowed researchers to achieve ground-breaking results in biological and materials sciences. This type of performance has come at the expense of portability: a significant effort is required for performance optimization on each new architecture. Using a metric that emphasizes speedup, we assess key accelerating programming components of four different best-performing molecular dynamics programs– GROMACS, NAMD, LAMMPS and CP2K– each having a particular scope of application, for contribution to performance and for portability. We use builds with and without these components, tested on HPC systems. We also analyze the code-bases to determine compliance with portability recommendations. We find that for all four programs, the contributions of the non-portable components to speed are essential to the programs’ performances; without them we see a reduction in time-to-solution of a magnitude that is insufferable to domain scientists. This characterizes the performance efficiency that must be approached for good performance portability on a programmatic level, suggesting solutions to this difficult problem, which should come from developers, industry and funding institutions, and possibly new research in programming languages.
Chapter
Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance within an acceptable range for these types of programs in the HPC setting. We find that for easily parallelizable kernels, performance on the GPU remains within this range. On the CPU, OpenACC-parallelized pairwise distance kernels would not meet the performance standards required, when using AMD Opteron “Interlagos” processors, but with IBM Power 9 processors, performance remains within an acceptable range for small batch sizes. These kernels provide a test for achieving performance portability with compiler directives for problems with memory-intensive components as are often found in scientific applications.
Article
Full-text available
We discuss the extension of gauge-invariant electric and magnetic screening masses in the Quark-Gluon Plasma to the case of a finite baryon density, defining them in terms of a matrix of Polyakov loop correlators. We present lattice results for Nf=2+1N_f=2+1 QCD with physical quark masses, obtained using the imaginary chemical potential approach, which indicate that the screening masses increase as a function of μB\mu_B. A separate analysis is carried out for the theoretically interesting case μB/T=3iπ\mu_B/T=3 i \pi, where charge conjugation is not explicitly broken and the usual definition of the screening masses can be used for temperatures below the Roberge-Weiss transition. Finally, we investigate the dependence of the static quark free energy on the baryon chemical potential, showing that it is a decreasing function of μB\mu_B which displays a peculiar behavior as the pseudocritical transition temperature at μB=0\mu_B=0 is approached.
Article
Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications.
Article
Full-text available
In this article we will explore the OpenACC platform for programming Graphics Processing Units (GPUs). The OpenACC platform offers a directive based programming model for GPUs which avoids the detailed data flow control and memory management necessary in a CUDA programming environment. In the OpenACC model, programs can be written in high level languages with OpenMP like directives. We present some examples of QCD simulation codes using OpenACC and discuss their performance on the Fermi and Kepler GPUs.
Article
Full-text available
Large-scale simulations play a central role in science and the industry. Several challenges occur when building simulation software, because simulations require complex software developed in a dynamical construction process. That is why simulation software engineering (SSE) is emerging lately as a research focus. The dichotomous trade-off between efficiency and scalability (SE) on one hand and maintainability and portability (MP) on the other hand is one of the core challenges. We report on the SE/MP trade-off in the context of an ongoing systematic literature review (SLR). After characterizing the issue of the SE/MP trade-off using two examples from our own research, we (1) review the 33 identified articles that assess the trade-off, (2) summarize the proposed solutions for the trade-off, and (3) discuss the findings for SSE and future work. Overall, we see evidence for the SE/MP trade-off and first solution approaches. However, a strong empirical foundation has yet to be established, general quantitative metrics and methods supporting software developers in addressing the trade-off have to be developed. We foresee considerable future work in SSE across scientific communities.
Article
Full-text available
Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.
Conference Paper
Full-text available
Nucleon transfer experiments have in recent years begun to be exploited in the study of nuclei far from stability, using radioactive beams in inverse kinematics. New techniques are still being developed in order to perform these experiments. The present experiment is designed to study the odd-odd nucleus 26 Na which has a high density of states and therefore requires gamma-ray detection to distinguish between them. The experiment employed an intense beam of up to 3 × 10 7 pps of 25 Na at 5.0 MeV/nucleon from the isac-ii facility at triumf. The new silicon array sharc was used for the first time and was coupled to the segmented clover gamma-ray array tigress. A novel thin plastic scintillator detector was employed at zero degrees to identify and reject reactions occurring on the carbon component of the (CD)2 target. The efficiency of the background rejection using this detector is described with respect to the proton and gamma-ray spectra from the (d,p) reaction.
Article
Full-text available
Improved staggered-fermion formulations are a popular choice for lattice QCD calculations. Historically, the algorithm used for such calculations has been the inexact R algorithm, which has systematic errors that only vanish as the square of the integration step size. We describe how the exact rational hybrid Monte Carlo (RHMC) algorithm may be used in this context, and show that for parameters corresponding to current state-of-the-art computations it leads to a factor of approximately seven decrease in cost as well as having no step-size errors.
Article
Full-text available
Many advances in the development of Krylov subspace methods for the iterative solution of linear systems during the last decade and a half are reviewed. These new developments include different versions of restarted, augmented, deflated, flexible, nested, and inexact methods. Also reviewed are methods specifically tailored to systems with special properties such as special forms of symmetry and those depending on one or more parameters. Copyright © 2006 John Wiley & Sons, Ltd.
Conference Paper
Full-text available
We explore the opportunities offered by current and forthcoming VLSI technologies to on-chip multiprocessing for Quantum Chromo Dynamics (QCD), a computational grand challenge for which over half a dozen specialized ma- chines have been developed over the last two decades. Based on a careful study of the information exchange requirements of QCD both across the network and within the memory system, we derive the optimal partition of die area between storage and functional units. We show that a scalable chip organization holds the promise to deliver from hundreds to thousands flop per cycle as VLSI feature size scales down from 90 nm to 20 nm, over the next dozen years.
Article
Full-text available
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops. Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing 2010 (submitted April 12, 2010)
Article
Full-text available
Explicit velocity- and position-Verlet-like algorithms of the second order are proposed to integrate the equations of motion in many-body systems. The algorithms are derived on the basis of an extended decomposition scheme at the presence of a free parameter. The nonzero value for this parameter is obtained by reducing the influence of truncated terms to a minimum. As a result, the proposed algorithms appear to be more efficient than the original Verlet versions that correspond to a particular case when the introduced parameter is equal to zero. Like the original versions, the extended counterparts are symplectic and time reversible, but lead to an improved accuracy in the generated solutions at the same overall computational costs. The advantages of the optimized algorithms are demonstrated in molecular dynamics simulations of a Lennard-Jones fluid.
Article
Full-text available
We examine a new 2nd order integrator recently found by Omelyan et al. The integration error of the new integrator measured in the root mean square of the energy difference, \bra\Delta H^2\ket^{1/2}, is about 10 times smaller than that of the standard 2nd order leapfrog (2LF) integrator. As a result, the step size of the new integrator can be made about three times larger. Taking into account a factor 2 increase in cost, the new integrator is about 50% more efficient than the 2LF integrator. Integrating over positions first, then momenta, is slightly more advantageous than the reverse. Further parameter tuning is possible. We find that the optimal parameter for the new integrator is slightly different from the value obtained by Omelyan et al., and depends on the simulation parameters. This integrator could also be advantageous for the Trotter-Suzuki decomposition in Quantum Monte Carlo.
Article
Full-text available
There has been much recent progress in the understanding and reduction of the computational cost of the hybrid Monte Carlo algorithm for lattice QCD as the quark mass parameter is reduced. In this letter we present a new solution to this problem, where we represent the fermionic determinant using n pseudofermion fields, each with an nth root kernel. We implement this within the framework of the rational hybrid Monte Carlo algorithm. We compare this algorithm with other recent methods in this area and find it is competitive with them.
Conference Paper
Full-text available
Numerical simulations of the strong nuclear force, known as quantum chromodynamics or QCD, have proven to be a demanding, forefront problem in high-performance computing. In this report, we describe a new computer, QCDOC (QCD On a Chip), designed for optimal price/performance in the study of QCD. QCDOC uses a six-dimensional, low-latency mesh network to connect processing nodes, each of which includes a single custom ASIC, designed by our collaboration and built by IBM, plus DDR SDRAM. Each node has a peak speed of 1Gigaflops and two 12,288node, 10+ Teraflops machines are to be completed in the fall of 2004. Currently, a 512 node machine is running, delivering efficiencies as high as 45% of peak on the conjugate gradient solvers that dominate our calculations and a 4096-node machine with a cost of 1.6Misunderconstruction.Thisshouldgiveusaprice/performancelessthan1.6M is under construction. This should give us a price/performance less than 1per sustained Megaflops.
Article
Full-text available
An analytic method of smearing link variables in lattice QCD is proposed and tested. The differentiability of the smearing scheme with respect to the link variables permits the use of modern Monte Carlo updating methods based on molecular dynamics evolution for gauge-field actions constructed using such smeared links. In examining the smeared mean plaquette and the static quark-antiquark potential, no degradation in effectiveness is observed as compared to link smearing methods currently in use, although an increased sensitivity to the smearing parameter is found. Comment: 8 pages, 4 figures, uses RevTex
Conference Paper
Lattice Quantumchromodynamics (QCD) is a powerful tool to numerically access the low energy regime of QCD in a straightforward way with quantifyable uncertainties. In this approach, QCD is discretized on a four dimensional, Euclidean space-time grid with millions of degrees of freedom. In modern lattice calculations, most of the work is still spent in solving large, sparse linear systems. This part has two challenges, i.e. optimizing the sparse matrix application as well as BLAS-like kernels used in the linear solver. We are going to present performance optimizations of the Dirac operator (dslash) with and without clover term for recent Intel® architectures, i.e. Haswell and Knights Landing (KNL). We were able to achieve a good fraction of peak performance for the Wilson-Dslash kernel, and Conjugate Gradients and Stabilized BiConjugate Gradients solvers. We will also present a series of experiments we performed on KNL, i.e. running MCDRAM in different modes, enabling or disabling hardware prefetching as well as using different SoA lengths. Furthermore, we will present a weak scaling study up to 16 KNL nodes.
Article
This biennial Review summarizes much of Particle Physics. Using data from previous editions, plus 2205 new measurements from 667 papers, we list, evaluate, and average measured properties of gauge bosons, leptons, quarks, mesons, and baryons. We also summarize searches for hypothetical particles such as Higgs bosons, heavy neutrinos, and supersymmetric particles. All the particle properties and search limits are listed in Summary Tables. We also give numerous tables, figures, formulae, and reviews of topics such as the Standard Model, particle detectors, probability, and statistics. This edition features expanded coverage of CP violation in B mesons and of neutrino oscillations. For the first time we cover searches for evidence of extra dimensions (both in the particle listings and in a new review). Another new review is on Grand Unified Theories. A booklet is available containing the Summary Tables and abbreviated versions of some of the other sections of this full Review. All tables, listings, and reviews (and errata) are also available on the Particle Data Group website: http://pdg.lbl.gov
Conference Paper
Current development trends of fast processors calls for an increasing number of cores, each core featuring wide vector processing units. Applications must then exploit both directions of parallelism to run efficiently. In this work we focus on the efficient use of vector instructions. These process several data-elements in parallel, and memory data layout plays an important role to make this efficient. An optimal memorylayout depends in principle on the access patterns of the algorithm but also on the architectural features of the processor. However, different parts of the application may have different requirements, and then the choice of the most efficient data-structure for vectorization has to be carefully assessed. We address these problems for a Lattice Boltzmann (LB) code, widely used in computational fluid-dynamics. We consider a state-of-the-art two-dimensional LB model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid. We write our codes in C and expose vector parallelism using directive-based programming approach. We consider different data layouts and analyze the corresponding performance. Our results show that, if an appropriate data layout is selected, it is possible to write a code for this class of applications that is automatically vectorized and performance portable on several architectures. We end up with a single code that runs efficiently onto traditional multi-core processors as well as on recent many-core systems such as the Xeon-Phi.
Book
Numerical simulation of lattice-regulated QCD has become an important source of information about strong interactions. In the last few years there has been an explosion of techniques for performing ever more accurate studies on the properties of strongly interacting particles. Lattice predictions directly impact many areas of particle and nuclear physics theory and phenomenology. This book provides a thorough introduction to the specialized techniques needed to carry out numerical simulations of QCD: A description of lattice discretizations of fermions and gauge fields, methods for actually doing a simulation, descriptions of common strategies to connect simulation results to predictions of physical quantities, and a discussion of uncertainties in lattice simulations. More importantly, while lattice QCD is a well-defined field in its own right, it has many connections to continuum field theory and elementary particle physics phenomenology, which are carefully elucidated in this book. © 2006 by World Scientific Publishing Co. Pte. Ltd. All rights reserved.
Article
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures.
Conference Paper
This paper describes our OpenACC porting efforts for a code that we have developed that solves the isothermal incompressible Navier Stokes equation via the lattice Boltzmann method (LBM). Implemented initially as a hybrid MPI/OpenMP parallel program, we ported our code to use OpenACC in order to obtain the benefit of accelerators in a high performance computing (HPC) environment. We describe the elements of parallelism inherent in the LBM algorithm and the way in which that parallelism can be expressed using OpenACC directives. By setting compile-time flags during the build process, the program alternatively can be compiled for, and continue to run without the benefit of, accelerators. Through this porting process we were able to accelerate our code in an incremental fashion without extensive code transformation. We point out some additional efforts that were required to expose C++ class data members to the OpenACC compiler and describe difficulties encountered in this process. We show an instance where similar code segments decorated with the same OpenACC directives can result in different compiler outputs with significantly different performance properties. The result of the code porting process, which occurred primarily during OpenACC EuroHack, a 5-day intensive computing workshop, was a 5.5x speedup over the non-accelerated version of the code.
Chapter
This chapter illustrates some of the key concepts behind optimizing the so-called Wilson-Dslash kernel, which is a frequently used component in lattice QCD calculations. Quantum chromodynamics (QCD) is the theory of the strong nuclear force responsible for binding quarks and gluons together into nuclei. It is one of the fundamental forces making up the Standard Model of particle interactions. Lattice QCD is a version of the theory suitable for numerical computation, which is used heavily in theoretical calculation for nuclear and high energy particle physics.
Conference Paper
Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDA-aware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout.
Conference Paper
Many scientific software applications, that solve complex compute-or data-intensive problems, such as large parallel simulations of physics phenomena, increasingly use HPC systems in order to achieve scientifically relevant results. An increasing number of HPC systems adopt heterogeneous node architectures, combining traditional multi-core CPUs with energy-efficient massively parallel accelerators, such as GPUs. The need to exploit the computing power of these systems, in conjunction with the lack of standardization in their hardware and/or programming frameworks, raises new issues with respect to scientific software development choices, which strongly impact software maintainability, portability and performance. Several new programming environments have been introduced recently, in order to address these issues. In particular, the Open ACC programming standard has been designed to ease the software development process for codes targeted for heterogeneous machines, helping to achieve code and performance portability. In this paper we present, as a specific Open ACC use case, an example of design, porting and optimization of an LQCD Monte Carlo code intended to be portable across present and future heterogeneous HPC architectures, we describe the design process, the most critical design choices and evaluate the trade off between portability and efficiency.
Conference Paper
Lattice Quantum Chromodynamics (QCD) is one of the most challenging applications running on massively parallel supercomputers. To reproduce these physical phenomena on a supercomputer, a precise simulation is demanded requiring well optimized and scalable code. We have optimized lattice QCD programs on Blue Gene family supercomputers and shown the strength in lattice QCD simulation. Here we optimized on the third generation Blue Gene/Q supercomputer; i) by changing the data layout, ii) by exploiting new SIMD instruction sets, and iii) by pipelining boundary data exchange to overlap communication and calculation. The optimized lattice QCD program shows excellent weak scalability on the large scale Blue Gene/Q system, and with 16 racks we sustained 1.08 Pflop/s, 32.1% of the theoretical peak performance, including the conjugate gradient solver routines.
Article
We extend a preconditioning technique for lattice Wilson fermions, previously introduced for propagator measurements, to the case of simulations with dynamical fermions. We give a general (but simple) form for the inclusion of preconditioning on dynamic fermion simulation algorithms. Some very encouraging preliminary results are also discussed.
Article
This is a transcript of the recorded panel discussion on the cost of dynamical quark simulations at Lattice2001.
Article
A mechanism for total confinement of quarks, similar to that of Schwinger, is defined which requires the existence of Abelian or non-Abelian gauge fields. It is shown how to quantize a gauge field theory on a discrete lattice in Euclidean space-time, preserving exact gauge invariance and treating the gauge fields as angular variables (which makes a gauge-fixing term unnecessary). The lattice gauge theory has a computable strong-coupling limit; in this limit the binding mechanism applies and there are no free quarks. There is unfortunately no Lorentz (or Euclidean) invariance in the strong-coupling limit. The strong-coupling expansion involves sums over all quark paths and sums over all surfaces (on the lattice) joining quark paths. This structure is reminiscent of relativistic string models of hadrons.
Article
The APE computer is a high performance processor designed to provide massive computational power for intrinsically parallel and homogeneous applications. APE is a linear array of processing elements and memory boards that execute in parallel in SIMD mode under the control of a CERN/SLAC 3081/E. Processing elements and memory boards are connected by a ‘circular’ switchnet. The hardware and software architecture of APE, as well as its implementation are discussed in this paper. Some physics results obtained in the simulation of lattice gauge theories are also presented.
Article
We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge case and Wilson fermions in the twisted mass formulation. The implementation is platform independent and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double precision dslash implementation performs at 60 GFLOPS over a wide range of lattice sizes. The hybrid Monte-Carlo presented reaches a speedup of four over the reference code running on a server CPU.
Article
We report on our implementation of the RHMC algorithm for the simulation of lattice QCD with two staggered flavors on Graphics Processing Units, using the NVIDIA CUDA programming language. The main feature of our code is that the GPU is not used just as an accelerator, but instead the whole Molecular Dynamics trajectory is performed on it. After pointing out the main bottlenecks and how to circumvent them, we discuss the obtained performances. We present some preliminary results regarding OpenCL and multiGPU extensions of our code and discuss future perspectives.
Article
We present a new method for the numerical simulation of lattice field theory. A hybrid (molecular dynamics/Langevin) algorithm is used to guide a Monte Carlo simulation. There are no discretization errors even for large step sizes. The method is especially efficient for systems such as quantum chromodynamics which contain fermionic degrees of freedom. Detailed results are presented for four-dimensional compact quantum electrodynamics including the dynamical effects of electrons.
Article
A complete classification of all explicit self-adjoint decomposition algorithms with up to 11 stages is given and the derivation process used is described. As a result, we have found 37 (6 non-gradient plus 31 force-gradient) new schemes of orders 2 to 6 in addition to 8 (4 non- and 4 force-gradient) integrators known earlier. It is shown that the derivation process proposed can be extended, in principle, to arbitrarily higher stage numbers without loss of generality. In practice, due to the restricted capabilities of modern computers, the maximal number of stages, which can be handled within the direct decomposition approach, is limited to 23. This corresponds to force-gradient algorithms of order 8. Combining the decomposition method with an advanced composition technique allows to increase the overall order up to a value of 16. The implementation and application of the introduced algorithms to numerical integration of the equations of motion in classical and quantum systems is considered as well. As is predicted theoretically and confirmed in molecular dynamics and celestial mechanics simulations, some of the new algorithms are particularly outstanding. They lead to much superior integration in comparison with known decomposition schemes such as the Verlet, Forest–Ruth, Chin, Suzuki, Yoshida, and Li integrators.
Article
Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson–Dirac sparse matrix–vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.
Article
We discuss a class of reversible, discrete approximations to Hamilton's equations for use in the hybrid Monte Carlo algorithm and derive an asymptotic formula for the step-size-dependent errors arising from this family of approximations. For lattice QCD with Wilson fermions, we construct several different updates in which the effect of fermion vacuum polarization is given a longer time step than the gauge field's self-interaction. On a 44 lattice, one of these algorithms with an optimal choice of step size is 30% to 40% faster than the standard leapfrog update with an optimal step size.
Article
A Monte Carlo procedure is constructed for lattice gauge theories with fermions by replacing integration over fermion degrees of freedom in the path integral by conventional integration over effective boson degrees of freedom. The method is applied to gauge theories over two discrete subgroups of SU(2).
Article
It is shown that Symanzik's program, to all leading logarithms, can be accomplished, for pure Yang-Mills theory, by adding to the Wilson lagrangian the simplest term of dimension 6 i.e. the ractangle. Moreover no changes in the source terms are needed.
Article
We present a variant of the HMC algorithm with mass preconditioning (Hasenbusch acceleration) and multiple time scale integration. We have tested this variant for standard Wilson fermions at β=5.6 and at pion masses ranging from 380 to 680 MeV. We show that in this situation its performance is comparable to the recently proposed HMC variant with domain decomposition as preconditioner. We give an update of the “Berlin Wall” figure, comparing the performance of our variant of the HMC algorithm to other published performance data. Advantages of the HMC algorithm with mass preconditioning and multiple time scale integration are that it is straightforward to implement and can be used in combination with a wide variety of lattice Dirac operators.
Article
Three computer-time saving techniques for lattice QCD programs are presented. They concern the link updating order, the compacting of the data and the pseudo-heatbath algorithm.
Article
Symanzik's programme for constructing a lattice action with improved continuum limit behaviour is considered for the case of pure Yang-Mills theory. The structure of the action is proposed and discussed in detail to lowest order in perturbation theory.
Article
We determine the topological susceptibility of the gauge configurations generated by lattice simulations using two flavors of optimal domain-wall fermion on the 163×32 16^3 \times 32 lattice with length 16 in the fifth dimension, at the lattice spacing a0.1 a \simeq 0.1 fm. Using the adaptive thick-restart Lanczos algorithm, we project the low-lying eigenmodes of the overlap Dirac operator, and obtain the topological charge of each configuration, for eight ensembles with pion masses in the range 220550 220-550 MeV. From the topological charge, we compute the topological susceptibility and the second normalized cumulant. Our result of the topological susceptibility agrees with the sea-quark mass dependence predicted by the chiral perturbation theory and provides a determination of the chiral condensate, ΣMSˉ(2GeV)=[259(6)(7)MeV]3\Sigma^{\bar{MS}}(2 GeV)=[259(6)(7) MeV]^3 , and the pion decay constant Fπ=92(12)(2)F_\pi = 92(12)(2) MeV.
Article
QPACE is a novel massively parallel architecture optimized for lattice QCD simulations. Each node comprises an IBM PowerXCell 8i processor. The nodes are interconnected by a custom 3-dimensional torus network implemented on an FPGA. The architecture was systematically optimized with respect to power consumption. This put QPACE in the number one spot on the Green500 List published in November 2009. In this paper we give an overview of the architecture and highlight the steps taken to improve power efficiency.
Article
In this work we explore the performance of CUDA in quenched lattice SU(2) simulations. CUDA, NVIDIA Compute Unified Device Architecture, is a hardware and software architecture developed by NVIDIA for computing on the GPU. We present an analysis and performance comparison between the GPU and CPU in single and double precision. Analyses with multiple GPUs and two different architectures (G200 and Fermi architectures) are also presented. In order to obtain a high performance, the code must be optimized for the GPU architecture, i.e., an implementation that exploits the memory hierarchy of the CUDA programming model. We produce codes for the Monte Carlo generation of SU(2) lattice gauge configurations, for the mean plaquette, for the Polyakov Loop at finite T and for the Wilson loop. We also present results for the potential using many configurations (50 00050\ 000) without smearing and almost 2 0002\ 000 configurations with APE smearing. With two Fermi GPUs we have achieved an excellent performance of 200×200 \times the speed over one CPU, in single precision, around 110 Gflops/s. We also find that, using the Fermi architecture, double precision computations for the static quark-antiquark potential are not much slower (less than 2×2 \times slower) than single precision computations.
Article
apeNEXT is the latest in the APE collaboration's series of parallel computers for computationally intensive calculations such as quantum chromo dynamics on the lattice. The authors describe the computer architectural choices that have been shaped by almost two decades of collaboration activity.
Article
The speed, bandwidth and cost characteristics of today's PC graphics cards make them an attractive target as general purpose computational platforms. High performance can be achieved also for lattice simulations but the actual implementation can be cumbersome. This paper outlines the architecture and programming model of modern graphics cards for the lattice practitioner with the goal of exploiting these chips for Monte Carlo simulations. Sample code is also given. (c) 2007 Elsevier B.V. All rights reserved.
  • M Albanese
M. Albanese et al., Comput. Phys. Commun. 45, 345 (1987), doi:10.1016/ 0010-4655(87)90172-X.
  • K G Wilson
K. G. Wilson, Physical Review D 10, p. 2445 (1974).
  • S Wienke
  • C Terboven
  • J Beyer
  • M Mller
S. Wienke, C. Terboven, J. Beyer and M. Mller, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8632, 812 (2014).
  • P De Forcrand
  • D Lellouch
P. De Forcrand, D. Lellouch and C. Roiesnel, Journal of Computational Physics 59, 324 (1985).