Conference Paper

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electrothermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarse- and fine-grained data-movement characteristics that can be used for performance and communication modeling, communication-avoidance, and dataflow transformations. The resulting code has been tuned for two top-6 hybrid supercomputers, reaching a sustained performance of 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision. These computational achievements enable the restructured QT simulator to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14x shorter duration than the original code needs to handle a system with 1,000 atoms, on the same number of CPUs/GPUs and with the same physical accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... First-principles materials simulation is the most accurate and effective quantum-mechanical methodology to explore the ab initio electronic structures for designing new highefficiency energy materials and electronic devices. For new quantum multifunctional materials and next-generation electronics [1], [2], nanoscopic (< 10 nm) and mesoscopic (> 100 nm) heterostructures [3] with complex atomic structures and electronic properties have been proposed as strong candidates for solar cells, battery electrodes, field-effect transistors (FETs) [2], PN junctions and diodes, due to their superior electronic properties (e.g., bandgap opening, band alignment and charge transfer) as shown in Fig. 1(a). For example, as one of the most important two-dimensional (2D) materials, graphene and its interfaces with metals have attracted much attention in graphene FETs [4] because of its high mobility. ...
... First-principles materials simulation is the most accurate and effective quantum-mechanical methodology to explore the ab initio electronic structures for designing new highefficiency energy materials and electronic devices. For new quantum multifunctional materials and next-generation electronics [1], [2], nanoscopic (< 10 nm) and mesoscopic (> 100 nm) heterostructures [3] with complex atomic structures and electronic properties have been proposed as strong candidates for solar cells, battery electrodes, field-effect transistors (FETs) [2], PN junctions and diodes, due to their superior electronic properties (e.g., bandgap opening, band alignment and charge transfer) as shown in Fig. 1(a). For example, as one of the most important two-dimensional (2D) materials, graphene and its interfaces with metals have attracted much attention in graphene FETs [4] because of its high mobility. ...
... However, even for the first maximum magic angle (1.1 • ) in MATBG, the smallest unitcell (10 nm) contains more than 10K atoms. Thus, multilayer MATBG systems with smaller magic angles in supercells readily reach up to 1M atoms (> 100 nm) in real applications [2]. Metal alloys, such as lithium-sodium (Li/Na) and gold-copper (Au/Cu), act as electrodes in batteries [6] and FETs [1], especially light metal Li/Na alloy processes strong quantum effect on the battery performance. ...
... Following Moore's law, the transistor number on the leading microprocessors has doubled roughly every two years over the last decades, accompanied by proportional reduction of the lateral dimensions of a transistor [1,2], with the newest 7 nm technology in industry containing billions of transistors in a single die [3]. Meanwhile, the conventional planar MOSFET (metal-oxide semiconductor field effect transistor) has been replaced by FinFET (fin field effect transistor), and future technologies such as GAAFET (gate-all-around nanowire FET), graphene-based nanoribbon transistors and coaxially-gated carbon nanotube FET are surging [1,4]. ...
... Following Moore's law, the transistor number on the leading microprocessors has doubled roughly every two years over the last decades, accompanied by proportional reduction of the lateral dimensions of a transistor [1,2], with the newest 7 nm technology in industry containing billions of transistors in a single die [3]. Meanwhile, the conventional planar MOSFET (metal-oxide semiconductor field effect transistor) has been replaced by FinFET (fin field effect transistor), and future technologies such as GAAFET (gate-all-around nanowire FET), graphene-based nanoribbon transistors and coaxially-gated carbon nanotube FET are surging [1,4]. ...
... As the density of the transistors increases, the problem becomes more conspicuous and even fatal for the high performance microprocessor chips. In fact, the heat flux intensity has reached more than 30 W/cm 2 now [1]. ...
Article
A highly efficient and novel atomistic simulation framework is first established for the thermal and mechanical behaviors of a whole microprocessor chip or its constituent functional modules, important for the performance and reliability of high-end microprocessor circuits. Taking the simulation of the thermal behavior as a model system, we first reached the simulation of the constituent functional modules integrating about 55.3 thousand nano-transistors with around 107 billion atoms. Traditionally, the macroscopic continuous methods are difficult to treat such nanoscale factors as doping, thin dielectric layer, surface and interface in the nano-transistor devices, while the microscopic quantum mechanics method can only calculate one or several nano-transistors. This proposed simulation realizes the integrated treatment of the above nanoscale factors and complex gate layout by coupling multiple interatomic potential models for different materials and designing efficient parallel algorithms, and bridges the mesoscale simulation gap between the aforementioned macroscopic and microscopic methods. The development is the first atomic-scale simulation framework for predicting and modulating the thermal behavior of a microprocessor circuit or its functional module, which paves an exciting way to the atomic-resolution design of novel high-performance microprocessor chips in the post-Moore era.
... In quantum physics, matrix size scales with 2 qubits . In physical chemistry or density functional theory (DFT), simulations require factorizing matrices of atom interactions, yielding sizes ranging from = 1,024 up to = 131,072 [18,66]. In machine learning, matrix factorizations are used for inverting Kronecker factors [52] whose sizes are usually around = 4,096. ...
... Furthermore, throughput-oriented hardware, such as GPUs and FPGAs, may benefit even more from the communication reduction of our schedules. Thus, COnf LUX and COnf CHOX not only outperform the state-of-the-art libraries at relatively small scales -which are most common use cases in practice [18,52,66] -but also promise speedups on full-scale performance runs on modern supercomputers. ...
... Pebbling [13,26,37,45,56] Projection-based [8,15,20,21,23,51] Problem specific [1,9,17,48,66] Scope General cDAGs Programs Geometric structure of iteration space Individually tailored for given problem ...
Preprint
Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.
... MLIR serves as a natural "connective tissue", with lowering and conversion from frontends in different languages and to backends covering a plethora of hardware architectures. For the datacentric representation, we choose the DaCe framework and its stateful dataflow multigraph (SDFG) IR [3], which is capable of optimizing a wide variety of applications through data movement minimization [4,6,8,17,28,37,38]. ...
... It also provides a transformation API on the IR to separate the concerns of the developer and the performance engineer. DaCe has succesfully improved the performance of applications in weather and climate models [4,8], sparse linear algebra in quantum transport simulation [37], graph analytics [3], and full neural network optimization in deep learning [28]. ...
Preprint
With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset - instruction-driven optimization augmented with scalar-based dataflow - whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two representations are not commutable, users must choose one over the other. In this paper, we explore how both control- and data-centric approaches can work in tandem via the Multi-Level Intermediate Representation (MLIR) framework. Through a combination of an MLIR dialect and specialized passes, we recover parametric, symbolic dataflow that can be optimized within the DaCe framework. We combine the two views into a single pipeline, called DCIR, showing that it is strictly more powerful than either view. On several benchmarks and a real-world application in C, we show that our proposed pipeline consistently outperforms MLIR and automatically uncovers new optimization opportunities with no additional effort.
... Single precision can also be applied to Coupled Cluster [20,21], including time dependent variants [22]. Other targets of single precision optimizations include DMRG [23], quantum transport calculations [24], and GW [25]. ...
... We vary the effective precision of the mantissa used for multiplication from 11 (half) to 53 (double). For each calculation, we use one of three fixed accumulation mantissa values(24,37, 53) or the "same" precision as multiplication. ...
Preprint
The abundant demand for deep learning compute resources has created a renaissance in low precision hardware. Going forward, it will be essential for simulation software to run on this new generation of machines without sacrificing scientific fidelity. In this paper, we examine the precision requirements of a representative kernel from quantum chemistry calculations: calculation of the single particle density matrix from a given mean field Hamiltonian (i.e. Hartree-Fock or Density Functional Theory) represented in an LCAO basis. We find that double precision affords an unnecessarily high level of precision, leading to optimization opportunities. We show how an approximation built from an error-free matrix multiplication transformation can be used to potentially accelerate this kernel on future hardware. Our results provide a road map for adapting quantum chemistry software for the next generation of High Performance Computing platforms.
... Parametric optimization is the process of searching for an optimal set of input parameters to an application-specific cost function, such that its output value is minimal. Parametric optimization problems are often encountered in physics [7] [8,[115][116], chemistry [9], medicine [10] [11,70], engineering [12,[419][420] [13,131] [14,12370] and economics [15]. ...
... At the same time, it is also possible that a client is merely busy with the calculation of a work item for a longer time than expected and therefore participates in the system again after an alleged error has been detected. To ensure that the specification S can be satisfied even if an error f ∈ F occurs, the server must be independent of the number of available clients at all times and ensure that a failing client does not affect the satisfaction of S. To enable this, as shown in Figure 5, a client-server model 7 For example, all state information would need to be backed up to a fail-over server that can replace the server in the event of a failure. Already the task of synchronization adds a huge amount of overhead and is not trivial to implement 8 The number of work items is a finite constant 9 i.e. any number of faults may occur for finite time, meaning that temporary failures of clients or network connections are irrelevant for the satisfaction of S with two-sided timeouts is used. ...
Preprint
Full-text available
Many challenges of today's science are parametric optimization problems that are extremely complex and computationally intensive to calculate. At the same time, the hardware for high-performance computing is becoming increasingly powerful. Geneva is a framework for parallel optimization of large-scale problems with highly nonlinear quality surfaces in grid and cloud environments. To harness the immense computing power of high-performance computing clusters, we have developed a new networking component for Geneva, the so-called MPI Consumer, which makes Geneva suitable for HPC. Geneva is most prominent for its evolutionary algorithm, which requires repeatedly evaluating a user-defined cost function. The MPI Consumer parallelizes the computation of the candidate solutions' cost functions by sending them to remote cluster nodes. By using an advanced multithreading mechanism on the master node and by using asynchronous requests on the worker nodes, the MPI Consumer is highly scalable. Additionally, it provides fault tolerance, which is usually not the case for MPI programs but becomes increasingly important for HPC. Moreover, the MPI Consumer provides a framework for the intuitive implementation of fine-grained parallelization of the cost function. Since the MPI Consumer conforms to the standard paradigm of HPC programs, it vastly improves Geneva's user-friendliness on HPC clusters. This article gives insight into Geneva's general system architecture and the system design of the MPI Consumer as well as the underlying concepts. Geneva, including the novel MPI Consumer, is publicly available as an open source project on GitHub and is currently used for fundamental physics research at GSI in Darmstadt, Germany.
... MLIR serves as a natural "connective tissue", with lowering and conversion from frontends in different languages and to backends covering a plethora of hardware architectures. For the datacentric representation, we choose the DaCe framework and its stateful dataflow multigraph (SDFG) IR [3], which is capable of optimizing a wide variety of applications through data movement minimization [4,6,8,17,28,37,38]. ...
... It also provides a transformation API on the IR to separate the concerns of the developer and the performance engineer. DaCe has succesfully improved the performance of applications in weather and climate models [4,8], sparse linear algebra in quantum transport simulation [37], graph analytics [3], and full neural network optimization in deep learning [28]. ...
... SDFGs allow representing programs by their dataflow and control flow independent of the chosen FPGA backend, enable compatibility across FPGA vendors through code generation, and are amenable to optimizing transformations performed directly on the graph. SDFGs have been proven effective for load/store workloads in various domains, ranging from linear algebra kernels and graph algorithms [3] to numerical weather prediction [5] and supercomputer-scale quantum transport simulations [6]. When FPGA SDFGs are manually authored [3], their performance is on-par with state-of-the-art implementations and libraries. ...
... In hardware, we distinguish between two ways of exploiting the parallelism implied by maps: pipelined maps, where iterations are Fig. 3. Kernel state with four processing elements (right), with pre-and post-states (left) copying memory between host and device. program.MakeKernel("read_A", A, n), 3 program.MakeKernel("read_B", B, n), 4 program.MakeKernel("compute", n), 5 program.MakeKernel("compute_1", n), 6 program.MakeKernel("compute_2", n), 7 program.MakeKernel("compute_3", n), 8 program.MakeKernel("write_C", 9 C, n)}; 10 std:: executed in sequence, but exploit pipeline parallelism in the mapped computation; and unrolled maps, which represent parametrically replicated hardware, such as systolic arrays (see Section 2.6) or SIMD-style vectorization. The purple box in Figure 3 contains an inner map, which will be generated as a pipelined inner loop, and an outer map over tiles, orchestrating the buffering behavior. ...
Preprint
Although high-level synthesis (HLS) tools have significantly improved programmer productivity over hardware description languages, developing for FPGAs remains tedious and error prone. Programmers must learn and implement a large set of vendor-specific syntax, patterns, and tricks to optimize (or even successfully compile) their applications, while dealing with ever-changing toolflows from the FPGA vendors. We propose a new way to develop, optimize, and compile FPGA programs. The Data-Centric parallel programming (DaCe) framework allows applications to be defined by their dataflow and control flow through the Stateful DataFlow multiGraph (SDFG) representation, capturing the abstract program characteristics, and exposing a plethora of optimization opportunities. In this work, we show how extending SDFGs with multi-level Library Nodes incorporates both domain-specific and platform-specific optimizations into the design flow, enabling knowledge transfer across application domains and FPGA vendors. We present the HLS-based FPGA code generation backend of DaCe, and show how SDFGs are code generated for either FPGA vendor, emitting efficient HLS code that is structured and annotated to implement the desired architecture.
... It also promotes collaboration with reproducible scientific workflows shared using Jupyter notebooks [42]. Therefore, numerous scientific fields, ranging from machine learning [2, 61] to climate [72] and quantum transport [75] have already adopted Python as their language of choice for new developments. ...
... In the following, we show results for single node shared memory parallel programs created using data-centric Python for CPU, GPU and FPGA, and compare these with other frameworks: NumPy over the CPython interpreter, Numba, Pythran, and CuPy. We collect a set of existing Python codes from different scientific and HPC domains [3,5,8,9,15,20,37,41,49,51,60,67,[70][71][72]75], as well as a NumPy version of Polybench [63] ported from the C benchmark. In this adaptation, we strive to express the algorithms of the original benchmark in a way that is natural to a Python programmer. ...
Preprint
Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python's high productivity while achieving portable performance across different architectures. The workflow's key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes.
... Pebbling [11,25,36,42,52] Projection-based [7,13,18,20,22,49] Problem specific [1,8,15,45,65] Scope General cDAGs Programs with static geometric structure of iteration space ...
... Specifically, we choose 4, 096 ≤ ≤ 16, 384. For example, Physical Chemistry or Density Functional Theory (DFT) simulations require factorizing matrices of atom interactions, yielding sizes of ≥ 10, 000[65]. For node count, we measure the algorithms starting from small square and cube nodes ( = 4, 8) up to = 1, 024, reflecting different scales for various use-cases. ...
Preprint
Full-text available
Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX, an LU algorithm with the parallel I/O cost of N3/(PM)N^3 / (P \sqrt{M}) communicated elements per processor -- only 1/3×1/3\times over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating asymptotically less than Cray ScaLAPACK or SLATE, and outperforming the asymptotically-optimal CANDMC library. Running on 1,024 nodes of Piz Daint, COnfLUX communicates 1.6×\times less than the second-best implementation and is expected to communicate 2.1×\times less on a full-scale run on Summit.
... This leads to considerable additional complexity in TCAD theory and methodology, also causing enormous computational costs. To improve the predictability and the computational efficiency of TCAD simulations, there have been many studies on developing advanced physical models beyond driftdiffusion (DD) frameworks [1] , as well as incorporating high performance computing [2] . ...
Article
Full-text available
Advancements in the semiconductor industry introduce novel channel materials, device structures, and integration methods, leading to intricate physics challenges when characterizing devices at circuit level. Nevertheless, accurate models for emerging devices are crucial for physics-driven TCAD-to-SPICE flows to enable the increasingly vital design technology co-optimization (DTCO). Particularly for ultra-scaled devices where quantum effects become significant, this led to the introduction of empirical model parameters and a disconnection to manufacturing processes. To catch up with these developments, an alternative to the traditional white-box modeling methods has attracted much attention: machine learning-assisted compact modeling (MLCM). These black-box methods target towards general-purpose modeling of complex mathematics and physics through training of neural networks on experimental and simulated data, generating an accurate closed-form mapping between output characteristics and input parameters for fabrication process and device operation. To address this new trend, this work provides a comprehensive overview of emerging device model methodologies, spanning from device physics to machine learning engines. By analyzing, structuring, and extending distributed efforts on this topic, it is shown how MLCM can overcome limitations of traditional compact modeling and contribute to effective DTCO to further advance semiconductor technologies.
... Currently, there is no effective way to address the weak scalability issue of industrial CFD applications on large-scale multi-GPU platforms. Data-Centric model is a programming paradigm that emphasizes data as the central element of software design and development [9]. This model emphasizes generating the appropriate data structure and organizing the data itself to build applications. ...
Chapter
Full-text available
Scalability is a crucial factor determining the performance of massive heterogeneous parallel CFD applications on the multi-GPUs platforms, particularly after the single-GPU implementations have achieved optimal performance through numerous optimizations. A novel Data-Centric hybrid MPI-CUDA CFD model is proposed in this paper to enable efficient scalability of CFD applications on large-scale heterogeneous platforms. Based on the Data-Centric approach, Minimum-cost MPI transfer strategy and the code refactoring technique are realized for a better balance between data transfer and floating-point computation performance, which could significantly improve the scalability and reduce the time-to-solution. Subsequently, those approaches are integrated into the industrial unstructured CFD software, FlowStar, to evaluate their effectiveness. Numerical results demonstrate that Minimum-cost MPI strategy achieves more than 2.0 times performance improvement compared to the traditional Model-Centric implementation, and the code refactoring technique boosts performance by 40% to 50% over the minimum-cost MPI version. Moreover, the Data-Centric implementation on 64 A100 GPUs platform show a speedup ratio of over 120 when compared to the original MPI implementation with 64 ranks.
... We construct a reference implementation of our approach for the SDFG representation that is used by the optimization framework DaCe [4]. This framework has become a popular choice for optimizing scientific high performance computing applications from numerous fields, where engineers write purpose-built, custom optimizing transformations to get high performance gains out of their applications [1,5,26,70,72]. Together with the ability to express arbitrary programs from Python, C, or Fortran, this makes the SDFG IR a good choice to demonstrate the effectiveness of our proposed approach. However, the techniques outlined in this paper are generally applicable to any parametric program representation adhering to the requirements outlined in Table 1. ...
Preprint
The current hardware landscape and application scale is driving performance engineers towards writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are often difficult due to side effects on the system state, mostly related to dataflow. This paper introduces FuzzyFlow: a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations to enable fast checking for semantic equivalence. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation. We demonstrate FuzzyFlow on example use cases in real-world applications where the approach provides up to 528 times faster optimization testing and debugging compared to traditional approaches.
... In the context of optimization algorithms, like e.g., the evolutionary algorithm, the fork-join model would mean splitting the population of candidate solutions into equally sized groups at the beginning of each iteration of the algorithm and sending one group to each worker process. However, since the evaluation of individual work items can potentially also take a very different computation time, 6 the fork-join model, in this case, would lead to suboptimal use of the computational resources of nodes. The reason for this is that early-finishing processes will have to wait for slower nodes to complete their computation because only once all processes have returned all candidate solutions, the evolutionary algorithm will generate the next population. ...
Article
Full-text available
Many challenges of today’s science are parametric optimization problems that are extremely complex and computationally intensive to calculate. At the same time, the hardware for high-performance computing is becoming increasingly powerful. Geneva is a framework for parallel optimization of large-scale problems with highly nonlinear quality surfaces in grid and cloud environments. To harness the immense computing power of high-performance computing clusters, we have developed a new networking component for Geneva—the so-called MPI Consumer—which makes Geneva suitable for HPC. Geneva is most prominent for its evolutionary algorithm, which requires repeatedly evaluating a user-defined cost function. The MPI Consumer parallelizes the computation of the candidate solutions’ cost functions by sending them to remote cluster nodes. By using an advanced multithreading mechanism on the master node and by using asynchronous requests on the worker nodes, the MPI Consumer is highly scalable. Additionally, it provides fault tolerance, which is usually not the case for MPI programs but becomes increasingly important for HPC. Moreover, the MPI Consumer provides a framework for the intuitive implementation of fine-grained parallelization of the cost function. Since the MPI Consumer conforms to the standard paradigm of HPC programs, it vastly improves Geneva’s user-friendliness on HPC clusters. This article gives insight into Geneva’s general system architecture and the system design of the MPI Consumer as well as the underlying concepts. Geneva—including the novel MPI Consumer—is publicly available as an open source project on GitHub (https://github.com/gemfony/geneva) and is currently used for fundamental physics research at GSI in Darmstadt, Germany.
... The dataflow graph is a potential solution for unified abstraction because it can explicitly describe the computation and data movement. Existing studies, such as DaCe [17,18] , implement the abstraction as a stateful dataflow multigraph (SDFG), therefore enabling users to develop applications and port them to achieve high performance. ...
Article
Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.
... Our approach is similar to methods used in quantum transport simulations where solutions to non-equilibrium Green's functions also necessitate selected inversions, see e.g. [60,61,62] as well as for Kalman-Bucy filtering [63]. In both cases the authors derive strategies to efficiently compute the block diagonal elements of the inverse of block tridiagonal matrices. ...
Preprint
Bayesian inference tasks continue to pose a computational challenge. This especially holds for spatial-temporal modeling where high-dimensional latent parameter spaces are ubiquitous. The methodology of integrated nested Laplace approximations (INLA) provides a framework for performing Bayesian inference applicable to a large subclass of additive Bayesian hierarchical models. In combination with the stochastic partial differential equations (SPDE) approach it gives rise to an efficient method for spatial-temporal modeling. In this work we build on the INLA-SPDE approach, by putting forward a performant distributed memory variant, INLA-DIST, for large-scale applications. To perform the arising computational kernel operations, consisting of Cholesky factorizations, solving linear systems, and selected matrix inversions, we present two numerical solver options, a sparse CPU-based library and a novel blocked GPU-accelerated approach which we propose. We leverage the recurring nonzero block structure in the arising precision (inverse covariance) matrices, which allows us to employ dense subroutines within a sparse setting. Both versions of INLA-DIST are highly scalable, capable of performing inference on models with millions of latent parameters. We demonstrate their accuracy and performance on synthetic as well as real-world climate dataset applications.
... The application of AI-based methods allowed to achieve notable results in areas like protein structure prediction [4,25], photonics [44], the solution to Schrödinger equation for fermions [37], quantum transport [45], molecular dynamics [23], climate analytics [28], weather prediction [43], computational fluid dynamics (CFD) [16,30], solid mechanics [19], Earth radiation belt modeling [10] and others. ...
Article
Full-text available
Partial differential equations (PDEs) are pervasive in vast domains of science and engineering. Although there is huge legacy of numerical methods for solving direct and inverse PDE problems, these methods are computationally expensive for many fundamental and real-life applications, demanding supercomputer resources. Moreover, existing methods for PDEs identification assume concrete functional forms for the coefficients to be found, significantly limiting the range of possible solutions. The mentioned circumstances lead to increasing interest in AI-based methods for direct solving and identification of PDEs. In this study, we propose a novel method based on artificial neural networks (ANNs) for the identification of partial differential equations. The method does not require any strong a priori assumptions regarding the family of the functions approximating PDE coefficients. It allows one to approximate the coefficients of a PDE based on the observed evolution of PDE direct solution. We demonstrate efficacy and high accuracy of ANN-based method in case of diffusion equation and nonlinear diffusion-advection equation (Richards equation) applied to the simulation of heat and moisture transfer in soil. We demonstrate that the novel method implemented on Ascend platform using the mixed precision floating point operations overperforms the classical gradient descent method in Barzilai-Borwein stabilized modification (BBstab, realized on a conventional central processor), in terms of MAPE (mean absolute percentage error) and RMSE (root mean square error) of approximated coefficients at least an order of magnitude. We also found that ANN-method is much less sensitive to initial guess of parameters compared to BBstab approach. Since the considered equations are of generic form, we anticipate that the proposed ANN-based method can be successfully exploited in other applications. These potential applications include hydrodynamic-type problems, e.g., optimization of turbulence closures, where the assumed reference solutions of PDEs are usually obtained from high-resolution direct Navier-Stokes simulations.
... The OMEN [61] quantum transport simulator was used to compute the device characteristics. It self-consistently solves the Schrödinger and Poisson equations for electrons and a dynamical equation for phonons via the non-equilibrium Green's function (NEGF) formalism [62]. ...
Article
Full-text available
The encapsulation of single-layer 2D materials within hBN has been shown to improve the mobility of these compounds. Nevertheless, the interplay between the semiconductor channel and the surrounding dielectrics is not yet fully understood, especially their electron–phonon interactions. Therefore, here, we present an ab initio study of the coupled electrons and phonon transport properties of MoS2-hBN devices. The characteristics of two transistor configurations are compared to each other: one where hBN is treated as a perfectly insulating, non-vibrating layer and one where it is included in the ab initio domain as MoS2. In both cases, a reduction of the ON-state current by about 50% is observed as compared to the quasi-ballistic limit. Despite the similarity in the current magnitude, explicitly accounting for hBN leads to additional electron–phonon interactions at frequencies corresponding to the breathing mode of the MoS2-hBN system. Moreover, the presence of an hBN layer around the 2D semiconductor affects the Joule-induced temperature distribution within the transistor.
... Klinkert et al conducted a large-scale computational study to investigate the suitability of 100 different 2D materials for potential use in ultra-scaled FETs [8]. To that end, the authors applied DFT and quantum transport simulations using OMEN (Schrödinger-Poisson solver via an NEGF formalism) [253] to calculate current-voltage characteristics for each material. In another review, Wang et al provided an overview on the advancements concerning Schottky barriers within the context of FETs based on 2D materials [254]. ...
Article
Full-text available
Quantum electronics has significantly evolved over the last decades. Where initially the clear focus was on light-matter interactions, nowadays approaches based on the electron's wave nature have solidified themselves as additional focus areas. This development is largely driven by continuous advances in electron quantum optics, electron based quantum information processing, electronic materials, and nanoelectronic devices and systems. The pace of research in all of these areas is astonishing and is accompanied by substantial theoretical and experimental advancements. What is particularly exciting is the fact that the computational methods, together with broadly available large-scale computing resources, have matured to such a degree so as to be essential enabling technologies themselves. These methods allow to predict, analyze, and design not only individual physical processes but also entire devices and systems, which would otherwise be very challenging or sometimes even out of reach with conventional experimental capabilities. This review is thus a testament to the increasingly towering importance of computational methods for advancing the expanding field of quantum electronics. To that end, computational aspects of a representative selection of recent research in quantum electronics are highlighted where a major focus is on the electron's wave nature. By categorizing the research into concrete technological applications, researchers and engineers will be able to use this review as a source for inspiration regarding problem-specific computational methods.
... Using the DaCe framework, SDFGs were shown to accelerate a wide range of application classes in dense/sparse linear algebra and graph algorithms [8], deep learning Transformer architectures [26], numerical weather prediction on FPGAs [16], and extreme-scale quantum transport simulations on the world's largest supercomputer [51]. ...
Preprint
C is the lingua franca of programming and almost any device can be programmed using C. However, programming mod-ern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly expressing parallelism as well as device-specific properties such as memory hierarchies. The resulting code is often hard to understand, debug, and modify for different architectures. We propose to lift C pro-grams to a parametric dataflow representation that lends itself to static data-centric analysis and enables automatic high-performance code generation. We separate writing code from optimizing for different hardware: simple, portable C source code is used to generate efficient specialized versions with a click of a button. Our approach can identify parallelism when no other compiler can, and outperforms a bespoke parallelized version of a scientific proxy application by up to21%.
... The framework provides an API to programmatically instrument and explore, e.g., different layouts and kernel fusion strategies, all without modifying the original code. DaCe was shown to map applications to different hardware architectures, including CPUs, GPUs, and FPGAs [64], enabling both whole-program and micro-optimizations of nontrivial applications to state-ofthe-art performance [65]. ...
Preprint
Transformer neural networks have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.
... DaCe [14], a data-centric approach, is presented to scale ab initio quantum transport simulations to extreme-scale, reaching an extraordinary performance of 85.45 PFLOPs/s (42.55% of the peak) in double precision on 4560 nodes of Summit. This study optimizes Ab initio quantum transport solver by analyzing data dependence. ...
Article
China is playing an increasingly important role in international supercomputing. In high-performance computing domain, there are two famous awards: The TOP500 list for the fastest 500 supercomputers in the world and the Gordon Bell Prize for the best HPC (high-performance computing) applications. China has been awarded in both TOP500 list and Gordon Bell Prize. In this paper, we review the supercomputers in the latest TOP500 list and seven Gordon Bell Prize applications to show the research trend of the large-scale supercomputers and applications. The first trend we observe is that heterogeneous architectures are widely used in the construction of supercomputing systems. The second trend is that artificial intelligence applications are expected to become one of the main stream applications of supercomputing. The third trend is that applying heterogeneous systems to complex scientific simulation applications will be more difficult.
... PENCIL [37] and Polly-ACC [21] automate the accelerator mapping using the polyhedral model. DaCe [38] allows performance engineers to select and develop target-specific transformations. All approaches are generic and, for the same level of performance and automation, solve a more complex problem than a domain-specific compiler. ...
Preprint
Traditional compilers operate on a single generic intermediate representation (IR). These IRs are usually low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.
... This year, the prize went to a team from the Swiss Federal Institute of Technology Zurich for "A Data-Centric Approach to Extreme-Scale Ab Initio Dissipative Quantum Transport Simulations". They used Piz Daint at the Swiss National Supercomputing Centre and Summit at Oak Ridge National Laboratory, US, to better understand the thermal properties of transistors which would, appropriately enough, help manage heat generation and dissipation as computer architecture shrinks [62]. ...
Article
Full-text available
Computers are becoming ever more powerful, along with the hyperbole used to discuss their potential in modelling. As we are about to enter the era of quantum and exascale computing, they are being used to perform simulations across a vast range of domains, from subatomic physics to cosmology, straddling fields as diverse as chemistry, biology, astrophysics, climate science, economics, psychology, social science, health care, engineering and many more. Machine learning and artificial intelligence have entered the field in a major way, their applications likewise spreading across the gamut of disciplines and domains. In this article we take a look at the state of the art and seek to distinguish rhetoric from reality in assessing the future of modelling and simulation, highlighting how to overcome the profound limitations of digital computers.
... The results are listed in Table 8, proving that the electro-thermal properties of nano-devices of this magnitude can be computed in under 7 minutes per iteration, as required for practical applications. A full-scale run on Summit, with further optimizations, is described by Ziogas et al. [29]. ...
Preprint
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated FinFETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse- and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs.
Article
Full-text available
After more than five decades, Moore’s Law for transistors is approaching the end of the international technology roadmap of semiconductors (ITRS). The fate of complementary metal oxide semiconductor (CMOS) architecture has become increasingly unknown. In this era, 3D transistors in the form of gate-all-around (GAA) transistors are being considered as an excellent solution to scaling down beyond the 5 nm technology node, which solves the difficulties of carrier transport in the channel region which are mainly rooted in short channel effects (SCEs). In parallel to Moore, during the last two decades, transistors with a fully depleted SOI (FDSOI) design have also been processed for low-power electronics. Among all the possible designs, there are also tunneling field-effect transistors (TFETs), which offer very low power consumption and decent electrical characteristics. This review article presents new transistor designs, along with the integration of electronics and photonics, simulation methods, and continuation of CMOS process technology to the 5 nm technology node and beyond. The content highlights the innovative methods, challenges, and difficulties in device processing and design, as well as how to apply suitable metrology techniques as a tool to find out the imperfections and lattice distortions, strain status, and composition in the device structures.
Article
Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4096 GPUs. The results show that shared-exponent MFFT is 1.23x faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53x and 9.48x on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2%53.2\% to 78.1%78.1\% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%83.8\% .
Article
Device simulation is nowa­days fully integrated into the production tool chain of transistors. The geometry of the latter can be carefully optimized, possible design pitfalls can be identified early on, and the obtained experimental data can be analyzed in detail thanks to state-of-the-art technology computer aided design tools. However, on the one hand, the dimensions of transistors are reaching the atomic scale. On the other hand, novel functionalities (e.g., light emission/detection) and materials, for example III-V semiconductors, are being added to silicon-based chips. To cope with these challenges it is crucial that device simulators go beyond classical theories, pure electronic transport, and continuum models. The inclusion of quantum mechanical phenomena, electro-thermal effects, and light-matter interactions in systems made of thousands of atoms and of various materials has become critical. In this paper, we review one approach that satisfies all these requirements, the Non-equilibrium Green’s Function (NEGF) formalism, focusing on its combination with ab initio bandstructure models. The NEGF method allows to treat electrical, thermal, and optical transport at the quantum mechanical level in multi-material, multi-functional devices, without any empirical parameters. Besides advanced logic switches, it can be used to simulate e.g., photo-detectors, thermoelectric generators, or memory cells composed of almost any materials, in the ballistic limit of transport and in the presence of scattering. The key features of NEGF are summarized first, then selected applications are presented, finally challenges and opportunities are discussed.
Article
Advances in computational capabilities have transformed national security. For the US nuclear deterrent these advances have allowed wiser options for a production complex that is a fraction of its cold war-era size, enabled more effective military solutions, and have been a safeguard against consequential mistakes. An ability to faithfully simulate increasingly complex physical systems in human learning times has been the key to this progress. Experience spanning almost 80 years shows that qualitative leaps in simulation capabilities have allowed new contributions to national security. Through detailed quantitative analysis of representative simulations, we show that reliance on technologies developed for popular market applications will likely stall progress. We argue that if the United States is to continue benefiting from advances in computing, investments in deeper codesign of hardware and software addressing levels of branching and sparsity not found in Machine Learning (ML) or most other major market applications will be needed.
Conference Paper
Full-text available
We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths features a redesigned "purified" transport layer that removes virtually all TCP performance issues (e.g., the slow start), and uses flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2× lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.
Article
Thanks to their remarkable properties single-layer 2-D materials appear as excellent candidates to extend Moore’s scaling law beyond the currently manufactured silicon FinFETs. However, the known 2-D semiconducting components, essentially transition metal dichalcogenides, are still far from delivering the expected performance. Based on a recent theoretical study that predicts the existence of more than 1,800 exfoliable 2-D materials, we investigate here the 100 most promising contenders for logic applications. Their “current vs. voltage” characteristics are simulated from first-principles, combining density-functional theory and advanced quantum transport calculations. Both n- and p-type configurations are considered, with gate lengths ranging from 15 down to 5 nm. From this large collection of electronic materials, we identify 13 compounds with electron and hole currents potentially much higher than in future Si FinFETs. The resulting database widely expands the design space of 2-D transistors and provides original guidelines to the materials and device engineering community.
Article
Power and energy consumption are becoming key challenges for the supercomputers' exascale race. HPC systems' processors waist active power during communication and synchronization among the MPI processes in large-scale HPC applications. However, due to the time scale at which communication happens, transitioning into low-power states while waiting for the completion of each communication may introduce unacceptable overhead in applications' execution time. In this paper, we present COUNTDOWN, a run-time library for identifying and automatically reducing the power consumption of the CPUs during communication and synchronization. COUNTDOWN saves energy without penalizing the time-to-completion by lowering CPUs power consumption only during idle times for which power state transition overhead is negligible. This is done transparently to the user, without requiring labor-intensive and error-prone application code modifications, nor requiring recompilation of the application. We test our methodology on a production Tier-1 system. For the NAS benchmarks, COUNTDOWN saves between 6% and 50% energy, with a time-to-solution penalty lower than 5%. In a complete production --- Quantum ESPRESSO --- for a 3.5K cores run, COUNTDOWN saves 22.36% energy, with a performance penalty below 3%. Energy saving increases to 37% with a performance penalty of 6.38%, if the application is executed without communication tuning.
Preprint
Thanks to their unique properties single-layer 2-D materials appear as excellent candidates to extend Moore's scaling law beyond the currently manufactured silicon FinFETs. However, the known 2-D semiconducting components, essentially transition metal dichalcogenides, are still far from delivering the expected performance. Based on a recent theoretical study that predicts the existence of more than 1,800 exfoliable 2-D materials, we investigate here the 100 most promising contenders for logic applications. Their "current vs. voltage" characteristics are simulated from first-principles, combining density-functional theory and advanced quantum transport calculations. Both n- and p-type configurations are considered, with gate lengths ranging from 15 down to 5 nm. From this unprecedented collection of electronic materials, we identify 13 compounds with electron and hole currents potentially much higher than in future Si FinFETs. The resulting database widely expands the design space of 2-D transistors and provides original guidelines to the materials and device engineering community.
Conference Paper
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated Fin-FETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse-and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs.
Conference Paper
The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.
Article
Full-text available
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
Conference Paper
Full-text available
The capabilities of CP2K, a density-functional theory package and OMEN, a nano-device simulator, are combined to study transport phenomena from first-principles in unprecedentedly large nanostructures. Based on the Hamiltonian and overlap matrices generated by CP2K for a given system, OMEN solves the Schrödinger equation with open boundary conditions (OBCs) for all possible electron momenta and energies. To accelerate this core operation a robust algorithm called SplitSolve has been developed. It allows to simultaneously treat the OBCs on CPUs and the Schrödinger equation on GPUs, taking advantage of hybrid nodes. Our key achievements on the Cray-XK7 Titan are (i) a reduction in time-to-solution by more than one order of magnitude as compared to standard methods, enabling the simulation of structures with more than 50000 atoms, (ii) a parallel efficiency of 97% when scaling from 756 up to 18564 nodes, and (iii) a sustained performance of 15 DP-PFlop/s.
Article
Full-text available
We have developed an efficient simulation tool 'GOLLUM' for the computation of electrical, spin and thermal transport characteristics of complex nanostructures. The new multi-scale, multi-terminal tool addresses a number of new challenges and functionalities that have emerged in nanoscale-scale transport over the past few years. To illustrate the flexibility and functionality of GOLLUM, we present a range of demonstrator calculations encompassing charge, spin and thermal transport, corrections to density functional theory such as local density approximation +U (LDA+U) and spectral adjustments, transport in the presence of non-collinear magnetism, the quantum Hall effect, Kondo and Coulomb blockade effects, finite-voltage transport, multi-terminal transport, quantum pumps, superconducting nanostructures, environmental effects, and pulling curves and conductance histograms for mechanically-controlled break-junction experiments.
Article
Full-text available
Kwant is a Python package for numerical quantum transport calculations. It aims to be an user-friendly, universal, and high-performance toolbox for the simulation of physical systems of any dimensionality and geometry that can be described by a tight-binding model. Kwant has been designed such that the natural concepts of the theory of quantum transport (lattices, symmetries, electrodes, orbital/spin/electron-hole degrees of freedom) are exposed in a simple and transparent way: Defining a new simulation setup is very close to describing the corresponding mathematical model. Kwant offers direct support for calculations of transport properties (conductance, noise, scattering matrix), dispersion relations, modes, wave functions, various Green's functions, and out-of-equilibrium local quantities. Other computations involving tight-binding Hamiltonians can be implemented easily thanks to its extensible and modular nature. Kwant is free software available at http://kwant-project.org/.
Article
Full-text available
Quantization in the inversion layer and phase coherent transport are anticipated to have significant impact on device performance in “ballistic” nanoscale transistors. While the role of some quantum effects have been analyzed qualitatively using simple one-dimensional ballistic models, two-dimensional (2D) quantum mechanical simulation is important for quantitative results. In this paper, we present a framework for 2D quantum mechanical simulation of a nanotransistor/metal oxide field effect transistor. This framework consists of the nonequilibrium Green’s function equations solved self-consistently with Poisson’s equation. Solution of this set of equations is computationally intensive. An efficient algorithm to calculate the quantum mechanical 2D electron density has been developed. The method presented is comprehensive in that treatment includes the three open boundary conditions, where the narrow channel region opens into physically broad source, drain and gate regions. Results are presented for (i) drain current vs drain and gate voltages, (ii) comparison to results from Medici, and (iii) gate tunneling current, using 2D potential profiles. Methods to reduce the gate leakage current are also discussed based on simulation results. © 2002 American Institute of Physics.
Conference Paper
Full-text available
A quantum transport approach based on the Non-equilibrium Green's Function formalism and the tight-binding method has been developed to investigate the performances of atomistically resolved nanoelectronic devices in the presence of electron-phonon scattering. The model is integrated into a quad-level parallel environment (bias, momentum, energy, and spatial domain decomposition) that scales almost perfectly up to 220k cores in the ballistic limit of electron transport. In this case, the momentum and energy points form a quasi-embarrassingly parallel problem. The novelty in this paper is the inclusion of scattering self-energies that couple all the momenta and several energies together, requiring substantial inter-processor communication. An efficient parallel implementation of electron-phonon scattering is therefore proposed and applied to a realistically extended transistor structure. A good scaling of the simulation walltime up to 95,256 cores and a sustained performance of 142 TFlop/s are reported on the Cray-XT5 Jaguar.
Conference Paper
Full-text available
We present a multi-dimensional, atomistic, quantum transport simulation approach to investigate the performances of realistic nanoscale transistors for various geometries and material systems. The central computation consists in solving the Schrödinger equation with open boundary conditions several thousand times. To do that, a Wave Function approach is used since it can be relatively easily parallelized. To further improve the computational efficiency, three additional levels of parallelization are identified, the work load is optimally balanced between the CPUs, computational interleaving is applied where possible, and a mixed precision scheme is introduced. Using two different device types, a high electron mobility and a band-to-band tunneling transistor, sustained performances up to 1.28 PFlop/s in double precision (55% of the peak performance) and 1.44 PFlop/s in mixed precision are reached on 221,400 cores on the CRAY-XT5 Jaguar at Oak Ridge National Lab.
Article
Full-text available
Manufacturers will likely offer multiple products with differing numbers of cores to cover multiple price-performance points, since Moore's Law will permit the doubling of the number of cores per chip every two years. While diversity may be understandable in this time of uncertainty, it exacerbates the already difficult jobs of programmers, compiler writers, and even architects. Hence, an easy-to-understand model that offers performance guidelines would be especially valuable. This article proposes one such model called Roofline, demonstrating it on four diverse multicore computers using four key floating-point kernels. The proposed Roofline model ties together floating-point performance, operational intensity, and memory performance in a 2D graph. The Roofline sets an upper bound on performance of a kernel depending on the kernel's operational intensity. If people think of operational intensity as a column that hits the roof, either it hits the flat part of the roof, meaning performance is compute-bound, or it hits the slanted part of the roof, meaning performance is ultimately memory-bound.
Article
Full-text available
As the active dimensions of metal-oxide field-effect transistors are approaching the atomic scale, the electronic properties of these "nanowire" devices must be treated on a quantum mechanical level. In this paper, the transmission coefficients and the density of states of biased and unbiased Si and GaAs nanowires are simulated using the sp3d5s* empirical tight-binding method. Each atom, as well as the connections to its nearest neighbors, is represented explicitly. The material parameters are optimized to reproduce bulk band-structure characteristics in various crystal directions and various strain conditions. A scattering boundary method to calculate the open boundary conditions in nanowire transistors is developed to reduce the computational burden. Existing methods such as iterative or generalized eigenvalue problem approaches are significantly more expensive than the transport simulation through the device. The algorithm can be coupled to nonequilibrium Green's function and wave function transport calculations. The speed improvement is even larger if the wire transport direction is different from #100$. Finally, it is demonstrated that strain effects can be easily included in the present nanowire simulations.
Article
Full-text available
As transistor gate lengths are scaled towards the 10-nm range, thermal device design is becoming an important part of microprocessor engineering. Decreasing dimensions lead to nanometer-scale hot spots in the transistor drain region, which may increase the drain series and source injection electrical resistances. Such trends are accelerated by the introduction of novel materials and nontraditional transistor geometries, including ultrathin body, FinFET, or nanowire devices, which impede heat conduction. Thermal analysis is complicated by subcontinuum phenomena including ballistic electron transport, which reshapes the heat generation region compared with classical diffusion theory predictions. Ballistic phonon transport from the hot spot and between material boundaries impedes conduction cooling. The increased surface to volume ratio of novel transistor designs also leads to a larger contribution from material boundary thermal resistance. This paper surveys trends in transistor geometries and materials, from bulk silicon to carbon nanotubes, along with their implications for the thermal design of electronic systems
Conference Paper
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated Fin-FETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse-and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs.
Conference Paper
The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.
Article
Through advanced quantum mechanical simulations combining electron transport and phonon transport from first-principles, self-heating effects are investigated in n-type transistors with single-layer MoS2, WS2, and black phosphorus as channel materials. The selected 2-D crystals all exhibit different phonon-limited mobility values, as well as electron and phonon properties, which have a direct influence on the increase in their lattice temperature and on the power dissipated inside their channel as a function of the applied gate voltage and electrical current magnitude. This computational study reveals (i) that self-heating plays a much more important role in 2-D materials than in Si nanowires, (ii) that it could severely limit the performance of 2-D devices at high current densities, and (iii) that black phosphorus appears less sensitive to this phenomenon than transition metal dichalcogenides.
Conference Paper
The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90 % of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.
Article
From a theory of Hohenberg and Kohn, approximation methods for treating an inhomogeneous system of interacting electrons are developed. These methods are exact for systems of slowly varying or high density. For the ground state, they lead to self-consistent equations analogous to the Hartree and Hartree-Fock equations, respectively. In these equations the exchange and correlation portions of the chemical potential of a uniform electron gas appear as additional effective potentials. (The exchange portion of our effective potential differs from that due to Slater by a factor of 23.) Electronic systems at finite temperatures and in magnetic fields are also treated by similar methods. An appendix deals with a further correction for systems with short-wavelength density oscillations.
Conference Paper
The personalized all-to-all collective exchange is one of the most challenging communication patterns in HPC applications in terms of performance and scalability. In the context of the fat tree family of interconnection networks, widely used in current HPC systems and datacenters, we show that there is potential for optimizing this traffic pattern by deriving a tight theoretical lower bound for the bandwidth needed in the network to support such communication in a non-contending way. Current state of the art methods require up to twice as much bisection bandwidth as this theoretical minimum. We propose a set of optimized exchanges that use exactly the minimum amount of resources and exhibit close to ideal performance. This enables cost-effective networks, i.e., with as little as half the bisection bandwidth required by current state of the art methods, to exhibit quasi optimal performance under all-to-all traffic. In addition to supporting our claims by mathematical proofs, we include simulation results that confirm their correctness in practical system configurations.
Conference Paper
Algorithms have two costs: arithmetic and communication, i.e. moving data between levels of a memory hierarchy or processors over a network. Communication costs (measured in time or energy per operation) already greatly exceed arithmetic costs, and the gap is growing over time following technological trends. Thus our goal is to design algorithms that minimize communication. We present algorithms that attain provable lower bounds on communication, and show large speedups compared to their conventional counterparts. These algorithms are for direct and iterative linear algebra, for dense and sparse matrices, as well as direct n-body simulations. Several of these algorithms exhibit perfect strong scaling, in both time and energy: run time (resp. energy) for a fixed problem size drops proportionally to p (resp. is independent of p). Finally, we describe extensions to algorithms involving arbitrary loop nests and array accesses, assuming only that array subscripts are linear functions of the loop indices.
Article
1. Preliminary concepts 2. Conductance from transmission 3. Transmission function, S-matrix and Green's functions 4. Quantum Hall effect 5. Localisation and fluctuations 6. Double-barrier tunnelling 7. Optical analogies 8. Non-equilibrium Green's function formalism.
Article
Cooling technologies that address high-density and asymmetric heat dissipation in CPU packages of high-performance servers are discussed. Thermal management schemes and the development of associated technologies are reviewed from a viewpoint of industrial application. Particular attention is directed to heat conduction in the package and heat removal from the package/heat sink module. Power dissipation and package cooling characteristics of high-performance microprocessors are analyzed. The development of a new metallic thermal interface technology is introduced, where thermal and mechanical performance of an indium-silver alloy in the chip/heat spreader assembly was studied. The paper also reports on research on other thermal management materials, such as diamond composite heat-spreading materials. Some actual package designs are described to illustrate the enhanced heat spreading capability of heat pipes and vapor chambers.
Article
We present the Gaussian and plane waves (GPW) method and its implementation in Quickstep which is part of the freely available program package CP2K. The GPW method allows for accurate density functional calculations in gas and condensed phases and can be effectively used for molecular dynamics simulations. We show how derivatives of the GPW energy functional, namely ionic forces and the Kohn–Sham matrix, can be computed in a consistent way. The computational cost of computing the total energy and the Kohn–Sham matrix is scaling linearly with the system size, even for condensed phase systems of just a few tens of atoms. The efficiency of the method allows for the use of large Gaussian basis sets for systems up to 3000 atoms, and we illustrate the accuracy of the method for various basis sets in gas and condensed phases. Agreement with basis set free calculations for single molecules and plane wave based calculations in the condensed phase is excellent. Wave function optimisation with the orbital transformation technique leads to good parallel performance, and outperforms traditional diagonalisation methods. Energy conserving Born–Oppenheimer dynamics can be performed, and a highly efficient scheme is obtained using an extrapolation of the density matrix. We illustrate these findings with calculations using commodity PCs as well as supercomputers.
Article
From a theory of Hohenberg and Kohn, approximation methods for treating an inhomogeneous system of interacting electrons are developed. These methods are exact for systems of slowly varying or high density. For the ground state, they lead to self-consistent equations analogous to the Hartree and Hartree-Fock equations, respectively. In these equations the exchange and correlation portions of the chemical potential of a uniform electron gas appear as additional effective potentials. (The exchange portion of our effective potential differs from that due to Slater by a factor of 23.) Electronic systems at finite temperatures and in magnetic fields are also treated by similar methods. An appendix deals with a further correction for systems with short-wavelength density oscillations.
Current CPUs produce 4 times more heat than hot plates
  • R Pawlik
Pushing Back the Limit of Ab-initio Quantum Transport Simulations on Hybrid Supercomputers. In Proc. Int'l Conference for High Performance Computing, Networking, Storage and Analysis (SC '15)
  • Calderara M.
Atomistic Nanoelectronic Device Engineering with Sustained Performances Up to 1.44 PFlop/s. In Proc. Int'l Conference for High Performance Computing, Networking, Storage and Analysis (SC '11)
  • Luisier M.