Chapter

JUWELS Booster – A Supercomputer for Large-Scale AI Research

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding performance. We exemplify its potential for research application by presenting large-scale AI research highlights from various scientific fields that require such a facility.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This means that all services have registered their expected communication via the InfiniBand network interface (Pentakalos, 2002). We refer to literature (Krause, 2019;Alvarez, 2021;Kesselheim et al., 2021) for an in-depth overview of the supercomputing system that we are using. One the Jülich Wizard for European Leadership Science ( JUWELS), one cluster compute node has two Intel Xeon Platinum 8168 CPUs. ...
... We especially highlight the embedding of Synavis into a modular supercomputing system (Suarez et al., 2019). Both JUWELS (Alvarez, 2021;Kesselheim et al., 2021;Krause, 2019) and JU-RECA supercomputers (Krause and Thörnig, 2018;Thörnig, 2021) are modular supercomputers. Newer systems, such as LUMI, further developed this concept to focus their accelerator systems on general compute-based programming paradigms (Markomanolis et al., 2022). ...
Article
Full-text available
In plant science it is an established method to obtain structural parameters of crops using image analysis. In recent years, deep learning techniques have improved the underlying processes significantly. However, since data acquisition is time and resource consuming, reliable training data is currently limiting. To overcome this bottleneck, synthetic data is a promising option for not only enabling a higher order of correctness by offering more training data, but also for validation of results. However, the creation of synthetic data is complex and requires extensive knowledge in Computer Graphics, Visualization and High-Performance Computing. We address this by introducing Synavis, a framework that allows users to train networks on real-time generated data. We created a pipeline that integrates realistic plant structures, simulated by the functional-structural plant model framework CPlantBox, into the game engine Unreal Engine. For this purpose, we needed to extend CPlantBox by introducing a new leaf geometrization that results in realistic leafs. All parameterized geometries of the plant are directly provided by the plant model. In the Unreal Engine, it is possible to alter the environment. WebRTC enables the streaming of the final image composition, which in turn can then be directly used to train deep neural networks to increase parameter robustness, for further plant trait detection and validation of original parameters. We enable user-friendly ready-to-use pipelines, providing virtual plant experiment and field visualizations, a python-binding library to access synthetic data, and a ready-to-run example to train models.
... Despite not being a specialized DL supercomputer, JUWELS can effectively rival dedicated and monolithic AI supercomputers. The authors in [9] underscored the system's versatility by demonstrating a full training run of the ResNet50 model on the ImageNet dataset. By employing 1,536 GPUs, the training phase concluded remarkably within a span of 43 seconds, thereby attaining a throughput of 1.7 million images per second. ...
... Data parallelism benefits neural networks of any size when trained on large datasets (it allows distributing the memory related to the activations). In [9], the authors trained a multispectral ResNet-152 on a large RS dataset, achieving 80% efficiency when scaling up from one to 256 GPUs. ...
Conference Paper
Full-text available
High-Performance Computing (HPC) enables precise analysis of large and complex Earth Observation (EO) datasets. However, the adoption of supercomputing in the EO community faces challenges from the increasing heterogeneity of HPC systems, limited expertise, and the need to leverage novel computing technologies. This paper explores the implications of exascale computing advancements and the inherent heterogeneity of HPC architectures. It highlights EU-supported projects optimizing software development and harnessing the capabilities of heterogeneous HPC configurations. Methodologies addressing challenges of modular supercomputing, large-scale Deep Learning (DL) models, and hybrid quantum-classical algorithms are presented, aiming to enhance the utilization of supercomputing in EO for improved research, industrial applications, and SME support.
... We use the AdamW optimizer [Loshchilov and Hutter, 2019] with a constant learning rate schedule, the temperature parameter τ in the loss function (Equation 1) set to 1, and the regularization parameter λ equal to 5 × 10 3 . The models were run on two different machines, JUWELS BOOSTER [Kesselheim et al., 2021] with NVIDIA A100 GPUs. ...
Preprint
Full-text available
Recent AI advances have enabled multi-modal systems to model and translate diverse information spaces. Extending beyond text and vision, we introduce OneProt, a multi-modal AI for proteins that integrates structural, sequence, alignment, and binding site data. Using the ImageBind framework, OneProt aligns the latent spaces of modality encoders along protein sequences. It demonstrates strong performance in retrieval tasks and surpasses state-of-the-art methods in various downstream tasks, including metal ion binding classification, gene-ontology annotation, and enzyme function prediction. This work expands multi-modal capabilities in protein models, paving the way for applications in drug discovery, biocatalytic reaction planning, and protein engineering.
... [120][121][122]. ...
Article
Full-text available
The disordered nature of Intrinsically Disordered Proteins (IDPs) makes their structural ensembles particularly susceptible to changes in chemical environmental conditions, often leading to an alteration of their normal functions. A Radial Distribution Function (RDF) is considered a standard method for characterizing the chemical environment surrounding particles during atomistic simulations, commonly averaged over an entire or part of a trajectory. Given their high structural variability, such averaged information might not be reliable for IDPs. We introduce the Time-Resolved Radial Distribution Function (TRRDF), implemented in our open-source Python package SPEADI, which is able to characterize dynamic environments around IDPs. We use SPEADI to characterize the dynamic distribution of ions around the IDPs Alpha-Synuclein (AS) and Humanin (HN) from Molecular Dynamics (MD) simulations, and some of their selected mutants, showing that local ion–residue interactions play an important role in the structures and behaviors of IDPs.
Article
Serine 129 can be phosphorylated in pathological inclusions formed by the intrinsically disordered protein human α-synuclein (AS), a key player in Parkinson’s disease and other synucleinopathies. Here, molecular simulations provide insight into the structural ensemble of phosphorylated AS. The simulations allow us to suggest that phosphorylation significantly impacts the structural content of the physiological AS conformational ensemble in aqueous solution, as the phosphate group is mostly solvated. The hydrophobic region of AS contains β-hairpin structures, which may increase the propensity of the protein to undergo amyloid formation, as seen in the nonphysiological (nonacetylated) form of the protein in a recent molecular simulation study. Our findings are consistent with existing experimental data with the caveat of the observed limitations of the force field for the phosphorylated moiety.
Article
Full-text available
Solving combinatorial optimization problems of the kind that can be codified by quadratic unconstrained binary optimization (QUBO) is a promising application of quantum computation. Some problems of this class suitable for practical applications such as the traveling salesman problem (TSP), the bin packing problem (BPP), or the knapsack problem (KP) have inequality constraints that require a particular cost function encoding. The common approach is the use of slack variables to represent the inequality constraints in the cost function. However, the use of slack variables considerably increases the number of qubits and operations required to solve these problems using quantum devices. In this work, we present an alternative method that does not require extra slack variables and consists of using an unbalanced penalization function to represent the inequality constraints in the QUBO. This function is characterized by larger penalization when the inequality constraint is not achieved than when it is. We evaluate our approach on the TSP, BPP, and KP, successfully encoding the optimal solution of the original optimization problem near the ground state cost Hamiltonian. Additionally, we employ D-Wave Advantage and D-Wave hybrid solvers to solve the BPP, surpassing the performance of the slack variables approach by achieving solutions for up to 29 items, whereas the slack variables approach only handles up to 11 items. This new approach can be used to solve combinatorial problems with inequality constraints with a reduced number of resources compared to the slack variables approach using quantum annealing or variational quantum algorithms.
Article
Full-text available
The confrontation of complex Earth system model (ESM) codes with novel supercomputing architectures poses challenges to efficient modeling and job submission strategies. The modular setup of these models naturally fits a modular supercomputing architecture (MSA), which tightly integrates heterogeneous hardware resources into a larger and more flexible high-performance computing (HPC) system. While parts of the ESM codes can easily take advantage of the increased parallelism and communication capabilities of modern GPUs, others lag behind due to the long development cycles or are better suited to run on classical CPUs due to their communication and memory usage patterns. To better cope with these imbalances between the development of the model components, we performed benchmark campaigns on the Jülich Wizard for European Leadership Science (JUWELS) modular HPC system. We enabled the weather and climate model Icosahedral Nonhydrostatic (ICON) to run in a coupled atmosphere–ocean setup, where the ocean and the model I/O is running on the CPU Cluster, while the atmosphere is simulated simultaneously on the GPUs of JUWELS Booster (ICON-MSA). Both atmosphere and ocean are running globally with a resolution of 5 km. In our test case, an optimal configuration in terms of model performance (core hours per simulation day) was found for the combination of 84 GPU nodes on the JUWELS Booster module to simulate the atmosphere and 80 CPU nodes on the JUWELS Cluster module, of which 63 nodes were used for the ocean simulation and the remaining 17 nodes were reserved for I/O. With this configuration the waiting times of the coupler were minimized. Compared to a simulation performed on CPUs only, the MSA approach reduces energy consumption by 45 % with comparable runtimes. ICON-MSA is able to scale up to a significant portion of the JUWELS system, making best use of the available computing resources. A maximum throughput of 170 simulation days per day (SDPD) was achieved when running ICON on 335 JUWELS Booster nodes and 268 Cluster nodes.
Conference Paper
Full-text available
In times of ever-increasing data sizes, data management and insightful analysis are amidst the most severe challenges of high-performance computing. While high-level libraries such as NetCDF, HDF5, and ADIOS2, as well as the associated self-describing data formats, offer convenient interfaces to complex data sets, they were built on outdated assumptions of storage systems and interfaces. They mostly rely on the POSIX interface that researchers have been aiming to replace for decades. Among others, its strict file semantics are not suitable for current HPC systems. As object storage has become increasingly prominent to store datasets of data formats like HDF5, providing a scalable object store back-end is necessary. Therefore, we looked into Ceph's object store BlueStore and developed a backend for the storage framework JULEA that uses BlueStore without the need for a full-fledged working Ceph cluster. This way, we significantly reduce the prerequisites of running it on an existing HPC cluster. BlueStore works directly on a raw block device and thereby circumvents the problems of other Ceph storage backends like Filestore and KStore. In a first evaluation, we examine the performance of Blue-Store and compare it to a POSIX-based solution which shows our prototype is functional yet not optimized enough to keep up with the POSIX-based object store. For example, the peak for explicitly synced writes is 50 MB/s for POSIX with a block size of 4,096 kiB and thereby twice as high as BlueStore's with 20.5 MB/s.
Article
Full-text available
A main route for SARS-CoV-2 (severe acute respiratory syndrome coronavirus) transmission involves airborne droplets and aerosols generated when a person talks, coughs, or sneezes. The residence time and spatial extent of these virus-laden aerosols are mainly controlled by their size and the ability of the background flow to disperse them. Therefore, a better understanding of the role played by the flow driven by respiratory events is key in estimating the ability of pathogen-laden particles to spread the infection. Here, we numerically investigate the hydrodynamics produced by a violent expiratory event resembling a mild cough. Coughs can be split into an initial jet stage during which air is expelled through mouth and a dissipative phase over which turbulence intensity decays as the puff penetrates the environment. Time-varying exhaled velocity and buoyancy due to temperature differences between the cough and the ambient air affect the overall flow dynamics. The direct numerical simulation (DNS) of an idealized isolated cough is used to characterize the jet/puff dynamics using the trajectory of the leading turbulent vortex ring and extract its topology by fitting an ellipsoid to the exhaled fluid contour. The three-dimensional structure of the simulated cough shows that the assumption of a spheroidal puff front fails to capture the observed ellipsoidal shape. Numerical results suggest that, although analytical models provide reasonable estimates of the distance traveled by the puff, trajectory predictions exhibit larger deviations from the DNS. The fully resolved hydrodynamics presented here can be used to inform new analytical models, leading to improved prediction of cough-induced pathogen-laden aerosol dispersion.
Article
Full-text available
Simulations of turbulent fluid flow around long cylindrical structures are computationally expensive because of the vast range of length scales, requiring simplifications such as dimensional reduction. Current dimensionality reduction techniques such as strip-theory and depth-averaged methods do not take into account the natural flow dissipation mechanism inherent in the small-scale three-dimensional (3-D) vortical structures. We propose a novel flow decomposition based on a local spanwise average of the flow, yielding the spanwise-averaged Navier–Stokes (SANS) equations. The SANS equations include closure terms accounting for the 3-D effects otherwise not considered in 2-D formulations. A supervised machine-learning (ML) model based on a deep convolutional neural network provides closure to the SANS system. A-priori results show up to 92% correlation between target and predicted closure terms; more than an order of magnitude better than the eddy viscosity model correlation. The trained ML model is also assessed for different Reynolds regimes and body shapes to the training case where, despite some discrepancies in the shear-layer region, high correlation values are still observed. The new SANS equations and ML closure model are also used for a-posteriori prediction. While we find evidence of known stability issues with long time ML predictions for dynamical systems, the closed SANS simulations are still capable of predicting wake metrics and induced forces with errors from 1-10%. This results in approximately an order of magnitude improvement over standard 2-D simulations while reducing the computational cost of 3-D simulations by 99.5%.
Article
Full-text available
Solving fluid dynamics problems mainly rely on experimental methods and numerical simulation. However, in experimental methods it is difficult to simulate the physical problems in reality, and there is also a high-cost to the economy while numerical simulation methods are sensitive about meshing a complicated structure. It is also time-consuming due to the billion degrees of freedom in relevant spatial-temporal flow fields. Therefore, constructing a cost-effective model to settle fluid dynamics problems is of significant meaning. Deep learning (DL) has great abilities to handle strong nonlinearity and high dimensionality that attracts much attention for solving fluid problems. Unfortunately, the proposed surrogate models in DL are almost black-box models and lack interpretation. In this paper, the Physical Informed Neural Network (PINN) combined with Resnet blocks is proposed to solve fluid flows depending on the partial differential equations (i.e., Navier-Stokes equation) which are embedded into the loss function of the deep neural network to drive the model. In addition, the initial conditions and boundary conditions are also considered in the loss function. To validate the performance of the PINN with Resnet blocks, Burger's equation with a discontinuous solution and Navier-Stokes (N-S) equation with continuous solution are selected. The results show that the PINN with Resnet blocks (Res-PINN) has stronger predictive ability than traditional deep learning methods. In addition, the Res-PINN can predict the whole velocity fields and pressure fields in spatial-temporal fluid flows, the magnitude of the mean square error of the fluid flow reaches to-5 10. The inverse problems of the fluid flows are also well conducted. The errors of the inverse parameters are 0.98% and 3.1% in clean data and 0.99% and 3.1% in noisy data.
Article
Full-text available
Accurate and efficient Machine Learning algorithms are of vital importance to many problems, especially on classification or clustering tasks but need a universal AI model standard. Unifying machine learning models into a common ecosystem can lead to less development time and better framework interoperability. ONNX (Open Neural Network Exchange Format) is a popular open format to represent deep learning models so that AI developers can more easily move models between state-of-the-art tools. On top of that, hardware companies such as Nvidia or Intel try to keep up with this trend and produce hardware-optimized runtimes (i.e. for CPUs, GPUs, FPGAs) that can handle these open format AI models like ONNX. That enables developers to leverage an heterogeneous mix of hardware and use whichever AI framework they prefer. However, FPGAs have a more challenging solution strategy which as a platform it is also proven to address these kind of problems very efficiently in terms of performance and power. This work is based on an early development stage project which is called HLS4ML originally created for particle physics applications via the automatic generation of neural networks (NNs) for embedded Xilinx FPGAs. Our work involves a hardware-aware NN training and a generalized optimization scheme on top of HLS4ML that boosts the performance and power efficiency of this package and adds functionality for cloud FPGA firmware from any NN model. We start from the FPGA-oriented training of a model in Keras for image recognition, converting into the ONNX open format then porting and optimizing it for cloud FPGAs using a novel scheme with optimizations in host, memory and kernels while using multiple levels of network precision. To the best of our knowledge this is a novel approach that also achieves a speed-up of up to 102× over single CPU in performance and up to 5.5× over GPU in performance/watt.
Conference Paper
Full-text available
Born from a need for a pure "pay-per-use" model and highly scalable platform, the "Serverless" paradigm emerged and has the potential to become a dominant way of building cloud applications. Although it was originally designed for cloud environments, Serverless is finding its position in the Edge Computing landscape, aiming to bring computational resources closer to the data source. That is, Serverless is crossing cloud borders to assess its merits in Edge computing, whose principal partner will be the Internet of Things (IoT) applications. This move sounds promising as Serverless brings particular benefits such as eliminating always-on services causing high electricity usage, for instance. However, the community is still hesitant to uptake Serverless Edge Computing because of the cloud-driven design of current Serverless platforms, and distinctive characteristics of edge landscape and IoT applications. In this paper, we evaluate both sides to shed light on the Serverless new territory. Our in-depth analysis promotes a broad vision for bringing Serverless to the Edge Computing. It also issues major challenges for Serverless to be met before entering Edge computing.
Article
Full-text available
The Coronavirus Disease 2019 (COVID-19) pandemic continues to have a devastating effect on the health and well-being of the global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiology examination using chest radiography. It was found in early studies that patients present abnormalities in chest radiography images that are characteristic of those infected with COVID-19. Motivated by this and inspired by the open source efforts of the research community, in this study we introduce COVID-Net, a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images that is open source and available to the general public. To the best of the authors’ knowledge, COVID-Net is one of the first open source network designs for COVID-19 detection from CXR images at the time of initial release. We also introduce COVIDx, an open access benchmark dataset that we generated comprising of 13,975 CXR images across 13,870 patient patient cases, with the largest number of publicly available COVID-19 positive cases to the best of the authors’ knowledge. Furthermore, we investigate how COVID-Net makes predictions using an explainability method in an attempt to not only gain deeper insights into critical factors associated with COVID cases, which can aid clinicians in improved screening, but also audit COVID-Net in a responsible and transparent manner to validate that it is making decisions based on relevant information from the CXR images. By no means a production-ready solution, the hope is that the open access COVID-Net, along with the description on constructing the open source COVIDx dataset, will be leveraged and build upon by both researchers and citizen data scientists alike to accelerate the development of highly accurate yet practical deep learning solutions for detecting COVID-19 cases and accelerate treatment of those who need it the most.
Article
Full-text available
DeepCOVID-XR, an artificial intelligence algorithm for detecting COVID-19 on chest radiographs, demonstrated performance similar to the consensus of experienced thoracic radiologists. Key Results: • DeepCOVID-XR classified 2,214 test images (1,194 COVID-19 positive) with an accuracy of 83% and AUC of 0.90 compared with the reference standard of RT-PCR. • On 300 random test images (134 COVID-19 positive), DeepCOVID-XR's accuracy was 82% (AUC 0.88) compared to 5 individual thoracic radiologists (accuracy 76%-81%) and the consensus of all 5 radiologists (accuracy 81%, AUC 0.85). Abstract: Background: There are characteristic findings of Coronavirus Disease 2019 (COVID-19) on chest imaging. An artificial intelligence (AI) algorithm to detect COVID-19 on chest radiographs might be useful for triage or infection control within a hospital setting, but prior reports have been limited by small datasets and/or poor data quality.
Article
Full-text available
Recurrent neural networks are good at solving prediction problems. However, finding a network that suits a problem is quite hard because their performance is strongly affected by their architecture configuration. Automatic architecture optimization methods help to find the most suitable design, but they are not extensively adopted because of their high computational cost. In this work, we introduce the Random Error Sampling-based Neuroevolution (RESN), an evolutionary algorithm that uses the mean absolute error random sampling, a training-free approach to predict the expected performance of an artificial neural network, to optimize the architecture of a network. We empirically validate our proposal on four prediction problems, and compare our technique to training-based architecture optimization techniques, neuroevolutionary approaches, and expert designed solutions. Our findings show that we can achieve state-of-the-art error performance and that we reduce by half the time needed to perform the optimization.
Article
Full-text available
Adaptive lattice Boltzmann methods (LBMs) are based on velocity discretizations that self-adjust to local macroscopic conditions such as velocity and temperature. While this feature improves the accuracy and the stability of LBMs for large velocity and temperature fluctuations, it also strongly impacts the efficiency of the algorithm due to space interpolations that are required to get populations at grid nodes. To avoid this defect, the present work proposes new formulations of adaptive LBMs which do not rely anymore on space interpolations, hence, drastically improving their parallel efficiency for the simulation of high-speed compressible flows. To reach this goal, the adaptive phase discretization is restricted to particular states that are compliant with the efficient ``collide and stream'' algorithm, and as a consequence, it does not require additional interpolation steps. The development of proper state-adaptive solvers with on-grid propagation imposes new restrictions and challenges on the discrete stencils, namely the need for an extended operability range allowing for the transition between two phase discretizations. Achieving the minimum operability range for discrete polynomial equilibria requires rather large stencils (e.g. D2Q81, D2Q121) and is therefore not competitive for compressible flow simulations. However, as shown in the article, the use of numerical equilibria can provide for overlaps in the operability ranges of neighboring discrete shifts at acceptable cost using the D2Q21 lattice. Through several numerical validations, the present approach is shown to allow for an efficient realization of discrete state-adaptive LBMs for high Mach number flows even in the low viscosity regime.
Article
Full-text available
Operating data-intensive applications on edge systems is challenging, due to the extreme workload and device heterogeneity, as well as the geographic dispersion of compute and storage infrastructure. Serverless computing has emerged as a compelling model to manage the complexity of such systems, by decoupling the underlying infrastructure and scaling mechanisms from applications. Although serverless platforms have reached a high level of maturity, we have found several limiting factors that inhibit their use in an edge setting. This paper presents a container scheduling system that enables such platforms to make efficient use of edge infrastructures. Our scheduler makes heuristic trade-offs between data and computation movement, and considers workload-specific compute requirements such as GPU acceleration. Furthermore, we present a method to automatically fine-tune scheduler parameters to optimize high-level operational objectives such as minimizing task execution time, uplink usage, or cloud execution cost. We implement a prototype that targets the container orchestration system Kubernetes, and deploy it on an edge testbed we have built. We evaluate our system with trace-driven simulations in different infrastructure scenarios, using traces generated from running representative workloads on our testbed. Our results show that (a) our scheduler significantly improves the quality of task placement compared to the state-of-the-art scheduler of Kubernetes, and (b) our method for fine-tuning the weights of scheduling constraints helps significantly in meeting operational goals.
Article
Full-text available
Machine learning algorithms have been used widely in various applications and areas. To fit a machine learning model into different problems, its hyper-parameters must be tuned. Selecting the best hyper-parameter configuration for machine learning models has a direct impact on the model’s performance. It often requires deep knowledge of machine learning algorithms and appropriate hyper-parameter optimization techniques. Although several automatic optimization techniques exist, they have different strengths and drawbacks when applied to different types of problems. In this paper, optimizing the hyper-parameters of common machine learning models is studied. We introduce several state-of-the-art optimization techniques and discuss how to apply them to machine learning algorithms. Many available libraries and frameworks developed for hyper-parameter optimization problems are provided, and some open challenges of hyper-parameter optimization research are also discussed in this paper. Moreover, experiments are conducted on benchmark datasets to compare the performance of different optimization methods and provide practical examples of hyper-parameter optimization. This survey paper will help industrial users, data analysts, and researchers to better develop machine learning models by identifying the proper hyper-parameter configurations effectively. Github code: https://github.com/LiYangHart/Hyperparameter-Optimization-of-Machine-Learning-Algorithms
Article
2018 Curran Associates Inc.All rights reserved. Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called “internal covariate shift”. In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
Article
Generative Adversarial Networks (GANs) are difficult to train because of pathologies such as mode and discriminator collapse. Similar pathologies have been studied and addressed in competitive evolutionary computation by increased diversity. We study a system, Lipizzaner, that combines spatial coevolution with gradient-based learning to improve the robustness and scalability of GAN training. We study different features of Lipizzaner’s evolutionary computation methodology. Our ablation experiments determine that communication, selection, parameter optimization, and ensemble optimization each, as well as in combination, play critical roles. Lipizzaner succumbs less frequently to critical collapses and, as a side benefit, demonstrates improved performance. In addition, we show a GAN-training feature of Lipizzaner: the ability to train simultaneously with different loss functions in the gradient descent parameter learning framework of each GAN at each cell. We use an image generation problem to show that different loss function combinations result in models with better accuracy and more diversity in comparison to other existing evolutionary GAN models. Finally, Lipizzaner with multiple loss function options promotes the best model diversity while requiring a large grid size for adequate accuracy.
Article
X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC's Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment's operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends.
Conference Paper
The recent advances in computer-assisted learning systems and the availability of open educational resources today promise a pathway to providing cost-efficient high-quality education to large masses of learners. One of the most ambitious use cases of computer-assisted learning is to build a lifelong learning recommendation system. Unlike short-term courses, lifelong learning presents unique challenges, requiring sophisticated recommendation models that account for a wide range of factors such as background knowledge of learners or novelty of the material while effectively maintaining knowledge states of masses of learners for significantly longer periods of time (ideally, a lifetime). This work presents the foundations towards building a dynamic, scalable and transparent recommendation system for education, modelling learner’s knowledge from implicit data in the form of engagement with open educational resources. We i) use a text ontology based on Wikipedia to automatically extract knowledge components of educational resources and, ii) propose a set of online Bayesian strategies inspired by the well-known areas of item response theory and knowledge tracing. Our proposal, TrueLearn, focuses on recommendations for which the learner has enough background knowledge (so they are able to understand and learn from the material), and the material has enough novelty that would help the learner improve their knowledge about the subject and keep them engaged. We further construct a large open educational video lectures dataset and test the performance of the proposed algorithms, which show clear promise towards building an effective educational recommendation system.
Book
This book constitutes extended, revised and selected papers from the 10th International Conference on Cloud Computing and Services Science, CLOSER 2020, held in Prague, Czech Republic, in May 2020. Due to the COVID-19 pandemic the conference was held in a virtual format. The 14 papers presented in this volume were carefully reviewed and selected from a total of 69 submissions. CLOSER 2020 focuses on the emerging area of cloud computing, inspired by some latest advances that concern the infrastructure, operations, and available servicesthrough the global network.
Article
Microservices are becoming the defining paradigm of cloud applications, which raises urgent challenges for efficient datacenter management. Guaranteeing end-to-end Service Level Agreement (SLA) while optimizing resource allocation is critical to both cloud service providers and users. However, one application may contain hundreds of microservices, which constitute an enormous search space that is unfeasible to explore exhaustively. Thus, we propose RAMBO, an SLA-aware framework for microservices that leverages multi-objective Bayesian Optimization (BO) to allocate resources and meet performance/cost goals. Experiments conducted on a real microservice workload demonstrate that RAMBO can correctly characterize each microservice and efficiently discover Pareto-optimal solutions. We envision that the proposed methodology and results will benefit future resource planning, cluster orchestration, and job scheduling.
Article
Computational science is crucial for delivering reliable weather and climate predictions. However, despite decades of high-performance computing experience, there is serious concern about the sustainability of this application in the post-Moore/Dennard era. Here, we discuss the present limitations in the field and propose the design of a novel infrastructure that is scalable and more adaptable to future, yet unknown computing architectures. There have been substantial developments in weather and climate prediction over the past few decades, attributable to advances in computational science. The rise of new technologies poses challenges to these developments, but also brings opportunities for new progress in the field.
Presentation
The recorded video can be found in the following link: https://www.youtube.com/watch?v=GcK2theDr34&list=PLy9rIbGDXrG2Ly0LPYNuNn1ohQTqO6mmp&index=3&fbclid=IwAR0CgqzS-2E0jhWDbDF_tuGHRN65R7Bj_mLO2NJtoolyiah5ZXkPAnlNndk
Chapter
This article presents an approach using parallel/distributed generative adversarial networks for image data augmentation, applied to generate COVID-19 training samples for computational intelligence methods. This is a relevant problem nowadays, considering the recent COVID-19 pandemic. Computational intelligence and learning methods are useful tools to assist physicians in the process of diagnosing diseases and acquire valuable medical knowledge. A specific generative adversarial network approach trained using a co-evolutionary algorithm is implemented, including a three-level parallel approach combining distributed memory and fine-grained parallelization using CPU and GPU. The experimental evaluation of the proposed method was performed on the high performance computing infrastructure provided by National Supercomputing Center, Uruguay. The main experimental results indicate that the proposed model is able to generate accurate images and the 3×33\times 3 version of the distributed GAN has better robustness properties of its training process, allowing to generate better and more diverse images.
Article
The rapid spread of COVID-19 cases in recent months has strained hospital resources, making rapid and accurate triage of patients presenting to emergency departments a necessity. Machine learning techniques using clinical data such as chest X-rays have been used to predict which patients are most at risk of deterioration. We consider the task of predicting two types of patient deterioration based on chest X-rays: adverse event deterioration (i.e., transfer to the intensive care unit, intubation, or mortality) and increased oxygen requirements beyond 6 L per day. Due to the relative scarcity of COVID-19 patient data, existing solutions leverage supervised pretraining on related non-COVID images, but this is limited by the differences between the pretraining data and the target COVID-19 patient data. In this paper, we use self-supervised learning based on the momentum contrast (MoCo) method in the pretraining phase to learn more general image representations to use for downstream tasks. We present three results. The first is deterioration prediction from a single image, where our model achieves an area under receiver operating characteristic curve (AUC) of 0.742 for predicting an adverse event within 96 hours (compared to 0.703 with supervised pretraining) and an AUC of 0.765 for predicting oxygen requirements greater than 6 L a day at 24 hours (compared to 0.749 with supervised pretraining). We then propose a new transformer-based architecture that can process sequences of multiple images for prediction and show that this model can achieve an improved AUC of 0.786 for predicting an adverse event at 96 hours and an AUC of 0.848 for predicting mortalities at 96 hours. A small pilot clinical study suggested that the prediction accuracy of our model is comparable to that of experienced radiologists analyzing the same information.
Conference Paper
Film cooling is an essential cooling method to prevent high-pressure turbine blade from melting down due to the high inlet temperature. In order to improve the film cooling efficiency, several flow control methods have been proposed. In this paper, large-eddy simulations are performed to study the effectiveness of a vortex generator (VG) and a semi-sphere installed downstream of the cooling jet. Before the detailed analyses, the numerical framework is validated against the available experimental data. Both the laminar and turbulent approaching boundary layers are considered. The turbulent boundary layer is generated by a numerical plasma actuator. After validation, the influence of VG and semi-sphere on the film cooling efficiency at various blowing ratios are analyzed. It is found that a counter-rotating vortex pair (CVP) is formed downstream and its strength increases with the blowing ratio in the configuration without VG/semi-sphere. When the VG is installed, it produces another vortex pair that rotates in the reverse direction of the CVP, which reduces the CVP strength and increases the lateral diffusion of the coolant. As a result, the film cooling efficiency is greatly improved, especially at a higher blowing ratio. For the case with a semi-sphere, the film cooling efficiency is also improved, especially at low–medium blowing ratios. However, it is not as effective as the VG in terms of enhancing cooling efficiency. In addition, the total pressure loss is calculated to examine the aerodynamic penalty associated with the VG and semi-sphere. It is found that the total pressure loss increased by only 1% due to the VG or semi-sphere, within the range of blowing ratio investigated in the current study. Considering the overall performance and the feasibility of being applied in practice, a semi-sphere installed downstream of the cooling hole is a promising method to improve the cooling efficiency.
Article
State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.
Chapter
One of the primary challenges facing scientists is extracting understanding from the large amounts of data produced by simulations, experiments, and observational facilities. The use of data across the entire lifetime ranging from real-time to post-hoc analysis is complex and varied, typically requiring a collaborative effort across multiple teams of scientists. Over time, three sets of tools have emerged: one set for analysis, another for visualization, and a final set for orchestrating the tasks. This trifurcated tool set often results in the manual assembly of analysis and visualization workflows, which are one-off solutions that are often fragile and difficult to generalize. To address these challenges, we propose a serviced-based paradigm and a set of abstractions to guide its design. These abstractions allow for the creation of services that can access and interpret data, and enable interoperability for intelligent scheduling of workflow systems. This work results from a codesign process over analysis, visualization, and workflow tools to provide the flexibility required for production use. Finally, this paper describes a forward-looking research and development plan that centers on the concept of visualization and analysis technology as reusable services, and also describes several real-world use cases that implement these concepts.
Conference Paper
To cope with the rapid growth in available data, theefficiency of data analysis and machine learning libraries has re-cently received increased attention. Although great advancementshave been made in traditional array-based computations, mostare limited by the resources available on a single computationnode. Consequently, novel approaches must be made to exploitdistributed resources, e.g. distributed memory architectures. Tothis end, we introduce HeAT, an array-based numerical pro-gramming framework for large-scale parallel processing withan easy-to-use NumPy-like API. HeAT utilizes PyTorch as anode-local eager execution engine and distributes the workloadon arbitrarily large high-performance computing systems viaMPI. It provides both low-level array computations, as wellasassorted higher-level algorithms. With HeAT, it is possible for aNumPy user to take full advantage of their available resources,significantly lowering the barrier to distributed data analysis.When compared to similar frameworks, HeAT achieves speedupsof up to two orders of magnitude.
Article
In this paper, a new mesoscopic approach with both the adjustable Prandtl number and the ratio of bulk to shear viscosity has been developed to simulate three-dimensional compressible decaying homogeneous isotropic turbulence under the framework of discrete unified gas kinetic scheme (DUGKS). In the new approach, two reduced model Boltzmann equations with newly designed source terms are solved. In the continuum limit, the Navier–Stokes–Fourier system can be recovered by applying the Chapman–Enskog analysis. A three-dimensional DUGKS code has been developed, incorporating the fifth-order weighted essentially non-oscillatory scheme to better reconstruct the particle distribution functions at the cell interfaces. In addition, a new lattice velocity model with 77 discrete particle velocities is applied to ensure that the accuracy of the Gauss–Hermite quadrature is up to the ninth-order, and as such, the heat flux can be accurately evaluated. To validate our code, we simulate two cases with different initial turbulent Mach numbers and Taylor microscale Reynolds numbers. The simulation results converge with the increase in resolution and agree well with the results from the literature. As a direct application of our DUGKS, we briefly study the influence of bulk viscosity on turbulence statistics and flow structures. Our results show that the DUGKS is a reliable tool for simulating compressible decaying isotropic turbulence at low and moderate turbulent Mach numbers. More parametric studies are needed in the future to further explore the full capabilities of this specific mesoscopic method.
Chapter
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes—from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
Chapter
Every day, supercomputers execute 1000s of jobs with different characteristics. Data centers monitor the behavior of jobs to support the users and improve the infrastructure, for instance, by optimizing jobs or by determining guidelines for the next procurement. The classification of jobs into groups that express similar run-time behavior aids this analysis as it reduces the number of representative jobs to look into. It is state of the practice to investigate job similarity by looking into job profiles that summarize the dynamics of job execution into one dimension of statistics and neglect the temporal behavior.
Chapter
HPC applications rely on a distributed-memory parallel programming model to improve the overall execution time. This leads to spawning multiple processes that need to communicate with each other to make the code progress. But these communications involve overheads caused by network latencies or synchronizations between processes. One possible approach to reduce those overheads is to overlap communications with computations. MPI allows this solution through its nonblocking communication mode: a nonblocking communication is composed of an initialization and a completion call. It is then possible to overlap the communication by inserting computations between these two calls. The use of nonblocking collective calls is however still marginal and adds a new layer of complexity. In this paper we propose an automatic static optimization that (i) transforms blocking MPI communications into their nonblocking counterparts and (ii) performs extensive code motion to increase the size of overlapping intervals between initialization and completion calls. Our method is implemented in LLVM as a compilation pass, and shows promising results on two mini applications.
Chapter
The 2019 MPI standard draft specification includes the addition of defined communicator info hints. These hints are assertions that an application makes to an MPI implementation, so that a more optimized implementation is possible. The 2019 draft specifications defines four assertions: mpi_assert_no_any_tag, mpi_assert_no_any_source, mpi_assert_exact_length and mpi_assert_allow_overtaking. In this paper we will explore the capability of a Clang/LLVM based static analysis to check whether these assertions hold for a given program. With this tool, existing codebases can benefit from this new addition to the MPI standard without the need for costly human intervention.
Article
In the present study, an effective optimization framework of aerodynamic shape design is established based on the multi-fidelity deep neural network (MFDNN) model. The objective of the current work is to construct a high-accuracy multi-fidelity surrogate model correlating the configuration parameters of an aircraft and its aerodynamic performance by blending different fidelity information and adaptively learning their linear or nonlinear correlation without any prior assumption. In the optimization framework, the high-fidelity model using a CFD evaluation with fine grid and the low-fidelity model using the same CFD model with coarse grid are applied. Moreover, in each optimization iteration, the high-fidelity infilling strategy by adding the current optimal solution of surrogate model into the high-fidelity database is applied to improve the surrogate accuracy. The low-fidelity infilling strategy which can generate the solutions distributed uniformly in the whole design space is used to update the low-fidelity database for avoiding local optimum. Then, the proposed multi-fidelity optimization framework is validated by two standard synthetic benchmarks. Finally, it is applied to the high-dimensional aerodynamic shape optimization of a RAE2822 airfoil parameterized by 10 design variables and a DLR-F4 wing-body configuration parameterized by 30 design variables. The optimization results demonstrate that the proposed multi-fidelity optimization framework can remarkably improve optimization efficiency and outperform the single-fidelity method.
Book
This volume provides a broad and uniform introduction of PDE-constrained optimization as well as to document a number of interesting and challenging applications. Many science and engineering applications necessitate the solution of optimization problems constrained by physical laws that are described by systems of partial differential equations (PDEs) . As a result, PDE-constrained optimization problems arise in a variety of disciplines including geophysics, earth and climate science, material science, chemical and mechanical engineering, medical imaging and physics. This volume is divided into two parts. The first part provides a comprehensive treatment of PDE-constrained optimization including discussions of problems constrained by PDEs with uncertain inputs and problems constrained by variational inequalities. We place special emphasis on algorithm development and numerical computation. The second part of this volume focuses on the application of PDE-constrained optimization including problems in optimal control, optimal design and inverse problems, which includes a comprehensive treatment of inverse problems arising in the oil and gas industry, among other topics.
Article
Modeling the performance and energy consumption of the sparse matrix-vector product (SpMV) is essential to perform off-line analysis and, for example, choose a target computer architecture that delivers the best performance-energy consumption ratio. However, this task is especially complex given the memory-bounded nature and irregular memory accesses of the SpMV, mainly dictated by the input sparse matrix. In this paper, we propose a Machine Learning (ML)-driven approach that leverages Convolutional Neural Networks (CNNs) to provide accurate estimations of the performance and energy consumption of the SpMV kernel. The proposed CNN-based models use a blockwise approach to make the CNN architecture independent of the matrix size. These models are trained to estimate execution time as well as total, package, and DRAM energy consumption at different processor frequencies. The experimental results reveal that the overall relative error ranges between 0.5% and 14%, while at matrix level is not superior to 10%. To demonstrate the applicability and accuracy of the SpMV CNN-based models, this study is complemented with an ad-hoc time-energy model for the PageRank algorithm, a popular algorithm for web information retrieval used by search engines, which internally realizes the SpMV kernel.
Preprint
Physics-based co-evolutionary models such as direct coupling analysis (DCA) in combination with machine learning (ML) techniques based on deep neural networks are able to predict protein contact maps with astonishing accuracy. Such contacts can be used as constraints in structure prediction and massively increase prediction accuracy. Unfortunately, the same ML methods cannot readily be applied to RNA as they rely on large structural datasets only available for proteins but not for RNAs. Here, we demonstrate how the small amount of data available for RNA can be used to significantly improve prediction of RNA contact maps. We introduce an algorithm called CoCoNet that is based on a combination of a Coevolutionary model and a shallow Convolutional Neural Network. Despite its simplicity and the small number of trained parameters, the method boosts the contact prediction accuracy by about 70% with respect to straightforward DCA as tested by cross-validation on a dataset of about sixty RNA structures. Both our extensive robustness tests and the limited number of parameters allow the generalization properties of our model. Finally, applications to other RNAs highlight the power of our approach. CoCoNet is freely available and can be found at https://github.com/KIT-MBS/coconet.