Article

FPGA-based Tsunami Simulation: Performance Comparison with GPUs, and Roofline Model for Scalability Analysis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

MOST (Method Of Splitting Tsunami) is widely used to solve shallow water equations (SWEs) for simulation of tsunami. This paper presents high-performance and power-efficient computation of MOST for practical tsunami simulation with FPGA. The custom hardware for simulation is based on a stream computing architecture for deeply pipelining to increase performance with a limited bandwidth. We design a stream processing element (SPE) of computing kernels combined with stencil buffers. We also introduce an SPE array architecture with spatial and temporal parallelism to further exploit available hardware resources by implementing multiple SPEs with parallel internal pipelines. Our prototype implementation with Arria 10 FPGA demonstrates that the FPGA-based design performs numerically stable tsunami simulation with real ocean-depth data in single precision by introducing non-dimensionalization. We explore the design space of SPE arrays, and find that the design of six cascaded SPEs with a single pipeline achieves the sustained performance of 383 GFlops and the performance per power of 8.41 GFlops/W with a stream bandwidth of only 7.2GB/s. These numbers are 8.6 and 17.2 times higher than those of NVidia Tesla K20c GPU, and 1.7 and 7.1 times higher than those of AMD Radeon R9 280X GPU, respectively, for the same tsunami simulation in single precision. Moreover, we proposed a roofline model for stream computing with the SPE array in order to investigate factors of performance degradation and possible performance improvement for given FPGAs. With the model, we estimate that an upcoming Stratix 10 GX2800 FPGA can achieve the sustained performance of 8.7 TFlops at most with our SPE array architecture for tsunami simulation.
Content may be subject to copyright.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, performance improvement by increasing only spatial parallelism was not what we expected due to lack of memory bandwidth. The paper [8] which presented the design of custom hardware for the MOST algorithm reported that exploiting spatial parallelism did not improve performance on their design. Instead, it showed that exploiting temporal parallelism to compute using multiple pipelines was effective for their design. ...
... A research [11] shows the design with both spatial and temporal parallelism for the stencil computation. However, we here evaluated them separately according to the previous reports in [8]. We first illustrate the kernel design to show tricks for temporal parallelism. ...
... Additionally, our design positively attempts burst transmission, and it succeeds for N buf = 1 to 4. However, we find it fails for N buf = 5, which degrades the performance. The performance model of the MOST algorithm on the same Arria10 has been presented in [8]. According to that model, required memory bandwidth is proportional to N buf , and it is 36.1 GB/s for Code SP with N buf = 4, provided that the size of struct SR is 32 bytes and the clock speed of the pipeline is 282 MHz. ...
Conference Paper
We developed and evaluated tsunami simulations on FPGAs by designing optimized OpenCL kernels that execute 2-D stencil calculation. By using Intel FPGA SDK for OpenCL, we obtained efficient FPGA designs exploiting temporal parallelism. The performance of our optimal implementation is 446 and 790 GFlops for Arria10 and Stratix10, respectively. These implementations are much faster than a design only exploiting spatial parallelism. The performance on Stratix10 is faster than our GPU implementation on Tesla V100 GPU.
... This paper presents the design and architecture of a stream computing platform, where custom computing units are cascaded over multiple FPGAs in a 1D ring topology. To efficiently utilize the available resources on multiple FPGAs, we rely on extending the pipeline with temporal and spatial parallelism [9]- [11], which introduces a vast design space. ...
... However, it was also discovered that 99.6% consumption of FP digital signal processors (DSPs) in a single FPGA limits the scalability. As with LBM, tsunami simulation in [11] is capable of delivering high throughput with FPGA-based stream computing approach. It was previously demonstrated that for a single Arria 10 FGPA, the highest sustained performance was achieved by a single pipeline with 6 cascaded computing units, where its scalability is also limited by the available FP-DSPs. ...
... Performance models for FPGA applications are important for scalability analysis and estimating achievable performance in different variants including future devices, as done by [8], [9], [11], [15]. Dohi et al. [8] introduced performance modeling of stream-based stencil computations on a single Maxeler Technology FPGA accelerator. ...
Article
Full-text available
Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance. For a practical exploration of this vast design space, a performance model is presented and verified with the evaluation of a tsunami simulation application implemented on Intel Arria 10 FPGAs. Finally, scalability analysis is performed, where speedup is achieved when increasing the computing pipeline over multiple FPGAs while maintaining the problem size of computation. Performance is scaled with multiple FPGAs; however, performance degradation occurs with insufficient available bandwidth and large pipeline overhead brought by inadequate data stream size. Tsunami simulation results show that the highest scaled performance for 8 cascaded Arria 10 FPGAs is achieved with a single pipeline of 5 stream processing elements (SPEs), which obtained a scaled performance of 2.5 TFlops and a parallel efficiency of 98%, indicating the strong scalability of the multi-FPGA stream computing platform.
... On the other hand, Nagasu et al. [10] designed the stream computing architecture and hardware for practical tsunami simulation. They introduced multiple stream processing element (SPE) arrays with parallel internal pipelines to exploit further available hardware resources. ...
... Therefore, the dedicated implementation for Arria 10 FPGA shows higher performance than our best GPU implementation. The performance per power of the FPGA implementation is also better than the GPU implementation [10]. ...
... In the latest implementation by Nagasu et al. [10], they evaluated the performance and power consumption of their dedicated FPGA implementation of the MOST algorithm on the same Arria 10 FPGA. In addition, they presented a performance model applied both spatial and temporal parallelism. ...
Article
Full-text available
When a tsunami occurred on a sea area, prediction of its arrival time is critical for evacuating people from the coastal area. There are many problems related to tsunami to be solved for reducing negative effects of this serious disaster. Numerical modeling of tsunami wave propagation is a computationally intensive problem which needs to accelerate its calculations by parallel processing. The method of splitting tsunami (MOST) is one of the well-known numerical solvers for tsunami modeling. We have developed a tsunami propagation code based on MOST algorithm and implemented different parallel optimizations for GPU and FPGA. In the latest study, we have the best performance of OpenCL kernel which is implemented tsunami simulation on AMD Radeon 280X GPU. This paper targets on design and evaluation on FPGA using OpenCL. The performance on FPGA design generated automatically by Altera offline compiler follows the results of GPU by several kernel modifications. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
... Nagasu et al. [7] compared the energy consumption of FPGA and GPU computations for the same tsunami modeling application and demonstrated the effectiveness of FPGAs. They showed that their implementation on the Arria10 FPGA consumed approximately 5x less energy than the initial implementation on an AMD Radeon GPU. ...
Preprint
Full-text available
General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors support binary128 arithmetic, which makes SDP solvers generally slow. In this study, we focused on accelerating GEMM with binary128 arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible design of accelerators for the desired computations. Our binary128 GEMM designs on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster than the computation executed on a recent CPU with 20 threads for large matrices. Using our binary128 GEMM design on the FPGA, we successfully accelerated two numerical applications: LU decomposition and SDP problems, for the first time.
... Satria et al. [15] proposed the GPU Acceleration of the Tsunami Propagation Model, which is based on the two-step finite-difference MacCormack scheme. Kohei et al. [16] present highperformance and power-efficient computation of MOST for practical tsunami simulation with FPGA. Parent's work [17] presents a GPU implementation for the real-time solution of shallow water equations. ...
Article
Full-text available
In this paper, we consider numerical simulation and GPU (graphics processing unit) computing for the two-dimensional non-linear tsunami equation, which is a fundamental equation of tsunami propagation in shallow water areas. Tsunamis are highly destructive natural disasters that have a significant impact on coastal regions. These events are typically caused by undersea earthquakes, volcanic eruptions, landslides, and possibly an asteroid impact. To solve numerically, firstly we discretized these equations in a rectangular domain and then transformed the partial differential equations into semi-implicit finite difference schemes. The spatial and time derivatives are approximated by using the second-order centered differences following the Crank-Nicolson method and the calculation method is based on the Jacobi method; the computation is performed using the C++ programming language; and the visualization of numerical results is performed by Matlab 2021. The initial condition was given as a Gaussian, and the basin profile has been approximated by a hyperbolic tangent. To accelerate the sequential algorithm, a parallel computation algorithm is developed using CUDA (Compute Unified Device Architecture) technology. CUDA technology has long been used for the numerical solution of partial differential equations (PDEs). It uses the parallel computing capabilities of graphics processing units (GPUs) to speed up the PDE solution. By taking advantage of the GPU’s massive parallelism, CUDA technology can significantly speed up PDE computations, making it an effective tool for scientific computing in a variety of fields. The performance of the parallel implementation is tested by comparing the computation time between the sequential (CPU) solver and CUDA implementations for various mesh sizes. The comparison shows that our parallel implementation gives significant acceleration in the implementation of CUDA.
... Running the tsunami simulation to make a threat level map on the fly during an event is not feasible as the CPU time for the desired modeling setup described above even using NeSI HPC was about 3 days. However, recent developments of GPU-based tsunami numerical models (e.g., Furuyama & Maihara, 2014;Nagasu et al., 2017) showed that the simulation time can be significantly reduced. In this study, the threat level map from the estimated source model is used purely as reference to evaluate the forecast based on the pre-computed scenarios. ...
Article
Full-text available
Abstract A tsunamigenic earthquake with thrust faulting mechanism occurred southeast of the Loyalty Islands, New Caledonia, in the Southern Vanuatu subduction zone on the 10th of February 2021. The tsunami was observed at coastal gauges in the surrounding islands and in New Zealand. The tsunami was also recorded at a new DART network designed to enhance the tsunami forecasting capability of the Southwestern Pacific. We used the tsunami waveforms in an inversion to estimate the fault slip distribution. The estimated major slip region is located near the trench with maximum slip of 4 m. This source model with an assumed rupture velocity of 1.0 km/s can reproduce the observed seismic waves. We evaluated two tsunami forecasting approaches for coastal regions in New Zealand: selecting a pre‐computed scenario, and interpolating between two pre‐computed scenarios. For the evaluation, we made a reference map of tsunami threat levels in New Zealand using the estimated source model. The results show that the threat level maps from the pre‐computed Mw 7.7 scenario located closest to the epicenter, and from an interpolation of two scenarios, match the reference threat levels in most coastal regions. Further improvements to enhance the system toward more robust warnings include expansion of scenario database and incorporation of tsunami observation around tsunami source regions. We also report on utilization of the coastal gauge and DART station data for updating forecasts in real‐time during the event and discuss the differences between the rapid‐response forecast and post‐event retrospective forecasts.
... In fact the Roofline Model has already been used in the past to evaluate the performance of specific applications [32], being ported to FPGAs. But few works provide a generic application-independent extension of this model for these architectures, mainly due to the difficulty in defining the maximum compute performance for a reconfigurable device. ...
Article
Full-text available
Nowadays, the use of hardware accelerators to boost the performance of HPC applications is a consolidated practice, and among others, GPUs are by far the most widespread. More recently, some data centers have successfully deployed also FPGA accelerated systems, especially to boost machine learning inference algorithms. Given the growing use of machine learning methods in various computational fields, and the increasing interest towards reconfigurable architectures, we may expect that in the near future FPGA based accelerators will be more common in HPC systems, and that they could be exploited also to accelerate general purpose HPC workloads. In view of this, tools able to benchmark FPGAs in the context of HPC are necessary for code developers to estimate the performance of applications, as well as for computer architects to model that of systems at scale. To fulfill these needs, we have developed FER (FPGA Empirical Roofline), a benchmarking tool able to empirically measure the computing performance of FPGA based accelerators, as well as the bandwidth of their on-chip and off-chip memories. FER measurements enable to draw Roofline plots for FPGAs, allowing for performance comparisons with other processors, such as CPUs and GPUs, and to estimate at the same time the performance upper-bounds that applications could achieve on a target device. In this paper we describe the theoretical model on which FER relies, its implementation details, and the results measured on Xilinx Alveo accelerator cards.
... The Roofline model has been introduced to assist the designer when targeting hardware acceleration of HPC applications, so as to explore the design space, estimate the performance, and evaluate the throughput due its dependency on communication and computation. Roofline is applied by Du et al. [161] in the acceleration of the stencil computation kernels, by Karp et al. [162] for the hardware implementation of a spectral element method, and by Nagasu et al. [163] in the context of an FPGA-based tsunami simulation. ...
Article
Full-text available
Hardware accelerators based on field programmable gate array (FPGA) and system on chip (SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different directives to obtain an optimized hardware design based on performance metrics. However, the complexity of the design space depends on different factors such as the number of directives used in the source code, the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques comprise the evaluation of multiple implementations with different combinations of directives to obtain a design with a good compromise between different metrics. This paper presents a survey of models, methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described. We also present the integration of existing models and frameworks in diverse research areas and identify the different challenges to be addressed.
... Running the tsunami simulation to make a threat level map on the fly during an event is not feasible as the CPU time for the desired modeling setup described above even using NeSI HPC was about 3 days. However, recent developments of GPU-based tsunami numerical models (e.g., Furuyama & Maihara, 2014;Nagasu et al., 2017) showed that the simulation time can be significantly reduced. In this study, the threat level map from the estimated source model is used purely as reference to evaluate the forecast based on the pre-computed scenarios. ...
... The original Roofline (Williams et al. 2009) paper suggests various optimizations that could be performed on workload bound by memory bandwidth and/or computational power and applies them to traditional scientific workloads. Since then it has been used to profile and optimize various architectures such as Intel KNL (Doerfler et al. 2016), NVIDIA GPUs (Lopes et al. 2017), Google TPUs (Jouppi et al. 2017a) and applications, including but not limited to, disaster detection Nagasu et al. (2017), large scale simulations (Kim et al. 2011), wireless network detection (Sarker et al. 2002) and even matrix multiplication (Kong et al. 2015). ...
Article
Full-text available
Over the last decade, technologies derived from convolutional neural networks (CNNs) called Deep Learning applications, have revolutionized fields as diverse as cancer detection, self-driving cars, virtual assistants, etc. However, many users of such applications are not experts in Machine Learning itself. Consequently, there is limited knowledge among the community to run such applications in an optimized manner. The performance question for Deep Learning applications has typically been addressed by employing bespoke hardware (e.g., GPUs) better suited for such compute-intensive operations. However, such a degree of performance is only accessibly at increasingly high financial costs leaving only big corporations and governments with resources sufficient enough to employ them at a large scale. As a result, an average user is only left with access to commodity clusters with, in many cases, only CPUs as the sole processing element. For such users to make effective use of resources at their disposal, concerted efforts are necessary to figure out optimal hardware and software configurations. This study is one such step in this direction as we use the Roofline model to perform a systematic analysis of representative CNN models and identify opportunities for black box and application-aware optimizations. Using the findings from our study, we are able to obtain up to 3.5×\times speedup compared to vanilla TensorFlow with default configurations.
... Performance models alleviate the design effort when using hardware accelerators [98]. They can be used to compare hardware accelerators [30], [109] in order to select the most performing technology or to evaluate the communication impact [161], [136]. Performance models have been also proposed for heterogeneous computing systems [143], but they mostly consider systems composed of a single type of hardware accelerator. ...
Thesis
Full-text available
Field-Programmable Gate Arrays (FPGAs) increasingly assume roles as hardware accelerators which significantly speed up computations in a wide range of streaming applications. For instance, specific streaming applications related to audio or image processing also demand high performance, runtime dynamism and power efficiency. Such applications demand a low latency while presenting a large amount of parallelism, both well-known features offered by FPGAs nowadays. Although the flexibility offered by FPGAs allows to implement customized architectures with higher computational performance and better power efficiency than multi-core CPUs and GPUs respectively, the design of such architectures is a very time-consuming task. Moreover, heterogeneous FPGA-based platforms and devices can only be fully exploited when modelling and analysing architectures combining the best of each technology. The aim of this thesis is the acceleration of streaming applications by overcoming the challenges that the available FPGA-based systems present when mapping high-performance demanding streaming applications. On the one hand, performance analysis and techniques are proposed to exploit customized architectures for acoustic streaming applications demanding a real-time computation of the incoming signals from dense microphone arrays. The proposed design-space exploration of reconfigurable architectures, including a complete analysis of the different trade-offs in terms of performance, power and frequency response, leads to designs providing the dynamic response, the high performance or the power efficiency demanded by highly constrained applications such as acoustic beamforming. On the other hand, heterogeneous FPGA-based systems performance models such as the roofline model are adapted for FPGAs to guide the design methodology to reach the highest performance. High-Level Synthesis tools are used not only as a complement of our roofline model but also for performance prediction. These models are applied to accelerate simple convolutional image filters and a more complex image algorithm for pedestrian detection.
... FPGA or Field Programmable Gate Array is an accelerator that consists of a matrix of configurable logic blocks (CLBs) connected through programmable interconnects. In [6], a stencil-based computation is employed for tsunami simulation on Intel Stratix 10 GX2800 FPGA. In [7], it is shown the implementation of scalable stencil computation on the multi-FPGA accelerator. ...
... To conquer this problem, they proposed an analytical design procedure using the roofline model [49] and achieved a peak performance of 61.62 GFLOPS while performing at 100 MHz. Recently, Nagasu et al. [50] proposed a high-performance computing system for simulating tsunami with an FPGA hardware, using a highly pipelined architecture performing in a limited bandwidth. Also, in a recent work by the authors of the present paper, the feasibility of the FPGA hardware for accelerating the numerical solution of Laplace problem as well as the 1-D Euler equation is studied, and a 20x speed-up is achieved [51]. ...
Article
Full-text available
The objective of the present study is to investigate the capability of field-programmable gate array hardware in numerical simulation of a model of a dielectric barrier discharge plasma actuator to accelerate the calculations. The reconfigurable hardware is designed such that it is possible to reprogram its architecture after manufacturing. This provides the capability to design and implement various architectures for several applications. Two reconfigurable chips are used in the present study, one of which consists of a programmable logic unit and a typical microprocessor. This hybrid architecture makes the high performance of the reconfigurable hardware in custom computing and the efficiency of the microprocessor in data flow control accessible. An automated design procedure is used for the design of the reconfigurable hardware. Further, a finite difference representation of a phenomenological model of a plasma actuator is derived and implemented on the field-programmable gate array hardware. The results are validated against other numerical data, and the computational time is compared to different conventional processors. Using the reconfigurable hardware results in up to 96% computational time reduction compared to a recent Core i7 processor.
... In general, calculation amounts related to sea bottom friction become too big for tsunami inundation simulations. Moreover, Nagasu et al. [17] developed an hardware architecture for FPGA-based custom computing Gflop/s/Watts. The performance and performance per power were 1.7 and 7.1 times higher than those of AMD Radeon R9 280X GPU. ...
Article
Full-text available
The tsunami disasters that occurred in Indonesia, Chile, and Japan have inflicted serious casualties and damaged social infrastructures. Tsunami forecasting systems are thus urgently required worldwide. We have developed a real-time tsunami inundation forecast system that can complete a tsunami inundation and damage forecast for coastal cities at the level of 10-m grid size in less than 20 min. As the tsunami inundation and damage simulation is a vectorizable memory-intensive program, we incorporate NEC’s vector supercomputer SX-ACE. In this paper, we present an overview of our system. In addition, we describe an implementation of the program on SX-ACE and evaluate its performance of SX-ACE in comparison with the cases using an Intel Xeon-based system and the K computer. Then, we clarify that the fulfillment of a real-time tsunami inundation forecast system requires a system with high-performance cores connected to the memory subsystem at a high memory bandwidth such as SX-ACE.
... On the other hand, Nagasu, et al. [7] designed the stream computing architecture and hardware for practical tsunami simulation. They introduced multiple stream processing element (SPE) arrays with parallel internal pipelines to exploit further available hardware resources. ...
Conference Paper
FPGAs are receiving increased attention as a promising architecture for accelerators in HPC systems. Evolving and maturing development tools based on high-level synthesis promise productivity improvements for this technology. However, up to now, FPGA designs for complex simulation workloads, like shallow water simulations based on discontinuous Galerkin discretizations, rely to a large degree on manual application-specific optimizations. In this work, we present a new approach to port shallow water simulations to FPGAs, based on a code-generation framework for high-level abstractions in combination with a template-based stencil processing library that provides FPGA-specific optimizations for a streaming execution model. The new implementation uses a structured grid representation suitable for stencil computations and is compared to an adaptation from an existing hand-optimized FPGA dataflow design supporting unstructured meshes. While there are many differences, for example in the numerical details and problem scalability to be discussed, we demonstrate that overall both approaches can yield meaningful results at competitive performance for the same target FPGA, thus demonstrating a new level of maturity for FPGA-accelerated scientific simulations.KeywordsFPGAReconfigurable ComputingShallow Water SimulationsCode GenerationDataflowSYCLOpenCLDiscontinuous Galerkin Method
Conference Paper
As the performance of high-end FPGAs has increased in recent years, it’s getting more important to construct an FPGA cluster for both improved processing performance and power efficiency in data centers and supercomputers. For higher utilization of FPGA resources for various applications, we require a flexible inter-FPGA network which provides various topologies appropriately to different applications while a conventional direct-connection network (DCN) provides only a fixed topology, such like a 2D torus. In this paper, we propose a virtual circuit-switching network (VCSN) for a large-scale FPGA cluster to have a flexible inter-FPGA network, where communication links connecting FPGAs are virtualized on the top of Ethernet frames. We can easily configure the VCSN topology optimized for the application by modifying the destination MAC addresses registered in a table of a frame encoder. We present its efficient protocol, hardware implementation, demonstration with 100Gbps Ethernet, and performance comparison with a conventional direct-connection network for FPGAs. We show that VCSN has higher but acceptable latency and slightly higher throughput in comparison with DCN, so that numerical simulation running with a ring of FPGAs achieves comparable performance for DCN.
Article
In this article the complexity and runtime performance of two Multiuser Detectors for Direct Sequence-Code Division Multiple Access were evaluated in two different hardware platforms. The innovation and aim is to take advantage of present parallel hardware to bring Multiuser technology to present and future Base Stations in order to increase the capacity of the overall system, to reduce the transmission power by the mobile stations and to reduce base station hardware requirements, in Universal Mobile Telecommunications System. The detectors are based on the Frequency Shift Canceller concatenated with a Parallel Interference Canceller. This detector implies the inversion of multiple identical size small matrices and because of that it is very scalable contrary to other solutions/detectors that only permits a sequential implementation despite their lower complexity. Implementations for the Time Division-Code Division Multiple Access, in two software platforms one in OpenMP and the other in CUDA were done taking into account the carrier and doppler frequency offsets (offset different for each user). The result shows that this deployment aware real-time implementation of the Multiuser Detectors is possible with a Graphics Processor Unit being three times faster than required.
Chapter
As tsunamis may cause damage in wide area, it is difficult to immediately understand the whole damage. To quickly estimate the damages of and respond to the disaster, we have developed a real-time tsunami inundation forecast system that utilizes the vector supercomputer SX-ACE for simulating tsunami inundation phenomena. The forecast system can complete a tsunami inundation and damage forecast for the southwestern part of the Pacific coast of Japan at the level of a 30-m grid size in less than 30 min. The forecast system requires higher-performance supercomputers to increase resolutions and expand forecast areas. In this paper, we compare the performance of the tsunami inundation simulation on SX-Aurora TSUBASA, which is a new vector supercomputer released in 2018, with those on Xeon Gold and SX-ACE. We clarify that SX-Aurora TSUBASA achieves the highest performance among the three systems and has a high potential for increasing resolutions as well as expanding forecast areas.
Article
Tsunami disasters can cause serious casualties and damage to social infrastructures. An early understanding of disaster states is required in order to advise evacuations and plan rescues and recoveries. We have developed a real-time tsunami inundation forecast system using a vector supercomputer SX-ACE. The system can complete a tsunami inundation and damage estimation for coastal city regions at the resolution of a 10 m grid size in under 20 minutes, and distribute tsunami inundation and infrastructure damage information to local governments in Japan. We also develop a new configuration for the computational domain, which is changed from rectangles to polygons and called a polygonal domain, in order to effectively simulate in the entire coast of Japan. Meanwhile, new supercomputers have been developed, and their peak performances have increased year by year. In 2016, a new Xeon Phi processor called Knights Landing was released for high-performance computing. In this paper, we present an overview of our real-time tsunami inundation forecast system and the polygonal domain, which can decrease the amount of computation in a simulation, and then discuss its performance on a vector supercomputer SX-ACE and a supercomputer system based on Intel Xeon Phi. We also clarify that the real-time tsunami inundation forecast system requires the efficient vector processing of a supercomputer with high-performance cores.
Conference Paper
Full-text available
Stream computation is one of the approaches suitable for FPGA-based custom computing due to its high throughput capability brought by pipelining with regular memory access. To increase performance of iterative stream computation, we can exploit both temporal and spatial parallelism by deepening and duplicating pipelines, respectively. However, the performance is constrained by several factors including available hardware resources on FPGA, an external memory bandwidth, and utilization of pipeline stages, and therefore we need to find the best mix of the different parallelism to achieve the highest performance per power. In this paper, we present a domain-specific language (DSL) based design space exploration for temporally and/or spatially parallel stream computation with FPGA. We define a DSL where we can easily design a hierarchical structure of parallel stream computation with abstract description of computation. For iterative stream computation of fluid dynamics simulation, we design hardware structures with a different mix of the temporal and spatial parallelism. By measuring the performance and the power consumption, we find the best among them.
Article
Full-text available
The potential of FPGAs as accelerators for high-performance computing applications is very large, but many factors are involved in their performance. The design for FPGAs and the selection of the proper optimizations when mapping computations to FPGAs lead to prohibitively long developing time. Alternatives are the high-level synthesis (HLS) tools, which promise a fast design space exploration due to design at high-level or analytical performance models which provide realistic performance expectations, potential impediments to performance, and optimization guidelines. In this paper we propose the combination of both, in order to construct a performance model for FPGAs which is able to visually condense all the helpful information for the designer. Our proposed model extends the roofline model, by considering the resource consumption and the parameters used in the HLS tools, to maximize the performance and the resource utilization within the area of the FPGA. The proposed model is applied to optimize the design exploration of a class of window-based image processing applications using two different HLS tools. The results show the accuracy of the model as well as its flexibility to be combined with any HLS tool.
Conference Paper
Full-text available
This paper presents a performance model of an LBM accelerator to be implemented on a tightly-coupled FPGA cluster. In strong scaling, each accelerator node has a smaller computation as the nodes increase, and consequently communication overhead becomes apparent and limits the scalability. Our tightly-coupled FPGA cluster has the 1D ring of the accelerator-domain network (ADN) which allows FPGAs to send and receive data with low communication overhead. We propose the LBM accelerator architecture and its stream computation appropriate to use ADN. We formulate a sustained-performance model of the accelerator, which consists of three cases depending on one of the resource availability, the network bandwidth and the size of shift-registers. With the model, we show that the network bandwidth is much more important than the memory bandwidth. The wider the network bandwidth is, the more FPGAs can scale the sustained performance in computing a constant size of a lattice. This result demonstrates the importance of ADN in the tightly-coupled FPGA cluster.
Article
Full-text available
Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.
Article
Full-text available
In this research we studied the effect of testing temperature on both static and dynamic fracturing behaviors of low-silicon CA-15 martensitic stainless steel (MSS) castings after austenitizing and tempering treatments. The results showed that the material’s microstructure was influenced by heat treatment and various testing temperatures would cause different fracturing mechanisms. In static tensile tests, the 573–673K tempered specimens occurred secondary strengthening at 423K and 298K testing temperatures. However, there is a contrast of weakening occurred at 123K for the same type of tempered samples. The phenomenon was mainly triggered by local cracking at the ferrite/martensitic interface and incoherent precipitate site in the materials because of the existence of shrinkage stress under subzero temperature. In the dynamic strain-rate tests, impact embrittlement occurred in the 573–673K tempered samples as a result of the tempered martensite embrittlement (TME) phenomenon. The ductile-to-brittle transition temperature (DBTT) of the tempered material was obviously lower than that of the as-cast material. Also, optical microscopy (OM), scanning electron microscopy (SEM) and transmission electron microscopy (TEM) were performed to correlate the properties attained to the microstructural observation.
Article
Full-text available
The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. An algorithm which exploits the potential data parallelism of this method is presented and implemented using the CUDA model in single and double floating point precision. Numerical experiments show the high efficiency of this CUDA solver in comparison with a CPU parallel implementation of the solver and with respect to a previously existing GPU solver based on a shading language. KeywordsGeneral Purpose computation on Graphics Processing Units (GPGPU)–Shallow water systems–OpenMP–CUDA
Conference Paper
Full-text available
Strongest earthquake of December 26, 2004 generated catastrophic tsunami in Indian Ocean. This shows that, in spite of recent technology progress, population at coastal zone is not protected against tsunami hazard. Here, we address the problem of tsunami risks mitigation. Note that prediction of tsunami wave parameters at certain locations should be made as early as possible to provide enough time for evacuation. Modern computational technologies can accurately calculate tsunami wave propagation over the deep ocean provided that initial displacement (perturbation of the sea bed at tsunami source) is known. Modern deep ocean tsunameters provide direct measurement of the passing tsunami wave in real time, which help to estimate initial displacement parameters right after the tsunami wave is recorded at one of the deep ocean buoys. Therefore, fast tsunami propagation code that can calculate tsunami evolution from estimated model source becomes critical for timely evacuation decision for many coastal communities in case of a strong tsunami. Numerical simulation of tsunami wave is very important task for risk evaluation, assessment and mitigation. Here we discuss a part of MOST (Method of Splitting Tsunami) software package, which has been accepted by the USA National Ocean and Atmosphere Administration as the basic tool to calculate tsunami wave propagation and evaluation of inundation parameters. Our main objectives are speed up the sequential program, and adaptation of this program for shared memory systems (OpenMP) and CELL architecture. Optimization of the existing parallel and sequential code for the task of tsunami wave propagation modeling as well as an adaptation of this code for systems based on CELL BE processors (e.g. SONY PlayStation3) is discussed. The paper also covers approaches and techniques for programs optimization and adaptations, and obtained results.
Article
Full-text available
For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.
Article
Full-text available
Manufacturers will likely offer multiple products with differing numbers of cores to cover multiple price-performance points, since Moore's Law will permit the doubling of the number of cores per chip every two years. While diversity may be understandable in this time of uncertainty, it exacerbates the already difficult jobs of programmers, compiler writers, and even architects. Hence, an easy-to-understand model that offers performance guidelines would be especially valuable. This article proposes one such model called Roofline, demonstrating it on four diverse multicore computers using four key floating-point kernels. The proposed Roofline model ties together floating-point performance, operational intensity, and memory performance in a 2D graph. The Roofline sets an upper bound on performance of a kernel depending on the kernel's operational intensity. If people think of operational intensity as a column that hits the roof, either it hits the flat part of the roof, meaning performance is compute-bound, or it hits the slanted part of the roof, meaning performance is ultimately memory-bound.
Conference Paper
Full-text available
For decades, the high-performance computing (HPC) community has focused on performance, where performance is defined as speed. To achieve better performance per compute node, microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also doubled the power densities. Consequently, keeping a large-scale HPC system functioning properly requires continual cooling in a largemachine room, thus resulting in substantial operational costs. Furthermore, the increase in power densities has led (in part) to a decrease in system reliability, thus leading to lost productivity. To address these problems, we propose a power-aware algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on performance. Specifically, we leverage a commodity technology called "dynamic voltage and frequency scaling" to implement our power-aware algorithm in the run-time system of commodity HPC systems.
Article
High-performance and low-power computation is required for large-scale fluid dynamics simulation. Due to the inefficient architecture and structure of CPUs and GPUs, they now have a difficulty in improving power efficiency for the target application. Although FPGAs become promising alternatives for power-efficient and high-performance computation due to their new architecture having floating-point (FP) DSP blocks, their relatively narrow memory bandwidth requires an appropriate way to fully exploit the advantage. This paper presents an architecture and design for scalable fluid simulation based on data-flow computing with a state-of-the-art FPGA. To exploit available hardware resources including FP DSPs, we introduce spatial and temporal parallelism to further scale the performance by adding more stream processing elements (SPEs) in an array. Performance modeling and prototype implementation allow us to explore the design space for both the existing Altera Arria10 and the upcoming Intel Stratix10 FPGAs. We demonstrate that Arria10 10AX115 FPGA achieves 519 GFlops at 9.67 GFlops/W only with a stream bandwidth of 9.0 GB/s, which is 97.9% of the peak performance of 18 implemented SPEs. We also estimate that Stratix10 FPGA can scale up to 6844 GFlops by combining spatial and temporal parallelism adequately.
Conference Paper
This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.
Conference Paper
This work describes the architecture of a new FPGA DSP block supporting both fixed and floating point arithmetic. Each DSP block can be configured to provide one single precision IEEE-754 floating multiplier and one IEEE-754 floating point adder, or when configured in fixed point mode, the block is completely backwards compatible with current FPGA DSP blocks. The DSP block operating frequency is similar in both modes, in the region of 500MHz, offering up to 2 GMACs fixed point and 1 GFLOPs performance per block. In floating point mode, support for multi-block vector modes are provided, where multiple blocks can be seamlessly assembled into any size real or complex dot products. By efficient reuse of the fixed point arithmetic modules, as well as the fixed point routing, the floating point features have only minimal power and area impact. We show how these blocks are implemented in a modern Arria 10 FPGA family, offering over 1 TFLOPs using only embedded structures, and how scaling to multiple TFLOPs densities is possible for planned devices.
Conference Paper
Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Article
To contribute to tsunami early warning systems, we investigated the currently achievable speed of tsunami inundation simulations on a parallel computer as well as the benefits of high-resolution and faster-than-real-time inundation predictions. We found that 5-m resolution inundation simulation can be 75 times faster than real time, requiring only 1.5 min to overview the inundation situation in Sendai City for the 2011 Tohoku tsunami. We developed a novel parallel tsunami model based on the well-known TUNAMI-N2 model and achieved 9.17 TFLOPS on 9,469 CPU cores. The present model can accurately hindcast the observed inundated regions of the 2011 Tohoku tsunami using tsunami source estimations of the tFISH and tFISH/RAPiD inversion algorithms, which can be instantly derived from real-time observation data. The present high-resolution predictions can provide clear images of imminent hazards/disasters and can provide guidance for appropriate evacuation actions.
Conference Paper
In this paper, we discuss user space parameters and performance modeling of 3-D stencil computing on a stream-based FPGA accelerator. We use a heat conduction simulation as a benchmark and evaluate a performance for that developed with MaxCompiler, a kind of high-level synthesis tools for FPGAs, and MaxGenFD, a domain specific framework on the MaxCompiler for finite-difference equations. Performance comparison with multi-threaded and SIMD-enabled CPU implementation shows FPGA design achieved about six times speedup when a user chose the best architectural parameters. Energy consumptions of the FPGA accelerator were measured and it is shown that the best configuration in terms of performance also shows the lowest energy consumption.
Article
Purpose To determine the relationship between abdominal breathing and tear meniscus volume in healthy women, we investigated the change in tear meniscus volume in two groups: normal breathing and abdominal breathing. Methods We used a crossover experimental model and examined 20 healthy women aged 20-54 years (mean ± SD, 32.7 ± 11.1 years). The participants were randomly assigned to one of two groups. During the first visit, the normal breathing group was subjected to normal breathing for 3 min, whereas the abdominal breathing group was subjected to abdominal breathing (4-second inhalation and 6-second exhalation) for 3 min. During the second visit, the protocols were swapped between the two groups. We estimated the R wave to R wave (R-R) interval, tear meniscus volume, salivary amylase activity, pulse, and blood pressure before, immediately, 15 min after, and 30 min after breathing initiation. Results: After abdominal breathing, compared to that before breathing, the tear meniscus volume increased significantly 15 min after breathing (P<.01). Furthermore, systolic blood pressure showed a significant decrease immediately after abdominal breathing (P<0.05). No significant difference was found in the test parameters in the normal breathing group. Conclusion: Abdominal breathing for 3 minutes increases the tear meniscus volume in healthy women. Consequently, abdominal breathing may be considered in the treatment of dry eye disease.
Conference Paper
One of the most essential and challenging components in a climate system model is the atmospheric model. To solve the multi-physical atmospheric equations, developers have to face extremely complex stencil kernels. In this paper, we propose a hybrid CPU-FPGA algorithm that applies single and multiple FPGAs to compute the upwind stencil for the global shallow water equations. Through mixed-precision arithmetic, we manage to build a fully pipelined upwind stencil design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The CPU-FPGA algorithm using one Virtex-6 FPGA provides 100 times speedup over a 6-core CPU and 4 times speedup over a hybrid node with 12 CPU cores and a Fermi GPU card. The algorithm using four FPGAs provides 330 times speedup over a 6-core CPU; it is also 14 times faster and 9 times more power efficient than the hybrid CPU-GPU node.
Article
Tsunami propagation in shallow water zone is often modeled by the shallow water equations (also called Saint-Venant equations) that are derived from conservation of mass and conser-vation of momentum equations. Adding friction slope to the con-servation of momentum equations enables the system to simulate the propagation over the coastal area. This means the system is also able to estimate inundation zone caused by the tsunami. Applying Neumann boundary condition and Hansen numerical filter bring more interesting complexities into the system. We solve the system using the two-step finite-difference MacCormack scheme which is potentially parallelizable. In this paper, we discuss the parallel im-plementation of the MacCormack scheme for the shallow water equations in modern graphics processing unit (GPU) architecture using NVIDIA CUDA technology. On a single Fermi-generation NVIDIA GPU C2050, we achieved 223x speedup with the result output at each time step over the original C code compiled with -O3 optimization flag. If the experiment only outputs the final time step result to the host, our CUDA implementation achieved around 818x speedup over its single-threaded CPU counterpart.
Article
Double precision floating point Sparse Matrix-Vector Multiplication (SMVM) is a critical computational kernel used in iterative solvers for systems of sparse linear equations. The poor data locality exhibited by sparse matrices along with the high memory bandwidth requirements of SMVM result in poor performance on general purpose processors. Field Programmable Gate Arrays (FPGAs) offer a possible alternative with their customizable and application-targeted memory sub-system and processing elements. In this work we investigate two separate implementations of the SMVM on an SRC-6 MAPStation workstation. The first implementation investigates the peak performance capability, while the second implementation balances the amount of instantiated logic with the available sustained bandwidth of the FPGA subsystem. Both implementations yield the same sustained performance with the second producing a much more efficient solution. The metrics of processor and application balance are introduced to help provide some insight into the efficiencies of the FPGA and CPU based solutions explicitly showing the tight coupling of the available bandwidth to peak floating point performance. Due to the FPGAs ability to balance the amount of implemented logic to the available memory bandwidth it can provide a much more efficient solution. Finally, making use of the lessons learned implementing the SMVM, we present a fully implemented non-preconditioned Conjugate Gradient Algorithm utilizing the second SMVM design.
Article
The acceleration of molecular dynamics (MD) simulations using high-performance reconfigurable computing (HPRC) has been much studied. Given the intense competition from multicore and GPUs, there is now a question whether MD on HPRC can be competitive. We concentrate here on the MD kernel computation: determining the short-range force between particle pairs. In one part of the study, we systematically explore the design space of the force pipeline with respect to arithmetic algorithm, arithmetic mode, precision, and various other optimizations. We examine simplifications and find that some have little effect on simulation quality. In the other part, we present the first FPGA study of the filtering of particle pairs with nearly zero mutual force, a standard optimization in MD codes. There are several innovations, including a novel partitioning of the particle space, and new methods for filtering and mapping work onto the pipelines. As a consequence, highly efficient filtering can be implemented with only a small fraction of the FPGA's resources. Overall, we find that, for an Altera Stratix-III EP3ES260, 8 force pipelines running at nearly 200 MHz can fit on the FPGA, and that they can perform at 95% efficiency. This results in an 80-fold per core speed-up for the short-range force, which is likely to make FPGAs highly competitive for MD.
Conference Paper
Inspired by the attractive Flops/dollar ratio and the incredible growth in the speed of modern graphics processing units (GPUs), we propose to use a cluster of GPUs for high performance scientific computing. As an example application, we have developed a parallel flow simulation using the lattice Boltzmann model (LBM) on a GPU cluster and have simulated the dispersion of airborne contaminants in the Times Square area of New York City. Using 30 GPU nodes, our simulation can compute a 480x400x80 LBM in 0.31 second/step, a speed which is 4.6 times faster than that of our CPU cluster implementation. Besides the LBM, we also discuss other potential applications of the GPU cluster, such as cellular automata, PDE solvers, and FEM.
Speeding up of the MOST Program
  • M Lavrentiev
  • A Romanenko
M. Lavrentiev Jr., A. Romanenko, Speeding up of the MOST Program, in: Geophysical Research Abstracts, Vol. 10, 2008.
Accelerating tsunami simulation with FPGA and GPU through automatic compilation
  • M Fujita
M. Fujita, Accelerating tsunami simulation with FPGA and GPU through automatic compilation, Proceedings of International Conference on Wireless Technologies for Humanitarian Relief (2011) 79.
Stream computation of shallow water equation solver for FPGA-based 1D tsunami simulation
  • K Sano
  • F Kono
  • N Nakasato
  • A Vazhenin
  • S Sedukhin
K. Sano, F. Kono, N. Nakasato, A. Vazhenin, S. Sedukhin, Stream computation of shallow water equation solver for FPGA-based 1D tsunami simulation, ACM SIGARCH Computer Architecture News.
Tsunami simulation accelerator exploiting fine and coarse-grain parallelism with FPGA
  • K Nagasu
  • K Sano
  • F Kono
  • N Nakasato
K. Nagasu, K. Sano, F. Kono, N. Nakasato, Tsunami simulation accelerator exploiting fine and coarse-grain parallelism with FPGA, in: Proceedings of the International Conference on Parallel Computational Fluid Dynamics (ParCFD2016), 2016, pp. 60-61.