Hiroyuki Takizawa

Hiroyuki Takizawa
Tohoku University | Tohokudai · Cyberscience Center

PhD

About

190
Publications
13,448
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,144
Citations
Citations since 2017
59 Research Items
533 Citations
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100

Publications

Publications (190)
Chapter
The purpose of this work is to reduce the burden on system administrators by virtually reproducing job scheduling and power management of their target systems and thereby helping them properly configure the system parameters and policies. Specifically, this paper focuses on a real computing system, named Supercomputer AOBA, as an example to discuss...
Article
As the number of cores on a processor increases, cache hierarchies contain more cache levels and a larger last level cache (LLC). Thus, the power and energy consumption of the cache hierarchy becomes non-negligible. Meanwhile, because the cache usage behaviors of individual applications can be different, it is possible to achieve higher energy effi...
Chapter
SX-Aurora TSUBASA (SX-AT) is a vector supercomputer equipped with Vector Engines (VEs). SX-AT has not only such a new system architecture, but also some execution modes to achieve high performance on executing a real-world application that often consists of vector friendly and unfriendly parts. Vector Engine Offloading (VEO) is a programming framew...
Chapter
NEC SX-Aurora TSUBASA is the latest vector supercomputer, consisting of host processors called Vector Hosts (VHs) and vector processors called Vector Engines (VEs). The final goal of this work is to simultaneously use both VHs and VEs to increase the resource utilization and improve the system throughput by co-executing more workloads. However, per...
Article
Full-text available
This study aimed to evaluate the impact of climate change on flood damage and the effects of mitigation measures and combinations of multiple adaptation measures in reducing flood damage. The inundation depth was calculated using a two-dimensional unsteady flow model. The flood damage cost was estimated from the unit evaluation value set for each l...
Article
Full-text available
This paper presents an OpenCL-like offload programming framework for NEC SX-Aurora TSUBASA (SX-Aurora) and also discusses the benefit of employing metaprogramming to describe architecture-specific parts of the programs. Unlike traditional vector systems, one node of an SX-Aurora system consists of a host processor and some vector processors on PCI-...
Article
Full-text available
Dedicated infrastructures are commonly used for urgent computations. However, using dedicated resources is not always affordable due to budget constraints. As a result, utilizing shared infrastructures becomes an alternative solution for urgent computations. Since the infrastructures are meant to serve many users, the urgent jobs may arrive when re...
Chapter
Full-text available
In this paper, we present results of the second phase of the project ExaFSA within the priority program SPP1648—Software for Exascale Computing. Our task was to establish a simulation environment consisting of specialized highly efficient and scalable solvers for the involved physical aspects with a particular focus on the computationally challengi...
Article
Full-text available
Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access...
Chapter
Full-text available
As field programmable gate arrays (FPGAs) become a favorable choice in exploring new computing architectures for the post-Moore era, a flexible network architecture for scalable FPGA clusters becomes increasingly important in high performance computing (HPC). In this paper, we introduce a scalable platform of indirectly-connected FPGAs, where its E...
Article
Full-text available
The mapping of tasks to processor cores, called task mapping, is crucial to achieving scalable performance on multicore processors. On modern NUMA (non-uniform memory access) systems, the memory congestion problem could degrade the performance more severely than the data locality problem because heavy congestion on shared caches and memory controll...
Article
Recently, many researchers have been investigating quantum annealing as a solver for realworld combinatorial optimization problems. However, due to the format of problems that quantum annealing solves and the structure of the physical annealer, these problems often require additional setup prior to solving. We study how these setup steps affect per...
Article
This paper introduces the Xevolver code transformation framework to separate system‐aware code optimizations from HPC application codes. System‐aware code optimizations often make it difficult for programmers to maintain HPC application codes. On the other side, system‐aware code optimizations are mandatory to exploit the performance of target HPC...
Conference Paper
Full-text available
MPI process mapping is an important step to achieve scalable performance on non-uniform memory access (NUMA) systems. Conventional approaches have focused only on improving the locality of communication. However, related studies have shown that on modern NUMA systems, the memory congestion problem could cause more severe performance degradation tha...
Conference Paper
Field-Programmable Gate Arrays (FPGAs) offer a fairly non-invasive method to specialize custom architectures towards a specific application domain. Recent studies has successfully demonstrated that single-node FPGAs can be a rival to both CPUs and GPUs in performance. Unfortunately, most existing studies limit themselves to using a single FPGA devi...
Article
Full-text available
Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to...
Conference Paper
Full-text available
Thread mapping is crucial to improve the performance and energy consumption of modern NUMA systems. In this work, we investigate the impacts of locality and memory congestion-aware thread mapping on the energy consumption. Our evaluation shows that considering both the locality and memory congestion can significantly reduce not only the performance...
Article
Full-text available
Modern supercomputers consist of multi-core processors, and these processors have recently employed vector instructions, or so-called SIMD instructions, to improve performances. Numerical simulations need to be vectorized in order to achieve higher performance on these processors. Various legacy numerical simulation codes that have been utilized fo...
Chapter
It is hard to envision all possible use cases or environmental conditions that might happen to a VLSI system during its lifetime and could adversely affect its performance and/or dependability. The job of designing and testing a VLSI includes the challenge of being prepared even against problems hard to foresee, within the restrictions of practical...
Chapter
Since different systems usually require different performance optimizations, an application code is likely “specialized” for a particular system configuration to fully extract the system performance. This is one major reason why migration of an existing application code, so-called legacy code migration, is so labor-intensive and error-prone especia...
Conference Paper
Full-text available
The OpenMP specification introduces thread team for hierarchical parallelism. A thread team is a team of synchronizable threads, and the number of threads in a thread team is called thread team size. OpenMP allows static adjustment of the thread team size, where the team size must be specified before executing an application and has to stay constan...
Poster
Full-text available
On modern NUMA systems, the memory congestion problem could degrade performance more than the memory access locality problem because a large number of processor cores in the systems can cause heavy congestion on memory controllers. In this work, we propose a thread mapping method that considers the spatio-temporal communication behavior of multi-th...
Conference Paper
Full-text available
Checkpointing with a constant checkpoint interval, a so-called constant checkpointing method, is commonly used in HPC field and has been proved to be the optimal solution for failures whose inter-arrival times are distributed exponentially. On the other hand, previous works have shown that there is a high correlation between processor temperature a...
Article
Sparse Matrix-Vector multiplication (SpMV) is a computational kernel widely used in many applications. Because of the importance, many different implementations have been proposed to accelerate this computational kernel. The performance characteristics of those SpMV implementations are quite different, and it is basically difficult to select the im...
Poster
Full-text available
Most machine learning models use hyperparameters empirically de�ned in advance of their training processes. Even a classic machine learning model, so-called multilayer perceptrons, has a lot of hyperparameters. In the case of using such a model for a classi- �cation problem, one di culty is that the achieved classi�cation accuracy could drastically...
Poster
Full-text available
Checkpointing with a fixed checkpoint interval, a so-called constant checkpointing method, is commonly used in the field of fault-tolerance for high-performance computing (HPC) systems. It can achieve minimum total execution time if the failure follows an exponential distribution. Related work show that there is a high correlation between temperatu...
Conference Paper
Full-text available
MPI process placement is an important step to achieve scalable performance on modern non-uniform memory access (NUMA) systems. A recent study on NUMA architectures has shown that, on modern NUMA systems, the memory congestion problem could cause more severe performance degradation than the data locality problem because heavy congestion on memory co...
Article
Full-text available
Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on \emph{speculative checkpointing}, a CPR mechanism that...
Conference Paper
Full-text available
Although incremental checkpointing is an effective way of reducing the checkpointing overhead, it has been discussed mostly for system-level checkpointing. Since the whole memory space of a running application is saved in a checkpoint file, system-level checkpointing will be less practical for future-generation extreme-scale computing systems, in w...
Article
Full-text available
In the field of high performance computing, massively-parallel many-core processors such as Intel Xeon Phi coprocessors are becoming popular because they can significantly accelerate various applications. In order to efficiently parallelize applications for such many-core processors, several high-level programming models have been proposed. The de...
Article
The emergence of various high-performance computing (HPC) systems compels users to write a code considering the characteristic of each HPC system. To describe the system-dependent information without drastic code modifications, the directive sets such as the OpenMP directive set and the OpenACC directive set are proofed to be useful. However, the c...
Article
Full-text available
Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this circumstance, a vector-parallel supercomputer SX-ACE has been designed to achieve a high sustained performance of...
Conference Paper
Full-text available
Coordinated checkpointing is a widely-used checkpoint/restart (CPR) technique for fault-tolerance in large-scale HPC systems. However, this CPR technique will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on multi-level checkpointing that allows the u...
Chapter
Xevolver is a code transformation framework for users to define their own code transformation rules. In the framework, an abstract syntax tree (AST) of an application code is written in an XML format, and its transformation rules are expressed in the XSLT format, which is a standard XML format to describe XML data conversion; an AST and its transfo...
Chapter
This paper proposes a directive translation approach that translates a special placeholder to different directives, depending on the target HPC system. The special placeholder in an application code is used as a trigger for the directive translation. By employing a code translation framework, Xevolver, the special placeholder can be translated to v...
Chapter
In this work, we propose an Automatic Performance Tracking System for analyzing the changes in execution performance and finding the source code modifications that cause the degradation of performance portability. The proposed system works in order to support evolving a large-scale numerical application while maintaining its performance portability...
Conference Paper
The Xevolver framework is a code transformation framework for supporting evolutional modifications of high performance computing codes. This paper introduces a set of software modules that facilitates administrative tasks about multiple code transformations on multiple source codes. We call this set of software modules the xevdriver, because it dri...
Conference Paper
The appearance of various high-performance computing (HPC) systems compels a user to write a code considering the characteristic of each HPC system. To describe the system-dependent information without drastic code modifications, the directive sets such as the OpenMP directive set and the OpenACC directive set are useful. However, a code becomes co...
Conference Paper
Recently, massively-parallel many-core processors such as Intel Xeon Phi coprocessors have attracted researchers' attentions because various applications are significantly accelerated with those processors. In the field of high-performance computing, OpenMP is a standard programming model commonly used to parallelize a kernel loop for many-core pro...
Conference Paper
Full-text available
Sparse Matrix-Vector multiplication (SpMV) is a computational kernel widely used in many applications. There are many different implementations using different processors and algorithms for SpMV. The performances of different SpMV implementations are quite different, and it is basically difficult to choose the implementation that has the best perfo...
Conference Paper
The last-level cache (LLC) of a modern chip-multiprocessor (CMP) keeps two kinds of data: shared data accessed by multiple cores and private data accessed by only one core. Although the former are likely to have a larger performance impact than the latter, the LLC manages both of those data in the same fashion. To realize a highly efficient executi...
Article
Full-text available
As the diversity of high-performance computing (HPC) systems increases, even legacy HPC applications often need to use accelerators for higher performance. To migrate large-scale legacy HPC applications to modern HPC systems equipped with accelerators, a promising way is to use OpenACC because its directive-based approach can prevent drastic code m...
Article
Full-text available
High performance scientific codes are written to achieve high performance on a modern HPC (High Performance Computing) platform, and are less readable and less manageable because of complex hand optimization which is often platform-dependent. We are developing a toolset to mitigate that maintainability problem by user-defined easy-to-use code trans...
Article
Full-text available
Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed me...
Conference Paper
This paper reports a case study of using the Xevolver code transformation framework for data layout optimizations of high-performance computing (HPC) applications. Due to the variety of data structures used in individual applications, a code transformation rule for data layout optimizations is generally specific to a particular application. Since t...
Conference Paper
As the diversity of HPC systems increases, even legacy HPC applications often need to use accelerators for higher performance. To migrate large-scale legacy HPC applications to modern HPC systems including accelerators, OpenACC is a promising approach because its directive-based approach can prevent drastic code modifications. This paper shows a ca...
Conference Paper
HPC scientific codes are less readable and less manageable because of complex hand optimization which is often platform-dependent. We are developing a toolset that hopefully mitigates that maintainability problem by user-defined easy-to-use code transformation: The code is written in a simpler form, and coding technique for high performance is intr...
Conference Paper
Empirical auto-tuning is getting attention in the field of high-performance computing (HPC) because it effectively reduces programmers' burden to improve the execution performance of an application. In the tuning process, a programmer selects a high-performance kernel variant of the application by evaluating the performances of multiple kernel vari...
Chapter
To exploit the potential of many core processors, a serial code is generally optimized for a particular compiler called a target compiler, so that the compiler can understand the code structure for automatic parallelization. However, the performance of such a serial code is always not portable to a new system that uses a different compiler. To impr...