Satoshi Matsuoka

Satoshi Matsuoka
  • Doctor of Philosophy
  • Professor (Full) at Tokyo Institute of Technology

About

368
Publications
55,982
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,505
Citations
Introduction
Skills and Expertise
Current institution
Tokyo Institute of Technology
Current position
  • Professor (Full)

Publications

Publications (368)
Article
In this paper, we investigate contention management in lock-based thread safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to wor...
Article
Full-text available
Sunway TaihuLight with its sustainable performance achieving 93PFLOPS is now the No.1 supercomputer in the latest Top500 list. It provides a high-level directive language called OpenACC that is compatible with OpenACC 2.0 standard with some customized extensions. GTC-P is a discovery-science-capable real-world application code based on the particle...
Conference Paper
Full-text available
Bird sounds have been studied in recent years due to their significance in helping ornithologists, and ecologists to monitor birds’ activities, which reflect climate changes, biodiversity, and reserves’ local protection status. Within the increasingly collected large amount of bird sound data from experts and amateurs, how to handle, and employ the...
Article
Full-text available
There are many large-scale graphs in real world such as Web graphs and social graphs. The interest in large-scale graph analysis is growing in recent years. Breadth-First Search (BFS) is one of the most fundamental graph algorithms used as a component of many graph algorithms. Our new method for distributed parallel BFS can compute BFS for one tril...
Article
Sparse matrix vector multiplication (SpMV) is the dominant kernel in scientific simulations. Many-core processors such as GPUs accelerate SpMV computations with high parallelism and memory bandwidth compared to CPUs; however, even for many-core processors the performance of SpMV is still strongly limited by memory bandwidth and lower locality of me...
Conference Paper
Full-text available
Solving word analogies became one of the most popular benchmarks for word embeddings on the assumption that linear relations between word pairs (such as king:man :: woman:queen) are indicative of the quality of the embedding. We question this assumption by showing that the information not detected by linear offset may still be recoverable by a more...
Conference Paper
We propose an out-of-core sorting acceleration technique, called xtr2sort, that deals with multi-level memory hierarchies of device memory (GPU), host memory (CPU), and semi-external non-volatile memory (Flash NVM) for leveraging the high computational performance and memory bandwidth of GPUs, while offloading bandwidth-oblivious operations onto se...
Conference Paper
We propose a method of accelerating Python code by just-in-time compilation leveraging type hints mechanism introduced in Python 3.5. In our approach performance-critical kernels are expected to be written as if Python was a strictly typed language, however without the need to extend Python syntax. This approach can be applied to any Python applica...
Conference Paper
Full-text available
Snore sound (SnS) data has been demonstrated to carry very important information for diagnosis and evaluation of sleep related breathing disorders with high prevalence, such as Primary Snoring and Obstructive Sleep Apnea (OSA) – a serious chronic sleep disorder with a big community. With the increasing number of collected SnS data from subjects, ho...
Conference Paper
Poor scalability on parallel architectures can be attributed to several factors, among which idle times, data movement, and runtime overhead are predominant. Conventional parallel loops and nested parallelism have proved successful for regular computational patterns. For more complex and irregular cases, however, these methods often perform poorly...
Article
inline − graphicxlink: href = "word/"/ > Abstract This special issue features papers that extend the state of art in various aspects of cluster computing.
Article
We extend an abstract agent-based swarming model based on the evolution of neural network controllers, to explore further the emergence of swarming. Our model is grounded in the ecological situation, in which agents can access some information from the environment about the resource location, but through a noisy channel. Swarming critically improve...
Conference Paper
Lossless interconnection networks are omnipresent in high performance computing systems, data centers and network-on-chip architectures. Such networks require efficient and deadlock-free routing functions to utilize the available hardware. Topology-aware routing functions become increasingly inapplicable, due to irregular topologies, which either a...
Conference Paper
Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called "Moore's Law" is predicted to occur around 2025--2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve...
Article
Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms remains unclear. We investigate applicability of using GPU devices...
Conference Paper
Full-text available
This paper presents a case study of discovering and classifying verbs in large web-corpora. While many tasks in natural language processing require corpora containing billions of words, with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a...
Conference Paper
We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text...
Conference Paper
GPUs are now one of the mainstream high-performance processors, embodying rich sets of computational as well as bandwidth resources. However, an individual GPU application typically does not exploit the resources on a GPU in its entirety, and thus concurrent execution of multiple applications may be advantageous in terms of total execution time and...
Article
Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existin...
Conference Paper
With the increasing prominence of many-core architectures and decreasing per-core resources on large supercomputers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectivel...
Article
Scientific simulations often require solving extremely large sparse linear equations, whose dominant kernel is sparse matrix vector multiplication. On modern many-core processors such as GPU or MIC, the operation has been known to pose significant bottleneck and thus would result in extremely poor efficiency, because of limited processor-to-memory...
Article
OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement,...
Conference Paper
Hybrid MPI+Threads programming has emerged as an alternative model to the “MPI everywhere” model to better handle the increasing core density in cluster nodes. While the MPI standard allows multithreaded concurrent communication, such flexibility comes with the cost of maintaining thread safety within the MPI implementation, typically implemented u...
Article
The growing system size of high performance computers results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place s...
Article
We introduce a memory efficient implementation for the NVM-based Hybrid BFS algorithm that merges redundant data structures to a single graph data structure, while offloading infrequent accessed graph data on NVMs based on the detailed analysis of access patterns, and demonstrate extremely fast BFS execution for large-scale unstructured graphs whos...
Conference Paper
Full-text available
From colonies of bacteria to swarms of bees and flocks of birds, countless organisms exhibit a swarming behavior based on local, individual decision making. In such species, the information is used efficiently at the group level to reach optimal behaviors in tasks such as food foraging, which allow to overcome noisy sensory inputs and local minima....
Article
Big data means big datacenters, comprised of hundreds or thousands of machines. With so many machines, failures are commonplace. Failure detection is crucial: undetected failures may lead to data loss and outages. Recent health monitoring approaches use anomaly detection to forecast failures { anomalous machines are considered to be at risk of futu...
Conference Paper
Full-text available
Modern supercomputer performance is principally limited by power. TSUBAME-KFC is a state-of-the-art prototype for our next-generation TSUBAME3.0 supercomputer and towards future exascale. In collaboration with Green Revolution Cooling and others, TSUBAME-KFC submerges compute nodes configured with extremely high processor/component density, into no...
Conference Paper
Full-text available
InfiniCortex: concurrent supercomputing across the globe utilising trans-continental InfiniBand and Galaxy of Supercomputers 1. Project description. We propose to merge four separately important and interesting concepts integrated for the first time together to realise InfiniCortex demonstration: i) High bandwidth intercontinental connectivity b...
Article
Full-text available
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data...
Conference Paper
Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms on large-scale heterogeneous GPU-based systems remains unclear. We...
Article
This paper addresses the issue of efficient sorting of strings on multi-and many-core processors. We propose CPU and GPU implementations of the most-significant digit radix sort algorithm using different parallelization strategies on various stages of the execution to achieve good workload balance and optimal use of system resources. We evaluate th...
Article
We propose extending common performance measurement and visualization tools to identify network bottlenecks within MPI collectives. By creating additional trace points in the Peruse utility of Open MPI, we track low-level InfiniBand communication events and then visualize the communication profile in Boxfish for a more comprehensive analysis. The p...
Conference Paper
GPUs can accelerate edge scan performance of graph processing applications; however, the capacity of device memory on GPUs limits the size of graph to process, whereas efficient techniques to handle GPU memory overflows, including overflow detection and performance analysis in large-scale systems, are not well investigated. To address the problem,...
Article
The performance and energy efficiency of multicore systems are increasingly dominated by the costs of communication. As hardware parallelism grows, developers require more powerful tools to assess the data sharing and reuse properties of their algorithms. The reuse distance is an effective metric to study the temporal locality of programs and model...
Conference Paper
On March 11 th 2011 a high magnitude earthquake and consequent tsunami struck the east coast of Japan, resulting in a nuclear accident unprecedented in time and extents. After scram started at all power stations affected by the earthquake, diesel generators began operation as designed until tsunami waves reached the power plants located on the east...
Conference Paper
Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover...
Conference Paper
The semi definite programming (SDP) problem is one of the central problems in mathematical optimization. The primal-dual interior-point method (PDIPM) is one of the most powerful algorithms for solving SDP problems, and many research groups have employed it for developing software packages. However, two well-known major bottlenecks, i.e., the gener...
Conference Paper
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider usi...
Conference Paper
This paper describes a performance model for read alignment problem, one of the most computationally intensive tasks in bioinformatics. We adapted Burrows Wheeler transform based index to be used with GPUs to reduce overall memory footprint. A mathematical model of computation and communication costs was developed to find optimal memory partitionin...
Article
Full-text available
The technical papers program for SC13 received 449 submissions of which 90 where selected for the program giving an acceptance rate of 20%. A rigorous peer review process, including author rebuttals and a 1.5 day face-to-face program committee meeting ensured that selected papers were the very best in our field. One of the tasks at the face-to-face...
Conference Paper
This paper proposes a methodology to study the data reuse quality of task-parallel runtimes. We introduce an coarse-grain version of the reuse distance method called Kernel Reuse Distance (KRD). The metric is a low-overhead alternative designed to analyze data reuse at the socket level while minimizing perturbation to the parallel schedule. Using t...
Conference Paper
The problem size of the stencil computation on GPU is limited by the GPU memory capacity, which is typically smaller than that of host memory. This paper proposes and evaluates a multi-level optimization method for stencil computation to achieve both larger problem size than GPU memory and high performance. It is based on the temporal blocking meth...
Conference Paper
The Japanese-french FP3C (Framework and Programming for Post-Petascale Computing) Project ANR/JST-2010-JTIC-003 aims at studying the software technologies, languages and programming models on the road to exascale computing. The ability to efficiency exploit these future systems is challenging because of their ultra large-scale and highly hierarchic...
Conference Paper
Both energy efficiency and system reliability are significant concerns towards exa-scale high-performance computing. In such large HPC systems, applications are required to conduct massive I/O operations to local storage devices (e.g. a NAND flash memory) for scalable checkpoint and restart. However, checkpoint/restart can use a large portion of ru...
Conference Paper
Extracting maximum performance of multi-core architectures is a difficult task primarily due to bandwidth limitations of the memory subsystem and its complex hierarchy. In this work, we study the implications of fork-join and data-driven execution models on this type of architecture at the level of task parallelism. For this purpose, we use a highl...
Conference Paper
OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators in an implicit and portable way. It allows the programmer to express the offloading of data and computations to accelerators, such that the porting process for legacy CPU-based applications can be significantly s...
Conference Paper
As the failure frequency is increasing with the components count in modern and future supercomputers, resilience is becoming critical for extreme scale systems. The association of failure prediction with proactive checkpointing seeks to reduce the effect of failures in the execution time of parallel applications. Unfortunately, proactive checkpoint...
Conference Paper
Fast processing for extremely large-scale graph is becoming increasingly important in various domains such as health care, social networks, intelligence, system biology, and electric power grids. The GIM-V algorithm based on MapReduce programing model is designed as a general graph processing method for supporting petabyte-scale graph data. On the...
Article
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Conference Paper
Semidefinite programming (SDP) is one of the most important problems among optimization problems at present. It is relevant to a wide range of fields such as combinatorial optimization, structural optimization, control theory, economics, quantum chemistry, sensor network location and data mining. The capability to solve extremely large-scale SDP pr...
Conference Paper
Full-text available
As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is suc...
Conference Paper
For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between man...
Conference Paper
Full-text available
Future high performance computing systems will need to use novel techniques to allow scientific applications to progress despite frequent failures. Checkpoint-Restart is currently the most popular way to mitigate the impact of failures during long-running executions. Different techniques try to reduce the cost of Checkpoint-Restart, some of them su...
Conference Paper
Fast processing for extremely large-scale graph, which consists of millions to trillions of vertices and 100 billions to 100 trillions of edges, is becoming increasingly important in various domains such as health care, social networks, intelligence, system biology, and electric power grid, etc. The GIM-V algorithm based on MapReduce programing mod...
Conference Paper
Full-text available
With increasing interest among mainstream users to run HPC appli-cations, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, sc...
Conference Paper
Accelerator-based computing systems invest significant fractions of hardware real estate to execute critical computation with vastly higher efficiency than general-purpose CPUs. Amdahl’s Law of the Multi-core Era suggests that such an heterogeneous approach to parallel computing is bound to deliver better scalability and power-efficiency than homog...
Conference Paper
Climate simulation models are used for a variety of scientific problems and accuracy of the climate prognoses is mostly limited by the resolution of the models. Finer resolution results in more accurate prognoses but, at the same time, significantly increases computational complexity. This explains the increasing interest to the High Performance Co...
Conference Paper
Massive and large scale content distribution over Internet is attracting a lot of research efforts as many challenges remain to be solved. Recent studies show that Internet video including video-to-TV and video calling is dominating the Internet traffic. As Internet becomes widely accessible to wired, mobile and wireless users, it is important to d...
Conference Paper
• This conference focuses strongly on computational accelerator technologies, a specific new technology proving very useful in support of computationally intensive research. • I will, as promised, summarize the state of use of accelerators ...
Article
Full-text available
Non-blocking communications are widely used in parallel applications for hiding communication overheads through overlapped computation and communication. While most of the existing implementations provide a non-blocking version of point-to-point communications, there is no portable and efficient implementation of non-blocking collectives, partly be...
Conference Paper
Bioinformatics is a quickly emerging area of science with many important applications to human life. Sequence alignment in various forms is one of the main instruments used in bioinformatics. This work is motivated by the ever-increasing amount of sequence data that requires more and more computation power for its processing. This task calls for ne...
Chapter
Although the Inter-cloud environment enables new possibilities for data-intensive e-Sciences applications, some challenging issues such as dynamic change of computing resources and management complexity of large-scale network still remain. We propose a novel structured overlay network for Inter-Cloud environment, called “Multi-Ring Structured netwo...
Article
Fast Fourier transform is one of the most important computations used in many kinds of applications. Although there are several works of on single GPU FFT, we also need large-scale transforms that require multiple GPUs due to the capacity of the device memory. We present high performance 3-D FFT using multiple GPU devices both on a single node and...
Conference Paper
Full-text available
We address the problem of performing faster read alignment on GPU devices. The task of DNA sequence processing is extremely computationally intensive as constant progress in sequencing technology leads to ever-increasing amounts of sequence data[6]. One of possible solutions for this problem is to use the extreme parallel capacities of modern GPU d...
Conference Paper
Full-text available
The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during soli...
Conference Paper
Full-text available
We present a computational framework for multi-scale simulations of real-life biofluidic problems. The framework allows to simulate suspensions composed by hundreds of millions of bodies interacting with each other and with a surrounding fluid in complex geometries. We apply the methodology to the simulation of blood flow through the human coronary...
Conference Paper
Full-text available
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in w...
Conference Paper
Full-text available
This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and...
Article
Graph500 is a new benchmark for supercomputers based on large-scale graph analysis, which is becoming an important form of analysis in many real-world applications. Graph algorithms run well on supercomputers with shared memory. For the Linpack-based supercomputer rankings, TOP500 reports that heterogeneous and distributed-memory super-computers wi...
Conference Paper
Supercomputers of the past were “performance at all cost” including power consumption, but nowadays supercomputers require even higher power-performance efficiencies than normal computers. For the past 25 years the ratio of supercomputer performance increase has constantly exceeded the so-called “Moore's Law”, but this has been partly achieved by i...
Conference Paper
In this paper, we propose a new Identity-Based Certificateless Proxy Signature scheme, for the grid environment, in order to enable attribute-based authorization, fine-grained delegation and enhanced delegation chain establishment and validation, all without relying on any kind of PKI Certificates or proxy certificates. We show that our scheme is c...
Conference Paper
Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize no...
Article
Full-text available
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have...
Conference Paper
Full-text available
MapReduce is a programming model that enables efficient massive data processing in large-scale computing environments such as supercomputers and clouds. Such large-scale computers employ GPUs to enjoy its good peak performance and high memory bandwidth. Since the performance of each job is depending on running application characteristics and underl...
Conference Paper
Although the Inter-Cloud environment enables new possibilities for several data-intensive e-Sciences applications, some challenging issues such as dynamic change of computing resources and management complexity of large-scale overlay network remain. The structured peer-to-peer overlay network approach is hereby adapted onto the Inter-Cloud environm...

Network

Cited By