Rio Yokota

Rio Yokota
  • PhD (Mechanical Engineering)
  • Professor at Institute of Science Tokyo

About

153
Publications
31,682
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,814
Citations
Current institution
Institute of Science Tokyo
Current position
  • Professor
Additional affiliations
September 2010 - August 2011
Boston University
Position
  • PostDoc Position
September 2011 - March 2015
King Abdullah University of Science and Technology
Position
  • Researcher

Publications

Publications (153)
Preprint
Full-text available
Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce SuperGCN, an efficient and scalable distributed...
Preprint
Full-text available
The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowC...
Preprint
Full-text available
Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still need human-originated signals for instruction tuning? This work answers the question affirmatively: we build st...
Article
The paper conducts empirical experiments in the continual pre-training method using both artificially generated images and real images, and proposes an approach to model parameter initialization in continual pre-training. For the synthetic pre-training dataset, we use the VisualAtom-1k from formula-driven supervised learning, and for the real-image...
Preprint
Full-text available
The double descent phenomenon, which deviates from the traditional bias-variance trade-off theory, attracts considerable research attention; however, the mechanism of its occurrence is not fully understood. On the other hand, in the study of convolutional neural networks (CNNs) for image recognition, methods are proposed to quantify the bias on sha...
Preprint
Full-text available
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from...
Preprint
Full-text available
Hierarchical low-rank approximation of dense matrices can reduce the complexity of their factorization from O(N^3) to O(N). However, the complex structure of such hierarchical matrices makes them difficult to parallelize. The block size and ranks can vary between the sub-blocks, which creates load imbalance. The dependency between the sub-blocks du...
Preprint
Full-text available
Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and Engli...
Preprint
Full-text available
Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors...
Preprint
Full-text available
Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to inefficient memory access patterns and high communication overhead. This paper presents general and efficient aggregation operators designed for irregular memory access patterns. Additio...
Preprint
Full-text available
In large language model (LLM) training, several parallelization strategies, including Tensor Parallelism (TP), Pipeline Parallelism (PP), Data Parallelism (DP), as well as Sequence Parallelism (SP) and Context Parallelism (CP), are employed to distribute model parameters, activations, and optimizer states across devices. Identifying the optimal par...
Preprint
Full-text available
Large Language Models (LLMs) have attracted significant attention due to their human-like language understanding and generation capabilities, as well as their applicability across various domains. These models, characterized by their massive scale and extensive training data, continue to push the boundaries of what is possible in natural language p...
Preprint
Full-text available
The Privacy Preserving Federated Learning Document VQA (PFL-DocVQA) competition challenged the community to develop provably private and communication-efficient solutions in a federated setting for a real-life use case: invoice processing. The competition introduced a dataset of real invoice documents, along with associated questions and answers re...
Preprint
Full-text available
Local learning, which trains a network through layer-wise local targets and losses, has been studied as an alternative to backpropagation (BP) in neural computation. However, its algorithms often become more complex or require additional hyperparameters because of the locality, making it challenging to identify desirable settings in which the algor...
Preprint
Full-text available
Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This appr...
Preprint
In this work, we investigate the understudied effect of the training data used for image super-resolution (SR). Most commonly, novel SR methods are developed and benchmarked on common training datasets such as DIV2K and DF2K. However, we investigate and rethink the training data from the perspectives of diversity and quality, {thereby addressing th...
Preprint
Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance sim...
Article
Hierarchical low-rank approximation of dense matrices can reduce the complexity of their factorization from [Formula: see text] to [Formula: see text]. However, the complex structure of such hierarchical matrices makes them difficult to parallelize. The block size and ranks can vary between the sub-blocks, which creates load imbalance. The dependen...
Article
Full-text available
Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast...
Article
We propose the Natural Gradient Primal-Dual (NGPD) method for decentralized learning of parameters in Deep Neural Networks (DNNs). Conventional approaches, such as the primal-dual method, constrain the local parameters to be similar between connected nodes. However, since most of them follow a first-order optimization method and the loss functions...
Conference Paper
Full-text available
Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what ma...
Article
Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be design...
Preprint
Full-text available
NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom kernel functions. The most common way to use Tensor Core is to supply the input matrices from shared memory, w...
Preprint
Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categor...
Preprint
Full-text available
Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast...
Chapter
Quantum circuit simulation provides the foundation for the development of quantum algorithms and the verification of quantum supremacy. Among the various methods for quantum circuit simulation, tensor network contraction has been increasing in popularity due to its ability to simulate a larger number of qubits. During tensor contraction, the input...
Preprint
Full-text available
Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent perfor...
Preprint
Full-text available
Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized Numerical Linear Algebra) leverages random projection to reduce the computational complexity of low-rank deco...
Chapter
The QR factorization, which is a fundamental operation in linear algebra, is used extensively in scientific simulations. The acceleration and memory reduction of it are important research targets. QR factorization using block low-rank matrices (BLR-QR) has previously been proposed to address this issue. In this study, we consider its implementation...
Preprint
Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic inv...
Article
This note gives highlights and key takeaways from Computing in Science & Engineering editors who attended the 2023 Society for Industrial and Applied Mathematics Conference on Computational Science and Engineering. The editors discuss themes such as the benefits of returning to in-person conferences, progress in scientific software development and...
Poster
Full-text available
Tall-Skinny QR (TSQR) is an efficient algorithm for calculating the QR decomposition of m × n matrices where m ≫ n, which is done by recursively performing QR decomposition on subdivided blocks of the tall and skinny matrix. Such operations are useful for low-rank approximation methods, which are replacing more and more dense linear algebra in both...
Preprint
Full-text available
Among various supervised deep metric learning methods proxy-based approaches have achieved high retrieval accuracies. Proxies, which are class-representative points in an embedding space, receive updates based on proxy-sample similarities in a similar manner to sample representations. In existing methods, a relatively small number of samples can pr...
Preprint
Full-text available
Modern deep learning systems are fragile and do not generalize well under distribution shifts. While much promising work has been accomplished to address these concerns, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular...
Preprint
Full-text available
Factorization of large dense matrices are ubiquitous in engineering and data science applications, e.g. preconditioners for iterative boundary integral solvers, frontal matrices in sparse multifrontal solvers, and computing the determinant of covariance matrices. HSS and $\mathcal{H}^2$-matrices are hierarchical low-rank matrix formats that can red...
Preprint
We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR, and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of $\mathcal{O}(mn)$, while the tiled BLR-QR exhibits $\mathcal{O}(mn^{1.5...
Preprint
In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k without the use of real images, human-, and self-supervision during the pre-training of Vision Transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k shows 81.8% top-1 accuracy when fine-tuned...
Article
Full-text available
Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditi...
Article
We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR, and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of \(\mathcal {O}(mn) \) , while the tiled BLR-QR exhibits \(\mathcal {O}(...
Preprint
Full-text available
Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditi...
Preprint
Full-text available
Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highl...
Preprint
Full-text available
The coupled dynamics of quantum turbulence (QT) and normal-fluid turbulence (NFT) have been a central challenge in quantum hydrodynamics, since it is expected to cause the unsolved T2 state of QT. We numerically studied the coupled dynamics of the two turbulences in thermal counterflow. NFT is driven by external forces to control its turbulent inte...
Preprint
Full-text available
The use of iterative pose refinement is a critical processing step for 6D object pose estimation, and its performance depends greatly on one's choice of image representation. Image representations learned via deep convolutional neural networks (CNN) are currently the method of choice as they are able to robustly encode object keypoint locations. Ho...
Preprint
Full-text available
This paper describes a viewpoint-robust object-based change detection network (OBJ-CDNet). Mobile cameras such as drive recorders capture images from different viewpoints each time due to differences in camera trajectory and shutter timing. However, previous methods for pixel-wise change detection are vulnerable to the viewpoint differences because...
Article
Full-text available
Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We prop...
Preprint
Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We prop...
Conference Paper
Hierarchical Matrix (H-matrix) is an approximation technique which splits a target dense matrix into multiple submatrices, and where a selected portion of submatrices are low-rank approximated. The technique substantially reduces both time and space complexity of dense matrix vector multiplication, and hence has been applied to numerous practical p...
Article
Full-text available
The parallel scaling of classical molecular dynamics simulations is limited by the communication of the 3D fast Fourier transform of the particle-mesh electrostatics methods, which are used by most molecular simulation packages. The Fast Multipole Method (FMM) has much lower communication requirements and would, therefore, be a promising alternativ...
Article
The QR factorization of a matrix is a fundamental operation in linear algebra and it is widely utilized in scientific simulations. Although the QR factorization requires a memory storage of O(N²) and the computational cost of O(N³), they can be reduced if the matrix could be approximated using low-rank structured matrices, such as hierarchical matr...
Preprint
Full-text available
Hierarchical Matrix (H-matrix) is an approximation technique which splits a target dense matrix into multiple submatrices, and where a selected portion of submatrices are low-rank approximated. The technique substantially reduces both time and space complexity of dense matrix vector multiplication, and hence has been applied to numerous practical p...
Conference Paper
Faster training of deep neural networks is desired to speed up the research and development cycle in deep learning. Distributed deep learning and second-order optimization methods are two different techniques to accelerate the training of deep neural networks. In the previous work, researchers show that an approximated second-order optimization met...
Article
We parallelize the LU factorization of a hierarchical low-rank matrix ([Formula: see text]-matrix) on a distributed-memory computer. This is much more difficult than the [Formula: see text]-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the...
Preprint
Full-text available
Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentatio...
Conference Paper
Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capability by training with the extremely large amount of d...
Chapter
We present an overview of our project that aimed to achieve both high performance and high productivity. In order to achieve our aim, we designed and developed high-level domain-specific frameworks that can automate many of tedious and complicated program optimizations for certain computation patterns. This article walks through some of our researc...
Preprint
Full-text available
Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative appro...
Article
Full-text available
Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our F...
Preprint
Full-text available
Algorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our F...
Chapter
Full-text available
The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (\(\mathcal {H}\)-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix c...
Article
Full-text available
Among optimal hierarchical algorithms for the computational solution of elliptic problems, the Fast Multipole Method (FMM) stands out for its adaptability to emerging architectures, having high arithmetic intensity, tunable accuracy, and relaxed global synchronization requirements. We demonstrate that, beyond its traditional use as a solver in prob...
Book
This book constitutes the refereed post-conference proceedings of 13 workshops held at the 33rd International ISC High Performance 2018 Conference, in Frankfurt, Germany, in June 2018: HPC I/O in the Data Center, HPC-IODC 2018; Workshop on Performance and Scalability of Storage Systems, WOPSSS 2018; 13th Workshop on Virtualization in High­-Performa...
Book
It constitutes the refereed proceedings of the 4th Asian Supercomputing Conference, SCFA 2018, held in Singapore in March 2018. Supercomputing Frontiers will be rebranded as Supercomputing Frontiers Asia (SCFA), which serves as the technical programme for SCA18. The technical programme for SCA18 consists of four tracks: • Application, Algorithms...
Book
This book constitutes the refereed proceedings of the 33rd International Conference, ISC High Performance 2018, held in Frankfurt, Germany, in June 2018. The 20 revised full papers presented in this book were carefully reviewed and selected from 81 submissions. The papers cover the following topics: Resource Management and Energy Efficiency; Perfor...
Conference Paper
Along with the recent development of Convolutional Neural Network (CNN) and its multilayering, it is important to reduce the amount of computation and the amount of data associated with convolution processing. Some compression methods of convolutional filters using low-rank approximation have been studied. The common goal of these studies is to acc...
Chapter
Full-text available
There has been a large increase in the amount of work on hierarchical low-rank approximation methods, where the interest is shared by multiple communities that previously did not intersect. This objective of this article is two-fold; to provide a thorough review of the recent advancements in this field from both analytical and algebraic perspective...
Conference Paper
Full-text available
Manycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memo...
Conference Paper
Full-text available
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical N-Body algorithms like Fast Multipole Method (FMM). In the present work, we propose three independent strategies to improve partitioning and reduce communication. First, we show that the conventional wisdom of using space-filling curve par...
Preprint
Full-text available
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical $N$-Body algorithms like FMM. In the present work, we propose four independent strategies to improve partitioning and reduce communication. First of all, we show that the conventional wisdom of using space-filling curve partitioning may no...

Network

Cited By