Shigang Li

Shigang Li
Beijing University of Posts and Telecommunications | BUPT · School of Computer Science

PhD
I'm leading the Parallel Computing and Intelligent Systems Lab in BUPT, and we're looking for Master, PhD, and Postdocs.

About

76
Publications
15,681
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,066
Citations
Additional affiliations
August 2018 - August 2022
ETH Zurich
Position
  • PostDoc Position
Description
  • Currently I'm mainly focusing on parallel and distributed training for large DL models. The core goal is to improve the training throughput and thus speedup the training process while guaranteeing model quality. See Chimera (SC'21), Ok-Topk (PPoPP'22), eager-SGD (PPoPP'20), WAGMA-SGD (TPDS'21), etc. for details.
June 2014 - August 2018
Institute of Computing Technology, Chinese Academy of Sciences
Position
  • Professor (Assistant)
Education
September 2011 - September 2013
University of Illinois, Urbana-Champaign
Field of study
  • Computer Architecture
September 2009 - June 2014
University of Science and Technology Beijing
Field of study
  • Computer Architecture

Publications

Publications (76)
Article
Full-text available
Many-core systems with a rapidly increasing number of cores pose a significant challenge to parallel applications to use their complex memory hierarchies efficiently. Many such applications rely on collective communications in performance-critical phases, which become a bottleneck if they are not optimized. We address this issue by proposing cache-...
Article
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we...
Article
Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a dist...
Preprint
Full-text available
Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for ve...
Preprint
Full-text available
Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent perfor...
Preprint
Full-text available
Recent advances in deep learning base on growing model sizes and the necessary scaling of compute power. Training such large-scale models requires an intricate combination of data-, operator-, and pipeline parallelism in complex distributed systems. We show how to use OneFlow's Split, Broadcast, and Partial Sum (SBP) tensor formulations to enable n...
Preprint
Full-text available
Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline bubbles during startup and tear-down reduce the utilization of accelerators. Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a...
Preprint
The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. Howeve...
Preprint
Full-text available
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investig...
Article
Since the machine learning platform can provide one-stop artificial intelligence (AI) application solutions, it has been widely used in the industrial and commercial internet fields in recent years. Based on the heterogeneous accelerator cards, scientific discovery using large-scale computation and massive data is a significant tendency in the futu...
Preprint
Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm...
Article
Full-text available
Global issues pertaining to climate change have necessitated the rapid deployment of new energy sources, such as photovoltaic (PV) generation. In smart grids, accurate forecasting is essential to ensure the reliability and economy of the power system. However, PV generation is severely affected by meteorological factors, which hinders accurate fore...
Article
The atmospheric general circulation model (AGCM) has been an important research tool in the study of climate change for decades. As the demand for high-resolution simulation is becoming urgent, the scalability and simulation efficiency is faced with great challenges, especially for the latitude-longitude mesh-based models. In this paper, we propose...
Article
Full-text available
The power output of photovoltaic (PV) systems is chiefly affected by climate and weather conditions. In that, PV farm requires accurate weather data, particularly, solar irradiance, in order to predict its power output as a means to improve solar energy utilization. Nevertheless, publicly available datasets which consist both power and weather data...
Preprint
Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a...
Preprint
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. C...
Preprint
Full-text available
The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solu...
Article
Full-text available
Quantifying uncertainty in weather forecasts is critical, especially for predicting extreme weather events. This is typically accomplished with ensemble prediction systems, which consist of many perturbed numerical weather simulations, or trajectories, run in parallel. These systems are associated with a high computational cost and often involve st...
Article
To achieve better performance, many researchers usually put more computing resources into their application. However, the scalability and performance reproducibility of parallel machine learning training applications, which mainly use stochastic optimization algorithms, are limited, and there are few studies focusing on why these indexes are limite...
Article
Full-text available
The second version of Chinese Academy of Sciences Earth System Model (CAS-ESM 2) is described with emphasis on the development process, strength and weakness, and climate sensitivities in simulations of the Coupled Model Intercomparison Project (CMIP6) DECK experiments. CAS-ESM 2 was built as a numerical model to simulate both the physical climate...
Article
Full-text available
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve...
Preprint
Full-text available
Hybrid density-functional calculation is one of the most commonly adopted electronic structure theory used in computational chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-func...
Article
Full-text available
In the molecular dynamics simulation, an important step is the establishment of neighbor list for each particle, which involves the distance calculation for each particle pair in the simulation space. However, the distance calculation will cause costly floating-point operations. In this paper, we propose a novel algorithm, called Fast Neighbor List...
Preprint
Transformer neural networks have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementati...
Preprint
Full-text available
Quantifying uncertainty in weather forecasts typically employs ensemble prediction systems, which consist of many perturbed trajectories run in parallel. These systems are associated with a high computational cost and often include statistical post-processing steps to inexpensively improve their raw prediction qualities. We propose a mixed predicti...
Preprint
Full-text available
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve...
Article
Full-text available
The first main paragraph of the paper contains errors. The correct wording is given below.
Conference Paper
Full-text available
With more attention attached to nuclear energy, the formation mechanism of the solute clusters precipitation within complex alloys becomes intriguing research in the embrittlement of nuclear reactor pressure vessel (RPV) steels. Such phenomenon can be simulated with atomic kinetic Monte Carlo (AKMC) software, which evaluates the interactions of sol...
Preprint
Full-text available
Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations. To provide accurate estimation, dozens of such computationally intensive simulations must be run. We show that deep neural networks can be used on a small set of numeric...
Preprint
To gain a better performance, many researchers put more computing resource into an application. However, in the AI area, there is still a lack of a successful large-scale machine learning training application: The scalability and performance reproducibility of parallel machine learning training algorithm are limited and there are a few pieces of re...
Chapter
In the big data era, users can get massive information from the Internet, but the value density is very low. In order to help users find the information they need more quickly, this paper presents the mechanism of diverse demands estimation and ranking based on user behaviors. Firstly, a definition of classification system for users query intent is...
Preprint
Full-text available
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every tra...
Article
Full-text available
Hybrid density-functional calculation is one of the most commonly adopted electronic structure theories in computational chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-functio...
Chapter
Asynchronous FTRL-proximal and L2 norm done at server are two widely used tricks in Parameters Server which is an implement of delayed SGD. Their commonness is leaving parts of updating computation on server which reduces the burden of network via making transmitted data sparse. But above tricks’ convergences are not well-proved. In this paper, bas...
Preprint
Automl is the key technology for machine learning problem. Current state of art hyperparameter optimization methods are based on traditional black-box optimization methods like SMBO (SMAC, TPE). The objective function of black-box optimization is non-smooth, or time-consuming to evaluate, or in some way noisy. Recent years, many researchers offered...
Conference Paper
Dynamical core is one of the most time-consuming parts in the global atmospheric general circulation model, which is widely used for the numerical simulation of the dynamic evolution process of global atmosphere. Due to its complicated calculation procedures and the non-uniformity of latitude-longitude mesh, the parallelization suffers from high co...
Conference Paper
Full-text available
The limitation of simulation scales leads to a gap between simulation results and physical phenomena. This paper reports our efforts on increasing the scalability of metal material microscopic damage simulation on the Sunway TaihuLight supercomputer. We use a multiscale modeling approach that couples Molecular Dynamics (MD) with Kinetic Monte Carlo...
Article
With the development of big data technology, Gradient Boosting Decision Tree, i.e. GBDT, becomes one of the most important machine learning algorithms for its accurate output. However, the training process of GBDT needs a lot of computational resources and time. In order to accelerate the training process of GBDT, the asynchronous parallel sampling...
Article
Asynchronous FTRL and $L2$ norm done at server are two widely used tricks to improve training efficiency, but their convergences are not well-proved. In this paper, we propose asynchronous COMID algorithm and prove its convergence. Then, we establish the equivalence between asynchronous COMID and the above two tricks. Thus, the convergences of abov...
Article
Full-text available
Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD, often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a he...
Conference Paper
Full-text available
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all-to-all operations, in which data blocks are copied into the recei...
Article
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all-to-all operations, in which data blocks are copied into the recei...
Article
Full-text available
Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth remain the critical performance bottlenecks. We present our novel solutions to these problems, for bot...
Article
The volume, variety, and velocity properties of big data and the valuable information it contains have motivated the investigation of many new parallel data processing systems in addition to the approaches using traditional database management systems (DBMSs). MapReduce pioneered this paradigm change and rapidly became the primary big data processi...
Article
The parallel Kinetic Monte Carlo (KMC) algorithm based on domain decomposition has been widely used in large-scale physical simulations. However, the communication overhead of the parallel KMC algorithm is critical, and severely degrades the overall performance and scalability. In this paper, we present a hybrid optimization strategy to reduce the...
Article
To optimize short-range force computations in Molecular Dynamics (MD) simulations, multi-threading and SIMD optimizations are presented in this paper. With respect to multi-threading optimization, a Partition-and-Separate-Calculation (PSC) method is designed to avoid write conflicts caused by using Newton’s third law. Serial bottlenecks are elimina...
Conference Paper
Kinetic Monte Carlo (KMC) algorithm has been widely applied for simulation of radiation damage, grain growth and chemical reactions. To simulate at a large temporal and spatial scale, domain decomposition is commonly used to parallelize the KMC algorithm. However, through experimental analysis, we find that the communication overhead is the main bo...
Conference Paper
The recently released MPI-3.0 standard introduced a process-level shared-memory interface which enables processes within the same node to have direct load/store access to each others' memory. Such an interface allows applications to declare data structures that are shared by multiple MPI processes on the node. In this paper, we study the capabiliti...
Article
To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector (SpMV) multiplication kernel implemented in a partitioned global address space langu...
Article
Due to its lower power consumption and cost, heterogeneous multi-core makes up a major computing resource in the current supercomputers. However, heterogeneous multi-core processor features high bandwidth and loose memory consistency, programmers pay attention to hardware details to get ideal memory and computation performance. This paper introduce...
Article
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate severa...
Conference Paper
Full-text available
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory, and we develop several...
Conference Paper
Work stealing is a popular policy for dynamic load balancing of irregular applications. However, communication overhead incurred by work stealing may make it less efficient, especially on distributed memory systems. In this work we propose an asynchronous work stealing (AsynchWS) strategy which exploits opportunities to overlap communication with l...
Article
Full-text available
In this paper, we present an extension to the CCA component architecture. The extension defined a minimal set of non-functional attributes of parallel components. We havea implemented the common CCA components for the management of these attributes. Parallel components can provide some non-functional interfaces optionally. And they provide related...
Conference Paper
The ability of expressing multiple-levels of parallelism is one of the significant features in OpenMP parallel programming model. However, pipeline parallelism is not well supported in OpenMP. This paper proposes extensions to OpenMP directives, aiming at expressing pipeline parallelism effectively. The extended directives are divided into two grou...
Conference Paper
In this paper, we present a multi-paradigm and multi-grain parallel component model. It is an extension to the Common Component Architecture (CCA). Components have two kinds of paradigms, running paradigms and programming paradigms. Running paradigms can be serial execution, message passing parallel, or memory sharing parallel. Programming paradigm...
Conference Paper
OpenMP task is the most significant feature in the new specification, which provides us with a way to handle unstructured parallelism. This paper presents a runtime library of task model on Cell heterogeneous multicore, which attempts to maximally utilize architectural advantages. Moreover, we propose two optimizations, an original scheduling strat...

Network

Cited By