Mingzhe Li's research while affiliated with Meta and other places

Publications (15)

Chapter
Full-text available
Deep Learning has recently been gaining popularity. From the micro-architecture field to the upper-layer end applications, a lot of research work has been proposed in the literature to advance the knowledge of Deep Learning. Deep Learning Benchmarking is one of such hot spots in the community. There are a bunch of Deep Learning benchmarks available...
Conference Paper
Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as...
Conference Paper
Full-text available
Intel Many Integrated Core (MIC) architectures have been playing a key role in modern supercomputing systems due to the features of high performance and low power consumption. This makes them become an attractive choice to accelerate HPC applications. MPI-3 RMA is an important part of the MPI-3 standard. It provides one-sided semantics that reduce...
Article
Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory be...
Article
Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV...
Article
The MPI two-sided programming model has been widely used for scientific applications. However, the benefits of MPI one-sided communication are still not well exploited. Recently, MPI-3 Remote Memory Access (RMA) was introduced with several advanced features which provide better performance, programmability, and flexibility over MPI-2 RMA. However,...
Article
Full-text available
The MPI programming model has been widely used for scientific applications. The emergence of Partitioned Global Address Space (PGAS) programming models presents an alternative approach to improve programmability. With the global data view and lightweight communication operations, PGAS has the potential to increase the performance of scientific appl...
Conference Paper
High Performance Computing (HPC) systems are becoming increasingly complex and are also associated with very high operational costs. The cloud computing paradigm, coupled with modern Virtual Machine (VM) technology offers attractive techniques to easily manage large scale systems, while significantly bringing down the cost of computation, memory an...

Citations

... For example, MLPerf [50] is a comprehensive benchmark for measuring ML inference performance across a spectrum of use cases. Architecture-oriented DNN benchmarks [39], [32], [38] target on analyzing the architectural features of DNNs on computing systems of different sizes. MDLBench [39], Embench [32] and AIoTBench [23] are representative benchmarks that characterize the features of different AI models on edge or mobile devices while NNBench-X [57], GNNMark [46] target on acceleration hardware design for different DNNs. ...
... Therefore, there is no need to pin pages. As noted in the paper, modern Mellanox InfiniBand NICs support many of these technologies, such as on-demand paging (ODP) [26], [27]; recent advances in current InfiniBand verbs allow registering the entire address space for ODP [28]. Next, we compare this scheme against ours. ...
... Therefore, there is no need to pin pages. As noted in the paper, modern Mellanox InfiniBand NICs support many of these technologies, such as on-demand paging (ODP) [26], [27]; recent advances in current InfiniBand verbs allow registering the entire address space for ODP [28]. Next, we compare this scheme against ours. ...
... Prior research [18,24] has shown that RMA-based communication can provide a better overlap of computation and communication by avoiding CPU intervention in communication and hence could provide better performance than point-to-point programs. Recently, the pioneering works of Li et al. [32,33] show that RMA can be used for graph processing systems to improve their performance over P2P implementations significantly. In this article, we leverage their experience to design a novel translation scheme from Green-Marl to an MPI RMA. ...
... Furthermore, performing two sends instead of memory copies although more efficient for large data sizes, performs poorly for smaller ones [13]. Thus, the available options are to use memory copies, which are known to hinder the algorithm's performance, employing derived datatypes, whose communication presents several overheads on the MPI implementations [36], or a third option not yet discussed which is pipelining multiple sends. ...
... As a result, a lot of code (and some processing power) is devoted to packing and unpacking messages. The User-mode Memory Registration (UMR) feature of Mellanox InfiniBand can support MPI derived datatype communication, which may reduce some of this overhead [16], but it requires the programmer to duplicate datatype definitions, in order to inform the MPI library about the datatypes used in the program. In parallel C++ this information is available to the compiler, which can generate the UMR instructions. ...
... Hou [29] proposed a framework called AAlign to automatically vectorize pairwise sequence alignment algorithms for bioinformatics applications. Not only for bioinformatics applications, Li et al. [30] took advantage of the new features from MPI-3 Remote Memory Access model to re-design a scalable Graph500 application kernel. The authors in [31] proposed a hybrid MPI + OpenSHMEM approach to optimize the original MiniMD [32] application, which is a simple proxy for the force computations in a typical molecular dynamics applications. ...
... To efficiently perform parallel computing tasks on multiple GPU nodes, the concept of CUDA-aware MPI [28] has been introduced and widely adopted in many production MPI libraries. It is worth noting that an intelligent CUDA-aware MPI implementation, which leverages the cutting-edge hardware technologies, can transparently provide significant performance improvement to the end applications [23,24]. As a result, the use of MPI for parallel applications significantly increases productivity and improves performance. ...
... There has been some work on combining shared memory parallelism in a PGAS environment, much like MPI?X with the message passing model. The MiniMD application was ported to PGAS in [12] using UPC and POSIX threads. Jose et al. implement a PGAS runtime to achieve considerable speedups due to shared memory support [10] These focus on specific implementation whereas MEPHISTO is agnostic to the back-end. ...
... However, each VM has its individual hostname and therefore this approach is no longer feasible for virtualized clusters. In contrast to previous works [53,54] , we use a unique identifier at a pre-defined offset within the shared-memory segment to obtain the same information. This is set by the first process accessing the shared-memory segment and evaluated by all following processes. ...