About
90
Publications
12,680
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
469
Citations
Introduction
Juan Chen is a full professor in the College of Computer at the National University of Defense Technology, China. Her research work focuses on power and energy efficiency in areas such as high-performance computing, and HPC interconnection network power management. She has published papers with others in highly competitive conferences and journals such as SC, IPDPS, Cluster, TOC, PARCO, Communications of the ACM, etc. Recognition for her research includes the best paper award at Cluster 2010.
Additional affiliations
June 2007 - present
Publications
Publications (90)
In an era where power and energy are the first-class constraints of computing systems, precise power information is crucial for energy efficiency optimization in parallel computing systems. Current power monitoring techniques rely on either software-centric power models that suffer from poor accuracy or hardware measurement schemes that have coarse...
Including dispositions and skills in computing curricula is beginning to take root in education circles. These two dimensions complement the knowledge dimension to form an understanding of competency taken in context. In a parallel movement, the Internet of Everything (IoE) is an emerging area of learning that focuses on the interaction between peo...
With the new infrastructure in recent times, the data center is relevant to all lives. Whether an individual, a small company, or a large corporation interested in education, finance, telecommunications, retailers, or social networking services, data centers provide a convenient and efficient platform for storing and computing data. In other words,...
Including dispositions and skills in computing curricula is beginning to take root in education circles. These two dimensions complement the knowledge dimension to form an understanding of competency taken in context. In a parallel movement, the Internet of Everything (IoE) is an emerging area of learning that focuses on the interaction between peo...
In today's world, the demand for big data and computing power has made High-Performance Computing (HPC) popular among various fields. Supercomputing as one of the representatives of HPC applications, the combination with the Internet of Everything (IoE) in its usage cycle cannot be ignored. For HPC, the Internet is the fundamental prerequisite for...
High-accuracy processor power modeling and forecasting are critical for power management and optimization. Though there are many works about processor power modeling, improving the processor power forecasting accuracy is still a challenge, especially for the static processor power modeling method. To achieve high-accuracy forecasting, the complexit...
Although GPUs have been used to accelerate various convolutional neural network algorithms with good performance, the demand for performance improvement is still continuously increasing. CPU/GPU overclocking technology brings opportunities for further performance improvement in CPU-GPU heterogeneous platforms. However, CPU/GPU overclocking inevitab...
Cross-platform power/performance prediction is becoming increasingly important due to the rapid development and variety of software and hardware architectures in an era of heterogeneous multi-core. However, accurate power/performance prediction is faced with an obstacle caused by the large gap between architectures, which is often overcome by labor...
Employers seek recruits who can apply the knowledge, skill, and culture they acquire in college to solve problems as soon as they enter the workforce.
Distributed parallel MPI applications are the dominant workload in many high-performance computing systems. While optimizing MPI application execution is a well-studied field, little work has considered optimizing the initial MPI application launching phase, which incurs extensive cross-machine communications and synchronization. The overhead of MP...
Deep learning has achieved accuracy and fast training speed and has been successfully applied to many fields, including speech recognition, text processing, image processing and video processing. However, the cost of high power and energy comes together with the high accuracy and training speed of Deep Neural Network (DNN). This inspires researcher...
Graphs are an effective approach for data representation and organization, and graph analysis is a promising killer application for AI systems. However, recently emerging extremely large graphs (consisting of trillions of vertices and edges) exceed the capacity of any small-/medium-scale clusters and thus necessitate the adoption of supercomputers...
Achieving faster performance without increasing power and energy consumption for computing systems is an outstanding challenge. This paper develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints...
Facing the challenges of the next generation exascale computing, National University of Defense Technology has developed a prototype system to explore opportunities, solutions, and limits toward the next generation Tianhe system. This paper briefly introduces the prototype system, which is deployed at the National Supercomputer Center in Tianjin an...
Energy and power density have forced the industry to introduce many-cores where a large number of processor cores are integrated into a single chip. In such settings, the communication latency of the network on chip (NoC) could be performance bottleneck of a multi-core and many-core processor. Unfortunately, existing approaches for mapping the runn...
Understanding the characteristics of scientific computing programs has been of great importance due to its close relationship with the design and implementation of program optimization methods. Generally, scientific computing programs can be divided into three categories according to their computing, memory access and communication characteristics,...
The problem of high power consumption has become one of the main obstacles that affect the reliability, stability, and performance of high-performance computers. How to get the power of CPU and memory instantaneously and accurately is an important basis for evaluating their power’s optimization methods. At present, much work has been done to model...
To enhance training in software development, we argue that students of software engineering should be exposed to software development activities early in the curriculum. This entails meeting the challenge of engaging students in software development before they take the software engineering course. In this paper, we propose a method to connect cour...
Reducing the energy consumption of the storage systems disk read/write requests plays an important role in improving the overall energy efficiency of high-performance computing systems. We propose a method to reduce disk energy consumption by delaying the dispatch of disk requests to the end of a time window, which we call time window-based lazy sc...
Component overclocking is an effective approach to speed up the components of a system to realize a higher program performance; it includes processor overclocking or memory overclocking. However, overclocking will unavoidably result in increase in power consumption. Our goal is to optimally improve the performance of scientific computing applicatio...
MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI collectives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents...
Parallel programming skills are becoming more popular due to the unprecedented boom in artificial intelligent and high-performance computing. Programming assignments are widely used in parallel programming courses to measure student performance and expose students to constraints in real projects. However, due to the difficulty level of these assign...
Exascale computing is one of the major challenges of this decade, and several studies have shown that communications are becoming one of the bottlenecks for scaling parallel applications. The analysis on the characteristics of communications can effectively aid to improve the performance of scientific applications. In this paper, we focus on the st...
With the continued surge in the popularity of CS classes, an unprecedented number of non-major undergraduate students enroll in an introductory computer science course. It is a little difficult for freshmen to totally understand CPU working principle in one or two lessons. In the literature, many CS0 curriculum teaching methods have been proposed a...
Teaching and training for high-performance computing in our college could not catch up with HPC research level. Thus, it is imperative to promote teaching reform on parallel computing course in our college. Our first parallel programming course is mainly for the first-grade graduate students majoring in CS and related branches with no previous HPC...
Situational case-based teaching is one of promising teaching strategies in engineering education, e.g., computer science and information security. This strategy is student-centered and can reinforce practical experience and context-aware problem solving. In this paper, we propose a novel evaluation framework by applying concept mapping to address t...
With the increasing demand of big data technology, there has been a growing interest of introducing high performance computing in computer science curriculum. One challenge in helping students understand the nature of efficiency and scalability issues in high performance computing is the lack of opportunities for them to be engaged in large-scale a...
Mesh partitioning is significant to the efficiency of parallel computational fluid dynamics simulations. The most time-consuming parts of parallel computational fluid dynamics simulations are iteratively solving linear systems derived from partial differential equation discretizations. This article aims at mesh partitioning for better iterative con...
Mesh partitioning plays an important role in Computational Fluid Dynamics (CFD) simulation. However, it is difficult to produce a good mesh partitioning to achieve a high performance simulation because it is a NP-complete problem to solve this problem. A good mesh partitioning may be determined by many factors. Nowadays, there is a lack of systemat...
This paper focuses on mesh-partitioning metrics in large-scale parallel computational fluid dynamics (CFD) simulations. Mesh partitioning has a significant influence on the efficiency of parallel preconditioned conjugated gradient (PCG) solving procedure, which is the most representative and timeconsuming part in parallel CFD. As the efficiency of...
The sound propagation in a wedge-shaped waveguide with perfectly reflecting boundaries is one of the few range-dependent problems with an analytical solution. This provides a benchmark for the theoretical and computational studies on the simulation of ocean acoustic applications. We present a direct finite volume method (FVM) simulation for the ide...
In numerical simulations, a great deal of effort has been made to control errors so as to guarantee the correctness of numerical simulations. Moreover, errors may propagate along meshes, and a slight error might amplified during the propagation which deteriorate the overall accuracy at last. Therefore, to study the characteristic of error propagati...
"Sustainable development" is one of the major issues in the 21st century. Thus the notions of green computing, green development and so on show up one after another. As the large-scale parallel computing systems develop rapidly, energy consumption of such systems is becoming very huge, especially system performance reaches Petascale (1015 Flops) or...
Flow instabilities of non-Newtonian fluids severely hamper the quality of products during various chemical processes, such as fibre spinning, extrusion, and film blowing. The origin of extrusion instability has been studied over many decades. However, no consensus has been reached among the research community so far. In this paper, the possible cau...
Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation platform is very important for the research on HPC software and hardware technologies. To effectively evaluate the performance and energy consump...
Exascale computing is one of the major challenges of this decade, and several studies have shown that the communication is becoming one of the bottlenecks for scaling parallel applications. The characteristic analysis of communication is an important means to improve the performance of scientific applications. In this paper, we focus on the statist...
The phase transition of complex fluids is intrinsically a multi-scale problem. In this paper we proposed a multi-scale two-fluid model, that couples a coarse-grained microscopic method to the two-fluid framework for studying the multi-phase fluids under shear flow. In this model the macroscopic viscoelastic stress is calculated by tracking massive...
Soft errors are becoming a prominent problem for massive parallel scientific applications. Dual-modular redundancy (DMR) can provide approximately 100% error coverage, but it has the problem of overhead excessive. Stencil kernel is one of the most important routines applied in the context of structured grids. In this paper, we propose Grid Sampling...
An efficient IBLF-dts scheme is proposed to integrate the bounce-back LBM and FVM scheme to solve the Navier-Stokes equations and the constitutive equation, respectively, for the simulation of viscoelastic fluid flows. In order to improve the efficiency, the bounce-back boundary treatment for LBM is introduced in to improve the grid mapping of LBM...
Routers are one of the major components in supercomputer interconnection networks, which has great influence on the performance and energy consumption in the supercomputer interconnection networks. Given a network topology and routing rule, different resource allocation strategies bring about the different numbers of idle routers. A key theoretical...
The scalability and the efficiency of the OpenFOAM-based parallel viscoelastic solver are greatly restricted by MPI operations. In this work, we proposed a comprehensive strategy of the computation and communication overlap and the hybrid parallelization with MPI and OpenMP to optimize the parallel viscoelastic solver on HPC platforms. Critical sim...
With the development of the electronic technology, the processors count in a supercomputer reaches million scales. However, the processes scale of a application is limited to several thousands, and the scalability face a bottle neck from several aspects, including I/O, communication, cache access .etc. In this paper, we focus on the communication b...
Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems. However, the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic processing unit). Aiming at this problem, a fast weighting method for PIC simulation on GPU-accelerated systems was proposed to avoid...
Multi/many-core design combined with wide vector extension has become the mainstream of modern process architectures. Recently, Intel released Knights Corner, a many-core processor of Intel's Many Integrated Core (MIC) architecuture. Knights Corner comprises up to 62 cores, each supports 512-bit SIMD operation, that is, 8-way double precision float...
Too high energy consumption is widely recognized to be a critical problem in large-scale parallel computing systems. The LogP-based energy-saving model and the frequency scaling method were proposed to reduce energy consumption analytically and systematically for other two representative barrier algorithms: tournament barrier and central counter ba...
Energy consumption has become a serious problem in high-performance computing (HPC) systems. Parallel loops often occupy a significant part of the execution time of overall parallel programs. Thus, reducing their energy consumption is the key to the reduction in energy consumption of the program. This paper discusses energy optimization in OpenMP l...
Huge energy consumption in a large-scale parallel system can have negative impact on the reliability and availability of the system. Based on the Log P model, we apply for the first time frequency scaling to reduce energy consumption analytically and systematically for combining
tree barrier, one representative barrier algorithm used as MPI collect...
High-performance computing (HPC) systems consume huge amount of power, while bringing about higher operational costs, lower reliability. Power optimization has become an important target for HPC system design. Network energy optimization is significant for the energy optimization of the whole system. In this paper, the dynamic energy optimization m...
This paper presents expanded hypothesis based OpenMP static scheduling energy optimization algorithm-Improved Energy-Optimal OpenMP Static Scheduling, IEOSS. Based on EOSS algorithm, IEOSS algorithm exploits the impact of memory access latency on performance and energy of parallel loop. Due to cache miss, the optimal chunk S* scales down processors...
In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system
of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting
of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel a...
In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime ov...
With the increase of high performance computing systems' scale, the power consumption of HPC systems greatly increase, among which the power consumption of massive storage system occupies an important role. Measurements and analyses of massive storage system are significant for reducing the power consumption of storage system. Tianhe-1 comes in at...
We discuss an implementation and optimization of GPU-accelerated Molecular Dynamics (MD) simulation of high-speed collision molecular model in NVIDIA CUDA language. A series of optimization methods are presented: spatial decomposition, use of shared memory and use of blockcell-link structure. These optimization methods effectively improve the perfo...
The general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We implemented a GPU-accelerated PQMRCGSTAB algorithm on NVIDIA Tesla C870. Three optimization metho...
In HPC, power-related concern becomes dominant aspects of hardware and software design. Significant research effort has been devoted towards the energy optimization of parallel loop. This article is focused on energy-oriented OpenMP static and dynamic parallel loop scheduling problem. Only DVS cannot obtain the maximum energy savings. It is necessa...
Solving complex convection-diffusion equations is very important to many practical mathematical and physical problems. After the finite difference discretization, most of the time for equations solution is spent on sparse linear equation solvers. In this paper, our goal is to solve 2D Nonlinear Unsteady Convection-Diffusion Equations by acceleratin...
DVFS-available (Dynamic Voltage/Frequency Scaling) processors make it possible for a system to reduce the energy consumption by scaling down the frequency/voltage of the processors in high performance computing. For MPI collective operations, network communication time occupies the most of the whole time. Scaling down CPU voltage/frequency in non-c...
In high performance parallel computing, energy optimization for parallel loops becomes one key because the time of loops often takes a significant part of the whole execution time. Energy-constrained problem is one of the important research focuses. This paper studies energy-constrained problem based on OpenMP static loop scheduling. Firstly, we pr...
We first demonstrate software prefetching provides an average 66.28% performance enhancement with much higher average power
on six memory-intensive benchmarks. Then we propose a power-directed software prefetching algorithm with dynamic voltage scaling
(PDP-DVS) that monitors a system’s power and adapts the voltage level accordingly to guarantee n...
Some traditional optimizations improve the performance of pro-cessors, but consume the higher power dissipation. We study
this trade-off using software prefetching as performance-oriented optimization technique. We first demonstrate that software
prefetching provides a significant performance boost with the higher power on several memory-intensive...
Although parallel systems with high peak performance have been exciting, high peak performance often means high power consumption.
In this paper, power-aware parallel systems are investigated, where each node can make dynamic voltage scaling (DVS). Based
on the characteristics of communication and memory access in MPI programs, a compiler is used t...
Compiler-directed dynamic voltage scaling (DVS) is one of the effective low-power techniques for real-time applications. Using
the technique, compiler inserts voltage scaling points into a real-time application, and supply voltage and clock frequency
are adjusted to the relationship between the remaining time and the remaining workload at each volt...
In mobile and embedded devices, the energy supply is strictly constrained with the battery capacity and energy saving ability. In these energy-constrained settings, the available energy budget is not sufficient to meet the optimal performance objective. This paper presents an energy-constrained software prefetching optimization approach, which can...