Conference Paper

Towards Efficient Supercomputing: A Quest for the Right Metric.

DOI: 10.1109/IPDPS.2005.440 Conference: 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), CD-ROM / Abstracts Proceedings, 4-8 April 2005, Denver, CO, USA
Source: DBLP

ABSTRACT Over the past decade, we have been building less and less efficient supercomputers, resulting in the construction of substantially larger machine rooms and even new build- ings. In addition, because of the thermal power envelope of these supercomputers, a small fortune must be spent to cool them. These infrastructure costs coupled with the ad- ditional costs of administering and maintaining such (un- reliable) supercomputers dramatically increases their to tal cost of ownership. As a result, there has been substantial in - terest in recent years to produce more reliable and more ef- ficient supercomputers that are easy to maintain and use. But how does one quantify efficient supercomputing? That is, what metric should be used to evaluate how efficiently a supercomputer delivers answers? We argue that existing efficiency metrics such as the performance-power ratio are insufficient and motivate the need for a new type of efficiency metric, one that incorpo- rates notions of reliability, availability, productivity , and to- tal cost of ownership (TCO), for instance. In doing so, how- ever, this paper raises more questions than it answers with respect to efficiency. And in the end, we still return to the performance-power ratio as an efficiency metric with re- spect to power and use it to evaluate a menagerie of pro- cessor platforms in order to provide a set of reference data points for the high-performance computing community.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Future supercomputers will consume enormous amounts of energy. These very large scale systems will gather many homogeneous clusters. In this paper, we analyze the power consumption of the nodes from different homogeneous clusters during different workloads. We classically observe that these nodes exhibit the same level of performance. But we also show that different nodes from a homogeneous cluster may exhibit heterogeneous idle power energy consumption even if they are made of identical hardware. Hence, we propose an experimental methodology to understand such differences. We show that CPUs are responsible for such heterogeneity which can reach 20% in terms of energy consumption. So energy aware (Green) schedulers must take care of such hidden heterogeneity in order to propose efficient mapping of tasks. To consume less energy, we propose an energy-aware scheduling approach taking into account the heterogeneous idle power consumption of homogeneous nodes. It shows that we are able to save energy up to 17% while exploiting the high power heterogeneity that may exist in some homogeneous clusters.
    Green Computing Conference (IGCC), 2013 International; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, the high-performance computing (HPC) community has recognized the need to design energy-efficient HPC systems. The main focus, however, has been on improving the energy efficiency of computation, resulting in an oversight on the energy efficiencies of other aspects of the system such as memory or disks. Furthermore, the energy consumption of the non-computational parts of a HPC system continues to consume an increasing percentage of the overall energy consumption. Therefore, to capture a more accurate picture of the energy efficiency of a HPC system, we seek to create a benchmark suite and associated methodology to stress different components of a HPC system, such as the processor, memory, and disk. Doing so, however, results in a potpourri of benchmark numbers that make it difficult to "rank" the energy efficiency of HPC systems. This leads to the following question: What metric, if any, can capture the energy efficiency of a HPC system with a single number? To address the above, we propose The Green Index (TGI), a metric to capture the system-wide energy efficiency of a HPC system as a single number. Then, in turn, we present (1) a methodology to compute TGI, (2) an evaluation of system-wide energy efficiency using TGI, and (3) a preliminary comparison of TGI to the traditional performance-to-power metric, i.e., floating-point operations per second (FLOPS) per watt.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 11/2013

Full-text (2 Sources)

Available from
May 22, 2014