Conference Paper

Towards Efficient Supercomputing: A Quest for the Right Metric.

DOI: 10.1109/IPDPS.2005.440 Conference: 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), CD-ROM / Abstracts Proceedings, 4-8 April 2005, Denver, CO, USA
Source: DBLP

ABSTRACT Over the past decade, we have been building less and less efficient supercomputers, resulting in the construction of substantially larger machine rooms and even new build- ings. In addition, because of the thermal power envelope of these supercomputers, a small fortune must be spent to cool them. These infrastructure costs coupled with the ad- ditional costs of administering and maintaining such (un- reliable) supercomputers dramatically increases their to tal cost of ownership. As a result, there has been substantial in - terest in recent years to produce more reliable and more ef- ficient supercomputers that are easy to maintain and use. But how does one quantify efficient supercomputing? That is, what metric should be used to evaluate how efficiently a supercomputer delivers answers? We argue that existing efficiency metrics such as the performance-power ratio are insufficient and motivate the need for a new type of efficiency metric, one that incorpo- rates notions of reliability, availability, productivity , and to- tal cost of ownership (TCO), for instance. In doing so, how- ever, this paper raises more questions than it answers with respect to efficiency. And in the end, we still return to the performance-power ratio as an efficiency metric with re- spect to power and use it to evaluate a menagerie of pro- cessor platforms in order to provide a set of reference data points for the high-performance computing community.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency is now a top priority. The first four years of the Green500 have seen the importance of energy efficiency in supercomputing grow from an afterthought to the forefront of innovation as we approach a point where systems become increasingly constrained by power consumption. Even so, the landscape of energy efficiency in supercomputing continues to shift—with new trends emerging and unexpected shifts in previous predictions. This paper offers an in-depth analysis of the new and shifting trends in the Green500. In addition, the analysis offers early indications of the path that we are taking towards exascale and what an exascale machine in 2018 is likely to look like. Lastly, we discuss the emerging efforts and collaborations toward designing and establishing better metrics, methodologies, and workloads for the measurement and analysis of energy-efficient supercomputing.
    Computer Science - Research and Development 05/2013; 28(2-3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: The High Performance Computing (HPC) community aimed for many years to increase performance regardless of energy consumption. Until the end of the decade, a next generation of HPC systems is expected to reach sustained performances of the order of exaflops. This requires many times more performance compared to the fastest supercomputers of today. Achieving this goal is unthinkable with current technology due to strict constraints on supplied power. Therefore, finding ways to improve energy efficiency become a main challenge on state-of-the-art research. The present paper investigates energy efficiency on heterogeneous CPU+GPU architectures using a scientific application from the agroforestry domain as a case-study. Differently from other works, our work evaluates how the workload of the application may affect energy efficiency on hybrid architectures. Results point out that the power supplier constraints depend also on the workload.
    Cluster Computing 09/2013; 16(3). · 0.78 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 11/2013

Full-text (2 Sources)

Available from
May 22, 2014