Conference Paper

Towards Efficient Supercomputing: A Quest for the Right Metric.

DOI: 10.1109/IPDPS.2005.440 Conference: 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), CD-ROM / Abstracts Proceedings, 4-8 April 2005, Denver, CO, USA
Source: DBLP


Over the past decade, we have been building less and less efficient supercomputers, resulting in the construction of substantially larger machine rooms and even new build- ings. In addition, because of the thermal power envelope of these supercomputers, a small fortune must be spent to cool them. These infrastructure costs coupled with the ad- ditional costs of administering and maintaining such (un- reliable) supercomputers dramatically increases their to tal cost of ownership. As a result, there has been substantial in - terest in recent years to produce more reliable and more ef- ficient supercomputers that are easy to maintain and use. But how does one quantify efficient supercomputing? That is, what metric should be used to evaluate how efficiently a supercomputer delivers answers? We argue that existing efficiency metrics such as the performance-power ratio are insufficient and motivate the need for a new type of efficiency metric, one that incorpo- rates notions of reliability, availability, productivity , and to- tal cost of ownership (TCO), for instance. In doing so, how- ever, this paper raises more questions than it answers with respect to efficiency. And in the end, we still return to the performance-power ratio as an efficiency metric with re- spect to power and use it to evaluate a menagerie of pro- cessor platforms in order to provide a set of reference data points for the high-performance computing community.

Download full-text


Available from: Wu Feng
  • Source
    • "This is done due to fear of increased node failures at higher temperatures because Mean Time Between Failure (MTBF) of a processor is directly proportional to the exponential of its temperature [20] [21] [22]. It has also been reported that for every 10 • C increase in temperature, fault rate doubles [20] [23] [24] [25]. Therefore, besides reducing the cooling energy of the data center, restraining the temperature of the processors also increases the MTBF of the system. "
    Dataset: sc13

    Full-text · Dataset · Jul 2014
  • Source
    • "[They] raise awareness about power consumption, promote alternative total cost of ownership performance metrics, and ensure that supercomputers only simulate climate change and not create it.'' Between its inception in 2007 and the November release of 2012, the GREEN500 list was a reordering of the TOP500 list in order of decreasing power efficiency measured as the maximal GFLOPS per Watt (GFLOPS/W), a metric proposed in the parallel and distributed processing community [12]. Power consumption of a system is measured by a digital power meter plugged into the system's power strip, and readings are sent to a profiling computer at a rate of 50 kHz. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The biannual TOP500 list of the highest performing supercomputers has chronicled, and even fostered, the development of recent supercomputing platforms. Coupled with the GREEN500 list that launched in November 2007, the TOP500 list has enabled analysis of multiple aspects of supercomputer design. In this comparative and retrospective study, we examine all of the available data contained in these two lists through November 2012 and propose a novel representation and analysis of the data, highlighting several major evolutionary trends.
    Full-text · Article · Apr 2013 · Parallel Computing
    • ".) but they are circumscribed to either one [8] or a few benchmarks types [10]. Furthermore, the most prevailing infrastructure comprises only an external wattmeter [1] [11]. Another issue arises on how to process the power/energy samples due to the high variability to which they are subject to. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Large-scale distributed systems (e.g., datacenters, HPC systems, clouds, large-scale networks, etc.) consume and will consume enormous amounts of energy. Therefore, accurately monitoring the power and energy consumption of these systems is increasingly more unavoidable. The main novelty of this contribution is the analysis and evaluation of different external and internal power monitoring devices tested using two different computing systems, a server and a desktop machine. Furthermore, we also provide experimental results for a variety of benchmarks which exercise intensively the main components (CPU, Memory, HDDs, and NICs) of the target platforms to validate the accuracy of the equipment in terms of power dispersion and energy consumption. This paper highlights that external wattmeters do not offer the same measures as internal wattmeters. Thanks to the high sampling rate and to the different measured lines, the internal wattmeters allow an improved visualization of some power fluctuations. However, a high sampling rate is not always necessary to understand the evolution of the power consumption during the execution of a benchmark.
    No preview · Chapter · Apr 2013
Show more