November 2024
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
Publications (23)
November 2022
·
4 Reads
·
2 Citations
April 2019
·
30 Reads
·
2 Citations
Lecture Notes in Computer Science
The PAPI performance library is a widely used tool for gathering self-monitored performance data from running applications. A key aspect of self-monitoring is the ability to read hardware performance counters with minimum possible overhead. If read overhead becomes too large then the act of measurement will start to interfere with the gathered results, adversely affecting the performance analysis.
April 2019
·
24 Reads
Lecture Notes in Computer Science
The PAPI performance library is a widely used tool for gathering performance data from running applications. Modern processors support advanced sampling interfaces, such as Intel’s Precise Event Based Sampling (PEBS) and AMD’s Instruction Based Sampling (IBS). The current PAPI sampling interface predates the existence of these interfaces and only provides simple instruction-pointer based samples.
October 2018
·
448 Reads
·
17 Citations
Modern memory hierarchies are complex pieces of hardware, combining memory controllers, caches, and memory-management units in a desperate attempt to keep modern high-speed processors fed with data. Recent developments, including the introduction of non-volatile memory, further complicate the situation. Operating systems are highly tuned to current memory systems; modifying them to take advantage of new developments is a difficult process. The low-level code involved is complex and hard to follow. This makes teaching students about modern memory systems a struggle, as wading through the complicated code in a full operating system like Linux can be frustrating. To this end we develop vmwOS, a simple operating system designed for low-cost Raspberry Pi development boards. Although inexpensive, these widely used boards support most of the features of modern memory systems, including 64-bit addresses, multi-level caches, multi-core, and full ARMv8 processor and MMU support. vmwOS is simple enough that students can follow the code and make modifications for memory hierarchy exploration. We describe a number of memory research topics that can be explored with this infrastructure, including a detailed examination of simulating non-volatile RAM support.
October 2016
·
782 Reads
·
121 Citations
Recent Intel processors support the Running Average Power Level (RAPL) interface, which among other things provides estimated energy measurements for the CPUs, integrated GPU, and DRAM. These measurements are easily accessible by the user, and can be gathered by a wide variety of tools, including the Linux perf_event interface. This allows unprecedented easy access to energy information when designing and optimizing energy-aware code. While greatly useful, on most systems these RAPL measurements are estimated values, generated on the fly by an on-chip energy model. The values are not documented well, and the results (especially the DRAM results) have undergone only limited validation. We validate the DRAM RAPL results on both desktop and server Haswell machines, with multiple types of DDR3 and DDR4 memory. We instrument the hardware to gather actual power measurements and compare them to the RAPL values returned via Linux perf_event. We describe the many challenges encountered when instrumenting systems for detailed power measurement. We find that the RAPL results match overall energy and power trends, usually by a constant power offset. The results match best when the DRAM is being heavily utilized, but do not match as well in cases where the system is idle, or when an integrated GPU is using the memory. We also verify that Haswell server machines produce more accurate results, as they include actual power measurements gathered through the integrated voltage regulator.
September 2016
·
295 Reads
·
56 Citations
Electronics
Power consumption has become an increasingly important metric when building large supercomputing clusters. One way to reduce power usage in large clusters is to use low-power embedded processors rather than the more typical high-end server CPUs (central processing units). We investigate various power-related metrics for seventeen different embedded ARM development boards in order to judge the appropriateness of using them in a computing cluster. We then build a custom cluster out of Raspberry Pi boards, which is specially designed for per-node detailed power measurement. In addition to serving as an embedded cluster testbed, our cluster’s power measurement, visualization and thermal features make it an excellent low-cost platform for education and experimentation.
July 2015
·
16 Reads
·
3 Citations
PAPI is a widely used portable library for accessing hardware counters on modern microprocessors. PAPI offers both counting and sampling interfaces, but the sampling interface is extremely limited, consisting of a simple interrupt-driven interface that can periodically report processor state. In the past few years, the hardware and operating systems of modern processors have added support for new more advanced sampling features. These features enable information about non-uniform memory access (NUMA) behavior to be obtained. Currently, performance tool developers who want to provide sampling data to their users must make use of a complex low-level kernel interface, sometimes developing their own kernel patch to access the features they need. This paper reports on initial efforts to develop a middleware layer that will serve as a stable interface and enable tool developers to access sampling data through standard PAPI calls and to obtain data important for NUMA analysis.
April 2015
·
250 Reads
·
38 Citations
Most modern CPUs include hardware performance counters: architectural registers that allow programmers to gain low-level insight into system performance. Low-overhead access to these counters is necessary for accurate performance analysis, making the operating system interface critical to providing lowlatency performance data. We investigate the overhead of selfmonitoring performance counter measurements on the Linux perf event interface. We find that default code (such as that used by PAPI) implementing the perf event self-monitoring interface can have large overhead: up to an order of magnitude larger than the previously used perfctr and perfmon2 performance counter implementations. We investigate the causes of this overhead and find that with proper coding this overhead can be greatly reduced on recent Linux kernels.
November 2014
·
46 Reads
·
46 Citations
Citations (19)
... Improving HPC Security with Targeted Syscall Fuzzing [69] 2022 USA Software Methodology: Integrates domain-specific knowledge, tailored fuzzing strategies, and hybrid execution techniques for robust vulnerability detection in critical system calls essential for HPC applications. Findings: The paper highlights the efficacy of targeted syscall fuzzing as a key asset for enhancing HPC security, and uncovering vulnerabilities in the Linux performance infrastructure. ...
- Citing Conference Paper
November 2022
... Examples of measured quantities include: aggregate error rates, per-cell probabilities of error, and spatial/temporal error distributions. These measurements can be made using testing infrastructures ranging from industry-standard large-scale testing equipment [354,355] to home-grown tools based on commodity FPGAs [16,29,55,65,90,171,264,273,[356][357][358][359] or DRAM-based computing systems [236,262,274,360,361]. ...
- Citing Conference Paper
October 2018
... energy consumption data from RAPL registers, simplifying the development of energy measurement tools. The accuracy of the RAPL measurements has been evaluated in several studies [18], [19], [20], [8] showing good concordance with energy measured at plug power. ...
- Citing Conference Paper
October 2016
... The emergence of the Pi2B in 2015, which boasted a quad-core processor, provided a significant performance boost over the Pi1B, as evidenced by the studies conducted by Christian Baun 9 and Cloutier et al. 10 . Baun compared 8-node Pi1B and Pi2B clusters, finding that Pi2B demonstrated approximately four times higher performance with 80% memory usage. ...
- Citing Article
- Full-text available
September 2016
Electronics
... To overcome Pi1B's constraints, Justin Moore 7 utilized a 4-node Pi1B cluster with 512MB RAM per node and, by using the Automatically Tuned Linear Algebra Software (ATLAS) library, achieved 0.836 GFlops with overclocking. Cloutier et al. 8 tested a 32 Pi1B( Pi1B and Pi 1B+) cluster with added memory, achieving up to 6.25 GFlops overclocked. ...
- Citing Conference Paper
November 2014
... PAPI is a widely used portable library for accessing hardware counters on modern microprocessors [7]. PAPI offers both counting and sampling interfaces, but the sampling interface is extremely limited, consisting of a simple interrupt-driven interface that can periodically report processor state. ...
Reference:
A prototype sampling interface for PAPI
- Citing Chapter
November 2010
... Power consumption and performance modeling and management in data centers is of great interest [16], [17], [18], [19], [20], [21]. The Dynamic Frequency Management of processor/cores' clock frequency has been considered and used as a alternative option to consolidation in saving energy and also in increasing the life time of equipment [8], [9], [22]. ...
- Citing Article
July 2015
... Linux Perf Monitor is a popular Linux module that provides an interface for user-level applications to configure and monitor performance counters. Nevertheless, instrumenting and monitoring systems at user-level may introduce extraneous overhead [21]. Run-DMC [16] is a runtime dynamic performance and power estimator for Heterogeneous Multicore platforms. ...
- Citing Article
April 2015
... However, the runtime overhead is usually very high. The SESC simulator [18] for instance runs hundreds of times slower than a native platform [19]. The memory subsystem is responsible for a significant part of the this overhead [20]. ...
- Citing Article
December 2010
... On the server side, we evaluate the performance degradation using the High Performance Linpack [62] benchmark which is used to evaluate performance on Top 500 supercomputers. ...
- Citing Conference Paper
- Full-text available
August 2011
Lecture Notes in Computer Science