Vincent M. Weaver’s research while affiliated with University of Maine and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (23)


Performance Measurement on Heterogeneous Processors with PAPI
  • Conference Paper

November 2024

Willow E. Cunningham

·

Vincent M. Weaver


Enhancing PAPI with Low-Overhead rdpmc Reads

April 2019

·

30 Reads

·

2 Citations

Lecture Notes in Computer Science

The PAPI performance library is a widely used tool for gathering self-monitored performance data from running applications. A key aspect of self-monitoring is the ability to read hardware performance counters with minimum possible overhead. If read overhead becomes too large then the act of measurement will start to interfere with the gathered results, adversely affecting the performance analysis.


Advanced Event-Sampling Support for PAPI

April 2019

·

24 Reads

Lecture Notes in Computer Science

The PAPI performance library is a widely used tool for gathering performance data from running applications. Modern processors support advanced sampling interfaces, such as Intel’s Precise Event Based Sampling (PEBS) and AMD’s Instruction Based Sampling (IBS). The current PAPI sampling interface predates the existence of these interfaces and only provides simple instruction-pointer based samples.


A raspberry pi operating system for exploring advanced memory system concepts

October 2018

·

448 Reads

·

17 Citations

Modern memory hierarchies are complex pieces of hardware, combining memory controllers, caches, and memory-management units in a desperate attempt to keep modern high-speed processors fed with data. Recent developments, including the introduction of non-volatile memory, further complicate the situation. Operating systems are highly tuned to current memory systems; modifying them to take advantage of new developments is a difficult process. The low-level code involved is complex and hard to follow. This makes teaching students about modern memory systems a struggle, as wading through the complicated code in a full operating system like Linux can be frustrating. To this end we develop vmwOS, a simple operating system designed for low-cost Raspberry Pi development boards. Although inexpensive, these widely used boards support most of the features of modern memory systems, including 64-bit addresses, multi-level caches, multi-core, and full ARMv8 processor and MMU support. vmwOS is simple enough that students can follow the code and make modifications for memory hierarchy exploration. We describe a number of memory research topics that can be explored with this infrastructure, including a detailed examination of simulating non-volatile RAM support.


A Validation of DRAM RAPL Power Measurements

October 2016

·

782 Reads

·

121 Citations

Recent Intel processors support the Running Average Power Level (RAPL) interface, which among other things provides estimated energy measurements for the CPUs, integrated GPU, and DRAM. These measurements are easily accessible by the user, and can be gathered by a wide variety of tools, including the Linux perf_event interface. This allows unprecedented easy access to energy information when designing and optimizing energy-aware code. While greatly useful, on most systems these RAPL measurements are estimated values, generated on the fly by an on-chip energy model. The values are not documented well, and the results (especially the DRAM results) have undergone only limited validation. We validate the DRAM RAPL results on both desktop and server Haswell machines, with multiple types of DDR3 and DDR4 memory. We instrument the hardware to gather actual power measurements and compare them to the RAPL values returned via Linux perf_event. We describe the many challenges encountered when instrumenting systems for detailed power measurement. We find that the RAPL results match overall energy and power trends, usually by a constant power offset. The results match best when the DRAM is being heavily utilized, but do not match as well in cases where the system is idle, or when an integrated GPU is using the memory. We also verify that Haswell server machines produce more accurate results, as they include actual power measurements gathered through the integrated voltage regulator.


Figure 1. Performance (GFLOPS) compared to average power. Upper left is the best.  
Figure 2. Performance (GFLOPS) compared to cost in US$. Upper left is the best.  
Figure 4. The raspberry-pi cluster.  
Figure 5. The circuit used to measure power. An op-amp provides a gain of 20 to the voltage drop across a sense resistor. This can be used to calculate current and then power. The values from four Pis are fed to a measurement node using an SPI A/D converter. The 5-V SPI bus is converted down to the 3.3 V expected by the Pi measurement node.  
Figure 6. Detailed per-node power measurement with 4-Hz sampling while running a 12-node 10 k HPL (high-performance Linpack) run. All 24 nodes are shown, but only half are being used, which is why 12 remain at idle throughout the run. node05–2 is currently down and node03–0, node03–3 and node01–0 have malfunctioning power measurement.  

+1

A Raspberry Pi Cluster Instrumented for Fine-Grained Power Measurement
  • Article
  • Full-text available

September 2016

·

295 Reads

·

56 Citations

Electronics

Power consumption has become an increasingly important metric when building large supercomputing clusters. One way to reduce power usage in large clusters is to use low-power embedded processors rather than the more typical high-end server CPUs (central processing units). We investigate various power-related metrics for seventeen different embedded ARM development boards in order to judge the appropriateness of using them in a computing cluster. We then build a custom cluster out of Raspberry Pi boards, which is specially designed for per-node detailed power measurement. In addition to serving as an embedded cluster testbed, our cluster’s power measurement, visualization and thermal features make it an excellent low-cost platform for education and experimentation.

Download

A prototype sampling interface for PAPI

July 2015

·

16 Reads

·

3 Citations

PAPI is a widely used portable library for accessing hardware counters on modern microprocessors. PAPI offers both counting and sampling interfaces, but the sampling interface is extremely limited, consisting of a simple interrupt-driven interface that can periodically report processor state. In the past few years, the hardware and operating systems of modern processors have added support for new more advanced sampling features. These features enable information about non-uniform memory access (NUMA) behavior to be obtained. Currently, performance tool developers who want to provide sampling data to their users must make use of a complex low-level kernel interface, sometimes developing their own kernel patch to access the features they need. This paper reports on initial efforts to develop a middleware layer that will serve as a stable interface and enable tool developers to access sampling data through standard PAPI calls and to obtain data important for NUMA analysis.


Self-monitoring overhead of the Linux perf- event performance counter interface

April 2015

·

250 Reads

·

38 Citations

Most modern CPUs include hardware performance counters: architectural registers that allow programmers to gain low-level insight into system performance. Low-overhead access to these counters is necessary for accurate performance analysis, making the operating system interface critical to providing lowlatency performance data. We investigate the overhead of selfmonitoring performance counter measurements on the Linux perf event interface. We find that default code (such as that used by PAPI) implementing the perf event self-monitoring interface can have large overhead: up to an order of magnitude larger than the previously used perfctr and perfmon2 performance counter implementations. We investigate the causes of this overhead and find that with proper coding this overhead can be greatly reduced on recent Linux kernels.



Citations (19)


... Improving HPC Security with Targeted Syscall Fuzzing [69] 2022 USA Software Methodology: Integrates domain-specific knowledge, tailored fuzzing strategies, and hybrid execution techniques for robust vulnerability detection in critical system calls essential for HPC applications. Findings: The paper highlights the efficacy of targeted syscall fuzzing as a key asset for enhancing HPC security, and uncovering vulnerabilities in the Linux performance infrastructure. ...

Reference:

High Performance Computing (HPC) System Security: Threats and Vulnerabilities, Challenges and Solutions
Improving HPC Security with Targeted Syscall Fuzzing
  • Citing Conference Paper
  • November 2022

... Examples of measured quantities include: aggregate error rates, per-cell probabilities of error, and spatial/temporal error distributions. These measurements can be made using testing infrastructures ranging from industry-standard large-scale testing equipment [354,355] to home-grown tools based on commodity FPGAs [16,29,55,65,90,171,264,273,[356][357][358][359] or DRAM-based computing systems [236,262,274,360,361]. ...

A raspberry pi operating system for exploring advanced memory system concepts
  • Citing Conference Paper
  • October 2018

... The emergence of the Pi2B in 2015, which boasted a quad-core processor, provided a significant performance boost over the Pi1B, as evidenced by the studies conducted by Christian Baun 9 and Cloutier et al. 10 . Baun compared 8-node Pi1B and Pi2B clusters, finding that Pi2B demonstrated approximately four times higher performance with 80% memory usage. ...

A Raspberry Pi Cluster Instrumented for Fine-Grained Power Measurement

Electronics

... To overcome Pi1B's constraints, Justin Moore 7 utilized a 4-node Pi1B cluster with 512MB RAM per node and, by using the Automatically Tuned Linear Algebra Software (ATLAS) library, achieved 0.836 GFlops with overclocking. Cloutier et al. 8 tested a 32 Pi1B( Pi1B and Pi 1B+) cluster with added memory, achieving up to 6.25 GFlops overclocked. ...

Design and Analysis of a 32-bit Embedded High-Performance Cluster Optimized for Energy and Performance
  • Citing Conference Paper
  • November 2014

... Power consumption and performance modeling and management in data centers is of great interest [16], [17], [18], [19], [20], [21]. The Dynamic Frequency Management of processor/cores' clock frequency has been considered and used as a alternative option to consolidation in saving energy and also in increasing the life time of equipment [8], [9], [22]. ...

A prototype sampling interface for PAPI
  • Citing Article
  • July 2015

... Linux Perf Monitor is a popular Linux module that provides an interface for user-level applications to configure and monitor performance counters. Nevertheless, instrumenting and monitoring systems at user-level may introduce extraneous overhead [21]. Run-DMC [16] is a runtime dynamic performance and power estimator for Heterogeneous Multicore platforms. ...

Self-monitoring overhead of the Linux perf- event performance counter interface
  • Citing Article
  • April 2015

... On the server side, we evaluate the performance degradation using the High Performance Linpack [62] benchmark which is used to evaluate performance on Top 500 supercomputers. ...

Evaluation of the HPC Challenge Benchmarks in Virtualized Environments

Lecture Notes in Computer Science

·

·

·

[...]

·