Content uploaded by Johannes Manner
Author content
All content in this area was uploaded by Johannes Manner on Aug 24, 2021
Content may be subject to copyright.
Why Many Benchmarks Might Be Compromised
Johannes Manner and Guido Wirtz
Distributed Systems Group
Otto-Friedrich-University
Bamberg, Germany
{johannes.manner, guido.wirtz}@uni-bamberg.de
Abstract—Benchmarking experiments often draw strong con-
clusions but lack information about the environmental influences
like the hardware used to deploy the investigated system. Fairness
and repeatability of these benchmarks are at least questionable.
Developing for or migrating applications to the cloud or DevOps
environments often requires performance testing, either for
ensuring quality-of-service or for choosing the correct service
parameters when deciding for a cloud offering.
While building a benchmarking pipeline for cloud functions,
the typical assumption is that a CPU scales the resources linearly
to the used utilization. Due to heat generation, noise and other
constraints, this is not the case due to the trade off between
efficiency and performance. To investigate this trade off and its
implications, we set up some experiments in order to evaluate the
influence of these factors for benchmark results. We solely focus
on Intel CPUs. Beginning with the second generation (Sandy
Bridge), Intel uses their own scaling driver intel_pstate. Our
results show that different settings for this scaling driver have a
significant impact on the measured performance and therefore
on the linear regression models we computed using LINPACK
benchmarks. These benchmarks are executed at different CPU
utilization points. An active intel_pstate scaling driver
with enabled turbo boost and powersave governor reached
a R² of 0.7349, whereas the performance governor shows a
significantly better, ideal determination coefficient with 0.9999
on a machine used in the benchmarks. Therefore, we propose a
methodology for system calibration to ensure fair and repeatable
benchmarks.
Index Terms—Intel pstate, Benchmark Design, Benchmarking,
Simulation, Profiling, Experiment Design, Repeatability
I. INTRODUCTION
Benchmarking is the process of understanding system qual-
ities and the basis to re-engineer closed source systems and
unravel their implementation mysteries. Often investigated
qualities of a system are the behavior under heavy load
and implicitly the scalability and resilience properties. To
understand the performance and runtime characteristics of
Systems under Test (SUT), benchmarking these systems is
an indispensable procedure. It is vital that experimenters
document all influencing factors modified in such a way
that other researchers/practitioners are able to interpret the
results correctly. As KUH LE NKAMP and WER NE R [1] stated
in a literature review about Function as a Service (FaaS)
benchmarks, only 12% of the experiments (3 out of 26) specify
all information to reproduce the experiment.
This problem of non-repeatable experiments gets even
worse, when a series of experiments does not state hypotheses
upfront and does not start with single, isolated experiments
to confirm or reject these hypotheses before conducting load
tests. Tools like JMeter1allow to stress test SUT, which can
lead to predictions on how the system will behave under heavy
load. CPU and other hardware resources are strained but often
the implications of the hardware used is neglected [2]–[4].
Since these SUT are running on different hardware within
the software lifecycle, the runtime behavior may diverge
which is important when configuring one system based on
the measured quality of service of another system. Dev and
prod environments, for example, rarely run on the exact
same hardware. General purpose processors, i. e., consumer
processors, focus on optimizing the average-case performance
by employing runtime performance enhancements. However,
these techniques are normally not documented since they are
Intellectual Property (IP) of the vendor. End users have limited
control over them. Frequency scaling of the CPU is influenced
by a lot of factors specified in the Advanced Configuration and
Power Interface (ACPI) specification [5]. In particular, perfor-
mance states (P-states), cooling requirements and turbo boost
options influence the frequency scaling, power consumption
and heat generation of the CPU. However, for load tests a
linear scaling of resources is important for interpretable and
fair results.
In a previous experiment, we faced some unpredictable and
non linear performance distributions on two Intel i7 proces-
sors. One of our machines showed three performance ranges
due to power saving aspects. However, there are many settings
influencing each other and ultimately the performance of the
CPU making it hard to trace performance variations back to
a single factor. DELLs configuration of High Performance
Computing (HPC) servers [6], [7] or the SPEC bios settings
descriptions2for their CPU benchmarks are examples for
this. Simply disabling all powersaving or performance boost
options is not an option since the default settings do not
ensure linear scaling of resources. It is likely that also other
investigations face this non linear performance distribution on
machines used in their experiments without detecting it due
to noise in the benchmarking data.
Therefore, in this paper we propose a simple function
to calibrate hardware and understand the scaling algorithms
determining CPU performance in order to allow for fair
benchmarks. This concept is also usable for later processors
1https://jmeter.apache.org/
2https://www.spec.org/cpu2017/flags/Intel-Platform-Settings-V1.1.html
--- PREPRINT ---
of the Intel i-series or Xeon processors which - dependent
on the model - use intel_pstate as their scaling driver.
Since a lot of factors influence the CPU frequency scaling at
different CPU utilization points, an understanding of runtime
characteristics is necessary in order to compare different
systems. For fair experiments, we assume a linear scaling when
increasing the system resources, particularly the CPU access
time of a processor. An implementation of our simple hardware
calibration functionality is available as a CLI feature3. A run-
ning Docker environment is the only prerequisite to execute
the calibration. Our benchmark can serve as a standard for
normalizing results of benchmark experiments.
This leads to the following research question:
•RQ: How can a consistent CPU scaling behavior across
different processors be achieved and made visible to
ensure the validity and comparability of benchmark ex-
periments?
We limit the discussion to Linux4and Intel CPUs5since
they are the predominant technology used in the cloud. The
agenda is as follows: In Section II we recap some important
fundamentals for P-states and the CPU frequency scaling
options in the Linux kernel. The next Section discusses
related work of benchmark characteristics and how P-states
were addressed in other research. Section IV describes the
motivation of our work in greater detail and continues with
the presentation of the problem. The next Section proposes
our methodology to answer RQ followed by an evaluation.
The discussion of the results and limitations of our work is
presented in Section VII. Finally, we conclude our paper with
ideas and next steps for future work.
II. FUNDAMENTALS
According to the ACPI specification6, Control States (C-
states) and Performance States (P-states) determine power
consumption and frequency of the CPU. C-states handle the
working mode of the machine, e.g. active or suspended. Since
we assume an active one, C-states will not be considered in the
following. P-states determine the CPU frequency and therefore
the computational power. They can be changed by algorithms
when demand changes. There are a few conflicting goals
which need to be addressed when changing the P-states [5].
On the one hand, performance and on the other hand, power
consumption, battery life, thermal and fan noise.
The Linux kernel CPU Frequency scaling (CPUFreq) sub-
system consists of three components7:
•Scaling Governors - Each scaling governor imple-
ments an algorithm for estimating the CPU demand
3https://github.com/johannes-manner/SeMoDe/releases/tag/v0.4
4Market share Linux: https://www.rackspace.com/en-gb/blog/
realising-the- value-of-cloud- computing-with- linux (last accessed 2021-
04-16)
5Market share Intel: https://www.statista.com/statistics/1130315/
worldwide-x86- intel-amd-laptop-market-share/ (last accessed 2021-04-
16)
6Especially Section 8 in the specification is of particular interest (pages
509ff. in [5])
7https://www.kernel.org/doc/html/v5.4/admin-guide/pm/cpufreq.html
and changes the processors’ frequency accordingly. Also
mixed strategies where different scaling governors work
together are reasonable to achieve a good system per-
formance under various loads. Examples for governors
are performance (highest frequency) or powersave
(lowest frequency).
•Scaling Drivers - ”Provide scaling governors with infor-
mation on the available P-states (or P-state ranges in some
cases) and access platform-specific hardware interfaces to
change CPU P-states as requested by scaling governors.”8
•CPUFreq Core - Basic code infrastructure framework
which the other two components integrate with.
The options listed here are the implementation in the Linux
kernel. Since the scaling driver communicates with the hard-
ware, vendor-specific options cannot be addressed by a generic
implementation. Therefore, the implementation of vendor-
specific scaling drivers was introduced. Since SandyBridge
(second generation of Intel processors), intel_pstate is
such a scaling driver making it possible to also implement
own scaling governors or overwriting existing ones. This driver
circumvents the generic implementations and also adds new
features where Hardware-Managed P-states (HWP) enable
customized scaling algorithms to deal with specialties of each
processor family and model. When HWP is turned off, the
generic scaling information specified in ACPI tables are used.
III. RELATED WORK
A. Benchmark Characteristics
One of the first publications about benchmark characteristics
is the work of HU PP LE R [8]. He states that a good bench-
mark design needs to be relevant, repeatable, fair, verifiable
and economical. As already mentioned in the introduction,
especially the repeatability and therefore also the fairness of
some experiments are difficult to ensure. Missing information
about the configuration, the number of executions or the
load distribution diminish the confidence in an experiment
and its results. Also, the equivalence of the benchmarking
environment and the production environment, often called dev-
prod parity, is seldom achieved and introduces another source
of discrepancy.
Other publications also introduce additional benchmark
characteristics for the cloud, e.g. COOP ER et al. [9] and
BER MBACH et al. [2]. The latter especially mentions the
underlying hardware stack, whereas CO OP ER et al. introduce
portability, scalability and simplicity as characteristics. For the
cloud in particular, different CPU architectures are present [3],
[4] which emphasizes the importance of transferable results of
benchmarks and the need to compare physical resources.
B. Experiments with intel_pstate
Since the scaling of CPU resources determines the CPU
frequency and therefore the speed, we cluster previous re-
search by their use of intel_pstate if this configuration
8https://www.kernel.org/doc/html/v5.4/admin-guide/pm/cpufreq.html#
cpu-performance- scaling-in- linux
--- PREPRINT ---
is explicitly mentioned. The Linux kernel has been supporting
this scaling driver since kernel version 3.9 got released in
2013 [10]. As the kernel documentation describes9, how the P-
state is translated into frequencies depends also on the specific
processor model and family. Energy consumption grows pro-
portionally with the frequency, therefore also ACPI compliant
low power solutions are researched for the cloud [11].
Overall, there are four configuration options10. Option one
and two are to use intel_pstate scaling driver in active
mode with (1) or without (2) hardware support (no_hwp).
The third option is to use it in passive mode (3), whereas
the last option is to disable intel_pstate (4). Disabling
it results in the usage of the generic acpi-cpufreq scaling
driver. Some vendor-specific hardware properties, which can
be read by the intel_pstate scaling driver, cannot be
used in this case. There are several reasons to do so. BECKER
and CH AKRABORT Y [12] fixed the CPU performance of their
system by disabling this feature and also the turbo boost
options. Reducing noise for a particular use case was another
reason to disable it [13]. Some researchers, e.g. [14], wanted
to be more flexible in changing the frequency by hand during
their benchmarks. Since some of the generic scaling governors
are overwritten (by using the same name) by intel_state
and others are not usable, sometimes researchers [15], [16]
disabled it to use the generic ones. We refer the interested
reader to the Linux kernel documentation. None of the papers
we identified by searching at google scholar and ACM digital
library specify one of the first three options (1-3) explicitly11.
Since the default configuration is an active
intel_pstate (for some models also with HWP enabled)
we assume that a lot of benchmarks use this configuration
when doing experiments on-premise.
IV. P ROB LE M ANA LYSIS
While working on our simulation and benchmarking
pipeline proposed in [17], we have seen the performance
behavior shown in Figure 1 when executing our calibration
function.
H60 and H90 are Intel quad-core machines with Ubuntu
20.04.2 as the OS and Docker for executing the function
seen in Figure 1. The machines’ specifications can be found
in Table I. We specified the cpus12 Docker CLI option to
limit the CPU usage by the containers. At each point in time,
only a single container is running on the mentioned machines
9https://www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
processor-support
10In the /etc/default/grub file, the
GRUB_CMDLINE_LINUX_DEFAULT property can be changed
to a value explained in the kernel documentation (https:
//www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
kernel-command- line-options-for-intel-pstate). Via sudo update-grub,
these changes can be applied and the system ca be rebooted.
11Search term was intel pstate resulted in ten records at ACM Digital
Library and 97 on google scholar. The result list at google scholar contained
a lot of presentations and also links to the Linux kernel documentation. All as
relevant identified conference and journal publications were included in this
paragraph.
12https://docs.docker.com/config/containers/resource constraints/#cpu
0 1 2 3 4
0 20 40 60 80
H60
CPU Quota
GFLOPS
0 1 2 3 4
0 50 100 150 200
H90 − 1.a
CPU Quota
GFLOPS
Fig. 1. Calibrating local Machines for Benchmarking.
TABLE I
SPE CIFI CATIO NS O F THE T WO MACH INE S OF T HE SHOW N EXP ERI MEN TS.
H60 H90
Processor i7-2600 i7-7700
Model 42 158
Base Frequency 3.40 GHz 3.60 GHz
Turbo Boost 3.80 GHz 3.90 GHz
Linux Kernel 5.4.0-65 5.4.0-70
together with our prototype which collects the metrics. The
impact of our prototype on the CPU utilization is negligible.
We measured it using the sar command in three system
states: when no function is running, the prototype starts and
the prototype idles. We did not see a noteworthy deviation to
the clean system state13.
We executed LINPACK [18], [19] which solves linear
equations as a CPU intensive function packaged in a Docker
container at runtime. Each setting present in Figure 1 was exe-
cuted 25 times by increasing the Docker cpus option by 0.1.
Both machines use intel_pstate in active mode as their
scaling driver and powersave as the scaling governor. HWP
is enabled on H90 and not available on H60. At each share of
the CPU, e.g. 0.5 cpus, the assigned portion of the CPU is
nearly fully utilized due to the LINPACK characteristics. This
is important to keep in mind when interpreting the diagrams
and the subsequent results. In other words, we mimic artificial
13Look at the following file which shows the utilization and changes
when starting the prototype: https://github.com/johannes-manner/SeMoDe/
files/6336159/utilization.txt.
--- PREPRINT ---
situations where a defined portion of the system is under heavy
load and look at the performance of our system. For example
when assigning 0.5 cpus on a system with four cores, the
CPU utilization is around 12.5%14.
The situation described here is contrived since in a normal
load test, the impact of the frequency scaling might be hidden
within the noise of other influencing factors. The CPU is
normally not executed long enough at a given utilization to
see the phenomenon in the data. Therefore, it is necessary to
create a testbed where we can assess this influencing factor in
isolation and make changes in the configuration visible.
As mentioned before, for laptops or machines with cooling
problems etc., the choice of the scaling driver impacts the
power consumption and heat generation. Furthermore, even
different models within the same generation of a processor line
have an impact on the frequency scaling due to their specific
hardware support for HWP. Being aware of this scaling
phenomenon makes it easier for experimenters to choose a
suitable scaling behavior for their benchmarks. This enables a
performance estimation under low, moderate and high load of
a system and does not jeopardize the results and, hence the
conclusions drawn.
TABLE II
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 1
H60 H90 - 1.a
p-value <2.2e-16 <2.2e-16
R² 0.9995 0.7349
Intercept -3.081 -50.340
Slope 23.400 56.357
Max GFLOPs 90.905 215.818
These problems are now expressed in numbers. The orange
lines in Figure 1 show the linear regressions. Table II shows
the statistics to the figures for H60 and H90. The coefficient
of determination (R²) for H60 shows a near ideal relation-
ship between the dependent Giga Floating Point Operations
per Second (GFLOPS) and the independent cpus variable.
Therefore, the results of a benchmark on this machine are
comparable and fair since doubling the resources results in
doubling the GFLOPS. The intercept is negligible in this case
and explainable due to inherent computational overhead. Con-
trary, on H90, the relationship between cpus and GFLOPS is
good with R² being 0.7349, but it is obvious when looking at
Figure 1 (right) that three performance ranges are visible from
[0, 0.5], [0.6, 2.8] and [3.0, 4.0] with different slopes. When
checking the available governors at H90, powersave and
performance are active. The kernel documentation states
that the ”processor is permitted to take over performance
scaling control”15 when exceeding a threshold. When further
looking at the different CPUs and their frequencies at runtime
14Due to other processes running on the system, the utilization is a bit
higher, but shared services running in the background are negligible as can
be seen for the prototype influence measured via sar.
15https://www.kernel.org/doc/html/v5.4/admin-guide/pm/intel pstate.html#
turbo-p- states-support
via tools like turbostat16, we can see that the powersave
scaling governor is used for the second interval operating at
minimum frequency and the performance scaling governor
is used for the first and third interval. Therefore, a fair
comparison of SUT deployed on H60 and H90 is questionable
since H90 performs worse under moderate load than under
peak load.
This observation and the statistical evaluation already em-
phasize the need for a calibration of the CPU performance.
V. METHODOLOGY
We propose the following solution (RQ). We use the
LINPACK benchmark as a CPU intensive calibration function
and report the metrics specified in Table II to the user of our
research prototype’s CLI17. Even though the case described
in the previous section is contrived, it gives us the option to
isolate CPU performance and to make changes in the con-
figuration of intel_pstate visible. To be able to restrict
CPU resources to a single function, we use the Docker CLI
cpus option. This gives us the chance to artificially fix the
CPU utilization at a given value and understand the scaling
of the CPU frequency and the performance by looking at the
computed GFLOPS.
LINPACK is especially suited to assess the performance of
multi-core hardware since it makes use of all available CPUs
and by being machine independent [18]. The same holds true
for load testing tools, where concurrent users of a system
can be simulated to stress the SUT. Other functions using the
available resources in a similar way are also possible for this
proposed calibration step, but LINPACK is well established in
this domain. An excerpt of the LINPACK output is shown in
Listing 1. We run our Docker container with a CPU share of
1.0 cpus on H90 in this example.
Listing 1. Sample LINPACK execution on H90 for a CPU share of 1.0.
>d o c k e r ru n − − c p u s = 1 . 0 jm n n r / l i n p a c k : v1
. . .
I n t e l ( R ) Op t i m i z e d LINPACK B en c hm a rk da t a
C u r r e n t d a t e / t i m e : F r i Ap r 16 1 1 : 1 0 : 5 2 2 02 1
. . .
=== ===== ==== S i n g l e Ru ns = ==== ===== ==
S i z e LDA A l i g n . Ti me ( s ) GF l o p s
10 0 0 1 00 0 4 0 . 0 0 5 1 4 4 . 8 3 1 9
10 0 0 1 00 0 4 0. 0 9 5 7 . 0 7 4 6
. . .
10 0 0 0 25 0 0 0 4 5 8 . 0 8 0 1 1 . 4 8 1 9
10 0 0 0 25 0 0 0 4 5 8 . 2 8 9 1 1 . 4 4 0 6
P e r f o r m a n c e S umm ary ( G F l op s )
S i z e LDA A l i g n . A ve r a g e M ax i ma l
10 0 0 1 00 0 4 8 7 . 1 9 7 8 14 4 . 8 3 1 9
50 0 0 1 80 0 0 4 10 . 9 0 3 4 1 0 . 9 6 3 8
10 0 0 0 25 0 0 0 4 1 1 . 4 6 1 2 1 1 . 4 8 1 9
R e s i d u a l c h e c k s PASSED
. . .
16https://www.linux.org/docs/man8/turbostat.html
17https://github.com/johannes-manner/SeMoDe/releases/tag/v0.4
--- PREPRINT ---
The Single Runs and Performance Summary sections present
the size of the matrix which is used for the linear computation
and the leading dimension of A (LDA) which also determines
the storage of arrays in memory. What is interesting in the
Single Runs section of the output is the problem size of the
linear equation system. Problem size 1’000 reached quite a
high number of GFLOPS. At this point in time the frequency
scaling of the CPU is not stable and also the equations for this
problem size are executed within a few milliseconds which
distorts the accuracy of the CPU performance measurement in
GFLOPS. In addition, the problem sizes are executed repeat-
edly for more stable results as can be seen for the two runs of
problem size 10’000. Therefore, we use the average GFLOPS
of the experiment with the largest problem size because the
equations in this case run a sufficient period of time to get a
stable scaling under this portion of CPU utilization. For sake
of simplicity and to be comparable, we package the LINPACK
function and push the image18 to Docker Hub, which is used
by our prototype per default. We further parse the results
(Listing 1) of LINPACK to get GFLOPS.
VI. EVALUATION
In our evaluation, we look at the four cases introduced
in related work. We solely focus on H90 since H60 shows
an already acceptable CPU performance distribution under
diverse load settings. For option one, we further investigate
sub-cases to be more precise in drawing conclusions on this
specific machine and show the most important settings.
1) Scaling driver intel_pstate in active mode with
HWP support.
a) Turbo boost on, powersave scaling governor.
b) Turbo boost off19,powersave scaling governor.
c) Turbo boost on, performance scaling governor.
d) Turbo boost off, performance scaling governor.
2) Scaling driver intel_pstate in active mode without
HWP support, powersave scaling governor20.
3) Scaling driver intel_cpufreq since
intel_pstate is in passive mode. Scaling governor
is ondemand21.
4) Scaling driver is here acpi-cpufreq and governor
ondemand.22.
For each setting (except for 1.a), we have only a single exe-
cution. In our methodology we propose that by default we only
need a single run to assess the quality of a system. Especially
the active HWP case is investigated further by changing the
scaling governor to performance and enabling/disabling
turbo boost. Options 2 to 4 are investigated in the default
setting when updating grub, so turbo boost is enabled in all
of these cases. The input for LINPACK is constant for all
executions with three different matrix sizes (1’000, 5’000 and
18https://hub.docker.com/repository/docker/jmnnr/linpack
19Change /sys/devices/system/cpu/intel_pstate/no_
turbo to ”1” disables the turbo boost. ”0” indicates an enabled turbo boost.
20GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=no_hwp"
21GRUB_CMDLINE_LINUX_DEFAULT= "intel_pstate=passive"
22GRUB_CMDLINE_LINUX_DEFAULT= "intel_pstate=disable"
10’000) as can be seen in Listing 1. We use the average of
the highest problem size (10’000) for the statistical evaluation
and the figures presented in the following.
The first one (1.a) was already shown in Figure 1 (right)
and evaluated in Table II. The CPU is configured with active
HWP and turbo boost23.
01234
0 50 100 150 200
H90 − 1.b
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 1.c
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 1.d
CPU Quota
GFLOPS
Fig. 2. Calibrating H90 in different settings by changing scaling governor
and turbo boost.
TABLE III
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 2
H90 - 1.b H90 - 1.c H90 - 1.d
p-value 6.6e-16 <2.2e-16 <2.2e-16
R² 0.7406 0.9999 0.9999
Intercept -45.203 -1.820 -1.715
Slope 51.649 54.118 49.389
Max GFLOPS 197.101 215.939 196.908
The other three sub-configurations under option 1 are inves-
tigated in Table III and Figure 2. In 1.b we turned off turbo
boost, resulting in the same distribution but the maximum
achieved GFLOPS is around 10% lower, which is reasonable
when looking at the base clock rate of 3.6 GHz and 3.9 GHz
in turbo boost mode. The same observation can be made when
comparing 1.c and 1.d with each other. The distributions are
equal except for the absolute value of GFLOPS.
The difference between 1.a/1.c and 1.b/1.d respectively is
the scaling governor used. We exchanged the powersave
23https://www.intel.com/content/www/us/en/architecture-and- technology/
turbo-boost/turbo- boost-technology.html
--- PREPRINT ---
with the performance governor24. For the performance
governor, switching from one to the other algorithm does not
happen since the CPU is operated at maximum frequency.
Therefore, this configuration is a candidate for doing fair and
repeatable benchmarks on H90. The drawback here is, that
the power consumption is also at its maximum resulting in
additional heat generation and power consumption.
01234
0 50 100 150 200
H90 − 2
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 3
CPU Quota
GFLOPS
01234
0 50 100 150 200
H90 − 4
CPU Quota
GFLOPS
Fig. 3. Calibrating H90 in different settings by changing driver.
TABLE IV
LIN EAR R EGR ESS IO N MOD ELS F OR DATA PRE SE NTE D IN FIGURE 3
H90 - 2 H90 - 3 H90 - 4
p-value <2.2e-16 <2.2e-16 <2.2e-16
R² 0.9953 0.9975 0.9976
Intercept -14.378 -7.169 -7.775
Slope 55.362 54.254 54.395
Max GFLOPS 215.835 215.880 215.740
Option 2 to 4 are graphically presented in Figure 3 and
statistically in Table IV. Compared to 1.a, the only difference
of option 2 is an active intel_pstate without HWP
support. The HWP support has an impact on the scaling
algorithm and enables switching between powersave and
performance governor as already seen in Figure 1 (right),
whereas the system configured without HWP only uses the
powersave governor which ”selects P-states proportional to
24sudo cpufreq-set --cpu n --governor performance
where n is a processor
the current CPU utilization”25 in this operation mode. This
results in the undulations seen in Figure 3 (top).
Passivating intel_pstate (option 3) results in the usage
of the intel_cpufreq scaling driver and the ondemand
scaling governor. As stated in the documentation, the HWP
support is also disabled. Compared to the second option, the
performance behavior is quite similar, however, the governor
uses the CPU load to determine the CPU frequency. Only 16
fixed P-sates are provided by the generic ACPI frequency table
which explains the non-linear scaling behavior.
Most researchers disable intel_pstate and use the
generic acpi-cpufreq scaling driver. Since option 3 and
4 use the same governor and the information available in the
ACPI tables, their results are similar.
As for the first option, a fine tuning for the other options is
necessary to reach a good and stable scaling. The presented
differences for the latter three options are only a starting point.
VII. CON CLUSION
A. Discussion of the Results
The presented methodology enables a calibration of systems
with respect to the CPU performance. We use LINPACK as
a machine independent benchmark to assess the frequency
scaling of the CPU and express the power via GFLOPS
when solving linear equations. The approach showed one
solution to solve the initially motivated situation, where an
unpredictable scaling was present. We investigated the most
important influencing factors for Intel CPUs under Linux
namely the scaling driver and its corresponding governors.
Due to vendor-specific implementations, it is possible to
make use of the vendor-specific knowledge about the system
components like intel_pstate showed. Nevertheless, in
some configurations, this leads to performance distributions
which are questionable when conducting fair and repeatable
benchmarks. The four options identified in related work cover
only a small portion of the overall possible system config-
uration when taking all possible settings into consideration.
Why a lot of researchers disabling intel_pstate to use
the generic acpi-cpufreq scaling driver is not discussed
explicitly in their work and also not reasonable when looking
at the results of our evaluation. Therefore, it is important
to conduct calibration experiments as ours upfront to find
a good configuration for benchmarking SUT. Only option 1
was examined in detail by changing the scaling governor and
testing the system with turbo boost enabled and disabled but
without other fine-tuning.
Figure 4 shows how important such a linear scaling is for
dev-prod parity considerations where a LINPACK benchmark
was executed on AWS Lambda (region eu-central-1) 100 times
for different memory settings on Intel Xeon processors with
2.50 GHz, model 63. This cloud platform shows a linear
scaling with a R² of 0.9973 (Intercept: -1.995, Slope: 0.0197,
Max. GFLOPS: 209.757) and therefore guarantee a stable
25https://www.kernel.org/doc/html/v5.4/admin-
guide/pm/intel pstate.html#powersave
--- PREPRINT ---
0 2000 4000 6000 8000 10000
0 50 100 150 200
Calibration on AWS Lambda
CPU Quota
GFLOPS
Fig. 4. Executing LINPACK on AWS Lambda for different memory settings.
quality of service. Without a similar performance distribution
locally, we cannot make implications on how software will
run on other machines or in the cloud.
B. Threats to Validity
Single Machine - We only looked in detail at a single Intel
processor (i7-7700, model 158) in this paper and assessed
its specific configuration. For further generations of Intel
processors (second generation onwards) intel_pstate is
used as a scaling driver and we assume that the behavior shown
here is also present at some of these machines dependent on
the model and processor line.
Single Vendor - We only looked at Intel processors, but we
assume that our methodology also works on other processors
like AMDs.
Minimal Dimension - CPU performance was the only
dimension we were interested in in this paper. To draw strong
conclusions, we tried to reduce the influence of all other
factors to a minimum. We are aware that the CPU performance
is also influenced by cooling requirements, network access,
hard disk speed, system bus, etc.
Sample Size - As mentioned in the evaluation, we only
executed a single run for each experimental setup since we also
propose to do this when using our methodology in practice. We
have seen in Figure 1 that especially the transition from the
second to the third performance interval is interesting since
the influences of the scaling algorithm switch can be seen
in greater detail for the 25 executions. There is a trade-off
between execution time and accuracy of the results. However,
in Figure 2 (top) we have seen that also a single execution is
enough to show a non-linear scaling.
LINPACK Configuration - The problem size, number
of runs and the LDA of the LINPACK benchmark can be
specified as input. All calibration runs in this paper were
executed with the same set of parameters, which could bias the
results since the LDA is also responsible for how the matrix A
consisting the linear equations is stored in the heap.
VIII. FUT URE WORK
Due to the various threats mentioned, there is a lot to do
in future work to enable an even fairer setup for benchmarks.
Our plan for future work is twofold.
Firstly, we want to look at other factors related to the CPU
and influencing the performance. First and foremost, we want
to look at memory influences by changing the LINPACK input
parameters. Also system bus capabilities and other components
play a vital role which might require some specific calibration
considerations to assess the quality of a system. Furthermore,
the cooling capabilities are important to keep in mind when
operating the system with the performance governor and
in turbo boost mode. Finally, different Intel processors and
other vendors are in focus of the next research step.
Secondly, we want to implement some visual support to
generate figures as presented in this paper by our prototype.
Currently, the support is limited to the statistical evaluation by
getting the intercept, slope and R².
REF ERENCES
[1] J. Kuhlenkamp and S. Werner, “Benchmarking FaaS Platforms: Call for
Community Participation,” in Proc. of WoSC, 2018.
[2] D. Bermbach et al.,Cloud Service Benchmarking. Springer Interna-
tional Publishing, 2017.
[3] J. O’Loughlin and L. Gillam, “Performance evaluation for cost-efficient
public infrastructure cloud use,” in Proc. of GECON, 2014.
[4] R. Cordingly et al., “Predicting performance and cost of serverless
computing functions with SAAF,” in Proc. of DASC/PiCom/CBDCom/-
CyberSciTech, 2020.
[5] UEFI-Forum, “Acpi specification, version 6.3,” UEFI Forum, Inc.,
Tech. Rep., 2019. [Online]. Available: https://uefi.org/sites/default/files/
resources/ACPI 6 3 final Jan30.pdf
[6] J. Beckett, “Bios performance and power tuning guidelines for dell
poweredge 12th generation servers,” DELL, Tech. Rep., 2012.
[7] G. Kocchar et al., “Optimal bios settings for hpc with dell poweredge
12th generation servers,” Citeseer, Tech. Rep., 2012.
[8] K. Huppler, “The Art of Building a Good Benchmark,” in Proc. of
TPCTC, 2009.
[9] B. F. Cooper et al., “Benchmarking cloud serving systems with YCSB,”
in Proc. of SoCC, 2010.
[10] C. Gough, I. Steiner, and W. Saunders, “Operating systems,” in Energy
Efficient Servers. Apress, 2015, pp. 173–207.
[11] M. Karpowicz et al., “Energy and power efficiency in cloud,” in
Computer Communications and Networks. Springer International
Publishing, 2016, pp. 97–127.
[12] M. Becker and S. Chakraborty, “Measuring software performance on
linux,” arXiv e-Prints - 1811.01412, 2018.
[13] J. Dorn et al., “Automatically exploring tradeoffs between software
output fidelity and energy costs,” IEEE Transactions on Software Engi-
neering, vol. 45, no. 3, pp. 219–236, 2019.
[14] E. Calore et al., “Software and DVFS tuning for performance and
energy-efficiency on intel KNL processors,” Journal of Low Power
Electronics and Applications, vol. 8, no. 2, p. 18, 2018.
[15] A. Rumyantsev et al., “Evaluating a single-server queue with asyn-
chronous speed scaling,” in Lecture Notes in Computer Science.
Springer International Publishing, 2018, pp. 157–172.
[16] M. Horikoshi et al., “Scaling collectives on large clusters using intel(r)
architecture processors and fabric,” in Proc. of HPC Asia, 2018.
[17] J. Manner, “Towards Performance and Cost Simulation in Function as
a Service,” in Proc. of ZEUS, 2019.
[18] J. J. Dongarra et al.,LINPACK Users’ Guide. Society for Industrial
and Applied Mathematics, 1979.
[19] ——, “The linpack benchmark: past, present and future,” Concurrency
and Computation: Practice and Experience, vol. 15, no. 9, pp. 803–820,
2003.
--- PREPRINT ---